Python 实用宝典

Question 1

Given a list of items, recall that the mode of the list is the item that occurs most often.

I would like to know how to create a function that can find the mode of a list but that displays a message if the list does not have a mode (e.g., all the items in the list only appear once). I want to make this function without importing any functions. I’m trying to make my own function from scratch.

Question 2

You can use the max function and a key. Have a look at python max function using ‘key’ and lambda expression.

max(set(lst), key=lst.count)

Question 3

You can use the Counter supplied in the collections package which has a mode-esque function

from collections import Counter
data = Counter(your_list_in_here)
data.most_common()   # Returns all unique items and their counts
data.most_common(1)  # Returns the highest occurring item

Note: Counter is new in python 2.7 and is not available in earlier versions.

Question 4

Python 3.4 includes the method statistics.mode, so it is straightforward:

>>> from statistics import mode
>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
 3

You can have any type of elements in the list, not just numeric:

>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
 'red'

Question 5

Taking a leaf from some statistics software, namely SciPy and MATLAB, these just return the smallest most common value, so if two values occur equally often, the smallest of these are returned. Hopefully an example will help:

>>> from scipy.stats import mode

>>> mode([1, 2, 3, 4, 5])
(array([ 1.]), array([ 1.]))

>>> mode([1, 2, 2, 3, 3, 4, 5])
(array([ 2.]), array([ 2.]))

>>> mode([1, 2, 2, -3, -3, 4, 5])
(array([-3.]), array([ 2.]))

Is there any reason why you can ‘t follow this convention?

Question 6

There are many simple ways to find the mode of a list in Python such as:

import statistics
statistics.mode([1,2,3,3])
>>> 3

Or, you could find the max by its count

max(array, key = array.count)

The problem with those two methods are that they don’t work with multiple modes. The first returns an error, while the second returns the first mode.

In order to find the modes of a set, you could use this function:

def mode(array):
    most = max(list(map(array.count, array)))
    return list(set(filter(lambda x: array.count(x) == most, array)))

Question 7

Extending the Community answer that will not work when the list is empty, here is working code for mode:

def mode(arr):
        if arr==[]:
            return None
        else:
            return max(set(arr), key=arr.count)

Question 8

In case you are interested in either the smallest, largest or all modes:

def get_small_mode(numbers, out_mode):
    counts = {k:numbers.count(k) for k in set(numbers)}
    modes = sorted(dict(filter(lambda x: x[1] == max(counts.values()), counts.items())).keys())
    if out_mode=='smallest':
        return modes[0]
    elif out_mode=='largest':
        return modes[-1]
    else:
        return modes

Question 9

I wrote up this handy function to find the mode.

def mode(nums):
    corresponding={}
    occurances=[]
    for i in nums:
            count = nums.count(i)
            corresponding.update({i:count})

    for i in corresponding:
            freq=corresponding[i]
            occurances.append(freq)

    maxFreq=max(occurances)

    keys=corresponding.keys()
    values=corresponding.values()

    index_v = values.index(maxFreq)
    global mode
    mode = keys[index_v]
    return mode

Question 10

Short, but somehow ugly:

def mode(arr) :
    m = max([arr.count(a) for a in arr])
    return [x for x in arr if arr.count(x) == m][0] if m>1 else None

Using a dictionary, slightly less ugly:

def mode(arr) :
    f = {}
    for a in arr : f[a] = f.get(a,0)+1
    m = max(f.values())
    t = [(x,f[x]) for x in f if f[x]==m]
    return m > 1 t[0][0] else None

Question 11

A little longer, but can have multiple modes and can get string with most counts or mix of datatypes.

def getmode(inplist):
    '''with list of items as input, returns mode
    '''
    dictofcounts = {}
    listofcounts = []
    for i in inplist:
        countofi = inplist.count(i) # count items for each item in list
        listofcounts.append(countofi) # add counts to list
        dictofcounts[i]=countofi # add counts and item in dict to get later
    maxcount = max(listofcounts) # get max count of items
    if maxcount ==1:
        print "There is no mode for this dataset, values occur only once"
    else:
        modelist = [] # if more than one mode, add to list to print out
        for key, item in dictofcounts.iteritems():
            if item ==maxcount: # get item from original list with most counts
                modelist.append(str(key))
        print "The mode(s) are:",' and '.join(modelist)
        return modelist

Question 12

For a number to be a mode, it must occur more number of times than at least one other number in the list, and it must not be the only number in the list. So, I refactored @mathwizurd’s answer (to use the difference method) as follows:

def mode(array):
    '''
    returns a set containing valid modes
    returns a message if no valid mode exists
      - when all numbers occur the same number of times
      - when only one number occurs in the list 
      - when no number occurs in the list 
    '''
    most = max(map(array.count, array)) if array else None
    mset = set(filter(lambda x: array.count(x) == most, array))
    return mset if set(array) - mset else "list does not have a mode!"

These tests pass successfully:

mode([]) == None 
mode([1]) == None
mode([1, 1]) == None 
mode([1, 1, 2, 2]) == None

Question 13

Why not just

def print_mode (thelist):
  counts = {}
  for item in thelist:
    counts [item] = counts.get (item, 0) + 1
  maxcount = 0
  maxitem = None
  for k, v in counts.items ():
    if v > maxcount:
      maxitem = k
      maxcount = v
  if maxcount == 1:
    print "All values only appear once"
  elif counts.values().count (maxcount) > 1:
    print "List has multiple modes"
  else:
    print "Mode of list:", maxitem

This doesn’t have a few error checks that it should have, but it will find the mode without importing any functions and will print a message if all values appear only once. It will also detect multiple items sharing the same maximum count, although it wasn’t clear if you wanted that.

Question 14

This function returns the mode or modes of a function no matter how many, as well as the frequency of the mode or modes in the dataset. If there is no mode (ie. all items occur only once), the function returns an error string. This is similar to A_nagpal’s function above but is, in my humble opinion, more complete, and I think it’s easier to understand for any Python novices (such as yours truly) reading this question to understand.

 def l_mode(list_in):
    count_dict = {}
    for e in (list_in):   
        count = list_in.count(e)
        if e not in count_dict.keys():
            count_dict[e] = count
    max_count = 0 
    for key in count_dict: 
        if count_dict[key] >= max_count:
            max_count = count_dict[key]
    corr_keys = [] 
    for corr_key, count_value in count_dict.items():
        if count_dict[corr_key] == max_count:
            corr_keys.append(corr_key)
    if max_count == 1 and len(count_dict) != 1: 
        return 'There is no mode for this data set. All values occur only once.'
    else: 
        corr_keys = sorted(corr_keys)
        return corr_keys, max_count

Question 15

This will return all modes:

def mode(numbers)
    largestCount = 0
    modes = []
    for x in numbers:
        if x in modes:
            continue
        count = numbers.count(x)
        if count > largestCount:
            del modes[:]
            modes.append(x)
            largestCount = count
        elif count == largestCount:
            modes.append(x)
    return modes

Question 16

Simple code that finds the mode of the list without any imports:

nums = #your_list_goes_here
nums.sort()
counts = dict()
for i in nums:
    counts[i] = counts.get(i, 0) + 1
mode = max(counts, key=counts.get)

In case of multiple modes, it should return the minimum node.

Question 17

def mode(inp_list):
    sort_list = sorted(inp_list)
    dict1 = {}
    for i in sort_list:        
            count = sort_list.count(i)
            if i not in dict1.keys():
                dict1[i] = count

    maximum = 0 #no. of occurences
    max_key = -1 #element having the most occurences

    for key in dict1:
        if(dict1[key]>maximum):
            maximum = dict1[key]
            max_key = key 
        elif(dict1[key]==maximum):
            if(key<max_key):
                maximum = dict1[key]
                max_key = key

    return max_key

Question 18

def mode(data):
    lst =[]
    hgh=0
    for i in range(len(data)):
        lst.append(data.count(data[i]))
    m= max(lst)
    ml = [x for x in data if data.count(x)==m ] #to find most frequent values
    mode = []
    for x in ml: #to remove duplicates of mode
        if x not in mode:
        mode.append(x)
    return mode
print mode([1,2,2,2,2,7,7,5,5,5,5])

Question 19

Here is a simple function that gets the first mode that occurs in a list. It makes a dictionary with the list elements as keys and number of occurrences and then reads the dict values to get the mode.

def findMode(readList):
    numCount={}
    highestNum=0
    for i in readList:
        if i in numCount.keys(): numCount[i] += 1
        else: numCount[i] = 1
    for i in numCount.keys():
        if numCount[i] > highestNum:
            highestNum=numCount[i]
            mode=i
    if highestNum != 1: print(mode)
    elif highestNum == 1: print("All elements of list appear once.")

Question 20

If you want a clear approach, useful for classroom and only using lists and dictionaries by comprehension, you can do:

def mode(my_list):
    # Form a new list with the unique elements
    unique_list = sorted(list(set(my_list)))
    # Create a comprehensive dictionary with the uniques and their count
    appearance = {a:my_list.count(a) for a in unique_list} 
    # Calculate max number of appearances
    max_app = max(appearance.values())
    # Return the elements of the dictionary that appear that # of times
    return {k: v for k, v in appearance.items() if v == max_app}

Question 21

#function to find mode
def mode(data):  
    modecnt=0
#for count of number appearing
    for i in range(len(data)):
        icount=data.count(data[i])
#for storing count of each number in list will be stored
        if icount>modecnt:
#the loop activates if current count if greater than the previous count 
            mode=data[i]
#here the mode of number is stored 
            modecnt=icount
#count of the appearance of number is stored
    return mode
print mode(data1)

Question 22

Here is how you can find mean,median and mode of a list:

import numpy as np
from scipy import stats

#to take input
size = int(input())
numbers = list(map(int, input().split()))

print(np.mean(numbers))
print(np.median(numbers))
print(int(stats.mode(numbers)[0]))

Question 23

import numpy as np
def get_mode(xs):
    values, counts = np.unique(xs, return_counts=True)
    max_count_index = np.argmax(counts) #return the index with max value counts
    return values[max_count_index]
print(get_mode([1,7,2,5,3,3,8,3,2]))

Question 24

For those looking for the minimum mode, e.g:case of bi-modal distribution, using numpy.

import numpy as np
mode = np.argmax(np.bincount(your_list))

Question 25

Mode of a data set is/are the member(s) that occur(s) most frequently in the set. If there are two members that appear most often with same number of times, then the data has two modes. This is called bimodal.

If there are more than 2 modes, then the data would be called multimodal. If all the members in the data set appear the same number of times, then the data set has no mode at all.

Following function modes() can work to find mode(s) in a given list of data:

import numpy as np; import pandas as pd

def modes(arr):
    df = pd.DataFrame(arr, columns=['Values'])
    dat = pd.crosstab(df['Values'], columns=['Freq'])
    if len(np.unique((dat['Freq']))) > 1:
        mode = list(dat.index[np.array(dat['Freq'] == max(dat['Freq']))])
        return mode
    else:
        print("There is NO mode in the data set")

Output:

# For a list of numbers in x as
In [1]: x = [2, 3, 4, 5, 7, 9, 8, 12, 2, 1, 1, 1, 3, 3, 2, 6, 12, 3, 7, 8, 9, 7, 12, 10, 10, 11, 12, 2]
In [2]: modes(x)
Out[2]: [2, 3, 12]
# For a list of repeated numbers in y as
In [3]: y = [2, 2, 3, 3, 4, 4, 10, 10]
In [4]: modes(y)
There is NO mode in the data set
# For a list of stings/characters in z as
In [5]: z = ['a', 'b', 'b', 'b', 'e', 'e', 'e', 'd', 'g', 'g', 'c', 'g', 'g', 'a', 'a', 'c', 'a']
In [6]: modes(z)
Out[6]: ['a', 'g']

If we do not want to import numpy or pandas to call any function from these packages, then to get this same output, modes() function can be written as:

def modes(arr):
    cnt = []
    for i in arr:
        cnt.append(arr.count(i))
    uniq_cnt = []
    for i in cnt:
        if i not in uniq_cnt:
            uniq_cnt.append(i)
    if len(uniq_cnt) > 1:
        m = []
        for i in list(range(len(cnt))):
            if cnt[i] == max(uniq_cnt):
                m.append(arr[i])
        mode = []
        for i in m:
            if i not in mode:
                mode.append(i)
        return mode
    else:
        print("There is NO mode in the data set")

Question 26

我有一个包含三个字符串列的数据框。我知道第三列中的唯一一个值对于前两个的每种组合都有效。要清理数据，我必须按前两列按数据帧分组，并为每种组合选择第三列的最常用值。

我的代码：

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用，它显示“键错误’Short name’”，如果我尝试仅按城市分组，则会收到AssertionError。我该如何解决？

Question 27

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn’t work, it says “Key error ‘Short name'” and if I try to group only by City, then I got an AssertionError. What can I do fix it?

Question 28

您可以value_counts()用来获取计数系列，并获取第一行：

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

如果您想在.agg（）中执行其他agg函数，请尝试执行此操作。

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

Question 29

You can use value_counts() to get a count series, and get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

In case you are wondering about performing other agg functions in the .agg() try this.

# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

Question 30

熊猫> = 0.16

`pd.Series.mode` 可用！

使用groupby，，GroupBy.agg并将pd.Series.mode功能应用于每个组：

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将此作为DataFrame，请使用

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

有用的Series.mode是，它总是返回一个Series，使其与agg和非常兼容apply，尤其是在重构groupby输出时。它也更快。

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

处理多种模式

Series.mode 当有多种模式：

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

或者，如果您想要每种模式单独一行，则可以使用GroupBy.apply：

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

如果你 不关心返回哪种模式（只要是其中一种模式），那么您将需要一个lambda来调用mode并提取第一个结果。

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

（不）考虑的替代方案

您也可以使用 statistics.mode从python，但是…

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…在处理多种模式时效果不佳；一个StatisticsError提高。在文档中提到了这一点：

如果数据为空，或者没有一个最常用的值，则会引发StatisticsError。

但是你可以自己看…

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

Question 31

Pandas >= 0.16

`pd.Series.mode` is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

If you don’t care which mode is returned as long as it’s either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

Alternatives to (not) consider

You can also use statistics.mode from python, but…

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself…

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

Question 32

对于agg，lambba函数获得一个Series没有'Short name'属性的。

stats.mode 返回两个数组的元组，因此您必须采用此元组中第一个数组的第一个元素。

通过以下两个简单的更改：

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

退货

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Question 33

For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Question 34

这里的游戏有点晚了，但是我遇到了HYRY解决方案的一些性能问题，因此我不得不提出另一个问题。

它的工作原理是找到每个键值的频率，然后对每个键只保留最常出现的值。

还有一个支持多种模式的附加解决方案。

在代表我正在处理的数据的规模测试中，运行时间从37.4s减少到0.5s！

这是解决方案的代码，一些示例用法和规模测试：

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行此代码将打印如下内容：

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这可以帮助！

Question 35

A little late to the game here, but I was running into some performance issues with HYRY’s solution, so I had to come up with another one.

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

There’s also an additional solution that supports multiple modes.

On a scale test that’s representative of the data I’m working with, this reduced runtime from 37.4s to 0.5s!

Here’s the code for the solution, some example usage, and the scale test:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

Running this code will print something like:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

Hope this helps!

Question 36

这里有两个最重要的答案：

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

或者，最好

df.groupby(cols).agg(pd.Series.mode)

但是，在简单的边缘情况下，这两种方法都会失败，如下所示：

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

首先：

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

收益IndexError（由于group返回的空Series C）。第二：

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

返回ValueError: Function does not reduce，因为第一组返回两个列表（因为有两种模式）。（如记录在这里，如果第一批返回的单一模式，这会工作！）

针对这种情况的两种可能的解决方案是：

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

以及cs95在这里的评论中给我的解决方案：

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

但是，所有这些都很慢，不适合大型数据集。我最终使用的一种解决方案是abw33的答案（应该更高）的稍微修改的版本：a）可以处理这些情况，b）快得多。

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

从本质上讲，该方法一次在一个col上工作并输出df，因此concat您可以将第一个视为df而不是密集的，然后将输出数组（values.flatten()）迭代添加为df中的一列。

Question 37

The two top answers here suggest:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

or, preferably

df.groupby(cols).agg(pd.Series.mode)

However both of these fail in simple edge cases, as demonstrated here:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

The first:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

yields IndexError (because of the empty Series returned by group C). The second:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

Two possible solutions for this case are:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

And the solution given to me by cs95 in the comments here:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33’s answer (which should be higher):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.

Question 38

正确的答案是@eumiro解决方案。@HYRY解决方案的问题是，当您拥有[1,2,3,4]之类的数字序列时，解决方案是错误的，即您没有mode。例：

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

如果您像@HYRY那样进行计算，则会得到：

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这显然是错误的（请参阅A值，该值为1而不是4），因为它不能使用唯一值进行处理。

因此，另一种解决方案是正确的：

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

Question 39

Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don’t have the mode. Example:

>>> import pandas as pd
>>> df = pd.DataFrame(
        {
            'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
            'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
            'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
        }
    )

If you compute like @HYRY you obtain:

>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

Which is clearly wrong (see the A value that should be 1 and not 4) because it can’t handle with unique values.

Thus, the other solution is correct:

>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

Question 40

如果您想要另一种不依赖的解决方法，value_counts或者scipy.stats可以使用Counter集合

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

这样可以应用于上面的例子

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

Question 41

If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

Which can be applied to the above example like this

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

Question 42

如果您不想包含NaN值，则使用Counter的速度比pd.Series.mode或快得多pd.Series.value_counts()[0]：

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

应该管用。当您具有NaN值时，这将失败，因为每个NaN都将被分别计算。

Question 43

If you don’t want to include NaN values, using Counter is much much faster than pd.Series.mode or pd.Series.value_counts()[0]:

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

should work. This will fail when you have NaN values, as each NaN will be counted separately.

Question 44

这里的问题是性能，如果您有很多行，那将是一个问题。

如果是您的情况，请尝试以下操作：

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

Question 45

The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

Question 46

对于较大的数据集，较为笨拙但较快的方法包括获取感兴趣列的计数，将计数从高到低排序，然后对子集进行重复数据删除以仅保留最大的个案。代码示例如下：

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

Question 47

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

Python 实用宝典

标签归档：mode

查找列表模式

问题：查找列表模式

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

回答 22

回答 23

有趣好用的Python教程

问题：查找列表模式

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

回答 22

回答 23

问题：通过熊猫DataFrame分组并选择最常用的值

回答 0

回答 1

熊猫> = 0.16

pd.Series.mode 可用！

处理多种模式

（不）考虑的替代方案

Pandas >= 0.16

pd.Series.mode is available!

Dealing with Multiple Modes

Alternatives to (not) consider

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

有趣好用的Python教程

`pd.Series.mode` 可用！

`pd.Series.mode` is available!