在numpy向量中找到最频繁的数字

Question 1

Suppose I have the following list in python:

a = [1,2,3,1,2,1,1,1,3,2,2,1]

How to find the most frequent number in this list in a neat way?

Question 2

If your list contains all non-negative ints, you should take a look at numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

and then probably use np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))

For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))

Question 3

You may use

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

If some element is as frequent as another one, this code will return only the first element.

Question 4

If you’re willing to use SciPy:

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

Question 5

Performances (using iPython) for some solutions found here:

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>>

Best is ‘max’ with ‘set’ for small arrays like the problem.

According to @David Sanders, if you increase the array size to something like 100,000 elements, the “max w/set” algorithm ends up being the worst by far whereas the “numpy bincount” method is the best.

Question 6

Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

Question 7

While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren’t on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

Question 8

I like the solution by JoshAdel.

But there is just one catch.

The np.bincount() solution only works on numbers.

If you have strings, collections.Counter solution will work for you.

Question 9

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1

Question 10

In Python 3 the following should work:

max(set(a), key=lambda x: a.count(x))

Question 11

Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.

Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

Question 12

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I’ve also found that this is much faster than scipy.stats.mode if there are a lot of unique values.

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Question 13

I’m recently doing a project and using collections.Counter.(Which tortured me).

The Counter in collections have a very very bad performance in my opinion. It’s just a class wrapping dict().

What’s worse, If you use cProfile to profile its method, you should see a lot of ‘__missing__’ and ‘__instancecheck__’ stuff wasting the whole time.

Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.

Btw, numpy’s bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.

在numpy向量中找到最频繁的数字

问题：在numpy向量中找到最频繁的数字

回答 0

回答 1

回答 2

回答 3

在此处找到一些解决方案的性能（使用iPython）：

Performances (using iPython) for some solutions found here:

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

rreplace-如何替换字符串中表达式的最后一次出现？

Python 30秒就能学会的漂亮短代码(译2)

Python 计算瑞幸和星巴克谁的门店最多

在datetime，Timestamp和datetime64之间转换

优劣互补! Python+Go结合开发的探讨

Python中“ in”的关联性？

在numpy向量中找到最频繁的数字

问题：在numpy向量中找到最频繁的数字

回答 0

回答 1

回答 2

回答 3

在此处找到一些解决方案的性能（使用iPython）：

Performances (using iPython) for some solutions found here:

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

相关文章

排行榜展示

文章展示