import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)print np.asarray((unique, counts)).T
这使:
[[15][23][51][251]]
与scipy.stats.itemfreq以下内容进行快速比较:
In[4]: x = np.random.random_integers(0,100,1e6)In[5]:%timeit unique, counts = np.unique(x, return_counts=True)10 loops, best of 3:31.5 ms per loopIn[6]:%timeit scipy.stats.itemfreq(x)10 loops, best of 3:170 ms per loop
As of Numpy 1.9, the easiest and fastest method is to simply use numpy.unique, which now has a return_counts keyword argument:
import numpy as np
x = np.array([1,1,1,2,2,2,5,25,1,1])
unique, counts = np.unique(x, return_counts=True)
print np.asarray((unique, counts)).T
Which gives:
[[ 1 5]
[ 2 3]
[ 5 1]
[25 1]]
A quick comparison with scipy.stats.itemfreq:
In [4]: x = np.random.random_integers(0,100,1e6)
In [5]: %timeit unique, counts = np.unique(x, return_counts=True)
10 loops, best of 3: 31.5 ms per loop
In [6]: %timeit scipy.stats.itemfreq(x)
10 loops, best of 3: 170 ms per loop
回答 2
更新:不建议使用原始答案中提到的方法,而应使用新方法:
>>>import numpy as np>>> x =[1,1,1,2,2,2,5,25,1,1]>>> np.array(np.unique(x, return_counts=True)).T
array([[1,5],[2,3],[5,1],[25,1]])
>>>from scipy.stats import itemfreq>>> x =[1,1,1,2,2,2,5,25,1,1]>>> itemfreq(x)/usr/local/bin/python:1:DeprecationWarning:`itemfreq`is deprecated!`itemfreq`is deprecated and will be removed in a future version.Use instead `np.unique(..., return_counts=True)`
array([[1.,5.],[2.,3.],[5.,1.],[25.,1.]])
>>> from scipy.stats import itemfreq
>>> x = [1,1,1,2,2,2,5,25,1,1]
>>> itemfreq(x)
/usr/local/bin/python:1: DeprecationWarning: `itemfreq` is deprecated! `itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
array([[ 1., 5.],
[ 2., 3.],
[ 5., 1.],
[ 25., 1.]])
Unlike the currently accepted answer, it works on any datatype that is sortable (not just positive ints), and it has optimal performance; the only significant expense is in the sorting done by np.unique.
numpy.bincount is the probably the best choice. If your array contains anything besides small dense integers it might be useful to wrap it something like this:
Even though it has already been answered, I suggest a different approach that makes use of numpy.histogram. Such function given a sequence it returns the frequency of its elements grouped in bins.
Beware though: it works in this example because numbers are integers. If they where real numbers, then this solution would not apply as nicely.
import numpy as np
from scipy import weave
def count_unique(datain):"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,['data','nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)return uniq, nums
个人资料信息
>%timeit count_unique(data)>10000 loops, best of 3:55.1µs per loop
Eelco的纯numpy版本:
>%timeit unique_count(data)>1000 loops, best of 3:284µs per loop
To count unique non-integers – similar to Eelco Hoogendoorn’s answer but considerably faster (factor of 5 on my machine), I used weave.inline to combine numpy.unique with a bit of c-code;
import numpy as np
from scipy import weave
def count_unique(datain):
"""
Similar to numpy.unique function for returning unique members of
data, but also returns their counts
"""
data = np.sort(datain)
uniq = np.unique(data)
nums = np.zeros(uniq.shape, dtype='int')
code="""
int i,count,j;
j=0;
count=0;
for(i=1; i<Ndata[0]; i++){
count++;
if(data(i) > data(i-1)){
nums(j) = count;
count = 0;
j++;
}
}
// Handle last value
nums(j) = count+1;
"""
weave.inline(code,
['data', 'nums'],
extra_compile_args=['-O2'],
type_converters=weave.converters.blitz)
return uniq, nums
Profile info
> %timeit count_unique(data)
> 10000 loops, best of 3: 55.1 µs per loop
Eelco’s pure numpy version:
> %timeit unique_count(data)
> 1000 loops, best of 3: 284 µs per loop
Note
There’s redundancy here (unique performs a sort also), meaning that the code could probably be further optimized by putting the unique functionality inside the c-code loop.
Old question, but I’d like to provide my own solution which turn out to be the fastest, use normal list instead of np.array as input (or transfer to list firstly), based on my bench test.
Check it out if you encounter it as well.
def count(a):
results = {}
for x in a:
if x not in results:
results[x] = 1
else:
results[x] += 1
return results
For example,
>>>timeit count([1,1,1,2,2,2,5,25,1,1]) would return:
Ref. comments below on cache and other in-RAM side-effects that influence a small dataset massively repetitive testing results.
回答 11
这样的事情应该做到:
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)#create a dictionary of the unique values
d = dict([(i,0)for i in numpy.unique(arr)])for number in arr:
d[j]+=1#increment when that value is found
#create 100 random numbers
arr = numpy.random.random_integers(0,50,100)
#create a dictionary of the unique values
d = dict([(i,0) for i in numpy.unique(arr)])
for number in arr:
d[j]+=1 #increment when that value is found