>>> df1 = pd.concat([df]*100000, ignore_index=True)# DataFrame with 500000 rows>>>%timeit np.unique(df1[['Col1','Col2']].values)1 loop, best of 3:1.12 s per loop
>>>%timeit pd.unique(df1[['Col1','Col2']].values.ravel('K'))10 loops, best of 3:38.9 ms per loop
>>>%timeit pd.unique(df1[['Col1','Col2']].values.ravel())# ravel using C order10 loops, best of 3:49.9 ms per loop
pd.unique returns the unique values from an input array, or DataFrame column or index.
The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
Note that ravel() is an array method than returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method’s default ‘C’ order.
An alternative way is to select the columns and pass them to np.unique:
There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.
The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop
An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.
import numpy as np
np.unique(df[['col1', 'col2']], axis=0)
Col1 Col2 Col3
0 Bob Joe 0.201079
1 Joe Steve 0.703279
2 Bill Bob 0.722724
3 Mary Bob 0.093912
4 Joe Steve 0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])
# ask for input
ipta = raw_input("Word: ")# create list
uniquewords =[]
counter =0
uniquewords.append(ipta)
a =0# loop thingy# while loop to ask for input and append in listwhile ipta:
ipta = raw_input("Word: ")
new_words.append(input1)
counter = counter +1for p in uniquewords:
So I’m trying to make this program that will ask the user for input and store the values in an array / list.
Then when a blank line is entered it will tell the user how many of those values are unique.
I’m building this for real life reasons and not as a problem set.
enter: happy
enter: rofl
enter: happy
enter: mpg8
enter: Cpp
enter: Cpp
enter:
There are 4 unique words!
My code is as follows:
# ask for input
ipta = raw_input("Word: ")
# create list
uniquewords = []
counter = 0
uniquewords.append(ipta)
a = 0 # loop thingy
# while loop to ask for input and append in list
while ipta:
ipta = raw_input("Word: ")
new_words.append(input1)
counter = counter + 1
for p in uniquewords:
..and that’s about all I’ve gotten so far.
I’m not sure how to count the unique number of words in a list?
If someone can post the solution so I can learn from it, or at least show me how it would be great, thanks!
from collections importCounter
words =['a','b','c','a']Counter(words).keys()# equals to list(set(words))Counter(words).values()# counts the elements' frequency
from collections import Counter
words = ['a', 'b', 'c', 'a']
Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency
word_map ={}
i =1for j in range(len(words)):ifnot word_map.has_key(words[j]):
word_map[words[j]]= i
i +=1
num_unique_words = len(new_map)# or num_unique_words = i, however you prefer
Although a set is the easiest way, you could also use a dict and use some_dict.has(key) to populate a dictionary with only unique keys and values.
Assuming you have already populated words[] with input from the user, create a dict mapping the unique words in the list to a number:
word_map = {}
i = 1
for j in range(len(words)):
if not word_map.has_key(words[j]):
word_map[words[j]] = i
i += 1
num_unique_words = len(new_map) # or num_unique_words = i, however you prefer
回答 8
使用熊猫的其他方法
import pandas as pd
LIST =["a","a","c","a","a","v","d"]
counts,values = pd.Series(LIST).value_counts().values, pd.Series(LIST).value_counts().index
df_results = pd.DataFrame(list(zip(values,counts)),columns=["value","count"])
uniquewords = []
while True:
ipta = raw_input("Word: ")
if ipta == "":
break
if not ipta in uniquewords:
uniquewords.append(ipta)
print "There are", len(uniquewords), "unique words!"
回答 12
ipta = raw_input("Word: ")## asks for input
words =[]## creates listwhile ipta:## while loop to ask for input and append in list
words.append(ipta)
ipta = raw_input("Word: ")
words.append(ipta)#Create a set, sets do not have repeats
unique_words = set(words)print"There are "+ str(len(unique_words))+" unique words!"
ipta = raw_input("Word: ") ## asks for input
words = [] ## creates list
while ipta: ## while loop to ask for input and append in list
words.append(ipta)
ipta = raw_input("Word: ")
words.append(ipta)
#Create a set, sets do not have repeats
unique_words = set(words)
print "There are " + str(len(unique_words)) + " unique words!"
Not the most efficient, but straight forward and concise:
if len(x) > len(set(x)):
pass # do something
Probably won’t make much of a difference for short lists.
回答 1
这里有两个班轮,它们也会提前退出:
>>>def allUnique(x):... seen = set()...returnnot any(i in seen or seen.add(i)for i in x)...>>> allUnique("ABCDEF")True>>> allUnique("ABACDEF")False
如果x的元素不可散列,那么您将不得不使用以下列表seen:
>>>def allUnique(x):... seen = list()...returnnot any(i in seen or seen.append(i)for i in x)...>>> allUnique([list("ABC"), list("DEF")])True>>> allUnique([list("ABC"), list("DEF"), list("ABC")])False
>>> def allUnique(x):
... seen = set()
... return not any(i in seen or seen.add(i) for i in x)
...
>>> allUnique("ABCDEF")
True
>>> allUnique("ABACDEF")
False
If the elements of x aren’t hashable, then you’ll have to resort to using a list for seen:
>>> def allUnique(x):
... seen = list()
... return not any(i in seen or seen.append(i) for i in x)
...
>>> allUnique([list("ABC"), list("DEF")])
True
>>> allUnique([list("ABC"), list("DEF"), list("ABC")])
False
回答 2
提前退出的解决方案可能是
def unique_values(g):
s = set()for x in g:if x in s:returnFalse
s.add(x)returnTrue
def f5(seq, idfun=None):# order preservingif idfun isNone:def idfun(x):return x
seen ={}
result =[]for item in seq:
marker = idfun(item)# in old Python versions:# if seen.has_key(marker)# but in new ones:if marker in seen:continue
seen[marker]=1
result.append(item)return result
You can use Yan’s syntax (len(x) > len(set(x))), but instead of set(x), define a function:
def f5(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
and do len(x) > len(f5(x)). This will be fast and is also order preserving.
>>> a # I have
array([[1,1,1,0,0,0],[0,1,1,1,0,0],[0,1,1,1,0,0],[1,1,1,0,0,0],[1,1,1,1,1,0]])>>> new_a # I want to get to
array([[1,1,1,0,0,0],[0,1,1,1,0,0],[1,1,1,1,1,0]])
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn’t figure out how to make it work.
a = np.random.randint(2, size=(10000,6))%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])100 loops, best of 3:3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]100 loops, best of 3:5.93 ms per loop
a = np.random.randint(2, size=(10000,100))%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])10 loops, best of 3:29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]10 loops, best of 3:116 ms per loop
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
a = np.random.randint(2, size=(10000, 6))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop
a = np.random.randint(2, size=(10000, 100))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop
If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy’s structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn’t make a copy, and is quite efficient.
To understand what’s going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it’s a similar data structure to a list of tuples.)
np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:
array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)
That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your “not looping through”
In[87]:%timeit unique(a.view(dtype)).view('<i8')10000 loops, best of 3:48.4 us per loop
In[88]:%timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]10000 loops, best of 3:37.6 us per loop
In[89]:%timeit b =[tuple(row)for row in a]; np.unique(b)10000 loops, best of 3:41.6 us per loop
但是,使用更大的版本,最终会变得快得多:
In[96]: a = np.random.randint(0,2,size=(10000,6))In[97]:%timeit unique(a.view(dtype)).view('<i8')10 loops, best of 3:24.4 ms per loop
In[98]:%timeit b =[tuple(row)for row in a]; np.unique(b)10 loops, best of 3:28.2 ms per loop
In[99]:%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]100 loops, best of 3:3.25 ms per loop
np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn’t too much of a difference, though this version is a bit faster:
In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop
In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop
In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop
With a larger a, however, this version ends up being much, much faster:
In [96]: a = np.random.randint(0,2,size=(10000,6))
In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop
In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop
In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop
I’ve compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy’s native unique with the axis argument. If you’re looking for speed, you’ll want
from scipy.spatial.distance import squareform, pdist
def uniqueRows(arr, thresh=0.0, metric='euclidean'):"Returns subset of rows that are unique, in terms of Euclidean distance"
distances = squareform(pdist(arr, metric=metric))
idxset ={tuple(np.nonzero(v)[0])for v in distances <= thresh}return arr[[x[0]for x in idxset]]# With this, unique columns are super-easy:def uniqueColumns(arr,*args,**kwargs):return uniqueRows(arr.T,*args,**kwargs)
I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some 𝜀”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.
Here’s my version:
from scipy.spatial.distance import squareform, pdist
def uniqueRows(arr, thresh=0.0, metric='euclidean'):
"Returns subset of rows that are unique, in terms of Euclidean distance"
distances = squareform(pdist(arr, metric=metric))
idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
return arr[[x[0] for x in idxset]]
# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
return uniqueRows(arr.T, *args, **kwargs)
The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.
As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).
If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.
回答 9
为什么不使用drop_duplicates熊猫:
>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3:3.08 s per loop
>>> timeit np.vstack({tuple(r)for r in image.reshape(-1,3)})1 loops, best of 3:51 s per loop
>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop
>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop
Based on the answer in this page I have written a function that replicates the capability of MATLAB’s unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.
def unique_rows(data, prec=5):
import numpy as np
d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
_, ia = np.unique(b, return_index=True)
_, ic = np.unique(b, return_inverse=True)
return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic
Beyond @Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :
a[unique(a.view(void(a.strides[0])),1)[1]]
For
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]
回答 14
对于3D或更高级别的多维嵌套数组等一般用途,请尝试以下操作:
import numpy as np
def unique_nested_arrays(ar):
origin_shape = ar.shape
origin_dtype = ar.dtype
ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
ar = np.ascontiguousarray(ar)
unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0],)+ origin_shape[1:])
满足您的2D数据集:
a = np.array([[1,1,1,0,0,0],[0,1,1,1,0,0],[0,1,1,1,0,0],[1,1,1,0,0,0],[1,1,1,1,1,0]])
unique_nested_arrays(a)
coor = np.array([[10,10],[12,9],[10,5],[12,9]])
coor_tuple =[tuple(x)for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count =[coor_tuple.count(x)for x in unique_coor]
unique_index =[coor_tuple.index(x)for x in unique_coor]
None of these answers worked for me. I’m assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:
coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]
import numpy as np
def uniqueRow(a):#This function turn m x n numpy array into m x 1 numpy array storing #string, and so the np.unique can be used#Input: an m x n numpy array (a)#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:]='-'
b =(a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1)+ s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1]-2for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index =True, return_inverse =True, return_counts =True)return a[idx], inv_, c
例:
A = np.array([[3.179.5023.291],[9.9842.7736.852],[1.1728.8854.258],[9.737.5183.227],[8.1139.5639.117],[9.9842.7736.852],[9.737.5183.227]])
B, inv_, c = uniqueRow(A)Results:
B:[[1.1728.8854.258][3.179.5023.291][8.1139.5639.117][9.737.5183.227][9.9842.7736.852]]
inv_:[3410240]
c:[21112]
We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
import numpy as np
original = np.array([[1,1,1,0,0,0],[0,1,1,1,0,0],[0,1,1,1,0,0],[1,1,1,0,0,0],[1,1,1,1,1,0]])
uniques, index = np.unique([str(i)for i in original], return_index=True)
cleaned = original[index]print(cleaned)
The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)
import numpy as np
original = np.array([[1,1,1,0,0,0],[0,1,1,1,0,0],[0,1,1,1,0,0],[1,1,1,0,0,0],[1,1,1,1,1,0]])# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
return_index=True)# get unique setprint(original[unique_index])
Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this:
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn’t smart enough to rule that out. To play it safe, it has to check the object each time.
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
def unique_everseen(iterable, key=None):"List unique elements, preserving order. Remember all elements ever seen."# unique_everseen('AAAABBBCCDAABBB') --> A B C D# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.addif key isNone:for element in filterfalse(seen.__contains__, iterable):
seen_add(element)yield elementelse:for element in iterable:
k = key(element)if k notin seen:
seen_add(k)yield element
As Raymond pointed out, in python 3.5+ where OrderedDict is implemented in C, the list comprehension approach will be slower than OrderedDict (unless you actually need the list at the end – and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict.
Important Edit 2015
As @abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is also the fastest solution too:
Just one simple library import and no hacks.
This comes from an implementation of the itertools recipe unique_everseen which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
In Python 2.7+ the accepted common idiom (which works but isn’t optimized for speed, I would now use unique_everseen) for this uses collections.OrderedDict:
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
Response to @max: Once you move to 3.6 or 3.7 and use the regular dict instead of OrderedDict, you can’t really beat the performance in any other way. The dictionary is dense and readily converts to a list with almost no overhead. The target list is pre-sized to len(d) which saves all the resizes that occur in a list comprehension. Also, since the internal key list is dense, copying the pointers is about almost fast as a list copy.
回答 3
sequence =['1','2','3','3','6','4','5','6']
unique =[][unique.append(item)for item in sequence if item notin unique]
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn’t even have to be sorted, the sufficient condition is that equal values are grouped together.
Edit: I assumed that “preserving order” implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
Community edit: This is however the most elegant way to “compress duplicate consecutive elements into a single element”.
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)yield element
For another very late answer to another very old question:
The itertools recipes have a function that does this, using the seen set technique, but:
Handles a standard key function.
Uses no unseemly hacks.
Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don’t.)
Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that’s in C, and much faster.)
Is it actually faster than f7? It depends on your data, so you’ll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there’s no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn’t require DSU when you want to decorate.
As with all of the recipes, it’s also available in more-iterools.
If you just want the no-key case, you can simplify it as:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections importOrderedDictfrom more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return[x for x in seq ifnot(x in seen or seen_add(x))]def iteration_utilities_unique_everseen(seq):return list(unique_everseen(seq))def more_itertools_unique_everseen(seq):return list(mi_unique_everseen(seq))def odict(seq):return list(OrderedDict.fromkeys(seq))from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],{2**i: list(range(2**i))for i in range(1,20)},'list size (no duplicates)')
b.plot()
并确保我也进行了更多重复的测试,以检查是否有所不同:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],{2**i:[random.randint(0,2**(i-1))for _ in range(2**i)]for i in range(1,20)},'list size (lots of duplicates)')
b.plot()
还有一个仅包含一个值:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],{2**i:[1]*(2**i)for i in range(1,20)},'list size (only duplicates)')
b.plot()
I did some timings (Python 3.6) and these show that it’s faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).
This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).
In[122]:%timeit unique(np.random.randint(5, size=(1)))10000 loops, best of 3:25.3 us per loop
In[123]:%timeit unique(np.random.randint(5, size=(10)))10000 loops, best of 3:42.9 us per loop
In[124]:%timeit unique(np.random.randint(5, size=(100)))10000 loops, best of 3:132 us per loop
In[125]:%timeit unique(np.random.randint(5, size=(1000)))1000 loops, best of 3:1.05 ms per loop
In[126]:%timeit unique(np.random.randint(5, size=(10000)))100 loops, best of 3:11 ms per loop
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it’s interesting that this could be readily generalized to uniqueness by other operations. Like this:
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was “equality” for uniqueness purposes, like this:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
>>> l =[5,6,6,1,1,2,2,3,4]>>> reduce(lambda r, v: v in r[1]and r or(r[0].append(v)or r[1].add(v))or r, l,([], set()))[0][5,6,1,2,3,4]
说明:
default =(list(), set())# use list to keep order# use set to make lookup fasterdef reducer(result, item):if item notin result[1]:
result[0].append(item)
result[1].add(item)return result
>>> reduce(reducer, l, default)[0][5,6,1,2,3,4]
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
You can reference a list comprehension as it is being built by the symbol ‘_[1]’. For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
[1, 2, 3, 4, 5]
回答 14
MizardX的答案很好地总结了多种方法。
这是我在大声思考时想到的:
mylist =[x for i,x in enumerate(mylist)if x notin mylist[i+1:]]
import pandas as pd
import numpy as np
uniquifier =lambda alist: pd.Series(alist).drop_duplicates().tolist()# from the chosen answer def f7(seq):
seen = set()
seen_add = seen.add
return[ x for x in seq ifnot(x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()print uniquifier(alist)== f7(alist)# True
定时:
In[104]:%timeit f7(alist)1000 loops, best of 3:1.3 ms per loop
In[110]:%timeit uniquifier(alist)100 loops, best of 3:4.39 ms per loop
If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
text ="ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates =[(sentence[i])for i in range (0,len(sentence))if sentence[i]notin sentence[:i]]print(noduplicates)
A solution without using imported modules or sets:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
This method is quadratic, because we have a linear lookup into the list for every element of the list (to that we have to add the cost of rearranging the list because of the del s).
That said, it is possible to operate in place if we start from the end of the list and proceed toward the origin removing each term that is present in the sub-list at its left
This idea in code is simply
for i in range(len(l)-1,0,-1):
if l[i] in l[:i]: del l[i]
defDelDupes(aseq):
seen = set()return[x for x in aseq if(x.lower()notin seen)and(not seen.add(x.lower()))]
紧密相关的功能是:
defHasDupes(aseq):
s = set()return any(((x.lower()in s)or s.add(x.lower()))for x in aseq)defGetDupes(aseq):
s = set()return set(x for x in aseq if((x.lower()in s)or s.add(x.lower())))
zmk’s approach uses list comprehension which is very fast, yet keeps the order naturally. For applying to case sensitive strings it can be easily modified. This also preserves the original case.
def DelDupes(aseq) :
seen = set()
return [x for x in aseq if (x.lower() not in seen) and (not seen.add(x.lower()))]
Closely associated functions are:
def HasDupes(aseq) :
s = set()
return any(((x.lower() in s) or s.add(x.lower())) for x in aseq)
def GetDupes(aseq) :
s = set()
return set(x for x in aseq if ((x.lower() in s) or s.add(x.lower())))