标签归档:arrays

从数组中删除Nan值

问题:从数组中删除Nan值

我想弄清楚如何从数组中删除nan值。我的数组看起来像这样:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

如何从中删除nanx

I want to figure out how to remove nan values from my array. My array looks something like this:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

How can I remove the nan values from x?


回答 0

如果您对数组使用numpy,也可以使用

x = x[numpy.logical_not(numpy.isnan(x))]

等效地

x = x[~numpy.isnan(x)]

[感谢chbrown新增了速记]

说明

内部函数numpy.isnan返回一个布尔值/逻辑数组,该数组在True每个地方都x具有非数字值。因为我们希望相反,我们使用逻辑不操作,~以获得与阵列True到处都是这x 一个有效的数字。

最后,我们使用此逻辑数组索引到原始数组x,仅检索非NaN值。

If you’re using numpy for your arrays, you can also use

x = x[numpy.logical_not(numpy.isnan(x))]

Equivalently

x = x[~numpy.isnan(x)]

[Thanks to chbrown for the added shorthand]

Explanation

The inner function, numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. As we want the opposite, we use the logical-not operator, ~ to get an array with Trues everywhere that x is a valid number.

Lastly we use this logical array to index into the original array x, to retrieve just the non-NaN values.


回答 1

filter(lambda v: v==v, x)

由于v!= v仅适用于NaN,因此适用于列表和numpy数组

filter(lambda v: v==v, x)

works both for lists and numpy array since v!=v only for NaN


回答 2

试试这个:

import math
print [value for value in x if not math.isnan(value)]

有关更多信息,请阅读列表理解

Try this:

import math
print [value for value in x if not math.isnan(value)]

For more, read on List Comprehensions.


回答 3

对我来说,@ jmetz的答案不起作用,但是使用熊猫的isull()可以。

x = x[~pd.isnull(x)]

For me the answer by @jmetz didn’t work, however using pandas isnull() did.

x = x[~pd.isnull(x)]

回答 4

执行以上操作:

x = x[~numpy.isnan(x)]

要么

x = x[numpy.logical_not(numpy.isnan(x))]

我发现重置为相同的变量(x)不会删除实际的nan值,而必须使用其他变量。将其设置为其他变量将删除nans。例如

y = x[~numpy.isnan(x)]

Doing the above :

x = x[~numpy.isnan(x)]

or

x = x[numpy.logical_not(numpy.isnan(x))]

I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans. e.g.

y = x[~numpy.isnan(x)]

回答 5

如其他人所示

x[~numpy.isnan(x)]

作品。但是,如果numpy dtype不是本机数据类型(例如,如果它是object),则将引发错误。在这种情况下,您可以使用熊猫。

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

As shown by others

x[~numpy.isnan(x)]

works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

回答 6

所述接受的答案改变为2D阵列的形状。我在这里提出了一个使用Pandas dropna()功能的解决方案。它适用于一维和二维阵列。在2D情况下,您可以选择天气删除包含的行或列np.nan

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

结果:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

The accepted answer changes shape for 2d arrays. I present a solution here, using the Pandas dropna() functionality. It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

Result:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

回答 7

如果您正在使用 numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

If you’re using numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

回答 8

最简单的方法是:

numpy.nan_to_num(x)

文档:https : //docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html


回答 9

这是我为NaN和infs过滤ndarray “ X”的方法,

我创建的行映射不包含NaN任何内容inf,如下所示:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx是一个元组。它的第二列(idx[1])包含数组的索引,在该行中找不到NaNinf

然后:

filtered_X = X[idx[1]]

filtered_X包含X,而不 包含NaNnor inf

This is my approach to filter ndarray “X” for NaNs and infs,

I create a map of rows without any NaN and any inf as follows:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx is a tuple. It’s second column (idx[1]) contains the indices of the array, where no NaN nor inf where found across the row.

Then:

filtered_X = X[idx[1]]

filtered_X contains X without NaN nor inf.


回答 10

@jmetz的答案可能是大多数人需要的答案。但是,它会产生一维数组,例如,使其无法删除矩阵中的整个行或列。

为此,应将逻辑数组缩小为一维,然后索引目标数组。例如,以下内容将删除至少具有一个NaN值的行:

x = x[~numpy.isnan(x).any(axis=1)]

在这里查看更多详细信息

@jmetz’s answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.

To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:

x = x[~numpy.isnan(x).any(axis=1)]

See more detail here.


如何删除numpy数组中的特定元素

问题:如何删除numpy数组中的特定元素

如何从numpy数组中删除某些特定元素?说我有

import numpy as np

a = np.array([1,2,3,4,5,6,7,8,9])

然后我想删除3,4,7a。我所知道的只是值的索引(index=[2,3,6])。

How can I remove some specific elements from a numpy array? Say I have

import numpy as np

a = np.array([1,2,3,4,5,6,7,8,9])

I then want to remove 3,4,7 from a. All I know is the index of the values (index=[2,3,6]).


回答 0

使用numpy.delete() -返回一个新的数组,该数组具有沿删除的轴的子数组

numpy.delete(a, index)

对于您的具体问题:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
index = [2, 3, 6]

new_a = np.delete(a, index)

print(new_a) #Prints `[1, 2, 5, 6, 8, 9]`

请注意,numpy.delete()由于数组标量是不变的,因此返回一个新数组,类似于Python中的字符串,因此每次对其进行更改时,都会创建一个新对象。即,引用delete() 文档

删除了由obj指定的元素的arr 副本请注意,删除不会就地发生 …”

如果我发布的代码已输出,则是运行代码的结果。

Use numpy.delete() – returns a new array with sub-arrays along an axis deleted

numpy.delete(a, index)

For your specific question:

import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
index = [2, 3, 6]

new_a = np.delete(a, index)

print(new_a) #Prints `[1, 2, 5, 6, 8, 9]`

Note that numpy.delete() returns a new array since array scalars are immutable, similar to strings in Python, so each time a change is made to it, a new object is created. I.e., to quote the delete() docs:

“A copy of arr with the elements specified by obj removed. Note that delete does not occur in-place…”

If the code I post has output, it is the result of running the code.


回答 1

有一个内置的numpy函数可以帮助您。

import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.array([3,4,7])
>>> c = np.setdiff1d(a,b)
>>> c
array([1, 2, 5, 6, 8, 9])

There is a numpy built-in function to help with that.

import numpy as np
>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.array([3,4,7])
>>> c = np.setdiff1d(a,b)
>>> c
array([1, 2, 5, 6, 8, 9])

回答 2

Numpy数组是不可变的,从技术上讲,您不能从中删除任何项目。但是,您可以构造一个没有所需值的数组,如下所示:

b = np.delete(a, [2,3,6])

A Numpy array is immutable, meaning you technically cannot delete an item from it. However, you can construct a new array without the values you don’t want, like this:

b = np.delete(a, [2,3,6])

回答 3

要按值删除:

modified_array = np.delete(original_array, np.where(original_array == value_to_delete))

To delete by value :

modified_array = np.delete(original_array, np.where(original_array == value_to_delete))

回答 4

我不是一个麻木的人,所以开枪:

>>> import numpy as np
>>> import itertools
>>> 
>>> a = np.array([1,2,3,4,5,6,7,8,9])
>>> index=[2,3,6]
>>> a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))
>>> a
array([1, 2, 5, 6, 8, 9])

根据我的测试,它的性能优于numpy.delete()。我不知道为什么会这样,也许是由于初始数组的大小?

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
100000 loops, best of 3: 12.9 usec per loop

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "np.delete(a, index)"
10000 loops, best of 3: 108 usec per loop

这是一个相当大的差异(与我期望的方向相反),任何人都知道为什么会这样吗?

更奇怪的是,传递numpy.delete()列表的效果比遍历列表并为其赋予单个索引更糟糕。

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "for i in index:" "    np.delete(a, i)"
10000 loops, best of 3: 33.8 usec per loop

编辑:这似乎与数组的大小有关。对于大型阵列,numpy.delete()速度明显加快。

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
10 loops, best of 3: 200 msec per loop

python -m timeit -s "import numpy as np" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "np.delete(a, index)"
1000 loops, best of 3: 1.68 msec per loop

显然,这完全无关紧要,因为您应该始终保持清晰并避免重新发明轮子,但是我发现它有点有趣,因此我想将其留在这里。

Not being a numpy person, I took a shot with:

>>> import numpy as np
>>> import itertools
>>> 
>>> a = np.array([1,2,3,4,5,6,7,8,9])
>>> index=[2,3,6]
>>> a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))
>>> a
array([1, 2, 5, 6, 8, 9])

According to my tests, this outperforms numpy.delete(). I don’t know why that would be the case, maybe due to the small size of the initial array?

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
100000 loops, best of 3: 12.9 usec per loop

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "np.delete(a, index)"
10000 loops, best of 3: 108 usec per loop

That’s a pretty significant difference (in the opposite direction to what I was expecting), anyone have any idea why this would be the case?

Even more weirdly, passing numpy.delete() a list performs worse than looping through the list and giving it single indices.

python -m timeit -s "import numpy as np" -s "a = np.array([1,2,3,4,5,6,7,8,9])" -s "index=[2,3,6]" "for i in index:" "    np.delete(a, i)"
10000 loops, best of 3: 33.8 usec per loop

Edit: It does appear to be to do with the size of the array. With large arrays, numpy.delete() is significantly faster.

python -m timeit -s "import numpy as np" -s "import itertools" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "a = np.array(list(itertools.compress(a, [i not in index for i in range(len(a))])))"
10 loops, best of 3: 200 msec per loop

python -m timeit -s "import numpy as np" -s "a = np.array(list(range(10000)))" -s "index=[i for i in range(10000) if i % 2 == 0]" "np.delete(a, index)"
1000 loops, best of 3: 1.68 msec per loop

Obviously, this is all pretty irrelevant, as you should always go for clarity and avoid reinventing the wheel, but I found it a little interesting, so I thought I’d leave it here.


回答 5

如果您不知道索引,则无法使用 logical_and

x = 10*np.random.randn(1,100)
low = 5
high = 27
x[0,np.logical_and(x[0,:]>low,x[0,:]<high)]

If you don’t know the index, you can’t use logical_and

x = 10*np.random.randn(1,100)
low = 5
high = 27
x[0,np.logical_and(x[0,:]>low,x[0,:]<high)]

回答 6

np.delete如果我们知道要删除的元素的索引,则使用是最快的方法。但是,为了完整起见,让我添加另一种使用借助于创建的布尔掩码“删除”数组元素的方法np.isin。此方法允许我们通过直接指定元素或通过其索引来删除元素:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

按索引删除

indices_to_remove = [2, 3, 6]
a = a[~np.isin(np.arange(a.size), indices_to_remove)]

按元素删除(不要忘记重新创建原始内容,a因为它已在上一行中被重写):

elements_to_remove = a[indices_to_remove]  # [3, 4, 7]
a = a[~np.isin(a, elements_to_remove)]

Using np.delete is the fastest way to do it, if we know the indices of the elements that we want to remove. However, for completeness, let me add another way of “removing” array elements using a boolean mask created with the help of np.isin. This method allows us to remove the elements by specifying them directly or by their indices:

import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Remove by indices:

indices_to_remove = [2, 3, 6]
a = a[~np.isin(np.arange(a.size), indices_to_remove)]

Remove by elements (don’t forget to recreate the original a since it was rewritten in the previous line):

elements_to_remove = a[indices_to_remove]  # [3, 4, 7]
a = a[~np.isin(a, elements_to_remove)]

回答 7

删除特定索引(我从矩阵中删除了16和21)

import numpy as np
mat = np.arange(12,26)
a = [4,9]
del_map = np.delete(mat, a)
del_map.reshape(3,4)

输出:

array([[12, 13, 14, 15],
      [17, 18, 19, 20],
      [22, 23, 24, 25]])

Remove specific index(i removed 16 and 21 from matrix)

import numpy as np
mat = np.arange(12,26)
a = [4,9]
del_map = np.delete(mat, a)
del_map.reshape(3,4)

Output:

array([[12, 13, 14, 15],
      [17, 18, 19, 20],
      [22, 23, 24, 25]])

回答 8

您还可以使用集合:

a = numpy.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
the_index_list = [2, 3, 6]

the_big_set = set(numpy.arange(len(a)))
the_small_set = set(the_index_list)
the_delta_row_list = list(the_big_set - the_small_set)

a = a[the_delta_row_list]

You can also use sets:

a = numpy.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
the_index_list = [2, 3, 6]

the_big_set = set(numpy.arange(len(a)))
the_small_set = set(the_index_list)
the_delta_row_list = list(the_big_set - the_small_set)

a = a[the_delta_row_list]

检查项目是否在数组/列表中

问题:检查项目是否在数组/列表中

如果我有一个字符串数组,是否可以检查字符串是否在数组中而不进行for循环?具体来说,我正在寻找一种在if语句中执行此操作的方法,因此如下所示:

if [check that item is in array]:

If I’ve got an array of strings, can I check to see if a string is in the array without doing a for loop? Specifically, I’m looking for a way to do it within an if statement, so something like this:

if [check that item is in array]:

回答 0

假设您说“列表”的意思是“数组”,那么您可以

if item in my_list:
    # whatever

这适用于任何集合,而不仅仅是列表。对于字典,它将检查字典中是否存在给定的键。

Assuming you mean “list” where you say “array”, you can do

if item in my_list:
    # whatever

This works for any collection, not just for lists. For dictionaries, it checks whether the given key is present in the dictionary.


回答 1

我还要假设您说“数组”时的意思是“列表”。Sven Marnach的解决方案很好。如果要对列表进行重复检查,则可能有必要将其转换为集合或冻结集,这对于每次检查来说可能更快。假设您的str列表称为subjects

subject_set = frozenset(subjects)
if query in subject_set:
    # whatever

I’m also going to assume that you mean “list” when you say “array.” Sven Marnach’s solution is good. If you are going to be doing repeated checks on the list, then it might be worth converting it to a set or frozenset, which can be faster for each check. Assuming your list of strs is called subjects:

subject_set = frozenset(subjects)
if query in subject_set:
    # whatever

回答 2

使用lambda函数。

假设您有一个数组:

nums = [0,1,5]

检查5是否在nums

(len(filter (lambda x : x == 5, nums)) > 0)

该解决方案更加强大。现在,您可以检查数组中是否有满足特定条件的任何数字nums

例如,检查是否存在大于或等于5的任何数字nums

(len(filter (lambda x : x >= 5, nums)) > 0)

Use a lambda function.

Let’s say you have an array:

nums = [0,1,5]

Check whether 5 is in nums in Python 3.X:

(len(list(filter (lambda x : x == 5, nums))) > 0)

Check whether 5 is in nums in Python 2.7:

(len(filter (lambda x : x == 5, nums)) > 0)

This solution is more robust. You can now check whether any number satisfying a certain condition is in your array nums.

For example, check whether any number that is greater than or equal to 5 exists in nums:

(len(filter (lambda x : x >= 5, nums)) > 0)

回答 3

您必须对数组使用.values。例如,假设您的数据框的列名称为test [‘Name’],则可以

if name in test['Name'].values :
   print(name)

对于普通列表,您不必使用.values

You have to use .values for arrays. for example say you have dataframe which has a column name ie, test[‘Name’], you can do

if name in test['Name'].values :
   print(name)

for a normal list you dont have to use .values


回答 4

您也可以对数组使用相同的语法。例如,在熊猫系列中搜索:

ser = pd.Series(['some', 'strings', 'to', 'query'])

if item in ser.values:
    # do stuff

You can also use the same syntax for an array. For example, searching within a Pandas series:

ser = pd.Series(['some', 'strings', 'to', 'query'])

if item in ser.values:
    # do stuff

如何找到列表交集?

问题:如何找到列表交集?

a = [1,2,3,4,5]
b = [1,3,5,6]
c = a and b
print c

实际输出:[1,3,5,6] 预期输出:[1,3,5]

我们如何在两个列表上实现布尔AND操作(列表交集)?

a = [1,2,3,4,5]
b = [1,3,5,6]
c = a and b
print c

actual output: [1,3,5,6] expected output: [1,3,5]

How can we achieve a boolean AND operation (list intersection) on two lists?


回答 0

如果顺序不重要,并且您不必担心重复,则可以使用set相交:

>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> list(set(a) & set(b))
[1, 3, 5]

If order is not important and you don’t need to worry about duplicates then you can use set intersection:

>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> list(set(a) & set(b))
[1, 3, 5]

回答 1

对我而言,使用列表推导是很明显的。不确定性能,但至少没有其他问题。

[x for x in a if x in b]

或“如果X值在B中,则所有A中的x值”。

Using list comprehensions is a pretty obvious one for me. Not sure about performance, but at least things stay lists.

[x for x in a if x in b]

Or “all the x values that are in A, if the X value is in B”.


回答 2

如果将两个列表中的较大者转换为一个集合,则可以使用将该集合与任何可迭代对象相交intersection()

a = [1,2,3,4,5]
b = [1,3,5,6]
set(a).intersection(b)

If you convert the larger of the two lists into a set, you can get the intersection of that set with any iterable using intersection():

a = [1,2,3,4,5]
b = [1,3,5,6]
set(a).intersection(b)

回答 3

从较大的一组中取出一组:

_auxset = set(a)

然后,

c = [x for x in b if x in _auxset]

会做您想要的事情(保留b顺序,而不是a-不一定要同时保留两者)并快速完成。(使用if x in a列表理解中的条件也是可行的,并且避免了构建的需要_auxset,但是不幸的是,对于长度很大的列表,它会慢很多)。

如果您希望对结果进行排序,而不是保留任何一个列表的顺序,那么更整洁的方法可能是:

c = sorted(set(a).intersection(b))

Make a set out of the larger one:

_auxset = set(a)

Then,

c = [x for x in b if x in _auxset]

will do what you want (preserving b‘s ordering, not a‘s — can’t necessarily preserve both) and do it fast. (Using if x in a as the condition in the list comprehension would also work, and avoid the need to build _auxset, but unfortunately for lists of substantial length it would be a lot slower).

If you want the result to be sorted, rather than preserve either list’s ordering, an even neater way might be:

c = sorted(set(a).intersection(b))

回答 4

这是一些Python 2 / Python 3代码,它们为查找两个列表的交集的基于列表的方法和基于集合的方法生成计时信息。

纯列表理解算法为O(n ^ 2),因为in列表上是线性搜索。基于集合的算法为O(n),因为集合搜索为O(1),集合创建为O(n)(并且将集合转换为列表也为O(n))。因此,对于足够大的n,基于集合的算法速度更快,但是对于很小的n,创建集合的开销使其比纯列表压缩算法慢。

#!/usr/bin/env python

''' Time list- vs set-based list intersection
    See http://stackoverflow.com/q/3697432/4014959
    Written by PM 2Ring 2015.10.16
'''

from __future__ import print_function, division
from timeit import Timer

setup = 'from __main__ import a, b'
cmd_lista = '[u for u in a if u in b]'
cmd_listb = '[u for u in b if u in a]'

cmd_lcsa = 'sa=set(a);[u for u in b if u in sa]'

cmd_seta = 'list(set(a).intersection(b))'
cmd_setb = 'list(set(b).intersection(a))'

reps = 3
loops = 50000

def do_timing(heading, cmd, setup):
    t = Timer(cmd, setup)
    r = t.repeat(reps, loops)
    r.sort()
    print(heading, r)
    return r[0]

m = 10
nums = list(range(6 * m))

for n in range(1, m + 1):
    a = nums[:6*n:2]
    b = nums[:6*n:3]
    print('\nn =', n, len(a), len(b))
    #print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
    la = do_timing('lista', cmd_lista, setup) 
    lb = do_timing('listb', cmd_listb, setup) 
    lc = do_timing('lcsa ', cmd_lcsa, setup)
    sa = do_timing('seta ', cmd_seta, setup)
    sb = do_timing('setb ', cmd_setb, setup)
    print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)

输出

n = 1 3 2
lista [0.082171916961669922, 0.082588911056518555, 0.0898590087890625]
listb [0.069530963897705078, 0.070394992828369141, 0.075379848480224609]
lcsa  [0.11858987808227539, 0.1188349723815918, 0.12825107574462891]
seta  [0.26900982856750488, 0.26902294158935547, 0.27298116683959961]
setb  [0.27218389511108398, 0.27459001541137695, 0.34307217597961426]
0.305460649521 0.258469975867 0.440838458259 0.301898526833 0.255455833892 0.435697630214

n = 2 6 4
lista [0.15915989875793457, 0.16000485420227051, 0.16551494598388672]
listb [0.13000702857971191, 0.13060092926025391, 0.13543915748596191]
lcsa  [0.18650484085083008, 0.18742108345031738, 0.19513416290283203]
seta  [0.33592700958251953, 0.34001994132995605, 0.34146714210510254]
setb  [0.29436492919921875, 0.2953648567199707, 0.30039691925048828]
0.473793098554 0.387009751735 0.555194537893 0.540689066428 0.441652573672 0.633583767462

n = 3 9 6
lista [0.27657914161682129, 0.28098297119140625, 0.28311991691589355]
listb [0.21585917472839355, 0.21679902076721191, 0.22272896766662598]
lcsa  [0.22559309005737305, 0.2271728515625, 0.2323150634765625]
seta  [0.36382699012756348, 0.36453008651733398, 0.36750602722167969]
setb  [0.34979605674743652, 0.35533690452575684, 0.36164689064025879]
0.760194128313 0.59330170819 0.62005595016 0.790686848184 0.61710008036 0.644927481902

n = 4 12 8
lista [0.39616990089416504, 0.39746403694152832, 0.41129183769226074]
listb [0.33485794067382812, 0.33914685249328613, 0.37850618362426758]
lcsa  [0.27405810356140137, 0.2745978832244873, 0.28249192237854004]
seta  [0.39211201667785645, 0.39234519004821777, 0.39317893981933594]
setb  [0.36988520622253418, 0.37011313438415527, 0.37571001052856445]
1.01034878821 0.85398540833 0.698928091731 1.07106176249 0.905302334456 0.740927452493

n = 5 15 10
lista [0.56792402267456055, 0.57422614097595215, 0.57740211486816406]
listb [0.47309303283691406, 0.47619009017944336, 0.47628307342529297]
lcsa  [0.32805585861206055, 0.32813096046447754, 0.3349759578704834]
seta  [0.40036201477050781, 0.40322518348693848, 0.40548801422119141]
setb  [0.39103078842163086, 0.39722800254821777, 0.43811702728271484]
1.41852623806 1.18166313332 0.819398061028 1.45237674242 1.20986133789 0.838951479847

n = 6 18 12
lista [0.77897095680236816, 0.78187918663024902, 0.78467702865600586]
listb [0.629547119140625, 0.63210701942443848, 0.63321495056152344]
lcsa  [0.36563992500305176, 0.36638498306274414, 0.38175487518310547]
seta  [0.46695613861083984, 0.46992206573486328, 0.47583580017089844]
setb  [0.47616910934448242, 0.47661614418029785, 0.4850609302520752]
1.66818870637 1.34819326075 0.783028414812 1.63591241329 1.32210827369 0.767878297495

n = 7 21 14
lista [0.9703209400177002, 0.9734041690826416, 1.0182771682739258]
listb [0.82394003868103027, 0.82625699043273926, 0.82796716690063477]
lcsa  [0.40975093841552734, 0.41210508346557617, 0.42286920547485352]
seta  [0.5086359977722168, 0.50968098640441895, 0.51014018058776855]
setb  [0.48688101768493652, 0.4879908561706543, 0.49204087257385254]
1.90769222837 1.61990115188 0.805587768483 1.99293236904 1.69228211566 0.841583309951

n = 8 24 16
lista [1.204819917678833, 1.2206029891967773, 1.258256196975708]
listb [1.014998197555542, 1.0206191539764404, 1.0343101024627686]
lcsa  [0.50966787338256836, 0.51018595695495605, 0.51319599151611328]
seta  [0.50310111045837402, 0.50556015968322754, 0.51335406303405762]
setb  [0.51472997665405273, 0.51948785781860352, 0.52113485336303711]
2.39478683834 2.01748351664 1.01305257092 2.34068341135 1.97190418975 0.990165516871

n = 9 27 18
lista [1.511646032333374, 1.5133969783782959, 1.5639569759368896]
listb [1.2461750507354736, 1.254518985748291, 1.2613379955291748]
lcsa  [0.5565330982208252, 0.56119203567504883, 0.56451296806335449]
seta  [0.5966339111328125, 0.60275578498840332, 0.64791703224182129]
setb  [0.54694414138793945, 0.5508568286895752, 0.55375313758850098]
2.53362406013 2.08867620074 0.932788243907 2.76380331728 2.27843203069 1.01753187594

n = 10 30 20
lista [1.7777848243713379, 2.1453688144683838, 2.4085969924926758]
listb [1.5070111751556396, 1.5202279090881348, 1.5779800415039062]
lcsa  [0.5954139232635498, 0.59703707695007324, 0.60746097564697266]
seta  [0.61563014984130859, 0.62125110626220703, 0.62354087829589844]
setb  [0.56723213195800781, 0.57257509231567383, 0.57460403442382812]
2.88774814689 2.44791645689 0.967161734066 3.13413984189 2.6567803378 1.04968299523

使用具有2GB RAM的2GHz单核计算机生成,该计算机在Debian风格的Linux上运行Python 2.6.6(在后台运行Firefox)。

这些数字只是一个粗略的参考,因为两种源列表中元素的比例会影响各种算法的实际速度。

Here’s some Python 2 / Python 3 code that generates timing information for both list-based and set-based methods of finding the intersection of two lists.

The pure list comprehension algorithms are O(n^2), since in on a list is a linear search. The set-based algorithms are O(n), since set search is O(1), and set creation is O(n) (and converting a set to a list is also O(n)). So for sufficiently large n the set-based algorithms are faster, but for small n the overheads of creating the set(s) make them slower than the pure list comp algorithms.

#!/usr/bin/env python

''' Time list- vs set-based list intersection
    See http://stackoverflow.com/q/3697432/4014959
    Written by PM 2Ring 2015.10.16
'''

from __future__ import print_function, division
from timeit import Timer

setup = 'from __main__ import a, b'
cmd_lista = '[u for u in a if u in b]'
cmd_listb = '[u for u in b if u in a]'

cmd_lcsa = 'sa=set(a);[u for u in b if u in sa]'

cmd_seta = 'list(set(a).intersection(b))'
cmd_setb = 'list(set(b).intersection(a))'

reps = 3
loops = 50000

def do_timing(heading, cmd, setup):
    t = Timer(cmd, setup)
    r = t.repeat(reps, loops)
    r.sort()
    print(heading, r)
    return r[0]

m = 10
nums = list(range(6 * m))

for n in range(1, m + 1):
    a = nums[:6*n:2]
    b = nums[:6*n:3]
    print('\nn =', n, len(a), len(b))
    #print('\nn = %d\n%s %d\n%s %d' % (n, a, len(a), b, len(b)))
    la = do_timing('lista', cmd_lista, setup) 
    lb = do_timing('listb', cmd_listb, setup) 
    lc = do_timing('lcsa ', cmd_lcsa, setup)
    sa = do_timing('seta ', cmd_seta, setup)
    sb = do_timing('setb ', cmd_setb, setup)
    print(la/sa, lb/sa, lc/sa, la/sb, lb/sb, lc/sb)

output

n = 1 3 2
lista [0.082171916961669922, 0.082588911056518555, 0.0898590087890625]
listb [0.069530963897705078, 0.070394992828369141, 0.075379848480224609]
lcsa  [0.11858987808227539, 0.1188349723815918, 0.12825107574462891]
seta  [0.26900982856750488, 0.26902294158935547, 0.27298116683959961]
setb  [0.27218389511108398, 0.27459001541137695, 0.34307217597961426]
0.305460649521 0.258469975867 0.440838458259 0.301898526833 0.255455833892 0.435697630214

n = 2 6 4
lista [0.15915989875793457, 0.16000485420227051, 0.16551494598388672]
listb [0.13000702857971191, 0.13060092926025391, 0.13543915748596191]
lcsa  [0.18650484085083008, 0.18742108345031738, 0.19513416290283203]
seta  [0.33592700958251953, 0.34001994132995605, 0.34146714210510254]
setb  [0.29436492919921875, 0.2953648567199707, 0.30039691925048828]
0.473793098554 0.387009751735 0.555194537893 0.540689066428 0.441652573672 0.633583767462

n = 3 9 6
lista [0.27657914161682129, 0.28098297119140625, 0.28311991691589355]
listb [0.21585917472839355, 0.21679902076721191, 0.22272896766662598]
lcsa  [0.22559309005737305, 0.2271728515625, 0.2323150634765625]
seta  [0.36382699012756348, 0.36453008651733398, 0.36750602722167969]
setb  [0.34979605674743652, 0.35533690452575684, 0.36164689064025879]
0.760194128313 0.59330170819 0.62005595016 0.790686848184 0.61710008036 0.644927481902

n = 4 12 8
lista [0.39616990089416504, 0.39746403694152832, 0.41129183769226074]
listb [0.33485794067382812, 0.33914685249328613, 0.37850618362426758]
lcsa  [0.27405810356140137, 0.2745978832244873, 0.28249192237854004]
seta  [0.39211201667785645, 0.39234519004821777, 0.39317893981933594]
setb  [0.36988520622253418, 0.37011313438415527, 0.37571001052856445]
1.01034878821 0.85398540833 0.698928091731 1.07106176249 0.905302334456 0.740927452493

n = 5 15 10
lista [0.56792402267456055, 0.57422614097595215, 0.57740211486816406]
listb [0.47309303283691406, 0.47619009017944336, 0.47628307342529297]
lcsa  [0.32805585861206055, 0.32813096046447754, 0.3349759578704834]
seta  [0.40036201477050781, 0.40322518348693848, 0.40548801422119141]
setb  [0.39103078842163086, 0.39722800254821777, 0.43811702728271484]
1.41852623806 1.18166313332 0.819398061028 1.45237674242 1.20986133789 0.838951479847

n = 6 18 12
lista [0.77897095680236816, 0.78187918663024902, 0.78467702865600586]
listb [0.629547119140625, 0.63210701942443848, 0.63321495056152344]
lcsa  [0.36563992500305176, 0.36638498306274414, 0.38175487518310547]
seta  [0.46695613861083984, 0.46992206573486328, 0.47583580017089844]
setb  [0.47616910934448242, 0.47661614418029785, 0.4850609302520752]
1.66818870637 1.34819326075 0.783028414812 1.63591241329 1.32210827369 0.767878297495

n = 7 21 14
lista [0.9703209400177002, 0.9734041690826416, 1.0182771682739258]
listb [0.82394003868103027, 0.82625699043273926, 0.82796716690063477]
lcsa  [0.40975093841552734, 0.41210508346557617, 0.42286920547485352]
seta  [0.5086359977722168, 0.50968098640441895, 0.51014018058776855]
setb  [0.48688101768493652, 0.4879908561706543, 0.49204087257385254]
1.90769222837 1.61990115188 0.805587768483 1.99293236904 1.69228211566 0.841583309951

n = 8 24 16
lista [1.204819917678833, 1.2206029891967773, 1.258256196975708]
listb [1.014998197555542, 1.0206191539764404, 1.0343101024627686]
lcsa  [0.50966787338256836, 0.51018595695495605, 0.51319599151611328]
seta  [0.50310111045837402, 0.50556015968322754, 0.51335406303405762]
setb  [0.51472997665405273, 0.51948785781860352, 0.52113485336303711]
2.39478683834 2.01748351664 1.01305257092 2.34068341135 1.97190418975 0.990165516871

n = 9 27 18
lista [1.511646032333374, 1.5133969783782959, 1.5639569759368896]
listb [1.2461750507354736, 1.254518985748291, 1.2613379955291748]
lcsa  [0.5565330982208252, 0.56119203567504883, 0.56451296806335449]
seta  [0.5966339111328125, 0.60275578498840332, 0.64791703224182129]
setb  [0.54694414138793945, 0.5508568286895752, 0.55375313758850098]
2.53362406013 2.08867620074 0.932788243907 2.76380331728 2.27843203069 1.01753187594

n = 10 30 20
lista [1.7777848243713379, 2.1453688144683838, 2.4085969924926758]
listb [1.5070111751556396, 1.5202279090881348, 1.5779800415039062]
lcsa  [0.5954139232635498, 0.59703707695007324, 0.60746097564697266]
seta  [0.61563014984130859, 0.62125110626220703, 0.62354087829589844]
setb  [0.56723213195800781, 0.57257509231567383, 0.57460403442382812]
2.88774814689 2.44791645689 0.967161734066 3.13413984189 2.6567803378 1.04968299523

Generated using a 2GHz single core machine with 2GB of RAM running Python 2.6.6 on a Debian flavour of Linux (with Firefox running in the background).

These figures are only a rough guide, since the actual speeds of the various algorithms are affected differently by the proportion of elements that are in both source lists.


回答 5

a = [1,2,3,4,5]
b = [1,3,5,6]
c = list(set(a).intersection(set(b)))

应该像梦一样工作。并且,如果可以的话,请使用集合而不是列表来避免所有这些类型的改变!

a = [1,2,3,4,5]
b = [1,3,5,6]
c = list(set(a).intersection(set(b)))

Should work like a dream. And, if you can, use sets instead of lists to avoid all this type changing!


回答 6

使用filterlambda运算符可以实现功能方式。

list1 = [1,2,3,4,5,6]

list2 = [2,4,6,9,10]

>>> list(filter(lambda x:x in list1, list2))

[2, 4, 6]

编辑:筛选出存在于list1和list中的x,还可以使用以下方法实现集差:

>>> list(filter(lambda x:x not in list1, list2))
[9,10]

Edit2:python3 filter返回一个过滤器对象,将其封装起来并list返回输出列表。

A functional way can be achieved using filter and lambda operator.

list1 = [1,2,3,4,5,6]

list2 = [2,4,6,9,10]

>>> list(filter(lambda x:x in list1, list2))

[2, 4, 6]

Edit: It filters out x that exists in both list1 and list, set difference can also be achieved using:

>>> list(filter(lambda x:x not in list1, list2))
[9,10]

Edit2: python3 filter returns a filter object, encapsulating it with list returns the output list.


回答 7

当您需要此示例时,结果中的每个元素应出现在两个数组中的次数相同。

def intersection(nums1, nums2):
    #example:
    #nums1 = [1,2,2,1]
    #nums2 = [2,2]
    #output = [2,2]
    #find first 2 and remove from target, continue iterating

    target, iterate = [nums1, nums2] if len(nums2) >= len(nums1) else [nums2, nums1] #iterate will look into target

    if len(target) == 0:
            return []

    i = 0
    store = []
    while i < len(iterate):

         element = iterate[i]

         if element in target:
               store.append(element)
               target.remove(element)

         i += 1


    return store

This is an example when you need Each element in the result should appear as many times as it shows in both arrays.

def intersection(nums1, nums2):
    #example:
    #nums1 = [1,2,2,1]
    #nums2 = [2,2]
    #output = [2,2]
    #find first 2 and remove from target, continue iterating

    target, iterate = [nums1, nums2] if len(nums2) >= len(nums1) else [nums2, nums1] #iterate will look into target

    if len(target) == 0:
            return []

    i = 0
    store = []
    while i < len(iterate):

         element = iterate[i]

         if element in target:
               store.append(element)
               target.remove(element)

         i += 1


    return store

回答 8

可能已经晚了,但我只是想分享我这种情况,要求您手动执行此操作(显示正在工作-哈哈),或者当您需要所有元素尽可能多地出现或者您还需要使其唯一时。

请注意,也为此编写了测试。



    from nose.tools import assert_equal

    '''
    Given two lists, print out the list of overlapping elements
    '''

    def overlap(l_a, l_b):
        '''
        compare the two lists l_a and l_b and return the overlapping
        elements (intersecting) between the two
        '''

        #edge case is when they are the same lists
        if l_a == l_b:
            return [] #no overlapping elements

        output = []

        if len(l_a) == len(l_b):
            for i in range(l_a): #same length so either one applies
                if l_a[i] in l_b:
                    output.append(l_a[i])

            #found all by now
            #return output #if repetition does not matter
            return list(set(output))

        else:
            #find the smallest and largest lists and go with that
            sm = l_a if len(l_a)  len(l_b) else l_b

            for i in range(len(sm)):
                if sm[i] in lg:
                    output.append(sm[i])

            #return output #if repetition does not matter
            return list(set(output))

    ## Test the Above Implementation

    a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
    b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
    exp = [1, 2, 3, 5, 8, 13]

    c = [4, 4, 5, 6]
    d = [5, 7, 4, 8 ,6 ] #assuming it is not ordered
    exp2 = [4, 5, 6]

    class TestOverlap(object):

        def test(self, sol):
            t = sol(a, b)
            assert_equal(t, exp)
            print('Comparing the two lists produces')
            print(t)

            t = sol(c, d)
            assert_equal(t, exp2)
            print('Comparing the two lists produces')
            print(t)

            print('All Tests Passed!!')

    t = TestOverlap()
    t.test(overlap)

It might be late but I just thought I should share for the case where you are required to do it manually (show working – haha) OR when you need all elements to appear as many times as possible or when you also need it to be unique.

Kindly note that tests have also been written for it.



    from nose.tools import assert_equal

    '''
    Given two lists, print out the list of overlapping elements
    '''

    def overlap(l_a, l_b):
        '''
        compare the two lists l_a and l_b and return the overlapping
        elements (intersecting) between the two
        '''

        #edge case is when they are the same lists
        if l_a == l_b:
            return [] #no overlapping elements

        output = []

        if len(l_a) == len(l_b):
            for i in range(l_a): #same length so either one applies
                if l_a[i] in l_b:
                    output.append(l_a[i])

            #found all by now
            #return output #if repetition does not matter
            return list(set(output))

        else:
            #find the smallest and largest lists and go with that
            sm = l_a if len(l_a)  len(l_b) else l_b

            for i in range(len(sm)):
                if sm[i] in lg:
                    output.append(sm[i])

            #return output #if repetition does not matter
            return list(set(output))

    ## Test the Above Implementation

    a = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
    b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
    exp = [1, 2, 3, 5, 8, 13]

    c = [4, 4, 5, 6]
    d = [5, 7, 4, 8 ,6 ] #assuming it is not ordered
    exp2 = [4, 5, 6]

    class TestOverlap(object):

        def test(self, sol):
            t = sol(a, b)
            assert_equal(t, exp)
            print('Comparing the two lists produces')
            print(t)

            t = sol(c, d)
            assert_equal(t, exp2)
            print('Comparing the two lists produces')
            print(t)

            print('All Tests Passed!!')

    t = TestOverlap()
    t.test(overlap)


回答 9

如果,由布尔AND,同时出现在列表中你的意思项目,如路口,那么你应该看看Python的setfrozenset类型。

If, by Boolean AND, you mean items that appear in both lists, e.g. intersection, then you should look at Python’s set and frozenset types.


回答 10

您也可以使用柜台!它不会保留顺序,但会考虑重复项:

>>> from collections import Counter
>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> d1, d2 = Counter(a), Counter(b)
>>> c = [n for n in d1.keys() & d2.keys() for _ in range(min(d1[n], d2[n]))]
>>> print(c)
[1,3,5]

You can also use a counter! It doesn’t preserve the order, but it’ll consider the duplicates:

>>> from collections import Counter
>>> a = [1,2,3,4,5]
>>> b = [1,3,5,6]
>>> d1, d2 = Counter(a), Counter(b)
>>> c = [n for n in d1.keys() & d2.keys() for _ in range(min(d1[n], d2[n]))]
>>> print(c)
[1,3,5]

在numpy.array中查找唯一的行

问题:在numpy.array中查找唯一的行

我需要在中找到唯一的行numpy.array

例如:

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

我知道我可以创建一个set并在数组上循环,但是我正在寻找一种有效的纯numpy解决方案。我相信有一种方法可以将数据类型设置为void,然后我可以使用numpy.unique,但是我不知道如何使它工作。

I need to find unique rows in a numpy.array.

For example:

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn’t figure out how to make it work.


回答 0

从NumPy 1.13开始,您可以简单地选择轴来选择任何N维数组中的唯一值。要获得唯一的行,可以执行以下操作:

unique_rows = np.unique(original_array, axis=0)

As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:

unique_rows = np.unique(original_array, axis=0)


回答 1

另一个可能的解决方案

np.vstack({tuple(row) for row in a})

Yet another possible solution

np.vstack({tuple(row) for row in a})

回答 2

使用结构化数组的另一种选择是使用一种void类型的视图,该视图将整行连接到单个项目中:

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

编辑 添加np.ascontiguousarray以下@seberg的建议。如果数组不是连续的,这会使方法变慢。

编辑 可以通过执行以下操作来稍微加快上述速度,也许是以清楚为代价的:

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

另外,至少在我的系统上,性能方面与lexsort方法相当,甚至更好:

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

EDIT Added np.ascontiguousarray following @seberg’s recommendation. This will slow the method down if the array is not already contiguous.

EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

回答 3

如果要避免转换为一系列元组或其他类似数据结构的内存开销,则可以利用numpy的结构化数组。

诀窍是将原始数组视为结构化数组,其中每个项目都对应于原始数组的一行。这不会产生副本,并且非常有效。

作为一个简单的例子:

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

要了解发生了什么,请看一下中间结果。

一旦我们将事物视为结构化数组,则数组中的每个元素都是原始数组中的一行。(基本上,它是与元组列表类似的数据结构。)

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

一旦运行numpy.unique,我们将返回一个结构化数组:

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

然后,我们需要将其视为“常规”数组(_将最后一次计算的结果存储在中ipython,这就是您看到的原因_.view...):

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

然后重塑为2D数组(-1是一个占位符,告诉numpy计算正确的行数,给出列数):

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

显然,如果您想更加简洁,可以将其编写为:

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

结果是:

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy’s structured arrays.

The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn’t make a copy, and is quite efficient.

As a quick example:

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

To understand what’s going on, have a look at the intermediary results.

Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it’s a similar data structure to a list of tuples.)

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

Once we run numpy.unique, we’ll get a structured array back:

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

That we then need to view as a “normal” array (_ stores the result of the last calculation in ipython, which is why you’re seeing _.view...):

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

Obviously, if you wanted to be more concise, you could write it as:

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

Which results in:

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

回答 4

np.unique当我运行它时,np.random.random(100).reshape(10,10)返回所有唯一的单个元素,但是您想要唯一的行,因此首先需要将它们放入元组:

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

这是我看到您更改类型以执行所需操作的唯一方法,并且我不确定将列表迭代更改为元组是否可以“不循环”

np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your “not looping through”


回答 5

np.unique的工作方式是:对扁平化的数组进行排序,然后查看各项是否等于上一项。这可以手动完成而无需展平:

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

此方法不使用元组,并且应比此处给出的其他方法更快,更简单。

注意:以前的版本在a [之后没有ind,这表示使用了错误的索引。另外,乔·肯顿(Joe Kington)指出,这样做确实可以制作各种中间副本。通过制作排序后的副本,然后使用其视图,以下方法可以减少数量:

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

这样速度更快,占用的内存更少。

此外,如果您要在ndarray中查找唯一行,而不管该数组中有多少维,则可以执行以下操作:

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

剩下的一个有趣的问题是,如果您想沿着任意维数组的任意轴排序/唯一,那将更加困难。

编辑:

为了演示速度差异,我在ipython中对答案中描述的三种不同方法进行了一些测试。与您的精确值a相比,差别不大,尽管此版本要快一些:

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

但是,使用更大的版本,最终会变得快得多:

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

This method does not use tuples, and should be much faster and simpler than other methods given here.

NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

This is faster and uses less memory.

Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.

Edit:

To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn’t too much of a difference, though this version is a bit faster:

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

With a larger a, however, this version ends up being much, much faster:

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

回答 6

这是@Greg pythonic答案的另一种变化

np.vstack(set(map(tuple, a)))

Here is another variation for @Greg pythonic answer

np.vstack(set(map(tuple, a)))

回答 7

我比较了建议的替代速度,发现令人惊讶的是,无效视图unique解决方案甚至比unique带有axis参数的numpy本机快一点。如果您正在寻找速度,您会想要

numpy.unique(
    a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
    ).view(a.dtype).reshape(-1, a.shape[1])

在此处输入图片说明


复制剧情的代码:

import numpy
import perfplot


def unique_void_view(a):
    return numpy.unique(
        a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
        ).view(a.dtype).reshape(-1, a.shape[1])


def lexsort(a):
    ind = numpy.lexsort(a.T)
    return a[ind[
        numpy.concatenate((
            [True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)
            ))
        ]]


def vstack(a):
    return numpy.vstack({tuple(row) for row in a})


def unique_axis(a):
    return numpy.unique(a, axis=0)


perfplot.show(
    setup=lambda n: numpy.random.randint(2, size=(n, 20)),
    kernels=[unique_void_view, lexsort, vstack, unique_axis],
    n_range=[2**k for k in range(15)],
    logx=True,
    logy=True,
    xlabel='len(a)',
    equality_check=None
    )

I’ve compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy’s native unique with the axis argument. If you’re looking for speed, you’ll want

numpy.unique(
    a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
    ).view(a.dtype).reshape(-1, a.shape[1])

enter image description here


Code to reproduce the plot:

import numpy
import perfplot


def unique_void_view(a):
    return numpy.unique(
        a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
        ).view(a.dtype).reshape(-1, a.shape[1])


def lexsort(a):
    ind = numpy.lexsort(a.T)
    return a[ind[
        numpy.concatenate((
            [True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)
            ))
        ]]


def vstack(a):
    return numpy.vstack({tuple(row) for row in a})


def unique_axis(a):
    return numpy.unique(a, axis=0)


perfplot.show(
    setup=lambda n: numpy.random.randint(2, size=(n, 20)),
    kernels=[unique_void_view, lexsort, vstack, unique_axis],
    n_range=[2**k for k in range(15)],
    logx=True,
    logy=True,
    xlabel='len(a)',
    equality_check=None
    )

回答 8

我不喜欢这些答案,因为没有一个以线性代数或向量空间的意义处理浮点数组,其中两行“相等”表示“在𝜀内”。具有容忍度阈值的一个答案https://stackoverflow.com/a/26867764/500207将阈值设为按元素和十进制精度精度,这在某些情况下适用,但在数学上不如真实向量距离。

这是我的版本:

from scipy.spatial.distance import squareform, pdist

def uniqueRows(arr, thresh=0.0, metric='euclidean'):
    "Returns subset of rows that are unique, in terms of Euclidean distance"
    distances = squareform(pdist(arr, metric=metric))
    idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
    return arr[[x[0] for x in idxset]]

# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
    return uniqueRows(arr.T, *args, **kwargs)

上面的公共域函数scipy.spatial.distance.pdist用于查找每对行之间的欧式距离(可自定义)。然后,将每个距离与thresh旧距离进行比较,以找到彼此之间的行thresh,并从每个行中仅返回一行thresh -cluster中。

如所暗示的,距离metric不必是欧几里得- pdist可以计算各种距离,包括cityblock(曼哈顿范数)和cosine(向量之间的角度)。

如果thresh=0(默认),则行必须精确到位才能被视为“唯一”。用于thresh缩放机器精度的其他良好值,即thresh=np.spacing(1)*1e3

I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some 𝜀”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.

Here’s my version:

from scipy.spatial.distance import squareform, pdist

def uniqueRows(arr, thresh=0.0, metric='euclidean'):
    "Returns subset of rows that are unique, in terms of Euclidean distance"
    distances = squareform(pdist(arr, metric=metric))
    idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
    return arr[[x[0] for x in idxset]]

# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
    return uniqueRows(arr.T, *args, **kwargs)

The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.

As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).

If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.


回答 9

为什么不使用drop_duplicates熊猫:

>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop

>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop

Why not use drop_duplicates from pandas:

>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop

>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop

回答 10

numpy_indexed包(免责声明:我是它的作者)包装由Jaime在一个不错的发布解决方案和测试界面,再加上还有更多的功能:

import numpy_indexed as npi
new_a = npi.unique(a)  # unique elements over axis=0 (rows) by default

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:

import numpy_indexed as npi
new_a = npi.unique(a)  # unique elements over axis=0 (rows) by default

回答 11

np.unique给出了元组列表:

>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]: 
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])

通过列表列表,它会引发一个 TypeError: unhashable type: 'list'

np.unique works given a list of tuples:

>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]: 
array([[1, 1],
       [2, 2],
       [3, 3],
       [4, 4]])

With a list of lists it raises a TypeError: unhashable type: 'list'


回答 12

基于此页面上的答案,我编写了一个函数,该函数复制了MATLAB函数的unique(input,'rows')功能,并具有接受检查唯一性公差的附加功能。它还返回诸如c = data[ia,:]和的索引data = c[ic,:]。如果发现任何差异或错误,请报告。

def unique_rows(data, prec=5):
    import numpy as np
    d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
    b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
    _, ia = np.unique(b, return_index=True)
    _, ic = np.unique(b, return_inverse=True)
    return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

Based on the answer in this page I have written a function that replicates the capability of MATLAB’s unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.

def unique_rows(data, prec=5):
    import numpy as np
    d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
    b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
    _, ia = np.unique(b, return_index=True)
    _, ic = np.unique(b, return_inverse=True)
    return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

回答 13

除了@Jaime最佳答案之外,折叠行的另一种方法是使用等于a.strides[0](假设a是C连续的)a.dtype.itemsize*a.shape[0]。而且void(n)是的快捷方式dtype((void,n))。我们终于到了最短的版本:

a[unique(a.view(void(a.strides[0])),1)[1]]

对于

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

Beyond @Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :

a[unique(a.view(void(a.strides[0])),1)[1]]

For

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

回答 14

对于3D或更高级别的多维嵌套数组等一般用途,请尝试以下操作:

import numpy as np

def unique_nested_arrays(ar):
    origin_shape = ar.shape
    origin_dtype = ar.dtype
    ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
    ar = np.ascontiguousarray(ar)
    unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
    return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])

满足您的2D数据集:

a = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)

给出:

array([[0, 1, 1, 1, 0, 0],
   [1, 1, 1, 0, 0, 0],
   [1, 1, 1, 1, 1, 0]])

而且还有3D阵列,例如:

b = np.array([[[1, 1, 1], [0, 1, 1]],
              [[0, 1, 1], [1, 1, 1]],
              [[1, 1, 1], [0, 1, 1]],
              [[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)

给出:

array([[[0, 1, 1], [1, 1, 1]],
   [[1, 1, 1], [0, 1, 1]],
   [[1, 1, 1], [1, 1, 1]]])

For general purpose like 3D or higher multidimensional nested arrays, try this:

import numpy as np

def unique_nested_arrays(ar):
    origin_shape = ar.shape
    origin_dtype = ar.dtype
    ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
    ar = np.ascontiguousarray(ar)
    unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
    return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])

which satisfies your 2D dataset:

a = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)

gives:

array([[0, 1, 1, 1, 0, 0],
   [1, 1, 1, 0, 0, 0],
   [1, 1, 1, 1, 1, 0]])

But also 3D arrays like:

b = np.array([[[1, 1, 1], [0, 1, 1]],
              [[0, 1, 1], [1, 1, 1]],
              [[1, 1, 1], [0, 1, 1]],
              [[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)

gives:

array([[[0, 1, 1], [1, 1, 1]],
   [[1, 1, 1], [0, 1, 1]],
   [[1, 1, 1], [1, 1, 1]]])

回答 15

这些答案都不对我有用。我假设我的唯一行包含字符串而不是数字。但是,另一个线程的这个答案确实起作用:

资料来源:https : //stackoverflow.com/a/38461043/5402386

您可以使用.count()和.index()列表的方法

coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]

None of these answers worked for me. I’m assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:

Source: https://stackoverflow.com/a/38461043/5402386

You can use .count() and .index() list’s methods

coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]

回答 16

我们实际上可以将mxn数字numpy数组转换为mx 1 numpy字符串数组,请尝试使用以下函数,它提供了countinverse_idx等,就像numpy.unique一样:

import numpy as np

def uniqueRow(a):
    #This function turn m x n numpy array into m x 1 numpy array storing 
    #string, and so the np.unique can be used

    #Input: an m x n numpy array (a)
    #Output unique m' x n numpy array (unique), inverse_indx, and counts 

    s = np.chararray((a.shape[0],1))
    s[:] = '-'

    b = (a).astype(np.str)

    s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)

    n = a.shape[1] - 2    

    for i in range(0,n):
         s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)

    s3, idx, inv_, c = np.unique(s2,return_index = True,  return_inverse = True, return_counts = True)

    return a[idx], inv_, c

例:

A = np.array([[ 3.17   9.502  3.291],
  [ 9.984  2.773  6.852],
  [ 1.172  8.885  4.258],
  [ 9.73   7.518  3.227],
  [ 8.113  9.563  9.117],
  [ 9.984  2.773  6.852],
  [ 9.73   7.518  3.227]])

B, inv_, c = uniqueRow(A)

Results:

B:
[[ 1.172  8.885  4.258]
[ 3.17   9.502  3.291]
[ 8.113  9.563  9.117]
[ 9.73   7.518  3.227]
[ 9.984  2.773  6.852]]

inv_:
[3 4 1 0 2 4 0]

c:
[2 1 1 1 2]

We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:

import numpy as np

def uniqueRow(a):
    #This function turn m x n numpy array into m x 1 numpy array storing 
    #string, and so the np.unique can be used

    #Input: an m x n numpy array (a)
    #Output unique m' x n numpy array (unique), inverse_indx, and counts 

    s = np.chararray((a.shape[0],1))
    s[:] = '-'

    b = (a).astype(np.str)

    s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)

    n = a.shape[1] - 2    

    for i in range(0,n):
         s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)

    s3, idx, inv_, c = np.unique(s2,return_index = True,  return_inverse = True, return_counts = True)

    return a[idx], inv_, c

Example:

A = np.array([[ 3.17   9.502  3.291],
  [ 9.984  2.773  6.852],
  [ 1.172  8.885  4.258],
  [ 9.73   7.518  3.227],
  [ 8.113  9.563  9.117],
  [ 9.984  2.773  6.852],
  [ 9.73   7.518  3.227]])

B, inv_, c = uniqueRow(A)

Results:

B:
[[ 1.172  8.885  4.258]
[ 3.17   9.502  3.291]
[ 8.113  9.563  9.117]
[ 9.73   7.518  3.227]
[ 9.984  2.773  6.852]]

inv_:
[3 4 1 0 2 4 0]

c:
[2 1 1 1 2]

回答 17

让我们以列表的形式获取整个numpy矩阵,然后从该列表中删除重复项,最后将唯一列表返回到numpy矩阵中:

matrix_as_list=data.tolist() 
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]

uniq_list=list()
uniq_list.append(matrix_as_list[0])

[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]

unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix:

matrix_as_list=data.tolist() 
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]

uniq_list=list()
uniq_list.append(matrix_as_list[0])

[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]

unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

回答 18

最直接的解决方案是通过将行设置为字符串来使其成为单个项目。然后,可以使用numpy将每一行的唯一性进行整体比较。该解决方案是可概括的,您只需要重塑形状并为其他组合转置数组即可。这是所提供问题的解决方案。

import numpy as np

original = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)    

会给:

 array([[0, 1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0]])

通过邮件发送我的诺贝尔奖

The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.

import numpy as np

original = np.array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)    

Will Give:

 array([[0, 1, 1, 1, 0, 0],
        [1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0]])

Send my nobel prize in the mail


回答 19

import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
                            return_index=True)
# get unique set
print(original[unique_index])
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [0, 1, 1, 1, 0, 0],
                     [1, 1, 1, 0, 0, 0],
                     [1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
                            return_index=True)
# get unique set
print(original[unique_index])

Python CSV字符串到数组

问题:Python CSV字符串到数组

有人知道一个简单的库或函数来解析csv编码的字符串并将其转换为数组或字典吗?

我不认为我想要内置的csv模块,因为在所有示例中,我看到的都是文件路径,而不是字符串。

Anyone know of a simple library or function to parse a csv encoded string and turn it into an array or dictionary?

I don’t think I want the built in csv module because in all the examples I’ve seen that takes filepaths, not strings.


回答 0

您可以使用将字符串转换为文件对象io.StringIO,然后将其传递给csv模块:

from io import StringIO
import csv

scsv = """text,with,Polish,non-Latin,letters
1,2,3,4,5,6
a,b,c,d,e,f
gęś,zółty,wąż,idzie,wąską,dróżką,
"""

f = StringIO(scsv)
reader = csv.reader(f, delimiter=',')
for row in reader:
    print('\t'.join(row))

带有split()换行符的简单版本:

reader = csv.reader(scsv.split('\n'), delimiter=',')
for row in reader:
    print('\t'.join(row))

或者,您可以split()使用\n分隔符将此字符串简单地分成几行,然后将split()每一行变成值,但是这种方式您必须知道引号,因此csv首选使用module。

Python 2上,您必须导入StringIO

from StringIO import StringIO

代替。

You can convert a string to a file object using io.StringIO and then pass that to the csv module:

from io import StringIO
import csv

scsv = """text,with,Polish,non-Latin,letters
1,2,3,4,5,6
a,b,c,d,e,f
gęś,zółty,wąż,idzie,wąską,dróżką,
"""

f = StringIO(scsv)
reader = csv.reader(f, delimiter=',')
for row in reader:
    print('\t'.join(row))

simpler version with split() on newlines:

reader = csv.reader(scsv.split('\n'), delimiter=',')
for row in reader:
    print('\t'.join(row))

Or you can simply split() this string into lines using \n as separator, and then split() each line into values, but this way you must be aware of quoting, so using csv module is preferred.

On Python 2 you have to import StringIO as

from StringIO import StringIO

instead.


回答 1

简单-csv模块也可以使用列表:

>>> a=["1,2,3","4,5,6"]  # or a = "1,2,3\n4,5,6".split('\n')
>>> import csv
>>> x = csv.reader(a)
>>> list(x)
[['1', '2', '3'], ['4', '5', '6']]

Simple – the csv module works with lists, too:

>>> a=["1,2,3","4,5,6"]  # or a = "1,2,3\n4,5,6".split('\n')
>>> import csv
>>> x = csv.reader(a)
>>> list(x)
[['1', '2', '3'], ['4', '5', '6']]

回答 2

csv.reader() https://docs.python.org/2/library/csv.html的官方文档 非常有帮助,它说

文件对象和列表对象都适合

import csv

text = """1,2,3
a,b,c
d,e,f"""

lines = text.splitlines()
reader = csv.reader(lines, delimiter=',')
for row in reader:
    print('\t'.join(row))

The official doc for csv.reader() https://docs.python.org/2/library/csv.html is very helpful, which says

file objects and list objects are both suitable

import csv

text = """1,2,3
a,b,c
d,e,f"""

lines = text.splitlines()
reader = csv.reader(lines, delimiter=',')
for row in reader:
    print('\t'.join(row))

回答 3

>>> a = "1,2"
>>> a
'1,2'
>>> b = a.split(",")
>>> b
['1', '2']

解析CSV文件:

f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if needed

for line in lines:
    if line != "": # add other needed checks to skip titles
        cols = line.split(",")
        print cols
>>> a = "1,2"
>>> a
'1,2'
>>> b = a.split(",")
>>> b
['1', '2']

To parse a CSV file:

f = open(file.csv, "r")
lines = f.read().split("\n") # "\r\n" if needed

for line in lines:
    if line != "": # add other needed checks to skip titles
        cols = line.split(",")
        print cols

回答 4

正如其他人已经指出的那样,Python包含一个用于读取和写入CSV文件的模块。只要输入字符保持在ASCII限制内,它就可以很好地工作。如果您要处理其他编码,则需要做更多的工作。

csv模块Python文档实现了csv.reader的扩展,该扩展使用相同的接口,但可以处理其他编码并返回unicode字符串。只需复制并粘贴文档中的代码即可。之后,您可以像这样处理CSV文件:

with open("some.csv", "rb") as csvFile: 
    for row in UnicodeReader(csvFile, encoding="iso-8859-15"):
        print row

As others have already pointed out, Python includes a module to read and write CSV files. It works pretty well as long as the input characters stay within ASCII limits. In case you want to process other encodings, more work is needed.

The Python documentation for the csv module implements an extension of csv.reader, which uses the same interface but can handle other encodings and returns unicode strings. Just copy and paste the code from the documentation. After that, you can process a CSV file like this:

with open("some.csv", "rb") as csvFile: 
    for row in UnicodeReader(csvFile, encoding="iso-8859-15"):
        print row

回答 5

根据文档:

尽管该模块不直接支持解析字符串,但可以轻松实现:

import csv
for row in csv.reader(['one,two,three']):
    print row

只需将您的字符串转换为单个元素列表即可。

当这个例子在文档中明确时,导入StringIO对我来说似乎有点多余。

Per the documentation:

And while the module doesn’t directly support parsing strings, it can easily be done:

import csv
for row in csv.reader(['one,two,three']):
    print row

Just turn your string into a single element list.

Importing StringIO seems a bit excessive to me when this example is explicitly in the docs.


回答 6

https://docs.python.org/2/library/csv.html?highlight=csv#csv.reader

csvfile可以是任何支持迭代器协议的对象,并且每次调用其next()方法时都会返回一个字符串

因此,一个StringIO.StringIO()str.splitlines()甚至一个生成器都很好。

https://docs.python.org/2/library/csv.html?highlight=csv#csv.reader

csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called

Thus, a StringIO.StringIO(), str.splitlines() or even a generator are all good.


回答 7

这是一个替代解决方案:

>>> import pyexcel as pe
>>> text="""1,2,3
... a,b,c
... d,e,f"""
>>> s = pe.load_from_memory('csv', text)
>>> s
Sheet Name: csv
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| a | b | c |
+---+---+---+
| d | e | f |
+---+---+---+
>>> s.to_array()
[[u'1', u'2', u'3'], [u'a', u'b', u'c'], [u'd', u'e', u'f']]

这是文档

Here’s an alternative solution:

>>> import pyexcel as pe
>>> text="""1,2,3
... a,b,c
... d,e,f"""
>>> s = pe.load_from_memory('csv', text)
>>> s
Sheet Name: csv
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| a | b | c |
+---+---+---+
| d | e | f |
+---+---+---+
>>> s.to_array()
[[u'1', u'2', u'3'], [u'a', u'b', u'c'], [u'd', u'e', u'f']]

Here’s the documentation


回答 8

使用此功能将csv加载到列表中

import csv

csvfile = open(myfile, 'r')
reader = csv.reader(csvfile, delimiter='\t')
my_list = list(reader)
print my_list
>>>[['1st_line', '0'],
    ['2nd_line', '0']]

Use this to have a csv loaded into a list

import csv

csvfile = open(myfile, 'r')
reader = csv.reader(csvfile, delimiter='\t')
my_list = list(reader)
print my_list
>>>[['1st_line', '0'],
    ['2nd_line', '0']]

回答 9

Panda是功能强大且智能的库,可使用Python读取CSV

这里有一个简单的例子,我有example.zip文件,其中有四个文件。

EXAMPLE.zip
 -- example1.csv
 -- example1.txt
 -- example2.csv
 -- example2.txt

from zipfile import ZipFile
import pandas as pd


filepath = 'EXAMPLE.zip'
file_prefix = filepath[:-4].lower()

zipfile = ZipFile(filepath)
target_file = ''.join([file_prefix, '/', file_prefix, 1 , '.csv'])

df = pd.read_csv(zipfile.open(target_file))

print(df.head()) # print first five row of csv
print(df[COL_NAME]) # fetch the col_name data

有了数据后,您就可以操纵播放列表或其他格式。

Panda is quite powerful and smart library reading CSV in Python

A simple example here, I have example.zip file with four files in it.

EXAMPLE.zip
 -- example1.csv
 -- example1.txt
 -- example2.csv
 -- example2.txt

from zipfile import ZipFile
import pandas as pd


filepath = 'EXAMPLE.zip'
file_prefix = filepath[:-4].lower()

zipfile = ZipFile(filepath)
target_file = ''.join([file_prefix, '/', file_prefix, 1 , '.csv'])

df = pd.read_csv(zipfile.open(target_file))

print(df.head()) # print first five row of csv
print(df[COL_NAME]) # fetch the col_name data

Once you have data you can manipulate to play with a list or other formats.


如何创建全为真或全为假的numpy数组?

问题:如何创建全为真或全为假的numpy数组?

在Python中,如何创建由全True或全False填充的任意形状的numpy数组?

In Python, how do I create a numpy array of arbitrary shape filled with all True or all False?


回答 0

numpy已经可以非常容易地创建全1或全0的数组:

例如numpy.ones((2, 2))numpy.zeros((2, 2))

由于TrueFalsePython中被表示为10,分别,我们只有指定这个数组应该是布尔使用可选dtype参数,我们正在这样做。

numpy.ones((2, 2), dtype=bool)

返回:

array([[ True,  True],
       [ True,  True]], dtype=bool)

更新:2013年10月30日

从numpy 版本1.8开始,我们可以使用full语法更清楚地表明我们意图的语法来达到相同的结果(如fmonegaglia指出):

numpy.full((2, 2), True, dtype=bool)

更新:2017年1月16日

因为至少numpy的1.12版full自动转换结果到了dtype第二个参数,所以我们可以这样写:

numpy.full((2, 2), True)

numpy already allows the creation of arrays of all ones or all zeros very easily:

e.g. numpy.ones((2, 2)) or numpy.zeros((2, 2))

Since True and False are represented in Python as 1 and 0, respectively, we have only to specify this array should be boolean using the optional dtype parameter and we are done.

numpy.ones((2, 2), dtype=bool)

returns:

array([[ True,  True],
       [ True,  True]], dtype=bool)

UPDATE: 30 October 2013

Since numpy version 1.8, we can use full to achieve the same result with syntax that more clearly shows our intent (as fmonegaglia points out):

numpy.full((2, 2), True, dtype=bool)

UPDATE: 16 January 2017

Since at least numpy version 1.12, full automatically casts results to the dtype of the second parameter, so we can just write:

numpy.full((2, 2), True)


回答 1

numpy.full((2,2), True, dtype=bool)
numpy.full((2,2), True, dtype=bool)

回答 2

ones和和zeros分别创建一个全为1和0的数组,它们带有一个可选dtype参数:

>>> numpy.ones((2, 2), dtype=bool)
array([[ True,  True],
       [ True,  True]], dtype=bool)
>>> numpy.zeros((2, 2), dtype=bool)
array([[False, False],
       [False, False]], dtype=bool)

ones and zeros, which create arrays full of ones and zeros respectively, take an optional dtype parameter:

>>> numpy.ones((2, 2), dtype=bool)
array([[ True,  True],
       [ True,  True]], dtype=bool)
>>> numpy.zeros((2, 2), dtype=bool)
array([[False, False],
       [False, False]], dtype=bool)

回答 3

如果它不是可写的,则可以使用以下方法创建这样的数组np.broadcast_to

>>> import numpy as np
>>> np.broadcast_to(True, (2, 5))
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

如果需要可写,也可以fill自己创建一个空数组:

>>> arr = np.empty((2, 5), dtype=bool)
>>> arr.fill(1)
>>> arr
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

这些方法只是替代建议。通常,您应该坚持np.fullnp.zeros或者np.ones像其他答案所建议的那样。

If it doesn’t have to be writeable you can create such an array with np.broadcast_to:

>>> import numpy as np
>>> np.broadcast_to(True, (2, 5))
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

If you need it writable you can also create an empty array and fill it yourself:

>>> arr = np.empty((2, 5), dtype=bool)
>>> arr.fill(1)
>>> arr
array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]], dtype=bool)

These approaches are only alternative suggestions. In general you should stick with np.full, np.zeros or np.ones like the other answers suggest.


回答 4

快速运行一个timeit,以查看np.full和之间是否有任何差异np.ones版本。

答:不可以

import timeit

n_array, n_test = 1000, 10000
setup = f"import numpy as np; n = {n_array};"

print(f"np.ones: {timeit.timeit('np.ones((n, n), dtype=bool)', number=n_test, setup=setup)}s")
print(f"np.full: {timeit.timeit('np.full((n, n), True)', number=n_test, setup=setup)}s")

结果:

np.ones: 0.38416870904620737s
np.full: 0.38430388597771525s


重要

关于帖子np.empty(由于声誉太低,我无法评论):

不要那样做。请勿使用np.empty初始化全True数组

由于数组为空,因此不会写入内存,也无法保证您的值是多少,例如

>>> print(np.empty((4,4), dtype=bool))
[[ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True False False]]

Quickly ran a timeit to see, if there are any differences between the np.full and np.ones version.

Answer: No

import timeit

n_array, n_test = 1000, 10000
setup = f"import numpy as np; n = {n_array};"

print(f"np.ones: {timeit.timeit('np.ones((n, n), dtype=bool)', number=n_test, setup=setup)}s")
print(f"np.full: {timeit.timeit('np.full((n, n), True)', number=n_test, setup=setup)}s")

Result:

np.ones: 0.38416870904620737s
np.full: 0.38430388597771525s


IMPORTANT

Regarding the post about np.empty (and I cannot comment, as my reputation is too low):

DON’T DO THAT. DON’T USE np.empty to initialize an all-True array

As the array is empty, the memory is not written and there is no guarantee, what your values will be, e.g.

>>> print(np.empty((4,4), dtype=bool))
[[ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True False False]]

回答 5

>>> a = numpy.full((2,4), True, dtype=bool)
>>> a[1][3]
True
>>> a
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

numpy.full(大小,标量值,类型)。也可以传递其他参数,有关文档,请检查https://docs.scipy.org/doc/numpy/reference/produced/numpy.full.html

>>> a = numpy.full((2,4), True, dtype=bool)
>>> a[1][3]
True
>>> a
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

numpy.full(Size, Scalar Value, Type). There is other arguments as well that can be passed, for documentation on that, check https://docs.scipy.org/doc/numpy/reference/generated/numpy.full.html


了解NumPy的einsum

问题:了解NumPy的einsum

我正在努力了解确切的einsum工作原理。我看了一下文档和一些示例,但看起来似乎并不固定。

这是我们在课堂上讲的一个例子:

C = np.einsum("ij,jk->ki", A, B)

对于两个数组AB

我认为可以A^T * B,但是我不确定(正在对其中之一进行移调吗?)。谁能告诉我这里到底发生了什么(以及使用时的一般情况einsum)?

I’m struggling to understand exactly how einsum works. I’ve looked at the documentation and a few examples, but it’s not seeming to stick.

Here’s an example we went over in class:

C = np.einsum("ij,jk->ki", A, B)

for two arraysA and B

I think this would take A^T * B, but I’m not sure (it’s taking the transpose of one of them right?). Can anyone walk me through exactly what’s happening here (and in general when using einsum)?


回答 0

(注:这个答案是基于短的博客文章einsum我写了前一阵子。)

怎么einsum办?

假设我们有两个多维数组,AB。现在假设我们要…

  • AB一种特殊的方式来创造新的产品阵列; 然后也许
  • 沿特定轴求和该新数组;然后也许
  • 以特定顺序转置新数组的轴。

有一个很好的机会,einsum可以帮助我们做到这一点更快,内存更是有效的NumPy的功能组合,喜欢multiplysumtranspose允许。

einsum工作如何?

这是一个简单(但并非完全无关紧要)的示例。取以下两个数组:

A = np.array([0, 1, 2])

B = np.array([[ 0,  1,  2,  3],
              [ 4,  5,  6,  7],
              [ 8,  9, 10, 11]])

我们将逐个元素相乘AB然后沿着新数组的行求和。在“普通” NumPy中,我们将编写:

>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])

因此,此处的索引操作A将两个数组的第一个轴对齐,以便可以广播乘法。然后将乘积数组中的行相加以返回答案。

现在,如果我们想使用它einsum,我们可以这样写:

>>> np.einsum('i,ij->i', A, B)
array([ 0, 22, 76])

签名字符串'i,ij->i'是这里的关键,需要解释的一点。您可以将其分为两半。在左侧(的左侧->),我们标记了两个输入数组。在的右侧->,我们标记了要结束的数组。

接下来会发生以下情况:

  • A有一个轴 我们已经标记了它i。并且B有两个轴;我们将轴0标记为i,将轴1 标记为j

  • 通过在两个输入数组中重复标签i,我们告诉我们einsum这两个轴应该相乘。换句话说,就像A数组B一样,我们将array 与array 的每一列相乘A[:, np.newaxis] * B

  • 请注意,j它不会在所需的输出中显示为标签;我们刚刚使用过i(我们想以一维数组结尾)。通过省略标签,我们告诉einsum总结沿着这条轴线。换句话说,我们就像对行进行求和.sum(axis=1)

基本上,这是您需要了解的所有信息einsum。玩一会会有所帮助;如果我们将两个标签都留在输出中,则会'i,ij->ij'返回2D产品数组(与相同A[:, np.newaxis] * B)。如果我们说没有输出标签,'i,ij->我们将返回一个数字(与相同(A[:, np.newaxis] * B).sum())。

einsum但是,最重要的是,它不会首先构建临时产品系列;它只是对产品进行累加。这样可以节省大量内存。

一个更大的例子

为了解释点积,这里有两个新数组:

A = array([[1, 1, 1],
           [2, 2, 2],
           [5, 5, 5]])

B = array([[0, 1, 0],
           [1, 1, 0],
           [1, 1, 1]])

我们将使用计算点积np.einsum('ij,jk->ik', A, B)。这是一张图片,显示了从函数获得的AB和输出数组的标签:

在此处输入图片说明

您会看到j重复的标签-这意味着我们会将的行A与的列相乘B。此外,j输出中不包含标签-我们对这些产品进行求和。标签ik被保留用于输出,因此我们得到一个2D数组。

这一结果与其中标签阵列比较可能是更加明显j求和。在下面的左侧,您可以看到写入产生的3D数组np.einsum('ij,jk->ijk', A, B)(即,我们保留了label j):

在此处输入图片说明

求和轴j给出了预期的点积,如右图所示。

一些练习

为了获得更多的感觉einsum,使用下标符号实现熟悉的NumPy数组操作可能会很有用。任何涉及乘法和求和轴组合的内容都可以使用编写 einsum

令A和B为两个具有相同长度的一维数组。例如A = np.arange(10)B = np.arange(5, 15)

  • 的总和A可以写成:

    np.einsum('i->', A)
  • A * B可以按元素写成:

    np.einsum('i,i->i', A, B)
  • 内积或点积np.inner(A, B)np.dot(A, B)可以写成:

    np.einsum('i,i->', A, B) # or just use 'i,i'
  • 外部乘积np.outer(A, B)可以写成:

    np.einsum('i,j->ij', A, B)

对于2D数组,CD,只要轴是兼容的长度(相同长度或其中之一具有长度1),下面是一些示例:

  • C(主对角线总和)的轨迹np.trace(C)可以写成:

    np.einsum('ii', C)
  • 的元素方式乘法C和转置DC * D.T可以写成:

    np.einsum('ij,ji->ij', C, D)
  • 可以将每个元素乘以C该数组D(以构成4D数组)C[:, :, None, None] * D,可以写成:

    np.einsum('ij,kl->ijkl', C, D)  

(Note: this answer is based on a short blog post about einsum I wrote a while ago.)

What does einsum do?

Imagine that we have two multi-dimensional arrays, A and B. Now let’s suppose we want to…

  • multiply A with B in a particular way to create new array of products; and then maybe
  • sum this new array along particular axes; and then maybe
  • transpose the axes of the new array in a particular order.

There’s a good chance that einsum will help us do this faster and more memory-efficiently that combinations of the NumPy functions like multiply, sum and transpose will allow.

How does einsum work?

Here’s a simple (but not completely trivial) example. Take the following two arrays:

A = np.array([0, 1, 2])

B = np.array([[ 0,  1,  2,  3],
              [ 4,  5,  6,  7],
              [ 8,  9, 10, 11]])

We will multiply A and B element-wise and then sum along the rows of the new array. In “normal” NumPy we’d write:

>>> (A[:, np.newaxis] * B).sum(axis=1)
array([ 0, 22, 76])

So here, the indexing operation on A lines up the first axes of the two arrays so that the multiplication can be broadcast. The rows of the array of products is then summed to return the answer.

Now if we wanted to use einsum instead, we could write:

>>> np.einsum('i,ij->i', A, B)
array([ 0, 22, 76])

The signature string 'i,ij->i' is the key here and needs a little bit of explaining. You can think of it in two halves. On the left-hand side (left of the ->) we’ve labelled the two input arrays. To the right of ->, we’ve labelled the array we want to end up with.

Here is what happens next:

  • A has one axis; we’ve labelled it i. And B has two axes; we’ve labelled axis 0 as i and axis 1 as j.

  • By repeating the label i in both input arrays, we are telling einsum that these two axes should be multiplied together. In other words, we’re multiplying array A with each column of array B, just like A[:, np.newaxis] * B does.

  • Notice that j does not appear as a label in our desired output; we’ve just used i (we want to end up with a 1D array). By omitting the label, we’re telling einsum to sum along this axis. In other words, we’re summing the rows of the products, just like .sum(axis=1) does.

That’s basically all you need to know to use einsum. It helps to play about a little; if we leave both labels in the output, 'i,ij->ij', we get back a 2D array of products (same as A[:, np.newaxis] * B). If we say no output labels, 'i,ij->, we get back a single number (same as doing (A[:, np.newaxis] * B).sum()).

The great thing about einsum however, is that is does not build a temporary array of products first; it just sums the products as it goes. This can lead to big savings in memory use.

A slightly bigger example

To explain the dot product, here are two new arrays:

A = array([[1, 1, 1],
           [2, 2, 2],
           [5, 5, 5]])

B = array([[0, 1, 0],
           [1, 1, 0],
           [1, 1, 1]])

We will compute the dot product using np.einsum('ij,jk->ik', A, B). Here’s a picture showing the labelling of the A and B and the output array that we get from the function:

enter image description here

You can see that label j is repeated – this means we’re multiplying the rows of A with the columns of B. Furthermore, the label j is not included in the output – we’re summing these products. Labels i and k are kept for the output, so we get back a 2D array.

It might be even clearer to compare this result with the array where the label j is not summed. Below, on the left you can see the 3D array that results from writing np.einsum('ij,jk->ijk', A, B) (i.e. we’ve kept label j):

enter image description here

Summing axis j gives the expected dot product, shown on the right.

Some exercises

To get more of feel for einsum, it can be useful to implement familiar NumPy array operations using the subscript notation. Anything that involves combinations of multiplying and summing axes can be written using einsum.

Let A and B be two 1D arrays with the same length. For example, A = np.arange(10) and B = np.arange(5, 15).

  • The sum of A can be written:

    np.einsum('i->', A)
    
  • Element-wise multiplication, A * B, can be written:

    np.einsum('i,i->i', A, B)
    
  • The inner product or dot product, np.inner(A, B) or np.dot(A, B), can be written:

    np.einsum('i,i->', A, B) # or just use 'i,i'
    
  • The outer product, np.outer(A, B), can be written:

    np.einsum('i,j->ij', A, B)
    

For 2D arrays, C and D, provided that the axes are compatible lengths (both the same length or one of them of has length 1), here are a few examples:

  • The trace of C (sum of main diagonal), np.trace(C), can be written:

    np.einsum('ii', C)
    
  • Element-wise multiplication of C and the transpose of D, C * D.T, can be written:

    np.einsum('ij,ji->ij', C, D)
    
  • Multiplying each element of C by the array D (to make a 4D array), C[:, :, None, None] * D, can be written:

    np.einsum('ij,kl->ijkl', C, D)  
    

回答 1

numpy.einsum()如果您直观地理解它的想法,将非常容易。作为示例,让我们从涉及矩阵乘法的简单描述开始。


使用时numpy.einsum(),您要做的就是传递所谓的下标字符串作为参数,然后传递输入数组

假设您有两个2D数组AB,并且想要进行矩阵乘法。所以你也是:

np.einsum("ij, jk -> ik", A, B)

在这里,下标字符串 ij对应于array,A下标字符串 jk对应于array B。另外,这里要注意的最重要的一点是,每个下标字符串中的字符数必须与数组的大小匹配。(例如,对于2D数组为2个字符,对于3D数组为3个字符,依此类推。)如果您在下标字符串之间重复字符(在我们的示例中),则意味着您希望总和沿着这些维度发生。因此,它们将减少总和。(即该维度将消失 jein

此之后的下标字符串->将成为我们的结果数组。如果将其保留为空,则将对所有内容求和,并返回标量值作为结果。否则,所得数组将具有根据下标字符串的尺寸。在我们的示例中,它将为ik。这很直观,因为我们知道对于矩阵乘法,数组中的列数A必须与数组中的行数相匹配,B这就是这里发生的情况(即,我们通过在下标字符串中重复char j来编码此知识)


这里还有一些其他示例,简要说明了np.einsum()实现某些常见张量nd数组操作的用途/功能。

输入项

# a vector
In [197]: vec
Out[197]: array([0, 1, 2, 3])

# an array
In [198]: A
Out[198]: 
array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34],
       [41, 42, 43, 44]])

# another array
In [199]: B
Out[199]: 
array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3],
       [4, 4, 4, 4]])

1)矩阵乘法(类似于np.matmul(arr1, arr2)

In [200]: np.einsum("ij, jk -> ik", A, B)
Out[200]: 
array([[130, 130, 130, 130],
       [230, 230, 230, 230],
       [330, 330, 330, 330],
       [430, 430, 430, 430]])

2)沿主对角线提取元素(类似于np.diag(arr)

In [202]: np.einsum("ii -> i", A)
Out[202]: array([11, 22, 33, 44])

3)Hadamard乘积(即两个数组的按元素乘积)(类似于arr1 * arr2

In [203]: np.einsum("ij, ij -> ij", A, B)
Out[203]: 
array([[ 11,  12,  13,  14],
       [ 42,  44,  46,  48],
       [ 93,  96,  99, 102],
       [164, 168, 172, 176]])

4)逐元素平方(类似于np.square(arr)arr ** 2

In [210]: np.einsum("ij, ij -> ij", B, B)
Out[210]: 
array([[ 1,  1,  1,  1],
       [ 4,  4,  4,  4],
       [ 9,  9,  9,  9],
       [16, 16, 16, 16]])

5)痕迹(即主对角元素的总和)(类似于np.trace(arr)

In [217]: np.einsum("ii -> ", A)
Out[217]: 110

6)矩阵转置(类似于np.transpose(arr)

In [221]: np.einsum("ij -> ji", A)
Out[221]: 
array([[11, 21, 31, 41],
       [12, 22, 32, 42],
       [13, 23, 33, 43],
       [14, 24, 34, 44]])

7)(向量的)外积(类似于np.outer(vec1, vec2)

In [255]: np.einsum("i, j -> ij", vec, vec)
Out[255]: 
array([[0, 0, 0, 0],
       [0, 1, 2, 3],
       [0, 2, 4, 6],
       [0, 3, 6, 9]])

8)(向量的)内积(类似于np.inner(vec1, vec2)

In [256]: np.einsum("i, i -> ", vec, vec)
Out[256]: 14

9)沿轴0求和(类似于np.sum(arr, axis=0)

In [260]: np.einsum("ij -> j", B)
Out[260]: array([10, 10, 10, 10])

10)沿轴1的总和(类似于np.sum(arr, axis=1)

In [261]: np.einsum("ij -> i", B)
Out[261]: array([ 4,  8, 12, 16])

11)批矩阵乘法

In [287]: BM = np.stack((A, B), axis=0)

In [288]: BM
Out[288]: 
array([[[11, 12, 13, 14],
        [21, 22, 23, 24],
        [31, 32, 33, 34],
        [41, 42, 43, 44]],

       [[ 1,  1,  1,  1],
        [ 2,  2,  2,  2],
        [ 3,  3,  3,  3],
        [ 4,  4,  4,  4]]])

In [289]: BM.shape
Out[289]: (2, 4, 4)

# batch matrix multiply using einsum
In [292]: BMM = np.einsum("bij, bjk -> bik", BM, BM)

In [293]: BMM
Out[293]: 
array([[[1350, 1400, 1450, 1500],
        [2390, 2480, 2570, 2660],
        [3430, 3560, 3690, 3820],
        [4470, 4640, 4810, 4980]],

       [[  10,   10,   10,   10],
        [  20,   20,   20,   20],
        [  30,   30,   30,   30],
        [  40,   40,   40,   40]]])

In [294]: BMM.shape
Out[294]: (2, 4, 4)

12)沿轴2的总和(类似于np.sum(arr, axis=2)

In [330]: np.einsum("ijk -> ij", BM)
Out[330]: 
array([[ 50,  90, 130, 170],
       [  4,   8,  12,  16]])

13)对数组中的所有元素求和(类似于np.sum(arr)

In [335]: np.einsum("ijk -> ", BM)
Out[335]: 480

14)多轴总和(即边际化)
(类似于np.sum(arr, axis=(axis0, axis1, axis2, axis3, axis4, axis6, axis7))

# 8D array
In [354]: R = np.random.standard_normal((3,5,4,6,8,2,7,9))

# marginalize out axis 5 (i.e. "n" here)
In [363]: esum = np.einsum("ijklmnop -> n", R)

# marginalize out axis 5 (i.e. sum over rest of the axes)
In [364]: nsum = np.sum(R, axis=(0,1,2,3,4,6,7))

In [365]: np.allclose(esum, nsum)
Out[365]: True

15)双点(类似于np.sum(哈达玛积) cf. 3

In [772]: A
Out[772]: 
array([[1, 2, 3],
       [4, 2, 2],
       [2, 3, 4]])

In [773]: B
Out[773]: 
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [774]: np.einsum("ij, ij -> ", A, B)
Out[774]: 124

16)2D和3D阵列乘法

在要验证结果的线性方程组(Ax = b)求解时,这种乘法可能非常有用。

# inputs
In [115]: A = np.random.rand(3,3)
In [116]: b = np.random.rand(3, 4, 5)

# solve for x
In [117]: x = np.linalg.solve(A, b.reshape(b.shape[0], -1)).reshape(b.shape)

# 2D and 3D array multiplication :)
In [118]: Ax = np.einsum('ij, jkl', A, x)

# indeed the same!
In [119]: np.allclose(Ax, b)
Out[119]: True

相反,如果必须使用np.matmul()此验证,则我们必须执行几项reshape操作才能获得相同的结果,例如:

# reshape 3D array `x` to 2D, perform matmul
# then reshape the resultant array to 3D
In [123]: Ax_matmul = np.matmul(A, x.reshape(x.shape[0], -1)).reshape(x.shape)

# indeed correct!
In [124]: np.allclose(Ax, Ax_matmul)
Out[124]: True

奖金:在这里阅读更多数学:爱因斯坦求和,当然在这里:张量表示法

Grasping the idea of numpy.einsum() is very easy if you understand it intuitively. As an example, let’s start with a simple description involving matrix multiplication.


To use numpy.einsum(), all you have to do is to pass the so-called subscripts string as an argument, followed by your input arrays.

Let’s say you have two 2D arrays, A and B, and you want to do matrix multiplication. So, you do:

np.einsum("ij, jk -> ik", A, B)

Here the subscript string ij corresponds to array A while the subscript string jk corresponds to array B. Also, the most important thing to note here is that the number of characters in each subscript string must match the dimensions of the array. (i.e. two chars for 2D arrays, three chars for 3D arrays, and so on.) And if you repeat the chars between subscript strings (j in our case), then that means you want the einsum to happen along those dimensions. Thus, they will be sum-reduced. (i.e. that dimension will be gone)

The subscript string after this ->, will be our resultant array. If you leave it empty, then everything will be summed and a scalar value is returned as result. Else the resultant array will have dimensions according to the subscript string. In our example, it’ll be ik. This is intuitive because we know that for matrix multiplication the number of columns in array A has to match the number of rows in array B which is what is happening here (i.e. we encode this knowledge by repeating the char j in the subscript string)


Here are some more examples illustrating the use/power of np.einsum() in implementing some common tensor or nd-array operations, succinctly.

Inputs

# a vector
In [197]: vec
Out[197]: array([0, 1, 2, 3])

# an array
In [198]: A
Out[198]: 
array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34],
       [41, 42, 43, 44]])

# another array
In [199]: B
Out[199]: 
array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3],
       [4, 4, 4, 4]])

1) Matrix multiplication (similar to np.matmul(arr1, arr2))

In [200]: np.einsum("ij, jk -> ik", A, B)
Out[200]: 
array([[130, 130, 130, 130],
       [230, 230, 230, 230],
       [330, 330, 330, 330],
       [430, 430, 430, 430]])

2) Extract elements along the main-diagonal (similar to np.diag(arr))

In [202]: np.einsum("ii -> i", A)
Out[202]: array([11, 22, 33, 44])

3) Hadamard product (i.e. element-wise product of two arrays) (similar to arr1 * arr2)

In [203]: np.einsum("ij, ij -> ij", A, B)
Out[203]: 
array([[ 11,  12,  13,  14],
       [ 42,  44,  46,  48],
       [ 93,  96,  99, 102],
       [164, 168, 172, 176]])

4) Element-wise squaring (similar to np.square(arr) or arr ** 2)

In [210]: np.einsum("ij, ij -> ij", B, B)
Out[210]: 
array([[ 1,  1,  1,  1],
       [ 4,  4,  4,  4],
       [ 9,  9,  9,  9],
       [16, 16, 16, 16]])

5) Trace (i.e. sum of main-diagonal elements) (similar to np.trace(arr))

In [217]: np.einsum("ii -> ", A)
Out[217]: 110

6) Matrix transpose (similar to np.transpose(arr))

In [221]: np.einsum("ij -> ji", A)
Out[221]: 
array([[11, 21, 31, 41],
       [12, 22, 32, 42],
       [13, 23, 33, 43],
       [14, 24, 34, 44]])

7) Outer Product (of vectors) (similar to np.outer(vec1, vec2))

In [255]: np.einsum("i, j -> ij", vec, vec)
Out[255]: 
array([[0, 0, 0, 0],
       [0, 1, 2, 3],
       [0, 2, 4, 6],
       [0, 3, 6, 9]])

8) Inner Product (of vectors) (similar to np.inner(vec1, vec2))

In [256]: np.einsum("i, i -> ", vec, vec)
Out[256]: 14

9) Sum along axis 0 (similar to np.sum(arr, axis=0))

In [260]: np.einsum("ij -> j", B)
Out[260]: array([10, 10, 10, 10])

10) Sum along axis 1 (similar to np.sum(arr, axis=1))

In [261]: np.einsum("ij -> i", B)
Out[261]: array([ 4,  8, 12, 16])

11) Batch Matrix Multiplication

In [287]: BM = np.stack((A, B), axis=0)

In [288]: BM
Out[288]: 
array([[[11, 12, 13, 14],
        [21, 22, 23, 24],
        [31, 32, 33, 34],
        [41, 42, 43, 44]],

       [[ 1,  1,  1,  1],
        [ 2,  2,  2,  2],
        [ 3,  3,  3,  3],
        [ 4,  4,  4,  4]]])

In [289]: BM.shape
Out[289]: (2, 4, 4)

# batch matrix multiply using einsum
In [292]: BMM = np.einsum("bij, bjk -> bik", BM, BM)

In [293]: BMM
Out[293]: 
array([[[1350, 1400, 1450, 1500],
        [2390, 2480, 2570, 2660],
        [3430, 3560, 3690, 3820],
        [4470, 4640, 4810, 4980]],

       [[  10,   10,   10,   10],
        [  20,   20,   20,   20],
        [  30,   30,   30,   30],
        [  40,   40,   40,   40]]])

In [294]: BMM.shape
Out[294]: (2, 4, 4)

12) Sum along axis 2 (similar to np.sum(arr, axis=2))

In [330]: np.einsum("ijk -> ij", BM)
Out[330]: 
array([[ 50,  90, 130, 170],
       [  4,   8,  12,  16]])

13) Sum all the elements in array (similar to np.sum(arr))

In [335]: np.einsum("ijk -> ", BM)
Out[335]: 480

14) Sum over multiple axes (i.e. marginalization)
(similar to np.sum(arr, axis=(axis0, axis1, axis2, axis3, axis4, axis6, axis7)))

# 8D array
In [354]: R = np.random.standard_normal((3,5,4,6,8,2,7,9))

# marginalize out axis 5 (i.e. "n" here)
In [363]: esum = np.einsum("ijklmnop -> n", R)

# marginalize out axis 5 (i.e. sum over rest of the axes)
In [364]: nsum = np.sum(R, axis=(0,1,2,3,4,6,7))

In [365]: np.allclose(esum, nsum)
Out[365]: True

15) Double Dot Products (similar to np.sum(hadamard-product) cf. 3)

In [772]: A
Out[772]: 
array([[1, 2, 3],
       [4, 2, 2],
       [2, 3, 4]])

In [773]: B
Out[773]: 
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

In [774]: np.einsum("ij, ij -> ", A, B)
Out[774]: 124

16) 2D and 3D array multiplication

Such a multiplication could be very useful when solving linear system of equations (Ax = b) where you want to verify the result.

# inputs
In [115]: A = np.random.rand(3,3)
In [116]: b = np.random.rand(3, 4, 5)

# solve for x
In [117]: x = np.linalg.solve(A, b.reshape(b.shape[0], -1)).reshape(b.shape)

# 2D and 3D array multiplication :)
In [118]: Ax = np.einsum('ij, jkl', A, x)

# indeed the same!
In [119]: np.allclose(Ax, b)
Out[119]: True

On the contrary, if one has to use np.matmul() for this verification, we have to do couple of reshape operations to achieve the same result like:

# reshape 3D array `x` to 2D, perform matmul
# then reshape the resultant array to 3D
In [123]: Ax_matmul = np.matmul(A, x.reshape(x.shape[0], -1)).reshape(x.shape)

# indeed correct!
In [124]: np.allclose(Ax, Ax_matmul)
Out[124]: True

Bonus: Read more math here : Einstein-Summation and definitely here: Tensor-Notation


回答 2

让我们制作2个数组,它们具有不同但兼容的维度,以突出它们之间的相互作用

In [43]: A=np.arange(6).reshape(2,3)
Out[43]: 
array([[0, 1, 2],
       [3, 4, 5]])


In [44]: B=np.arange(12).reshape(3,4)
Out[44]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

您的计算将(2,3)的“点”(乘积之和)与(3,4)相乘,以生成(4,2)数组。 i是第一个昏暗的A,最后一个C; k最后B1个,第1个Cj通过求和“消耗”。

In [45]: C=np.einsum('ij,jk->ki',A,B)
Out[45]: 
array([[20, 56],
       [23, 68],
       [26, 80],
       [29, 92]])

这与np.dot(A,B).T-是转置的最终输出相同。

要查看更多情况j,请将C下标更改为ijk

In [46]: np.einsum('ij,jk->ijk',A,B)
Out[46]: 
array([[[ 0,  0,  0,  0],
        [ 4,  5,  6,  7],
        [16, 18, 20, 22]],

       [[ 0,  3,  6,  9],
        [16, 20, 24, 28],
        [40, 45, 50, 55]]])

这也可以通过以下方式生成:

A[:,:,None]*B[None,:,:]

即,添加一个k维度的端部A,以及i与前部B,产生了(2,3,4)阵列。

0 + 4 + 16 = 209 + 28 + 55 = 92等; 求和j转置以获得较早的结果:

np.sum(A[:,:,None] * B[None,:,:], axis=1).T

# C[k,i] = sum(j) A[i,j (,k) ] * B[(i,)  j,k]

Lets make 2 arrays, with different, but compatible dimensions to highlight their interplay

In [43]: A=np.arange(6).reshape(2,3)
Out[43]: 
array([[0, 1, 2],
       [3, 4, 5]])


In [44]: B=np.arange(12).reshape(3,4)
Out[44]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Your calculation, takes a ‘dot’ (sum of products) of a (2,3) with a (3,4) to produce a (4,2) array. i is the 1st dim of A, the last of C; k the last of B, 1st of C. j is ‘consumed’ by the summation.

In [45]: C=np.einsum('ij,jk->ki',A,B)
Out[45]: 
array([[20, 56],
       [23, 68],
       [26, 80],
       [29, 92]])

This is the same as np.dot(A,B).T – it’s the final output that’s transposed.

To see more of what happens to j, change the C subscripts to ijk:

In [46]: np.einsum('ij,jk->ijk',A,B)
Out[46]: 
array([[[ 0,  0,  0,  0],
        [ 4,  5,  6,  7],
        [16, 18, 20, 22]],

       [[ 0,  3,  6,  9],
        [16, 20, 24, 28],
        [40, 45, 50, 55]]])

This can also be produced with:

A[:,:,None]*B[None,:,:]

That is, add a k dimension to the end of A, and an i to the front of B, resulting in a (2,3,4) array.

0 + 4 + 16 = 20, 9 + 28 + 55 = 92, etc; Sum on j and transpose to get the earlier result:

np.sum(A[:,:,None] * B[None,:,:], axis=1).T

# C[k,i] = sum(j) A[i,j (,k) ] * B[(i,)  j,k]

回答 3

我发现NumPy:交易技巧(第二部分)具有启发性

我们使用->指示输出数组的顺序。因此,将“ ij,i-> j”视为具有左侧(LHS)和右侧(RHS)。LHS上标签的任何重复都会明智地计算乘积元素,然后求和。通过更改RHS(输出)端的标签,我们可以相对于输入数组定义要在其中进行处理的轴,即沿轴0、1求和,依此类推。

import numpy as np

>>> a
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])
>>> b
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> d = np.einsum('ij, jk->ki', a, b)

请注意,存在三个轴,即i,j,k,并且重复了j(在左侧)。 i,j代表的行和列aj,kb

为了计算乘积并对齐j轴,我们需要在上添加一个轴a。(b将沿第一个轴广播?)

a[i, j, k]
   b[j, k]

>>> c = a[:,:,np.newaxis] * b
>>> c
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 0,  2,  4],
        [ 6,  8, 10],
        [12, 14, 16]],

       [[ 0,  3,  6],
        [ 9, 12, 15],
        [18, 21, 24]]])

j在右侧不存在,因此我们求和j是3x3x3数组的第二个轴

>>> c = c.sum(1)
>>> c
array([[ 9, 12, 15],
       [18, 24, 30],
       [27, 36, 45]])

最后,索引在右侧(按字母顺序)相反,因此我们进行了转置。

>>> c.T
array([[ 9, 18, 27],
       [12, 24, 36],
       [15, 30, 45]])

>>> np.einsum('ij, jk->ki', a, b)
array([[ 9, 18, 27],
       [12, 24, 36],
       [15, 30, 45]])
>>>

I found NumPy: The tricks of the trade (Part II) instructive

We use -> to indicate the order of the output array. So think of ‘ij, i->j’ as having left hand side (LHS) and right hand side (RHS). Any repetition of labels on the LHS computes the product element wise and then sums over. By changing the label on the RHS (output) side, we can define the axis in which we want to proceed with respect to the input array, i.e. summation along axis 0, 1 and so on.

import numpy as np

>>> a
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])
>>> b
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> d = np.einsum('ij, jk->ki', a, b)

Notice there are three axes, i, j, k, and that j is repeated (on the left-hand-side). i,j represent rows and columns for a. j,k for b.

In order to calculate the product and align the j axis we need to add an axis to a. (b will be broadcast along(?) the first axis)

a[i, j, k]
   b[j, k]

>>> c = a[:,:,np.newaxis] * b
>>> c
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 0,  2,  4],
        [ 6,  8, 10],
        [12, 14, 16]],

       [[ 0,  3,  6],
        [ 9, 12, 15],
        [18, 21, 24]]])

j is absent from the right-hand-side so we sum over j which is the second axis of the 3x3x3 array

>>> c = c.sum(1)
>>> c
array([[ 9, 12, 15],
       [18, 24, 30],
       [27, 36, 45]])

Finally, the indices are (alphabetically) reversed on the right-hand-side so we transpose.

>>> c.T
array([[ 9, 18, 27],
       [12, 24, 36],
       [15, 30, 45]])

>>> np.einsum('ij, jk->ki', a, b)
array([[ 9, 18, 27],
       [12, 24, 36],
       [15, 30, 45]])
>>>

回答 4

在阅读einsum方程式时,我发现最简单的方法就是将它们简化为必要的形式。

让我们从以下(强加)语句开始:

C = np.einsum('bhwi,bhwj->bij', A, B)

首先通过标点符号进行操作,我们看到在箭头前有两个4个逗号分隔的斑点- bhwibhwj,在箭头后有一个3个字母斑点bij。因此,该方程从两个4级张量输入产生3级张量结果。

现在,让每个斑点中的每个字母成为范围变量的名称。字母在Blob中出现的位置是该张量范围内的轴的索引。因此,产生C的每个元素的命令式求和必须从三个嵌套的for循环开始,每个C的索引一个。

for b in range(...):
    for i in range(...):
        for j in range(...):
            # the variables b, i and j index C in the order of their appearance in the equation
            C[b, i, j] = ...

因此,从本质for上讲,每个C的输出索引都有一个循环。我们现在暂时不确定范围。

接下来,我们看一下左侧-是否有没有出现在右侧的范围变量?在我们的情况下-是,h并且wfor为每个此类变量添加一个内部嵌套循环:

for b in range(...):
    for i in range(...):
        for j in range(...):
            C[b, i, j] = 0
            for h in range(...):
                for w in range(...):
                    ...

现在,在最内层的循环中,我们定义了所有索引,因此我们可以编写实际的求和并完成转换:

# three nested for-loops that index the elements of C
for b in range(...):
    for i in range(...):
        for j in range(...):

            # prepare to sum
            C[b, i, j] = 0

            # two nested for-loops for the two indexes that don't appear on the right-hand side
            for h in range(...):
                for w in range(...):
                    # Sum! Compare the statement below with the original einsum formula
                    # 'bhwi,bhwj->bij'

                    C[b, i, j] += A[b, h, w, i] * B[b, h, w, j]

如果到目前为止您已经能够遵循该代码,那么恭喜您!这就是您需要阅读einsum方程式所需的全部。请特别注意原始einsum公式如何映射到以上代码段中的最终sumsum语句。for循环和范围边界只是模糊不清的,最终声明是您真正需要了解的所有内容。

为了完整起见,让我们看看如何确定每个范围变量的范围。嗯,每个变量的范围只是它索引的维度的长度。显然,如果变量在一个或多个张量中索引一个以上的维度,则每个维度的长度必须相等。这是上面带有完整范围的代码:

# C's shape is determined by the shapes of the inputs
# b indexes both A and B, so its range can come from either A.shape or B.shape
# i indexes only A, so its range can only come from A.shape, the same is true for j and B
assert A.shape[0] == B.shape[0]
assert A.shape[1] == B.shape[1]
assert A.shape[2] == B.shape[2]
C = np.zeros((A.shape[0], A.shape[3], B.shape[3]))
for b in range(A.shape[0]): # b indexes both A and B, or B.shape[0], which must be the same
    for i in range(A.shape[3]):
        for j in range(B.shape[3]):
            # h and w can come from either A or B
            for h in range(A.shape[1]):
                for w in range(A.shape[2]):
                    C[b, i, j] += A[b, h, w, i] * B[b, h, w, j]

When reading einsum equations, I’ve found it the most helpful to just be able to mentally boil them down to their imperative versions.

Let’s start with the following (imposing) statement:

C = np.einsum('bhwi,bhwj->bij', A, B)

Working through the punctuation first we see that we have two 4-letter comma-separated blobs – bhwi and bhwj, before the arrow, and a single 3-letter blob bij after it. Therefore, the equation produces a rank-3 tensor result from two rank-4 tensor inputs.

Now, let each letter in each blob be the name of a range variable. The position at which the letter appears in the blob is the index of the axis that it ranges over in that tensor. The imperative summation that produces each element of C, therefore, has to start with three nested for loops, one for each index of C.

for b in range(...):
    for i in range(...):
        for j in range(...):
            # the variables b, i and j index C in the order of their appearance in the equation
            C[b, i, j] = ...

So, essentially, you have a for loop for every output index of C. We’ll leave the ranges undetermined for now.

Next we look at the left-hand side – are there any range variables there that don’t appear on the right-hand side? In our case – yes, h and w. Add an inner nested for loop for every such variable:

for b in range(...):
    for i in range(...):
        for j in range(...):
            C[b, i, j] = 0
            for h in range(...):
                for w in range(...):
                    ...

Inside the innermost loop we now have all indices defined, so we can write the actual summation and the translation is complete:

# three nested for-loops that index the elements of C
for b in range(...):
    for i in range(...):
        for j in range(...):

            # prepare to sum
            C[b, i, j] = 0

            # two nested for-loops for the two indexes that don't appear on the right-hand side
            for h in range(...):
                for w in range(...):
                    # Sum! Compare the statement below with the original einsum formula
                    # 'bhwi,bhwj->bij'

                    C[b, i, j] += A[b, h, w, i] * B[b, h, w, j]

If you’ve been able to follow the code thus far, then congratulations! This is all you need to be able to read einsum equations. Notice in particular how the original einsum formula maps to the final summation statement in the snippet above. The for-loops and range bounds are just fluff and that final statement is all you really need to understand what’s going on.

For the sake of completeness, let’s see how to determine the ranges for each range variable. Well, the range of each variable is simply the length of the dimension(s) which it indexes. Obviously, if a variable indexes more than one dimension in one or more tensors, then the lengths of each of those dimensions have to be equal. Here’s the code above with the complete ranges:

# C's shape is determined by the shapes of the inputs
# b indexes both A and B, so its range can come from either A.shape or B.shape
# i indexes only A, so its range can only come from A.shape, the same is true for j and B
assert A.shape[0] == B.shape[0]
assert A.shape[1] == B.shape[1]
assert A.shape[2] == B.shape[2]
C = np.zeros((A.shape[0], A.shape[3], B.shape[3]))
for b in range(A.shape[0]): # b indexes both A and B, or B.shape[0], which must be the same
    for i in range(A.shape[3]):
        for j in range(B.shape[3]):
            # h and w can come from either A or B
            for h in range(A.shape[1]):
                for w in range(A.shape[2]):
                    C[b, i, j] += A[b, h, w, i] * B[b, h, w, j]

替换Python NumPy数组中所有大于某个值的元素

问题:替换Python NumPy数组中所有大于某个值的元素

我有一个2D NumPy数组,并希望将大于或等于阈值T的所有值替换为255.0。据我所知,最基本的方法是:

shape = arr.shape
result = np.zeros(shape)
for x in range(0, shape[0]):
    for y in range(0, shape[1]):
        if arr[x, y] >= T:
            result[x, y] = 255
  1. 什么是最简洁,最pythonic的方法?

  2. 有更快的方法(可能不太简洁和/或更少的pythonic)来做到这一点吗?

这将是用于人头MRI扫描的窗口/水平调整子程序的一部分。2D numpy数组是图像像素数据。

I have a 2D NumPy array and would like to replace all values in it greater than or equal to a threshold T with 255.0. To my knowledge, the most fundamental way would be:

shape = arr.shape
result = np.zeros(shape)
for x in range(0, shape[0]):
    for y in range(0, shape[1]):
        if arr[x, y] >= T:
            result[x, y] = 255
  1. What is the most concise and pythonic way to do this?

  2. Is there a faster (possibly less concise and/or less pythonic) way to do this?

This will be part of a window/level adjustment subroutine for MRI scans of the human head. The 2D numpy array is the image pixel data.


回答 0

我认为最快和最简洁的方法是使用NumPy内置的Fancy indexing。如果您具有ndarraynamed arr,则可以将所有元素替换>255为一个值x,如下所示:

arr[arr > 255] = x

我使用500 x 500随机矩阵在计算机上运行此命令,将所有> 0.5的值替换为5,平均花费了7.59ms。

In [1]: import numpy as np
In [2]: A = np.random.rand(500, 500)
In [3]: timeit A[A > 0.5] = 5
100 loops, best of 3: 7.59 ms per loop

I think both the fastest and most concise way to do this is to use NumPy’s built-in Fancy indexing. If you have an ndarray named arr, you can replace all elements >255 with a value x as follows:

arr[arr > 255] = x

I ran this on my machine with a 500 x 500 random matrix, replacing all values >0.5 with 5, and it took an average of 7.59ms.

In [1]: import numpy as np
In [2]: A = np.random.rand(500, 500)
In [3]: timeit A[A > 0.5] = 5
100 loops, best of 3: 7.59 ms per loop

回答 1

由于您实际上想要的是arrwhere 的其他数组arr < 255255否则可以简单地完成此操作:

result = np.minimum(arr, 255)

更一般而言,对于下限和/或上限:

result = np.clip(arr, 0, 255)

如果您只想访问超过255的值,或者更复杂的值,则@ mtitan8的回答更为笼统,但对于您的情况,np.clipand和np.minimum(或np.maximum)更好,更快:

In [292]: timeit np.minimum(a, 255)
100000 loops, best of 3: 19.6 µs per loop

In [293]: %%timeit
   .....: c = np.copy(a)
   .....: c[a>255] = 255
   .....: 
10000 loops, best of 3: 86.6 µs per loop

如果您想就地进行操作(即修改arr而不是创建result),则可以使用out参数np.minimum

np.minimum(arr, 255, out=arr)

要么

np.clip(arr, 0, 255, arr)

out=名称是可选的,因为参数与函数定义的顺序相同。)

对于就地修改,布尔索引可以提高很多速度(无需分别制作然后修改副本),但是仍然不如minimum

In [328]: %%timeit
   .....: a = np.random.randint(0, 300, (100,100))
   .....: np.minimum(a, 255, a)
   .....: 
100000 loops, best of 3: 303 µs per loop

In [329]: %%timeit
   .....: a = np.random.randint(0, 300, (100,100))
   .....: a[a>255] = 255
   .....: 
100000 loops, best of 3: 356 µs per loop

为了进行比较,如果您想限制最小值和最大值,而无需clip两次,例如

np.minimum(a, 255, a)
np.maximum(a, 0, a)

要么,

a[a>255] = 255
a[a<0] = 0

Since you actually want a different array which is arr where arr < 255, and 255 otherwise, this can be done simply:

result = np.minimum(arr, 255)

More generally, for a lower and/or upper bound:

result = np.clip(arr, 0, 255)

If you just want to access the values over 255, or something more complicated, @mtitan8’s answer is more general, but np.clip and np.minimum (or np.maximum) are nicer and much faster for your case:

In [292]: timeit np.minimum(a, 255)
100000 loops, best of 3: 19.6 µs per loop

In [293]: %%timeit
   .....: c = np.copy(a)
   .....: c[a>255] = 255
   .....: 
10000 loops, best of 3: 86.6 µs per loop

If you want to do it in-place (i.e., modify arr instead of creating result) you can use the out parameter of np.minimum:

np.minimum(arr, 255, out=arr)

or

np.clip(arr, 0, 255, arr)

(the out= name is optional since the arguments in the same order as the function’s definition.)

For in-place modification, the boolean indexing speeds up a lot (without having to make and then modify the copy separately), but is still not as fast as minimum:

In [328]: %%timeit
   .....: a = np.random.randint(0, 300, (100,100))
   .....: np.minimum(a, 255, a)
   .....: 
100000 loops, best of 3: 303 µs per loop

In [329]: %%timeit
   .....: a = np.random.randint(0, 300, (100,100))
   .....: a[a>255] = 255
   .....: 
100000 loops, best of 3: 356 µs per loop

For comparison, if you wanted to restrict your values with a minimum as well as a maximum, without clip you would have to do this twice, with something like

np.minimum(a, 255, a)
np.maximum(a, 0, a)

or,

a[a>255] = 255
a[a<0] = 0

回答 2

我认为您可以使用以下where功能最快地实现此目的:

例如,在numpy数组中查找大于0.2的项并将其替换为0:

import numpy as np

nums = np.random.rand(4,3)

print np.where(nums > 0.2, 0, nums)

I think you can achieve this the quickest by using the where function:

For example looking for items greater than 0.2 in a numpy array and replacing those with 0:

import numpy as np

nums = np.random.rand(4,3)

print np.where(nums > 0.2, 0, nums)

回答 3

您可以考虑使用numpy.putmask

np.putmask(arr, arr>=T, 255.0)

这是与Numpy内置索引的性能比较:

In [1]: import numpy as np
In [2]: A = np.random.rand(500, 500)

In [3]: timeit np.putmask(A, A>0.5, 5)
1000 loops, best of 3: 1.34 ms per loop

In [4]: timeit A[A > 0.5] = 5
1000 loops, best of 3: 1.82 ms per loop

You can consider using numpy.putmask:

np.putmask(arr, arr>=T, 255.0)

Here is a performance comparison with the Numpy’s builtin indexing:

In [1]: import numpy as np
In [2]: A = np.random.rand(500, 500)

In [3]: timeit np.putmask(A, A>0.5, 5)
1000 loops, best of 3: 1.34 ms per loop

In [4]: timeit A[A > 0.5] = 5
1000 loops, best of 3: 1.82 ms per loop

回答 4

另一种方法是使用np.place它进行就地替换并与多维数组一起使用:

import numpy as np

# create 2x3 array with numbers 0..5
arr = np.arange(6).reshape(2, 3)

# replace 0 with -10
np.place(arr, arr == 0, -10)

Another way is to use np.place which does in-place replacement and works with multidimentional arrays:

import numpy as np

# create 2x3 array with numbers 0..5
arr = np.arange(6).reshape(2, 3)

# replace 0 with -10
np.place(arr, arr == 0, -10)

回答 5

你也可以使用&|(和/或)有更多的灵活性:

介于5到10之间的值: A[(A>5)&(A<10)]

大于10或小于5的值: A[(A<5)|(A>10)]

You can also use &, | (and/or) for more flexibility:

values between 5 and 10: A[(A>5)&(A<10)]

values greater than 10 or smaller than 5: A[(A<5)|(A>10)]


ValueError:使用序列设置数组元素

问题:ValueError:使用序列设置数组元素

此Python代码:

import numpy as p

def firstfunction():
    UnFilteredDuringExSummaryOfMeansArray = []
    MeanOutputHeader=['TestID','ConditionName','FilterType','RRMean','HRMean',
                      'dZdtMaxVoltageMean','BZMean','ZXMean','LVETMean','Z0Mean',
                      'StrokeVolumeMean','CardiacOutputMean','VelocityIndexMean']
    dataMatrix = BeatByBeatMatrixOfMatrices[column]
    roughTrimmedMatrix = p.array(dataMatrix[1:,1:17])


    trimmedMatrix = p.array(roughTrimmedMatrix,dtype=p.float64)  #ERROR THROWN HERE


    myMeans = p.mean(trimmedMatrix,axis=0,dtype=p.float64)
    conditionMeansArray = [TestID,testCondition,'UnfilteredBefore',myMeans[3], myMeans[4], 
                           myMeans[6], myMeans[9], myMeans[10], myMeans[11], myMeans[12],
                           myMeans[13], myMeans[14], myMeans[15]]
    UnFilteredDuringExSummaryOfMeansArray.append(conditionMeansArray)
    secondfunction(UnFilteredDuringExSummaryOfMeansArray)
    return

def secondfunction(UnFilteredDuringExSummaryOfMeansArray):
    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
    return

firstfunction()

引发此错误消息:

File "mypath\mypythonscript.py", line 3484, in secondfunction
RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
ValueError: setting an array element with a sequence.

谁能告诉我该怎么做才能解决上面破碎的代码中的问题,以便停止抛出错误消息?


编辑: 我做了一个打印命令来获取矩阵的内容,这就是它打印出来的内容:

UnFilteredDuringExSummaryOfMeansArray为:

[['TestID', 'ConditionName', 'FilterType', 'RRMean', 'HRMean', 'dZdtMaxVoltageMean', 'BZMean', 'ZXMean', 'LVETMean', 'Z0Mean', 'StrokeVolumeMean', 'CardiacOutputMean', 'VelocityIndexMean'],
[u'HF101710', 'PreEx10SecondsBEFORE', 'UnfilteredBefore', 0.90670000000000006, 66.257731979420001, 1.8305673000000002, 0.11750000000000001, 0.15120546389880002, 0.26870546389879996, 27.628261216480002, 86.944190346160013, 5.767261352345999, 0.066259118585869997],
[u'HF101710', '25W10SecondsBEFORE', 'UnfilteredBefore', 0.68478571428571422, 87.727887206978565, 2.2965444125714285, 0.099642857142857144, 0.14952476549885715, 0.24916762264164286, 27.010483303721429, 103.5237336525, 9.0682762747642869, 0.085022572648242867],
[u'HF101710', '50W10SecondsBEFORE', 'UnfilteredBefore', 0.54188235294117659, 110.74841107829413, 2.6719262705882354, 0.077705882352917643, 0.15051306356552943, 0.2282189459185294, 26.768787504858825, 111.22827075238826, 12.329456404418824, 0.099814258468417641],
[u'HF101710', '75W10SecondsBEFORE', 'UnfilteredBefore', 0.4561904761904762, 131.52996981880955, 3.1818159523809522, 0.074714285714290493, 0.13459344175047619, 0.20930772746485715, 26.391156337028569, 123.27387909873812, 16.214243779323812, 0.1205685359981619]]

对我来说,这看起来像是5行乘13列的矩阵,但是当通过脚本运行不同的数据时,行数是可变的。使用我要添加的相同数据。

编辑2:但是,脚本抛出错误。因此,我认为您的想法不能解释此处正在发生的问题。谢谢你 还有其他想法吗?


编辑3:

仅供参考,如果我替换此有问题的代码行:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]

与此相反:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray)[1:,3]

然后,脚本的该部分可以正常工作而不会引发错误,但是此代码行更进一步:

p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())

引发此错误:

File "mypath\mypythonscript.py", line 3631, in CreateSummaryGraphics
  p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())
TypeError: cannot perform reduce with flexible type

因此,您可以看到我需要指定数据类型以便能够在matplotlib中使用ylim,但是指定数据类型会引发引发此帖子的错误消息。

This Python code:

import numpy as p

def firstfunction():
    UnFilteredDuringExSummaryOfMeansArray = []
    MeanOutputHeader=['TestID','ConditionName','FilterType','RRMean','HRMean',
                      'dZdtMaxVoltageMean','BZMean','ZXMean','LVETMean','Z0Mean',
                      'StrokeVolumeMean','CardiacOutputMean','VelocityIndexMean']
    dataMatrix = BeatByBeatMatrixOfMatrices[column]
    roughTrimmedMatrix = p.array(dataMatrix[1:,1:17])


    trimmedMatrix = p.array(roughTrimmedMatrix,dtype=p.float64)  #ERROR THROWN HERE


    myMeans = p.mean(trimmedMatrix,axis=0,dtype=p.float64)
    conditionMeansArray = [TestID,testCondition,'UnfilteredBefore',myMeans[3], myMeans[4], 
                           myMeans[6], myMeans[9], myMeans[10], myMeans[11], myMeans[12],
                           myMeans[13], myMeans[14], myMeans[15]]
    UnFilteredDuringExSummaryOfMeansArray.append(conditionMeansArray)
    secondfunction(UnFilteredDuringExSummaryOfMeansArray)
    return

def secondfunction(UnFilteredDuringExSummaryOfMeansArray):
    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
    return

firstfunction()

Throws this error message:

File "mypath\mypythonscript.py", line 3484, in secondfunction
RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]
ValueError: setting an array element with a sequence.

Can anyone show me what to do to fix the problem in the broken code above so that it stops throwing an error message?


EDIT: I did a print command to get the contents of the matrix, and this is what it printed out:

UnFilteredDuringExSummaryOfMeansArray is:

[['TestID', 'ConditionName', 'FilterType', 'RRMean', 'HRMean', 'dZdtMaxVoltageMean', 'BZMean', 'ZXMean', 'LVETMean', 'Z0Mean', 'StrokeVolumeMean', 'CardiacOutputMean', 'VelocityIndexMean'],
[u'HF101710', 'PreEx10SecondsBEFORE', 'UnfilteredBefore', 0.90670000000000006, 66.257731979420001, 1.8305673000000002, 0.11750000000000001, 0.15120546389880002, 0.26870546389879996, 27.628261216480002, 86.944190346160013, 5.767261352345999, 0.066259118585869997],
[u'HF101710', '25W10SecondsBEFORE', 'UnfilteredBefore', 0.68478571428571422, 87.727887206978565, 2.2965444125714285, 0.099642857142857144, 0.14952476549885715, 0.24916762264164286, 27.010483303721429, 103.5237336525, 9.0682762747642869, 0.085022572648242867],
[u'HF101710', '50W10SecondsBEFORE', 'UnfilteredBefore', 0.54188235294117659, 110.74841107829413, 2.6719262705882354, 0.077705882352917643, 0.15051306356552943, 0.2282189459185294, 26.768787504858825, 111.22827075238826, 12.329456404418824, 0.099814258468417641],
[u'HF101710', '75W10SecondsBEFORE', 'UnfilteredBefore', 0.4561904761904762, 131.52996981880955, 3.1818159523809522, 0.074714285714290493, 0.13459344175047619, 0.20930772746485715, 26.391156337028569, 123.27387909873812, 16.214243779323812, 0.1205685359981619]]

Looks like a 5 row by 13 column matrix to me, though the number of rows is variable when different data are run through the script. With this same data that I am adding in this.

EDIT 2: However, the script is throwing an error. So I do not think that your idea explains the problem that is happening here. Thank you, though. Any other ideas?


EDIT 3:

FYI, if I replace this problem line of code:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray,dtype=p.float64)[1:,3]

with this instead:

    RRDuringArray = p.array(UnFilteredDuringExSummaryOfMeansArray)[1:,3]

Then that section of the script works fine without throwing an error, but then this line of code further down the line:

p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())

Throws this error:

File "mypath\mypythonscript.py", line 3631, in CreateSummaryGraphics
  p.ylim(.5*RRDuringArray.min(),1.5*RRDuringArray.max())
TypeError: cannot perform reduce with flexible type

So you can see that I need to specify the data type in order to be able to use ylim in matplotlib, but yet specifying the data type is throwing the error message that initiated this post.


回答 0

从您展示给我们的代码中,我们唯一可以看出的是您正在尝试从形状不像多维数组的列表中创建数组。例如

numpy.array([[1,2], [2, 3, 4]])

要么

numpy.array([[1,2], [2, [3, 4]]])

将产生此错误消息,因为输入列表的形状不是可以转换为多维数组的(通用)“框”。因此可能UnFilteredDuringExSummaryOfMeansArray包含不同长度的序列。

编辑:此错误消息的另一个可能原因是尝试将字符串用作类型数组中的元素float

numpy.array([1.2, "abc"], dtype=float)

那就是您根据编辑尝试的内容。如果您确实想拥有一个同时包含字符串和浮点数的NumPy数组,则可以使用dtype object,它使该数组可以容纳任意Python对象:

numpy.array([1.2, "abc"], dtype=object)

不知道您的代码将完成什么,我无法判断这是否是您想要的。

From the code you showed us, the only thing we can tell is that you are trying to create an array from a list that isn’t shaped like a multi-dimensional array. For example

numpy.array([[1,2], [2, 3, 4]])

or

numpy.array([[1,2], [2, [3, 4]]])

will yield this error message, because the shape of the input list isn’t a (generalised) “box” that can be turned into a multidimensional array. So probably UnFilteredDuringExSummaryOfMeansArray contains sequences of different lengths.

Edit: Another possible cause for this error message is trying to use a string as an element in an array of type float:

numpy.array([1.2, "abc"], dtype=float)

That is what you are trying according to your edit. If you really want to have a NumPy array containing both strings and floats, you could use the dtype object, which enables the array to hold arbitrary Python objects:

numpy.array([1.2, "abc"], dtype=object)

Without knowing what your code shall accomplish, I can’t judge if this is what you want.


回答 1

Python ValueError:

ValueError: setting an array element with a sequence.

就是说的意思,您正在尝试将一系列数字填充到单个数字槽中。它可以在各种情况下抛出。

1.当您将python元组或列表传递为numpy数组元素时:

import numpy

numpy.array([1,2,3])               #good

numpy.array([1, (2,3)])            #Fail, can't convert a tuple into a numpy 
                                   #array element


numpy.mean([5,(6+7)])              #good

numpy.mean([5,tuple(range(2))])    #Fail, can't convert a tuple into a numpy 
                                   #array element


def foo():
    return 3
numpy.array([2, foo()])            #good


def foo():
    return [3,4]
numpy.array([2, foo()])            #Fail, can't convert a list into a numpy 
                                   #array element

2.通过尝试将长度大于1的numpy数组塞入numpy数组元素:

x = np.array([1,2,3])
x[0] = np.array([4])         #good



x = np.array([1,2,3])
x[0] = np.array([4,5])       #Fail, can't convert the numpy array to fit 
                             #into a numpy array element

正在创建一个numpy数组,并且numpy不知道如何将多值元组或数组填充到单个元素插槽中。它期望您给它提供的任何结果都可以求出单个数字,如果没有,Numpy会回答说它不知道如何设置带有序列的数组元素。

The Python ValueError:

ValueError: setting an array element with a sequence.

Means exactly what it says, you’re trying to cram a sequence of numbers into a single number slot. It can be thrown under various circumstances.

1. When you pass a python tuple or list to be interpreted as a numpy array element:

import numpy

numpy.array([1,2,3])               #good

numpy.array([1, (2,3)])            #Fail, can't convert a tuple into a numpy 
                                   #array element


numpy.mean([5,(6+7)])              #good

numpy.mean([5,tuple(range(2))])    #Fail, can't convert a tuple into a numpy 
                                   #array element


def foo():
    return 3
numpy.array([2, foo()])            #good


def foo():
    return [3,4]
numpy.array([2, foo()])            #Fail, can't convert a list into a numpy 
                                   #array element

2. By trying to cram a numpy array length > 1 into a numpy array element:

x = np.array([1,2,3])
x[0] = np.array([4])         #good



x = np.array([1,2,3])
x[0] = np.array([4,5])       #Fail, can't convert the numpy array to fit 
                             #into a numpy array element

A numpy array is being created, and numpy doesn’t know how to cram multivalued tuples or arrays into single element slots. It expects whatever you give it to evaluate to a single number, if it doesn’t, Numpy responds that it doesn’t know how to set an array element with a sequence.


回答 2

就我而言,我在Tensorflow中遇到此错误,原因是我试图输入具有不同长度或序列的数组:

例如:

import tensorflow as tf

input_x = tf.placeholder(tf.int32,[None,None])



word_embedding = tf.get_variable('embeddin',shape=[len(vocab_),110],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01))

embedding_look=tf.nn.embedding_lookup(word_embedding,input_x)

with tf.Session() as tt:
    tt.run(tf.global_variables_initializer())

    a,b=tt.run([word_embedding,embedding_look],feed_dict={input_x:example_array})
    print(b)

如果我的数组是:

example_array = [[1,2,3],[1,2]]

然后我会得到错误:

ValueError: setting an array element with a sequence.

但是如果我做填充,那么:

example_array = [[1,2,3],[1,2,0]]

现在可以了。

In my case , I got this Error in Tensorflow , Reason was i was trying to feed a array with different length or sequences :

example :

import tensorflow as tf

input_x = tf.placeholder(tf.int32,[None,None])



word_embedding = tf.get_variable('embeddin',shape=[len(vocab_),110],dtype=tf.float32,initializer=tf.random_uniform_initializer(-0.01,0.01))

embedding_look=tf.nn.embedding_lookup(word_embedding,input_x)

with tf.Session() as tt:
    tt.run(tf.global_variables_initializer())

    a,b=tt.run([word_embedding,embedding_look],feed_dict={input_x:example_array})
    print(b)

And if my array is :

example_array = [[1,2,3],[1,2]]

Then i will get error :

ValueError: setting an array element with a sequence.

but if i do padding then :

example_array = [[1,2,3],[1,2,0]]

Now it’s working.


回答 3

对于那些在Numpy中遇到类似问题的人,一个非常简单的解决方案是:

定义dtype=object限定的阵列用于向它分配值时。例如:

out = np.empty_like(lil_img, dtype=object)

for those who are having trouble with similar problems in Numpy, a very simple solution would be:

defining dtype=object when defining an array for assigning values to it. for instance:

out = np.empty_like(lil_img, dtype=object)

回答 4

就我而言,问题是另一个。我正在尝试将int列表转换为array。问题在于,一个列表的长度与其他列表不同。如果要证明这一点,则必须执行以下操作:

print([i for i,x in enumerate(list) if len(x) != 560])

在我的情况下,长度参考为560。

In my case, the problem was another. I was trying convert lists of lists of int to array. The problem was that there was one list with a different length than others. If you want to prove it, you must do:

print([i for i,x in enumerate(list) if len(x) != 560])

In my case, the length reference was 560.


回答 5

就我而言,问题在于数据帧X []的散点图:

ax.scatter(X[:,0],X[:,1],c=colors,    
       cmap=CMAP, edgecolor='k', s=40)  #c=y[:,0],

#ValueError: setting an array element with a sequence.
#Fix with .toarray():
colors = 'br'
y = label_binarize(y, classes=['Irrelevant','Relevant'])
ax.scatter(X[:,0].toarray(),X[:,1].toarray(),c=colors,   
       cmap=CMAP, edgecolor='k', s=40)

In my case, the problem was with a scatterplot of a dataframe X[]:

ax.scatter(X[:,0],X[:,1],c=colors,    
       cmap=CMAP, edgecolor='k', s=40)  #c=y[:,0],

#ValueError: setting an array element with a sequence.
#Fix with .toarray():
colors = 'br'
y = label_binarize(y, classes=['Irrelevant','Relevant'])
ax.scatter(X[:,0].toarray(),X[:,1].toarray(),c=colors,   
       cmap=CMAP, edgecolor='k', s=40)

回答 6

当形状不规则或元素具有不同的数据类型时,dtype传递给np.array 的参数只能为object

import numpy as np

# arr1 = np.array([[10, 20.], [30], [40]], dtype=np.float32)  # error
arr2 = np.array([[10, 20.], [30], [40]])  # OK, and the dtype is object
arr3 = np.array([[10, 20.], 'hello'])     # OK, and the dtype is also object

When the shape is not regular or the elements have different data types, the dtype argument passed to np.array only can be object.

import numpy as np

# arr1 = np.array([[10, 20.], [30], [40]], dtype=np.float32)  # error
arr2 = np.array([[10, 20.], [30], [40]])  # OK, and the dtype is object
arr3 = np.array([[10, 20.], 'hello'])     # OK, and the dtype is also object