标签归档:arrays

如何在Python中声明数组?

问题:如何在Python中声明数组?

如何在Python中声明数组?

我在文档中找不到对数组的任何引用。

How do I declare an array in Python?

I can’t find any reference to arrays in the documentation.


回答 0

variable = []

现在variable引用一个空列表*

当然,这是分配,而不是声明。在Python中,没有办法说“此变量不应引用列表以外的任何东西”,因为Python是动态类型的。


*默认的内置Python类型称为list,而不是数组。它是一个任意长度的有序容器,可以容纳异构对象集合(它们的类型无关紧要,可以自由混合)。请勿将其与array模块混淆,后者提供的类型更接近C array类型。内容必须是同质的(所有类型都相同),但是长度仍然是动态的。

variable = []

Now variable refers to an empty list*.

Of course this is an assignment, not a declaration. There’s no way to say in Python “this variable should never refer to anything other than a list”, since Python is dynamically typed.


*The default built-in Python type is called a list, not an array. It is an ordered container of arbitrary length that can hold a heterogenous collection of objects (their types do not matter and can be freely mixed). This should not be confused with the array module, which offers a type closer to the C array type; the contents must be homogenous (all of the same type), but the length is still dynamic.


回答 1

这是Python中令人惊讶的复杂主题。

实用答案

数组由类表示list(请参见参考,不要将它们与generator混合使用)。

查看用法示例:

# empty array
arr = [] 

# init with values (can contain mixed types)
arr = [1, "eels"]

# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0]  # 1
arr[-1] # 6

# get length
length = len(arr)

# supports append and insert
arr.append(8)
arr.insert(6, 7)

理论答案

Python的list内幕是一个真实数组的包装,该数组包含对项目的引用。同样,基础数组会创建一些额外的空间。

其后果是:

  • 随机访问真的很便宜(arr[6653]与相同arr[0]
  • append 操作是“免费的”,同时有一些额外的空间
  • insert 操作费用昂贵

检查这张很棒的操作复杂性表

另外,请参见这张图片,在这里我试图显示数组,引用数组和链接列表之间的最重要区别:

This is surprisingly complex topic in Python.

Practical answer

Arrays are represented by class list (see reference and do not mix them with generators).

Check out usage examples:

# empty array
arr = [] 

# init with values (can contain mixed types)
arr = [1, "eels"]

# get item by index (can be negative to access end of array)
arr = [1, 2, 3, 4, 5, 6]
arr[0]  # 1
arr[-1] # 6

# get length
length = len(arr)

# supports append and insert
arr.append(8)
arr.insert(6, 7)

Theoretical answer

Under the hood Python’s list is a wrapper for a real array which contains references to items. Also, underlying array is created with some extra space.

Consequences of this are:

  • random access is really cheap (arr[6653] is same to arr[0])
  • append operation is ‘for free’ while some extra space
  • insert operation is expensive

Check this awesome table of operations complexity.

Also, please see this picture, where I’ve tried to show most important differences between array, array of references and linked list:


回答 2

您实际上并没有声明任何东西,但这是在Python中创建数组的方式:

from array import array
intarray = array('i')

有关更多信息,请参见数组模块:http : //docs.python.org/library/array.html

现在可能您不想要数组,而是列表,但是其他人已经回答了。:)

You don’t actually declare things, but this is how you create an array in Python:

from array import array
intarray = array('i')

For more info see the array module: http://docs.python.org/library/array.html

Now possible you don’t want an array, but a list, but others have answered that already. :)


回答 3

我认为您想要一个列表,其中前30个单元格已经填充。所以

   f = []

   for i in range(30):
       f.append(0)

斐波那契数列就是一个可以使用它的例子。请参阅欧拉计画中的问题2

I think you (meant)want an list with the first 30 cells already filled. So

   f = []

   for i in range(30):
       f.append(0)

An example to where this could be used is in Fibonacci sequence. See problem 2 in Project Euler


回答 4

这是这样的:

my_array = [1, 'rebecca', 'allard', 15]

This is how:

my_array = [1, 'rebecca', 'allard', 15]

回答 5

对于计算,请使用如下的numpy数组:

import numpy as np

a = np.ones((3,2))        # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3])     # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100)  # an array with 100 points beteen (and including) 2 and 3

print(a*1.5)  # all elements of a times 1.5
print(a.T+b)  # b added to the transpose of a

这些numpy数组可以从磁盘保存和加载(甚至压缩),并且包含大量元素的复杂计算的速度类似于C。

在科学环境中大量使用。看到这里更多。

For calculations, use numpy arrays like this:

import numpy as np

a = np.ones((3,2))        # a 2D array with 3 rows, 2 columns, filled with ones
b = np.array([1,2,3])     # a 1D array initialised using a list [1,2,3]
c = np.linspace(2,3,100)  # an array with 100 points beteen (and including) 2 and 3

print(a*1.5)  # all elements of a times 1.5
print(a.T+b)  # b added to the transpose of a

these numpy arrays can be saved and loaded from disk (even compressed) and complex calculations with large amounts of elements are C-like fast.

Much used in scientific environments. See here for more.


回答 6

JohnMachin的评论应该是真正的答案。我认为所有其他答案只是解决方法!所以:

array=[0]*element_count

JohnMachin’s comment should be the real answer. All the other answers are just workarounds in my opinion! So:

array=[0]*element_count

回答 7

一些贡献表明python中的数组由列表表示。这是不正确的。Python array()在标准库模块arrayarray.array()”中具有独立的实现,因此将两者混淆是不正确的。列表是python中的列表,因此请谨慎使用所使用的术语。

list_01 = [4, 6.2, 7-2j, 'flo', 'cro']

list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']

list和之间有一个非常重要的区别array.array()。虽然这两个对象都是有序序列,但array.array()是有序均质序列,而列表是非均质序列。

A couple of contributions suggested that arrays in python are represented by lists. This is incorrect. Python has an independent implementation of array() in the standard library module arrayarray.array()” hence it is incorrect to confuse the two. Lists are lists in python so be careful with the nomenclature used.

list_01 = [4, 6.2, 7-2j, 'flo', 'cro']

list_01
Out[85]: [4, 6.2, (7-2j), 'flo', 'cro']

There is one very important difference between list and array.array(). While both of these objects are ordered sequences, array.array() is an ordered homogeneous sequences whereas a list is a non-homogeneous sequence.


回答 8

您无需在Python中声明任何内容。您只需要使用它。我建议您从http://diveintopython.net之类的东西开始。

You don’t declare anything in Python. You just use it. I recommend you start out with something like http://diveintopython.net.


回答 9

我通常只是做a = [1,2,3]一个实际上是一个,listarrays看看这个正式定义

I would normally just do a = [1,2,3] which is actually a list but for arrays look at this formal definition


回答 10

为了增加Lennart的答案,可以这样创建一个数组:

from array import array
float_array = array("f",values)

其中可以采取一个元组,列表的形式,或np.array,但不是数组:

values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable

并且输出仍然是相同的:

print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))

# array('f', [1.0, 2.0, 3.0])
# 2.0
# True

list的大多数方法也适用于数组,常见的方法是pop(),extend()和append()。

从答案和评论来看,似乎数组数据结构并不流行。我喜欢它,就像人们可能会更喜欢元组而不是列表一样。

数组结构比列表或np.array具有更严格的规则,这可以减少错误并简化调试,尤其是在处理数字数据时。

尝试将浮点数插入/附加到int数组将引发TypeError:

values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])

# TypeError: integer argument expected, got float

因此,将要为整数(例如索引列表)的值保留为数组形式可能会防止“ TypeError:列表索引必须为整数,而不是浮点数”,因为可以迭代数组,类似于np.array和list:

int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
    sample.append(data[i])

烦人的是,将int附加到float数组将导致int成为float,而不会引发异常。

np.array的条目也保留相同的数据类型,但是它不会改变错误,而是会更改其数据类型以适合新条目(通常为double或str):

import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
    print(type(i))
    # <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
    sample.append(data[i])
    # no problem here, but TypeError for the other two

在分配期间也是如此。如果指定了数据类型,则np.array将在可能的情况下将条目转换为该数据类型:

int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>

或者,本质上:

data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True

而数组只会给出:

invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float

因此,对特定于类型的命令使用np.array不是一个好主意。数组结构在这里很有用。list保留值的数据类型。

对于某些问题,我感到有些讨厌:数据类型在array()中指定为第一个参数,但是(通常)在np.array()中指定为第二个参数。:|

与C的关系在这里引用: Python列表与数组-何时使用?

祝您探索愉快!

注意:数组的类型化和相当严格的性质更倾向于C而不是Python,并且通过设计,Python在其函数中没有许多特定于类型的约束。它的不受欢迎也给协作工作带来了积极的反馈,而替换它通常需要附加的[文件中x的int(x)]。因此,忽略数组的存在是完全可行和合理的。它不应该以任何方式阻碍我们大多数人。:D

To add to Lennart’s answer, an array may be created like this:

from array import array
float_array = array("f",values)

where values can take the form of a tuple, list, or np.array, but not array:

values = [1,2,3]
values = (1,2,3)
values = np.array([1,2,3],'f')
# 'i' will work here too, but if array is 'i' then values have to be int
wrong_values = array('f',[1,2,3])
# TypeError: 'array.array' object is not callable

and the output will still be the same:

print(float_array)
print(float_array[1])
print(isinstance(float_array[1],float))

# array('f', [1.0, 2.0, 3.0])
# 2.0
# True

Most methods for list work with array as well, common ones being pop(), extend(), and append().

Judging from the answers and comments, it appears that the array data structure isn’t that popular. I like it though, the same way as one might prefer a tuple over a list.

The array structure has stricter rules than a list or np.array, and this can reduce errors and make debugging easier, especially when working with numerical data.

Attempts to insert/append a float to an int array will throw a TypeError:

values = [1,2,3]
int_array = array("i",values)
int_array.append(float(1))
# or int_array.extend([float(1)])

# TypeError: integer argument expected, got float

Keeping values which are meant to be integers (e.g. list of indices) in the array form may therefore prevent a “TypeError: list indices must be integers, not float”, since arrays can be iterated over, similar to np.array and lists:

int_array = array('i',[1,2,3])
data = [11,22,33,44,55]
sample = []
for i in int_array:
    sample.append(data[i])

Annoyingly, appending an int to a float array will cause the int to become a float, without throwing an exception.

np.array retain the same data type for its entries too, but instead of giving an error it will change its data type to fit new entries (usually to double or str):

import numpy as np
numpy_int_array = np.array([1,2,3],'i')
for i in numpy_int_array:
    print(type(i))
    # <class 'numpy.int32'>
numpy_int_array_2 = np.append(numpy_int_array,int(1))
# still <class 'numpy.int32'>
numpy_float_array = np.append(numpy_int_array,float(1))
# <class 'numpy.float64'> for all values
numpy_str_array = np.append(numpy_int_array,"1")
# <class 'numpy.str_'> for all values
data = [11,22,33,44,55]
sample = []
for i in numpy_int_array_2:
    sample.append(data[i])
    # no problem here, but TypeError for the other two

This is true during assignment as well. If the data type is specified, np.array will, wherever possible, transform the entries to that data type:

int_numpy_array = np.array([1,2,float(3)],'i')
# 3 becomes an int
int_numpy_array_2 = np.array([1,2,3.9],'i')
# 3.9 gets truncated to 3 (same as int(3.9))
invalid_array = np.array([1,2,"string"],'i')
# ValueError: invalid literal for int() with base 10: 'string'
# Same error as int('string')
str_numpy_array = np.array([1,2,3],'str')
print(str_numpy_array)
print([type(i) for i in str_numpy_array])
# ['1' '2' '3']
# <class 'numpy.str_'>

or, in essence:

data = [1.2,3.4,5.6]
list_1 = np.array(data,'i').tolist()
list_2 = [int(i) for i in data]
print(list_1 == list_2)
# True

while array will simply give:

invalid_array = array([1,2,3.9],'i')
# TypeError: integer argument expected, got float

Because of this, it is not a good idea to use np.array for type-specific commands. The array structure is useful here. list preserves the data type of the values.

And for something I find rather pesky: the data type is specified as the first argument in array(), but (usually) the second in np.array(). :|

The relation to C is referred to here: Python List vs. Array – when to use?

Have fun exploring!

Note: The typed and rather strict nature of array leans more towards C rather than Python, and by design Python does not have many type-specific constraints in its functions. Its unpopularity also creates a positive feedback in collaborative work, and replacing it mostly involves an additional [int(x) for x in file]. It is therefore entirely viable and reasonable to ignore the existence of array. It shouldn’t hinder most of us in any way. :D


回答 11

这个怎么样…

>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6

How about this…

>>> a = range(12)
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
>>> a[7]
6

回答 12

Python将它们称为list。您可以使用方括号和逗号编写列表文字:

>>> [6,28,496,8128]
[6, 28, 496, 8128]

Python calls them lists. You can write a list literal with square brackets and commas:

>>> [6,28,496,8128]
[6, 28, 496, 8128]

回答 13

在Lennart之后,还有numpy,它实现了同类的多维数组。

Following on from Lennart, there’s also numpy which implements homogeneous multi-dimensional arrays.


回答 14

我有一个字符串数组,并且需要一个相同长度的布尔值数组(初始化为True)。这就是我所做的

strs = ["Hi","Bye"] 
bools = [ True for s in strs ]

I had an array of strings and needed an array of the same length of booleans initiated to True. This is what I did

strs = ["Hi","Bye"] 
bools = [ True for s in strs ]

回答 15

您可以创建列表并将其转换为数组,也可以使用numpy模块创建数组。以下是一些说明此问题的示例。Numpy还使使用多维数组更容易。

import numpy as np
a = np.array([1, 2, 3, 4])

#For custom inputs
a = np.array([int(x) for x in input().split()])

您还可以使用reshape函数将此数组整形为2X2矩阵,该函数将输入作为矩阵的尺寸。

mat = a.reshape(2, 2)

You can create lists and convert them into arrays or you can create array using numpy module. Below are few examples to illustrate the same. Numpy also makes it easier to work with multi-dimensional arrays.

import numpy as np
a = np.array([1, 2, 3, 4])

#For custom inputs
a = np.array([int(x) for x in input().split()])

You can also reshape this array into a 2X2 matrix using reshape function which takes in input as the dimensions of the matrix.

mat = a.reshape(2, 2)

与常规Python列表相比,NumPy有什么优势?

问题:与常规Python列表相比,NumPy有什么优势?

与常规Python列表相比,NumPy有什么优势?

我大约有100个金融市场系列,我将创建一个100x100x100 = 1百万个单元的多维数据集数组。我将每个x与y和z回归(3变量),以用标准误差填充数组。

我听说对于“大型矩阵”,出于性能和可伸缩性的原因,我应该使用NumPy而不是Python列表。事实是,我知道Python列表,它们似乎对我有用。

如果我转到NumPy,会有什么好处?

如果我有1000个序列(即立方体中有10亿个浮点单元)怎么办?

What are the advantages of NumPy over regular Python lists?

I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.

I have heard that for “large matrices” I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know Python lists and they seem to work for me.

What will the benefits be if I move to NumPy?

What if I had 1000 series (that is, 1 billion floating point cells in the cube)?


回答 0

NumPy的数组比Python列表更紧凑-您在Python中描述的列表列表至少需要20 MB左右,而单元格中具有单精度浮点数的NumPy 3D数组则需要4 MB。使用NumPy可以更快地读取和写入项目。

也许您只关心一百万个单元就不会那么在意,但是您肯定会关心十亿个单元-两种方法都不适合32位体系结构,但是使用64位版本,NumPy可以节省约4 GB ,仅Python一项就至少需要12 GB(很多指针的大小加倍),这是一个昂贵得多的硬件!

差异主要是由于“间接性”造成的-Python列表是指向Python对象的指针的数组,每个指针至少4个字节,对于最小的Python对象也至少包含16个字节(类型指针为4,引用计数为4,类型为4值-内存分配器舍入为16)。NumPy数组是统一值的数组-单精度数字每个占用4个字节,双精度数字每个占用8个字节。灵活性较差,但您要为标准Python列表的灵活性付出高昂的代价!

NumPy’s arrays are more compact than Python lists — a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.

Maybe you don’t care that much for just a million cells, but you definitely would for a billion cells — neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) — a much costlier piece of hardware!

The difference is mostly due to “indirectness” — a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). A NumPy array is an array of uniform values — single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!


回答 1

NumPy不仅效率更高;这也更加方便。您可以免费获得许多矢量和矩阵运算,有时这可以避免不必要的工作。而且它们也得到有效实施。

例如,您可以将多维数据集直接从文件读取到数组中:

x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))

沿第二维求和:

s = x.sum(axis=1)

查找哪些单元格高于阈值:

(x > 0.5).nonzero()

删除沿第三维的每个偶数索引切片:

x[:, :, ::2]

同样,许多有用的库都可以与NumPy数组一起使用。例如,统计分析和可视化库。

即使您没有性能问题,学习NumPy也是值得的。

NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.

For example, you could read your cube directly from a file into an array:

x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))

Sum along the second dimension:

s = x.sum(axis=1)

Find which cells are above a threshold:

(x > 0.5).nonzero()

Remove every even-indexed slice along the third dimension:

x[:, :, ::2]

Also, many useful libraries work with NumPy arrays. For example, statistical analysis and visualization libraries.

Even if you don’t have performance problems, learning NumPy is worth the effort.


回答 2

Alex提到了内存效率,Roberto提到了便利性,这些都是不错的地方。对于其他一些想法,我将提到速度功能

功能性:NumPy,FFT,卷积,快速搜索,基本统计信息,线性代数,直方图等都内置了很多功能。实际上,没有FFT谁能活下去?

速度:这是一项对列表和NumPy数组求和的测试,表明NumPy数组的求和速度快10倍(在此测试中,里程可能会有所不同)。

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

在我的系统上(运行备份时),它会给出:

numpy: 3.004e-05
list:  5.363e-04

Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas, I’ll mention speed and functionality.

Functionality: You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?

Speed: Here’s a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test — mileage may vary).

from numpy import arange
from timeit import Timer

Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

t_numpy = Timer("x.sum()", "from __main__ import x")
t_list = Timer("sum(y)", "from __main__ import y")
print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,))
print("list:  %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

which on my systems (while I’m running a backup) gives:

numpy: 3.004e-05
list:  5.363e-04

回答 3

这是scipy.org网站上的常见问题解答中的一个很好的答案:

与(嵌套)Python列表相比,NumPy数组有什么优势?

Python的列表是有效的通用容器。它们支持(相当)高效的插入,删除,附加和连接,并且Python的列表理解使它们易于构造和操作。但是,它们有一定的局限性:它们不支持“向量化”操作,例如逐元素加法和乘法,并且它们可以包含不同类型的对象这一事实意味着Python必须为每个元素存储类型信息,并且必须执行类型分派代码在每个元素上操作时。这也意味着有效的C循环几乎无法执行列表操作-每次迭代都需要类型检查和其他Python API簿记。

Here’s a nice answer from the FAQ on the scipy.org website:

What advantages do NumPy arrays offer over (nested) Python lists?

Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate. However, they have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element. This also means that very few list operations can be carried out by efficient C loops – each iteration would require type checks and other Python API bookkeeping.


回答 4

所有人都强调了numpy数组和python列表之间的几乎所有主要区别,在这里我将向大家简单介绍一下:

  1. Numpy数组在创建时具有固定的大小,这与python列表(可以动态增长)不同。更改ndarray的大小将创建一个新数组并删除原始数组。

  2. Numpy数组中的所有元素都必须具有相同的数据类型(我们也可以具有异构类型,但这将不允许您进行数学运算),因此在内存中的大小将相同

  3. Numpy数组有助于对大量数据进行数学运算和其他类型的运算。通常,与使用python顺序构建相比,此类操作执行效率更高且代码更少

All have highlighted almost all major differences between numpy array and python list, I will just brief them out here:

  1. Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original.

  2. The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory

  3. Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences


将pandas数据框转换为NumPy数组

问题:将pandas数据框转换为NumPy数组

我对知道如何将熊猫数据框转换为NumPy数组感兴趣。

数据框:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

我想将其转换为NumPy数组,如下所示:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

我怎样才能做到这一点?


另外,是否可以像这样保留dtype?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

或类似的?

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

gives

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I would like to convert this to a NumPy array, as so:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

How can I do this?


As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?


回答 0

要将熊猫数据框(df)转换为numpy ndarray,请使用以下代码:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

df.values

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 1

弃用的用法valuesas_matrix()

pandas v0.24.0引入了两种从pandas对象获取NumPy数组的新方法:

  1. to_numpy(),其定义上IndexSeries,DataFrame对象,并
  2. array,仅在IndexSeries对象上定义。

如果您访问的v0.24文档.values,则会看到一个红色的大警告:

警告:我们建议DataFrame.to_numpy()改为使用。

请参阅v0.24.0发行说明的本部分以及此答案以获取更多信息。


追求更好的一致性: to_numpy()

本着整个API更好的一致性的精神,to_numpy引入了一种新方法来从DataFrames中提取底层的NumPy数组。

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

如上所述,该方法也在IndexSeries对象上定义(请参见此处)。

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

默认情况下,将返回视图,因此所做的任何修改都会影响原始视图。

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

如果您需要副本,请使用to_numpy(copy=True

Pandas> = 1.0更新为ExtensionTypes

如果您使用的是熊猫1.x,那么您可能会更多地处理扩展类型。您必须多加注意,这些扩展名类型已正确转换。

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

在文档中对此进行了标注

如果您需要dtypes

如另一个答案所示,DataFrame.to_records是执行此操作的好方法。

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

to_numpy不幸的是,这不能做到。但是,您也可以使用np.rec.fromrecords

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

在性能方面,它几乎是相同的(实际上,使用rec.fromrecords速度要快一些)。

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

添加新方法的理由

to_numpy()(除了array)作为两个GitHub问题GH19954GH23623下讨论的结果而添加

具体来说,文档提到了基本原理:

[…] .values尚不清楚返回的值是实际数组,它的某种转换还是熊猫自定义数组之一(如Categorical)。例如,使用PeriodIndex,每次都会.values 生成一个新ndarray的周期对象。[…]

to_numpy旨在提高API的一致性,这是朝正确方向迈出的重要一步。.values不会在当前版本中被弃用,但我希望这种情况可能会在将来的某个时刻发生,因此,我敦促用户尽快向较新的API迁移。


批判其他解决方案

DataFrame.values 如前所述,具有不一致的行为。

DataFrame.get_values()只是一个包装器DataFrame.values,因此上述所有内容均适用。

DataFrame.as_matrix()现在已弃用,请勿使用!

Deprecate your usage of values and as_matrix()!

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

  1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
  2. array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.


Towards Better Consistency: to_numpy()

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df.to_numpy()
array([[1, 4],
       [2, 5],
       [3, 6]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1

df
   A  B
a -1  4
b  2  5
c  3  6

If you need a copy instead, use to_numpy(copy=True).

pandas >= 1.0 update for ExtensionTypes

If you’re using pandas 1.x, chances are you’ll be dealing with extension types a lot more. You’ll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Right
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

This is called out in the docs.

If you need the dtypes

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', -1, 4), ('b',  2, 5), ('c',  3, 6)],
#          dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8')])

Performance wise, it’s nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

11.1 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.67 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. […]

to_numpy aim to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.


Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!


回答 2

注意.as_matrix()不建议使用此答案中的方法。熊猫0.23.4警告:

方法.as_matrix将在以后的版本中删除。请改用.values。


熊猫内置一些东西…

numpy_matrix = df.as_matrix()

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

Note: The .as_matrix() method used in this answer is deprecated. Pandas 0.23.4 warns:

Method .as_matrix will be removed in a future version. Use .values instead.


Pandas has something built in…

numpy_matrix = df.as_matrix()

gives

array([[nan, 0.2, nan],
       [nan, nan, 0.5],
       [nan, 0.2, 0.5],
       [0.1, 0.2, nan],
       [0.1, 0.2, 0.5],
       [0.1, nan, 0.5],
       [0.1, nan, nan]])

回答 3

我只需要链接DataFrame.reset_index()DataFrame.values函数来获得数据帧的Numpy表示,包括索引:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

为了获得dtypes,我们需要使用view将此ndarray转换为结构化数组:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

In [8]: df
Out[8]: 
          A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333

[8 rows x 3 columns]

In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
       [ 1.        ,  0.61729734, -0.47187926,  0.50554728],
       [ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
       [ 3.        , -0.16636303, -0.95775849,  1.17865945],
       [ 4.        , -0.16410334,  0.0745164 , -0.67432474],
       [ 5.        , -0.34016865, -0.29369841,  1.23179064],
       [ 6.        , -1.06282542,  0.55627285,  1.50805754],
       [ 7.        ,  0.95961001,  0.24753911,  0.09133339]])

To get the dtypes we’d need to transform this ndarray into a structured array using view:

In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
       ( 1,  0.61729734, -0.47187926,  0.50554728),
       ( 2,  0.4171228 , -1.35680324, -1.01349922),
       ( 3, -0.16636303, -0.95775849,  1.17865945),
       ( 4, -0.16410334,  0.0745164 , -0.67432474),
       ( 5, -0.34016865, -0.29369841,  1.23179064),
       ( 6, -1.06282542,  0.55627285,  1.50805754),
       ( 7,  0.95961001,  0.24753911,  0.09133339),
       dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

回答 4

您可以使用该to_records方法,但是如果dtypes不是您一开始就想要的,就必须多花点时间。就我而言,从字符串复制了DF,索引类型为字符串(以objectdtypes在pandas中表示):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

转换recarray dtype对我不起作用,但是已经可以在Pandas中做到这一点:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

请注意,Pandas ID在导出的记录数组中没有正确地将索引名设置为(错误?),因此我们可以从类型转换中受益,也可以对此进行更正。

目前,Pandas只有8个字节的整数i8,并且浮点数f8(请参阅本期)。

You can use the to_records method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an object dtype in pandas):

In [102]: df
Out[102]: 
label    A    B    C
ID                  
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN

In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]: 
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Converting the recarray dtype does not work for me, but one can do this in Pandas already:

In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Note that Pandas does not set the name of the index properly (to ID) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.

At the moment Pandas has only 8-byte integers, i8, and floats, f8 (see this issue).


回答 5

似乎df.to_records()会为您工作。您正在寻找的确切功能已被要求to_records指出作为替代。

我使用您的示例在本地进行了尝试,该调用产生的结果与您正在寻找的输出非常相似:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

请注意,这是一个recarray而不是array。您可以通过将其构造函数调用为,将结果移入常规numpy数组np.array(df.to_records())

It seems like df.to_records() will work for you. The exact feature you’re looking for was requested and to_records pointed to as an alternative.

I tried this out locally using your example, and that call yields something very similar to the output you were looking for:

rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)],
      dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])

Note that this is a recarray rather than an array. You could move the result in to regular numpy array by calling its constructor as np.array(df.to_records()).


回答 6

尝试这个:

a = numpy.asarray(df)

Try this:

a = numpy.asarray(df)

回答 7

这是我从pandas DataFrame制作结构数组的方法。

创建数据框

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

定义函数以从pandas DataFrame中创建一个numpy结构数组(而不是记录数组)。

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

使用reset_index使包括索引作为其数据的一部分,新的数据帧。将该数据帧转换为结构数组。

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

编辑:更新df_to_sarray以避免错误调用.encode()与Python 3.感谢约瑟夫·加尔文宁静 为他们的意见和解决方案。

Here is my approach to making a structure array from a pandas DataFrame.

Create the data frame

import pandas as pd
import numpy as np
import six

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns

    if six.PY2:  # python 2 needs .encode() but 3 does not
        types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    else:
        types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.


回答 8

将数据帧转换为其Numpy数组表示形式的两种方法。

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc:https : //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html

Two ways to convert the data-frame to its Numpy-array representation.

  • mah_np_array = df.as_matrix(columns=None)

  • mah_np_array = df.values

Doc: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.as_matrix.html


回答 9

示例DataFrame的更简单方法:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

采用:

np.array(df.to_records().view(type=np.matrix))

得到:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

A Simpler Way for Example DataFrame:

df

         gbm       nnet        reg
0  12.097439  12.047437  12.100953
1  12.109811  12.070209  12.095288
2  11.720734  11.622139  11.740523
3  11.824557  11.926414  11.926527
4  11.800868  11.727730  11.729737
5  12.490984  12.502440  12.530894

USE:

np.array(df.to_records().view(type=np.matrix))

GET:

array([[(0, 12.097439  , 12.047437, 12.10095324),
        (1, 12.10981081, 12.070209, 12.09528824),
        (2, 11.72073428, 11.622139, 11.74052253),
        (3, 11.82455653, 11.926414, 11.92652727),
        (4, 11.80086775, 11.72773 , 11.72973699),
        (5, 12.49098389, 12.50244 , 12.53089367)]],
dtype=(numpy.record, [('index', '<i8'), ('gbm', '<f8'), ('nnet', '<f4'),
       ('reg', '<f8')]))

回答 10

从数据框导出到arcgis表时遇到了类似的问题,偶然发现了来自usgs的解决方案(https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table)。简而言之,您的问题具有类似的解决方案:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

Just had a similar problem when exporting from dataframe to arcgis table and stumbled on a solution from usgs (https://my.usgs.gov/confluence/display/cdi/pandas.DataFrame+to+ArcGIS+Table). In short your problem has a similar solution:

df

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

np_data = np.array(np.rec.fromrecords(df.values))
np_names = df.dtypes.index.tolist()
np_data.dtype.names = tuple([name.encode('UTF8') for name in np_names])

np_data

array([( nan,  0.2,  nan), ( nan,  nan,  0.5), ( nan,  0.2,  0.5),
       ( 0.1,  0.2,  nan), ( 0.1,  0.2,  0.5), ( 0.1,  nan,  0.5),
       ( 0.1,  nan,  nan)], 
      dtype=(numpy.record, [('A', '<f8'), ('B', '<f8'), ('C', '<f8')]))

回答 11

我经历了以上答案。“ as_matrix() ”方法可以使用,但是现在已经过时了。对我来说,有效的是“ .to_numpy() ”。

这将返回一个多维数组。如果您要从excel工作表中读取数据,并且需要从任何索引访问数据,则我会更喜欢使用此方法。希望这可以帮助 :)

I went through the answers above. The “as_matrix()” method works but its obsolete now. For me, What worked was “.to_numpy()“.

This returns a multidimensional array. I’ll prefer using this method if you’re reading data from excel sheet and you need to access data from any index. Hope this helps :)


回答 12

除了陨石的答案,我找到了代码

df.index = df.index.astype('i8')

对我不起作用。因此,我将代码放在这里,以方便陷入此问题的其他人。

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

Further to meteore’s answer, I found the code

df.index = df.index.astype('i8')

doesn’t work for me. So I put my code here for the convenience of others stuck with this issue.

city_cluster_df = pd.read_csv(text_filepath, encoding='utf-8')
# the field 'city_en' is a string, when converted to Numpy array, it will be an object
city_cluster_arr = city_cluster_df[['city_en','lat','lon','cluster','cluster_filtered']].to_records()
descr=city_cluster_arr.dtype.descr
# change the field 'city_en' to string type (the index for 'city_en' here is 1 because before the field is the row index of dataframe)
descr[1]=(descr[1][0], "S20")
newArr=city_cluster_arr.astype(np.dtype(descr))

回答 13

将数据帧转换为numpy数组的简单方法:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

鼓励使用to_numpy来保持一致性。

参考:https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

A simple way to convert dataframe to numpy array:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df_to_array = df.to_numpy()
array([[1, 3],
   [2, 4]])

Use of to_numpy is encouraged to preserve consistency.

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html


回答 14

尝试这个:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

有关更多信息,请访问:[ https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] 对numpy 1.16.5和pandas 0.25.2有效。

Try this:

np.array(df) 

array([['ID', nan, nan, nan],
   ['1', nan, 0.2, nan],
   ['2', nan, nan, 0.5],
   ['3', nan, 0.2, 0.5],
   ['4', 0.1, 0.2, nan],
   ['5', 0.1, 0.2, 0.5],
   ['6', 0.1, nan, 0.5],
   ['7', 0.1, nan, nan]], dtype=object)

Some more information at: [https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html] Valid for numpy 1.16.5 and pandas 0.25.2.


如何访问NumPy多维数组的第i列?

问题:如何访问NumPy多维数组的第i列?

假设我有:

test = numpy.array([[1, 2], [3, 4], [5, 6]])

test[i]使我得到数组的第i行(例如[1, 2])。如何访问第ith列?(例如[1, 3, 5])。另外,这将是一项昂贵的操作吗?

Suppose I have:

test = numpy.array([[1, 2], [3, 4], [5, 6]])

test[i] gets me ith line of the array (eg [1, 2]). How can I access the ith column? (eg [1, 3, 5]). Also, would this be an expensive operation?


回答 0

>>> test[:,0]
array([1, 3, 5])

同样,

>>> test[1,:]
array([3, 4])

使您可以访问行。NumPy参考资料的第1.4节(索引)对此进行了介绍。这很快,至少以我的经验而言。它肯定比循环访问每个元素要快得多。

>>> test[:,0]
array([1, 3, 5])

Similarly,

>>> test[1,:]
array([3, 4])

lets you access rows. This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It’s certainly much quicker than accessing each element in a loop.


回答 1

如果您想一次访问多个列,则可以执行以下操作:

>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
       [3, 5],
       [6, 8]])

And if you want to access more than one column at a time you could do:

>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
       [3, 5],
       [6, 8]])

回答 2

>>> test[:,0]
array([1, 3, 5])

该命令为您提供了行向量,如果您只想在其上循环,就可以了,但是如果您要与其他尺寸为3xN的数组进行堆叠,则可以

ValueError: all the input arrays must have same number of dimensions

>>> test[:,[0]]
array([[1],
       [3],
       [5]])

为您提供列向量,以便您可以进行串联或hstack操作。

例如

>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
       [3, 4, 3],
       [5, 6, 5]])
>>> test[:,0]
array([1, 3, 5])

this command gives you a row vector, if you just want to loop over it, it’s fine, but if you want to hstack with some other array with dimension 3xN, you will have

ValueError: all the input arrays must have same number of dimensions

while

>>> test[:,[0]]
array([[1],
       [3],
       [5]])

gives you a column vector, so that you can do concatenate or hstack operation.

e.g.

>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
       [3, 4, 3],
       [5, 6, 5]])

回答 3

您还可以转置并返回一行:

In [4]: test.T[0]
Out[4]: array([1, 3, 5])

You could also transpose and return a row:

In [4]: test.T[0]
Out[4]: array([1, 3, 5])

回答 4

要获得几个独立的列,只需:

> test[:,[0,2]]

你会得到第0列和第2列

To get several and indepent columns, just:

> test[:,[0,2]]

you will get colums 0 and 2


回答 5

尽管问题已得到回答,但让我提及一些细微差别。

假设您对数组的第一列感兴趣

arr = numpy.array([[1, 2],
                   [3, 4],
                   [5, 6]])

从其他答案中已经知道,要以“行向量”(shape的数组(3,))的形式获取它,可以使用切片:

arr_c1_ref = arr[:, 1]  # creates a reference to the 1st column of the arr
arr_c1_copy = arr[:, 1].copy()  # creates a copy of the 1st column of the arr

要检查一个数组是视图还是另一个数组的副本,可以执行以下操作:

arr_c1_ref.base is arr  # True
arr_c1_copy.base is arr  # False

参见ndarray.base

除了两者之间的明显区别(修改arr_c1_ref将影响arr),遍历它们中每一个的字节步数也不同:

arr_c1_ref.strides[0]  # 8 bytes
arr_c1_copy.strides[0]  # 4 bytes

大步。为什么这很重要?假设您有一个很大的数组,A而不是arr

A = np.random.randint(2, size=(10000,10000), dtype='int32')
A_c1_ref = A[:, 1] 
A_c1_copy = A[:, 1].copy()

并且您要计算第一列的所有元素的总和,即A_c1_ref.sum()A_c1_copy.sum()。使用复制的版本要快得多:

%timeit A_c1_ref.sum()  # ~248 µs
%timeit A_c1_copy.sum()  # ~12.8 µs

这是由于前面提到的跨步数不同:

A_c1_ref.strides[0]  # 40000 bytes
A_c1_copy.strides[0]  # 4 bytes

尽管使用列副本似乎更好,但由于创建副本需要时间并使用更多的内存(在这种情况下,我花了大约200 µs的时间来创建副本)并不总是正确的。 A_c1_copy)。但是,如果我们首先需要复制,或者需要在数组的特定列上执行许多不同的操作,并且可以牺牲内存以提高速度,那么复制就可以了。

如果我们有兴趣主要使用列,最好以列大(’F’)顺序而不是行大(’C’)顺序创建数组(这是默认设置) ),然后像以前一样进行切片以获取一列而不复制它:

A = np.asfortranarray(A)  # or np.array(A, order='F')
A_c1_ref = A[:, 1]
A_c1_ref.strides[0]  # 4 bytes
%timeit A_c1_ref.sum()  # ~12.6 µs vs ~248 µs

现在,在列视图上执行求和运算(或其他任何运算)要快得多。

最后,让我注意到,转置数组并使用行切片与在原始数组上使用列切片相同,因为转置是通过交换原始数组的形状和步幅来完成的。

A.T[1,:].strides[0]  # 40000

Although the question has been answered, let me mention some nuances.

Let’s say you are interested in the first column of the array

arr = numpy.array([[1, 2],
                   [3, 4],
                   [5, 6]])

As you already know from other answers, to get it in the form of “row vector” (array of shape (3,)), you use slicing:

arr_c1_ref = arr[:, 1]  # creates a reference to the 1st column of the arr
arr_c1_copy = arr[:, 1].copy()  # creates a copy of the 1st column of the arr

To check if an array is a view or a copy of another array you can do the following:

arr_c1_ref.base is arr  # True
arr_c1_copy.base is arr  # False

see ndarray.base.

Besides the obvious difference between the two (modifying arr_c1_ref will affect arr), the number of byte-steps for traversing each of them is different:

arr_c1_ref.strides[0]  # 8 bytes
arr_c1_copy.strides[0]  # 4 bytes

see strides. Why is this important? Imagine that you have a very big array A instead of the arr:

A = np.random.randint(2, size=(10000,10000), dtype='int32')
A_c1_ref = A[:, 1] 
A_c1_copy = A[:, 1].copy()

and you want to compute the sum of all the elements of the first column, i.e. A_c1_ref.sum() or A_c1_copy.sum(). Using the copied version is much faster:

%timeit A_c1_ref.sum()  # ~248 µs
%timeit A_c1_copy.sum()  # ~12.8 µs

This is due to the different number of strides mentioned before:

A_c1_ref.strides[0]  # 40000 bytes
A_c1_copy.strides[0]  # 4 bytes

Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time and uses more memory (in this case it took me approx. 200 µs to create the A_c1_copy). However if we need the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.

In the case that we are interested in working mostly with columns, it could be a good idea to create our array in column-major (‘F’) order instead of the row-major (‘C’) order (which is the default), and then do the slicing as before to get a column without copying it:

A = np.asfortranarray(A)  # or np.array(A, order='F')
A_c1_ref = A[:, 1]
A_c1_ref.strides[0]  # 4 bytes
%timeit A_c1_ref.sum()  # ~12.6 µs vs ~248 µs

Now, performing the sum operation (or any other) on a column-view is much faster.

Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.

A.T[1,:].strides[0]  # 40000

回答 6

>>> test
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> ncol = test.shape[1]
>>> ncol
5L

然后,您可以通过以下方式选择第二至第四列:

>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
       [6, 7, 8]])
>>> test
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

>>> ncol = test.shape[1]
>>> ncol
5L

Then you can select the 2nd – 4th column this way:

>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
       [6, 7, 8]])

是否有NumPy函数返回数组中某物的第一个索引?

问题:是否有NumPy函数返回数组中某物的第一个索引?

我知道有一种方法可以让Python列表返回某些内容的第一个索引:

>>> l = [1, 2, 3]
>>> l.index(2)
1

NumPy数组有类似的东西吗?

I know there is a method for a Python list to return the first index of something:

>>> l = [1, 2, 3]
>>> l.index(2)
1

Is there something like that for NumPy arrays?


回答 0

是的,这是给定NumPy数组array和值的答案item,以搜索:

itemindex = numpy.where(array==item)

结果是具有所有行索引,然后是所有列索引的元组。

例如,如果一个数组是二维的,并且它在两个位置包含您的商品,则

array[itemindex[0][0]][itemindex[1][0]]

将等于您的物品,所以

array[itemindex[0][1]][itemindex[1][1]]

numpy.where

Yes, here is the answer given a NumPy array, array, and a value, item, to search for:

itemindex = numpy.where(array==item)

The result is a tuple with first all the row indices, then all the column indices.

For example, if an array is two dimensions and it contained your item at two locations then

array[itemindex[0][0]][itemindex[1][0]]

would be equal to your item and so would

array[itemindex[0][1]][itemindex[1][1]]

numpy.where


回答 1

如果仅需要一个值的第一次出现的索引,则可以使用nonzero(或where,在这种情况下,这等于同一件事):

>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6

如果您需要每个的第一个索引,显然可以重复执行与上述相同的操作,但是有一个技巧可能会更快。以下是每个子序列的第一个元素的索引:

>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)

请注意,它找到了3s的两个子序列和8s的两个子序列的开头:

[ 1,1,1,2,2,3838,8]

因此,它与找到每个值的第一个匹配项略有不同。在您的程序中,您可以使用的排序版本t来获得所需的内容:

>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)

If you need the index of the first occurrence of only one value, you can use nonzero (or where, which amounts to the same thing in this case):

>>> t = array([1, 1, 1, 2, 2, 3, 8, 3, 8, 8])
>>> nonzero(t == 8)
(array([6, 8, 9]),)
>>> nonzero(t == 8)[0][0]
6

If you need the first index of each of many values, you could obviously do the same as above repeatedly, but there is a trick that may be faster. The following finds the indices of the first element of each subsequence:

>>> nonzero(r_[1, diff(t)[:-1]])
(array([0, 3, 5, 6, 7, 8]),)

Notice that it finds the beginning of both subsequence of 3s and both subsequences of 8s:

[1, 1, 1, 2, 2, 3, 8, 3, 8, 8]

So it’s slightly different than finding the first occurrence of each value. In your program, you may be able to work with a sorted version of t to get what you want:

>>> st = sorted(t)
>>> nonzero(r_[1, diff(st)[:-1]])
(array([0, 3, 5, 7]),)

回答 2

您还可以将NumPy数组转换为空中列表并获取其索引。例如,

l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i

它将打印1。

You can also convert a NumPy array to list in the air and get its index. For example,

l = [1,2,3,4,5] # Python list
a = numpy.array(l) # NumPy array
i = a.tolist().index(2) # i will return index of 2
print i

It will print 1.


回答 3

只是添加了一个非常高性能和方便 替代基于np.ndenumerate查找第一个索引:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    # If no item was found return None, other return types might be a problem due to
    # numbas type inference.

这非常快,并且自然可以处理多维数组

>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2

>>> index(arr1, 2)
(2, 2, 2)

>>> arr2 = np.ones(20)
>>> arr2[5] = 2

>>> index(arr2, 2)
(5,)

这可能比使用或的任何方法都要快得多(因为这会使操作短路)。np.wherenp.nonzero


但是np.argwhere也可以处理优雅与多维数组(您需要手动将它转换到一个元组它不是短路),但如果没有找到匹配它会失败:

>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)

Just to add a very performant and handy alternative based on np.ndenumerate to find the first index:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    # If no item was found return None, other return types might be a problem due to
    # numbas type inference.

This is pretty fast and deals naturally with multidimensional arrays:

>>> arr1 = np.ones((100, 100, 100))
>>> arr1[2, 2, 2] = 2

>>> index(arr1, 2)
(2, 2, 2)

>>> arr2 = np.ones(20)
>>> arr2[5] = 2

>>> index(arr2, 2)
(5,)

This can be much faster (because it’s short-circuiting the operation) than any approach using np.where or np.nonzero.


However np.argwhere could also deal gracefully with multidimensional arrays (you would need to manually cast it to a tuple and it’s not short-circuited) but it would fail if no match is found:

>>> tuple(np.argwhere(arr1 == 2)[0])
(2, 2, 2)
>>> tuple(np.argwhere(arr2 == 2)[0])
(5,)

回答 4

如果要将其用作其他内容的索引,则可以使用布尔索引(如果数组可广播);您不需要显式索引。做到这一点的最简单的绝对方法是根据真值简单地建立索引。

other_array[first_array == item]

任何布尔运算均可:

a = numpy.arange(100)
other_array[first_array > 50]

非零方法也需要布尔值:

index = numpy.nonzero(first_array == item)[0][0]

两个零用于索引的元组(假设first_array为1D),然后是索引数组中的第一项。

If you’re going to use this as an index into something else, you can use boolean indices if the arrays are broadcastable; you don’t need explicit indices. The absolute simplest way to do this is to simply index based on a truth value.

other_array[first_array == item]

Any boolean operation works:

a = numpy.arange(100)
other_array[first_array > 50]

The nonzero method takes booleans, too:

index = numpy.nonzero(first_array == item)[0][0]

The two zeros are for the tuple of indices (assuming first_array is 1D) and then the first item in the array of indices.


回答 5

l.index(x)返回最小的i,使i是列表中x首次出现的索引。

可以放心地假定index()Python 中的函数已实现,以便在找到第一个匹配项后停止,从而获得最佳的平均性能。

为了找到在NumPy数组中的第一个匹配项之后停止的元素,请使用迭代器(ndenumerate)。

In [67]: l=range(100)

In [68]: l.index(2)
Out[68]: 2

NumPy数组:

In [69]: a = np.arange(100)

In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)

请注意,这两种方法index(),并next在未找到该元素,返回一个错误。使用next,可以在找不到元素的情况下使用第二个参数返回一个特殊值,例如

In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)

NumPy(argmaxwherenonzero)中还有其他函数可用于查找数组中的元素,但是它们都有缺点,即遍历整个数组以查找所有出现的元素,因此没有为查找第一个元素而进行优化。还需要注意的是where,并nonzero返回数组,所以你需要选择得到索引的第一个元素。

In [71]: np.argmax(a==2)
Out[71]: 2

In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)

In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)

时间比较

当搜索的项位于数组的开头%timeit在IPython shell中使用)时,仅检查大型数组的解决方案使用迭代器的解决方案会更快:

In [285]: a = np.arange(100000)

In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop

In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop

In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop

这是NumPy GitHub的未解决问题

另请参阅:Numpy:快速找到价值的第一个索引

l.index(x) returns the smallest i such that i is the index of the first occurrence of x in the list.

One can safely assume that the index() function in Python is implemented so that it stops after finding the first match, and this results in an optimal average performance.

For finding an element stopping after the first match in a NumPy array use an iterator (ndenumerate).

In [67]: l=range(100)

In [68]: l.index(2)
Out[68]: 2

NumPy array:

In [69]: a = np.arange(100)

In [70]: next((idx for idx, val in np.ndenumerate(a) if val==2))
Out[70]: (2L,)

Note that both methods index() and next return an error if the element is not found. With next, one can use a second argument to return a special value in case the element is not found, e.g.

In [77]: next((idx for idx, val in np.ndenumerate(a) if val==400),None)

There are other functions in NumPy (argmax, where, and nonzero) that can be used to find an element in an array, but they all have the drawback of going through the whole array looking for all occurrences, thus not being optimized for finding the first element. Note also that where and nonzero return arrays, so you need to select the first element to get the index.

In [71]: np.argmax(a==2)
Out[71]: 2

In [72]: np.where(a==2)
Out[72]: (array([2], dtype=int64),)

In [73]: np.nonzero(a==2)
Out[73]: (array([2], dtype=int64),)

Time comparison

Just checking that for large arrays the solution using an iterator is faster when the searched item is at the beginning of the array (using %timeit in the IPython shell):

In [285]: a = np.arange(100000)

In [286]: %timeit next((idx for idx, val in np.ndenumerate(a) if val==0))
100000 loops, best of 3: 17.6 µs per loop

In [287]: %timeit np.argmax(a==0)
1000 loops, best of 3: 254 µs per loop

In [288]: %timeit np.where(a==0)[0][0]
1000 loops, best of 3: 314 µs per loop

This is an open NumPy GitHub issue.

See also: Numpy: find first index of value fast


回答 6

对于一维排序数组,使用numpy.searchsorted返回NumPy整数(位置)会更加简单有效。例如,

arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)

只要确保数组已经排序

还要检查返回的索引i是否实际上包含被搜索的元素,因为searchsorted的主要目的是查找应该在其中插入元素以保持顺序的索引。

if arr[i] == 3:
    print("present")
else:
    print("not present")

For one-dimensional sorted arrays, it would be much more simpler and efficient O(log(n)) to use numpy.searchsorted which returns a NumPy integer (position). For example,

arr = np.array([1, 1, 1, 2, 3, 3, 4])
i = np.searchsorted(arr, 3)

Just make sure the array is already sorted

Also check if returned index i actually contains the searched element, since searchsorted’s main objective is to find indices where elements should be inserted to maintain order.

if arr[i] == 3:
    print("present")
else:
    print("not present")

回答 7

要根据任何条件建立索引,可以执行以下操作:

In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
   .....:         print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4

这是一个快速执行list.index()的功能的函数,除非找不到该异常不会引发异常。当心-在大型阵列上这可能非常慢。如果您愿意将其用作方法,则可以将其修补到数组上。

def ndindex(ndarray, item):
    if len(ndarray.shape) == 1:
        try:
            return [ndarray.tolist().index(item)]
        except:
            pass
    else:
        for i, subarray in enumerate(ndarray):
            try:
                return [i] + ndindex(subarray, item)
            except:
                pass

In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]

To index on any criteria, you can so something like the following:

In [1]: from numpy import *
In [2]: x = arange(125).reshape((5,5,5))
In [3]: y = indices(x.shape)
In [4]: locs = y[:,x >= 120] # put whatever you want in place of x >= 120
In [5]: pts = hsplit(locs, len(locs[0]))
In [6]: for pt in pts:
   .....:         print(', '.join(str(p[0]) for p in pt))
4, 4, 0
4, 4, 1
4, 4, 2
4, 4, 3
4, 4, 4

And here’s a quick function to do what list.index() does, except doesn’t raise an exception if it’s not found. Beware — this is probably very slow on large arrays. You can probably monkey patch this on to arrays if you’d rather use it as a method.

def ndindex(ndarray, item):
    if len(ndarray.shape) == 1:
        try:
            return [ndarray.tolist().index(item)]
        except:
            pass
    else:
        for i, subarray in enumerate(ndarray):
            try:
                return [i] + ndindex(subarray, item)
            except:
                pass

In [1]: ndindex(x, 103)
Out[1]: [4, 0, 3]

回答 8

对于一维数组,我建议np.flatnonzero(array == value)[0],这相当于两np.nonzero(array == value)[0][0]np.where(array == value)[0][0],但避免了拆箱1-元件的元组的丑陋。

For 1D arrays, I’d recommend np.flatnonzero(array == value)[0], which is equivalent to both np.nonzero(array == value)[0][0] and np.where(array == value)[0][0] but avoids the ugliness of unboxing a 1-element tuple.


回答 9

从np.where()中选择第一个元素的替代方法是将生成器表达式与枚举一起使用,例如:

>>> import numpy as np
>>> x = np.arange(100)   # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2

对于二维数组,可以这样做:

>>> x = np.arange(100).reshape(10,10)   # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x) 
...            for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)

这种方法的优势在于,它会在找到第一个匹配项后停止检查数组的元素,而np.where会检查所有元素是否匹配。如果数组中的早期匹配项,则生成器表达式会更快。

An alternative to selecting the first element from np.where() is to use a generator expression together with enumerate, such as:

>>> import numpy as np
>>> x = np.arange(100)   # x = array([0, 1, 2, 3, ... 99])
>>> next(i for i, x_i in enumerate(x) if x_i == 2)
2

For a two dimensional array one would do:

>>> x = np.arange(100).reshape(10,10)   # x = array([[0, 1, 2,... 9], [10,..19],])
>>> next((i,j) for i, x_i in enumerate(x) 
...            for j, x_ij in enumerate(x_i) if x_ij == 2)
(0, 2)

The advantage of this approach is that it stops checking the elements of the array after the first match is found, whereas np.where checks all elements for a match. A generator expression would be faster if there’s match early in the array.


回答 10

NumPy中有很多操作可以组合在一起完成。这将返回等于item的元素的索引:

numpy.nonzero(array - item)

然后,您可以使用列表的第一个元素来获得一个元素。

There are lots of operations in NumPy that could perhaps be put together to accomplish this. This will return indices of elements equal to item:

numpy.nonzero(array - item)

You could then take the first elements of the lists to get a single element.


回答 11

numpy_indexed包(声明,我是它的作者)包含list.index为numpy.ndarray的矢量相当于; 那是:

sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]

import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx)   # [2, -1]

该解决方案具有矢量化的性能,可推广到ndarrays,并具有多种处理缺失值的方法。

The numpy_indexed package (disclaimer, I am its author) contains a vectorized equivalent of list.index for numpy.ndarray; that is:

sequence_of_arrays = [[0, 1], [1, 2], [-5, 0]]
arrays_to_query = [[-5, 0], [1, 0]]

import numpy_indexed as npi
idx = npi.indices(sequence_of_arrays, arrays_to_query, missing=-1)
print(idx)   # [2, -1]

This solution has vectorized performance, generalizes to ndarrays, and has various ways of dealing with missing values.


回答 12

注意:这是针对python 2.7版本的

您可以使用lambda函数来解决此问题,并且它对NumPy数组和列表均有效。

your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]

import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]

你可以使用

result[0]

获取过滤元素的第一个索引。

对于python 3.6,请使用

list(result)

代替

result

Note: this is for python 2.7 version

You can use a lambda function to deal with the problem, and it works both on NumPy array and list.

your_list = [11, 22, 23, 44, 55]
result = filter(lambda x:your_list[x]>30, range(len(your_list)))
#result: [3, 4]

import numpy as np
your_numpy_array = np.array([11, 22, 23, 44, 55])
result = filter(lambda x:your_numpy_array [x]>30, range(len(your_list)))
#result: [3, 4]

And you can use

result[0]

to get the first index of the filtered elements.

For python 3.6, use

list(result)

instead of

result

如何从列表中删除第一个项目?

问题:如何从列表中删除第一个项目?

我有表[0, 1, 2, 3, 4],我想将它做成[1, 2, 3, 4]。我该怎么办?

I have the list [0, 1, 2, 3, 4] I’d like to make it into [1, 2, 3, 4]. How do I go about this?


回答 0

Python清单

list.pop(索引)

>>> l = ['a', 'b', 'c', 'd']
>>> l.pop(0)
'a'
>>> l
['b', 'c', 'd']
>>> 

删除列表[索引]

>>> l = ['a', 'b', 'c', 'd']
>>> del l[0]
>>> l
['b', 'c', 'd']
>>> 

这些都将修改您的原始列表。

其他人建议使用切片:

  • 复制清单
  • 可以返回一个子集

另外,如果要执行许多pop(0),则应查看collections.deque

from collections import deque
>>> l = deque(['a', 'b', 'c', 'd'])
>>> l.popleft()
'a'
>>> l
deque(['b', 'c', 'd'])
  • 从列表的左端提供更高的性能

Python List

list.pop(index)

>>> l = ['a', 'b', 'c', 'd']
>>> l.pop(0)
'a'
>>> l
['b', 'c', 'd']
>>> 

del list[index]

>>> l = ['a', 'b', 'c', 'd']
>>> del l[0]
>>> l
['b', 'c', 'd']
>>> 

These both modify your original list.

Others have suggested using slicing:

  • Copies the list
  • Can return a subset

Also, if you are performing many pop(0), you should look at collections.deque

from collections import deque
>>> l = deque(['a', 'b', 'c', 'd'])
>>> l.popleft()
'a'
>>> l
deque(['b', 'c', 'd'])
  • Provides higher performance popping from left end of the list

回答 1

切片:

x = [0,1,2,3,4]
x = x[1:]

这实际上将返回原始的子集,但不会对其进行修改。

Slicing:

x = [0,1,2,3,4]
x = x[1:]

Which would actually return a subset of the original but not modify it.


回答 2

>>> x = [0, 1, 2, 3, 4]
>>> x.pop(0)
0

更多关于此这里

>>> x = [0, 1, 2, 3, 4]
>>> x.pop(0)
0

More on this here.


回答 3

使用列表切片,请参阅有关列表的Python教程以获取更多详细信息:

>>> l = [0, 1, 2, 3, 4]
>>> l[1:]
[1, 2, 3, 4]

With list slicing, see the Python tutorial about lists for more details:

>>> l = [0, 1, 2, 3, 4]
>>> l[1:]
[1, 2, 3, 4]

回答 4

你会这样做

l = [0, 1, 2, 3, 4]
l.pop(0)

要么 l = l[1:]

利弊

使用pop可以获取值

x = l.pop(0) x0

you would just do this

l = [0, 1, 2, 3, 4]
l.pop(0)

or l = l[1:]

Pros and Cons

Using pop you can retrieve the value

say x = l.pop(0) x would be 0


回答 5

然后将其删除:

x = [0, 1, 2, 3, 4]
del x[0]
print x
# [1, 2, 3, 4]

Then just delete it:

x = [0, 1, 2, 3, 4]
del x[0]
print x
# [1, 2, 3, 4]

回答 6

您可以list.reverse()用来反转列表,然后list.pop()删除最后一个元素,例如:

l = [0, 1, 2, 3, 4]
l.reverse()
print l
[4, 3, 2, 1, 0]


l.pop()
0
l.pop()
1
l.pop()
2
l.pop()
3
l.pop()
4

You can use list.reverse() to reverse the list, then list.pop() to remove the last element, for example:

l = [0, 1, 2, 3, 4]
l.reverse()
print l
[4, 3, 2, 1, 0]


l.pop()
0
l.pop()
1
l.pop()
2
l.pop()
3
l.pop()
4

回答 7

您也可以使用list.remove(a[0])pop删除列表中的第一个元素。

>>>> a=[1,2,3,4,5]
>>>> a.remove(a[0])
>>>> print a
>>>> [2,3,4,5]

You can also use list.remove(a[0]) to pop out the first element in the list.

>>>> a=[1,2,3,4,5]
>>>> a.remove(a[0])
>>>> print a
>>>> [2,3,4,5]

回答 8

如果使用numpy,则需要使用delete方法:

import numpy as np

a = np.array([1, 2, 3, 4, 5])

a = np.delete(a, 0)

print(a) # [2 3 4 5]

If you are working with numpy you need to use the delete method:

import numpy as np

a = np.array([1, 2, 3, 4, 5])

a = np.delete(a, 0)

print(a) # [2 3 4 5]

回答 9

有一个称为“双端队列”或双头队列的数据结构,它比列表更快,更高效。您可以使用列表并将其转换为双端队列,并在其中进行所需的转换。您也可以将双端队列转换回列表。

import collections
mylist = [0, 1, 2, 3, 4]

#make a deque from your list
de = collections.deque(mylist)

#you can remove from a deque from either left side or right side
de.popleft()
print(de)

#you can covert the deque back to list
mylist = list(de)
print(mylist)

Deque还提供了非常有用的功能,例如将元素插入列表的任一侧或任何特定的索引。您也可以旋转或反转双端队列。试试看!!

There is a datastructure called “deque” or double ended queue which is faster and efficient than a list. You can use your list and convert it to deque and do the required transformations in it. You can also convert the deque back to list.

import collections
mylist = [0, 1, 2, 3, 4]

#make a deque from your list
de = collections.deque(mylist)

#you can remove from a deque from either left side or right side
de.popleft()
print(de)

#you can covert the deque back to list
mylist = list(de)
print(mylist)

Deque also provides very useful functions like inserting elements to either side of the list or to any specific index. You can also rotate or reverse a deque. Give it a try!!


arr .__ len __()是在Python中获取数组长度的首选方法吗?

问题:arr .__ len __()是在Python中获取数组长度的首选方法吗?

Python中,以下是获取元素数量的唯一方法吗?

arr.__len__()

如果是这样,为什么会有奇怪的语法?

In Python, is the following the only way to get the number of elements?

arr.__len__()

If so, why the strange syntax?


回答 0

my_list = [1,2,3,4,5]
len(my_list)
# 5

对于元组也是如此:

my_tuple = (1,2,3,4,5)
len(my_tuple)
# 5

和字符串,它们实际上只是字符数组:

my_string = 'hello world'
len(my_string)
# 11

这样做的目的是为了使列表,元组和其他容器类型或可迭代对象都不需要显式实现公共.length()方法,而只需检查len()实现了“魔术” __len__()方法的所有内容即可。

当然,这似乎是多余的,但是长度检查的实现可能会有很大的不同,即使在同一语言中也是如此。经常会看到一种类型的集合使用一种.length()方法,而另一种类型使用一种.length属性,而另一种类型使用.count()。使用语言级别的关键字可以统一所有这些类型的入口点。因此,即使您可能不认为是元素列表的对象也可以进行长度检查。这包括字符串,队列,树等。

的功能性质len()也很适合编程的功能样式。

lengths = map(len, list_of_containers)
my_list = [1,2,3,4,5]
len(my_list)
# 5

The same works for tuples:

my_tuple = (1,2,3,4,5)
len(my_tuple)
# 5

And strings, which are really just arrays of characters:

my_string = 'hello world'
len(my_string)
# 11

It was intentionally done this way so that lists, tuples and other container types or iterables didn’t all need to explicitly implement a public .length() method, instead you can just check the len() of anything that implements the ‘magic’ __len__() method.

Sure, this may seem redundant, but length checking implementations can vary considerably, even within the same language. It’s not uncommon to see one collection type use a .length() method while another type uses a .length property, while yet another uses .count(). Having a language-level keyword unifies the entry point for all these types. So even objects you may not consider to be lists of elements could still be length-checked. This includes strings, queues, trees, etc.

The functional nature of len() also lends itself well to functional styles of programming.

lengths = map(len, list_of_containers)

回答 1

取任何有意义的长度(列表,字典,元组,字符串等)的方法是调用len它。

l = [1,2,3,4]
s = 'abcde'
len(l) #returns 4
len(s) #returns 5

使用“奇怪”语法的原因是python在内部翻译len(object)object.__len__()。这适用于任何对象。因此,如果您正在定义某个类,并且让它具有长度是有意义的,则只需__len__()在其上定义一个方法,然后人们便可以调用len这些实例。

The way you take a length of anything for which that makes sense (a list, dictionary, tuple, string, …) is to call len on it.

l = [1,2,3,4]
s = 'abcde'
len(l) #returns 4
len(s) #returns 5

The reason for the “strange” syntax is that internally python translates len(object) into object.__len__(). This applies to any object. So, if you are defining some class and it makes sense for it to have a length, just define a __len__() method on it and then one can call len on those instances.


回答 2

Python使用鸭子类型:它不在乎对象什么,只要它具有适合当前情况的适当接口即可。当您在对象上调用内置函数len()时,实际上是在调用其内部__len__方法。自定义对象可以实现此接口,并且len()将返回答案,即使该对象在概念上不是序列。

有关接口的完整列表,请在此处查看:http : //docs.python.org/reference/datamodel.html#basic-customization

Python uses duck typing: it doesn’t care about what an object is, as long as it has the appropriate interface for the situation at hand. When you call the built-in function len() on an object, you are actually calling its internal __len__ method. A custom object can implement this interface and len() will return the answer, even if the object is not conceptually a sequence.

For a complete list of interfaces, have a look here: http://docs.python.org/reference/datamodel.html#basic-customization


回答 3

获取任何python对象长度的首选方法是将其作为参数传递给len函数。然后在内部,python将尝试调用__len__所传递对象的特殊方法。

The preferred way to get the length of any python object is to pass it as an argument to the len function. Internally, python will then try to call the special __len__ method of the object that was passed.


回答 4

只需使用len(arr)

>>> import array
>>> arr = array.array('i')
>>> arr.append('2')
>>> arr.__len__()
1
>>> len(arr)
1

Just use len(arr):

>>> import array
>>> arr = array.array('i')
>>> arr.append('2')
>>> arr.__len__()
1
>>> len(arr)
1

回答 5

您可以len(arr) 按照前面的答案中的建议使用,以获取数组的长度。如果需要二维数组的尺寸,可以使用arr.shape返回高度和宽度

you can use len(arr) as suggested in previous answers to get the length of the array. In case you want the dimensions of a 2D array you could use arr.shape returns height and width


回答 6

len(list_name)函数以list作为参数,并调用list的__len__()函数。

len(list_name) function takes list as a parameter and it calls list’s __len__() function.


回答 7

Python建议用户使用len()而不是__len__()为了保持一致性,就像其他人所说的那样。但是,还有其他一些好处:

对于一些内置的类型,如liststrbytearray等等,在用Cython实现len()需要一个快捷方式。它直接ob_size以C结构返回,比调用更快__len__()

如果您对这样的细节感兴趣,可以阅读Luciano Ramalho的书“ Fluent Python”。其中包含许多有趣的细节,可能有助于您更深入地了解Python。

Python suggests users use len() instead of __len__() for consistency, just like other guys said. However, There’re some other benefits:

For some built-in types like list, str, bytearray and so on, the Cython implementation of len() takes a shortcut. It directly returns the ob_size in a C structure, which is faster than calling __len__().

If you are interested in such details, you could read the book called “Fluent Python” by Luciano Ramalho. There’re many interesting details in it, and may help you understand Python more deeply.