标签归档:numpy

将NumPy数组转换为Python List结构?

问题:将NumPy数组转换为Python List结构?

如何将NumPy数组转换为Python列表(例如[[1,2,3],[4,5,6]]),并且速度相当快?

How do I convert a NumPy array to a Python List (for example [[1,2,3],[4,5,6]] ), and do it reasonably fast?


回答 0

用途tolist()

import numpy as np
>>> np.array([[1,2,3],[4,5,6]]).tolist()
[[1, 2, 3], [4, 5, 6]]

请注意,这会将值从它们可能具有的任何numpy类型(例如np.int32或np.float32)转换为“最近兼容的Python类型”(在列表中)。如果要保留numpy数据类型,则可以在数组上调用list(),最后得到numpy标量列表。(感谢Mr_and_Mrs_D在评论中指出这一点。)

Use tolist():

import numpy as np
>>> np.array([[1,2,3],[4,5,6]]).tolist()
[[1, 2, 3], [4, 5, 6]]

Note that this converts the values from whatever numpy type they may have (e.g. np.int32 or np.float32) to the “nearest compatible Python type” (in a list). If you want to preserve the numpy data types, you could call list() on your array instead, and you’ll end up with a list of numpy scalars. (Thanks to Mr_and_Mrs_D for pointing that out in a comment.)


回答 1

如果numpy数组形状为2D,则numpy .tolist方法将生成嵌套列表。

如果需要平面列表,则可以使用以下方法。

import numpy as np
from itertools import chain

a = [1,2,3,4,5,6,7,8,9]
print type(a), len(a), a
npa = np.asarray(a)
print type(npa), npa.shape, "\n", npa
npa = npa.reshape((3, 3))
print type(npa), npa.shape, "\n", npa
a = list(chain.from_iterable(npa))
print type(a), len(a), a`

The numpy .tolist method produces nested lists if the numpy array shape is 2D.

if flat lists are desired, the method below works.

import numpy as np
from itertools import chain

a = [1,2,3,4,5,6,7,8,9]
print type(a), len(a), a
npa = np.asarray(a)
print type(npa), npa.shape, "\n", npa
npa = npa.reshape((3, 3))
print type(npa), npa.shape, "\n", npa
a = list(chain.from_iterable(npa))
print type(a), len(a), a`

回答 2

tolist()熊猫说,即使遇到嵌套数组,也可以正常工作DataFrame

my_list = [0,1,2,3,4,5,4,3,2,1,0]
my_dt = pd.DataFrame(my_list)
new_list = [i[0] for i in my_dt.values.tolist()]

print(type(my_list),type(my_dt),type(new_list))

tolist() works fine even if encountered a nested array, say a pandas DataFrame;

my_list = [0,1,2,3,4,5,4,3,2,1,0]
my_dt = pd.DataFrame(my_list)
new_list = [i[0] for i in my_dt.values.tolist()]

print(type(my_list),type(my_dt),type(new_list))

回答 3

someList = [list(map(int, input().split())) for i in range(N)]

someList = [list(map(int, input().split())) for i in range(N)]


回答 4

c = np.array([[1,2,3],[4,5,6]])

list(c.flatten())

c = np.array([[1,2,3],[4,5,6]])

list(c.flatten())

numpy中的flatten和ravel函数有什么区别?

问题:numpy中的flatten和ravel函数有什么区别?

import numpy as np
y = np.array(((1,2,3),(4,5,6),(7,8,9)))
OUTPUT:
print(y.flatten())
[1   2   3   4   5   6   7   8   9]
print(y.ravel())
[1   2   3   4   5   6   7   8   9]

这两个函数返回相同的列表。那么需要两个不同的功能来执行相同的工作。

import numpy as np
y = np.array(((1,2,3),(4,5,6),(7,8,9)))
OUTPUT:
print(y.flatten())
[1   2   3   4   5   6   7   8   9]
print(y.ravel())
[1   2   3   4   5   6   7   8   9]

Both function return the same list. Then what is the need of two different functions performing same job.


回答 0

当前的API是:

  • flatten 总是返回一个副本。
  • ravel尽可能返回原始数组的视图。这在打印输出中不可见,但是如果您修改ravel返回的数组,则可能会修改原始数组中的条目。如果您修改从flatten返回的数组中的条目,则将永远不会发生。ravel通常会更快,因为没有内存被复制,但是您在修改返回的数组时要格外小心。
  • reshape((-1,)) 只要数组的步幅允许,就可以得到一个视图,即使这意味着您并不总是可以获得连续的数组。

The current API is that:

  • flatten always returns a copy.
  • ravel returns a view of the original array whenever possible. This isn’t visible in the printed output, but if you modify the array returned by ravel, it may modify the entries in the original array. If you modify the entries in an array returned from flatten this will never happen. ravel will often be faster since no memory is copied, but you have to be more careful about modifying the array it returns.
  • reshape((-1,)) gets a view whenever the strides of the array allow it even if that means you don’t always get a contiguous array.

回答 1

如此所述,关键区别在于:

  • flatten 是ndarray对象的方法,因此只能用于真正的numpy数组。

  • ravel 是库级别的函数,因此可以在任何可以成功解析的对象上调用。

例如,ravel将对ndarray列表起作用,flatten而不适用于该类型的对象。

@IanH还在回答中指出了与内存处理的重要区别。

As explained here a key difference is that:

  • flatten is a method of an ndarray object and hence can only be called for true numpy arrays.

  • ravel is a library-level function and hence can be called on any object that can successfully be parsed.

For example ravel will work on a list of ndarrays, while flatten is not available for that type of object.

@IanH also points out important differences with memory handling in his answer.


回答 2

这是函数的正确命名空间:

这两个函数均返回指向新存储器结构的展平一维数组。

import numpy
a = numpy.array([[1,2],[3,4]])

r = numpy.ravel(a)
f = numpy.ndarray.flatten(a)  

print(id(a))
print(id(r))
print(id(f))

print(r)
print(f)

print("\nbase r:", r.base)
print("\nbase f:", f.base)

---returns---
140541099429760
140541099471056
140541099473216

[1 2 3 4]
[1 2 3 4]

base r: [[1 2]
 [3 4]]

base f: None

在上例中:

  • 结果的存储位置不同,
  • 结果看起来一样
  • 展平将返回副本
  • ravel将返回一个视图。

我们如何检查某物是否是副本?使用的.base属性ndarray。如果是视图,则基础将是原始数组;如果是副本,则基数为None

Here is the correct namespace for the functions:

Both functions return flattened 1D arrays pointing to the new memory structures.

import numpy
a = numpy.array([[1,2],[3,4]])

r = numpy.ravel(a)
f = numpy.ndarray.flatten(a)  

print(id(a))
print(id(r))
print(id(f))

print(r)
print(f)

print("\nbase r:", r.base)
print("\nbase f:", f.base)

---returns---
140541099429760
140541099471056
140541099473216

[1 2 3 4]
[1 2 3 4]

base r: [[1 2]
 [3 4]]

base f: None

In the upper example:

  • the memory locations of the results are different,
  • the results look the same
  • flatten would return a copy
  • ravel would return a view.

How we check if something is a copy? Using the .base attribute of the ndarray. If it’s a view, the base will be the original array; if it is a copy, the base will be None.


如何在NumPy数组中添加额外的列

问题:如何在NumPy数组中添加额外的列

假设我有一个NumPy数组a

a = np.array([
    [1, 2, 3],
    [2, 3, 4]
    ])

我想添加一列零以获取一个数组b

b = np.array([
    [1, 2, 3, 0],
    [2, 3, 4, 0]
    ])

我如何在NumPy中轻松做到这一点?

Let’s say I have a NumPy array, a:

a = np.array([
    [1, 2, 3],
    [2, 3, 4]
    ])

And I would like to add a column of zeros to get an array, b:

b = np.array([
    [1, 2, 3, 0],
    [2, 3, 4, 0]
    ])

How can I do this easily in NumPy?


回答 0

我认为,更简单,更快速的启动方法是执行以下操作:

import numpy as np
N = 10
a = np.random.rand(N,N)
b = np.zeros((N,N+1))
b[:,:-1] = a

和时间:

In [23]: N = 10

In [24]: a = np.random.rand(N,N)

In [25]: %timeit b = np.hstack((a,np.zeros((a.shape[0],1))))
10000 loops, best of 3: 19.6 us per loop

In [27]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 5.62 us per loop

I think a more straightforward solution and faster to boot is to do the following:

import numpy as np
N = 10
a = np.random.rand(N,N)
b = np.zeros((N,N+1))
b[:,:-1] = a

And timings:

In [23]: N = 10

In [24]: a = np.random.rand(N,N)

In [25]: %timeit b = np.hstack((a,np.zeros((a.shape[0],1))))
10000 loops, best of 3: 19.6 us per loop

In [27]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 5.62 us per loop

回答 1

np.r_[ ... ]并且np.c_[ ... ] 是有用的替代品vstackhstack,用方括号[]代替圆()。
几个例子:

: import numpy as np
: N = 3
: A = np.eye(N)

: np.c_[ A, np.ones(N) ]              # add a column
array([[ 1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.]])

: np.c_[ np.ones(N), A, np.ones(N) ]  # or two
array([[ 1.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  1.]])

: np.r_[ A, [A[1]] ]              # add a row
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.]])
: # not np.r_[ A, A[1] ]

: np.r_[ A[0], 1, 2, 3, A[1] ]    # mix vecs and scalars
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], [1, 2, 3], A[1] ]  # lists
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], (1, 2, 3), A[1] ]  # tuples
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], 1:4, A[1] ]        # same, 1:4 == arange(1,4) == 1,2,3
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

(使用方括号[]代替round()的原因是Python扩展了方括号内的比例,例如1:4,这是重载的奇迹。)

np.r_[ ... ] and np.c_[ ... ] are useful alternatives to vstack and hstack, with square brackets [] instead of round ().
A couple of examples:

: import numpy as np
: N = 3
: A = np.eye(N)

: np.c_[ A, np.ones(N) ]              # add a column
array([[ 1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.]])

: np.c_[ np.ones(N), A, np.ones(N) ]  # or two
array([[ 1.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  1.]])

: np.r_[ A, [A[1]] ]              # add a row
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.]])
: # not np.r_[ A, A[1] ]

: np.r_[ A[0], 1, 2, 3, A[1] ]    # mix vecs and scalars
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], [1, 2, 3], A[1] ]  # lists
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], (1, 2, 3), A[1] ]  # tuples
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], 1:4, A[1] ]        # same, 1:4 == arange(1,4) == 1,2,3
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

(The reason for square brackets [] instead of round () is that Python expands e.g. 1:4 in square — the wonders of overloading.)


回答 2

用途numpy.append

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])

>>> z = np.zeros((2,1), dtype=int64)
>>> z
array([[0],
       [0]])

>>> np.append(a, z, axis=1)
array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

Use numpy.append:

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])

>>> z = np.zeros((2,1), dtype=int64)
>>> z
array([[0],
       [0]])

>>> np.append(a, z, axis=1)
array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

回答 3

使用hstack的一种方法是:

b = np.hstack((a, np.zeros((a.shape[0], 1), dtype=a.dtype)))

One way, using hstack, is:

b = np.hstack((a, np.zeros((a.shape[0], 1), dtype=a.dtype)))

回答 4

我发现以下最优雅的东西:

b = np.insert(a, 3, values=0, axis=1) # Insert values before column 3

的优点insert是,它还允许您在数组内的其他位置插入列(或行)。同样,除了插入单个值,您还可以轻松插入整个向量,例如,复制最后一列:

b = np.insert(a, insert_index, values=a[:,2], axis=1)

这导致:

array([[1, 2, 3, 3],
       [2, 3, 4, 4]])

在时间上,insert可能比JoshAdel的解决方案慢:

In [1]: N = 10

In [2]: a = np.random.rand(N,N)

In [3]: %timeit b = np.hstack((a, np.zeros((a.shape[0], 1))))
100000 loops, best of 3: 7.5 µs per loop

In [4]: %timeit b = np.zeros((a.shape[0], a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 2.17 µs per loop

In [5]: %timeit b = np.insert(a, 3, values=0, axis=1)
100000 loops, best of 3: 10.2 µs per loop

I find the following most elegant:

b = np.insert(a, 3, values=0, axis=1) # Insert values before column 3

An advantage of insert is that it also allows you to insert columns (or rows) at other places inside the array. Also instead of inserting a single value you can easily insert a whole vector, for instance duplicate the last column:

b = np.insert(a, insert_index, values=a[:,2], axis=1)

Which leads to:

array([[1, 2, 3, 3],
       [2, 3, 4, 4]])

For the timing, insert might be slower than JoshAdel’s solution:

In [1]: N = 10

In [2]: a = np.random.rand(N,N)

In [3]: %timeit b = np.hstack((a, np.zeros((a.shape[0], 1))))
100000 loops, best of 3: 7.5 µs per loop

In [4]: %timeit b = np.zeros((a.shape[0], a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 2.17 µs per loop

In [5]: %timeit b = np.insert(a, 3, values=0, axis=1)
100000 loops, best of 3: 10.2 µs per loop

回答 5

我对这个问题也很感兴趣,并比较了

numpy.c_[a, a]
numpy.stack([a, a]).T
numpy.vstack([a, a]).T
numpy.ascontiguousarray(numpy.stack([a, a]).T)               
numpy.ascontiguousarray(numpy.vstack([a, a]).T)
numpy.column_stack([a, a])
numpy.concatenate([a[:,None], a[:,None]], axis=1)
numpy.concatenate([a[None], a[None]], axis=0).T

所有输入向量都做同样的事情a。生长时间a

在此处输入图片说明

请注意,所有非连续变体(特别是 stack/ vstack)最终都比所有连续变体快。column_stack(出于清晰度和速度方面)(如果需要连续性)似乎是一个不错的选择。


复制剧情的代码:

import numpy
import perfplot

perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.rand(n),
    kernels=[
        lambda a: numpy.c_[a, a],
        lambda a: numpy.ascontiguousarray(numpy.stack([a, a]).T),
        lambda a: numpy.ascontiguousarray(numpy.vstack([a, a]).T),
        lambda a: numpy.column_stack([a, a]),
        lambda a: numpy.concatenate([a[:, None], a[:, None]], axis=1),
        lambda a: numpy.ascontiguousarray(
            numpy.concatenate([a[None], a[None]], axis=0).T
        ),
        lambda a: numpy.stack([a, a]).T,
        lambda a: numpy.vstack([a, a]).T,
        lambda a: numpy.concatenate([a[None], a[None]], axis=0).T,
    ],
    labels=[
        "c_",
        "ascont(stack)",
        "ascont(vstack)",
        "column_stack",
        "concat",
        "ascont(concat)",
        "stack (non-cont)",
        "vstack (non-cont)",
        "concat (non-cont)",
    ],
    n_range=[2 ** k for k in range(20)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

I was also interested in this question and compared the speed of

numpy.c_[a, a]
numpy.stack([a, a]).T
numpy.vstack([a, a]).T
numpy.ascontiguousarray(numpy.stack([a, a]).T)               
numpy.ascontiguousarray(numpy.vstack([a, a]).T)
numpy.column_stack([a, a])
numpy.concatenate([a[:,None], a[:,None]], axis=1)
numpy.concatenate([a[None], a[None]], axis=0).T

which all do the same thing for any input vector a. Timings for growing a:

enter image description here

Note that all non-contiguous variants (in particular stack/vstack) are eventually faster than all contiguous variants. column_stack (for its clarity and speed) appears to be a good option if you require contiguity.


Code to reproduce the plot:

import numpy
import perfplot

perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.rand(n),
    kernels=[
        lambda a: numpy.c_[a, a],
        lambda a: numpy.ascontiguousarray(numpy.stack([a, a]).T),
        lambda a: numpy.ascontiguousarray(numpy.vstack([a, a]).T),
        lambda a: numpy.column_stack([a, a]),
        lambda a: numpy.concatenate([a[:, None], a[:, None]], axis=1),
        lambda a: numpy.ascontiguousarray(
            numpy.concatenate([a[None], a[None]], axis=0).T
        ),
        lambda a: numpy.stack([a, a]).T,
        lambda a: numpy.vstack([a, a]).T,
        lambda a: numpy.concatenate([a[None], a[None]], axis=0).T,
    ],
    labels=[
        "c_",
        "ascont(stack)",
        "ascont(vstack)",
        "column_stack",
        "concat",
        "ascont(concat)",
        "stack (non-cont)",
        "vstack (non-cont)",
        "concat (non-cont)",
    ],
    n_range=[2 ** k for k in range(20)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

回答 6

我认为:

np.column_stack((a, zeros(shape(a)[0])))

更优雅。

I think:

np.column_stack((a, zeros(shape(a)[0])))

is more elegant.


回答 7

np.concatenate也可以

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])
>>> z = np.zeros((2,1))
>>> z
array([[ 0.],
       [ 0.]])
>>> np.concatenate((a, z), axis=1)
array([[ 1.,  2.,  3.,  0.],
       [ 2.,  3.,  4.,  0.]])

np.concatenate also works

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])
>>> z = np.zeros((2,1))
>>> z
array([[ 0.],
       [ 0.]])
>>> np.concatenate((a, z), axis=1)
array([[ 1.,  2.,  3.,  0.],
       [ 2.,  3.,  4.,  0.]])

回答 8

假设M一个(100,3)ndarray和y一个(100,)ndarray append可以按以下方式使用:

M=numpy.append(M,y[:,None],1)

诀窍是使用

y[:, None]

这将转换y为(100,1)2D数组。

M.shape

现在给

(100, 4)

Assuming M is a (100,3) ndarray and y is a (100,) ndarray append can be used as follows:

M=numpy.append(M,y[:,None],1)

The trick is to use

y[:, None]

This converts y to a (100, 1) 2D array.

M.shape

now gives

(100, 4)

回答 9

我喜欢JoshAdel的答案,因为它专注于性能。性能上的次要改进是避免仅被覆盖的初始化零的开销。当N较大时,使用空而不是零,并且将零列作为单独的步骤写入时,这具有可测量的差异:

In [1]: import numpy as np

In [2]: N = 10000

In [3]: a = np.ones((N,N))

In [4]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
1 loops, best of 3: 492 ms per loop

In [5]: %timeit b = np.empty((a.shape[0],a.shape[1]+1)); b[:,:-1] = a; b[:,-1] = np.zeros((a.shape[0],))
1 loops, best of 3: 407 ms per loop

I like JoshAdel’s answer because of the focus on performance. A minor performance improvement is to avoid the overhead of initializing with zeros, only to be overwritten. This has a measurable difference when N is large, empty is used instead of zeros, and the column of zeros is written as a separate step:

In [1]: import numpy as np

In [2]: N = 10000

In [3]: a = np.ones((N,N))

In [4]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
1 loops, best of 3: 492 ms per loop

In [5]: %timeit b = np.empty((a.shape[0],a.shape[1]+1)); b[:,:-1] = a; b[:,-1] = np.zeros((a.shape[0],))
1 loops, best of 3: 407 ms per loop

回答 10

np.insert 也达到目的。

matA = np.array([[1,2,3], 
                 [2,3,4]])
idx = 3
new_col = np.array([0, 0])
np.insert(matA, idx, new_col, axis=1)

array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

它沿一个轴new_col在给定索引之前在此处插入值idx。换句话说,新插入的值将占据该idx列并向后移动原始位置idx

np.insert also serves the purpose.

matA = np.array([[1,2,3], 
                 [2,3,4]])
idx = 3
new_col = np.array([0, 0])
np.insert(matA, idx, new_col, axis=1)

array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

It inserts values, here new_col, before a given index, here idx along one axis. In other words, the newly inserted values will occupy the idx column and move what were originally there at and after idx backward.


回答 11

向numpy数组添加额外的列:

Numpy的np.append方法需要三个参数,前两个是2D numpy数组,第三个是轴参数,指示要沿哪个轴附加:

import numpy as np  
x = np.array([[1,2,3], [4,5,6]]) 
print("Original x:") 
print(x) 

y = np.array([[1], [1]]) 
print("Original y:") 
print(y) 

print("x appended to y on axis of 1:") 
print(np.append(x, y, axis=1)) 

印刷品:

Original x:
[[1 2 3]
 [4 5 6]]
Original y:
[[1]
 [1]]
x appended to y on axis of 1:
[[1 2 3 1]
 [4 5 6 1]]

Add an extra column to a numpy array:

Numpy’s np.append method takes three parameters, the first two are 2D numpy arrays and the 3rd is an axis parameter instructing along which axis to append:

import numpy as np  
x = np.array([[1,2,3], [4,5,6]]) 
print("Original x:") 
print(x) 

y = np.array([[1], [1]]) 
print("Original y:") 
print(y) 

print("x appended to y on axis of 1:") 
print(np.append(x, y, axis=1)) 

Prints:

Original x:
[[1 2 3]
 [4 5 6]]
Original y:
[[1]
 [1]]
x appended to y on axis of 1:
[[1 2 3 1]
 [4 5 6 1]]

回答 12

晚会晚了一点,但是还没有人发布这个答案,因此为了完整起见:您可以使用列表推导在一个简单的Python数组上执行此操作:

source = a.tolist()
result = [row + [0] for row in source]
b = np.array(result)

A bit late to the party, but nobody posted this answer yet, so for the sake of completeness: you can do this with list comprehensions, on a plain Python array:

source = a.tolist()
result = [row + [0] for row in source]
b = np.array(result)

回答 13

就我而言,我必须在NumPy数组中添加一列

X = array([ 6.1101, 5.5277, ... ])
X.shape => (97,)
X = np.concatenate((np.ones((m,1), dtype=np.int), X.reshape(m,1)), axis=1)

在X.shape =>(97,2)之后

array([[ 1. , 6.1101],
       [ 1. , 5.5277],
...

In my case, I had to add a column of ones to a NumPy array

X = array([ 6.1101, 5.5277, ... ])
X.shape => (97,)
X = np.concatenate((np.ones((m,1), dtype=np.int), X.reshape(m,1)), axis=1)

After X.shape => (97, 2)

array([[ 1. , 6.1101],
       [ 1. , 5.5277],
...

回答 14

对我来说,下一种方法看起来非常直观和简单。

zeros = np.zeros((2,1)) #2 is a number of rows in your array.   
b = np.hstack((a, zeros))

For me, the next way looks pretty intuitive and simple.

zeros = np.zeros((2,1)) #2 is a number of rows in your array.   
b = np.hstack((a, zeros))

回答 15

有专门为此功能。它叫做numpy.pad

a = np.array([[1,2,3], [2,3,4]])
b = np.pad(a, ((0, 0), (0, 1)), mode='constant', constant_values=0)
print b
>>> array([[1, 2, 3, 0],
           [2, 3, 4, 0]])

这是它在文档字符串中所说的:

Pads an array.

Parameters
----------
array : array_like of rank N
    Input array
pad_width : {sequence, array_like, int}
    Number of values padded to the edges of each axis.
    ((before_1, after_1), ... (before_N, after_N)) unique pad widths
    for each axis.
    ((before, after),) yields same before and after pad for each axis.
    (pad,) or int is a shortcut for before = after = pad width for all
    axes.
mode : str or function
    One of the following string values or a user supplied function.

    'constant'
        Pads with a constant value.
    'edge'
        Pads with the edge values of array.
    'linear_ramp'
        Pads with the linear ramp between end_value and the
        array edge value.
    'maximum'
        Pads with the maximum value of all or part of the
        vector along each axis.
    'mean'
        Pads with the mean value of all or part of the
        vector along each axis.
    'median'
        Pads with the median value of all or part of the
        vector along each axis.
    'minimum'
        Pads with the minimum value of all or part of the
        vector along each axis.
    'reflect'
        Pads with the reflection of the vector mirrored on
        the first and last values of the vector along each
        axis.
    'symmetric'
        Pads with the reflection of the vector mirrored
        along the edge of the array.
    'wrap'
        Pads with the wrap of the vector along the axis.
        The first values are used to pad the end and the
        end values are used to pad the beginning.
    <function>
        Padding function, see Notes.
stat_length : sequence or int, optional
    Used in 'maximum', 'mean', 'median', and 'minimum'.  Number of
    values at edge of each axis used to calculate the statistic value.

    ((before_1, after_1), ... (before_N, after_N)) unique statistic
    lengths for each axis.

    ((before, after),) yields same before and after statistic lengths
    for each axis.

    (stat_length,) or int is a shortcut for before = after = statistic
    length for all axes.

    Default is ``None``, to use the entire axis.
constant_values : sequence or int, optional
    Used in 'constant'.  The values to set the padded values for each
    axis.

    ((before_1, after_1), ... (before_N, after_N)) unique pad constants
    for each axis.

    ((before, after),) yields same before and after constants for each
    axis.

    (constant,) or int is a shortcut for before = after = constant for
    all axes.

    Default is 0.
end_values : sequence or int, optional
    Used in 'linear_ramp'.  The values used for the ending value of the
    linear_ramp and that will form the edge of the padded array.

    ((before_1, after_1), ... (before_N, after_N)) unique end values
    for each axis.

    ((before, after),) yields same before and after end values for each
    axis.

    (constant,) or int is a shortcut for before = after = end value for
    all axes.

    Default is 0.
reflect_type : {'even', 'odd'}, optional
    Used in 'reflect', and 'symmetric'.  The 'even' style is the
    default with an unaltered reflection around the edge value.  For
    the 'odd' style, the extented part of the array is created by
    subtracting the reflected values from two times the edge value.

Returns
-------
pad : ndarray
    Padded array of rank equal to `array` with shape increased
    according to `pad_width`.

Notes
-----
.. versionadded:: 1.7.0

For an array with rank greater than 1, some of the padding of later
axes is calculated from padding of previous axes.  This is easiest to
think about with a rank 2 array where the corners of the padded array
are calculated by using padded values from the first axis.

The padding function, if used, should return a rank 1 array equal in
length to the vector argument with padded values replaced. It has the
following signature::

    padding_func(vector, iaxis_pad_width, iaxis, kwargs)

where

    vector : ndarray
        A rank 1 array already padded with zeros.  Padded values are
        vector[:pad_tuple[0]] and vector[-pad_tuple[1]:].
    iaxis_pad_width : tuple
        A 2-tuple of ints, iaxis_pad_width[0] represents the number of
        values padded at the beginning of vector where
        iaxis_pad_width[1] represents the number of values padded at
        the end of vector.
    iaxis : int
        The axis currently being calculated.
    kwargs : dict
        Any keyword arguments the function requires.

Examples
--------
>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2,3), 'constant', constant_values=(4, 6))
array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])

>>> np.pad(a, (2, 3), 'edge')
array([1, 1, 1, 2, 3, 4, 5, 5, 5, 5])

>>> np.pad(a, (2, 3), 'linear_ramp', end_values=(5, -4))
array([ 5,  3,  1,  2,  3,  4,  5,  2, -1, -4])

>>> np.pad(a, (2,), 'maximum')
array([5, 5, 1, 2, 3, 4, 5, 5, 5])

>>> np.pad(a, (2,), 'mean')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> np.pad(a, (2,), 'median')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> a = [[1, 2], [3, 4]]
>>> np.pad(a, ((3, 2), (2, 3)), 'minimum')
array([[1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [3, 3, 3, 4, 3, 3, 3],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1]])

>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2, 3), 'reflect')
array([3, 2, 1, 2, 3, 4, 5, 4, 3, 2])

>>> np.pad(a, (2, 3), 'reflect', reflect_type='odd')
array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8])

>>> np.pad(a, (2, 3), 'symmetric')
array([2, 1, 1, 2, 3, 4, 5, 5, 4, 3])

>>> np.pad(a, (2, 3), 'symmetric', reflect_type='odd')
array([0, 1, 1, 2, 3, 4, 5, 5, 6, 7])

>>> np.pad(a, (2, 3), 'wrap')
array([4, 5, 1, 2, 3, 4, 5, 1, 2, 3])

>>> def pad_with(vector, pad_width, iaxis, kwargs):
...     pad_value = kwargs.get('padder', 10)
...     vector[:pad_width[0]] = pad_value
...     vector[-pad_width[1]:] = pad_value
...     return vector
>>> a = np.arange(6)
>>> a = a.reshape((2, 3))
>>> np.pad(a, 2, pad_with)
array([[10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10,  0,  1,  2, 10, 10],
       [10, 10,  3,  4,  5, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10]])
>>> np.pad(a, 2, pad_with, padder=100)
array([[100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100,   0,   1,   2, 100, 100],
       [100, 100,   3,   4,   5, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100]])

There is a function specifically for this. It is called numpy.pad

a = np.array([[1,2,3], [2,3,4]])
b = np.pad(a, ((0, 0), (0, 1)), mode='constant', constant_values=0)
print b
>>> array([[1, 2, 3, 0],
           [2, 3, 4, 0]])

Here is what it says in the docstring:

Pads an array.

Parameters
----------
array : array_like of rank N
    Input array
pad_width : {sequence, array_like, int}
    Number of values padded to the edges of each axis.
    ((before_1, after_1), ... (before_N, after_N)) unique pad widths
    for each axis.
    ((before, after),) yields same before and after pad for each axis.
    (pad,) or int is a shortcut for before = after = pad width for all
    axes.
mode : str or function
    One of the following string values or a user supplied function.

    'constant'
        Pads with a constant value.
    'edge'
        Pads with the edge values of array.
    'linear_ramp'
        Pads with the linear ramp between end_value and the
        array edge value.
    'maximum'
        Pads with the maximum value of all or part of the
        vector along each axis.
    'mean'
        Pads with the mean value of all or part of the
        vector along each axis.
    'median'
        Pads with the median value of all or part of the
        vector along each axis.
    'minimum'
        Pads with the minimum value of all or part of the
        vector along each axis.
    'reflect'
        Pads with the reflection of the vector mirrored on
        the first and last values of the vector along each
        axis.
    'symmetric'
        Pads with the reflection of the vector mirrored
        along the edge of the array.
    'wrap'
        Pads with the wrap of the vector along the axis.
        The first values are used to pad the end and the
        end values are used to pad the beginning.
    <function>
        Padding function, see Notes.
stat_length : sequence or int, optional
    Used in 'maximum', 'mean', 'median', and 'minimum'.  Number of
    values at edge of each axis used to calculate the statistic value.

    ((before_1, after_1), ... (before_N, after_N)) unique statistic
    lengths for each axis.

    ((before, after),) yields same before and after statistic lengths
    for each axis.

    (stat_length,) or int is a shortcut for before = after = statistic
    length for all axes.

    Default is ``None``, to use the entire axis.
constant_values : sequence or int, optional
    Used in 'constant'.  The values to set the padded values for each
    axis.

    ((before_1, after_1), ... (before_N, after_N)) unique pad constants
    for each axis.

    ((before, after),) yields same before and after constants for each
    axis.

    (constant,) or int is a shortcut for before = after = constant for
    all axes.

    Default is 0.
end_values : sequence or int, optional
    Used in 'linear_ramp'.  The values used for the ending value of the
    linear_ramp and that will form the edge of the padded array.

    ((before_1, after_1), ... (before_N, after_N)) unique end values
    for each axis.

    ((before, after),) yields same before and after end values for each
    axis.

    (constant,) or int is a shortcut for before = after = end value for
    all axes.

    Default is 0.
reflect_type : {'even', 'odd'}, optional
    Used in 'reflect', and 'symmetric'.  The 'even' style is the
    default with an unaltered reflection around the edge value.  For
    the 'odd' style, the extented part of the array is created by
    subtracting the reflected values from two times the edge value.

Returns
-------
pad : ndarray
    Padded array of rank equal to `array` with shape increased
    according to `pad_width`.

Notes
-----
.. versionadded:: 1.7.0

For an array with rank greater than 1, some of the padding of later
axes is calculated from padding of previous axes.  This is easiest to
think about with a rank 2 array where the corners of the padded array
are calculated by using padded values from the first axis.

The padding function, if used, should return a rank 1 array equal in
length to the vector argument with padded values replaced. It has the
following signature::

    padding_func(vector, iaxis_pad_width, iaxis, kwargs)

where

    vector : ndarray
        A rank 1 array already padded with zeros.  Padded values are
        vector[:pad_tuple[0]] and vector[-pad_tuple[1]:].
    iaxis_pad_width : tuple
        A 2-tuple of ints, iaxis_pad_width[0] represents the number of
        values padded at the beginning of vector where
        iaxis_pad_width[1] represents the number of values padded at
        the end of vector.
    iaxis : int
        The axis currently being calculated.
    kwargs : dict
        Any keyword arguments the function requires.

Examples
--------
>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2,3), 'constant', constant_values=(4, 6))
array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])

>>> np.pad(a, (2, 3), 'edge')
array([1, 1, 1, 2, 3, 4, 5, 5, 5, 5])

>>> np.pad(a, (2, 3), 'linear_ramp', end_values=(5, -4))
array([ 5,  3,  1,  2,  3,  4,  5,  2, -1, -4])

>>> np.pad(a, (2,), 'maximum')
array([5, 5, 1, 2, 3, 4, 5, 5, 5])

>>> np.pad(a, (2,), 'mean')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> np.pad(a, (2,), 'median')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> a = [[1, 2], [3, 4]]
>>> np.pad(a, ((3, 2), (2, 3)), 'minimum')
array([[1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [3, 3, 3, 4, 3, 3, 3],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1]])

>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2, 3), 'reflect')
array([3, 2, 1, 2, 3, 4, 5, 4, 3, 2])

>>> np.pad(a, (2, 3), 'reflect', reflect_type='odd')
array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8])

>>> np.pad(a, (2, 3), 'symmetric')
array([2, 1, 1, 2, 3, 4, 5, 5, 4, 3])

>>> np.pad(a, (2, 3), 'symmetric', reflect_type='odd')
array([0, 1, 1, 2, 3, 4, 5, 5, 6, 7])

>>> np.pad(a, (2, 3), 'wrap')
array([4, 5, 1, 2, 3, 4, 5, 1, 2, 3])

>>> def pad_with(vector, pad_width, iaxis, kwargs):
...     pad_value = kwargs.get('padder', 10)
...     vector[:pad_width[0]] = pad_value
...     vector[-pad_width[1]:] = pad_value
...     return vector
>>> a = np.arange(6)
>>> a = a.reshape((2, 3))
>>> np.pad(a, 2, pad_with)
array([[10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10,  0,  1,  2, 10, 10],
       [10, 10,  3,  4,  5, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10]])
>>> np.pad(a, 2, pad_with, padder=100)
array([[100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100,   0,   1,   2, 100, 100],
       [100, 100,   3,   4,   5, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100]])

在datetime,Timestamp和datetime64之间转换

问题:在datetime,Timestamp和datetime64之间转换

如何将numpy.datetime64对象转换为datetime.datetime(或Timestamp)?

在下面的代码中,我创建一个datetime,timestamp和datetime64对象。

import datetime
import numpy as np
import pandas as pd
dt = datetime.datetime(2012, 5, 1)
# A strange way to extract a Timestamp object, there's surely a better way?
ts = pd.DatetimeIndex([dt])[0]
dt64 = np.datetime64(dt)

In [7]: dt
Out[7]: datetime.datetime(2012, 5, 1, 0, 0)

In [8]: ts
Out[8]: <Timestamp: 2012-05-01 00:00:00>

In [9]: dt64
Out[9]: numpy.datetime64('2012-05-01T01:00:00.000000+0100')

注意:很容易从时间戳获取日期时间:

In [10]: ts.to_datetime()
Out[10]: datetime.datetime(2012, 5, 1, 0, 0)

但是我们如何从()中提取datetime或?Timestampnumpy.datetime64dt64

更新:我的数据集中的一个令人讨厌的例子(也许是激励性的例子)似乎是:

dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

应该是datetime.datetime(2002, 6, 28, 1, 0),而不是长(!)(1025222400000000000L)…

How do I convert a numpy.datetime64 object to a datetime.datetime (or Timestamp)?

In the following code, I create a datetime, timestamp and datetime64 objects.

import datetime
import numpy as np
import pandas as pd
dt = datetime.datetime(2012, 5, 1)
# A strange way to extract a Timestamp object, there's surely a better way?
ts = pd.DatetimeIndex([dt])[0]
dt64 = np.datetime64(dt)

In [7]: dt
Out[7]: datetime.datetime(2012, 5, 1, 0, 0)

In [8]: ts
Out[8]: <Timestamp: 2012-05-01 00:00:00>

In [9]: dt64
Out[9]: numpy.datetime64('2012-05-01T01:00:00.000000+0100')

Note: it’s easy to get the datetime from the Timestamp:

In [10]: ts.to_datetime()
Out[10]: datetime.datetime(2012, 5, 1, 0, 0)

But how do we extract the datetime or Timestamp from a numpy.datetime64 (dt64)?

.

Update: a somewhat nasty example in my dataset (perhaps the motivating example) seems to be:

dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

which should be datetime.datetime(2002, 6, 28, 1, 0), and not a long (!) (1025222400000000000L)…


回答 0

要将numpy.datetime64日期时间对象转换为代表UTC时间的日期时间对象,请执行以下操作numpy-1.8

>>> from datetime import datetime
>>> import numpy as np
>>> dt = datetime.utcnow()
>>> dt
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> dt64 = np.datetime64(dt)
>>> ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
>>> ts
1354650685.3624549
>>> datetime.utcfromtimestamp(ts)
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> np.__version__
'1.8.0.dev-7b75899'

上面的示例假定np.datetime64在UTC中将朴素的datetime对象解释为时间。


要将datetime转换为np.datetime64并返回(numpy-1.6):

>>> np.datetime64(datetime.utcnow()).astype(datetime)
datetime.datetime(2012, 12, 4, 13, 34, 52, 827542)

它既可用于单个np.datetime64对象,又可用于np.datetime64的numpy数组。

想想np.datetime64的方式与处理np.int8,np.int16等的方式相同,并应用相同的方法在Python对象(如int,datetime和相应的numpy对象)之间转换甜菜。

您的“讨厌的例子”可以正常工作:

>>> from datetime import datetime
>>> import numpy 
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
datetime.datetime(2002, 6, 28, 0, 0)
>>> numpy.__version__
'1.6.2' # current version available via pip install numpy

我可以将安装时的long值复制numpy-1.8.0为:

pip install git+https://github.com/numpy/numpy.git#egg=numpy-dev

相同的例子:

>>> from datetime import datetime
>>> import numpy
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
1025222400000000000L
>>> numpy.__version__
'1.8.0.dev-7b75899'

long之所以返回,是因为for numpy.datetime64类型.astype(datetime)等于在.astype(object)上返回Python整数(longnumpy-1.8

要获取日期时间对象,您可以:

>>> dt64.dtype
dtype('<M8[ns]')
>>> ns = 1e-9 # number of seconds in a nanosecond
>>> datetime.utcfromtimestamp(dt64.astype(int) * ns)
datetime.datetime(2002, 6, 28, 0, 0)

要获取直接使用秒的datetime64:

>>> dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100', 's')
>>> dt64.dtype
dtype('<M8[s]')
>>> datetime.utcfromtimestamp(dt64.astype(int))
datetime.datetime(2002, 6, 28, 0, 0)

numpy的文档说,日期时间API是实验性的,并在未来的版本中numpy的可能改变。

To convert numpy.datetime64 to datetime object that represents time in UTC on numpy-1.8:

>>> from datetime import datetime
>>> import numpy as np
>>> dt = datetime.utcnow()
>>> dt
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> dt64 = np.datetime64(dt)
>>> ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
>>> ts
1354650685.3624549
>>> datetime.utcfromtimestamp(ts)
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> np.__version__
'1.8.0.dev-7b75899'

The above example assumes that a naive datetime object is interpreted by np.datetime64 as time in UTC.


To convert datetime to np.datetime64 and back (numpy-1.6):

>>> np.datetime64(datetime.utcnow()).astype(datetime)
datetime.datetime(2012, 12, 4, 13, 34, 52, 827542)

It works both on a single np.datetime64 object and a numpy array of np.datetime64.

Think of np.datetime64 the same way you would about np.int8, np.int16, etc and apply the same methods to convert beetween Python objects such as int, datetime and corresponding numpy objects.

Your “nasty example” works correctly:

>>> from datetime import datetime
>>> import numpy 
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
datetime.datetime(2002, 6, 28, 0, 0)
>>> numpy.__version__
'1.6.2' # current version available via pip install numpy

I can reproduce the long value on numpy-1.8.0 installed as:

pip install git+https://github.com/numpy/numpy.git#egg=numpy-dev

The same example:

>>> from datetime import datetime
>>> import numpy
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
1025222400000000000L
>>> numpy.__version__
'1.8.0.dev-7b75899'

It returns long because for numpy.datetime64 type .astype(datetime) is equivalent to .astype(object) that returns Python integer (long) on numpy-1.8.

To get datetime object you could:

>>> dt64.dtype
dtype('<M8[ns]')
>>> ns = 1e-9 # number of seconds in a nanosecond
>>> datetime.utcfromtimestamp(dt64.astype(int) * ns)
datetime.datetime(2002, 6, 28, 0, 0)

To get datetime64 that uses seconds directly:

>>> dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100', 's')
>>> dt64.dtype
dtype('<M8[s]')
>>> datetime.utcfromtimestamp(dt64.astype(int))
datetime.datetime(2002, 6, 28, 0, 0)

The numpy docs say that the datetime API is experimental and may change in future numpy versions.


回答 1

您可以只使用pd.Timestamp构造函数。下图可能对此问题和相关问题有用。

时间表示之间的转换

You can just use the pd.Timestamp constructor. The following diagram may be useful for this and related questions.

Conversions between time representations


回答 2

欢迎来到地狱。

您可以将datetime64对象传递给pandas.Timestamp

In [16]: Timestamp(numpy.datetime64('2012-05-01T01:00:00.000000'))
Out[16]: <Timestamp: 2012-05-01 01:00:00>

我注意到虽然在NumPy 1.6.1中这是行不通的:

numpy.datetime64('2012-05-01T01:00:00.000000+0100')

pandas.to_datetime可以使用(这是dev版本的版本,尚未检查v0.9.1):

In [24]: pandas.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[24]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

Welcome to hell.

You can just pass a datetime64 object to pandas.Timestamp:

In [16]: Timestamp(numpy.datetime64('2012-05-01T01:00:00.000000'))
Out[16]: <Timestamp: 2012-05-01 01:00:00>

I noticed that this doesn’t work right though in NumPy 1.6.1:

numpy.datetime64('2012-05-01T01:00:00.000000+0100')

Also, pandas.to_datetime can be used (this is off of the dev version, haven’t checked v0.9.1):

In [24]: pandas.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[24]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

回答 3

我认为答案中可能需要做更多的整合工作,以更好地解释Python的datetime模块,numpy的datetime64 / timedelta64和熊猫的Timestamp / Timedelta对象之间的关系。

Python的日期时间标准库

日期时间标准库有四个主要对象

  • 时间-仅时间,以小时,分钟,秒和微秒为单位
  • 日期-仅年,月和日
  • datetime-时间和日期的所有组成部分
  • timedelta-以天为单位的最大时间量

创建这四个对象

>>> import datetime
>>> datetime.time(hour=4, minute=3, second=10, microsecond=7199)
datetime.time(4, 3, 10, 7199)

>>> datetime.date(year=2017, month=10, day=24)
datetime.date(2017, 10, 24)

>>> datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 24, 4, 3, 10, 7199)

>>> datetime.timedelta(days=3, minutes = 55)
datetime.timedelta(3, 3300)

>>> # add timedelta to datetime
>>> datetime.timedelta(days=3, minutes = 55) + \
    datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 27, 4, 58, 10, 7199)

NumPy的datetime64和timedelta64对象

NumPy没有单独的日期和时间对象,只有一个datetime64对象代表一个时间点。datetime模块的datetime对象的精度为微秒(百万分之一秒)。NumPy的datetime64对象使您可以将其精度设置为从小时到十亿分之一秒(10 ^ -18)。它的构造函数更加灵活,可以接受各种输入。

构造NumPy的datetime64和timedelta64对象

传递带有字符串的整数作为单位。在这里查看所有单位。在UNIX时代之后,它转换为这么多单位:1970年1月1日

>>> np.datetime64(5, 'ns') 
numpy.datetime64('1970-01-01T00:00:00.000000005')

>>> np.datetime64(1508887504, 's')
numpy.datetime64('2017-10-24T23:25:04')

您也可以使用ISO 8601格式的字符串。

>>> np.datetime64('2017-10-24')
numpy.datetime64('2017-10-24')

Timedelta有一个单位

>>> np.timedelta64(5, 'D') # 5 days
>>> np.timedelta64(10, 'h') 10 hours

也可以通过减去两个datetime64对象来创建它们

>>> np.datetime64('2017-10-24T05:30:45.67') - np.datetime64('2017-10-22T12:35:40.123')
numpy.timedelta64(147305547,'ms')

Pandas Timestamp和Timedelta在NumPy之上构建了更多功能

大熊猫时间戳记与日期时间非常相似,但是功能更多。您可以使用pd.Timestamp或构造它们pd.to_datetime

>>> pd.Timestamp(1239.1238934) #defautls to nanoseconds
Timestamp('1970-01-01 00:00:00.000001239')

>>> pd.Timestamp(1239.1238934, unit='D') # change units
Timestamp('1973-05-24 02:58:24.355200')

>>> pd.Timestamp('2017-10-24 05') # partial strings work
Timestamp('2017-10-24 05:00:00')

pd.to_datetime 的工作方式非常相似(有更多选择),并且可以将字符串列表转换为时间戳。

>>> pd.to_datetime('2017-10-24 05')
Timestamp('2017-10-24 05:00:00')

>>> pd.to_datetime(['2017-1-1', '2017-1-2'])
DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq=None)

将Python datetime转换为datetime64和Timestamp

>>> dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

将numpy datetime64转换为datetime和Timestamp

>>> dt64 = np.datetime64('2017-10-24 05:34:20.123456')
>>> unix_epoch = np.datetime64(0, 's')
>>> one_second = np.timedelta64(1, 's')
>>> seconds_since_epoch = (dt64 - unix_epoch) / one_second
>>> seconds_since_epoch
1508823260.123456

>>> datetime.datetime.utcfromtimestamp(seconds_since_epoch)
>>> datetime.datetime(2017, 10, 24, 5, 34, 20, 123456)

转换为时间戳

>>> pd.Timestamp(dt64)
Timestamp('2017-10-24 05:34:20.123456')

从时间戳转换为datetime和datetime64

这很简单,因为熊猫时间戳非常强大

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

I think there could be a more consolidated effort in an answer to better explain the relationship between Python’s datetime module, numpy’s datetime64/timedelta64 and pandas’ Timestamp/Timedelta objects.

The datetime standard library of Python

The datetime standard library has four main objects

  • time – only time, measured in hours, minutes, seconds and microseconds
  • date – only year, month and day
  • datetime – All components of time and date
  • timedelta – An amount of time with maximum unit of days

Create these four objects

>>> import datetime
>>> datetime.time(hour=4, minute=3, second=10, microsecond=7199)
datetime.time(4, 3, 10, 7199)

>>> datetime.date(year=2017, month=10, day=24)
datetime.date(2017, 10, 24)

>>> datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 24, 4, 3, 10, 7199)

>>> datetime.timedelta(days=3, minutes = 55)
datetime.timedelta(3, 3300)

>>> # add timedelta to datetime
>>> datetime.timedelta(days=3, minutes = 55) + \
    datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 27, 4, 58, 10, 7199)

NumPy’s datetime64 and timedelta64 objects

NumPy has no separate date and time objects, just a single datetime64 object to represent a single moment in time. The datetime module’s datetime object has microsecond precision (one-millionth of a second). NumPy’s datetime64 object allows you to set its precision from hours all the way to attoseconds (10 ^ -18). It’s constructor is more flexible and can take a variety of inputs.

Construct NumPy’s datetime64 and timedelta64 objects

Pass an integer with a string for the units. See all units here. It gets converted to that many units after the UNIX epoch: Jan 1, 1970

>>> np.datetime64(5, 'ns') 
numpy.datetime64('1970-01-01T00:00:00.000000005')

>>> np.datetime64(1508887504, 's')
numpy.datetime64('2017-10-24T23:25:04')

You can also use strings as long as they are in ISO 8601 format.

>>> np.datetime64('2017-10-24')
numpy.datetime64('2017-10-24')

Timedeltas have a single unit

>>> np.timedelta64(5, 'D') # 5 days
>>> np.timedelta64(10, 'h') 10 hours

Can also create them by subtracting two datetime64 objects

>>> np.datetime64('2017-10-24T05:30:45.67') - np.datetime64('2017-10-22T12:35:40.123')
numpy.timedelta64(147305547,'ms')

Pandas Timestamp and Timedelta build much more functionality on top of NumPy

A pandas Timestamp is a moment in time very similar to a datetime but with much more functionality. You can construct them with either pd.Timestamp or pd.to_datetime.

>>> pd.Timestamp(1239.1238934) #defautls to nanoseconds
Timestamp('1970-01-01 00:00:00.000001239')

>>> pd.Timestamp(1239.1238934, unit='D') # change units
Timestamp('1973-05-24 02:58:24.355200')

>>> pd.Timestamp('2017-10-24 05') # partial strings work
Timestamp('2017-10-24 05:00:00')

pd.to_datetime works very similarly (with a few more options) and can convert a list of strings into Timestamps.

>>> pd.to_datetime('2017-10-24 05')
Timestamp('2017-10-24 05:00:00')

>>> pd.to_datetime(['2017-1-1', '2017-1-2'])
DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq=None)

Converting Python datetime to datetime64 and Timestamp

>>> dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

Converting numpy datetime64 to datetime and Timestamp

>>> dt64 = np.datetime64('2017-10-24 05:34:20.123456')
>>> unix_epoch = np.datetime64(0, 's')
>>> one_second = np.timedelta64(1, 's')
>>> seconds_since_epoch = (dt64 - unix_epoch) / one_second
>>> seconds_since_epoch
1508823260.123456

>>> datetime.datetime.utcfromtimestamp(seconds_since_epoch)
>>> datetime.datetime(2017, 10, 24, 5, 34, 20, 123456)

Convert to Timestamp

>>> pd.Timestamp(dt64)
Timestamp('2017-10-24 05:34:20.123456')

Convert from Timestamp to datetime and datetime64

This is quite easy as pandas timestamps are very powerful

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

回答 4

>>> dt64.tolist()
datetime.datetime(2012, 5, 1, 0, 0)

对于DatetimeIndextolist返回datetime对象列表。对于单个datetime64对象,它返回一个datetime对象。

>>> dt64.tolist()
datetime.datetime(2012, 5, 1, 0, 0)

For DatetimeIndex, the tolist returns a list of datetime objects. For a single datetime64 object it returns a single datetime object.


回答 5

如果要将整个熊猫系列日期时间转换为常规python日期时间,也可以使用.to_pydatetime()

pd.date_range('20110101','20110102',freq='H').to_pydatetime()

> [datetime.datetime(2011, 1, 1, 0, 0) datetime.datetime(2011, 1, 1, 1, 0)
   datetime.datetime(2011, 1, 1, 2, 0) datetime.datetime(2011, 1, 1, 3, 0)
   ....

它还支持时区:

pd.date_range('20110101','20110102',freq='H').tz_localize('UTC').tz_convert('Australia/Sydney').to_pydatetime()

[ datetime.datetime(2011, 1, 1, 11, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
 datetime.datetime(2011, 1, 1, 12, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
....

注意:如果您使用的是熊猫系列,则不能调用to_pydatetime()整个系列。您将需要.to_pydatetime()使用列表推导或类似方法在每个单独的datetime64 上调用:

datetimes = [val.to_pydatetime() for val in df.problem_datetime_column]

If you want to convert an entire pandas series of datetimes to regular python datetimes, you can also use .to_pydatetime().

pd.date_range('20110101','20110102',freq='H').to_pydatetime()

> [datetime.datetime(2011, 1, 1, 0, 0) datetime.datetime(2011, 1, 1, 1, 0)
   datetime.datetime(2011, 1, 1, 2, 0) datetime.datetime(2011, 1, 1, 3, 0)
   ....

It also supports timezones:

pd.date_range('20110101','20110102',freq='H').tz_localize('UTC').tz_convert('Australia/Sydney').to_pydatetime()

[ datetime.datetime(2011, 1, 1, 11, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
 datetime.datetime(2011, 1, 1, 12, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
....

NOTE: If you are operating on a Pandas Series you cannot call to_pydatetime() on the entire series. You will need to call .to_pydatetime() on each individual datetime64 using a list comprehension or something similar:

datetimes = [val.to_pydatetime() for val in df.problem_datetime_column]

回答 6

一种选择是使用str,然后使用to_datetime(或类似方法):

In [11]: str(dt64)
Out[11]: '2012-05-01T01:00:00.000000+0100'

In [12]: pd.to_datetime(str(dt64))
Out[12]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

注意:它不等于,dt因为它变得“可偏移”

In [13]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[13]: datetime.datetime(2012, 5, 1, 1, 0)

这似乎不雅。

更新:这可以处理“讨厌的例子”:

In [21]: dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

In [22]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[22]: datetime.datetime(2002, 6, 28, 1, 0)

One option is to use str, and then to_datetime (or similar):

In [11]: str(dt64)
Out[11]: '2012-05-01T01:00:00.000000+0100'

In [12]: pd.to_datetime(str(dt64))
Out[12]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

Note: it is not equal to dt because it’s become “offset-aware”:

In [13]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[13]: datetime.datetime(2012, 5, 1, 1, 0)

This seems inelegant.

.

Update: this can deal with the “nasty example”:

In [21]: dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

In [22]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[22]: datetime.datetime(2002, 6, 28, 1, 0)

回答 7

这篇文章已经发表了四年,但我仍然在为这个转换问题而苦苦挣扎-因此从某种意义上说,该问题在2017年仍然很活跃。numpy文档没有提供简单的转换算法,这让我有些震惊,但这是另一回事了。

我遇到了另一种仅涉及模块numpy和的转换方法datetime,它不需要导入熊猫,在我看来,要进行这种简单转换,需要导入很多代码。我注意到,如果原始单位微秒单位,则datetime64.astype(datetime.datetime)它将返回一个datetime.datetime对象,而其他单位则返回整数时间戳。我使用Netcdf文件中的数据I / O 模块,该模块使用纳秒级单位进行转换,除非您首先转换为微秒级单位,否则转换将失败。这是示例转换代码,datetime64xarraydatetime64

import numpy as np
import datetime

def convert_datetime64_to_datetime( usert: np.datetime64 )->datetime.datetime:
    t = np.datetime64( usert, 'us').astype(datetime.datetime)
return t

它仅在我的机器上进行过测试,该机器是带有最新的2017 Anaconda发行版的Python 3.6。我只是看过标量转换,没有检查基于数组的转换,尽管我猜这会很好。我也没有查看numpy datetime64源代码,以查看该操作是否有意义。

This post has been up for 4 years and I still struggled with this conversion problem – so the issue is still active in 2017 in some sense. I was somewhat shocked that the numpy documentation does not readily offer a simple conversion algorithm but that’s another story.

I have come across another way to do the conversion that only involves modules numpy and datetime, it does not require pandas to be imported which seems to me to be a lot of code to import for such a simple conversion. I noticed that datetime64.astype(datetime.datetime) will return a datetime.datetime object if the original datetime64 is in micro-second units while other units return an integer timestamp. I use module xarray for data I/O from Netcdf files which uses the datetime64 in nanosecond units making the conversion fail unless you first convert to micro-second units. Here is the example conversion code,

import numpy as np
import datetime

def convert_datetime64_to_datetime( usert: np.datetime64 )->datetime.datetime:
    t = np.datetime64( usert, 'us').astype(datetime.datetime)
return t

Its only tested on my machine, which is Python 3.6 with a recent 2017 Anaconda distribution. I have only looked at scalar conversion and have not checked array based conversions although I’m guessing it will be good. Nor have I looked at the numpy datetime64 source code to see if the operation makes sense or not.


回答 8

我回来这个答案的次数超出了我的预期,因此我决定召集一个快速的小类,将Numpy datetime64值转换为Python datetime值。我希望它可以帮助其他人。

from datetime import datetime
import pandas as pd

class NumpyConverter(object):
    @classmethod
    def to_datetime(cls, dt64, tzinfo=None):
        """
        Converts a Numpy datetime64 to a Python datetime.
        :param dt64: A Numpy datetime64 variable
        :type dt64: numpy.datetime64
        :param tzinfo: The timezone the date / time value is in
        :type tzinfo: pytz.timezone
        :return: A Python datetime variable
        :rtype: datetime
        """
        ts = pd.to_datetime(dt64)
        if tzinfo is not None:
            return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second, tzinfo=tzinfo)
        return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second)

我要把它放在我的工具袋里,告诉我我将再次需要它。

I’ve come back to this answer more times than I can count, so I decided to throw together a quick little class, which converts a Numpy datetime64 value to Python datetime value. I hope it helps others out there.

from datetime import datetime
import pandas as pd

class NumpyConverter(object):
    @classmethod
    def to_datetime(cls, dt64, tzinfo=None):
        """
        Converts a Numpy datetime64 to a Python datetime.
        :param dt64: A Numpy datetime64 variable
        :type dt64: numpy.datetime64
        :param tzinfo: The timezone the date / time value is in
        :type tzinfo: pytz.timezone
        :return: A Python datetime variable
        :rtype: datetime
        """
        ts = pd.to_datetime(dt64)
        if tzinfo is not None:
            return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second, tzinfo=tzinfo)
        return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second)

I’m gonna keep this in my tool bag, something tells me I’ll need it again.


回答 9

import numpy as np
import pandas as pd 

def np64toDate(np64):
    return pd.to_datetime(str(np64)).replace(tzinfo=None).to_datetime()

使用此函数获取pythons本机datetime对象

import numpy as np
import pandas as pd 

def np64toDate(np64):
    return pd.to_datetime(str(np64)).replace(tzinfo=None).to_datetime()

use this function to get pythons native datetime object


回答 10

一些解决方案对我来说效果很好,但是numpy将弃用某些参数。对我来说更好的解决方案是将日期作为熊猫的日期时间读取,并明确地提取熊猫对象的年,月和日。以下代码适用于最常见的情况。

def format_dates(dates):
    dt = pd.to_datetime(dates)
    try: return [datetime.date(x.year, x.month, x.day) for x in dt]    
    except TypeError: return datetime.date(dt.year, dt.month, dt.day)

Some solutions work well for me but numpy will deprecate some parameters. The solution that work better for me is to read the date as a pandas datetime and excract explicitly the year, month and day of a pandas object. The following code works for the most common situation.

def format_dates(dates):
    dt = pd.to_datetime(dates)
    try: return [datetime.date(x.year, x.month, x.day) for x in dt]    
    except TypeError: return datetime.date(dt.year, dt.month, dt.day)

回答 11

实际上,所有这些日期时间类型都可能很困难,并且可能有问题(必须仔细跟踪时区信息)。这是我所做的,尽管我承认我担心至少其中一部分是“不是设计造成的”。同样,这可以根据需要变得更紧凑。以numpy.datetime64 dt_a开头:

dt_a

numpy.datetime64(’2015-04-24T23:11:26.270000-0700’)

dt_a1 = dt_a.tolist()#以UTC格式生成日期时间对象,但不包含tzinfo

dt_a1

datetime.datetime(2015,4,25,6,11,26,270000)

# now, make your "aware" datetime:

dt_a2 = datetime.datetime(* list(dt_a1.timetuple()[:6])+ [dt_a1.microsecond],tzinfo = pytz.timezone(’UTC’))

…当然,可以根据需要将其压缩为一行。

indeed, all of these datetime types can be difficult, and potentially problematic (must keep careful track of timezone information). here’s what i have done, though i admit that i am concerned that at least part of it is “not by design”. also, this can be made a bit more compact as needed. starting with a numpy.datetime64 dt_a:

dt_a

numpy.datetime64(‘2015-04-24T23:11:26.270000-0700’)

dt_a1 = dt_a.tolist() # yields a datetime object in UTC, but without tzinfo

dt_a1

datetime.datetime(2015, 4, 25, 6, 11, 26, 270000)

# now, make your "aware" datetime:

dt_a2=datetime.datetime(*list(dt_a1.timetuple()[:6]) + [dt_a1.microsecond], tzinfo=pytz.timezone(‘UTC’))

… and of course, that can be compressed into one line as needed.


OpenCV-Python中的简单数字识别OCR

问题:OpenCV-Python中的简单数字识别OCR

我正在尝试在OpenCV-Python(cv2)中实现“数字识别OCR”。它仅用于学习目的。我想学习OpenCV中的KNearest和SVM功能。

我每个数字有100个样本(即图像)。我想和他们一起训练。

letter_recog.pyOpenCV示例附带一个示例。但是我仍然不知道如何使用它。我不了解样本,响应等内容。此外,它首先会加载txt文件,而我首先并不了解。

稍后进行搜索时,我可以在cpp样本中找到letter_recognitiontion.data。我用它并在letter_recog.py模型中为cv2.KNearest编写了一个代码(仅用于测试):

import numpy as np
import cv2

fn = 'letter-recognition.data'
a = np.loadtxt(fn, np.float32, delimiter=',', converters={ 0 : lambda ch : ord(ch)-ord('A') })
samples, responses = a[:,1:], a[:,0]

model = cv2.KNearest()
retval = model.train(samples,responses)
retval, results, neigh_resp, dists = model.find_nearest(samples, k = 10)
print results.ravel()

它给了我一个大小为20000的数组,我不知道它是什么。

问题:

1)什么是letter_recognition.data文件?如何从我自己的数据集中构建该文件?

2)results.reval()代表什么?

3)我们如何使用letter_recognition.data文件(KNearest或SVM)编写一个简单的数字识别工具?

I am trying to implement a “Digit Recognition OCR” in OpenCV-Python (cv2). It is just for learning purposes. I would like to learn both KNearest and SVM features in OpenCV.

I have 100 samples (i.e. images) of each digit. I would like to train with them.

There is a sample letter_recog.py that comes with OpenCV sample. But I still couldn’t figure out on how to use it. I don’t understand what are the samples, responses etc. Also, it loads a txt file at first, which I didn’t understand first.

Later on searching a little bit, I could find a letter_recognition.data in cpp samples. I used it and made a code for cv2.KNearest in the model of letter_recog.py (just for testing):

import numpy as np
import cv2

fn = 'letter-recognition.data'
a = np.loadtxt(fn, np.float32, delimiter=',', converters={ 0 : lambda ch : ord(ch)-ord('A') })
samples, responses = a[:,1:], a[:,0]

model = cv2.KNearest()
retval = model.train(samples,responses)
retval, results, neigh_resp, dists = model.find_nearest(samples, k = 10)
print results.ravel()

It gave me an array of size 20000, I don’t understand what it is.

Questions:

1) What is letter_recognition.data file? How to build that file from my own data set?

2) What does results.reval() denote?

3) How we can write a simple digit recognition tool using letter_recognition.data file (either KNearest or SVM)?


回答 0

好吧,我决定对我的问题进行锻炼以解决上述问题。我想要的是使用OpenCV中的KNearest或SVM功能实现简单的OCR。以下是我的工作方式。(这只是为了学习如何将KNearest用于简单的OCR目的)。

1)我的第一个问题是有关OpenCV示例随附的letter_recognition.data文件的。我想知道该文件中的内容。

它包含一个字母以及该字母的16个功能。

this SOF帮助我找到了它。本文介绍了这16个功能Letter Recognition Using Holland-Style Adaptive Classifiers。(尽管我不了解最后的一些功能)

2)由于我知道,如果不了解所有这些功能,就很难做到这一点。我尝试了其他一些论文,但是对于初学者来说,都有些困难。

So I just decided to take all the pixel values as my features. (我并不担心准确性或性能,我只是希望它能够工作,至少以最低的准确性)

我为训练数据拍摄了下图:

在此处输入图片说明

(我知道训练数据的数量较少。但是,由于所有字母的字体和大小都相同,因此我决定尝试一下)。

为了准备训练数据,我在OpenCV中编写了一个小代码。它执行以下操作:

  1. 它加载图像。
  2. 选择数字(显然是通过轮廓查找并在字母的面积和高度上施加约束来避免错误检测)。
  3. 围绕一个字母绘制边界矩形,然后等待key press manually。这次我们自己按对应于框中字母的数字键
  4. 一旦按下相应的数字键,它将将该框的大小调整为10×10,并在一个数组(此处为样本)中保存100个像素值,在另一个数组中(此处为响应)保存相应的手动输入的数字。
  5. 然后将两个数组保存在单独的txt文件中。

手动数字分类结束时,火车数据(train.png)中的所有数字都是由我们自己手动标记的,图像如下所示:

在此处输入图片说明

以下是我用于上述目的的代码(当然,不是很干净):

import sys

import numpy as np
import cv2

im = cv2.imread('pitrain.png')
im3 = im.copy()

gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),0)
thresh = cv2.adaptiveThreshold(blur,255,1,1,11,2)

#################      Now finding Contours         ###################

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

samples =  np.empty((0,100))
responses = []
keys = [i for i in range(48,58)]

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)

        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,0,255),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            cv2.imshow('norm',im)
            key = cv2.waitKey(0)

            if key == 27:  # (escape to quit)
                sys.exit()
            elif key in keys:
                responses.append(int(chr(key)))
                sample = roismall.reshape((1,100))
                samples = np.append(samples,sample,0)

responses = np.array(responses,np.float32)
responses = responses.reshape((responses.size,1))
print "training complete"

np.savetxt('generalsamples.data',samples)
np.savetxt('generalresponses.data',responses)

现在我们进入培训和测试部分。

为了测试零件,我使用了下面的图片,该图片与我训练过的字母具有相同的类型。

在此处输入图片说明

对于培训,我们执行以下操作

  1. 加载我们之前已经保存的txt文件
  2. 创建一个我们正在使用的分类器的实例(这里是KNearest)
  3. 然后我们使用KNearest.train函数来训练数据

出于测试目的,我们执行以下操作:

  1. 我们加载用于测试的图像
  2. 较早处理图像并使用轮廓法提取每个数字
  3. 为其绘制一个边界框,然后将其大小调整为10×10,并将其像素值存储在数组中,如之前所做的那样。
  4. 然后,我们使用KNearest.find_nearest()函数查找与我们给出的项目最接近的项目。(如果幸运,它将识别出正确的数字。)

我在下面的单个代码中包括了最后两个步骤(培训和测试):

import cv2
import numpy as np

#######   training part    ############### 
samples = np.loadtxt('generalsamples.data',np.float32)
responses = np.loadtxt('generalresponses.data',np.float32)
responses = responses.reshape((responses.size,1))

model = cv2.KNearest()
model.train(samples,responses)

############################# testing part  #########################

im = cv2.imread('pi.png')
out = np.zeros(im.shape,np.uint8)
gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray,255,1,1,11,2)

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)
        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            roismall = roismall.reshape((1,100))
            roismall = np.float32(roismall)
            retval, results, neigh_resp, dists = model.find_nearest(roismall, k = 1)
            string = str(int((results[0][0])))
            cv2.putText(out,string,(x,y+h),0,1,(0,255,0))

cv2.imshow('im',im)
cv2.imshow('out',out)
cv2.waitKey(0)

它奏效了,下面是我得到的结果:

在此处输入图片说明


在这里,它以100%的精度工作。我认为这是因为所有数字都是相同的种类和大小。

但是无论如何,这对于初学者来说是一个不错的开始(我希望如此)。

Well, I decided to workout myself on my question to solve above problem. What I wanted is to implement a simpl OCR using KNearest or SVM features in OpenCV. And below is what I did and how. ( it is just for learning how to use KNearest for simple OCR purposes).

1) My first question was about letter_recognition.data file that comes with OpenCV samples. I wanted to know what is inside that file.

It contains a letter, along with 16 features of that letter.

And this SOF helped me to find it. These 16 features are explained in the paperLetter Recognition Using Holland-Style Adaptive Classifiers. ( Although I didn’t understand some of the features at end)

2) Since I knew, without understanding all those features, it is difficult to do that method. I tried some other papers, but all were a little difficult for a beginner.

So I just decided to take all the pixel values as my features. (I was not worried about accuracy or performance, I just wanted it to work, at least with the least accuracy)

I took below image for my training data:

enter image description here

( I know the amount of training data is less. But, since all letters are of same font and size, I decided to try on this).

To prepare the data for training, I made a small code in OpenCV. It does following things:

  1. It loads the image.
  2. Selects the digits ( obviously by contour finding and applying constraints on area and height of letters to avoid false detections).
  3. Draws the bounding rectangle around one letter and wait for key press manually. This time we press the digit key ourselves corresponding to the letter in box.
  4. Once corresponding digit key is pressed, it resizes this box to 10×10 and saves 100 pixel values in an array (here, samples) and corresponding manually entered digit in another array(here, responses).
  5. Then save both the arrays in separate txt files.

At the end of manual classification of digits, all the digits in the train data( train.png) are labeled manually by ourselves, image will look like below:

enter image description here

Below is the code I used for above purpose ( of course, not so clean):

import sys

import numpy as np
import cv2

im = cv2.imread('pitrain.png')
im3 = im.copy()

gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),0)
thresh = cv2.adaptiveThreshold(blur,255,1,1,11,2)

#################      Now finding Contours         ###################

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

samples =  np.empty((0,100))
responses = []
keys = [i for i in range(48,58)]

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)

        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,0,255),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            cv2.imshow('norm',im)
            key = cv2.waitKey(0)

            if key == 27:  # (escape to quit)
                sys.exit()
            elif key in keys:
                responses.append(int(chr(key)))
                sample = roismall.reshape((1,100))
                samples = np.append(samples,sample,0)

responses = np.array(responses,np.float32)
responses = responses.reshape((responses.size,1))
print "training complete"

np.savetxt('generalsamples.data',samples)
np.savetxt('generalresponses.data',responses)

Now we enter in to training and testing part.

For testing part I used below image, which has same type of letters I used to train.

enter image description here

For training we do as follows:

  1. Load the txt files we already saved earlier
  2. create a instance of classifier we are using ( here, it is KNearest)
  3. Then we use KNearest.train function to train the data

For testing purposes, we do as follows:

  1. We load the image used for testing
  2. process the image as earlier and extract each digit using contour methods
  3. Draw bounding box for it, then resize to 10×10, and store its pixel values in an array as done earlier.
  4. Then we use KNearest.find_nearest() function to find the nearest item to the one we gave. ( If lucky, it recognises the correct digit.)

I included last two steps ( training and testing) in single code below:

import cv2
import numpy as np

#######   training part    ############### 
samples = np.loadtxt('generalsamples.data',np.float32)
responses = np.loadtxt('generalresponses.data',np.float32)
responses = responses.reshape((responses.size,1))

model = cv2.KNearest()
model.train(samples,responses)

############################# testing part  #########################

im = cv2.imread('pi.png')
out = np.zeros(im.shape,np.uint8)
gray = cv2.cvtColor(im,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray,255,1,1,11,2)

contours,hierarchy = cv2.findContours(thresh,cv2.RETR_LIST,cv2.CHAIN_APPROX_SIMPLE)

for cnt in contours:
    if cv2.contourArea(cnt)>50:
        [x,y,w,h] = cv2.boundingRect(cnt)
        if  h>28:
            cv2.rectangle(im,(x,y),(x+w,y+h),(0,255,0),2)
            roi = thresh[y:y+h,x:x+w]
            roismall = cv2.resize(roi,(10,10))
            roismall = roismall.reshape((1,100))
            roismall = np.float32(roismall)
            retval, results, neigh_resp, dists = model.find_nearest(roismall, k = 1)
            string = str(int((results[0][0])))
            cv2.putText(out,string,(x,y+h),0,1,(0,255,0))

cv2.imshow('im',im)
cv2.imshow('out',out)
cv2.waitKey(0)

And it worked, below is the result I got:

enter image description here


Here it worked with 100% accuracy. I assume this is because all the digits are of same kind and same size.

But any way, this is a good start to go for beginners ( I hope so).


回答 1

对于那些对C ++代码感兴趣的人,可以参考以下代码。感谢Abid Rahman的出色解释。


步骤与上面相同,但是轮廓查找仅使用第一层次级别的轮廓,因此算法仅对每个数字使用外部轮廓。

用于创建样本和标签数据的代码

//Process image to extract contour
Mat thr,gray,con;
Mat src=imread("digit.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); //Threshold to find contour
thr.copyTo(con);

// Create sample and label data
vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;
Mat sample;
Mat response_array;  
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE ); //Find contour

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through first hierarchy level contours
{
    Rect r= boundingRect(contours[i]); //Find bounding rect for each contour
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,0,255),2,8,0);
    Mat ROI = thr(r); //Crop the image
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR ); //resize to 10X10
    tmp1.convertTo(tmp2,CV_32FC1); //convert to float
    sample.push_back(tmp2.reshape(1,1)); // Store  sample data
    imshow("src",src);
    int c=waitKey(0); // Read corresponding label for contour from keyoard
    c-=0x30;     // Convert ascii to intiger value
    response_array.push_back(c); // Store label to a mat
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,255,0),2,8,0);    
}

// Store the data to file
Mat response,tmp;
tmp=response_array.reshape(1,1); //make continuous
tmp.convertTo(response,CV_32FC1); // Convert  to float

FileStorage Data("TrainingData.yml",FileStorage::WRITE); // Store the sample data in a file
Data << "data" << sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::WRITE); // Store the label data in a file
Label << "label" << response;
Label.release();
cout<<"Training and Label data created successfully....!! "<<endl;

imshow("src",src);
waitKey();

培训和测试代码

Mat thr,gray,con;
Mat src=imread("dig.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); // Threshold to create input
thr.copyTo(con);


// Read stored sample and label for training
Mat sample;
Mat response,tmp;
FileStorage Data("TrainingData.yml",FileStorage::READ); // Read traing data to a Mat
Data["data"] >> sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::READ); // Read label data to a Mat
Label["label"] >> response;
Label.release();


KNearest knn;
knn.train(sample,response); // Train with sample and responses
cout<<"Training compleated.....!!"<<endl;

vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;

//Create input sample by contour finding and cropping
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE );
Mat dst(src.rows,src.cols,CV_8UC3,Scalar::all(0));

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through each contour for first hierarchy level .
{
    Rect r= boundingRect(contours[i]);
    Mat ROI = thr(r);
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR );
    tmp1.convertTo(tmp2,CV_32FC1);
    float p=knn.find_nearest(tmp2.reshape(1,1), 1);
    char name[4];
    sprintf(name,"%d",(int)p);
    putText( dst,name,Point(r.x,r.y+r.height) ,0,1, Scalar(0, 255, 0), 2, 8 );
}

imshow("src",src);
imshow("dst",dst);
imwrite("dest.jpg",dst);
waitKey();

结果

结果,第一行中的点被检测为8,而我们尚未训练该点。另外,我正在考虑将第一个层次结构中的每个轮廓作为样本输入,用户可以通过计算面积来避免它。

结果

For those who interested in C++ code can refer below code. Thanks Abid Rahman for the nice explanation.


The procedure is same as above but, the contour finding uses only first hierarchy level contour, so that the algorithm uses only outer contour for each digit.

Code for creating sample and Label data

//Process image to extract contour
Mat thr,gray,con;
Mat src=imread("digit.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); //Threshold to find contour
thr.copyTo(con);

// Create sample and label data
vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;
Mat sample;
Mat response_array;  
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE ); //Find contour

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through first hierarchy level contours
{
    Rect r= boundingRect(contours[i]); //Find bounding rect for each contour
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,0,255),2,8,0);
    Mat ROI = thr(r); //Crop the image
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR ); //resize to 10X10
    tmp1.convertTo(tmp2,CV_32FC1); //convert to float
    sample.push_back(tmp2.reshape(1,1)); // Store  sample data
    imshow("src",src);
    int c=waitKey(0); // Read corresponding label for contour from keyoard
    c-=0x30;     // Convert ascii to intiger value
    response_array.push_back(c); // Store label to a mat
    rectangle(src,Point(r.x,r.y), Point(r.x+r.width,r.y+r.height), Scalar(0,255,0),2,8,0);    
}

// Store the data to file
Mat response,tmp;
tmp=response_array.reshape(1,1); //make continuous
tmp.convertTo(response,CV_32FC1); // Convert  to float

FileStorage Data("TrainingData.yml",FileStorage::WRITE); // Store the sample data in a file
Data << "data" << sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::WRITE); // Store the label data in a file
Label << "label" << response;
Label.release();
cout<<"Training and Label data created successfully....!! "<<endl;

imshow("src",src);
waitKey();

Code for training and testing

Mat thr,gray,con;
Mat src=imread("dig.png",1);
cvtColor(src,gray,CV_BGR2GRAY);
threshold(gray,thr,200,255,THRESH_BINARY_INV); // Threshold to create input
thr.copyTo(con);


// Read stored sample and label for training
Mat sample;
Mat response,tmp;
FileStorage Data("TrainingData.yml",FileStorage::READ); // Read traing data to a Mat
Data["data"] >> sample;
Data.release();

FileStorage Label("LabelData.yml",FileStorage::READ); // Read label data to a Mat
Label["label"] >> response;
Label.release();


KNearest knn;
knn.train(sample,response); // Train with sample and responses
cout<<"Training compleated.....!!"<<endl;

vector< vector <Point> > contours; // Vector for storing contour
vector< Vec4i > hierarchy;

//Create input sample by contour finding and cropping
findContours( con, contours, hierarchy,CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE );
Mat dst(src.rows,src.cols,CV_8UC3,Scalar::all(0));

for( int i = 0; i< contours.size(); i=hierarchy[i][0] ) // iterate through each contour for first hierarchy level .
{
    Rect r= boundingRect(contours[i]);
    Mat ROI = thr(r);
    Mat tmp1, tmp2;
    resize(ROI,tmp1, Size(10,10), 0,0,INTER_LINEAR );
    tmp1.convertTo(tmp2,CV_32FC1);
    float p=knn.find_nearest(tmp2.reshape(1,1), 1);
    char name[4];
    sprintf(name,"%d",(int)p);
    putText( dst,name,Point(r.x,r.y+r.height) ,0,1, Scalar(0, 255, 0), 2, 8 );
}

imshow("src",src);
imshow("dst",dst);
imwrite("dest.jpg",dst);
waitKey();

Result

In the result the dot in the first line is detected as 8 and we haven’t trained for dot. Also I am considering every contour in first hierarchy level as the sample input, user can avoid it by computing the area.

Results


回答 2

如果您对机器学习的最新技术感兴趣,则应研究深度学习。您应该具有支持GPU的CUDA,或者在Amazon Web Services上使用GPU。

Google Udacity使用Tensor Flow对此提供了很好的教程。本教程将教您如何在手写数字上训练自己的分类器。使用卷积网络,我在测试集上的准确性超过97%。

If you are interested in the state of the art in Machine Learning, you should look into Deep Learning. You should have a CUDA supporting GPU or alternatively use the GPU on Amazon Web Services.

Google Udacity has a nice tutorial on this using Tensor Flow. This tutorial will teach you how to train your own classifier on hand written digits. I got an accuracy of over 97% on the test set using Convolutional Networks.


如何计算Python中ndarray中某些项目的出现?

问题:如何计算Python中ndarray中某些项目的出现?

在Python中,我有一个ndarray y 打印为array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

我试图计算这个数组中有多少个0和多少个1

但是当我输入y.count(0)or时y.count(1),它说

numpy.ndarray 对象没有属性 count

我该怎么办?

In Python, I have an ndarray y that is printed as array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

I’m trying to count how many 0s and how many 1s are there in this array.

But when I type y.count(0) or y.count(1), it says

numpy.ndarray object has no attribute count

What should I do?


回答 0

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> unique, counts = numpy.unique(a, return_counts=True)
>>> dict(zip(unique, counts))
{0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

非numpy方式

使用collections.Counter;

>> import collections, numpy

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> collections.Counter(a)
Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})
>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> unique, counts = numpy.unique(a, return_counts=True)
>>> dict(zip(unique, counts))
{0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

Non-numpy way:

Use collections.Counter;

>> import collections, numpy

>>> a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
>>> collections.Counter(a)
Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})

回答 1

那使用numpy.count_nonzero什么呢

>>> import numpy as np
>>> y = np.array([1, 2, 2, 2, 2, 0, 2, 3, 3, 3, 0, 0, 2, 2, 0])

>>> np.count_nonzero(y == 1)
1
>>> np.count_nonzero(y == 2)
7
>>> np.count_nonzero(y == 3)
3

What about using numpy.count_nonzero, something like

>>> import numpy as np
>>> y = np.array([1, 2, 2, 2, 2, 0, 2, 3, 3, 3, 0, 0, 2, 2, 0])

>>> np.count_nonzero(y == 1)
1
>>> np.count_nonzero(y == 2)
7
>>> np.count_nonzero(y == 3)
3

回答 2

就个人而言,我会去: (y == 0).sum()(y == 1).sum()

例如

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
num_zeros = (y == 0).sum()
num_ones = (y == 1).sum()

Personally, I’d go for: (y == 0).sum() and (y == 1).sum()

E.g.

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
num_zeros = (y == 0).sum()
num_ones = (y == 1).sum()

回答 3

对于您的情况,您还可以查看numpy.bincount

In [56]: a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

In [57]: np.bincount(a)
Out[57]: array([8, 4])  #count of zeros is at index 0 : 8
                        #count of ones is at index 1 : 4

For your case you could also look into numpy.bincount

In [56]: a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

In [57]: np.bincount(a)
Out[57]: array([8, 4])  #count of zeros is at index 0 : 8
                        #count of ones is at index 1 : 4

回答 4

将数组转换y为列表l,然后执行l.count(1)l.count(0)

>>> y = numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>> l = list(y)
>>> l.count(1)
4
>>> l.count(0)
8 

Convert your array y to list l and then do l.count(1) and l.count(0)

>>> y = numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>> l = list(y)
>>> l.count(1)
4
>>> l.count(0)
8 

回答 5

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

如果你知道,他们只是01

np.sum(y)

给你的数量。 np.sum(1-y)给出零。

为了稍微概括起见,如果要计数0而不是零(但可能是2或3):

np.count_nonzero(y)

给出非零的数量。

但是,如果您需要更复杂的东西,我认为numpy不会提供一个不错的count选择。在这种情况下,请转到集合:

import collections
collections.Counter(y)
> Counter({0: 8, 1: 4})

这就像一个字典

collections.Counter(y)[0]
> 8
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

If you know that they are just 0 and 1:

np.sum(y)

gives you the number of ones. np.sum(1-y) gives the zeroes.

For slight generality, if you want to count 0 and not zero (but possibly 2 or 3):

np.count_nonzero(y)

gives the number of nonzero.

But if you need something more complicated, I don’t think numpy will provide a nice count option. In that case, go to collections:

import collections
collections.Counter(y)
> Counter({0: 8, 1: 4})

This behaves like a dict

collections.Counter(y)[0]
> 8

回答 6

如果您确切知道要查找的号码,则可以使用以下代码;

lst = np.array([1,1,2,3,3,6,6,6,3,2,1])
(lst == 2).sum()

返回数组中发生2的次数。

If you know exactly which number you’re looking for, you can use the following;

lst = np.array([1,1,2,3,3,6,6,6,3,2,1])
(lst == 2).sum()

returns how many times 2 is occurred in your array.


回答 7

老实说,我发现将其转换为pandas系列或DataFrame最简单:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])})
print df['data'].value_counts()

或罗伯特·穆伊(Robert Muil)提出的这一好话:

pd.Series([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]).value_counts()

Honestly I find it easiest to convert to a pandas Series or DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])})
print df['data'].value_counts()

Or this nice one-liner suggested by Robert Muil:

pd.Series([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]).value_counts()

回答 8

没有人建议使用numpy.bincount(input, minlength)minlength = np.size(input),但它似乎是一个很好的解决方案,并且绝对是最快的

In [1]: choices = np.random.randint(0, 100, 10000)

In [2]: %timeit [ np.sum(choices == k) for k in range(min(choices), max(choices)+1) ]
100 loops, best of 3: 2.67 ms per loop

In [3]: %timeit np.unique(choices, return_counts=True)
1000 loops, best of 3: 388 µs per loop

In [4]: %timeit np.bincount(choices, minlength=np.size(choices))
100000 loops, best of 3: 16.3 µs per loop

numpy.unique(x, return_counts=True)和之间的疯狂加速numpy.bincount(x, minlength=np.max(x))

No one suggested to use numpy.bincount(input, minlength) with minlength = np.size(input), but it seems to be a good solution, and definitely the fastest:

In [1]: choices = np.random.randint(0, 100, 10000)

In [2]: %timeit [ np.sum(choices == k) for k in range(min(choices), max(choices)+1) ]
100 loops, best of 3: 2.67 ms per loop

In [3]: %timeit np.unique(choices, return_counts=True)
1000 loops, best of 3: 388 µs per loop

In [4]: %timeit np.bincount(choices, minlength=np.size(choices))
100000 loops, best of 3: 16.3 µs per loop

That’s a crazy speedup between numpy.unique(x, return_counts=True) and numpy.bincount(x, minlength=np.max(x)) !


回答 9

怎么样len(y[y==0])len(y[y==1])

What about len(y[y==0]) and len(y[y==1]) ?


回答 10

y.tolist().count(val)

使用val 0或1

由于python列表具有本机函数count,因此在使用该函数之前将其转换为list是一个简单的解决方案。

y.tolist().count(val)

with val 0 or 1

Since a python list has a native function count, converting to list before using that function is a simple solution.


回答 11

另一个简单的解决方案可能是使用numpy.count_nonzero()

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y_nonzero_num = np.count_nonzero(y==1)
y_zero_num = np.count_nonzero(y==0)
y_nonzero_num
4
y_zero_num
8

不要让名称误导您,如果您像示例中那样将其与布尔值一起使用,就可以解决问题。

Yet another simple solution might be to use numpy.count_nonzero():

import numpy as np
y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y_nonzero_num = np.count_nonzero(y==1)
y_zero_num = np.count_nonzero(y==0)
y_nonzero_num
4
y_zero_num
8

Don’t let the name mislead you, if you use it with the boolean just like in the example, it will do the trick.


回答 12

要计算出现次数,可以使用np.unique(array, return_counts=True)

In [75]: boo = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

# use bool value `True` or equivalently `1`
In [77]: uniq, cnts = np.unique(boo, return_counts=1)
In [81]: uniq
Out[81]: array([0, 1])   #unique elements in input array are: 0, 1

In [82]: cnts
Out[82]: array([8, 4])   # 0 occurs 8 times, 1 occurs 4 times

To count the number of occurrences, you can use np.unique(array, return_counts=True):

In [75]: boo = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

# use bool value `True` or equivalently `1`
In [77]: uniq, cnts = np.unique(boo, return_counts=1)
In [81]: uniq
Out[81]: array([0, 1])   #unique elements in input array are: 0, 1

In [82]: cnts
Out[82]: array([8, 4])   # 0 occurs 8 times, 1 occurs 4 times

回答 13

我会使用np.where:

how_many_0 = len(np.where(a==0.)[0])
how_many_1 = len(np.where(a==1.)[0])

I’d use np.where:

how_many_0 = len(np.where(a==0.)[0])
how_many_1 = len(np.where(a==1.)[0])

回答 14

利用系列提供的方法:

>>> import pandas as pd
>>> y = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
>>> pd.Series(y).value_counts()
0    8
1    4
dtype: int64

take advantage of the methods offered by a Series:

>>> import pandas as pd
>>> y = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
>>> pd.Series(y).value_counts()
0    8
1    4
dtype: int64

回答 15

一个简单的一般答案是:

numpy.sum(MyArray==x)   # sum of a binary list of the occurence of x (=0 or 1) in MyArray

这将导致完整的代码作为示例

import numpy
MyArray=numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])  # array we want to search in
x=0   # the value I want to count (can be iterator, in a list, etc.)
numpy.sum(MyArray==0)   # sum of a binary list of the occurence of x in MyArray

现在,如果MyArray具有多个维度,并且您要计算行中值分布的出现次数(此后为pattern)

MyArray=numpy.array([[6, 1],[4, 5],[0, 7],[5, 1],[2, 5],[1, 2],[3, 2],[0, 2],[2, 5],[5, 1],[3, 0]])
x=numpy.array([5,1])   # the value I want to count (can be iterator, in a list, etc.)
temp = numpy.ascontiguousarray(MyArray).view(numpy.dtype((numpy.void, MyArray.dtype.itemsize * MyArray.shape[1])))  # convert the 2d-array into an array of analyzable patterns
xt=numpy.ascontiguousarray(x).view(numpy.dtype((numpy.void, x.dtype.itemsize * x.shape[0])))  # convert what you search into one analyzable pattern
numpy.sum(temp==xt)  # count of the searched pattern in the list of patterns

A general and simple answer would be:

numpy.sum(MyArray==x)   # sum of a binary list of the occurence of x (=0 or 1) in MyArray

which would result into this full code as exemple

import numpy
MyArray=numpy.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])  # array we want to search in
x=0   # the value I want to count (can be iterator, in a list, etc.)
numpy.sum(MyArray==0)   # sum of a binary list of the occurence of x in MyArray

Now if MyArray is in multiple dimensions and you want to count the occurence of a distribution of values in line (= pattern hereafter)

MyArray=numpy.array([[6, 1],[4, 5],[0, 7],[5, 1],[2, 5],[1, 2],[3, 2],[0, 2],[2, 5],[5, 1],[3, 0]])
x=numpy.array([5,1])   # the value I want to count (can be iterator, in a list, etc.)
temp = numpy.ascontiguousarray(MyArray).view(numpy.dtype((numpy.void, MyArray.dtype.itemsize * MyArray.shape[1])))  # convert the 2d-array into an array of analyzable patterns
xt=numpy.ascontiguousarray(x).view(numpy.dtype((numpy.void, x.dtype.itemsize * x.shape[0])))  # convert what you search into one analyzable pattern
numpy.sum(temp==xt)  # count of the searched pattern in the list of patterns

回答 16

您可以使用字典理解来创建整齐的单线。可以在这里找到有关字典理解的更多信息

>>>counts = {int(value): list(y).count(value) for value in set(y)}
>>>print(counts)
{0: 8, 1: 4}

这将创建一个字典,将ndarray中的值作为键,并将值的计数分别作为键的值。

每当您要计算此格式数组中某个值的出现次数时,此方法都将起作用。

You can use dictionary comprehension to create a neat one-liner. More about dictionary comprehension can be found here

>>>counts = {int(value): list(y).count(value) for value in set(y)}
>>>print(counts)
{0: 8, 1: 4}

This will create a dictionary with the values in your ndarray as keys, and the counts of the values as the values for the keys respectively.

This will work whenever you want to count occurences of a value in arrays of this format.


回答 17

尝试这个:

a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
list(a).count(1)

Try this:

a = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
list(a).count(1)

回答 18

这可以通过以下方法轻松完成

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y.tolist().count(1)

This can be done easily in the following method

y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
y.tolist().count(1)

回答 19

由于您的ndarray仅包含0和1,因此您可以使用sum()获得1的出现,并使用len()-sum()获得0的出现。

num_of_ones = sum(array)
num_of_zeros = len(array)-sum(array)

Since your ndarray contains only 0 and 1, you can use sum() to get the occurrence of 1s and len()-sum() to get the occurrence of 0s.

num_of_ones = sum(array)
num_of_zeros = len(array)-sum(array)

回答 20

您有一个只有1和0的特殊数组。所以一个诀窍是使用

np.mean(x)

这将为您提供数组中1s的百分比。或者,使用

np.sum(x)
np.sum(1-x)

将为您提供数组中1和0的绝对数。

You have a special array with only 1 and 0 here. So a trick is to use

np.mean(x)

which gives you the percentage of 1s in your array. Alternatively, use

np.sum(x)
np.sum(1-x)

will give you the absolute number of 1 and 0 in your array.


回答 21

dict(zip(*numpy.unique(y, return_counts=True)))

刚刚在此处复制了Seppo Enarvi的评论,这应该是一个正确的答案

dict(zip(*numpy.unique(y, return_counts=True)))

Just copied Seppo Enarvi’s comment here which deserves to be a proper answer


回答 22

它涉及更多的步骤,但是对2d数组和更复杂的过滤器也适用的更灵活的解决方案是创建一个布尔掩码,然后在掩码上使用.sum()。

>>>>y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>>>mask = y == 0
>>>>mask.sum()
8

It involves one more step, but a more flexible solution which would also work for 2d arrays and more complicated filters is to create a boolean mask and then use .sum() on the mask.

>>>>y = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
>>>>mask = y == 0
>>>>mask.sum()
8

回答 23

如果您不想使用numpy或collections模块,则可以使用字典:

d = dict()
a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
for item in a:
    try:
        d[item]+=1
    except KeyError:
        d[item]=1

结果:

>>>d
{0: 8, 1: 4}

当然,您也可以使用if / else语句。我认为Counter函数的功能几乎相同,但这更加透明。

If you don’t want to use numpy or a collections module you can use a dictionary:

d = dict()
a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]
for item in a:
    try:
        d[item]+=1
    except KeyError:
        d[item]=1

result:

>>>d
{0: 8, 1: 4}

Of course you can also use an if/else statement. I think the Counter function does almost the same thing but this is more transparant.


回答 24

对于通用条目:

x = np.array([11, 2, 3, 5, 3, 2, 16, 10, 10, 3, 11, 4, 5, 16, 3, 11, 4])
n = {i:len([j for j in np.where(x==i)[0]]) for i in set(x)}
ix = {i:[j for j in np.where(x==i)[0]] for i in set(x)}

将输出一个计数:

{2: 2, 3: 4, 4: 2, 5: 2, 10: 2, 11: 3, 16: 2}

和索引:

{2: [1, 5],
3: [2, 4, 9, 14],
4: [11, 16],
5: [3, 12],
10: [7, 8],
11: [0, 10, 15],
16: [6, 13]}

For generic entries:

x = np.array([11, 2, 3, 5, 3, 2, 16, 10, 10, 3, 11, 4, 5, 16, 3, 11, 4])
n = {i:len([j for j in np.where(x==i)[0]]) for i in set(x)}
ix = {i:[j for j in np.where(x==i)[0]] for i in set(x)}

Will output a count:

{2: 2, 3: 4, 4: 2, 5: 2, 10: 2, 11: 3, 16: 2}

And indices:

{2: [1, 5],
3: [2, 4, 9, 14],
4: [11, 16],
5: [3, 12],
10: [7, 8],
11: [0, 10, 15],
16: [6, 13]}

回答 25

这里有一些东西,通过它您可以计算出特定数字的出现次数:根据您的代码

count_of_zero = list(y [y == 0])。count(0)

打印(count_of_zero)

//根据匹配项,将有布尔值,根据True值,将返回数字0

here I have something, through which you can count the number of occurrence of a particular number: according to your code

count_of_zero=list(y[y==0]).count(0)

print(count_of_zero)

// according to the match there will be boolean values and according to True value the number 0 will be return


回答 26

如果您对最快的执行感兴趣,那么您会事先知道要查找的值,并且您的数组是一维的,否则您对展平数组上的结果感兴趣(在这种情况下,函数的输入应是np.flatten(arr)不是只arr),然后Numba是你的朋友:

import numba as nb


@nb.jit
def count_nb(arr, value):
    result = 0
    for x in arr:
        if x == value:
            result += 1
    return result

或者,对于超大型阵列,并行化可能会有所帮助:

@nb.jit(parallel=True)
def count_nbp(arr, value):
    result = 0
    for i in nb.prange(arr.size):
        if arr[i] == value:
            result += 1
    return result

对这些基准进行基准测试np.count_nonzero()(也存在创建可以避免的临时数组的问题)和np.unique()基于-的解决方案

import numpy as np


def count_np(arr, value):
    return np.count_nonzero(arr == value)
import numpy as np


def count_np2(arr, value):
    uniques, counts = np.unique(a, return_counts=True)
    counter = dict(zip(uniques, counts))
    return counter[value] if value in counter else 0 

用于使用以下命令生成的输入:

def gen_input(n, a=0, b=100):
    return np.random.randint(a, b, n)

获得以下图(图的第二行是对更快方法的放大):

bm_full bm_zoom

表明基于Numba的解决方案比NumPy的解决方案明显更快,并且对于非常大的输入,并行方法比朴素的方法要快。


完整的代码在这里

If you are interested in the fastest execution, you know in advance which value(s) to look for, and your array is 1D, or you are otherwise interested in the result on the flattened array (in which case the input of the function should be np.flatten(arr) rather than just arr), then Numba is your friend:

import numba as nb


@nb.jit
def count_nb(arr, value):
    result = 0
    for x in arr:
        if x == value:
            result += 1
    return result

or, for very large arrays where parallelization may be beneficial:

@nb.jit(parallel=True)
def count_nbp(arr, value):
    result = 0
    for i in nb.prange(arr.size):
        if arr[i] == value:
            result += 1
    return result

Benchmarking these against np.count_nonzero() (which also has a problem of creating a temporary array which may be avoided) and np.unique()-based solution

import numpy as np


def count_np(arr, value):
    return np.count_nonzero(arr == value)
import numpy as np


def count_np2(arr, value):
    uniques, counts = np.unique(a, return_counts=True)
    counter = dict(zip(uniques, counts))
    return counter[value] if value in counter else 0 

for input generated with:

def gen_input(n, a=0, b=100):
    return np.random.randint(a, b, n)

the following plots are obtained (the second row of plots is a zoom on the faster approach):

bm_full bm_zoom

Showing that Numba-based solution are noticeably faster than the NumPy counterparts, and, for very large inputs, the parallel approach is faster than the naive one.


Full code available here.


回答 27

如果使用生成器处理非常大的数组,则可以选择。令人高兴的是,这种方法对数组和列表都适用,并且您不需要任何其他程序包。此外,您没有使用太多的内存。

my_array = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
sum(1 for val in my_array if val==0)
Out: 8

if you are dealing with very large arrays using generators could be an option. The nice thing here it that this approach works fine for both arrays and lists and you dont need any additional package. Additionally, you are not using that much memory.

my_array = np.array([0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1])
sum(1 for val in my_array if val==0)
Out: 8

回答 28

Numpy为此提供了一个模块。只是一个小技巧。将您的输入数组作为垃圾箱。

numpy.histogram(y, bins=y)

输出是2个数组。一个带有值本身,另一个带有相应的频率。

Numpy has a module for this. Just a small hack. Put your input array as bins.

numpy.histogram(y, bins=y)

The output are 2 arrays. One with the values itself, other with the corresponding frequencies.


回答 29

using numpy.count

$ a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]

$ np.count(a, 1)
using numpy.count

$ a = [0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1]

$ np.count(a, 1)

块数组尺寸

问题:块数组尺寸

我目前正在尝试学习Numpy和Python。给定以下数组:

import numpy as np
a = np.array([[1,2],[1,2]])

有没有返回尺寸的函数a(ega是2 x 2数组)?

size() 返回4并没有太大帮助。

I’m currently trying to learn Numpy and Python. Given the following array:

import numpy as np
a = np.array([[1,2],[1,2]])

Is there a function that returns the dimensions of a (e.g.a is a 2 by 2 array)?

size() returns 4 and that doesn’t help very much.


回答 0

.shape

ndarray。 数组尺寸的形状
元组。

从而:

>>> a.shape
(2, 2)

It is .shape:

ndarray.shape
Tuple of array dimensions.

Thus:

>>> a.shape
(2, 2)

回答 1

第一:

按照惯例,在Python世界中,的快捷方式numpynp,因此:

In [1]: import numpy as np

In [2]: a = np.array([[1,2],[3,4]])

第二:

在Numpy中,维度轴/轴形状是相关的,有时是相似的概念:

尺寸

在“ 数学/物理学”中,维或维数被非正式地定义为指定空间中任何点所需的最小坐标数。但在numpy的,根据numpy的文档,这是相同的轴线/轴:

在Numpy中,尺寸称为轴。轴数为等级。

In [3]: a.ndim  # num of dimensions/axes, *Mathematics definition of dimension*
Out[3]: 2

轴/轴

在Numpy中索引an 的第n个坐标array。多维数组每个轴可以有一个索引。

In [4]: a[1,0]  # to index `a`, we specific 1 at the first axis and 0 at the second axis.
Out[4]: 3  # which results in 3 (locate at the row 1 and column 0, 0-based index)

形状

描述沿每个可用轴有多少数据(或范围)。

In [5]: a.shape
Out[5]: (2, 2)  # both the first and second axis have 2 (columns/rows/pages/blocks/...) data

First:

By convention, in Python world, the shortcut for numpy is np, so:

In [1]: import numpy as np

In [2]: a = np.array([[1,2],[3,4]])

Second:

In Numpy, dimension, axis/axes, shape are related and sometimes similar concepts:

dimension

In Mathematics/Physics, dimension or dimensionality is informally defined as the minimum number of coordinates needed to specify any point within a space. But in Numpy, according to the numpy doc, it’s the same as axis/axes:

In Numpy dimensions are called axes. The number of axes is rank.

In [3]: a.ndim  # num of dimensions/axes, *Mathematics definition of dimension*
Out[3]: 2

axis/axes

the nth coordinate to index an array in Numpy. And multidimensional arrays can have one index per axis.

In [4]: a[1,0]  # to index `a`, we specific 1 at the first axis and 0 at the second axis.
Out[4]: 3  # which results in 3 (locate at the row 1 and column 0, 0-based index)

shape

describes how many data (or the range) along each available axis.

In [5]: a.shape
Out[5]: (2, 2)  # both the first and second axis have 2 (columns/rows/pages/blocks/...) data

回答 2

import numpy as np   
>>> np.shape(a)
(2,2)

如果输入不是numpy数组而是列表列表,则也可以使用

>>> a = [[1,2],[1,2]]
>>> np.shape(a)
(2,2)

或元组的元组

>>> a = ((1,2),(1,2))
>>> np.shape(a)
(2,2)
import numpy as np   
>>> np.shape(a)
(2,2)

Also works if the input is not a numpy array but a list of lists

>>> a = [[1,2],[1,2]]
>>> np.shape(a)
(2,2)

Or a tuple of tuples

>>> a = ((1,2),(1,2))
>>> np.shape(a)
(2,2)

回答 3

您可以使用.shape

In: a = np.array([[1,2,3],[4,5,6]])
In: a.shape
Out: (2, 3)
In: a.shape[0] # x axis
Out: 2
In: a.shape[1] # y axis
Out: 3

You can use .shape

In: a = np.array([[1,2,3],[4,5,6]])
In: a.shape
Out: (2, 3)
In: a.shape[0] # x axis
Out: 2
In: a.shape[1] # y axis
Out: 3

回答 4

您可以使用.ndim尺寸并.shape知道确切尺寸

var = np.array([[1,2,3,4,5,6], [1,2,3,4,5,6]])

var.ndim
# displays 2

var.shape
# display 6, 2

您可以使用.reshape功能更改尺寸

var = np.array([[1,2,3,4,5,6], [1,2,3,4,5,6]]).reshape(3,4)

var.ndim
#display 2

var.shape
#display 3, 4

You can use .ndim for dimension and .shape to know the exact dimension

var = np.array([[1,2,3,4,5,6], [1,2,3,4,5,6]])

var.ndim
# displays 2

var.shape
# display 6, 2

You can change the dimension using .reshape function

var = np.array([[1,2,3,4,5,6], [1,2,3,4,5,6]]).reshape(3,4)

var.ndim
#display 2

var.shape
#display 3, 4

回答 5

shape方法要求它a是一个Numpy ndarray。但是Numpy还可以计算纯python对象的可迭代对象的形状:

np.shape([[1,2],[1,2]])

The shape method requires that a be a Numpy ndarray. But Numpy can also calculate the shape of iterables of pure python objects:

np.shape([[1,2],[1,2]])

回答 6

a.shape只是的受限版本np.info()。看一下这个:

import numpy as np
a = np.array([[1,2],[1,2]])
np.info(a)

class:  ndarray
shape:  (2, 2)
strides:  (8, 4)
itemsize:  4
aligned:  True
contiguous:  True
fortran:  False
data pointer: 0x27509cf0560
byteorder:  little
byteswap:  False
type: int32

a.shape is just a limited version of np.info(). Check this out:

import numpy as np
a = np.array([[1,2],[1,2]])
np.info(a)

Out

class:  ndarray
shape:  (2, 2)
strides:  (8, 4)
itemsize:  4
aligned:  True
contiguous:  True
fortran:  False
data pointer: 0x27509cf0560
byteorder:  little
byteswap:  False
type: int32

numpy数组和矩阵有什么区别?我应该使用哪一个?

问题:numpy数组和矩阵有什么区别?我应该使用哪一个?

每种都有哪些优点和缺点?

从我所看到的情况来看,如果需要,任何一个都可以替代另一个,所以我应该同时使用这两个还是应该仅使用其中之一?

程序的样式会影响我的选择吗?我正在使用numpy进行一些机器学习,因此确实有很多矩阵,但也有很多向量(数组)。

What are the advantages and disadvantages of each?

From what I’ve seen, either one can work as a replacement for the other if need be, so should I bother using both or should I stick to just one of them?

Will the style of the program influence my choice? I am doing some machine learning using numpy, so there are indeed lots of matrices, but also lots of vectors (arrays).


回答 0

根据官方文件,不再建议使用矩阵类,因为将来会删除它。

https://numpy.org/doc/stable/reference/generation/numpy.matrix.html

正如其他答案所指出的那样,您可以使用NumPy数组实现所有操作。

As per the official documents, it’s not anymore advisable to use matrix class since it will be removed in the future.

https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

As other answers already state that you can achieve all the operations with NumPy arrays.


回答 1

numpy的矩阵是严格2维的,而numpy的阵列(ndarrays)是N维的。矩阵对象是ndarray的子​​类,因此它们继承了ndarray的所有属性和方法。

numpy矩阵的主要优点是它们为矩阵乘法提供了一种方便的表示法:如果a和b是矩阵,则a*b它们是矩阵乘积。

import numpy as np

a = np.mat('4 3; 2 1')
b = np.mat('1 2; 3 4')
print(a)
# [[4 3]
#  [2 1]]
print(b)
# [[1 2]
#  [3 4]]
print(a*b)
# [[13 20]
#  [ 5  8]]

另一方面,从Python 3.5开始,NumPy使用@运算符支持中缀矩阵乘法,因此您可以在Python> = 3.5中使用ndarrays实现相同的矩阵乘法便捷性。

import numpy as np

a = np.array([[4, 3], [2, 1]])
b = np.array([[1, 2], [3, 4]])
print(a@b)
# [[13 20]
#  [ 5  8]]

矩阵对象和ndarray都.T必须返回转置,但是矩阵对象也必须具有.H共轭转置和.I逆。

相反,numpy数组始终遵守以元素为单位应用操作的规则(除了new @运算符)。因此,如果ab是numpy数组,则a*b该数组是通过按元素逐个乘以组成的:

c = np.array([[4, 3], [2, 1]])
d = np.array([[1, 2], [3, 4]])
print(c*d)
# [[4 6]
#  [6 4]]

要获得矩阵乘法的结果,请使用np.dot(或@在Python> = 3.5中,如上所示):

print(np.dot(c,d))
# [[13 20]
#  [ 5  8]]

**运营商还表现不同:

print(a**2)
# [[22 15]
#  [10  7]]
print(c**2)
# [[16  9]
#  [ 4  1]]

由于a是矩阵,所以a**2返回矩阵乘积a*a。由于c是ndarray,因此c**2返回一个ndarray,每个组件的元素均平方。

矩阵对象和ndarray之间还有其他技术差异(与np.ravel,项目选择和序列行为有关)。

numpy数组的主要优点是它们比二维矩阵更通用。当您需要3维数组时会发生什么?然后,您必须使用ndarray,而不是矩阵对象。因此,学习使用矩阵对象的工作量更大-您必须学习矩阵对象操作和ndarray操作。

编写一个将矩阵和数组混合在一起的程序会使您的生活变得困难,因为您必须跟踪变量是什么类型的对象,以免乘法返回您不期望的东西。

相反,如果仅使用ndarray,则可以执行矩阵对象可以执行的所有操作,以及更多操作,但功能/符号略有不同。

如果您愿意放弃NumPy矩阵产品表示法的视觉吸引力(使用python> = 3.5的ndarrays几乎可以优雅地实现),那么我认为NumPy数组绝对是可行的方法。

PS。当然,您实际上不必选择以牺牲另一个为代价,因为np.asmatrixnp.asarray允许您将一个转换为另一个(只要数组是二维的)。


还有就是与NumPy之间的差异大纲arraysVS NumPy的matrixES 这里

Numpy matrices are strictly 2-dimensional, while numpy arrays (ndarrays) are N-dimensional. Matrix objects are a subclass of ndarray, so they inherit all the attributes and methods of ndarrays.

The main advantage of numpy matrices is that they provide a convenient notation for matrix multiplication: if a and b are matrices, then a*b is their matrix product.

import numpy as np

a = np.mat('4 3; 2 1')
b = np.mat('1 2; 3 4')
print(a)
# [[4 3]
#  [2 1]]
print(b)
# [[1 2]
#  [3 4]]
print(a*b)
# [[13 20]
#  [ 5  8]]

On the other hand, as of Python 3.5, NumPy supports infix matrix multiplication using the @ operator, so you can achieve the same convenience of matrix multiplication with ndarrays in Python >= 3.5.

import numpy as np

a = np.array([[4, 3], [2, 1]])
b = np.array([[1, 2], [3, 4]])
print(a@b)
# [[13 20]
#  [ 5  8]]

Both matrix objects and ndarrays have .T to return the transpose, but matrix objects also have .H for the conjugate transpose, and .I for the inverse.

In contrast, numpy arrays consistently abide by the rule that operations are applied element-wise (except for the new @ operator). Thus, if a and b are numpy arrays, then a*b is the array formed by multiplying the components element-wise:

c = np.array([[4, 3], [2, 1]])
d = np.array([[1, 2], [3, 4]])
print(c*d)
# [[4 6]
#  [6 4]]

To obtain the result of matrix multiplication, you use np.dot (or @ in Python >= 3.5, as shown above):

print(np.dot(c,d))
# [[13 20]
#  [ 5  8]]

The ** operator also behaves differently:

print(a**2)
# [[22 15]
#  [10  7]]
print(c**2)
# [[16  9]
#  [ 4  1]]

Since a is a matrix, a**2 returns the matrix product a*a. Since c is an ndarray, c**2 returns an ndarray with each component squared element-wise.

There are other technical differences between matrix objects and ndarrays (having to do with np.ravel, item selection and sequence behavior).

The main advantage of numpy arrays is that they are more general than 2-dimensional matrices. What happens when you want a 3-dimensional array? Then you have to use an ndarray, not a matrix object. Thus, learning to use matrix objects is more work — you have to learn matrix object operations, and ndarray operations.

Writing a program that mixes both matrices and arrays makes your life difficult because you have to keep track of what type of object your variables are, lest multiplication return something you don’t expect.

In contrast, if you stick solely with ndarrays, then you can do everything matrix objects can do, and more, except with slightly different functions/notation.

If you are willing to give up the visual appeal of NumPy matrix product notation (which can be achieved almost as elegantly with ndarrays in Python >= 3.5), then I think NumPy arrays are definitely the way to go.

PS. Of course, you really don’t have to choose one at the expense of the other, since np.asmatrix and np.asarray allow you to convert one to the other (as long as the array is 2-dimensional).


There is a synopsis of the differences between NumPy arrays vs NumPy matrixes here.


回答 2

Scipy.org建议您使用数组:

*’array’或’matrix’?我应该使用哪个?-简短答案

使用数组。

  • 它们是numpy的标准向量/矩阵/张量类型。许多numpy函数返回数组,而不是矩阵。

  • 在逐元素运算和线性代数运算之间有明显的区别。

  • 如果愿意,可以有标准向量或行/列向量。

使用数组类型的唯一缺点是,您将不得不使用dot而不是*乘(减少)两个张量(标量积,矩阵向量乘法等)。

Scipy.org recommends that you use arrays:

*’array’ or ‘matrix’? Which should I use? – Short answer

Use arrays.

  • They are the standard vector/matrix/tensor type of numpy. Many numpy function return arrays, not matrices.

  • There is a clear distinction between element-wise operations and linear algebra operations.

  • You can have standard vectors or row/column vectors if you like.

The only disadvantage of using the array type is that you will have to use dot instead of * to multiply (reduce) two tensors (scalar product, matrix vector multiplication etc.).


回答 3

只是将一个案例添加到unutbu的列表中。

与numpy矩阵或矩阵语言(如matlab)相比,numpy ndarray对我而言最大的实际差异之一是,在归约运算中未保留维。矩阵始终为2d,而数组的均值则少一维。

例如,矩阵或数组的行为不佳的行:

带矩阵

>>> m = np.mat([[1,2],[2,3]])
>>> m
matrix([[1, 2],
        [2, 3]])
>>> mm = m.mean(1)
>>> mm
matrix([[ 1.5],
        [ 2.5]])
>>> mm.shape
(2, 1)
>>> m - mm
matrix([[-0.5,  0.5],
        [-0.5,  0.5]])

带阵列

>>> a = np.array([[1,2],[2,3]])
>>> a
array([[1, 2],
       [2, 3]])
>>> am = a.mean(1)
>>> am.shape
(2,)
>>> am
array([ 1.5,  2.5])
>>> a - am #wrong
array([[-0.5, -0.5],
       [ 0.5,  0.5]])
>>> a - am[:, np.newaxis]  #right
array([[-0.5,  0.5],
       [-0.5,  0.5]])

我还认为混合数组和矩阵会带来很多“快乐的”调试时间。但是,就乘法而言,scipy.sparse矩阵始终是矩阵。

Just to add one case to unutbu’s list.

One of the biggest practical differences for me of numpy ndarrays compared to numpy matrices or matrix languages like matlab, is that the dimension is not preserved in reduce operations. Matrices are always 2d, while the mean of an array, for example, has one dimension less.

For example demean rows of a matrix or array:

with matrix

>>> m = np.mat([[1,2],[2,3]])
>>> m
matrix([[1, 2],
        [2, 3]])
>>> mm = m.mean(1)
>>> mm
matrix([[ 1.5],
        [ 2.5]])
>>> mm.shape
(2, 1)
>>> m - mm
matrix([[-0.5,  0.5],
        [-0.5,  0.5]])

with array

>>> a = np.array([[1,2],[2,3]])
>>> a
array([[1, 2],
       [2, 3]])
>>> am = a.mean(1)
>>> am.shape
(2,)
>>> am
array([ 1.5,  2.5])
>>> a - am #wrong
array([[-0.5, -0.5],
       [ 0.5,  0.5]])
>>> a - am[:, np.newaxis]  #right
array([[-0.5,  0.5],
       [-0.5,  0.5]])

I also think that mixing arrays and matrices gives rise to many “happy” debugging hours. However, scipy.sparse matrices are always matrices in terms of operators like multiplication.


回答 4

正如其他人提到的那样,也许它的主要优点matrix是它为矩阵乘法提供了一种方便的符号。

但是,在Python 3.5中,终于有了一个专用的infix运算符用于矩阵乘法@

在最新的NumPy版本中,它可以与ndarrays 一起使用:

A = numpy.ones((1, 3))
B = numpy.ones((3, 3))
A @ B

因此,如今,如果有更多疑问,您应该坚持ndarray

As others have mentioned, perhaps the main advantage of matrix was that it provided a convenient notation for matrix multiplication.

However, in Python 3.5 there is finally a dedicated infix operator for matrix multiplication: @.

With recent NumPy versions, it can be used with ndarrays:

A = numpy.ones((1, 3))
B = numpy.ones((3, 3))
A @ B

So nowadays, even more, when in doubt, you should stick to ndarray.


在numpy数组上映射函数的最有效方法

问题:在numpy数组上映射函数的最有效方法

在numpy数组上映射函数的最有效方法是什么?我在当前项目中一直采用的方式如下:

import numpy as np 

x = np.array([1, 2, 3, 4, 5])

# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])

但是,这似乎效率很低,因为我正在使用列表推导将新数组构造为Python列表,然后再将其转换回numpy数组。

我们可以做得更好吗?

What is the most efficient way to map a function over a numpy array? The way I’ve been doing it in my current project is as follows:

import numpy as np 

x = np.array([1, 2, 3, 4, 5])

# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])

However, this seems like it is probably very inefficient, since I am using a list comprehension to construct the new array as a Python list before converting it back to a numpy array.

Can we do better?


回答 0

我测试过的所有建议的方法,加上np.array(map(f, x))perfplot(我的一个小项目)。

消息1:如果可以使用numpy的本机函数,请执行此操作。

如果你想已经矢量化功能矢量(如x**2在原岗位的例子),使用的是比什么都更快(注意对数标度):

在此处输入图片说明

如果您确实需要向量化,那么使用哪种变体并不重要。

在此处输入图片说明


复制图的代码:

import numpy as np
import perfplot
import math


def f(x):
    # return math.sqrt(x)
    return np.sqrt(x)


vf = np.vectorize(f)


def array_for(x):
    return np.array([f(xi) for xi in x])


def array_map(x):
    return np.array(list(map(f, x)))


def fromiter(x):
    return np.fromiter((f(xi) for xi in x), x.dtype)


def vectorize(x):
    return np.vectorize(f)(x)


def vectorize_without_init(x):
    return vf(x)


perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2 ** k for k in range(20)],
    kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
    xlabel="len(x)",
)

I’ve tested all suggested methods plus np.array(map(f, x)) with perfplot (a small project of mine).

Message #1: If you can use numpy’s native functions, do that.

If the function you’re trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):

enter image description here

If you actually need vectorization, it doesn’t really matter much which variant you use.

enter image description here


Code to reproduce the plots:

import numpy as np
import perfplot
import math


def f(x):
    # return math.sqrt(x)
    return np.sqrt(x)


vf = np.vectorize(f)


def array_for(x):
    return np.array([f(xi) for xi in x])


def array_map(x):
    return np.array(list(map(f, x)))


def fromiter(x):
    return np.fromiter((f(xi) for xi in x), x.dtype)


def vectorize(x):
    return np.vectorize(f)(x)


def vectorize_without_init(x):
    return vf(x)


perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2 ** k for k in range(20)],
    kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
    xlabel="len(x)",
)

回答 1

如何使用numpy.vectorize

import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])

How about using numpy.vectorize.

import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1,  4,  9, 16, 25])

回答 2

TL; DR

@ user2357112所述,应用函数的“直接”方法始终是在Numpy数组上映射函数的最快,最简单的方法:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)

通常应避免使用np.vectorize它,因为它运行不佳,并且有(或遇到)许多问题。如果要处理其他数据类型,则可能需要研究以下所示的其他方法。

方法比较

以下是一些简单的测试,用于比较三种映射函数的方法,本示例在Python 3.6和NumPy 1.15.4中使用。首先,用于测试的设置功能:

import timeit
import numpy as np

f = lambda x: x ** 2
vf = np.vectorize(f)

def test_array(x, n):
    t = timeit.timeit(
        'np.array([f(xi) for xi in x])',
        'from __main__ import np, x, f', number=n)
    print('array: {0:.3f}'.format(t))

def test_fromiter(x, n):
    t = timeit.timeit(
        'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
        'from __main__ import np, x, f', number=n)
    print('fromiter: {0:.3f}'.format(t))

def test_direct(x, n):
    t = timeit.timeit(
        'f(x)',
        'from __main__ import x, f', number=n)
    print('direct: {0:.3f}'.format(t))

def test_vectorized(x, n):
    t = timeit.timeit(
        'vf(x)',
        'from __main__ import x, vf', number=n)
    print('vectorized: {0:.3f}'.format(t))

用五个元素(从最快到最慢排序)进行测试:

x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n)      # 0.265
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.865
test_vectorized(x, n)  # 2.906

具有100多个元素:

x = np.arange(100)
n = 10000
test_direct(x, n)      # 0.030
test_array(x, n)       # 0.501
test_vectorized(x, n)  # 0.670
test_fromiter(x, n)    # 0.883

并且具有1000或更多的数组元素:

x = np.arange(1000)
n = 1000
test_direct(x, n)      # 0.007
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.516
test_vectorized(x, n)  # 0.945

不同版本的Python / NumPy和编译器优化将产生不同的结果,因此请针对您的环境进行类似的测试。

TL;DR

As noted by @user2357112, a “direct” method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:

import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)

Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.

Comparison of methods

Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:

import timeit
import numpy as np

f = lambda x: x ** 2
vf = np.vectorize(f)

def test_array(x, n):
    t = timeit.timeit(
        'np.array([f(xi) for xi in x])',
        'from __main__ import np, x, f', number=n)
    print('array: {0:.3f}'.format(t))

def test_fromiter(x, n):
    t = timeit.timeit(
        'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
        'from __main__ import np, x, f', number=n)
    print('fromiter: {0:.3f}'.format(t))

def test_direct(x, n):
    t = timeit.timeit(
        'f(x)',
        'from __main__ import x, f', number=n)
    print('direct: {0:.3f}'.format(t))

def test_vectorized(x, n):
    t = timeit.timeit(
        'vf(x)',
        'from __main__ import x, vf', number=n)
    print('vectorized: {0:.3f}'.format(t))

Testing with five elements (sorted from fastest to slowest):

x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n)      # 0.265
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.865
test_vectorized(x, n)  # 2.906

With 100s of elements:

x = np.arange(100)
n = 10000
test_direct(x, n)      # 0.030
test_array(x, n)       # 0.501
test_vectorized(x, n)  # 0.670
test_fromiter(x, n)    # 0.883

And with 1000s of array elements or more:

x = np.arange(1000)
n = 1000
test_direct(x, n)      # 0.007
test_fromiter(x, n)    # 0.479
test_array(x, n)       # 0.516
test_vectorized(x, n)  # 0.945

Different versions of Python/NumPy and compiler optimization will have different results, so do a similar test for your environment.


回答 3

numexprnumbacython周围,此答案的目的是考虑这些可能性。

但是首先让我们说明一个显而易见的事实:无论您如何将Python函数映射到numpy数组,它都会保留为Python函数,这意味着每次评估:

  • numpy-array元素必须转换为Python对象(例如, Float)。
  • 所有的计算都是使用Python对象完成的,这意味着要占用解释器,动态分配和不可变对象的开销。

因此,由于上面提到的开销,实际上用于循环遍历数组的机制不会发挥很大的作用-它比使用numpy的内置功能要慢得多。

让我们看下面的例子:

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

np.vectorize被选为方法的纯Python函数类的代表。使用perfplot(请参阅此答案的附录中的代码),我们得到以下运行时间:

在此处输入图片说明

我们可以看到,numpy方法比纯python版本快10到100倍。对于更大的数组大小,性能下降可能是因为数据不再适合高速缓存。

值得一提的是,vectorize它还占用大量内存,因此内存使用常常是瓶颈(请参阅相关的SO问题)。还要注意,numpy的文档np.vectorize指出“主要是为了方便而不是性能而提供”。

需要性能时,应使用其他工具,除了从头开始编写C扩展名外,还有以下可能性:


人们经常听到,numpy性能是最好的,因为它是纯C语言。但是还有很多改进的空间!

向量化的numpy版本使用大量额外的内存和内存访问。Numexp库尝试对numpy数组进行平铺,从而获得更好的缓存利用率:

# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
    return ne.evaluate("x+2*x*x+4*x*x*x")

导致以下比较:

在此处输入图片说明

我无法解释上面图表中的所有内容:一开始我们会看到numexpr-library的开销更大,但是因为它更好地利用了缓存,所以对于较大的数组,它的速度要快大约10倍!


另一种方法是通过jit编译功能,从而获得真正的纯C UFunc。这是numba的方法:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

它比原始的numpy方法快10倍:

在此处输入图片说明


但是,该任务可尴尬地可并行化,因此我们也可以使用prange它来并行计算循环:

@nb.njit(parallel=True)
def nb_par_jitf(x):
    y=np.empty(x.shape)
    for i in nb.prange(len(x)):
        y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y

不出所料,并行功能对于较小的输入而言较慢,但对于较大的输入则较快(几乎为2倍):

在此处输入图片说明


虽然numba专门研究使用numpy数组优化操作,但Cython是更通用的工具。提取与numba相同的性能更加复杂-相对于本地编译器(gcc / MSVC),通常归结为llvm(numba):

%%cython -c=/openmp -a
import numpy as np
import cython

#single core:
@cython.boundscheck(False) 
@cython.wraparound(False) 
def cy_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef Py_ssize_t i
    cdef double[::1] y=y_out
    for i in range(len(x)):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

#parallel:
from cython.parallel import prange
@cython.boundscheck(False) 
@cython.wraparound(False)  
def cy_par_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef double[::1] y=y_out
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(x)
    for i in prange(n, nogil=True):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

Cython导致功能变慢:

在此处输入图片说明


结论

显然,仅测试一个功能并不能证明任何事情。还要记住的是,对于所选的功能示例,内存的带宽是大于10 ^ 5个元素的瓶颈-因此,在该区域中,numba,numexpr和cython的性能相同。

最后,最终答案取决于函数的类型,硬件,Python分布和其他因素。例如,Anaconda-distribution使用Intel的VML来实现numpy的功能,从而在超越性功能(如,和类似功能)方面的性能要优于numba(除非它使用SVML,请参见此SO-post),例如exp,请参见以下SO-postsincos

但是从这次调查和到目前为止的经验来看,只要不涉及先验功能,numba似乎是性能最佳的最简单工具。


使用perfplot -package绘制运行时间:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2**k for k in range(0,24)],
    kernels=[
        f, 
        vf,
        ne_f, 
        nb_vf, nb_par_jitf,
        cy_f, cy_par_f,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.

But first let’s state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:

  • numpy-array element must be converted to a Python-object (e.g. a Float).
  • all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.

So which machinery is used to actually loop through the array doesn’t play a big role because of the overhead mentioned above – it stays much slower than using numpy’s built-in functionality.

Let’s take a look at the following example:

# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:

enter image description here

We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.

It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance”.

Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:


One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!

The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:

# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
    return ne.evaluate("x+2*x*x+4*x*x*x")

Leads to the following comparison:

enter image description here

I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!


Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba’s approach:

# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

It is 10 times faster than the original numpy-approach:

enter image description here


However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:

@nb.njit(parallel=True)
def nb_par_jitf(x):
    y=np.empty(x.shape)
    for i in nb.prange(len(x)):
        y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y

As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:

enter image description here


While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba – often it is down to llvm (numba) vs local compiler (gcc/MSVC):

%%cython -c=/openmp -a
import numpy as np
import cython

#single core:
@cython.boundscheck(False) 
@cython.wraparound(False) 
def cy_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef Py_ssize_t i
    cdef double[::1] y=y_out
    for i in range(len(x)):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

#parallel:
from cython.parallel import prange
@cython.boundscheck(False) 
@cython.wraparound(False)  
def cy_par_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef double[::1] y=y_out
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(x)
    for i in prange(n, nogil=True):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

Cython results in somewhat slower functions:

enter image description here


Conclusion

Obviously, testing only for one function doesn’t prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements – thus we had the same performance for numba, numexpr and cython in this region.

In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel’s VML for numpy’s functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar – see e.g. the following SO-post.

Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.


Plotting running times with perfplot-package:

import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2**k for k in range(0,24)],
    kernels=[
        f, 
        vf,
        ne_f, 
        nb_vf, nb_par_jitf,
        cy_f, cy_par_f,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

回答 4

squares = squarer(x)

数组上的算术运算会自动按元素进行应用,并使用高效的C级循环,避免了所有适用于Python级循环或理解的解释器开销。

您想将所有元素应用于NumPy数组的大多数功能都可以使用,尽管有些功能可能需要更改。例如,if不能逐个元素地工作。您想要将其转换为使用类似numpy.where以下的构造:

def using_if(x):
    if x < 5:
        return x
    else:
        return x**2

变成

def using_where(x):
    return numpy.where(x < 5, x, x**2)
squares = squarer(x)

Arithmetic operations on arrays are automatically applied elementwise, with efficient C-level loops that avoid all the interpreter overhead that would apply to a Python-level loop or comprehension.

Most of the functions you’d want to apply to a NumPy array elementwise will just work, though some may need changes. For example, if doesn’t work elementwise. You’d want to convert those to use constructs like numpy.where:

def using_if(x):
    if x < 5:
        return x
    else:
        return x**2

becomes

def using_where(x):
    return numpy.where(x < 5, x, x**2)

回答 5

我相信在numpy的较新版本(我使用1.13)中,您可以通过将numpy数组传递给您为标量类型编写的函数来调用函数,它将自动将函数调用应用于numpy数组上的每个元素并返回另一个numpy数组

>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1,  4,  9, 16, 25])

I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array

>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1,  4,  9, 16, 25])

回答 6

在许多情况下,numpy.apply_along_axis是最佳选择。与其他方法相比,它的性能提高了约100倍-不仅对于微不足道的测试功能,而且对于numpy和scipy的更复杂的功能组成。

当我添加方法时:

def along_axis(x):
    return np.apply_along_axis(f, 0, x)

到perfplot代码,我得到以下结果: 在此处输入图片说明

In many cases, numpy.apply_along_axis will be the best choice. It increases the performance by about 100x compared to the other approaches – and not only for trivial test functions, but also for more complex function compositions from numpy and scipy.

When I add the method:

def along_axis(x):
    return np.apply_along_axis(f, 0, x)

to the perfplot code, I get the following results: enter image description here


回答 7

似乎没有人提到过内置的工厂生产ufuncnumpy软件包的方法:np.frompyfunc我再次进行了测试np.vectorize,其性能要比其高出20%到30%。当然,它可以按规定的C代码甚至numba(我还没有测试过的)性能很好,但是比起更好的选择np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms

我还测试了较大的样本,并且改进成比例。另请参阅文档在这里

It seems no one has mentioned a built-in factory method of producing ufunc in numpy package: np.frompyfunc which I have tested again np.vectorize and have outperformed it by about 20~30%. Of course it will perform well as prescribed C code or even numba(which I have not tested), but it can a better alternative than np.vectorize

f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)

%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms

I have also tested larger samples, and the improvement is proportional. See the documentation also here


回答 8

正如提到的这篇文章,只是使用生成器表达式如下所示:

numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)

As mentioned in this post, just use generator expressions like so:

numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)

回答 9

以上所有答案比较都不错,但是如果您需要使用自定义函数进行映射,并且 numpy.ndarray,则需要保留数组的形状。

我只比较了两个,但它将保留的形状ndarray。我已经将数组与100万个条目进行比较。在这里,我使用平方函数,该函数也是内置在numpy中的,并且具有很大的性能提升,因为有需要,您可以使用自己选择的函数。

import numpy, time
def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

输出量

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

在这里,您可以清楚地看到numpy.fromiter采用简单方法的效果很好,如果内置功能可用,请使用它。

All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.

I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function, which is also inbuilt in numpy and has great performance boost, since there as was need of something, you can use function of your choice.

import numpy, time
def timeit():
    y = numpy.arange(1000000)
    now = time.time()
    numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)        
    print(time.time() - now)
    now = time.time()
    numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
    print(time.time() - now)
    now = time.time()
    numpy.square(y)  
    print(time.time() - now)

Output

>>> timeit()
1.162431240081787    # list comprehension and then building numpy array
1.0775556564331055   # from numpy.fromiter
0.002948284149169922 # using inbuilt function

here you can clearly see numpy.fromiter works great considering to simple approach, and if inbuilt function is available please use that.


回答 10

采用 numpy.fromfunction(function, shape, **kwargs)

参见“ https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfunction.html


按列对NumPy中的数组排序

问题:按列对NumPy中的数组排序

如何按第n列对NumPy中的数组排序?

例如,

a = array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])

我想按第二列对行进行排序,以便返回:

array([[7, 0, 5],
       [9, 2, 3],
       [4, 5, 6]])

How can I sort an array in NumPy by the nth column?

For example,

a = array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])

I’d like to sort rows by the second column, such that I get back:

array([[7, 0, 5],
       [9, 2, 3],
       [4, 5, 6]])

回答 0

@steve答案实际上是最优雅的方法。

对于“正确”的方式,请参见numpy.ndarray.sort的order关键字参数。

但是,您需要将数组视为具有字段的数组(结构化数组)。

如果您最初没有使用字段定义数组,那么“正确”的方法就很难看了。

作为一个简单的示例,对其进行排序并返回副本:

In [1]: import numpy as np

In [2]: a = np.array([[1,2,3],[4,5,6],[0,0,1]])

In [3]: np.sort(a.view('i8,i8,i8'), order=['f1'], axis=0).view(np.int)
Out[3]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

对其进行原位排序:

In [6]: a.view('i8,i8,i8').sort(order=['f1'], axis=0) #<-- returns None

In [7]: a
Out[7]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

据我所知,@ Steve确实是最优雅的方式…

此方法的唯一优点是,“ order”参数是用来对搜索进行排序的字段列表。例如,您可以通过提供order = [‘f1’,’f2’,’f0’]来对第二列,第三列,第一列进行排序。

@steve‘s is actually the most elegant way of doing it.

For the “correct” way see the order keyword argument of numpy.ndarray.sort

However, you’ll need to view your array as an array with fields (a structured array).

The “correct” way is quite ugly if you didn’t initially define your array with fields…

As a quick example, to sort it and return a copy:

In [1]: import numpy as np

In [2]: a = np.array([[1,2,3],[4,5,6],[0,0,1]])

In [3]: np.sort(a.view('i8,i8,i8'), order=['f1'], axis=0).view(np.int)
Out[3]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

To sort it in-place:

In [6]: a.view('i8,i8,i8').sort(order=['f1'], axis=0) #<-- returns None

In [7]: a
Out[7]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

@Steve’s really is the most elegant way to do it, as far as I know…

The only advantage to this method is that the “order” argument is a list of the fields to order the search by. For example, you can sort by the second column, then the third column, then the first column by supplying order=[‘f1′,’f2′,’f0’].


回答 1

我想这可行: a[a[:,1].argsort()]

这表示的第二列,a并据此对其进行排序。

I suppose this works: a[a[:,1].argsort()]

This indicates the second column of a and sort it based on it accordingly.


回答 2

您可以按照Steve Tjoa的方法对多个列进行排序,方法是使用诸如mergesort之类的稳定排序并对索引从最低有效列到最高有效列进行排序:

a = a[a[:,2].argsort()] # First sort doesn't need to be stable.
a = a[a[:,1].argsort(kind='mergesort')]
a = a[a[:,0].argsort(kind='mergesort')]

排序方式为:第0列,然后是1,然后是2。

You can sort on multiple columns as per Steve Tjoa’s method by using a stable sort like mergesort and sorting the indices from the least significant to the most significant columns:

a = a[a[:,2].argsort()] # First sort doesn't need to be stable.
a = a[a[:,1].argsort(kind='mergesort')]
a = a[a[:,0].argsort(kind='mergesort')]

This sorts by column 0, then 1, then 2.


回答 3

我认为您可以从Python文档Wiki中进行以下操作:

a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]); 
a = sorted(a, key=lambda a_entry: a_entry[1]) 
print a

输出为:

[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]

From the Python documentation wiki, I think you can do:

a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]); 
a = sorted(a, key=lambda a_entry: a_entry[1]) 
print a

The output is:

[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]

回答 4

如果有人想在他们程序的关键部分使用排序,下面是对不同提案的性能比较:

import numpy as np
table = np.random.rand(5000, 10)

%timeit table.view('f8,f8,f8,f8,f8,f8,f8,f8,f8,f8').sort(order=['f9'], axis=0)
1000 loops, best of 3: 1.88 ms per loop

%timeit table[table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop

import pandas as pd
df = pd.DataFrame(table)
%timeit df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop

因此,似乎使用argsort进行索引是迄今为止最快的方法…

In case someone wants to make use of sorting at a critical part of their programs here’s a performance comparison for the different proposals:

import numpy as np
table = np.random.rand(5000, 10)

%timeit table.view('f8,f8,f8,f8,f8,f8,f8,f8,f8,f8').sort(order=['f9'], axis=0)
1000 loops, best of 3: 1.88 ms per loop

%timeit table[table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop

import pandas as pd
df = pd.DataFrame(table)
%timeit df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop

So, it looks like indexing with argsort is the quickest method so far…


回答 5

该NumPy的邮件列表,这里是另一种解决方案:

>>> a
array([[1, 2],
       [0, 0],
       [1, 0],
       [0, 2],
       [2, 1],
       [1, 0],
       [1, 0],
       [0, 0],
       [1, 0],
      [2, 2]])
>>> a[np.lexsort(np.fliplr(a).T)]
array([[0, 0],
       [0, 0],
       [0, 2],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 2],
       [2, 1],
       [2, 2]])

From the NumPy mailing list, here’s another solution:

>>> a
array([[1, 2],
       [0, 0],
       [1, 0],
       [0, 2],
       [2, 1],
       [1, 0],
       [1, 0],
       [0, 0],
       [1, 0],
      [2, 2]])
>>> a[np.lexsort(np.fliplr(a).T)]
array([[0, 0],
       [0, 0],
       [0, 2],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 2],
       [2, 1],
       [2, 2]])

回答 6

我有一个类似的问题。

我的问题:

我想计算SVD,需要按降序对我的特征值进行排序。但是我想保留特征值和特征向量之间的映射。我的特征值在第一行中,而对应的特征向量在同一列中。

因此,我想按降序按第一行在列中对二维数组进行排序。

我的解决方案

a = a[::, a[0,].argsort()[::-1]]

那么这是如何工作的呢?

a[0,] 只是我要排序的第一行。

现在,我使用argsort来获取索引的顺序。

我用 [::-1]是因为我需要降序排列。

最后,我使用a[::, ...]正确的顺序查看各列。

I had a similar problem.

My Problem:

I want to calculate an SVD and need to sort my eigenvalues in descending order. But I want to keep the mapping between eigenvalues and eigenvectors. My eigenvalues were in the first row and the corresponding eigenvector below it in the same column.

So I want to sort a two-dimensional array column-wise by the first row in descending order.

My Solution

a = a[::, a[0,].argsort()[::-1]]

So how does this work?

a[0,] is just the first row I want to sort by.

Now I use argsort to get the order of indices.

I use [::-1] because I need descending order.

Lastly I use a[::, ...] to get a view with the columns in the right order.


回答 7

稍微复杂一点的lexsort例子-在第一列下降,在第二列上升。的窍门lexsort是,它对行进行排序(因此.T),并优先考虑最后一行。

In [120]: b=np.array([[1,2,1],[3,1,2],[1,1,3],[2,3,4],[3,2,5],[2,1,6]])
In [121]: b
Out[121]: 
array([[1, 2, 1],
       [3, 1, 2],
       [1, 1, 3],
       [2, 3, 4],
       [3, 2, 5],
       [2, 1, 6]])
In [122]: b[np.lexsort(([1,-1]*b[:,[1,0]]).T)]
Out[122]: 
array([[3, 1, 2],
       [3, 2, 5],
       [2, 1, 6],
       [2, 3, 4],
       [1, 1, 3],
       [1, 2, 1]])

A little more complicated lexsort example – descending on the 1st column, secondarily ascending on the 2nd. The tricks with lexsort are that it sorts on rows (hence the .T), and gives priority to the last.

In [120]: b=np.array([[1,2,1],[3,1,2],[1,1,3],[2,3,4],[3,2,5],[2,1,6]])
In [121]: b
Out[121]: 
array([[1, 2, 1],
       [3, 1, 2],
       [1, 1, 3],
       [2, 3, 4],
       [3, 2, 5],
       [2, 1, 6]])
In [122]: b[np.lexsort(([1,-1]*b[:,[1,0]]).T)]
Out[122]: 
array([[3, 1, 2],
       [3, 2, 5],
       [2, 1, 6],
       [2, 3, 4],
       [1, 1, 3],
       [1, 2, 1]])

回答 8

这是考虑所有列的另一种解决方案(JJ的答案的更紧凑方式);

ar=np.array([[0, 0, 0, 1],
             [1, 0, 1, 0],
             [0, 1, 0, 0],
             [1, 0, 0, 1],
             [0, 0, 1, 0],
             [1, 1, 0, 0]])

用lexsort排序,

ar[np.lexsort(([ar[:, i] for i in range(ar.shape[1]-1, -1, -1)]))]

输出:

array([[0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 0],
       [1, 1, 0, 0]])

Here is another solution considering all columns (more compact way of J.J‘s answer);

ar=np.array([[0, 0, 0, 1],
             [1, 0, 1, 0],
             [0, 1, 0, 0],
             [1, 0, 0, 1],
             [0, 0, 1, 0],
             [1, 1, 0, 0]])

Sort with lexsort,

ar[np.lexsort(([ar[:, i] for i in range(ar.shape[1]-1, -1, -1)]))]

Output:

array([[0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 0],
       [1, 1, 0, 0]])

回答 9

只需使用排序,即可使用要排序的列号。

a = np.array([1,1], [1,-1], [-1,1], [-1,-1]])
print (a)
a=a.tolist() 
a = np.array(sorted(a, key=lambda a_entry: a_entry[0]))
print (a)

Simply using sort, use coloumn number based on which you want to sort.

a = np.array([1,1], [1,-1], [-1,1], [-1,-1]])
print (a)
a=a.tolist() 
a = np.array(sorted(a, key=lambda a_entry: a_entry[0]))
print (a)

回答 10

这是一个古老的问题,但是如果您需要将其推广到2维以上的数组,则可以采用以下解决方案:

np.einsum('ij->ij', a[a[:,1].argsort(),:])

这对于两个维度来说是一个过大的杀伤力,并且a[a[:,1].argsort()]每个@steve的答案就足够了,但是不能将该答案推广到更高的维度。您可以在此问题中找到3D阵列的示例。

输出:

[[7 0 5]
 [9 2 3]
 [4 5 6]]

It is an old question but if you need to generalize this to a higher than 2 dimension arrays, here is the solution than can be easily generalized:

np.einsum('ij->ij', a[a[:,1].argsort(),:])

This is an overkill for two dimensions and a[a[:,1].argsort()] would be enough per @steve’s answer, however that answer cannot be generalized to higher dimensions. You can find an example of 3D array in this question.

Output:

[[7 0 5]
 [9 2 3]
 [4 5 6]]