标签归档:arrays

连续数组和非连续数组有什么区别?

问题:连续数组和非连续数组有什么区别?

在有关reshape()函数的numpy手册中,它说

>>> a = np.zeros((10, 2))
# A transpose make the array non-contiguous
>>> b = a.T
# Taking a view makes it possible to modify the shape without modifying the
# initial object.
>>> c = b.view()
>>> c.shape = (20)
AttributeError: incompatible shape for a non-contiguous array

我的问题是:

  1. 什么是连续和不连续数组?它类似于C 中的连续存储块吗?什么是连续存储块?
  2. 两者之间是否有性能差异?我们什么时候应该使用其中一个?
  3. 为什么转置会使数组不连续?
  4. 为什么会c.shape = (20)引发错误incompatible shape for a non-contiguous array

感谢您的回答!

In the numpy manual about the reshape() function, it says

>>> a = np.zeros((10, 2))
# A transpose make the array non-contiguous
>>> b = a.T
# Taking a view makes it possible to modify the shape without modifying the
# initial object.
>>> c = b.view()
>>> c.shape = (20)
AttributeError: incompatible shape for a non-contiguous array

My questions are:

  1. What are continuous and noncontiguous arrays? Is it similar to the contiguous memory block in C like What is a contiguous memory block?
  2. Is there any performance difference between these two? When should we use one or the other?
  3. Why does transpose make the array non-contiguous?
  4. Why does c.shape = (20) throws an error incompatible shape for a non-contiguous array?

Thanks for your answer!


回答 0

连续数组只是存储在不间断内存块中的数组:要访问数组中的下一个值,我们只需移至下一个内存地址。

考虑2D数组arr = np.arange(12).reshape(3,4)。看起来像这样:

在计算机的内存中,的值arr存储如下:

这意味着arrC连续数组,因为被存储为连续的内存块。下一个内存地址保存该行的下一行值。如果要向下移动一列,我们只需要跳过三个块(例如,从0跳到4意味着我们跳过1,2和3)。

用换位数组arr.T意味着C连续性丢失,因为相邻的行条目不再位于相邻的存储器地址中。但是,Fortranarr.T连续的,因为在内存的连续块中:


从性能角度来看,访问彼此相邻的内存地址通常比访问更“扩展”的地址更快(从RAM中获取值可能需要为CPU提取并缓存许多相邻地址。)意味着对连续阵列的操作通常会更快。

由于C连续的内存布局,因此按行操作通常比按列操作快。例如,您通常会发现

np.sum(arr, axis=1) # sum the rows

快于:

np.sum(arr, axis=0) # sum the columns

同样,对于Fortran连续数组,对列的操作将稍快一些。


最后,为什么不能通过分配新形状来展平Fortran连续数组?

>>> arr2 = arr.T
>>> arr2.shape = 12
AttributeError: incompatible shape for a non-contiguous array

为了使这成为可能,NumPy必须arr.T像这样将各行放在一起:

shape直接设置该属性将假定C顺序-即NumPy尝试逐行执行该操作。)

这是不可能的。对于任何轴,NumPy必须具有恒定的步幅长度(要移动的字节数)才能到达数组的下一个元素。arr.T以这种方式展平将需要在内存中向前和向后跳过以检索数组的连续值。

如果我们arr2.reshape(12)改为写,NumPy会将arr2的值复制到新的内存块中(因为它无法将视图返回到该形状的原始数据)。

A contiguous array is just an array stored in an unbroken block of memory: to access the next value in the array, we just move to the next memory address.

Consider the 2D array arr = np.arange(12).reshape(3,4). It looks like this:

In the computer’s memory, the values of arr are stored like this:

This means arr is a C contiguous array because the rows are stored as contiguous blocks of memory. The next memory address holds the next row value on that row. If we want to move down a column, we just need to jump over three blocks (e.g. to jump from 0 to 4 means we skip over 1,2 and 3).

Transposing the array with arr.T means that C contiguity is lost because adjacent row entries are no longer in adjacent memory addresses. However, arr.T is Fortran contiguous since the columns are in contiguous blocks of memory:


Performance-wise, accessing memory addresses which are next to each other is very often faster than accessing addresses which are more “spread out” (fetching a value from RAM could entail a number of neighbouring addresses being fetched and cached for the CPU.) This means that operations over contiguous arrays will often be quicker.

As a consequence of C contiguous memory layout, row-wise operations are usually faster than column-wise operations. For example, you’ll typically find that

np.sum(arr, axis=1) # sum the rows

is slightly faster than:

np.sum(arr, axis=0) # sum the columns

Similarly, operations on columns will be slightly faster for Fortran contiguous arrays.


Finally, why can’t we flatten the Fortran contiguous array by assigning a new shape?

>>> arr2 = arr.T
>>> arr2.shape = 12
AttributeError: incompatible shape for a non-contiguous array

In order for this to be possible NumPy would have to put the rows of arr.T together like this:

(Setting the shape attribute directly assumes C order – i.e. NumPy tries to perform the operation row-wise.)

This is impossible to do. For any axis, NumPy needs to have a constant stride length (the number of bytes to move) to get to the next element of the array. Flattening arr.T in this way would require skipping forwards and backwards in memory to retrieve consecutive values of the array.

If we wrote arr2.reshape(12) instead, NumPy would copy the values of arr2 into a new block of memory (since it can’t return a view on to the original data for this shape).


回答 1

也许此示例具有12个不同的数组值将有所帮助:

In [207]: x=np.arange(12).reshape(3,4).copy()

In [208]: x.flags
Out[208]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [209]: x.T.flags
Out[209]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...

C order值是,他们在生成的顺序。在调换哪些不是

In [212]: x.reshape(12,)   # same as x.ravel()
Out[212]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [213]: x.T.reshape(12,)
Out[213]: array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

您可以同时获得两者的一维视图

In [214]: x1=x.T

In [217]: x.shape=(12,)

的形状x也可以更改。

In [220]: x1.shape=(12,)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-220-cf2b1a308253> in <module>()
----> 1 x1.shape=(12,)

AttributeError: incompatible shape for a non-contiguous array

但是移调的形状无法更改。在data仍处于0,1,2,3,4...顺序,这不能被访问访问如0,4,8...在一维数组。

但是x1可以更改的副本:

In [227]: x2=x1.copy()

In [228]: x2.flags
Out[228]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [229]: x2.shape=(12,)

strides也许也有帮助。跨步是到达下一个值必须走多远(以字节为单位)。对于2d数组,将有2个跨步值:

In [233]: x=np.arange(12).reshape(3,4).copy()

In [234]: x.strides
Out[234]: (16, 4)

要到达下一行,请步进16个字节,仅下一列4。

In [235]: x1.strides
Out[235]: (4, 16)

转置只是切换步幅的顺序。下一行只有4个字节,即下一个数字。

In [236]: x.shape=(12,)

In [237]: x.strides
Out[237]: (4,)

改变形状也会改变步幅-一次仅通过缓冲区4个字节。

In [238]: x2=x1.copy()

In [239]: x2.strides
Out[239]: (12, 4)

即使x2看起来像x1,它也有自己的数据缓冲区,其值以不同的顺序排列。现在,下一列是4字节,而下一行是12(3 * 4)。

In [240]: x2.shape=(12,)

In [241]: x2.strides
Out[241]: (4,)

并且x,将形状更改为1d会将步幅减小为(4,)

因为x1,按0,1,2,...顺序排列数据,不会产生一维的跨度0,4,8...

__array_interface__ 是显示数组信息的另一种有用方法:

In [242]: x1.__array_interface__
Out[242]: 
{'strides': (4, 16),
 'typestr': '<i4',
 'shape': (4, 3),
 'version': 3,
 'data': (163336056, False),
 'descr': [('', '<i4')]}

x1数据缓冲器地址将是相同x,同它的数据。 x2具有不同的缓冲区地址。

您也可以尝试order='F'copyreshape命令中添加参数。

Maybe this example with 12 different array values will help:

In [207]: x=np.arange(12).reshape(3,4).copy()

In [208]: x.flags
Out[208]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [209]: x.T.flags
Out[209]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...

The C order values are in the order that they were generated in. The transposed ones are not

In [212]: x.reshape(12,)   # same as x.ravel()
Out[212]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [213]: x.T.reshape(12,)
Out[213]: array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

You can get 1d views of both

In [214]: x1=x.T

In [217]: x.shape=(12,)

the shape of x can also be changed.

In [220]: x1.shape=(12,)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-220-cf2b1a308253> in <module>()
----> 1 x1.shape=(12,)

AttributeError: incompatible shape for a non-contiguous array

But the shape of the transpose cannot be changed. The data is still in the 0,1,2,3,4... order, which can’t be accessed accessed as 0,4,8... in a 1d array.

But a copy of x1 can be changed:

In [227]: x2=x1.copy()

In [228]: x2.flags
Out[228]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [229]: x2.shape=(12,)

Looking at strides might also help. A strides is how far (in bytes) it has to step to get to the next value. For a 2d array, there will be be 2 stride values:

In [233]: x=np.arange(12).reshape(3,4).copy()

In [234]: x.strides
Out[234]: (16, 4)

To get to the next row, step 16 bytes, next column only 4.

In [235]: x1.strides
Out[235]: (4, 16)

Transpose just switches the order of the strides. The next row is only 4 bytes- i.e. the next number.

In [236]: x.shape=(12,)

In [237]: x.strides
Out[237]: (4,)

Changing the shape also changes the strides – just step through the buffer 4 bytes at a time.

In [238]: x2=x1.copy()

In [239]: x2.strides
Out[239]: (12, 4)

Even though x2 looks just like x1, it has its own data buffer, with the values in a different order. The next column is now 4 bytes over, while the next row is 12 (3*4).

In [240]: x2.shape=(12,)

In [241]: x2.strides
Out[241]: (4,)

And as with x, changing the shape to 1d reduces the strides to (4,).

For x1, with data in the 0,1,2,... order, there isn’t a 1d stride that would give 0,4,8....

__array_interface__ is another useful way of displaying array information:

In [242]: x1.__array_interface__
Out[242]: 
{'strides': (4, 16),
 'typestr': '<i4',
 'shape': (4, 3),
 'version': 3,
 'data': (163336056, False),
 'descr': [('', '<i4')]}

The x1 data buffer address will be same as for x, with which it shares the data. x2 has a different buffer address.

You could also experiment with adding a order='F' parameter to the copy and reshape commands.


Numpy isnan()在浮点数组上失败(适用于pandas数据框)

问题:Numpy isnan()在浮点数组上失败(适用于pandas数据框)

我有一个浮点数数组(一些正常数字,一些nans),它们是从对熊猫数据框的应用中得出的。

由于某种原因,numpy.isnan在此数组上失败,但是,如下所示,每个元素都是浮点数,numpy.isnan在每个元素上正确运行,变量的类型肯定是一个numpy数组。

这是怎么回事?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

I have an array of floats (some normal numbers, some nans) that is coming out of an apply on a pandas dataframe.

For some reason, numpy.isnan is failing on this array, however as shown below, each element is a float, numpy.isnan runs correctly on each element, the type of the variable is definitely a numpy array.

What’s going on?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

回答 0

np.isnan 可以应用于本机dtype的NumPy数组(例如np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

但是在应用于对象数组时引发TypeError:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

由于您拥有Pandas,pd.isnull因此可以改用-它可以接受对象或本机dtypes的NumPy数组:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

请注意,None在对象数组中也将其视为空值。

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

but raises TypeError when applied to object arrays:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Since you have Pandas, you could use pd.isnull instead — it can accept NumPy arrays of object or native dtypes:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

Note that None is also considered a null value in object arrays.


回答 1

np.isnan()和pd.isnull()的绝佳替代品是

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

因为只有nan不等于自己。

A great substitute for np.isnan() and pd.isnull() is

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

since only nan is not equal to itself.


回答 2

在@unutbu答案的顶部,您可以将pandas numpy对象数组强制转换为本机(float64)类型,沿线进行操作

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

指定errors =’coerce’强制将无法解析为数字值的字符串变为NaN。列类型为dtype: float64,然后isnan检查是否可以使用

On top of @unutbu answer, you could coerce pandas numpy object array to native (float64) type, something along the line

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

Specify errors=’coerce’ to force strings that can’t be parsed to a numeric value to become NaN. Column type would be dtype: float64, and then isnan check should work


回答 3

确保使用熊猫导入csv文件

import pandas as pd

condition = pd.isnull(data[i][j])

Make sure you import csv file using Pandas

import pandas as pd

condition = pd.isnull(data[i][j])

如何将数据集分割/划分为训练和测试数据集,例如进行交叉验证?

问题:如何将数据集分割/划分为训练和测试数据集,例如进行交叉验证?

将NumPy数组随机分为训练和测试/验证数据集的好方法是什么?与Matlab中的cvpartitioncrossvalind函数类似。

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.


回答 0

如果要将数据集分成两半,可以使用numpy.random.shuffle,或者numpy.random.permutation需要跟踪索引:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

要么

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有多种方法可以重复分区同一数据集以进行交叉验证。一种策略是从数据集中重复采样:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后,sklearn包含几种交叉验证方法(k折,nave -n-out等)。它还包括更高级的“分层抽样”方法,该方法创建了针对某些功能平衡的数据分区,例如,确保训练和测试集中的正例和负例比例相同。

If you want to split the data set once in two halves, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

Finally, sklearn contains several cross validation methods (k-fold, leave-n-out, …). It also includes more advanced “stratified sampling” methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.


回答 1

还有另一个选择就是需要使用scikit-learn。如scikit的Wiki所述,您可以按照以下说明进行操作:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样,您就可以将要拆分为训练和测试的数据的标签保持同步。

There is another option that just entails using scikit-learn. As scikit’s wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you’re trying to split into training and test.


回答 2

请注意。如果您想要训练,测试和AND验证集,则可以执行以下操作:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将为训练提供70%,为测试和验证集各提供15%。希望这可以帮助。

Just a note. In case you want train, test, AND validation sets, you can do this:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.


回答 3

由于sklearn.cross_validation模块被弃用,你可以使用:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

As sklearn.cross_validation module was deprecated, you can use:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

回答 4

您还可以考虑将训练和测试集进行分层。初始除法还会随机生成训练集和测试集,但要保留原始Class的比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出:

[1 2 3]
[1 2 3]

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

回答 5

我为自己的项目编写了一个函数来执行此操作(尽管它不使用numpy):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

如果您希望将块随机化,则在将列表传递之前先对其进行随机排序。

I wrote a function for my own project to do this (it doesn’t use numpy, though):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

If you want the chunks to be randomized, just shuffle the list before passing it in.


回答 6

这是一个以分层方式将数据分成n = 5折的代码

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Here is a code to split the data into n=5 folds in a stratified manner

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

回答 7

感谢pberkes的回答。我只是对其进行了修改,以避免(1)在训练和测试中采样(2)重复的实例时进行替换:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

回答 8

在进行了一些阅读并考虑了(许多..)将数据拆分以进行训练和测试的不同方式之后,我决定计时了!

我使用了4种不同的方法(其中没有一种使用的是sklearn库,我相信它将得到最好的结果,因为它是经过精心设计和测试的代码):

  1. 洗净整个矩阵arr,然后拆分数据以进行训练和测试
  2. 随机排列索引,然后将其分配给x和y以拆分数据
  3. 与方法2相同,但以更有效的方式进行
  4. 使用熊猫数据框进行拆分

方法3赢得的时间最短,仅次于方法1,而方法2和4的确效率很低。

我计时的4种不同方法的代码:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

在这段时间内,执行1000次循环的3次重复的最短时间为:

  • 方法1:0.35883826200006297秒
  • 方法2:1.7157016959999964秒
  • 方法3:1.7876616719995582秒
  • 方法4:0.07562861499991413秒

希望对您有所帮助!

After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I’m sure will give the best results, giving that it is well designed and tested code):

  1. shuffle the whole matrix arr and then split the data to train and test
  2. shuffle the indices and then assign it x and y to split the data
  3. same as method 2, but in a more efficient way to do it
  4. using pandas dataframe to split

method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

  • Method 1: 0.35883826200006297 seconds
  • Method 2: 1.7157016959999964 seconds
  • Method 3: 1.7876616719995582 seconds
  • Method 4: 0.07562861499991413 seconds

I hope that’s helpful!


回答 9

可能您不仅需要拆分训练和测试,而且还需要交叉验证以确保模型能够概括。在这里,我假设70%的训练数据,20%的验证和10%的坚持/测试数据。

查看np.split

如果indexs_or_sections是一维排序的整数数组,则条目指示沿轴在哪里拆分该数组。例如,对于轴= 0,[2,3]将导致

ary [:2] ary [2:3] ary [3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))]) 

Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.

Check out the np.split:

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in

ary[:2] ary[2:3] ary[3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))]) 

回答 10

分为火车测试并有效

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

Split into train test and valid

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

python如何用零填充numpy数组

问题:python如何用零填充numpy数组

我想知道如何使用python 2.6.6和numpy版本1.5.0用零填充2D numpy数组。抱歉! 但是这些是我的局限性。因此我不能使用np.pad。例如,我想a用零填充以使其形状匹配b。我想这样做的原因是我可以这样做:

b-a

这样

>>> a
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
>>> b
array([[ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.]])
>>> c
array([[1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0]])

我能想到的唯一方法是追加,但这看起来很丑。是否有可能使用更清洁的解决方案b.shape

编辑,谢谢MSeiferts的答案。我必须清理一下,这就是我得到的:

def pad(array, reference_shape, offsets):
    """
    array: Array to be padded
    reference_shape: tuple of size of ndarray to create
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    will throw a ValueError if offsets is too big and the reference_shape cannot handle the offsets
    """

    # Create an array of zeros with the reference shape
    result = np.zeros(reference_shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offsets[dim], offsets[dim] + array.shape[dim]) for dim in range(array.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = array
    return result

I want to know how I can pad a 2D numpy array with zeros using python 2.6.6 with numpy version 1.5.0. Sorry! But these are my limitations. Therefore I cannot use np.pad. For example, I want to pad a with zeros such that its shape matches b. The reason why I want to do this is so I can do:

b-a

such that

>>> a
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
>>> b
array([[ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.]])
>>> c
array([[1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0]])

The only way I can think of doing this is appending, however this seems pretty ugly. is there a cleaner solution possibly using b.shape?

Edit, Thank you to MSeiferts answer. I had to clean it up a bit, and this is what I got:

def pad(array, reference_shape, offsets):
    """
    array: Array to be padded
    reference_shape: tuple of size of ndarray to create
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    will throw a ValueError if offsets is too big and the reference_shape cannot handle the offsets
    """

    # Create an array of zeros with the reference shape
    result = np.zeros(reference_shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offsets[dim], offsets[dim] + array.shape[dim]) for dim in range(array.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = array
    return result

回答 0

很简单,使用参考形状创建一个包含零的数组:

result = np.zeros(b.shape)
# actually you can also use result = np.zeros_like(b) 
# but that also copies the dtype not only the shape

然后在需要的地方插入数组:

result[:a.shape[0],:a.shape[1]] = a

瞧,您已经填充了它:

print(result)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

如果您定义应该在左上方插入元素的位置,也可以使其更通用一些

result = np.zeros_like(b)
x_offset = 1  # 0 would be what you wanted
y_offset = 1  # 0 in your case
result[x_offset:a.shape[0]+x_offset,y_offset:a.shape[1]+y_offset] = a
result

array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.]])

但请注意,偏移量不要超过允许的范围。例如x_offset = 2,这将失败。


如果您有任意数量的维,则可以定义切片列表以插入原始数组。我发现有趣的是可以玩一下,并创建了一个填充函数,该函数可以填充(偏移)任意形状的数组,只要数组和引用的维数相同且偏移量不太大即可。

def pad(array, reference, offsets):
    """
    array: Array to be padded
    reference: Reference array with the desired shape
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    """
    # Create an array of zeros with the reference shape
    result = np.zeros(reference.shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offset[dim], offset[dim] + array.shape[dim]) for dim in range(a.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = a
    return result

和一些测试用例:

import numpy as np

# 1 Dimension
a = np.ones(2)
b = np.ones(5)
offset = [3]
pad(a, b, offset)

# 3 Dimensions

a = np.ones((3,3,3))
b = np.ones((5,4,3))
offset = [1,0,0]
pad(a, b, offset)

Very simple, you create an array containing zeros using the reference shape:

result = np.zeros(b.shape)
# actually you can also use result = np.zeros_like(b) 
# but that also copies the dtype not only the shape

and then insert the array where you need it:

result[:a.shape[0],:a.shape[1]] = a

and voila you have padded it:

print(result)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

You can also make it a bit more general if you define where your upper left element should be inserted

result = np.zeros_like(b)
x_offset = 1  # 0 would be what you wanted
y_offset = 1  # 0 in your case
result[x_offset:a.shape[0]+x_offset,y_offset:a.shape[1]+y_offset] = a
result

array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.]])

but then be careful that you don’t have offsets bigger than allowed. For x_offset = 2 for example this will fail.


If you have an arbitary number of dimensions you can define a list of slices to insert the original array. I’ve found it interesting to play around a bit and created a padding function that can pad (with offset) an arbitary shaped array as long as the array and reference have the same number of dimensions and the offsets are not too big.

def pad(array, reference, offsets):
    """
    array: Array to be padded
    reference: Reference array with the desired shape
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    """
    # Create an array of zeros with the reference shape
    result = np.zeros(reference.shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offset[dim], offset[dim] + array.shape[dim]) for dim in range(a.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = a
    return result

And some test cases:

import numpy as np

# 1 Dimension
a = np.ones(2)
b = np.ones(5)
offset = [3]
pad(a, b, offset)

# 3 Dimensions

a = np.ones((3,3,3))
b = np.ones((5,4,3))
offset = [1,0,0]
pad(a, b, offset)

回答 1

NumPy 1.7.0(numpy.pad添加时)现在已经很老了(它于2013年发布),因此即使问题要求使用不使用该功能的方法,我也认为了解使用可以实现该功能很有用numpy.pad

实际上很简单:

>>> import numpy as np
>>> a = np.array([[ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.]])
>>> np.pad(a, [(0, 1), (0, 1)], mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

在这种情况下,我使用0的默认值mode='constant'。但是也可以通过显式传递它来指定它:

>>> np.pad(a, [(0, 1), (0, 1)], mode='constant', constant_values=0)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

以防第二个参数([(0, 1), (0, 1)])令人困惑:每个列表项(在本例中为元组)都对应于一个维度,并且其中的每个项都表示(第一个元素)之前之后(第二个元素)的填充。所以:

[(0, 1), (0, 1)]
         ^^^^^^------ padding for second dimension
 ^^^^^^-------------- padding for first dimension

  ^------------------ no padding at the beginning of the first axis
     ^--------------- pad with one "value" at the end of the first axis.

在这种情况下,第一轴和第二轴的填充相同,因此也可以只传入2元组:

>>> np.pad(a, (0, 1), mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

如果前后的填充相同,甚至可以省略该元组(尽管在这种情况下不适用):

>>> np.pad(a, 1, mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.]])

或者,如果前后的填充相同但轴的填充不同,则也可以在内部元组中省略第二个参数:

>>> np.pad(a, [(1, ), (2, )], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

但是我倾向于始终使用显式的,因为这样做很容易犯错(当NumPys的期望与您的意图有所不同时):

>>> np.pad(a, [1, 2], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

在这里,NumPy认为您希望在每个轴前填充1个元素,在每个轴后填充2个元素!即使您打算用轴1中的1个元素和轴2中的2个元素填充。

我使用元组列表进行填充,请注意,这只是“我的约定”,您也可以使用列表列表或元组的元组,甚至数组的元组。NumPy只是检查参数的长度(如果没有长度)和每个项目的长度(或者如果有长度)!

NumPy 1.7.0 (when numpy.pad was added) is pretty old now (it was released in 2013) so even though the question asked for a way without using that function I thought it could be useful to know how that could be achieved using numpy.pad.

It’s actually pretty simple:

>>> import numpy as np
>>> a = np.array([[ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.]])
>>> np.pad(a, [(0, 1), (0, 1)], mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

In this case I used that 0 is the default value for mode='constant'. But it could also be specified by passing it in explicitly:

>>> np.pad(a, [(0, 1), (0, 1)], mode='constant', constant_values=0)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

Just in case the second argument ([(0, 1), (0, 1)]) seems confusing: Each list item (in this case tuple) corresponds to a dimension and item therein represents the padding before (first element) and after (second element). So:

[(0, 1), (0, 1)]
         ^^^^^^------ padding for second dimension
 ^^^^^^-------------- padding for first dimension

  ^------------------ no padding at the beginning of the first axis
     ^--------------- pad with one "value" at the end of the first axis.

In this case the padding for the first and second axis are identical, so one could also just pass in the 2-tuple:

>>> np.pad(a, (0, 1), mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

In case the padding before and after is identical one could even omit the tuple (not applicable in this case though):

>>> np.pad(a, 1, mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Or if the padding before and after is identical but different for the axis, you could also omit the second argument in the inner tuples:

>>> np.pad(a, [(1, ), (2, )], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

However I tend to prefer to always use the explicit one, because it’s just to easy to make mistakes (when NumPys expectations differ from your intentions):

>>> np.pad(a, [1, 2], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Here NumPy thinks you wanted to pad all axis with 1 element before and 2 elements after each axis! Even if you intended it to pad with 1 element in axis 1 and 2 elements for axis 2.

I used lists of tuples for the padding, note that this is just “my convention”, you could also use lists of lists or tuples of tuples, or even tuples of arrays. NumPy just checks the length of the argument (or if it doesn’t have a length) and the length of each item (or if it has a length)!


回答 2

我了解您的主要问题是您需要计算,d=b-a但数组的大小不同。无需中间填充c

您可以解决此问题而无需填充:

import numpy as np

a = np.array([[ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.]])

b = np.array([[ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.]])

d = b.copy()
d[:a.shape[0],:a.shape[1]] -=  a

print d

输出:

[[ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 3.  3.  3.  3.  3.  3.]]

I understand that your main problem is that you need to calculate d=b-a but your arrays have different sizes. There is no need for an intermediate padded c

You can solve this without padding:

import numpy as np

a = np.array([[ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.]])

b = np.array([[ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.]])

d = b.copy()
d[:a.shape[0],:a.shape[1]] -=  a

print d

Output:

[[ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 3.  3.  3.  3.  3.  3.]]

回答 3

如果需要向数组添加1s的范围:

>>> mat = np.zeros((4,4), np.int32)
>>> mat
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
>>> mat[0,:] = mat[:,0] = mat[:,-1] =  mat[-1,:] = 1
>>> mat
array([[1, 1, 1, 1],
       [1, 0, 0, 1],
       [1, 0, 0, 1],
       [1, 1, 1, 1]])

In case you need to add a fence of 1s to an array:

>>> mat = np.zeros((4,4), np.int32)
>>> mat
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
>>> mat[0,:] = mat[:,0] = mat[:,-1] =  mat[-1,:] = 1
>>> mat
array([[1, 1, 1, 1],
       [1, 0, 0, 1],
       [1, 0, 0, 1],
       [1, 1, 1, 1]])

回答 4

我知道我有点晚了,但是如果您想执行相对填充(aka边缘填充),可以通过以下方法实现它。请注意,分配的第一个实例将导致零填充,因此您可以将其用于零填充和相对填充(这是将原始数组的边值复制到填充数组中的地方)。

def replicate_padding(arr):
    """Perform replicate padding on a numpy array."""
    new_pad_shape = tuple(np.array(arr.shape) + 2) # 2 indicates the width + height to change, a (512, 512) image --> (514, 514) padded image.
    padded_array = np.zeros(new_pad_shape) #create an array of zeros with new dimensions
    
    # perform replication
    padded_array[1:-1,1:-1] = arr        # result will be zero-pad
    padded_array[0,1:-1] = arr[0]        # perform edge pad for top row
    padded_array[-1, 1:-1] = arr[-1]     # edge pad for bottom row
    padded_array.T[0, 1:-1] = arr.T[0]   # edge pad for first column
    padded_array.T[-1, 1:-1] = arr.T[-1] # edge pad for last column
    
    #at this point, all values except for the 4 corners should have been replicated
    padded_array[0][0] = arr[0][0]     # top left corner
    padded_array[-1][0] = arr[-1][0]   # bottom left corner
    padded_array[0][-1] = arr[0][-1]   # top right corner 
    padded_array[-1][-1] = arr[-1][-1] # bottom right corner

    return padded_array

复杂度分析:

对此的最佳解决方案是numpy的pad方法。在平均运行5次之后,具有相对填充的np.pad仅8%比上面定义的函数好。这表明这是相对填充和零填充的最佳方法。


#My method, replicate_padding
start = time.time()
padded = replicate_padding(input_image)
end = time.time()
delta0 = end - start

#np.pad with edge padding
start = time.time()
padded = np.pad(input_image, 1, mode='edge')
end = time.time()
delta = end - start


print(delta0) # np Output: 0.0008790493011474609 
print(delta)  # My Output: 0.0008130073547363281
print(100*((delta0-delta)/delta)) # Percent difference: 8.12316715542522%

I know I’m a bit late to this, but in case you wanted to perform relative padding (aka edge padding), here’s how you can implement it. Note that the very first instance of assignment results in zero-padding, so you can use this for both zero-padding and relative padding (this is where you copy the edge values of the original array into the padded array).

def replicate_padding(arr):
    """Perform replicate padding on a numpy array."""
    new_pad_shape = tuple(np.array(arr.shape) + 2) # 2 indicates the width + height to change, a (512, 512) image --> (514, 514) padded image.
    padded_array = np.zeros(new_pad_shape) #create an array of zeros with new dimensions
    
    # perform replication
    padded_array[1:-1,1:-1] = arr        # result will be zero-pad
    padded_array[0,1:-1] = arr[0]        # perform edge pad for top row
    padded_array[-1, 1:-1] = arr[-1]     # edge pad for bottom row
    padded_array.T[0, 1:-1] = arr.T[0]   # edge pad for first column
    padded_array.T[-1, 1:-1] = arr.T[-1] # edge pad for last column
    
    #at this point, all values except for the 4 corners should have been replicated
    padded_array[0][0] = arr[0][0]     # top left corner
    padded_array[-1][0] = arr[-1][0]   # bottom left corner
    padded_array[0][-1] = arr[0][-1]   # top right corner 
    padded_array[-1][-1] = arr[-1][-1] # bottom right corner

    return padded_array

Complexity Analysis:

The optimal solution for this is numpy’s pad method. After averaging for 5 runs, np.pad with relative padding is only 8% better than the function defined above. This shows that this is fairly an optimal method for relative and zero-padding padding.


#My method, replicate_padding
start = time.time()
padded = replicate_padding(input_image)
end = time.time()
delta0 = end - start

#np.pad with edge padding
start = time.time()
padded = np.pad(input_image, 1, mode='edge')
end = time.time()
delta = end - start


print(delta0) # np Output: 0.0008790493011474609 
print(delta)  # My Output: 0.0008130073547363281
print(100*((delta0-delta)/delta)) # Percent difference: 8.12316715542522%

如何将字节字符串转换为int?

问题:如何将字节字符串转换为int?

如何在python中将字节字符串转换为int?

这样说: 'y\xcc\xa6\xbb'

我想出了一个聪明/愚蠢的方法:

sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))

我知道必须有内置的东西或在标准库中可以更简单地执行此操作…

这与转换可以使用int(xxx,16)的十六进制数字字符串不同,但是我想转换一个实际字节值的字符串。

更新:

我有点喜欢James的回答,因为它不需要导入另一个模块,但是Greg的方法更快:

>>> from timeit import Timer
>>> Timer('struct.unpack("<L", "y\xcc\xa6\xbb")[0]', 'import struct').timeit()
0.36242198944091797
>>> Timer("int('y\xcc\xa6\xbb'.encode('hex'), 16)").timeit()
1.1432669162750244

我的骇客方法:

>>> Timer("sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))").timeit()
2.8819329738616943

进一步更新:

有人在评论中问导入另一个模块有什么问题。好吧,导入模块不一定便宜,请看一下:

>>> Timer("""import struct\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""").timeit()
0.98822188377380371

包括导入模块的成本,几乎抵消了此方法的所有优点。我认为,这仅包括在整个基准测试运行中一次导入一次的费用;看一下我每次强制重新加载时会发生什么:

>>> Timer("""reload(struct)\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""", 'import struct').timeit()
68.474128007888794

不用说,如果您每次导入都执行此方法很多次,则成比例地减少了一个问题。也可能是I / O成本而不是CPU,因此它可能取决于特定计算机的容量和负载特性。

How can I convert a string of bytes into an int in python?

Say like this: 'y\xcc\xa6\xbb'

I came up with a clever/stupid way of doing it:

sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))

I know there has to be something builtin or in the standard library that does this more simply…

This is different from converting a string of hex digits for which you can use int(xxx, 16), but instead I want to convert a string of actual byte values.

UPDATE:

I kind of like James’ answer a little better because it doesn’t require importing another module, but Greg’s method is faster:

>>> from timeit import Timer
>>> Timer('struct.unpack("<L", "y\xcc\xa6\xbb")[0]', 'import struct').timeit()
0.36242198944091797
>>> Timer("int('y\xcc\xa6\xbb'.encode('hex'), 16)").timeit()
1.1432669162750244

My hacky method:

>>> Timer("sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))").timeit()
2.8819329738616943

FURTHER UPDATE:

Someone asked in comments what’s the problem with importing another module. Well, importing a module isn’t necessarily cheap, take a look:

>>> Timer("""import struct\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""").timeit()
0.98822188377380371

Including the cost of importing the module negates almost all of the advantage that this method has. I believe that this will only include the expense of importing it once for the entire benchmark run; look what happens when I force it to reload every time:

>>> Timer("""reload(struct)\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""", 'import struct').timeit()
68.474128007888794

Needless to say, if you’re doing a lot of executions of this method per one import than this becomes proportionally less of an issue. It’s also probably i/o cost rather than cpu so it may depend on the capacity and load characteristics of the particular machine.


回答 0

您还可以使用struct模块来执行此操作:

>>> struct.unpack("<L", "y\xcc\xa6\xbb")[0]
3148270713L

You can also use the struct module to do this:

>>> struct.unpack("<L", "y\xcc\xa6\xbb")[0]
3148270713L

回答 1

在Python 3.2和更高版本中,使用

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='big')
2043455163

要么

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='little')
3148270713

根据您的字节字符串的字节序

这也适用于任意长度的字节字符串整数,并且通过指定,可用于以二进制补码的整数signed=True。请参阅有关的文档from_bytes

In Python 3.2 and later, use

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='big')
2043455163

or

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='little')
3148270713

according to the endianness of your byte-string.

This also works for bytestring-integers of arbitrary length, and for two’s-complement signed integers by specifying signed=True. See the docs for from_bytes.


回答 2

正如Greg所说的,如果要处理二进制值,则可以使用struct,但是如果您只有一个“十六进制数”,但是以字节格式,则可能需要将其转换为:

s = 'y\xcc\xa6\xbb'
num = int(s.encode('hex'), 16)

…与以下内容相同:

num = struct.unpack(">L", s)[0]

…除了适用于任何数量的字节。

As Greg said, you can use struct if you are dealing with binary values, but if you just have a “hex number” but in byte format you might want to just convert it like:

s = 'y\xcc\xa6\xbb'
num = int(s.encode('hex'), 16)

…this is the same as:

num = struct.unpack(">L", s)[0]

…except it’ll work for any number of bytes.


回答 3

我使用以下函数在int,hex和字节之间转换数据。

def bytes2int(str):
 return int(str.encode('hex'), 16)

def bytes2hex(str):
 return '0x'+str.encode('hex')

def int2bytes(i):
 h = int2hex(i)
 return hex2bytes(h)

def int2hex(i):
 return hex(i)

def hex2int(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return int(h, 16)

def hex2bytes(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return h.decode('hex')

资料来源:http : //opentechnotes.blogspot.com.au/2014/04/convert-values-to-from-integer-hex.html

I use the following function to convert data between int, hex and bytes.

def bytes2int(str):
 return int(str.encode('hex'), 16)

def bytes2hex(str):
 return '0x'+str.encode('hex')

def int2bytes(i):
 h = int2hex(i)
 return hex2bytes(h)

def int2hex(i):
 return hex(i)

def hex2int(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return int(h, 16)

def hex2bytes(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return h.decode('hex')

Source: http://opentechnotes.blogspot.com.au/2014/04/convert-values-to-from-integer-hex.html


回答 4

import array
integerValue = array.array("I", 'y\xcc\xa6\xbb')[0]

警告:以上内容是特定于平台的。“ I”说明符和string-> int转换的字节序都取决于您的特定Python实现。但是,如果要一次转换许多整数/字符串,则数组模块可以快速完成转换。

import array
integerValue = array.array("I", 'y\xcc\xa6\xbb')[0]

Warning: the above is strongly platform-specific. Both the “I” specifier and the endianness of the string->int conversion are dependent on your particular Python implementation. But if you want to convert many integers/strings at once, then the array module does it quickly.


回答 5

在Python 2.x中,您可以将格式说明符<B用于无符号字节,以及<b用于带struct.unpack/的有符号字节struct.pack

例如:

x='\xff\x10\x11'

data_ints = struct.unpack('<' + 'B'*len(x), x) # [255, 16, 17]

和:

data_bytes = struct.pack('<' + 'B'*len(data_ints), *data_ints) # '\xff\x10\x11'

*是必须的!

看到 https://docs.python.org/2/library/struct.html#format-characters获取格式说明符列表。

In Python 2.x, you could use the format specifiers <B for unsigned bytes, and <b for signed bytes with struct.unpack/struct.pack.

E.g:

Let x = '\xff\x10\x11'

data_ints = struct.unpack('<' + 'B'*len(x), x) # [255, 16, 17]

And:

data_bytes = struct.pack('<' + 'B'*len(data_ints), *data_ints) # '\xff\x10\x11'

That * is required!

See https://docs.python.org/2/library/struct.html#format-characters for a list of the format specifiers.


回答 6

>>> reduce(lambda s, x: s*256 + x, bytearray("y\xcc\xa6\xbb"))
2043455163

测试1:逆:

>>> hex(2043455163)
'0x79cca6bb'

测试2:字节数> 8:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAA"))
338822822454978555838225329091068225L

测试3:加1:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAB"))
338822822454978555838225329091068226L

测试4:附加一个字节,说“ A”:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))
86738642548474510294585684247313465921L

测试5:除以256:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))/256
338822822454978555838225329091068226L

结果等于预期的测试4的结果。

>>> reduce(lambda s, x: s*256 + x, bytearray("y\xcc\xa6\xbb"))
2043455163

Test 1: inverse:

>>> hex(2043455163)
'0x79cca6bb'

Test 2: Number of bytes > 8:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAA"))
338822822454978555838225329091068225L

Test 3: Increment by one:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAB"))
338822822454978555838225329091068226L

Test 4: Append one byte, say ‘A’:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))
86738642548474510294585684247313465921L

Test 5: Divide by 256:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))/256
338822822454978555838225329091068226L

Result equals the result of Test 4, as expected.


回答 7

我一直在努力寻找适用于Python 2.x的任意长度字节序列的解决方案。最后,我写了这个,有点麻烦,因为它执行字符串转换,但是可以用。

Python 2.x的函数,任意长度

def signedbytes(data):
    """Convert a bytearray into an integer, considering the first bit as
    sign. The data must be big-endian."""
    negative = data[0] & 0x80 > 0

    if negative:
        inverted = bytearray(~d % 256 for d in data)
        return -signedbytes(inverted) - 1

    encoded = str(data).encode('hex')
    return int(encoded, 16)

此功能有两个要求:

  • 输入data必须为bytearray。您可以这样调用函数:

    s = 'y\xcc\xa6\xbb'
    n = signedbytes(s)
  • 数据必须是大端的。如果您有一个小端值,则应首先将其取反:

    n = signedbytes(s[::-1])

当然,仅在需要任意长度时才应使用此选项。否则,请遵循更多标准方法(例如struct)。

I was struggling to find a solution for arbitrary length byte sequences that would work under Python 2.x. Finally I wrote this one, it’s a bit hacky because it performs a string conversion, but it works.

Function for Python 2.x, arbitrary length

def signedbytes(data):
    """Convert a bytearray into an integer, considering the first bit as
    sign. The data must be big-endian."""
    negative = data[0] & 0x80 > 0

    if negative:
        inverted = bytearray(~d % 256 for d in data)
        return -signedbytes(inverted) - 1

    encoded = str(data).encode('hex')
    return int(encoded, 16)

This function has two requirements:

  • The input data needs to be a bytearray. You may call the function like this:

    s = 'y\xcc\xa6\xbb'
    n = signedbytes(s)
    
  • The data needs to be big-endian. In case you have a little-endian value, you should reverse it first:

    n = signedbytes(s[::-1])
    

Of course, this should be used only if arbitrary length is needed. Otherwise, stick with more standard ways (e.g. struct).


回答 8

如果版本> = 3.2,则int.from_bytes是最佳解决方案。“ struct.unpack”解决方案需要一个字符串,因此它不适用于字节数组。这是另一种解决方案:

def bytes2int( tb, order='big'):
    if order == 'big': seq=[0,1,2,3]
    elif order == 'little': seq=[3,2,1,0]
    i = 0
    for j in seq: i = (i<<8)+tb[j]
    return i

hex(bytes2int([0x87,0x65,0x43,0x21]))返回’0x87654321’。

它处理大小字节序,很容易修改为8个字节

int.from_bytes is the best solution if you are at version >=3.2. The “struct.unpack” solution requires a string so it will not apply to arrays of bytes. Here is another solution:

def bytes2int( tb, order='big'):
    if order == 'big': seq=[0,1,2,3]
    elif order == 'little': seq=[3,2,1,0]
    i = 0
    for j in seq: i = (i<<8)+tb[j]
    return i

hex( bytes2int( [0x87, 0x65, 0x43, 0x21])) returns ‘0x87654321’.

It handles big and little endianness and is easily modifiable for 8 bytes


回答 9

如上文使用所提unpack的功能结构是一个很好的方式。如果要实现自己的功能,则还有另一种解决方案:

def bytes_to_int(bytes):
    result = 0
    for b in bytes:
        result = result * 256 + int(b)
return result

As mentioned above using unpack function of struct is a good way. If you want to implement your own function there is an another solution:

def bytes_to_int(bytes):
    result = 0
    for b in bytes:
        result = result * 256 + int(b)
return result

回答 10

在python 3中,您可以通过以下方式轻松地将字节字符串转换为整数列表(0..255)

>>> list(b'y\xcc\xa6\xbb')
[121, 204, 166, 187]

In python 3 you can easily convert a byte string into a list of integers (0..255) by

>>> list(b'y\xcc\xa6\xbb')
[121, 204, 166, 187]

回答 11

一种使用array.array的快速方法,我已经使用了一段时间:

预定义变量:

offset = 0
size = 4
big = True # endian
arr = array('B')
arr.fromstring("\x00\x00\xff\x00") # 5 bytes (encoding issues) [0, 0, 195, 191, 0]

诠释为:(阅读)

val = 0
for v in arr[offset:offset+size][::pow(-1,not big)]: val = (val<<8)|v

来自int:(写)

val = 16384
arr[offset:offset+size] = \
    array('B',((val>>(i<<3))&255 for i in range(size)))[::pow(-1,not big)]

这些可能会更快一些。

编辑:
对于某些数字,这是一项性能测试(Anaconda 2.3.0),与以下各项相比,显示出稳定的平均读数reduce()

========================= byte array to int.py =========================
5000 iterations; threshold of min + 5000ns:
______________________________________code___|_______min______|_______max______|_______avg______|_efficiency
⣿⠀⠀⠀⠀⡇⢀⡀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⡀⠀⢰⠀⠀⠀⢰⠀⠀⠀⢸⠀⠀⢀⡇⠀⢀⠀⠀⠀⠀⢠⠀⠀⠀⠀⢰⠀⠀⠀⢸⡀⠀⠀⠀⢸⠀⡇⠀⠀⢠⠀⢰⠀⢸⠀
⣿⣦⣴⣰⣦⣿⣾⣧⣤⣷⣦⣤⣶⣾⣿⣦⣼⣶⣷⣶⣸⣴⣤⣀⣾⣾⣄⣤⣾⡆⣾⣿⣿⣶⣾⣾⣶⣿⣤⣾⣤⣤⣴⣼⣾⣼⣴⣤⣼⣷⣆⣴⣴⣿⣾⣷⣧⣶⣼⣴⣿⣶⣿⣶
    val = 0 \nfor v in arr: val = (val<<8)|v |     5373.848ns |   850009.965ns |     ~8649.64ns |  62.128%
⡇⠀⠀⢀⠀⠀⠀⡇⠀⡇⠀⠀⣠⠀⣿⠀⠀⠀⠀⡀⠀⠀⡆⠀⡆⢰⠀⠀⡆⠀⡄⠀⠀⠀⢠⢀⣼⠀⠀⡇⣠⣸⣤⡇⠀⡆⢸⠀⠀⠀⠀⢠⠀⢠⣿⠀⠀⢠⠀⠀⢸⢠⠀⡀
⣧⣶⣶⣾⣶⣷⣴⣿⣾⡇⣤⣶⣿⣸⣿⣶⣶⣶⣶⣧⣷⣼⣷⣷⣷⣿⣦⣴⣧⣄⣷⣠⣷⣶⣾⣸⣿⣶⣶⣷⣿⣿⣿⣷⣧⣷⣼⣦⣶⣾⣿⣾⣼⣿⣿⣶⣶⣼⣦⣼⣾⣿⣶⣷
                  val = reduce( shift, arr ) |     6489.921ns |  5094212.014ns |   ~12040.269ns |  53.902%

这是原始性能测试,因此省略了endian pow-flip。
shift显示的函数与for循环应用相同的移位或运算,并且该函数的迭代性能arr仅次于array.array('B',[0,0,255,0])dict

我可能还应该注意到,效率是通过对平均时间的准确性来衡量的。

A decently speedy method utilizing array.array I’ve been using for some time:

predefined variables:

offset = 0
size = 4
big = True # endian
arr = array('B')
arr.fromstring("\x00\x00\xff\x00") # 5 bytes (encoding issues) [0, 0, 195, 191, 0]

to int: (read)

val = 0
for v in arr[offset:offset+size][::pow(-1,not big)]: val = (val<<8)|v

from int: (write)

val = 16384
arr[offset:offset+size] = \
    array('B',((val>>(i<<3))&255 for i in range(size)))[::pow(-1,not big)]

It’s possible these could be faster though.

EDIT:
For some numbers, here’s a performance test (Anaconda 2.3.0) showing stable averages on read in comparison to reduce():

========================= byte array to int.py =========================
5000 iterations; threshold of min + 5000ns:
______________________________________code___|_______min______|_______max______|_______avg______|_efficiency
⣿⠀⠀⠀⠀⡇⢀⡀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⡀⠀⢰⠀⠀⠀⢰⠀⠀⠀⢸⠀⠀⢀⡇⠀⢀⠀⠀⠀⠀⢠⠀⠀⠀⠀⢰⠀⠀⠀⢸⡀⠀⠀⠀⢸⠀⡇⠀⠀⢠⠀⢰⠀⢸⠀
⣿⣦⣴⣰⣦⣿⣾⣧⣤⣷⣦⣤⣶⣾⣿⣦⣼⣶⣷⣶⣸⣴⣤⣀⣾⣾⣄⣤⣾⡆⣾⣿⣿⣶⣾⣾⣶⣿⣤⣾⣤⣤⣴⣼⣾⣼⣴⣤⣼⣷⣆⣴⣴⣿⣾⣷⣧⣶⣼⣴⣿⣶⣿⣶
    val = 0 \nfor v in arr: val = (val<<8)|v |     5373.848ns |   850009.965ns |     ~8649.64ns |  62.128%
⡇⠀⠀⢀⠀⠀⠀⡇⠀⡇⠀⠀⣠⠀⣿⠀⠀⠀⠀⡀⠀⠀⡆⠀⡆⢰⠀⠀⡆⠀⡄⠀⠀⠀⢠⢀⣼⠀⠀⡇⣠⣸⣤⡇⠀⡆⢸⠀⠀⠀⠀⢠⠀⢠⣿⠀⠀⢠⠀⠀⢸⢠⠀⡀
⣧⣶⣶⣾⣶⣷⣴⣿⣾⡇⣤⣶⣿⣸⣿⣶⣶⣶⣶⣧⣷⣼⣷⣷⣷⣿⣦⣴⣧⣄⣷⣠⣷⣶⣾⣸⣿⣶⣶⣷⣿⣿⣿⣷⣧⣷⣼⣦⣶⣾⣿⣾⣼⣿⣿⣶⣶⣼⣦⣼⣾⣿⣶⣷
                  val = reduce( shift, arr ) |     6489.921ns |  5094212.014ns |   ~12040.269ns |  53.902%

This is a raw performance test, so the endian pow-flip is left out.
The shift function shown applies the same shift-oring operation as the for loop, and arr is just array.array('B',[0,0,255,0]) as it has the fastest iterative performance next to dict.

I should probably also note efficiency is measured by accuracy to the average time.


numpy-将行添加到数组

问题:numpy-将行添加到数组

如何将行添加到numpy数组?

我有一个数组A:

A = array([[0, 1, 2], [0, 2, 0]])

如果X中每行的第一个元素满足特定条件,我希望从另一个数组X向该数组添加行。

Numpy数组没有像列表那样的“追加”方法,或者看起来。

如果A和X是列表,我只会这样做:

for i in X:
    if i[0] < 3:
        A.append(i)

是否有numpythonic的方式来做等效的?

谢谢,S ;-)

How does one add rows to a numpy array?

I have an array A:

A = array([[0, 1, 2], [0, 2, 0]])

I wish to add rows to this array from another array X if the first element of each row in X meets a specific condition.

Numpy arrays do not have a method ‘append’ like that of lists, or so it seems.

If A and X were lists I would merely do:

for i in X:
    if i[0] < 3:
        A.append(i)

Is there a numpythonic way to do the equivalent?

Thanks, S ;-)


回答 0

什么X啊 如果它是一个二维数组,你怎么能那么其行比作一个号码:i < 3

OP评论后编辑:

A = array([[0, 1, 2], [0, 2, 0]])
X = array([[0, 1, 2], [1, 2, 0], [2, 1, 2], [3, 2, 0]])

AX第一个元素添加到所有行< 3

import numpy as np
A = np.vstack((A, X[X[:,0] < 3]))

# returns: 
array([[0, 1, 2],
       [0, 2, 0],
       [0, 1, 2],
       [1, 2, 0],
       [2, 1, 2]])

What is X? If it is a 2D-array, how can you then compare its row to a number: i < 3?

EDIT after OP’s comment:

A = array([[0, 1, 2], [0, 2, 0]])
X = array([[0, 1, 2], [1, 2, 0], [2, 1, 2], [3, 2, 0]])

add to A all rows from X where the first element < 3:

import numpy as np
A = np.vstack((A, X[X[:,0] < 3]))

# returns: 
array([[0, 1, 2],
       [0, 2, 0],
       [0, 1, 2],
       [1, 2, 0],
       [2, 1, 2]])

回答 1

好吧,你可以这样做:

  newrow = [1,2,3]
  A = numpy.vstack([A, newrow])

well u can do this :

  newrow = [1,2,3]
  A = numpy.vstack([A, newrow])

回答 2

由于这个问题已经存在了7年,所以我使用的最新版本是numpy版本1.13和python3,我在向矩阵中添加一行时也做同样的事情,请记住在第二个参数中加上双括号,否则会引起尺寸误差。

在这里我要添加矩阵A

1 2 3
4 5 6

连续

7 8 9

相同的用法 np.r_

A= [[1, 2, 3], [4, 5, 6]]
np.append(A, [[7, 8, 9]], axis=0)

    >> array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
#or 
np.r_[A,[[7,8,9]]]

只是对某人感兴趣,如果您想添加一列,

array = np.c_[A,np.zeros(#A's row size)]

按照我们之前在矩阵A上所做的操作,向其中添加一列

np.c_[A, [2,8]]

>> array([[1, 2, 3, 2],
          [4, 5, 6, 8]])

As this question is been 7 years before, in the latest version which I am using is numpy version 1.13, and python3, I am doing the same thing with adding a row to a matrix, remember to put a double bracket to the second argument, otherwise, it will raise dimension error.

In here I am adding on matrix A

1 2 3
4 5 6

with a row

7 8 9

same usage in np.r_

A= [[1, 2, 3], [4, 5, 6]]
np.append(A, [[7, 8, 9]], axis=0)

    >> array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
#or 
np.r_[A,[[7,8,9]]]

Just to someone’s intersted, if you would like to add a column,

array = np.c_[A,np.zeros(#A's row size)]

following what we did before on matrix A, adding a column to it

np.c_[A, [2,8]]

>> array([[1, 2, 3, 2],
          [4, 5, 6, 8]])

回答 3

您也可以这样做:

newrow = [1,2,3]
A = numpy.concatenate((A,newrow))

You can also do this:

newrow = [1,2,3]
A = numpy.concatenate((A,newrow))

回答 4

如果每行之后都不需要进行计算,则在python中添加行然后转换为numpy会更快。以下是使用python 3.6与numpy 1.14进行的时序测试,添加了100行,一次添加一行:

import numpy as np 
from time import perf_counter, sleep

def time_it():
    # Compare performance of two methods for adding rows to numpy array
    py_array = [[0, 1, 2], [0, 2, 0]]
    py_row = [4, 5, 6]
    numpy_array = np.array(py_array)
    numpy_row = np.array([4,5,6])
    n_loops = 100

    start_clock = perf_counter()
    for count in range(0, n_loops):
       numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
    duration = perf_counter() - start_clock
    print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))

    start_clock = perf_counter()
    for count in range(0, n_loops):
        py_array.append(py_row) # .15 micros
    numpy_array = np.array(py_array) # 43.9 micros       
    duration = perf_counter() - start_clock
    print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
    sleep(15)

#time_it() prints:

numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row

因此,七年前对原始问题的简单解决方案是在将行转换为numpy数组后,使用vstack()添加新行。但是更现实的解决方案应该考虑在这些情况下vstack的性能不佳。如果您不需要在每次添加后对数组进行数据分析,最好将新行缓冲到python行列表(实际上是列表列表)中,然后将它们作为一个组添加到numpy数组中在进行任何数据分析之前使用vstack()。

If no calculations are necessary after every row, it’s much quicker to add rows in python, then convert to numpy. Here are timing tests using python 3.6 vs. numpy 1.14, adding 100 rows, one at a time:

import numpy as np 
from time import perf_counter, sleep

def time_it():
    # Compare performance of two methods for adding rows to numpy array
    py_array = [[0, 1, 2], [0, 2, 0]]
    py_row = [4, 5, 6]
    numpy_array = np.array(py_array)
    numpy_row = np.array([4,5,6])
    n_loops = 100

    start_clock = perf_counter()
    for count in range(0, n_loops):
       numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
    duration = perf_counter() - start_clock
    print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))

    start_clock = perf_counter()
    for count in range(0, n_loops):
        py_array.append(py_row) # .15 micros
    numpy_array = np.array(py_array) # 43.9 micros       
    duration = perf_counter() - start_clock
    print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
    sleep(15)

#time_it() prints:

numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row

So, the simple solution to the original question, from seven years ago, is to use vstack() to add a new row after converting the row to a numpy array. But a more realistic solution should consider vstack’s poor performance under those circumstances. If you don’t need to run data analysis on the array after every addition, it is better to buffer the new rows to a python list of rows (a list of lists, really), and add them as a group to the numpy array using vstack() before doing any data analysis.


回答 5

import numpy as np
array_ = np.array([[1,2,3]])
add_row = np.array([[4,5,6]])

array_ = np.concatenate((array_, add_row), axis=0)
import numpy as np
array_ = np.array([[1,2,3]])
add_row = np.array([[4,5,6]])

array_ = np.concatenate((array_, add_row), axis=0)

回答 6

如果您可以在一个操作中完成构造,那么类似vstack-with-fancy-indexing的答案就是很好的方法。但是,如果您的情况更加复杂,或者您的行不断增加,那么您可能想要增加数组。实际上,执行类似这样的numpythonic方法-动态增长数组-是动态增长列表:

A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
    newrow = np.arange(3)+i
    if i%5:
        Alist.append(newrow)
A = np.array(Alist)
del Alist

列表针对这种访问模式进行了高度优化。在列表形式时,您没有方便的numpy多维索引,但是只要您要追加,就很难比行数组列表做得更好。

If you can do the construction in a single operation, then something like the vstack-with-fancy-indexing answer is a fine approach. But if your condition is more complicated or your rows come in on the fly, you may want to grow the array. In fact the numpythonic way to do something like this – dynamically grow an array – is to dynamically grow a list:

A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
    newrow = np.arange(3)+i
    if i%5:
        Alist.append(newrow)
A = np.array(Alist)
del Alist

Lists are highly optimized for this kind of access pattern; you don’t have convenient numpy multidimensional indexing while in list form, but for as long as you’re appending it’s hard to do better than a list of row arrays.


回答 7

我使用更快的“ np.vstack”,例如:

import numpy as np

input_array=np.array([1,2,3])
new_row= np.array([4,5,6])

new_array=np.vstack([input_array, new_row])

I use ‘np.vstack’ which is faster, EX:

import numpy as np

input_array=np.array([1,2,3])
new_row= np.array([4,5,6])

new_array=np.vstack([input_array, new_row])

回答 8

您可以用来numpy.append()在numpty数组后附加一行,然后再将其整形为矩阵。

import numpy as np
a = np.array([1,2])
a = np.append(a, [3,4])
print a
# [1,2,3,4]
# in your example
A = [1,2]
for row in X:
    A = np.append(A, row)

You can use numpy.append() to append a row to numpty array and reshape to a matrix later on.

import numpy as np
a = np.array([1,2])
a = np.append(a, [3,4])
print a
# [1,2,3,4]
# in your example
A = [1,2]
for row in X:
    A = np.append(A, row)

为什么Python的数组变慢?

问题:为什么Python的数组变慢?

我期望array.array比列表更快,因为数组似乎已拆箱。

但是,我得到以下结果:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

造成这种差异的原因是什么?

I expected array.array to be faster than lists, as arrays seem to be unboxed.

However, I get the following result:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

What could be the cause of such a difference?


回答 0

存储是“未装箱的”,但是每次访问元素时,Python都必须对其进行“装箱”(将其嵌入到常规Python对象中),以便对其进行任何处理。例如,您sum(A)遍历数组,并将每个整数一次装在一个常规Python int对象中。那要花时间。在您的中sum(L),所有装箱操作均在创建列表时完成。

因此,最后,阵列通常较慢,但所需的内存却少得多。


这是最新版本的Python 3的相关代码,但是自Python首次发布以来,相同的基本思想也适用于所有CPython实现。

这是访问列表项的代码:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

它几乎没有什么: somelist[i]只返回i列表中的第一个对象(CPython中的所有Python对象都是指向结构的指针,该结构的初始段符合a的布局struct PyObject)。

__getitem__array带有类型代码的实现l

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

原始内存被视为平台本地C long整数的向量;该i“个C long被读取起来; 然后PyLong_FromLong()调用该方法将本机包装(“框”)C long在Python long对象中(在Python 3中,该对象消除了Python 2 int和之间的区别long,实际上显示为type int)。

此装箱必须为Python int对象分配新的内存,然后将native C long的位喷入其中。在原始示例的上下文中,该对象的生存期非常短(足够长,足以sum()将内容添加到运行总计中),然后需要更多时间才能取消分配新int对象。

在CPython实现中,这就是速度差异的来源,永远都是这样,永远都是这样。

The storage is “unboxed”, but every time you access an element Python has to “box” it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.

So, in the end, an array is generally slower, but requires substantially less memory.


Here’s the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.

Here’s the code to access a list item:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

There’s very little to it: somelist[i] just returns the i‘th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).

And here’s the __getitem__ implementation for an array with type code l:

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

The raw memory is treated as a vector of platform-native C long integers; the i‘th C long is read up; and then PyLong_FromLong() is called to wrap (“box”) the native C long in a Python long object (which, in Python 3, which eliminates Python 2’s distinction between int and long, is actually shown as type int).

This boxing has to allocate new memory for a Python int object, and spray the native C long‘s bits into it. In the context of the original example, this object’s lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.

This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.


回答 1

为了增加蒂姆·彼得斯的出色答案,数组实现了缓冲协议,而列表则没有。这意味着,如果您正在编写C扩展名(或道德上的对等物,例如编写Cython模块),则可以比Python更快地访问和使用数组的元素。这将为您带来可观的速度改进,可能远远超过一个数量级。但是,它有很多缺点:

  1. 您现在正从事编写C而不是Python的工作。Cython是改善这种情况的一种方法,但是并不能消除语言之间的许多基本差异;您需要熟悉C语义并了解它在做什么。
  2. PyPy的C API 在某种程度上可以正常工作,但是速度不是很快。如果您以PyPy为目标,则可能应该只编写带有常规列表的简单代码,然后让JITter为您优化它。
  3. C扩展比纯Python代码更难分发,因为它们需要编译。编译通常取决于体系结构和操作系统,因此您需要确保要针对目标平台进行编译。

直接转到C扩展程序可能会使用大锤拍打苍蝇,具体取决于您的用例。您首先应该研究NumPy,看看它是否足够强大,可以执行您想做的任何数学运算。如果正确使用,它将比本地Python快得多。

To add to Tim Peters’ excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:

  1. You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
  2. PyPy’s C API works to some extent, but isn’t very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
  3. C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.

Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you’re trying to do. It will also be much faster than native Python, if used correctly.


回答 2

蒂姆·彼得斯(Tim Peters)回答了为什么这样做很慢,但让我们看看如何改善它。

坚持您的示例sum(range(...))(比示例小10倍,以适合此处的内存):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

这样numpy也需要装箱/拆箱,这有额外的开销。要使其快速运行,必须停留在numpy C代码内:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

因此,从列表解决方案到numpy版本,这是运行时的因素16。

我们还要检查创建这些数据结构需要花费多长时间

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

明确的获胜者:Numpy

还要注意,创建数据结构所花费的时间与求和所花费的时间差不多,甚至更多。分配内存很慢。

那些的内存使用情况:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

因此,每个数字占用8个字节,开销各不相同。对于此范围,我们使用32位整数就足够了,因此我们可以保护一些内存。

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

但是事实证明,在我的计算机上添加64位整数要快于32位整数,因此只有在受内存/带宽限制的情况下才值得这样做。

Tim Peters answered why this is slow, but let’s see how to improve it.

Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

So from the list solution to the numpy version this is a factor 16 in runtime.

Let’s also check how long creating those data structures takes

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

Clear winner: Numpy

Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.

Memory usage of those:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.


回答 3

请注意,100000000等于10^8不等于10^7,我的结果如下所示:

100000000 == 10**8

# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:

100000000 == 10**8

# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

块矩阵到数组

问题:块矩阵到数组

我正在使用numpy。我有一个具有1列和N行的矩阵,并且我想从中获得具有N个元素的数组。

例如,如果我有M = matrix([[1], [2], [3], [4]]),我想得到A = array([1,2,3,4])

为此,我使用A = np.array(M.T)[0]。有谁知道一种更优雅的方式来获得相同的结果?

谢谢!

I am using numpy. I have a matrix with 1 column and N rows and I want to get an array from with N elements.

For example, if i have M = matrix([[1], [2], [3], [4]]), I want to get A = array([1,2,3,4]).

To achieve it, I use A = np.array(M.T)[0]. Does anyone know a more elegant way to get the same result?

Thanks!


回答 0

如果您想让内容更具可读性,可以执行以下操作:

A = np.squeeze(np.asarray(M))

同样,您也可以执行以下操作:A = np.asarray(M).reshape(-1),但是它不太容易阅读。

If you’d like something a bit more readable, you can do this:

A = np.squeeze(np.asarray(M))

Equivalently, you could also do: A = np.asarray(M).reshape(-1), but that’s a bit less easy to read.


回答 1


回答 2

A, = np.array(M.T)

我想,这取决于您所说的优雅的意思,但这就是我会做的

A, = np.array(M.T)

depends what you mean by elegance i suppose but thats what i would do


回答 3

您可以尝试以下变体:

result=np.array(M).flatten()

You can try the following variant:

result=np.array(M).flatten()

回答 4

np.array(M).ravel()

如果您在乎速度;但是,如果您关心内存:

np.asarray(M).ravel()
np.array(M).ravel()

If you care for speed; But if you care for memory:

np.asarray(M).ravel()

回答 5

或者您可以尝试避免与

A = M.view(np.ndarray)
A.shape = -1

Or you could try to avoid some temps with

A = M.view(np.ndarray)
A.shape = -1

回答 6

第一, Mv = numpy.asarray(M.T)您会得到一个4×1但2D的数组。

然后,执行A = Mv[0,:],这将为您提供所需的内容。您可以将它们放在一起,如numpy.asarray(M.T)[0,:]

First, Mv = numpy.asarray(M.T), which gives you a 4×1 but 2D array.

Then, perform A = Mv[0,:], which gives you what you want. You could put them together, as numpy.asarray(M.T)[0,:].


回答 7

这会将矩阵转换为数组

A = np.ravel(M).T

This will convert the matrix into array

A = np.ravel(M).T

回答 8

numpy的ravel()flatten()函数是我将在此处尝试的两种技术。我想补充一下JoeSirajbubbleKevad的帖子

拉威尔:

A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)

展平:

M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)

numpy.ravel()更快,因为它是库级别的函数,不会复制任何数组。但是,如果使用,则数组A中的任何更改都会将其自身带到原始数组M中numpy.ravel()

numpy.flatten()比慢numpy.ravel()。但是,如果你使用的是numpy.flatten()创建一个,然后改变了一个将不会延续到原来的列M

numpy.squeeze()并且M.reshape(-1)numpy.flatten()和慢numpy.ravel()

%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop

%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop

%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop

%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop

ravel() and flatten() functions from numpy are two techniques that I would try here. I will like to add to the posts made by Joe, Siraj, bubble and Kevad.

Ravel:

A = M.ravel()
print A, A.shape
>>> [1 2 3 4] (4,)

Flatten:

M = np.array([[1], [2], [3], [4]])
A = M.flatten()
print A, A.shape
>>> [1 2 3 4] (4,)

numpy.ravel() is faster, since it is a library level function which does not make any copy of the array. However, any change in array A will carry itself over to the original array M if you are using numpy.ravel().

numpy.flatten() is slower than numpy.ravel(). But if you are using numpy.flatten() to create A, then changes in A will not get carried over to the original array M.

numpy.squeeze() and M.reshape(-1) are slower than numpy.flatten() and numpy.ravel().

%timeit M.ravel()
>>> 1000000 loops, best of 3: 309 ns per loop

%timeit M.flatten()
>>> 1000000 loops, best of 3: 650 ns per loop

%timeit M.reshape(-1)
>>> 1000000 loops, best of 3: 755 ns per loop

%timeit np.squeeze(M)
>>> 1000000 loops, best of 3: 886 ns per loop

使用numpy构建两个数组的所有组合的数组

问题:使用numpy构建两个数组的所有组合的数组

我试图在尝试使用6参数函数的参数空间之前先研究它的数值行为,然后再尝试对其进行复杂的处理,因此我正在寻找一种有效的方法来执行此操作。

我的函数采用给定6维numpy数组作为输入的float值。我最初尝试做的是:

首先,我创建了一个函数,该函数接受2个数组并生成一个包含两个数组中值的所有组合的数组

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

然后我将reduce()其应用于同一数组的m个副本:

def combs(a,m):
    return reduce(comb,[a]*m)

然后我像这样评估我的功能:

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

这有效,但是太慢了。我知道参数的空间很大,但这不应该太慢。在此示例中,我仅采样了10 6(一百万)个点,仅花费了15秒以上的时间便创建了数组values

您知道使用numpy进行此操作的更有效的方法吗?

F如果需要,我可以修改函数接受参数的方式。

I’m trying to run over the parameters space of a 6 parameter function to study it’s numerical behavior before trying to do anything complex with it so I’m searching for a efficient way to do this.

My function takes float values given a 6-dim numpy array as input. What I tried to do initially was this:

First I created a function that takes 2 arrays and generate an array with all combinations of values from the two arrays

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

Then I used reduce() to apply that to m copies of the same array:

def combs(a,m):
    return reduce(comb,[a]*m)

And then I evaluate my function like this:

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

This works but it’s waaaay too slow. I know the space of parameters is huge, but this shouldn’t be so slow. I have only sampled 106 (a million) points in this example and it took more than 15 seconds just to create the array values.

Do you know any more efficient way of doing this with numpy?

I can modify the way the function F takes it’s arguments if it’s necessary.


回答 0

numpy(> 1.8.x)的较新版本中,numpy.meshgrid()提供了更快的实现:

@PV的解决方案

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid()只能用于2D,现在可以ND。在这种情况下,3D:

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

请注意,最终结果的顺序略有不同。

In newer version of numpy (>1.8.x), numpy.meshgrid() provides a much faster implementation:

@pv’s solution

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid() use to be 2D only, now it is capable of ND. In this case, 3D:

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

Note that the order of the final resultant is slightly different.


回答 1

这是一个纯粹的numpy实现。它比使用itertools快约5倍。


import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

Here’s a pure-numpy implementation. It’s about 5× faster than using itertools.


import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

回答 2

通常,itertools.combinations是从Python容器中获取组合的最快方法(如果您实际上确实想要组合,即无重复且无顺序的安排;那不是您的代码看起来正在做的事情,但是我做不到判断这是因为您的代码有错误还是因为您使用了错误的术语)。

如果您想要的不是组合,则itertools中的其他迭代器productpermutations可能会为您提供更好的服务。例如,看起来您的代码与以下代码大致相同:

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

所有这些迭代器都会生成元组,而不是列表或numpy数组,因此,如果您的F对要专门获取一个numpy数组很挑剔,则您将不得不承担在每一步构造或清除和重新填充一个数组的额外开销。

itertools.combinations is in general the fastest way to get combinations from a Python container (if you do in fact want combinations, i.e., arrangements WITHOUT repetitions and independent of order; that’s not what your code appears to be doing, but I can’t tell whether that’s because your code is buggy or because you’re using the wrong terminology).

If you want something different than combinations perhaps other iterators in itertools, product or permutations, might serve you better. For example, it looks like your code is roughly the same as:

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

All of these iterators yield tuples, not lists or numpy arrays, so if your F is picky about getting specifically a numpy array you’ll have to accept the extra overhead of constructing or clearing and re-filling one at each step.


回答 3

你可以做这样的事情

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

这使

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

You can do something like this

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

which gives

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

回答 4

以下numpy实现应为大约。给定答案速度的2倍:

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

The following numpy implementation should be approx. 2x the speed of the given answer:

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

回答 5

看起来您想让网格评估您的功能,在这种情况下,您可以使用numpy.ogrid(打开)或numpy.mgrid(完善):

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

It looks like you want a grid to evaluate your function, in which case you can use numpy.ogrid (open) or numpy.mgrid (fleshed out):

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

回答 6

您可以使用 np.array(itertools.product(a, b))

you can use np.array(itertools.product(a, b))


回答 7

这是使用纯NumPy,没有递归,没有列表理解以及没有明确的for循环的另一种方式。它比原始答案慢20%,并且基于np.meshgrid。

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

例如,

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

Here’s yet another way, using pure NumPy, no recursion, no list comprehension, and no explicit for loops. It’s about 20% slower than the original answer, and it’s based on np.meshgrid.

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

For example,

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

gives

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

回答 8

对于一维数组(或平坦的python列表)的笛卡尔积的纯numpy实现,只需使用meshgrid(),用滚动轴transpose(),然后将形状整形为所需的输出:

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

请注意,这具有最后一个轴更改最快的约定(“ C样式”或“行主要”)。

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

如果要最快地更改第一轴(“ FORTRAN样式”或“主要列”),只需更改如下order参数reshape()reshape((-1, N), order='F')

For a pure numpy implementation of Cartesian product of 1D arrays (or flat python lists), just use meshgrid(), roll the axes with transpose(), and reshape to the desired ouput:

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

Note this has the convention of last axis changing fastest (“C style” or “row-major”).

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

If you want to change the first axis fastest (“FORTRAN style” or “column-major”), just change the order parameter of reshape() like this: reshape((-1, N), order='F')


回答 9

熊猫merge提供了一个天真的,快速的解决方案:

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, righ_index=True),
              left_index=True, right_index=True)

Pandas merge offers a naive, fast solution to the problem:

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, righ_index=True),
              left_index=True, right_index=True)

如何在Python中将带有逗号分隔的项目的字符串转换为列表?

问题:如何在Python中将带有逗号分隔的项目的字符串转换为列表?

如何将字符串转换为列表?

说的字符串就像text = "a,b,c"。转换后,text == ['a', 'b', 'c']希望是text[0] == 'a'text[1] == 'b'

How do you convert a string into a list?

Say the string is like text = "a,b,c". After the conversion, text == ['a', 'b', 'c'] and hopefully text[0] == 'a', text[1] == 'b'?


回答 0

像这样:

>>> text = 'a,b,c'
>>> text = text.split(',')
>>> text
[ 'a', 'b', 'c' ]

另外,eval()如果您信任字符串是安全的,则可以使用:

>>> text = 'a,b,c'
>>> text = eval('[' + text + ']')

Like this:

>>> text = 'a,b,c'
>>> text = text.split(',')
>>> text
[ 'a', 'b', 'c' ]

Alternatively, you can use eval() if you trust the string to be safe:

>>> text = 'a,b,c'
>>> text = eval('[' + text + ']')

回答 1

只是补充现有的答案:希望将来您会遇到更多类似的情况:

>>> word = 'abc'
>>> L = list(word)
>>> L
['a', 'b', 'c']
>>> ''.join(L)
'abc'

但是,你正在处理什么用,现在,一起去@ 卡梅隆的回答。

>>> word = 'a,b,c'
>>> L = word.split(',')
>>> L
['a', 'b', 'c']
>>> ','.join(L)
'a,b,c'

Just to add on to the existing answers: hopefully, you’ll encounter something more like this in the future:

>>> word = 'abc'
>>> L = list(word)
>>> L
['a', 'b', 'c']
>>> ''.join(L)
'abc'

But what you’re dealing with right now, go with @Cameron‘s answer.

>>> word = 'a,b,c'
>>> L = word.split(',')
>>> L
['a', 'b', 'c']
>>> ','.join(L)
'a,b,c'

回答 2

以下Python代码会将您的字符串转换为字符串列表:

import ast
teststr = "['aaa','bbb','ccc']"
testarray = ast.literal_eval(teststr)

The following Python code will turn your string into a list of strings:

import ast
teststr = "['aaa','bbb','ccc']"
testarray = ast.literal_eval(teststr)

回答 3

我不认为你需要

在python中,您几乎不需要将字符串转换为列表,因为字符串和列表非常相似

改变类型

如果您确实有一个应该是字符数组的字符串,请执行以下操作:

In [1]: x = "foobar"
In [2]: list(x)
Out[2]: ['f', 'o', 'o', 'b', 'a', 'r']

不更改类型

请注意,字符串非常类似于python中的列表

字符串具有访问器,例如列表

In [3]: x[0]
Out[3]: 'f'

字符串是可迭代的,例如列表

In [4]: for i in range(len(x)):
...:     print x[i]
...:     
f
o
o
b
a
r

TLDR

字符串是列表。几乎。

I don’t think you need to

In python you seldom need to convert a string to a list, because strings and lists are very similar

Changing the type

If you really have a string which should be a character array, do this:

In [1]: x = "foobar"
In [2]: list(x)
Out[2]: ['f', 'o', 'o', 'b', 'a', 'r']

Not changing the type

Note that Strings are very much like lists in python

Strings have accessors, like lists

In [3]: x[0]
Out[3]: 'f'

Strings are iterable, like lists

In [4]: for i in range(len(x)):
...:     print x[i]
...:     
f
o
o
b
a
r

TLDR

Strings are lists. Almost.


回答 4

如果要按空格分割,可以使用.split()

a = 'mary had a little lamb'
z = a.split()
print z

输出:

['mary', 'had', 'a', 'little', 'lamb'] 

In case you want to split by spaces, you can just use .split():

a = 'mary had a little lamb'
z = a.split()
print z

Output:

['mary', 'had', 'a', 'little', 'lamb'] 

回答 5

如果您实际上想要数组:

>>> from array import array
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> myarray = array('c', text)
>>> myarray
array('c', 'abc')
>>> myarray[0]
'a'
>>> myarray[1]
'b'

如果您不需要数组,只想按索引查看您的字符,请记住字符串是可迭代的,就像列表一样,除了它是不可变的:

>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> text[0]
'a'

If you actually want arrays:

>>> from array import array
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> myarray = array('c', text)
>>> myarray
array('c', 'abc')
>>> myarray[0]
'a'
>>> myarray[1]
'b'

If you do not need arrays, and only want to look by index at your characters, remember a string is an iterable, just like a list except the fact that it is immutable:

>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> text[0]
'a'

回答 6

m = '[[1,2,3],[4,5,6],[7,8,9]]'

m= eval(m.split()[0])

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
m = '[[1,2,3],[4,5,6],[7,8,9]]'

m= eval(m.split()[0])

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

回答 7

所有答案都很好,还有另一种方法,即列表理解,请参见下面的解决方案。

u = "UUUDDD"

lst = [x for x in u]

对于逗号分隔的列表,请执行以下操作

u = "U,U,U,D,D,D"

lst = [x for x in u.split(',')]

All answers are good, there is another way of doing, which is list comprehension, see the solution below.

u = "UUUDDD"

lst = [x for x in u]

for comma separated list do the following

u = "U,U,U,D,D,D"

lst = [x for x in u.split(',')]

回答 8

我通常使用:

l = [ word.strip() for word in text.split(',') ]

strip删除单词周围的空格。

I usually use:

l = [ word.strip() for word in text.split(',') ]

the strip remove spaces around words.


回答 9

要转换string具有a="[[1, 3], [2, -6]]"我编写但尚未优化的代码的形式:

matrixAr = []
mystring = "[[1, 3], [2, -4], [19, -15]]"
b=mystring.replace("[[","").replace("]]","") # to remove head [[ and tail ]]
for line in b.split('], ['):
    row =list(map(int,line.split(','))) #map = to convert the number from string (some has also space ) to integer
    matrixAr.append(row)
print matrixAr

To convert a string having the form a="[[1, 3], [2, -6]]" I wrote yet not optimized code:

matrixAr = []
mystring = "[[1, 3], [2, -4], [19, -15]]"
b=mystring.replace("[[","").replace("]]","") # to remove head [[ and tail ]]
for line in b.split('], ['):
    row =list(map(int,line.split(','))) #map = to convert the number from string (some has also space ) to integer
    matrixAr.append(row)
print matrixAr

回答 10

# to strip `,` and `.` from a string ->

>>> 'a,b,c.'.translate(None, ',.')
'abc'

您应该translate对字符串使用内置方法。

help('abc'.translate)在Python shell上键入以获取更多信息。

# to strip `,` and `.` from a string ->

>>> 'a,b,c.'.translate(None, ',.')
'abc'

You should use the built-in translate method for strings.

Type help('abc'.translate) at Python shell for more info.


回答 11

使用功能性Python:

text=filter(lambda x:x!=',',map(str,text))

Using functional Python:

text=filter(lambda x:x!=',',map(str,text))

回答 12

例子1

>>> email= "myemailid@gmail.com"
>>> email.split()
#OUTPUT
["myemailid@gmail.com"]

例子2

>>> email= "myemailid@gmail.com, someonsemailid@gmail.com"
>>> email.split(',')
#OUTPUT
["myemailid@gmail.com", "someonsemailid@gmail.com"]

Example 1

>>> email= "myemailid@gmail.com"
>>> email.split()
#OUTPUT
["myemailid@gmail.com"]

Example 2

>>> email= "myemailid@gmail.com, someonsemailid@gmail.com"
>>> email.split(',')
#OUTPUT
["myemailid@gmail.com", "someonsemailid@gmail.com"]