Python 实用宝典

Question 1

In the numpy manual about the reshape() function, it says

>>> a = np.zeros((10, 2))
# A transpose make the array non-contiguous
>>> b = a.T
# Taking a view makes it possible to modify the shape without modifying the
# initial object.
>>> c = b.view()
>>> c.shape = (20)
AttributeError: incompatible shape for a non-contiguous array

My questions are:

What are continuous and noncontiguous arrays? Is it similar to the contiguous memory block in C like What is a contiguous memory block?
Is there any performance difference between these two? When should we use one or the other?
Why does transpose make the array non-contiguous?
Why does c.shape = (20) throws an error incompatible shape for a non-contiguous array?

Thanks for your answer!

Question 2

A contiguous array is just an array stored in an unbroken block of memory: to access the next value in the array, we just move to the next memory address.

Consider the 2D array arr = np.arange(12).reshape(3,4). It looks like this:

In the computer’s memory, the values of arr are stored like this:

This means arr is a C contiguous array because the rows are stored as contiguous blocks of memory. The next memory address holds the next row value on that row. If we want to move down a column, we just need to jump over three blocks (e.g. to jump from 0 to 4 means we skip over 1,2 and 3).

Transposing the array with arr.T means that C contiguity is lost because adjacent row entries are no longer in adjacent memory addresses. However, arr.T is Fortran contiguous since the columns are in contiguous blocks of memory:

Performance-wise, accessing memory addresses which are next to each other is very often faster than accessing addresses which are more “spread out” (fetching a value from RAM could entail a number of neighbouring addresses being fetched and cached for the CPU.) This means that operations over contiguous arrays will often be quicker.

As a consequence of C contiguous memory layout, row-wise operations are usually faster than column-wise operations. For example, you’ll typically find that

np.sum(arr, axis=1) # sum the rows

is slightly faster than:

np.sum(arr, axis=0) # sum the columns

Similarly, operations on columns will be slightly faster for Fortran contiguous arrays.

Finally, why can’t we flatten the Fortran contiguous array by assigning a new shape?

>>> arr2 = arr.T
>>> arr2.shape = 12
AttributeError: incompatible shape for a non-contiguous array

In order for this to be possible NumPy would have to put the rows of arr.T together like this:

(Setting the shape attribute directly assumes C order – i.e. NumPy tries to perform the operation row-wise.)

This is impossible to do. For any axis, NumPy needs to have a constant stride length (the number of bytes to move) to get to the next element of the array. Flattening arr.T in this way would require skipping forwards and backwards in memory to retrieve consecutive values of the array.

If we wrote arr2.reshape(12) instead, NumPy would copy the values of arr2 into a new block of memory (since it can’t return a view on to the original data for this shape).

Question 3

Maybe this example with 12 different array values will help:

In [207]: x=np.arange(12).reshape(3,4).copy()

In [208]: x.flags
Out[208]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [209]: x.T.flags
Out[209]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...

The C order values are in the order that they were generated in. The transposed ones are not

In [212]: x.reshape(12,)   # same as x.ravel()
Out[212]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [213]: x.T.reshape(12,)
Out[213]: array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

You can get 1d views of both

In [214]: x1=x.T

In [217]: x.shape=(12,)

the shape of x can also be changed.

In [220]: x1.shape=(12,)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-220-cf2b1a308253> in <module>()
----> 1 x1.shape=(12,)

AttributeError: incompatible shape for a non-contiguous array

But the shape of the transpose cannot be changed. The data is still in the 0,1,2,3,4... order, which can’t be accessed accessed as 0,4,8... in a 1d array.

But a copy of x1 can be changed:

In [227]: x2=x1.copy()

In [228]: x2.flags
Out[228]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  ...
In [229]: x2.shape=(12,)

Looking at strides might also help. A strides is how far (in bytes) it has to step to get to the next value. For a 2d array, there will be be 2 stride values:

In [233]: x=np.arange(12).reshape(3,4).copy()

In [234]: x.strides
Out[234]: (16, 4)

To get to the next row, step 16 bytes, next column only 4.

In [235]: x1.strides
Out[235]: (4, 16)

Transpose just switches the order of the strides. The next row is only 4 bytes- i.e. the next number.

In [236]: x.shape=(12,)

In [237]: x.strides
Out[237]: (4,)

Changing the shape also changes the strides – just step through the buffer 4 bytes at a time.

In [238]: x2=x1.copy()

In [239]: x2.strides
Out[239]: (12, 4)

Even though x2 looks just like x1, it has its own data buffer, with the values in a different order. The next column is now 4 bytes over, while the next row is 12 (3*4).

In [240]: x2.shape=(12,)

In [241]: x2.strides
Out[241]: (4,)

And as with x, changing the shape to 1d reduces the strides to (4,).

For x1, with data in the 0,1,2,... order, there isn’t a 1d stride that would give 0,4,8....

__array_interface__ is another useful way of displaying array information:

In [242]: x1.__array_interface__
Out[242]: 
{'strides': (4, 16),
 'typestr': '<i4',
 'shape': (4, 3),
 'version': 3,
 'data': (163336056, False),
 'descr': [('', '<i4')]}

The x1 data buffer address will be same as for x, with which it shares the data. x2 has a different buffer address.

You could also experiment with adding a order='F' parameter to the copy and reshape commands.

Question 4

我有一个浮点数数组（一些正常数字，一些nans），它们是从对熊猫数据框的应用中得出的。

由于某种原因，numpy.isnan在此数组上失败，但是，如下所示，每个元素都是浮点数，numpy.isnan在每个元素上正确运行，变量的类型肯定是一个numpy数组。

这是怎么回事？！

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

Question 5

I have an array of floats (some normal numbers, some nans) that is coming out of an apply on a pandas dataframe.

For some reason, numpy.isnan is failing on this array, however as shown below, each element is a float, numpy.isnan runs correctly on each element, the type of the variable is definitely a numpy array.

What’s going on?!

set([type(x) for x in tester])
Out[59]: {float}

tester
Out[60]: 
array([-0.7000000000000001, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
   nan, nan], dtype=object)

set([type(x) for x in tester])
Out[61]: {float}

np.isnan(tester)
Traceback (most recent call last):

File "<ipython-input-62-e3638605b43c>", line 1, in <module>
np.isnan(tester)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

set([np.isnan(x) for x in tester])
Out[65]: {False, True}

type(tester)
Out[66]: numpy.ndarray

Question 6

np.isnan 可以应用于本机dtype的NumPy数组（例如np.float64）：

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

但是在应用于对象数组时引发TypeError：

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

由于您拥有Pandas，pd.isnull因此可以改用-它可以接受对象或本机dtypes的NumPy数组：

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

请注意，None在对象数组中也将其视为空值。

Question 7

np.isnan can be applied to NumPy arrays of native dtype (such as np.float64):

In [99]: np.isnan(np.array([np.nan, 0], dtype=np.float64))
Out[99]: array([ True, False], dtype=bool)

but raises TypeError when applied to object arrays:

In [96]: np.isnan(np.array([np.nan, 0], dtype=object))
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Since you have Pandas, you could use pd.isnull instead — it can accept NumPy arrays of object or native dtypes:

In [97]: pd.isnull(np.array([np.nan, 0], dtype=float))
Out[97]: array([ True, False], dtype=bool)

In [98]: pd.isnull(np.array([np.nan, 0], dtype=object))
Out[98]: array([ True, False], dtype=bool)

Note that None is also considered a null value in object arrays.

Question 8

np.isnan（）和pd.isnull（）的绝佳替代品是

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

因为只有nan不等于自己。

Question 9

A great substitute for np.isnan() and pd.isnull() is

for i in range(0,a.shape[0]):
    if(a[i]!=a[i]):
       //do something here
       //a[i] is nan

since only nan is not equal to itself.

Question 10

在@unutbu答案的顶部，您可以将pandas numpy对象数组强制转换为本机（float64）类型，沿线进行操作

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

指定errors =’coerce’强制将无法解析为数字值的字符串变为NaN。列类型为dtype: float64，然后isnan检查是否可以使用

Question 11

On top of @unutbu answer, you could coerce pandas numpy object array to native (float64) type, something along the line

import pandas as pd
pd.to_numeric(df['tester'], errors='coerce')

Specify errors=’coerce’ to force strings that can’t be parsed to a numeric value to become NaN. Column type would be dtype: float64, and then isnan check should work

Question 12

确保使用熊猫导入csv文件

import pandas as pd

condition = pd.isnull(data[i][j])

Question 13

Make sure you import csv file using Pandas

import pandas as pd

condition = pd.isnull(data[i][j])

Question 14

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.

Question 15

If you want to split the data set once in two halves, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

Finally, sklearn contains several cross validation methods (k-fold, leave-n-out, …). It also includes more advanced “stratified sampling” methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

Question 16

There is another option that just entails using scikit-learn. As scikit’s wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you’re trying to split into training and test.

Question 17

Just a note. In case you want train, test, AND validation sets, you can do this:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.

Question 18

As sklearn.cross_validation module was deprecated, you can use:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

Question 19

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

Question 20

I wrote a function for my own project to do this (it doesn’t use numpy, though):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

If you want the chunks to be randomized, just shuffle the list before passing it in.

Question 21

Here is a code to split the data into n=5 folds in a stratified manner

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Question 22

Thanks pberkes for your answer. I just modified it to avoid (1) replacement while sampling (2) duplicated instances occurred in both training and testing:

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

Question 23

After doing some reading and taking into account the (many..) different ways of splitting the data to train and test, I decided to timeit!

I used 4 different methods (non of them are using the library sklearn, which I’m sure will give the best results, giving that it is well designed and tested code):

shuffle the whole matrix arr and then split the data to train and test
shuffle the indices and then assign it x and y to split the data
same as method 2, but in a more efficient way to do it
using pandas dataframe to split

method 3 won by far with the shortest time, after that method 1, and method 2 and 4 discovered to be really inefficient.

The code for the 4 different methods I timed:

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

And for the times, the minimum time to execute out of 3 repetitions of 1000 loops is:

Method 1: 0.35883826200006297 seconds
Method 2: 1.7157016959999964 seconds
Method 3: 1.7876616719995582 seconds
Method 4: 0.07562861499991413 seconds

I hope that’s helpful!

Question 24

Likely you will not only need to split into train and test, but also cross validation to make sure your model generalizes. Here I am assuming 70% training data, 20% validation and 10% holdout/test data.

Check out the np.split:

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in

ary[:2] ary[2:3] ary[3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

Question 25

Split into train test and valid

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

Question 26

我想知道如何使用python 2.6.6和numpy版本1.5.0用零填充2D numpy数组。抱歉! 但是这些是我的局限性。因此我不能使用np.pad。例如，我想a用零填充以使其形状匹配b。我想这样做的原因是我可以这样做：

b-a

这样

>>> a
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
>>> b
array([[ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.]])
>>> c
array([[1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0]])

我能想到的唯一方法是追加，但这看起来很丑。是否有可能使用更清洁的解决方案b.shape？

编辑，谢谢MSeiferts的答案。我必须清理一下，这就是我得到的：

def pad(array, reference_shape, offsets):
    """
    array: Array to be padded
    reference_shape: tuple of size of ndarray to create
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    will throw a ValueError if offsets is too big and the reference_shape cannot handle the offsets
    """

    # Create an array of zeros with the reference shape
    result = np.zeros(reference_shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offsets[dim], offsets[dim] + array.shape[dim]) for dim in range(array.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = array
    return result

Question 27

I want to know how I can pad a 2D numpy array with zeros using python 2.6.6 with numpy version 1.5.0. Sorry! But these are my limitations. Therefore I cannot use np.pad. For example, I want to pad a with zeros such that its shape matches b. The reason why I want to do this is so I can do:

b-a

such that

>>> a
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
>>> b
array([[ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.],
       [ 3.,  3.,  3.,  3.,  3.,  3.]])
>>> c
array([[1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0]])

The only way I can think of doing this is appending, however this seems pretty ugly. is there a cleaner solution possibly using b.shape?

Edit, Thank you to MSeiferts answer. I had to clean it up a bit, and this is what I got:

def pad(array, reference_shape, offsets):
    """
    array: Array to be padded
    reference_shape: tuple of size of ndarray to create
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    will throw a ValueError if offsets is too big and the reference_shape cannot handle the offsets
    """

    # Create an array of zeros with the reference shape
    result = np.zeros(reference_shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offsets[dim], offsets[dim] + array.shape[dim]) for dim in range(array.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = array
    return result

Question 28

很简单，使用参考形状创建一个包含零的数组：

result = np.zeros(b.shape)
# actually you can also use result = np.zeros_like(b) 
# but that also copies the dtype not only the shape

然后在需要的地方插入数组：

result[:a.shape[0],:a.shape[1]] = a

瞧，您已经填充了它：

print(result)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

如果您定义应该在左上方插入元素的位置，也可以使其更通用一些

result = np.zeros_like(b)
x_offset = 1  # 0 would be what you wanted
y_offset = 1  # 0 in your case
result[x_offset:a.shape[0]+x_offset,y_offset:a.shape[1]+y_offset] = a
result

array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.]])

但请注意，偏移量不要超过允许的范围。例如x_offset = 2，这将失败。

如果您有任意数量的维，则可以定义切片列表以插入原始数组。我发现有趣的是可以玩一下，并创建了一个填充函数，该函数可以填充（偏移）任意形状的数组，只要数组和引用的维数相同且偏移量不太大即可。

def pad(array, reference, offsets):
    """
    array: Array to be padded
    reference: Reference array with the desired shape
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    """
    # Create an array of zeros with the reference shape
    result = np.zeros(reference.shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offset[dim], offset[dim] + array.shape[dim]) for dim in range(a.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = a
    return result

和一些测试用例：

import numpy as np

# 1 Dimension
a = np.ones(2)
b = np.ones(5)
offset = [3]
pad(a, b, offset)

# 3 Dimensions

a = np.ones((3,3,3))
b = np.ones((5,4,3))
offset = [1,0,0]
pad(a, b, offset)

Question 29

Very simple, you create an array containing zeros using the reference shape:

result = np.zeros(b.shape)
# actually you can also use result = np.zeros_like(b) 
# but that also copies the dtype not only the shape

and then insert the array where you need it:

result[:a.shape[0],:a.shape[1]] = a

and voila you have padded it:

print(result)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

You can also make it a bit more general if you define where your upper left element should be inserted

result = np.zeros_like(b)
x_offset = 1  # 0 would be what you wanted
y_offset = 1  # 0 in your case
result[x_offset:a.shape[0]+x_offset,y_offset:a.shape[1]+y_offset] = a
result

array([[ 0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.],
       [ 0.,  1.,  1.,  1.,  1.,  1.]])

but then be careful that you don’t have offsets bigger than allowed. For x_offset = 2 for example this will fail.

If you have an arbitary number of dimensions you can define a list of slices to insert the original array. I’ve found it interesting to play around a bit and created a padding function that can pad (with offset) an arbitary shaped array as long as the array and reference have the same number of dimensions and the offsets are not too big.

def pad(array, reference, offsets):
    """
    array: Array to be padded
    reference: Reference array with the desired shape
    offsets: list of offsets (number of elements must be equal to the dimension of the array)
    """
    # Create an array of zeros with the reference shape
    result = np.zeros(reference.shape)
    # Create a list of slices from offset to offset + shape in each dimension
    insertHere = [slice(offset[dim], offset[dim] + array.shape[dim]) for dim in range(a.ndim)]
    # Insert the array in the result at the specified offsets
    result[insertHere] = a
    return result

And some test cases:

import numpy as np

# 1 Dimension
a = np.ones(2)
b = np.ones(5)
offset = [3]
pad(a, b, offset)

# 3 Dimensions

a = np.ones((3,3,3))
b = np.ones((5,4,3))
offset = [1,0,0]
pad(a, b, offset)

Question 30

NumPy 1.7.0（numpy.pad添加时）现在已经很老了（它于2013年发布），因此即使问题要求使用不使用该功能的方法，我也认为了解使用可以实现该功能很有用numpy.pad。

实际上很简单：

>>> import numpy as np
>>> a = np.array([[ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.]])
>>> np.pad(a, [(0, 1), (0, 1)], mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

在这种情况下，我使用0的默认值mode='constant'。但是也可以通过显式传递它来指定它：

>>> np.pad(a, [(0, 1), (0, 1)], mode='constant', constant_values=0)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

以防第二个参数（[(0, 1), (0, 1)]）令人困惑：每个列表项（在本例中为元组）都对应于一个维度，并且其中的每个项都表示（第一个元素）之前和之后（第二个元素）的填充。所以：

[(0, 1), (0, 1)]
         ^^^^^^------ padding for second dimension
 ^^^^^^-------------- padding for first dimension

  ^------------------ no padding at the beginning of the first axis
     ^--------------- pad with one "value" at the end of the first axis.

在这种情况下，第一轴和第二轴的填充相同，因此也可以只传入2元组：

>>> np.pad(a, (0, 1), mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

如果前后的填充相同，甚至可以省略该元组（尽管在这种情况下不适用）：

>>> np.pad(a, 1, mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.]])

或者，如果前后的填充相同但轴的填充不同，则也可以在内部元组中省略第二个参数：

>>> np.pad(a, [(1, ), (2, )], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

但是我倾向于始终使用显式的，因为这样做很容易犯错（当NumPys的期望与您的意图有所不同时）：

>>> np.pad(a, [1, 2], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

在这里，NumPy认为您希望在每个轴前填充1个元素，在每个轴后填充2个元素！即使您打算用轴1中的1个元素和轴2中的2个元素填充。

我使用元组列表进行填充，请注意，这只是“我的约定”，您也可以使用列表列表或元组的元组，甚至数组的元组。NumPy只是检查参数的长度（如果没有长度）和每个项目的长度（或者如果有长度）！

Question 31

NumPy 1.7.0 (when numpy.pad was added) is pretty old now (it was released in 2013) so even though the question asked for a way without using that function I thought it could be useful to know how that could be achieved using numpy.pad.

It’s actually pretty simple:

>>> import numpy as np
>>> a = np.array([[ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.],
...               [ 1.,  1.,  1.,  1.,  1.]])
>>> np.pad(a, [(0, 1), (0, 1)], mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

In this case I used that 0 is the default value for mode='constant'. But it could also be specified by passing it in explicitly:

>>> np.pad(a, [(0, 1), (0, 1)], mode='constant', constant_values=0)
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

Just in case the second argument ([(0, 1), (0, 1)]) seems confusing: Each list item (in this case tuple) corresponds to a dimension and item therein represents the padding before (first element) and after (second element). So:

[(0, 1), (0, 1)]
         ^^^^^^------ padding for second dimension
 ^^^^^^-------------- padding for first dimension

  ^------------------ no padding at the beginning of the first axis
     ^--------------- pad with one "value" at the end of the first axis.

In this case the padding for the first and second axis are identical, so one could also just pass in the 2-tuple:

>>> np.pad(a, (0, 1), mode='constant')
array([[ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.]])

In case the padding before and after is identical one could even omit the tuple (not applicable in this case though):

>>> np.pad(a, 1, mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Or if the padding before and after is identical but different for the axis, you could also omit the second argument in the inner tuples:

>>> np.pad(a, [(1, ), (2, )], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

However I tend to prefer to always use the explicit one, because it’s just to easy to make mistakes (when NumPys expectations differ from your intentions):

>>> np.pad(a, [1, 2], mode='constant')
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

Here NumPy thinks you wanted to pad all axis with 1 element before and 2 elements after each axis! Even if you intended it to pad with 1 element in axis 1 and 2 elements for axis 2.

I used lists of tuples for the padding, note that this is just “my convention”, you could also use lists of lists or tuples of tuples, or even tuples of arrays. NumPy just checks the length of the argument (or if it doesn’t have a length) and the length of each item (or if it has a length)!

Question 32

我了解您的主要问题是您需要计算，d=b-a但数组的大小不同。无需中间填充c

您可以解决此问题而无需填充：

import numpy as np

a = np.array([[ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.]])

b = np.array([[ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.]])

d = b.copy()
d[:a.shape[0],:a.shape[1]] -=  a

print d

输出：

[[ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 3.  3.  3.  3.  3.  3.]]

Question 33

I understand that your main problem is that you need to calculate d=b-a but your arrays have different sizes. There is no need for an intermediate padded c

You can solve this without padding:

import numpy as np

a = np.array([[ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.],
              [ 1.,  1.,  1.,  1.,  1.]])

b = np.array([[ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.],
              [ 3.,  3.,  3.,  3.,  3.,  3.]])

d = b.copy()
d[:a.shape[0],:a.shape[1]] -=  a

print d

Output:

[[ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 2.  2.  2.  2.  2.  3.]
 [ 3.  3.  3.  3.  3.  3.]]

Question 34

如果需要向数组添加1s的范围：

>>> mat = np.zeros((4,4), np.int32)
>>> mat
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
>>> mat[0,:] = mat[:,0] = mat[:,-1] =  mat[-1,:] = 1
>>> mat
array([[1, 1, 1, 1],
       [1, 0, 0, 1],
       [1, 0, 0, 1],
       [1, 1, 1, 1]])

Question 35

In case you need to add a fence of 1s to an array:

>>> mat = np.zeros((4,4), np.int32)
>>> mat
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
>>> mat[0,:] = mat[:,0] = mat[:,-1] =  mat[-1,:] = 1
>>> mat
array([[1, 1, 1, 1],
       [1, 0, 0, 1],
       [1, 0, 0, 1],
       [1, 1, 1, 1]])

Question 36

我知道我有点晚了，但是如果您想执行相对填充（aka边缘填充），可以通过以下方法实现它。请注意，分配的第一个实例将导致零填充，因此您可以将其用于零填充和相对填充（这是将原始数组的边值复制到填充数组中的地方）。

def replicate_padding(arr):
    """Perform replicate padding on a numpy array."""
    new_pad_shape = tuple(np.array(arr.shape) + 2) # 2 indicates the width + height to change, a (512, 512) image --> (514, 514) padded image.
    padded_array = np.zeros(new_pad_shape) #create an array of zeros with new dimensions
    
    # perform replication
    padded_array[1:-1,1:-1] = arr        # result will be zero-pad
    padded_array[0,1:-1] = arr[0]        # perform edge pad for top row
    padded_array[-1, 1:-1] = arr[-1]     # edge pad for bottom row
    padded_array.T[0, 1:-1] = arr.T[0]   # edge pad for first column
    padded_array.T[-1, 1:-1] = arr.T[-1] # edge pad for last column
    
    #at this point, all values except for the 4 corners should have been replicated
    padded_array[0][0] = arr[0][0]     # top left corner
    padded_array[-1][0] = arr[-1][0]   # bottom left corner
    padded_array[0][-1] = arr[0][-1]   # top right corner 
    padded_array[-1][-1] = arr[-1][-1] # bottom right corner

    return padded_array

复杂度分析：

对此的最佳解决方案是numpy的pad方法。在平均运行5次之后，具有相对填充的np.pad仅8%比上面定义的函数好。这表明这是相对填充和零填充的最佳方法。


#My method, replicate_padding
start = time.time()
padded = replicate_padding(input_image)
end = time.time()
delta0 = end - start

#np.pad with edge padding
start = time.time()
padded = np.pad(input_image, 1, mode='edge')
end = time.time()
delta = end - start


print(delta0) # np Output: 0.0008790493011474609 
print(delta)  # My Output: 0.0008130073547363281
print(100*((delta0-delta)/delta)) # Percent difference: 8.12316715542522%

Question 37

I know I’m a bit late to this, but in case you wanted to perform relative padding (aka edge padding), here’s how you can implement it. Note that the very first instance of assignment results in zero-padding, so you can use this for both zero-padding and relative padding (this is where you copy the edge values of the original array into the padded array).

def replicate_padding(arr):
    """Perform replicate padding on a numpy array."""
    new_pad_shape = tuple(np.array(arr.shape) + 2) # 2 indicates the width + height to change, a (512, 512) image --> (514, 514) padded image.
    padded_array = np.zeros(new_pad_shape) #create an array of zeros with new dimensions
    
    # perform replication
    padded_array[1:-1,1:-1] = arr        # result will be zero-pad
    padded_array[0,1:-1] = arr[0]        # perform edge pad for top row
    padded_array[-1, 1:-1] = arr[-1]     # edge pad for bottom row
    padded_array.T[0, 1:-1] = arr.T[0]   # edge pad for first column
    padded_array.T[-1, 1:-1] = arr.T[-1] # edge pad for last column
    
    #at this point, all values except for the 4 corners should have been replicated
    padded_array[0][0] = arr[0][0]     # top left corner
    padded_array[-1][0] = arr[-1][0]   # bottom left corner
    padded_array[0][-1] = arr[0][-1]   # top right corner 
    padded_array[-1][-1] = arr[-1][-1] # bottom right corner

    return padded_array

Complexity Analysis:

The optimal solution for this is numpy’s pad method. After averaging for 5 runs, np.pad with relative padding is only 8% better than the function defined above. This shows that this is fairly an optimal method for relative and zero-padding padding.


#My method, replicate_padding
start = time.time()
padded = replicate_padding(input_image)
end = time.time()
delta0 = end - start

#np.pad with edge padding
start = time.time()
padded = np.pad(input_image, 1, mode='edge')
end = time.time()
delta = end - start


print(delta0) # np Output: 0.0008790493011474609 
print(delta)  # My Output: 0.0008130073547363281
print(100*((delta0-delta)/delta)) # Percent difference: 8.12316715542522%

Question 38

How can I convert a string of bytes into an int in python?

Say like this: 'y\xcc\xa6\xbb'

I came up with a clever/stupid way of doing it:

sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))

I know there has to be something builtin or in the standard library that does this more simply…

This is different from converting a string of hex digits for which you can use int(xxx, 16), but instead I want to convert a string of actual byte values.

UPDATE:

I kind of like James’ answer a little better because it doesn’t require importing another module, but Greg’s method is faster:

>>> from timeit import Timer
>>> Timer('struct.unpack("<L", "y\xcc\xa6\xbb")[0]', 'import struct').timeit()
0.36242198944091797
>>> Timer("int('y\xcc\xa6\xbb'.encode('hex'), 16)").timeit()
1.1432669162750244

My hacky method:

>>> Timer("sum(ord(c) << (i * 8) for i, c in enumerate('y\xcc\xa6\xbb'[::-1]))").timeit()
2.8819329738616943

FURTHER UPDATE:

Someone asked in comments what’s the problem with importing another module. Well, importing a module isn’t necessarily cheap, take a look:

>>> Timer("""import struct\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""").timeit()
0.98822188377380371

Including the cost of importing the module negates almost all of the advantage that this method has. I believe that this will only include the expense of importing it once for the entire benchmark run; look what happens when I force it to reload every time:

>>> Timer("""reload(struct)\nstruct.unpack(">L", "y\xcc\xa6\xbb")[0]""", 'import struct').timeit()
68.474128007888794

Needless to say, if you’re doing a lot of executions of this method per one import than this becomes proportionally less of an issue. It’s also probably i/o cost rather than cpu so it may depend on the capacity and load characteristics of the particular machine.

Question 39

You can also use the struct module to do this:

>>> struct.unpack("<L", "y\xcc\xa6\xbb")[0]
3148270713L

Question 40

In Python 3.2 and later, use

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='big')
2043455163

or

>>> int.from_bytes(b'y\xcc\xa6\xbb', byteorder='little')
3148270713

according to the endianness of your byte-string.

This also works for bytestring-integers of arbitrary length, and for two’s-complement signed integers by specifying signed=True. See the docs for from_bytes.

Question 41

As Greg said, you can use struct if you are dealing with binary values, but if you just have a “hex number” but in byte format you might want to just convert it like:

s = 'y\xcc\xa6\xbb'
num = int(s.encode('hex'), 16)

…this is the same as:

num = struct.unpack(">L", s)[0]

…except it’ll work for any number of bytes.

Question 42

I use the following function to convert data between int, hex and bytes.

def bytes2int(str):
 return int(str.encode('hex'), 16)

def bytes2hex(str):
 return '0x'+str.encode('hex')

def int2bytes(i):
 h = int2hex(i)
 return hex2bytes(h)

def int2hex(i):
 return hex(i)

def hex2int(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return int(h, 16)

def hex2bytes(h):
 if len(h) > 1 and h[0:2] == '0x':
  h = h[2:]

 if len(h) % 2:
  h = "0" + h

 return h.decode('hex')

Source: http://opentechnotes.blogspot.com.au/2014/04/convert-values-to-from-integer-hex.html

Question 43

import array
integerValue = array.array("I", 'y\xcc\xa6\xbb')[0]

Warning: the above is strongly platform-specific. Both the “I” specifier and the endianness of the string->int conversion are dependent on your particular Python implementation. But if you want to convert many integers/strings at once, then the array module does it quickly.

Question 44

In Python 2.x, you could use the format specifiers <B for unsigned bytes, and <b for signed bytes with struct.unpack/struct.pack.

E.g:

Let x = '\xff\x10\x11'

data_ints = struct.unpack('<' + 'B'*len(x), x) # [255, 16, 17]

And:

data_bytes = struct.pack('<' + 'B'*len(data_ints), *data_ints) # '\xff\x10\x11'

That * is required!

See https://docs.python.org/2/library/struct.html#format-characters for a list of the format specifiers.

Question 45

>>> reduce(lambda s, x: s*256 + x, bytearray("y\xcc\xa6\xbb"))
2043455163

Test 1: inverse:

>>> hex(2043455163)
'0x79cca6bb'

Test 2: Number of bytes > 8:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAA"))
338822822454978555838225329091068225L

Test 3: Increment by one:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAAB"))
338822822454978555838225329091068226L

Test 4: Append one byte, say ‘A’:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))
86738642548474510294585684247313465921L

Test 5: Divide by 256:

>>> reduce(lambda s, x: s*256 + x, bytearray("AAAAAAAAAAAAAABA"))/256
338822822454978555838225329091068226L

Result equals the result of Test 4, as expected.

Question 46

I was struggling to find a solution for arbitrary length byte sequences that would work under Python 2.x. Finally I wrote this one, it’s a bit hacky because it performs a string conversion, but it works.

Function for Python 2.x, arbitrary length

def signedbytes(data):
    """Convert a bytearray into an integer, considering the first bit as
    sign. The data must be big-endian."""
    negative = data[0] & 0x80 > 0

    if negative:
        inverted = bytearray(~d % 256 for d in data)
        return -signedbytes(inverted) - 1

    encoded = str(data).encode('hex')
    return int(encoded, 16)

This function has two requirements:

The input data needs to be a bytearray. You may call the function like this:
```
s = 'y\xcc\xa6\xbb'
n = signedbytes(s)
```
The data needs to be big-endian. In case you have a little-endian value, you should reverse it first:
```
n = signedbytes(s[::-1])
```

Of course, this should be used only if arbitrary length is needed. Otherwise, stick with more standard ways (e.g. struct).

Question 47

int.from_bytes is the best solution if you are at version >=3.2. The “struct.unpack” solution requires a string so it will not apply to arrays of bytes. Here is another solution:

def bytes2int( tb, order='big'):
    if order == 'big': seq=[0,1,2,3]
    elif order == 'little': seq=[3,2,1,0]
    i = 0
    for j in seq: i = (i<<8)+tb[j]
    return i

hex( bytes2int( [0x87, 0x65, 0x43, 0x21])) returns ‘0x87654321’.

It handles big and little endianness and is easily modifiable for 8 bytes

Question 48

As mentioned above using unpack function of struct is a good way. If you want to implement your own function there is an another solution:

def bytes_to_int(bytes):
    result = 0
    for b in bytes:
        result = result * 256 + int(b)
return result

Question 49

In python 3 you can easily convert a byte string into a list of integers (0..255) by

>>> list(b'y\xcc\xa6\xbb')
[121, 204, 166, 187]

Question 50

A decently speedy method utilizing array.array I’ve been using for some time:

predefined variables:

offset = 0
size = 4
big = True # endian
arr = array('B')
arr.fromstring("\x00\x00\xff\x00") # 5 bytes (encoding issues) [0, 0, 195, 191, 0]

to int: (read)

val = 0
for v in arr[offset:offset+size][::pow(-1,not big)]: val = (val<<8)|v

from int: (write)

val = 16384
arr[offset:offset+size] = \
    array('B',((val>>(i<<3))&255 for i in range(size)))[::pow(-1,not big)]

It’s possible these could be faster though.

EDIT:
For some numbers, here’s a performance test (Anaconda 2.3.0) showing stable averages on read in comparison to reduce():

========================= byte array to int.py =========================
5000 iterations; threshold of min + 5000ns:
______________________________________code___|_______min______|_______max______|_______avg______|_efficiency
⣿⠀⠀⠀⠀⡇⢀⡀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⡀⠀⢰⠀⠀⠀⢰⠀⠀⠀⢸⠀⠀⢀⡇⠀⢀⠀⠀⠀⠀⢠⠀⠀⠀⠀⢰⠀⠀⠀⢸⡀⠀⠀⠀⢸⠀⡇⠀⠀⢠⠀⢰⠀⢸⠀
⣿⣦⣴⣰⣦⣿⣾⣧⣤⣷⣦⣤⣶⣾⣿⣦⣼⣶⣷⣶⣸⣴⣤⣀⣾⣾⣄⣤⣾⡆⣾⣿⣿⣶⣾⣾⣶⣿⣤⣾⣤⣤⣴⣼⣾⣼⣴⣤⣼⣷⣆⣴⣴⣿⣾⣷⣧⣶⣼⣴⣿⣶⣿⣶
    val = 0 \nfor v in arr: val = (val<<8)|v |     5373.848ns |   850009.965ns |     ~8649.64ns |  62.128%
⡇⠀⠀⢀⠀⠀⠀⡇⠀⡇⠀⠀⣠⠀⣿⠀⠀⠀⠀⡀⠀⠀⡆⠀⡆⢰⠀⠀⡆⠀⡄⠀⠀⠀⢠⢀⣼⠀⠀⡇⣠⣸⣤⡇⠀⡆⢸⠀⠀⠀⠀⢠⠀⢠⣿⠀⠀⢠⠀⠀⢸⢠⠀⡀
⣧⣶⣶⣾⣶⣷⣴⣿⣾⡇⣤⣶⣿⣸⣿⣶⣶⣶⣶⣧⣷⣼⣷⣷⣷⣿⣦⣴⣧⣄⣷⣠⣷⣶⣾⣸⣿⣶⣶⣷⣿⣿⣿⣷⣧⣷⣼⣦⣶⣾⣿⣾⣼⣿⣿⣶⣶⣼⣦⣼⣾⣿⣶⣷
                  val = reduce( shift, arr ) |     6489.921ns |  5094212.014ns |   ~12040.269ns |  53.902%

This is a raw performance test, so the endian pow-flip is left out.
The shift function shown applies the same shift-oring operation as the for loop, and arr is just array.array('B',[0,0,255,0]) as it has the fastest iterative performance next to dict.

I should probably also note efficiency is measured by accuracy to the average time.

Question 51

How does one add rows to a numpy array?

I have an array A:

A = array([[0, 1, 2], [0, 2, 0]])

I wish to add rows to this array from another array X if the first element of each row in X meets a specific condition.

Numpy arrays do not have a method ‘append’ like that of lists, or so it seems.

If A and X were lists I would merely do:

for i in X:
    if i[0] < 3:
        A.append(i)

Is there a numpythonic way to do the equivalent?

Thanks, S ;-)

Question 52

What is X? If it is a 2D-array, how can you then compare its row to a number: i < 3?

EDIT after OP’s comment:

A = array([[0, 1, 2], [0, 2, 0]])
X = array([[0, 1, 2], [1, 2, 0], [2, 1, 2], [3, 2, 0]])

add to A all rows from X where the first element < 3:

import numpy as np
A = np.vstack((A, X[X[:,0] < 3]))

# returns: 
array([[0, 1, 2],
       [0, 2, 0],
       [0, 1, 2],
       [1, 2, 0],
       [2, 1, 2]])

Question 53

well u can do this :

  newrow = [1,2,3]
  A = numpy.vstack([A, newrow])

Question 54

As this question is been 7 years before, in the latest version which I am using is numpy version 1.13, and python3, I am doing the same thing with adding a row to a matrix, remember to put a double bracket to the second argument, otherwise, it will raise dimension error.

In here I am adding on matrix A

1 2 3
4 5 6

with a row

7 8 9

same usage in np.r_

A= [[1, 2, 3], [4, 5, 6]]
np.append(A, [[7, 8, 9]], axis=0)

    >> array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
#or 
np.r_[A,[[7,8,9]]]

Just to someone’s intersted, if you would like to add a column,

array = np.c_[A,np.zeros(#A's row size)]

following what we did before on matrix A, adding a column to it

np.c_[A, [2,8]]

>> array([[1, 2, 3, 2],
          [4, 5, 6, 8]])

Question 55

You can also do this:

newrow = [1,2,3]
A = numpy.concatenate((A,newrow))

Question 56

If no calculations are necessary after every row, it’s much quicker to add rows in python, then convert to numpy. Here are timing tests using python 3.6 vs. numpy 1.14, adding 100 rows, one at a time:

import numpy as np 
from time import perf_counter, sleep

def time_it():
    # Compare performance of two methods for adding rows to numpy array
    py_array = [[0, 1, 2], [0, 2, 0]]
    py_row = [4, 5, 6]
    numpy_array = np.array(py_array)
    numpy_row = np.array([4,5,6])
    n_loops = 100

    start_clock = perf_counter()
    for count in range(0, n_loops):
       numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
    duration = perf_counter() - start_clock
    print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))

    start_clock = perf_counter()
    for count in range(0, n_loops):
        py_array.append(py_row) # .15 micros
    numpy_array = np.array(py_array) # 43.9 micros       
    duration = perf_counter() - start_clock
    print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
    sleep(15)

#time_it() prints:

numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row

So, the simple solution to the original question, from seven years ago, is to use vstack() to add a new row after converting the row to a numpy array. But a more realistic solution should consider vstack’s poor performance under those circumstances. If you don’t need to run data analysis on the array after every addition, it is better to buffer the new rows to a python list of rows (a list of lists, really), and add them as a group to the numpy array using vstack() before doing any data analysis.

Question 57

import numpy as np
array_ = np.array([[1,2,3]])
add_row = np.array([[4,5,6]])

array_ = np.concatenate((array_, add_row), axis=0)

Question 58

If you can do the construction in a single operation, then something like the vstack-with-fancy-indexing answer is a fine approach. But if your condition is more complicated or your rows come in on the fly, you may want to grow the array. In fact the numpythonic way to do something like this – dynamically grow an array – is to dynamically grow a list:

A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
    newrow = np.arange(3)+i
    if i%5:
        Alist.append(newrow)
A = np.array(Alist)
del Alist

Lists are highly optimized for this kind of access pattern; you don’t have convenient numpy multidimensional indexing while in list form, but for as long as you’re appending it’s hard to do better than a list of row arrays.

Question 59

I use ‘np.vstack’ which is faster, EX:

import numpy as np

input_array=np.array([1,2,3])
new_row= np.array([4,5,6])

new_array=np.vstack([input_array, new_row])

Question 60

You can use numpy.append() to append a row to numpty array and reshape to a matrix later on.

import numpy as np
a = np.array([1,2])
a = np.append(a, [3,4])
print a
# [1,2,3,4]
# in your example
A = [1,2]
for row in X:
    A = np.append(A, row)

Question 61

I expected array.array to be faster than lists, as arrays seem to be unboxed.

However, I get the following result:

In [1]: import array

In [2]: L = list(range(100000000))

In [3]: A = array.array('l', range(100000000))

In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop

In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop

In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop

In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop

What could be the cause of such a difference?

Question 62

The storage is “unboxed”, but every time you access an element Python has to “box” it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.

So, in the end, an array is generally slower, but requires substantially less memory.

Here’s the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.

Here’s the code to access a list item:

PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
    /* error checking omitted */
    return ((PyListObject *)op) -> ob_item[i];
}

There’s very little to it: somelist[i] just returns the i‘th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).

And here’s the __getitem__ implementation for an array with type code l:

static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
    return PyLong_FromLong(((long *)ap->ob_item)[i]);
}

The raw memory is treated as a vector of platform-native C long integers; the i‘th C long is read up; and then PyLong_FromLong() is called to wrap (“box”) the native C long in a Python long object (which, in Python 3, which eliminates Python 2’s distinction between int and long, is actually shown as type int).

This boxing has to allocate new memory for a Python int object, and spray the native C long‘s bits into it. In the context of the original example, this object’s lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.

This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.

Question 63

To add to Tim Peters’ excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:

You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
PyPy’s C API works to some extent, but isn’t very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.

Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you’re trying to do. It will also be much faster than native Python, if used correctly.

Question 64

Tim Peters answered why this is slow, but let’s see how to improve it.

Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):

import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)

%timeit sum(L)
10 loops, best of 3: 101 ms per loop

%timeit sum(A)
1 loop, best of 3: 237 ms per loop

%timeit sum(N)
1 loop, best of 3: 743 ms per loop

This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:

%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop

So from the list solution to the numpy version this is a factor 16 in runtime.

Let’s also check how long creating those data structures takes

%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop

%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop

%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop

%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop

Clear winner: Numpy

Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.

Memory usage of those:

sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096

So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.

N=numpy.arange(10**7, dtype=numpy.int32)

sys.getsizeof(N)
40000096

%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop

But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.

问题：连续数组和非连续数组有什么区别？

回答 0

回答 1

问题：Numpy isnan（）在浮点数组上失败（适用于pandas数据框）

回答 0

回答 1

回答 2

回答 3

问题：如何将数据集分割/划分为训练和测试数据集，例如进行交叉验证？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：python如何用零填充numpy数组

回答 0

回答 1

回答 2

回答 3

回答 4

复杂度分析：

Complexity Analysis:

问题：如何将字节字符串转换为int？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

Python 2.x的函数，任意长度

Function for Python 2.x, arbitrary length

回答 8

回答 9

回答 10

回答 11

问题：numpy-将行添加到数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：为什么Python的数组变慢？

回答 0

回答 1

回答 2

回答 3

问题：块矩阵到数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：使用numpy构建两个数组的所有组合的数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：如何在Python中将带有逗号分隔的项目的字符串转换为列表？