标签归档:numpy

如何在熊猫中获取数据框的列切片

问题:如何在熊猫中获取数据框的列切片

我从CSV文件加载了一些机器学习数据。前两列是观测值,其余两列是要素。

目前,我执行以下操作:

data = pandas.read_csv('mydata.csv')

它给出了类似的东西:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

我想两个dataframes切片此数据框:包含列一个ab和包含一个列cde

不可能写这样的东西

observations = data[:'c']
features = data['c':]

我不确定最好的方法是什么。我需要一个pd.Panel吗?

顺便说一下,我发现数据帧索引非常不一致:data['a']允许,但data[0]不允许。另一方面,data['a':]不允许,但允许data[0:]。是否有实际原因?如果列是由Int索引的,这确实令人困惑,因为data[0] != data[0:1]

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c']
features = data['c':]

I’m not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]


回答 0

2017年答案-熊猫0.20:.ix已弃用。使用.loc

请参阅文档中弃用

.loc使用基于标签的索引来选择行和列。标签是索引或列的值。切片.loc包含最后一个元素。

假设我们有以下的列的数据框中:
foobarquzantcatsatdat

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc接受与Python列表对行和列所做的相同的切片表示法。切片符号为start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

您可以按行和列进行切片。举例来说,如果你有5列的标签vwxyz

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

2017 Answer – pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let’s assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

回答 1

注意: .ix自Pandas v0.20起已弃用。您应该改用.loc.iloc,视情况而定。

您要访问的是DataFrame.ix索引。这有点令人困惑(我同意熊猫索引有时会令人困惑!),但是以下内容似乎可以满足您的要求:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

其中.ix [row slice,column slice]是正在解释的内容。有关熊猫索引的更多信息,请访问:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced


回答 2

让我们以seaborn包中的钛酸数据集为例

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

使用列名

>> titanic.loc[:,['sex','age','fare']]

使用列索引

>> titanic.iloc[:,[2,3,6]]

使用ix(版本低于.20的Pandas版本)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

要么

>> titanic.ix[:,[2,3,6]]

使用重新索引方法

>> titanic.reindex(columns=['sex','age','fare'])

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix (Older than Pandas <.20 version)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])

回答 3

另外,给定一个DataFrame

数据

如您的示例所示,如果您只想提取列a和d(即第一列和第四列),则需要从熊猫数据框中获取iloc方法,并且可以非常有效地使用它。您只需要知道要提取的列的索引即可。例如:

>>> data.iloc[:,[0,3]]

会给你

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

回答 4

您可以DataFrame通过引用列表中每一列的名称来沿a的列进行切片,如下所示:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

回答 5

如果您来这里是想对两个范围的列进行切片并将它们组合在一起(例如我),则可以执行以下操作

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

这将创建一个具有前900列和(所有)列> 3593的新数据框(假设您的数据集中有4000列)。

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).


回答 6

这是您可以使用不同方法进行选择性列切片的方法,包括基于选择性标签,基于索引和基于选择性范围的列切片。

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

回答 7

相当于

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

Its equivalent

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

回答 8

如果数据框如下所示:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

和输出可能像

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

如果您使用逻辑运算符np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

更多关于

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

其他逻辑运算符

  1. logical_and(x1,x2,/ [,out,where,…])计算x1和x2元素的真值。

  2. logical_or(x1,x2,/ [,out,where,cast,…])计算x1或x2元素的真值。

  3. logical_not(x,/ [,out,where,cast,…])计算非x元素值的真值。
  4. logical_xor(x1,x2,/ [,out,where,..])按元素计算x1 XOR x2的真值。

if Data frame look like that:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

and OUTPUT could be like

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

if you use logical operator np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

more about

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

other logical operators

  1. logical_and(x1, x2, /[, out, where, …]) Compute the truth value of x1 AND x2 element-wise.

  2. logical_or(x1, x2, /[, out, where, casting, …]) Compute the truth value of x1 OR x2 element-wise.

  3. logical_not(x, /[, out, where, casting, …]) Compute the truth value of NOT x element-wise.
  4. logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.

回答 9

假设您需要所有行,则从DataFrame获取列子集的另一种方法是:
data[['a','b']]data[['c','d','e']]
如果要使用数字列索引,可以执行:
data[data.columns[:2]]data[data.columns[2:]]

Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]


将Numpy数组另存为图像

问题:将Numpy数组另存为图像

我有一个Numpy数组类型的矩阵。如何将其作为映像写入磁盘?任何格式都可以使用(png,jpeg,bmp …)。一个重要的约束是不存在PIL。

I have a matrix in the type of a Numpy array. How would I write it to disk it as an image? Any format works (png, jpeg, bmp…). One important constraint is that PIL is not present.


回答 0

您可以使用PyPNG。这是一个纯Python(无依赖项)开源PNG编码器/解码器,它支持将NumPy数组写为图像。

You can use PyPNG. It’s a pure Python (no dependencies) open source PNG encoder/decoder and it supports writing NumPy arrays as images.


回答 1

这使用了PIL,但也许有人会觉得有用:

import scipy.misc
scipy.misc.imsave('outfile.jpg', image_array)

编辑:当前scipy版本开始规范化所有图像,以便min(数据)变为黑色,而max(data)变为白色。如果数据应该是精确的灰度级或精确的RGB通道,则这是不需要的。解决方案:

import scipy.misc
scipy.misc.toimage(image_array, cmin=0.0, cmax=...).save('outfile.jpg')

This uses PIL, but maybe some might find it useful:

import scipy.misc
scipy.misc.imsave('outfile.jpg', image_array)

EDIT: The current scipy version started to normalize all images so that min(data) become black and max(data) become white. This is unwanted if the data should be exact grey levels or exact RGB channels. The solution:

import scipy.misc
scipy.misc.toimage(image_array, cmin=0.0, cmax=...).save('outfile.jpg')

回答 2

使用PIL的答案(以防万一)。

给定一个numpy数组“ A”:

from PIL import Image
im = Image.fromarray(A)
im.save("your_file.jpeg")

您可以用几乎任何所需的格式替换“ jpeg”。有关格式的更多详细信息,请点击此处

An answer using PIL (just in case it’s useful).

given a numpy array “A”:

from PIL import Image
im = Image.fromarray(A)
im.save("your_file.jpeg")

you can replace “jpeg” with almost any format you want. More details about the formats here


回答 3

matplotlib

import matplotlib

matplotlib.image.imsave('name.png', array)

适用于matplotlib 1.3.1,我不知道较低的版本。从文档字符串:

Arguments:
  *fname*:
    A string containing a path to a filename, or a Python file-like object.
    If *format* is *None* and *fname* is a string, the output
    format is deduced from the extension of the filename.
  *arr*:
    An MxN (luminance), MxNx3 (RGB) or MxNx4 (RGBA) array.

在此处输入图片说明

With matplotlib:

import matplotlib

matplotlib.image.imsave('name.png', array)

Works with matplotlib 1.3.1, I don’t know about lower version. From the docstring:

Arguments:
  *fname*:
    A string containing a path to a filename, or a Python file-like object.
    If *format* is *None* and *fname* is a string, the output
    format is deduced from the extension of the filename.
  *arr*:
    An MxN (luminance), MxNx3 (RGB) or MxNx4 (RGBA) array.

enter image description here


回答 4

纯Python(2和3),无第三方依赖的代码段。

此函数写入压缩的真彩色(每个像素4个字节)RGBAPNG。

def write_png(buf, width, height):
    """ buf: must be bytes or a bytearray in Python3.x,
        a regular string in Python2.x.
    """
    import zlib, struct

    # reverse the vertical line order and add null bytes at the start
    width_byte_4 = width * 4
    raw_data = b''.join(
        b'\x00' + buf[span:span + width_byte_4]
        for span in range((height - 1) * width_byte_4, -1, - width_byte_4)
    )

    def png_pack(png_tag, data):
        chunk_head = png_tag + data
        return (struct.pack("!I", len(data)) +
                chunk_head +
                struct.pack("!I", 0xFFFFFFFF & zlib.crc32(chunk_head)))

    return b''.join([
        b'\x89PNG\r\n\x1a\n',
        png_pack(b'IHDR', struct.pack("!2I5B", width, height, 8, 6, 0, 0, 0)),
        png_pack(b'IDAT', zlib.compress(raw_data, 9)),
        png_pack(b'IEND', b'')])

…数据应直接写入以二进制格式打开的文件,如:

data = write_png(buf, 64, 64)
with open("my_image.png", 'wb') as fh:
    fh.write(data)

Pure Python (2 & 3), a snippet without 3rd party dependencies.

This function writes compressed, true-color (4 bytes per pixel) RGBA PNG’s.

def write_png(buf, width, height):
    """ buf: must be bytes or a bytearray in Python3.x,
        a regular string in Python2.x.
    """
    import zlib, struct

    # reverse the vertical line order and add null bytes at the start
    width_byte_4 = width * 4
    raw_data = b''.join(
        b'\x00' + buf[span:span + width_byte_4]
        for span in range((height - 1) * width_byte_4, -1, - width_byte_4)
    )

    def png_pack(png_tag, data):
        chunk_head = png_tag + data
        return (struct.pack("!I", len(data)) +
                chunk_head +
                struct.pack("!I", 0xFFFFFFFF & zlib.crc32(chunk_head)))

    return b''.join([
        b'\x89PNG\r\n\x1a\n',
        png_pack(b'IHDR', struct.pack("!2I5B", width, height, 8, 6, 0, 0, 0)),
        png_pack(b'IDAT', zlib.compress(raw_data, 9)),
        png_pack(b'IEND', b'')])

… The data should be written directly to a file opened as binary, as in:

data = write_png(buf, 64, 64)
with open("my_image.png", 'wb') as fh:
    fh.write(data)


回答 5

opencv用于python的文档在此处提供文档)。

import cv2
import numpy as np

cv2.imwrite("filename.png", np.zeros((10,10)))

如果您需要进行除保存以外的更多处理,则很有用。

There’s opencv for python (documentation here).

import cv2
import numpy as np

img = ... # Your image as a numpy array 

cv2.imwrite("filename.png", img)

useful if you need to do more processing other than saving.


回答 6

如果您有matplotlib,则可以执行以下操作:

import matplotlib.pyplot as plt
plt.imshow(matrix) #Needs to be in row,col order
plt.savefig(filename)

这将保存绘图(而不是图像本身)。 在此处输入图片说明

If you have matplotlib, you can do:

import matplotlib.pyplot as plt
plt.imshow(matrix) #Needs to be in row,col order
plt.savefig(filename)

This will save the plot (not the images itself). enter image description here


回答 7

您可以在Python中使用’skimage’库

例:

from skimage.io import imsave
imsave('Path_to_your_folder/File_name.jpg',your_array)

You can use ‘skimage’ library in Python

Example:

from skimage.io import imsave
imsave('Path_to_your_folder/File_name.jpg',your_array)

回答 8

scipy.misc给出有关imsave功能的弃用警告,并建议使用imageio替代功能。

import imageio
imageio.imwrite('image_name.png', img)

scipy.misc gives deprecation warning about imsave function and suggests usage of imageio instead.

import imageio
imageio.imwrite('image_name.png', img)

回答 9

@ ideasman42的答案的附录:

def saveAsPNG(array, filename):
    import struct
    if any([len(row) != len(array[0]) for row in array]):
        raise ValueError, "Array should have elements of equal size"

                                #First row becomes top row of image.
    flat = []; map(flat.extend, reversed(array))
                                 #Big-endian, unsigned 32-byte integer.
    buf = b''.join([struct.pack('>I', ((0xffFFff & i32)<<8)|(i32>>24) )
                    for i32 in flat])   #Rotate from ARGB to RGBA.

    data = write_png(buf, len(array[0]), len(array))
    f = open(filename, 'wb')
    f.write(data)
    f.close()

因此,您可以执行以下操作:

saveAsPNG([[0xffFF0000, 0xffFFFF00],
           [0xff00aa77, 0xff333333]], 'test_grid.png')

生产test_grid.png

红色,黄色,深水绿色,灰色网格

(透明度也可以通过减少中的高字节来实现0xff。)

Addendum to @ideasman42’s answer:

def saveAsPNG(array, filename):
    import struct
    if any([len(row) != len(array[0]) for row in array]):
        raise ValueError, "Array should have elements of equal size"

                                #First row becomes top row of image.
    flat = []; map(flat.extend, reversed(array))
                                 #Big-endian, unsigned 32-byte integer.
    buf = b''.join([struct.pack('>I', ((0xffFFff & i32)<<8)|(i32>>24) )
                    for i32 in flat])   #Rotate from ARGB to RGBA.

    data = write_png(buf, len(array[0]), len(array))
    f = open(filename, 'wb')
    f.write(data)
    f.close()

So you can do:

saveAsPNG([[0xffFF0000, 0xffFFFF00],
           [0xff00aa77, 0xff333333]], 'test_grid.png')

Producing test_grid.png:

Grid of red, yellow, dark-aqua, grey

(Transparency also works, by reducing the high byte from 0xff.)


回答 10

对于那些希望直接工作的示例:

from PIL import Image
import numpy

w,h = 200,100
img = numpy.zeros((h,w,3),dtype=numpy.uint8) # has to be unsigned bytes

img[:] = (0,0,255) # fill blue

x,y = 40,20
img[y:y+30, x:x+50] = (255,0,0) # 50x30 red box

Image.fromarray(img).convert("RGB").save("art.png") # don't need to convert

另外,如果您想要高质量的jpeg
.save(file, subsampling=0, quality=100)

For those looking for a direct fully working example:

from PIL import Image
import numpy

w,h = 200,100
img = numpy.zeros((h,w,3),dtype=numpy.uint8) # has to be unsigned bytes

img[:] = (0,0,255) # fill blue

x,y = 40,20
img[y:y+30, x:x+50] = (255,0,0) # 50x30 red box

Image.fromarray(img).convert("RGB").save("art.png") # don't need to convert

also, if you want high quality jpeg’s
.save(file, subsampling=0, quality=100)


回答 11

matplotlib svn具有一项新功能,可以将图像保存为图像,而无需保存轴等。如果您不想安装svn(从matplotlib svn中从image.py直接复制,删除了为简洁起见,请使用docstring):

def imsave(fname, arr, vmin=None, vmax=None, cmap=None, format=None, origin=None):
    from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
    from matplotlib.figure import Figure

    fig = Figure(figsize=arr.shape[::-1], dpi=1, frameon=False)
    canvas = FigureCanvas(fig)
    fig.figimage(arr, cmap=cmap, vmin=vmin, vmax=vmax, origin=origin)
    fig.savefig(fname, dpi=1, format=format)

matplotlib svn has a new function to save images as just an image — no axes etc. it’s a very simple function to backport too, if you don’t want to install svn (copied straight from image.py in matplotlib svn, removed the docstring for brevity):

def imsave(fname, arr, vmin=None, vmax=None, cmap=None, format=None, origin=None):
    from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
    from matplotlib.figure import Figure

    fig = Figure(figsize=arr.shape[::-1], dpi=1, frameon=False)
    canvas = FigureCanvas(fig)
    fig.figimage(arr, cmap=cmap, vmin=vmin, vmax=vmax, origin=origin)
    fig.savefig(fname, dpi=1, format=format)

回答 12

这个世界可能不需要另一个程序包来将numpy数组写入PNG文件,但是对于那些不够用的人,我最近放上了numpngwgithub:

https://github.com/WarrenWeckesser/numpngw

并在pypi上:https ://pypi.python.org/pypi/numpngw/

唯一的外部依赖项是numpy。

这是examples存储库目录中的第一个示例。基本线很简单

write_png('example1.png', img)

imgnumpy数组在哪里。该行之前的所有代码都是import语句和要创建的代码img

import numpy as np
from numpngw import write_png


# Example 1
#
# Create an 8-bit RGB image.

img = np.zeros((80, 128, 3), dtype=np.uint8)

grad = np.linspace(0, 255, img.shape[1])

img[:16, :, :] = 127
img[16:32, :, 0] = grad
img[32:48, :, 1] = grad[::-1]
img[48:64, :, 2] = grad
img[64:, :, :] = 127

write_png('example1.png', img)

这是它创建的PNG文件:

example1.png

The world probably doesn’t need yet another package for writing a numpy array to a PNG file, but for those who can’t get enough, I recently put up numpngw on github:

https://github.com/WarrenWeckesser/numpngw

and on pypi: https://pypi.python.org/pypi/numpngw/

The only external dependency is numpy.

Here’s the first example from the examples directory of the repository. The essential line is simply

write_png('example1.png', img)

where img is a numpy array. All the code before that line is import statements and code to create img.

import numpy as np
from numpngw import write_png


# Example 1
#
# Create an 8-bit RGB image.

img = np.zeros((80, 128, 3), dtype=np.uint8)

grad = np.linspace(0, 255, img.shape[1])

img[:16, :, :] = 127
img[16:32, :, 0] = grad
img[32:48, :, 1] = grad[::-1]
img[48:64, :, 2] = grad
img[64:, :, :] = 127

write_png('example1.png', img)

Here’s the PNG file that it creates:

example1.png


回答 13

假设您想要一个灰度图像:

im = Image.new('L', (width, height))
im.putdata(an_array.flatten().tolist())
im.save("image.tiff")

Assuming you want a grayscale image:

im = Image.new('L', (width, height))
im.putdata(an_array.flatten().tolist())
im.save("image.tiff")

回答 14

图像是一个Python库,它提供了一个轻松的界面来读取和写入各种图像数据,包括动画图像,视频,体积数据和科学格式。它是跨平台的,可在Python 2.7和3.4+上运行,并且易于安装。

这是灰度图像的示例:

import numpy as np
import imageio

# data is numpy array with grayscale value for each pixel.
data = np.array([70,80,82,72,58,58,60,63,54,58,60,48,89,115,121,119])

# 16 pixels can be converted into square of 4x4 or 2x8 or 8x2
data = data.reshape((4, 4)).astype('uint8')

# save image
imageio.imwrite('pic.jpg', data)

Imageio is a Python library that provides an easy interface to read and write a wide range of image data, including animated images, video, volumetric data, and scientific formats. It is cross-platform, runs on Python 2.7 and 3.4+, and is easy to install.

This is example for grayscale image:

import numpy as np
import imageio

# data is numpy array with grayscale value for each pixel.
data = np.array([70,80,82,72,58,58,60,63,54,58,60,48,89,115,121,119])

# 16 pixels can be converted into square of 4x4 or 2x8 or 8x2
data = data.reshape((4, 4)).astype('uint8')

# save image
imageio.imwrite('pic.jpg', data)

回答 15

如果您已经碰巧已经使用[Py] Qt,则可能对qimage2ndarray感兴趣。从1.4版(刚刚发布)开始,也支持PySide,它将有一个imsave(filename, array)类似于scipy的小功能,但使用Qt而不是PIL。在1.3中,只需使用以下内容:

qImage = array2qimage(image, normalize = False) # create QImage from ndarray
success = qImage.save(filename) # use Qt's image IO functions for saving PNG/JPG/..

(1.4的另一个优点是它是一个纯python解决方案,这使其更加轻巧。)

If you happen to use [Py]Qt already, you may be interested in qimage2ndarray. Starting with version 1.4 (just released), PySide is supported as well, and there will be a tiny imsave(filename, array) function similar to scipy’s, but using Qt instead of PIL. With 1.3, just use something like the following:

qImage = array2qimage(image, normalize = False) # create QImage from ndarray
success = qImage.save(filename) # use Qt's image IO functions for saving PNG/JPG/..

(Another advantage of 1.4 is that it is a pure python solution, which makes this even more lightweight.)


回答 16

如果您在python环境Spyder中工作,那么与仅在变量资源管理器中右键单击数组,然后选择“显示图像”选项相比,它变得更加容易。

在此处输入图片说明

这将要求您将图像大多数以PNG格式保存到dsik。

在这种情况下,将不需要PIL库。

If you are working in python environment Spyder, then it cannot get more easier than to just right click the array in variable explorer, and then choose Show Image option.

enter image description here

This will ask you to save image to dsik, mostly in PNG format.

PIL library will not be needed in this case.


回答 17

使用cv2.imwrite

import cv2
assert mat.shape[2] == 1 or mat.shape[2] == 3, 'the third dim should be channel'
cv2.imwrite(path, mat) # note the form of data should be height - width - channel  

Use cv2.imwrite.

import cv2
assert mat.shape[2] == 1 or mat.shape[2] == 3, 'the third dim should be channel'
cv2.imwrite(path, mat) # note the form of data should be height - width - channel  

回答 18

为了将一个numpy数组另存为图像,U有几种选择:

1)其他最佳:OpenCV

 import cv2   
 cv2.imwrite('file name with extension(like .jpg)', numpy_array)

2)Matplotlib

  from matplotlib import pyplot as plt
  plt.imsave('file name with extension(like .jpg)', numpy_array)

3)PIL

  from PIL import Image
  image = Image.fromarray(numpy_array)
  image.save('file name with extension(like .jpg)')

4)…

for saving a numpy array as image, U have several choices:

1) best of other: OpenCV

 import cv2   
 cv2.imwrite('file name with extension(like .jpg)', numpy_array)

2) Matplotlib

  from matplotlib import pyplot as plt
  plt.imsave('file name with extension(like .jpg)', numpy_array)

3) PIL

  from PIL import Image
  image = Image.fromarray(numpy_array)
  image.save('file name with extension(like .jpg)')

4) …


熊猫read_csv low_memory和dtype选项

问题:熊猫read_csv low_memory和dtype选项

打电话时

df = pd.read_csv('somefile.csv')

我得到:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130:DtypeWarning:列(4,5,7,16)具有混合类型。在导入时指定dtype选项,或将low_memory = False设置为false。

为什么dtype选项与关联low_memory,为什么使它False有助于解决此问题?

When calling

df = pd.read_csv('somefile.csv')

I get:

/Users/josh/anaconda/envs/py27/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (4,5,7,16) have mixed types. Specify dtype option on import or set low_memory=False.

Why is the dtype option related to low_memory, and why would making it False help with this problem?


回答 0

不推荐使用的low_memory选项

low_memory选项未正确弃用,但应该正确使用,因为它实际上没有做任何不同的事情[ 来源 ]

收到此low_memory警告的原因是因为猜测每列的dtypes非常需要内存。熊猫尝试通过分析每列中的数据来确定要设置的dtype。

Dtype猜测(非常糟糕)

一旦读取了整个文件,熊猫便只能确定列应具有的dtype。这意味着在读取整个文件之前,无法真正解析任何内容,除非您冒着在读取最后一个值时不得不更改该列的dtype的风险。

考虑一个文件的示例,该文件具有一个名为user_id的列。它包含1000万行,其中user_id始终是数字。由于熊猫不能只知道数字,因此它可能会一直保留为原始字符串,直到它读取了整个文件。

指定dtypes(应该总是这样做)

dtype={'user_id': int}

pd.read_csv()呼叫将使大熊猫知道它开始读取文件时,认为这是唯一的整数。

还值得注意的是,如果文件的最后一行将被"foobar"写入user_id列中,那么如果指定了上面的dtype,则加载将崩溃。

定义dtypes时会中断的中断数据示例

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes通常是一个numpy的东西,请在这里阅读有关它们的更多信息:http ://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

存在哪些dtype?

我们可以访问numpy dtypes:float,int,bool,timedelta64 [ns]和datetime64 [ns]。请注意,numpy日期/时间dtypes 识别时区。

熊猫通过自己的方式扩展了这套dtypes:

‘datetime64 [ns,]’这是一个时区感知的时间戳。

‘category’本质上是一个枚举(以整数键表示的字符串以保存

‘period []’不要与timedelta混淆,这些对象实际上是固定在特定时间段的

“稀疏”,“ Sparse [int]”,“ Sparse [float]”用于稀疏数据或“其中有很多漏洞的数据”,而不是在数据框中保存NaN或None,它忽略了对象,从而节省了空间。

“间隔”本身是一个主题,但其主要用途是用于索引。在这里查看更多

与numpy变体不同,“ Int8”,“ Int16”,“ Int32”,“ Int64”,“ UInt8”,“ UInt16”,“ UInt32”,“ UInt64”都是可为空的熊猫特定整数。

‘string’是用于处理字符串数据的特定dtype,可访问.str系列中的属性。

‘boolean’类似于numpy’bool’,但它也支持丢失数据。

在此处阅读完整的参考:

熊猫DType参考

陷阱,注意事项,笔记

设置dtype=object将使上面的警告静音,但不会使其更有效地使用内存,仅在有任何处理时才有效。

设置dtype=unicode不会做任何事情,因为对于numpy,a unicode表示为object

转换器的使用

@sparrow正确指出了转换器的用法,以避免在遇到'foobar'指定为的列时遇到大熊猫int。我想补充一点,转换器在熊猫中使用时确实很笨重且效率低下,应该作为最后的手段使用。这是因为read_csv进程是单个进程。

CSV文件可以逐行处理,因此可以通过简单地将文件切成段并运行多个进程来由多个转换器并行更有效地进行处理,而这是熊猫所不支持的。但这是一个不同的故事。

The deprecated low_memory option

The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently[source]

The reason you get this low_memory warning is because guessing dtypes for each column is very memory demanding. Pandas tries to determine what dtype to set by analyzing the data in each column.

Dtype Guessing (very bad)

Pandas can only determine what dtype a column should have once the whole file is read. This means nothing can really be parsed before the whole file is read unless you risk having to change the dtype of that column when you read the last value.

Consider the example of one file which has a column called user_id. It contains 10 million rows where the user_id is always numbers. Since pandas cannot know it is only numbers, it will probably keep it as the original strings until it has read the whole file.

Specifying dtypes (should always be done)

adding

dtype={'user_id': int}

to the pd.read_csv() call will make pandas know when it starts reading the file, that this is only integers.

Also worth noting is that if the last line in the file would have "foobar" written in the user_id column, the loading would crash if the above dtype was specified.

Example of broken data that breaks when dtypes are defined

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


csvdata = """user_id,username
1,Alice
3,Bob
foobar,Caesar"""
sio = StringIO(csvdata)
pd.read_csv(sio, dtype={"user_id": int, "username": "string"})

ValueError: invalid literal for long() with base 10: 'foobar'

dtypes are typically a numpy thing, read more about them here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

What dtypes exists?

We have access to numpy dtypes: float, int, bool, timedelta64[ns] and datetime64[ns]. Note that the numpy date/time dtypes are not time zone aware.

Pandas extends this set of dtypes with its own:

‘datetime64[ns, ]’ Which is a time zone aware timestamp.

‘category’ which is essentially an enum (strings represented by integer keys to save

‘period[]’ Not to be confused with a timedelta, these objects are actually anchored to specific time periods

‘Sparse’, ‘Sparse[int]’, ‘Sparse[float]’ is for sparse data or ‘Data that has a lot of holes in it’ Instead of saving the NaN or None in the dataframe it omits the objects, saving space.

‘Interval’ is a topic of its own but its main use is for indexing. See more here

‘Int8’, ‘Int16’, ‘Int32’, ‘Int64’, ‘UInt8’, ‘UInt16’, ‘UInt32’, ‘UInt64’ are all pandas specific integers that are nullable, unlike the numpy variant.

‘string’ is a specific dtype for working with string data and gives access to the .str attribute on the series.

‘boolean’ is like the numpy ‘bool’ but it also supports missing data.

Read the complete reference here:

Pandas dtype reference

Gotchas, caveats, notes

Setting dtype=object will silence the above warning, but will not make it more memory efficient, only process efficient if anything.

Setting dtype=unicode will not do anything, since to numpy, a unicode is represented as object.

Usage of converters

@sparrow correctly points out the usage of converters to avoid pandas blowing up when encountering 'foobar' in a column specified as int. I would like to add that converters are really heavy and inefficient to use in pandas and should be used as a last resort. This is because the read_csv process is a single process.

CSV files can be processed line by line and thus can be processed by multiple converters in parallel more efficiently by simply cutting the file into segments and running multiple processes, something that pandas does not support. But this is a different story.


回答 1

尝试:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

根据熊猫文件:

dtype:类型名称或列的字典->类型

至于low_memory,默认情况下为True 尚未记录。我认为这无关紧要。该错误消息是通用的,因此无论如何您都无需弄混low_memory。希望这会有所帮助,如果您还有其他问题,请告诉我

Try:

dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

According to the pandas documentation:

dtype : Type name or dict of column -> type

As for low_memory, it’s True by default and isn’t yet documented. I don’t think its relevant though. The error message is generic, so you shouldn’t need to mess with low_memory anyway. Hope this helps and let me know if you have further problems


回答 2

df = pd.read_csv('somefile.csv', low_memory=False)

这应该可以解决问题。从CSV读取180万行时,出现了完全相同的错误。

df = pd.read_csv('somefile.csv', low_memory=False)

This should solve the issue. I got exactly the same error, when reading 1.8M rows from a CSV.


回答 3

如firelynx先前所述,如果显式指定了dtype并且存在与该dtype不兼容的混合数据,则加载将崩溃。我使用像这样的转换器作为变通方法来更改具有不兼容数据类型的值,以便仍然可以加载数据。

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

As mentioned earlier by firelynx if dtype is explicitly specified and there is mixed data that is not compatible with that dtype then loading will crash. I used a converter like this as a workaround to change the values with incompatible data type so that the data could still be loaded.

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv(csv_file,converters={'COL_A':conv,'COL_B':conv})

回答 4

我有一个约400MB的文件类似的问题。设置low_memory=False对我有用。首先做一些简单的事情,我将检查您的数据帧不大于系统内存,重新启动,清除RAM,然后再继续。如果您仍然遇到错误,则值得确保您的.csv文件正常,请在Excel中快速查看并确保没有明显的损坏。原始数据损坏可能会给企业造成严重破坏。

I had a similar issue with a ~400MB file. Setting low_memory=False did the trick for me. Do the simple things first,I would check that your dataframe isn’t bigger than your system memory, reboot, clear the RAM before proceeding. If you’re still running into errors, its worth making sure your .csv file is ok, take a quick look in Excel and make sure there’s no obvious corruption. Broken original data can wreak havoc…


回答 5

处理巨大的csv文件(600万行)时,我遇到了类似的问题。我遇到了三个问题:1.文件包含奇怪的字符(使用编码修复)2.未指定数据类型(使用dtype属性修复)3.使用上述方法,我仍然面临与file_format相关的问题,即根据文件名定义(使用try ..固定,..除外)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

I was facing a similar issue when processing a huge csv file (6 million rows). I had three issues: 1. the file contained strange characters (fixed using encoding) 2. the datatype was not specified (fixed using dtype property) 3. Using the above I still faced an issue which was related with the file_format that could not be defined based on the filename (fixed using try .. except..)

df = pd.read_csv(csv_file,sep=';', encoding = 'ISO-8859-1',
                 names=['permission','owner_name','group_name','size','ctime','mtime','atime','filename','full_filename'],
                 dtype={'permission':str,'owner_name':str,'group_name':str,'size':str,'ctime':object,'mtime':object,'atime':object,'filename':str,'full_filename':str,'first_date':object,'last_date':object})

try:
    df['file_format'] = [Path(f).suffix[1:] for f in df.filename.tolist()]
except:
    df['file_format'] = ''

回答 6

它在low_memory = False导入DataFrame时对我有用。这就是对我有用的所有更改:

df = pd.read_csv('export4_16.csv',low_memory=False)

It worked for me with low_memory = False while importing a DataFrame. That is all the change that worked for me:

df = pd.read_csv('export4_16.csv',low_memory=False)

numpy.array形状(R,1)和(R,)之间的区别

问题:numpy.array形状(R,1)和(R,)之间的区别

进入时numpy,一些操作恢复了形状,(R, 1)但有些恢复了(R,)。由于reshape需要显式运算,因此这将使矩阵乘法更加乏味。例如,给定矩阵M,如果我们想在numpy.dot(M[:,0], numpy.ones((1, R)))哪里做R行数(当然,同样的问题也会逐列出现)。我们会得到matrices are not aligned错误,因为M[:,0]是在外形(R,),但numpy.ones((1, R))在形状(1, R)

所以我的问题是:

  1. 什么形状之间的差异(R, 1)(R,)。我从字面上知道它是数字列表和列表列表,其中所有列表仅包含一个数字。只是想知道为什么不设计numpy使其偏爱形状(R, 1)而不是(R,)更容易进行矩阵乘法。

  2. 以上示例是否有更好的方法?无需像这样显式重塑:numpy.dot(M[:,0].reshape(R, 1), numpy.ones((1, R)))

In numpy, some of the operations return in shape (R, 1) but some return (R,). This will make matrix multiplication more tedious since explicit reshape is required. For example, given a matrix M, if we want to do numpy.dot(M[:,0], numpy.ones((1, R))) where R is the number of rows (of course, the same issue also occurs column-wise). We will get matrices are not aligned error since M[:,0] is in shape (R,) but numpy.ones((1, R)) is in shape (1, R).

So my questions are:

  1. What’s the difference between shape (R, 1) and (R,). I know literally it’s list of numbers and list of lists where all list contains only a number. Just wondering why not design numpy so that it favors shape (R, 1) instead of (R,) for easier matrix multiplication.

  2. Are there better ways for the above example? Without explicitly reshape like this: numpy.dot(M[:,0].reshape(R, 1), numpy.ones((1, R)))


回答 0

1. NumPy中形状的含义

您写道:“我从字面上知道这是一个数字列表和一个列表列表,其中所有列表都只包含一个数字”,但这是一种无益的思考方式。

考虑NumPy数组的最佳方法是它们由两部分组成,一个数据缓冲区只是一个原始元素块,另一个视图描述了如何解释数据缓冲区。

例如,如果我们创建一个包含12个整数的数组:

>>> a = numpy.arange(12)
>>> a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

然后a由一个数据缓冲区组成,排列如下:

┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

还有一个描述如何解释数据的视图:

>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> a.dtype
dtype('int64')
>>> a.itemsize
8
>>> a.strides
(8,)
>>> a.shape
(12,)

这里的形状 (12,)表示该数组由一个从0到11的单个索引建立索引。从概念上讲,如果我们标记此单个索引i,则该数组a如下所示:

i= 0    1    2    3    4    5    6    7    8    9   10   11
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

如果我们调整数组的形状,则不会更改数据缓冲区。相反,它创建一个新视图,该视图描述了另一种解释数据的方式。所以之后:

>>> b = a.reshape((3, 4))

该数组b具有与相同的数据缓冲区a,但是现在它由两个索引分别从0到2和0到3进行索引。如果我们标记两个索引ij,则数组b如下所示:

i= 0    0    0    0    1    1    1    1    2    2    2    2
j= 0    1    2    3    0    1    2    3    0    1    2    3
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

意思就是:

>>> b[2,1]
9

您可以看到第二个索引变化很快,而第一个索引变化缓慢。如果您不希望这样做,可以指定order参数:

>>> c = a.reshape((3, 4), order='F')

这将导致数组的索引如下:

i= 0    1    2    0    1    2    0    1    2    0    1    2
j= 0    0    0    1    1    1    2    2    2    3    3    3
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

意思就是:

>>> c[2,1]
5

现在应该清楚一个数组具有一个或多个尺寸为1的尺寸的形状的含义。

>>> d = a.reshape((12, 1))

数组d由两个索引索引,第一个索引的范围是0到11,第二个索引始终是0:

i= 0    1    2    3    4    5    6    7    8    9   10   11
j= 0    0    0    0    0    0    0    0    0    0    0    0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

所以:

>>> d[10,0]
10

长度为1的尺寸是“自由的”(在某种意义上),因此没有什么可以阻止您进入城镇:

>>> e = a.reshape((1, 2, 1, 6, 1))

给出一个索引如下的数组:

i= 0    0    0    0    0    0    0    0    0    0    0    0
j= 0    0    0    0    0    0    1    1    1    1    1    1
k= 0    0    0    0    0    0    0    0    0    0    0    0
l= 0    1    2    3    4    5    0    1    2    3    4    5
m= 0    0    0    0    0    0    0    0    0    0    0    0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  0   1   2   3   4   5   6   7   8   9  10  11 
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

所以:

>>> e[0,1,0,0,0]
6

有关如何实现数组的更多详细信息,请参见NumPy内部文档

2.怎么办?

由于numpy.reshape只是创建了一个新视图,因此不必在必要时使用它。当您想以其他方式索引数组时,它是使用的正确工具。

但是,在较长的计算中,通常可能首先要安排构造具有“正确”形状的数组,这样就可以最大程度地减少变形和转置的次数。但是,在没有看到导致需要重塑的实际环境的情况下,很难说应该改变什么。

您问题中的示例是:

numpy.dot(M[:,0], numpy.ones((1, R)))

但这是不现实的。首先,此表达式:

M[:,0].sum()

计算结果更简单。第二,第0列真的有什么特别之处吗?也许您实际需要的是:

M.sum(axis=0)

1. The meaning of shapes in NumPy

You write, “I know literally it’s list of numbers and list of lists where all list contains only a number” but that’s a bit of an unhelpful way to think about it.

The best way to think about NumPy arrays is that they consist of two parts, a data buffer which is just a block of raw elements, and a view which describes how to interpret the data buffer.

For example, if we create an array of 12 integers:

>>> a = numpy.arange(12)
>>> a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Then a consists of a data buffer, arranged something like this:

┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

and a view which describes how to interpret the data:

>>> a.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> a.dtype
dtype('int64')
>>> a.itemsize
8
>>> a.strides
(8,)
>>> a.shape
(12,)

Here the shape (12,) means the array is indexed by a single index which runs from 0 to 11. Conceptually, if we label this single index i, the array a looks like this:

i= 0    1    2    3    4    5    6    7    8    9   10   11
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

If we reshape an array, this doesn’t change the data buffer. Instead, it creates a new view that describes a different way to interpret the data. So after:

>>> b = a.reshape((3, 4))

the array b has the same data buffer as a, but now it is indexed by two indices which run from 0 to 2 and 0 to 3 respectively. If we label the two indices i and j, the array b looks like this:

i= 0    0    0    0    1    1    1    1    2    2    2    2
j= 0    1    2    3    0    1    2    3    0    1    2    3
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

which means that:

>>> b[2,1]
9

You can see that the second index changes quickly and the first index changes slowly. If you prefer this to be the other way round, you can specify the order parameter:

>>> c = a.reshape((3, 4), order='F')

which results in an array indexed like this:

i= 0    1    2    0    1    2    0    1    2    0    1    2
j= 0    0    0    1    1    1    2    2    2    3    3    3
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

which means that:

>>> c[2,1]
5

It should now be clear what it means for an array to have a shape with one or more dimensions of size 1. After:

>>> d = a.reshape((12, 1))

the array d is indexed by two indices, the first of which runs from 0 to 11, and the second index is always 0:

i= 0    1    2    3    4    5    6    7    8    9   10   11
j= 0    0    0    0    0    0    0    0    0    0    0    0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

and so:

>>> d[10,0]
10

A dimension of length 1 is “free” (in some sense), so there’s nothing stopping you from going to town:

>>> e = a.reshape((1, 2, 1, 6, 1))

giving an array indexed like this:

i= 0    0    0    0    0    0    0    0    0    0    0    0
j= 0    0    0    0    0    0    1    1    1    1    1    1
k= 0    0    0    0    0    0    0    0    0    0    0    0
l= 0    1    2    3    4    5    0    1    2    3    4    5
m= 0    0    0    0    0    0    0    0    0    0    0    0
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│  0 │  1 │  2 │  3 │  4 │  5 │  6 │  7 │  8 │  9 │ 10 │ 11 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

and so:

>>> e[0,1,0,0,0]
6

See the NumPy internals documentation for more details about how arrays are implemented.

2. What to do?

Since numpy.reshape just creates a new view, you shouldn’t be scared about using it whenever necessary. It’s the right tool to use when you want to index an array in a different way.

However, in a long computation it’s usually possible to arrange to construct arrays with the “right” shape in the first place, and so minimize the number of reshapes and transposes. But without seeing the actual context that led to the need for a reshape, it’s hard to say what should be changed.

The example in your question is:

numpy.dot(M[:,0], numpy.ones((1, R)))

but this is not realistic. First, this expression:

M[:,0].sum()

computes the result more simply. Second, is there really something special about column 0? Perhaps what you actually need is:

M.sum(axis=0)

回答 1

(R,)和之间的区别(1,R)实际上是您需要使用的索引数。 ones((1,R))是一个二维数组,碰巧只有一行。 ones(R)是一个向量。通常,如果变量的行数/列数不超过一个,则应该使用向量,而不是单维度的矩阵。

对于您的特定情况,有两种选择:

1)只需将第二个参数设为向量。以下工作正常:

    np.dot(M[:,0], np.ones(R))

2)如果您想要矩阵等矩阵运算,请使用类matrix代替ndarray。所有*矩阵都被强制为二维数组,并且运算符执行矩阵乘法而不是按元素进行乘法(因此您不需要点)。以我的经验,这是值得解决的麻烦,但是如果您习惯使用matlab可能会很好。

The difference between (R,) and (1,R) is literally the number of indices that you need to use. ones((1,R)) is a 2-D array that happens to have only one row. ones(R) is a vector. Generally if it doesn’t make sense for the variable to have more than one row/column, you should be using a vector, not a matrix with a singleton dimension.

For your specific case, there are a couple of options:

1) Just make the second argument a vector. The following works fine:

    np.dot(M[:,0], np.ones(R))

2) If you want matlab like matrix operations, use the class matrix instead of ndarray. All matricies are forced into being 2-D arrays, and operator * does matrix multiplication instead of element-wise (so you don’t need dot). In my experience, this is more trouble that it is worth, but it may be nice if you are used to matlab.


回答 2

形状是一个元组。如果只有一维,则形状将是一个数字,并且逗号后仅是空白。对于2维以上的尺寸,所有逗号后面都会有一个数字。

# 1 dimension with 2 elements, shape = (2,). 
# Note there's nothing after the comma.
z=np.array([  # start dimension
    10,       # not a dimension
    20        # not a dimension
])            # end dimension
print(z.shape)

(2,)

# 2 dimensions, each with 1 element, shape = (2,1)
w=np.array([  # start outer dimension 
    [10],     # element is in an inner dimension
    [20]      # element is in an inner dimension
])            # end outer dimension
print(w.shape)

(2,1)

The shape is a tuple. If there is only 1 dimension the shape will be one number and just blank after a comma. For 2+ dimensions, there will be a number after all the commas.

# 1 dimension with 2 elements, shape = (2,). 
# Note there's nothing after the comma.
z=np.array([  # start dimension
    10,       # not a dimension
    20        # not a dimension
])            # end dimension
print(z.shape)

(2,)

# 2 dimensions, each with 1 element, shape = (2,1)
w=np.array([  # start outer dimension 
    [10],     # element is in an inner dimension
    [20]      # element is in an inner dimension
])            # end outer dimension
print(w.shape)

(2,1)


回答 3

对于其基本数组类,2d数组不比1d或3d数组更特殊。有一些操作可以保留尺寸,一些可以减小尺寸,其他可以组合甚至扩展尺寸。

M=np.arange(9).reshape(3,3)
M[:,0].shape # (3,) selects one column, returns a 1d array
M[0,:].shape # same, one row, 1d array
M[:,[0]].shape # (3,1), index with a list (or array), returns 2d
M[:,[0,1]].shape # (3,2)

In [20]: np.dot(M[:,0].reshape(3,1),np.ones((1,3)))

Out[20]: 
array([[ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

In [21]: np.dot(M[:,[0]],np.ones((1,3)))
Out[21]: 
array([[ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

其他给出相同数组的表达式

np.dot(M[:,0][:,np.newaxis],np.ones((1,3)))
np.dot(np.atleast_2d(M[:,0]).T,np.ones((1,3)))
np.einsum('i,j',M[:,0],np.ones((3)))
M1=M[:,0]; R=np.ones((3)); np.dot(M1[:,None], R[None,:])

MATLAB最初只是2D阵列。较新的版本允许更大的尺寸,但保留2的下限。但是,您仍然必须注意行矩阵和列1之间的差异,即形状为(1,3)v的列(3,1)。你多久写一次[1,2,3].'?我将要编写row vectorcolumn vector,但是受2d约束,MATLAB中没有任何矢量-至少从矢量的数学意义上讲不是1d。

您是否看过np.atleast_2d(还有_1d和_3d版本)?

For its base array class, 2d arrays are no more special than 1d or 3d ones. There are some operations the preserve the dimensions, some that reduce them, other combine or even expand them.

M=np.arange(9).reshape(3,3)
M[:,0].shape # (3,) selects one column, returns a 1d array
M[0,:].shape # same, one row, 1d array
M[:,[0]].shape # (3,1), index with a list (or array), returns 2d
M[:,[0,1]].shape # (3,2)

In [20]: np.dot(M[:,0].reshape(3,1),np.ones((1,3)))

Out[20]: 
array([[ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

In [21]: np.dot(M[:,[0]],np.ones((1,3)))
Out[21]: 
array([[ 0.,  0.,  0.],
       [ 3.,  3.,  3.],
       [ 6.,  6.,  6.]])

Other expressions that give the same array

np.dot(M[:,0][:,np.newaxis],np.ones((1,3)))
np.dot(np.atleast_2d(M[:,0]).T,np.ones((1,3)))
np.einsum('i,j',M[:,0],np.ones((3)))
M1=M[:,0]; R=np.ones((3)); np.dot(M1[:,None], R[None,:])

MATLAB started out with just 2D arrays. Newer versions allow more dimensions, but retain the lower bound of 2. But you still have to pay attention to the difference between a row matrix and column one, one with shape (1,3) v (3,1). How often have you written [1,2,3].'? I was going to write row vector and column vector, but with that 2d constraint, there aren’t any vectors in MATLAB – at least not in the mathematical sense of vector as being 1d.

Have you looked at np.atleast_2d (also _1d and _3d versions)?


回答 4

1)不喜欢的形状的原因(R, 1)(R,)在于,它不必要地复杂的事情。此外,为什么最好在(R, 1)长度R向量上默认使用形状而不是(1, R)?当您需要其他尺寸时,最好使其简单明了。

2)以您的示例为例,您正在计算外部产品,因此可以reshape通过使用np.outer以下命令来执行此操作而无需调用

np.outer(M[:,0], numpy.ones((1, R)))

1) The reason not to prefer a shape of (R, 1) over (R,) is that it unnecessarily complicates things. Besides, why would it be preferable to have shape (R, 1) by default for a length-R vector instead of (1, R)? It’s better to keep it simple and be explicit when you require additional dimensions.

2) For your example, you are computing an outer product so you can do this without a reshape call by using np.outer:

np.outer(M[:,0], numpy.ones((1, R)))

回答 5

这里已经有很多好的答案。但是对我来说,很难找到一些例子,其中形状或数组会破坏所有程序。

所以这是一个:

import numpy as np
a = np.array([1,2,3,4])
b = np.array([10,20,30,40])


from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(a,b)

这将因错误而失败:

ValueError:预期的2D数组,取而代之的是1D数组

但是如果我们添加reshapea

a = np.array([1,2,3,4]).reshape(-1,1)

这正常工作!

There are a lot of good answers here already. But for me it was hard to find some example, where the shape or array can break all the program.

So here is the one:

import numpy as np
a = np.array([1,2,3,4])
b = np.array([10,20,30,40])


from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(a,b)

This will fail with error:

ValueError: Expected 2D array, got 1D array instead

but if we add reshape to a:

a = np.array([1,2,3,4]).reshape(-1,1)

this works correctly!


熊猫根据其他列的值创建新列/逐行应用多列的功能

问题:熊猫根据其他列的值创建新列/逐行应用多列的功能

我想申请我的自定义函数(它使用的if-else梯)这六个列(ERI_HispanicERI_AmerInd_AKNatvERI_AsianERI_Black_Afr.AmerERI_HI_PacIslERI_White我的数据帧的每一行中)。

我尝试了与其他问题不同的方法,但似乎仍然找不到适合我问题的正确答案。关键在于,如果该人被视为西班牙裔,就不能被视为其他任何人。即使他们在另一个种族栏中的得分为“ 1”,他们仍然被视为西班牙裔,而不是两个或两个以上的种族。同样,如果所有ERI列的总和大于1,则将它们计为两个或多个种族,并且不能计为唯一的种族(西班牙裔除外)。希望这是有道理的。任何帮助将不胜感激。

这几乎就像在每行中进行一个for循环一样,如果每条记录都符合条件,则将它们添加到一个列表中并从原始列表中删除。

从下面的数据框中,我需要根据以下SQL规范来计算新列:

===================================================== =======

IF [ERI_Hispanic] = 1 THEN RETURN Hispanic
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN Two or More
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN A/I AK Native
ELSE IF [ERI_Asian] = 1 THEN RETURN Asian
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN Black/AA
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN White

评论:如果西班牙裔ERI标志为True(1),则该雇员被分类为“西班牙裔”

注释:如果多个非西班牙裔ERI标志为真,则返回“两个或更多”

======================数据帧===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I’ve tried different methods from other questions but still can’t seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can’t be counted as anything else. Even if they have a “1” in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can’t be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

回答 0

好的,执行此步骤有两个步骤-首先是编写一个可以执行所需翻译的函数-我已根据您的伪代码将一个示例放在一起:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

您可能想要解决这个问题,但这似乎可以解决问题-请注意,进入函数的参数被视为标记为“行”的Series对象。

接下来,在熊猫中使用apply函数来应用该函数-例如

df.apply (lambda row: label_race(row), axis=1)

请注意axis = 1说明符,这意味着应用程序是在行而不是列级别完成的。结果在这里:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

如果您对这些结果感到满意,请再次运行它,将结果保存到原始数据框中的新列中。

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

结果数据框如下所示(向右滚动以查看新列):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.

Next, use the apply function in pandas to apply the function – e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

回答 1

由于这是Google针对“来自其他人的熊猫专栏”的第一个结果,因此下面是一个简单的示例:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

如果得到了,SettingWithCopyWarning您也可以通过以下方式进行操作:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

资料来源:https : //stackoverflow.com/a/12555510/243392

如果列名包含空格,则可以使用如下语法:

df = df.assign(**{'some column name': col.values})

这是applyAssign的文档。

Since this is the first Google result for ‘pandas new column from others’, here’s a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here’s the documentation for apply, and assign.


回答 2

上面的答案是完全正确的,但是存在矢量化的解决方案,形式为numpy.select。这使您可以定义条件,然后为这些条件定义输出,这比使用apply以下命令更有效:


首先,定义条件:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

现在,定义相应的输出:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

最后,使用numpy.select

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

为什么要numpy.select用完apply?以下是一些性能检查:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用numpy.select给了我们极大地提高了性能,并且随着数据增长的差异只会增加。

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.


回答 3

.apply()以函数作为第一个参数;这样传递label_race函数:

df['race_label'] = df.apply(label_race, axis=1)

您无需使一个lambda函数即可传递函数。

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don’t need to make a lambda function to pass in a function.


回答 4

尝试这个,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O / P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

使用.loc代替apply

它改善了向量化。

.loc 以简单的方式工作,根据条件屏蔽行,将值应用于冻结行。

有关更多详细信息,请访问.loc docs

性能指标:

接受的答案:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

每个循环1.15 s±46.5 ms(平均±标准偏差,共7次运行,每个循环1次)

我的建议答案:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

每个循环24.7 ms±1.7 ms(平均±标准偏差,运行7次,每个循环10个)

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White 

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


熊猫有条件地创建系列/数据框列

问题:熊猫有条件地创建系列/数据框列

我有下面的数据框:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

我想向数据框添加另一列(或生成一系列),该列的长度与数据框的长度相同(=记录/行的数目相等),如果Set =’Z’则设置为绿色,如果Set =’否则为’red’ 。

最好的方法是什么?

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.

What’s the best way to do this?


回答 0

如果您只有两个选择供您选择:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

例如,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

Yield

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

如果您有两个以上的条件,请使用np.select。例如,如果您想color成为

  • yellow 什么时候 (df['Set'] == 'Z') & (df['Type'] == 'A')
  • 否则blue,当(df['Set'] == 'Z') & (df['Type'] == 'B')
  • 否则purple,当(df['Type'] == 'B')
  • 否则black

然后使用

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

产生

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

  • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
  • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
  • otherwise purple when (df['Type'] == 'B')
  • otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

回答 1

列表理解是有条件创建另一列的另一种方法。如果像在示例中那样使用列中的对象dtype,则列表理解通常胜过大多数其他方法。

示例列表理解:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit测试:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

回答 2

可以实现这一目标的另一种方法是

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

回答 3

这是给这只猫换皮的另一种方法,使用字典将新值映射到列表中的键上:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

看起来像什么:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

当您要执行许多ifelse-type语句(即要替换的许多唯一值)时,此方法可能非常强大。

当然,您可以始终这样做:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

但是apply在我的机器上,这种方法的速度是上面的方法的三倍以上。

您也可以使用dict.get

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Here’s yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What’s it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

回答 4

以下内容比此处介绍的方法要慢,但是我们可以根据多于一列的内容来计算额外的列,并且可以为额外的列计算两个以上的值。

仅使用“设置”列的简单示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

具有更多颜色和更多列的示例:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

编辑(21/06/2019):使用plydata

也可以使用plydata来执行这种操作(尽管这似乎比使用assignand 还要慢apply)。

from plydata import define, if_else

简单if_else

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

嵌套if_else

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the “Set” column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            
  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

回答 5

也许是通过更新Pandas来实现的,但到目前为止,我认为以下是该问题的最短和最佳答案。您可以使用该.loc方法,并根据需要使用一个或多个条件。

代码摘要:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

说明:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

添加“颜色”列并将所有值设置为“红色”

df['Color'] = "red"

应用您的单个条件:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

或多个条件(如果需要):

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

您可以在此处阅读Pandas逻辑运算符和条件选择: Pandas中用于布尔索引的逻辑运算符

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a ‘color’ column and set all values to “red”

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas


回答 6

一种带有.apply()方法的衬纸如下:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

之后,df数据帧如下所示:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

回答 7

如果您要处理海量数据,则最好采用记忆方式:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

当您有很多重复的值时,这种方法将是最快的。我的一般经验法则是记住以下情况:data_size> 10**4n_distinct<data_size/4

例如,在10,000行中记录2,500个或更少的不同值。

If you’re working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.


如何在NumPy中创建一个空数组/矩阵?

问题:如何在NumPy中创建一个空数组/矩阵?

我无法弄清楚如何以通常使用列表的方式使用数组或矩阵。我想创建一个空数组(或矩阵),然后一次向其中添加一列(或行)。

目前,我能找到的唯一方法是:

mat = None
for col in columns:
    if mat is None:
        mat = col
    else:
        mat = hstack((mat, col))

而如果这是一个列表,我会做这样的事情:

list = []
for item in data:
    list.append(item)

有没有办法对NumPy数组或矩阵使用这种表示法?

I can’t figure out how to use an array or matrix in the way that I would normally use a list. I want to create an empty array (or matrix) and then add one column (or row) to it at a time.

At the moment the only way I can find to do this is like:

mat = None
for col in columns:
    if mat is None:
        mat = col
    else:
        mat = hstack((mat, col))

Whereas if it were a list, I’d do something like this:

list = []
for item in data:
    list.append(item)

Is there a way to use that kind of notation for NumPy arrays or matrices?


回答 0

您对有效使用NumPy的思维模式有误。NumPy数组存储在连续的内存块中。如果要向现有阵列添加行或列,则需要将整个阵列复制到新的内存块中,从而为要存储的新元素创建间隙。如果反复进行以构建阵列,则效率非常低下。

在添加行的情况下,最好的选择是创建一个与数据集最终大小一样大的数组,然后逐行向其中添加数据:

>>> import numpy
>>> a = numpy.zeros(shape=(5,2))
>>> a
array([[ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])
>>> a[0] = [1,2]
>>> a[1] = [2,3]
>>> a
array([[ 1.,  2.],
   [ 2.,  3.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])

You have the wrong mental model for using NumPy efficiently. NumPy arrays are stored in contiguous blocks of memory. If you want to add rows or columns to an existing array, the entire array needs to be copied to a new block of memory, creating gaps for the new elements to be stored. This is very inefficient if done repeatedly to build an array.

In the case of adding rows, your best bet is to create an array that is as big as your data set will eventually be, and then assign data to it row-by-row:

>>> import numpy
>>> a = numpy.zeros(shape=(5,2))
>>> a
array([[ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])
>>> a[0] = [1,2]
>>> a[1] = [2,3]
>>> a
array([[ 1.,  2.],
   [ 2.,  3.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])

回答 1

NumPy数组是与列表非常不同的数据结构,旨在以不同的方式使用。您的使用hstack可能效率很低…每次调用它时,现有数组中的所有数据都将被复制到一个新数组中。(该append函数会有相同的问题。)如果您想一次建立一列矩阵,最好将其保留在列表中直到完成,然后再将其转换为数组。

例如


mylist = []
for item in data:
    mylist.append(item)
mat = numpy.array(mylist)

item可以是列表,数组或任何可迭代的,只要每个item元素具有相同数量的元素即可。
在这种特殊情况下(data保存矩阵列有些可迭代),您可以简单地使用


mat = numpy.array(data)

(还请注意,将其list用作变量名可能不是一个好习惯,因为它会用该名称掩盖内置类型,这可能会导致错误。)

编辑:

如果出于某种原因您确实想创建一个空数组,则可以使用 numpy.array([]),但这很少有用!

A NumPy array is a very different data structure from a list and is designed to be used in different ways. Your use of hstack is potentially very inefficient… every time you call it, all the data in the existing array is copied into a new one. (The append function will have the same issue.) If you want to build up your matrix one column at a time, you might be best off to keep it in a list until it is finished, and only then convert it into an array.

e.g.


mylist = []
for item in data:
    mylist.append(item)
mat = numpy.array(mylist)

item can be a list, an array or any iterable, as long as each item has the same number of elements.
In this particular case (data is some iterable holding the matrix columns) you can simply use


mat = numpy.array(data)

(Also note that using list as a variable name is probably not good practice since it masks the built-in type by that name, which can lead to bugs.)

EDIT:

If for some reason you really do want to create an empty array, you can just use numpy.array([]), but this is rarely useful!


回答 2

要在NumPy中创建一个空的多维数组(例如,m*n用于存储矩阵的2D数组),以防万一您不知道m要追加多少行并且不在乎Stephen Simmons提到的计算成本(即重新构建数组),您可以将要附加到的尺寸压缩为0 X = np.empty(shape=[0, n])

例如,您可以使用这种方式(在这里m = 5我们假设在创建空矩阵时我们并不知道,以及n = 2):

import numpy as np

n = 2
X = np.empty(shape=[0, n])

for i in range(5):
    for j  in range(2):
        X = np.append(X, [[i, j]], axis=0)

print X

这将为您提供:

[[ 0.  0.]
 [ 0.  1.]
 [ 1.  0.]
 [ 1.  1.]
 [ 2.  0.]
 [ 2.  1.]
 [ 3.  0.]
 [ 3.  1.]
 [ 4.  0.]
 [ 4.  1.]]

To create an empty multidimensional array in NumPy (e.g. a 2D array m*n to store your matrix), in case you don’t know m how many rows you will append and don’t care about the computational cost Stephen Simmons mentioned (namely re-buildinging the array at each append), you can squeeze to 0 the dimension to which you want to append to: X = np.empty(shape=[0, n]).

This way you can use for example (here m = 5 which we assume we didn’t know when creating the empty matrix, and n = 2):

import numpy as np

n = 2
X = np.empty(shape=[0, n])

for i in range(5):
    for j  in range(2):
        X = np.append(X, [[i, j]], axis=0)

print X

which will give you:

[[ 0.  0.]
 [ 0.  1.]
 [ 1.  0.]
 [ 1.  1.]
 [ 2.  0.]
 [ 2.  1.]
 [ 3.  0.]
 [ 3.  1.]
 [ 4.  0.]
 [ 4.  1.]]

回答 3

我进行了很多研究,因为我需要在我的一个学校项目中使用numpy.array作为集合,并且需要将其初始化为空。在这里,我在Stack Overflow上没有找到任何相关的答案,所以我开始涂鸦的东西。

# Initialize your variable as an empty list first
In [32]: x=[]
# and now cast it as a numpy ndarray
In [33]: x=np.array(x)

结果将是:

In [34]: x
Out[34]: array([], dtype=float64)

因此,您可以按如下所示直接初始化np数组:

In [36]: x= np.array([], dtype=np.float64)

我希望这有帮助。

I looked into this a lot because I needed to use a numpy.array as a set in one of my school projects and I needed to be initialized empty… I didn’t found any relevant answer here on Stack Overflow, so I started doodling something.

# Initialize your variable as an empty list first
In [32]: x=[]
# and now cast it as a numpy ndarray
In [33]: x=np.array(x)

The result will be:

In [34]: x
Out[34]: array([], dtype=float64)

Therefore you can directly initialize an np array as follows:

In [36]: x= np.array([], dtype=np.float64)

I hope this helps.


回答 4

您可以使用附加功能。对于行:

>>> from numpy import *
>>> a = array([10,20,30])
>>> append(a, [[1,2,3]], axis=0)
array([[10, 20, 30],      
       [1, 2, 3]])

对于列:

>>> append(a, [[15],[15]], axis=1)
array([[10, 20, 30, 15],      
       [1, 2, 3, 15]])

编辑
当然,正如其他答案中所述,除非每次对矩阵/数组进行一些处理(例如反转),否则每次将其添加到列表中时,我都会创建一个列表,将其添加到列表中,然后将其转换为数组。

You can use the append function. For rows:

>>> from numpy import *
>>> a = array([10,20,30])
>>> append(a, [[1,2,3]], axis=0)
array([[10, 20, 30],      
       [1, 2, 3]])

For columns:

>>> append(a, [[15],[15]], axis=1)
array([[10, 20, 30, 15],      
       [1, 2, 3, 15]])

EDIT
Of course, as mentioned in other answers, unless you’re doing some processing (ex. inversion) on the matrix/array EVERY time you append something to it, I would just create a list, append to it then convert it to an array.


回答 5

如果您完全不知道数组的最终大小,则可以像这样增加数组的大小:

my_arr = numpy.zeros((0,5))
for i in range(3):
    my_arr=numpy.concatenate( ( my_arr, numpy.ones((1,5)) ) )
print(my_arr)

[[ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]]
  • 注意0第一行中的。
  • numpy.append是另一种选择。它调用numpy.concatenate

If you absolutely don’t know the final size of the array, you can increment the size of the array like this:

my_arr = numpy.zeros((0,5))
for i in range(3):
    my_arr=numpy.concatenate( ( my_arr, numpy.ones((1,5)) ) )
print(my_arr)

[[ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]]
  • Notice the 0 in the first line.
  • numpy.append is another option. It calls numpy.concatenate.

回答 6

您可以将其应用于构建任何类型的数组,例如零:

a = range(5)
a = [i*0 for i in a]
print a 
[0, 0, 0, 0, 0]

You can apply it to build any kind of array, like zeros:

a = range(5)
a = [i*0 for i in a]
print a 
[0, 0, 0, 0, 0]

回答 7

这是一些使numpys看起来更像列表的解决方法

np_arr = np.array([])
np_arr = np.append(np_arr , 2)
np_arr = np.append(np_arr , 24)
print(np_arr)

输出:array([2.,24.])

Here is some workaround to make numpys look more like Lists

np_arr = np.array([])
np_arr = np.append(np_arr , 2)
np_arr = np.append(np_arr , 24)
print(np_arr)

OUTPUT: array([ 2., 24.])


回答 8

根据您使用它的目的,您可能需要指定数据类型(请参见‘dtype’)。

例如,要创建一个8位值的2D数组(适合用作单色图像):

myarray = numpy.empty(shape=(H,W),dtype='u1')

对于RGB图像,在形状中包括颜色通道的数量: shape=(H,W,3)

您可能还想考虑使用零初始化,numpy.zeros而不是使用numpy.empty。请参阅此处的注释。

Depending on what you are using this for, you may need to specify the data type (see ‘dtype’).

For example, to create a 2D array of 8-bit values (suitable for use as a monochrome image):

myarray = numpy.empty(shape=(H,W),dtype='u1')

For an RGB image, include the number of color channels in the shape: shape=(H,W,3)

You may also want to consider zero-initializing with numpy.zeros instead of using numpy.empty. See the note here.


回答 9

我认为您想处理列表的大部分工作,然后将结果用作矩阵。也许这是一种方式;

ur_list = []
for col in columns:
    ur_list.append(list(col))

mat = np.matrix(ur_list)

I think you want to handle most of the work with lists then use the result as a matrix. Maybe this is a way ;

ur_list = []
for col in columns:
    ur_list.append(list(col))

mat = np.matrix(ur_list)

回答 10

我认为您可以创建空的numpy数组,例如:

>>> import numpy as np
>>> empty_array= np.zeros(0)
>>> empty_array
array([], dtype=float64)
>>> empty_array.shape
(0,)

当您要在循环中附加numpy数组时,此格式很有用。

I think you can create empty numpy array like:

>>> import numpy as np
>>> empty_array= np.zeros(0)
>>> empty_array
array([], dtype=float64)
>>> empty_array.shape
(0,)

This format is useful when you want to append numpy array in the loop.


回答 11

为了创建一个空的NumPy数组而不定义其形状,有一种方法:

1。

arr = np.array([]) 

首选。因为您知道您将以numpy的形式使用它。

2。

arr = []
# and use it as numpy. append to it or etc..

NumPy之后将其转换为np.ndarray类型,无需额外的操作[] dimionsion

For creating an empty NumPy array without defining its shape there is to way:

1.

arr = np.array([]) 

preferred. cause you know you will be using this as numpy.

2.

arr = []
# and use it as numpy. append to it or etc..

NumPy converts this to np.ndarray type afterward, without extra [] dimionsion.


Python / NumPy中的meshgrid的用途是什么?

问题:Python / NumPy中的meshgrid的用途是什么?

有人可以向我解释meshgridNumpy 中功能的目的是什么?我知道它会为绘图创建某种坐标网格,但是我真的看不到它的直接好处。

我正在研究Sebastian Raschka的“ Python机器学习”,他正在使用它来绘制决策边界。请参阅此处的输入11 。

我也从官方文档中尝试过此代码,但是再次,输出对我来说真的没有意义。

x = np.arange(-5, 5, 1)
y = np.arange(-5, 5, 1)
xx, yy = np.meshgrid(x, y, sparse=True)
z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2)
h = plt.contourf(x,y,z)

请,如果可能的话,还请给我展示很多真实的例子。

Can someone explain to me what is the purpose of meshgrid function in Numpy? I know it creates some kind of grid of coordinates for plotting, but I can’t really see the direct benefit of it.

I am studying “Python Machine Learning” from Sebastian Raschka, and he is using it for plotting the decision borders. See input 11 here.

I have also tried this code from official documentation, but, again, the output doesn’t really make sense to me.

x = np.arange(-5, 5, 1)
y = np.arange(-5, 5, 1)
xx, yy = np.meshgrid(x, y, sparse=True)
z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2)
h = plt.contourf(x,y,z)

Please, if possible, also show me a lot of real-world examples.


回答 0

目的meshgrid是根据x值数组和y值数组创建矩形网格。

因此,例如,如果我们要创建一个网格,在x和y方向上每个介于0和4之间的整数值处都有一个点。要创建矩形网格,我们需要xy点的每个组合。

这将是25分,对吧?因此,如果我们想为所有这些点创建一个x和y数组,则可以执行以下操作。

x[0,0] = 0    y[0,0] = 0
x[0,1] = 1    y[0,1] = 0
x[0,2] = 2    y[0,2] = 0
x[0,3] = 3    y[0,3] = 0
x[0,4] = 4    y[0,4] = 0
x[1,0] = 0    y[1,0] = 1
x[1,1] = 1    y[1,1] = 1
...
x[4,3] = 3    y[4,3] = 4
x[4,4] = 4    y[4,4] = 4

这将导致以下xy矩阵,使得每个矩阵中对应元素的配对给出网格中一个点的x和y坐标。

x =   0 1 2 3 4        y =   0 0 0 0 0
      0 1 2 3 4              1 1 1 1 1
      0 1 2 3 4              2 2 2 2 2
      0 1 2 3 4              3 3 3 3 3
      0 1 2 3 4              4 4 4 4 4

然后,我们可以绘制这些图形以验证它们是否为网格:

plt.plot(x,y, marker='.', color='k', linestyle='none')

在此处输入图片说明

显然,这特别繁琐,尤其对于x和的范围y。相反,meshgrid实际上可以为我们生成此代码:我们只需指定唯一值xy值即可。

xvalues = np.array([0, 1, 2, 3, 4]);
yvalues = np.array([0, 1, 2, 3, 4]);

现在,当我们调用时meshgrid,我们将自动获得先前的输出。

xx, yy = np.meshgrid(xvalues, yvalues)

plt.plot(xx, yy, marker='.', color='k', linestyle='none')

在此处输入图片说明

这些矩形网格的创建对于许多任务很有用。在您的帖子中提供的示例中,这只是一种sin(x**2 + y**2) / (x**2 + y**2)x和的值范围内对函数()进行采样的方法y

由于此函数已在矩形网格上采样,因此现在可以将其可视化为“图像”。

在此处输入图片说明

此外,现在可以将结果传递给期望在矩形网格上获得数据的函数(例如contourf

The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values.

So, for example, if we want to create a grid where we have a point at each integer value between 0 and 4 in both the x and y directions. To create a rectangular grid, we need every combination of the x and y points.

This is going to be 25 points, right? So if we wanted to create an x and y array for all of these points, we could do the following.

x[0,0] = 0    y[0,0] = 0
x[0,1] = 1    y[0,1] = 0
x[0,2] = 2    y[0,2] = 0
x[0,3] = 3    y[0,3] = 0
x[0,4] = 4    y[0,4] = 0
x[1,0] = 0    y[1,0] = 1
x[1,1] = 1    y[1,1] = 1
...
x[4,3] = 3    y[4,3] = 4
x[4,4] = 4    y[4,4] = 4

This would result in the following x and y matrices, such that the pairing of the corresponding element in each matrix gives the x and y coordinates of a point in the grid.

x =   0 1 2 3 4        y =   0 0 0 0 0
      0 1 2 3 4              1 1 1 1 1
      0 1 2 3 4              2 2 2 2 2
      0 1 2 3 4              3 3 3 3 3
      0 1 2 3 4              4 4 4 4 4

We can then plot these to verify that they are a grid:

plt.plot(x,y, marker='.', color='k', linestyle='none')

enter image description here

Obviously, this gets very tedious especially for large ranges of x and y. Instead, meshgrid can actually generate this for us: all we have to specify are the unique x and y values.

xvalues = np.array([0, 1, 2, 3, 4]);
yvalues = np.array([0, 1, 2, 3, 4]);

Now, when we call meshgrid, we get the previous output automatically.

xx, yy = np.meshgrid(xvalues, yvalues)

plt.plot(xx, yy, marker='.', color='k', linestyle='none')

enter image description here

Creation of these rectangular grids is useful for a number of tasks. In the example that you have provided in your post, it is simply a way to sample a function (sin(x**2 + y**2) / (x**2 + y**2)) over a range of values for x and y.

Because this function has been sampled on a rectangular grid, the function can now be visualized as an “image”.

enter image description here

Additionally, the result can now be passed to functions which expect data on rectangular grid (i.e. contourf)


回答 1

由Microsoft Excel提供: 

在此处输入图片说明

Courtesy of Microsoft Excel: 

enter image description here


回答 2

实际上np.meshgrid,文档中已经提到的目的:

np.meshgrid

从坐标向量返回坐标矩阵。

给定一维坐标数组x1,x2,…,xn,制作ND坐标数组以对ND网格上的ND标量/矢量场进行矢量化评估。

因此,其主要目的是创建坐标矩阵。

您可能只是问自己:

为什么我们需要创建坐标矩阵?

使用Python / NumPy需要坐标矩阵的原因是,从坐标到值没有直接关系,除非坐标从零开始并且是纯正整数。然后,您可以只使用数组的索引作为索引。但是,如果不是这种情况,则您需要以某种方式将坐标存储在数据旁边。那就是网格进来的地方。

假设您的数据是:

1  2  1
2  5  2
1  2  1

但是,每个值代表一个水平2公里,垂直3公里的区域。假设您的原点是左上角,并且您想要一个表示可以使用的距离的数组:

import numpy as np
h, v = np.meshgrid(np.arange(3)*3, np.arange(3)*2)

其中v是:

array([[0, 0, 0],
       [2, 2, 2],
       [4, 4, 4]])

和h:

array([[0, 3, 6],
       [0, 3, 6],
       [0, 3, 6]])

所以,如果你有两个指标,比方说xy(这就是为什么的返回值meshgrid通常是xxxs,而不是x在这种情况下,我选择h了水平!),那么你可以得到该点的x坐标,在Y点和坐标通过使用以下方法在那时的价值:

h[x, y]    # horizontal coordinate
v[x, y]    # vertical coordinate
data[x, y]  # value

这样可以更轻松地跟踪坐标(更重要的是)您可以将其传递给需要知道坐标的函数。

稍长的解释

但是,np.meshgrid本身并不经常直接使用,大多数人只是使用类似对象np.mgrid或中的一种np.ogrid。这里np.mgrid代表sparse=Falsenp.ogridsparse=True情况(我指的是的sparse参数np.meshgrid)。请注意,np.meshgridnp.ogrid和之间有显着差异 np.mgrid:返回的前两个值(如果有两个或多个)将颠倒。通常这无关紧要,但是您应该根据上下文提供有意义的变量名称。

例如,在2D网格的情况下,matplotlib.pyplot.imshow命名第一个返回的项np.meshgrid x和第二个返回项是有意义的,y而对于np.mgrid和则相反np.ogrid

np.ogrid 和稀疏的网格

>>> import numpy as np
>>> yy, xx = np.ogrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])

正如已经说过的,与相比np.meshgrid,输出是相反的,这就是为什么我解压缩yy, xx而不是xx, yy

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6), sparse=True)
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])

这已经看起来像座标,特别是2D图的x和y线。

可视化:

yy, xx = np.ogrid[-5:6, -5:6]
plt.figure()
plt.title('ogrid (sparse meshgrid)')
plt.grid()
plt.xticks(xx.ravel())
plt.yticks(yy.ravel())
plt.scatter(xx, np.zeros_like(xx), color="blue", marker="*")
plt.scatter(np.zeros_like(yy), yy, color="red", marker="x")

在此处输入图片说明

np.mgrid 和密集/充实的网格

>>> yy, xx = np.mgrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])

此处同样适用:与相比,输出反转np.meshgrid

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6))
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])

ogrid这些数组不同的是,它们在-5 <= xx <= 5中包含所有 xxyy坐标;-5 <= yy <= 5格。

yy, xx = np.mgrid[-5:6, -5:6]
plt.figure()
plt.title('mgrid (dense meshgrid)')
plt.grid()
plt.xticks(xx[0])
plt.yticks(yy[:, 0])
plt.scatter(xx, yy, color="red", marker="x")

在此处输入图片说明

功能性

它不仅限于二维,而且这些函数适用于任意尺寸(嗯,Python中为函数提供了最大数量的参数,而NumPy允许最大数量的尺寸):

>>> x1, x2, x3, x4 = np.ogrid[:3, 1:4, 2:5, 3:6]
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
x1
array([[[[0]]],


       [[[1]]],


       [[[2]]]])
x2
array([[[[1]],

        [[2]],

        [[3]]]])
x3
array([[[[2],
         [3],
         [4]]]])
x4
array([[[[3, 4, 5]]]])

>>> # equivalent meshgrid output, note how the first two arguments are reversed and the unpacking
>>> x2, x1, x3, x4 = np.meshgrid(np.arange(1,4), np.arange(3), np.arange(2, 5), np.arange(3, 6), sparse=True)
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
# Identical output so it's omitted here.

即使这些也适用于一维,也有两个(更为常见的)一维网格创建功能:

除了startand stop参数外,它还支持step参数(即使是代表步骤数的复杂步骤):

>>> x1, x2 = np.mgrid[1:10:2, 1:10:4j]
>>> x1  # The dimension with the explicit step width of 2
array([[1., 1., 1., 1.],
       [3., 3., 3., 3.],
       [5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [9., 9., 9., 9.]])
>>> x2  # The dimension with the "number of steps"
array([[ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.]])

应用领域

您专门询问了目的,实际上,如果需要坐标系,这些网格非常有用。

例如,如果您有一个NumPy函数,它可以在两个维度上计算距离:

def distance_2d(x_point, y_point, x, y):
    return np.hypot(x-x_point, y-y_point)

您想知道每个点的距离:

>>> ys, xs = np.ogrid[-5:5, -5:5]
>>> distances = distance_2d(1, 2, xs, ys)  # distance to point (1, 2)
>>> distances
array([[9.21954446, 8.60232527, 8.06225775, 7.61577311, 7.28010989,
        7.07106781, 7.        , 7.07106781, 7.28010989, 7.61577311],
       [8.48528137, 7.81024968, 7.21110255, 6.70820393, 6.32455532,
        6.08276253, 6.        , 6.08276253, 6.32455532, 6.70820393],
       [7.81024968, 7.07106781, 6.40312424, 5.83095189, 5.38516481,
        5.09901951, 5.        , 5.09901951, 5.38516481, 5.83095189],
       [7.21110255, 6.40312424, 5.65685425, 5.        , 4.47213595,
        4.12310563, 4.        , 4.12310563, 4.47213595, 5.        ],
       [6.70820393, 5.83095189, 5.        , 4.24264069, 3.60555128,
        3.16227766, 3.        , 3.16227766, 3.60555128, 4.24264069],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.        , 5.        , 4.        , 3.        , 2.        ,
        1.        , 0.        , 1.        , 2.        , 3.        ],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128]])

如果通过密集网格而不是开放网格,则输出将是相同的。NumPys广播使之成为可能!

让我们可视化结果:

plt.figure()
plt.title('distance to point (1, 2)')
plt.imshow(distances, origin='lower', interpolation="none")
plt.xticks(np.arange(xs.shape[1]), xs.ravel())  # need to set the ticks manually
plt.yticks(np.arange(ys.shape[0]), ys.ravel())
plt.colorbar()

在此处输入图片说明

这也是当NumPys mgridogrid变得非常方便,因为它可以让你轻松更改网格的分辨率:

ys, xs = np.ogrid[-5:5:200j, -5:5:200j]
# otherwise same code as above

在此处输入图片说明

但是,由于imshow不支持xy输入,因此必须手动更改报价。如果接受x和,这将非常方便。y坐标,对吗?

使用NumPy编写自然处理网格的函数很容易。此外,NumPy,SciPy,matplotlib中有几个函数希望您传递到网格中。

我喜欢图片,因此让我们来探索一下matplotlib.pyplot.contour

ys, xs = np.mgrid[-5:5:200j, -5:5:200j]
density = np.sin(ys)-np.cos(xs)
plt.figure()
plt.contour(xs, ys, density)

在此处输入图片说明

注意如何正确设置坐标!如果您只是传入,则不会是这种情况density

或举另一个使用天体模型的有趣示例(这次我不太在乎坐标,我只是使用它们来创建一些网格):

from astropy.modeling import models
z = np.zeros((100, 100))
y, x = np.mgrid[0:100, 0:100]
for _ in range(10):
    g2d = models.Gaussian2D(amplitude=100, 
                           x_mean=np.random.randint(0, 100), 
                           y_mean=np.random.randint(0, 100), 
                           x_stddev=3, 
                           y_stddev=3)
    z += g2d(x, y)
    a2d = models.AiryDisk2D(amplitude=70, 
                            x_0=np.random.randint(0, 100), 
                            y_0=np.random.randint(0, 100), 
                            radius=5)
    z += a2d(x, y)

在此处输入图片说明

尽管那只是为了“外观”,但与Scipy中的功能模型和拟合(例如scipy.interpolate.interp2dscipy.interpolate.griddata甚至显示示例使用np.mgrid)相关的一些功能仍需要网格。其中大多数都适用于开放式网格和密集型网格,但是有些仅适用于其中之一。

Actually the purpose of np.meshgrid is already mentioned in the documentation:

np.meshgrid

Return coordinate matrices from coordinate vectors.

Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,…, xn.

So it’s primary purpose is to create a coordinates matrices.

You probably just asked yourself:

Why do we need to create coordinate matrices?

The reason you need coordinate matrices with Python/NumPy is that there is no direct relation from coordinates to values, except when your coordinates start with zero and are purely positive integers. Then you can just use the indices of an array as the index. However when that’s not the case you somehow need to store coordinates alongside your data. That’s where grids come in.

Suppose your data is:

1  2  1
2  5  2
1  2  1

However, each value represents a 2 kilometers wide region horizontally and 3 kilometers vertically. Suppose your origin is the upper left corner and you want arrays that represent the distance you could use:

import numpy as np
h, v = np.meshgrid(np.arange(3)*3, np.arange(3)*2)

where v is:

array([[0, 0, 0],
       [2, 2, 2],
       [4, 4, 4]])

and h:

array([[0, 3, 6],
       [0, 3, 6],
       [0, 3, 6]])

So if you have two indices, let’s say x and y (that’s why the return value of meshgrid is usually xx or xs instead of x in this case I chose h for horizontally!) then you can get the x coordinate of the point, the y coordinate of the point and the value at that point by using:

h[x, y]    # horizontal coordinate
v[x, y]    # vertical coordinate
data[x, y]  # value

That makes it much easier to keep track of coordinates and (even more importantly) you can pass them to functions that need to know the coordinates.

A slightly longer explanation

However, np.meshgrid itself isn’t often used directly, mostly one just uses one of similar objects np.mgrid or np.ogrid. Here np.mgrid represents the sparse=False and np.ogrid the sparse=True case (I refer to the sparse argument of np.meshgrid). Note that there is a significant difference between np.meshgrid and np.ogrid and np.mgrid: The first two returned values (if there are two or more) are reversed. Often this doesn’t matter but you should give meaningful variable names depending on the context.

For example, in case of a 2D grid and matplotlib.pyplot.imshow it makes sense to name the first returned item of np.meshgrid x and the second one y while it’s the other way around for np.mgrid and np.ogrid.

np.ogrid and sparse grids

>>> import numpy as np
>>> yy, xx = np.ogrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])
       

As already said the output is reversed when compared to np.meshgrid, that’s why I unpacked it as yy, xx instead of xx, yy:

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6), sparse=True)
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])

This already looks like coordinates, specifically the x and y lines for 2D plots.

Visualized:

yy, xx = np.ogrid[-5:6, -5:6]
plt.figure()
plt.title('ogrid (sparse meshgrid)')
plt.grid()
plt.xticks(xx.ravel())
plt.yticks(yy.ravel())
plt.scatter(xx, np.zeros_like(xx), color="blue", marker="*")
plt.scatter(np.zeros_like(yy), yy, color="red", marker="x")

enter image description here

np.mgrid and dense/fleshed out grids

>>> yy, xx = np.mgrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])
       

The same applies here: The output is reversed compared to np.meshgrid:

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6))
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])
       

Unlike ogrid these arrays contain all xx and yy coordinates in the -5 <= xx <= 5; -5 <= yy <= 5 grid.

yy, xx = np.mgrid[-5:6, -5:6]
plt.figure()
plt.title('mgrid (dense meshgrid)')
plt.grid()
plt.xticks(xx[0])
plt.yticks(yy[:, 0])
plt.scatter(xx, yy, color="red", marker="x")

enter image description here

Functionality

It’s not only limited to 2D, these functions work for arbitrary dimensions (well, there is a maximum number of arguments given to function in Python and a maximum number of dimensions that NumPy allows):

>>> x1, x2, x3, x4 = np.ogrid[:3, 1:4, 2:5, 3:6]
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
x1
array([[[[0]]],


       [[[1]]],


       [[[2]]]])
x2
array([[[[1]],

        [[2]],

        [[3]]]])
x3
array([[[[2],
         [3],
         [4]]]])
x4
array([[[[3, 4, 5]]]])

>>> # equivalent meshgrid output, note how the first two arguments are reversed and the unpacking
>>> x2, x1, x3, x4 = np.meshgrid(np.arange(1,4), np.arange(3), np.arange(2, 5), np.arange(3, 6), sparse=True)
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
# Identical output so it's omitted here.

Even if these also work for 1D there are two (much more common) 1D grid creation functions:

Besides the start and stop argument it also supports the step argument (even complex steps that represent the number of steps):

>>> x1, x2 = np.mgrid[1:10:2, 1:10:4j]
>>> x1  # The dimension with the explicit step width of 2
array([[1., 1., 1., 1.],
       [3., 3., 3., 3.],
       [5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [9., 9., 9., 9.]])
>>> x2  # The dimension with the "number of steps"
array([[ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.]])
       

Applications

You specifically asked about the purpose and in fact, these grids are extremely useful if you need a coordinate system.

For example if you have a NumPy function that calculates the distance in two dimensions:

def distance_2d(x_point, y_point, x, y):
    return np.hypot(x-x_point, y-y_point)
    

And you want to know the distance of each point:

>>> ys, xs = np.ogrid[-5:5, -5:5]
>>> distances = distance_2d(1, 2, xs, ys)  # distance to point (1, 2)
>>> distances
array([[9.21954446, 8.60232527, 8.06225775, 7.61577311, 7.28010989,
        7.07106781, 7.        , 7.07106781, 7.28010989, 7.61577311],
       [8.48528137, 7.81024968, 7.21110255, 6.70820393, 6.32455532,
        6.08276253, 6.        , 6.08276253, 6.32455532, 6.70820393],
       [7.81024968, 7.07106781, 6.40312424, 5.83095189, 5.38516481,
        5.09901951, 5.        , 5.09901951, 5.38516481, 5.83095189],
       [7.21110255, 6.40312424, 5.65685425, 5.        , 4.47213595,
        4.12310563, 4.        , 4.12310563, 4.47213595, 5.        ],
       [6.70820393, 5.83095189, 5.        , 4.24264069, 3.60555128,
        3.16227766, 3.        , 3.16227766, 3.60555128, 4.24264069],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.        , 5.        , 4.        , 3.        , 2.        ,
        1.        , 0.        , 1.        , 2.        , 3.        ],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128]])
        

The output would be identical if one passed in a dense grid instead of an open grid. NumPys broadcasting makes it possible!

Let’s visualize the result:

plt.figure()
plt.title('distance to point (1, 2)')
plt.imshow(distances, origin='lower', interpolation="none")
plt.xticks(np.arange(xs.shape[1]), xs.ravel())  # need to set the ticks manually
plt.yticks(np.arange(ys.shape[0]), ys.ravel())
plt.colorbar()

enter image description here

And this is also when NumPys mgrid and ogrid become very convenient because it allows you to easily change the resolution of your grids:

ys, xs = np.ogrid[-5:5:200j, -5:5:200j]
# otherwise same code as above

enter image description here

However, since imshow doesn’t support x and y inputs one has to change the ticks by hand. It would be really convenient if it would accept the x and y coordinates, right?

It’s easy to write functions with NumPy that deal naturally with grids. Furthermore, there are several functions in NumPy, SciPy, matplotlib that expect you to pass in the grid.

I like images so let’s explore matplotlib.pyplot.contour:

ys, xs = np.mgrid[-5:5:200j, -5:5:200j]
density = np.sin(ys)-np.cos(xs)
plt.figure()
plt.contour(xs, ys, density)

enter image description here

Note how the coordinates are already correctly set! That wouldn’t be the case if you just passed in the density.

Or to give another fun example using astropy models (this time I don’t care much about the coordinates, I just use them to create some grid):

from astropy.modeling import models
z = np.zeros((100, 100))
y, x = np.mgrid[0:100, 0:100]
for _ in range(10):
    g2d = models.Gaussian2D(amplitude=100, 
                           x_mean=np.random.randint(0, 100), 
                           y_mean=np.random.randint(0, 100), 
                           x_stddev=3, 
                           y_stddev=3)
    z += g2d(x, y)
    a2d = models.AiryDisk2D(amplitude=70, 
                            x_0=np.random.randint(0, 100), 
                            y_0=np.random.randint(0, 100), 
                            radius=5)
    z += a2d(x, y)
    

enter image description here

Although that’s just “for the looks” several functions related to functional models and fitting (for example scipy.interpolate.interp2d, scipy.interpolate.griddata even show examples using np.mgrid) in Scipy, etc. require grids. Most of these work with open grids and dense grids, however some only work with one of them.


回答 3

假设您有一个功能:

def sinus2d(x, y):
    return np.sin(x) + np.sin(y)

例如,您想要查看0到2 * pi范围内的图像。你会怎么做?有np.meshgrid来自于:

xx, yy = np.meshgrid(np.linspace(0,2*np.pi,100), np.linspace(0,2*np.pi,100))
z = sinus2d(xx, yy) # Create the image on this grid

这样的情节看起来像:

import matplotlib.pyplot as plt
plt.imshow(z, origin='lower', interpolation='none')
plt.show()

在此处输入图片说明

所以np.meshgrid只是一个方便。原则上,可以通过以下方式完成此操作:

z2 = sinus2d(np.linspace(0,2*np.pi,100)[:,None], np.linspace(0,2*np.pi,100)[None,:])

但是您需要了解自己的尺寸(假设您有两个以上…)和正确的广播。np.meshgrid为您完成所有这一切。

此外,例如,如果您想进行插值但排除某些值,则meshgrid允许您将坐标与数据一起删除:

condition = z>0.6
z_new = z[condition] # This will make your array 1D

那么您现在将如何进行插值?你可以给xy一个插值函数,scipy.interpolate.interp2d因此您需要一种方法来知道删除了哪些坐标:

x_new = xx[condition]
y_new = yy[condition]

然后您仍然可以使用“正确的”坐标进行插值(在没有网状网格的情况下进行尝试,您将获得很多额外的代码):

from scipy.interpolate import interp2d
interpolated = interp2d(x_new, y_new, z_new)

并且原始网格物体允许您再次在原始网格物体上获得插值:

interpolated_grid = interpolated(xx[0], yy[:, 0]).reshape(xx.shape)

这些只是我使用过的一些示例,meshgrid可能还会更多。

Suppose you have a function:

def sinus2d(x, y):
    return np.sin(x) + np.sin(y)

and you want, for example, to see what it looks like in the range 0 to 2*pi. How would you do it? There np.meshgrid comes in:

xx, yy = np.meshgrid(np.linspace(0,2*np.pi,100), np.linspace(0,2*np.pi,100))
z = sinus2d(xx, yy) # Create the image on this grid

and such a plot would look like:

import matplotlib.pyplot as plt
plt.imshow(z, origin='lower', interpolation='none')
plt.show()

enter image description here

So np.meshgrid is just a convenience. In principle the same could be done by:

z2 = sinus2d(np.linspace(0,2*np.pi,100)[:,None], np.linspace(0,2*np.pi,100)[None,:])

but there you need to be aware of your dimensions (suppose you have more than two …) and the right broadcasting. np.meshgrid does all of this for you.

Also meshgrid allows you to delete coordinates together with the data if you, for example, want to do an interpolation but exclude certain values:

condition = z>0.6
z_new = z[condition] # This will make your array 1D

so how would you do the interpolation now? You can give x and y to an interpolation function like scipy.interpolate.interp2d so you need a way to know which coordinates were deleted:

x_new = xx[condition]
y_new = yy[condition]

and then you can still interpolate with the “right” coordinates (try it without the meshgrid and you will have a lot of extra code):

from scipy.interpolate import interp2d
interpolated = interp2d(x_new, y_new, z_new)

and the original meshgrid allows you to get the interpolation on the original grid again:

interpolated_grid = interpolated(xx[0], yy[:, 0]).reshape(xx.shape)

These are just some examples where I used the meshgrid there might be a lot more.


回答 4

meshgrid帮助从两个数组的所有成对点的两个一维数组中创建一个矩形网格。

x = np.array([0, 1, 2, 3, 4])
y = np.array([0, 1, 2, 3, 4])

现在,如果您定义了函数f(x,y),并且想将此函数应用于数组’x’和’y’的所有可能的点组合,则可以执行以下操作:

f(*np.meshgrid(x, y))

假设,如果您的函数仅产生两个元素的乘积,那么这就是可以有效地为大型数组获得笛卡尔积的方法。

这里引用

meshgrid helps in creating a rectangular grid from two 1-D arrays of all pairs of points from the two arrays.

x = np.array([0, 1, 2, 3, 4])
y = np.array([0, 1, 2, 3, 4])

Now, if you have defined a function f(x,y) and you wanna apply this function to all the possible combination of points from the arrays ‘x’ and ‘y’, then you can do this:

f(*np.meshgrid(x, y))

Say, if your function just produces the product of two elements, then this is how a cartesian product can be achieved, efficiently for large arrays.

Referred from here


回答 5

基本思想

给定可能的x值xs(将其视为图的x轴上的刻度线)和可能的y值ys,将meshgrid生成对应的(x,y)网格点集-与类似set((x, y) for x in xs for y in yx)。例如,如果xs=[1,2,3]ys=[4,5,6],我们将获得一组坐标{(1,4), (2,4), (3,4), (1,5), (2,5), (3,5), (1,6), (2,6), (3,6)}

返回值的形式

但是,meshgrid返回的表示形式在两个方面与上述表达式不同:

首先meshgrid在2d数组中布置网格点:行对应于不同的y值,列对应于不同的x值-如在中list(list((x, y) for x in xs) for y in ys),将得到以下数组:

   [[(1,4), (2,4), (3,4)],
    [(1,5), (2,5), (3,5)],
    [(1,6), (2,6), (3,6)]]

其次,分别meshgrid返回x和y坐标(即在两个不同的numpy 2d数组中):

   xcoords, ycoords = (
       array([[1, 2, 3],
              [1, 2, 3],
              [1, 2, 3]]),
       array([[4, 4, 4],
              [5, 5, 5],
              [6, 6, 6]]))
   # same thing using np.meshgrid:
   xcoords, ycoords = np.meshgrid([1,2,3], [4,5,6])
   # same thing without meshgrid:
   xcoords = np.array([xs] * len(ys)
   ycoords = np.array([ys] * len(xs)).T

注意,np.meshgrid也可以生成更大尺寸的网格。给定xs,ys和zs,您将把xcoords,ycoords,zcoords作为3d数组返回。meshgrid还支持维度的反向排序以及结果的稀疏表示。

应用领域

我们为什么要这种形式的输出?

在网格上的每个点上应用一个函数: 一种动机是像(+,-,*,/,**)这样的二进制运算符对于numpy数组作为元素操作进行了重载。这意味着,如果我有一个def f(x, y): return (x - y) ** 2可以在两个标量上使用的函数,那么我也可以将其应用于两个numpy数组以获取按元素排列的结果数组:例如f(xcoords, ycoords)f(*np.meshgrid(xs, ys))在上面的示例中给出以下内容:

array([[ 9,  4,  1],
       [16,  9,  4],
       [25, 16,  9]])

高维外部产品:我不确定这有多有效,但是您可以通过以下方式获得高维外部产品:np.prod(np.meshgrid([1,2,3], [1,2], [1,2,3,4]), axis=0)

在matplotlib云图:我遇到meshgrid调查时绘制等高线图与matplotlib用于绘制的决策边界。为此,使用生成一个网格meshgrid,在每个网格点评估函数(例如,如上所示),然后将xcoords,ycoords和计算出的f值(即zcoords)传递给contourf函数。

Basic Idea

Given possible x values, xs, (think of them as the tick-marks on the x-axis of a plot) and possible y values, ys, meshgrid generates the corresponding set of (x, y) grid points—analogous to set((x, y) for x in xs for y in yx). For example, if xs=[1,2,3] and ys=[4,5,6], we’d get the set of coordinates {(1,4), (2,4), (3,4), (1,5), (2,5), (3,5), (1,6), (2,6), (3,6)}.

Form of the Return Value

However, the representation that meshgrid returns is different from the above expression in two ways:

First, meshgrid lays out the grid points in a 2d array: rows correspond to different y-values, columns correspond to different x-values—as in list(list((x, y) for x in xs) for y in ys), which would give the following array:

   [[(1,4), (2,4), (3,4)],
    [(1,5), (2,5), (3,5)],
    [(1,6), (2,6), (3,6)]]

Second, meshgrid returns the x and y coordinates separately (i.e. in two different numpy 2d arrays):

   xcoords, ycoords = (
       array([[1, 2, 3],
              [1, 2, 3],
              [1, 2, 3]]),
       array([[4, 4, 4],
              [5, 5, 5],
              [6, 6, 6]]))
   # same thing using np.meshgrid:
   xcoords, ycoords = np.meshgrid([1,2,3], [4,5,6])
   # same thing without meshgrid:
   xcoords = np.array([xs] * len(ys)
   ycoords = np.array([ys] * len(xs)).T

Note, np.meshgrid can also generate grids for higher dimensions. Given xs, ys, and zs, you’d get back xcoords, ycoords, zcoords as 3d arrays. meshgrid also supports reverse ordering of the dimensions as well as sparse representation of the result.

Applications

Why would we want this form of output?

Apply a function at every point on a grid: One motivation is that binary operators like (+, -, *, /, **) are overloaded for numpy arrays as elementwise operations. This means that if I have a function def f(x, y): return (x - y) ** 2 that works on two scalars, I can also apply it on two numpy arrays to get an array of elementwise results: e.g. f(xcoords, ycoords) or f(*np.meshgrid(xs, ys)) gives the following on the above example:

array([[ 9,  4,  1],
       [16,  9,  4],
       [25, 16,  9]])

Higher dimensional outer product: I’m not sure how efficient this is, but you can get high-dimensional outer products this way: np.prod(np.meshgrid([1,2,3], [1,2], [1,2,3,4]), axis=0).

Contour plots in matplotlib: I came across meshgrid when investigating drawing contour plots with matplotlib for plotting decision boundaries. For this, you generate a grid with meshgrid, evaluate the function at each grid point (e.g. as shown above), and then pass the xcoords, ycoords, and computed f-values (i.e. zcoords) into the contourf function.


如何将2D float numpy数组转换为2D int numpy数组?

问题:如何将2D float numpy数组转换为2D int numpy数组?

如何将实际的numpy数组转换为int numpy数组?尝试直接使用map映射到数组,但是没有用。

How to convert real numpy array to int numpy array? Tried using map directly to array but it did not work.


回答 0

使用astype方法。

>>> x = np.array([[1.0, 2.3], [1.3, 2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> x.astype(int)
array([[1, 2],
       [1, 2]])

Use the astype method.

>>> x = np.array([[1.0, 2.3], [1.3, 2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> x.astype(int)
array([[1, 2],
       [1, 2]])

回答 1

一些用于控制舍入的numpy函数:rintfloortruncceil。取决于您希望如何将浮点数向上,向下或最接近的整数取整。

>>> x = np.array([[1.0,2.3],[1.3,2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> y = np.trunc(x)
>>> y
array([[ 1.,  2.],
       [ 1.,  2.]])
>>> z = np.ceil(x)
>>> z
array([[ 1.,  3.],
       [ 2.,  3.]])
>>> t = np.floor(x)
>>> t
array([[ 1.,  2.],
       [ 1.,  2.]])
>>> a = np.rint(x)
>>> a
array([[ 1.,  2.],
       [ 1.,  3.]])

要将其中之一转换为int或将numpy中的其他类型转换astype(由BrenBern回答):

a.astype(int)
array([[1, 2],
       [1, 3]])

>>> y.astype(int)
array([[1, 2],
       [1, 2]])

Some numpy functions for how to control the rounding: rint, floor,trunc, ceil. depending how u wish to round the floats, up, down, or to the nearest int.

>>> x = np.array([[1.0,2.3],[1.3,2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> y = np.trunc(x)
>>> y
array([[ 1.,  2.],
       [ 1.,  2.]])
>>> z = np.ceil(x)
>>> z
array([[ 1.,  3.],
       [ 2.,  3.]])
>>> t = np.floor(x)
>>> t
array([[ 1.,  2.],
       [ 1.,  2.]])
>>> a = np.rint(x)
>>> a
array([[ 1.,  2.],
       [ 1.,  3.]])

To make one of this in to int, or one of the other types in numpy, astype (as answered by BrenBern):

a.astype(int)
array([[1, 2],
       [1, 3]])

>>> y.astype(int)
array([[1, 2],
       [1, 2]])

回答 2

您可以使用np.int_

>>> x = np.array([[1.0, 2.3], [1.3, 2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> np.int_(x)
array([[1, 2],
       [1, 2]])

you can use np.int_:

>>> x = np.array([[1.0, 2.3], [1.3, 2.9]])
>>> x
array([[ 1. ,  2.3],
       [ 1.3,  2.9]])
>>> np.int_(x)
array([[1, 2],
       [1, 2]])

回答 3

如果不确定您的输入将是Numpy数组,则可以使用asarraywith dtype=int代替astype

>>> np.asarray([1,2,3,4], dtype=int)
array([1, 2, 3, 4])

如果输入数组已经具有正确的dtype,请asarray避免使用数组复制,astype而不要这样做(除非您指定copy=False):

>>> a = np.array([1,2,3,4])
>>> a is np.asarray(a)  # no copy :)
True
>>> a is a.astype(int)  # copy :(
False
>>> a is a.astype(int, copy=False)  # no copy :)
True

If you’re not sure your input is going to be a Numpy array, you can use asarray with dtype=int instead of astype:

>>> np.asarray([1,2,3,4], dtype=int)
array([1, 2, 3, 4])

If the input array already has the correct dtype, asarray avoids the array copy while astype does not (unless you specify copy=False):

>>> a = np.array([1,2,3,4])
>>> a is np.asarray(a)  # no copy :)
True
>>> a is a.astype(int)  # copy :(
False
>>> a is a.astype(int, copy=False)  # no copy :)
True

Numpy的array()和asarray()函数有什么区别?

问题:Numpy的array()和asarray()函数有什么区别?

Numpy array()asarray()函数之间有什么区别?什么时候应该使用一个而不是另一个?他们似乎为我能想到的所有输入生成相同的输出。

What is the difference between Numpy’s array() and asarray() functions? When should you use one rather than the other? They seem to generate identical output for all the inputs I can think of.


回答 0

由于将其他问题重定向到这个询问问题asanyarray其他数组创建例程的问题,因此可能有必要简要概述每个问题的作法。

区别主要在于何时返回不变的输入,而不是将新数组作为副本。

array提供了多种选择(大多数其他功能都是围绕它的薄包装器),包括用于确定何时复制的标志。完整的解释将和文档一样长(请参阅Array Creation,但是简要地,这里有一些示例:

假设andarray,并且mmatrix,并且它们都具有dtypefloat32

  • np.array(a)并且np.array(m)将复制两个,因为这是默认行为。
  • np.array(a, copy=False)并且np.array(m, copy=False)将复制m但不复制a,因为m不是ndarray
  • np.array(a, copy=False, subok=True),并且np.array(m, copy=False, subok=True)不会复制任何内容,因为mmatrix,这是的子类ndarray
  • np.array(a, dtype=int, copy=False, subok=True)将同时复制两者,因为与dtype不兼容。

其他大多数功能都是array在复制发生时围绕该控件的薄包装器:

  • asarray:如果兼容ndarraycopy=False),则输入将返回未复制的状态。
  • asanyarray:如果输入是兼容的ndarray或子类matrix(如copy=Falsesubok=True),则输入将不被复制。
  • ascontiguousarray:如果输入是兼容ndarray的连续C顺序(copy=False,,则将返回未复制的输入order='C')
  • asfortranarray:如果输入与ndarray连续的Fortran顺序(copy=Falseorder='F')兼容,则将返回未复制的输入。
  • require:如果输入与指定的需求字符串兼容,则输入将不复制而返回。
  • copy:总是复制输入。
  • fromiter:输入被视为可迭代的(例如,您可以从迭代器的元素构造数组,而不是object使用迭代器的数组);始终复制。

还有一些便利功能,例如asarray_chkfinite(与复制规则相同asarray,但复制规则与相同,但是ValueError如果有naninf值,则会提高),以及子类的构造函数(例如matrix或特殊情况下的记录数组),当然还有实际的ndarray构造函数(可让您直接创建数组)超出缓冲区)。

Since other questions are being redirected to this one which ask about asanyarray or other array creation routines, it’s probably worth having a brief summary of what each of them does.

The differences are mainly about when to return the input unchanged, as opposed to making a new array as a copy.

array offers a wide variety of options (most of the other functions are thin wrappers around it), including flags to determine when to copy. A full explanation would take just as long as the docs (see Array Creation, but briefly, here are some examples:

Assume a is an ndarray, and m is a matrix, and they both have a dtype of float32:

  • np.array(a) and np.array(m) will copy both, because that’s the default behavior.
  • np.array(a, copy=False) and np.array(m, copy=False) will copy m but not a, because m is not an ndarray.
  • np.array(a, copy=False, subok=True) and np.array(m, copy=False, subok=True) will copy neither, because m is a matrix, which is a subclass of ndarray.
  • np.array(a, dtype=int, copy=False, subok=True) will copy both, because the dtype is not compatible.

Most of the other functions are thin wrappers around array that control when copying happens:

  • asarray: The input will be returned uncopied iff it’s a compatible ndarray (copy=False).
  • asanyarray: The input will be returned uncopied iff it’s a compatible ndarray or subclass like matrix (copy=False, subok=True).
  • ascontiguousarray: The input will be returned uncopied iff it’s a compatible ndarray in contiguous C order (copy=False, order='C').
  • asfortranarray: The input will be returned uncopied iff it’s a compatible ndarray in contiguous Fortran order (copy=False, order='F').
  • require: The input will be returned uncopied iff it’s compatible with the specified requirements string.
  • copy: The input is always copied.
  • fromiter: The input is treated as an iterable (so, e.g., you can construct an array from an iterator’s elements, instead of an object array with the iterator); always copied.

There are also convenience functions, like asarray_chkfinite (same copying rules as asarray, but raises ValueError if there are any nan or inf values), and constructors for subclasses like matrix or for special cases like record arrays, and of course the actual ndarray constructor (which lets you create an array directly out of strides over a buffer).


回答 1

定义asarray是:

def asarray(a, dtype=None, order=None):
    return array(a, dtype, copy=False, order=order)

就像array,除了它的选项更少,和copy=Falsearraycopy=True默认。

主要区别在于array(默认情况下)将复制对象,而asarray除非有必要,否则不会复制。

The definition of asarray is:

def asarray(a, dtype=None, order=None):
    return array(a, dtype, copy=False, order=order)

So it is like array, except it has fewer options, and copy=False. array has copy=True by default.

The main difference is that array (by default) will make a copy of the object, while asarray will not unless necessary.


回答 2

可以通过以下示例证明差异:

  1. 产生矩阵

    >>> A = numpy.matrix(numpy.ones((3,3)))
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 1.,  1.,  1.]])
  2. 用于numpy.array修改A。不起作用,因为您正在修改副本

    >>> numpy.array(A)[2]=2
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 1.,  1.,  1.]])
  3. 用于numpy.asarray修改A。之所以有效,是因为您正在修改A自己

    >>> numpy.asarray(A)[2]=2
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 2.,  2.,  2.]])

希望这可以帮助!

The difference can be demonstrated by this example:

  1. generate a matrix

    >>> A = numpy.matrix(numpy.ones((3,3)))
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 1.,  1.,  1.]])
    
  2. use numpy.array to modify A. Doesn’t work because you are modifying a copy

    >>> numpy.array(A)[2]=2
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 1.,  1.,  1.]])
    
  3. use numpy.asarray to modify A. It worked because you are modifying A itself

    >>> numpy.asarray(A)[2]=2
    >>> A
    matrix([[ 1.,  1.,  1.],
            [ 1.,  1.,  1.],
            [ 2.,  2.,  2.]])
    

Hope this helps!


回答 3

array和的文档中非常清楚地提到了差异asarray。不同之处在于参数列表,因此取决于这些参数的功能作用。

函数定义为:

numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)

numpy.asarray(a, dtype=None, order=None)

以下参数是可能传递给文档的参数,array不是 asarray文档中提到的参数:

copy:bool,可选如果为true(默认),则复制对象。否则,仅当__array__返回一个副本,obj是一个嵌套序列或需要一个副本以满足其他任何要求(dtype,order等)时,才创建副本。

subok:bool,可选如果为True,则子类将被传递,否则返回的数组将被强制为基类数组(默认)。

ndmin:int,可选指定结果数组应具有的最小维数。可以根据需要预先添加形状。

The differences are mentioned quite clearly in the documentation of array and asarray. The differences lie in the argument list and hence the action of the function depending on those parameters.

The function definitions are :

numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)

and

numpy.asarray(a, dtype=None, order=None)

The following arguments are those that may be passed to array and not asarray as mentioned in the documentation :

copy : bool, optional If true (default), then the object is copied. Otherwise, a copy will only be made if __array__ returns a copy, if obj is a nested sequence, or if a copy is needed to satisfy any of the other requirements (dtype, order, etc.).

subok : bool, optional If True, then sub-classes will be passed-through, otherwise the returned array will be forced to be a base-class array (default).

ndmin : int, optional Specifies the minimum number of dimensions that the resulting array should have. Ones will be pre-pended to the shape as needed to meet this requirement.


回答 4

这是一个可以证明差异的简单示例。

主要区别在于数组将复制原始数据,并且使用不同的对象,我们可以修改原始数组中的数据。

import numpy as np
a = np.arange(0.0, 10.2, 0.12)
int_cvr = np.asarray(a, dtype = np.int64)

数组(a)中的内容保持不变,但仍然可以使用另一个对象对数据执行任何操作,而无需修改原始数组中的内容。

Here’s a simple example that can demonstrate the difference.

The main difference is that array will make a copy of the original data and using different object we can modify the data in the original array.

import numpy as np
a = np.arange(0.0, 10.2, 0.12)
int_cvr = np.asarray(a, dtype = np.int64)

The contents in array (a), remain untouched, and still, we can perform any operation on the data using another object without modifying the content in original array.


回答 5

asarray(x) 就好像 array(x, copy=False)

asarray(x)当您要确保x在执行任何其他操作之前将其设为数组时使用。如果x已经是数组,则不会进行任何复制。这不会造成冗余的性能损失。

这是确保x首先转换为数组的函数示例。

def mysum(x):
    return np.asarray(x).sum()

asarray(x) is like array(x, copy=False)

Use asarray(x) when you want to ensure that x will be an array before any other operations are done. If x is already an array then no copy would be done. It would not cause a redundant performance hit.

Here is an example of a function that ensure x is converted into an array first.

def mysum(x):
    return np.asarray(x).sum()