Python 实用宝典

Question 1

I load some machine learning data from a CSV file. The first 2 columns are observations and the remaining columns are features.

Currently, I do the following:

data = pandas.read_csv('mydata.csv')

which gives something like:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))

I’d like to slice this dataframe in two dataframes: one containing the columns a and b and one containing the columns c, d and e.

It is not possible to write something like

observations = data[:'c']
features = data['c':]

I’m not sure what the best method is. Do I need a pd.Panel?

By the way, I find dataframe indexing pretty inconsistent: data['a'] is permitted, but data[0] is not. On the other side, data['a':] is not permitted but data[0:] is. Is there a practical reason for this? This is really confusing if columns are indexed by Int, given that data[0] != data[0:1]

Question 2

2017 Answer – pandas 0.20: .ix is deprecated. Use .loc

See the deprecation in the docs

.loc uses label based indexing to select both rows and columns. The labels being the values of the index or the columns. Slicing with .loc includes the last element.

Let’s assume we have a DataFrame with the following columns:
foo, bar, quz, ant, cat, sat, dat.

# selects all rows and all columns beginning at 'foo' up to and including 'sat'
df.loc[:, 'foo':'sat']
# foo bar quz ant cat sat

.loc accepts the same slice notation that Python lists do for both row and columns. Slice notation being start:stop:step

# slice from 'foo' to 'cat' by every 2nd column
df.loc[:, 'foo':'cat':2]
# foo quz cat

# slice from the beginning to 'bar'
df.loc[:, :'bar']
# foo bar

# slice from 'quz' to the end by 3
df.loc[:, 'quz'::3]
# quz sat

# attempt from 'sat' to 'bar'
df.loc[:, 'sat':'bar']
# no columns returned

# slice from 'sat' to 'bar'
df.loc[:, 'sat':'bar':-1]
sat cat ant quz bar

# slice notation is syntatic sugar for the slice function
# slice from 'quz' to the end by 2 with slice function
df.loc[:, slice('quz',None, 2)]
# quz cat dat

# select specific columns with a list
# select columns foo, bar and dat
df.loc[:, ['foo','bar','dat']]
# foo bar dat

You can slice by rows and columns. For instance, if you have 5 rows with labels v, w, x, y, z

# slice from 'w' to 'y' and 'foo' to 'ant' by 3
df.loc['w':'y', 'foo':'ant':3]
#    foo ant
# w
# x
# y

Question 3

Note: .ix has been deprecated since Pandas v0.20. You should instead use .loc or .iloc, as appropriate.

The DataFrame.ix index is what you want to be accessing. It’s a little confusing (I agree that Pandas indexing is perplexing at times!), but the following seems to do what you want:

>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df.ix[:,'b':]
      b         c         d         e
0  0.418762  0.042369  0.869203  0.972314
1  0.991058  0.510228  0.594784  0.534366
2  0.407472  0.259811  0.396664  0.894202
3  0.726168  0.139531  0.324932  0.906575

where .ix[row slice, column slice] is what is being interpreted. More on Pandas indexing here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-advanced

Question 4

Lets use the titanic dataset from the seaborn package as an example

# Load dataset (pip install seaborn)
>> import seaborn.apionly as sns
>> titanic = sns.load_dataset('titanic')

using the column names

>> titanic.loc[:,['sex','age','fare']]

using the column indices

>> titanic.iloc[:,[2,3,6]]

using ix (Older than Pandas <.20 version)

>> titanic.ix[:,[‘sex’,’age’,’fare’]]

or

>> titanic.ix[:,[2,3,6]]

using the reindex method

>> titanic.reindex(columns=['sex','age','fare'])

Question 5

Also, Given a DataFrame

data

as in your example, if you would like to extract column a and d only (e.i. the 1st and the 4th column), iloc mothod from the pandas dataframe is what you need and could be used very effectively. All you need to know is the index of the columns you would like to extract. For example:

>>> data.iloc[:,[0,3]]

will give you

          a         d
0  0.883283  0.100975
1  0.614313  0.221731
2  0.438963  0.224361
3  0.466078  0.703347
4  0.955285  0.114033
5  0.268443  0.416996
6  0.613241  0.327548
7  0.370784  0.359159
8  0.692708  0.659410
9  0.806624  0.875476

Question 6

You can slice along the columns of a DataFrame by referring to the names of each column in a list, like so:

data = pandas.DataFrame(np.random.rand(10,5), columns = list('abcde'))
data_ab = data[list('ab')]
data_cde = data[list('cde')]

Question 7

And if you came here looking for slicing two ranges of columns and combining them together (like me) you can do something like

op = df[list(df.columns[0:899]) + list(df.columns[3593:])]
print op

This will create a new dataframe with first 900 columns and (all) columns > 3593 (assuming you have some 4000 columns in your data set).

Question 8

Here’s how you could use different methods to do selective column slicing, including selective label based, index based and the selective ranges based column slicing.

In [37]: import pandas as pd    
In [38]: import numpy as np
In [43]: df = pd.DataFrame(np.random.rand(4,7), columns = list('abcdefg'))

In [44]: df
Out[44]: 
          a         b         c         d         e         f         g
0  0.409038  0.745497  0.890767  0.945890  0.014655  0.458070  0.786633
1  0.570642  0.181552  0.794599  0.036340  0.907011  0.655237  0.735268
2  0.568440  0.501638  0.186635  0.441445  0.703312  0.187447  0.604305
3  0.679125  0.642817  0.697628  0.391686  0.698381  0.936899  0.101806

In [45]: df.loc[:, ["a", "b", "c"]] ## label based selective column slicing 
Out[45]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [46]: df.loc[:, "a":"c"] ## label based column ranges slicing 
Out[46]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

In [47]: df.iloc[:, 0:3] ## index based column ranges slicing 
Out[47]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

### with 2 different column ranges, index based slicing: 
In [49]: df[df.columns[0:1].tolist() + df.columns[1:3].tolist()]
Out[49]: 
          a         b         c
0  0.409038  0.745497  0.890767
1  0.570642  0.181552  0.794599
2  0.568440  0.501638  0.186635
3  0.679125  0.642817  0.697628

Question 9

Its equivalent

 >>> print(df2.loc[140:160,['Relevance','Title']])
 >>> print(df2.ix[140:160,[3,7]])

Question 10

if Data frame look like that:

group         name      count
fruit         apple     90
fruit         banana    150
fruit         orange    130
vegetable     broccoli  80
vegetable     kale      70
vegetable     lettuce   125

and OUTPUT could be like

   group    name  count
0  fruit   apple     90
1  fruit  banana    150
2  fruit  orange    130

if you use logical operator np.logical_not

df[np.logical_not(df['group'] == 'vegetable')]

more about

https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.logic.html

other logical operators

logical_and(x1, x2, /[, out, where, …]) Compute the truth value of x1 AND x2 element-wise.
logical_or(x1, x2, /[, out, where, casting, …]) Compute the truth value of x1 OR x2 element-wise.
logical_not(x, /[, out, where, casting, …]) Compute the truth value of NOT x element-wise.
logical_xor(x1, x2, /[, out, where, ..]) Compute the truth value of x1 XOR x2, element-wise.

Question 11

Another way to get a subset of columns from your DataFrame, assuming you want all the rows, would be to do:
data[['a','b']] and data[['c','d','e']]
If you want to use numerical column indexes you can do:
data[data.columns[:2]] and data[data.columns[2:]]

Question 12

I have a matrix in the type of a Numpy array. How would I write it to disk it as an image? Any format works (png, jpeg, bmp…). One important constraint is that PIL is not present.

Question 13

You can use PyPNG. It’s a pure Python (no dependencies) open source PNG encoder/decoder and it supports writing NumPy arrays as images.

Question 14

This uses PIL, but maybe some might find it useful:

import scipy.misc
scipy.misc.imsave('outfile.jpg', image_array)

EDIT: The current scipy version started to normalize all images so that min(data) become black and max(data) become white. This is unwanted if the data should be exact grey levels or exact RGB channels. The solution:

import scipy.misc
scipy.misc.toimage(image_array, cmin=0.0, cmax=...).save('outfile.jpg')

Question 15

An answer using PIL (just in case it’s useful).

given a numpy array “A”:

from PIL import Image
im = Image.fromarray(A)
im.save("your_file.jpeg")

you can replace “jpeg” with almost any format you want. More details about the formats here

Question 16

With matplotlib:

import matplotlib

matplotlib.image.imsave('name.png', array)

Works with matplotlib 1.3.1, I don’t know about lower version. From the docstring:

Arguments:
  *fname*:
    A string containing a path to a filename, or a Python file-like object.
    If *format* is *None* and *fname* is a string, the output
    format is deduced from the extension of the filename.
  *arr*:
    An MxN (luminance), MxNx3 (RGB) or MxNx4 (RGBA) array.

Question 17

Pure Python (2 & 3), a snippet without 3rd party dependencies.

This function writes compressed, true-color (4 bytes per pixel) RGBA PNG’s.

def write_png(buf, width, height):
    """ buf: must be bytes or a bytearray in Python3.x,
        a regular string in Python2.x.
    """
    import zlib, struct

    # reverse the vertical line order and add null bytes at the start
    width_byte_4 = width * 4
    raw_data = b''.join(
        b'\x00' + buf[span:span + width_byte_4]
        for span in range((height - 1) * width_byte_4, -1, - width_byte_4)
    )

    def png_pack(png_tag, data):
        chunk_head = png_tag + data
        return (struct.pack("!I", len(data)) +
                chunk_head +
                struct.pack("!I", 0xFFFFFFFF & zlib.crc32(chunk_head)))

    return b''.join([
        b'\x89PNG\r\n\x1a\n',
        png_pack(b'IHDR', struct.pack("!2I5B", width, height, 8, 6, 0, 0, 0)),
        png_pack(b'IDAT', zlib.compress(raw_data, 9)),
        png_pack(b'IEND', b'')])

… The data should be written directly to a file opened as binary, as in:

data = write_png(buf, 64, 64)
with open("my_image.png", 'wb') as fh:
    fh.write(data)

Original source
See also: Rust Port from this question.
Example usage thanks to @Evgeni Sergeev: https://stackoverflow.com/a/21034111/432509

Question 18

There’s opencv for python (documentation here).

import cv2
import numpy as np

img = ... # Your image as a numpy array 

cv2.imwrite("filename.png", img)

useful if you need to do more processing other than saving.

Question 19

If you have matplotlib, you can do:

import matplotlib.pyplot as plt
plt.imshow(matrix) #Needs to be in row,col order
plt.savefig(filename)

This will save the plot (not the images itself).

Question 20

You can use ‘skimage’ library in Python

Example:

from skimage.io import imsave
imsave('Path_to_your_folder/File_name.jpg',your_array)

Question 21

scipy.misc gives deprecation warning about imsave function and suggests usage of imageio instead.

import imageio
imageio.imwrite('image_name.png', img)

Question 22

Addendum to @ideasman42’s answer:

def saveAsPNG(array, filename):
    import struct
    if any([len(row) != len(array[0]) for row in array]):
        raise ValueError, "Array should have elements of equal size"

                                #First row becomes top row of image.
    flat = []; map(flat.extend, reversed(array))
                                 #Big-endian, unsigned 32-byte integer.
    buf = b''.join([struct.pack('>I', ((0xffFFff & i32)<<8)|(i32>>24) )
                    for i32 in flat])   #Rotate from ARGB to RGBA.

    data = write_png(buf, len(array[0]), len(array))
    f = open(filename, 'wb')
    f.write(data)
    f.close()

So you can do:

saveAsPNG([[0xffFF0000, 0xffFFFF00],
           [0xff00aa77, 0xff333333]], 'test_grid.png')

Producing test_grid.png:

Grid of red, yellow, dark-aqua, grey

(Transparency also works, by reducing the high byte from 0xff.)

Question 23

For those looking for a direct fully working example:

from PIL import Image
import numpy

w,h = 200,100
img = numpy.zeros((h,w,3),dtype=numpy.uint8) # has to be unsigned bytes

img[:] = (0,0,255) # fill blue

x,y = 40,20
img[y:y+30, x:x+50] = (255,0,0) # 50x30 red box

Image.fromarray(img).convert("RGB").save("art.png") # don't need to convert

also, if you want high quality jpeg’s
.save(file, subsampling=0, quality=100)

Question 24

matplotlib svn has a new function to save images as just an image — no axes etc. it’s a very simple function to backport too, if you don’t want to install svn (copied straight from image.py in matplotlib svn, removed the docstring for brevity):

def imsave(fname, arr, vmin=None, vmax=None, cmap=None, format=None, origin=None):
    from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
    from matplotlib.figure import Figure

    fig = Figure(figsize=arr.shape[::-1], dpi=1, frameon=False)
    canvas = FigureCanvas(fig)
    fig.figimage(arr, cmap=cmap, vmin=vmin, vmax=vmax, origin=origin)
    fig.savefig(fname, dpi=1, format=format)

Question 25

The world probably doesn’t need yet another package for writing a numpy array to a PNG file, but for those who can’t get enough, I recently put up numpngw on github:

https://github.com/WarrenWeckesser/numpngw

and on pypi: https://pypi.python.org/pypi/numpngw/

The only external dependency is numpy.

Here’s the first example from the examples directory of the repository. The essential line is simply

write_png('example1.png', img)

where img is a numpy array. All the code before that line is import statements and code to create img.

import numpy as np
from numpngw import write_png


# Example 1
#
# Create an 8-bit RGB image.

img = np.zeros((80, 128, 3), dtype=np.uint8)

grad = np.linspace(0, 255, img.shape[1])

img[:16, :, :] = 127
img[16:32, :, 0] = grad
img[32:48, :, 1] = grad[::-1]
img[48:64, :, 2] = grad
img[64:, :, :] = 127

write_png('example1.png', img)

Here’s the PNG file that it creates:

Question 26

Assuming you want a grayscale image:

im = Image.new('L', (width, height))
im.putdata(an_array.flatten().tolist())
im.save("image.tiff")

Question 27

Imageio is a Python library that provides an easy interface to read and write a wide range of image data, including animated images, video, volumetric data, and scientific formats. It is cross-platform, runs on Python 2.7 and 3.4+, and is easy to install.

This is example for grayscale image:

import numpy as np
import imageio

# data is numpy array with grayscale value for each pixel.
data = np.array([70,80,82,72,58,58,60,63,54,58,60,48,89,115,121,119])

# 16 pixels can be converted into square of 4x4 or 2x8 or 8x2
data = data.reshape((4, 4)).astype('uint8')

# save image
imageio.imwrite('pic.jpg', data)

Question 28

If you happen to use [Py]Qt already, you may be interested in qimage2ndarray. Starting with version 1.4 (just released), PySide is supported as well, and there will be a tiny imsave(filename, array) function similar to scipy’s, but using Qt instead of PIL. With 1.3, just use something like the following:

qImage = array2qimage(image, normalize = False) # create QImage from ndarray
success = qImage.save(filename) # use Qt's image IO functions for saving PNG/JPG/..

(Another advantage of 1.4 is that it is a pure python solution, which makes this even more lightweight.)

Question 29

If you are working in python environment Spyder, then it cannot get more easier than to just right click the array in variable explorer, and then choose Show Image option.

This will ask you to save image to dsik, mostly in PNG format.

PIL library will not be needed in this case.

Question 30

Use cv2.imwrite.

import cv2
assert mat.shape[2] == 1 or mat.shape[2] == 3, 'the third dim should be channel'
cv2.imwrite(path, mat) # note the form of data should be height - width - channel

Question 31

for saving a numpy array as image, U have several choices:

1) best of other: OpenCV

 import cv2   
 cv2.imwrite('file name with extension(like .jpg)', numpy_array)

2) Matplotlib

  from matplotlib import pyplot as plt
  plt.imsave('file name with extension(like .jpg)', numpy_array)

3) PIL

  from PIL import Image
  image = Image.fromarray(numpy_array)
  image.save('file name with extension(like .jpg)')

4) …

Question 32

I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.

I’ve tried different methods from other questions but still can’t seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can’t be counted as anything else. Even if they have a “1” in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can’t be counted as a unique ethnicity(except for Hispanic). Hopefully this makes sense. Any help will be greatly appreciated.

Its almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.

From the dataframe below I need to calculate a new column based on the following spec in SQL:

========================= CRITERIA ===============================

IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”

Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”

Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”

====================== DATAFRAME ===========================

     lname          fname       rno_cd  eri_afr_amer    eri_asian   eri_hawaiian    eri_hispanic    eri_nat_amer    eri_white   rno_defined
0    MOST           JEFF        E       0               0           0               0               0               1           White
1    CRUISE         TOM         E       0               0           0               1               0               0           White
2    DEPP           JOHNNY              0               0           0               0               0               1           Unknown
3    DICAP          LEO                 0               0           0               0               0               1           Unknown
4    BRANDO         MARLON      E       0               0           0               0               0               0           White
5    HANKS          TOM         0                       0           0               0               0               1           Unknown
6    DENIRO         ROBERT      E       0               1           0               0               0               1           White
7    PACINO         AL          E       0               0           0               0               0               1           White
8    WILLIAMS       ROBIN       E       0               0           1               0               0               0           White
9    EASTWOOD       CLINT       E       0               0           0               0               0               1           White

Question 33

OK, two steps to this – first is to write a function that does the translation you want – I’ve put an example together based on your pseudo-code:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

You may want to go over this, but it seems to do the trick – notice that the parameter going into the function is considered to be a Series object labelled “row”.

Next, use the apply function in pandas to apply the function – e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White

If you’re happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label
0      MOST    JEFF      E             0          0             0              0             0          1       White         White
1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic
2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White
3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White
4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other
5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White
6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More
7    PACINO      AL      E             0          0             0              0             0          1       White         White
8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.
9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

Question 34

Since this is the first Google result for ‘pandas new column from others’, here’s a simple example:

import pandas as pd

# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
#    a  b
# 0  1  3
# 1  2  4

# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0    4
# 1    6

# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
#    a  b  c
# 0  1  3  4
# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here’s the documentation for apply, and assign.

Question 35

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:

First, define conditions:

conditions = [
    df['eri_hispanic'] == 1,
    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    df['eri_nat_amer'] == 1,
    df['eri_asian'] == 1,
    df['eri_afr_amer'] == 1,
    df['eri_hawaiian'] == 1,
    df['eri_white'] == 1,
]

Now, define the corresponding outputs:

outputs = [
    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')
pd.Series(res)

0           White
1        Hispanic
2           White
3           White
4           Other
5           White
6     Two Or More
7           White
8    Haw/Pac Isl.
9           White
dtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)

In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %%timeit
    ...: conditions = [
    ...:     df['eri_hispanic'] == 1,
    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
    ...:     df['eri_nat_amer'] == 1,
    ...:     df['eri_asian'] == 1,
    ...:     df['eri_afr_amer'] == 1,
    ...:     df['eri_hawaiian'] == 1,
    ...:     df['eri_white'] == 1,
    ...: ]
    ...:
    ...: outputs = [
    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
    ...: ]
    ...:
    ...: np.select(conditions, outputs, 'Other')
    ...:
    ...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

Question 36

.apply() takes in a function as the first parameter; pass in the label_race function as so:

df['race_label'] = df.apply(label_race, axis=1)

You don’t need to make a lambda function to pass in a function.

Question 37

try this,

df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)

O/P:

     lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian  \
0      MOST    JEFF      E             0          0             0   
1    CRUISE     TOM      E             0          0             0   
2      DEPP  JOHNNY    NaN             0          0             0   
3     DICAP     LEO    NaN             0          0             0   
4    BRANDO  MARLON      E             0          0             0   
5     HANKS     TOM    NaN             0          0             0   
6    DENIRO  ROBERT      E             0          1             0   
7    PACINO      AL      E             0          0             0   
8  WILLIAMS   ROBIN      E             0          0             1   
9  EASTWOOD   CLINT      E             0          0             0   

   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label  
0             0             0          1       White         White  
1             1             0          0       White      Hispanic  
2             0             0          1     Unknown         White  
3             0             0          1     Unknown         White  
4             0             0          0       White         Other  
5             0             0          1     Unknown         White  
6             0             0          1       White   Two Or More  
7             0             0          1       White         White  
8             0             0          0       White  Haw/Pac Isl.  
9             0             0          1       White         White

use .loc instead of apply.

it improves vectorization.

.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.

for more details visit, .loc docs

Performance metrics:

Accepted Answer:

def label_race (row):
   if row['eri_hispanic'] == 1 :
      return 'Hispanic'
   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
      return 'Two Or More'
   if row['eri_nat_amer'] == 1 :
      return 'A/I AK Native'
   if row['eri_asian'] == 1:
      return 'Asian'
   if row['eri_afr_amer']  == 1:
      return 'Black/AA'
   if row['eri_hawaiian'] == 1:
      return 'Haw/Pac Isl.'
   if row['eri_white'] == 1:
      return 'White'
   return 'Other'

df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)

%timeit df.apply(lambda row: label_race(row), axis=1)

1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

My Proposed Answer:

def label_race(df):
    df.loc[df['eri_white']==1,'race_label'] = 'White'
    df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
    df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
    df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
    df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
    df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
    df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
    df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)

%timeit label_race(df)

24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 38

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.

What’s the best way to do this?

Question 39

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

Question 40

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

Question 41

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Question 42

Here’s yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What’s it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Question 43

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the “Set” column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

Question 44

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a ‘color’ column and set all values to “red”

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

Question 45

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

Question 46

If you’re working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

Question 47

I can’t figure out how to use an array or matrix in the way that I would normally use a list. I want to create an empty array (or matrix) and then add one column (or row) to it at a time.

At the moment the only way I can find to do this is like:

mat = None
for col in columns:
    if mat is None:
        mat = col
    else:
        mat = hstack((mat, col))

Whereas if it were a list, I’d do something like this:

list = []
for item in data:
    list.append(item)

Is there a way to use that kind of notation for NumPy arrays or matrices?

Question 48

You have the wrong mental model for using NumPy efficiently. NumPy arrays are stored in contiguous blocks of memory. If you want to add rows or columns to an existing array, the entire array needs to be copied to a new block of memory, creating gaps for the new elements to be stored. This is very inefficient if done repeatedly to build an array.

In the case of adding rows, your best bet is to create an array that is as big as your data set will eventually be, and then assign data to it row-by-row:

>>> import numpy
>>> a = numpy.zeros(shape=(5,2))
>>> a
array([[ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])
>>> a[0] = [1,2]
>>> a[1] = [2,3]
>>> a
array([[ 1.,  2.],
   [ 2.,  3.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]])

Question 49

A NumPy array is a very different data structure from a list and is designed to be used in different ways. Your use of hstack is potentially very inefficient… every time you call it, all the data in the existing array is copied into a new one. (The append function will have the same issue.) If you want to build up your matrix one column at a time, you might be best off to keep it in a list until it is finished, and only then convert it into an array.

e.g.


mylist = []
for item in data:
    mylist.append(item)
mat = numpy.array(mylist)

item can be a list, an array or any iterable, as long as each item has the same number of elements.
In this particular case (data is some iterable holding the matrix columns) you can simply use


mat = numpy.array(data)

(Also note that using list as a variable name is probably not good practice since it masks the built-in type by that name, which can lead to bugs.)

EDIT:

If for some reason you really do want to create an empty array, you can just use numpy.array([]), but this is rarely useful!

Question 50

To create an empty multidimensional array in NumPy (e.g. a 2D array m*n to store your matrix), in case you don’t know m how many rows you will append and don’t care about the computational cost Stephen Simmons mentioned (namely re-buildinging the array at each append), you can squeeze to 0 the dimension to which you want to append to: X = np.empty(shape=[0, n]).

This way you can use for example (here m = 5 which we assume we didn’t know when creating the empty matrix, and n = 2):

import numpy as np

n = 2
X = np.empty(shape=[0, n])

for i in range(5):
    for j  in range(2):
        X = np.append(X, [[i, j]], axis=0)

print X

which will give you:

[[ 0.  0.]
 [ 0.  1.]
 [ 1.  0.]
 [ 1.  1.]
 [ 2.  0.]
 [ 2.  1.]
 [ 3.  0.]
 [ 3.  1.]
 [ 4.  0.]
 [ 4.  1.]]

Question 51

I looked into this a lot because I needed to use a numpy.array as a set in one of my school projects and I needed to be initialized empty… I didn’t found any relevant answer here on Stack Overflow, so I started doodling something.

# Initialize your variable as an empty list first
In [32]: x=[]
# and now cast it as a numpy ndarray
In [33]: x=np.array(x)

The result will be:

In [34]: x
Out[34]: array([], dtype=float64)

Therefore you can directly initialize an np array as follows:

In [36]: x= np.array([], dtype=np.float64)

I hope this helps.

Question 52

You can use the append function. For rows:

>>> from numpy import *
>>> a = array([10,20,30])
>>> append(a, [[1,2,3]], axis=0)
array([[10, 20, 30],      
       [1, 2, 3]])

For columns:

>>> append(a, [[15],[15]], axis=1)
array([[10, 20, 30, 15],      
       [1, 2, 3, 15]])

EDIT
Of course, as mentioned in other answers, unless you’re doing some processing (ex. inversion) on the matrix/array EVERY time you append something to it, I would just create a list, append to it then convert it to an array.

Question 53

If you absolutely don’t know the final size of the array, you can increment the size of the array like this:

my_arr = numpy.zeros((0,5))
for i in range(3):
    my_arr=numpy.concatenate( ( my_arr, numpy.ones((1,5)) ) )
print(my_arr)

[[ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]  [ 1.  1.  1.  1.  1.]]

Notice the 0 in the first line.
numpy.append is another option. It calls numpy.concatenate.

Question 54

You can apply it to build any kind of array, like zeros:

a = range(5)
a = [i*0 for i in a]
print a 
[0, 0, 0, 0, 0]

Question 55

Here is some workaround to make numpys look more like Lists

np_arr = np.array([])
np_arr = np.append(np_arr , 2)
np_arr = np.append(np_arr , 24)
print(np_arr)

OUTPUT: array([ 2., 24.])

Question 56

Depending on what you are using this for, you may need to specify the data type (see ‘dtype’).

For example, to create a 2D array of 8-bit values (suitable for use as a monochrome image):

myarray = numpy.empty(shape=(H,W),dtype='u1')

For an RGB image, include the number of color channels in the shape: shape=(H,W,3)

You may also want to consider zero-initializing with numpy.zeros instead of using numpy.empty. See the note here.

Question 57

I think you want to handle most of the work with lists then use the result as a matrix. Maybe this is a way ;

ur_list = []
for col in columns:
    ur_list.append(list(col))

mat = np.matrix(ur_list)

Question 58

I think you can create empty numpy array like:

>>> import numpy as np
>>> empty_array= np.zeros(0)
>>> empty_array
array([], dtype=float64)
>>> empty_array.shape
(0,)

This format is useful when you want to append numpy array in the loop.

Question 59

For creating an empty NumPy array without defining its shape there is to way:

1.

arr = np.array([])

preferred. cause you know you will be using this as numpy.

2.

arr = []
# and use it as numpy. append to it or etc..

NumPy converts this to np.ndarray type afterward, without extra [] dimionsion.

Question 60

Can someone explain to me what is the purpose of meshgrid function in Numpy? I know it creates some kind of grid of coordinates for plotting, but I can’t really see the direct benefit of it.

I am studying “Python Machine Learning” from Sebastian Raschka, and he is using it for plotting the decision borders. See input 11 here.

I have also tried this code from official documentation, but, again, the output doesn’t really make sense to me.

x = np.arange(-5, 5, 1)
y = np.arange(-5, 5, 1)
xx, yy = np.meshgrid(x, y, sparse=True)
z = np.sin(xx**2 + yy**2) / (xx**2 + yy**2)
h = plt.contourf(x,y,z)

Please, if possible, also show me a lot of real-world examples.

Question 61

The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values.

So, for example, if we want to create a grid where we have a point at each integer value between 0 and 4 in both the x and y directions. To create a rectangular grid, we need every combination of the x and y points.

This is going to be 25 points, right? So if we wanted to create an x and y array for all of these points, we could do the following.

x[0,0] = 0    y[0,0] = 0
x[0,1] = 1    y[0,1] = 0
x[0,2] = 2    y[0,2] = 0
x[0,3] = 3    y[0,3] = 0
x[0,4] = 4    y[0,4] = 0
x[1,0] = 0    y[1,0] = 1
x[1,1] = 1    y[1,1] = 1
...
x[4,3] = 3    y[4,3] = 4
x[4,4] = 4    y[4,4] = 4

This would result in the following x and y matrices, such that the pairing of the corresponding element in each matrix gives the x and y coordinates of a point in the grid.

x =   0 1 2 3 4        y =   0 0 0 0 0
      0 1 2 3 4              1 1 1 1 1
      0 1 2 3 4              2 2 2 2 2
      0 1 2 3 4              3 3 3 3 3
      0 1 2 3 4              4 4 4 4 4

We can then plot these to verify that they are a grid:

plt.plot(x,y, marker='.', color='k', linestyle='none')

Obviously, this gets very tedious especially for large ranges of x and y. Instead, meshgrid can actually generate this for us: all we have to specify are the unique x and y values.

xvalues = np.array([0, 1, 2, 3, 4]);
yvalues = np.array([0, 1, 2, 3, 4]);

Now, when we call meshgrid, we get the previous output automatically.

xx, yy = np.meshgrid(xvalues, yvalues)

plt.plot(xx, yy, marker='.', color='k', linestyle='none')

Creation of these rectangular grids is useful for a number of tasks. In the example that you have provided in your post, it is simply a way to sample a function (sin(x**2 + y**2) / (x**2 + y**2)) over a range of values for x and y.

Because this function has been sampled on a rectangular grid, the function can now be visualized as an “image”.

Additionally, the result can now be passed to functions which expect data on rectangular grid (i.e. contourf)

Question 62

Courtesy of Microsoft Excel:

Question 63

Actually the purpose of np.meshgrid is already mentioned in the documentation:

np.meshgrid

Return coordinate matrices from coordinate vectors.

Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,…, xn.

So it’s primary purpose is to create a coordinates matrices.

You probably just asked yourself:

Why do we need to create coordinate matrices?

The reason you need coordinate matrices with Python/NumPy is that there is no direct relation from coordinates to values, except when your coordinates start with zero and are purely positive integers. Then you can just use the indices of an array as the index. However when that’s not the case you somehow need to store coordinates alongside your data. That’s where grids come in.

Suppose your data is:

1  2  1
2  5  2
1  2  1

However, each value represents a 2 kilometers wide region horizontally and 3 kilometers vertically. Suppose your origin is the upper left corner and you want arrays that represent the distance you could use:

import numpy as np
h, v = np.meshgrid(np.arange(3)*3, np.arange(3)*2)

where v is:

array([[0, 0, 0],
       [2, 2, 2],
       [4, 4, 4]])

and h:

array([[0, 3, 6],
       [0, 3, 6],
       [0, 3, 6]])

So if you have two indices, let’s say x and y (that’s why the return value of meshgrid is usually xx or xs instead of x in this case I chose h for horizontally!) then you can get the x coordinate of the point, the y coordinate of the point and the value at that point by using:

h[x, y]    # horizontal coordinate
v[x, y]    # vertical coordinate
data[x, y]  # value

That makes it much easier to keep track of coordinates and (even more importantly) you can pass them to functions that need to know the coordinates.

A slightly longer explanation

However, np.meshgrid itself isn’t often used directly, mostly one just uses one of similar objects np.mgrid or np.ogrid. Here np.mgrid represents the sparse=False and np.ogrid the sparse=True case (I refer to the sparse argument of np.meshgrid). Note that there is a significant difference between np.meshgrid and np.ogrid and np.mgrid: The first two returned values (if there are two or more) are reversed. Often this doesn’t matter but you should give meaningful variable names depending on the context.

For example, in case of a 2D grid and matplotlib.pyplot.imshow it makes sense to name the first returned item of np.meshgrid x and the second one y while it’s the other way around for np.mgrid and np.ogrid.

`np.ogrid` and sparse grids

>>> import numpy as np
>>> yy, xx = np.ogrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])

As already said the output is reversed when compared to np.meshgrid, that’s why I unpacked it as yy, xx instead of xx, yy:

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6), sparse=True)
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5],
       [-4],
       [-3],
       [-2],
       [-1],
       [ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5]])

This already looks like coordinates, specifically the x and y lines for 2D plots.

Visualized:

yy, xx = np.ogrid[-5:6, -5:6]
plt.figure()
plt.title('ogrid (sparse meshgrid)')
plt.grid()
plt.xticks(xx.ravel())
plt.yticks(yy.ravel())
plt.scatter(xx, np.zeros_like(xx), color="blue", marker="*")
plt.scatter(np.zeros_like(yy), yy, color="red", marker="x")

`np.mgrid` and dense/fleshed out grids

>>> yy, xx = np.mgrid[-5:6, -5:6]
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])

The same applies here: The output is reversed compared to np.meshgrid:

>>> xx, yy = np.meshgrid(np.arange(-5, 6), np.arange(-5, 6))
>>> xx
array([[-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5],
       [-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5]])
>>> yy
array([[-5, -5, -5, -5, -5, -5, -5, -5, -5, -5, -5],
       [-4, -4, -4, -4, -4, -4, -4, -4, -4, -4, -4],
       [-3, -3, -3, -3, -3, -3, -3, -3, -3, -3, -3],
       [-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5]])

Unlike ogrid these arrays contain all xx and yy coordinates in the -5 <= xx <= 5; -5 <= yy <= 5 grid.

yy, xx = np.mgrid[-5:6, -5:6]
plt.figure()
plt.title('mgrid (dense meshgrid)')
plt.grid()
plt.xticks(xx[0])
plt.yticks(yy[:, 0])
plt.scatter(xx, yy, color="red", marker="x")

Functionality

It’s not only limited to 2D, these functions work for arbitrary dimensions (well, there is a maximum number of arguments given to function in Python and a maximum number of dimensions that NumPy allows):

>>> x1, x2, x3, x4 = np.ogrid[:3, 1:4, 2:5, 3:6]
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
x1
array([[[[0]]],


       [[[1]]],


       [[[2]]]])
x2
array([[[[1]],

        [[2]],

        [[3]]]])
x3
array([[[[2],
         [3],
         [4]]]])
x4
array([[[[3, 4, 5]]]])

>>> # equivalent meshgrid output, note how the first two arguments are reversed and the unpacking
>>> x2, x1, x3, x4 = np.meshgrid(np.arange(1,4), np.arange(3), np.arange(2, 5), np.arange(3, 6), sparse=True)
>>> for i, x in enumerate([x1, x2, x3, x4]):
...     print('x{}'.format(i+1))
...     print(repr(x))
# Identical output so it's omitted here.

Even if these also work for 1D there are two (much more common) 1D grid creation functions:

Besides the start and stop argument it also supports the step argument (even complex steps that represent the number of steps):

>>> x1, x2 = np.mgrid[1:10:2, 1:10:4j]
>>> x1  # The dimension with the explicit step width of 2
array([[1., 1., 1., 1.],
       [3., 3., 3., 3.],
       [5., 5., 5., 5.],
       [7., 7., 7., 7.],
       [9., 9., 9., 9.]])
>>> x2  # The dimension with the "number of steps"
array([[ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.],
       [ 1.,  4.,  7., 10.]])

Applications

You specifically asked about the purpose and in fact, these grids are extremely useful if you need a coordinate system.

For example if you have a NumPy function that calculates the distance in two dimensions:

def distance_2d(x_point, y_point, x, y):
    return np.hypot(x-x_point, y-y_point)

And you want to know the distance of each point:

>>> ys, xs = np.ogrid[-5:5, -5:5]
>>> distances = distance_2d(1, 2, xs, ys)  # distance to point (1, 2)
>>> distances
array([[9.21954446, 8.60232527, 8.06225775, 7.61577311, 7.28010989,
        7.07106781, 7.        , 7.07106781, 7.28010989, 7.61577311],
       [8.48528137, 7.81024968, 7.21110255, 6.70820393, 6.32455532,
        6.08276253, 6.        , 6.08276253, 6.32455532, 6.70820393],
       [7.81024968, 7.07106781, 6.40312424, 5.83095189, 5.38516481,
        5.09901951, 5.        , 5.09901951, 5.38516481, 5.83095189],
       [7.21110255, 6.40312424, 5.65685425, 5.        , 4.47213595,
        4.12310563, 4.        , 4.12310563, 4.47213595, 5.        ],
       [6.70820393, 5.83095189, 5.        , 4.24264069, 3.60555128,
        3.16227766, 3.        , 3.16227766, 3.60555128, 4.24264069],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.        , 5.        , 4.        , 3.        , 2.        ,
        1.        , 0.        , 1.        , 2.        , 3.        ],
       [6.08276253, 5.09901951, 4.12310563, 3.16227766, 2.23606798,
        1.41421356, 1.        , 1.41421356, 2.23606798, 3.16227766],
       [6.32455532, 5.38516481, 4.47213595, 3.60555128, 2.82842712,
        2.23606798, 2.        , 2.23606798, 2.82842712, 3.60555128]])

The output would be identical if one passed in a dense grid instead of an open grid. NumPys broadcasting makes it possible!

Let’s visualize the result:

plt.figure()
plt.title('distance to point (1, 2)')
plt.imshow(distances, origin='lower', interpolation="none")
plt.xticks(np.arange(xs.shape[1]), xs.ravel())  # need to set the ticks manually
plt.yticks(np.arange(ys.shape[0]), ys.ravel())
plt.colorbar()

And this is also when NumPys mgrid and ogrid become very convenient because it allows you to easily change the resolution of your grids:

ys, xs = np.ogrid[-5:5:200j, -5:5:200j]
# otherwise same code as above

However, since imshow doesn’t support x and y inputs one has to change the ticks by hand. It would be really convenient if it would accept the x and y coordinates, right?

It’s easy to write functions with NumPy that deal naturally with grids. Furthermore, there are several functions in NumPy, SciPy, matplotlib that expect you to pass in the grid.

I like images so let’s explore matplotlib.pyplot.contour:

ys, xs = np.mgrid[-5:5:200j, -5:5:200j]
density = np.sin(ys)-np.cos(xs)
plt.figure()
plt.contour(xs, ys, density)

Note how the coordinates are already correctly set! That wouldn’t be the case if you just passed in the density.

Or to give another fun example using astropy models (this time I don’t care much about the coordinates, I just use them to create some grid):

from astropy.modeling import models
z = np.zeros((100, 100))
y, x = np.mgrid[0:100, 0:100]
for _ in range(10):
    g2d = models.Gaussian2D(amplitude=100, 
                           x_mean=np.random.randint(0, 100), 
                           y_mean=np.random.randint(0, 100), 
                           x_stddev=3, 
                           y_stddev=3)
    z += g2d(x, y)
    a2d = models.AiryDisk2D(amplitude=70, 
                            x_0=np.random.randint(0, 100), 
                            y_0=np.random.randint(0, 100), 
                            radius=5)
    z += a2d(x, y)

Although that’s just “for the looks” several functions related to functional models and fitting (for example scipy.interpolate.interp2d, scipy.interpolate.griddata even show examples using np.mgrid) in Scipy, etc. require grids. Most of these work with open grids and dense grids, however some only work with one of them.

Question 64

Suppose you have a function:

def sinus2d(x, y):
    return np.sin(x) + np.sin(y)

and you want, for example, to see what it looks like in the range 0 to 2*pi. How would you do it? There np.meshgrid comes in:

xx, yy = np.meshgrid(np.linspace(0,2*np.pi,100), np.linspace(0,2*np.pi,100))
z = sinus2d(xx, yy) # Create the image on this grid

and such a plot would look like:

import matplotlib.pyplot as plt
plt.imshow(z, origin='lower', interpolation='none')
plt.show()

So np.meshgrid is just a convenience. In principle the same could be done by:

z2 = sinus2d(np.linspace(0,2*np.pi,100)[:,None], np.linspace(0,2*np.pi,100)[None,:])

but there you need to be aware of your dimensions (suppose you have more than two …) and the right broadcasting. np.meshgrid does all of this for you.

Also meshgrid allows you to delete coordinates together with the data if you, for example, want to do an interpolation but exclude certain values:

condition = z>0.6
z_new = z[condition] # This will make your array 1D

so how would you do the interpolation now? You can give x and y to an interpolation function like scipy.interpolate.interp2d so you need a way to know which coordinates were deleted:

x_new = xx[condition]
y_new = yy[condition]

and then you can still interpolate with the “right” coordinates (try it without the meshgrid and you will have a lot of extra code):

from scipy.interpolate import interp2d
interpolated = interp2d(x_new, y_new, z_new)

and the original meshgrid allows you to get the interpolation on the original grid again:

interpolated_grid = interpolated(xx[0], yy[:, 0]).reshape(xx.shape)

These are just some examples where I used the meshgrid there might be a lot more.

问题：如何在熊猫中获取数据框的列切片

回答 0

2017年答案-熊猫0.20：.ix已弃用。使用.loc

2017 Answer – pandas 0.20: .ix is deprecated. Use .loc

回答 1

回答 2

使用列名

使用列索引

使用ix（版本低于.20的Pandas版本）

使用重新索引方法

using the column names

using the column indices

using ix (Older than Pandas <.20 version)

using the reindex method

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：将Numpy数组另存为图像

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

问题：熊猫read_csv low_memory和dtype选项

回答 0

不推荐使用的low_memory选项

Dtype猜测（非常糟糕）

指定dtypes（应该总是这样做）

定义dtypes时会中断的中断数据示例

存在哪些dtype？

陷阱，注意事项，笔记

转换器的使用

The deprecated low_memory option

Dtype Guessing (very bad)

Specifying dtypes (should always be done)

Example of broken data that breaks when dtypes are defined

What dtypes exists?

Gotchas, caveats, notes

Usage of converters

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：numpy.array形状（R，1）和（R，）之间的区别

回答 0

1. NumPy中形状的含义

2.怎么办？

1. The meaning of shapes in NumPy

2. What to do?

回答 1

回答 2

回答 3

回答 4

回答 5

问题：熊猫根据其他列的值创建新列/逐行应用多列的功能

回答 0

回答 1

回答 2

回答 3

回答 4

`np.ogrid` 和稀疏的网格

`np.mgrid` 和密集/充实的网格

`np.ogrid` and sparse grids

`np.mgrid` and dense/fleshed out grids