Python 实用宝典

Question 1

In R I can create the desired output by doing:

data = c(rep(1.5, 7), rep(2.5, 2), rep(3.5, 8),
         rep(4.5, 3), rep(5.5, 1), rep(6.5, 8))
plot(density(data, bw=0.5))

Density plot in R

In python (with matplotlib) the closest I got was with a simple histogram:

import matplotlib.pyplot as plt
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
plt.hist(data, bins=6)
plt.show()

Histogram in matplotlib

I also tried the normed=True parameter but couldn’t get anything other than trying to fit a gaussian to the histogram.

My latest attempts were around scipy.stats and gaussian_kde, following examples on the web, but I’ve been unsuccessful so far.

Question 2

Sven has shown how to use the class gaussian_kde from Scipy, but you will notice that it doesn’t look quite like what you generated with R. This is because gaussian_kde tries to infer the bandwidth automatically. You can play with the bandwidth in a way by changing the function covariance_factor of the gaussian_kde class. First, here is what you get without changing that function:

alt text

However, if I use the following code:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density = gaussian_kde(data)
xs = np.linspace(0,8,200)
density.covariance_factor = lambda : .25
density._compute_covariance()
plt.plot(xs,density(xs))
plt.show()

I get

alt text

which is pretty close to what you are getting from R. What have I done? gaussian_kde uses a changable function, covariance_factor to calculate its bandwidth. Before changing the function, the value returned by covariance_factor for this data was about .5. Lowering this lowered the bandwidth. I had to call _compute_covariance after changing that function so that all of the factors would be calculated correctly. It isn’t an exact correspondence with the bw parameter from R, but hopefully it helps you get in the right direction.

Question 3

Five years later, when I Google “how to create a kernel density plot using python”, this thread still shows up at the top!

Today, a much easier way to do this is to use seaborn, a package that provides many convenient plotting functions and good style management.

import numpy as np
import seaborn as sns
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
sns.set_style('whitegrid')
sns.kdeplot(np.array(data), bw=0.5)

Question 4

Option 1:

Use pandas dataframe plot (built on top of matplotlib):

import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
pd.DataFrame(data).plot(kind='density') # or pd.Series()

Option 2:

Use distplot of seaborn:

import seaborn as sns
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
sns.distplot(data, hist=False)

Question 5

Maybe try something like:

import matplotlib.pyplot as plt
import numpy
from scipy import stats
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density = stats.kde.gaussian_kde(data)
x = numpy.arange(0., 8, .1)
plt.plot(x, density(x))
plt.show()

You can easily replace gaussian_kde() by a different kernel density estimate.

Question 6

The density plot can also be created by using matplotlib: The function plt.hist(data) returns the y and x values necessary for the density plot (see the documentation https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html). Resultingly, the following code creates a density plot by using the matplotlib library:

import matplotlib.pyplot as plt
dat=[-1,2,1,4,-5,3,6,1,2,1,2,5,6,5,6,2,2,2]
a=plt.hist(dat,density=True)
plt.close()
plt.figure()
plt.plot(a[1][1:],a[0])

This code returns the following density plot

Question 7

I have a numpy array containing:

[1, 2, 3]

I want to create an array containing:

[1, 2, 3, 1]

That is, I want to add the first element on to the end of the array.

I have tried the obvious:

np.concatenate((a, a[0]))

But I get an error saying ValueError: arrays must have same number of dimensions

I don’t understand this – the arrays are both just 1d arrays.

Question 8

append() creates a new array which can be the old array with the appended element.

I think it’s more normal to use the proper method for adding an element:

a = numpy.append(a, a[0])

Question 9

When appending only once or once every now and again, using np.append on your array should be fine. The drawback of this approach is that memory is allocated for a completely new array every time it is called. When growing an array for a significant amount of samples it would be better to either pre-allocate the array (if the total size is known) or to append to a list and convert to an array afterward.

Using np.append:

b = np.array([0])
for k in range(int(10e4)):
    b = np.append(b, k)
1.2 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using python list converting to array afterward:

d = [0]
for k in range(int(10e4)):
    d.append(k)
f = np.array(d)
13.5 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pre-allocating numpy array:

e = np.zeros((n,))
for k in range(n):
    e[k] = k
9.92 ms ± 752 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When the final size is unkown pre-allocating is difficult, I tried pre-allocating in chunks of 50 but it did not come close to using a list.

85.1 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 10

a[0] isn’t an array, it’s the first element of a and therefore has no dimensions.

Try using a[0:1] instead, which will return the first element of a inside a single item array.

Question 11

Try this:

np.concatenate((a, np.array([a[0]])))

http://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html

concatenate needs both elements to be numpy arrays; however, a[0] is not an array. That is why it does not work.

Question 12

This command,

numpy.append(a, a[0])

does not alter a array. However, it returns a new modified array. So, if a modification is required, then the following must be used.

a = numpy.append(a, a[0])

Question 13

t = np.array([2, 3])
t = np.append(t, [4])

Question 14

This might be a bit overkill, but I always use the the np.take function for any wrap-around indexing:

>>> a = np.array([1, 2, 3])
>>> np.take(a, range(0, len(a)+1), mode='wrap')
array([1, 2, 3, 1])

>>> np.take(a, range(-1, len(a)+1), mode='wrap')
array([3, 1, 2, 3, 1])

Question 15

Let’s say a=[1,2,3] and you want it to be [1,2,3,1].

You may use the built-in append function

np.append(a,1)

Here 1 is an int, it may be a string and it may or may not belong to the elements in the array. Prints: [1,2,3,1]

Question 16

If you want to add an element use append()

a = numpy.append(a, 1) in this case add the 1 at the end of the array

If you want to insert an element use insert()

a = numpy.insert(a, index, 1) in this case you can put the 1 where you desire, using index to set the position in the array.

Question 17

I am surprised this specific question hasn’t been asked before, but I really didn’t find it on SO nor on the documentation of np.sort.

Say I have a random numpy array holding integers, e.g:

> temp = np.random.randint(1,10, 10)    
> temp
array([2, 4, 7, 4, 2, 2, 7, 6, 4, 4])

If I sort it, I get ascending order by default:

> np.sort(temp)
array([2, 2, 2, 4, 4, 4, 4, 6, 7, 7])

but I want the solution to be sorted in descending order.

Now, I know I can always do:

reverse_order = np.sort(temp)[::-1]

but is this last statement efficient? Doesn’t it create a copy in ascending order, and then reverses this copy to get the result in reversed order? If this is indeed the case, is there an efficient alternative? It doesn’t look like np.sort accepts parameters to change the sign of the comparisons in the sort operation to get things in reverse order.

Question 18

temp[::-1].sort() sorts the array in place, whereas np.sort(temp)[::-1] creates a new array.

In [25]: temp = np.random.randint(1,10, 10)

In [26]: temp
Out[26]: array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])

In [27]: id(temp)
Out[27]: 139962713524944

In [28]: temp[::-1].sort()

In [29]: temp
Out[29]: array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])

In [30]: id(temp)
Out[30]: 139962713524944

Question 19

>>> a=np.array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])

>>> np.sort(a)
array([2, 2, 4, 4, 4, 4, 5, 6, 7, 8])

>>> -np.sort(-a)
array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])

Question 20

For short arrays I suggest using np.argsort() by finding the indices of the sorted negatived array, which is slightly faster than reversing the sorted array:

In [37]: temp = np.random.randint(1,10, 10)

In [38]: %timeit np.sort(temp)[::-1]
100000 loops, best of 3: 4.65 µs per loop

In [39]: %timeit temp[np.argsort(-temp)]
100000 loops, best of 3: 3.91 µs per loop

Question 21

Unfortunately when you have a complex array, only np.sort(temp)[::-1] works properly. The two other methods mentioned here are not effective.

Question 22

Be careful with dimensions.

Let

x  # initial numpy array
I = np.argsort(x) or I = x.argsort() 
y = np.sort(x)    or y = x.sort()
z  # reverse sorted array

Full Reverse

z = x[I[::-1]]
z = -np.sort(-x)
z = np.flip(y)

flip changed in 1.15, previous versions 1.14 required axis. Solution: pip install --upgrade numpy.

First Dimension Reversed

z = y[::-1]
z = np.flipud(y)
z = np.flip(y, axis=0)

Second Dimension Reversed

z = y[::-1, :]
z = np.fliplr(y)
z = np.flip(y, axis=1)

Testing

Testing on a 100×10×10 array 1000 times.

Method       | Time (ms)
-------------+----------
y[::-1]      | 0.126659  # only in first dimension
-np.sort(-x) | 0.133152
np.flip(y)   | 0.121711
x[I[::-1]]   | 4.611778

x.sort()     | 0.024961
x.argsort()  | 0.041830
np.flip(x)   | 0.002026

This is mainly due to reindexing rather than argsort.

# Timing code
import time
import numpy as np


def timeit(fun, xs):
    t = time.time()
    for i in range(len(xs)):  # inline and map gave much worse results for x[-I], 5*t
        fun(xs[i])
    t = time.time() - t
    print(np.round(t,6))

I, N = 1000, (100, 10, 10)
xs = np.random.rand(I,*N)
timeit(lambda x: np.sort(x)[::-1], xs)
timeit(lambda x: -np.sort(-x), xs)
timeit(lambda x: np.flip(x.sort()), xs)
timeit(lambda x: x[x.argsort()[::-1]], xs)
timeit(lambda x: x.sort(), xs)
timeit(lambda x: x.argsort(), xs)
timeit(lambda x: np.flip(x), xs)

Question 23

Hello I was searching for a solution to reverse sorting a two dimensional numpy array, and I couldn’t find anything that worked, but I think I have stumbled on a solution which I am uploading just in case anyone is in the same boat.

x=np.sort(array)
y=np.fliplr(x)

np.sort sorts ascending which is not what you want, but the command fliplr flips the rows left to right! Seems to work!

Hope it helps you out!

I guess it’s similar to the suggest about -np.sort(-a) above but I was put off going for that by comment that it doesn’t always work. Perhaps my solution won’t always work either however I have tested it with a few arrays and seems to be OK.

Question 24

You could sort the array first (Ascending by default) and then apply np.flip() (https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html)

FYI It works with datetime objects as well.

Example:

    x = np.array([2,3,1,0]) 
    x_sort_asc=np.sort(x) 
    print(x_sort_asc)

    >>> array([0, 1, 2, 3])

    x_sort_desc=np.flip(x_sort_asc) 
    print(x_sort_desc)

    >>> array([3,2,1,0])

Question 25

Here is a quick trick

In[3]: import numpy as np
In[4]: temp = np.random.randint(1,10, 10)
In[5]: temp
Out[5]: array([5, 4, 2, 9, 2, 3, 4, 7, 5, 8])

In[6]: sorted = np.sort(temp)
In[7]: rsorted = list(reversed(sorted))
In[8]: sorted
Out[8]: array([2, 2, 3, 4, 4, 5, 5, 7, 8, 9])

In[9]: rsorted
Out[9]: [9, 8, 7, 5, 5, 4, 4, 3, 2, 2]

Question 26

i suggest using this …

np.arange(start_index, end_index, intervals)[::-1]

for example:

np.arange(10, 20, 0.5)
np.arange(10, 20, 0.5)[::-1]

Then your resault:

[ 19.5,  19. ,  18.5,  18. ,  17.5,  17. ,  16.5,  16. ,  15.5,
    15. ,  14.5,  14. ,  13.5,  13. ,  12.5,  12. ,  11.5,  11. ,
    10.5,  10. ]

Question 27

I’m looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.

I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?

(To those who would ask “how gigantic”: I can’t tell. This is input validation for library code.)

Question 28

Ray’s solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:

In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop

In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop

Unlike min, sum doesn’t require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.

edit The above test was performed with a single NaN right in the middle of the array.

It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum‘s throughput seems constant regardless of whether there are NaNs and where they’re located:

In [40]: x = np.random.rand(100000)

In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop

In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

In [43]: x[50000] = np.nan

In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop

In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop

In [46]: x[0] = np.nan

In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop

In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

Question 29

I think np.isnan(np.min(X)) should do what you want.

Question 30

Even there exist an accepted answer, I’ll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):

In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop

In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop

Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.

Question 31

There are two general approaches here:

Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.

While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.

import numpy
import perfplot


def min(a):
    return numpy.isnan(numpy.min(a))


def sum(a):
    return numpy.isnan(numpy.sum(a))


def dot(a):
    return numpy.isnan(numpy.dot(a, a))


def any(a):
    return numpy.any(numpy.isnan(a))


def einsum(a):
    return numpy.isnan(numpy.einsum("i->", a))


perfplot.show(
    setup=lambda n: numpy.random.rand(n),
    kernels=[min, sum, dot, any, einsum],
    n_range=[2 ** k for k in range(20)],
    logx=True,
    logy=True,
    xlabel="len(a)",
)

Question 32

use .any()

if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking

if not np.isfinite(prop).all()

Question 33

If you’re comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:

import numba as nb
import math

@nb.njit
def anynan(array):
    array = array.ravel()
    for i in range(array.size):
        if math.isnan(array[i]):
            return True
    return False

If there is no NaN the function might actually be slower than np.min, I think that’s because np.min uses multiprocessing for large arrays:

import numpy as np
array = np.random.random(2000000)

%timeit anynan(array)          # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.64 ms per loop

But in case there is a NaN in the array, especially if it’s position is at low indices, then it’s much faster:

array = np.random.random(2000000)
array[100] = np.nan

%timeit anynan(array)          # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum())  # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min())  # 1000 loops, best of 3: 1.65 ms per loop

Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.

Question 34

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:

index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

Question 35

I find myself typing import numpy as np almost every single time I fire up the python interpreter. How do I set up the python or ipython interpreter so that numpy is automatically imported?

Question 36

Use the environment variable PYTHONSTARTUP. From the official documentation:

If this is the name of a readable file, the Python commands in that file are executed before the first prompt is displayed in interactive mode. The file is executed in the same namespace where interactive commands are executed so that objects defined or imported in it can be used without qualification in the interactive session.

So, just create a python script with the import statement and point the environment variable to it. Having said that, remember that ‘Explicit is always better than implicit’, so don’t rely on this behavior for production scripts.

For Ipython, see this tutorial on how to make a ipython_config file

Question 37

For ipython, there are two ways to achieve this. Both involve ipython’s configuration directory which is located in ~/.ipython.

Create a custom ipython profile.
Or you can add a startup file to ~/.ipython/profile_default/startup/

For simplicity, I’d use option 2. All you have to do is place a .py or .ipy file in the ~/.ipython/profile_default/startup directory and it will automatically be executed. So you could simple place import numpy as np in a simple file and you’ll have np in the namespace of your ipython prompt.

Option 2 will actually work with a custom profile, but using a custom profile will allow you to change the startup requirements and other configuration based on a particular case. However, if you’d always like np to be available to you then by all means put it in the startup directory.

For more information on ipython configuration. The docs have a much more complete explanation.

Question 38

I use a ~/.startup.py file like this:

# Ned's .startup.py file
print("(.startup.py)")
import datetime, os, pprint, re, sys, time
print("(imported datetime, os, pprint, re, sys, time)")

pp = pprint.pprint

Then define PYTHONSTARTUP=~/.startup.py, and Python will use it when starting a shell.

The print statements are there so when I start the shell, I get a reminder that it’s in effect, and what has been imported already. The pp shortcut is really handy too…

Question 39

While creating a custom startup script like ravenac95 suggests is the best general answer for most cases, it won’t work in circumstances where you want to use a from __future__ import X. If you sometimes work in Python 2.x but want to use modern division, there is only one way to do this. Once you create a profile, edit the profile_default (For Ubuntu this is located in ~/.ipython/profile_default) and add something like the following to the bottom:

c.InteractiveShellApp.exec_lines = [
    'from __future__ import division, print_function',
    'import numpy as np',
    'import matplotlib.pyplot as plt',
    ]

Question 40

As a simpler alternative to the accepted answer, on linux:

just define an alias, e.g. put alias pynp='python -i -c"import numpy as np"' in your ~/.bash_aliases file. You can then invoke python+numpy with pynp, and you can still use just python with python. Python scripts’ behaviour is left untouched.

Question 41

You can create a normal python script as import_numpy.py or anything you like

#!/bin/env python3
import numpy as np

then launch it with -i flag.

python -i import_numpy.py

Way like this will give you flexibility to choose only modules you want for different projects.

Question 42

As ravenac95 mentioned in his answer, you can either create a custom profile or modify the default profile. This answer is quick view of Linux commands needed to import numpy as np automatically.

If you want to use a custom profile called numpy, run:

ipython profile create numpy
echo 'import numpy as np' >> $(ipython locate profile numpy)/startup/00_imports.py
ipython --profile=numpy

Or if you want to modify the default profile to always import numpy:

echo 'import numpy as np' >> $(ipython locate profile default)/startup/00_imports.py
ipython

Check out the IPython config tutorial to read more in depth about configuring profiles. See .ipython/profile_default/startup/README to understand how the startup directory works.

Question 43

My default ipython invocation is

ipython --pylab --nosep --InteractiveShellApp.pylab_import_all=False

--pylab has been a ipython option for some time. It imports numpy and (parts of) matplotlib. I’ve added the --Inter... option so it does not use the * import, since I prefer to use the explicit np.....

This can be a shortcut, alias or script.

Question 44

Suppose I have a numpy array:

data = np.array([[1,1,1],[2,2,2],[3,3,3]])

and I have a corresponding “vector:”

vector = np.array([1,2,3])

How do I operate on data along each row to either subtract or divide so the result is:

sub_result = [[0,0,0], [0,0,0], [0,0,0]]
div_result = [[1,1,1], [1,1,1], [1,1,1]]

Long story short: How do I perform an operation on each row of a 2D array with a 1D array of scalars that correspond to each row?

Question 45

Here you go. You just need to use None (or alternatively np.newaxis) combined with broadcasting:

In [6]: data - vector[:,None]
Out[6]:
array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In [7]: data / vector[:,None]
Out[7]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]])

Question 46

As has been mentioned, slicing with None or with np.newaxes is a great way to do this. Another alternative is to use transposes and broadcasting, as in

(data.T - vector).T

and

(data.T / vector).T

For higher dimensional arrays you may want to use the swapaxes method of NumPy arrays or the NumPy rollaxis function. There really are a lot of ways to do this.

For a fuller explanation of broadcasting, see http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

Question 47

JoshAdel’s solution uses np.newaxis to add a dimension. An alternative is to use reshape() to align the dimensions in preparation for broadcasting.

data = np.array([[1,1,1],[2,2,2],[3,3,3]])
vector = np.array([1,2,3])

data
# array([[1, 1, 1],
#        [2, 2, 2],
#        [3, 3, 3]])
vector
# array([1, 2, 3])

data.shape
# (3, 3)
vector.shape
# (3,)

data / vector.reshape((3,1))
# array([[1, 1, 1],
#        [1, 1, 1],
#        [1, 1, 1]])

Performing the reshape() allows the dimensions to line up for broadcasting:

data:            3 x 3
vector:              3
vector reshaped: 3 x 1

Note that data/vector is ok, but it doesn’t get you the answer that you want. It divides each column of array (instead of each row) by each corresponding element of vector. It’s what you would get if you explicitly reshaped vector to be 1x3 instead of 3x1.

data / vector
# array([[1, 0, 0],
#        [2, 1, 0],
#        [3, 1, 1]])
data / vector.reshape((1,3))
# array([[1, 0, 0],
#        [2, 1, 0],
#        [3, 1, 1]])

Question 48

Pythonic way to do this is …

np.divide(data.T,vector).T

This takes care of reshaping and also the results are in floating point format. In other answers results are in rounded integer format.

#NOTE: No of columns in both data and vector should match

Question 49

Adding to the answer of stackoverflowuser2010, in the general case you can just use

data = np.array([[1,1,1],[2,2,2],[3,3,3]])

vector = np.array([1,2,3])

data / vector.reshape(-1,1)

This will turn your vector into a column matrix/vector. Allowing you to do the elementwise operations as you wish. At least to me, this is the most intuitive way going about it and since (in most cases) numpy will just use a view of the same internal memory for the reshaping it’s efficient too.

Question 50

While reading up on numpy, I encountered the function numpy.histogram().

What is it for and how does it work? In the docs they mention bins: What are they?

Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can’t link this knowledge to the examples given in the docs.

Question 51

A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as “disjoint categories”.)

The Numpy histogram function doesn’t draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren’t of equal width) of each bar.

In this example:

 np.histogram([1, 2, 1], bins=[0, 1, 2, 3])

There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.

The input values are 1, 2 and 1. Therefore, bin “1 to 2” contains two occurrences (the two 1 values), and bin “2 to 3” contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).

Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:

a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].

You can plot this directly with Matplotlib (its hist function also returns the bins and the values):

>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()

enter image description here

Question 52

import numpy as np    
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))

Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.

print(hist)
# array([0, 2, 4, 1])

bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), …, bin #3 is [3,4).

print (bin_edges)
# array([0, 1, 2, 3, 4]))

Play with the above code, change the input to np.histogram and see how it works.

But a picture is worth a thousand words:

import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()

enter image description here

Question 53

Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:

arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()

This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.

Question 54

I recently moved to Python 3.5 and noticed the new matrix multiplication operator (@) sometimes behaves differently from the numpy dot operator. In example, for 3d arrays:

import numpy as np

a = np.random.rand(8,13,13)
b = np.random.rand(8,13,13)
c = a @ b  # Python 3.5+
d = np.dot(a, b)

The @ operator returns an array of shape:

c.shape
(8, 13, 13)

while the np.dot() function returns:

d.shape
(8, 13, 8, 13)

How can I reproduce the same result with numpy dot? Are there any other significant differences?

Question 55

The @ operator calls the array’s __matmul__ method, not dot. This method is also present in the API as the function np.matmul.

>>> a = np.random.rand(8,13,13)
>>> b = np.random.rand(8,13,13)
>>> np.matmul(a, b).shape
(8, 13, 13)

From the documentation:

matmul differs from dot in two important ways.

Multiplication by scalars is not allowed.

Stacks of matrices are broadcast together as if the matrices were elements.

The last point makes it clear that dot and matmul methods behave differently when passed 3D (or higher dimensional) arrays. Quoting from the documentation some more:

For matmul:

If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.

For np.dot:

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b

Question 56

The answer by @ajcr explains how the dot and matmul (invoked by the @ symbol) differ. By looking at a simple example, one clearly sees how the two behave differently when operating on ‘stacks of matricies’ or tensors.

To clarify the differences take a 4×4 array and return the dot product and matmul product with a 3x4x2 ‘stack of matricies’ or tensor.

import numpy as np
fourbyfour = np.array([
                       [1,2,3,4],
                       [3,2,1,4],
                       [5,4,6,7],
                       [11,12,13,14]
                      ])


threebyfourbytwo = np.array([
                             [[2,3],[11,9],[32,21],[28,17]],
                             [[2,3],[1,9],[3,21],[28,7]],
                             [[2,3],[1,9],[3,21],[28,7]],
                            ])

print('4x4*3x4x2 dot:\n {}\n'.format(np.dot(fourbyfour,threebyfourbytwo)))
print('4x4*3x4x2 matmul:\n {}\n'.format(np.matmul(fourbyfour,threebyfourbytwo)))

The products of each operation appear below. Notice how the dot product is,

…a sum product over the last axis of a and the second-to-last of b

and how the matrix product is formed by broadcasting the matrix together.

4x4*3x4x2 dot:
 [[[232 152]
  [125 112]
  [125 112]]

 [[172 116]
  [123  76]
  [123  76]]

 [[442 296]
  [228 226]
  [228 226]]

 [[962 652]
  [465 512]
  [465 512]]]

4x4*3x4x2 matmul:
 [[[232 152]
  [172 116]
  [442 296]
  [962 652]]

 [[125 112]
  [123  76]
  [228 226]
  [465 512]]

 [[125 112]
  [123  76]
  [228 226]
  [465 512]]]

Question 57

Just FYI, @ and its numpy equivalents dot and matmul are all equally fast. (Plot created with perfplot, a project of mine.)

Code to reproduce the plot:

import perfplot
import numpy


def setup(n):
    A = numpy.random.rand(n, n)
    x = numpy.random.rand(n)
    return A, x


def at(data):
    A, x = data
    return A @ x


def numpy_dot(data):
    A, x = data
    return numpy.dot(A, x)


def numpy_matmul(data):
    A, x = data
    return numpy.matmul(A, x)


perfplot.show(
    setup=setup,
    kernels=[at, numpy_dot, numpy_matmul],
    n_range=[2 ** k for k in range(15)],
)

Question 58

In mathematics, I think the dot in numpy makes more sense

dot(a,b)_{i,j,k,a,b,c} = $\sum_m a_{i,j,k,m}b_{a,b,m,c}$

since it gives the dot product when a and b are vectors, or the matrix multiplication when a and b are matrices

As for matmul operation in numpy, it consists of parts of dot result, and it can be defined as

>matmul(a,b)_{i,j,k,c} = $\sum_m a_{i,j,k,m}b_{i,j,m,c}$

So, you can see that matmul(a,b) returns an array with a small shape, which has smaller memory consumption and make more sense in applications. In particular, combining with broadcasting, you can get

matmul(a,b)_{i,j,k,l} = $\sum_m a_{i,j,k,m}b_{j,m,l}$

for example.

From the above two definitions, you can see the requirements to use those two operations. Assume a.shape=(s1,s2,s3,s4) and b.shape=(t1,t2,t3,t4)

To use dot(a,b) you need
1. t3=s4;
To use matmul(a,b) you need
1. t3=s4
2. t2=s2, or one of t2 and s2 is 1
3. t1=s1, or one of t1 and s1 is 1

Use the following piece of code to convince yourself.

Code sample

import numpy as np
for it in xrange(10000):
    a = np.random.rand(5,6,2,4)
    b = np.random.rand(6,4,3)
    c = np.matmul(a,b)
    d = np.dot(a,b)
    #print 'c shape: ', c.shape,'d shape:', d.shape

    for i in range(5):
        for j in range(6):
            for k in range(2):
                for l in range(3):
                    if not c[i,j,k,l] == d[i,j,k,j,l]:
                        print it,i,j,k,l,c[i,j,k,l]==d[i,j,k,j,l] #you will not see them

Question 59

Here is a comparison with np.einsum to show how the indices are projected

np.allclose(np.einsum('ijk,ijk->ijk', a,b), a*b)        # True 
np.allclose(np.einsum('ijk,ikl->ijl', a,b), a@b)        # True
np.allclose(np.einsum('ijk,lkm->ijlm',a,b), a.dot(b))   # True

Question 60

My experience with MATMUL and DOT

I was constantly getting “ValueError: Shape of passed values is (200, 1), indices imply (200, 3)” when trying to use MATMUL. I wanted a quick workaround and found DOT to deliver the same functionality. I don’t get any error using DOT. I get the correct answer

with MATMUL

X.shape
>>>(200, 3)

type(X)

>>>pandas.core.frame.DataFrame

w

>>>array([0.37454012, 0.95071431, 0.73199394])

YY = np.matmul(X,w)

>>>  ValueError: Shape of passed values is (200, 1), indices imply (200, 3)"

with DOT

YY = np.dot(X,w)
# no error message
YY
>>>array([ 2.59206877,  1.06842193,  2.18533396,  2.11366346,  0.28505879, …

YY.shape

>>> (200, )

Question 61

I have a 2 dimensional NumPy array. I know how to get the maximum values over axes:

>>> a = array([[1,2,3],[4,3,1]])
>>> amax(a,axis=0)
array([4, 3, 3])

How can I get the indices of the maximum elements? I would like as output array([1,1,0]) instead.

Question 62

>>> a.argmax(axis=0)

array([1, 1, 0])

Question 63

>>> import numpy as np
>>> a = np.array([[1,2,3],[4,3,1]])
>>> i,j = np.unravel_index(a.argmax(), a.shape)
>>> a[i,j]
4

Question 64

argmax() will only return the first occurrence for each row. http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html

If you ever need to do this for a shaped array, this works better than unravel:

import numpy as np
a = np.array([[1,2,3], [4,3,1]])  # Can be of any shape
indices = np.where(a == a.max())

You can also change your conditions:

indices = np.where(a >= 1.5)

The above gives you results in the form that you asked for. Alternatively, you can convert to a list of x,y coordinates by:

x_y_coords =  zip(indices[0], indices[1])

问题：如何在matplotlib中创建密度图？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：将单个元素添加到numpy中的数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：有效地对numpy数组进行降序排序？

回答 0

回答 1

回答 2

回答 3

回答 4

注意尺寸。

让

全反转

第一维反转

逆向二维

测试中

Be careful with dimensions.

Let

Full Reverse

First Dimension Reversed

Second Dimension Reversed

Testing

回答 5

回答 6

回答 7

回答 8

然后您的恢复：

Then your resault:

问题：快速检查NumPy中的NaN

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：输入python或ipython解释器时自动导入模块

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：numpy：将每行除以一个向量元素

回答 0

回答 1

回答 2

回答 3

回答 4

问题：numpy.histogram（）如何工作？

回答 0

回答 1

回答 2

问题：numpy dot（）和Python 3.5+矩阵乘法@之间的区别

回答 0

回答 1

回答 2

回答 3

> matmul（a，b）_ {i，j，k，c} =

代码样例

>matmul(a,b)_{i,j,k,c} =

Code sample

回答 4

回答 5

问题：如何沿一个轴获取numpy数组中最大元素的索引

回答 0

> matmul（a，b）_ {i，j，k，c} = $\sum_m a_{i,j,k,m}b_{i,j,m,c}$

>matmul(a,b)_{i,j,k,c} = $\sum_m a_{i,j,k,m}b_{i,j,m,c}$