Python 实用宝典

Question 1

如何cv2在Python OpenCV（numpy）的包装器中获取图像的大小。除了之外还有其他正确的方法吗numpy.shape()？如何获得以下格式的尺寸：（宽度，高度）列表？

Question 2

How to get the size of an image in cv2 wrapper in Python OpenCV (numpy). Is there a correct way to do that other than numpy.shape(). How can I get it in these format dimensions: (width, height) list?

Question 3

cv2numpy用于处理图像，因此使用来获取图像大小的正确和最佳方法是numpy.shape。假设您正在使用BGR图像，下面是一个示例：

>>> import numpy as np
>>> import cv2
>>> img = cv2.imread('foo.jpg')
>>> height, width, channels = img.shape
>>> print height, width, channels
  600 800 3

如果您正在使用二进制图像，img它将具有两个尺寸，因此必须将代码更改为：height, width = img.shape

Question 4

cv2 uses numpy for manipulating images, so the proper and best way to get the size of an image is using numpy.shape. Assuming you are working with BGR images, here is an example:

>>> import numpy as np
>>> import cv2
>>> img = cv2.imread('foo.jpg')
>>> height, width, channels = img.shape
>>> print height, width, channels
  600 800 3

In case you were working with binary images, img will have two dimensions, and therefore you must change the code to: height, width = img.shape

Question 5

恐怕没有“更好”的方法来获得这种大小，但是没有那么多痛苦。

当然，您的代码对于二进制/单图像以及多通道图像都应该是安全的，但是图像的主要尺寸始终以numpy数组的形状排在首位。如果您选择可读性，或者不想打扰它，可以将其包装在一个函数中，并为其命名，例如cv_size：

import numpy as np
import cv2

# ...

def cv_size(img):
    return tuple(img.shape[1::-1])

如果您在终端机/ ipython上，还可以使用lambda表示它：

>>> cv_size = lambda img: tuple(img.shape[1::-1])
>>> cv_size(img)
(640, 480)

def交互工作时，用编写函数并不有趣。

编辑

本来我以为可以使用[:2]，但是numpy的形状是(height, width[, depth])，并且我们需要(width, height)如cv2.resize预期的那样-因此我们必须使用[1::-1]。难忘的是[:2]。还有谁记得反向切片？

Question 6

I’m afraid there is no “better” way to get this size, however it’s not that much pain.

Of course your code should be safe for both binary/mono images as well as multi-channel ones, but the principal dimensions of the image always come first in the numpy array’s shape. If you opt for readability, or don’t want to bother typing this, you can wrap it up in a function, and give it a name you like, e.g. cv_size:

import numpy as np
import cv2

# ...

def cv_size(img):
    return tuple(img.shape[1::-1])

If you’re on a terminal / ipython, you can also express it with a lambda:

>>> cv_size = lambda img: tuple(img.shape[1::-1])
>>> cv_size(img)
(640, 480)

Writing functions with def is not fun while working interactively.

Edit

Originally I thought that using [:2] was OK, but the numpy shape is (height, width[, depth]), and we need (width, height), as e.g. cv2.resize expects, so – we must use [1::-1]. Even less memorable than [:2]. And who remembers reverse slicing anyway?

Question 7

我正在使用numpy，并希望在不丢失维度信息的情况下对行进行索引。

import numpy as np
X = np.zeros((100,10))
X.shape        # >> (100, 10)
xslice = X[10,:]
xslice.shape   # >> (10,)

在此示例中，xslice现在为1维，但我希望它为（1,10）。在R中，我将使用X [10，：，drop = F]。numpy中是否有类似的东西。我在文档中找不到它，也没有看到类似的问题。

谢谢！

Question 8

I’m using numpy and want to index a row without losing the dimension information.

import numpy as np
X = np.zeros((100,10))
X.shape        # >> (100, 10)
xslice = X[10,:]
xslice.shape   # >> (10,)

In this example xslice is now 1 dimension, but I want it to be (1,10). In R, I would use X[10,:,drop=F]. Is there something similar in numpy. I couldn’t find it in the documentation and didn’t see a similar question asked.

Thanks!

Question 9

这可能最容易做到x[None, 10, :]或等效（但更具可读性）x[np.newaxis, 10, :]。

就为什么不是默认值而言，我个人发现，不断拥有单例维度的数组会非常烦人。我猜想那些麻木的开发者也有同样的感觉。

另外，numpy可以很好地处理广播数组，因此通常没有理由保留切片所来自的数组的尺寸。如果您这样做了，那么类似：

a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b

要么行不通，要么实施起来更加困难。

（或者至少这是我对切片时删除维度信息背后的numpy开发人员的猜测）

Question 10

It’s probably easiest to do x[None, 10, :] or equivalently (but more readable) x[np.newaxis, 10, :].

As far as why it’s not the default, personally, I find that constantly having arrays with singleton dimensions gets annoying very quickly. I’d guess the numpy devs felt the same way.

Also, numpy handle broadcasting arrays very well, so there’s usually little reason to retain the dimension of the array the slice came from. If you did, then things like:

a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b

either wouldn’t work or would be much more difficult to implement.

(Or at least that’s my guess at the numpy dev’s reasoning behind dropping dimension info when slicing)

Question 11

另一个解决方案是

X[[10],:]

要么

I = array([10])
X[I,:]

当通过索引列表（或数组）执行索引时，将保留数组的维数。这很好，因为它使您可以选择保持尺寸和压缩尺寸。

Question 12

Another solution is to do

X[[10],:]

or

I = array([10])
X[I,:]

The dimensionality of an array is preserved when indexing is performed by a list (or an array) of indexes. This is nice because it leaves you with the choice between keeping the dimension and squeezing.

Question 13

我找到了一些合理的解决方案。

1）使用 numpy.take(X,[10],0)

2）使用这个奇怪的索引 X[10:11:, :]

理想情况下，这应该是默认设置。我从未理解过为什么尺寸会下降。但这是关于numpy的讨论…

Question 14

I found a few reasonable solutions.

1) use numpy.take(X,[10],0)

2) use this strange indexing X[10:11:, :]

Ideally, this should be the default. I never understood why dimensions are ever dropped. But that’s a discussion for numpy…

Question 15

这是我更喜欢的替代方法。而不是使用单个数字编制索引，而是使用范围进行索引。即使用X[10:11,:]。（请注意，其中10:11不包括11）。

import numpy as np
X = np.zeros((100,10))
X.shape        # >> (100, 10)
xslice = X[10:11,:]
xslice.shape   # >> (1,10)

这也使得使用更多尺寸也很容易理解，而无需None费力地弄清楚要使用哪个索引的轴。同样，无需为数组大小做额外的记账工作，只需i:i+1为i您将在常规索引中使用的任何记账工作做好。

b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape  # >> (1, 3, 4)
b[:, 2:3, :].shape .  # >> (2, 1, 4)

Question 16

Here’s an alternative I like better. Instead of indexing with a single number, index with a range. That is, use X[10:11,:]. (Note that 10:11 does not include 11).

import numpy as np
X = np.zeros((100,10))
X.shape        # >> (100, 10)
xslice = X[10:11,:]
xslice.shape   # >> (1,10)

This makes it easy to understand with more dimensions too, no None juggling and figuring out which axis to use which index. Also no need to do extra bookkeeping regarding array size, just i:i+1 for any i that you would have used in regular indexing.

b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape  # >> (1, 3, 4)
b[:, 2:3, :].shape .  # >> (2, 1, 4)

Question 17

要添加涉及由gnebehay 按列表或数组建立索引的解决方案，还可以使用元组：

X[(10,),:]

Question 18

To add to the solution involving indexing by lists or arrays by gnebehay, it is also possible to use tuples:

X[(10,),:]

Question 19

如果您在运行时按长度可能为1的数组建立索引，这将特别令人讨厌。对于这种情况，有np.ix_：

some_array[np.ix_(row_index,column_index)]

Question 20

This is especially annoying if you’re indexing by an array that might be length 1 at runtime. For that case, there’s np.ix_:

some_array[np.ix_(row_index,column_index)]

Question 21

使用numpy，如何执行以下操作：

ln(x)

它等效于：

np.log(x)

我这样一个看似微不足道的问题道歉，但我之间的差异的理解log和ln被认为ln是LOGSPACEè？

Question 22

Using numpy, how can I do the following:

ln(x)

Is it equivalent to:

np.log(x)

I apologise for such a seemingly trivial question, but my understanding of the difference between log and ln is that ln is logspace e?

Question 23

np.log是ln，而np.log10您是以10为基数的标准对数。

响应发布的原始数据和评论

这里的问题是您没有定期数据。您应该始终检查输入任何算法的数据，以确保它是适当的。

import pandas
import matplotlib.pyplot as plt
#import seaborn
%matplotlib inline

# the OP's data
x = pandas.read_csv('http://pastebin.com/raw.php?i=ksM4FvZS', skiprows=2, header=None).values
y = pandas.read_csv('http://pastebin.com/raw.php?i=0WhjjMkb', skiprows=2, header=None).values
fig, ax = plt.subplots()
ax.plot(x, y)

在此处输入图片说明

Question 36

So I run a functionally equivalent form of your code in an IPython notebook:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

# Number of samplepoints
N = 600
# sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = scipy.fftpack.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N/2)

fig, ax = plt.subplots()
ax.plot(xf, 2.0/N * np.abs(yf[:N//2]))
plt.show()

I get what I believe to be very reasonable output.

enter image description here

It’s been longer than I care to admit since I was in engineering school thinking about signal processing, but spikes at 50 and 80 are exactly what I would expect. So what’s the issue?

In response to the raw data and comments being posted

The problem here is that you don’t have periodic data. You should always inspect the data that you feed into any algorithm to make sure that it’s appropriate.

import pandas
import matplotlib.pyplot as plt
#import seaborn
%matplotlib inline

# the OP's data
x = pandas.read_csv('http://pastebin.com/raw.php?i=ksM4FvZS', skiprows=2, header=None).values
y = pandas.read_csv('http://pastebin.com/raw.php?i=0WhjjMkb', skiprows=2, header=None).values
fig, ax = plt.subplots()
ax.plot(x, y)

enter image description here

Question 37

关于fft的重要一点是，它只能应用于时间戳统一的数据（即时间上的统一采样，如上面所示）。

如果采样不均匀，请使用函数拟合数据。有几种教程和功能可供选择：

https://github.com/tiagopereira/python_tips/wiki/Scipy%3A-curve-fitting http://docs.scipy.org/doc/numpy/reference/generation/numpy.polyfit.html

如果无法选择拟合，则可以直接使用某种形式的插值将数据插值为统一采样：

https://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/interpolate.html

当您有统一的样本时，您只需要担心t[1] - t[0]样本的时间增量（）。在这种情况下，您可以直接使用fft函数

Y    = numpy.fft.fft(y)
freq = numpy.fft.fftfreq(len(y), t[1] - t[0])

pylab.figure()
pylab.plot( freq, numpy.abs(Y) )
pylab.figure()
pylab.plot(freq, numpy.angle(Y) )
pylab.show()

这应该可以解决您的问题。

Question 38

The important thing about fft is that it can only be applied to data in which the timestamp is uniform (i.e. uniform sampling in time, like what you have shown above).

In case of non-uniform sampling, please use a function for fitting the data. There are several tutorials and functions to choose from:

https://github.com/tiagopereira/python_tips/wiki/Scipy%3A-curve-fitting http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html

If fitting is not an option, you can directly use some form of interpolation to interpolate data to a uniform sampling:

https://docs.scipy.org/doc/scipy-0.14.0/reference/tutorial/interpolate.html

When you have uniform samples, you will only have to wory about the time delta (t[1] - t[0]) of your samples. In this case, you can directly use the fft functions

Y    = numpy.fft.fft(y)
freq = numpy.fft.fftfreq(len(y), t[1] - t[0])

pylab.figure()
pylab.plot( freq, numpy.abs(Y) )
pylab.figure()
pylab.plot(freq, numpy.angle(Y) )
pylab.show()

This should solve your problem.

Question 39

您的高尖峰信号是由于信号的DC（不变，即freq = 0）部分引起的。这是规模问题。如果要查看非DC频率内容，则为了可视化，可能需要从偏移量1而不是信号FFT的偏移量0绘制。

修改@PaulH上面给出的示例

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

# Number of samplepoints
N = 600
# sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = 10 + np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = scipy.fftpack.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N/2)

plt.subplot(2, 1, 1)
plt.plot(xf, 2.0/N * np.abs(yf[0:N/2]))
plt.subplot(2, 1, 2)
plt.plot(xf[1:], 2.0/N * np.abs(yf[0:N/2])[1:])

输出图：用DC绘制FFT信号，然后将其删除（跳过频率= 0）

另一种方法是以对数刻度可视化数据：

使用：

plt.semilogy(xf, 2.0/N * np.abs(yf[0:N/2]))

将会呈现：在此处输入图片说明

Question 40

The high spike that you have is due to the DC (non-varying, i.e. freq = 0) portion of your signal. It’s an issue of scale. If you want to see non-DC frequency content, for visualization, you may need to plot from the offset 1 not from offset 0 of the FFT of the signal.

Modifying the example given above by @PaulH

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

# Number of samplepoints
N = 600
# sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = 10 + np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = scipy.fftpack.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N/2)

plt.subplot(2, 1, 1)
plt.plot(xf, 2.0/N * np.abs(yf[0:N/2]))
plt.subplot(2, 1, 2)
plt.plot(xf[1:], 2.0/N * np.abs(yf[0:N/2])[1:])

The output plots: Ploting FFT signal with DC and then when removing it (skipping freq = 0)

Another way, is to visualize the data in log scale:

Using:

plt.semilogy(xf, 2.0/N * np.abs(yf[0:N/2]))

Will show: enter image description here

Question 41

作为对已经给出的答案的补充，我想指出的是，经常需要考虑FFT的bin大小。测试一堆值并选择对您的应用程序更有意义的值将是有意义的。通常，它与样本数量的大小相同。给出的大多数答案都假定了这一点，并且产生了很好且合理的结果。如果有人想探索一下，这是我的代码版本：

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

fig = plt.figure(figsize=[14,4])
N = 600           # Number of samplepoints
Fs = 800.0
T = 1.0 / Fs      # N_samps*T (#samples x sample period) is the sample spacing.
N_fft = 80        # Number of bins (chooses granularity)
x = np.linspace(0, N*T, N)     # the interval
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)   # the signal

# removing the mean of the signal
mean_removed = np.ones_like(y)*np.mean(y)
y = y - mean_removed

# Compute the fft.
yf = scipy.fftpack.fft(y,n=N_fft)
xf = np.arange(0,Fs,Fs/N_fft)

##### Plot the fft #####
ax = plt.subplot(121)
pt, = ax.plot(xf,np.abs(yf), lw=2.0, c='b')
p = plt.Rectangle((Fs/2, 0), Fs/2, ax.get_ylim()[1], facecolor="grey", fill=True, alpha=0.75, hatch="/", zorder=3)
ax.add_patch(p)
ax.set_xlim((ax.get_xlim()[0],Fs))
ax.set_title('FFT', fontsize= 16, fontweight="bold")
ax.set_ylabel('FFT magnitude (power)')
ax.set_xlabel('Frequency (Hz)')
plt.legend((p,), ('mirrowed',))
ax.grid()

##### Close up on the graph of fft#######
# This is the same histogram above, but truncated at the max frequence + an offset. 
offset = 1    # just to help the visualization. Nothing important.
ax2 = fig.add_subplot(122)
ax2.plot(xf,np.abs(yf), lw=2.0, c='b')
ax2.set_xticks(xf)
ax2.set_xlim(-1,int(Fs/6)+offset)
ax2.set_title('FFT close-up', fontsize= 16, fontweight="bold")
ax2.set_ylabel('FFT magnitude (power) - log')
ax2.set_xlabel('Frequency (Hz)')
ax2.hold(True)
ax2.grid()

plt.yscale('log')

输出图：

Question 42

Just as a complement to the answers already given, I would like to point out that often it is important to play with the size of the bins for the FFT. It would make sense to test a bunch of values and pick the one that makes more sense to your application. Often, it is in the same magnitude of the number of samples. This was as assumed by most of the answers given, and produces great and reasonable results. In case one wants to explore that, here is my code version:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

fig = plt.figure(figsize=[14,4])
N = 600           # Number of samplepoints
Fs = 800.0
T = 1.0 / Fs      # N_samps*T (#samples x sample period) is the sample spacing.
N_fft = 80        # Number of bins (chooses granularity)
x = np.linspace(0, N*T, N)     # the interval
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)   # the signal

# removing the mean of the signal
mean_removed = np.ones_like(y)*np.mean(y)
y = y - mean_removed

# Compute the fft.
yf = scipy.fftpack.fft(y,n=N_fft)
xf = np.arange(0,Fs,Fs/N_fft)

##### Plot the fft #####
ax = plt.subplot(121)
pt, = ax.plot(xf,np.abs(yf), lw=2.0, c='b')
p = plt.Rectangle((Fs/2, 0), Fs/2, ax.get_ylim()[1], facecolor="grey", fill=True, alpha=0.75, hatch="/", zorder=3)
ax.add_patch(p)
ax.set_xlim((ax.get_xlim()[0],Fs))
ax.set_title('FFT', fontsize= 16, fontweight="bold")
ax.set_ylabel('FFT magnitude (power)')
ax.set_xlabel('Frequency (Hz)')
plt.legend((p,), ('mirrowed',))
ax.grid()

##### Close up on the graph of fft#######
# This is the same histogram above, but truncated at the max frequence + an offset. 
offset = 1    # just to help the visualization. Nothing important.
ax2 = fig.add_subplot(122)
ax2.plot(xf,np.abs(yf), lw=2.0, c='b')
ax2.set_xticks(xf)
ax2.set_xlim(-1,int(Fs/6)+offset)
ax2.set_title('FFT close-up', fontsize= 16, fontweight="bold")
ax2.set_ylabel('FFT magnitude (power) - log')
ax2.set_xlabel('Frequency (Hz)')
ax2.hold(True)
ax2.grid()

plt.yscale('log')

the output plots:

Question 43

我建立了一个函数，用于绘制真实信号的FFT。与先前的答案相比，我的功能额外的好处是您可以获得信号的实际幅度。

另外，由于假设是真实信号，因此FFT是对称的，因此我们只能绘制x轴的正向：

import matplotlib.pyplot as plt
import numpy as np
import warnings


def fftPlot(sig, dt=None, plot=True):
    # Here it's assumes analytic signal (real signal...) - so only half of the axis is required

    if dt is None:
        dt = 1
        t = np.arange(0, sig.shape[-1])
        xLabel = 'samples'
    else:
        t = np.arange(0, sig.shape[-1]) * dt
        xLabel = 'freq [Hz]'

    if sig.shape[0] % 2 != 0:
        warnings.warn("signal preferred to be even in size, autoFixing it...")
        t = t[0:-1]
        sig = sig[0:-1]

    sigFFT = np.fft.fft(sig) / t.shape[0]  # Divided by size t for coherent magnitude

    freq = np.fft.fftfreq(t.shape[0], d=dt)

    # Plot analytic signal - right half of frequence axis needed only...
    firstNegInd = np.argmax(freq < 0)
    freqAxisPos = freq[0:firstNegInd]
    sigFFTPos = 2 * sigFFT[0:firstNegInd]  # *2 because of magnitude of analytic signal

    if plot:
        plt.figure()
        plt.plot(freqAxisPos, np.abs(sigFFTPos))
        plt.xlabel(xLabel)
        plt.ylabel('mag')
        plt.title('Analytic FFT plot')
        plt.show()

    return sigFFTPos, freqAxisPos


if __name__ == "__main__":
    dt = 1 / 1000

    # Build a signal within Nyquist - the result will be the positive FFT with actual magnitude
    f0 = 200  # [Hz]
    t = np.arange(0, 1 + dt, dt)
    sig = 1 * np.sin(2 * np.pi * f0 * t) + \
        10 * np.sin(2 * np.pi * f0 / 2 * t) + \
        3 * np.sin(2 * np.pi * f0 / 4 * t) +\
        7.5 * np.sin(2 * np.pi * f0 / 5 * t)

    # Result in frequencies
    fftPlot(sig, dt=dt)
    # Result in samples (if the frequencies axis is unknown)
    fftPlot(sig)

Question 44

I’ve built a function that deals with plotting FFT of real signals. The extra bonus in my function relative to the messages above is that you get the ACTUAL amplitude of the signal. Also, because of the assumption of a real signal, the FFT is symmetric so we can plot only the positive side of the x axis:

import matplotlib.pyplot as plt
import numpy as np
import warnings


def fftPlot(sig, dt=None, plot=True):
    # here it's assumes analytic signal (real signal...)- so only half of the axis is required

    if dt is None:
        dt = 1
        t = np.arange(0, sig.shape[-1])
        xLabel = 'samples'
    else:
        t = np.arange(0, sig.shape[-1]) * dt
        xLabel = 'freq [Hz]'

    if sig.shape[0] % 2 != 0:
        warnings.warn("signal prefered to be even in size, autoFixing it...")
        t = t[0:-1]
        sig = sig[0:-1]

    sigFFT = np.fft.fft(sig) / t.shape[0]  # divided by size t for coherent magnitude

    freq = np.fft.fftfreq(t.shape[0], d=dt)

    # plot analytic signal - right half of freq axis needed only...
    firstNegInd = np.argmax(freq < 0)
    freqAxisPos = freq[0:firstNegInd]
    sigFFTPos = 2 * sigFFT[0:firstNegInd]  # *2 because of magnitude of analytic signal

    if plot:
        plt.figure()
        plt.plot(freqAxisPos, np.abs(sigFFTPos))
        plt.xlabel(xLabel)
        plt.ylabel('mag')
        plt.title('Analytic FFT plot')
        plt.show()

    return sigFFTPos, freqAxisPos


if __name__ == "__main__":
    dt = 1 / 1000

    # build a signal within nyquist - the result will be the positive FFT with actual magnitude
    f0 = 200  # [Hz]
    t = np.arange(0, 1 + dt, dt)
    sig = 1 * np.sin(2 * np.pi * f0 * t) + \
        10 * np.sin(2 * np.pi * f0 / 2 * t) + \
        3 * np.sin(2 * np.pi * f0 / 4 * t) +\
        7.5 * np.sin(2 * np.pi * f0 / 5 * t)

    # res in freqs
    fftPlot(sig, dt=dt)
    # res in samples (if freqs axis is unknown)
    fftPlot(sig)

Question 45

此页面上已经有不错的解决方案，但是所有人都假定数据集是统一/均匀采样/分布的。我将尝试提供一个更一般的随机采样数据示例。我还将使用此MATLAB教程作为示例：

添加所需的模块：

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack
import scipy.signal

生成样本数据：

N = 600 # Number of samples
t = np.random.uniform(0.0, 1.0, N) # Assuming the time start is 0.0 and time end is 1.0
S = 1.0 * np.sin(50.0 * 2 * np.pi * t) + 0.5 * np.sin(80.0 * 2 * np.pi * t)
X = S + 0.01 * np.random.randn(N) # Adding noise

排序数据集：

order = np.argsort(t)
ts = np.array(t)[order]
Xs = np.array(X)[order]

重采样：

T = (t.max() - t.min()) / N # Average period
Fs = 1 / T # Average sample rate frequency
f = Fs * np.arange(0, N // 2 + 1) / N; # Resampled frequency vector
X_new, t_new = scipy.signal.resample(Xs, N, ts)

绘制数据和重新采样的数据：

plt.xlim(0, 0.1)
plt.plot(t_new, X_new, label="resampled")
plt.plot(ts, Xs, label="org")
plt.legend()
plt.ylabel("X")
plt.xlabel("t")

现在计算FFT：

Y = scipy.fftpack.fft(X_new)
P2 = np.abs(Y / N)
P1 = P2[0 : N // 2 + 1]
P1[1 : -2] = 2 * P1[1 : -2]

plt.ylabel("Y")
plt.xlabel("f")
plt.plot(f, P1)

PS我终于有时间实施一个更规范的算法来获得不均匀分布数据的傅立叶变换。您可以在此处查看代码，说明和示例Jupyter笔记本。

Question 46

There are already great solutions on this page, but all have assumed the dataset is uniformly/evenly sampled/distributed. I will try to provide a more general example of randomly sampled data. I will also use this MATLAB tutorial as an example:

Adding the required modules:

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack
import scipy.signal

Generating sample data:

N = 600 # number of samples
t = np.random.uniform(0.0, 1.0, N) # assuming the time start is 0.0 and time end is 1.0
S = 1.0 * np.sin(50.0 * 2 * np.pi * t) + 0.5 * np.sin(80.0 * 2 * np.pi * t) 
X = S + 0.01 * np.random.randn(N) # adding noise

Sorting the data set:

order = np.argsort(t)
ts = np.array(t)[order]
Xs = np.array(X)[order]

Resampling:

T = (t.max() - t.min()) / N # average period 
Fs = 1 / T # average sample rate frequency
f = Fs * np.arange(0, N // 2 + 1) / N; # resampled frequency vector
X_new, t_new = scipy.signal.resample(Xs, N, ts)

plotting the data and resampled data:

plt.xlim(0, 0.1)
plt.plot(t_new, X_new, label="resampled")
plt.plot(ts, Xs, label="org")
plt.legend()
plt.ylabel("X")
plt.xlabel("t")

now calculating the fft:

Y = scipy.fftpack.fft(X_new)
P2 = np.abs(Y / N)
P1 = P2[0 : N // 2 + 1]
P1[1 : -2] = 2 * P1[1 : -2]

plt.ylabel("Y")
plt.xlabel("f")
plt.plot(f, P1)

P.S. I finally got time to implement a more canonical algorithm to get a Fourier transform of unevenly distributed data. You may see the code, description, and example Jupyter notebook here.

Question 47

我写了这个额外的答案，以解释使用FFT时尖峰扩散的根源，特别是讨论scipy.fftpack教程，我在某些时候对此表示不同意。

在此示例中，记录时间tmax=N*T=0.75。信号是sin(50*2*pi*x) + 0.5*sin(80*2*pi*x)。频率信号应包含两个尖峰，其频率50和80幅度分别为1和0.5。但是，如果所分析的信号没有整数周期，则由于信号的截断会出现扩散：

派克1：50*tmax=37.5=>频率50不是的倍数1/tmax=>存在扩散的由于在该频率信号截断。
派克2：80*tmax=60=>频率80是的倍数1/tmax=>无扩散由于在该频率信号截断。

这是一段代码，它分析的信号与教程（sin(50*2*pi*x) + 0.5*sin(80*2*pi*x)）中的信号相同，但略有不同：

原始的scipy.fftpack示例。
原始scipy.fftpack示例具有整数个信号周期（tmax=1.0而不是0.75为了避免截断扩散）。
原始scipy.fftpack示例，其中包含整数个信号周期，并且日期和频率均取自FFT理论。

编码：

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

# 1. Linspace
N = 600
# Sample spacing
tmax = 3/4
T = tmax / N # =1.0 / 800.0
x1 = np.linspace(0.0, N*T, N)
y1 = np.sin(50.0 * 2.0*np.pi*x1) + 0.5*np.sin(80.0 * 2.0*np.pi*x1)
yf1 = scipy.fftpack.fft(y1)
xf1 = np.linspace(0.0, 1.0/(2.0*T), N//2)

# 2. Integer number of periods
tmax = 1
T = tmax / N # Sample spacing
x2 = np.linspace(0.0, N*T, N)
y2 = np.sin(50.0 * 2.0*np.pi*x2) + 0.5*np.sin(80.0 * 2.0*np.pi*x2)
yf2 = scipy.fftpack.fft(y2)
xf2 = np.linspace(0.0, 1.0/(2.0*T), N//2)

# 3. Correct positioning of dates relatively to FFT theory ('arange' instead of 'linspace')
tmax = 1
T = tmax / N # Sample spacing
x3 = T * np.arange(N)
y3 = np.sin(50.0 * 2.0*np.pi*x3) + 0.5*np.sin(80.0 * 2.0*np.pi*x3)
yf3 = scipy.fftpack.fft(y3)
xf3 = 1/(N*T) * np.arange(N)[:N//2]

fig, ax = plt.subplots()
# Plotting only the left part of the spectrum to not show aliasing
ax.plot(xf1, 2.0/N * np.abs(yf1[:N//2]), label='fftpack tutorial')
ax.plot(xf2, 2.0/N * np.abs(yf2[:N//2]), label='Integer number of periods')
ax.plot(xf3, 2.0/N * np.abs(yf3[:N//2]), label='Correct positioning of dates')
plt.legend()
plt.grid()
plt.show()

输出：

就像这里可能的那样，即使使用整数周期，仍然会有一些扩散。此行为是由于scipy.fftpack教程中日期和频率的位置不正确造成的。因此，在离散傅立叶变换理论中：

信号应t=0,T,...,(N-1)*T在T为采样周期且信号总持续时间为的日期进行评估tmax=N*T。请注意，我们在停下来tmax-T。
相关联的频率f=0,df,...,(N-1)*df，其中df=1/tmax=1/(N*T)是采样频率。信号的所有谐波应为采样频率的倍数，以避免扩散。

在上面的示例中，您可以看到使用arange代替linspace可以避免频谱中的额外扩散。此外，使用该linspace版本还会导致尖峰的偏移，该尖峰的频率略高于应有的尖峰，这在第一张图片中可以看到，在这些图片中，尖峰稍微位于频率50和的右边80。

我将得出结论，用法示例应替换为以下代码（在我看来，这不太容易引起误解）：

import numpy as np
from scipy.fftpack import fft

# Number of sample points
N = 600
T = 1.0 / 800.0
x = T*np.arange(N)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = fft(y)
xf = 1/(N*T)*np.arange(N//2)
import matplotlib.pyplot as plt
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))
plt.grid()
plt.show()

输出（第二个峰值不再扩散）：

我认为此答案仍会带来一些有关如何正确应用离散傅立叶变换的附加说明。显然，我的答案太长了，总是有其他要说的东西（例如，关于“别名”的简短讨论，关于“窗口化”的说法很多），所以我将停止。

我认为，在应用离散傅里叶变换时，深刻理解离散傅里叶变换的原理非常重要，因为我们都知道很多人在应用离散傅里叶变换时都会在其中添加因数以获得自己想要的东西。

Question 48

I write this additionnal answer to explain the origins of the diffusion of the spikes when using fft and especially discuss the scipy.fftpack tutorial with which I disagree at some point.

In this example, the recording time tmax=N*T=0.75. The signal is sin(50*2*pi*x)+0.5*sin(80*2*pi*x). The frequency signal should contain 2 spikes at frequencies 50 and 80 with amplitudes 1 and 0.5. However, if the analysed signal does not have a integer number of periods diffusion can appear due to the truncation of the signal:

Pike 1: 50*tmax=37.5 => frequency 50 is not a multiple of 1/tmax => Presence of diffusion due to signal truncation at this frequency.
Pike 2: 80*tmax=60 => frequency 80 is a multiple of 1/tmax => No diffusion due to signal truncation at this frequency.

Here is a code that analyses the same signal as in the tutorial (sin(50*2*pi*x)+0.5*sin(80*2*pi*x)) but with the slight differences:

The original scipy.fftpack example.
The original scipy.fftpack example with an integer number of signal periods (tmax=1.0 instead of 0.75 to avoid truncation diffusion).
The original scipy.fftpack example with an integer number of signal periods and where the dates and frequencies are taken from the FFT theory.

The code:

import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

# 1. Linspace
N = 600
# sample spacing
tmax = 3/4
T = tmax / N # =1.0 / 800.0
x1 = np.linspace(0.0, N*T, N)
y1 = np.sin(50.0 * 2.0*np.pi*x1) + 0.5*np.sin(80.0 * 2.0*np.pi*x1)
yf1 = scipy.fftpack.fft(y1)
xf1 = np.linspace(0.0, 1.0/(2.0*T), N//2)

# 2. Integer number of periods
tmax = 1
T = tmax / N # sample spacing
x2 = np.linspace(0.0, N*T, N)
y2 = np.sin(50.0 * 2.0*np.pi*x2) + 0.5*np.sin(80.0 * 2.0*np.pi*x2)
yf2 = scipy.fftpack.fft(y2)
xf2 = np.linspace(0.0, 1.0/(2.0*T), N//2)

# 3. Correct positionning of dates relatively to FFT theory (arange instead of linspace)
tmax = 1
T = tmax / N # sample spacing
x3 = T * np.arange(N)
y3 = np.sin(50.0 * 2.0*np.pi*x3) + 0.5*np.sin(80.0 * 2.0*np.pi*x3)
yf3 = scipy.fftpack.fft(y3)
xf3 = 1/(N*T) * np.arange(N)[:N//2]

fig, ax = plt.subplots()
# Plotting only the left part of the spectrum to not show aliasing
ax.plot(xf1, 2.0/N * np.abs(yf1[:N//2]), label='fftpack tutorial')
ax.plot(xf2, 2.0/N * np.abs(yf2[:N//2]), label='Integer number of periods')
ax.plot(xf3, 2.0/N * np.abs(yf3[:N//2]), label='Correct positionning of dates')
plt.legend()
plt.grid()
plt.show()

Output:

As it can be here, even with using an integer number of periods some diffusion still remains. This behaviour is due to a bad positionning of dates and frequencies in the scipy.fftpack tutorial. Hence, in the theory of discrete Fourier transforms:

the signal should be evaluated at dates t=0,T,...,(N-1)*T where T is the sampling period and the total duration of the signal is tmax=N*T. Note that we stop at tmax-T.
the associated frequencies are f=0,df,...,(N-1)*df where df=1/tmax=1/(N*T) is the sampling frequency. All harmonics of the signal should be multiple of the sampling frequency to avoid diffusion.

In the example above, you can see that the use of arange instead of linspace enables to avoid additional diffusion in the frequency spectrum. Moreover, using the linspace version also leads to an offset of the spikes that are located at slightly higher frequencies than what they should be as it can be seen in the first picture where the spikes are a little bit at the right of the frequencies 50 and 80.

I’ll just conclude that the example of usage should be replace by the following code (which is less misleading in my opinion):

import numpy as np
from scipy.fftpack import fft
# Number of sample points
N = 600
T = 1.0 / 800.0
x = T*np.arange(N)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = fft(y)
xf = 1/(N*T)*np.arange(N//2)
import matplotlib.pyplot as plt
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))
plt.grid()
plt.show()

Output (the second spike is not diffused anymore):

I think this answer still bring some additional explanations on how to apply correctly discrete Fourier transform. Obviously, my answer is too long and there is always additional things to say (@ewerlopes talked briefly about aliasing for instance and a lot can be said about windowing) so I’ll stop. I think that it is very important to understand deeply the principles of discrete Fourier transform when applying it because we all know so much people adding factors here and there when applying it in order to obtain what they want.

Question 49

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

This is my DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

Then, df["attr2"] still has dtype object, although type(df["attr2"].ix[0] reveals str, which is correct.

Pandas distinguishes between int64 and float64 and object. What is the logic behind it when there is no dtype str? Why is a str covered by object?

Question 50

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

the int64 array contains 4 int64 value.
the object array contains 4 pointers to 3 string objects.

enter image description here

Question 51

The accepted answer is good. Just wanted to provide an answer which referenced the documentation. The documentation says:

Pandas uses the object dtype for storing strings.

As the leading comment says “Don’t worry about it; it’s supposed to be like this.” (Although the accepted answer did a great job explaining the “why”; strings are variable-length)

But for strings, the length of the string is not fixed.

Question 52

@HYRY’s answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it’ll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it’d end up looking like this.

Now your computer doesn’t have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there’s no guarantee the pointers are actually pointing to strings. That’s why it reports the dtype as ‘object’.

Shamelessly gonna plug my own blog article where I originally discussed this.

Question 53

As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype.

While you’ll still be seeing object by default, the new type can be used by specifying a dtype of pd.StringDtype or simply 'string':

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

Question 54

I’m trying to visualize a numpy array using imshow() since it’s similar to imagesc() in Matlab.

imshow(random.rand(8, 90), interpolation='nearest')

The resulting figure is very small at the center of the grey window, while most of the space is unoccupied. How can I set the parameters to make the figure larger? I tried figsize=(xx,xx) and it’s not what I want. Thanks!

Question 55

If you don’t give an aspect argument to imshow, it will use the value for image.aspect in your matplotlibrc. The default for this value in a new matplotlibrc is equal. So imshow will plot your array with equal aspect ratio.

If you don’t need an equal aspect you can set aspect to auto

imshow(random.rand(8, 90), interpolation='nearest', aspect='auto')

which gives the following figure

imshow-auto

If you want an equal aspect ratio you have to adapt your figsize according to the aspect

fig, ax = subplots(figsize=(18, 2))
ax.imshow(random.rand(8, 90), interpolation='nearest')
tight_layout()

which gives you:

imshow-equal

Question 56

That’s strange, it definitely works for me:

from matplotlib import pyplot as plt

plt.figure(figsize = (20,2))
plt.imshow(random.rand(8, 90), interpolation='nearest')

I am using the “MacOSX” backend, btw.

Question 57

I’m new to python too. Here is something that looks like will do what you want to

axes([0.08, 0.08, 0.94-0.08, 0.94-0.08]) #[left, bottom, width, height]
axis('scaled')`

I believe this decides the size of the canvas.

Question 58

Update 2020

as requested by @baxxx, here is an update because `random.rand` is deprecated meanwhile.

This works with matplotlip 3.2.1:

from matplotlib import pyplot as plt
import random
import numpy as np

random = np.random.random ([8,90])

plt.figure(figsize = (20,2))
plt.imshow(random, interpolation='nearest')

This plots:

To change the random number, you can experiment with np.random.normal(0,1,(8,90)) (here mean = 0, standard deviation = 1).

Question 59

I am processing large 3D arrays, which I often need to slice in various ways to do a variety of data analysis. A typical “cube” can be ~100GB (and will likely get larger in the future)

It seems that the typical recommended file format for large datasets in python is to use HDF5 (either h5py or pytables). My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them in simple flat binary files? Is HDF5 more appropriate for tabular data, as opposed to large arrays like what I am working with? I see that HDF5 can provide nice compression, but I am more interested in processing speed and dealing with memory overflow.

I frequently want to analyze only one large subset of the cube. One drawback of both pytables and h5py is it seems is that when I take a slice of the array, I always get a numpy array back, using up memory. However, if I slice a numpy memmap of a flat binary file, I can get a view, which keeps the data on disk. So, it seems that I can more easily analyze specific sectors of my data without overrunning my memory.

I have explored both pytables and h5py, and haven’t seen the benefit of either so far for my purpose.

Question 60

HDF5 Advantages: Organization, flexibility, interoperability

Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it’s very useful in practice.

Another advantage of HDF is that the datasets can be either fixed-size or flexibly sized. Therefore, it’s easy to append data to a large dataset without having to create an entire new copy.

Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it’s not too hard with a big binary array, too, as long as you’re aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)

HDF advantages for a large array: Faster I/O of an arbitrary slice

Just as the TL/DR: For an ~8GB 3D array, reading a “full” slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to over three hours (worst case) for a memmapped array of the same data.

Beyond the things listed above, there’s another big advantage to a “chunked”* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.

*(HDF5 doesn’t have to be a chunked data format. It supports chunking, but doesn’t require it. In fact, the default for creating a dataset in h5py is not to chunk, if I recall correctly.)

Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is much worse.

One caveat, if you have an SSD, you likely won’t notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long seek time.) HDF still has an advantage on an SSD, but it’s more due its other features (e.g. metadata, organization, etc) than due to raw speed.

First off, to clear up confusion, accessing an h5py dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it’s sliced. (Similar to memmap, but not identical.) Have a look at the h5py introduction for more information.

Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you’ll need it in memory anyway.

If you do want to do out-of-core computations, you can fairly easily for tabular data with pandas or pytables. It is possible with h5py (nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.

However, the future of numpy-like out-of-core computations is Blaze. Have a look at it if you really want to take that route.

The “unchunked” case

First off, consider a 3D C-ordered array written to disk (I’ll simulate it by calling arr.ravel() and printing the result, to make things more visible):

In [1]: import numpy as np

In [2]: arr = np.arange(4*6*6).reshape(4,6,6)

In [3]: arr
Out[3]:
array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]],

       [[108, 109, 110, 111, 112, 113],
        [114, 115, 116, 117, 118, 119],
        [120, 121, 122, 123, 124, 125],
        [126, 127, 128, 129, 130, 131],
        [132, 133, 134, 135, 136, 137],
        [138, 139, 140, 141, 142, 143]]])

The values would be stored on-disk sequentially as shown on line 4 below. (Let’s ignore filesystem details and fragmentation for the moment.)

In [4]: arr.ravel(order='C')
Out[4]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])

In the best case scenario, let’s take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a very fast read! (one seek, one read)

In [5]: arr[0,:,:]
Out[5]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one seek operation. If all we’re going to be reading is various slices along this axis, then this is the perfect file structure.

However, let’s consider the worst-case scenario: A slice along the last axis.

In [6]: arr[:,:,0]
Out[6]:
array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102],
       [108, 114, 120, 126, 132, 138]])

To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!

This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the seek operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in via memmap, reading a full slice along the “worst” axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I’m only showing “full” slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.

Incidentally there are several file formats that take advantage of this and basically store three copies of huge 3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe’s D3D format, though I’m not sure it’s documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!

The simple “chunked” case

Let’s say we store 2x2x2 “chunks” of the 3D array as contiguous blocks on disk. In other words, something like:

nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
    for j in range(0, ny, 2):
        for k in range(0, nz, 2):
            slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))

chunked = np.hstack([arr[chunk].ravel() for chunk in slices])

So the data on disk would look like chunked:

array([  0,   1,   6,   7,  36,  37,  42,  43,   2,   3,   8,   9,  38,
        39,  44,  45,   4,   5,  10,  11,  40,  41,  46,  47,  12,  13,
        18,  19,  48,  49,  54,  55,  14,  15,  20,  21,  50,  51,  56,
        57,  16,  17,  22,  23,  52,  53,  58,  59,  24,  25,  30,  31,
        60,  61,  66,  67,  26,  27,  32,  33,  62,  63,  68,  69,  28,
        29,  34,  35,  64,  65,  70,  71,  72,  73,  78,  79, 108, 109,
       114, 115,  74,  75,  80,  81, 110, 111, 116, 117,  76,  77,  82,
        83, 112, 113, 118, 119,  84,  85,  90,  91, 120, 121, 126, 127,
        86,  87,  92,  93, 122, 123, 128, 129,  88,  89,  94,  95, 124,
       125, 130, 131,  96,  97, 102, 103, 132, 133, 138, 139,  98,  99,
       104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])

And just to show that they’re 2x2x2 blocks of arr, notice that these are the first 8 values of chunked:

In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0,  1],
        [ 6,  7]],

       [[36, 37],
        [42, 43]]])

To read in any slice along an axis, we’d read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That’s a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.

HDF5 takes this a few steps farther. The chunks don’t have to be stored contiguously, and they’re indexed by a B-Tree. Furthermore, they don’t have to be the same size on disk, so compression can be applied to each chunk.

Chunked arrays with `h5py`

By default, h5py doesn’t created chunked HDF files on disk (I think pytables does, by contrast). If you specify chunks=True when creating the dataset, however, you’ll get a chunked array on disk.

As a quick, minimal example:

import numpy as np
import h5py

data = np.random.random((100, 100, 100))

with h5py.File('test.hdf', 'w') as outfile:
    dset = outfile.create_dataset('a_descriptive_name', data=data, chunks=True)
    dset.attrs['some key'] = 'Did you want some metadata?'

Note that chunks=True tells h5py to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g. (2,2,2) in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.

I/O Performance comparison

Just to emphasize the point, let’s compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.

I’ve cleared all OS caches between each run, so we’re seeing the “cold” performance.

For each file type, we’ll test reading in a “full” x-slice along the first axis and a “full” z-slize along the last axis. For the Fortran-ordered memmapped array, the “x” slice is the worst case, and the “z” slice is the best case.

The code used is in a gist (including creating the hdf file). I can’t easily share the data used here, but you could simulate it by an array of zeros of the same shape (621, 4991, 2600) and type np.uint8.

The chunked_hdf.py looks like this:

import sys
import h5py

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    f = h5py.File('/tmp/test.hdf5', 'r')
    return f['seismic_volume']

def z_slice(data):
    return data[:,:,0]

def x_slice(data):
    return data[0,:,:]

main()

memmapped_array.py is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, another memmapped array would be returned, which wouldn’t be an apples-to-apples comparison).

import numpy as np
import sys

def main():
    data = read()

    if sys.argv[1] == 'x':
        x_slice(data)
    elif sys.argv[1] == 'z':
        z_slice(data)

def read():
    big_binary_filename = '/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol'
    shape = 621, 4991, 2600
    header_len = 3072

    data = np.memmap(filename=big_binary_filename, mode='r', offset=header_len,
                     order='F', shape=shape, dtype=np.uint8)
    return data

def z_slice(data):
    dat = np.empty(data.shape[:2], dtype=data.dtype)
    dat[:] = data[:,:,0]
    return dat

def x_slice(data):
    dat = np.empty(data.shape[1:], dtype=data.dtype)
    dat[:] = data[0,:,:]
    return dat

main()

Let’s have a look at the HDF performance first:

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py z
python chunked_hdf.py z  0.64s user 0.28s system 3% cpu 23.800 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python chunked_hdf.py x
python chunked_hdf.py x  0.12s user 0.30s system 1% cpu 21.856 total

A “full” x-slice and a “full” z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that’s not too bad. Most of the time

And if we compare this to the memmapped array times (it’s Fortran-ordered: A “z-slice” is the best case and an “x-slice” is the worst case.):

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py z
python memmapped_array.py z  0.07s user 0.04s system 28% cpu 0.385 total

jofer at cornbread in ~ 
$ sudo ./clear_cache.sh

jofer at cornbread in ~ 
$ time python memmapped_array.py x
python memmapped_array.py x  2.46s user 37.24s system 0% cpu 3:35:26.85 total

Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 hours for the other.

The time to slice in the “x” direction is far longer than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)

However, if we’re always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)

With a memmapped array, you’re stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.

In summary

Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over “raw” memmaps, but I don’t have room to expand on all of them here. Compression can speed some things up (the data I work with doesn’t benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with “raw” memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.

Overall, try it and see if it works well for your use case. I think you might be surprised.

Question 61

我一直在发疯，试图找出我在这里做错了什么愚蠢的事情。

我正在使用NumPy，并且我想从中选择特定的行索引和特定的列索引。这是我的问题的要点：

import numpy as np

a = np.arange(20).reshape((5,4))
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11],
#        [12, 13, 14, 15],
#        [16, 17, 18, 19]])

# If I select certain rows, it works
print a[[0, 1, 3], :]
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [12, 13, 14, 15]])

# If I select certain rows and a single column, it works
print a[[0, 1, 3], 2]
# array([ 2,  6, 14])

# But if I select certain rows AND certain columns, it fails
print a[[0,1,3], [0,2]]
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# ValueError: shape mismatch: objects cannot be broadcast to a single shape

为什么会这样呢？我当然应该能够选择第一行，第二行和第四行以及第一列和第三列？我期望的结果是：

a[[0,1,3], [0,2]] => [[0,  2],
                      [4,  6],
                      [12, 14]]

Question 62

I’ve been going crazy trying to figure out what stupid thing I’m doing wrong here.

I’m using NumPy, and I have specific row indices and specific column indices that I want to select from. Here’s the gist of my problem:

import numpy as np

a = np.arange(20).reshape((5,4))
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [ 8,  9, 10, 11],
#        [12, 13, 14, 15],
#        [16, 17, 18, 19]])

# If I select certain rows, it works
print a[[0, 1, 3], :]
# array([[ 0,  1,  2,  3],
#        [ 4,  5,  6,  7],
#        [12, 13, 14, 15]])

# If I select certain rows and a single column, it works
print a[[0, 1, 3], 2]
# array([ 2,  6, 14])

# But if I select certain rows AND certain columns, it fails
print a[[0,1,3], [0,2]]
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# ValueError: shape mismatch: objects cannot be broadcast to a single shape

Why is this happening? Surely I should be able to select the 1st, 2nd, and 4th rows, and 1st and 3rd columns? The result I’m expecting is:

a[[0,1,3], [0,2]] => [[0,  2],
                      [4,  6],
                      [12, 14]]

Question 63

花式索引要求您提供每个维度的所有索引。您为第一个提供3个索引，为第二个仅提供2个索引，因此会出现错误。您想做这样的事情：

>>> a[[[0, 0], [1, 1], [3, 3]], [[0,2], [0,2], [0, 2]]]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

当然写这很痛苦，所以您可以让广播帮助您：

>>> a[[[0], [1], [3]], [0, 2]]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

如果您使用数组而不是列表建立索引，则此操作要简单得多：

>>> row_idx = np.array([0, 1, 3])
>>> col_idx = np.array([0, 2])
>>> a[row_idx[:, None], col_idx]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

Question 64

Fancy indexing requires you to provide all indices for each dimension. You are providing 3 indices for the first one, and only 2 for the second one, hence the error. You want to do something like this:

>>> a[[[0, 0], [1, 1], [3, 3]], [[0,2], [0,2], [0, 2]]]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

That is of course a pain to write, so you can let broadcasting help you:

>>> a[[[0], [1], [3]], [0, 2]]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

This is much simpler to do if you index with arrays, not lists:

>>> row_idx = np.array([0, 1, 3])
>>> col_idx = np.array([0, 2])
>>> a[row_idx[:, None], col_idx]
array([[ 0,  2],
       [ 4,  6],
       [12, 14]])

问题：Python OpenCV2（cv2）包装器获取图像大小？

回答 0

回答 1

问题：脾气暴躁的索引切片而不会丢失尺寸信息

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：在Python中如何用numpy处理自然日志（例如“ ln（）”）？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：在Python中绘制快速傅立叶变换

回答 0

响应发布的原始数据和评论

In response to the raw data and comments being posted

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：DataFrame中的字符串，但dtype是object

回答 0

回答 1

回答 2

回答 3

问题：imshow（）的数字太小

回答 0

回答 1

回答 2

回答 3

更新2020

按照@baxxx的要求，这是一个更新，因为random.rand同时已弃用。

Update 2020

as requested by @baxxx, here is an update because random.rand is deprecated meanwhile.

问题：将HDF5用于大型阵列存储（而不是平面二进制文件）是否具有分析速度或内存使用优势？

回答 0

HDF5的优势：组织，灵活性，互操作性

HDF对于大型阵列的优势：更快的任意切片I / O

“无条件”案例

简单的“块状”案例

分块数组 h5py

I / O性能比较

综上所述

HDF5 Advantages: Organization, flexibility, interoperability

HDF advantages for a large array: Faster I/O of an arbitrary slice

The “unchunked” case

The simple “chunked” case

Chunked arrays with h5py

I/O Performance comparison

In summary

问题：从NumPy数组中选择特定的行和列

回答 0

回答 1

[编辑]内置方法： np.ix_

[Edit] The built-in method: np.ix_

回答 2

回答 3

问题：在TensorFlow中使用预训练的单词嵌入（word2vec或Glove）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：大小调整/缩放图像

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

有趣好用的Python教程

按照@baxxx的要求，这是一个更新，因为`random.rand`同时已弃用。

as requested by @baxxx, here is an update because `random.rand` is deprecated meanwhile.

分块数组 `h5py`

Chunked arrays with `h5py`

[编辑]内置方法： `np.ix_`

[Edit] The built-in method: `np.ix_`