标签归档:scipy

如何以正确的方式平滑曲线?

问题:如何以正确的方式平滑曲线?

假设我们有一个数据集,大约可以由

import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2

因此,我们有20%的数据集变异。我的第一个想法是使用scipy的UnivariateSpline函数,但是问题是这没有很好地考虑小噪声。如果考虑频率,则背景比信号小得多,因此仅花键作为截止点可能是个主意,但这会涉及来回傅立叶变换,这可能会导致不良行为。另一种方法是移动平均线,但这也需要正确选择延迟。

任何提示/书籍或链接如何解决此问题?

例

Lets assume we have a dataset which might be given approximately by

import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2

Therefore we have a variation of 20% of the dataset. My first idea was to use the UnivariateSpline function of scipy, but the problem is that this does not consider the small noise in a good way. If you consider the frequencies, the background is much smaller than the signal, so a spline only of the cutoff might be an idea, but that would involve a back and forth fourier transformation, which might result in bad behaviour. Another way would be a moving average, but this would also need the right choice of the delay.

Any hints/ books or links how to tackle this problem?

example


回答 0

我更喜欢使用Savitzky-Golay滤波器。它使用最小二乘法将数据的一个小窗口回归到多项式上,然后使用多项式来估计窗口中心的点。最后,窗口向前移动一个数据点,然后重复该过程。这一直持续到每个点都相对于其相邻点进行了最佳调整为止。即使使用来自非周期性和非线性来源的嘈杂样本,它也能很好地工作。

这是一个详尽的食谱示例。请参阅下面的代码,以了解使用起来很容易。注意:我省略了用于定义savitzky_golay()函数的代码,因为您可以从上面链接的食谱示例中直接复制/粘贴该函数。

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savitzky_golay(y, 51, 3) # window size 51, polynomial order 3

plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()

最佳平滑嘈杂的正弦曲线

更新:引起我注意的是,我链接到的食谱示例已被删除。幸运的是,正如@dodohjk所指出的,Savitzky-Golay过滤器已被合并到SciPy库中。要使用SciPy源修改上面的代码,请输入:

from scipy.signal import savgol_filter
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3

I prefer a Savitzky-Golay filter. It uses least squares to regress a small window of your data onto a polynomial, then uses the polynomial to estimate the point in the center of the window. Finally the window is shifted forward by one data point and the process repeats. This continues until every point has been optimally adjusted relative to its neighbors. It works great even with noisy samples from non-periodic and non-linear sources.

Here is a thorough cookbook example. See my code below to get an idea of how easy it is to use. Note: I left out the code for defining the savitzky_golay() function because you can literally copy/paste it from the cookbook example I linked above.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savitzky_golay(y, 51, 3) # window size 51, polynomial order 3

plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()

optimally smoothing a noisy sinusoid

UPDATE: It has come to my attention that the cookbook example I linked to has been taken down. Fortunately, the Savitzky-Golay filter has been incorporated into the SciPy library, as pointed out by @dodohjk. To adapt the above code by using SciPy source, type:

from scipy.signal import savgol_filter
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3

回答 1

基于移动平均值框(通过卷积)的一种快速而肮脏的方式来平滑我使用的数据:

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.8

def smooth(y, box_pts):
    box = np.ones(box_pts)/box_pts
    y_smooth = np.convolve(y, box, mode='same')
    return y_smooth

plot(x, y,'o')
plot(x, smooth(y,3), 'r-', lw=2)
plot(x, smooth(y,19), 'g-', lw=2)

在此处输入图片说明

A quick and dirty way to smooth data I use, based on a moving average box (by convolution):

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.8

def smooth(y, box_pts):
    box = np.ones(box_pts)/box_pts
    y_smooth = np.convolve(y, box, mode='same')
    return y_smooth

plot(x, y,'o')
plot(x, smooth(y,3), 'r-', lw=2)
plot(x, smooth(y,19), 'g-', lw=2)

enter image description here


回答 2

如果您对周期性信号的“平滑”版本感兴趣(例如您的示例),那么FFT是正确的选择。进行傅立叶变换并减去低贡献频率:

import numpy as np
import scipy.fftpack

N = 100
x = np.linspace(0,2*np.pi,N)
y = np.sin(x) + np.random.random(N) * 0.2

w = scipy.fftpack.rfft(y)
f = scipy.fftpack.rfftfreq(N, x[1]-x[0])
spectrum = w**2

cutoff_idx = spectrum < (spectrum.max()/5)
w2 = w.copy()
w2[cutoff_idx] = 0

y2 = scipy.fftpack.irfft(w2)

在此处输入图片说明

即使您的信号不是完全周期性的,也可以很好地消除白噪声。有多种类型的滤波器可供使用(高通,低通等),适当的滤波器取决于您要查找的内容。

If you are interested in a “smooth” version of a signal that is periodic (like your example), then a FFT is the right way to go. Take the fourier transform and subtract out the low-contributing frequencies:

import numpy as np
import scipy.fftpack

N = 100
x = np.linspace(0,2*np.pi,N)
y = np.sin(x) + np.random.random(N) * 0.2

w = scipy.fftpack.rfft(y)
f = scipy.fftpack.rfftfreq(N, x[1]-x[0])
spectrum = w**2

cutoff_idx = spectrum < (spectrum.max()/5)
w2 = w.copy()
w2[cutoff_idx] = 0

y2 = scipy.fftpack.irfft(w2)

enter image description here

Even if your signal is not completely periodic, this will do a great job of subtracting out white noise. There a many types of filters to use (high-pass, low-pass, etc…), the appropriate one is dependent on what you are looking for.


回答 3

将移动平均线拟合到数据将消除噪声,有关此操作的信息,请参见此答案

如果您想使用LOWESS来拟合数据(它类似于移动平均值,但更为复杂),则可以使用statsmodels库来实现:

import numpy as np
import pylab as plt
import statsmodels.api as sm

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)

plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()

最后,如果您知道信号的功能形式,则可以对数据拟合一条曲线,这可能是最好的方法。

Fitting a moving average to your data would smooth out the noise, see this this answer for how to do that.

If you’d like to use LOWESS to fit your data (it’s similar to a moving average but more sophisticated), you can do that using the statsmodels library:

import numpy as np
import pylab as plt
import statsmodels.api as sm

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)

plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()

Finally, if you know the functional form of your signal, you could fit a curve to your data, which would probably be the best thing to do.


回答 4

另一个选择是在statsmodels中使用KernelReg

from statsmodels.nonparametric.kernel_regression import KernelReg
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2

# The third parameter specifies the type of the variable x;
# 'c' stands for continuous
kr = KernelReg(y,x,'c')
plt.plot(x, y, '+')
y_pred, y_std = kr.fit(x)

plt.plot(x, y_pred)
plt.show()

Another option is to use KernelReg in statsmodels:

from statsmodels.nonparametric.kernel_regression import KernelReg
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2

# The third parameter specifies the type of the variable x;
# 'c' stands for continuous
kr = KernelReg(y,x,'c')
plt.plot(x, y, '+')
y_pred, y_std = kr.fit(x)

plt.plot(x, y_pred)
plt.show()

回答 5

看一下这个!对于一维信号的平滑有一个明确的定义。

http://scipy-cookbook.readthedocs.io/items/SignalSmooth.html

捷径:

import numpy

def smooth(x,window_len=11,window='hanning'):
    """smooth the data using a window with requested size.

    This method is based on the convolution of a scaled window with the signal.
    The signal is prepared by introducing reflected copies of the signal 
    (with the window size) in both ends so that transient parts are minimized
    in the begining and end part of the output signal.

    input:
        x: the input signal 
        window_len: the dimension of the smoothing window; should be an odd integer
        window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
            flat window will produce a moving average smoothing.

    output:
        the smoothed signal

    example:

    t=linspace(-2,2,0.1)
    x=sin(t)+randn(len(t))*0.1
    y=smooth(x)

    see also: 

    numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve
    scipy.signal.lfilter

    TODO: the window parameter could be the window itself if an array instead of a string
    NOTE: length(output) != length(input), to correct this: return y[(window_len/2-1):-(window_len/2)] instead of just y.
    """

    if x.ndim != 1:
        raise ValueError, "smooth only accepts 1 dimension arrays."

    if x.size < window_len:
        raise ValueError, "Input vector needs to be bigger than window size."


    if window_len<3:
        return x


    if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
        raise ValueError, "Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'"


    s=numpy.r_[x[window_len-1:0:-1],x,x[-2:-window_len-1:-1]]
    #print(len(s))
    if window == 'flat': #moving average
        w=numpy.ones(window_len,'d')
    else:
        w=eval('numpy.'+window+'(window_len)')

    y=numpy.convolve(w/w.sum(),s,mode='valid')
    return y




from numpy import *
from pylab import *

def smooth_demo():

    t=linspace(-4,4,100)
    x=sin(t)
    xn=x+randn(len(t))*0.1
    y=smooth(x)

    ws=31

    subplot(211)
    plot(ones(ws))

    windows=['flat', 'hanning', 'hamming', 'bartlett', 'blackman']

    hold(True)
    for w in windows[1:]:
        eval('plot('+w+'(ws) )')

    axis([0,30,0,1.1])

    legend(windows)
    title("The smoothing windows")
    subplot(212)
    plot(x)
    plot(xn)
    for w in windows:
        plot(smooth(xn,10,w))
    l=['original signal', 'signal with noise']
    l.extend(windows)

    legend(l)
    title("Smoothing a noisy signal")
    show()


if __name__=='__main__':
    smooth_demo()

Check this out! There is a clear definition of smoothing of a 1D signal.

http://scipy-cookbook.readthedocs.io/items/SignalSmooth.html

Shortcut:

import numpy

def smooth(x,window_len=11,window='hanning'):
    """smooth the data using a window with requested size.

    This method is based on the convolution of a scaled window with the signal.
    The signal is prepared by introducing reflected copies of the signal 
    (with the window size) in both ends so that transient parts are minimized
    in the begining and end part of the output signal.

    input:
        x: the input signal 
        window_len: the dimension of the smoothing window; should be an odd integer
        window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
            flat window will produce a moving average smoothing.

    output:
        the smoothed signal

    example:

    t=linspace(-2,2,0.1)
    x=sin(t)+randn(len(t))*0.1
    y=smooth(x)

    see also: 

    numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve
    scipy.signal.lfilter

    TODO: the window parameter could be the window itself if an array instead of a string
    NOTE: length(output) != length(input), to correct this: return y[(window_len/2-1):-(window_len/2)] instead of just y.
    """

    if x.ndim != 1:
        raise ValueError, "smooth only accepts 1 dimension arrays."

    if x.size < window_len:
        raise ValueError, "Input vector needs to be bigger than window size."


    if window_len<3:
        return x


    if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
        raise ValueError, "Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'"


    s=numpy.r_[x[window_len-1:0:-1],x,x[-2:-window_len-1:-1]]
    #print(len(s))
    if window == 'flat': #moving average
        w=numpy.ones(window_len,'d')
    else:
        w=eval('numpy.'+window+'(window_len)')

    y=numpy.convolve(w/w.sum(),s,mode='valid')
    return y




from numpy import *
from pylab import *

def smooth_demo():

    t=linspace(-4,4,100)
    x=sin(t)
    xn=x+randn(len(t))*0.1
    y=smooth(x)

    ws=31

    subplot(211)
    plot(ones(ws))

    windows=['flat', 'hanning', 'hamming', 'bartlett', 'blackman']

    hold(True)
    for w in windows[1:]:
        eval('plot('+w+'(ws) )')

    axis([0,30,0,1.1])

    legend(windows)
    title("The smoothing windows")
    subplot(212)
    plot(x)
    plot(xn)
    for w in windows:
        plot(smooth(xn,10,w))
    l=['original signal', 'signal with noise']
    l.extend(windows)

    legend(l)
    title("Smoothing a noisy signal")
    show()


if __name__=='__main__':
    smooth_demo()

回答 6

如果要绘制时间序列图,并且已使用mtplotlib绘制图,请使用中位数法对图进行平滑处理

smotDeriv = timeseries.rolling(window=20, min_periods=5, center=True).median()

timeseries您传递的数据集在哪里,您可以更改windowsize以进行更平滑的处理。

If you are plotting time series graph and if you have used mtplotlib for drawing graphs then use median method to smooth-en the graph

smotDeriv = timeseries.rolling(window=20, min_periods=5, center=True).median()

where timeseries is your set of data passed you can alter windowsize for more smoothining.


导入错误:没有名为numpy的模块

问题:导入错误:没有名为numpy的模块

我有一个与此问题非常相似的问题,但仍落后了一步。我在Windows 7(对不起)64位系统上仅安装了一个Python 3版本。

我在此链接后安装了numpy- 如问题中所述。安装进行得很好,但是当我执行时

import numpy

我收到以下错误:

导入错误:没有名为numpy的模块

我知道这可能是一个超级基本的问题,但我仍在学习。

谢谢

I have a very similar question to this question, but still one step behind. I have only one version of Python 3 installed on my Windows 7 (sorry) 64-bit system.

I installed numpy following this link – as suggested in the question. The installation went fine but when I execute

import numpy

I got the following error:

Import error: No module named numpy

I know this is probably a super basic question, but I’m still learning.

Thanks


回答 0

NumPy版本1.5.0中添加了对Python 3的支持,因此,首先必须下载/安装较新版本的NumPy。

Support for Python 3 was added in NumPy version 1.5.0, so to begin with, you must download/install a newer version of NumPy.


回答 1

您可以简单地使用

pip install numpy

或者对于python3,使用

pip3 install numpy

You can simply use

pip install numpy

Or for python3, use

pip3 install numpy

回答 2

我认为numpy的安装有问题。这是我解决此问题的步骤。

  1. 请访问此网站以下载正确的软件包:http : //sourceforge.net/projects/numpy/files/
  2. 解压包装
  3. 转到文件
  4. 使用此命令安装numpy: python setup.py install

I think there are something wrong with the installation of numpy. Here are my steps to solve this problem.

  1. go to this website to download correct package: http://sourceforge.net/projects/numpy/files/
  2. unzip the package
  3. go to the document
  4. use this command to install numpy: python setup.py install

回答 3

在Windows上安装Numpy

  1. 以管理员权限打开Windows命令提示符(快速方法:按Windows键。键入“ cmd”。右键单击建议的“命令提示符”,然后选择“以管理员身份运行”)
  2. 使用“ cd”(更改目录)命令导航到Python安装目录的Scripts文件夹。例如“ cd C:\ Program Files(x86)\ PythonXX \ Scripts”

这可能是:C:\ Users \\ AppData \ Local \ Programs \ Python \ PythonXX \ ScriptsC:\ Program Files(x86)\ PythonXX \ Scripts(其中XX代表Python版本号),具体取决于安装位置。使用Windows资源管理器查找文件夹,然后将其从资源管理器地址栏中粘贴或键入地址到命令提示符中,可能会更容易。

  1. 输入以下命令:“ pip install numpy”。

下载并安装软件包后,您应该会看到类似于以下文本的内容。

Collecting numpy
  Downloading numpy-1.13.3-2-cp27-none-win32.whl (6.7MB)  
  100% |################################| 6.7MB 112kB/s
Installing collected packages: numpy
Successfully installed numpy-1.13.3

Installing Numpy on Windows

  1. Open Windows command prompt with administrator privileges (quick method: Press the Windows key. Type “cmd”. Right-click on the suggested “Command Prompt” and select “Run as Administrator)
  2. Navigate to the Python installation directory’s Scripts folder using the “cd” (change directory) command. e.g. “cd C:\Program Files (x86)\PythonXX\Scripts”

This might be: C:\Users\\AppData\Local\Programs\Python\PythonXX\Scripts or C:\Program Files (x86)\PythonXX\Scripts (where XX represents the Python version number), depending on where it was installed. It may be easier to find the folder using Windows explorer, and then paste or type the address from the Explorer address bar into the command prompt.

  1. Enter the following command: “pip install numpy”.

You should see something similar to the following text appear as the package is downloaded and installed.

Collecting numpy
  Downloading numpy-1.13.3-2-cp27-none-win32.whl (6.7MB)  
  100% |################################| 6.7MB 112kB/s
Installing collected packages: numpy
Successfully installed numpy-1.13.3

回答 4

我也遇到了这个问题(导入错误:没有名为numpy的模块),但就我而言,这是我在Mac OS X中使用PATH变量时遇到的问题。我对.bash_profile文件进行了较早的编辑,该文件导致了Anaconda安装的路径(及其他)不能正确添加。

只要在这里将此注释添加到列表中,以防其他类似我的人以相同的错误消息来到此页面并且遇到与我相同的问题。

I also had this problem (Import Error: No module named numpy) but in my case it was a problem with my PATH variables in Mac OS X. I had made an earlier edit to my .bash_profile file that caused the paths for my Anaconda installation (and others) to not be added properly.

Just adding this comment to the list here in case other people like me come to this page with the same error message and have the same problem as I had.


回答 5

您安装了适用于Python 2.6的Numpy版本-因此您只能将其与Python 2.6一起使用。您必须安装适用于Python 3.x的Numpy,例如:http : //sourceforge.net/projects/numpy/files/NumPy/1.6.1/numpy-1.6.1-win32-superpack-python3.2.exe /下载

有关不同版本的概述,请参见此处:http : //sourceforge.net/projects/numpy/files/NumPy/1.6.1/

You installed the Numpy Version for Python 2.6 – so you can only use it with Python 2.6. You have to install Numpy for Python 3.x, e.g. that one: http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/numpy-1.6.1-win32-superpack-python3.2.exe/download

For an overview of the different versions, see here: http://sourceforge.net/projects/numpy/files/NumPy/1.6.1/


回答 6

安装Numpy后,我也遇到了这个问题。我通过关闭Python解释器并重新打开解决了该问题。如果其他人有此问题,可能要尝试其他方法,也许可以节省几分钟!

I had this problem too after I installed Numpy. I solved it by just closing the Python interpreter and reopening. It may be something else to try if anyone else has this problem, perhaps it will save a few minutes!


回答 7

面对同样的问题

ImportError: No module named numpy

因此,在我们的情况下(我们使用的是PIP和python 2.7),解决方案是SPLIT pip install命令:

RUN pip install numpy scipy pandas sklearn

RUN pip install numpy scipy
RUN pip install pandas sklearn

在此处找到解决方案:https : //github.com/pandas-dev/pandas/issues/25193,它是pandas的最新更新到v0.24.0

Faced with same issue

ImportError: No module named numpy

So, in our case (we are use PIP and python 2.7) the solution was SPLIT pip install commands :

From

RUN pip install numpy scipy pandas sklearn

TO

RUN pip install numpy scipy
RUN pip install pandas sklearn

Solution found here : https://github.com/pandas-dev/pandas/issues/25193, it’s related latest update of pandas to v0.24.0


回答 8

我通过pip和conda在相同的环境中安装了numpy,仅删除并重新安装其中一个是不够的。

我不得不重新安装两个。

我不知道为什么突然发生,但是解决方案是

pip uninstall numpy

conda uninstall numpy

从康达卸载也删除torchtorchvision

然后

conda install pytorch-cpu torchvision-cpu -c pytorch

pip install numpy

这为我解决了这个问题。

I had numpy installed on the same environment both by pip and by conda, and simply removing and reinstalling either was not enough.

I had to reinstall both.

I don’t know why it suddenly happened, but the solution was

pip uninstall numpy

conda uninstall numpy

uninstalling from conda also removed torch and torchvision.

then

conda install pytorch-cpu torchvision-cpu -c pytorch

and

pip install numpy

this resolved the issue for me.


回答 9

在设置用于机器学习的python时,我也面临着phyton 3的上述问题。

我遵循以下步骤:

安装python-2.7.13.msi

•设置PATH = C:\ Python27

•设置PATH = C:\ Python27 \ Scripts

前往http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy

下载的:–numpy-1.13.1 + mkl-cp27-cp27m-win32.whl

          --scipy-0.18.0-cp27-cp27m-win32.whl 

安装numpy:pip install numpy-1.13.1 + mkl-cp27-cp27m-win32.whl

安装scipy:pip install scipy-0.18.0-cp27-cp27m-win32.whl

您可以使用以下cmds测试正确性:-

>>> import numpy
>>> import scipy
>>> import sklearn
>>> numpy.version.version
'1.13.1'
>>> scipy.version.version
'0.19.1'
>>>

I too faced the above problem with phyton 3 while setting up python for machine learning.

I followed the below steps :-

Install python-2.7.13.msi

• set PATH=C:\Python27

• set PATH=C:\Python27\Scripts

Go to http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy

Downloaded:- — numpy-1.13.1+mkl-cp27-cp27m-win32.whl

          --scipy-0.18.0-cp27-cp27m-win32.whl 

Installing numpy: pip install numpy-1.13.1+mkl-cp27-cp27m-win32.whl

Installing scipy: pip install scipy-0.18.0-cp27-cp27m-win32.whl

You can test the correctness using below cmds:-

>>> import numpy
>>> import scipy
>>> import sklearn
>>> numpy.version.version
'1.13.1'
>>> scipy.version.version
'0.19.1'
>>>

回答 10

我不确定为什么会收到错误消息,但pip3 uninstall numpy随后pip3 install numpy为我解决了该问题。

I’m not sure exactly why I was getting the error, but pip3 uninstall numpy then pip3 install numpy resolved the issue for me.


回答 11

通过Anaconda安装NumPy(使用以下命令):

  • conda安装-c conda-forge numpy
  • conda install -c conda-forge / label /破碎的numpy

For installing NumPy via Anaconda(use below commands):

  • conda install -c conda-forge numpy
  • conda install -c conda-forge/label/broken numpy

回答 12

那些正在使用的人会xonshxpip install numpy

Those who are using xonsh, do xpip install numpy.


回答 13

对于使用python 2.7的用户,应尝试:

apt-get install -y python-numpy

而不是pip install numpy

For those using python 2.7, should try:

apt-get install -y python-numpy

Instead of pip install numpy


回答 14

你可以试试:

py -3 -m pip安装anyPackageName

在您的情况下使用:

py -3 -m pip安装numpy

谢谢

You can try:

py -3 -m pip install anyPackageName

In your case use:

py -3 -m pip install numpy

Thanks


回答 15

这是numpy版本的问题,请查看$ CAFFE_ROOT / python / requirement.txt。然后执行:sudo apt-get install python-numpy> = xxx,这个问题将会解决。

this is the problem of the numpy’s version, please check out $CAFFE_ROOT/python/requirement.txt. Then exec: sudo apt-get install python-numpy>=x.x.x, this problem will be sloved.


回答 16

import numpy as np
ImportError: No module named numpy 

即使知道安装了numpy并尝试了上述所有建议都没有成功,我还是得到了这个。对我来说,解决方法是删除as np 并直接引用模块。(Centos上的python 3.4.8)。

import numpy
DataTwo=numpy.stack((OutputListUnixTwo))...
import numpy as np
ImportError: No module named numpy 

I got this even though I knew numpy was installed and unsuccessfully tried all the advice above. The fix for me was to remove the as np and directly refer to modules . (python 3.4.8 on Centos) .

import numpy
DataTwo=numpy.stack((OutputListUnixTwo))...

回答 17

您应该尝试使用以下一种安装numpy:

pip install numpy
pip2 install numpy
pip3 install numpy

由于某种原因,在我的情况下,pip2解决了该问题

You should try to install numpy using one of those:

pip install numpy
pip2 install numpy
pip3 install numpy

For some reason in my case pip2 solved the problem


回答 18

在尝试了来自各个站点的许多建议和类似问题之后,对我有用的是卸载所有Python东西并仅重新安装Anaconda(请参阅https://stackoverflow.com/a/38330088/1083292

我以前的Python安装不仅多余,而且还给我带来了麻烦。

After trying many suggestions from various sites and similar questions, what worked for me was to uninstall all Python stuff and reinstall Anaconda only (see https://stackoverflow.com/a/38330088/1083292)

The previous Python installation I had was not only redundant but only caused me trouble.


回答 19

如果在重新安装python之前工作正常,则可以解决此问题。

我只是使用以下方法解决了此问题: 如何使用自制软件在macOS中安装Python 3的早期版本?

If it was working before reinstalling python would solve the issue.

I just hit and resolved this issue using: How can I install a previous version of Python 3 in macOS using homebrew?


回答 20

对我来说,在Windows 10上,我在不知不觉中安装了多个python版本(一个来自PyCharm IDE,另一个来自Windows应用商店)。我从Windows Store卸载了一个,为了更彻底,卸载了numpy pip uninstall numpy,然后再次安装了它pip install numpy。它在PyCharm的终端和命令提示符中都可以使用。

For me, on windows 10, I had unknowingly installed multiple python versions (One from PyCharm IDE and another from Windows store). I uninstalled the one from windows Store and just to be thorough, uninstalled numpy pip uninstall numpy and then installed it again pip install numpy. It worked in the terminal in PyCharm and also in command prompt.


在Python中计算Pearson相关性和重要性

问题:在Python中计算Pearson相关性和重要性

我正在寻找一个将两个列表作为输入,并返回Pearson相关性相关性意义的函数。

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.


回答 0

您可以看一下scipy.stats

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

You can have a look at scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

回答 1

皮尔逊相关性可以用numpy的公式计算corrcoef

import numpy
numpy.corrcoef(list1, list2)[0, 1]

The Pearson correlation can be calculated with numpy’s corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]

回答 2

一种替代可以是来自天然SciPy的功能linregress其计算:

斜率:回归线的斜率

截距:回归线的截距

r值:相关系数

p值:假设检验的两侧p值,其零假设是斜率为零

stderr:估算的标准误

这是一个示例:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

将返回您:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

An alternative can be a native scipy function from linregress which calculates:

slope : slope of the regression line

intercept : intercept of the regression line

r-value : correlation coefficient

p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero

stderr : Standard error of the estimate

And here is an example:

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

will return you:

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

回答 3

如果您不想安装scipy,请使用此快速技巧,该技巧已从Programming Collective Intelligence进行了稍微修改

(为确保正确性进行了编辑。)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

If you don’t feel like installing scipy, I’ve used this quick hack, slightly modified from Programming Collective Intelligence:

(Edited for correctness.)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

回答 4

以下代码是对该定义的直接解释:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

测试:

print pearson_def([1,2,3], [1,5,7])

退货

0.981980506062

这符合Excel中,这个计算器SciPy的(也NumPy的),它分别返回0.981980506和0.9819805060619657和0.98198050606196574。

R

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

编辑:修复了评论者指出的错误。

The following code is a straight-up interpretation of the definition:

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

Test:

print pearson_def([1,2,3], [1,5,7])

returns

0.981980506062

This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.

R:

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

EDIT: Fixed a bug pointed out by a commenter.


回答 5

您也可以使用进行此操作pandas.DataFrame.corr

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

这给

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

You can do this with pandas.DataFrame.corr, too:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

This gives

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

回答 6

我认为我的答案应该最简单地编码和理解计算Pearson相关系数(PCC)的步骤,而不是依赖于numpy / scipy 。

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC 的意义基本上是向您显示两个变量/列表之间的相关程度如何。重要的是要注意PCC值的范围是-1至1。0到1之间的值表示正相关。值0 =最大变化(无相关性)。-1至0之间的值表示负相关。

Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x) 

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x: 
        scorex.append((i - mean(x))/sampleStandardDeviation(x)) 

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

The significance of PCC is basically to show you how strongly correlated the two variables/lists are. It is important to note that the PCC value ranges from -1 to 1. A value between 0 to 1 denotes a positive correlation. Value of 0 = highest variation (no correlation whatsoever). A value between -1 to 0 denotes a negative correlation.


回答 7

使用python中的pandas进行Pearson系数计算:由于您的数据包含列表,建议您尝试这种方法。与数据进行交互并从控制台对其进行操作将很容易,因为您可以可视化数据结构并根据需要进行更新。您还可以导出数据集并保存它,并从python控制台中添加新数据以供以后分析。该代码更简单,包含更少的代码行。我假设您需要一些快速的代码行来筛选数据以进行进一步分析

例:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 

但是,您没有为我发布数据以查看数据集的大小或分析之前可能需要进行的转换。

Pearson coefficient calculation using pandas in python: I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis

Example:

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}

import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes

df = pd.DataFrame(data, columns = ['list 1','list 2'])

from scipy import stats # For in-built method to get PCC

pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results 

However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.


回答 8

嗯,这些响应中的许多响应都有很长且很难阅读的代码…

我建议在使用数组时使用numpy及其漂亮的功能:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

Hmm, many of these responses have long and hard to read code…

I’d suggest using numpy with its nifty features when working with arrays:

import numpy as np
def pcc(X, Y):
   ''' Compute Pearson Correlation Coefficient. '''
   # Normalise X and Y
   X -= X.mean(0)
   Y -= Y.mean(0)
   # Standardise X and Y
   X /= X.std(0)
   Y /= Y.std(0)
   # Compute mean product
   return np.mean(X*Y)

# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

回答 9

这是使用numpy的Pearson Correlation函数的实现:


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr

This is a implementation of Pearson Correlation function using numpy:


def corr(data1, data2):
    "data1 & data2 should be numpy arrays."
    mean1 = data1.mean() 
    mean2 = data2.mean()
    std1 = data1.std()
    std2 = data2.std()

#     corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
    corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
    return corr


回答 10

这是mkh答案的一种变体,它比使用numba和scipy.stats.pearsonr运行得快得多。

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

Here’s a variant on mkh’s answer that runs much faster than it, and scipy.stats.pearsonr, using numba.

import numba

@numba.jit
def corr(data1, data2):
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

回答 11

这是基于稀疏向量的皮尔逊相关性的实现。这里的向量表示为表示为(索引,值)的元组列表。两个稀疏向量的长度可以不同,但​​在所有向量上,大小必须相同。这对于文本挖掘应用很有用,因为大多数特征都是单词包,因此向量大小非常大,因此通常使用稀疏向量执行计算。

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

单元测试:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

Here is an implementation for pearson correlation based on sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length but over all vector size will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
    indexed_feature_dict = {}
    if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
        raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")

    sum_a = sum(value for index, value in first_feature_vector)
    sum_b = sum(value for index, value in second_feature_vector)

    avg_a = float(sum_a) / length_of_featureset
    avg_b = float(sum_b) / length_of_featureset

    mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
        length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
    mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
        length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))

    covariance_a_b = 0

    #calculate covariance for the sparse vectors
    for tuple in first_feature_vector:
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        indexed_feature_dict[tuple[0]] = tuple[1]
    count_of_features = 0
    for tuple in second_feature_vector:
        count_of_features += 1
        if len(tuple) != 2:
            raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
        if tuple[0] in indexed_feature_dict:
            covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
            del (indexed_feature_dict[tuple[0]])
        else:
            covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)

    for index in indexed_feature_dict:
        count_of_features += 1
        covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)

    #adjust covariance with rest of vector with 0 value
    covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b

    if mean_sq_error_a == 0 or mean_sq_error_b == 0:
        return -1
    else:
        return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

Unit tests:

def test_get_get_pearson_corelation(self):
    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)

    vector_a = [(1, 1), (2, 2), (3, 3)]
    vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
    self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

回答 12

我有一个非常简单易懂的解决方案。对于长度相等的两个数组,皮尔逊系数可以很容易地计算如下:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:

def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient. 
Numerator is the sum of product of (a - a_avg) and (b - b_avg), 
while denominator is the product of a_std and b_std multiplied by 
length of array. 
"""
  a_avg, b_avg = np.average(a), np.average(b)
  a_stdev, b_stdev = np.std(a), np.std(b)
  n = len(a)
  denominator = a_stdev * b_stdev * n
  numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
  p_coef = numerator/denominator
  return p_coef

回答 13

您可能想知道如何在寻找特定方向的相关性(负相关或正相关)的情况下解释您的概率。这是我编写的用于帮助实现此功能的函数。甚至可能是对的!

它基于我从http://www.vassarstats.net/rsig.htmlhttp://en.wikipedia.org/wiki/Student%27s_t_distribution收集的信息,这要归功于此处发布的其他答案。

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!

It’s based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

回答 14

您可以看一下这篇文章。这是一个有据可查的示例,该示例用于使用pandas库(适用于Python)基于来自多个文件的历史外汇货币对数据计算相关性,然后使用seaborn库生成热图图。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/


回答 15

def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r
def pearson(x,y):
  n=len(x)
  vals=range(n)

  sumx=sum([float(x[i]) for i in vals])
  sumy=sum([float(y[i]) for i in vals])

  sumxSq=sum([x[i]**2.0 for i in vals])
  sumySq=sum([y[i]**2.0 for i in vals])

  pSum=sum([x[i]*y[i] for i in vals])
  # Calculating Pearson correlation
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
  if den==0: return 0
  r=num/den
  return r

Python中的Pandas和NumPy + SciPy有什么区别?[关闭]

问题:Python中的Pandas和NumPy + SciPy有什么区别?[关闭]

他们似乎都非常相似,我很好奇哪个软件包对财务数据分析更有利。

They both seem exceedingly similar and I’m curious as to which package would be more beneficial for financial data analysis.


回答 0

熊猫提供了基于NumPy构建的高级数据处理工具。NumPy本身是一个相当底层的工具,类似于MATLAB。另一方面,pandas提供了丰富的时间序列功能,数据对齐,对NA友好的统计信息,groupby,合并和联接方法以及许多其他便利。近年来,它在金融应用中变得非常流行。我的下一本书将专门介绍使用熊猫进行财务数据分析的一章。

pandas provides high level data manipulation tools built on top of NumPy. NumPy by itself is a fairly low-level tool, similar to MATLAB. pandas on the other hand provides rich time series functionality, data alignment, NA-friendly statistics, groupby, merge and join methods, and lots of other conveniences. It has become very popular in recent years in financial applications. I will have a chapter dedicated to financial data analysis using pandas in my upcoming book.


回答 1

熊猫(以及几乎所有用于Python的数值工具)都需要Numpy。熊猫不是严格要求Scipy,但被列为“可选依赖项”。我不会说熊猫是Numpy和/或Scipy的替代品。相反,它是一个额外的工具,它提供了更简化的方式来使用Python中的数字和表格数据。您可以使用pandas数据结构,但可以自由利用Numpy和Scipy函数进行操作。

Numpy is required by pandas (and by virtually all numerical tools for Python). Scipy is not strictly required for pandas but is listed as an “optional dependency”. I wouldn’t say that pandas is an alternative to Numpy and/or Scipy. Rather, it’s an extra tool that provides a more streamlined way of working with numerical and tabular data in Python. You can use pandas data structures but freely draw on Numpy and Scipy functions to manipulate them.


回答 2

熊猫提供了一种处理表格的好方法,因为您可以使装箱变得容易(在Python中以熊猫装箱数据框)并计算统计信息。在熊猫中另一个很棒的事情是Panel类,您可以将具有不同属性的一系列图层结合在一起,并使用groupby函数将其组合。

Pandas offer a great way to manipulate tables, as you can make binning easy (binning a dataframe in pandas in Python) and calculate statistics. Other thing that is great in pandas is the Panel class that you can join series of layers with different properties and combine it using groupby function.


均线或均线

问题:均线或均线

是否存在用于Python的SciPy函数或NumPy函数或模块,用于在给定特定窗口的情况下计算一维数组的运行平均值?

Is there a SciPy function or NumPy function or module for Python that calculates the running mean of a 1D array given a specific window?


回答 0

对于一个简短,快速的解决方案,它可以在一个循环中完成所有事情,而没有依赖关系,下面的代码效果很好。

mylist = [1, 2, 3, 4, 5, 6, 7]
N = 3
cumsum, moving_aves = [0], []

for i, x in enumerate(mylist, 1):
    cumsum.append(cumsum[i-1] + x)
    if i>=N:
        moving_ave = (cumsum[i] - cumsum[i-N])/N
        #can do stuff with moving_ave here
        moving_aves.append(moving_ave)

For a short, fast solution that does the whole thing in one loop, without dependencies, the code below works great.

mylist = [1, 2, 3, 4, 5, 6, 7]
N = 3
cumsum, moving_aves = [0], []

for i, x in enumerate(mylist, 1):
    cumsum.append(cumsum[i-1] + x)
    if i>=N:
        moving_ave = (cumsum[i] - cumsum[i-N])/N
        #can do stuff with moving_ave here
        moving_aves.append(moving_ave)

回答 1

UPD:Alleojasaarim提出了更有效的解决方案。


您可以使用np.convolve

np.convolve(x, np.ones((N,))/N, mode='valid')

说明

运行平均值是卷积数学运算的一种情况。对于移动平均值,您可以沿输入滑动窗口并计算窗口内容的平均值。对于离散的一维信号,卷积是相同的事情,除了用平均值代替之外,您还可以计算任意线性组合,即将每个元素乘以相应的系数并相加结果。窗口中每个位置对应的那些系数有时称为卷积。现在,N个值的算术平均值为(x_1 + x_2 + ... + x_N) / N,因此对应的内核为(1/N, 1/N, ..., 1/N),这正是我们使用所得到的np.ones((N,))/N

边缘

mode参数np.convolve指定如何处理边缘。我在valid这里选择模式是因为我认为大多数人都希望运行平均值起作用,但是您可能还有其他优先事项。这是说明两种模式之间差异的图表:

import numpy as np
import matplotlib.pyplot as plt
modes = ['full', 'same', 'valid']
for m in modes:
    plt.plot(np.convolve(np.ones((200,)), np.ones((50,))/50, mode=m));
plt.axis([-10, 251, -.1, 1.1]);
plt.legend(modes, loc='lower center');
plt.show()

Running mean convolve modes

UPD: more efficient solutions have been proposed by Alleo and jasaarim.


You can use np.convolve for that:

np.convolve(x, np.ones((N,))/N, mode='valid')

Explanation

The running mean is a case of the mathematical operation of convolution. For the running mean, you slide a window along the input and compute the mean of the window’s contents. For discrete 1D signals, convolution is the same thing, except instead of the mean you compute an arbitrary linear combination, i.e. multiply each element by a corresponding coefficient and add up the results. Those coefficients, one for each position in the window, are sometimes called the convolution kernel. Now, the arithmetic mean of N values is (x_1 + x_2 + ... + x_N) / N, so the corresponding kernel is (1/N, 1/N, ..., 1/N), and that’s exactly what we get by using np.ones((N,))/N.

Edges

The mode argument of np.convolve specifies how to handle the edges. I chose the valid mode here because I think that’s how most people expect the running mean to work, but you may have other priorities. Here is a plot that illustrates the difference between the modes:

import numpy as np
import matplotlib.pyplot as plt
modes = ['full', 'same', 'valid']
for m in modes:
    plt.plot(np.convolve(np.ones((200,)), np.ones((50,))/50, mode=m));
plt.axis([-10, 251, -.1, 1.1]);
plt.legend(modes, loc='lower center');
plt.show()

Running mean convolve modes


回答 2

高效的解决方案

卷积比简单方法好得多,但是(我猜)它使用FFT,因此速度很慢。但是专门用于计算运行,以下方法可以正常工作

def running_mean(x, N):
    cumsum = numpy.cumsum(numpy.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / float(N)

要检查的代码

In[3]: x = numpy.random.random(100000)
In[4]: N = 1000
In[5]: %timeit result1 = numpy.convolve(x, numpy.ones((N,))/N, mode='valid')
10 loops, best of 3: 41.4 ms per loop
In[6]: %timeit result2 = running_mean(x, N)
1000 loops, best of 3: 1.04 ms per loop

注意 numpy.allclose(result1, result2)True,这两种方法是等效的。N越大,时间差越大。

警告:尽管累积速度更快,但浮点错误会增加,这可能会导致结果无效/不正确/不可接受

评论指出了此浮点错误问题,但我在答案中将其变得更加明显。

# demonstrate loss of precision with only 100,000 points
np.random.seed(42)
x = np.random.randn(100000)+1e6
y1 = running_mean_convolve(x, 10)
y2 = running_mean_cumsum(x, 10)
assert np.allclose(y1, y2, rtol=1e-12, atol=0)
  • 累积的点越多,浮点误差就越大(因此值得注意的是1e5点,1e6点更重要,大于1e6,并且您可能想重置累加器)
  • 您可以通过使用作弊,np.longdouble但是相对较大数量的点(大约> 1e5,但取决于您的数据),浮点错误仍然会变得很明显
  • 您可以绘制误差并看到它相对较快地增加
  • 卷积解算较慢,但没有这种浮点精度损失
  • Uniform_filter1d解决方案比该cumsum解决方案更快,并且没有这种浮点精度损失

Efficient solution

Convolution is much better than straightforward approach, but (I guess) it uses FFT and thus quite slow. However specially for computing the running mean the following approach works fine

def running_mean(x, N):
    cumsum = numpy.cumsum(numpy.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / float(N)

The code to check

In[3]: x = numpy.random.random(100000)
In[4]: N = 1000
In[5]: %timeit result1 = numpy.convolve(x, numpy.ones((N,))/N, mode='valid')
10 loops, best of 3: 41.4 ms per loop
In[6]: %timeit result2 = running_mean(x, N)
1000 loops, best of 3: 1.04 ms per loop

Note that numpy.allclose(result1, result2) is True, two methods are equivalent. The greater N, the greater difference in time.

warning: although cumsum is faster there will be increased floating point error that may cause your results to be invalid/incorrect/unacceptable

the comments pointed out this floating point error issue here but i am making it more obvious here in the answer..

# demonstrate loss of precision with only 100,000 points
np.random.seed(42)
x = np.random.randn(100000)+1e6
y1 = running_mean_convolve(x, 10)
y2 = running_mean_cumsum(x, 10)
assert np.allclose(y1, y2, rtol=1e-12, atol=0)
  • the more points you accumulate over the greater the floating point error (so 1e5 points is noticable, 1e6 points is more significant, more than 1e6 and you may want to resetting the accumulators)
  • you can cheat by using np.longdouble but your floating point error still will get significant for relatively large number of points (around >1e5 but depends on your data)
  • you can plot the error and see it increasing relatively fast
  • the convolve solution is slower but does not have this floating point loss of precision
  • the uniform_filter1d solution is faster than this cumsum solution AND does not have this floating point loss of precision

回答 3

更新:以下示例显示了旧pandas.rolling_mean功能,该功能已在最新版本的熊猫中删除。下面的函数调用的等效形式为

In [8]: pd.Series(x).rolling(window=N).mean().iloc[N-1:].values
Out[8]: 
array([ 0.49815397,  0.49844183,  0.49840518, ...,  0.49488191,
        0.49456679,  0.49427121])

熊猫比NumPy或SciPy更适合于此。它的函数rolling_mean方便地完成工作。当输入是数组时,它还会返回一个NumPy数组。

rolling_mean使用任何自定义的纯Python实现都很难在性能上胜过。这是针对两个建议的解决方案的示例性能:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: def running_mean(x, N):
   ...:     cumsum = np.cumsum(np.insert(x, 0, 0)) 
   ...:     return (cumsum[N:] - cumsum[:-N]) / N
   ...:

In [4]: x = np.random.random(100000)

In [5]: N = 1000

In [6]: %timeit np.convolve(x, np.ones((N,))/N, mode='valid')
10 loops, best of 3: 172 ms per loop

In [7]: %timeit running_mean(x, N)
100 loops, best of 3: 6.72 ms per loop

In [8]: %timeit pd.rolling_mean(x, N)[N-1:]
100 loops, best of 3: 4.74 ms per loop

In [9]: np.allclose(pd.rolling_mean(x, N)[N-1:], running_mean(x, N))
Out[9]: True

关于如何处理边缘值,也有不错的选择。

Update: The example below shows the old pandas.rolling_mean function which has been removed in recent versions of pandas. A modern equivalent of the function call below would be

In [8]: pd.Series(x).rolling(window=N).mean().iloc[N-1:].values
Out[8]: 
array([ 0.49815397,  0.49844183,  0.49840518, ...,  0.49488191,
        0.49456679,  0.49427121])

pandas is more suitable for this than NumPy or SciPy. Its function rolling_mean does the job conveniently. It also returns a NumPy array when the input is an array.

It is difficult to beat rolling_mean in performance with any custom pure Python implementation. Here is an example performance against two of the proposed solutions:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: def running_mean(x, N):
   ...:     cumsum = np.cumsum(np.insert(x, 0, 0)) 
   ...:     return (cumsum[N:] - cumsum[:-N]) / N
   ...:

In [4]: x = np.random.random(100000)

In [5]: N = 1000

In [6]: %timeit np.convolve(x, np.ones((N,))/N, mode='valid')
10 loops, best of 3: 172 ms per loop

In [7]: %timeit running_mean(x, N)
100 loops, best of 3: 6.72 ms per loop

In [8]: %timeit pd.rolling_mean(x, N)[N-1:]
100 loops, best of 3: 4.74 ms per loop

In [9]: np.allclose(pd.rolling_mean(x, N)[N-1:], running_mean(x, N))
Out[9]: True

There are also nice options as to how to deal with the edge values.


回答 4

您可以使用以下方法计算运行平均值:

import numpy as np

def runningMean(x, N):
    y = np.zeros((len(x),))
    for ctr in range(len(x)):
         y[ctr] = np.sum(x[ctr:(ctr+N)])
    return y/N

但这很慢。

幸运的是,numpy包含一个卷积函数,我们可以使用它来加快处理速度。运行均值等效于对所有成员等于的长x向量进行卷积。卷积的n​​umpy实现包括起始瞬变,因此您必须删除前N-1个点:N1/N

def runningMeanFast(x, N):
    return np.convolve(x, np.ones((N,))/N)[(N-1):]

在我的机器上,快速版本的速度要快20-30倍,具体取决于输入向量的长度和平均窗口的大小。

请注意,convolve确实包含一个'same'模式,该模式似乎应该解决起始瞬态问题,但是它将其在开始和结束之间进行了拆分。

You can calculate a running mean with:

import numpy as np

def runningMean(x, N):
    y = np.zeros((len(x),))
    for ctr in range(len(x)):
         y[ctr] = np.sum(x[ctr:(ctr+N)])
    return y/N

But it’s slow.

Fortunately, numpy includes a convolve function which we can use to speed things up. The running mean is equivalent to convolving x with a vector that is N long, with all members equal to 1/N. The numpy implementation of convolve includes the starting transient, so you have to remove the first N-1 points:

def runningMeanFast(x, N):
    return np.convolve(x, np.ones((N,))/N)[(N-1):]

On my machine, the fast version is 20-30 times faster, depending on the length of the input vector and size of the averaging window.

Note that convolve does include a 'same' mode which seems like it should address the starting transient issue, but it splits it between the beginning and end.


回答 5

或用于计算的python模块

在Tradewave.net上进行的测试中,TA-lib总是会赢得:

import talib as ta
import numpy as np
import pandas as pd
import scipy
from scipy import signal
import time as t

PAIR = info.primary_pair
PERIOD = 30

def initialize():
    storage.reset()
    storage.elapsed = storage.get('elapsed', [0,0,0,0,0,0])

def cumsum_sma(array, period):
    ret = np.cumsum(array, dtype=float)
    ret[period:] = ret[period:] - ret[:-period]
    return ret[period - 1:] / period

def pandas_sma(array, period):
    return pd.rolling_mean(array, period)

def api_sma(array, period):
    # this method is native to Tradewave and does NOT return an array
    return (data[PAIR].ma(PERIOD))

def talib_sma(array, period):
    return ta.MA(array, period)

def convolve_sma(array, period):
    return np.convolve(array, np.ones((period,))/period, mode='valid')

def fftconvolve_sma(array, period):    
    return scipy.signal.fftconvolve(
        array, np.ones((period,))/period, mode='valid')    

def tick():

    close = data[PAIR].warmup_period('close')

    t1 = t.time()
    sma_api = api_sma(close, PERIOD)
    t2 = t.time()
    sma_cumsum = cumsum_sma(close, PERIOD)
    t3 = t.time()
    sma_pandas = pandas_sma(close, PERIOD)
    t4 = t.time()
    sma_talib = talib_sma(close, PERIOD)
    t5 = t.time()
    sma_convolve = convolve_sma(close, PERIOD)
    t6 = t.time()
    sma_fftconvolve = fftconvolve_sma(close, PERIOD)
    t7 = t.time()

    storage.elapsed[-1] = storage.elapsed[-1] + t2-t1
    storage.elapsed[-2] = storage.elapsed[-2] + t3-t2
    storage.elapsed[-3] = storage.elapsed[-3] + t4-t3
    storage.elapsed[-4] = storage.elapsed[-4] + t5-t4
    storage.elapsed[-5] = storage.elapsed[-5] + t6-t5    
    storage.elapsed[-6] = storage.elapsed[-6] + t7-t6        

    plot('sma_api', sma_api)  
    plot('sma_cumsum', sma_cumsum[-5])
    plot('sma_pandas', sma_pandas[-10])
    plot('sma_talib', sma_talib[-15])
    plot('sma_convolve', sma_convolve[-20])    
    plot('sma_fftconvolve', sma_fftconvolve[-25])

def stop():

    log('ticks....: %s' % info.max_ticks)

    log('api......: %.5f' % storage.elapsed[-1])
    log('cumsum...: %.5f' % storage.elapsed[-2])
    log('pandas...: %.5f' % storage.elapsed[-3])
    log('talib....: %.5f' % storage.elapsed[-4])
    log('convolve.: %.5f' % storage.elapsed[-5])    
    log('fft......: %.5f' % storage.elapsed[-6])

结果:

[2015-01-31 23:00:00] ticks....: 744
[2015-01-31 23:00:00] api......: 0.16445
[2015-01-31 23:00:00] cumsum...: 0.03189
[2015-01-31 23:00:00] pandas...: 0.03677
[2015-01-31 23:00:00] talib....: 0.00700  # <<< Winner!
[2015-01-31 23:00:00] convolve.: 0.04871
[2015-01-31 23:00:00] fft......: 0.22306

在此处输入图片说明

or module for python that calculates

in my tests at Tradewave.net TA-lib always wins:

import talib as ta
import numpy as np
import pandas as pd
import scipy
from scipy import signal
import time as t

PAIR = info.primary_pair
PERIOD = 30

def initialize():
    storage.reset()
    storage.elapsed = storage.get('elapsed', [0,0,0,0,0,0])

def cumsum_sma(array, period):
    ret = np.cumsum(array, dtype=float)
    ret[period:] = ret[period:] - ret[:-period]
    return ret[period - 1:] / period

def pandas_sma(array, period):
    return pd.rolling_mean(array, period)

def api_sma(array, period):
    # this method is native to Tradewave and does NOT return an array
    return (data[PAIR].ma(PERIOD))

def talib_sma(array, period):
    return ta.MA(array, period)

def convolve_sma(array, period):
    return np.convolve(array, np.ones((period,))/period, mode='valid')

def fftconvolve_sma(array, period):    
    return scipy.signal.fftconvolve(
        array, np.ones((period,))/period, mode='valid')    

def tick():

    close = data[PAIR].warmup_period('close')

    t1 = t.time()
    sma_api = api_sma(close, PERIOD)
    t2 = t.time()
    sma_cumsum = cumsum_sma(close, PERIOD)
    t3 = t.time()
    sma_pandas = pandas_sma(close, PERIOD)
    t4 = t.time()
    sma_talib = talib_sma(close, PERIOD)
    t5 = t.time()
    sma_convolve = convolve_sma(close, PERIOD)
    t6 = t.time()
    sma_fftconvolve = fftconvolve_sma(close, PERIOD)
    t7 = t.time()

    storage.elapsed[-1] = storage.elapsed[-1] + t2-t1
    storage.elapsed[-2] = storage.elapsed[-2] + t3-t2
    storage.elapsed[-3] = storage.elapsed[-3] + t4-t3
    storage.elapsed[-4] = storage.elapsed[-4] + t5-t4
    storage.elapsed[-5] = storage.elapsed[-5] + t6-t5    
    storage.elapsed[-6] = storage.elapsed[-6] + t7-t6        

    plot('sma_api', sma_api)  
    plot('sma_cumsum', sma_cumsum[-5])
    plot('sma_pandas', sma_pandas[-10])
    plot('sma_talib', sma_talib[-15])
    plot('sma_convolve', sma_convolve[-20])    
    plot('sma_fftconvolve', sma_fftconvolve[-25])

def stop():

    log('ticks....: %s' % info.max_ticks)

    log('api......: %.5f' % storage.elapsed[-1])
    log('cumsum...: %.5f' % storage.elapsed[-2])
    log('pandas...: %.5f' % storage.elapsed[-3])
    log('talib....: %.5f' % storage.elapsed[-4])
    log('convolve.: %.5f' % storage.elapsed[-5])    
    log('fft......: %.5f' % storage.elapsed[-6])

results:

[2015-01-31 23:00:00] ticks....: 744
[2015-01-31 23:00:00] api......: 0.16445
[2015-01-31 23:00:00] cumsum...: 0.03189
[2015-01-31 23:00:00] pandas...: 0.03677
[2015-01-31 23:00:00] talib....: 0.00700  # <<< Winner!
[2015-01-31 23:00:00] convolve.: 0.04871
[2015-01-31 23:00:00] fft......: 0.22306

enter image description here


回答 6

有关即用型解决方案,请参见https://scipy-cookbook.readthedocs.io/items/SignalSmooth.html。它提供了与flat窗口类型的。请注意,这比简单的自己动手卷积方法要复杂得多,因为它试图通过反映数据来处理数据开头和结尾的问题(在您的情况下可能有效,也可能无效)。 ..)。

首先,您可以尝试:

a = np.random.random(100)
plt.plot(a)
b = smooth(a, window='flat')
plt.plot(b)

For a ready-to-use solution, see https://scipy-cookbook.readthedocs.io/items/SignalSmooth.html. It provides running average with the flat window type. Note that this is a bit more sophisticated than the simple do-it-yourself convolve-method, since it tries to handle the problems at the beginning and the end of the data by reflecting it (which may or may not work in your case…).

To start with, you could try:

a = np.random.random(100)
plt.plot(a)
b = smooth(a, window='flat')
plt.plot(b)

回答 7

您可以使用scipy.ndimage.filters.uniform_filter1d

import numpy as np
from scipy.ndimage.filters import uniform_filter1d
N = 1000
x = np.random.random(100000)
y = uniform_filter1d(x, size=N)

uniform_filter1d

  • 使输出具有相同的numpy形状(即点数)
  • 允许使用多种方式来处理'reflect'默认的边框,但就我而言,我宁愿'nearest'

它也相当快(比上面给出的cumsum方法快 50倍,快np.convolve2-5倍):

%timeit y1 = np.convolve(x, np.ones((N,))/N, mode='same')
100 loops, best of 3: 9.28 ms per loop

%timeit y2 = uniform_filter1d(x, size=N)
10000 loops, best of 3: 191 µs per loop

这是3个函数,可让您比较不同实现的错误/速度:

from __future__ import division
import numpy as np
import scipy.ndimage.filters as ndif
def running_mean_convolve(x, N):
    return np.convolve(x, np.ones(N) / float(N), 'valid')
def running_mean_cumsum(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)
def running_mean_uniform_filter1d(x, N):
    return ndif.uniform_filter1d(x, N, mode='constant', origin=-(N//2))[:-(N-1)]

You can use scipy.ndimage.filters.uniform_filter1d:

import numpy as np
from scipy.ndimage.filters import uniform_filter1d
N = 1000
x = np.random.random(100000)
y = uniform_filter1d(x, size=N)

uniform_filter1d:

  • gives the output with the same numpy shape (i.e. number of points)
  • allows multiple ways to handle the border where 'reflect' is the default, but in my case, I rather wanted 'nearest'

It is also rather quick (nearly 50 times faster than np.convolve and 2-5 times faster than the cumsum approach given above):

%timeit y1 = np.convolve(x, np.ones((N,))/N, mode='same')
100 loops, best of 3: 9.28 ms per loop

%timeit y2 = uniform_filter1d(x, size=N)
10000 loops, best of 3: 191 µs per loop

here’s 3 functions that let you compare error/speed of different implementations:

from __future__ import division
import numpy as np
import scipy.ndimage.filters as ndif
def running_mean_convolve(x, N):
    return np.convolve(x, np.ones(N) / float(N), 'valid')
def running_mean_cumsum(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)
def running_mean_uniform_filter1d(x, N):
    return ndif.uniform_filter1d(x, N, mode='constant', origin=-(N//2))[:-(N-1)]

回答 8

我知道这是一个古老的问题,但这是不使用任何额外数据结构或库的解决方案。它在输入列表中的元素数量上是线性的,我想不出任何其他方法来提高它的效率(实际上,如果有人知道更好的分配结果的方法,请告诉我)。

注意:使用numpy数组而不是列表会更快,但是我想消除所有依赖关系。通过多线程执行还可以提高性能

该函数假定输入列表是一维的,因此要小心。

### Running mean/Moving average
def running_mean(l, N):
    sum = 0
    result = list( 0 for x in l)

    for i in range( 0, N ):
        sum = sum + l[i]
        result[i] = sum / (i+1)

    for i in range( N, len(l) ):
        sum = sum - l[i-N] + l[i]
        result[i] = sum / N

    return result

假设我们有一个清单 data = [ 1, 2, 3, 4, 5, 6 ],我们要在该上计算周期为3的滚动平均值,并且还需要一个输出列表,该列表的大小与输入值的大小相同(通常是这种情况)。

第一个元素的索引为0,因此应该对索引为-2,-1和0的元素计算滚动平均值。显然,我们没有data [-2]和data [-1](除非您想使用特殊的边界条件),因此我们假设这些元素为0。这等效于对列表进行零填充,除了我们实际上不填充它外,只需跟踪需要填充的索引(从0到N-1)即可。

因此,对于前N个元素,我们只是将这些元素累加到一个累加器中。

result[0] = (0 + 0 + 1) / 3  = 0.333    ==   (sum + 1) / 3
result[1] = (0 + 1 + 2) / 3  = 1        ==   (sum + 2) / 3
result[2] = (1 + 2 + 3) / 3  = 2        ==   (sum + 3) / 3

从元素N + 1开始,简单的累积不起作用。我们期望,result[3] = (2 + 3 + 4)/3 = 3但这与有所不同(sum + 4)/3 = 3.333

计算正确值的方法是减去,从而 data[0] = 1得出。sum+4sum + 4 - 1 = 9

发生这种情况的原因是当前sum = data[0] + data[1] + data[2],但对于每一种情况也是如此,i >= N因为在减法之前sumdata[i-N] + ... + data[i-2] + data[i-1]

I know this is an old question, but here is a solution that doesn’t use any extra data structures or libraries. It is linear in the number of elements of the input list and I cannot think of any other way to make it more efficient (actually if anyone knows of a better way to allocate the result, please let me know).

NOTE: this would be much faster using a numpy array instead of a list, but I wanted to eliminate all dependencies. It would also be possible to improve performance by multi-threaded execution

The function assumes that the input list is one dimensional, so be careful.

### Running mean/Moving average
def running_mean(l, N):
    sum = 0
    result = list( 0 for x in l)

    for i in range( 0, N ):
        sum = sum + l[i]
        result[i] = sum / (i+1)

    for i in range( N, len(l) ):
        sum = sum - l[i-N] + l[i]
        result[i] = sum / N

    return result

Example

Assume that we have a list data = [ 1, 2, 3, 4, 5, 6 ] on which we want to compute a rolling mean with period of 3, and that you also want a output list that is the same size of the input one (that’s most often the case).

The first element has index 0, so the rolling mean should be computed on elements of index -2, -1 and 0. Obviously we don’t have data[-2] and data[-1] (unless you want to use special boundary conditions), so we assume that those elements are 0. This is equivalent to zero-padding the list, except we don’t actually pad it, just keep track of the indices that require padding (from 0 to N-1).

So, for the first N elements we just keep adding up the elements in an accumulator.

result[0] = (0 + 0 + 1) / 3  = 0.333    ==   (sum + 1) / 3
result[1] = (0 + 1 + 2) / 3  = 1        ==   (sum + 2) / 3
result[2] = (1 + 2 + 3) / 3  = 2        ==   (sum + 3) / 3

From elements N+1 forwards simple accumulation doesn’t work. we expect result[3] = (2 + 3 + 4)/3 = 3 but this is different from (sum + 4)/3 = 3.333.

The way to compute the correct value is to subtract data[0] = 1 from sum+4, thus giving sum + 4 - 1 = 9.

This happens because currently sum = data[0] + data[1] + data[2], but it is also true for every i >= N because, before the subtraction, sum is data[i-N] + ... + data[i-2] + data[i-1].


回答 9

我觉得可以通过瓶颈解决这个问题

请参见下面的基本示例:

import numpy as np
import bottleneck as bn

a = np.random.randint(4, 1000, size=100)
mm = bn.move_mean(a, window=5, min_count=1)
  • “ mm”是“ a”的移动平均值。

  • “窗口”是移动平均值要考虑的最大条目数。

  • “ min_count”是移动平均值(例如,对于前几个元素或数组具有nan值)要考虑的最小条目数。

好的部分是Bottleneck有助于处理nan值,而且效率很高。

I feel this can be elegantly solved using bottleneck

See basic sample below:

import numpy as np
import bottleneck as bn

a = np.random.randint(4, 1000, size=100)
mm = bn.move_mean(a, window=5, min_count=1)
  • “mm” is the moving mean for “a”.

  • “window” is the max number of entries to consider for moving mean.

  • “min_count” is min number of entries to consider for moving mean (e.g. for first few elements or if the array has nan values).

The good part is Bottleneck helps to deal with nan values and it’s also very efficient.


回答 10

我尚未检查这有多快,但是您可以尝试:

from collections import deque

cache = deque() # keep track of seen values
n = 10          # window size
A = xrange(100) # some dummy iterable
cum_sum = 0     # initialize cumulative sum

for t, val in enumerate(A, 1):
    cache.append(val)
    cum_sum += val
    if t < n:
        avg = cum_sum / float(t)
    else:                           # if window is saturated,
        cum_sum -= cache.popleft()  # subtract oldest value
        avg = cum_sum / float(n)

I haven’t yet checked how fast this is, but you could try:

from collections import deque

cache = deque() # keep track of seen values
n = 10          # window size
A = xrange(100) # some dummy iterable
cum_sum = 0     # initialize cumulative sum

for t, val in enumerate(A, 1):
    cache.append(val)
    cum_sum += val
    if t < n:
        avg = cum_sum / float(t)
    else:                           # if window is saturated,
        cum_sum -= cache.popleft()  # subtract oldest value
        avg = cum_sum / float(n)

回答 11

该答案包含针对三种不同情况使用Python 标准库的解决方案。


与运行平均 itertools.accumulate

这是一种内存有效的Python 3.2+解决方案,可利用来计算可迭代值的运行平均值itertools.accumulate

>>> from itertools import accumulate
>>> values = range(100)

请注意,它values可以是任何可迭代的,包括生成器或任何其他动态生成值的对象。

首先,延迟构造值的累加和。

>>> cumu_sum = accumulate(value_stream)

接下来,enumerate累加和(从1开始),并构造一个生成器,该生成器产生累加值的分数和当前枚举索引。

>>> rolling_avg = (accu/i for i, accu in enumerate(cumu_sum, 1))

您可以发出means = list(rolling_avg)是否需要一次存储在内存中的所有值或next递增调用的问题。
(当然,你也可以遍历rolling_avg一个for循环,这将调用next隐式)。

>>> next(rolling_avg) # 0/1
>>> 0.0
>>> next(rolling_avg) # (0 + 1)/2
>>> 0.5
>>> next(rolling_avg) # (0 + 1 + 2)/3
>>> 1.0

该解决方案可以编写为如下功能。

from itertools import accumulate

def rolling_avg(iterable):
    cumu_sum = accumulate(iterable)
    yield from (accu/i for i, accu in enumerate(cumu_sum, 1))
    

一个协同程序来,你可以随时发送值

该协程会消耗您发送给它的值,并保持到目前为止所见值的运行平均值。

当您没有可迭代的值但需要在程序的整个生命周期的不同时间获取要平均的值时,此方法很有用。

def rolling_avg_coro():
    i = 0
    total = 0.0
    avg = None

    while True:
        next_value = yield avg
        i += 1
        total += next_value
        avg = total/i
        

协程的工作方式如下:

>>> averager = rolling_avg_coro() # instantiate coroutine
>>> next(averager) # get coroutine going (this is called priming)
>>>
>>> averager.send(5) # 5/1
>>> 5.0
>>> averager.send(3) # (5 + 3)/2
>>> 4.0
>>> print('doing something else...')
doing something else...
>>> averager.send(13) # (5 + 3 + 13)/3
>>> 7.0

计算大小滑动窗口上的平均值 N

该生成器函数具有可迭代的窗口大小,N 并得出窗口内部当前值的平均值。它使用deque,这是类似于列表的数据结构,但针对两个端点的快速修改(popappend进行了优化。

from collections import deque
from itertools import islice

def sliding_avg(iterable, N):        
    it = iter(iterable)
    window = deque(islice(it, N))        
    num_vals = len(window)

    if num_vals < N:
        msg = 'window size {} exceeds total number of values {}'
        raise ValueError(msg.format(N, num_vals))

    N = float(N) # force floating point division if using Python 2
    s = sum(window)
    
    while True:
        yield s/N
        try:
            nxt = next(it)
        except StopIteration:
            break
        s = s - window.popleft() + nxt
        window.append(nxt)
        

这是起作用的功能:

>>> values = range(100)
>>> N = 5
>>> window_avg = sliding_avg(values, N)
>>> 
>>> next(window_avg) # (0 + 1 + 2 + 3 + 4)/5
>>> 2.0
>>> next(window_avg) # (1 + 2 + 3 + 4 + 5)/5
>>> 3.0
>>> next(window_avg) # (2 + 3 + 4 + 5 + 6)/5
>>> 4.0

This answer contains solutions using the Python standard library for three different scenarios.


Running average with itertools.accumulate

This is a memory efficient Python 3.2+ solution computing the running average over an iterable of values by leveraging itertools.accumulate.

>>> from itertools import accumulate
>>> values = range(100)

Note that values can be any iterable, including generators or any other object that produces values on the fly.

First, lazily construct the cumulative sum of the values.

>>> cumu_sum = accumulate(value_stream)

Next, enumerate the cumulative sum (starting at 1) and construct a generator that yields the fraction of accumulated values and the current enumeration index.

>>> rolling_avg = (accu/i for i, accu in enumerate(cumu_sum, 1))

You can issue means = list(rolling_avg) if you need all the values in memory at once or call next incrementally.
(Of course, you can also iterate over rolling_avg with a for loop, which will call next implicitly.)

>>> next(rolling_avg) # 0/1
>>> 0.0
>>> next(rolling_avg) # (0 + 1)/2
>>> 0.5
>>> next(rolling_avg) # (0 + 1 + 2)/3
>>> 1.0

This solution can be written as a function as follows.

from itertools import accumulate

def rolling_avg(iterable):
    cumu_sum = accumulate(iterable)
    yield from (accu/i for i, accu in enumerate(cumu_sum, 1))
    

A coroutine to which you can send values at any time

This coroutine consumes values you send it and keeps a running average of the values seen so far.

It is useful when you don’t have an iterable of values but aquire the values to be averaged one by one at different times throughout your program’s life.

def rolling_avg_coro():
    i = 0
    total = 0.0
    avg = None

    while True:
        next_value = yield avg
        i += 1
        total += next_value
        avg = total/i
        

The coroutine works like this:

>>> averager = rolling_avg_coro() # instantiate coroutine
>>> next(averager) # get coroutine going (this is called priming)
>>>
>>> averager.send(5) # 5/1
>>> 5.0
>>> averager.send(3) # (5 + 3)/2
>>> 4.0
>>> print('doing something else...')
doing something else...
>>> averager.send(13) # (5 + 3 + 13)/3
>>> 7.0

Computing the average over a sliding window of size N

This generator-function takes an iterable and a window size N and yields the average over the current values inside the window. It uses a deque, which is a datastructure similar to a list, but optimized for fast modifications (pop, append) at both endpoints.

from collections import deque
from itertools import islice

def sliding_avg(iterable, N):        
    it = iter(iterable)
    window = deque(islice(it, N))        
    num_vals = len(window)

    if num_vals < N:
        msg = 'window size {} exceeds total number of values {}'
        raise ValueError(msg.format(N, num_vals))

    N = float(N) # force floating point division if using Python 2
    s = sum(window)
    
    while True:
        yield s/N
        try:
            nxt = next(it)
        except StopIteration:
            break
        s = s - window.popleft() + nxt
        window.append(nxt)
        

Here is the function in action:

>>> values = range(100)
>>> N = 5
>>> window_avg = sliding_avg(values, N)
>>> 
>>> next(window_avg) # (0 + 1 + 2 + 3 + 4)/5
>>> 2.0
>>> next(window_avg) # (1 + 2 + 3 + 4 + 5)/5
>>> 3.0
>>> next(window_avg) # (2 + 3 + 4 + 5 + 6)/5
>>> 4.0

回答 12

聚会晚了一点,但是我做了一个自己的小函数,它不会缠住两端或填充零的垫子,这些垫子也可以用来寻找平均值。进一步的处理是,它还在线性间隔的点上对信号进行重新采样。随意自定义代码以获取其他功能。

该方法是具有归一化的高斯核的简单矩阵乘法。

def running_mean(y_in, x_in, N_out=101, sigma=1):
    '''
    Returns running mean as a Bell-curve weighted average at evenly spaced
    points. Does NOT wrap signal around, or pad with zeros.

    Arguments:
    y_in -- y values, the values to be smoothed and re-sampled
    x_in -- x values for array

    Keyword arguments:
    N_out -- NoOf elements in resampled array.
    sigma -- 'Width' of Bell-curve in units of param x .
    '''
    N_in = size(y_in)

    # Gaussian kernel
    x_out = np.linspace(np.min(x_in), np.max(x_in), N_out)
    x_in_mesh, x_out_mesh = np.meshgrid(x_in, x_out)
    gauss_kernel = np.exp(-np.square(x_in_mesh - x_out_mesh) / (2 * sigma**2))
    # Normalize kernel, such that the sum is one along axis 1
    normalization = np.tile(np.reshape(sum(gauss_kernel, axis=1), (N_out, 1)), (1, N_in))
    gauss_kernel_normalized = gauss_kernel / normalization
    # Perform running average as a linear operation
    y_out = gauss_kernel_normalized @ y_in

    return y_out, x_out

具有正态分布噪声的正弦信号的简单用法: 在此处输入图片说明

A bit late to the party, but I’ve made my own little function that does NOT wrap around the ends or pads with zeroes that are then used to find the average as well. As a further treat is, that it also re-samples the signal at linearly spaced points. Customize the code at will to get other features.

The method is a simple matrix multiplication with a normalized Gaussian kernel.

def running_mean(y_in, x_in, N_out=101, sigma=1):
    '''
    Returns running mean as a Bell-curve weighted average at evenly spaced
    points. Does NOT wrap signal around, or pad with zeros.

    Arguments:
    y_in -- y values, the values to be smoothed and re-sampled
    x_in -- x values for array

    Keyword arguments:
    N_out -- NoOf elements in resampled array.
    sigma -- 'Width' of Bell-curve in units of param x .
    '''
    N_in = size(y_in)

    # Gaussian kernel
    x_out = np.linspace(np.min(x_in), np.max(x_in), N_out)
    x_in_mesh, x_out_mesh = np.meshgrid(x_in, x_out)
    gauss_kernel = np.exp(-np.square(x_in_mesh - x_out_mesh) / (2 * sigma**2))
    # Normalize kernel, such that the sum is one along axis 1
    normalization = np.tile(np.reshape(sum(gauss_kernel, axis=1), (N_out, 1)), (1, N_in))
    gauss_kernel_normalized = gauss_kernel / normalization
    # Perform running average as a linear operation
    y_out = gauss_kernel_normalized @ y_in

    return y_out, x_out

A simple usage on a sinusoidal signal with added normal distributed noise: enter image description here


回答 13

我建议不要用numpy或scipy来更快地做到这一点:

df['data'].rolling(3).mean()

这将采用“数据”列的3个周期的移动平均值(MA)。您还可以计算偏移版本,例如,不包含当前单元格的版本(向后偏移一次)可以很容易地计算为:

df['data'].shift(periods=1).rolling(3).mean()

Instead of numpy or scipy, I would recommend pandas to do this more swiftly:

df['data'].rolling(3).mean()

This takes the moving average (MA) of 3 periods of the column “data”. You can also calculate the shifted versions, for example the one that excludes the current cell (shifted one back) can be calculated easily as:

df['data'].shift(periods=1).rolling(3).mean()

回答 14

上面的答案之一掩盖了mab的评论,上面有此方法。 有一个简单的移动平均线:bottleneckmove_mean

import numpy as np
import bottleneck as bn

a = np.arange(10) + np.random.random(10)

mva = bn.move_mean(a, window=2, min_count=1)

min_count是一个方便的参数,基本上可以将移动平均线带到数组中的该点。如果不设置min_count,它将相等window,直到window点为止的一切都是如此nan

There is a comment by mab buried in one of the answers above which has this method. bottleneck has move_mean which is a simple moving average:

import numpy as np
import bottleneck as bn

a = np.arange(10) + np.random.random(10)

mva = bn.move_mean(a, window=2, min_count=1)

min_count is a handy parameter that will basically take the moving average up to that point in your array. If you don’t set min_count, it will equal window, and everything up to window points will be nan.


回答 15

无需使用Numpy,Panda 即可找到移动平均线的另一种方法

import itertools
sample = [2, 6, 10, 8, 11, 10]
list(itertools.starmap(lambda a,b: b/a, 
               enumerate(itertools.accumulate(sample), 1)))

将打印[2.0、4.0、6.0、6.5、7.4、7.833333333333333]

Another approach to find moving average without using numpy, panda

import itertools
sample = [2, 6, 10, 8, 11, 10]
list(itertools.starmap(lambda a,b: b/a, 
               enumerate(itertools.accumulate(sample), 1)))

will print [2.0, 4.0, 6.0, 6.5, 7.4, 7.833333333333333]


回答 16

现在,这个问题甚至比NeXuS上个月撰写该问题时还要古老,但是我喜欢他的代码如何处理边缘情况。但是,由于它是“简单的移动平均线”,因此其结果落后于它们应用的数据。我认为,在比与NumPy的方式更令人满意的方式处理边缘情况validsame以及full可以通过应用类似的方式,以实现convolution()为基础的方法。

我的贡献使用中央移动平均值将其结果与他们的数据保持一致。当可用的点数太少而无法使用完整大小的窗口时,将根据数组边缘处连续较小的窗口来计算移动平均值。[实际上,是从依次更大的窗口开始的,但这是一个实现细节。]

import numpy as np

def running_mean(l, N):
    # Also works for the(strictly invalid) cases when N is even.
    if (N//2)*2 == N:
        N = N - 1
    front = np.zeros(N//2)
    back = np.zeros(N//2)

    for i in range(1, (N//2)*2, 2):
        front[i//2] = np.convolve(l[:i], np.ones((i,))/i, mode = 'valid')
    for i in range(1, (N//2)*2, 2):
        back[i//2] = np.convolve(l[-i:], np.ones((i,))/i, mode = 'valid')
    return np.concatenate([front, np.convolve(l, np.ones((N,))/N, mode = 'valid'), back[::-1]])

它相对较慢,因为它使用convolve(),并且可能由真正的Pythonista大量使用,但是,我相信这个想法是正确的。

This question is now even older than when NeXuS wrote about it last month, BUT I like how his code deals with edge cases. However, because it is a “simple moving average,” its results lag behind the data they apply to. I thought that dealing with edge cases in a more satisfying way than NumPy’s modes valid, same, and full could be achieved by applying a similar approach to a convolution() based method.

My contribution uses a central running average to align its results with their data. When there are too few points available for the full-sized window to be used, running averages are computed from successively smaller windows at the edges of the array. [Actually, from successively larger windows, but that’s an implementation detail.]

import numpy as np

def running_mean(l, N):
    # Also works for the(strictly invalid) cases when N is even.
    if (N//2)*2 == N:
        N = N - 1
    front = np.zeros(N//2)
    back = np.zeros(N//2)

    for i in range(1, (N//2)*2, 2):
        front[i//2] = np.convolve(l[:i], np.ones((i,))/i, mode = 'valid')
    for i in range(1, (N//2)*2, 2):
        back[i//2] = np.convolve(l[-i:], np.ones((i,))/i, mode = 'valid')
    return np.concatenate([front, np.convolve(l, np.ones((N,))/N, mode = 'valid'), back[::-1]])

It’s relatively slow because it uses convolve(), and could likely be spruced up quite a lot by a true Pythonista, however, I believe that the idea stands.


回答 17

上面有很多关于计算运行平均值的答案。我的答案增加了两个额外功能:

  1. 忽略nan值
  2. 计算不包括感兴趣值本身的N个相邻值的平均值

第二个特征对于确定哪些值与总体趋势相差一定量特别有用。

我使用numpy.cumsum,因为它是最省时的方法(请参见上面的Alleo回答)。

N=10 # number of points to test on each side of point of interest, best if even
padded_x = np.insert(np.insert( np.insert(x, len(x), np.empty(int(N/2))*np.nan), 0, np.empty(int(N/2))*np.nan ),0,0)
n_nan = np.cumsum(np.isnan(padded_x))
cumsum = np.nancumsum(padded_x) 
window_sum = cumsum[N+1:] - cumsum[:-(N+1)] - x # subtract value of interest from sum of all values within window
window_n_nan = n_nan[N+1:] - n_nan[:-(N+1)] - np.isnan(x)
window_n_values = (N - window_n_nan)
movavg = (window_sum) / (window_n_values)

此代码仅适用于Ns。可以通过更改padded_x和n_nan的np.insert来调整奇数。

输出示例(黑色为原始,蓝色为movavg): 每个值周围10点的原始数据(黑色)和移动平均值(蓝色),不包括该值。 nan值将被忽略。

可以轻松地修改此代码,以删除从少于cutoff = 3个非nan值计算出的所有移动平均值。

window_n_values = (N - window_n_nan).astype(float) # dtype must be float to set some values to nan
cutoff = 3
window_n_values[window_n_values<cutoff] = np.nan
movavg = (window_sum) / (window_n_values)

原始数据(黑色)和移动平均值(蓝色),而忽略任何非小于3个非nan值的窗口

There are many answers above about calculating a running mean. My answer adds two extra features:

  1. ignores nan values
  2. calculates the mean for the N neighboring values NOT including the value of interest itself

This second feature is particularly useful for determining which values differ from the general trend by a certain amount.

I use numpy.cumsum since it is the most time-efficient method (see Alleo’s answer above).

N=10 # number of points to test on each side of point of interest, best if even
padded_x = np.insert(np.insert( np.insert(x, len(x), np.empty(int(N/2))*np.nan), 0, np.empty(int(N/2))*np.nan ),0,0)
n_nan = np.cumsum(np.isnan(padded_x))
cumsum = np.nancumsum(padded_x) 
window_sum = cumsum[N+1:] - cumsum[:-(N+1)] - x # subtract value of interest from sum of all values within window
window_n_nan = n_nan[N+1:] - n_nan[:-(N+1)] - np.isnan(x)
window_n_values = (N - window_n_nan)
movavg = (window_sum) / (window_n_values)

This code works for even Ns only. It can be adjusted for odd numbers by changing the np.insert of padded_x and n_nan.

Example output (raw in black, movavg in blue): raw data (black) and moving average (blue) of 10 points around each value, not including that value. nan values are ignored.

This code can be easily adapted to remove all moving average values calculated from fewer than cutoff = 3 non-nan values.

window_n_values = (N - window_n_nan).astype(float) # dtype must be float to set some values to nan
cutoff = 3
window_n_values[window_n_values<cutoff] = np.nan
movavg = (window_sum) / (window_n_values)

raw data (black) and moving average (blue) while ignoring any window with fewer than 3 non-nan values


回答 18

仅使用Python标准库(高效存储)

仅给出使用标准库的另一个版本deque。令我惊讶的是,大多数答案都使用pandasnumpy

def moving_average(iterable, n=3):
    d = deque(maxlen=n)
    for i in iterable:
        d.append(i)
        if len(d) == n:
            yield sum(d)/n

r = moving_average([40, 30, 50, 46, 39, 44])
assert list(r) == [40.0, 42.0, 45.0, 43.0]

其实我在python文档中找到了另一个实现

def moving_average(iterable, n=3):
    # moving_average([40, 30, 50, 46, 39, 44]) --> 40.0 42.0 45.0 43.0
    # http://en.wikipedia.org/wiki/Moving_average
    it = iter(iterable)
    d = deque(itertools.islice(it, n-1))
    d.appendleft(0)
    s = sum(d)
    for elem in it:
        s += elem - d.popleft()
        d.append(elem)
        yield s / n

但是,对我来说,实现似乎比应该的要复杂一些。但这必须在标准python文档中,这是有原因的,有人可以评论我的实现和标准doc吗?

Use Only Python Standard Library (Memory Efficient)

Just give another version of using the standard library deque only. It’s quite a surprise to me that most of the answers are using pandas or numpy.

def moving_average(iterable, n=3):
    d = deque(maxlen=n)
    for i in iterable:
        d.append(i)
        if len(d) == n:
            yield sum(d)/n

r = moving_average([40, 30, 50, 46, 39, 44])
assert list(r) == [40.0, 42.0, 45.0, 43.0]

Actually I found another implementation in python docs

def moving_average(iterable, n=3):
    # moving_average([40, 30, 50, 46, 39, 44]) --> 40.0 42.0 45.0 43.0
    # http://en.wikipedia.org/wiki/Moving_average
    it = iter(iterable)
    d = deque(itertools.islice(it, n-1))
    d.appendleft(0)
    s = sum(d)
    for elem in it:
        s += elem - d.popleft()
        d.append(elem)
        yield s / n

However the implementation seems to me is a bit more complex than it should be. But it must be in the standard python docs for a reason, could someone comment on the implementation of mine and the standard doc?


回答 19

使用@Aikude的变量,我编写了单行代码。

import numpy as np

mylist = [1, 2, 3, 4, 5, 6, 7]
N = 3

mean = [np.mean(mylist[x:x+N]) for x in range(len(mylist)-N+1)]
print(mean)

>>> [2.0, 3.0, 4.0, 5.0, 6.0]

With @Aikude’s variables, I wrote one-liner.

import numpy as np

mylist = [1, 2, 3, 4, 5, 6, 7]
N = 3

mean = [np.mean(mylist[x:x+N]) for x in range(len(mylist)-N+1)]
print(mean)

>>> [2.0, 3.0, 4.0, 5.0, 6.0]

回答 20

尽管这里有针对此问题的解决方案,但请查看我的解决方案。它非常简单并且运行良好。

import numpy as np
dataset = np.asarray([1, 2, 3, 4, 5, 6, 7])
ma = list()
window = 3
for t in range(0, len(dataset)):
    if t+window <= len(dataset):
        indices = range(t, t+window)
        ma.append(np.average(np.take(dataset, indices)))
else:
    ma = np.asarray(ma)

Although there are solutions for this question here, please take a look at my solution. It is very simple and working well.

import numpy as np
dataset = np.asarray([1, 2, 3, 4, 5, 6, 7])
ma = list()
window = 3
for t in range(0, len(dataset)):
    if t+window <= len(dataset):
        indices = range(t, t+window)
        ma.append(np.average(np.take(dataset, indices)))
else:
    ma = np.asarray(ma)

回答 21

通过阅读其他答案,我认为这不是问题所要解决的问题,但是我到达这里的目的是保持一个不断增长的值列表的运行平均值。

因此,如果要保留从某处(站点,测量设备等)获取的值的列表以及最新值的平均值n,可以使用下面的代码,以最大程度地减少添加新值的工作量。元素:

class Running_Average(object):
    def __init__(self, buffer_size=10):
        """
        Create a new Running_Average object.

        This object allows the efficient calculation of the average of the last
        `buffer_size` numbers added to it.

        Examples
        --------
        >>> a = Running_Average(2)
        >>> a.add(1)
        >>> a.get()
        1.0
        >>> a.add(1)  # there are two 1 in buffer
        >>> a.get()
        1.0
        >>> a.add(2)  # there's a 1 and a 2 in the buffer
        >>> a.get()
        1.5
        >>> a.add(2)
        >>> a.get()  # now there's only two 2 in the buffer
        2.0
        """
        self._buffer_size = int(buffer_size)  # make sure it's an int
        self.reset()

    def add(self, new):
        """
        Add a new number to the buffer, or replaces the oldest one there.
        """
        new = float(new)  # make sure it's a float
        n = len(self._buffer)
        if n < self.buffer_size:  # still have to had numbers to the buffer.
            self._buffer.append(new)
            if self._average != self._average:  # ~ if isNaN().
                self._average = new  # no previous numbers, so it's new.
            else:
                self._average *= n  # so it's only the sum of numbers.
                self._average += new  # add new number.
                self._average /= (n+1)  # divide by new number of numbers.
        else:  # buffer full, replace oldest value.
            old = self._buffer[self._index]  # the previous oldest number.
            self._buffer[self._index] = new  # replace with new one.
            self._index += 1  # update the index and make sure it's...
            self._index %= self.buffer_size  # ... smaller than buffer_size.
            self._average -= old/self.buffer_size  # remove old one...
            self._average += new/self.buffer_size  # ...and add new one...
            # ... weighted by the number of elements.

    def __call__(self):
        """
        Return the moving average value, for the lazy ones who don't want
        to write .get .
        """
        return self._average

    def get(self):
        """
        Return the moving average value.
        """
        return self()

    def reset(self):
        """
        Reset the moving average.

        If for some reason you don't want to just create a new one.
        """
        self._buffer = []  # could use np.empty(self.buffer_size)...
        self._index = 0  # and use this to keep track of how many numbers.
        self._average = float('nan')  # could use np.NaN .

    def get_buffer_size(self):
        """
        Return current buffer_size.
        """
        return self._buffer_size

    def set_buffer_size(self, buffer_size):
        """
        >>> a = Running_Average(10)
        >>> for i in range(15):
        ...     a.add(i)
        ...
        >>> a()
        9.5
        >>> a._buffer  # should not access this!!
        [10.0, 11.0, 12.0, 13.0, 14.0, 5.0, 6.0, 7.0, 8.0, 9.0]

        Decreasing buffer size:
        >>> a.buffer_size = 6
        >>> a._buffer  # should not access this!!
        [9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
        >>> a.buffer_size = 2
        >>> a._buffer
        [13.0, 14.0]

        Increasing buffer size:
        >>> a.buffer_size = 5
        Warning: no older data available!
        >>> a._buffer
        [13.0, 14.0]

        Keeping buffer size:
        >>> a = Running_Average(10)
        >>> for i in range(15):
        ...     a.add(i)
        ...
        >>> a()
        9.5
        >>> a._buffer  # should not access this!!
        [10.0, 11.0, 12.0, 13.0, 14.0, 5.0, 6.0, 7.0, 8.0, 9.0]
        >>> a.buffer_size = 10  # reorders buffer!
        >>> a._buffer
        [5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
        """
        buffer_size = int(buffer_size)
        # order the buffer so index is zero again:
        new_buffer = self._buffer[self._index:]
        new_buffer.extend(self._buffer[:self._index])
        self._index = 0
        if self._buffer_size < buffer_size:
            print('Warning: no older data available!')  # should use Warnings!
        else:
            diff = self._buffer_size - buffer_size
            print(diff)
            new_buffer = new_buffer[diff:]
        self._buffer_size = buffer_size
        self._buffer = new_buffer

    buffer_size = property(get_buffer_size, set_buffer_size)

您可以使用以下示例进行测试:

def graph_test(N=200):
    import matplotlib.pyplot as plt
    values = list(range(N))
    values_average_calculator = Running_Average(N/2)
    values_averages = []
    for value in values:
        values_average_calculator.add(value)
        values_averages.append(values_average_calculator())
    fig, ax = plt.subplots(1, 1)
    ax.plot(values, label='values')
    ax.plot(values_averages, label='averages')
    ax.grid()
    ax.set_xlim(0, N)
    ax.set_ylim(0, N)
    fig.show()

这使:

值及其平均值作为值的函数

From reading the other answers I don’t think this is what the question asked for, but I got here with the need of keeping a running average of a list of values that was growing in size.

So if you want to keep a list of values that you are acquiring from somewhere (a site, a measuring device, etc.) and the average of the last n values updated, you can use the code bellow, that minimizes the effort of adding new elements:

class Running_Average(object):
    def __init__(self, buffer_size=10):
        """
        Create a new Running_Average object.

        This object allows the efficient calculation of the average of the last
        `buffer_size` numbers added to it.

        Examples
        --------
        >>> a = Running_Average(2)
        >>> a.add(1)
        >>> a.get()
        1.0
        >>> a.add(1)  # there are two 1 in buffer
        >>> a.get()
        1.0
        >>> a.add(2)  # there's a 1 and a 2 in the buffer
        >>> a.get()
        1.5
        >>> a.add(2)
        >>> a.get()  # now there's only two 2 in the buffer
        2.0
        """
        self._buffer_size = int(buffer_size)  # make sure it's an int
        self.reset()

    def add(self, new):
        """
        Add a new number to the buffer, or replaces the oldest one there.
        """
        new = float(new)  # make sure it's a float
        n = len(self._buffer)
        if n < self.buffer_size:  # still have to had numbers to the buffer.
            self._buffer.append(new)
            if self._average != self._average:  # ~ if isNaN().
                self._average = new  # no previous numbers, so it's new.
            else:
                self._average *= n  # so it's only the sum of numbers.
                self._average += new  # add new number.
                self._average /= (n+1)  # divide by new number of numbers.
        else:  # buffer full, replace oldest value.
            old = self._buffer[self._index]  # the previous oldest number.
            self._buffer[self._index] = new  # replace with new one.
            self._index += 1  # update the index and make sure it's...
            self._index %= self.buffer_size  # ... smaller than buffer_size.
            self._average -= old/self.buffer_size  # remove old one...
            self._average += new/self.buffer_size  # ...and add new one...
            # ... weighted by the number of elements.

    def __call__(self):
        """
        Return the moving average value, for the lazy ones who don't want
        to write .get .
        """
        return self._average

    def get(self):
        """
        Return the moving average value.
        """
        return self()

    def reset(self):
        """
        Reset the moving average.

        If for some reason you don't want to just create a new one.
        """
        self._buffer = []  # could use np.empty(self.buffer_size)...
        self._index = 0  # and use this to keep track of how many numbers.
        self._average = float('nan')  # could use np.NaN .

    def get_buffer_size(self):
        """
        Return current buffer_size.
        """
        return self._buffer_size

    def set_buffer_size(self, buffer_size):
        """
        >>> a = Running_Average(10)
        >>> for i in range(15):
        ...     a.add(i)
        ...
        >>> a()
        9.5
        >>> a._buffer  # should not access this!!
        [10.0, 11.0, 12.0, 13.0, 14.0, 5.0, 6.0, 7.0, 8.0, 9.0]

        Decreasing buffer size:
        >>> a.buffer_size = 6
        >>> a._buffer  # should not access this!!
        [9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
        >>> a.buffer_size = 2
        >>> a._buffer
        [13.0, 14.0]

        Increasing buffer size:
        >>> a.buffer_size = 5
        Warning: no older data available!
        >>> a._buffer
        [13.0, 14.0]

        Keeping buffer size:
        >>> a = Running_Average(10)
        >>> for i in range(15):
        ...     a.add(i)
        ...
        >>> a()
        9.5
        >>> a._buffer  # should not access this!!
        [10.0, 11.0, 12.0, 13.0, 14.0, 5.0, 6.0, 7.0, 8.0, 9.0]
        >>> a.buffer_size = 10  # reorders buffer!
        >>> a._buffer
        [5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
        """
        buffer_size = int(buffer_size)
        # order the buffer so index is zero again:
        new_buffer = self._buffer[self._index:]
        new_buffer.extend(self._buffer[:self._index])
        self._index = 0
        if self._buffer_size < buffer_size:
            print('Warning: no older data available!')  # should use Warnings!
        else:
            diff = self._buffer_size - buffer_size
            print(diff)
            new_buffer = new_buffer[diff:]
        self._buffer_size = buffer_size
        self._buffer = new_buffer

    buffer_size = property(get_buffer_size, set_buffer_size)

And you can test it with, for example:

def graph_test(N=200):
    import matplotlib.pyplot as plt
    values = list(range(N))
    values_average_calculator = Running_Average(N/2)
    values_averages = []
    for value in values:
        values_average_calculator.add(value)
        values_averages.append(values_average_calculator())
    fig, ax = plt.subplots(1, 1)
    ax.plot(values, label='values')
    ax.plot(values_averages, label='averages')
    ax.grid()
    ax.set_xlim(0, N)
    ax.set_ylim(0, N)
    fig.show()

Which gives:

Values and their average as a function of values #


回答 22

另一个使用标准库和双端队列的解决方案:

from collections import deque
import itertools

def moving_average(iterable, n=3):
    # http://en.wikipedia.org/wiki/Moving_average
    it = iter(iterable) 
    # create an iterable object from input argument
    d = deque(itertools.islice(it, n-1))  
    # create deque object by slicing iterable
    d.appendleft(0)
    s = sum(d)
    for elem in it:
        s += elem - d.popleft()
        d.append(elem)
        yield s / n

# example on how to use it
for i in  moving_average([40, 30, 50, 46, 39, 44]):
    print(i)

# 40.0
# 42.0
# 45.0
# 43.0

Another solution just using a standard library and deque:

from collections import deque
import itertools

def moving_average(iterable, n=3):
    # http://en.wikipedia.org/wiki/Moving_average
    it = iter(iterable) 
    # create an iterable object from input argument
    d = deque(itertools.islice(it, n-1))  
    # create deque object by slicing iterable
    d.appendleft(0)
    s = sum(d)
    for elem in it:
        s += elem - d.popleft()
        d.append(elem)
        yield s / n

# example on how to use it
for i in  moving_average([40, 30, 50, 46, 39, 44]):
    print(i)

# 40.0
# 42.0
# 45.0
# 43.0

回答 23

出于教育目的,让我添加另外两个Numpy解决方案(比cumsum解决方案要慢):

import numpy as np
from numpy.lib.stride_tricks import as_strided

def ra_strides(arr, window):
    ''' Running average using as_strided'''
    n = arr.shape[0] - window + 1
    arr_strided = as_strided(arr, shape=[n, window], strides=2*arr.strides)
    return arr_strided.mean(axis=1)

def ra_add(arr, window):
    ''' Running average using add.reduceat'''
    n = arr.shape[0] - window + 1
    indices = np.array([0, window]*n) + np.repeat(np.arange(n), 2)
    arr = np.append(arr, 0)
    return np.add.reduceat(arr, indices )[::2]/window

使用的函数:as_stridedadd.reduceat

For educational purposes, let me add two more Numpy solutions (which are slower than the cumsum solution):

import numpy as np
from numpy.lib.stride_tricks import as_strided

def ra_strides(arr, window):
    ''' Running average using as_strided'''
    n = arr.shape[0] - window + 1
    arr_strided = as_strided(arr, shape=[n, window], strides=2*arr.strides)
    return arr_strided.mean(axis=1)

def ra_add(arr, window):
    ''' Running average using add.reduceat'''
    n = arr.shape[0] - window + 1
    indices = np.array([0, window]*n) + np.repeat(np.arange(n), 2)
    arr = np.append(arr, 0)
    return np.add.reduceat(arr, indices )[::2]/window

Functions used: as_strided, add.reduceat


回答 24

上述所有解决方案均较差,因为它们缺乏

  • 由于使用本地python而不是numpy向量化实现,因此速度更快,
  • 由于numpy.cumsum,或
  • 由于O(len(x) * w)实现为卷积而提高了速度。

给定

import numpy
m = 10000
x = numpy.random.rand(m)
w = 1000

注意x_[:w].sum()等于x[:w-1].sum()。因此,对于所述第一平均的numpy.cumsum(...)增加x[w] / w(通过x_[w+1] / w),并减去0(从x_[0] / w)。这导致x[0:w].mean()

通过cumsum,您将通过加法x[w+1] / w和减法更新第二个平均值x[0] / w,从而得出x[1:w+1].mean()

这一直持续到x[-w:].mean()达到为止。

x_ = numpy.insert(x, 0, 0)
sliding_average = x_[:w].sum() / w + numpy.cumsum(x_[w:] - x_[:-w]) / w

该解决方案是矢量化,O(m)可读性和数值稳定的。

All the aforementioned solutions are poor because they lack

  • speed due to a native python instead of a numpy vectorized implementation,
  • numerical stability due to poor use of numpy.cumsum, or
  • speed due to O(len(x) * w) implementations as convolutions.

Given

import numpy
m = 10000
x = numpy.random.rand(m)
w = 1000

Note that x_[:w].sum() equals x[:w-1].sum(). So for the first average the numpy.cumsum(...) adds x[w] / w (via x_[w+1] / w), and subtracts 0 (from x_[0] / w). This results in x[0:w].mean()

Via cumsum, you’ll update the second average by additionally add x[w+1] / w and subtracting x[0] / w, resulting in x[1:w+1].mean().

This goes on until x[-w:].mean() is reached.

x_ = numpy.insert(x, 0, 0)
sliding_average = x_[:w].sum() / w + numpy.cumsum(x_[w:] - x_[:-w]) / w

This solution is vectorized, O(m), readable and numerically stable.


回答 25

如何移动平均滤波器?它也是单线的,并且具有优点,如果您需要矩形以外的其他东西,即可以轻松地操纵窗口类型。数组的N长简单移动平均值a:

lfilter(np.ones(N)/N, [1], a)[N:]

并应用三角形窗口:

lfilter(np.ones(N)*scipy.signal.triang(N)/N, [1], a)[N:]

注意:我通常会将前N个样本作为伪造品丢弃,因此[N:]最后将其作为伪造品,但是这不是必需的,只是个人选择的问题。

How about a moving average filter? It is also a one-liner and has the advantage, that you can easily manipulate the window type if you need something else than the rectangle, ie. a N-long simple moving average of an array a:

lfilter(np.ones(N)/N, [1], a)[N:]

And with the triangular window applied:

lfilter(np.ones(N)*scipy.signal.triang(N)/N, [1], a)[N:]

Note: I usually discard the first N samples as bogus hence [N:] at the end, but it is not necessary and the matter of a personal choice only.


回答 26

如果您确实选择自己滚动而不是使用现有的库,请注意浮点错误并尝试最小化其影响:

class SumAccumulator:
    def __init__(self):
        self.values = [0]
        self.count = 0

    def add( self, val ):
        self.values.append( val )
        self.count = self.count + 1
        i = self.count
        while i & 0x01:
            i = i >> 1
            v0 = self.values.pop()
            v1 = self.values.pop()
            self.values.append( v0 + v1 )

    def get_total(self):
        return sum( reversed(self.values) )

    def get_size( self ):
        return self.count

如果您所有的值都是大致相同的数量级,那么这将通过始终添加大致相似的数量级的值来帮助保持精度。

If you do choose to roll your own, rather than use an existing library, please be conscious of floating point error and try to minimize its effects:

class SumAccumulator:
    def __init__(self):
        self.values = [0]
        self.count = 0

    def add( self, val ):
        self.values.append( val )
        self.count = self.count + 1
        i = self.count
        while i & 0x01:
            i = i >> 1
            v0 = self.values.pop()
            v1 = self.values.pop()
            self.values.append( v0 + v1 )

    def get_total(self):
        return sum( reversed(self.values) )

    def get_size( self ):
        return self.count

If all your values are roughly the same order of magnitude, then this will help to preserve precision by always adding values of roughly similar magnitudes.


SciPy和NumPy之间的关系

问题:SciPy和NumPy之间的关系

SciPy似乎在其自己的命名空间中提供了NumPy的大多数(但不是全部[1])功能。换句话说,如果有一个名为的函数numpy.foo,则几乎可以肯定有一个scipy.foo。在大多数情况下,两者看起来是完全相同的,甚至有时指向相同的功能对象。

有时,它们是不同的。举一个最近出现的例子:

  • numpy.log10是一个ufunc该返回的NaN为负参数;
  • scipy.log10 返回负参数的复数值,并且似乎不是ufunc。

同样可以说,大约loglog2logn,但不是关于log1p[2]。

另一方面,numpy.expscipy.exp似乎对于同一ufunc是不同的名称。scipy.log1p和的情况也是如此numpy.log1p

另一个例子是numpy.linalg.solveVS scipy.linalg.solve。它们相似,但是后者比前者提供了一些附加功能。

为什么出现明显的重复?如果这意味着要的批发进口numpyscipy命名空间,为什么在行为的细微差别和缺少的功能?是否有一些有助于消除混乱的总体逻辑?

[1] ,,numpy.min 和其他几个人都在没有同行的命名空间。numpy.maxnumpy.absscipy

[2]使用NumPy 1.5.1和SciPy 0.9.0rc2进行了测试。

SciPy appears to provide most (but not all [1]) of NumPy’s functions in its own namespace. In other words, if there’s a function named numpy.foo, there’s almost certainly a scipy.foo. Most of the time, the two appear to be exactly the same, oftentimes even pointing to the same function object.

Sometimes, they’re different. To give an example that came up recently:

  • numpy.log10 is a ufunc that returns NaNs for negative arguments;
  • scipy.log10 returns complex values for negative arguments and doesn’t appear to be a ufunc.

The same can be said about log, log2 and logn, but not about log1p [2].

On the other hand, numpy.exp and scipy.exp appear to be different names for the same ufunc. This is also true of scipy.log1p and numpy.log1p.

Another example is numpy.linalg.solve vs scipy.linalg.solve. They’re similar, but the latter offers some additional features over the former.

Why the apparent duplication? If this is meant to be a wholesale import of numpy into the scipy namespace, why the subtle differences in behaviour and the missing functions? Is there some overarching logic that would help clear up the confusion?

[1] numpy.min, numpy.max, numpy.abs and a few others have no counterparts in the scipy namespace.

[2] Tested using NumPy 1.5.1 and SciPy 0.9.0rc2.


回答 0

上次我检查它时,scipy __init__方法执行

from numpy import *

以便在导入scipy模块时将整个numpy命名空间包含到scipy中。

log10您描述的行为很有趣,因为两个版本都来自numpy。一个是a ufunc,另一个是numpy.lib功能。为什么scipy偏爱库函数而不是ufunc,我不知道该怎么办。


编辑:事实上,我可以回答这个log10问题。在scipy __init__方法中,我看到以下内容:

# Import numpy symbols to scipy name space
import numpy as _num
from numpy import oldnumeric
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

log10您获得scipy 的功能来自numpy.lib.scimath。查看该代码,它说:

"""
Wrapper functions to more user-friendly calling of certain math functions
whose output data-type is different than the input data-type in certain
domains of the input.

For example, for functions like log() with branch cuts, the versions in this
module provide the mathematically valid answers in the complex plane:

>>> import math
>>> from numpy.lib import scimath
>>> scimath.log(-math.exp(1)) == (1+1j*math.pi)
True

Similarly, sqrt(), other base logarithms, power() and trig functions are
correctly handled.  See their respective docstrings for specific examples.
"""

看来模块覆盖了基础numpy的ufuncs sqrtloglog2lognlog10powerarccosarcsin,和arctanh。这就解释了您所看到的行为。这样做的根本设计原因可能埋在某个地方的邮件列表中。

Last time I checked it, the scipy __init__ method executes a

from numpy import *

so that the whole numpy namespace is included into scipy when the scipy module is imported.

The log10 behavior you are describing is interesting, because both versions are coming from numpy. One is a ufunc, the other is a numpy.lib function. Why scipy is preferring the library function over the ufunc, I don’t know off the top of my head.


EDIT: In fact, I can answer the log10 question. Looking in the scipy __init__ method I see this:

# Import numpy symbols to scipy name space
import numpy as _num
from numpy import oldnumeric
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

The log10 function you get in scipy comes from numpy.lib.scimath. Looking at that code, it says:

"""
Wrapper functions to more user-friendly calling of certain math functions
whose output data-type is different than the input data-type in certain
domains of the input.

For example, for functions like log() with branch cuts, the versions in this
module provide the mathematically valid answers in the complex plane:

>>> import math
>>> from numpy.lib import scimath
>>> scimath.log(-math.exp(1)) == (1+1j*math.pi)
True

Similarly, sqrt(), other base logarithms, power() and trig functions are
correctly handled.  See their respective docstrings for specific examples.
"""

It seems that module overlays the base numpy ufuncs for sqrt, log, log2, logn, log10, power, arccos, arcsin, and arctanh. That explains the behavior you are seeing. The underlying design reason why it is done like that is probably buried in a mailing list post somewhere.


回答 1

从《 SciPy参考指南》中:

…所有的Nu​​mpy函数都已包含在scipy 命名空间中,因此所有这些函数都可用而无需另外导入Numpy。

目的是使用户不必知道scipynumpy命名空间之间的区别,尽管显然您已经发现了一个exceptions。

From the SciPy Reference Guide:

… all of the Numpy functions have been subsumed into the scipy namespace so that all of those functions are available without additionally importing Numpy.

The intention is for users not to have to know the distinction between the scipy and numpy namespaces, though apparently you’ve found an exception.


回答 2

从看来 SciPy常见问题解答 NumPy的某些功能出于历史原因而在这里,而它仅应在SciPy中:

NumPy和SciPy有什么区别?

在理想的情况下,NumPy只会包含数组数据类型和最基本的操作:索引,排序,重塑,基本的元素函数等。所有数字代码都将驻留在SciPy中。但是,NumPy的重要目标之一是兼容性,因此NumPy尝试保留其前任任一个所支持的所有功能。因此,NumPy包含一些线性代数函数,即使这些函数更恰当地属于SciPy。无论如何,SciPy都包含线性代数模块的更多全功能版本,以及许多其他数值算法。如果您正在使用python进行科学计算,则可能应该同时安装NumPy和SciPy。大多数新功能属于SciPy,而不是NumPy。

这就解释了为什么scipy.linalg.solve在之上提供了一些附加功能numpy.linalg.solve

我没有看到SethMMorton对相关问题的回答

It seems from the SciPy FAQ that some functions from NumPy are here for historical reasons while it should only be in SciPy:

What is the difference between NumPy and SciPy?

In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, et cetera. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors. Thus NumPy contains some linear algebra functions, even though these more properly belong in SciPy. In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.

That explains why scipy.linalg.solve offers some additional features over numpy.linalg.solve.

I did not see the answer of SethMMorton to the related question


回答 3

SciPy文档的简介末尾有一段简短的评论:

另一个有用的命令是source。当给定一个用Python编写的函数作为参数时,它将打印出该函数的源代码清单。这有助于学习算法或准确了解函数对其参数的作用。另外,不要忘记Python命令目录,该目录可用于查看模块或包的命名空间。

我认为,这将允许有人用所有的软件包足够的知识涉及挑开完全的差异是什么之间的一些 SciPy的和numpy的功能(它没有帮助我在所有的日志10题)。我绝对不具备这些知识,但是source确实表明了这一点,scipy.linalg.solvenumpy.linalg.solve以不同的方式与lapack进行了交互。

Python 2.4.3 (#1, May  5 2011, 18:44:23) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
>>> import scipy
>>> import scipy.linalg
>>> import numpy
>>> scipy.source(scipy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/scipy/linalg/basic.py

def solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0,
          debug = 0):
    """ solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0) -> x

    Solve a linear system of equations a * x = b for x.

    Inputs:

      a -- An N x N matrix.
      b -- An N x nrhs matrix or N vector.
      sym_pos -- Assume a is symmetric and positive definite.
      lower -- Assume a is lower triangular, otherwise upper one.
               Only used if sym_pos is true.
      overwrite_y - Discard data in y, where y is a or b.

    Outputs:

      x -- The solution to the system a * x = b
    """
    a1, b1 = map(asarray_chkfinite,(a,b))
    if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
        raise ValueError, 'expected square matrix'
    if a1.shape[0] != b1.shape[0]:
        raise ValueError, 'incompatible dimensions'
    overwrite_a = overwrite_a or (a1 is not a and not hasattr(a,'__array__'))
    overwrite_b = overwrite_b or (b1 is not b and not hasattr(b,'__array__'))
    if debug:
        print 'solve:overwrite_a=',overwrite_a
        print 'solve:overwrite_b=',overwrite_b
    if sym_pos:
        posv, = get_lapack_funcs(('posv',),(a1,b1))
        c,x,info = posv(a1,b1,
                        lower = lower,
                        overwrite_a=overwrite_a,
                        overwrite_b=overwrite_b)
    else:
        gesv, = get_lapack_funcs(('gesv',),(a1,b1))
        lu,piv,x,info = gesv(a1,b1,
                             overwrite_a=overwrite_a,
                             overwrite_b=overwrite_b)

    if info==0:
        return x
    if info>0:
        raise LinAlgError, "singular matrix"
    raise ValueError,\
          'illegal value in %-th argument of internal gesv|posv'%(-info)

>>> scipy.source(numpy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/numpy/linalg/linalg.py

def solve(a, b):
    """
    Solve the equation ``a x = b`` for ``x``.

    Parameters
    ----------
    a : array_like, shape (M, M)
        Input equation coefficients.
    b : array_like, shape (M,)
        Equation target values.

    Returns
    -------
    x : array, shape (M,)

    Raises
    ------
    LinAlgError
        If `a` is singular or not square.

    Examples
    --------
    Solve the system of equations ``3 * x0 + x1 = 9`` and ``x0 + 2 * x1 = 8``:

    >>> a = np.array([[3,1], [1,2]])
    >>> b = np.array([9,8])
    >>> x = np.linalg.solve(a, b)
    >>> x
    array([ 2.,  3.])

    Check that the solution is correct:

    >>> (np.dot(a, x) == b).all()
    True

    """
    a, _ = _makearray(a)
    b, wrap = _makearray(b)
    one_eq = len(b.shape) == 1
    if one_eq:
        b = b[:, newaxis]
    _assertRank2(a, b)
    _assertSquareness(a)
    n_eq = a.shape[0]
    n_rhs = b.shape[1]
    if n_eq != b.shape[0]:
        raise LinAlgError, 'Incompatible dimensions'
    t, result_t = _commonType(a, b)
#    lapack_routine = _findLapackRoutine('gesv', t)
    if isComplexType(t):
        lapack_routine = lapack_lite.zgesv
    else:
        lapack_routine = lapack_lite.dgesv
    a, b = _fastCopyAndTranspose(t, a, b)
    pivots = zeros(n_eq, fortran_int)
    results = lapack_routine(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
    if results['info'] > 0:
        raise LinAlgError, 'Singular matrix'
    if one_eq:
        return wrap(b.ravel().astype(result_t))
    else:
        return wrap(b.transpose().astype(result_t))

这也是我的第一篇文章,因此如果我要在此处进行更改,请告诉我。

There is a short comment at the end of the introduction to SciPy documentation:

Another useful command issource. When given a function written in Python as an argument, it prints out a listing of the source code for that function. This can be helpful in learning about an algorithm or understanding exactly what a function is doing with its arguments. Also don’t forget about the Python command dir which can be used to look at the namespace of a module or package.

I think this will allow someone with enough knowledge of all the packages involved to pick apart exactly what the differences are between some scipy and numpy functions (it didn’t help me with the log10 question at all). I definitely don’t have that knowledge but source does indicate that scipy.linalg.solve and numpy.linalg.solve interact with lapack in different ways;

Python 2.4.3 (#1, May  5 2011, 18:44:23) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
>>> import scipy
>>> import scipy.linalg
>>> import numpy
>>> scipy.source(scipy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/scipy/linalg/basic.py

def solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0,
          debug = 0):
    """ solve(a, b, sym_pos=0, lower=0, overwrite_a=0, overwrite_b=0) -> x

    Solve a linear system of equations a * x = b for x.

    Inputs:

      a -- An N x N matrix.
      b -- An N x nrhs matrix or N vector.
      sym_pos -- Assume a is symmetric and positive definite.
      lower -- Assume a is lower triangular, otherwise upper one.
               Only used if sym_pos is true.
      overwrite_y - Discard data in y, where y is a or b.

    Outputs:

      x -- The solution to the system a * x = b
    """
    a1, b1 = map(asarray_chkfinite,(a,b))
    if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]:
        raise ValueError, 'expected square matrix'
    if a1.shape[0] != b1.shape[0]:
        raise ValueError, 'incompatible dimensions'
    overwrite_a = overwrite_a or (a1 is not a and not hasattr(a,'__array__'))
    overwrite_b = overwrite_b or (b1 is not b and not hasattr(b,'__array__'))
    if debug:
        print 'solve:overwrite_a=',overwrite_a
        print 'solve:overwrite_b=',overwrite_b
    if sym_pos:
        posv, = get_lapack_funcs(('posv',),(a1,b1))
        c,x,info = posv(a1,b1,
                        lower = lower,
                        overwrite_a=overwrite_a,
                        overwrite_b=overwrite_b)
    else:
        gesv, = get_lapack_funcs(('gesv',),(a1,b1))
        lu,piv,x,info = gesv(a1,b1,
                             overwrite_a=overwrite_a,
                             overwrite_b=overwrite_b)

    if info==0:
        return x
    if info>0:
        raise LinAlgError, "singular matrix"
    raise ValueError,\
          'illegal value in %-th argument of internal gesv|posv'%(-info)

>>> scipy.source(numpy.linalg.solve)
In file: /usr/lib64/python2.4/site-packages/numpy/linalg/linalg.py

def solve(a, b):
    """
    Solve the equation ``a x = b`` for ``x``.

    Parameters
    ----------
    a : array_like, shape (M, M)
        Input equation coefficients.
    b : array_like, shape (M,)
        Equation target values.

    Returns
    -------
    x : array, shape (M,)

    Raises
    ------
    LinAlgError
        If `a` is singular or not square.

    Examples
    --------
    Solve the system of equations ``3 * x0 + x1 = 9`` and ``x0 + 2 * x1 = 8``:

    >>> a = np.array([[3,1], [1,2]])
    >>> b = np.array([9,8])
    >>> x = np.linalg.solve(a, b)
    >>> x
    array([ 2.,  3.])

    Check that the solution is correct:

    >>> (np.dot(a, x) == b).all()
    True

    """
    a, _ = _makearray(a)
    b, wrap = _makearray(b)
    one_eq = len(b.shape) == 1
    if one_eq:
        b = b[:, newaxis]
    _assertRank2(a, b)
    _assertSquareness(a)
    n_eq = a.shape[0]
    n_rhs = b.shape[1]
    if n_eq != b.shape[0]:
        raise LinAlgError, 'Incompatible dimensions'
    t, result_t = _commonType(a, b)
#    lapack_routine = _findLapackRoutine('gesv', t)
    if isComplexType(t):
        lapack_routine = lapack_lite.zgesv
    else:
        lapack_routine = lapack_lite.dgesv
    a, b = _fastCopyAndTranspose(t, a, b)
    pivots = zeros(n_eq, fortran_int)
    results = lapack_routine(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
    if results['info'] > 0:
        raise LinAlgError, 'Singular matrix'
    if one_eq:
        return wrap(b.ravel().astype(result_t))
    else:
        return wrap(b.transpose().astype(result_t))

This is also my first post so if I should change something here please let me know.


回答 4

从Wikipedia(http://en.wikipedia.org/wiki/NumPy#History):

修改了数字代码,使其更具可维护性和灵活性,足以实现Numarray的新颖功能。这个新项目是SciPy的一部分。为了避免仅为了获取数组对象而安装整个程序包,将此新程序包分开并称为NumPy。

scipy为了方便起见,依赖numpy并将许多numpy函数导入其命名空间。

From Wikipedia ( http://en.wikipedia.org/wiki/NumPy#History ):

The Numeric code was adapted to make it more maintainable and flexible enough to implement the novel features of Numarray. This new project was part of SciPy. To avoid installing a whole package just to get an array object, this new package was separated and called NumPy.

scipy depends on numpy and imports many numpy functions into its namespace for convenience.


回答 5

关于linalg软件包-scipy函数将调用lapack和blas,它们在许多平台上都具有高度优化的版本,并且具有非常好的性能,尤其是对于在较大密度矩阵上的操作。另一方面,它们不是易于编译的库,需要fortran编译器和许多特定于平台的调整才能获得完整的性能。因此,numpy提供了许多常见线性代数函数的简单实现,这些函数通常足以满足许多目的。

Regarding the linalg package – the scipy functions will call lapack and blas, which are available in highly optimised versions on many platforms and offer very good performance, particularly for operations on reasonably large dense matrices. On the other hand, they are not easy libraries to compile, requiring a fortran compiler and many platform specific tweaks to get full performance. Therefore, numpy provides simple implementations of many common linear algebra functions which are often good enough for many purposes.


回答 6

从“ 定量经济学 ” 讲座

SciPy是一个软件包,其中包含使用NumPy构建的各种工具,这些工具使用其数组数据类型和相关功能

实际上,当我们导入SciPy时,我们也会得到NumPy,这可以从SciPy初始化文件中看到

# Import numpy symbols to scipy name space
import numpy as _num
linalg = None
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

__all__  = []
__all__ += _num.__all__
__all__ += ['randn', 'rand', 'fft', 'ifft']

del _num
# Remove the linalg imported from numpy so that the scipy.linalg package can be
# imported.
del linalg
__all__.remove('linalg')

但是,显式使用NumPy功能是更常见和更好的做法

import numpy as np

a = np.identity(3)

在SciPy中有用的是其子包中的功能

  • scipy.optimize,scipy.integrate,scipy.stats等。

From Lectures on ‘Quantitative Economics

SciPy is a package that contains various tools that are built on top of NumPy, using its array data type and related functionality

In fact, when we import SciPy we also get NumPy, as can be seen from the SciPy initialization file

# Import numpy symbols to scipy name space
import numpy as _num
linalg = None
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *

__all__  = []
__all__ += _num.__all__
__all__ += ['randn', 'rand', 'fft', 'ifft']

del _num
# Remove the linalg imported from numpy so that the scipy.linalg package can be
# imported.
del linalg
__all__.remove('linalg')

However, it’s more common and better practice to use NumPy functionality explicitly

import numpy as np

a = np.identity(3)

What is useful in SciPy is the functionality in its subpackages

  • scipy.optimize, scipy.integrate, scipy.stats, etc.

回答 7

除了SciPy FAQ中描述的重复主要是为了向后兼容之外,在NumPy文档中进一步阐明说:

可选的SciPy加速例程(numpy.dual)

Scipy可能会加速的功能别名。

可以将SciPy构建为对FFT,线性代数和特殊函数使用加速或其他改进的库。该模块允许开发人员在SciPy可用时透明地支持这些加速功能,但仍支持仅安装NumPy的用户。

为简便起见,这些是:

  • 线性代数
  • 快速傅立叶变换
  • 第一种修改贝塞尔函数,阶数为0

另外,从SciPy教程中

SciPy的顶层还包含NumPy和numpy.lib.scimath中的函数。但是,最好直接从NumPy模块中使用它们。

因此,对于新应用程序,您应该首选在SciPy顶层重复的数组操作的NumPy版本。对于上面列出的域,您应该首选SciPy中的域,并在必要时在NumPy中检查向后兼容性。

以我的个人经验,我使用的大多数数组函数都位于NumPy的顶层(除外random)。但是,所有特定于域的例程都存在于SciPy的子包中,因此我很少使用SciPy顶层的任何东西。

In addition to the SciPy FAQ describing the duplication is mainly for backwards compatibility, it is further clarified in the NumPy documentation to say that

Optionally SciPy-accelerated routines (numpy.dual)

Aliases for functions which may be accelerated by Scipy.

SciPy can be built to use accelerated or otherwise improved libraries for FFTs, linear algebra, and special functions. This module allows developers to transparently support these accelerated functions when SciPy is available but still support users who have only installed NumPy.

For brevity, these are:

  • Linear algebra
  • FFT
  • The Modified Bessel function of the first kind, order 0

Also, from the SciPy Tutorial:

The top level of SciPy also contains functions from NumPy and numpy.lib.scimath. However, it is better to use them directly from the NumPy module instead.

So, for new applications, you should prefer the NumPy version of the array operations that are duplicated in the top level of SciPy. For the domains listed above, you should prefer those in SciPy and check backward compatibility if necessary in NumPy.

In my personal experience, most of the array functions I use exist in the top level of NumPy (except for random). However, all the domain specific routines exist in subpackages of SciPy, so I rarely use anything from the top level of SciPy.


在Python中读取.mat文件

问题:在Python中读取.mat文件

是否可以在Python中读取二进制MATLAB .mat文件?

我已经看到SciPy声称支持读取.mat文件,但是我没有成功。我安装了SciPy 0.7.0版,但找不到该loadmat()方法。

Is it possible to read binary MATLAB .mat files in Python?

I’ve seen that SciPy has alleged support for reading .mat files, but I’m unsuccessful with it. I installed SciPy version 0.7.0, and I can’t find the loadmat() method.


回答 0

需要导入,import scipy.io

import scipy.io
mat = scipy.io.loadmat('file.mat')

An import is required, import scipy.io

import scipy.io
mat = scipy.io.loadmat('file.mat')

回答 1

无论是scipy.io.savemat,还是scipy.io.loadmat对于MATLAB阵列的7.3版本的工作。但是好消息是MATLAB版本7.3文件是hdf5数据集。因此,可以使用许多工具(包括NumPy)读取它们。

对于Python,您将需要h5py扩展,该扩展在系统上需要HDF5。

import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array

Neither scipy.io.savemat, nor scipy.io.loadmat work for MATLAB arrays version 7.3. But the good part is that MATLAB version 7.3 files are hdf5 datasets. So they can be read using a number of tools, including NumPy.

For Python, you will need the h5py extension, which requires HDF5 on your system.

import numpy as np
import h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to a NumPy array

回答 2

首先将.mat文件另存为:

save('test.mat', '-v7')

之后,在Python中,使用通常的loadmat函数:

import scipy.io as sio
test = sio.loadmat('test.mat')

First save the .mat file as:

save('test.mat', '-v7')

After that, in Python, use the usual loadmat function:

import scipy.io as sio
test = sio.loadmat('test.mat')

回答 3

有一个很好的软件包mat4py,可以很容易地使用安装

pip install mat4py

使用起来很简单(从网站上):

从MAT文件加载数据

该函数loadmat仅使用Python dictlist对象将MAT文件中存储的所有变量加载到简单的Python数据结构中。数字和单元格数组将转换为按行排序的嵌套列表。压缩数组以消除仅包含一个元素的数组。结果数据结构由与JSON格式兼容的简单类型组成。

示例:将MAT文件加载到Python数据结构中:

from mat4py import loadmat

data = loadmat('datafile.mat')

变量datadict带有MAT文件中包含的变量和值的a 。

将Python数据结构保存到MAT文件

可以使用函数将Python数据保存到MAT文件中savemat。数据已经以同样的方式为被结构化的loadmat,也就是说,它应该由简单数据类型,像dictliststrint,和float

示例:将Python数据结构保存到MAT文件中:

from mat4py import savemat

savemat('datafile.mat', data)

参数data应为dict带有变量的a。

There is a nice package called mat4py which can easily be installed using

pip install mat4py

It is straightforward to use (from the website):

Load data from a MAT-file

The function loadmat loads all variables stored in the MAT-file into a simple Python data structure, using only Python’s dict and list objects. Numeric and cell arrays are converted to row-ordered nested lists. Arrays are squeezed to eliminate arrays with only one element. The resulting data structure is composed of simple types that are compatible with the JSON format.

Example: Load a MAT-file into a Python data structure:

from mat4py import loadmat

data = loadmat('datafile.mat')

The variable data is a dict with the variables and values contained in the MAT-file.

Save a Python data structure to a MAT-file

Python data can be saved to a MAT-file, with the function savemat. Data has to be structured in the same way as for loadmat, i.e. it should be composed of simple data types, like dict, list, str, int, and float.

Example: Save a Python data structure to a MAT-file:

from mat4py import savemat

savemat('datafile.mat', data)

The parameter data shall be a dict with the variables.


回答 4

安装了MATLAB 2014b或更高版本后,可以使用适用于PythonMATLAB引擎

import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)

Having MATLAB 2014b or newer installed, the MATLAB engine for Python could be used:

import matlab.engine
eng = matlab.engine.start_matlab()
content = eng.load("example.mat", nargout=1)

回答 5

读取文件

import scipy.io
mat = scipy.io.loadmat(file_name)

检查MAT变量的类型

print(type(mat))
#OUTPUT - <class 'dict'>

字典中的MATLAB变量分配给这些变量对象

Reading the file

import scipy.io
mat = scipy.io.loadmat(file_name)

Inspecting the type of MAT variable

print(type(mat))
#OUTPUT - <class 'dict'>

The keys inside the dictionary are MATLAB variables, and the values are the objects assigned to those variables.


回答 6

MathWorks本身也提供用于PythonMATLAB引擎。如果您有MATLAB,则可能值得考虑(我自己还没有尝试过,但是它具有比仅读取MATLAB文件更多的功能)。但是,我不知道是否允许将其分发给其他用户(如果这些人拥有MATLAB,这可能不是问题。否则,也许NumPy是正确的选择?)。

另外,如果您想自己掌握所有基础知识,MathWorks将提供有关文件格式结构的详细文档(如果链接发生更改,请尝试使用google matfile_format.pdf或其标题MAT-FILE Format)。它并不像我个人想象的那样复杂,但是显然,这不是最简单的方法。它还取决于.mat您要支持-files的多少功能。

我编写了一个“小”(约700行)Python脚本,该脚本可以读取一些基本的.mat-files。我既不是Python专家,也不是初学者,我花了大约两天时间来编写它(使用上面链接的MathWorks文档)。我学到很多新东西,这很有趣(大部分时间)。当我在工作时编写Python脚本时,恐怕我无法发布它了……但是我可以在这里给出一些建议:

  • 首先阅读文档。
  • 使用十六进制编辑器(例如HxD)并查看.mat要解析的参考文件。
  • 尝试通过将字节保存到.txt文件并注释每行来弄清楚每个字节的含义。
  • 使用类来保存每个数据元素(如miCOMPRESSEDmiMATRIXmxDOUBLE,或miINT32
  • .mat-files’结构是最佳的用于保存在一个树形数据结构中的数据元素; 每个节点都有一个类和子节点

There is also the MATLAB Engine for Python by MathWorks itself. If you have MATLAB, this might be worth considering (I haven’t tried it myself but it has a lot more functionality than just reading MATLAB files). However, I don’t know if it is allowed to distribute it to other users (it is probably not a problem if those persons have MATLAB. Otherwise, maybe NumPy is the right way to go?).

Also, if you want to do all the basics yourself, MathWorks provides (if the link changes, try to google for matfile_format.pdf or its title MAT-FILE Format) a detailed documentation on the structure of the file format. It’s not as complicated as I personally thought, but obviously, this is not the easiest way to go. It also depends on how many features of the .mat-files you want to support.

I’ve written a “small” (about 700 lines) Python script which can read some basic .mat-files. I’m neither a Python expert nor a beginner and it took me about two days to write it (using the MathWorks documentation linked above). I’ve learned a lot of new stuff and it was quite fun (most of the time). As I’ve written the Python script at work, I’m afraid I cannot publish it… But I can give some advice here:

  • First read the documentation.
  • Use a hex editor (such as HxD) and look into a reference .mat-file you want to parse.
  • Try to figure out the meaning of each byte by saving the bytes to a .txt file and annotate each line.
  • Use classes to save each data element (such as miCOMPRESSED, miMATRIX, mxDOUBLE, or miINT32)
  • The .mat-files’ structure is optimal for saving the data elements in a tree data structure; each node has one class and subnodes

回答 7

from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)

您可以使用上面的代码在Python中读取默认保存的.mat文件。

from os.path import dirname, join as pjoin
import scipy.io as sio
data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, 'testdouble_7.4_GLNX86.mat')
mat_contents = sio.loadmat(mat_fname)

You can use above code to read the default saved .mat file in Python.


按列对NumPy中的数组排序

问题:按列对NumPy中的数组排序

如何按第n列对NumPy中的数组排序?

例如,

a = array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])

我想按第二列对行进行排序,以便返回:

array([[7, 0, 5],
       [9, 2, 3],
       [4, 5, 6]])

How can I sort an array in NumPy by the nth column?

For example,

a = array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])

I’d like to sort rows by the second column, such that I get back:

array([[7, 0, 5],
       [9, 2, 3],
       [4, 5, 6]])

回答 0

@steve答案实际上是最优雅的方法。

对于“正确”的方式,请参见numpy.ndarray.sort的order关键字参数。

但是,您需要将数组视为具有字段的数组(结构化数组)。

如果您最初没有使用字段定义数组,那么“正确”的方法就很难看了。

作为一个简单的示例,对其进行排序并返回副本:

In [1]: import numpy as np

In [2]: a = np.array([[1,2,3],[4,5,6],[0,0,1]])

In [3]: np.sort(a.view('i8,i8,i8'), order=['f1'], axis=0).view(np.int)
Out[3]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

对其进行原位排序:

In [6]: a.view('i8,i8,i8').sort(order=['f1'], axis=0) #<-- returns None

In [7]: a
Out[7]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

据我所知,@ Steve确实是最优雅的方式…

此方法的唯一优点是,“ order”参数是用来对搜索进行排序的字段列表。例如,您可以通过提供order = [‘f1’,’f2’,’f0’]来对第二列,第三列,第一列进行排序。

@steve‘s is actually the most elegant way of doing it.

For the “correct” way see the order keyword argument of numpy.ndarray.sort

However, you’ll need to view your array as an array with fields (a structured array).

The “correct” way is quite ugly if you didn’t initially define your array with fields…

As a quick example, to sort it and return a copy:

In [1]: import numpy as np

In [2]: a = np.array([[1,2,3],[4,5,6],[0,0,1]])

In [3]: np.sort(a.view('i8,i8,i8'), order=['f1'], axis=0).view(np.int)
Out[3]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

To sort it in-place:

In [6]: a.view('i8,i8,i8').sort(order=['f1'], axis=0) #<-- returns None

In [7]: a
Out[7]: 
array([[0, 0, 1],
       [1, 2, 3],
       [4, 5, 6]])

@Steve’s really is the most elegant way to do it, as far as I know…

The only advantage to this method is that the “order” argument is a list of the fields to order the search by. For example, you can sort by the second column, then the third column, then the first column by supplying order=[‘f1′,’f2′,’f0’].


回答 1

我想这可行: a[a[:,1].argsort()]

这表示的第二列,a并据此对其进行排序。

I suppose this works: a[a[:,1].argsort()]

This indicates the second column of a and sort it based on it accordingly.


回答 2

您可以按照Steve Tjoa的方法对多个列进行排序,方法是使用诸如mergesort之类的稳定排序并对索引从最低有效列到最高有效列进行排序:

a = a[a[:,2].argsort()] # First sort doesn't need to be stable.
a = a[a[:,1].argsort(kind='mergesort')]
a = a[a[:,0].argsort(kind='mergesort')]

排序方式为:第0列,然后是1,然后是2。

You can sort on multiple columns as per Steve Tjoa’s method by using a stable sort like mergesort and sorting the indices from the least significant to the most significant columns:

a = a[a[:,2].argsort()] # First sort doesn't need to be stable.
a = a[a[:,1].argsort(kind='mergesort')]
a = a[a[:,0].argsort(kind='mergesort')]

This sorts by column 0, then 1, then 2.


回答 3

我认为您可以从Python文档Wiki中进行以下操作:

a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]); 
a = sorted(a, key=lambda a_entry: a_entry[1]) 
print a

输出为:

[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]

From the Python documentation wiki, I think you can do:

a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]); 
a = sorted(a, key=lambda a_entry: a_entry[1]) 
print a

The output is:

[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]

回答 4

如果有人想在他们程序的关键部分使用排序,下面是对不同提案的性能比较:

import numpy as np
table = np.random.rand(5000, 10)

%timeit table.view('f8,f8,f8,f8,f8,f8,f8,f8,f8,f8').sort(order=['f9'], axis=0)
1000 loops, best of 3: 1.88 ms per loop

%timeit table[table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop

import pandas as pd
df = pd.DataFrame(table)
%timeit df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop

因此,似乎使用argsort进行索引是迄今为止最快的方法…

In case someone wants to make use of sorting at a critical part of their programs here’s a performance comparison for the different proposals:

import numpy as np
table = np.random.rand(5000, 10)

%timeit table.view('f8,f8,f8,f8,f8,f8,f8,f8,f8,f8').sort(order=['f9'], axis=0)
1000 loops, best of 3: 1.88 ms per loop

%timeit table[table[:,9].argsort()]
10000 loops, best of 3: 180 µs per loop

import pandas as pd
df = pd.DataFrame(table)
%timeit df.sort_values(9, ascending=True)
1000 loops, best of 3: 400 µs per loop

So, it looks like indexing with argsort is the quickest method so far…


回答 5

该NumPy的邮件列表,这里是另一种解决方案:

>>> a
array([[1, 2],
       [0, 0],
       [1, 0],
       [0, 2],
       [2, 1],
       [1, 0],
       [1, 0],
       [0, 0],
       [1, 0],
      [2, 2]])
>>> a[np.lexsort(np.fliplr(a).T)]
array([[0, 0],
       [0, 0],
       [0, 2],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 2],
       [2, 1],
       [2, 2]])

From the NumPy mailing list, here’s another solution:

>>> a
array([[1, 2],
       [0, 0],
       [1, 0],
       [0, 2],
       [2, 1],
       [1, 0],
       [1, 0],
       [0, 0],
       [1, 0],
      [2, 2]])
>>> a[np.lexsort(np.fliplr(a).T)]
array([[0, 0],
       [0, 0],
       [0, 2],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 2],
       [2, 1],
       [2, 2]])

回答 6

我有一个类似的问题。

我的问题:

我想计算SVD,需要按降序对我的特征值进行排序。但是我想保留特征值和特征向量之间的映射。我的特征值在第一行中,而对应的特征向量在同一列中。

因此,我想按降序按第一行在列中对二维数组进行排序。

我的解决方案

a = a[::, a[0,].argsort()[::-1]]

那么这是如何工作的呢?

a[0,] 只是我要排序的第一行。

现在,我使用argsort来获取索引的顺序。

我用 [::-1]是因为我需要降序排列。

最后,我使用a[::, ...]正确的顺序查看各列。

I had a similar problem.

My Problem:

I want to calculate an SVD and need to sort my eigenvalues in descending order. But I want to keep the mapping between eigenvalues and eigenvectors. My eigenvalues were in the first row and the corresponding eigenvector below it in the same column.

So I want to sort a two-dimensional array column-wise by the first row in descending order.

My Solution

a = a[::, a[0,].argsort()[::-1]]

So how does this work?

a[0,] is just the first row I want to sort by.

Now I use argsort to get the order of indices.

I use [::-1] because I need descending order.

Lastly I use a[::, ...] to get a view with the columns in the right order.


回答 7

稍微复杂一点的lexsort例子-在第一列下降,在第二列上升。的窍门lexsort是,它对行进行排序(因此.T),并优先考虑最后一行。

In [120]: b=np.array([[1,2,1],[3,1,2],[1,1,3],[2,3,4],[3,2,5],[2,1,6]])
In [121]: b
Out[121]: 
array([[1, 2, 1],
       [3, 1, 2],
       [1, 1, 3],
       [2, 3, 4],
       [3, 2, 5],
       [2, 1, 6]])
In [122]: b[np.lexsort(([1,-1]*b[:,[1,0]]).T)]
Out[122]: 
array([[3, 1, 2],
       [3, 2, 5],
       [2, 1, 6],
       [2, 3, 4],
       [1, 1, 3],
       [1, 2, 1]])

A little more complicated lexsort example – descending on the 1st column, secondarily ascending on the 2nd. The tricks with lexsort are that it sorts on rows (hence the .T), and gives priority to the last.

In [120]: b=np.array([[1,2,1],[3,1,2],[1,1,3],[2,3,4],[3,2,5],[2,1,6]])
In [121]: b
Out[121]: 
array([[1, 2, 1],
       [3, 1, 2],
       [1, 1, 3],
       [2, 3, 4],
       [3, 2, 5],
       [2, 1, 6]])
In [122]: b[np.lexsort(([1,-1]*b[:,[1,0]]).T)]
Out[122]: 
array([[3, 1, 2],
       [3, 2, 5],
       [2, 1, 6],
       [2, 3, 4],
       [1, 1, 3],
       [1, 2, 1]])

回答 8

这是考虑所有列的另一种解决方案(JJ的答案的更紧凑方式);

ar=np.array([[0, 0, 0, 1],
             [1, 0, 1, 0],
             [0, 1, 0, 0],
             [1, 0, 0, 1],
             [0, 0, 1, 0],
             [1, 1, 0, 0]])

用lexsort排序,

ar[np.lexsort(([ar[:, i] for i in range(ar.shape[1]-1, -1, -1)]))]

输出:

array([[0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 0],
       [1, 1, 0, 0]])

Here is another solution considering all columns (more compact way of J.J‘s answer);

ar=np.array([[0, 0, 0, 1],
             [1, 0, 1, 0],
             [0, 1, 0, 0],
             [1, 0, 0, 1],
             [0, 0, 1, 0],
             [1, 1, 0, 0]])

Sort with lexsort,

ar[np.lexsort(([ar[:, i] for i in range(ar.shape[1]-1, -1, -1)]))]

Output:

array([[0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 0],
       [1, 1, 0, 0]])

回答 9

只需使用排序,即可使用要排序的列号。

a = np.array([1,1], [1,-1], [-1,1], [-1,-1]])
print (a)
a=a.tolist() 
a = np.array(sorted(a, key=lambda a_entry: a_entry[0]))
print (a)

Simply using sort, use coloumn number based on which you want to sort.

a = np.array([1,1], [1,-1], [-1,1], [-1,-1]])
print (a)
a=a.tolist() 
a = np.array(sorted(a, key=lambda a_entry: a_entry[0]))
print (a)

回答 10

这是一个古老的问题,但是如果您需要将其推广到2维以上的数组,则可以采用以下解决方案:

np.einsum('ij->ij', a[a[:,1].argsort(),:])

这对于两个维度来说是一个过大的杀伤力,并且a[a[:,1].argsort()]每个@steve的答案就足够了,但是不能将该答案推广到更高的维度。您可以在此问题中找到3D阵列的示例。

输出:

[[7 0 5]
 [9 2 3]
 [4 5 6]]

It is an old question but if you need to generalize this to a higher than 2 dimension arrays, here is the solution than can be easily generalized:

np.einsum('ij->ij', a[a[:,1].argsort(),:])

This is an overkill for two dimensions and a[a[:,1].argsort()] would be enough per @steve’s answer, however that answer cannot be generalized to higher dimensions. You can find an example of 3D array in this question.

Output:

[[7 0 5]
 [9 2 3]
 [4 5 6]]

如何将CSV数据读入NumPy中的记录数组?

问题:如何将CSV数据读入NumPy中的记录数组?

我不知道是否有一个CSV文件的内容导入到一个记录阵列直接的方式,很多的方式是R的read.table()read.delim()read.csv()家庭的进口数据与R的数据帧?

还是使用csv.reader()然后应用类似内容的最佳方法numpy.core.records.fromrecords()

I wonder if there is a direct way to import the contents of a CSV file into a record array, much in the way that R’s read.table(), read.delim(), and read.csv() family imports data to R’s data frame?

Or is the best way to use csv.reader() and then apply something like numpy.core.records.fromrecords()?


回答 0

您可以genfromtxt()通过将delimiterkwarg 设置为逗号来使用Numpy的方法。

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

有关该功能的更多信息,请参见其相应的文档

You can use Numpy’s genfromtxt() method to do so, by setting the delimiter kwarg to a comma.

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

More information on the function can be found at its respective documentation.


回答 1

我会read_csvpandas库中推荐该功能:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

这提供了一个熊猫DataFrame-允许许多有用的数据操作功能,而numpy记录数组无法直接使用这些功能

DataFrame是二维标记的数据结构,具有可能不同类型的列。您可以将其视为电子表格或SQL表…


我也建议genfromtxt。但是,由于该问题要求记录数组,而不是普通数组,因此dtype=None需要将参数添加到genfromtxt调用中:

给定一个输入文件,myfile.csv

1.0, 2, 3
4, 5.5, 6

import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

给出一个数组:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

给出一个记录数组:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

这样的优点是可以轻松导入具有多种数据类型(包括字符串)的文件。

I would recommend the read_csv function from the pandas library:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=',',header=None)
df.values
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame – allowing many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table…


I would also recommend genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

Given an input file, myfile.csv:

1.0, 2, 3
4, 5.5, 6

import numpy as np
np.genfromtxt('myfile.csv',delimiter=',')

gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv',delimiter=',',dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that file with multiple data types (including strings) can be easily imported.


回答 2

我定时了

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

在460万行,约70列的数据上,发现NumPy路径花费了2分16秒,而csv-list理解方法花费了13秒。

我建议使用csv-list理解方法,因为它很可能依赖于预编译的库,而不像NumPy那样依赖于解释器。我怀疑pandas方法会有类似的解释器开销。

I timed the

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.


回答 3

您也可以尝试使用recfromcsv()哪种方法可以猜测数据类型并返回格式正确的记录数组。

You can also try recfromcsv() which can guess data types and return a properly formatted record array.


回答 4

当我尝试使用NumPy和Pandas两种方式时,使用Pandas有很多优点:

  • 快点
  • 减少CPU使用率
  • 与NumPy genfromtxt相比1/3的RAM使用量

这是我的测试代码:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

资料档案:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

使用NumPy和pandas版本:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

  • Faster
  • Less CPU usage
  • 1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

回答 5

您可以使用以下代码将CSV文件数据发送到数组中:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

回答 6

使用 numpy.loadtxt

一个非常简单的方法。但这要求所有元素都是浮点数(int等)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

回答 7

这是最简单的方法:

import csv with open('testfile.csv', newline='') as csvfile: data = list(csv.reader(csvfile))

现在,数据中的每个条目都是一条记录,表示为一个数组。因此,您拥有一个2D阵列。它节省了我很多时间。

This is the easiest way:

import csv with open('testfile.csv', newline='') as csvfile: data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.


回答 8

我尝试了这个:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

回答 9

我建议使用表格(pip3 install tables)。您可以将.csv文件保存为.h5使用熊猫(pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

然后,您可以轻松地以较少的时间(即使是处理大量数据)将数据加载到NumPy数组中

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

回答 10

这项工作令人着迷…

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

This work as a charm…

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

Mlcourse.ai-开放机器学习课程

ODS stickers

mlcourse.ai是一门开放的机器学习课程,由OpenDataScience (ods.ai),由Yury Kashnitsky (yorko)尤里拥有应用数学博士学位和卡格尔竞赛大师学位,他的目标是设计一门理论与实践完美平衡的ML课程。因此,你可以在课堂上复习数学公式,并与Kaggle Inclass竞赛一起练习。目前,该课程正处于自定步模式检查一下详细的Roadmap引导您完成自定进度的课程。ai

奖金:此外,您还可以购买带有最佳非演示版本的奖励作业包mlcourse.ai任务。选择“Bonus Assignments” tier请参阅主页上的交易详情mlcourse.ai

镜子(🇬🇧-仅限):mlcourse.ai(主站点)、Kaggle Dataset(与Kaggle笔记本相同的笔记本)

自定

这个Roadmap将指导您度过11周的mlCourse.ai课程。每周,从熊猫到梯度助推,都会给出阅读什么文章、看什么讲座、完成什么作业的指示。

内容

这是medium.com上发表的文章列表🇬🇧,habr.com🇷🇺还提到了中文笔记本。🇨🇳并给出了指向Kaggle笔记本(英文)的链接。图标是可点击的

  1. 用PANDA软件进行探索性数据分析🇬🇧🇷🇺🇨🇳Kaggle Notebook
  2. 用Python进行可视化数据分析🇬🇧🇷🇺🇨🇳,Kaggle笔记本电脑:part1part2
  3. 分类、决策树和k近邻🇬🇧🇷🇺🇨🇳Kaggle Notebook
  4. 线性分类与回归🇬🇧🇷🇺🇨🇳,Kaggle笔记本电脑:part1part2part3part4part5
  5. 套袋与随机林🇬🇧🇷🇺🇨🇳,Kaggle笔记本电脑:part1part2part3
  6. 特征工程与特征选择🇬🇧🇷🇺🇨🇳Kaggle Notebook
  7. 无监督学习:主成分分析与聚类🇬🇧🇷🇺🇨🇳Kaggle Notebook
  8. Vowpal Wabbit:用千兆字节的数据学习🇬🇧🇷🇺🇨🇳Kaggle Notebook
  9. 用Python进行时间序列分析,第一部分🇬🇧🇷🇺🇨🇳使用Facebook Prophet预测未来,第2部分🇬🇧🇨🇳卡格尔笔记本:part1part2
  10. 梯度增压🇬🇧🇷🇺🇨🇳Kaggle Notebook

讲座

视频上传到thisYouTube播放列表。引言,videoslides

  1. 用熊猫进行探索性数据分析,video
  2. 可视化,EDA的主要情节,video
  3. 诊断树:theorypractical part
  4. Logistic回归:theoretical foundationspractical part(《爱丽丝》比赛中的基线)
  5. 合奏和随机森林-part 1分类指标-part 2预测客户付款的业务任务示例-part 3
  6. 线性回归和正则化-theory,Lasso&Ridge,LTV预测-practice
  7. 无监督学习-Principal Component AnalysisClustering
  8. 用于分类和回归的随机梯度下降-part 1,第2部分TBA
  9. 用Python(ARIMA,PERPHET)进行时间序列分析-video
  10. 梯度增压:基本思路-part 1、XgBoost、LightGBM和CatBoost+Practice背后的关键理念-part 2

作业

以下是演示作业。此外,在“Bonus Assignments” tier您可以访问非演示作业

  1. 用熊猫进行探索性数据分析,nbviewerKaggle Notebooksolution
  2. 分析心血管疾病数据,nbviewerKaggle Notebooksolution
  3. 带有玩具任务和UCI成人数据集的决策树,nbviewerKaggle Notebooksolution
  4. 讽刺检测,Kaggle Notebooksolution线性回归作为一个最优化问题,nbviewerKaggle Notebook
  5. Logistic回归和随机森林在信用评分问题中的应用nbviewerKaggle Notebooksolution
  6. 在回归任务中探索OLS、LASSO和随机森林nbviewerKaggle Notebooksolution
  7. 无监督学习,nbviewerKaggle Notebooksolution
  8. 实现在线回归,nbviewerKaggle Notebooksolution
  9. 时间序列分析,nbviewerKaggle Notebooksolution
  10. 在比赛中超越底线,Kaggle Notebook

卡格尔竞赛

  1. 如果可以,请抓住我:通过网页会话跟踪检测入侵者。Kaggle Inclass
  2. Dota 2获胜者预测。Kaggle Inclass

引用mlCourse.ai

如果你碰巧引用了mlcourse.ai在您的工作中,您可以使用此BibTeX记录:

@misc{mlcourse_ai,
    author = {Kashnitsky, Yury},
    title = {mlcourse.ai – Open Machine Learning Course},
    year = {2020},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Yorko/mlcourse.ai}},
}

社区

讨论在#mlCourse_ai世界上最重要的一条航道OpenDataScience (ods.ai)松懈团队

课程是免费的,但你可以通过承诺以下内容来支持组织者Patreon(每月支持)或一次性付款Ko-fi

Donate
Donate