标签归档:scipy

使用pip安装SciPy和NumPy

问题:使用pip安装SciPy和NumPy

我正在尝试在要分发的程序包中创建所需的库。它需要SciPyNumPy库。在开发过程中,我同时使用

apt-get install scipy

它安装了SciPy 0.9.0和NumPy 1.5.1,并且运行良好。

我想使用pip install– 做同样的事情,以便能够在我自己的包的setup.py中指定依赖项。

问题是,当我尝试:

pip install 'numpy==1.5.1'

它工作正常。

但是之后

pip install 'scipy==0.9.0'

惨败

raise self.notfounderror(self.notfounderror.__doc__)

numpy.distutils.system_info.BlasNotFoundError:

Blas (http://www.netlib.org/blas/) libraries not found.

Directories to search for the libraries can be specified in the

numpy/distutils/site.cfg file (section [blas]) or by setting

the BLAS environment variable.

我该如何工作?

I’m trying to create required libraries in a package I’m distributing. It requires both the SciPy and NumPy libraries. While developing, I installed both using

apt-get install scipy

which installed SciPy 0.9.0 and NumPy 1.5.1, and it worked fine.

I would like to do the same using pip install – in order to be able to specify dependencies in a setup.py of my own package.

The problem is, when I try:

pip install 'numpy==1.5.1'

it works fine.

But then

pip install 'scipy==0.9.0'

fails miserably, with

raise self.notfounderror(self.notfounderror.__doc__)

numpy.distutils.system_info.BlasNotFoundError:

Blas (http://www.netlib.org/blas/) libraries not found.

Directories to search for the libraries can be specified in the

numpy/distutils/site.cfg file (section [blas]) or by setting

the BLAS environment variable.

How do I get it to work?


回答 0

我假设我的回答是Linux经验。我发现pip install scipy要顺利进行有三个先决条件。

转到此处:安装SciPY

按照说明下载,构建和导出BLAS的env变量,然后下载LAPACK。注意不要盲目剪切’n’粘贴shell命令-您需要根据您的体系结构等选择几行,并且您需要修复/添加错误地假定为的正确目录好。

您可能需要做的第三件事是yum安装numpy-f2py或等效程序。

哦,是的,最后,您可能需要安装gcc-gfortran,因为上述库都是Fortran源码。

I am assuming Linux experience in my answer; I found that there are three prerequisites to getting pip install scipy to proceed nicely.

Go here: Installing SciPY

Follow the instructions to download, build and export the env variable for BLAS and then LAPACK. Be careful to not just blindly cut’n’paste the shell commands – there will be a few lines you need to select depending on your architecture, etc., and you’ll need to fix/add the correct directories that it incorrectly assumes as well.

The third thing you may need is to yum install numpy-f2py or the equivalent.

Oh, yes and lastly, you may need to yum install gcc-gfortran as the libraries above are Fortran source.


回答 1

这在Ubuntu 14.04上对我有用:

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

This worked for me on Ubuntu 14.04:

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

回答 2

如果使用的是Ubuntu,则需要libblas和liblapack开发软件包。

aptitude install libblas-dev liblapack-dev
pip install scipy

you need the libblas and liblapack dev packages if you are using Ubuntu.

aptitude install libblas-dev liblapack-dev
pip install scipy

回答 3

由于先前使用yum进行安装的说明已被破坏,因此这里提供了在诸如fedora之类的设备上进行安装的更新说明。我已经在“ Amazon Linux AMI 2016.03”上对此进行了测试

sudo yum install atlas-devel lapack-devel blas-devel libgfortran
pip install scipy

Since the previous instructions for installing with yum are broken here are the updated instructions for installing on something like fedora. I’ve tested this on “Amazon Linux AMI 2016.03”

sudo yum install atlas-devel lapack-devel blas-devel libgfortran
pip install scipy

回答 4

我当时正在从事一个依赖于numpy和scipy的项目。在Fedora 23的全新安装中,使用适用于Python 3.4的python虚拟环境(也适用于Python 2.7),并在setup.py中使用以下内容(在setup()方法中)

setup_requires=[
    'numpy',
],
install_requires=[
    'numpy',
    'scipy',
],

我发现必须运行以下命令才能pip install -e .开始工作:

pip install --upgrade pip

sudo dnf install atlas-devel gcc-{c++,gfortran} subversion redhat-rpm-config

redhat-rpm-config是SciPy的的使用redhat-hardened-cc1,而不是常规cc1

I was working on a project that depended on numpy and scipy. In a clean installation of Fedora 23, using a python virtual environment for Python 3.4 (also worked for Python 2.7), and with the following in my setup.py (in the setup() method)

setup_requires=[
    'numpy',
],
install_requires=[
    'numpy',
    'scipy',
],

I found I had to run the following to get pip install -e . to work:

pip install --upgrade pip

and

sudo dnf install atlas-devel gcc-{c++,gfortran} subversion redhat-rpm-config

The redhat-rpm-config is for scipy’s use of redhat-hardened-cc1 as opposed to the regular cc1


回答 5

Windows python 3.5上,我设法scipy使用conda not 进行安装pip

conda install scipy

On windows python 3.5, I managed to install scipy by using conda not pip:

conda install scipy

回答 6

这是什么操作系统?答案可能取决于所涉及的操作系统。但是,您似乎需要找到此BLAS库并进行安装。它似乎不在PIP中(因此您必须手工完成),但是如果您安装它,则应该让您进行SciPy安装。

What operating system is this? The answer might depend on the OS involved. However, it looks like you need to find this BLAS library and install it. It doesn’t seem to be in PIP (you’ll have to do it by hand thus), but if you install it, it ought let you progress your SciPy install.


回答 7

就我而言,升级点可以解决问题。另外,我已经用-U参数安装了scipy(将所有软件包升级到最新的可用版本)

in my case, upgrading pip did the trick. Also, I’ve installed scipy with -U parameter (upgrade all packages to the last available version)


如何在Python中进行指数和对数曲线拟合?我发现只有多项式拟合

问题:如何在Python中进行指数和对数曲线拟合?我发现只有多项式拟合

我有一组数据,我想比较哪条线描述得最好(不同阶数,指数或对数的多项式)。

我使用Python和Numpy,对于多项式拟合,有一个函数polyfit()。但是我没有找到用于指数和对数拟合的函数。

有吗 否则如何解决?

I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).

I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.

Are there any? Or how to solve it otherwise?


回答 0

对于拟合y = A + B log x,只需将y拟合为(log x)。

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

用于装配ÿ = Bx的,取两侧的对数使日志Ŷ =登录 + Bx的。因此对x拟合(log y)。

需要注意的是配件(日志Ÿ),就好像它是线性的会强调的较小值Ÿ,造成较大偏差大ÿ。这是因为polyfit(线性回归)的工作原理是最小化Σ (Δ Ý2 =Σ ÿ Ŷ 2。当ÿ =登录ÿ ,残基Δ ÿ =Δ(日志Ý )≈Δ ÿ / | y |。所以即使polyfit对大y做出非常糟糕的决定,“除以| y | |” 因数将对其进行补偿,从而导致polyfit偏爱较小的值。

可以通过为每个条目赋予与y成正比的“权重”来缓解这种情况。polyfit通过w关键字参数支持加权最小二乘。

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

请注意,Excel,LibreOffice和大多数科学计算器通常对指数回归/趋势线使用未加权(有偏)公式。如果您希望您的结果与这些平台兼容,即使提供更好的结果,也不要包括权重。


现在,如果您可以使用scipy,则可以使用它scipy.optimize.curve_fit来拟合任何模型而无需进行转换。

对于y = A + B log x,结果与转换方法相同:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

但是,对于y = Ae Bx,因为它可以直接计算Δ(log y),所以我们可以获得更好的拟合度。但是我们需要提供一个初始猜测,以便curve_fit可以达到所需的局部最小值。

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

For fitting y = A + B log x, just fit y against (log x).

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

For fitting y = AeBx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.

Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑iY)2 = ∑i (YiŶi)2. When Yi = log yi, the residues ΔYi = Δ(log yi) ≈ Δyi / |yi|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.

This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.


Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.

For y = A + B log x the result is the same as the transformation method:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

For y = AeBx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.


回答 1

您还可以将一组数据拟合到您喜欢使用curve_fitfrom的任何函数scipy.optimize。例如,如果您想拟合指数函数(来自文档):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

然后,如果要绘制,则可以执行以下操作:

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

(注:*在前面popt,当你将绘制出扩大的条款进入abc那个func。期待)

You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

And then if you want to plot, you could do:

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)


回答 2

我对此有些麻烦,所以请让我非常明确,让像我这样的菜鸟可以理解。

假设我们有一个数据文件或类似的文件

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

"""
Generate some data, let's imagine that you already have this. 
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)

"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")

"""
brutal force to avoid errors
"""    
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)

"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

"""
Print the coefficients and plot the funcion.
"""

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')
plt.show()

结果是:a = 0.849195983017,b = -1.18101681765,c = 2.24061176543,d = 0.816643894816

I was having some trouble with this so let me be very explicit so noobs like me can understand.

Lets say that we have a data file or something like that

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

"""
Generate some data, let's imagine that you already have this. 
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)

"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")

"""
brutal force to avoid errors
"""    
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)

"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

"""
Print the coefficients and plot the funcion.
"""

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')
plt.show()

the result is: a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816


回答 3

好吧,我想您可以随时使用:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

稍微修改IanVS的答案

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

结果如下图:

Well I guess you can always use:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

Slightly modifying IanVS’s answer:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

This results in the following graph:


回答 4

这是使用scikit learning中的工具的简单数据的线性化选项。

给定

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


np.random.seed(123)

# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c


def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)

拟合指数数据

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

适合日志数据

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")


细节

一般步骤

  1. 应用日志操作数据值(xy或两者)
  2. 将数据回归到线性模型
  3. 通过“反转”任何日志操作(使用np.exp())进行绘制并适合原始数据

假设我们的数据遵循指数趋势,则一般方程+可能为:

我们可以通过取log线性化后一个方程(例如y =截距+斜率* x):

给定一个线性方程式++和回归参数,我们可以计算:

  • A通过拦截(ln(A)
  • B通过坡度(B

线性化技术摘要

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
-------------|------------|----------------------|----------------|------------------------------------------
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+注意:当噪声较小且C = 0时,线性化指数函数的效果最佳。请谨慎使用。

++注:更改x数据有助于线性化指数数据,而更改y数据有助于线性化日志数据。

Here’s a linearization option on simple data that uses tools from scikit learn.

Given

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


np.random.seed(123)

# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c


def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)

Code

Fit exponential data

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

Fit log data

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")


Details

General Steps

  1. Apply a log operation to data values (x, y or both)
  2. Regress the data to a linearized model
  3. Plot by “reversing” any log operations (with np.exp()) and fit to original data

Assuming our data follows an exponential trend, a general equation+ may be:

We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:

Given a linearized equation++ and the regression parameters, we could calculate:

  • A via intercept (ln(A))
  • B via slope (B)

Summary of Linearization Techniques

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
-------------|------------|----------------------|----------------|------------------------------------------
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.

++Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.


回答 5

我们展示了lmfit同时解决这两个问题的功能。

给定

import lmfit

import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
np.random.seed(123)

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

方法1- lmfit模型

拟合指数数据

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

方法2-自定义模型

适合日志数据

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()


细节

  1. 选择回归类别
  2. 提供尊重功能域的命名,初步猜测

您可以从回归对象确定推断的参数。例:

regressor.param_names
# ['decay', 'amplitude']

注意:ExponentialModel()以下是衰减函数,该函数接受两个参数,其中一个为负数。

另请参见ExponentialGaussianModel(),它接受更多参数

通过安装> pip install lmfit

We demonstrate features of lmfit while solving both problems.

Given

import lmfit

import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
np.random.seed(123)

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

Code

Approach 1 – lmfit Model

Fit exponential data

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

Approach 2 – Custom Model

Fit log data

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()


Details

  1. Choose a regression class
  2. Supply named, initial guesses that respect the function’s domain

You can determine the inferred parameters from the regressor object. Example:

regressor.param_names
# ['decay', 'amplitude']

Note: the ExponentialModel() follows a decay function, which accepts two parameters, one of which is negative.

See also ExponentialGaussianModel(), which accepts more parameters.

Install the library via > pip install lmfit.


回答 6

Wolfram具有用于拟合指数的封闭形式的解决方案。他们也有类似的解决方案来拟合对数幂律

我发现这比scipy的curve_fit更好。这是一个例子:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    #end
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)


xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.figure()
plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Wolfram has a closed form solution for fitting an exponential. They also have similar solutions for fitting a logarithmic and power law.

I found this to work better than scipy’s curve_fit. Especially when you don’t have data “near zero”. Here is an example:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    #end
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)


xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.figure()
plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
plt.tight_layout()
plt.show()


python中是否存在针对均方根误差(RMSE)的库函数?

问题:python中是否存在针对均方根误差(RMSE)的库函数?

我知道我可以像这样实现均方根误差函数:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

如果此rmse函数在某个地方的某个库中实现(可能在scipy或scikit-learn中实现),我正在寻找什么?

I know I could implement a root mean squared error function like this:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

What I’m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?


回答 0

sklearn.metrics具有mean_squared_error功能。RMSE只是返回值的平方根。

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

sklearn.metrics has a mean_squared_error function. The RMSE is just the square root of whatever it returns.

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

回答 1

什么是RMSE?也称为MSE,RMD或RMS。它解决什么问题?

如果您了解RMSE :(均方根误差),MSE :(均方根误差)RMD(均方根偏差)和RMS :(均方根),那么要求库为您计算此值是不必要的过度设计。所有这些指标都是最长2英寸长的单行python代码。rmse,mse,rmd和rms这三个指标在概念上核心相同。

RMSE回答了这个问题:“何其相似,平均而言,是数字在list1list2?”。这两个列表的大小必须相同。我想“消除任何两个给定元素之间的噪音,消除收集到的数据的大小,并获得随时间变化的单一数字感觉”。

直觉和ELI5 for RMSE:

想象一下,您正在学习在飞镖板上扔飞镖。每天练习一小时。您想弄清楚自己是好还是坏。因此,每天您要投掷10次球,并测量靶心与飞镖击中点之间的距离。

您列出这些数字list1。使用第1天与list2包含所有零的距离之间的均方根误差。在第二天和第n天做同样的事情。您将得到的是一个希望随时间减少的数字。当您的RMSE数为零时,您每次都击中Bullseyes。如果均方根值增加,则情况会越来越糟。

在python中计算均方根误差的示例:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

哪些打印:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

数学符号:

字形图例: n是一个完整的正整数,表示投掷次数。 i表示枚举和的整个正整数计数器。 d代表理想距离,list2在上面的示例中包含所有零。 p代表性能,list1在上面的示例中。上标2代表数字平方。 d i是的第i个索引dp i是的第i个索引p

rmse分步进行,因此可以理解:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

RMSE的每个步骤如何工作:

一个数字减去另一个数字就可以得出它们之间的距离。

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

如果将任何数字乘以自身,则结果总是正数,因为负数乘以负数就是正数:

3*3     = 9   = positive
-30*-30 = 900 = positive

将它们全部加起来,但是等一下,那么包含许多元素的数组将比小的数组具有更大的误差,因此请按元素数对它们进行平均。

但是,等等,我们更早地对它们进行平方,以迫使他们保持积极态度。消除平方根的伤害!

剩下的一个数字平均代表list1的每个值与其list2的对应元素值之间的距离。

如果RMSE值随着时间下降,我们会感到高兴,因为方差正在减小。

RMSE不是最准确的线拟合策略,最小二乘法的总和为:

均方根误差测量的是点与线之间的垂直距离,因此,如果数据的形状像香蕉,底部附近平坦,顶部附近陡峭,则RMSE将报告距较高点的距离较大,而距点的距离较短实际上是距离相等时的低点。这会导致偏斜,在此偏斜时,线倾向于更靠近高点而不是低点。

如果这是一个问题,则总最小二乘法可以解决此问题:https : //mubaris.com/posts/linear-regression

可能会破坏此RMSE功能的陷阱:

如果在任何一个输入列表中都有空值或无穷大,则输出rmse值将变得没有意义。任一列表中都有三种处理空值/缺失值/无穷大的策略:忽略该组件,将其清零,或在所有时间步长中添加最佳猜测或统一的随机噪声。每种补救措施都有其优缺点,具体取决于数据的含义。通常,最好忽略任何缺少值的组件,但这会使RMSE偏向零,从而使您认为性能确实有所改善。如果存在很多缺失值,则最好在最佳猜测上添加随机噪声。

为了保证RMSE输出的相对正确性,您必须消除输入中的所有null / infinites。

对于不属于异常值的数据点,RMSE的容差为零

均方根误差平方根取决于所有数据正确,并且所有数据均视为相等。这意味着在左侧区域中出现的一个杂散点将完全破坏整个计算。若要处理离群数据点并在特定阈值后消除其巨大影响,请参见稳健估计器,该估计器内置了消除离群值的阈值。

What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS: (Root Mean Squared), then asking for a library to calculate this for you is unnecessary over-engineering. All these metrics are a single line of python code at most 2 inches long. The three metrics rmse, mse, rmd, and rms are at their core conceptually identical.

RMSE answers the question: “How similar, on average, are the numbers in list1 to list2?”. The two lists must be the same size. I want to “wash out the noise between any two given elements, wash out the size of the data collected, and get a single number feel for change over time”.

Intuition and ELI5 for RMSE:

Imagine you are learning to throw darts at a dart board. Every day you practice for one hour. You want to figure out if you are getting better or getting worse. So every day you make 10 throws and measure the distance between the bullseye and where your dart hit.

You make a list of those numbers list1. Use the root mean squared error between the distances at day 1 and a list2 containing all zeros. Do the same on the 2nd and nth days. What you will get is a single number that hopefully decreases over time. When your RMSE number is zero, you hit bullseyes every time. If the rmse number goes up, you are getting worse.

Example in calculating root mean squared error in python:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

Which prints:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

The mathematical notation:

Glyph Legend: n is a whole positive integer representing the number of throws. i represents a whole positive integer counter that enumerates sum. d stands for the ideal distances, the list2 containing all zeros in above example. p stands for performance, the list1 in the above example. superscript 2 stands for numeric squared. di is the i’th index of d. pi is the i’th index of p.

The rmse done in small steps so it can be understood:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

How does every step of RMSE work:

Subtracting one number from another gives you the distance between them.

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

If you multiply any number times itself, the result is always positive because negative times negative is positive:

3*3     = 9   = positive
-30*-30 = 900 = positive

Add them all up, but wait, then an array with many elements would have a larger error than a small array, so average them by the number of elements.

But wait, we squared them all earlier to force them positive. Undo the damage with a square root!

That leaves you with a single number that represents, on average, the distance between every value of list1 to it’s corresponding element value of list2.

If the RMSE value goes down over time we are happy because variance is decreasing.

RMSE isn’t the most accurate line fitting strategy, total least squares is:

Root mean squared error measures the vertical distance between the point and the line, so if your data is shaped like a banana, flat near the bottom and steep near the top, then the RMSE will report greater distances to points high, but short distances to points low when in fact the distances are equivalent. This causes a skew where the line prefers to be closer to points high than low.

If this is a problem the total least squares method fixes this: https://mubaris.com/posts/linear-regression

Gotchas that can break this RMSE function:

If there are nulls or infinity in either input list, then output rmse value is is going to not make sense. There are three strategies to deal with nulls / missing values / infinities in either list: Ignore that component, zero it out or add a best guess or a uniform random noise to all timesteps. Each remedy has its pros and cons depending on what your data means. In general ignoring any component with a missing value is preferred, but this biases the RMSE toward zero making you think performance has improved when it really hasn’t. Adding random noise on a best guess could be preferred if there are lots of missing values.

In order to guarantee relative correctness of the RMSE output, you must eliminate all nulls/infinites from the input.

RMSE has zero tolerance for outlier data points which don’t belong

Root mean squared error squares relies on all data being right and all are counted as equal. That means one stray point that’s way out in left field is going to totally ruin the whole calculation. To handle outlier data points and dismiss their tremendous influence after a certain threshold, see Robust estimators that build in a threshold for dismissal of outliers.


回答 2

这可能更快吗?

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

This is probably faster?:

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

回答 3

在scikit-learn 0.22.0中,您可以传递mean_squared_error()参数squared=False以返回RMSE。

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)

In scikit-learn 0.22.0 you can pass mean_squared_error() the argument squared=False to return the RMSE.

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)


回答 4

以防万一有人在2019年发现此线程,有一个名为的库ml_metrics,无需预先安装就可以在Kaggle的内核中使用,该库非常轻巧并且可以通过以下方式访问pypi(可以使用轻松快速地安装pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

它还有其他一些有趣的指标sklearn,例如mapk

参考文献:

Just in case someone finds this thread in 2019, there is a library called ml_metrics which is available without pre-installation in Kaggle’s kernels, pretty lightweighted and accessible through pypi ( it can be installed easily and fast with pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

It has few other interesting metrics which are not available in sklearn, like mapk.

References:


回答 5

实际上,我确实写了一堆作为statsmodels的实用函数

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

http://statsmodels.sourceforge.net/devel/generation/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

通常是一两个衬板,输入检查不多,主要用于比较数组时轻松获得一些统计信息。但是他们对轴参数有单元测试,因为这是我有时会犯草率错误的地方。

Actually, I did write a bunch of those as utility functions for statsmodels

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

and http://statsmodels.sourceforge.net/devel/generated/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

Mostly one or two liners and not much input checking, and mainly intended for easily getting some statistics when comparing arrays. But they have unit tests for the axis arguments, because that’s where I sometimes make sloppy mistakes.


回答 6

或仅使用NumPy函数:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

哪里:

  • y是我的目标
  • y_pred是我的预测

注意,rmse(y, y_pred)==rmse(y_pred, y)由于平方函数。

Or by simply using only NumPy functions:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

Where:

  • y is my target
  • y_pred is my prediction

Note that rmse(y, y_pred)==rmse(y_pred, y) due to the square function.


回答 7

您无法在SKLearn中直接找到RMSE功能。但是,除了手动执行sqrt之外,还有另一种使用sklearn的标准方法。显然,Sklearn的mean_squared_error本身包含一个名为“ squared”的参数,默认值为true。如果将其设置为false,则同一函数将返回RMSE而不是MSE。

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

You can’t find RMSE function directly in SKLearn. But , instead of manually doing sqrt , there is another standard way using sklearn. Apparently, Sklearn’s mean_squared_error itself contains a parameter called as “squared” with default value as true .If we set it to false ,the same function will return RMSE instead of MSE.

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

回答 8

这是一个示例代码,用于计算两种多边形文件格式之间的RMSE PLY。它同时使用ml_metricslib和np.linalg.norm

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

Here’s an example code that calculates the RMSE between two polygon file formats PLY. It uses both the ml_metrics lib and the np.linalg.norm:

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

回答 9

  1. 不,有一个用于机器学习的Scikit Learn库,可以通过使用Python语言轻松使用。它具有均方误差的功能,我在下面共享以下链接:

https://scikit-learn.org/stable/modules/generation/sklearn.metrics.mean_squared_error.html

  1. 该函数的命名方式如下所示,其中y_true是数据元组的真实类值,而y_pred是预测值,由您使用的机器学习算法预测:

mean_squared_error(y_true,y_pred)

  1. 您必须对其进行修改以获取RMSE(通过使用Python使用sqrt函数)。此过程在以下链接中进行了描述:https : //www.codeastar.com/regression-model-rmsd/

因此,最终代码将类似于:

从sklearn.metrics从数学导入sqrt导入mean_squared_error

RMSD = sqrt(均方误差(testing_y,预测))

打印(RMSD)

  1. No, there is a library Scikit Learn for machine learning and it can be easily employed by using Python language. It has the a function for Mean Squared Error which i am sharing the link below:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

  1. The function is named mean_squared_error as given below, where y_true would be real class values for the data tuples and y_pred would be the predicted values, predicted by the machine learning algorithm you are using:

mean_squared_error(y_true, y_pred)

  1. You have to modify it to get RMSE (by using sqrt function using Python).This process is described in this link: https://www.codeastar.com/regression-model-rmsd/

So, final code would be something like:

from sklearn.metrics import mean_squared_error from math import sqrt

RMSD = sqrt(mean_squared_error(testing_y, prediction))

print(RMSD)


如何将新行添加到空的numpy数组

问题:如何将新行添加到空的numpy数组

使用标准的Python数组,我可以执行以下操作:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
# arr is now [[1,2,3],[4,5,6]]

但是,我不能在numpy中做同样的事情。例如:

arr = np.array([])
arr = np.append(arr, np.array([1,2,3]))
arr = np.append(arr, np.array([4,5,6]))
# arr is now [1,2,3,4,5,6]

我也研究了vstack,但是在vstack空数组上使用时,得到:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

那么,如何将新行追加到numpy中的空数组中?

Using standard Python arrays, I can do the following:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
# arr is now [[1,2,3],[4,5,6]]

However, I cannot do the same thing in numpy. For example:

arr = np.array([])
arr = np.append(arr, np.array([1,2,3]))
arr = np.append(arr, np.array([4,5,6]))
# arr is now [1,2,3,4,5,6]

I also looked into vstack, but when I use vstack on an empty array, I get:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

So how do I do append a new row to an empty array in numpy?


回答 0

“启动”所需阵列的方法是:

arr = np.empty((0,3), int)

这是一个空数组,但具有适当的维数。

>>> arr
array([], shape=(0, 3), dtype=int64)

然后确保沿轴0附加:

arr = np.append(arr, np.array([[1,2,3]]), axis=0)
arr = np.append(arr, np.array([[4,5,6]]), axis=0)

但是,@ jonrsharpe是正确的。实际上,如果要循环添加,那么像第一个示例中那样将其添加到列表中会更快得多,然后最后转换为numpy数组,因为您实际上并没有使用numpy作为打算在循环中:

In [210]: %%timeit
   .....: l = []
   .....: for i in xrange(1000):
   .....:     l.append([3*i+1,3*i+2,3*i+3])
   .....: l = np.asarray(l)
   .....: 
1000 loops, best of 3: 1.18 ms per loop

In [211]: %%timeit
   .....: a = np.empty((0,3), int)
   .....: for i in xrange(1000):
   .....:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
   .....: 
100 loops, best of 3: 18.5 ms per loop

In [214]: np.allclose(a, l)
Out[214]: True

numpythonic的实现方法取决于您的应用程序,但它更像是:

In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop

In [221]: np.allclose(a, n)
Out[221]: True

The way to “start” the array that you want is:

arr = np.empty((0,3), int)

Which is an empty array but it has the proper dimensionality.

>>> arr
array([], shape=(0, 3), dtype=int64)

Then be sure to append along axis 0:

arr = np.append(arr, np.array([[1,2,3]]), axis=0)
arr = np.append(arr, np.array([[4,5,6]]), axis=0)

But, @jonrsharpe is right. In fact, if you’re going to be appending in a loop, it would be much faster to append to a list as in your first example, then convert to a numpy array at the end, since you’re really not using numpy as intended during the loop:

In [210]: %%timeit
   .....: l = []
   .....: for i in xrange(1000):
   .....:     l.append([3*i+1,3*i+2,3*i+3])
   .....: l = np.asarray(l)
   .....: 
1000 loops, best of 3: 1.18 ms per loop

In [211]: %%timeit
   .....: a = np.empty((0,3), int)
   .....: for i in xrange(1000):
   .....:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
   .....: 
100 loops, best of 3: 18.5 ms per loop

In [214]: np.allclose(a, l)
Out[214]: True

The numpythonic way to do it depends on your application, but it would be more like:

In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop

In [221]: np.allclose(a, n)
Out[221]: True

回答 1

这是我的解决方案:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
np_arr = np.array(arr)

Here is my solution:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
np_arr = np.array(arr)

回答 2

在这种情况下,您可能需要使用np.hstack和np.vstack函数

arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))
# arr is now [1,2,3]

arr = np.vstack((arr, np.array([4,5,6])))
# arr is now [[1,2,3],[4,5,6]]

您也可以使用np.concatenate函数。

干杯

In this case you might want to use the functions np.hstack and np.vstack

arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))
# arr is now [1,2,3]

arr = np.vstack((arr, np.array([4,5,6])))
# arr is now [[1,2,3],[4,5,6]]

You also can use the np.concatenate function.

Cheers


回答 3

使用自定义dtype定义,对我有用的是:

import numpy

# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])

using an custom dtype definition, what worked for me was:

import numpy

# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])

回答 4

如果要为循环中的数组添加新行,请直接为首次循环中的数组分配数组,而不是初始化一个空数组。

for i in range(0,len(0,100)):
    SOMECALCULATEDARRAY = .......
    if(i==0):
        finalArrayCollection = SOMECALCULATEDARRAY
    else:
        finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)

当阵列的形状未知时,这主要有用

In case of adding new rows for array in loop, Assign the array directly for firsttime in loop instead of initialising an empty array.

for i in range(0,len(0,100)):
    SOMECALCULATEDARRAY = .......
    if(i==0):
        finalArrayCollection = SOMECALCULATEDARRAY
    else:
        finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)

This is mainly useful when the shape of the array is unknown


回答 5

我想做一个for循环,但是用askewchan的方法效果不好,所以我修改了它。

x=np.empty((0,3))
y=np.array([1 2 3])
for i in ...
x = vstack((x,y))

I want to do a for loop, yet with askewchan’s method it does not work well, so I have modified it.

x=np.empty((0,3))
y=np.array([1 2 3])
for i in ...
x = vstack((x,y))

指定并保存具有精确大小(以像素为单位)的图形

问题:指定并保存具有精确大小(以像素为单位)的图形

假设我的图像尺寸为3841 x 7195像素。我想将图形的内容保存到磁盘,以得到我指定的确切大小的图像(以像素为单位)。

没有轴,没有标题。只是图像。我个人并不关心DPI,因为我只想以像素为单位指定图像在屏幕上所占的大小。

我已经阅读了其他 线程,它们似乎都将转换为英寸,然后以英寸为单位指定图形的尺寸,并以某种方式调整dpi。我想避免处理像素到英寸转换可能导致的精度损失。

我尝试过:

w = 7195
h = 3841
fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig(some_path, dpi=1)

没有运气(Python抱怨宽度和高度都必须低于32768(?))

从我所看到的一切来看,都matplotlib需要在inches和中指定图形大小dpi,但是我只对图形在磁盘中占据的像素感兴趣。我怎样才能做到这一点?

需要说明的是:我正在寻找一种使用matplotlib而不是其他图像保存库的方法。

Say I have an image of size 3841 x 7195 pixels. I would like to save the contents of the figure to disk, resulting in an image of the exact size I specify in pixels.

No axis, no titles. Just the image. I don’t personally care about DPIs, as I only want to specify the size the image takes in the screen in disk in pixels.

I have read other threads, and they all seem to do conversions to inches and then specify the dimensions of the figure in inches and adjust dpi’s in some way. I would like to avoid dealing with the potential loss of accuracy that could result from pixel-to-inches conversions.

I have tried with:

w = 7195
h = 3841
fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig(some_path, dpi=1)

with no luck (Python complains that width and height must each be below 32768 (?))

From everything I have seen, matplotlib requires the figure size to be specified in inches and dpi, but I am only interested in the pixels the figure takes in disk. How can I do this?

To clarify: I am looking for a way to do this with matplotlib, and not with other image-saving libraries.


回答 0

Matplotlib不能直接处理像素,而是可以处理物理尺寸和DPI。如果要显示具有特定像素大小的图形,则需要知道显示器的DPI。例如,此链接将为您检测到该链接

如果您具有3841×7195像素的图像,则监视器不太可能会那么大,因此您将无法显示该尺寸的图形(matplotlib要求该图形适合屏幕尺寸,如果您要求一个尺寸太大会缩小到屏幕尺寸)。让我们想象一下,您仅需要一个800×800像素的图像作为示例。这是在监视器(my_dpi=96)中显示800×800像素图像的方法:

plt.figure(figsize=(800/my_dpi, 800/my_dpi), dpi=my_dpi)

因此,您基本上只需将尺寸(以英寸为单位)除以DPI。

如果要保存特定大小的图形,则是另一回事。屏幕DPI不再那么重要了(除非您要求提供一个不适合屏幕的数字)。使用相同的800×800像素图形示例,我们可以使用dpi关键字来将其保存为不同的分辨率savefig。要将其保存为与屏幕相同的分辨率,只需使用相同的dpi:

plt.savefig('my_fig.png', dpi=my_dpi)

要将其保存为8000×8000像素的图像,请使用10倍大的dpi:

plt.savefig('my_fig.png', dpi=my_dpi * 10)

请注意,并非所有后端都支持DPI的设置。在这里,使用PNG后端,但是pdf和ps后端将以不同的方式实现大小。同样,更改DPI和大小也会影响诸如fontsize之类的内容。较大的DPI将保持相同的字体和元素相对大小,但是如果您希望较小的字体用于较大的图形,则需要增加物理尺寸而不是DPI。

回到您的示例,如果要保存3841 x 7195像素的图像,可以执行以下操作:

plt.figure(figsize=(3.841, 7.195), dpi=100)
( your code ...)
plt.savefig('myfig.png', dpi=1000)

请注意,我使用的数字dpi为100以适合大多数屏幕,但dpi=1000为了获得所需的分辨率而将其保存下来。在我的系统中,这会生成一个3840×7190像素的png -似乎保存的DPI总是比所选值小0.02像素/英寸,这将对大图像尺寸产生(较小)影响。这里对此进行更多讨论。

Matplotlib doesn’t work with pixels directly, but rather physical sizes and DPI. If you want to display a figure with a certain pixel size, you need to know the DPI of your monitor. For example this link will detect that for you.

If you have an image of 3841×7195 pixels it is unlikely that you monitor will be that large, so you won’t be able to show a figure of that size (matplotlib requires the figure to fit in the screen, if you ask for a size too large it will shrink to the screen size). Let’s imagine you want an 800×800 pixel image just for an example. Here’s how to show an 800×800 pixel image in my monitor (my_dpi=96):

plt.figure(figsize=(800/my_dpi, 800/my_dpi), dpi=my_dpi)

So you basically just divide the dimensions in inches by your DPI.

If you want to save a figure of a specific size, then it is a different matter. Screen DPIs are not so important anymore (unless you ask for a figure that won’t fit in the screen). Using the same example of the 800×800 pixel figure, we can save it in different resolutions using the dpi keyword of savefig. To save it in the same resolution as the screen just use the same dpi:

plt.savefig('my_fig.png', dpi=my_dpi)

To to save it as an 8000×8000 pixel image, use a dpi 10 times larger:

plt.savefig('my_fig.png', dpi=my_dpi * 10)

Note that the setting of the DPI is not supported by all backends. Here, the PNG backend is used, but the pdf and ps backends will implement the size differently. Also, changing the DPI and sizes will also affect things like fontsize. A larger DPI will keep the same relative sizes of fonts and elements, but if you want smaller fonts for a larger figure you need to increase the physical size instead of the DPI.

Getting back to your example, if you want to save a image with 3841 x 7195 pixels, you could do the following:

plt.figure(figsize=(3.841, 7.195), dpi=100)
( your code ...)
plt.savefig('myfig.png', dpi=1000)

Note that I used the figure dpi of 100 to fit in most screens, but saved with dpi=1000 to achieve the required resolution. In my system this produces a png with 3840×7190 pixels — it seems that the DPI saved is always 0.02 pixels/inch smaller than the selected value, which will have a (small) effect on large image sizes. Some more discussion of this here.


回答 1

根据您的代码,这对我有用,生成了一个93Mb png图像,带有彩色噪声和所需的尺寸:

import matplotlib.pyplot as plt
import numpy

w = 7195
h = 3841

im_np = numpy.random.rand(h, w)

fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig('figure.png', dpi=1)

我正在使用Linux Mint 13中Python 2.7库的最新PIP版本。

希望有帮助!

This worked for me, based on your code, generating a 93Mb png image with color noise and the desired dimensions:

import matplotlib.pyplot as plt
import numpy

w = 7195
h = 3841

im_np = numpy.random.rand(h, w)

fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig('figure.png', dpi=1)

I am using the last PIP versions of the Python 2.7 libraries in Linux Mint 13.

Hope that helps!


回答 2

根据tiago接受的响应,这是一个小型通用函数,该函数将numpy数组导出到与该数组具有相同分辨率的图像:

import matplotlib.pyplot as plt
import numpy as np

def export_figure_matplotlib(arr, f_name, dpi=200, resize_fact=1, plt_show=False):
    """
    Export array as figure in original resolution
    :param arr: array of image to save in original resolution
    :param f_name: name of file where to save figure
    :param resize_fact: resize facter wrt shape of arr, in (0, np.infty)
    :param dpi: dpi of your screen
    :param plt_show: show plot or not
    """
    fig = plt.figure(frameon=False)
    fig.set_size_inches(arr.shape[1]/dpi, arr.shape[0]/dpi)
    ax = plt.Axes(fig, [0., 0., 1., 1.])
    ax.set_axis_off()
    fig.add_axes(ax)
    ax.imshow(arr)
    plt.savefig(f_name, dpi=(dpi * resize_fact))
    if plt_show:
        plt.show()
    else:
        plt.close()

如tiago上次答复中所述,需要首先找到屏幕DPI,例如,可以在此处完成:http : //dpi.lv

resize_fact在函数中添加了一个附加参数,例如,您可以将图像导出到原始分辨率的50%(0.5)。

Based on the accepted response by tiago, here is a small generic function that exports a numpy array to an image having the same resolution as the array:

import matplotlib.pyplot as plt
import numpy as np

def export_figure_matplotlib(arr, f_name, dpi=200, resize_fact=1, plt_show=False):
    """
    Export array as figure in original resolution
    :param arr: array of image to save in original resolution
    :param f_name: name of file where to save figure
    :param resize_fact: resize facter wrt shape of arr, in (0, np.infty)
    :param dpi: dpi of your screen
    :param plt_show: show plot or not
    """
    fig = plt.figure(frameon=False)
    fig.set_size_inches(arr.shape[1]/dpi, arr.shape[0]/dpi)
    ax = plt.Axes(fig, [0., 0., 1., 1.])
    ax.set_axis_off()
    fig.add_axes(ax)
    ax.imshow(arr)
    plt.savefig(f_name, dpi=(dpi * resize_fact))
    if plt_show:
        plt.show()
    else:
        plt.close()

As said in the previous reply by tiago, the screen DPI needs to be found first, which can be done here for instance: http://dpi.lv

I’ve added an additional argument resize_fact in the function which which you can export the image to 50% (0.5) of the original resolution, for instance.


回答 3

plt.imsave为我工作。您可以在此处找到文档:https : //matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.imsave.html

#file_path = directory address where the image will be stored along with file name and extension
#array = variable where the image is stored. I think for the original post this variable is im_np
plt.imsave(file_path, array)

plt.imsave worked for me. You can find the documentation here: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.imsave.html

#file_path = directory address where the image will be stored along with file name and extension
#array = variable where the image is stored. I think for the original post this variable is im_np
plt.imsave(file_path, array)

将MATLAB代码转换为Python的工具

问题:将MATLAB代码转换为Python的工具

我的MS论文中有一堆MATLAB代码,现在我想将其转换为Python(使用numpy / scipy和matplotlib)并作为开源分发。我知道MATLAB与Python科学库之间的相似之处,手动转换它们的时间不会超过两周(前提是我每天都会努力一段时间)。我想知道是否已经有任何工具可以进行转换。

I have a bunch of MATLAB code from my MS thesis which I now want to convert to Python (using numpy/scipy and matplotlib) and distribute as open-source. I know the similarity between MATLAB and Python scientific libraries, and converting them manually will be not more than a fortnight (provided that I work towards it every day for some time). I was wondering if there was already any tool available which can do the conversion.


回答 0

有几种工具可以将Matlab转换为Python代码。

那见过最近的活动只有一个(最后从2018年6月提交)是小号商场中号 ATLab的牛逼Ø P ython编译器(这里也开发:SMOP @ chiselapp)。

其他选项包括:

  • LiberMate:从Matlab转换为Python和SciPy(需要Python 2,最新更新为4年前)。
  • OMPC:Matlab到Python(有点过时)。

同样,对于那些对两种语言之间的接口感兴趣而不是转换的人:

  • pymatlab:从Python进行通信,方法是将数据发送到MATLAB工作区,使用脚本对其进行操作,然后拉回结果数据。
  • Python-Matlab虫洞:支持双向交互。
  • Python-Matlab桥:从Python内部使用Matlab,为iPython提供matlab_magic,以从ipython内部执行普通的matlab代码。
  • PyMat:从Python控制Matlab会话。
  • pymat2:看似被遗弃的PyMat的延续。
  • mlabwrapmlabwrap-purepy:使Matlab看起来像Python库(基于PyMat)。
  • oct2py:从Python内部运行GNU Octave命令。
  • pymex:将Python解释器嵌入到Matlab以及文件交换中
  • matpy:通过各种方式访问​​MATLAB:创建变量,访问.mat文件,直接连接MATLAB引擎(需要安装MATLAB)。
  • MatPy:Python软件包,用于数值线性代数并使用类似于MatLab的界面进行绘图。

顺便说一句,在这里查找其他迁移技巧可能会有所帮助:

另一方面,尽管我一点也不喜欢,但fortran对于可能会觉得有用的人来说,有:

There are several tools for converting Matlab to Python code.

The only one that’s seen recent activity (last commit from June 2018) is Small Matlab to Python compiler (also developed here: SMOP@chiselapp).

Other options include:

  • LiberMate: translate from Matlab to Python and SciPy (Requires Python 2, last update 4 years ago).
  • OMPC: Matlab to Python (a bit outdated).

Also, for those interested in an interface between the two languages and not conversion:

  • pymatlab: communicate from Python by sending data to the MATLAB workspace, operating on them with scripts and pulling back the resulting data.
  • Python-Matlab wormholes: both directions of interaction supported.
  • Python-Matlab bridge: use Matlab from within Python, offers matlab_magic for iPython, to execute normal matlab code from within ipython.
  • PyMat: Control Matlab session from Python.
  • pymat2: continuation of the seemingly abandoned PyMat.
  • mlabwrap, mlabwrap-purepy: make Matlab look like Python library (based on PyMat).
  • oct2py: run GNU Octave commands from within Python.
  • pymex: Embeds the Python Interpreter in Matlab, also on File Exchange.
  • matpy: Access MATLAB in various ways: create variables, access .mat files, direct interface to MATLAB engine (requires MATLAB be installed).
  • MatPy: Python package for numerical linear algebra and plotting with a MatLab-like interface.

Btw might be helpful to look here for other migration tips:

On a different note, though I’m not a fortran fan at all, for people who might find it useful there is:


回答 1

还有oct2py可以在python中调用.m文件

https://pypi.python.org/pypi/oct2py

它需要GNU Octave,它与MATLAB高度兼容。

https://www.gnu.org/software/octave/

There’s also oct2py which can call .m files within python

https://pypi.python.org/pypi/oct2py

It requires GNU Octave, which is highly compatible with MATLAB.

https://www.gnu.org/software/octave/


使用Scipy(Python)使经验分布适合理论分布吗?

问题:使用Scipy(Python)使经验分布适合理论分布吗?

简介:我列出了30,000多个整数值,范围从0到47(含0和47),例如[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]从某个连续分布中采样。列表中的值不一定按顺序排列,但顺序对于此问题并不重要。

问题:根据我的分布,我想为任何给定值计算p值(看到更大值的概率)。例如,您可以看到0的p值将接近1,数字较大的p值将趋于0。

我不知道我是否正确,但是为了确定概率,我认为我需要使我的数据适合最适合描述我的数据的理论分布。我认为需要某种拟合优度检验来确定最佳模型。

有没有办法在Python(ScipyNumpy)中实现这种分析?你能举个例子吗?

谢谢!

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn’t matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don’t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

Thank you!


回答 0

平方误差和(SSE)的分布拟合

这是Saullo答案的更新和修改,它使用当前scipy.stats分布的完整列表,并返回分布的直方图和数据的直方图之间SSE最小的分布。

拟合示例

使用中的ElNiño数据集statsmodels进行拟合,并确定误差。返回错误最少的分布。

所有发行

最佳拟合分布

范例程式码

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo’s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution’s histogram and the data’s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

回答 1

SciPy 0.12.0中82个已实现的分发功能。您可以使用他们的fit()方法测试其中的一些适合您的数据的方式。检查下面的代码以获取更多详细信息:

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

参考文献:

-拟合分布,拟合优度,p值。是否可以使用Scipy(Python)做到这一点?

-Scipy的分配配件

以下是Scipy 0.12.0(VI)中可用的所有分布函数的名称列表:

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy'] 

There are 82 implemented distribution functions in SciPy 0.12.0. You can test how some of them fit to your data using their fit() method. Check the code below for more details:

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

References:

– Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

– Distribution fitting with Scipy

And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy'] 

回答 2

fit()@Saullo Castro提到的方法提供了最大似然估计(MLE)。数据的最佳分布是一种可以给您最高的数据,可以通过几种不同的方法来确定:

1,一种使您获得最高对数可能性的方法。

2,为您提供最小的AIC,BIC或BICc值(请参阅Wiki:http : //en.wikipedia.org/wiki/Akaike_information_criterion),基本上可以看作是针对参数数量调整后的对数似然性,参数有望更好地适合)

3,最大化贝叶斯后验概率的那一种。(请参见Wiki:http : //en.wikipedia.org/wiki/Posterior_probability

当然,如果您已经有一个可以描述您的数据的分布(基于特定领域的理论)并且要坚持下去,那么您将跳过确定最佳拟合分布的步骤。

scipy没有提供计算对数似然性的功能(尽管提供了MLE方法),但是一个简单的硬代码很容易:请参见`scipy.stat.distributions’的内置概率密度函数是否比用户提供的函数慢?

fit() method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as

1, the one that gives you the highest log likelihood.

2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)

3, the one that maximize the Bayesian posterior probability. (see wiki: http://en.wikipedia.org/wiki/Posterior_probability)

Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.

scipy does not come with a function to calculate log likelihood (although MLE method is provided), but hard code one is easy: see Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?


回答 3

AFAICU,您的分布是离散的(除了离散之外什么都没有)。因此,仅计算不同值的频率并对它们进行归一化就足以满足您的目的。因此,一个例子来证明这一点:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

因此,看到比1简单高的值的概率(根据互补累积分布函数(ccdf))

In []: 1- cdf[1]
Out[]: 0.40000000000000002

请注意,ccdf生存函数(sf)密切相关,但它也是用离散分布定义的,而sf仅用于连续分布的定义。

AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

Thus, probability of seeing values higher than 1 is simply (according to the complementary cumulative distribution function (ccdf):

In []: 1- cdf[1]
Out[]: 0.40000000000000002

Please note that ccdf is closely related to survival function (sf), but it’s also defined with discrete distributions, whereas sf is defined only for contiguous distributions.


回答 4

在我看来,这听起来像是概率密度估计问题。

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

另请参阅http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html

It sounds like probability density estimation problem to me.

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.


回答 5

尝试 distfit图书馆。

点安装distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

请注意,在这种情况下,由于均匀分布,所有点都是有效的。如果需要,可以使用dist.y_pred进行过滤。

Try the distfit library.

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.


回答 6

使用OpenTURNS时,我将使用BIC标准来选择适合此类数据的最佳分布。这是因为此标准不会给具有更多参数的分布带来太多优势。实际上,如果分布具有更多参数,则拟合的分布更容易接近数据。此外,在这种情况下,Kolmogorov-Smirnov可能没有意义,因为测量值的微小误差将对p值产生巨大影响。

为了说明这一过程,我加载了El-Nino数据,其中包含1950年至2010年的732次每月温度测量值:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

使用GetContinuousUniVariateFactories静态方法很容易获得30个内置的单变量分布工厂。完成后,BestModelBIC静态方法将返回最佳模型和相应的BIC分数。

sample = ot.Sample(data, 1)
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

打印:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

为了以图形方式将拟合度与直方图进行比较,我使用drawPDF最佳分布的方法。

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

这将生成:

BestModelBIC文档中提供了有关此主题的更多详细信息。可以将Scipy分布包括在SciPyDistribution中,甚至可以将ChaosPy分布与ChaosPyDistribution一起包含在内,但是我想当前脚本可以满足大多数实际目的。

With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

which prints:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

This produces:

More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.


回答 7

如果我不理解您的需要,请原谅我,但是将数据存储在字典中的情况又如何呢?字典中的键将是0到47之间的数字,并且将其相关键在原始列表中的出现次数视为数值?
因此,您的可能性p(x)将是大于x的键的所有值的总和除以30000。

Forgive me if I don’t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.


从熊猫的数据框中删除无限值?

问题:从熊猫的数据框中删除无限值?

从熊猫DataFrame中删除nan和inf / -inf值而不重置的最快/最简单方法是什么mode.use_inf_as_null?我希望能够使用的subsethow参数dropna,但不能使用inf认为缺少的值,例如:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

这可能吗?有没有办法告诉它在缺失值的定义中dropna包含inf

what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

is this possible? Is there a way to tell dropna to include inf in its definition of missing values?


回答 0

最简单的方法是先将replaceinfs改为NaN:

df.replace([np.inf, -np.inf], np.nan)

然后使用dropna

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

例如:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

相同的方法适用于系列。

The simplest way would be to first replace infs to NaN:

df.replace([np.inf, -np.inf], np.nan)

and then use the dropna:

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

For example:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

The same method would work for a Series.


回答 1

使用选项上下文时,无需永久设置即可use_inf_as_na。例如:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

当然可以将其设置infNaN永久

pd.set_option('use_inf_as_na', True)

对于旧版本,请替换use_inf_as_nause_inf_as_null

With option context, this is possible without permanently setting use_inf_as_na. For example:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

Of course it can be set to treat inf as NaN permanently with

pd.set_option('use_inf_as_na', True)

For older versions, replace use_inf_as_na with use_inf_as_null.


回答 2

这是.loc在Series上用nan替换inf的另一种方法:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

因此,针对原始问题:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

Here is another method using .loc to replace inf with nan on a Series:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

So, in response to the original question:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

回答 3

使用(快速简单):

df = df[np.isfinite(df).all(1)]

该答案基于DougR在另一个问题中的答案。这里是一个示例代码:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

结果:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

Use (fast and simple):

df = df[np.isfinite(df).all(1)]

This answer is based on DougR’s answer in an other question. Here an example code:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

Result:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

回答 4

另一个解决方案是使用该isin方法。使用它来确定每个值是无限的还是缺失的,然后链接该all方法以确定行中的所有值是无限的还是缺失的。

最后,使用该结果的否定值通过布尔索引选择不具有所有无限值或缺失值的行。

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

Yet another solution would be to use the isin method. Use it to determine whether each value is infinite or missing and then chain the all method to determine if all the values in the rows are infinite or missing.

Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

回答 5

以上解决方案将修改inf不在目标列中的。为了解决这个问题,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

The above solution will modify the infs that are not in the target columns. To remedy that,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

回答 6

您可以使用pd.DataFrame.masknp.isinf。首先,您应确保数据框系列均为type float。然后使用dropna现有逻辑。

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

You can use pd.DataFrame.mask with np.isinf. You should ensure first your dataframe series are all of type float. Then use dropna with your existing logic.

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

用pip安装SciPy

问题:用pip安装SciPy

使用可以通过pip安装NumPypip install numpy

SciPy是否有类似的可能性?(这样pip install scipy做无效。)


更新资料

SciPy软件包现在可以安装了pip

It is possible to install NumPy with pip using pip install numpy.

Is there a similar possibility with SciPy? (Doing pip install scipy does not work.)


Update

The package SciPy is now available to be installed with pip!


回答 0

试图easy_install指出其在Python Package Index中列出的问题,该点会进行搜索。

easy_install scipy
Searching for scipy
Reading http://pypi.python.org/simple/scipy/
Reading http://www.scipy.org
Reading http://sourceforge.net/project/showfiles.php?group_id=27747&package_id=19531
Reading http://new.scipy.org/Wiki/Download

但是,一切并没有丢失。pip可以从Subversion(SVN),GitMercurialBazaar存储库安装。SciPy使用SVN:

pip install svn+http://svn.scipy.org/svn/scipy/trunk/#egg=scipy

更新(12-2012):

pip install git+https://github.com/scipy/scipy.git

由于NumPy是依赖项,因此也应安装它。

An attempt to easy_install indicates a problem with their listing in the Python Package Index, which pip searches.

easy_install scipy
Searching for scipy
Reading http://pypi.python.org/simple/scipy/
Reading http://www.scipy.org
Reading http://sourceforge.net/project/showfiles.php?group_id=27747&package_id=19531
Reading http://new.scipy.org/Wiki/Download

All is not lost, however; pip can install from Subversion (SVN), Git, Mercurial, and Bazaar repositories. SciPy uses SVN:

pip install svn+http://svn.scipy.org/svn/scipy/trunk/#egg=scipy

Update (12-2012):

pip install git+https://github.com/scipy/scipy.git

Since NumPy is a dependency, it should be installed as well.


回答 1

先决条件:

sudo apt-get install build-essential gfortran libatlas-base-dev python-pip python-dev
sudo pip install --upgrade pip

实际包装:

sudo pip install numpy
sudo pip install scipy

可选软件包:

sudo pip install matplotlib   OR  sudo apt-get install python-matplotlib
sudo pip install -U scikit-learn
sudo pip install pandas

src

Prerequisite:

sudo apt-get install build-essential gfortran libatlas-base-dev python-pip python-dev
sudo pip install --upgrade pip

Actual packages:

sudo pip install numpy
sudo pip install scipy

Optional packages:

sudo pip install matplotlib   OR  sudo apt-get install python-matplotlib
sudo pip install -U scikit-learn
sudo pip install pandas

src


回答 2

在Ubuntu 10.04(Lucid)中,pip install scipy安装某些依赖项后,我可以成功地(在virtualenv中):

$ sudo apt-get install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-sse2-dev python-all-dev

In Ubuntu 10.04 (Lucid), I could successfully pip install scipy (within a virtualenv) after installing some of its dependencies, in particular:

$ sudo apt-get install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-sse2-dev python-all-dev

回答 3

要在Windows上安装scipy,请遵循以下说明:-

步骤1:按此链接http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy下载scipy .whl文件(例如scipy-0.17.0-cp34-none-win_amd64.whl)。

步骤2:从命令提示符(cd folder-name)转到下载文件所在的目录。

步骤3:运行以下命令:

pip install scipy-0.17.0-cp27-none-win_amd64.whl

To install scipy on windows follow these instructions:-

Step-1 : Press this link http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy to download a scipy .whl file (e.g. scipy-0.17.0-cp34-none-win_amd64.whl).

Step-2: Go to the directory where that download file is there from the command prompt (cd folder-name ).

Step-3: Run this command:

pip install scipy-0.17.0-cp27-none-win_amd64.whl

回答 4

我尝试了以上所有方法,但对我没有任何帮助。这解决了我所有的问题:

pip install -U numpy

pip install -U scipy

请注意,-U用于pip install请求升级软件包的选项。没有它,如果已经安装了软件包,pip则会通知您此信息,并且不做任何事情就退出。

I tried all the above and nothing worked for me. This solved all my problems:

pip install -U numpy

pip install -U scipy

Note that the -U option to pip install requests that the package be upgraded. Without it, if the package is already installed pip will inform you of this and exit without doing anything.


回答 5

如果我首先将BLAS,LAPACK和GCC Fortran作为系统软件包安装(我正在使用Arch Linux),则可以通过以下方式安装SciPy:

pip install scipy

If I first install BLAS, LAPACK and GCC Fortran as system packages (I’m using Arch Linux), I can get SciPy installed with:

pip install scipy

回答 6

在Fedora上,这有效:

sudo yum install -y python-pip
sudo yum install -y lapack lapack-devel blas blas-devel 
sudo yum install -y blas-static lapack-static
sudo pip install numpy
sudo pip install scipy

如果public key下载时出现任何错误,请将--nogpgcheck作为参数添加到yum,例如: yum --nogpgcheck install blas-devel

从Fedora 23开始,使用dnf代替yum

On Fedora, this works:

sudo yum install -y python-pip
sudo yum install -y lapack lapack-devel blas blas-devel 
sudo yum install -y blas-static lapack-static
sudo pip install numpy
sudo pip install scipy

If you get any public key errors while downloading, add --nogpgcheck as parameter to yum, for example: yum --nogpgcheck install blas-devel

On Fedora 23 onwards, use dnf instead of yum.


回答 7

对于Arch Linux用户:

pip install --user scipy 先决条件要安装以下Arch软件包:

  • gcc-fortran
  • blas
  • lapack

For the Arch Linux users:

pip install --user scipy prerequisites the following Arch packages to be installed:

  • gcc-fortran
  • blas
  • lapack

回答 8

适用于Ubuntu(Ubuntu 10.04 LTS(Lucid Lynx))的插件:

存储库已移动,但是

pip install -e git+http://github.com/scipy/scipy/#egg=scipy

我失败了…通过以下步骤,最终解决了问题(作为虚拟环境中的root,python3指向Python 3.2.2的链接):安装Ubuntu依赖项(请参阅elaichi),克隆NumPy和SciPy:

git clone git://github.com/scipy/scipy.git scipy

git clone git://github.com/numpy/numpy.git numpy

生成NumPy(在numpy文件夹中):

python3 setup.py build --fcompiler=gnu95

安装SciPy(在scipy文件夹中):

python3 setup.py install

Addon for Ubuntu (Ubuntu 10.04 LTS (Lucid Lynx)):

The repository moved, but a

pip install -e git+http://github.com/scipy/scipy/#egg=scipy

failed for me… With the following steps, it finally worked out (as root in a virtual environment, where python3 is a link to Python 3.2.2): install the Ubuntu dependencies (see elaichi), clone NumPy and SciPy:

git clone git://github.com/scipy/scipy.git scipy

git clone git://github.com/numpy/numpy.git numpy

Build NumPy (within the numpy folder):

python3 setup.py build --fcompiler=gnu95

Install SciPy (within the scipy folder):

python3 setup.py install

回答 9

就我而言,直到我还安装了以下软件包,该软件包才起作用:libatlas-base-dev,gfortran

 sudo apt-get install libatlas-base-dev gfortran

然后运行pip install scipy

In my case, it wasn’t working until I also installed the following package : libatlas-base-dev, gfortran

 sudo apt-get install libatlas-base-dev gfortran

Then run pip install scipy


回答 10

  1. 安装python-3.4.4
  2. scipy-0.15.1-win32-superpack-python3.4
  3. 应用以下推荐文档
py -m pip install --upgrade pip
py -m pip install numpy
py -m pip install matplotlib
py -m pip install scipy
py -m pip install scikit-learn
  1. install python-3.4.4
  2. scipy-0.15.1-win32-superpack-python3.4
  3. apply the following commend doc
py -m pip install --upgrade pip
py -m pip install numpy
py -m pip install matplotlib
py -m pip install scipy
py -m pip install scikit-learn

回答 11

答案是肯定的。

首先,您可以轻松安装numpy use命令:

pip install numpy

然后,您应该安装Scipy所需的mkl,然后可以在此处下载

下载file_name.whl后,您进行安装

C:\Users\****\Desktop\a> pip install mkl_service-1.1.2-cp35-cp35m-win32.whl
Processing c:\users\****\desktop\a\mkl_service-1.1.2-cp35-cp35m-win32.whl 
Installing collected packages: mkl-service    
Successfully installed mkl-service-1.1.2

然后,您可以在同一网站上下载scipy-0.18.1-cp35-cp35m-win32.whl

注意:您应该根据您的python版本下载file_name.whl,如果您的python版本是32bit python3.5,则应该下载该文件,而“ win32”是您的python版本,而不是操作系统版本。

然后像这样安装file_name.whl:

C:\Users\****\Desktop\a>pip install scipy-0.18.1-cp35-cp35m-win32.whl
Processing c:\users\****\desktop\a\scipy-0.18.1-cp35-cp35m-win32.whl
Installing collected packages: scipy
Successfully installed scipy-0.18.1

然后,只有一件事要做:注释掉特定的一行,否则当您输入命令“ import scipy”时会出现错误消息。

所以注释掉这行

from numpy._distributor_init import NUMPY_MKL  # requires numpy+mkl

在此文件中:your_own_path \ lib \ site-packages \ scipy__init __。py

然后您可以使用SciPy :)

这里告诉您更多有关最后一步的信息。

是一个类似问题的答案。

The answer is yes, there is.

First you can easily install numpy use commands:

pip install numpy

Then you should install mkl, which is required by Scipy, and you can download it here

After download the file_name.whl you install it

C:\Users\****\Desktop\a> pip install mkl_service-1.1.2-cp35-cp35m-win32.whl
Processing c:\users\****\desktop\a\mkl_service-1.1.2-cp35-cp35m-win32.whl 
Installing collected packages: mkl-service    
Successfully installed mkl-service-1.1.2

Then at the same website you can download scipy-0.18.1-cp35-cp35m-win32.whl

Note:You should download the file_name.whl according to you python version, if you python version is 32bit python3.5 you should download this one, and the “win32” is about your python version, not your operating system version.

Then install file_name.whl like this:

C:\Users\****\Desktop\a>pip install scipy-0.18.1-cp35-cp35m-win32.whl
Processing c:\users\****\desktop\a\scipy-0.18.1-cp35-cp35m-win32.whl
Installing collected packages: scipy
Successfully installed scipy-0.18.1

Then there is only one more thing to do: comment out a specfic line or there will be error messages when you imput command “import scipy”.

So comment out this line

from numpy._distributor_init import NUMPY_MKL  # requires numpy+mkl

in this file: your_own_path\lib\site-packages\scipy__init__.py

Then you can use SciPy :)

Here tells you more about the last step.

Here is a similar anwser to a similar question.


回答 12

除了所有这些答案之外,如果在64位计算机上安装32位python,则无论您的计算机如何,都必须下载32位scipy。 http://www.lfd.uci.edu/~gohlke/pythonlibs/ 在上述URL中,您可以下载软件包,命令为:pip install

Besides all of these answers, If you install python of 32bit on your 64bit machine, you have to download scipy of 32-bit irrespective of your machine. http://www.lfd.uci.edu/~gohlke/pythonlibs/ In the above URL you can download the packages and command is: pip install


回答 13

对于gentoo,它位于主存储库中: emerge --ask scipy

For gentoo, it’s in the main repository: emerge --ask scipy


回答 14

您也可以在Windows中使用python 3.6使用它 python -m pip install scipy

You can also use this in windows with python 3.6 python -m pip install scipy


康达能否取代对virtualenv的需求?

问题:康达能否取代对virtualenv的需求?

我在安装SciPy时遇到了麻烦,最近发现了Conda,特别是在我正在开发的Heroku应用程序上。

使用Conda,您可以创建与virtualenv十分相似的环境。我的问题是:

  1. 如果我使用Conda,它将取代对virtualenv的需求吗?如果没有,如何将两者一起使用?是否在Conda中安装virtualenv或在virtualenv中安装Conda?
  2. 我还需要使用点子吗?如果是这样,我仍然可以在隔离的环境中使用pip安装软件包吗?

I recently discovered Conda after I was having trouble installing SciPy, specifically on a Heroku app that I am developing.

With Conda you create environments, very similar to what virtualenv does. My questions are:

  1. If I use Conda will it replace the need for virtualenv? If not, how do I use the two together? Do I install virtualenv in Conda, or Conda in virtualenv?
  2. Do I still need to use pip? If so, will I still be able to install packages with pip in an isolated environment?

回答 0

  1. 康达取代了virtualenv。我认为这更好。它不仅限于Python,还可以用于其他语言。以我的经验,它提供了更加流畅的体验,尤其是对于科学包装。我第一次在Mac上正确安装MayaVi是使用conda

  2. 您仍然可以使用pip。实际上,conda安装pip在每个新环境中。它知道有关pip安装的软件包的信息。

例如:

conda list

列出当前环境中所有已安装的软件包。Conda安装的软件包显示如下:

sphinx_rtd_theme          0.1.7                    py35_0    defaults

并通过安装的pip带有<pip>标记:

wxpython-common           3.0.0.0                   <pip>
  1. Conda replaces virtualenv. In my opinion it is better. It is not limited to Python but can be used for other languages too. In my experience it provides a much smoother experience, especially for scientific packages. The first time I got MayaVi properly installed on Mac was with conda.

  2. You can still use pip. In fact, conda installs pip in each new environment. It knows about pip-installed packages.

For example:

conda list

lists all installed packages in your current environment. Conda-installed packages show up like this:

sphinx_rtd_theme          0.1.7                    py35_0    defaults

and the ones installed via pip have the <pip> marker:

wxpython-common           3.0.0.0                   <pip>

回答 1

简短的答案是,您只需要conda。

  1. Conda在单个软件包中有效地结合了pip和virtualenv的功能,因此,如果您使用的是conda,则不需要virtualenv。

  2. 您会惊讶conda支持多少个软件包。如果还不够,可以在conda下使用pip。

这是到conda页面的链接,用于比较conda,pip和virtualenv:

https://docs.conda.io/projects/conda/zh-CN/latest/commands.html#conda-vs-pip-vs-virtualenv-commands

Short answer is, you only need conda.

  1. Conda effectively combines the functionality of pip and virtualenv in a single package, so you do not need virtualenv if you are using conda.

  2. You would be surprised how many packages conda supports. If it is not enough, you can use pip under conda.

Here is a link to the conda page comparing conda, pip and virtualenv:

https://docs.conda.io/projects/conda/en/latest/commands.html#conda-vs-pip-vs-virtualenv-commands.


回答 2

虚拟环境和 pip

我会补充说,使用Anaconda可以轻松创建删除 conda环境。

> conda create --name <envname> python=<version> <optional dependencies>

> conda remove --name <envname> --all 

激活的环境中,通过conda或安装软件包pip

(envname)> conda install <package>

(envname)> pip install <package>

这些环境与conda的pip式软件包管理紧密相关,因此创建环境以及安装Python和非Python软件包都很简单。


朱皮特

此外,在环境中安装ipykernel在Jupyter笔记本的“内核”下拉菜单中会添加一个新列表,从而将可复制的环境扩展到笔记本。从Anaconda 4.1开始,添加了nbextensions,更轻松地为笔记本添加了扩展名。

可靠性

以我的经验,conda在安装大型库(例如numpy和)时更快,更可靠pandas。此外,如果您希望转移环境的保留状态,则可以通过共享克隆环境来实现。

Virtual Environments and pip

I will add that creating and removing conda environments is simple with Anaconda.

> conda create --name <envname> python=<version> <optional dependencies>

> conda remove --name <envname> --all 

In an activated environment, install packages via conda or pip:

(envname)> conda install <package>

(envname)> pip install <package>

These environments are strongly tied to conda’s pip-like package management, so it is simple to create environments and install both Python and non-Python packages.


Jupyter

In addition, installing ipykernel in an environment adds a new listing in the Kernels dropdown menu of Jupyter notebooks, extending reproducible environments to notebooks. As of Anaconda 4.1, nbextensions were added, adding extensions to notebooks more easily.

Reliability

In my experience, conda is faster and more reliable at installing large libraries such as numpy and pandas. Moreover, if you wish to transfer your the preserved state of an environment, you can do so by sharing or cloning an env.


回答 3

安装Conda将使您能够根据需要创建和删除python环境,从而为您提供与virtualenv相同的功能。

对于这两种发行版,您将能够创建一个隔离的文件系统树,在其中您可以根据需要安装和删除python软件包(可能是pip)。如果您想为不同的用例使用同一个库的不同版本,或者只想尝试进行一些发行并将其删除以节省磁盘空间,则可能会派上用场。

差异:

许可协议。虽然virtualenv已获得MIT最宽松的许可证,但Conda使用3条款BSD许可。

康达为您提供了自己的包装控制系统。这个程序包控制系统通常提供流行的非python软件的预编译版本(对于大多数流行的系统),这可以很容易地使某些机器学习程序包正常工作。也就是说,您不必为系统编译优化的C / C ++代码。虽然这对我们大多数人来说是一个很大的缓解,但它可能会影响此类库的性能。

与virtualenv不同,Conda至少在Linux系统上复制了一些系统库。该库可能不同步,从而导致程序行为不一致。

判决:

Conda很棒,应该是您开始机器学习之路的默认选择。它将为您节省一些时间,使他们无法使用gcc和许多软件包。但是,Conda并没有取代virtualenv。它引入了一些额外的复杂性,而这些复杂性可能并非总是需要的。它具有不同的许可。您可能要避免在分布式环境或HPC硬件上使用conda。

Installing Conda will enable you to create and remove python environments as you wish, therefore providing you with same functionality as virtualenv would.

In case of both distributions you would be able to create an isolated filesystem tree, where you can install and remove python packages (probably, with pip) as you wish. Which might come in handy if you want to have different versions of same library for different use cases or you just want to try some distribution and remove it afterwards conserving your disk space.

Differences:

License agreement. While virtualenv comes under most liberal MIT license, Conda uses 3 clause BSD license.

Conda provides you with their own package control system. This package control system often provides precompiled versions (for most popular systems) of popular non-python software, which can easy ones way getting some machine learning packages working. Namely you don’t have to compile optimized C/C++ code for you system. While it is a great relief for most of us, it might affect performance of such libraries.

Unlike virtualenv, Conda duplicating some system libraries at least on Linux system. This libraries can get out of sync leading to inconsistent behavior of your programs.

Verdict:

Conda is great and should be your default choice while starting your way with machine learning. It will save you some time messing with gcc and numerous packages. Yet, Conda does not replace virtualenv. It introduces some additional complexity which might not always be desired. It comes under different license. You might want to avoid using conda on a distributed environments or on HPC hardware.


回答 4

我同时使用和(截至2020年1月)它们具有一些肤浅的差异,因此适合我的不同用法。通过默认康达更喜欢在一个中央位置来管理你的环境的列表,而使得的virtualenv在当前目录中的文件夹。前者(集中式)在您进行机器学习时才有意义,例如,您有几个广泛的环境可用于许多项目,并且想从任何地方跳入其中。如果您正在做一些一次性项目,这些项目具有完全不同的lib需求集,而lib需求集实际上更多地属于项目本身,则后者(每个项目文件夹)才有意义。

Conda创建的空环境约为122MB,而virtualenv约为12MB,因此,这是您可能不希望将Conda环境分散在各处的另一个原因。

最后,表明Conda偏爱其集中式env的另一个表面现象是(同样,默认情况下),如果您确实在自己的项目文件夹中创建了一个Conda env并激活它,则出现在外壳中的名称前缀是(太长)绝对值文件夹的路径。您可以通过给它命名来解决该问题,但是virtualenv默认情况下做对了。

我希望随着两个程序包管理者争夺霸主地位,此信息会迅速过时,但这是今天的权衡:)

I use both and (as of Jan, 2020) they have some superficial differences that lend themselves to different usages for me. By default Conda prefers to manage a list of environments for you in a central location, whereas virtualenv makes a folder in the current directory. The former (centralized) makes sense if you are e.g. doing machine learning and just have a couple of broad environments that you use across many projects and want to jump into them from anywhere. The latter (per project folder) makes sense if you are doing little one-off projects that have completely different sets of lib requirements that really belong more to the project itself.

The empty environment that Conda creates is about 122MB whereas the virtualenv’s is about 12MB, so that’s another reason you may prefer not to scatter Conda environments around everywhere.

Finally, another superficial indication that Conda prefers its centralized envs is that (again, by default) if you do create a Conda env in your own project folder and activate it the name prefix that appears in your shell is the (way too long) absolute path to the folder. You can fix that by giving it a name, but virtualenv does the right thing by default.

I expect this info to become stale rapidly as the two package managers vie for dominance, but these are the trade-offs as of today :)


回答 5

Pipenv是另一个新的选择,也是我当前首选的启动和运行环境的方法。

目前,它是Python.org官方推荐的Python打包工具

Another new option and my current preferred method of getting an environment up and running is Pipenv

It is currently the officially recommended Python packaging tool from Python.org


回答 6

是的,conda它比容易安装得多virtualenv,并且几乎可以替代后者。

Yes, conda is a lot easier to install than virtualenv, and pretty much replaces the latter.


回答 7

我在公司工作,在几台没有管理员权限的计算机的防火墙后面

以我有限的python使用经验(两年),我遇到了几个库(JayDeBeApi,sasl),这些库通过pip安装时会抛出C ++依赖项错误错误:需要Microsoft Visual C ++ 14.0。使用“ Microsoft Visual C ++生成工具”获得它:http : //landinghub.visualstudio.com/visual-cpp-build-tools

这些用conda安装得很好,因此从那以后,我开始使用conda env。但是要阻止conda从c.programfiles里面安装依赖项并不容易,因为我没有写权限。

I work in corporate, behind several firewall with machine on which I have no admin acces

In my limited experience with python (2 years) i have come across few libraries (JayDeBeApi,sasl) which when installing via pip threw C++ dependency errors error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools

these installed fine with conda, hence since those days i started working with conda env. however it isnt easy to stop conda from installing dependency inside c.programfiles where i dont have write access.