Python 实用宝典

Question 1

I’m trying to create required libraries in a package I’m distributing. It requires both the SciPy and NumPy libraries. While developing, I installed both using

apt-get install scipy

which installed SciPy 0.9.0 and NumPy 1.5.1, and it worked fine.

I would like to do the same using pip install – in order to be able to specify dependencies in a setup.py of my own package.

The problem is, when I try:

pip install 'numpy==1.5.1'

it works fine.

But then

pip install 'scipy==0.9.0'

fails miserably, with

raise self.notfounderror(self.notfounderror.__doc__)

numpy.distutils.system_info.BlasNotFoundError:

Blas (http://www.netlib.org/blas/) libraries not found.

Directories to search for the libraries can be specified in the

numpy/distutils/site.cfg file (section [blas]) or by setting

the BLAS environment variable.

How do I get it to work?

Question 2

I am assuming Linux experience in my answer; I found that there are three prerequisites to getting pip install scipy to proceed nicely.

Go here: Installing SciPY

Follow the instructions to download, build and export the env variable for BLAS and then LAPACK. Be careful to not just blindly cut’n’paste the shell commands – there will be a few lines you need to select depending on your architecture, etc., and you’ll need to fix/add the correct directories that it incorrectly assumes as well.

The third thing you may need is to yum install numpy-f2py or the equivalent.

Oh, yes and lastly, you may need to yum install gcc-gfortran as the libraries above are Fortran source.

Question 3

This worked for me on Ubuntu 14.04:

sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran
pip install scipy

Question 4

you need the libblas and liblapack dev packages if you are using Ubuntu.

aptitude install libblas-dev liblapack-dev
pip install scipy

Question 5

Since the previous instructions for installing with yum are broken here are the updated instructions for installing on something like fedora. I’ve tested this on “Amazon Linux AMI 2016.03”

sudo yum install atlas-devel lapack-devel blas-devel libgfortran
pip install scipy

Question 6

I was working on a project that depended on numpy and scipy. In a clean installation of Fedora 23, using a python virtual environment for Python 3.4 (also worked for Python 2.7), and with the following in my setup.py (in the setup() method)

setup_requires=[
    'numpy',
],
install_requires=[
    'numpy',
    'scipy',
],

I found I had to run the following to get pip install -e . to work:

pip install --upgrade pip

and

sudo dnf install atlas-devel gcc-{c++,gfortran} subversion redhat-rpm-config

The redhat-rpm-config is for scipy’s use of redhat-hardened-cc1 as opposed to the regular cc1

Question 7

On windows python 3.5, I managed to install scipy by using conda not pip:

conda install scipy

Question 8

What operating system is this? The answer might depend on the OS involved. However, it looks like you need to find this BLAS library and install it. It doesn’t seem to be in PIP (you’ll have to do it by hand thus), but if you install it, it ought let you progress your SciPy install.

Question 9

in my case, upgrading pip did the trick. Also, I’ve installed scipy with -U parameter (upgrade all packages to the last available version)

Question 10

I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).

I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.

Are there any? Or how to solve it otherwise?

Question 11

For fitting y = A + B log x, just fit y against (log x).

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

For fitting y = Ae^Bx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.

Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑_i (ΔY)² = ∑_i (Y_i − Ŷ_i)². When Y_i = log y_i, the residues ΔY_i = Δ(log y_i) ≈ Δy_i / |y_i|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.

This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.

Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.

For y = A + B log x the result is the same as the transformation method:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

For y = Ae^Bx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

Question 12

You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

And then if you want to plot, you could do:

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)

Question 13

I was having some trouble with this so let me be very explicit so noobs like me can understand.

Lets say that we have a data file or something like that

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

"""
Generate some data, let's imagine that you already have this. 
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)

"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")

"""
brutal force to avoid errors
"""    
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)

"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

"""
Print the coefficients and plot the funcion.
"""

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')
plt.show()

the result is: a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816

Question 14

Well I guess you can always use:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

Slightly modifying IanVS’s answer:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.figure()
plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")
plt.legend()
plt.show()

This results in the following graph:

Question 15

Here’s a linearization option on simple data that uses tools from scikit learn.

Given

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


np.random.seed(123)

# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c


def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys

transformer = FunctionTransformer(np.log, validate=True)

Code

Fit exponential data

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

Fit log data

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")

Details

General Steps

Apply a log operation to data values (x, y or both)
Regress the data to a linearized model
Plot by “reversing” any log operations (with np.exp()) and fit to original data

Assuming our data follows an exponential trend, a general equation⁺ may be:

We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:

Given a linearized equation⁺⁺ and the regression parameters, we could calculate:

A via intercept (ln(A))
B via slope (B)

Summary of Linearization Techniques

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
-------------|------------|----------------------|----------------|------------------------------------------
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

_{⁺Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.}

_{⁺⁺Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.}

Question 16

We demonstrate features of lmfit while solving both problems.

Given

import lmfit

import numpy as np

import matplotlib.pyplot as plt


%matplotlib inline
np.random.seed(123)

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c


# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

Code

Approach 1 – lmfit Model

Fit exponential data

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

Approach 2 – Custom Model

Fit log data

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")
plt.legend()

Details

Choose a regression class
Supply named, initial guesses that respect the function’s domain

You can determine the inferred parameters from the regressor object. Example:

regressor.param_names
# ['decay', 'amplitude']

Note: the ExponentialModel() follows a decay function, which accepts two parameters, one of which is negative.

See also ExponentialGaussianModel(), which accepts more parameters.

Install the library via > pip install lmfit.

Question 17

Wolfram has a closed form solution for fitting an exponential. They also have similar solutions for fitting a logarithmic and power law.

I found this to work better than scipy’s curve_fit. Especially when you don’t have data “near zero”. Here is an example:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    #end
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)


xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.figure()
plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

Question 18

I know I could implement a root mean squared error function like this:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

What I’m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?

Question 19

sklearn.metrics has a mean_squared_error function. The RMSE is just the square root of whatever it returns.

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

Question 20

What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS: (Root Mean Squared), then asking for a library to calculate this for you is unnecessary over-engineering. All these metrics are a single line of python code at most 2 inches long. The three metrics rmse, mse, rmd, and rms are at their core conceptually identical.

RMSE answers the question: “How similar, on average, are the numbers in list1 to list2?”. The two lists must be the same size. I want to “wash out the noise between any two given elements, wash out the size of the data collected, and get a single number feel for change over time”.

Intuition and ELI5 for RMSE:

Imagine you are learning to throw darts at a dart board. Every day you practice for one hour. You want to figure out if you are getting better or getting worse. So every day you make 10 throws and measure the distance between the bullseye and where your dart hit.

You make a list of those numbers list1. Use the root mean squared error between the distances at day 1 and a list2 containing all zeros. Do the same on the 2nd and nth days. What you will get is a single number that hopefully decreases over time. When your RMSE number is zero, you hit bullseyes every time. If the rmse number goes up, you are getting worse.

Example in calculating root mean squared error in python:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

Which prints:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

The mathematical notation:

Glyph Legend: n is a whole positive integer representing the number of throws. i represents a whole positive integer counter that enumerates sum. d stands for the ideal distances, the list2 containing all zeros in above example. p stands for performance, the list1 in the above example. superscript 2 stands for numeric squared. d_i is the i’th index of d. p_i is the i’th index of p.

The rmse done in small steps so it can be understood:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

How does every step of RMSE work:

Subtracting one number from another gives you the distance between them.

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

If you multiply any number times itself, the result is always positive because negative times negative is positive:

3*3     = 9   = positive
-30*-30 = 900 = positive

Add them all up, but wait, then an array with many elements would have a larger error than a small array, so average them by the number of elements.

But wait, we squared them all earlier to force them positive. Undo the damage with a square root!

That leaves you with a single number that represents, on average, the distance between every value of list1 to it’s corresponding element value of list2.

If the RMSE value goes down over time we are happy because variance is decreasing.

RMSE isn’t the most accurate line fitting strategy, total least squares is:

Root mean squared error measures the vertical distance between the point and the line, so if your data is shaped like a banana, flat near the bottom and steep near the top, then the RMSE will report greater distances to points high, but short distances to points low when in fact the distances are equivalent. This causes a skew where the line prefers to be closer to points high than low.

If this is a problem the total least squares method fixes this: https://mubaris.com/posts/linear-regression

Gotchas that can break this RMSE function:

If there are nulls or infinity in either input list, then output rmse value is is going to not make sense. There are three strategies to deal with nulls / missing values / infinities in either list: Ignore that component, zero it out or add a best guess or a uniform random noise to all timesteps. Each remedy has its pros and cons depending on what your data means. In general ignoring any component with a missing value is preferred, but this biases the RMSE toward zero making you think performance has improved when it really hasn’t. Adding random noise on a best guess could be preferred if there are lots of missing values.

In order to guarantee relative correctness of the RMSE output, you must eliminate all nulls/infinites from the input.

RMSE has zero tolerance for outlier data points which don’t belong

Root mean squared error squares relies on all data being right and all are counted as equal. That means one stray point that’s way out in left field is going to totally ruin the whole calculation. To handle outlier data points and dismiss their tremendous influence after a certain threshold, see Robust estimators that build in a threshold for dismissal of outliers.

Question 21

This is probably faster?:

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

Question 22

In scikit-learn 0.22.0 you can pass mean_squared_error() the argument squared=False to return the RMSE.

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)

Question 23

Just in case someone finds this thread in 2019, there is a library called ml_metrics which is available without pre-installation in Kaggle’s kernels, pretty lightweighted and accessible through pypi ( it can be installed easily and fast with pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

It has few other interesting metrics which are not available in sklearn, like mapk.

References:

Question 24

Actually, I did write a bunch of those as utility functions for statsmodels

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

and http://statsmodels.sourceforge.net/devel/generated/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

Mostly one or two liners and not much input checking, and mainly intended for easily getting some statistics when comparing arrays. But they have unit tests for the axis arguments, because that’s where I sometimes make sloppy mistakes.

Question 25

Or by simply using only NumPy functions:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

Where:

y is my target
y_pred is my prediction

Note that rmse(y, y_pred)==rmse(y_pred, y) due to the square function.

Question 26

You can’t find RMSE function directly in SKLearn. But , instead of manually doing sqrt , there is another standard way using sklearn. Apparently, Sklearn’s mean_squared_error itself contains a parameter called as “squared” with default value as true .If we set it to false ,the same function will return RMSE instead of MSE.

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

Question 27

Here’s an example code that calculates the RMSE between two polygon file formats PLY. It uses both the ml_metrics lib and the np.linalg.norm:

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

Question 28

No, there is a library Scikit Learn for machine learning and it can be easily employed by using Python language. It has the a function for Mean Squared Error which i am sharing the link below:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

The function is named mean_squared_error as given below, where y_true would be real class values for the data tuples and y_pred would be the predicted values, predicted by the machine learning algorithm you are using:

mean_squared_error(y_true, y_pred)

You have to modify it to get RMSE (by using sqrt function using Python).This process is described in this link: https://www.codeastar.com/regression-model-rmsd/

So, final code would be something like:

from sklearn.metrics import mean_squared_error from math import sqrt

RMSD = sqrt(mean_squared_error(testing_y, prediction))

print(RMSD)

Question 29

Using standard Python arrays, I can do the following:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
# arr is now [[1,2,3],[4,5,6]]

However, I cannot do the same thing in numpy. For example:

arr = np.array([])
arr = np.append(arr, np.array([1,2,3]))
arr = np.append(arr, np.array([4,5,6]))
# arr is now [1,2,3,4,5,6]

I also looked into vstack, but when I use vstack on an empty array, I get:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

So how do I do append a new row to an empty array in numpy?

Question 30

The way to “start” the array that you want is:

arr = np.empty((0,3), int)

Which is an empty array but it has the proper dimensionality.

>>> arr
array([], shape=(0, 3), dtype=int64)

Then be sure to append along axis 0:

arr = np.append(arr, np.array([[1,2,3]]), axis=0)
arr = np.append(arr, np.array([[4,5,6]]), axis=0)

But, @jonrsharpe is right. In fact, if you’re going to be appending in a loop, it would be much faster to append to a list as in your first example, then convert to a numpy array at the end, since you’re really not using numpy as intended during the loop:

In [210]: %%timeit
   .....: l = []
   .....: for i in xrange(1000):
   .....:     l.append([3*i+1,3*i+2,3*i+3])
   .....: l = np.asarray(l)
   .....: 
1000 loops, best of 3: 1.18 ms per loop

In [211]: %%timeit
   .....: a = np.empty((0,3), int)
   .....: for i in xrange(1000):
   .....:     a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
   .....: 
100 loops, best of 3: 18.5 ms per loop

In [214]: np.allclose(a, l)
Out[214]: True

The numpythonic way to do it depends on your application, but it would be more like:

In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop

In [221]: np.allclose(a, n)
Out[221]: True

Question 31

Here is my solution:

arr = []
arr.append([1,2,3])
arr.append([4,5,6])
np_arr = np.array(arr)

Question 32

In this case you might want to use the functions np.hstack and np.vstack

arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))
# arr is now [1,2,3]

arr = np.vstack((arr, np.array([4,5,6])))
# arr is now [[1,2,3],[4,5,6]]

You also can use the np.concatenate function.

Cheers

Question 33

using an custom dtype definition, what worked for me was:

import numpy

# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])

Question 34

In case of adding new rows for array in loop, Assign the array directly for firsttime in loop instead of initialising an empty array.

for i in range(0,len(0,100)):
    SOMECALCULATEDARRAY = .......
    if(i==0):
        finalArrayCollection = SOMECALCULATEDARRAY
    else:
        finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)

This is mainly useful when the shape of the array is unknown

Question 35

I want to do a for loop, yet with askewchan’s method it does not work well, so I have modified it.

x=np.empty((0,3))
y=np.array([1 2 3])
for i in ...
x = vstack((x,y))

Question 36

Say I have an image of size 3841 x 7195 pixels. I would like to save the contents of the figure to disk, resulting in an image of the exact size I specify in pixels.

No axis, no titles. Just the image. I don’t personally care about DPIs, as I only want to specify the size the image takes in the screen in disk in pixels.

I have read other threads, and they all seem to do conversions to inches and then specify the dimensions of the figure in inches and adjust dpi’s in some way. I would like to avoid dealing with the potential loss of accuracy that could result from pixel-to-inches conversions.

I have tried with:

w = 7195
h = 3841
fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig(some_path, dpi=1)

with no luck (Python complains that width and height must each be below 32768 (?))

From everything I have seen, matplotlib requires the figure size to be specified in inches and dpi, but I am only interested in the pixels the figure takes in disk. How can I do this?

To clarify: I am looking for a way to do this with matplotlib, and not with other image-saving libraries.

Question 37

Matplotlib doesn’t work with pixels directly, but rather physical sizes and DPI. If you want to display a figure with a certain pixel size, you need to know the DPI of your monitor. For example this link will detect that for you.

If you have an image of 3841×7195 pixels it is unlikely that you monitor will be that large, so you won’t be able to show a figure of that size (matplotlib requires the figure to fit in the screen, if you ask for a size too large it will shrink to the screen size). Let’s imagine you want an 800×800 pixel image just for an example. Here’s how to show an 800×800 pixel image in my monitor (my_dpi=96):

plt.figure(figsize=(800/my_dpi, 800/my_dpi), dpi=my_dpi)

So you basically just divide the dimensions in inches by your DPI.

If you want to save a figure of a specific size, then it is a different matter. Screen DPIs are not so important anymore (unless you ask for a figure that won’t fit in the screen). Using the same example of the 800×800 pixel figure, we can save it in different resolutions using the dpi keyword of savefig. To save it in the same resolution as the screen just use the same dpi:

plt.savefig('my_fig.png', dpi=my_dpi)

To to save it as an 8000×8000 pixel image, use a dpi 10 times larger:

plt.savefig('my_fig.png', dpi=my_dpi * 10)

Note that the setting of the DPI is not supported by all backends. Here, the PNG backend is used, but the pdf and ps backends will implement the size differently. Also, changing the DPI and sizes will also affect things like fontsize. A larger DPI will keep the same relative sizes of fonts and elements, but if you want smaller fonts for a larger figure you need to increase the physical size instead of the DPI.

Getting back to your example, if you want to save a image with 3841 x 7195 pixels, you could do the following:

plt.figure(figsize=(3.841, 7.195), dpi=100)
( your code ...)
plt.savefig('myfig.png', dpi=1000)

Note that I used the figure dpi of 100 to fit in most screens, but saved with dpi=1000 to achieve the required resolution. In my system this produces a png with 3840×7190 pixels — it seems that the DPI saved is always 0.02 pixels/inch smaller than the selected value, which will have a (small) effect on large image sizes. Some more discussion of this here.

Question 38

This worked for me, based on your code, generating a 93Mb png image with color noise and the desired dimensions:

import matplotlib.pyplot as plt
import numpy

w = 7195
h = 3841

im_np = numpy.random.rand(h, w)

fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig('figure.png', dpi=1)

I am using the last PIP versions of the Python 2.7 libraries in Linux Mint 13.

Hope that helps!

Question 39

Based on the accepted response by tiago, here is a small generic function that exports a numpy array to an image having the same resolution as the array:

import matplotlib.pyplot as plt
import numpy as np

def export_figure_matplotlib(arr, f_name, dpi=200, resize_fact=1, plt_show=False):
    """
    Export array as figure in original resolution
    :param arr: array of image to save in original resolution
    :param f_name: name of file where to save figure
    :param resize_fact: resize facter wrt shape of arr, in (0, np.infty)
    :param dpi: dpi of your screen
    :param plt_show: show plot or not
    """
    fig = plt.figure(frameon=False)
    fig.set_size_inches(arr.shape[1]/dpi, arr.shape[0]/dpi)
    ax = plt.Axes(fig, [0., 0., 1., 1.])
    ax.set_axis_off()
    fig.add_axes(ax)
    ax.imshow(arr)
    plt.savefig(f_name, dpi=(dpi * resize_fact))
    if plt_show:
        plt.show()
    else:
        plt.close()

As said in the previous reply by tiago, the screen DPI needs to be found first, which can be done here for instance: http://dpi.lv

I’ve added an additional argument resize_fact in the function which which you can export the image to 50% (0.5) of the original resolution, for instance.

Question 40

plt.imsave worked for me. You can find the documentation here: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.imsave.html

#file_path = directory address where the image will be stored along with file name and extension
#array = variable where the image is stored. I think for the original post this variable is im_np
plt.imsave(file_path, array)

Question 41

I have a bunch of MATLAB code from my MS thesis which I now want to convert to Python (using numpy/scipy and matplotlib) and distribute as open-source. I know the similarity between MATLAB and Python scientific libraries, and converting them manually will be not more than a fortnight (provided that I work towards it every day for some time). I was wondering if there was already any tool available which can do the conversion.

Question 42

There are several tools for converting Matlab to Python code.

The only one that’s seen recent activity (last commit from June 2018) is Small Matlab to Python compiler (also developed here: SMOP@chiselapp).

Other options include:

LiberMate: translate from Matlab to Python and SciPy (Requires Python 2, last update 4 years ago).
OMPC: Matlab to Python (a bit outdated).

Also, for those interested in an interface between the two languages and not conversion:

pymatlab: communicate from Python by sending data to the MATLAB workspace, operating on them with scripts and pulling back the resulting data.
Python-Matlab wormholes: both directions of interaction supported.
Python-Matlab bridge: use Matlab from within Python, offers matlab_magic for iPython, to execute normal matlab code from within ipython.
PyMat: Control Matlab session from Python.
pymat2: continuation of the seemingly abandoned PyMat.
mlabwrap, mlabwrap-purepy: make Matlab look like Python library (based on PyMat).
oct2py: run GNU Octave commands from within Python.
pymex: Embeds the Python Interpreter in Matlab, also on File Exchange.
matpy: Access MATLAB in various ways: create variables, access .mat files, direct interface to MATLAB engine (requires MATLAB be installed).
MatPy: Python package for numerical linear algebra and plotting with a MatLab-like interface.

Btw might be helpful to look here for other migration tips:

http://bci2000.org/downloads/BCPy2000/Migration.html

On a different note, though I’m not a fortran fan at all, for people who might find it useful there is:

matlab2fortran

Question 43

There’s also oct2py which can call .m files within python

https://pypi.python.org/pypi/oct2py

It requires GNU Octave, which is highly compatible with MATLAB.

https://www.gnu.org/software/octave/

Question 44

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn’t matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don’t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

Thank you!

Question 45

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo’s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution’s histogram and the data’s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

Question 46

There are 82 implemented distribution functions in SciPy 0.12.0. You can test how some of them fit to your data using their fit() method. Check the code below for more details:

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

References:

– Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

– Distribution fitting with Scipy

And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

Question 47

fit() method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as

1, the one that gives you the highest log likelihood.

2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)

3, the one that maximize the Bayesian posterior probability. (see wiki: http://en.wikipedia.org/wiki/Posterior_probability)

Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.

scipy does not come with a function to calculate log likelihood (although MLE method is provided), but hard code one is easy: see Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?

Question 48

AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

Thus, probability of seeing values higher than 1 is simply (according to the complementary cumulative distribution function (ccdf):

In []: 1- cdf[1]
Out[]: 0.40000000000000002

Please note that ccdf is closely related to survival function (sf), but it’s also defined with discrete distributions, whereas sf is defined only for contiguous distributions.

Question 49

It sounds like probability density estimation problem to me.

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.

Question 50

Try the distfit library.

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.

Question 51

With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

which prints:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

This produces:

More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.

Question 52

Forgive me if I don’t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.

Question 53

what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:

df.dropna(subset=["col1", "col2"], how="all", with_inf=True)

is this possible? Is there a way to tell dropna to include inf in its definition of missing values?

Question 54

The simplest way would be to first replace infs to NaN:

df.replace([np.inf, -np.inf], np.nan)

and then use the dropna:

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

For example:

In [11]: df = pd.DataFrame([1, 2, np.inf, -np.inf])

In [12]: df.replace([np.inf, -np.inf], np.nan)
Out[12]:
    0
0   1
1   2
2 NaN
3 NaN

The same method would work for a Series.

Question 55

With option context, this is possible without permanently setting use_inf_as_na. For example:

with pd.option_context('mode.use_inf_as_na', True):
    df = df.dropna(subset=['col1', 'col2'], how='all')

Of course it can be set to treat inf as NaN permanently with

pd.set_option('use_inf_as_na', True)

For older versions, replace use_inf_as_na with use_inf_as_null.

Question 56

Here is another method using .loc to replace inf with nan on a Series:

s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan

So, in response to the original question:

df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))

for i in range(3): 
    df.iat[i, i] = np.inf

df
          A         B         C
0       inf  1.000000  1.000000
1  1.000000       inf  1.000000
2  1.000000  1.000000       inf

df.sum()
A    inf
B    inf
C    inf
dtype: float64

df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A    2
B    2
C    2
dtype: float64

Question 57

Use (fast and simple):

df = df[np.isfinite(df).all(1)]

This answer is based on DougR’s answer in an other question. Here an example code:

import pandas as pd
import numpy as np
df=pd.DataFrame([1,2,3,np.nan,4,np.inf,5,-np.inf,6])
print('Input:\n',df,sep='')
df = df[np.isfinite(df).all(1)]
print('\nDropped:\n',df,sep='')

Result:

Input:
    0
0  1.0000
1  2.0000
2  3.0000
3     NaN
4  4.0000
5     inf
6  5.0000
7    -inf
8  6.0000

Dropped:
     0
0  1.0
1  2.0
2  3.0
4  4.0
6  5.0
8  6.0

Question 58

Yet another solution would be to use the isin method. Use it to determine whether each value is infinite or missing and then chain the all method to determine if all the values in the rows are infinite or missing.

Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.

all_inf_or_nan = df.isin([np.inf, -np.inf, np.nan]).all(axis='columns')
df[~all_inf_or_nan]

Question 59

The above solution will modify the infs that are not in the target columns. To remedy that,

lst = [np.inf, -np.inf]
to_replace = {v: lst for v in ['col1', 'col2']}
df.replace(to_replace, np.nan)

Question 60

You can use pd.DataFrame.mask with np.isinf. You should ensure first your dataframe series are all of type float. Then use dropna with your existing logic.

print(df)

       col1      col2
0 -0.441406       inf
1 -0.321105      -inf
2 -0.412857  2.223047
3 -0.356610  2.513048

df = df.mask(np.isinf(df))

print(df)

       col1      col2
0 -0.441406       NaN
1 -0.321105       NaN
2 -0.412857  2.223047
3 -0.356610  2.513048

Question 61

It is possible to install NumPy with pip using pip install numpy.

Is there a similar possibility with SciPy? (Doing pip install scipy does not work.)

Update

The package SciPy is now available to be installed with pip!

Question 62

An attempt to easy_install indicates a problem with their listing in the Python Package Index, which pip searches.

easy_install scipy
Searching for scipy
Reading http://pypi.python.org/simple/scipy/
Reading http://www.scipy.org
Reading http://sourceforge.net/project/showfiles.php?group_id=27747&package_id=19531
Reading http://new.scipy.org/Wiki/Download

All is not lost, however; pip can install from Subversion (SVN), Git, Mercurial, and Bazaar repositories. SciPy uses SVN:

pip install svn+http://svn.scipy.org/svn/scipy/trunk/#egg=scipy

Update (12-2012):

pip install git+https://github.com/scipy/scipy.git

Since NumPy is a dependency, it should be installed as well.

Question 63

Prerequisite:

sudo apt-get install build-essential gfortran libatlas-base-dev python-pip python-dev
sudo pip install --upgrade pip

Actual packages:

sudo pip install numpy
sudo pip install scipy

Optional packages:

sudo pip install matplotlib   OR  sudo apt-get install python-matplotlib
sudo pip install -U scikit-learn
sudo pip install pandas

src

Question 64

In Ubuntu 10.04 (Lucid), I could successfully pip install scipy (within a virtualenv) after installing some of its dependencies, in particular:

$ sudo apt-get install libamd2.2.0 libblas3gf libc6 libgcc1 libgfortran3 liblapack3gf libumfpack5.4.0 libstdc++6 build-essential gfortran libatlas-sse2-dev python-all-dev

问题：使用pip安装SciPy和NumPy

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何在Python中进行指数和对数曲线拟合？我发现只有多项式拟合

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：python中是否存在针对均方根误差（RMSE）的库函数？

回答 0

回答 1

什么是RMSE？也称为MSE，RMD或RMS。它解决什么问题？

直觉和ELI5 for RMSE：

在python中计算均方根误差的示例：

数学符号：

RMSE的每个步骤如何工作：

RMSE不是最准确的线拟合策略，最小二乘法的总和为：

可能会破坏此RMSE功能的陷阱：

对于不属于异常值的数据点，RMSE的容差为零

What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?

Intuition and ELI5 for RMSE:

Example in calculating root mean squared error in python:

The mathematical notation:

How does every step of RMSE work:

RMSE isn’t the most accurate line fitting strategy, total least squares is:

Gotchas that can break this RMSE function:

RMSE has zero tolerance for outlier data points which don’t belong

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：如何将新行添加到空的numpy数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：指定并保存具有精确大小（以像素为单位）的图形

回答 0

回答 1

回答 2

回答 3

问题：将MATLAB代码转换为Python的工具

回答 0

回答 1

问题：使用Scipy（Python）使经验分布适合理论分布吗？

回答 0

平方误差和（SSE）的分布拟合

拟合示例

所有发行

最佳拟合分布

范例程式码

Distribution Fitting with Sum of Square Error (SSE)

Example Fitting

All Distributions

Best Fit Distribution

Example Code

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：从熊猫的数据框中删除无限值？

回答 0