raise self.notfounderror(self.notfounderror.__doc__)
numpy.distutils.system_info.BlasNotFoundError:Blas(http://www.netlib.org/blas/) libraries not found.Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [blas])or by setting
the BLAS environment variable.
I’m trying to create required libraries in a package I’m distributing. It requires both the SciPy and NumPy libraries.
While developing, I installed both using
apt-get install scipy
which installed SciPy 0.9.0 and NumPy 1.5.1, and it worked fine.
I would like to do the same using pip install – in order to be able to specify dependencies in a setup.py of my own package.
The problem is, when I try:
pip install 'numpy==1.5.1'
it works fine.
But then
pip install 'scipy==0.9.0'
fails miserably, with
raise self.notfounderror(self.notfounderror.__doc__)
numpy.distutils.system_info.BlasNotFoundError:
Blas (http://www.netlib.org/blas/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [blas]) or by setting
the BLAS environment variable.
Follow the instructions to download, build and export the env variable for BLAS and then LAPACK. Be careful to not just blindly cut’n’paste the shell commands – there will be a few lines you need to select depending on your architecture, etc., and you’ll need to fix/add the correct directories that it incorrectly assumes as well.
The third thing you may need is to yum install numpy-f2py or the equivalent.
Oh, yes and lastly, you may need to yum install gcc-gfortran as the libraries above are Fortran source.
Since the previous instructions for installing with yum are broken here are the updated instructions for installing on something like fedora. I’ve tested this on “Amazon Linux AMI 2016.03”
I was working on a project that depended on numpy and scipy. In a clean installation of Fedora 23, using a python virtual environment for Python 3.4 (also worked for Python 2.7), and with the following in my setup.py (in the setup() method)
What operating system is this? The answer might depend on the OS involved. However, it looks like you need to find this BLAS library and install it. It doesn’t seem to be in PIP (you’ll have to do it by hand thus), but if you install it, it ought let you progress your SciPy install.
I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).
I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.
Are there any? Or how to solve it otherwise?
回答 0
对于拟合y = A + B log x,只需将y拟合为(log x)。
>>> x = numpy.array([1,7,20,50,79])>>> y = numpy.array([10,19,30,35,51])>>> numpy.polyfit(numpy.log(x), y,1)
array([8.46295607,6.61867463])# y ≈ 8.46 log(x) + 6.62
For fitting y = A + B log x, just fit y against (log x).
>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607, 6.61867463])
# y ≈ 8.46 log(x) + 6.62
For fitting y = AeBx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.
Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑i (ΔY)2 = ∑i (Yi − Ŷi)2. When Yi = log yi, the residues ΔYi = Δ(log yi) ≈ Δyi / |yi|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.
This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.
>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
# y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446, 1.41648096])
# y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)
Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.
Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.
For y = A + B log x the result is the same as the transformation method:
For y = AeBx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import curve_fitdef func(x, a, b, c):return a * np.exp(-b * x)+ c
x = np.linspace(0,4,50)
y = func(x,2.5,1.3,0.5)
yn = y +0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.exp(-b * x) + c
x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)
回答 2
我对此有些麻烦,所以请让我非常明确,让像我这样的菜鸟可以理解。
假设我们有一个数据文件或类似的文件
# -*- coding: utf-8 -*-import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
"""
Generate some data, let's imagine that you already have this.
"""
x = np.linspace(0,3,50)
y = np.exp(x)"""
Plot your data
"""
plt.plot(x, y,'ro',label="Original Data")"""
brutal force to avoid errors
"""
x = np.array(x, dtype=float)#transform your data in a numpy array of floats
y = np.array(y, dtype=float)#so the curve_fit can work"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you.
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""def func(x, a, b, c, d):return a*x**3+ b*x**2+c*x + d
"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""print"a = %s , b = %s, c = %s, d = %s"%(popt[0], popt[1], popt[2], popt[3])"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')
tex = sym.latex(func(xs,*popt)).replace('$','')
plt.title(r'$f(\lambda)= %s$'%(tex),fontsize=16)"""
Print the coefficients and plot the funcion.
"""
plt.plot(x, func(x,*popt), label="Fitted Curve")#same as line above \/#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
I was having some trouble with this so let me be very explicit so noobs like me can understand.
Lets say that we have a data file or something like that
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
"""
Generate some data, let's imagine that you already have this.
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)
"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")
"""
brutal force to avoid errors
"""
x = np.array(x, dtype=float) #transform your data in a numpy array of floats
y = np.array(y, dtype=float) #so the curve_fit can work
"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you.
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
return a*x**3 + b*x**2 +c*x + d
"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)
"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])
"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)
"""
Print the coefficients and plot the funcion.
"""
plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
the result is:
a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816
回答 3
好吧,我想您可以随时使用:
np.log --> natural log
np.log10 --> base 10
np.log2 --> base 2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model importLinearRegressionfrom sklearn.preprocessing importFunctionTransformer
np.random.seed(123)
# General Functionsdef func_exp(x, a, b, c):"""Return values from a general exponential function."""return a * np.exp(b * x)+ c
def func_log(x, a, b, c):"""Return values from a general log function."""return a * np.log(b * x)+ c
# Helperdef generate_data(func,*args, jitter=0):"""Return a tuple of arrays with random data along a general function."""
xs = np.linspace(1,5,50)
ys = func(xs,*args)
noise = jitter * np.random.normal(size=len(xs))+ jitter
xs = xs.reshape(-1,1)# xs[:, np.newaxis]
ys =(ys + noise).reshape(-1,1)return xs, ys
Relationship|Example|GeneralEqn.|AlteredVar.|LinearizedEqn.-------------|------------|----------------------|----------------|------------------------------------------Linear| x | y = B * x + C |-| y = C + B * x
Logarithmic| log(x)| y = A * log(B*x)+ C | log(x)| y = C + A *(log(B)+ log(x))Exponential|2**x, e**x | y = A * exp(B*x)+ C | log(y)| log(y-C)= log(A)+ B * x
Power| x**2| y = B * x**N + C | log(x), log(y)| log(y-C)= log(B)+ N * log(x)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
np.random.seed(123)
# General Functions
def func_exp(x, a, b, c):
"""Return values from a general exponential function."""
return a * np.exp(b * x) + c
def func_log(x, a, b, c):
"""Return values from a general log function."""
return a * np.log(b * x) + c
# Helper
def generate_data(func, *args, jitter=0):
"""Return a tuple of arrays with random data along a general function."""
xs = np.linspace(1, 5, 50)
ys = func(xs, *args)
noise = jitter * np.random.normal(size=len(xs)) + jitter
xs = xs.reshape(-1, 1) # xs[:, np.newaxis]
ys = (ys + noise).reshape(-1, 1)
return xs, ys
Apply a log operation to data values (x, y or both)
Regress the data to a linearized model
Plot by “reversing” any log operations (with np.exp()) and fit to original data
Assuming our data follows an exponential trend, a general equation+ may be:
We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:
Given a linearized equation++ and the regression parameters, we could calculate:
A via intercept (ln(A))
B via slope (B)
Summary of Linearization Techniques
Relationship | Example | General Eqn. | Altered Var. | Linearized Eqn.
-------------|------------|----------------------|----------------|------------------------------------------
Linear | x | y = B * x + C | - | y = C + B * x
Logarithmic | log(x) | y = A * log(B*x) + C | log(x) | y = C + A * (log(B) + log(x))
Exponential | 2**x, e**x | y = A * exp(B*x) + C | log(y) | log(y-C) = log(A) + B * x
Power | x**2 | y = B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)
+Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.
++Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.
import lmfit
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(123)
# General Functionsdef func_log(x, a, b, c):"""Return values from a general log function."""return a * np.log(b * x)+ c
# Data
x_samp = np.linspace(1,5,50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp =2.5* np.exp(1.2* x_samp)+0.7+ _noise
y_samp2 =2.5* np.log(1.2* x_samp)+0.7+ _noise
import numpy as np
d =[0.000,0.166,0.333]#ideal target distances, these can be all zeros.
p =[0.000,0.254,0.998]#your performance goes hereprint("d is: "+ str(["%.8f"% elem for elem in d]))print("p is: "+ str(["%.8f"% elem for elem in p]))def rmse(predictions, targets):return np.sqrt(((predictions - targets)**2).mean())
rmse_val = rmse(np.array(d), np.array(p))print("rms error is: "+ str(rmse_val))
哪些打印:
d is:['0.00000000','0.16600000','0.33300000']
p is:['0.00000000','0.25400000','0.99800000']
rms error between lists d and p is:0.387284994115
数学符号:
字形图例:n是一个完整的正整数,表示投掷次数。 i表示枚举和的整个正整数计数器。 d代表理想距离,list2在上面的示例中包含所有零。 p代表性能,list1在上面的示例中。上标2代表数字平方。 d i是的第i个索引d。 p i是的第i个索引p。
rmse分步进行,因此可以理解:
def rmse(predictions, targets):
differences = predictions - targets #the DIFFERENCEs.
differences_squared = differences **2#the SQUAREs of ^
mean_of_differences_squared = differences_squared.mean()#the MEAN of ^
rmse_val = np.sqrt(mean_of_differences_squared)#ROOT of ^return rmse_val #get the ^
RMSE的每个步骤如何工作:
一个数字减去另一个数字就可以得出它们之间的距离。
8-5=3#absolute distance between 8 and 5 is +3-20-10=-30#absolute distance between -20 and 10 is +30
What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?
If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS: (Root Mean Squared), then asking for a library to calculate this for you is unnecessary over-engineering. All these metrics are a single line of python code at most 2 inches long. The three metrics rmse, mse, rmd, and rms are at their core conceptually identical.
RMSE answers the question: “How similar, on average, are the numbers in list1 to list2?”. The two lists must be the same size. I want to “wash out the noise between any two given elements, wash out the size of the data collected, and get a single number feel for change over time”.
Intuition and ELI5 for RMSE:
Imagine you are learning to throw darts at a dart board. Every day you practice for one hour. You want to figure out if you are getting better or getting worse. So every day you make 10 throws and measure the distance between the bullseye and where your dart hit.
You make a list of those numbers list1. Use the root mean squared error between the distances at day 1 and a list2 containing all zeros. Do the same on the 2nd and nth days. What you will get is a single number that hopefully decreases over time. When your RMSE number is zero, you hit bullseyes every time. If the rmse number goes up, you are getting worse.
Example in calculating root mean squared error in python:
import numpy as np
d = [0.000, 0.166, 0.333] #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998] #your performance goes here
print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))
def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())
rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))
Which prints:
d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115
The mathematical notation:
Glyph Legend:n is a whole positive integer representing the number of throws. i represents a whole positive integer counter that enumerates sum. d stands for the ideal distances, the list2 containing all zeros in above example. p stands for performance, the list1 in the above example. superscript 2 stands for numeric squared. di is the i’th index of d. pi is the i’th index of p.
The rmse done in small steps so it can be understood:
def rmse(predictions, targets):
differences = predictions - targets #the DIFFERENCEs.
differences_squared = differences ** 2 #the SQUAREs of ^
mean_of_differences_squared = differences_squared.mean() #the MEAN of ^
rmse_val = np.sqrt(mean_of_differences_squared) #ROOT of ^
return rmse_val #get the ^
How does every step of RMSE work:
Subtracting one number from another gives you the distance between them.
8 - 5 = 3 #absolute distance between 8 and 5 is +3
-20 - 10 = -30 #absolute distance between -20 and 10 is +30
If you multiply any number times itself, the result is always positive because negative times negative is positive:
3*3 = 9 = positive
-30*-30 = 900 = positive
Add them all up, but wait, then an array with many elements would have a larger error than a small array, so average them by the number of elements.
But wait, we squared them all earlier to force them positive. Undo the damage with a square root!
That leaves you with a single number that represents, on average, the distance between every value of list1 to it’s corresponding element value of list2.
If the RMSE value goes down over time we are happy because variance is decreasing.
RMSE isn’t the most accurate line fitting strategy, total least squares is:
Root mean squared error measures the vertical distance between the point and the line, so if your data is shaped like a banana, flat near the bottom and steep near the top, then the RMSE will report greater distances to points high, but short distances to points low when in fact the distances are equivalent. This causes a skew where the line prefers to be closer to points high than low.
If there are nulls or infinity in either input list, then output rmse value is is going to not make sense. There are three strategies to deal with nulls / missing values / infinities in either list: Ignore that component, zero it out or add a best guess or a uniform random noise to all timesteps. Each remedy has its pros and cons depending on what your data means. In general ignoring any component with a missing value is preferred, but this biases the RMSE toward zero making you think performance has improved when it really hasn’t. Adding random noise on a best guess could be preferred if there are lots of missing values.
In order to guarantee relative correctness of the RMSE output, you must eliminate all nulls/infinites from the input.
RMSE has zero tolerance for outlier data points which don’t belong
Root mean squared error squares relies on all data being right and all are counted as equal. That means one stray point that’s way out in left field is going to totally ruin the whole calculation. To handle outlier data points and dismiss their tremendous influence after a certain threshold, see Robust estimators that build in a threshold for dismissal of outliers.
回答 2
这可能更快吗?
n = len(predictions)
rmse = np.linalg.norm(predictions - targets)/ np.sqrt(n)
Just in case someone finds this thread in 2019, there is a library called ml_metrics which is available without pre-installation in Kaggle’s kernels, pretty lightweighted and accessible through pypi ( it can be installed easily and fast with pip install ml_metrics):
Mostly one or two liners and not much input checking, and mainly intended for easily getting some statistics when comparing arrays. But they have unit tests for the axis arguments, because that’s where I sometimes make sloppy mistakes.
You can’t find RMSE function directly in SKLearn.
But , instead of manually doing sqrt , there is another standard way using sklearn.
Apparently, Sklearn’s mean_squared_error itself contains a parameter called as “squared” with default value as true .If we set it to false ,the same function will return RMSE instead of MSE.
# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)
No, there is a library Scikit Learn for machine learning and it can be easily employed by using Python language. It has the a function for Mean Squared Error which i am sharing the link below:
The function is named mean_squared_error as given below, where y_true would be real class values for the data tuples and y_pred would be the predicted values, predicted by the machine learning algorithm you are using:
In[210]:%%timeit
.....: l =[].....:for i in xrange(1000):.....: l.append([3*i+1,3*i+2,3*i+3]).....: l = np.asarray(l).....:1000 loops, best of 3:1.18 ms per loop
In[211]:%%timeit
.....: a = np.empty((0,3), int).....:for i in xrange(1000):.....: a = np.append(a,3*i+np.array([[1,2,3]]),0).....:100 loops, best of 3:18.5 ms per loop
In[214]: np.allclose(a, l)Out[214]:True
numpythonic的实现方法取决于您的应用程序,但它更像是:
In[220]: timeit n = np.arange(1,3001).reshape(1000,3)100000 loops, best of 3:5.93µs per loop
In[221]: np.allclose(a, n)Out[221]:True
But, @jonrsharpe is right. In fact, if you’re going to be appending in a loop, it would be much faster to append to a list as in your first example, then convert to a numpy array at the end, since you’re really not using numpy as intended during the loop:
In [210]: %%timeit
.....: l = []
.....: for i in xrange(1000):
.....: l.append([3*i+1,3*i+2,3*i+3])
.....: l = np.asarray(l)
.....:
1000 loops, best of 3: 1.18 ms per loop
In [211]: %%timeit
.....: a = np.empty((0,3), int)
.....: for i in xrange(1000):
.....: a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
.....:
100 loops, best of 3: 18.5 ms per loop
In [214]: np.allclose(a, l)
Out[214]: True
The numpythonic way to do it depends on your application, but it would be more like:
In [220]: timeit n = np.arange(1,3001).reshape(1000,3)
100000 loops, best of 3: 5.93 µs per loop
In [221]: np.allclose(a, n)
Out[221]: True
arr = np.array([])
arr = np.hstack((arr, np.array([1,2,3])))# arr is now [1,2,3]
arr = np.vstack((arr, np.array([4,5,6])))# arr is now [[1,2,3],[4,5,6]]
using an custom dtype definition, what worked for me was:
import numpy
# define custom dtype
type1 = numpy.dtype([('freq', numpy.float64, 1), ('amplitude', numpy.float64, 1)])
# declare empty array, zero rows but one column
arr = numpy.empty([0,1],dtype=type1)
# store row data, maybe inside a loop
row = numpy.array([(0.0001, 0.002)], dtype=type1)
# append row to the main array
arr = numpy.row_stack((arr, row))
# print values stored in the row 0
print float(arr[0]['freq'])
print float(arr[0]['amplitude'])
回答 4
如果要为循环中的数组添加新行,请直接为首次循环中的数组分配数组,而不是初始化一个空数组。
for i in range(0,len(0,100)):
SOMECALCULATEDARRAY =.......if(i==0):
finalArrayCollection = SOMECALCULATEDARRAY
else:
finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)
In case of adding new rows for array in loop, Assign the array directly for firsttime in loop instead of initialising an empty array.
for i in range(0,len(0,100)):
SOMECALCULATEDARRAY = .......
if(i==0):
finalArrayCollection = SOMECALCULATEDARRAY
else:
finalArrayCollection = np.vstack(finalArrayCollection,SOMECALCULATEDARRAY)
This is mainly useful when the shape of the array is unknown
回答 5
我想做一个for循环,但是用askewchan的方法效果不好,所以我修改了它。
x=np.empty((0,3))
y=np.array([123])for i in...
x = vstack((x,y))
Say I have an image of size 3841 x 7195 pixels. I would like to save the contents of the figure to disk, resulting in an image of the exact size I specify in pixels.
No axis, no titles. Just the image. I don’t personally care about DPIs, as I only want to specify the size the image takes in the screen in disk in pixels.
I have read otherthreads, and they all seem to do conversions to inches and then specify the dimensions of the figure in inches and adjust dpi’s in some way. I would like to avoid dealing with the potential loss of accuracy that could result from pixel-to-inches conversions.
with no luck (Python complains that width and height must each be below 32768 (?))
From everything I have seen, matplotlib requires the figure size to be specified in inches and dpi, but I am only interested in the pixels the figure takes in disk. How can I do this?
To clarify: I am looking for a way to do this with matplotlib, and not with other image-saving libraries.
Matplotlib doesn’t work with pixels directly, but rather physical sizes and DPI. If you want to display a figure with a certain pixel size, you need to know the DPI of your monitor. For example this link will detect that for you.
If you have an image of 3841×7195 pixels it is unlikely that you monitor will be that large, so you won’t be able to show a figure of that size (matplotlib requires the figure to fit in the screen, if you ask for a size too large it will shrink to the screen size). Let’s imagine you want an 800×800 pixel image just for an example. Here’s how to show an 800×800 pixel image in my monitor (my_dpi=96):
So you basically just divide the dimensions in inches by your DPI.
If you want to save a figure of a specific size, then it is a different matter. Screen DPIs are not so important anymore (unless you ask for a figure that won’t fit in the screen). Using the same example of the 800×800 pixel figure, we can save it in different resolutions using the dpi keyword of savefig. To save it in the same resolution as the screen just use the same dpi:
plt.savefig('my_fig.png', dpi=my_dpi)
To to save it as an 8000×8000 pixel image, use a dpi 10 times larger:
plt.savefig('my_fig.png', dpi=my_dpi * 10)
Note that the setting of the DPI is not supported by all backends. Here, the PNG backend is used, but the pdf and ps backends will implement the size differently. Also, changing the DPI and sizes will also affect things like fontsize. A larger DPI will keep the same relative sizes of fonts and elements, but if you want smaller fonts for a larger figure you need to increase the physical size instead of the DPI.
Getting back to your example, if you want to save a image with 3841 x 7195 pixels, you could do the following:
plt.figure(figsize=(3.841, 7.195), dpi=100)
( your code ...)
plt.savefig('myfig.png', dpi=1000)
Note that I used the figure dpi of 100 to fit in most screens, but saved with dpi=1000 to achieve the required resolution. In my system this produces a png with 3840×7190 pixels — it seems that the DPI saved is always 0.02 pixels/inch smaller than the selected value, which will have a (small) effect on large image sizes. Some more discussion of this here.
回答 1
根据您的代码,这对我有用,生成了一个93Mb png图像,带有彩色噪声和所需的尺寸:
import matplotlib.pyplot as plt
import numpy
w =7195
h =3841
im_np = numpy.random.rand(h, w)
fig = plt.figure(frameon=False)
fig.set_size_inches(w,h)
ax = plt.Axes(fig,[0.,0.,1.,1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(im_np, aspect='normal')
fig.savefig('figure.png', dpi=1)
import matplotlib.pyplot as plt
import numpy as np
def export_figure_matplotlib(arr, f_name, dpi=200, resize_fact=1, plt_show=False):"""
Export array as figure in original resolution
:param arr: array of image to save in original resolution
:param f_name: name of file where to save figure
:param resize_fact: resize facter wrt shape of arr, in (0, np.infty)
:param dpi: dpi of your screen
:param plt_show: show plot or not
"""
fig = plt.figure(frameon=False)
fig.set_size_inches(arr.shape[1]/dpi, arr.shape[0]/dpi)
ax = plt.Axes(fig,[0.,0.,1.,1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(arr)
plt.savefig(f_name, dpi=(dpi * resize_fact))if plt_show:
plt.show()else:
plt.close()
Based on the accepted response by tiago, here is a small generic function that exports a numpy array to an image having the same resolution as the array:
import matplotlib.pyplot as plt
import numpy as np
def export_figure_matplotlib(arr, f_name, dpi=200, resize_fact=1, plt_show=False):
"""
Export array as figure in original resolution
:param arr: array of image to save in original resolution
:param f_name: name of file where to save figure
:param resize_fact: resize facter wrt shape of arr, in (0, np.infty)
:param dpi: dpi of your screen
:param plt_show: show plot or not
"""
fig = plt.figure(frameon=False)
fig.set_size_inches(arr.shape[1]/dpi, arr.shape[0]/dpi)
ax = plt.Axes(fig, [0., 0., 1., 1.])
ax.set_axis_off()
fig.add_axes(ax)
ax.imshow(arr)
plt.savefig(f_name, dpi=(dpi * resize_fact))
if plt_show:
plt.show()
else:
plt.close()
As said in the previous reply by tiago, the screen DPI needs to be found first, which can be done here for instance: http://dpi.lv
I’ve added an additional argument resize_fact in the function which which you can export the image to 50% (0.5) of the original resolution, for instance.
#file_path = directory address where the image will be stored along with file name and extension#array = variable where the image is stored. I think for the original post this variable is im_np
plt.imsave(file_path, array)
#file_path = directory address where the image will be stored along with file name and extension
#array = variable where the image is stored. I think for the original post this variable is im_np
plt.imsave(file_path, array)
I have a bunch of MATLAB code from my MS thesis which I now want to convert to Python (using numpy/scipy and matplotlib) and distribute as open-source. I know the similarity between MATLAB and Python scientific libraries, and converting them manually will be not more than a fortnight (provided that I work towards it every day for some time). I was wondering if there was already any tool available which can do the conversion.
INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn’t matter for this problem.
PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.
I don’t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.
Is there a way to implement such an analysis in Python (Scipy or Numpy)?
Could you present any examples?
%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize']=(16.0,12.0)
matplotlib.style.use('ggplot')# Create models from datadef best_fit_distribution(data, bins=200, ax=None):"""Model data by finding best fit distribution to data"""# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x =(x + np.roll(x,-1))[:-1]/2.0# Distributions to check
DISTRIBUTIONS =[
st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
]# Best holders
best_distribution = st.norm
best_params =(0.0,1.0)
best_sse = np.inf
# Estimate distribution parameters from datafor distribution in DISTRIBUTIONS:# Try to fit the distributiontry:# Ignore warnings from data that can't be fitwith warnings.catch_warnings():
warnings.filterwarnings('ignore')# fit dist to data
params = distribution.fit(data)# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale,*arg)
sse = np.sum(np.power(y - pdf,2.0))# if axis pass in add to plottry:if ax:
pd.Series(pdf, x).plot(ax=ax)
end
exceptException:pass# identify if this distribution is betterif best_sse > sse >0:
best_distribution = distribution
best_params = params
best_sse = sse
exceptException:passreturn(best_distribution.name, best_params)def make_pdf(dist, params, size=10000):"""Generate distributions's Probability Distribution Function """# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]# Get sane start and end points of distribution
start = dist.ppf(0.01,*arg, loc=loc, scale=scale)if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99,*arg, loc=loc, scale=scale)if arg else dist.ppf(0.99, loc=loc, scale=scale)# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale,*arg)
pdf = pd.Series(y, x)return pdf
# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])# Save plot limits
dataYLim = ax.get_ylim()# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data,200, ax)
best_dist = getattr(st, best_fit_name)# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')# Make PDF with best params
pdf = make_pdf(best_dist, best_fit_params)# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)
param_names =(best_dist.shapes +', loc, scale').split(', ')if best_dist.shapes else['loc','scale']
param_str =', '.join(['{}={:0.2f}'.format(k,v)for k,v in zip(param_names, best_fit_params)])
dist_str ='{}({})'.format(best_fit_name, param_str)
ax.set_title(u'El Niño sea temp. with best fit distribution \n'+ dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')
Distribution Fitting with Sum of Square Error (SSE)
This is an update and modification to Saullo’s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution’s histogram and the data’s histogram.
Example Fitting
Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.
All Distributions
Best Fit Distribution
Example Code
%matplotlib inline
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Distributions to check
DISTRIBUTIONS = [
st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
]
# Best holders
best_distribution = st.norm
best_params = (0.0, 1.0)
best_sse = np.inf
# Estimate distribution parameters from data
for distribution in DISTRIBUTIONS:
# Try to fit the distribution
try:
# Ignore warnings from data that can't be fit
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
# fit dist to data
params = distribution.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass
# identify if this distribution is better
if best_sse > sse > 0:
best_distribution = distribution
best_params = params
best_sse = sse
except Exception:
pass
return (best_distribution.name, best_params)
def make_pdf(dist, params, size=10000):
"""Generate distributions's Probability Distribution Function """
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Get sane start and end points of distribution
start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)
# Build PDF and turn into pandas Series
x = np.linspace(start, end, size)
y = dist.pdf(x, loc=loc, scale=scale, *arg)
pdf = pd.Series(y, x)
return pdf
# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())
# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()
# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)
# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')
# Make PDF with best params
pdf = make_pdf(best_dist, best_fit_params)
# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)
param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)
ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')
fit() method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as
1, the one that gives you the highest log likelihood.
2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)
Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.
AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:
In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()
Please note that ccdf is closely related to survival function (sf), but it’s also defined with discrete distributions, whereas sf is defined only for contiguous distributions.
回答 4
在我看来,这听起来像是概率密度估计问题。
from scipy.stats import gaussian_kde
occurences =[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)print"P(x>=1) = %f"% sum(p[1:])
# Create 1000 random integers, value between [0-50]
X = np.random.randint(0,50,1000)# Retrieve P-value for y
y =[0,10,45,55,100]# From the distfit library import the class distfitfrom distfit import distfit
# Initialize.# Set any properties here, such as alpha.# The smoothing can be of use when working with integers. Otherwise your histogram# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)# Search for best theoretical fit on your empirical data
dist.fit_transform(X)>[distfit]>fit..>[distfit]>transform..>[distfit]>[norm ][RSS:0.0037894][loc=23.535 scale=14.450]>[distfit]>[expon ][RSS:0.0055534][loc=0.000 scale=23.535]>[distfit]>[pareto ][RSS:0.0056828][loc=-384473077.778 scale=384473077.778]>[distfit]>[dweibull ][RSS:0.0038202][loc=24.535 scale=13.936]>[distfit]>[t ][RSS:0.0037896][loc=23.535 scale=14.450]>[distfit]>[genextreme][RSS:0.0036185][loc=18.890 scale=14.506]>[distfit]>[gamma ][RSS:0.0037600][loc=-175.505 scale=1.044]>[distfit]>[lognorm ][RSS:0.0642364][loc=-0.000 scale=1.802]>[distfit]>[beta ][RSS:0.0021885][loc=-3.981 scale=52.981]>[distfit]>[uniform ][RSS:0.0012349][loc=0.000 scale=49.000]# Best fitted model
best_distr = dist.model
print(best_distr)# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters>{'distr':<scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,>'params':(0.0,49.0),>'name':'uniform',>'RSS':0.0012349021241149533,>'loc':0.0,>'scale':49.0,>'arg':(),>'CII_min_alpha':2.45,>'CII_max_alpha':46.55}# Ranking distributions
dist.summary
# Plot the summary of fitted distributions
dist.plot_summary()
# Make prediction on new datapoints based on the fit
dist.predict(y)# Retrieve your pvalues with
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816,0.02040816,0.02040816,0.,0.])# Or in one dataframe
dist.df
# The plot function will now also include the predictions of y
dist.plot()
# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)
# Retrieve P-value for y
y = [0,10,45,55,100]
# From the distfit library import the class distfit
from distfit import distfit
# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)
# Search for best theoretical fit on your empirical data
dist.fit_transform(X)
> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm ] [RSS: 0.0037894] [loc=23.535 scale=14.450]
> [distfit] >[expon ] [RSS: 0.0055534] [loc=0.000 scale=23.535]
> [distfit] >[pareto ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778]
> [distfit] >[dweibull ] [RSS: 0.0038202] [loc=24.535 scale=13.936]
> [distfit] >[t ] [RSS: 0.0037896] [loc=23.535 scale=14.450]
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506]
> [distfit] >[gamma ] [RSS: 0.0037600] [loc=-175.505 scale=1.044]
> [distfit] >[lognorm ] [RSS: 0.0642364] [loc=-0.000 scale=1.802]
> [distfit] >[beta ] [RSS: 0.0021885] [loc=-3.981 scale=52.981]
> [distfit] >[uniform ] [RSS: 0.0012349] [loc=0.000 scale=49.000]
# Best fitted model
best_distr = dist.model
print(best_distr)
# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
> 'params': (0.0, 49.0),
> 'name': 'uniform',
> 'RSS': 0.0012349021241149533,
> 'loc': 0.0,
> 'scale': 49.0,
> 'arg': (),
> 'CII_min_alpha': 2.45,
> 'CII_max_alpha': 46.55}
# Ranking distributions
dist.summary
# Plot the summary of fitted distributions
dist.plot_summary()
# Make prediction on new datapoints based on the fit
dist.predict(y)
# Retrieve your pvalues with
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0. , 0. ])
# Or in one dataframe
dist.df
# The plot function will now also include the predictions of y
dist.plot()
Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.
import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR']= dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values
With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.
To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:
import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values
It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.
sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
tested_factories)
print("Best=",best_model)
which prints:
Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)
In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.
More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.
Forgive me if I don’t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.
what is the quickest/simplest way to drop nan and inf/-inf values from a pandas DataFrame without resetting mode.use_inf_as_null? I’d like to be able to use the subset and how arguments of dropna, except with inf values considered missing, like:
With option context, this is possible without permanently setting use_inf_as_na. For example:
with pd.option_context('mode.use_inf_as_na', True):
df = df.dropna(subset=['col1', 'col2'], how='all')
Of course it can be set to treat inf as NaN permanently with
pd.set_option('use_inf_as_na', True)
For older versions, replace use_inf_as_na with use_inf_as_null.
回答 2
这是.loc在Series上用nan替换inf的另一种方法:
s.loc[(~np.isfinite(s))& s.notnull()]= np.nan
因此,针对原始问题:
df = pd.DataFrame(np.ones((3,3)), columns=list('ABC'))for i in range(3):
df.iat[i, i]= np.inf
df
A B C
0 inf 1.0000001.00000011.000000 inf 1.00000021.0000001.000000 inf
df.sum()
A inf
B inf
C inf
dtype: float64
df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A 2
B 2
C 2
dtype: float64
Here is another method using .loc to replace inf with nan on a Series:
s.loc[(~np.isfinite(s)) & s.notnull()] = np.nan
So, in response to the original question:
df = pd.DataFrame(np.ones((3, 3)), columns=list('ABC'))
for i in range(3):
df.iat[i, i] = np.inf
df
A B C
0 inf 1.000000 1.000000
1 1.000000 inf 1.000000
2 1.000000 1.000000 inf
df.sum()
A inf
B inf
C inf
dtype: float64
df.apply(lambda s: s[np.isfinite(s)].dropna()).sum()
A 2
B 2
C 2
dtype: float64
Yet another solution would be to use the isin method. Use it to determine whether each value is infinite or missing and then chain the all method to determine if all the values in the rows are infinite or missing.
Finally, use the negation of that result to select the rows that don’t have all infinite or missing values via boolean indexing.
You can use pd.DataFrame.mask with np.isinf. You should ensure first your dataframe series are all of type float. Then use dropna with your existing logic.
I tried all the above and nothing worked for me. This solved all my problems:
pip install -U numpy
pip install -U scipy
Note that the -U option to pip install requests that the package be upgraded. Without it, if the package is already installed pip will inform you of this and exit without doing anything.
failed for me… With the following steps, it finally worked out (as root in a virtual environment, where python3 is a link to Python 3.2.2):
install the Ubuntu dependencies (see elaichi), clone NumPy and SciPy:
Then at the same website you can download scipy-0.18.1-cp35-cp35m-win32.whl
Note:You should download the file_name.whl according to you python version, if you python version is 32bit python3.5 you should download this one, and the “win32” is about your python version, not your operating system version.
Besides all of these answers,
If you install python of 32bit on your 64bit machine, you have to download scipy of 32-bit irrespective of your machine.
http://www.lfd.uci.edu/~gohlke/pythonlibs/
In the above URL you can download the packages and command is: pip install
I recently discovered Conda after I was having trouble installing SciPy, specifically on a Heroku app that I am developing.
With Conda you create environments, very similar to what virtualenv does. My questions are:
If I use Conda will it replace the need for virtualenv? If not, how do I use the two together? Do I install virtualenv in Conda, or Conda in virtualenv?
Do I still need to use pip? If so, will I still be able to install packages with pip in an isolated environment?
Conda replaces virtualenv. In my opinion it is better. It is not limited to Python but can be used for other languages too. In my experience it provides a much smoother experience, especially for scientific packages. The first time I got MayaVi properly installed on Mac was with conda.
You can still use pip. In fact, conda installs pip in each new environment. It knows about pip-installed packages.
For example:
conda list
lists all installed packages in your current environment.
Conda-installed packages show up like this:
sphinx_rtd_theme 0.1.7 py35_0 defaults
and the ones installed via pip have the <pip> marker:
These environments are strongly tied to conda’s pip-like package management, so it is simple to create environments and install both Python and non-Python packages.
Jupyter
In addition, installing ipykernel in an environment adds a new listing in the Kernels dropdown menu of Jupyter notebooks, extending reproducible environments to notebooks. As of Anaconda 4.1, nbextensions were added, adding extensions to notebooks more easily.
Reliability
In my experience, conda is faster and more reliable at installing large libraries such as numpy and pandas. Moreover, if you wish to transfer your the preserved state of an environment, you can do so by sharing or cloning an env.
Installing Conda will enable you to create and remove python environments as you wish, therefore providing you with same functionality as virtualenv would.
In case of both distributions you would be able to create an isolated filesystem tree, where you can install and remove python packages (probably, with pip) as you wish. Which might come in handy if you want to have different versions of same library for different use cases or you just want to try some distribution and remove it afterwards conserving your disk space.
Differences:
License agreement. While virtualenv comes under most liberal MIT license, Conda uses 3 clause BSD license.
Conda provides you with their own package control system. This package control system often provides precompiled versions (for most popular systems) of popular non-python software, which can easy ones way getting some machine learning packages working. Namely you don’t have to compile optimized C/C++ code for you system. While it is a great relief for most of us, it might affect performance of such libraries.
Unlike virtualenv, Conda duplicating some system libraries at least on Linux system. This libraries can get out of sync leading to inconsistent behavior of your programs.
Verdict:
Conda is great and should be your default choice while starting your way with machine learning. It will save you some time messing with gcc and numerous packages. Yet, Conda does not replace virtualenv. It introduces some additional complexity which might not always be desired. It comes under different license. You might want to avoid using conda on a distributed environments or on HPC hardware.
I use both and (as of Jan, 2020) they have some superficial differences that lend themselves to different usages for me. By default Conda prefers to manage a list of environments for you in a central location, whereas virtualenv makes a folder in the current directory. The former (centralized) makes sense if you are e.g. doing machine learning and just have a couple of broad environments that you use across many projects and want to jump into them from anywhere. The latter (per project folder) makes sense if you are doing little one-off projects that have completely different sets of lib requirements that really belong more to the project itself.
The empty environment that Conda creates is about 122MB whereas the virtualenv’s is about 12MB, so that’s another reason you may prefer not to scatter Conda environments around everywhere.
Finally, another superficial indication that Conda prefers its centralized envs is that (again, by default) if you do create a Conda env in your own project folder and activate it the name prefix that appears in your shell is the (way too long) absolute path to the folder. You can fix that by giving it a name, but virtualenv does the right thing by default.
I expect this info to become stale rapidly as the two package managers vie for dominance, but these are the trade-offs as of today :)
I work in corporate, behind several firewall with machine on which I have no admin acces
In my limited experience with python (2 years) i have come across few libraries (JayDeBeApi,sasl) which when installing via pip threw C++ dependency errors
error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools
these installed fine with conda, hence since those days i started working with conda env.
however it isnt easy to stop conda from installing dependency inside c.programfiles where i dont have write access.