I can’t seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).
For example, with this data:
print 'y x1 x2 x3 x4 x5 x6 x7'
for t in texts:
print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
.format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t,'x%d'% i)for i in range(1,8)]for t in texts],[t.y for t in texts])
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
[t.y for t in texts])
Then clf.coef_ will have the regression coefficients.
sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.
回答 1
这是我创建的一些解决方法。我用R检查了它,它可以正常工作。
import numpy as np
import statsmodels.api as sm
y =[1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x =[[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]]def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()return results
Here is a little work around that I created. I checked it with R and it works correct.
import numpy as np
import statsmodels.api as sm
y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]
x = [
[4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
[4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
[4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
]
def reg_m(y, x):
ones = np.ones(len(x[0]))
X = sm.add_constant(np.column_stack((x[0], ones)))
for ele in x[1:]:
X = sm.add_constant(np.column_stack((ele, X)))
results = sm.OLS(y, X).fit()
return results
Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:
The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.
In short:
multiple linear regression: the response y is a scalar.
multivariate linear regression: the response y is a vector.
import numpy as np
y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]# add bias term
beta_hat = np.linalg.lstsq(X,y)[0]print beta_hat
from random import random
from pandas importDataFramefrom statsmodels.api import OLS
lr =lambda:[random()for i in range(100)]
x =DataFrame({'x1': lr(),'x2':lr(),'x3':lr()})
x['b']=1
y = x.x1 + x.x2 *2+ x.x3 *3+4print x.head()
x1 x2 x3 b
00.4336810.9467230.103422110.4004230.5271790.131674120.9924410.9006780.360140130.4137570.0993190.825181140.7964910.8625930.1935541print y.head()06.63739215.84980227.87421837.08793847.102337
dtype: float64
model = OLS(y, x)
result = model.fit()print result.summary()
OLS RegressionResults==============================================================================Dep.Variable: y R-squared:1.000Model: OLS Adj. R-squared:1.000Method:LeastSquares F-statistic:5.859e+30Date:Wed,09Dec2015Prob(F-statistic):0.00Time:15:17:32Log-Likelihood:3224.9No.Observations:100 AIC:-6442.DfResiduals:96 BIC:-6431.DfModel:3CovarianceType: nonrobust
==============================================================================
coef std err t P>|t|[95.0%Conf.Int.]------------------------------------------------------------------------------
x1 1.00008.98e-161.11e+150.0001.0001.000
x2 2.00008.28e-162.41e+150.0002.0002.000
x3 3.00008.34e-163.6e+150.0003.0003.000
b 4.00008.51e-164.7e+150.0004.0004.000==============================================================================Omnibus:7.675Durbin-Watson:1.614Prob(Omnibus):0.022Jarque-Bera(JB):3.118Skew:0.045Prob(JB):0.210Kurtosis:2.140Cond.No.6.89==============================================================================
def linear(x, y=None, show=True):"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""import statsmodels.api as sm
xy = sm.add_constant(x if y isNoneelse pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:,-1], xy.ix[:,:-1], missing='drop').fit()if show:print res.summary()return res
You can use the function below and pass it a DataFrame:
def linear(x, y=None, show=True):
"""
@param x: pd.DataFrame
@param y: pd.DataFrame or pd.Series or None
if None, then use last column of x as y
@param show: if show regression summary
"""
import statsmodels.api as sm
xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()
if show: print res.summary()
return res
import numpy as np
import matplotlib.pyplot as plt #to plot visualizationsimport pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]# Use label encoders, if you have any categorical variablefrom sklearn.preprocessing importLabelEncoder
labelencoder =LabelEncoder()
X['<column-name>']= labelencoder.fit_transform(X['<column-name>'])from sklearn.preprocessing importOneHotEncoder
onehotencoder =OneHotEncoder(categorical_features =['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()# Avoiding the dummy variable trap
X = X[:,1:]# Usually done by the algorithm itself#Spliting the data into test and train setfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state =0, test_size =0.2)# Fitting the modelfrom sklearn.linear_model importLinearRegression
regressor =LinearRegression()
regressor.fit(X_train, y_train)# Predicting the test set results
y_pred = regressor.predict(X_test)
Scikit-learn is a machine learning library for Python which can do this job for you.
Just import sklearn.linear_model module into your script.
Find the code template for Multiple Linear Regression using sklearn in Python:
import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd
# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself
#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)
# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the test set results
y_pred = regressor.predict(X_test)
That’s it. You can use this code as a template for implementing Multiple Linear Regression in any dataset.
For a better understanding with an example, Visit: Linear Regression with an example
回答 11
这是另一种基本方法:
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()print(results.summary())
from patsy import dmatrices
import statsmodels.api as sm
y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ###
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())
Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.
I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).
I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.
Are there any? Or how to solve it otherwise?
回答 0
对于拟合y = A + B log x,只需将y拟合为(log x)。
>>> x = numpy.array([1,7,20,50,79])>>> y = numpy.array([10,19,30,35,51])>>> numpy.polyfit(numpy.log(x), y,1)
array([8.46295607,6.61867463])# y ≈ 8.46 log(x) + 6.62
For fitting y = A + B log x, just fit y against (log x).
>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607, 6.61867463])
# y ≈ 8.46 log(x) + 6.62
For fitting y = AeBx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.
Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑i (ΔY)2 = ∑i (Yi − Ŷi)2. When Yi = log yi, the residues ΔYi = Δ(log yi) ≈ Δyi / |yi|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.
This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.
>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
# y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446, 1.41648096])
# y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)
Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.
Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.
For y = A + B log x the result is the same as the transformation method:
For y = AeBx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import curve_fitdef func(x, a, b, c):return a * np.exp(-b * x)+ c
x = np.linspace(0,4,50)
y = func(x,2.5,1.3,0.5)
yn = y +0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a * np.exp(-b * x) + c
x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)
回答 2
我对此有些麻烦,所以请让我非常明确,让像我这样的菜鸟可以理解。
假设我们有一个数据文件或类似的文件
# -*- coding: utf-8 -*-import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
"""
Generate some data, let's imagine that you already have this.
"""
x = np.linspace(0,3,50)
y = np.exp(x)"""
Plot your data
"""
plt.plot(x, y,'ro',label="Original Data")"""
brutal force to avoid errors
"""
x = np.array(x, dtype=float)#transform your data in a numpy array of floats
y = np.array(y, dtype=float)#so the curve_fit can work"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you.
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""def func(x, a, b, c, d):return a*x**3+ b*x**2+c*x + d
"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""print"a = %s , b = %s, c = %s, d = %s"%(popt[0], popt[1], popt[2], popt[3])"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')
tex = sym.latex(func(xs,*popt)).replace('$','')
plt.title(r'$f(\lambda)= %s$'%(tex),fontsize=16)"""
Print the coefficients and plot the funcion.
"""
plt.plot(x, func(x,*popt), label="Fitted Curve")#same as line above \/#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
I was having some trouble with this so let me be very explicit so noobs like me can understand.
Lets say that we have a data file or something like that
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym
"""
Generate some data, let's imagine that you already have this.
"""
x = np.linspace(0, 3, 50)
y = np.exp(x)
"""
Plot your data
"""
plt.plot(x, y, 'ro',label="Original Data")
"""
brutal force to avoid errors
"""
x = np.array(x, dtype=float) #transform your data in a numpy array of floats
y = np.array(y, dtype=float) #so the curve_fit can work
"""
create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you.
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
"""
def func(x, a, b, c, d):
return a*x**3 + b*x**2 +c*x + d
"""
make the curve_fit
"""
popt, pcov = curve_fit(func, x, y)
"""
The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
"""
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])
"""
Use sympy to generate the latex sintax of the function
"""
xs = sym.Symbol('\lambda')
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)
"""
Print the coefficients and plot the funcion.
"""
plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve")
plt.legend(loc='upper left')
plt.show()
the result is:
a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816
回答 3
好吧,我想您可以随时使用:
np.log --> natural log
np.log10 --> base 10
np.log2 --> base 2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model importLinearRegressionfrom sklearn.preprocessing importFunctionTransformer
np.random.seed(123)
# General Functionsdef func_exp(x, a, b, c):"""Return values from a general exponential function."""return a * np.exp(b * x)+ c
def func_log(x, a, b, c):"""Return values from a general log function."""return a * np.log(b * x)+ c
# Helperdef generate_data(func,*args, jitter=0):"""Return a tuple of arrays with random data along a general function."""
xs = np.linspace(1,5,50)
ys = func(xs,*args)
noise = jitter * np.random.normal(size=len(xs))+ jitter
xs = xs.reshape(-1,1)# xs[:, np.newaxis]
ys =(ys + noise).reshape(-1,1)return xs, ys
Relationship|Example|GeneralEqn.|AlteredVar.|LinearizedEqn.-------------|------------|----------------------|----------------|------------------------------------------Linear| x | y = B * x + C |-| y = C + B * x
Logarithmic| log(x)| y = A * log(B*x)+ C | log(x)| y = C + A *(log(B)+ log(x))Exponential|2**x, e**x | y = A * exp(B*x)+ C | log(y)| log(y-C)= log(A)+ B * x
Power| x**2| y = B * x**N + C | log(x), log(y)| log(y-C)= log(B)+ N * log(x)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
np.random.seed(123)
# General Functions
def func_exp(x, a, b, c):
"""Return values from a general exponential function."""
return a * np.exp(b * x) + c
def func_log(x, a, b, c):
"""Return values from a general log function."""
return a * np.log(b * x) + c
# Helper
def generate_data(func, *args, jitter=0):
"""Return a tuple of arrays with random data along a general function."""
xs = np.linspace(1, 5, 50)
ys = func(xs, *args)
noise = jitter * np.random.normal(size=len(xs)) + jitter
xs = xs.reshape(-1, 1) # xs[:, np.newaxis]
ys = (ys + noise).reshape(-1, 1)
return xs, ys
Apply a log operation to data values (x, y or both)
Regress the data to a linearized model
Plot by “reversing” any log operations (with np.exp()) and fit to original data
Assuming our data follows an exponential trend, a general equation+ may be:
We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:
Given a linearized equation++ and the regression parameters, we could calculate:
A via intercept (ln(A))
B via slope (B)
Summary of Linearization Techniques
Relationship | Example | General Eqn. | Altered Var. | Linearized Eqn.
-------------|------------|----------------------|----------------|------------------------------------------
Linear | x | y = B * x + C | - | y = C + B * x
Logarithmic | log(x) | y = A * log(B*x) + C | log(x) | y = C + A * (log(B) + log(x))
Exponential | 2**x, e**x | y = A * exp(B*x) + C | log(y) | log(y-C) = log(A) + B * x
Power | x**2 | y = B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)
+Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.
++Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.
import lmfit
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(123)
# General Functionsdef func_log(x, a, b, c):"""Return values from a general log function."""return a * np.log(b * x)+ c
# Data
x_samp = np.linspace(1,5,50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp =2.5* np.exp(1.2* x_samp)+0.7+ _noise
y_samp2 =2.5* np.log(1.2* x_samp)+0.7+ _noise