import pandas as pd
df = pd.DataFrame({"A":[10,20,30,40,50],"B":[20,30,10,40,50],"C":[32,234,23,23,42523]})
理想情况下,我会有类似的东西,ols(A ~ B + C, data = df)但是当我查看算法库中的示例时,看起来好像scikit-learn是用行而不是列的列表将数据提供给模型。这将要求我将数据重新格式化为列表内的列表,这似乎首先使使用熊猫的目的遭到了破坏。在熊猫数据框中的数据上运行OLS回归(或更通用的任何机器学习算法)的最有效方法是什么?
Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?
>>>import pandas as pd
>>>import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A":[10,20,30,40,50],"B":[20,30,10,40,50],"C":[32,234,23,23,42523]})>>> result = sm.ols(formula="A ~ B + C", data=df).fit()>>>print(result.params)Intercept14.952480
B 0.401182
C 0.000352
dtype: float64
>>>print(result.summary())
OLS RegressionResults==============================================================================Dep.Variable: A R-squared:0.579Model: OLS Adj. R-squared:0.158Method:LeastSquares F-statistic:1.375Date:Thu,14Nov2013Prob(F-statistic):0.421Time:20:04:30Log-Likelihood:-18.178No.Observations:5 AIC:42.36DfResiduals:2 BIC:41.19DfModel:2==============================================================================
coef std err t P>|t|[95.0%Conf.Int.]------------------------------------------------------------------------------Intercept14.952517.7640.8420.489-61.48191.386
B 0.40120.6500.6170.600-2.3943.197
C 0.00040.0010.6500.583-0.0020.003==============================================================================Omnibus: nan Durbin-Watson:1.061Prob(Omnibus): nan Jarque-Bera(JB):0.498Skew:-0.123Prob(JB):0.780Kurtosis:1.474Cond.No.5.21e+04==============================================================================Warnings:[1]The condition number is large,5.21e+04.This might indicate that there are
strong multicollinearity or other numerical problems.
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas‘ optional dependencies before pandas‘ version 0.20.0 (it was used for a few things in pandas.stats.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
>>>from pandas.stats.api import ols
>>> df = pd.DataFrame({"A":[10,20,30,40,50],"B":[20,30,10,40,50],"C":[32,234,23,23,42523]})>>> res = ols(y=df['A'], x=df[['B','C']])>>> res
-------------------------Summary of RegressionAnalysis-------------------------Formula: Y ~<B>+<C>+<intercept>Number of Observations:5Number of Degrees of Freedom:3
R-squared:0.5789Adj R-squared:0.1577Rmse:14.5108
F-stat (2,2):1.3746, p-value:0.4211Degrees of Freedom: model 2, resid 2-----------------------Summary of EstimatedCoefficients------------------------VariableCoefStdErr t-stat p-value CI 2.5% CI 97.5%--------------------------------------------------------------------------------
B 0.40120.64970.620.5999-0.87231.6746
C 0.00040.00050.650.5826-0.00070.0014
intercept 14.952517.76430.840.4886-19.865549.7705---------------------------------End of Summary---------------------------------
I don’t know if this is new in sklearn or pandas, but I’m able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.
# importsimport pandas as pd
import statsmodels.api as sm
import numpy as np
# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100,3)), columns=list('ABC'))# assign dependent and independent / explanatory variables
variables = list(df.columns)
y ='A'
x =[var for var in variables if var notin y ]# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
model.summary()
输出:
OLS RegressionResults==============================================================================Dep.Variable: A R-squared:0.019Model: OLS Adj. R-squared:-0.001Method:LeastSquares F-statistic:0.9409Date:Thu,14Feb2019Prob(F-statistic):0.394Time:08:35:04Log-Likelihood:-484.49No.Observations:100 AIC:975.0DfResiduals:97 BIC:982.8DfModel:2CovarianceType: nonrobust
==============================================================================
coef std err t P>|t|[0.0250.975]------------------------------------------------------------------------------
const 43.48018.8094.9360.00025.99660.964
B 0.12410.1051.1880.238-0.0830.332
C -0.07520.110-0.6810.497-0.2940.144==============================================================================Omnibus:50.990Durbin-Watson:2.013Prob(Omnibus):0.000Jarque-Bera(JB):6.905Skew:0.032Prob(JB):0.0317Kurtosis:1.714Cond.No.231.==============================================================================
如何直接获得R平方,系数和p值:
# commands:
model.params
model.pvalues
model.rsquared
# demo:In[1]:
model.params
Out[1]:
const 43.480106
B 0.124130
C -0.075156
dtype: float64
In[2]:
model.pvalues
Out[2]:
const 0.000003
B 0.237924
C 0.497400
dtype: float64
Out[3]:
model.rsquared
Out[2]:0.0190
Statsmodels kan build an OLS model with column references directly to a pandas dataframe.
Short and sweet:
model = sm.OLS(df[y], df[x]).fit()
Code details and regression summary:
# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np
# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))
# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]
# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()
# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
model.summary()
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model importLinearRegressionimport statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()print(est2.summary())
lm =LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)
newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE =(sum((y-predictions)**2))/(len(newX)-len(newX.columns))# Note if you don't want to use a DataFrame replace the two lines above with# newX = np.append(np.ones((len(X),1)), X, axis=1)# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0]))))for i in ts_b]
sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)
myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"]=[params,sd_b,ts_b,p_values]print(myDF3)
这给了我们。
CoefficientsStandardErrors t values Probabilities0152.13352.57659.0610.0001-10.012259.749-0.1680.8672-239.819161.222-3.9170.0003519.839866.5347.8130.0004324.390465.4224.9580.0005-792.1842416.684-1.9010.0586476.7458339.0351.4060.1607101.0446212.5330.4750.6358177.0642161.4761.0970.2739751.2793171.9024.3700.0001067.625465.9841.0250.306
This is kind of overkill but let’s give it a go. First lets use statsmodel to find out what the p-values should be
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())
from sklearn import linear_model
from scipy import stats
import numpy as np
classLinearRegression(linear_model.LinearRegression):"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""def __init__(self,*args,**kwargs):ifnot"fit_intercept"in kwargs:
kwargs['fit_intercept']=False
super(LinearRegression, self)\
.__init__(*args,**kwargs)def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X)- y)**2, axis=0)/ float(X.shape[0]- X.shape[1])
se = np.array([
np.sqrt(np.diagonal(sse[i]* np.linalg.inv(np.dot(X.T, X))))for i in range(sse.shape[0])])
self.t = self.coef_ / se
self.p =2*(1- stats.t.cdf(np.abs(self.t), y.shape[0]- X.shape[1]))return self
scikit-learn’s LinearRegression doesn’t calculate this information but you can easily extend the class to do it:
from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
which is (n_features, n_coefs)
This class sets the intercept to 0 by default, since usually we include it
in X.
"""
def __init__(self, *args, **kwargs):
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression, self)\
.__init__(*args, **kwargs)
def fit(self, X, y, n_jobs=1):
self = super(LinearRegression, self).fit(X, y, n_jobs)
sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
se = np.array([
np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
for i in range(sse.shape[0])
])
self.t = self.coef_ / se
self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
return self
classLinearRegression(linear_model.LinearRegression):def __init__(self,*args,**kwargs):# *args is the list of arguments that might go into the LinearRegression object# that we don't know about and don't want to have to deal with. Similarly, **kwargs# is a dictionary of key words and values that might also need to go into the orginal# LinearRegression object. We put *args and **kwargs so that we don't have to look# these up and write them down explicitly here. Nice and easy.ifnot"fit_intercept"in kwargs:
kwargs['fit_intercept']=False
super(LinearRegression,self).__init__(*args,**kwargs)# Adding in t-statistics for the coefficients.def fit(self,x,y):# This takes in numpy arrays (not matrices). Also assumes you are leaving out the column# of constants.# Not totally sure what 'super' does here and why you redefine self...
self = super(LinearRegression, self).fit(x,y)
n, k = x.shape
yHat = np.matrix(self.predict(x)).T
# Change X and Y into numpy matricies. x also has a column of ones added to it.
x = np.hstack((np.ones((n,1)),np.matrix(x)))
y = np.matrix(y).T
# Degrees of freedom.
df = float(n-k-1)# Sample variance.
sse = np.sum(np.square(yHat - y),axis=0)
self.sampleVariance = sse/df
# Sample variance for x.
self.sampleVarianceX = x.T*x
# Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly)
self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)# Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
self.se = self.covarianceMatrix.diagonal()[1:]# T statistic for each beta.
self.betasTStat = np.zeros(len(self.se))for i in xrange(len(self.se)):
self.betasTStat[i]= self.coef_[0,i]/self.se[i]# P-value for each beta. This is a two sided t-test, since the betas can be # positive or negative.
self.betasPValue =1- t.cdf(abs(self.betasTStat),df)
The code in elyase’s answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.
class LinearRegression(linear_model.LinearRegression):
def __init__(self,*args,**kwargs):
# *args is the list of arguments that might go into the LinearRegression object
# that we don't know about and don't want to have to deal with. Similarly, **kwargs
# is a dictionary of key words and values that might also need to go into the orginal
# LinearRegression object. We put *args and **kwargs so that we don't have to look
# these up and write them down explicitly here. Nice and easy.
if not "fit_intercept" in kwargs:
kwargs['fit_intercept'] = False
super(LinearRegression,self).__init__(*args,**kwargs)
# Adding in t-statistics for the coefficients.
def fit(self,x,y):
# This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
# of constants.
# Not totally sure what 'super' does here and why you redefine self...
self = super(LinearRegression, self).fit(x,y)
n, k = x.shape
yHat = np.matrix(self.predict(x)).T
# Change X and Y into numpy matricies. x also has a column of ones added to it.
x = np.hstack((np.ones((n,1)),np.matrix(x)))
y = np.matrix(y).T
# Degrees of freedom.
df = float(n-k-1)
# Sample variance.
sse = np.sum(np.square(yHat - y),axis=0)
self.sampleVariance = sse/df
# Sample variance for x.
self.sampleVarianceX = x.T*x
# Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly)
self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)
# Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
self.se = self.covarianceMatrix.diagonal()[1:]
# T statistic for each beta.
self.betasTStat = np.zeros(len(self.se))
for i in xrange(len(self.se)):
self.betasTStat[i] = self.coef_[0,i]/self.se[i]
# P-value for each beta. This is a two sided t-test, since the betas can be
# positive or negative.
self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)
回答 4
拉取p值的一种简单方法是使用statsmodels回归:
import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']
An easy way to pull of the p-values is to use statsmodels regression:
import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']
You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):
回答 5
p_value在f统计信息中。如果要获取值,只需使用以下几行代码:
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)print(est.fit().f_pvalue)
p_value is among f statistics. if you want to get the value, simply use this few lines of code:
import statsmodels.api as sm
from scipy import stats
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)
There could be a mistake in @JARH‘s answer in the case of a multivariable regression.
(I do not have enough reputation to comment.)
In the following line:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],
the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.
So this should be:
p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]
import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x','z']], df['y'])# Using a NumPy array:
lm = pg.linear_regression(X, y)
For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:
import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)
The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.