标签归档:regression

使用Pandas Data Frame运行OLS回归

问题:使用Pandas Data Frame运行OLS回归

我有一个pandas数据框,我希望能够从B和C列中的值预测A列的值。这是一个玩具示例:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], 
                   "B": [20, 30, 10, 40, 50], 
                   "C": [32, 234, 23, 23, 42523]})

理想情况下,我会有类似的东西,ols(A ~ B + C, data = df)但是当我查看算法库中的示例时,看起来好像scikit-learn是用行而不是列的列表将数据提供给模型。这将要求我将数据重新格式化为列表内的列表,这似乎首先使使用熊猫的目的遭到了破坏。在熊猫数据框中的数据上运行OLS回归(或更通用的任何机器学习算法)的最有效方法是什么?

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50], 
                   "B": [20, 30, 10, 40, 50], 
                   "C": [32, 234, 23, 23, 42523]})

Ideally, I would have something like ols(A ~ B + C, data = df) but when I look at the examples from algorithm libraries like scikit-learn it appears to feed the data to the model with a list of rows instead of columns. This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place. What is the most pythonic way to run an OLS regression (or any machine learning algorithm more generally) on data in a pandas data frame?


回答 0

我认为您可以使用statsmodels包几乎完成您认为理想的事情,该包是0.20.0版pandas之前的“可选依赖项pandas”之一(在中有一些用途pandas.stats。)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas‘ optional dependencies before pandas‘ version 0.20.0 (it was used for a few things in pandas.stats.)

>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64
>>> print(result.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Thu, 14 Nov 2013   Prob (F-statistic):              0.421
Time:                        20:04:30   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.489       -61.481    91.386
B              0.4012      0.650      0.617      0.600        -2.394     3.197
C              0.0004      0.001      0.650      0.583        -0.002     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.061
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.498
Skew:                          -0.123   Prob(JB):                        0.780
Kurtosis:                       1.474   Cond. No.                     5.21e+04
==============================================================================

Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

回答 1

注意: pandas.stats 已被 0.20.0 删除


可以使用pandas.stats.ols

>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   3

R-squared:         0.5789
Adj R-squared:     0.1577

Rmse:             14.5108

F-stat (2, 2):     1.3746, p-value:     0.4211

Degrees of Freedom: model 2, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
---------------------------------End of Summary---------------------------------

请注意,您需要statsmodels安装软件包,该软件包在内部使用pandas.stats.ols

Note: pandas.stats has been removed with 0.20.0


It’s possible to do this with pandas.stats.ols:

>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <intercept>

Number of Observations:         5
Number of Degrees of Freedom:   3

R-squared:         0.5789
Adj R-squared:     0.1577

Rmse:             14.5108

F-stat (2, 2):     1.3746, p-value:     0.4211

Degrees of Freedom: model 2, resid 2

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.4012     0.6497       0.62     0.5999    -0.8723     1.6746
             C     0.0004     0.0005       0.65     0.5826    -0.0007     0.0014
     intercept    14.9525    17.7643       0.84     0.4886   -19.8655    49.7705
---------------------------------End of Summary---------------------------------

Note that you need to have statsmodels package installed, it is used internally by the pandas.stats.ols function.


回答 2

我不知道这是否是新的sklearn还是pandas,但我能直接传递数据帧sklearn没有数据帧转换为numpy的阵列或任何其它数据类型。

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])

>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])

I don’t know if this is new in sklearn or pandas, but I’m able to pass the data frame directly to sklearn without converting the data frame to a numpy array or any other data types.

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])

>>> reg.coef_
array([  4.01182386e-01,   3.51587361e-04])

回答 3

这将要求我将数据重新格式化为列表内的列表,这似乎首先使使用熊猫的目的无法实现。

不,不是,只是转换为NumPy数组:

>>> data = np.asarray(df)

这会花费固定的时间,因为它只会创建数据视图。然后将其提供给scikit-learn:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([  4.01182386e-01,   3.51587361e-04])
>>> lr.intercept_
14.952479503953672

This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.

No it doesn’t, just convert to a NumPy array:

>>> data = np.asarray(df)

This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([  4.01182386e-01,   3.51587361e-04])
>>> lr.intercept_
14.952479503953672

回答 4

Statsmodels可使用直接引用熊猫数据框的列引用来构建OLS模型

简短而甜美:

model = sm.OLS(df[y], df[x]).fit()


代码详细信息和回归摘要:

# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np

# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))

# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]

# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()

# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()

model.summary()

输出:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9409
Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
Time:                        08:35:04   Log-Likelihood:                -484.49
No. Observations:                 100   AIC:                             975.0
Df Residuals:                      97   BIC:                             982.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         43.4801      8.809      4.936      0.000      25.996      60.964
B              0.1241      0.105      1.188      0.238      -0.083       0.332
C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
==============================================================================
Omnibus:                       50.990   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
Skew:                           0.032   Prob(JB):                       0.0317
Kurtosis:                       1.714   Cond. No.                         231.
==============================================================================

如何直接获得R平方,系数和p值:

# commands:
model.params
model.pvalues
model.rsquared

# demo:
In[1]: 
model.params
Out[1]:
const    43.480106
B         0.124130
C        -0.075156
dtype: float64

In[2]: 
model.pvalues
Out[2]: 
const    0.000003
B        0.237924
C        0.497400
dtype: float64

Out[3]:
model.rsquared
Out[2]:
0.0190

Statsmodels kan build an OLS model with column references directly to a pandas dataframe.

Short and sweet:

model = sm.OLS(df[y], df[x]).fit()


Code details and regression summary:

# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np

# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))

# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]

# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()

# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()

model.summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      A   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9409
Date:                Thu, 14 Feb 2019   Prob (F-statistic):              0.394
Time:                        08:35:04   Log-Likelihood:                -484.49
No. Observations:                 100   AIC:                             975.0
Df Residuals:                      97   BIC:                             982.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         43.4801      8.809      4.936      0.000      25.996      60.964
B              0.1241      0.105      1.188      0.238      -0.083       0.332
C             -0.0752      0.110     -0.681      0.497      -0.294       0.144
==============================================================================
Omnibus:                       50.990   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                6.905
Skew:                           0.032   Prob(JB):                       0.0317
Kurtosis:                       1.714   Cond. No.                         231.
==============================================================================

How to directly get R-squared, Coefficients and p-value:

# commands:
model.params
model.pvalues
model.rsquared

# demo:
In[1]: 
model.params
Out[1]:
const    43.480106
B         0.124130
C        -0.075156
dtype: float64

In[2]: 
model.pvalues
Out[2]: 
const    0.000003
B        0.237924
C        0.497400
dtype: float64

Out[3]:
model.rsquared
Out[2]:
0.0190

在scikit学习LinearRegression中找到p值(重要性)

问题:在scikit学习LinearRegression中找到p值(重要性)

如何找到每个系数的p值(重要性)?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

How can I find the p-value (significance) of each coefficient?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

回答 0

这有点矫kill过正,但让我们尝试一下。首先让我们使用statsmodel找出p值应该是什么

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

我们得到

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

好的,让我们重现一下。这有点过头了,因为我们几乎要使用矩阵代数重现线性回归分析。但是到底。

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

这给了我们。

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

因此,我们可以从statsmodel复制值。

This is kind of overkill but let’s give it a go. First lets use statsmodel to find out what the p-values should be

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

and we get

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

Ok, let’s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

And this gives us.

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

So we can reproduce the values from statsmodel.


回答 1

scikit-learn的LinearRegression不会计算此信息,但是您可以轻松地扩展该类来做到这一点:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

这里被盗。

您应该看一下statsmodels,以便在Python中进行这种统计分析。

scikit-learn’s LinearRegression doesn’t calculate this information but you can easily extend the class to do it:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

Stolen from here.

You should take a look at statsmodels for this kind of statistical analysis in Python.


回答 2

编辑:可能不是正确的方法,请参阅评论

您可以使用sklearn.feature_selection.f_regression。

单击此处获取scikit学习页面

EDIT: Probably not the right way to do it, see comments

You could use sklearn.feature_selection.f_regression.

click here for the scikit-learn page


回答 3

elyase的答案https://stackoverflow.com/a/27928411/4240413中的代码实际上无效。请注意,sse是一个标量,然后尝试对其进行迭代。以下代码是修改后的版本。并不是很干净,但是我认为它或多或少地起作用。

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

The code in elyase’s answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

回答 4

拉取p值的一种简单方法是使用statsmodels回归:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

您将获得一系列可以操纵的p值(例如,通过评估每个p值来选择要保留的顺序):

An easy way to pull of the p-values is to use statsmodels regression:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):


回答 5

p_value在f统计信息中。如果要获取值,只需使用以下几行代码:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

p_value is among f statistics. if you want to get the value, simply use this few lines of code:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

回答 6

在多变量回归的情况下,@ JARH的答案可能有误。(我没有足够的声誉来发表评论。)

在以下行中:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]

t值遵循度的卡方分布len(newX)-1而不是度的卡方分布len(newX)-len(newX.columns)-1

所以这应该是:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(有关更多详细信息,请参见t值以进行OLS回归

There could be a mistake in @JARH‘s answer in the case of a multivariable regression. (I do not have enough reputation to comment.)

In the following line:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],

the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.

So this should be:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(See t-values for OLS regression for more details)


回答 7

您可以将scipy用作p值。此代码来自scipy文档。

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

You can use scipy for p-value. This code is from scipy documentation.

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

回答 8

对于单行代码,您可以使用pingouin.linear_regression函数(免责声明:我是Pingouin的创建者),该函数可使用NumPy数组或Pandas DataFrame与单变量/多元回归配合使用,例如:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

输出是一个数据帧,其中包含每个预测变量的beta系数,标准误差,T值,p值和置信区间,以及拟合的R ^ 2和调整后的R ^ 2。

For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.


AiLearning-AiLearning:机器学习-MachineLearning-ML、深度学习-DeepLearning-DL、自然语言处理nlp

网站地址

下载

Docker

docker pull apachecn0/ailearning
docker run -tid -p <port>:80 apachecn0/ailearning
# 访问 http://localhost:{port} 查看文档

PYPI

pip install apachecn-ailearning
apachecn-ailearning <port>
# 访问 http://localhost:{port} 查看文档

NPM

npm install -g ailearning
ailearning <port>
# 访问 http://localhost:{port} 查看文档

组织介绍

  • 合作或侵权,请联系:apachecn@163.com
  • 我们不是apache的官方组织/机构/团体,只是apache技术栈(以及AI)的爱好者!

一种新技术一旦开始流行,你要么坐上压路机,要么成为铺路石.–斯图尔特·布兰德(Stewart Brand)

路线图

补充

1.机器学习-基础

支持版本

版本 支持
3.6.x
2.7.x

注意事项:

  • 机器学习实战:仅仅只是学习,请使用Python 2.7.x版本(3.6.x只是修改了部分)

基本介绍

学习文档

模块 章节 类型 负责人(GiHub) QQ
机器学习实战 第 1 章: 机器学习基础 介绍 @毛红动 1306014226
机器学习实战 第 2 章: KNN 近邻算法 分类 @尤永江 279393323
机器学习实战 第 3 章: 决策树 分类 @景涛 844300439
机器学习实战 第 4 章: 朴素贝叶斯 分类 @wnma3mz
@分析
1003324213
244970749
机器学习实战 第 5 章: Logistic回归 分类 @微光同尘 529925688
机器学习实战 第 6 章: SVM 支持向量机 分类 @王德红 934969547
网上组合内容 第 7 章: 集成方法(随机森林和 AdaBoost) 分类 @片刻 529815144
机器学习实战 第 8 章: 回归 回归 @微光同尘 529925688
机器学习实战 第 9 章: 树回归 回归 @微光同尘 529925688
机器学习实战 第 10 章: K-Means 聚类 聚类 @徐昭清 827106588
机器学习实战 第 11 章: 利用 Apriori 算法进行关联分析 频繁项集 @刘海飞 1049498972
机器学习实战 第 12 章: FP-growth 高效发现频繁项集 频繁项集 @程威 842725815
机器学习实战 第 13 章: 利用 PCA 来简化数据 工具 @廖立娟 835670618
机器学习实战 第 14 章: 利用 SVD 来简化数据 工具 @张俊皓 714974242
机器学习实战 第 15 章: 大数据与 MapReduce 工具 @wnma3mz 1003324213
ml项目实战 第 16 章: 推荐系统(已迁移) 项目 推荐系统(迁移后地址)
第一期的总结 2017-04-08: 第一期的总结 总结 总结 529815144

网站视频

知乎问答-爆炸啦-机器学习该怎么入门?

当然我知道,第一句就会被吐槽,因为科班出身的人,不屑的吐了一口唾沫,说傻X,还评论Andrew Ng的视频.

我还知道还有一部分人,看Andrew Ng的视频就是看不懂,那神秘的数学推导,那迷之微笑的英文版的教学,我何尝又不是这样走过来的??我的心可能比你们都痛,因为我在网上收藏过上10部“机器学习”相关视频,外加国内本土风格的教程:7月+小象等等,我都很难去听懂,直到有一天,被一个百度的高级算法分析师推荐说:“机器学习实战”还不错,通俗易懂,你去试试??

我试了试,还好我的Python基础和调试能力还不错,基本上代码都调试过一遍,很多高大上的“理论+推导”,在我眼中变成了几个“加减乘除+循环”,我想这不就是像我这样的程序员想要的入门教程么?

很多程序员说机器学习TM太难学了,是的,真TM难学,我想最难的是:没有一本像“机器学习实战”那样的作者愿意以程序员Coding角度去给大家讲解!!

最近几天、GitHub涨了300颗STAR、加群的200人,现在还在不断的增加++,我想大家可能都是感同身受吧!

很多想入门新手就是被忽悠着收藏收藏再收藏,但是最后还是什么都没有学到,也就是“资源收藏家”,也许新手要的就是MachineLearning(机器学习) 学习路线图那就是。没错,我可以给你们的一份,因为我们还通过视频记录下来我们的学习过程.水平当然也有限,不过对于新手入门,绝对没问题,如果你还不会,那算我输!!

视频怎么看?

  1. 理论科班出身-建议去学习Andrew Ng的视频(Ng的视频绝对是权威,这个毋庸置疑)
  2. 编码能力强-建议看我们的《机器学习实战-教学版》
  3. 编码能力弱-建议看我们的《机器学习实战-讨论版》、不过在看理论的时候,看教学版-理论部分;讨论版的废话太多,不过在讲解代码的时候是一行一行讲解的;所以,根据自己的需求,自由的组合.

[免费]数学教学视频-可汗学院入门篇

概率 统计 线性代数
可汗学院(概率) 可汗学院(统计学) 可汗学院(线性代数)

机器学习视频-ApacheCN教学版

AcFun B站
优酷 网易云课堂

[免费]机器/深度学习视频-吴恩达

机器学习 深度学习
吴恩达机器学习 神经网络和深度学习

2.深度学习

支持版本

版本 支持
3.6.x
2.7.x

入门基础

  1. 反向传递https://www.cnblogs.com/charlotte77/p/5629865.html
  2. CNN原理http://www.cnblogs.com/charlotte77/p/7759802.html
  3. RNN原理https://blog.csdn.net/qq_39422642/article/details/78676567
  4. LSTM原理https://blog.csdn.net/weixin_42111770/article/details/80900575

火炬-教程

–待更新

TensorFlow2.0-教程

–待更新

目录结构:

(切分(分词)

词性标注

命名实体识别

句法分析

wordnet可以被看作是一个同义词词典

词干提取(词干)与词形还原(词汇化)

TensorFlow2.0学习网址

3.自然语言处理

支持版本

版本 支持
3.6.x
2.7.x

学习过程中-内心复杂的变化!

自从学习NLP以后,才发现国内与国外的典型区别:
1. 对资源的态度是完全相反的:
  1) 国内: 就好像为了名气,举办工作装逼的会议,就是没有干货,全部都是象征性的PPT介绍,不是针对在做的各位
  2)国外: 就好像是为了推动nlp进步一样,分享者各种干货资料和具体的实现。(特别是: python自然语言处理)
2. 论文的实现: 
  1) 各种高大上的论文实现,却还是没看到一个像样的GitHub项目!(可能我的搜索能力差了点,一直没找到)
  2)国外就不举例了,我看不懂!
3. 开源的框架
  1)国外的开源框架:  tensorflow/pytorch 文档+教程+视频(官方提供)
  2) 国内的开源框架: 额额,还真举例不出来!但是牛逼吹得不比国外差!(MXNet虽然有众多国人参与开发,但不能算是国内开源框架。基于MXNet的动手学深度学习(http://zh.d2l.ai & https://discuss.gluon.ai/t/topic/753)中文教程,已经由沐神(李沐)以及阿斯顿·张讲授录制,公开发布(文档+第一季教程+视频)。)
每一次深入都要去翻墙,每一次深入都要Google,每一次看着国内的说: 哈工大、讯飞、中科大、百度、阿里多牛逼,但是资料还是得国外去找!
有时候真的挺恨的!真的有点瞧不起自己国内的技术环境!

当然谢谢国内很多博客大佬,特别是一些入门的Demo和基本概念。【深入的水平有限,没看懂】

1.(使用场景(百度公开课)

第一部分入门介绍

第二部分机器翻译

第三部分篇章分析

第四部分单元-语言理解与交互技术

应用领域

中文分词:

  • 构建DAG图
  • 动态规划查找,综合正反向(正向加权反向输出)求得DAG最大概率路径
  • 使用了SBME语料训练了一套HMM+维特比模型,解决未登录词问题

1.文本分类(文本分类)

文本分类是指标记句子或文档,例如电子邮件垃圾邮件分类和情感分析.

下面是一些很好的初学者文本分类数据集.

  1. 路透社Newswire主题分类(路透社-21578)。1987年年路透社出现的一系列新闻文件,按类别编制索引。另见RCV1,RCV2和TRC2那就是。
  2. IMDB电影评论情感分类(斯坦福)那就是。来自网站imdb.com的一系列电影评论及其积极或消极的情绪。
  3. 新闻组电影评论情感分类(康奈尔)那就是。来自网站imdb.com的一系列电影评论及其积极或消极的情绪。

有关更多信息,请参阅帖子:单标签文本分类的数据集那就是。

情感分析

比赛地址:https://www.kaggle.com/c/word2vec-nlp-tutorial

  • 方案一(0.86):字数+朴素贝叶斯
  • 方案二(0.94):lda+分类模型(knn/决策树/逻辑回归/svm/xgBoost/随机森林)
    • a)决策树效果不是很好,这种连续特征不太适合的
    • b)通过参数调整200年个主题,信息量保存效果较优(计算主题)
  • 美国有线电视新闻网(方案三):word2vec+cnn
    • 说实话:没有一个好的机器,是调不出来一个好的结果(:逃

通过AuC来评估模型的效果

2.语言模型(语言建模)

语言建模涉及开发一种统计模型,用于预测句子中的下一个单词或一个单词中的下一个单词.它是语音识别和机器翻译等任务中的前置任务.

它是语音识别和机器翻译等任务中的前置任务.

下面是一些很好的初学者语言建模数据集.

  1. 古腾堡项目、一系列免费书籍,可以用纯文本检索各种语言.
  2. 还有更多正式的语料库得到了很好的研究;例如:布朗大学现代美国英语标准语料库那就是。大量英语单词样本.谷歌10亿字语料库那就是。

新词发现

句子相似度识别

文本纠错

  • 双字母+双音

3.图像字幕(图像字幕)

法师字幕是为给定图像生成文本描述的任务。

下面是一些很好的初学者图像字幕数据集.

  1. 上下文中的公共对象(COCO)那就是。包含超过12万张带描述的图像的集合
  2. Flickr 8K那就是。从Flickr.com获取的8千个描述图像的集合。
  3. Flickr 30K那就是。从Flickr.com获取的3万个描述图像的集合。欲了解更多,请看帖子:

探索图像字幕数据集,2016年

4.机器翻译(机器翻译)

机器翻译是将文本从一种语言翻译成另一种语言的任务.

下面是一些很好的初学者机器翻译数据集.

  1. 加拿大第36届议会的协调国会议员那就是。成对的英语和法语句子.
  2. 欧洲议会诉讼平行语料库1996-2011那就是。句子对一套欧洲语言.有大量标准数据集用于年度机器翻译挑战;看到:

统计机器翻译

机器翻译

5.问答系统(问答)

问答是一项任务,其中提供了一个句子或文本样本,从中提出问题并且必须回答问题.

下面是一些很好的初学者问题回答数据集.

  1. 斯坦福问题回答数据集(SQuAD)那就是。回答有关维基百科文章的问题.
  2. Deepmind问题回答语料库那就是。从每日邮报回答有关新闻文章的问题.
  3. 亚马逊问答数据那就是。回答有关亚马逊产品的问题.有关更多信息,请参阅帖子:

数据集: 我如何获得问答网站的语料库,如Quora或Yahoo Answers或Stack Overflow来分析答案质量?

6.语音识别(语音识别)

语音识别是将口语的音频转换为人类可读文本的任务.

下面是一些很好的初学者语音识别数据集.

  1. TIMIT声学 – 语音连续语音语料库那就是。不是免费的,但因其广泛使用而上市.口语美国英语和相关的转录.
  2. VoxForge那就是。用于构建用于语音识别的开源数据库的项目.
  3. LibriSpeech ASR语料库那就是。从librivox收集的大量英语有声读物.

7.自动文摘(文档摘要)

文档摘要是创建较大文档的简短有意义描述的任务.

下面是一些很好的初学者文档摘要数据集.

  1. 法律案例报告数据集那就是。收集了4000份法律案件及其摘要。
  2. TIPSTER文本摘要评估会议语料库那就是。收集了近200份文件及其摘要。
  3. 英语新闻文本的AQUAINT语料库那就是。不是免费的,而是广泛使用的.新闻文章的语料库.欲了解更多信息:

文档理解会议(DUC)任务那就是。在哪里可以找到用于文本摘要的良好数据集?

命名实体识别

文本摘要

图形图计算[慢慢更新]

  • 数据集:data/nlp/graph
  • 学习资料:电光图片X实战.pdf[文件太大不方便提供,自己百度]

知识图谱

进一步阅读

如果您希望更深入,本节提供了其他数据集列表.

  1. 维基百科研究中使用的文本数据集
  2. 数据集: 计算语言学家和自然语言处理研究人员使用的主要文本语料库是什么?
  3. 斯坦福统计自然语言处理语料库
  4. 按字母顺序排列的NLP数据集列表
  5. 该机构NLTK
  6. 在DL4J上打开深度学习数据
  7. NLP数据集
  8. 国内开放数据集:https://bosonnlp.com/dev/resource

贡献者信息

欢迎贡献者不断的追加

免责声明-[只供学习参考]

  • ApacheCN纯粹出于学习目的与个人兴趣翻译本书
  • ApacheCN保留对此版本译文的署名权及其它相关权利

协议

  • 以各项目协议为准.
  • ApacheCN账号下没有协议的项目,一律视为CC BY-NC-SA 4.0那就是。

资料来源:

感谢信

最近无意收到群友推送的链接,发现得到大佬高度的认可,并在热心的推广

在此感谢:

赞助我们