




print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /


      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55


Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + + a7x7 + c

I can’t seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /

(output for above:)

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

How would I regress these in python, to get the linear regression formula:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

回答 0

sklearn.linear_model.LinearRegression 会做的:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])


sklearn.linear_model 也具有类似的接口,可以对回归进行各种正则化。

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

回答 1


import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results


print reg_m(y, x).summary()


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas 提供了运行此答案中给出的OLS的便捷方法:

使用Pandas Data Frame运行OLS回归

Here is a little work around that I created. I checked it with R and it works correct.

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results


print reg_m(y, x).summary()


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas provides a convenient way to run OLS as given in this answer:

Run an OLS regression with Pandas Data Frame

回答 2




  • 多元线性回归:响应y是一个标量。
  • 多元线性回归:响应y是向量。


Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.

In short:

  • multiple linear regression: the response y is a scalar.
  • multivariate linear regression: the response y is a vector.

(Another source.)

回答 3


import numpy as np
y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X,y)[0]
print beta_hat


[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]


print np.dot(X,beta_hat)


[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

You can use numpy.linalg.lstsq:

import numpy as np

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
        [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
        [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
        [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
        [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
        [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
        [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
        [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]


[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

You can see the estimated output with:



[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

回答 4


from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

Use scipy.optimize.curve_fit. And not only for linear fit.

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

回答 5


import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()



Once you convert your data to a pandas dataframe (df),

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()

The intercept term is included by default.

See this notebook for more examples.

回答 6


from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89

I think this may the most easy way to finish this work:

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89

回答 7

可以使用上面提到的sklearn库处理多个线性回归。我正在使用Python 3.6的Anaconda安装。


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients

Multiple Linear Regression can be handled using the sklearn library as referenced above. I’m using the Anaconda install of Python 3.6.

Create your model as follows:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients

回答 8


回答 9


def linear(x, y=None, show=True):
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

You can use the function below and pass it a DataFrame:

def linear(x, y=None, show=True):
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

回答 10



import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)


Scikit-learn is a machine learning library for Python which can do this job for you. Just import sklearn.linear_model module into your script.

Find the code template for Multiple Linear Regression using sklearn in Python:

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

That’s it. You can use this code as a template for implementing Multiple Linear Regression in any dataset. For a better understanding with an example, Visit: Linear Regression with an example

回答 11


from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()

代替sm.OLS您也可以使用sm.Logitor sm.Probit和等。

Here is an alternative and basic method:

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()

Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.





有吗 否则如何解决?

I have a set of data and I want to compare which line describes it best (polynomials of different orders, exponential or logarithmic).

I use Python and Numpy and for polynomial fitting there is a function polyfit(). But I found no such functions for exponential and logarithmic fitting.

Are there any? Or how to solve it otherwise?

回答 0

对于拟合y = A + B log x,只需将y拟合为(log x)。

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

用于装配ÿ = Bx的,取两侧的对数使日志Ŷ =登录 + Bx的。因此对x拟合(log y)。

需要注意的是配件(日志Ÿ),就好像它是线性的会强调的较小值Ÿ,造成较大偏差大ÿ。这是因为polyfit(线性回归)的工作原理是最小化Σ (Δ Ý2 =&Sigma; ÿ Ŷ 2。当ÿ =登录ÿ ,残基Δ ÿ =Δ(日志Ý )≈Δ ÿ / | y |。所以即使polyfit对大y做出非常糟糕的决定,“除以| y | |” 因数将对其进行补偿,从而导致polyfit偏爱较小的值。


>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)



对于y = A + B log x,结果与转换方法相同:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

但是,对于y = Ae Bx,因为它可以直接计算Δ(log y),所以我们可以获得更好的拟合度。但是我们需要提供一个初始猜测,以便curve_fit可以达到所需的局部最小值。

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

For fitting y = A + B log x, just fit y against (log x).

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> numpy.polyfit(numpy.log(x), y, 1)
array([ 8.46295607,  6.61867463])
# y ≈ 8.46 log(x) + 6.62

For fitting y = AeBx, take the logarithm of both side gives log y = log A + Bx. So fit (log y) against x.

Note that fitting (log y) as if it is linear will emphasize small values of y, causing large deviation for large y. This is because polyfit (linear regression) works by minimizing ∑iY)2 = ∑i (YiŶi)2. When Yi = log yi, the residues ΔYi = Δ(log yi) ≈ Δyi / |yi|. So even if polyfit makes a very bad decision for large y, the “divide-by-|y|” factor will compensate for it, causing polyfit favors small values.

This could be alleviated by giving each entry a “weight” proportional to y. polyfit supports weighted-least-squares via the w keyword argument.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> numpy.polyfit(x, numpy.log(y), 1)
array([ 0.10502711, -0.40116352])
#    y ≈ exp(-0.401) * exp(0.105 * x) = 0.670 * exp(0.105 * x)
# (^ biased towards small values)
>>> numpy.polyfit(x, numpy.log(y), 1, w=numpy.sqrt(y))
array([ 0.06009446,  1.41648096])
#    y ≈ exp(1.42) * exp(0.0601 * x) = 4.12 * exp(0.0601 * x)
# (^ not so biased)

Note that Excel, LibreOffice and most scientific calculators typically use the unweighted (biased) formula for the exponential regression / trend lines. If you want your results to be compatible with these platforms, do not include the weights even if it provides better results.

Now, if you can use scipy, you could use scipy.optimize.curve_fit to fit any model without transformations.

For y = A + B log x the result is the same as the transformation method:

>>> x = numpy.array([1, 7, 20, 50, 79])
>>> y = numpy.array([10, 19, 30, 35, 51])
>>> scipy.optimize.curve_fit(lambda t,a,b: a+b*numpy.log(t),  x,  y)
(array([ 6.61867467,  8.46295606]), 
 array([[ 28.15948002,  -7.89609542],
        [ -7.89609542,   2.9857172 ]]))
# y ≈ 6.62 + 8.46 log(x)

For y = AeBx, however, we can get a better fit since it computes Δ(log y) directly. But we need to provide an initialize guess so curve_fit can reach the desired local minimum.

>>> x = numpy.array([10, 19, 30, 35, 51])
>>> y = numpy.array([1, 7, 20, 50, 79])
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y)
(array([  5.60728326e-21,   9.99993501e-01]),
 array([[  4.14809412e-27,  -1.45078961e-08],
        [ -1.45078961e-08,   5.07411462e+10]]))
# oops, definitely wrong.
>>> scipy.optimize.curve_fit(lambda t,a,b: a*numpy.exp(b*t),  x,  y,  p0=(4, 0.1))
(array([ 4.88003249,  0.05531256]),
 array([[  1.01261314e+01,  -4.31940132e-02],
        [ -4.31940132e-02,   1.91188656e-04]]))
# y ≈ 4.88 exp(0.0553 x). much better.

回答 1


import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)


plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")


You can also fit a set of a data to whatever function you like using curve_fit from scipy.optimize. For example if you want to fit an exponential function (from the documentation):

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
    return a * np.exp(-b * x) + c

x = np.linspace(0,4,50)
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

And then if you want to plot, you could do:

plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")

(Note: the * in front of popt when you plot will expand out the terms into the a, b, and c that func is expecting.)

回答 2



# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

Generate some data, let's imagine that you already have this. 
x = np.linspace(0, 3, 50)
y = np.exp(x)

Plot your data
plt.plot(x, y, 'ro',label="Original Data")

brutal force to avoid errors
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

make the curve_fit
popt, pcov = curve_fit(func, x, y)

The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

Use sympy to generate the latex sintax of the function
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

Print the coefficients and plot the funcion.

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')

结果是:a = 0.849195983017,b = -1.18101681765,c = 2.24061176543,d = 0.816643894816

I was having some trouble with this so let me be very explicit so noobs like me can understand.

Lets say that we have a data file or something like that

# -*- coding: utf-8 -*-

import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
import sympy as sym

Generate some data, let's imagine that you already have this. 
x = np.linspace(0, 3, 50)
y = np.exp(x)

Plot your data
plt.plot(x, y, 'ro',label="Original Data")

brutal force to avoid errors
x = np.array(x, dtype=float) #transform your data in a numpy array of floats 
y = np.array(y, dtype=float) #so the curve_fit can work

create a function to fit with your data. a, b, c and d are the coefficients
that curve_fit will calculate for you. 
In this part you need to guess and/or use mathematical knowledge to find
a function that resembles your data
def func(x, a, b, c, d):
    return a*x**3 + b*x**2 +c*x + d

make the curve_fit
popt, pcov = curve_fit(func, x, y)

The result is:
popt[0] = a , popt[1] = b, popt[2] = c and popt[3] = d of the function,
so f(x) = popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3].
print "a = %s , b = %s, c = %s, d = %s" % (popt[0], popt[1], popt[2], popt[3])

Use sympy to generate the latex sintax of the function
xs = sym.Symbol('\lambda')    
tex = sym.latex(func(xs,*popt)).replace('$', '')
plt.title(r'$f(\lambda)= %s$' %(tex),fontsize=16)

Print the coefficients and plot the funcion.

plt.plot(x, func(x, *popt), label="Fitted Curve") #same as line above \/
#plt.plot(x, popt[0]*x**3 + popt[1]*x**2 + popt[2]*x + popt[3], label="Fitted Curve") 

plt.legend(loc='upper left')

the result is: a = 0.849195983017 , b = -1.18101681765, c = 2.24061176543, d = 0.816643894816

回答 3


np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2


import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")


Well I guess you can always use:

np.log   -->  natural log
np.log10 -->  base 10
np.log2  -->  base 2

Slightly modifying IanVS’s answer:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c):
  #return a * np.exp(-b * x) + c
  return a * np.log(b * x) + c

x = np.linspace(1,5,50)   # changed boundary conditions to avoid division by 0
y = func(x, 2.5, 1.3, 0.5)
yn = y + 0.2*np.random.normal(size=len(x))

popt, pcov = curve_fit(func, x, yn)

plt.plot(x, yn, 'ko', label="Original Noised Data")
plt.plot(x, func(x, *popt), 'r-', label="Fitted Curve")

This results in the following graph:

回答 4

这是使用scikit learning中的工具的简单数据的线性化选项。


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c

def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c

# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)


# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")


# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")



  1. 应用日志操作数据值(xy或两者)
  2. 将数据回归到线性模型
  3. 通过“反转”任何日志操作(使用np.exp())进行绘制并适合原始数据


我们可以通过取log线性化后一个方程(例如y =截距+斜率* x):


  • A通过拦截(ln(A)
  • B通过坡度(B


Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+注意:当噪声较小且C = 0时,线性化指数函数的效果最佳。请谨慎使用。


Here’s a linearization option on simple data that uses tools from scikit learn.


import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer


# General Functions
def func_exp(x, a, b, c):
    """Return values from a general exponential function."""
    return a * np.exp(b * x) + c

def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c

# Helper
def generate_data(func, *args, jitter=0):
    """Return a tuple of arrays with random data along a general function."""
    xs = np.linspace(1, 5, 50)
    ys = func(xs, *args)
    noise = jitter * np.random.normal(size=len(xs)) + jitter
    xs = xs.reshape(-1, 1)                                  # xs[:, np.newaxis]
    ys = (ys + noise).reshape(-1, 1)
    return xs, ys
transformer = FunctionTransformer(np.log, validate=True)


Fit exponential data

# Data
x_samp, y_samp = generate_data(func_exp, 2.5, 1.2, 0.7, jitter=3)
y_trans = transformer.fit_transform(y_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_samp, y_trans)                # 2
model = results.predict
y_fit = model(x_samp)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, np.exp(y_fit), "k--", label="Fit")     # 3
plt.title("Exponential Fit")

Fit log data

# Data
x_samp, y_samp = generate_data(func_log, 2.5, 1.2, 0.7, jitter=0.15)
x_trans = transformer.fit_transform(x_samp)             # 1

# Regression
regressor = LinearRegression()
results = regressor.fit(x_trans, y_samp)                # 2
model = results.predict
y_fit = model(x_trans)

# Visualization
plt.scatter(x_samp, y_samp)
plt.plot(x_samp, y_fit, "k--", label="Fit")             # 3
plt.title("Logarithmic Fit")


General Steps

  1. Apply a log operation to data values (x, y or both)
  2. Regress the data to a linearized model
  3. Plot by “reversing” any log operations (with np.exp()) and fit to original data

Assuming our data follows an exponential trend, a general equation+ may be:

We can linearize the latter equation (e.g. y = intercept + slope * x) by taking the log:

Given a linearized equation++ and the regression parameters, we could calculate:

  • A via intercept (ln(A))
  • B via slope (B)

Summary of Linearization Techniques

Relationship |  Example   |     General Eqn.     |  Altered Var.  |        Linearized Eqn.  
Linear       | x          | y =     B * x    + C | -              |        y =   C    + B * x
Logarithmic  | log(x)     | y = A * log(B*x) + C | log(x)         |        y =   C    + A * (log(B) + log(x))
Exponential  | 2**x, e**x | y = A * exp(B*x) + C | log(y)         | log(y-C) = log(A) + B * x
Power        | x**2       | y =     B * x**N + C | log(x), log(y) | log(y-C) = log(B) + N * log(x)

+Note: linearizing exponential functions works best when the noise is small and C=0. Use with caution.

++Note: while altering x data helps linearize exponential data, altering y data helps linearize log data.

回答 5



import lmfit

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c

# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise

方法1- lmfit模型


regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")



regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")


  1. 选择回归类别
  2. 提供尊重功能域的命名,初步猜测


# ['decay', 'amplitude']



通过安装> pip install lmfit

We demonstrate features of lmfit while solving both problems.


import lmfit

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# General Functions
def func_log(x, a, b, c):
    """Return values from a general log function."""
    return a * np.log(b * x) + c

# Data
x_samp = np.linspace(1, 5, 50)
_noise = np.random.normal(size=len(x_samp), scale=0.06)
y_samp = 2.5 * np.exp(1.2 * x_samp) + 0.7 + _noise
y_samp2 = 2.5 * np.log(1.2 * x_samp) + 0.7 + _noise


Approach 1 – lmfit Model

Fit exponential data

regressor = lmfit.models.ExponentialModel()                # 1    
initial_guess = dict(amplitude=1, decay=-1)                # 2
results = regressor.fit(y_samp, x=x_samp, **initial_guess)
y_fit = results.best_fit    

plt.plot(x_samp, y_samp, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")

Approach 2 – Custom Model

Fit log data

regressor = lmfit.Model(func_log)                          # 1
initial_guess = dict(a=1, b=.1, c=.1)                      # 2
results = regressor.fit(y_samp2, x=x_samp, **initial_guess)
y_fit = results.best_fit

plt.plot(x_samp, y_samp2, "o", label="Data")
plt.plot(x_samp, y_fit, "k--", label="Fit")


  1. Choose a regression class
  2. Supply named, initial guesses that respect the function’s domain

You can determine the inferred parameters from the regressor object. Example:

# ['decay', 'amplitude']

Note: the ExponentialModel() follows a decay function, which accepts two parameters, one of which is negative.

See also ExponentialGaussianModel(), which accepts more parameters.

Install the library via > pip install lmfit.

回答 6



import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)

xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')

Wolfram has a closed form solution for fitting an exponential. They also have similar solutions for fitting a logarithmic and power law.

I found this to work better than scipy’s curve_fit. Especially when you don’t have data “near zero”. Here is an example:

import numpy as np
import matplotlib.pyplot as plt

# Fit the function y = A * exp(B * x) to the data
# returns (A, B)
# From: https://mathworld.wolfram.com/LeastSquaresFittingExponential.html
def fit_exp(xs, ys):
    S_x2_y = 0.0
    S_y_lny = 0.0
    S_x_y = 0.0
    S_x_y_lny = 0.0
    S_y = 0.0
    for (x,y) in zip(xs, ys):
        S_x2_y += x * x * y
        S_y_lny += y * np.log(y)
        S_x_y += x * y
        S_x_y_lny += x * y * np.log(y)
        S_y += y
    a = (S_x2_y * S_y_lny - S_x_y * S_x_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    b = (S_y * S_x_y_lny - S_x_y * S_y_lny) / (S_y * S_x2_y - S_x_y * S_x_y)
    return (np.exp(a), b)

xs = [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
ys = [3187, 3545, 4045, 4447, 4872, 5660, 5983, 6254, 6681, 7206]

(A, B) = fit_exp(xs, ys)

plt.plot(xs, ys, 'o-', label='Raw Data')
plt.plot(xs, [A * np.exp(B *x) for x in xs], 'o-', label='Fit')

plt.title('Exponential Fit Test')



100天机器学习 Siraj Raval

获取数据集: here


从以下位置签出代码: here












#100DaysOfMLCode为了阐明我对逻辑回归的见解我在互联网上搜索了一些资源或文章,我偶然看到了这篇文章(https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc)Saishruthi Swminathan著





















“学习问题”,亚瑟·阿布-穆斯塔法教授(Yaser Abu-Mostafa)|第19天

开始了加州理工学院机器学习课程(CS156)第18讲中的第1讲,由Yaser Abu-Mostafa教授主讲。这基本上是对即将到来的讲座的介绍。他还解释了感知器算法









Bloomberg ML课程的LEC3介绍了一些核心概念,如输入空间、动作空间、结果空间、预测函数、损失函数和假设空间




























由3Blue1Brown YouTube频道在神经网络上发布的令人惊叹的视频。本视频对神经网络有很好的理解,并使用手写数字数据集来解释概念。链接到video.


第二部分神经网络由3Blue1Brown YouTube频道提供。这个视频以一种有趣的方式解释了渐变下降的概念。169必看,强烈推荐。链接到video.


第三部分神经网络由3Blue1Brown YouTube频道提供。这个视频主要讨论偏导数和反向传播。链接到video.


第四部分神经网络由3Blue1Brown YouTube频道提供。这里的目标是用一些更正式的术语来表示反向传播如何工作的直觉,以及视频中讨论偏导数和反向传播的视频。链接到video.














我买了一本JK Vanderplas的新书《Python数据科学手册》Check the Jupyter Notebookhere.

Introduction to NumPy

Understanding Data Types in Python

The Basics of NumPy Arrays

Computation on NumPy Arrays: Universal Functions


Aggregations: Min, Max, and Everything In Between

Computation on Arrays: Broadcasting

Comparisons, Masks, and Boolean Logic


Fancy Indexing

Sorting Arrays

Structured Data: NumPy’s Structured Arrays


Data Manipulation with Pandas

Introducing Pandas Objects

Data Indexing and Selection

Operating on Data in Pandas

Handling Missing Data

Hierarchical Indexing

Combining Datasets: Concat and Append


Combining Datasets: Merge and Join

Aggregation and Grouping

Pivot Tables


Vectorized String Operations

Working with Time Series

High-Performance Pandas: eval() and query()


Visualization with Matplotlib

Simple Line Plots

Simple Scatter Plots

Visualizing Errors

Density and Contour Plots


Histograms, Binnings, and Density

Customizing Plot Legends

Customizing Colorbars

Multiple Subplots

Text and Annotation


Three-Dimensional Plotting in Matplotlib

