标签归档:scikit-learn

谁能向我解释StandardScaler?

问题:谁能向我解释StandardScaler?

我无法理解网页StandardScaler的文档中sklearn

谁能简单地向我解释一下?

I am unable to understand the page of the StandardScaler in the documentation of sklearn.

Can anyone explain this to me in simple terms?


回答 0

背后的想法StandardScaler是,它将转换您的数据,使其分布的平均值为0,标准差为1。
对于多变量数据,这是按功能进行的(换句话说,独立于数据的每一列) 。
给定数据的分布,数据集中的每个值都将减去平均值,然后除以整个数据集(或多变量情况下的特征)的标准差。

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).


回答 1

简介:我假设您有一个矩阵X,其中每一行/每一都是一个样本/观测值,每一都是一个变量/特征sklearn顺便说一下,这是任何ML函数的预期输入- X.shape应该是[number_of_samples, number_of_features])。


方法的核心:主要思路是正常化/标准化,即μ = 0σ = 1你的功能/变量/列X单独之前应用任何机器学习模型。

StandardScaler()归一化的特征,即X的每一列中,独立地,使每个柱/特征/变量将具有μ = 0σ = 1


附言我在此页面上找到了最受欢迎的答案,这是错误的。我引用的是“数据集中的每个值都将减去样本平均值” –这既不正确也不正确。


另请参阅:如何以及为何使数据标准化:python教程


例:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

验证每个特征(列)的平均值是否为0:

scaled_data.mean(axis = 0)
array([0., 0.])

验证每个功能(列)的标准差为1:

scaled_data.std(axis = 0)
array([1., 1.])

数学:

在此处输入图片说明


UPDATE 08/2019:Concering输入参数with_meanwith_stdFalse/ True,我这里提供一个答案:之间StandardScaler差“with_std =伪或真”和“with_mean =伪或真”

Intro: I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way — X.shape should be [number_of_samples, number_of_features]).


Core of method: The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.


P.S: I find the most upvoted answer on this page, wrong. I am quoting “each value in the dataset will have the sample mean value subtracted” — This is neither true nor correct.


See also: How and why to Standardize your data: A python tutorial


Example:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

The maths:

enter image description here


UPDATE 08/2020: Concerning the input parameters with_mean and with_std to False/True, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”


回答 2


回答 3

StandardScaler执行标准化任务。通常,数据集包含比例不同的变量。例如,一个Employee数据集将包含AGE列,其值的范围为20-70,而SALARY列的值的范围为10000-80000
由于这两列的规模不同,因此在构建机器学习模型时将它们标准化以具有相同的规模。

StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.


回答 4

当您要比较对应于不同单位的数据时,这很有用。在这种情况下,您要删除单元。要以一致的方式处理所有数据,请以方差为单位且序列的均值为0的方式转换数据。

This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.


回答 5

上面的答案很好,但是我需要一个简单的例子来减轻过去的担忧。我想确保确实将每个专栏分开对待。现在,我可以放心了,无法找到引起我关注的示例。如上所述,所有列均按比例缩放。

import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

输出值

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)

The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can’t find what example had caused me concern. All columns ARE scaled separately as described by those above.

CODE

import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

OUTPUT

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)

NOTE:

The scipy.stats module is correctly reporting the “sample” variance, which uses (n – 1) in the denominator. The “population” variance would use n in the denominator for the calculation of variance. To understand better, please see the code below that uses scaled data from the first column of the data set above:

Code

import scipy.stats as ss

sc_Data = [[-1.34164079], [-0.4472136], [0.4472136], [1.34164079]]
col_stats = ss.describe([-1.34164079, -0.4472136, 0.4472136, 1.34164079])
print(col_stats)
print()

mean_by_hand = 0
for row in sc_Data:
    for element in row:
        mean_by_hand += element
mean_by_hand /= 4

variance_by_hand = 0
for row in sc_Data:
    for element in row:
        variance_by_hand += (mean_by_hand - element)**2
sample_variance_by_hand = variance_by_hand / 3
sample_std_dev_by_hand = sample_variance_by_hand ** 0.5

pop_variance_by_hand = variance_by_hand / 4
pop_std_dev_by_hand = pop_variance_by_hand ** 0.5

print("Sample of Population Calcs:")
print(mean_by_hand, sample_variance_by_hand, sample_std_dev_by_hand, '\n')
print("Population Calcs:")
print(mean_by_hand, pop_variance_by_hand, pop_std_dev_by_hand)

Output

DescribeResult(nobs=4, minmax=(-1.34164079, 1.34164079), mean=0.0, variance=1.3333333422778562, skewness=0.0, kurtosis=-1.36000000429325)

Sample of Population Calcs:
0.0 1.3333333422778562 1.1547005422523435

Population Calcs:
0.0 1.000000006708392 1.000000003354196

回答 6

以下是一个简单的工作示例,用于解释标准化计算的工作原理。理论部分已经在其他答案中得到了很好的解释。

>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)

>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])

>>>np.mean(a, axis=0)
array([6. , 2.5])

>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)

#Xchanged = (X−μ)/σ  WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z

计算方式

正如您在输出中看到的,均值为[6。,2.5]和标准偏差为[1.41421356,0.8660254]

数据为(0,1)位置为2标准化=(2-2.5)/0.8660254 = -0.57735027

(1,0)位置的数据为4标准化=(4-6)/1.41421356 = -1.414

标准化后的结果

在此处输入图片说明

标准化后检查均值和标准偏差

在此处输入图片说明

注意:-2.77555756e-17非常接近0。

参考文献

  1. 将不同缩放器对数据的影响与离群值进行比较

  2. 标准化和标准化之间有什么区别?

  3. 使用sklearn StandardScaler缩放的数据均值不为零

Following is a simple working example to explain how standarization calculation works. The theory part is already well explained in other answers.

>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)

>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])

>>>np.mean(a, axis=0)
array([6. , 2.5])

>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)

#Xchanged = (X−μ)/σ  WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z

Calculation

As you can see in the output, mean is [6. , 2.5] and std deviation is [1.41421356, 0.8660254 ]

Data is (0,1) position is 2 Standardization = (2 – 2.5)/0.8660254 = -0.57735027

Data in (1,0) position is 4 Standardization = (4-6)/1.41421356 = -1.414

Result After Standardization

enter image description here

Check Mean and Std Deviation After Standardization

enter image description here

Note: -2.77555756e-17 is very close to 0.

References

  1. Compare the effect of different scalers on data with outliers

  2. What’s the difference between Normalization and Standardization?

  3. Mean of data scaled with sklearn StandardScaler is not zero


回答 7

应用后StandardScaler(),X中的每一列的平均值为0,标准差为1。

公式在此页面上被其他人列出。

基本原理:某些算法要求数据看起来像这样(请参阅sklearn docs)。

After applying StandardScaler(), each column in X will have mean of 0 and standard deviation of 1.

Formulas are listed by others on this page.

Rationale: some algorithms require data to look like this (see sklearn docs).


如何检查scikit学习安装了哪个版本的nltk?

问题:如何检查scikit学习安装了哪个版本的nltk?

在shell脚本中,我正在检查是否已安装此软件包,如果未安装,请先安装它。因此,使用shell脚本:

import nltk
echo nltk.__version__

但它会在以下位置停止shell脚本 import在行

在Linux终端中尝试以这种方式查看:

which nltk

没有任何东西以为已安装。

还有没有其他方法可以在shell脚本中验证此软件包的安装,如果未安装,请同时安装。

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script:

import nltk
echo nltk.__version__

but it stops shell script at import line

in linux terminal tried to see in this manner:

which nltk

which gives nothing thought it is installed.

Is there any other way to verify this package installation in shell script, if not installed, also install it.


回答 0

import nltk 是Python语法,因此无法在Shell脚本中使用。

为了测试的版本nltkscikit_learn,你可以写一个Python脚本并运行它。这样的脚本可能看起来像

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

# The nltk version is 3.0.0.
# The scikit-learn version is 0.15.2.

请注意,并非所有Python软件包都保证具有__version__属性,因此对于某些其他软件包可能会失败,但是对于nltk和scikit-learn至少它会起作用。

import nltk is Python syntax, and as such won’t work in a shell script.

To test the version of nltk and scikit_learn, you can write a Python script and run it. Such a script may look like

import nltk
import sklearn

print('The nltk version is {}.'.format(nltk.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

# The nltk version is 3.0.0.
# The scikit-learn version is 0.15.2.

Note that not all Python packages are guaranteed to have a __version__ attribute, so for some others it may fail, but for nltk and scikit-learn at least it will work.


回答 1

试试这个:

$ python -c "import nltk; print nltk.__version__"

Try this:

$ python -c "import nltk; print nltk.__version__"

回答 2

在Windows®系统中,您可以尝试

pip3 list | findstr scikit

scikit-learn                  0.22.1

如果您在Anaconda上,请尝试

conda list scikit

scikit-learn              0.22.1           py37h6288b17_0

这可以用来找出您已安装的任何软件包的版本。例如

pip3 list | findstr numpy

numpy                         1.17.4
numpydoc                      0.9.2

或者,如果您想一次查找多个包裹

pip3 list | findstr "scikit numpy"

numpy                         1.17.4
numpydoc                      0.9.2
scikit-learn                  0.22.1

请注意,当搜索多个单词时,必须使用引号字符。

照顾自己。

In Windows® systems you can simply try

pip3 list | findstr scikit

scikit-learn                  0.22.1

If you are on Anaconda try

conda list scikit

scikit-learn              0.22.1           py37h6288b17_0

And this can be used to find out the version of any package you have installed. For example

pip3 list | findstr numpy

numpy                         1.17.4
numpydoc                      0.9.2

Or if you want to look for more than one package at a time

pip3 list | findstr "scikit numpy"

numpy                         1.17.4
numpydoc                      0.9.2
scikit-learn                  0.22.1

Note the quote characters are required when searching for more than one word.

Take care.


回答 3

要检查shell脚本中scikit-learn的版本,如果已安装pip,则可以尝试以下命令

pip freeze | grep scikit-learn
scikit-learn==0.17.1

希望能帮助到你!

For checking the version of scikit-learn in shell script, if you have pip installed, you can try this command

pip freeze | grep scikit-learn
scikit-learn==0.17.1

Hope it helps!


回答 4

您只需执行以下操作即可找到NLTK版本:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.5'

对于scikit-learn,

In [3]: import sklearn

In [4]: sklearn.__version__
Out[4]: '0.19.0'

我在这里使用python3。

You can find NLTK version simply by doing:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: '3.2.5'

And similarly for scikit-learn,

In [3]: import sklearn

In [4]: sklearn.__version__
Out[4]: '0.19.0'

I’m using python3 here.


回答 5

您可以按照以下方式从python笔记本单元格中进行检查

!pip install --upgrade nltk     # needed if nltk is not already installed
import nltk      
print('The nltk version is {}.'.format(nltk.__version__))
print('The nltk version is '+ str(nltk.__version__))

#!pip install --upgrade sklearn      # needed if sklearn is not already installed
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The scikit-learn version is '+ str(nltk.__version__))

you may check from a python notebook cell as follows

!pip install --upgrade nltk     # needed if nltk is not already installed
import nltk      
print('The nltk version is {}.'.format(nltk.__version__))
print('The nltk version is '+ str(nltk.__version__))

and

#!pip install --upgrade sklearn      # needed if sklearn is not already installed
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
print('The scikit-learn version is '+ str(nltk.__version__))

回答 6

在我的安装了python 2.7的ubuntu 14.04机器中,如果我去这里,

/usr/local/lib/python2.7/dist-packages/nltk/

有一个名为

VERSION

如果我这样做,cat VERSION它将打印3.1,这是已安装的NLTK版本。

In my machine which is ubuntu 14.04 with python 2.7 installed, if I go here,

/usr/local/lib/python2.7/dist-packages/nltk/

there is a file called

VERSION.

If I do a cat VERSION it prints 3.1, which is the NLTK version installed.


python中是否存在针对均方根误差(RMSE)的库函数?

问题:python中是否存在针对均方根误差(RMSE)的库函数?

我知道我可以像这样实现均方根误差函数:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

如果此rmse函数在某个地方的某个库中实现(可能在scipy或scikit-learn中实现),我正在寻找什么?

I know I could implement a root mean squared error function like this:

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

What I’m looking for if this rmse function is implemented in a library somewhere, perhaps in scipy or scikit-learn?


回答 0

sklearn.metrics具有mean_squared_error功能。RMSE只是返回值的平方根。

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

sklearn.metrics has a mean_squared_error function. The RMSE is just the square root of whatever it returns.

from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(y_actual, y_predicted))

回答 1

什么是RMSE?也称为MSE,RMD或RMS。它解决什么问题?

如果您了解RMSE :(均方根误差),MSE :(均方根误差)RMD(均方根偏差)和RMS :(均方根),那么要求库为您计算此值是不必要的过度设计。所有这些指标都是最长2英寸长的单行python代码。rmse,mse,rmd和rms这三个指标在概念上核心相同。

RMSE回答了这个问题:“何其相似,平均而言,是数字在list1list2?”。这两个列表的大小必须相同。我想“消除任何两个给定元素之间的噪音,消除收集到的数据的大小,并获得随时间变化的单一数字感觉”。

直觉和ELI5 for RMSE:

想象一下,您正在学习在飞镖板上扔飞镖。每天练习一小时。您想弄清楚自己是好还是坏。因此,每天您要投掷10次球,并测量靶心与飞镖击中点之间的距离。

您列出这些数字list1。使用第1天与list2包含所有零的距离之间的均方根误差。在第二天和第n天做同样的事情。您将得到的是一个希望随时间减少的数字。当您的RMSE数为零时,您每次都击中Bullseyes。如果均方根值增加,则情况会越来越糟。

在python中计算均方根误差的示例:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

哪些打印:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

数学符号:

均方根偏差说明

字形图例: n是一个完整的正整数,表示投掷次数。 i表示枚举和的整个正整数计数器。 d代表理想距离,list2在上面的示例中包含所有零。 p代表性能,list1在上面的示例中。上标2代表数字平方。 d i是的第i个索引dp i是的第i个索引p

rmse分步进行,因此可以理解:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

RMSE的每个步骤如何工作:

一个数字减去另一个数字就可以得出它们之间的距离。

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

如果将任何数字乘以自身,则结果总是正数,因为负数乘以负数就是正数:

3*3     = 9   = positive
-30*-30 = 900 = positive

将它们全部加起来,但是等一下,那么包含许多元素的数组将比小的数组具有更大的误差,因此请按元素数对它们进行平均。

但是,等等,我们更早地对它们进行平方,以迫使他们保持积极态度。消除平方根的伤害!

剩下的一个数字平均代表list1的每个值与其list2的对应元素值之间的距离。

如果RMSE值随着时间下降,我们会感到高兴,因为方差正在减小。

RMSE不是最准确的线拟合策略,最小二乘法的总和为:

均方根误差测量的是点与线之间的垂直距离,因此,如果数据的形状像香蕉,底部附近平坦,顶部附近陡峭,则RMSE将报告距较高点的距离较大,而距点的距离较短实际上是距离相等时的低点。这会导致偏斜,在此偏斜时,线倾向于更靠近高点而不是低点。

如果这是一个问题,则总最小二乘法可以解决此问题:https : //mubaris.com/posts/linear-regression

可能会破坏此RMSE功能的陷阱:

如果在任何一个输入列表中都有空值或无穷大,则输出rmse值将变得没有意义。任一列表中都有三种处理空值/缺失值/无穷大的策略:忽略该组件,将其清零,或在所有时间步长中添加最佳猜测或统一的随机噪声。每种补救措施都有其优缺点,具体取决于数据的含义。通常,最好忽略任何缺少值的组件,但这会使RMSE偏向零,从而使您认为性能确实有所改善。如果存在很多缺失值,则最好在最佳猜测上添加随机噪声。

为了保证RMSE输出的相对正确性,您必须消除输入中的所有null / infinites。

对于不属于异常值的数据点,RMSE的容差为零

均方根误差平方根取决于所有数据正确,并且所有数据均视为相等。这意味着在左侧区域中出现的一个杂散点将完全破坏整个计算。若要处理离群数据点并在特定阈值后消除其巨大影响,请参见稳健估计器,该估计器内置了消除离群值的阈值。

What is RMSE? Also known as MSE, RMD, or RMS. What problem does it solve?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS: (Root Mean Squared), then asking for a library to calculate this for you is unnecessary over-engineering. All these metrics are a single line of python code at most 2 inches long. The three metrics rmse, mse, rmd, and rms are at their core conceptually identical.

RMSE answers the question: “How similar, on average, are the numbers in list1 to list2?”. The two lists must be the same size. I want to “wash out the noise between any two given elements, wash out the size of the data collected, and get a single number feel for change over time”.

Intuition and ELI5 for RMSE:

Imagine you are learning to throw darts at a dart board. Every day you practice for one hour. You want to figure out if you are getting better or getting worse. So every day you make 10 throws and measure the distance between the bullseye and where your dart hit.

You make a list of those numbers list1. Use the root mean squared error between the distances at day 1 and a list2 containing all zeros. Do the same on the 2nd and nth days. What you will get is a single number that hopefully decreases over time. When your RMSE number is zero, you hit bullseyes every time. If the rmse number goes up, you are getting worse.

Example in calculating root mean squared error in python:

import numpy as np
d = [0.000, 0.166, 0.333]   #ideal target distances, these can be all zeros.
p = [0.000, 0.254, 0.998]   #your performance goes here

print("d is: " + str(["%.8f" % elem for elem in d]))
print("p is: " + str(["%.8f" % elem for elem in p]))

def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(np.array(d), np.array(p))
print("rms error is: " + str(rmse_val))

Which prints:

d is: ['0.00000000', '0.16600000', '0.33300000']
p is: ['0.00000000', '0.25400000', '0.99800000']
rms error between lists d and p is: 0.387284994115

The mathematical notation:

root mean squared deviation explained

Glyph Legend: n is a whole positive integer representing the number of throws. i represents a whole positive integer counter that enumerates sum. d stands for the ideal distances, the list2 containing all zeros in above example. p stands for performance, the list1 in the above example. superscript 2 stands for numeric squared. di is the i’th index of d. pi is the i’th index of p.

The rmse done in small steps so it can be understood:

def rmse(predictions, targets):

    differences = predictions - targets                       #the DIFFERENCEs.

    differences_squared = differences ** 2                    #the SQUAREs of ^

    mean_of_differences_squared = differences_squared.mean()  #the MEAN of ^

    rmse_val = np.sqrt(mean_of_differences_squared)           #ROOT of ^

    return rmse_val                                           #get the ^

How does every step of RMSE work:

Subtracting one number from another gives you the distance between them.

8 - 5 = 3         #absolute distance between 8 and 5 is +3
-20 - 10 = -30    #absolute distance between -20 and 10 is +30

If you multiply any number times itself, the result is always positive because negative times negative is positive:

3*3     = 9   = positive
-30*-30 = 900 = positive

Add them all up, but wait, then an array with many elements would have a larger error than a small array, so average them by the number of elements.

But wait, we squared them all earlier to force them positive. Undo the damage with a square root!

That leaves you with a single number that represents, on average, the distance between every value of list1 to it’s corresponding element value of list2.

If the RMSE value goes down over time we are happy because variance is decreasing.

RMSE isn’t the most accurate line fitting strategy, total least squares is:

Root mean squared error measures the vertical distance between the point and the line, so if your data is shaped like a banana, flat near the bottom and steep near the top, then the RMSE will report greater distances to points high, but short distances to points low when in fact the distances are equivalent. This causes a skew where the line prefers to be closer to points high than low.

If this is a problem the total least squares method fixes this: https://mubaris.com/posts/linear-regression

Gotchas that can break this RMSE function:

If there are nulls or infinity in either input list, then output rmse value is is going to not make sense. There are three strategies to deal with nulls / missing values / infinities in either list: Ignore that component, zero it out or add a best guess or a uniform random noise to all timesteps. Each remedy has its pros and cons depending on what your data means. In general ignoring any component with a missing value is preferred, but this biases the RMSE toward zero making you think performance has improved when it really hasn’t. Adding random noise on a best guess could be preferred if there are lots of missing values.

In order to guarantee relative correctness of the RMSE output, you must eliminate all nulls/infinites from the input.

RMSE has zero tolerance for outlier data points which don’t belong

Root mean squared error squares relies on all data being right and all are counted as equal. That means one stray point that’s way out in left field is going to totally ruin the whole calculation. To handle outlier data points and dismiss their tremendous influence after a certain threshold, see Robust estimators that build in a threshold for dismissal of outliers.


回答 2

这可能更快吗?

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

This is probably faster?:

n = len(predictions)
rmse = np.linalg.norm(predictions - targets) / np.sqrt(n)

回答 3

在scikit-learn 0.22.0中,您可以传递mean_squared_error()参数squared=False以返回RMSE。

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)

In scikit-learn 0.22.0 you can pass mean_squared_error() the argument squared=False to return the RMSE.

from sklearn.metrics import mean_squared_error

mean_squared_error(y_actual, y_predicted, squared=False)


回答 4

以防万一有人在2019年发现此线程,有一个名为的库ml_metrics,无需预先安装就可以在Kaggle的内核中使用,该库非常轻巧并且可以通过以下方式访问pypi(可以使用轻松快速地安装pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

它还有其他一些有趣的指标sklearn,例如mapk

参考文献:

Just in case someone finds this thread in 2019, there is a library called ml_metrics which is available without pre-installation in Kaggle’s kernels, pretty lightweighted and accessible through pypi ( it can be installed easily and fast with pip install ml_metrics):

from ml_metrics import rmse
rmse(actual=[0, 1, 2], predicted=[1, 10, 5])
# 5.507570547286102

It has few other interesting metrics which are not available in sklearn, like mapk.

References:


回答 5

实际上,我确实写了一堆作为statsmodels的实用函数

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

http://statsmodels.sourceforge.net/devel/generation/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

通常是一两个衬板,输入检查不多,主要用于比较数组时轻松获得一些统计信息。但是他们对轴参数有单元测试,因为这是我有时会犯草率错误的地方。

Actually, I did write a bunch of those as utility functions for statsmodels

http://statsmodels.sourceforge.net/devel/tools.html#measure-for-fit-performance-eval-measures

and http://statsmodels.sourceforge.net/devel/generated/statsmodels.tools.eval_measures.rmse.html#statsmodels.tools.eval_measures.rmse

Mostly one or two liners and not much input checking, and mainly intended for easily getting some statistics when comparing arrays. But they have unit tests for the axis arguments, because that’s where I sometimes make sloppy mistakes.


回答 6

或仅使用NumPy函数:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

哪里:

  • y是我的目标
  • y_pred是我的预测

注意,rmse(y, y_pred)==rmse(y_pred, y)由于平方函数。

Or by simply using only NumPy functions:

def rmse(y, y_pred):
    return np.sqrt(np.mean(np.square(y - y_pred)))

Where:

  • y is my target
  • y_pred is my prediction

Note that rmse(y, y_pred)==rmse(y_pred, y) due to the square function.


回答 7

您无法在SKLearn中直接找到RMSE功能。但是,除了手动执行sqrt之外,还有另一种使用sklearn的标准方法。显然,Sklearn的mean_squared_error本身包含一个名为“ squared”的参数,默认值为true。如果将其设置为false,则同一函数将返回RMSE而不是MSE。

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

You can’t find RMSE function directly in SKLearn. But , instead of manually doing sqrt , there is another standard way using sklearn. Apparently, Sklearn’s mean_squared_error itself contains a parameter called as “squared” with default value as true .If we set it to false ,the same function will return RMSE instead of MSE.

# code changes implemented by Esha Prakash
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_true, y_pred , squared=False)

回答 8

这是一个示例代码,用于计算两种多边形文件格式之间的RMSE PLY。它同时使用ml_metricslib和np.linalg.norm

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

Here’s an example code that calculates the RMSE between two polygon file formats PLY. It uses both the ml_metrics lib and the np.linalg.norm:

import sys
import SimpleITK as sitk
from pyntcloud import PyntCloud as pc
import numpy as np
from ml_metrics import rmse

if len(sys.argv) < 3 or sys.argv[1] == "-h" or sys.argv[1] == "--help":
    print("Usage: compute-rmse.py <input1.ply> <input2.ply>")
    sys.exit(1)

def verify_rmse(a, b):
    n = len(a)
    return np.linalg.norm(np.array(b) - np.array(a)) / np.sqrt(n)

def compare(a, b):
    m = pc.from_file(a).points
    n = pc.from_file(b).points
    m = [ tuple(m.x), tuple(m.y), tuple(m.z) ]; m = m[0]
    n = [ tuple(n.x), tuple(n.y), tuple(n.z) ]; n = n[0]
    v1, v2 = verify_rmse(m, n), rmse(m,n)
    print(v1, v2)

compare(sys.argv[1], sys.argv[2])

回答 9

  1. 不,有一个用于机器学习的Scikit Learn库,可以通过使用Python语言轻松使用。它具有均方误差的功能,我在下面共享以下链接:

https://scikit-learn.org/stable/modules/generation/sklearn.metrics.mean_squared_error.html

  1. 该函数的命名方式如下所示,其中y_true是数据元组的真实类值,而y_pred是预测值,由您使用的机器学习算法预测:

mean_squared_error(y_true,y_pred)

  1. 您必须对其进行修改以获取RMSE(通过使用Python使用sqrt函数)。此过程在以下链接中进行了描述:https : //www.codeastar.com/regression-model-rmsd/

因此,最终代码将类似于:

从sklearn.metrics从数学导入sqrt导入mean_squared_error

RMSD = sqrt(均方误差(testing_y,预测))

打印(RMSD)

  1. No, there is a library Scikit Learn for machine learning and it can be easily employed by using Python language. It has the a function for Mean Squared Error which i am sharing the link below:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

  1. The function is named mean_squared_error as given below, where y_true would be real class values for the data tuples and y_pred would be the predicted values, predicted by the machine learning algorithm you are using:

mean_squared_error(y_true, y_pred)

  1. You have to modify it to get RMSE (by using sqrt function using Python).This process is described in this link: https://www.codeastar.com/regression-model-rmsd/

So, final code would be something like:

from sklearn.metrics import mean_squared_error from math import sqrt

RMSD = sqrt(mean_squared_error(testing_y, prediction))

print(RMSD)


如何从scikit-learn决策树中提取决策规则?

问题:如何从scikit-learn决策树中提取决策规则?

我可以从决策树中经过训练的树中提取出基本的决策规则(或“决策路径”)作为文本列表吗?

就像是:

if A>0.4 then if B<0.2 then if C>0.8 then class='X'

谢谢你的帮助。

Can I extract the underlying decision-rules (or ‘decision paths’) from a trained tree in a decision tree as a textual list?

Something like:

if A>0.4 then if B<0.2 then if C>0.8 then class='X'

Thanks for your help.


回答 0

我相信这个答案比这里的其他答案更正确:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)

这会打印出有效的Python函数。这是尝试返回其输入的树的示例输出,该数字介于0和10之间。

def tree(f0):
  if f0 <= 6.0:
    if f0 <= 1.5:
      return [[ 0.]]
    else:  # if f0 > 1.5
      if f0 <= 4.5:
        if f0 <= 3.5:
          return [[ 3.]]
        else:  # if f0 > 3.5
          return [[ 4.]]
      else:  # if f0 > 4.5
        return [[ 5.]]
  else:  # if f0 > 6.0
    if f0 <= 8.5:
      if f0 <= 7.5:
        return [[ 7.]]
      else:  # if f0 > 7.5
        return [[ 8.]]
    else:  # if f0 > 8.5
      return [[ 9.]]

这是我在其他答案中看到的一些绊脚石:

  1. 使用tree_.threshold == -2来决定一个节点是否为叶是不是一个好主意。如果它是阈值为-2的真实决策节点怎么办?相反,您应该查看tree.featuretree.children_*
  2. 该行在features = [feature_names[i] for i in tree_.feature]我的sklearn版本中崩溃,因为某些值tree.tree_.feature是-2(特别是对于叶节点)。
  3. 递归函数中不需要有多个if语句,只需一个就可以了。

I believe that this answer is more correct than the other answers here:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)

This prints out a valid Python function. Here’s an example output for a tree that is trying to return its input, a number between 0 and 10.

def tree(f0):
  if f0 <= 6.0:
    if f0 <= 1.5:
      return [[ 0.]]
    else:  # if f0 > 1.5
      if f0 <= 4.5:
        if f0 <= 3.5:
          return [[ 3.]]
        else:  # if f0 > 3.5
          return [[ 4.]]
      else:  # if f0 > 4.5
        return [[ 5.]]
  else:  # if f0 > 6.0
    if f0 <= 8.5:
      if f0 <= 7.5:
        return [[ 7.]]
      else:  # if f0 > 7.5
        return [[ 8.]]
    else:  # if f0 > 8.5
      return [[ 9.]]

Here are some stumbling blocks that I see in other answers:

  1. Using tree_.threshold == -2 to decide whether a node is a leaf isn’t a good idea. What if it’s a real decision node with a threshold of -2? Instead, you should look at tree.feature or tree.children_*.
  2. The line features = [feature_names[i] for i in tree_.feature] crashes with my version of sklearn, because some values of tree.tree_.feature are -2 (specifically for leaf nodes).
  3. There is no need to have multiple if statements in the recursive function, just one is fine.

回答 1

我创建了自己的函数,以从sklearn创建的决策树中提取规则:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# dummy data:
df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})

# create decision tree
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1)
dt.fit(df.ix[:,:2], df.dv)

此函数首先从节点开始(在子数组中由-1标识),然后递归地找到父节点。我称其为节点的“血统”。一路上,我掌握了创建if / then / else SAS逻辑所需的值:

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]

     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'

          lineage.append((parent, split, threshold[parent], features[parent]))

          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)

     for child in idx:
          for node in recurse(left, right, child):
               print node

下面的元组集包含创建SAS if / then / else语句所需的所有内容。我不喜欢do在SAS中使用块,这就是为什么我创建描述节点整个路径的逻辑的原因。元组之后的单个整数是路径中终端节点的ID。所有前面的元组组合在一起创建该节点。

In [1]: get_lineage(dt, df.columns)
(0, 'l', 0.5, 'col1')
1
(0, 'r', 0.5, 'col1')
(2, 'l', 4.5, 'col2')
3
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'l', 2.5, 'col1')
5
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'r', 2.5, 'col1')
6

示例树的GraphViz输出

I created my own function to extract the rules from the decision trees created by sklearn:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# dummy data:
df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})

# create decision tree
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1)
dt.fit(df.ix[:,:2], df.dv)

This function first starts with the nodes (identified by -1 in the child arrays) and then recursively finds the parents. I call this a node’s ‘lineage’. Along the way, I grab the values I need to create if/then/else SAS logic:

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]

     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'

          lineage.append((parent, split, threshold[parent], features[parent]))

          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)

     for child in idx:
          for node in recurse(left, right, child):
               print node

The sets of tuples below contain everything I need to create SAS if/then/else statements. I do not like using do blocks in SAS which is why I create logic describing a node’s entire path. The single integer after the tuples is the ID of the terminal node in a path. All of the preceding tuples combine to create that node.

In [1]: get_lineage(dt, df.columns)
(0, 'l', 0.5, 'col1')
1
(0, 'r', 0.5, 'col1')
(2, 'l', 4.5, 'col2')
3
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'l', 2.5, 'col1')
5
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'r', 2.5, 'col1')
6

GraphViz output of example tree


回答 2

我修改了Zelazny7提交的代码以打印一些伪代码:

def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])

        recurse(left, right, threshold, features, 0)

如果调用get_code(dt, df.columns)同一示例,则将获得:

if ( col1 <= 0.5 ) {
return [[ 1.  0.]]
} else {
if ( col2 <= 4.5 ) {
return [[ 0.  1.]]
} else {
if ( col1 <= 2.5 ) {
return [[ 1.  0.]]
} else {
return [[ 0.  1.]]
}
}
}

I modified the code submitted by Zelazny7 to print some pseudocode:

def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])

        recurse(left, right, threshold, features, 0)

if you call get_code(dt, df.columns) on the same example you will obtain:

if ( col1 <= 0.5 ) {
return [[ 1.  0.]]
} else {
if ( col2 <= 4.5 ) {
return [[ 0.  1.]]
} else {
if ( col1 <= 2.5 ) {
return [[ 1.  0.]]
} else {
return [[ 0.  1.]]
}
}
}

回答 3

Scikit Learn引入了一种美味的新方法,称为export_text0.21版(2019年5月),用于从树中提取规则。文档在这里。不再需要创建自定义函数。

拟合模型后,只需两行代码。首先,导入export_text

from sklearn.tree.export import export_text

其次,创建一个包含规则的对象。为了使规则更具可读性,请使用feature_names参数并传递功能名称列表。例如,如果您的模型被调用,model并且您的要素在名为的数据框中命名X_train,则可以创建一个名为的对象tree_rules

tree_rules = export_text(model, feature_names=list(X_train))

然后只需打印或保存tree_rules。您的输出将如下所示:

|--- Age <= 0.63
|   |--- EstimatedSalary <= 0.61
|   |   |--- Age <= -0.16
|   |   |   |--- class: 0
|   |   |--- Age >  -0.16
|   |   |   |--- EstimatedSalary <= -0.06
|   |   |   |   |--- class: 0
|   |   |   |--- EstimatedSalary >  -0.06
|   |   |   |   |--- EstimatedSalary <= 0.40
|   |   |   |   |   |--- EstimatedSalary <= 0.03
|   |   |   |   |   |   |--- class: 1

Scikit learn introduced a delicious new method called export_text in version 0.21 (May 2019) to extract the rules from a tree. Documentation here. It’s no longer necessary to create a custom function.

Once you’ve fit your model, you just need two lines of code. First, import export_text:

from sklearn.tree import export_text

Second, create an object that will contain your rules. To make the rules look more readable, use the feature_names argument and pass a list of your feature names. For example, if your model is called model and your features are named in a dataframe called X_train, you could create an object called tree_rules:

tree_rules = export_text(model, feature_names=list(X_train.columns))

Then just print or save tree_rules. Your output will look like this:

|--- Age <= 0.63
|   |--- EstimatedSalary <= 0.61
|   |   |--- Age <= -0.16
|   |   |   |--- class: 0
|   |   |--- Age >  -0.16
|   |   |   |--- EstimatedSalary <= -0.06
|   |   |   |   |--- class: 0
|   |   |   |--- EstimatedSalary >  -0.06
|   |   |   |   |--- EstimatedSalary <= 0.40
|   |   |   |   |   |--- EstimatedSalary <= 0.03
|   |   |   |   |   |   |--- class: 1

回答 4

0.18.0版本中提供了一种新DecisionTreeClassifier方法。开发人员提供了广泛的(有据可查的)演练decision_path

演练中打印树结构的代码的第一部分似乎还可以。但是,我修改了第二部分中的代码以询问一个样本。我的更改用表示# <--

编辑# <--在拉取请求#8653#10951中指出错误之后,以下代码中标记的更改已在演练链接中更新。现在跟随起来要容易得多。

sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))

Rules used to predict sample 0: 
decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011921)
decision id node 2 : (X[0, 2] (= 5.1) > 4.94999980927)
leaf node 4 reached, no decision here

更改sample_id以查看其他样本的决策路径。我没有问过开发人员这些更改,只是在研究示例时看起来更加直观。

There is a new DecisionTreeClassifier method, decision_path, in the 0.18.0 release. The developers provide an extensive (well-documented) walkthrough.

The first section of code in the walkthrough that prints the tree structure seems to be OK. However, I modified the code in the second section to interrogate one sample. My changes denoted with # <--

Edit The changes marked by # <-- in the code below have since been updated in walkthrough link after the errors were pointed out in pull requests #8653 and #10951. It’s much easier to follow along now.

sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))

Rules used to predict sample 0: 
decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011921)
decision id node 2 : (X[0, 2] (= 5.1) > 4.94999980927)
leaf node 4 reached, no decision here

Change the sample_id to see the decision paths for other samples. I haven’t asked the developers about these changes, just seemed more intuitive when working through the example.


回答 5

from StringIO import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)
print out.getvalue()

您可以看到有向图树。然后,clf.tree_.featureclf.tree_.value分别是节点分割特征数组和节点值数组。您可以从此github源中引用更多详细信息。

from StringIO import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)
print out.getvalue()

You can see a digraph Tree. Then, clf.tree_.feature and clf.tree_.value are array of nodes splitting feature and array of nodes values respectively. You can refer to more details from this github source.


回答 6

仅仅因为每个人都非常乐于助人,所以我将对Zelazny7和Daniele的精美解决方案进行修改。这个是针对python 2.7的,带有标签使其更具可读性:

def get_code(tree, feature_names, tabdepth=0):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, tabdepth=0):
            if (threshold[node] != -2):
                    print '\t' * tabdepth,
                    print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "} else {"
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "}"
            else:
                    print '\t' * tabdepth,
                    print "return " + str(value[node])

    recurse(left, right, threshold, features, 0)

Just because everyone was so helpful I’ll just add a modification to Zelazny7 and Daniele’s beautiful solutions. This one is for python 2.7, with tabs to make it more readable:

def get_code(tree, feature_names, tabdepth=0):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, tabdepth=0):
            if (threshold[node] != -2):
                    print '\t' * tabdepth,
                    print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "} else {"
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "}"
            else:
                    print '\t' * tabdepth,
                    print "return " + str(value[node])

    recurse(left, right, threshold, features, 0)

回答 7

下面的代码是我在anaconda python 2.7下加上包名称“ pydot-ng”制作带有决策规则的PDF文件的方法。希望对您有所帮助。

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_leaf_nodes=n)
clf_ = clf.fit(X, data_y)

feature_names = X.columns
class_name = clf_.classes_.astype(int).astype(str)

def output_pdf(clf_, name):
    from sklearn import tree
    from sklearn.externals.six import StringIO
    import pydot_ng as pydot
    dot_data = StringIO()
    tree.export_graphviz(clf_, out_file=dot_data,
                         feature_names=feature_names,
                         class_names=class_name,
                         filled=True, rounded=True,
                         special_characters=True,
                          node_ids=1,)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf("%s.pdf"%name)

output_pdf(clf_, name='filename%s'%n)

一个树形图在这里显示

Codes below is my approach under anaconda python 2.7 plus a package name “pydot-ng” to making a PDF file with decision rules. I hope it is helpful.

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_leaf_nodes=n)
clf_ = clf.fit(X, data_y)

feature_names = X.columns
class_name = clf_.classes_.astype(int).astype(str)

def output_pdf(clf_, name):
    from sklearn import tree
    from sklearn.externals.six import StringIO
    import pydot_ng as pydot
    dot_data = StringIO()
    tree.export_graphviz(clf_, out_file=dot_data,
                         feature_names=feature_names,
                         class_names=class_name,
                         filled=True, rounded=True,
                         special_characters=True,
                          node_ids=1,)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf("%s.pdf"%name)

output_pdf(clf_, name='filename%s'%n)

a tree graphy show here


回答 8

我已经经历过了,但是我需要规则以这种格式编写

if A>0.4 then if B<0.2 then if C>0.8 then class='X' 

因此,我修改了@paulkernfeld的答案(谢谢),您可以根据自己的需要进行自定义

def tree_to_code(tree, feature_names, Y):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    pathto=dict()

    global k
    k = 0
    def recurse(node, depth, parent):
        global k
        indent = "  " * depth

        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            s= "{} <= {} ".format( name, threshold, node )
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s

            recurse(tree_.children_left[node], depth + 1, node)
            s="{} > {}".format( name, threshold)
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s
            recurse(tree_.children_right[node], depth + 1, node)
        else:
            k=k+1
            print(k,')',pathto[parent], tree_.value[node])
    recurse(0, 1, 0)

I’ve been going through this, but i needed the rules to be written in this format

if A>0.4 then if B<0.2 then if C>0.8 then class='X' 

So I adapted the answer of @paulkernfeld (thanks) that you can customize to your need

def tree_to_code(tree, feature_names, Y):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    pathto=dict()

    global k
    k = 0
    def recurse(node, depth, parent):
        global k
        indent = "  " * depth

        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            s= "{} <= {} ".format( name, threshold, node )
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s

            recurse(tree_.children_left[node], depth + 1, node)
            s="{} > {}".format( name, threshold)
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s
            recurse(tree_.children_right[node], depth + 1, node)
        else:
            k=k+1
            print(k,')',pathto[parent], tree_.value[node])
    recurse(0, 1, 0)

回答 9

这是一种使用SKompiler库将整个树转换为单个(不一定是人类可读的)python表达式的方法:

from skompiler import skompile
skompile(dtree.predict).to('python/code')

Here is a way to translate the whole tree into a single (not necessarily too human-readable) python expression using the SKompiler library:

from skompiler import skompile
skompile(dtree.predict).to('python/code')

回答 10

这基于@paulkernfeld的答案。如果您有一个具有特征的数据框X和一个具有共振的目标数据框y,并且想要了解哪个y值终止于哪个节点(并相应地对其进行绘制),则可以执行以下操作:

    def tree_to_code(tree, feature_names):
        from sklearn.tree import _tree
        codelines = []
        codelines.append('def get_cat(X_tmp):\n')
        codelines.append('   catout = []\n')
        codelines.append('   for codelines in range(0,X_tmp.shape[0]):\n')
        codelines.append('      Xin = X_tmp.iloc[codelines]\n')
        tree_ = tree.tree_
        feature_name = [
            feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
            for i in tree_.feature
        ]
        #print "def tree({}):".format(", ".join(feature_names))

        def recurse(node, depth):
            indent = "      " * depth
            if tree_.feature[node] != _tree.TREE_UNDEFINED:
                name = feature_name[node]
                threshold = tree_.threshold[node]
                codelines.append ('{}if Xin["{}"] <= {}:\n'.format(indent, name, threshold))
                recurse(tree_.children_left[node], depth + 1)
                codelines.append( '{}else:  # if Xin["{}"] > {}\n'.format(indent, name, threshold))
                recurse(tree_.children_right[node], depth + 1)
            else:
                codelines.append( '{}mycat = {}\n'.format(indent, node))

        recurse(0, 1)
        codelines.append('      catout.append(mycat)\n')
        codelines.append('   return pd.DataFrame(catout,index=X_tmp.index,columns=["category"])\n')
        codelines.append('node_ids = get_cat(X)\n')
        return codelines
    mycode = tree_to_code(clf,X.columns.values)

    # now execute the function and obtain the dataframe with all nodes
    exec(''.join(mycode))
    node_ids = [int(x[0]) for x in node_ids.values]
    node_ids2 = pd.DataFrame(node_ids)

    print('make plot')
    import matplotlib.cm as cm
    colors = cm.rainbow(np.linspace(0, 1, 1+max( list(set(node_ids)))))
    #plt.figure(figsize=cm2inch(24, 21))
    for i in list(set(node_ids)):
        plt.plot(y[node_ids2.values==i],'o',color=colors[i], label=str(i))  
    mytitle = ['y colored by node']
    plt.title(mytitle ,fontsize=14)
    plt.xlabel('my xlabel')
    plt.ylabel(tagname)
    plt.xticks(rotation=70)       
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=9)
    plt.tight_layout()
    plt.show()
    plt.close 

不是最优雅的版本,但可以胜任工作…

This builds on @paulkernfeld ‘s answer. If you have a dataframe X with your features and a target dataframe y with your resonses and you you want to get an idea which y value ended in which node (and also ant to plot it accordingly) you can do the following:

    def tree_to_code(tree, feature_names):
        from sklearn.tree import _tree
        codelines = []
        codelines.append('def get_cat(X_tmp):\n')
        codelines.append('   catout = []\n')
        codelines.append('   for codelines in range(0,X_tmp.shape[0]):\n')
        codelines.append('      Xin = X_tmp.iloc[codelines]\n')
        tree_ = tree.tree_
        feature_name = [
            feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
            for i in tree_.feature
        ]
        #print "def tree({}):".format(", ".join(feature_names))

        def recurse(node, depth):
            indent = "      " * depth
            if tree_.feature[node] != _tree.TREE_UNDEFINED:
                name = feature_name[node]
                threshold = tree_.threshold[node]
                codelines.append ('{}if Xin["{}"] <= {}:\n'.format(indent, name, threshold))
                recurse(tree_.children_left[node], depth + 1)
                codelines.append( '{}else:  # if Xin["{}"] > {}\n'.format(indent, name, threshold))
                recurse(tree_.children_right[node], depth + 1)
            else:
                codelines.append( '{}mycat = {}\n'.format(indent, node))

        recurse(0, 1)
        codelines.append('      catout.append(mycat)\n')
        codelines.append('   return pd.DataFrame(catout,index=X_tmp.index,columns=["category"])\n')
        codelines.append('node_ids = get_cat(X)\n')
        return codelines
    mycode = tree_to_code(clf,X.columns.values)

    # now execute the function and obtain the dataframe with all nodes
    exec(''.join(mycode))
    node_ids = [int(x[0]) for x in node_ids.values]
    node_ids2 = pd.DataFrame(node_ids)

    print('make plot')
    import matplotlib.cm as cm
    colors = cm.rainbow(np.linspace(0, 1, 1+max( list(set(node_ids)))))
    #plt.figure(figsize=cm2inch(24, 21))
    for i in list(set(node_ids)):
        plt.plot(y[node_ids2.values==i],'o',color=colors[i], label=str(i))  
    mytitle = ['y colored by node']
    plt.title(mytitle ,fontsize=14)
    plt.xlabel('my xlabel')
    plt.ylabel(tagname)
    plt.xticks(rotation=70)       
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=9)
    plt.tight_layout()
    plt.show()
    plt.close 

not the most elegant version but it does the job…


回答 11

这是您需要的代码

我已经修改了最喜欢的代码以正确缩进jupyter笔记本python 3

import numpy as np
from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [feature_names[i] 
                    if i != _tree.TREE_UNDEFINED else "undefined!" 
                    for i in tree_.feature]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "    " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, np.argmax(tree_.value[node])))

    recurse(0, 1)

This is the code you need

I have modified the top liked code to indent in a jupyter notebook python 3 correctly

import numpy as np
from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [feature_names[i] 
                    if i != _tree.TREE_UNDEFINED else "undefined!" 
                    for i in tree_.feature]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "    " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, np.argmax(tree_.value[node])))

    recurse(0, 1)

回答 12

这是一个函数,在python 3下打印scikit-learn决策树的规则,并带有条件块的偏移量以使结构更易读:

def print_decision_tree(tree, feature_names=None, offset_unit='    '):
    '''Plots textual representation of rules of a decision tree
    tree: scikit-learn representation of tree
    feature_names: list of feature names. They are set to f1,f2,f3,... if not specified
    offset_unit: a string of offset of the conditional block'''

    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value
    if feature_names is None:
        features  = ['f%d'%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0):
            offset = offset_unit*depth
            if (threshold[node] != -2):
                    print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node],depth+1)
                    print(offset+"} else {")
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node],depth+1)
                    print(offset+"}")
            else:
                    print(offset+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0,0)

Here is a function, printing rules of a scikit-learn decision tree under python 3 and with offsets for conditional blocks to make the structure more readable:

def print_decision_tree(tree, feature_names=None, offset_unit='    '):
    '''Plots textual representation of rules of a decision tree
    tree: scikit-learn representation of tree
    feature_names: list of feature names. They are set to f1,f2,f3,... if not specified
    offset_unit: a string of offset of the conditional block'''

    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value
    if feature_names is None:
        features  = ['f%d'%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0):
            offset = offset_unit*depth
            if (threshold[node] != -2):
                    print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node],depth+1)
                    print(offset+"} else {")
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node],depth+1)
                    print(offset+"}")
            else:
                    print(offset+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0,0)

回答 13

您还可以通过区分它属于哪个类,甚至提及其输出值来使它更具信息性。

def print_decision_tree(tree, feature_names, offset_unit='    '):    
left      = tree.tree_.children_left
right     = tree.tree_.children_right
threshold = tree.tree_.threshold
value = tree.tree_.value
if feature_names is None:
    features  = ['f%d'%i for i in tree.tree_.feature]
else:
    features  = [feature_names[i] for i in tree.tree_.feature]        

def recurse(left, right, threshold, features, node, depth=0):
        offset = offset_unit*depth
        if (threshold[node] != -2):
                print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                if left[node] != -1:
                        recurse (left, right, threshold, features,left[node],depth+1)
                print(offset+"} else {")
                if right[node] != -1:
                        recurse (left, right, threshold, features,right[node],depth+1)
                print(offset+"}")
        else:
                #print(offset,value[node]) 

                #To remove values from node
                temp=str(value[node])
                mid=len(temp)//2
                tempx=[]
                tempy=[]
                cnt=0
                for i in temp:
                    if cnt<=mid:
                        tempx.append(i)
                        cnt+=1
                    else:
                        tempy.append(i)
                        cnt+=1
                val_yes=[]
                val_no=[]
                res=[]
                for j in tempx:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_no.append(j)
                for j in tempy:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_yes.append(j)
                val_yes = int("".join(map(str, val_yes)))
                val_no = int("".join(map(str, val_no)))

                if val_yes>val_no:
                    print(offset,'\033[1m',"YES")
                    print('\033[0m')
                elif val_no>val_yes:
                    print(offset,'\033[1m',"NO")
                    print('\033[0m')
                else:
                    print(offset,'\033[1m',"Tie")
                    print('\033[0m')

recurse(left, right, threshold, features, 0,0)

在此处输入图片说明

You can also make it more informative by distinguishing it to which class it belongs or even by mentioning its output value.

def print_decision_tree(tree, feature_names, offset_unit='    '):    
left      = tree.tree_.children_left
right     = tree.tree_.children_right
threshold = tree.tree_.threshold
value = tree.tree_.value
if feature_names is None:
    features  = ['f%d'%i for i in tree.tree_.feature]
else:
    features  = [feature_names[i] for i in tree.tree_.feature]        

def recurse(left, right, threshold, features, node, depth=0):
        offset = offset_unit*depth
        if (threshold[node] != -2):
                print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                if left[node] != -1:
                        recurse (left, right, threshold, features,left[node],depth+1)
                print(offset+"} else {")
                if right[node] != -1:
                        recurse (left, right, threshold, features,right[node],depth+1)
                print(offset+"}")
        else:
                #print(offset,value[node]) 

                #To remove values from node
                temp=str(value[node])
                mid=len(temp)//2
                tempx=[]
                tempy=[]
                cnt=0
                for i in temp:
                    if cnt<=mid:
                        tempx.append(i)
                        cnt+=1
                    else:
                        tempy.append(i)
                        cnt+=1
                val_yes=[]
                val_no=[]
                res=[]
                for j in tempx:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_no.append(j)
                for j in tempy:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_yes.append(j)
                val_yes = int("".join(map(str, val_yes)))
                val_no = int("".join(map(str, val_no)))

                if val_yes>val_no:
                    print(offset,'\033[1m',"YES")
                    print('\033[0m')
                elif val_no>val_yes:
                    print(offset,'\033[1m',"NO")
                    print('\033[0m')
                else:
                    print(offset,'\033[1m',"Tie")
                    print('\033[0m')

recurse(left, right, threshold, features, 0,0)

enter image description here


回答 14

这是我提取可直接在sql中使用的形式的决策规则的方法,因此可以按节点对数据进行分组。(基于先前海报的方法。)

结果将是CASE可以复制到sql语句(例如)的后续子句。

SELECT COALESCE(*CASE WHEN <conditions> THEN > <NodeA>*, > *CASE WHEN <conditions> THEN <NodeB>*, > ....)NodeName,* > FROM <table or view>


import numpy as np

import pickle
feature_names=.............
features  = [feature_names[i] for i in range(len(feature_names))]
clf= pickle.loads(trained_model)
impurity=clf.tree_.impurity
importances = clf.feature_importances_
SqlOut=""

#global Conts
global ContsNode
global Path
#Conts=[]#
ContsNode=[]
Path=[]
global Results
Results=[]

def print_decision_tree(tree, feature_names, offset_unit=''    ''):    
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value

    if feature_names is None:
        features  = [''f%d''%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0,ParentNode=0,IsElse=0):
        global Conts
        global ContsNode
        global Path
        global Results
        global LeftParents
        LeftParents=[]
        global RightParents
        RightParents=[]
        for i in range(len(left)): # This is just to tell you how to create a list.
            LeftParents.append(-1)
            RightParents.append(-1)
            ContsNode.append("")
            Path.append("")


        for i in range(len(left)): # i is node
            if (left[i]==-1 and right[i]==-1):      
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not " +ContsNode[RightParents[i]]                     
                Results.append(" case when  " +Path[i]+"  then ''" +"{:4d}".format(i)+ " "+"{:2.2f}".format(impurity[i])+" "+Path[i][0:180]+"''")

            else:       
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not "+ContsNode[RightParents[i]]                      
                if (left[i]!=-1):
                    LeftParents[left[i]]=i
                if (right[i]!=-1):
                    RightParents[right[i]]=i
                ContsNode[i]=   "( "+ features[i] + " <= " + str(threshold[i])   + " ) "

    recurse(left, right, threshold, features, 0,0,0,0)
print_decision_tree(clf,features)
SqlOut=""
for i in range(len(Results)): 
    SqlOut=SqlOut+Results[i]+ " end,"+chr(13)+chr(10)

Here is my approach to extract the decision rules in a form that can be used in directly in sql, so the data can be grouped by node. (Based on the approaches of previous posters.)

The result will be subsequent CASE clauses that can be copied to an sql statement, ex.

SELECT COALESCE(*CASE WHEN <conditions> THEN > <NodeA>*, > *CASE WHEN <conditions> THEN <NodeB>*, > ....)NodeName,* > FROM <table or view>


import numpy as np

import pickle
feature_names=.............
features  = [feature_names[i] for i in range(len(feature_names))]
clf= pickle.loads(trained_model)
impurity=clf.tree_.impurity
importances = clf.feature_importances_
SqlOut=""

#global Conts
global ContsNode
global Path
#Conts=[]#
ContsNode=[]
Path=[]
global Results
Results=[]

def print_decision_tree(tree, feature_names, offset_unit=''    ''):    
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value

    if feature_names is None:
        features  = [''f%d''%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0,ParentNode=0,IsElse=0):
        global Conts
        global ContsNode
        global Path
        global Results
        global LeftParents
        LeftParents=[]
        global RightParents
        RightParents=[]
        for i in range(len(left)): # This is just to tell you how to create a list.
            LeftParents.append(-1)
            RightParents.append(-1)
            ContsNode.append("")
            Path.append("")


        for i in range(len(left)): # i is node
            if (left[i]==-1 and right[i]==-1):      
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not " +ContsNode[RightParents[i]]                     
                Results.append(" case when  " +Path[i]+"  then ''" +"{:4d}".format(i)+ " "+"{:2.2f}".format(impurity[i])+" "+Path[i][0:180]+"''")

            else:       
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not "+ContsNode[RightParents[i]]                      
                if (left[i]!=-1):
                    LeftParents[left[i]]=i
                if (right[i]!=-1):
                    RightParents[right[i]]=i
                ContsNode[i]=   "( "+ features[i] + " <= " + str(threshold[i])   + " ) "

    recurse(left, right, threshold, features, 0,0,0,0)
print_decision_tree(clf,features)
SqlOut=""
for i in range(len(Results)): 
    SqlOut=SqlOut+Results[i]+ " end,"+chr(13)+chr(10)

回答 15

现在您可以使用export_text。

from sklearn.tree import export_text

r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)

[sklearn] [1]中的完整示例

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

Now you can use export_text.

from sklearn.tree import export_text

r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)

A complete example from [sklearn][1]

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

回答 16

修改了Zelazny7的代码以从决策树中获取SQL。

# SQL from decision tree

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]
     le='<='               
     g ='>'
     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'
          lineage.append((parent, split, threshold[parent], features[parent]))
          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)
     print 'case '
     for j,child in enumerate(idx):
        clause=' when '
        for node in recurse(left, right, child):
            if len(str(node))<3:
                continue
            i=node
            if i[1]=='l':  sign=le 
            else: sign=g
            clause=clause+i[3]+sign+str(i[2])+' and '
        clause=clause[:-4]+' then '+str(j)
        print clause
     print 'else 99 end as clusters'

Modified Zelazny7’s code to fetch SQL from the decision tree.

# SQL from decision tree

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]
     le='<='               
     g ='>'
     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'
          lineage.append((parent, split, threshold[parent], features[parent]))
          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)
     print 'case '
     for j,child in enumerate(idx):
        clause=' when '
        for node in recurse(left, right, child):
            if len(str(node))<3:
                continue
            i=node
            if i[1]=='l':  sign=le 
            else: sign=g
            clause=clause+i[3]+sign+str(i[2])+' and '
        clause=clause[:-4]+' then '+str(j)
        print clause
     print 'else 99 end as clusters'

回答 17

显然,很久以前,已经有人决定尝试将以下功能添加到官方scikit的树导出功能中(该功能基本上仅支持export_graphviz)

def export_dict(tree, feature_names=None, max_depth=None) :
    """Export a decision tree in dict format.

这是他的全部承诺:

https://github.com/scikit-learn/scikit-learn/blob/79bdc8f711d0af225ed6be9fdb708cea9f98a910/sklearn/tree/export.py

不确定该评论发生了什么。但是您也可以尝试使用该功能。

我认为这对scikit-learn的优秀人员提出了严肃的文档要求,以正确地记录sklearn.tree.TreeAPI,API是DecisionTreeClassifier作为其属性公开的底层树结构tree_

Apparently a long time ago somebody already decided to try to add the following function to the official scikit’s tree export functions (which basically only supports export_graphviz)

def export_dict(tree, feature_names=None, max_depth=None) :
    """Export a decision tree in dict format.

Here is his full commit:

https://github.com/scikit-learn/scikit-learn/blob/79bdc8f711d0af225ed6be9fdb708cea9f98a910/sklearn/tree/export.py

Not exactly sure what happened to this comment. But you could also try to use that function.

I think this warrants a serious documentation request to the good people of scikit-learn to properly document the sklearn.tree.Tree API which is the underlying tree structure that DecisionTreeClassifier exposes as its attribute tree_.


回答 18

像这样使用sklearn.tree中的函数

from sklearn.tree import export_graphviz
    export_graphviz(tree,
                out_file = "tree.dot",
                feature_names = tree.columns) //or just ["petal length", "petal width"]

然后在项目文件夹中查找tree.dot文件,复制所有内容并将其粘贴到此处http://www.webgraphviz.com/并生成图形:)

Just use the function from sklearn.tree like this

from sklearn.tree import export_graphviz
    export_graphviz(tree,
                out_file = "tree.dot",
                feature_names = tree.columns) //or just ["petal length", "petal width"]

And then look in your project folder for the file tree.dot, copy the ALL the content and paste it here http://www.webgraphviz.com/ and generate your graph :)


回答 19

感谢@paulkerfeld的出色解决方案。在他的解决方案之上,为所有那些谁希望有树木序列化版本,只要使用tree.thresholdtree.children_lefttree.children_righttree.featuretree.value。由于叶子没有分裂,因此没有要素名称和子元素,因此它们在tree.featuretree.children_***中的占位符为_tree.TREE_UNDEFINEDand _tree.TREE_LEAF。每个分割均由分配唯一索引depth first search
请注意,tree.value形状为[n, 1, 1]

Thank for the wonderful solution of @paulkerfeld. On top of his solution, for all those who want to have a serialized version of trees, just use tree.threshold, tree.children_left, tree.children_right, tree.feature and tree.value. Since the leaves don’t have splits and hence no feature names and children, their placeholder in tree.feature and tree.children_*** are _tree.TREE_UNDEFINED and _tree.TREE_LEAF. Every split is assigned a unique index by depth first search.
Notice that the tree.value is of shape [n, 1, 1]


回答 20

这是一个通过转换以下内容的决策树生成Python代码的函数export_text

import string
from sklearn.tree import export_text

def export_py_code(tree, feature_names, max_depth=100, spacing=4):
    if spacing < 2:
        raise ValueError('spacing must be > 1')

    # Clean up feature names (for correctness)
    nums = string.digits
    alnums = string.ascii_letters + nums
    clean = lambda s: ''.join(c if c in alnums else '_' for c in s)
    features = [clean(x) for x in feature_names]
    features = ['_'+x if x[0] in nums else x for x in features if x]
    if len(set(features)) != len(feature_names):
        raise ValueError('invalid feature names')

    # First: export tree to text
    res = export_text(tree, feature_names=features, 
                        max_depth=max_depth,
                        decimals=6,
                        spacing=spacing-1)

    # Second: generate Python code from the text
    skip, dash = ' '*spacing, '-'*(spacing-1)
    code = 'def decision_tree({}):\n'.format(', '.join(features))
    for line in repr(tree).split('\n'):
        code += skip + "# " + line + '\n'
    for line in res.split('\n'):
        line = line.rstrip().replace('|',' ')
        if '<' in line or '>' in line:
            line, val = line.rsplit(maxsplit=1)
            line = line.replace(' ' + dash, 'if')
            line = '{} {:g}:'.format(line, float(val))
        else:
            line = line.replace(' {} class:'.format(dash), 'return')
        code += skip + line + '\n'

    return code

用法示例:

res = export_py_code(tree, feature_names=names, spacing=4)
print (res)

样本输出:

def decision_tree(f1, f2, f3):
    # DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
    #                        max_features=None, max_leaf_nodes=None,
    #                        min_impurity_decrease=0.0, min_impurity_split=None,
    #                        min_samples_leaf=1, min_samples_split=2,
    #                        min_weight_fraction_leaf=0.0, presort=False,
    #                        random_state=42, splitter='best')
    if f1 <= 12.5:
        if f2 <= 17.5:
            if f1 <= 10.5:
                return 2
            if f1 > 10.5:
                return 3
        if f2 > 17.5:
            if f2 <= 22.5:
                return 1
            if f2 > 22.5:
                return 1
    if f1 > 12.5:
        if f1 <= 17.5:
            if f3 <= 23.5:
                return 2
            if f3 > 23.5:
                return 3
        if f1 > 17.5:
            if f1 <= 25:
                return 1
            if f1 > 25:
                return 2

上面的示例是使用生成的names = ['f'+str(j+1) for j in range(NUM_FEATURES)]

一个方便的功能是,它可以生成较小的文件,且间距减小。刚设定spacing=2

Here is a function that generates Python code from a decision tree by converting the output of export_text:

import string
from sklearn.tree import export_text

def export_py_code(tree, feature_names, max_depth=100, spacing=4):
    if spacing < 2:
        raise ValueError('spacing must be > 1')

    # Clean up feature names (for correctness)
    nums = string.digits
    alnums = string.ascii_letters + nums
    clean = lambda s: ''.join(c if c in alnums else '_' for c in s)
    features = [clean(x) for x in feature_names]
    features = ['_'+x if x[0] in nums else x for x in features if x]
    if len(set(features)) != len(feature_names):
        raise ValueError('invalid feature names')

    # First: export tree to text
    res = export_text(tree, feature_names=features, 
                        max_depth=max_depth,
                        decimals=6,
                        spacing=spacing-1)

    # Second: generate Python code from the text
    skip, dash = ' '*spacing, '-'*(spacing-1)
    code = 'def decision_tree({}):\n'.format(', '.join(features))
    for line in repr(tree).split('\n'):
        code += skip + "# " + line + '\n'
    for line in res.split('\n'):
        line = line.rstrip().replace('|',' ')
        if '<' in line or '>' in line:
            line, val = line.rsplit(maxsplit=1)
            line = line.replace(' ' + dash, 'if')
            line = '{} {:g}:'.format(line, float(val))
        else:
            line = line.replace(' {} class:'.format(dash), 'return')
        code += skip + line + '\n'

    return code

Sample usage:

res = export_py_code(tree, feature_names=names, spacing=4)
print (res)

Sample output:

def decision_tree(f1, f2, f3):
    # DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
    #                        max_features=None, max_leaf_nodes=None,
    #                        min_impurity_decrease=0.0, min_impurity_split=None,
    #                        min_samples_leaf=1, min_samples_split=2,
    #                        min_weight_fraction_leaf=0.0, presort=False,
    #                        random_state=42, splitter='best')
    if f1 <= 12.5:
        if f2 <= 17.5:
            if f1 <= 10.5:
                return 2
            if f1 > 10.5:
                return 3
        if f2 > 17.5:
            if f2 <= 22.5:
                return 1
            if f2 > 22.5:
                return 1
    if f1 > 12.5:
        if f1 <= 17.5:
            if f3 <= 23.5:
                return 2
            if f3 > 23.5:
                return 3
        if f1 > 17.5:
            if f1 <= 25:
                return 1
            if f1 > 25:
                return 2

The above example is generated with names = ['f'+str(j+1) for j in range(NUM_FEATURES)].

One handy feature is that it can generate smaller file size with reduced spacing. Just set spacing=2.


在scikit学习LinearRegression中找到p值(重要性)

问题:在scikit学习LinearRegression中找到p值(重要性)

如何找到每个系数的p值(重要性)?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

How can I find the p-value (significance) of each coefficient?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

回答 0

这有点矫kill过正,但让我们尝试一下。首先让我们使用statsmodel找出p值应该是什么

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

我们得到

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

好的,让我们重现一下。这有点过头了,因为我们几乎要使用矩阵代数重现线性回归分析。但是到底。

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

这给了我们。

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

因此,我们可以从statsmodel复制值。

This is kind of overkill but let’s give it a go. First lets use statsmodel to find out what the p-values should be

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

and we get

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           3.83e-62
Time:                        10:08:24   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
==============================================================================
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.
==============================================================================

Ok, let’s reproduce this. It is kind of overkill as we are almost reproducing a linear regression analysis using Matrix Algebra. But what the heck.

lm = LinearRegression()
lm.fit(X,y)
params = np.append(lm.intercept_,lm.coef_)
predictions = lm.predict(X)

newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X))
MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns))

# Note if you don't want to use a DataFrame replace the two lines above with
# newX = np.append(np.ones((len(X),1)), X, axis=1)
# MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0]))

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

And this gives us.

    Coefficients  Standard Errors  t values  Probabilities
0       152.1335            2.576    59.061         0.000
1       -10.0122           59.749    -0.168         0.867
2      -239.8191           61.222    -3.917         0.000
3       519.8398           66.534     7.813         0.000
4       324.3904           65.422     4.958         0.000
5      -792.1842          416.684    -1.901         0.058
6       476.7458          339.035     1.406         0.160
7       101.0446          212.533     0.475         0.635
8       177.0642          161.476     1.097         0.273
9       751.2793          171.902     4.370         0.000
10       67.6254           65.984     1.025         0.306

So we can reproduce the values from statsmodel.


回答 1

scikit-learn的LinearRegression不会计算此信息,但是您可以轻松地扩展该类来做到这一点:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

这里被盗。

您应该看一下statsmodels,以便在Python中进行这种统计分析。

scikit-learn’s LinearRegression doesn’t calculate this information but you can easily extend the class to do it:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

Stolen from here.

You should take a look at statsmodels for this kind of statistical analysis in Python.


回答 2

编辑:可能不是正确的方法,请参阅评论

您可以使用sklearn.feature_selection.f_regression。

单击此处获取scikit学习页面

EDIT: Probably not the right way to do it, see comments

You could use sklearn.feature_selection.f_regression.

click here for the scikit-learn page


回答 3

elyase的答案https://stackoverflow.com/a/27928411/4240413中的代码实际上无效。请注意,sse是一个标量,然后尝试对其进行迭代。以下代码是修改后的版本。并不是很干净,但是我认为它或多或少地起作用。

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

The code in elyase’s answer https://stackoverflow.com/a/27928411/4240413 does not actually work. Notice that sse is a scalar, and then it tries to iterate through it. The following code is a modified version. Not amazingly clean, but I think it works more or less.

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

回答 4

拉取p值的一种简单方法是使用statsmodels回归:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

您将获得一系列可以操纵的p值(例如,通过评估每个p值来选择要保留的顺序):

在此处输入图片说明

An easy way to pull of the p-values is to use statsmodels regression:

import statsmodels.api as sm
mod = sm.OLS(Y,X)
fii = mod.fit()
p_values = fii.summary2().tables[1]['P>|t|']

You get a series of p-values that you can manipulate (for example choose the order you want to keep by evaluating each p-value):

enter image description here


回答 5

p_value在f统计信息中。如果要获取值,只需使用以下几行代码:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

p_value is among f statistics. if you want to get the value, simply use this few lines of code:

import statsmodels.api as sm
from scipy import stats

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
print(est.fit().f_pvalue)

回答 6

在多变量回归的情况下,@ JARH的答案可能有误。(我没有足够的声誉来发表评论。)

在以下行中:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]

t值遵循度的卡方分布len(newX)-1而不是度的卡方分布len(newX)-len(newX.columns)-1

所以这应该是:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(有关更多详细信息,请参见t值以进行OLS回归

There could be a mistake in @JARH‘s answer in the case of a multivariable regression. (I do not have enough reputation to comment.)

In the following line:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b],

the t-values follows a chi-squared distribution of degree len(newX)-1 instead of following a chi-squared distribution of degree len(newX)-len(newX.columns)-1.

So this should be:

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

(See t-values for OLS regression for more details)


回答 7

您可以将scipy用作p值。此代码来自scipy文档。

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

You can use scipy for p-value. This code is from scipy documentation.

>>> from scipy import stats
>>> import numpy as np
>>> x = np.random.random(10)
>>> y = np.random.random(10)
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

回答 8

对于单行代码,您可以使用pingouin.linear_regression函数(免责声明:我是Pingouin的创建者),该函数可使用NumPy数组或Pandas DataFrame与单变量/多元回归配合使用,例如:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

输出是一个数据帧,其中包含每个预测变量的beta系数,标准误差,T值,p值和置信区间,以及拟合的R ^ 2和调整后的R ^ 2。

For a one-liner you can use the pingouin.linear_regression function (disclaimer: I am the creator of Pingouin), which works with uni/multi-variate regression using NumPy arrays or Pandas DataFrame, e.g:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

The output is a dataframe with the beta coefficients, standard errors, T-values, p-values and confidence intervals for each predictor, as well as the R^2 and adjusted R^2 of the fit.


RuntimeWarning:numpy.dtype大小已更改,可能表明二进制不兼容

问题:RuntimeWarning:numpy.dtype大小已更改,可能表明二进制不兼容

我尝试加载已保存的SVM模型时遇到此错误。我尝试卸载sklearn,NumPy和SciPy,然后再次重新安装最新版本(使用pip)。我仍然收到此错误。为什么?

In [1]: import sklearn; print sklearn.__version__
0.18.1
In [3]: import numpy; print numpy.__version__
1.11.2
In [5]: import scipy; print scipy.__version__
0.18.1
In [7]: import pandas; print pandas.__version__
0.19.1

In [10]: clf = joblib.load('model/trained_model.pkl')
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-10-5e5db1331757> in <module>()
----> 1 clf = joblib.load('sentiment_classification/model/trained_model.pkl')

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in load(filename, mmap_mode)
    573                     return load_compatibility(fobj)
    574
--> 575                 obj = _unpickle(fobj, filename, mmap_mode)
    576
    577     return obj

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in _unpickle(fobj, filename, mmap_mode)
    505     obj = None
    506     try:
--> 507         obj = unpickler.load()
    508         if unpickler.compat_mode:
    509             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_global(self)
   1094         module = self.readline()[:-1]
   1095         name = self.readline()[:-1]
-> 1096         klass = self.find_class(module, name)
   1097         self.append(klass)
   1098     dispatch[GLOBAL] = load_global

/usr/lib/python2.7/pickle.pyc in find_class(self, module, name)
   1128     def find_class(self, module, name):
   1129         # Subclasses may override this
-> 1130         __import__(module)
   1131         mod = sys.modules[module]
   1132         klass = getattr(mod, name)

/usr/local/lib/python2.7/dist-packages/sklearn/svm/__init__.py in <module>()
     11 # License: BSD 3 clause (C) INRIA 2010
     12
---> 13 from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, \
     14         LinearSVR
     15 from .bounds import l1_min_c

/usr/local/lib/python2.7/dist-packages/sklearn/svm/classes.py in <module>()
      2 import numpy as np
      3
----> 4 from .base import _fit_liblinear, BaseSVC, BaseLibSVM
      5 from ..base import BaseEstimator, RegressorMixin
      6 from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \

/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py in <module>()
      6 from abc import ABCMeta, abstractmethod
      7
----> 8 from . import libsvm, liblinear
      9 from . import libsvm_sparse
     10 from ..base import BaseEstimator, ClassifierMixin

__init__.pxd in init sklearn.svm.libsvm (sklearn/svm/libsvm.c:10207)()

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80

更新:确定,请按照此处

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy scikit-learn

该错误现在已经消失了,尽管我仍然不知道为什么会首先发生它。

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this error. Why?

In [1]: import sklearn; print sklearn.__version__
0.18.1
In [3]: import numpy; print numpy.__version__
1.11.2
In [5]: import scipy; print scipy.__version__
0.18.1
In [7]: import pandas; print pandas.__version__
0.19.1

In [10]: clf = joblib.load('model/trained_model.pkl')
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-10-5e5db1331757> in <module>()
----> 1 clf = joblib.load('sentiment_classification/model/trained_model.pkl')

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in load(filename, mmap_mode)
    573                     return load_compatibility(fobj)
    574
--> 575                 obj = _unpickle(fobj, filename, mmap_mode)
    576
    577     return obj

/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.pyc in _unpickle(fobj, filename, mmap_mode)
    505     obj = None
    506     try:
--> 507         obj = unpickler.load()
    508         if unpickler.compat_mode:
    509             warnings.warn("The file '%s' has been generated with a "

/usr/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/usr/lib/python2.7/pickle.pyc in load_global(self)
   1094         module = self.readline()[:-1]
   1095         name = self.readline()[:-1]
-> 1096         klass = self.find_class(module, name)
   1097         self.append(klass)
   1098     dispatch[GLOBAL] = load_global

/usr/lib/python2.7/pickle.pyc in find_class(self, module, name)
   1128     def find_class(self, module, name):
   1129         # Subclasses may override this
-> 1130         __import__(module)
   1131         mod = sys.modules[module]
   1132         klass = getattr(mod, name)

/usr/local/lib/python2.7/dist-packages/sklearn/svm/__init__.py in <module>()
     11 # License: BSD 3 clause (C) INRIA 2010
     12
---> 13 from .classes import SVC, NuSVC, SVR, NuSVR, OneClassSVM, LinearSVC, \
     14         LinearSVR
     15 from .bounds import l1_min_c

/usr/local/lib/python2.7/dist-packages/sklearn/svm/classes.py in <module>()
      2 import numpy as np
      3
----> 4 from .base import _fit_liblinear, BaseSVC, BaseLibSVM
      5 from ..base import BaseEstimator, RegressorMixin
      6 from ..linear_model.base import LinearClassifierMixin, SparseCoefMixin, \

/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py in <module>()
      6 from abc import ABCMeta, abstractmethod
      7
----> 8 from . import libsvm, liblinear
      9 from . import libsvm_sparse
     10 from ..base import BaseEstimator, ClassifierMixin

__init__.pxd in init sklearn.svm.libsvm (sklearn/svm/libsvm.c:10207)()

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 80

UPDATE: OK, by following here, and

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy scikit-learn

The error has now gone, though I still have no idea why it occurred in the first place…


回答 0

根据MAINT:沉默有关更改dtype / ufunc大小的Cython警告。-numpy / numpy

每当您导入针对比已安装的numpy早的numpy编译的scipy(或其他软件包)时,这些警告都是可见的。

支票由Cython插入(因此,在任何使用它编译的模块中都存在)。

长话短说,这些警告在的特定情况下应该是良性的numpy,并且numpy 1.8(此提交进入的分支)开始这些消息就会被过滤掉。虽然scikit-learn 0.18.1针对编译numpy 1.6.1

要自己过滤这些警告,您可以执行与补丁程序相同的操作

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

当然,你可以重新编译从源代码的所有受影响的模块对当地numpypip install --no-binary :all:¹ ,而不是如果你有该工具。


更长的故事:补丁程序的支持者声称专门针对numpy,应该没有任何风险,并且第3方软件包是有意针对较早版本构建的:

[针对当前的numpy重建所有内容]不是可行的解决方案,当然也没有必要。Scipy(与许多其他软件包一样)与numpy的许多版本兼容。因此,当我们分发scipy二进制文件时,我们将根据支持的最低numpy版本(截至目前为1.5.1)构建它们,并且它们也可与1.6.x,1.7.x和numpy master一起使用。

真正的正确之处在于,Cython仅在dtypes / ufuncs的大小发生变化而导致破坏ABI时发出警告,否则保持沉默。

结果,Cython的开发人员同意信任numpy团队手工维护二进制兼容性,因此我们可以期望使用具有破坏性的ABI更改的版本会产生特制的异常或某些其他显式的阻止程序。


¹ 自以来,先前可用的--no-use-wheel选项已被删除。pip 10.0.0

According to MAINT: silence Cython warnings about changes dtype/ufunc size. – numpy/numpy:

These warnings are visible whenever you import scipy (or another package) that was compiled against an older numpy than is installed.

and the checks are inserted by Cython (hence are present in any module compiled with it).

Long story short, these warnings should be benign in the particular case of numpy, and these messages are filtered out since numpy 1.8 (the branch this commit went onto). While scikit-learn 0.18.1 is compiled against numpy 1.6.1.

To filter these warnings yourself, you can do the same as the patch does:

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

Of course, you can just recompile all affected modules from source against your local numpy with pip install --no-binary :all:¹ instead if you have the balls tools for that.


Longer story: the patch’s proponent claims there should be no risk specifically with numpy, and 3rd-party packages are intentionally built against older versions:

[Rebuilding everything against current numpy is] not a feasible solution, and certainly shouldn’t be necessary. Scipy (as many other packages) is compatible with a number of versions of numpy. So when we distribute scipy binaries, we build them against the lowest supported numpy version (1.5.1 as of now) and they work with 1.6.x, 1.7.x and numpy master as well.

The real correct would be for Cython only to issue warnings when the size of dtypes/ufuncs has changes in a way that breaks the ABI, and be silent otherwise.

As a result, Cython’s devs agreed to trust the numpy team with maintaining binary compatibility by hand, so we can probably expect that using versions with breaking ABI changes would yield a specially-crafted exception or some other explicit show-stopper.


¹The previously available --no-use-wheel option has been removed since pip 10.0.0.


回答 1

这是新的numpy版本(1.15.0)的问题

您可以降级numpy,此问题将得到解决:

sudo pip uninstall numpy
sudo pip install numpy==1.14.5

最终发布了numpy 1.15.1版本,从而解决了警告问题。

须藤点安装numpy == 1.15.1

这正在工作..

It’s the issue of new numpy version (1.15.0)

You can downgrade numpy and this problem will be fixed:

sudo pip uninstall numpy
sudo pip install numpy==1.14.5

Finally numpy 1.15.1 version is released so the warning issues are fixed.

sudo pip install numpy==1.15.1

This is working..


回答 2

如果您在anaconda环境中,请使用:

conda update --all

if you are in an anaconda environment use:

conda update --all

回答 3

我已经尝试了上述方法,但是没有任何效果。但是在我通过apt install安装库之后,问题就消失了,

对于Python3,

pip3 uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python3-numpy python3-scipy python3-pandas python3-sklearn 

对于Python2,

pip uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python-numpy python-scipy python-pandas python-sklearn 

希望有帮助。

I’ve tried the above-mentioned ways, but nothing worked. But the issue was gone after I installed the libraries through apt install,

For Python3,

pip3 uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python3-numpy python3-scipy python3-pandas python3-sklearn 

For Python2,

pip uninstall -y numpy scipy pandas scikit-learn
sudo apt update
sudo apt install python-numpy python-scipy python-pandas python-sklearn 

Hope that helps.


回答 4

只需升级您的numpy模块,现在它是1.15.4。对于窗户

pip install numpy --upgrade

Just upgrade your numpy module, right now it is 1.15.4. For windows

pip install numpy --upgrade

回答 5

发生此错误是因为已安装的软件包是numpy的另一个版本。
我们需要针对本地重建scipy和scikit-learnnumpy

对于新的pip(在我的情况下pip 18.0),它起作用:

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy,scikit-learn -I scipy scikit-learn

--no-binary列出您要忽略二进制文件的软件包的名称列表。在这种情况下,我们通过--no-binary scipy,scikit-learn了它将忽略软件包scipy,scikit-learn的二进制文件。没帮我

This error occurs because the installed packages were build agains different version of numpy.
We need to rebuild scipy and scikit-learn against the local numpy.

For new pip (in my case pip 18.0) this worked:

pip uninstall -y scipy scikit-learn
pip install --no-binary scipy,scikit-learn -I scipy scikit-learn

--no-binary takes a list of names of packages that you want to ignore binaries for. In this case we passed --no-binary scipy,scikit-learn which will ignore binaries for packages scipy,scikit-learn. Didn’t help me


回答 6

元信息:安装sklearn的推荐方法

如果您已经可以正常安装numpy和scipy,则安装scikit-learn的最简单方法是使用 pip

pip install -U scikit-learn 

conda

conda install scikit-learn

[…不要使用pip从源代码编译]

如果您还没有一个Python安装与numpy的和SciPy的,我们建议通过你的包管理器,或通过任何安装一个 python的 。这些带有numpy,scipy,scikit-learn,matplotlib和许多其他有用的科学和数据处理库。

Meta-information: The recommended way to install sklearn

If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using pip

pip install -U scikit-learn 

or conda:

conda install scikit-learn

[… do not compile from source using pip]

If you don’t already have a python installation with numpy and scipy, we recommend to install either via your package manager or via a python bundle. These come with numpy, scipy, scikit-learn, matplotlib and many other helpful scientific and data processing libraries.


回答 7

请注意,从cython 0.29开始,有一个新的check_size选项可以消除源代码中的警告,因此,一旦该版本渗入到各种软件包中,则无需任何解决方法

Note that as of cython 0.29 there is a new check_size option that eliminates the warning at the source, so no work-arounds should be needed once that version percolates to the various packages


回答 8

我的环境是Python 2.7.15

我尝试

pip uninstall
pip install --no-use-wheel

但它不起作用。它显示错误:

没有这样的选择:–no-use-wheel

然后我尝试:

pip uninstall
pip install --user --install-option="--prefix=" -U scikit-learn

而且有效:不会显示无用的警告。

My enviroment is Python 2.7.15

I try

pip uninstall
pip install --no-use-wheel

but it does not work. It shows the error:

no such option: –no-use-wheel

Then I try:

pip uninstall
pip install --user --install-option="--prefix=" -U scikit-learn

And it works: the useless warnings do not show.


回答 9

导入scipy时,错误信息显示:RuntimeWarning:内置 .type大小已更改,可能表示二进制不兼容。预期的zd,得到了zd

我通过将python版本从2.7.2更新到2.7.13解决了这个问题

When import scipy, error info shows: RuntimeWarning: builtin.type size changed, may indicate binary incompatibility. Expected zd, got zd

I solved this problem by updating python version from 2.7.2 to 2.7.13


Scikit学习中的随机状态(伪随机数)

问题:Scikit学习中的随机状态(伪随机数)

我想在scikit learning中实现机器学习算法,但我不明白此参数的random_state作用?我为什么要使用它?

我也无法理解什么是伪随机数。

I want to implement a machine learning algorithm in scikit learn, but I don’t understand what this parameter random_state does? Why should I use it?

I also could not understand what is a Pseudo-random number.


回答 0

train_test_split将数组或矩阵拆分为随机训练和测试子集。这意味着,每次运行时不指定random_state,您都会得到不同的结果,这是预期的行为。例如:

运行1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

运行2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

它改变。另一方面,如果使用random_state=some_number,则可以保证运行1的输出与运行2的输出相等,即,拆分将始终相同。实际的random_state数字是42,0,21,…无关紧要。重要的是,每次使用42时,第一次进行拆分时总会得到相同的输出。如果您想要可重现的结果(例如在文档中),这将很有用,这样每个人在运行示例时都可以始终看到相同的数字。实际上,我会说,random_state在测试材料时,应将设置为某个固定数字,但如果确实需要随机(而不是固定)分割,则应在生产中将其删除。

关于第二个问题,伪随机数生成器是一个生成几乎真正随机数的数字生成器。为什么它们不是真正随机的,超出了这个问题的范围,并且可能对您而言无关紧要,您可以在此处查看更多详细信息。

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:

Run 1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

Run 2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn’t matter what the actual random_state number is 42, 0, 21, … The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.

Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won’t matter in your case, you can take a look here form more details.


回答 1

如果未random_state在代码中指定,则每次运行(执行)代码时,都会生成一个新的随机值,并且训练和测试数据集每次将具有不同的值。

但是,如果像这样分配一个固定值,则random_state = 42无论您执行了多少次代码,结果都将相同,即训练和测试数据集中的值相同。

If you don’t specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.


回答 2

如果您在代码中未提及random_state,则每次执行代码时都会生成一个新的随机值,并且训练和测试数据集每次都将具有不同的值。

但是,如果每次将特定值用于random_state(random_state = 1或任何其他值),则结果将相同,即训练和测试数据集中的值相同。请参考以下代码:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

不管运行代码多少次,输出都是70。

70

尝试删除random_state并运行代码。

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

现在,每次执行代码时,输​​出将有所不同。

If you don’t mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets. Refer below code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Doesn’t matter how many times you run the code, the output will be 70.

70

Try to remove the random_state and run the code.

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Now here output will be different each time you execute the code.


回答 3

random_state数字以随机方式拆分测试和训练数据集。除了此处要说明的内容外,还必须记住,random_state值可能会对模型的质量产生重大影响(按质量,我实质上是指预测的准确性)。例如,如果您采用某个数据集并使用其训练回归模型,而未指定random_state值,则有可能每次都会在测试数据上为训练后的模型获得不同的准确性结果。因此,找到最佳的random_state值以为您提供最准确的模型很重要。然后,该数字将用于在另一个场合(例如另一个研究实验)重现您的模型。为此,

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`

random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data. So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment. To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`


回答 4

如果没有提供任何randomstate,系统将使用内部生成的randomstate。因此,当您多次运行该程序时,您可能会看到不同的训练/测试数据点,并且行为将不可预测。万一您的模型有问题,您将无法重新创建它,因为您不知道运行程序时生成的随机数。

如果您看到树分类器-DT或RF,它们会尝试使用最佳计划进行尝试。尽管大多数时候该计划可能是相同的,但是在某些情况下树可能会有所不同,因此预测也是如此。当您尝试调试模型时,可能无法重新创建为其构建Tree的实例。因此,为了避免所有这些麻烦,我们在构建DecisionTreeClassifier或RandomForestClassifier时使用了random_state。

PS:您可以深入了解如何在DecisionTree中构建Tree,以更好地理解这一点。

randomstate基本上用于在每次运行时均重现您的问题。如果您不在traintestsplit中使用randomstate,则每次进行拆分时,您可能会得到一组不同的train和test数据点,并且在遇到问题时将无助于调​​试。

从文档:

如果为int,则randomstate是随机数生成器使用的种子;否则为false。如果是RandomState实例,则randomstate是随机数生成器;如果为None,则随机数生成器是np.random使用的RandomState实例。

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

If you see the Tree Classifiers – either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.

PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.

randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

From Doc:

If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.


回答 5

sklearn.model_selection.train_test_split(*arrays, **options)[source]

将数组或矩阵拆分为随机训练和测试子集

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

如果为int,则random_state是随机数生成器使用的种子;否则为false。如果是RandomState实例,则random_state是随机数生成器;如果为None,则随机数生成器是np.random使用的RandomState实例。来源:http : //scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

关于随机状态,在sklearn中许多随机算法中使用它来确定传递给伪随机数生成器的随机种子。因此,它不支配算法行为的任何方面。结果,在验证集中表现良好的随机状态值与在新的,看不见的测试集中表现良好的随机状态值不对应。确实,根据算法的不同,您可能仅通过更改训练样本的顺序即可看到完全不同的结果。”’来源:https : //stats.stackexchange.com/questions/263999/is-random-state-a-parameter -调

sklearn.model_selection.train_test_split(*arrays, **options)[source]

Split arrays or matrices into random train and test subsets

Parameters: ... 
    random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. source: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

”’Regarding the random state, it is used in many randomized algorithms in sklearn to determine the random seed passed to the pseudo-random number generator. Therefore, it does not govern any aspect of the algorithm’s behavior. As a consequence, random state values which performed well in the validation set do not correspond to those which would perform well in a new, unseen test set. Indeed, depending on the algorithm, you might see completely different results by just changing the ordering of training samples.”’ source: https://stats.stackexchange.com/questions/263999/is-random-state-a-parameter-to-tune


scikit-learn中跨多列的标签编码

问题:scikit-learn中跨多列的标签编码

我正在尝试使用scikit-learn LabelEncoder来编码一大串DataFrame字符串标签。由于数据框有许多(50+)列,因此我想避免LabelEncoder为每一列创建一个对象。我宁愿只有一个LabelEncoder可以在我所有数据列中使用的大对象。

将整个数据DataFrame投入LabelEncoder会产生以下错误。请记住,我在这里使用伪数据。实际上,我正在处理大约50列的字符串标记数据,因此需要一个不按名称引用任何列的解决方案。

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

追溯(最近一次通话最近):文件“ /Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”中的第1行,第103行,适合= column_or_1d的第306行“ column_or_1d(y,warn = True)文件“ /Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py”引发ValueError(“错误的输入形状{ 0}“。format(shape))ValueError:输入形状错误(6,3)

关于如何解决这个问题有什么想法吗?

I’m trying to use scikit-learn’s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I’d rather just have one big LabelEncoder objects that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I’m using dummy data here; in actuality I’m dealing with about 50 columns of string labeled data, so need a solution that doesn’t reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File “”, line 1, in File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”, line 103, in fit y = column_or_1d(y, warn=True) File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py”, line 306, in column_or_1d raise ValueError(“bad input shape {0}”.format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?


回答 0

您可以轻松地做到这一点,

df.apply(LabelEncoder().fit_transform)

编辑2:

在scikit-learn 0.20中,推荐的方法是

OneHotEncoder().fit_transform(df)

因为OneHotEncoder现在支持字符串输入。使用ColumnTransformer可以仅将OneHotEncoder应用于某些列。

编辑:

由于这个答案是一年多以前的,并且产生了许多赞誉(包括赏金),所以我可能应该进一步扩大。

对于inverse_transform和transform,您必须做一点修改。

from collections import defaultdict
d = defaultdict(LabelEncoder)

这样,您现在将所有列保留LabelEncoder为字典。

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT2:

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

EDIT:

Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

回答 1

如larsmans所述,LabelEncoder()仅将一维数组作为参数。也就是说,滚动自己的标签编码器非常容易,该标签编码器可以在您选择的多列上运行,并返回转换后的数据帧。我的代码部分基于Zac Stewart在此处找到的出色博客文章。

创建自定义编码器包括简单地创建一个类来响应fit()transform()fit_transform()方法。就您而言,一个好的开始可能是这样的:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

假设我们要对两个分类属性(fruitcolor)进行编码,而不保留数字属性weight。我们可以这样做,如下所示:

MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

从中转换我们的fruit_data数据集

在此处输入图片说明

在此处输入图片说明

向其传递一个完全由分类变量组成的数据框并省略该columns参数将导致对每一列进行编码(我相信这是您最初在寻找的内容):

MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

这转变

在此处输入图片说明

在此处输入图片说明

请注意,当尝试对已经为数字的属性进行编码时,它可能会感到窒息(如果您愿意,可以添加一些代码来处理)。

与此相关的另一个不错的功能是,我们可以在管道中使用此自定义转换器:

encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart’s excellent blog post found here.

Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform() methods. In your case, a good start might be something like this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
    'fruit':  ['apple','orange','pear','orange'],
    'color':  ['red','orange','green','green'],
    'weight': [5,6,3,4]
})

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows:

MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)

Which transforms our fruit_data dataset from

enter image description here to

enter image description here

Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded (which I believe is what you were originally looking for):

MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))

This transforms

enter image description here to

enter image description here.

Note that it’ll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).

Another nice feature about this is that we can use this custom transformer in a pipeline:

encoding_pipeline = Pipeline([
    ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
    # add more pipeline steps as needed
])
encoding_pipeline.fit_transform(fruit_data)

回答 2

从scikit-learn 0.20开始,您可以使用sklearn.compose.ColumnTransformersklearn.preprocessing.OneHotEncoder

如果只有分类变量,则OneHotEncoder直接:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown='ignore').fit_transform(df)

如果您具有异构类型的功能:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
    (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
    (numerical_columns, RobustScaler())
column_trans.fit_transform(df)

文档中的更多选项:http : //scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer and sklearn.preprocessing.OneHotEncoder:

If you only have categorical variables, OneHotEncoder directly:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown='ignore').fit_transform(df)

If you have heterogeneously typed features:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
    (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
    (numerical_columns, RobustScaler())
column_trans.fit_transform(df)

More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data


回答 3

我们不需要LabelEncoder。

您可以将列转换为类别,然后获取其代码。我在下面使用了字典理解来将此过程应用于每一列,并将结果包装回具有相同索引和列名称的相同形状的数据框中。

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

要创建映射字典,您可以使用字典理解来枚举类别:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

We don’t need a LabelEncoder.

You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.

>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

>>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df}

{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

回答 4

这不会直接回答您的问题(Naputipulu Jon和PriceHardman对此做出了出色的答复)

但是,出于一些分类任务等目的,您可以使用

pandas.get_dummies(input_df) 

这样可以输入带有分类数据的数据框,并返回带有二进制值的数据框。变量值被编码为结果数据框中的列名。更多

this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)

However, for the purpose of a few classification tasks etc. you could use

pandas.get_dummies(input_df) 

this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more


回答 5

假设您只是试图获取一个sklearn.preprocessing.LabelEncoder()可用于表示您的列的对象,那么您要做的就是:

le.fit(df.columns)

在上面的代码中,您将有一个对应于每一列的唯一编号。更准确地说,你将有一个1:1映射df.columnsle.transform(df.columns.get_values())。要获取列的编码,只需将其传递给le.transform(...)。例如,以下将获取每一列的编码:

le.transform(df.columns.get_values())

假设要sklearn.preprocessing.LabelEncoder()为所有行标签创建一个对象,可以执行以下操作:

le.fit([y for x in df.get_values() for y in x])

在这种情况下,您很可能具有非唯一的行标签(如您的问题所示)。要查看编​​码器创建的类,可以执行le.classes_。您会注意到,该元素应与中的元素相同set(y for x in df.get_values() for y in x)。再次将行标签转换为编码标签使用le.transform(...)。例如,如果要检索df.columns数组第一列和第一行的标签,则可以执行以下操作:

le.transform([df.get_value(0, df.columns[0])])

您在评论中遇到的问题比较复杂,但仍然可以解决:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

上面的代码执行以下操作:

  1. 对所有的(列,行)进行唯一组合
  2. 将每对表示为元组的字符串版本。这是一种解决方法,可以克服LabelEncoder不支持将元组用作类名的类。
  3. 使新项目适合LabelEncoder

现在使用这种新模型要复杂一些。假设我们要提取在上一个示例中查找的同一项目的表示形式(df.columns中的第一列和第一行),我们可以这样做:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

请记住,每个查找现在都是包含(列,行)的元组的字符串表示形式。

Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:

le.fit(df.columns)

In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column’s encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:

le.transform(df.columns.get_values())

Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:

le.fit([y for x in df.get_values() for y in x])

In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You’ll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:

le.transform([df.get_value(0, df.columns[0])])

The question you had in your comment is a bit more complicated, but can still be accomplished:

le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])

The above code does the following:

  1. Make a unique combination of all of the pairs of (column, row)
  2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
  3. Fits the new items to the LabelEncoder.

Now to use this new model it’s a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:

le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])

Remember that each lookup is now a string representation of a tuple that contains the (column, row).


回答 6

不,LabelEncoder不这样做。它需要一维类标签数组,并生成一维数组。它旨在处理分类问题中的类标签,而不是任意数据,并且任何试图将其用于其他用途的尝试都将需要代码将实际问题转换为要解决的问题(并将解决方案还原到原始空间)。

No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It’s designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).


回答 7

实际上,这已经过去了一年半,但是我也需要.transform()一次能够同时处理多个pandas数据框列(也必须能够对.inverse_transform()它们进行处理)。这扩展了上面@PriceHardman的出色建议:

class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing
        `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`

        Access individual column encoded labels via indexing
        `self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe.loc[:, self.columns].values

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

例:

如果dfdf_copy()是混合类型的pandas数据框,则可以通过以下方式将和MultiColumnLabelEncoder()应用于dtype=object列:

# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)

# fit to `df` data
mcle.fit(df)

# transform the `df` data
mcle.transform(df)

# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# transform `df_copy` data
mcle.transform(df_copy)

# returns output like below (assuming the respective columns 
# of `df_copy` contain the same unique values as that particular 
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# inverse `df` data
mcle.inverse_transform(df)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

# inverse `df_copy` data
mcle.inverse_transform(df_copy)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

您可以通过索引访问用于适合各列的各个列类,列标签和列编码器:

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_

This is a year-and-a-half after the fact, but I too, needed to be able to .transform() multiple pandas dataframe columns at once (and be able to .inverse_transform() them as well). This expands upon the excellent suggestion of @PriceHardman above:

class MultiColumnLabelEncoder(LabelEncoder):
    """
    Wraps sklearn LabelEncoder functionality for use on multiple columns of a
    pandas dataframe.

    """
    def __init__(self, columns=None):
        self.columns = columns

    def fit(self, dframe):
        """
        Fit label encoder to pandas columns.

        Access individual column classes via indexig `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            for idx, column in enumerate(self.columns):
                # fit LabelEncoder to get `classes_` for the column
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                # append this column's encoder
                self.all_encoders_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                le.fit(dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return self

    def fit_transform(self, dframe):
        """
        Fit label encoder and return encoded labels.

        Access individual column classes via indexing
        `self.all_classes_`

        Access individual column encoders via indexing
        `self.all_encoders_`

        Access individual column encoded labels via indexing
        `self.all_labels_`
        """
        # if columns are provided, iterate through and get `classes_`
        if self.columns is not None:
            # ndarray to hold LabelEncoder().classes_ for each
            # column; should match the shape of specified `columns`
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            self.all_encoders_ = np.ndarray(shape=self.columns.shape,
                                            dtype=object)
            self.all_labels_ = np.ndarray(shape=self.columns.shape,
                                          dtype=object)
            for idx, column in enumerate(self.columns):
                # instantiate LabelEncoder
                le = LabelEncoder()
                # fit and transform labels in the column
                dframe.loc[:, column] =\
                    le.fit_transform(dframe.loc[:, column].values)
                # append the `classes_` to our ndarray container
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
                self.all_labels_[idx] = le
        else:
            # no columns specified; assume all are to be encoded
            self.columns = dframe.iloc[:, :].columns
            self.all_classes_ = np.ndarray(shape=self.columns.shape,
                                           dtype=object)
            for idx, column in enumerate(self.columns):
                le = LabelEncoder()
                dframe.loc[:, column] = le.fit_transform(
                        dframe.loc[:, column].values)
                self.all_classes_[idx] = (column,
                                          np.array(le.classes_.tolist(),
                                                  dtype=object))
                self.all_encoders_[idx] = le
        return dframe.loc[:, self.columns].values

    def transform(self, dframe):
        """
        Transform labels to normalized encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[
                    idx].transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

    def inverse_transform(self, dframe):
        """
        Transform labels back to original encoding.
        """
        if self.columns is not None:
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        else:
            self.columns = dframe.iloc[:, :].columns
            for idx, column in enumerate(self.columns):
                dframe.loc[:, column] = self.all_encoders_[idx]\
                    .inverse_transform(dframe.loc[:, column].values)
        return dframe.loc[:, self.columns].values

Example:

If df and df_copy() are mixed-type pandas dataframes, you can apply the MultiColumnLabelEncoder() to the dtype=object columns in the following way:

# get `object` columns
df_object_columns = df.iloc[:, :].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:, :].select_dtypes(include=['object']).columns

# instantiate `MultiColumnLabelEncoder`
mcle = MultiColumnLabelEncoder(columns=object_columns)

# fit to `df` data
mcle.fit(df)

# transform the `df` data
mcle.transform(df)

# returns output like below
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# transform `df_copy` data
mcle.transform(df_copy)

# returns output like below (assuming the respective columns 
# of `df_copy` contain the same unique values as that particular 
# column in `df`
array([[1, 0, 0, ..., 1, 1, 0],
       [0, 5, 1, ..., 1, 1, 2],
       [1, 1, 1, ..., 1, 1, 2],
       ..., 
       [3, 5, 1, ..., 1, 1, 2],

# inverse `df` data
mcle.inverse_transform(df)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

# inverse `df_copy` data
mcle.inverse_transform(df_copy)

# outputs data like below
array([['August', 'Friday', '2013', ..., 'N', 'N', 'CA'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['August', 'Monday', '2014', ..., 'N', 'N', 'NJ'],
       ..., 
       ['February', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['April', 'Tuesday', '2014', ..., 'N', 'N', 'NJ'],
       ['March', 'Tuesday', '2013', ..., 'N', 'N', 'NJ']], dtype=object)

You can access individual column classes, column labels, and column encoders used to fit each column via indexing:

mcle.all_classes_
mcle.all_encoders_
mcle.all_labels_


回答 8

在对@PriceHardman解决方案提出的意见进行跟进之后,我将提出该类的以下版本:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

此类将编码器适合训练集,并在转换时使用适合的版本。该代码的初始版本可以在此处找到。

Following up on the comments raised on the solution of @PriceHardman I would propose the following version of the class:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.


回答 9

使用LabelEncoder()多个列的一种简短方法dict()

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

您可以使用此le_dict标签对其他任何列进行标签编码:

le_dict[col].transform(df_another[col])

A short way to LabelEncoder() multiple columns with a dict():

from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
    le_dict[col].fit_transform(df[col])

and you can use this le_dict to labelEncode any other column:

le_dict[col].transform(df_another[col])

回答 10

可以直接在熊猫中进行所有操作,并且非常适合该replace方法的独特功能。

首先,让我们创建一个字典字典,将列及其值映射到新的替换值。

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

由于这将始终是一对一的映射,因此我们可以反转内部字典以获得新值回到原始值的映射。

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

现在,我们可以使用该replace方法的独特功能来获取字典的嵌套列表,并将外键用作列,而将内键用作我们要替换的值。

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

通过再次链接该replace方法,我们可以轻松地回到原始状态

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog

It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.

First, let’s make a dictionary of dictionaries mapping the columns and their values to their new replacement values.

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

We can easily go back to the original by again chaining the replace method

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog

回答 11

经过大量搜索和试验,并在此处和其他地方找到了一些答案,我认为您的答案在这里

pd.DataFrame(columns = df.columns,data = LabelEncoder()。fit_transform(df.values.flatten())。reshape(df.shape))

这将跨列保留类别名称:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0

After lots of search and experimentation with some answers here and elsewhere, I think your answer is here:

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

This will preserve category names across columns:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0

回答 12

我检查了LabelEncoder 的源代码(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py)。它基于一组numpy转换,其中一个是np.unique()。并且此功能仅需要一维数组输入。(如果我错了,请纠正我)。

非常粗糙的想法…首先,确定需要LabelEncoder的列,然后遍历每列。

def cat_var(df): 
    """Identify categorical features. 

    Parameters
    ----------
    df: original df after missing operations 

    Returns
    -------
    cat_var_df: summary df with col index and col name for all categorical vars
    """
    col_type = df.dtypes
    col_names = list(df)

    cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
    cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]

    cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                               'cat_name': cat_var_name})

    return cat_var_df



from sklearn.preprocessing import LabelEncoder 

def column_encoder(df, cat_var_list):
    """Encoding categorical feature in the dataframe

    Parameters
    ----------
    df: input dataframe 
    cat_var_list: categorical feature index and name, from cat_var function

    Return
    ------
    df: new dataframe where categorical features are encoded
    label_list: classes_ attribute for all encoded features 
    """

    label_list = []
    cat_var_df = cat_var(df)
    cat_list = cat_var_df.loc[:, 'cat_name']

    for index, cat_feature in enumerate(cat_list): 

        le = LabelEncoder()

        le.fit(df.loc[:, cat_feature])    
        label_list.append(list(le.classes_))

        df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])

    return df, label_list

返回的df将是编码后的dflabel_list将向您显示所有这些值在对应列中的含义。这是我为工作编写的数据处理脚本的摘录。如果您认为有任何进一步的改进,请告诉我。

编辑:这里只想提一下,上面的方法与数据框架一起使用时不会错过最好的方法。不知道它是如何工作的数据帧包含丢失的数据。(在执行上述方法之前,我处理了丢失的程序)

I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).

Very Rough ideas… first, identify which columns needed LabelEncoder, then loop through each column.

def cat_var(df): 
    """Identify categorical features. 

    Parameters
    ----------
    df: original df after missing operations 

    Returns
    -------
    cat_var_df: summary df with col index and col name for all categorical vars
    """
    col_type = df.dtypes
    col_names = list(df)

    cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
    cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]

    cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                               'cat_name': cat_var_name})

    return cat_var_df



from sklearn.preprocessing import LabelEncoder 

def column_encoder(df, cat_var_list):
    """Encoding categorical feature in the dataframe

    Parameters
    ----------
    df: input dataframe 
    cat_var_list: categorical feature index and name, from cat_var function

    Return
    ------
    df: new dataframe where categorical features are encoded
    label_list: classes_ attribute for all encoded features 
    """

    label_list = []
    cat_var_df = cat_var(df)
    cat_list = cat_var_df.loc[:, 'cat_name']

    for index, cat_feature in enumerate(cat_list): 

        le = LabelEncoder()

        le.fit(df.loc[:, cat_feature])    
        label_list.append(list(le.classes_))

        df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])

    return df, label_list

The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.

EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)


回答 13

如果我们有单列来进行标签编码,并且在python中有多列时,它的逆变换很容易做到

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original

if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python

def stringtocategory(dataset):
    '''
    @author puja.sharma
    @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
    @param dataset dataframe on whoes column the label encoding has to be done
    @return label encoded and inverse tranform of the label encoded data.
   ''' 
   data_original = dataset[:]
   data_tranformed = dataset[:]
   for y in dataset.columns:
       #check the dtype of the column object type contains strings or chars
       if (dataset[y].dtype == object):
          print("The string type features are  : " + y)
          le = preprocessing.LabelEncoder()
          le.fit(dataset[y].unique())
          #label encoded data
          data_tranformed[y] = le.transform(dataset[y])
          #inverse label transform  data
          data_original[y] = le.inverse_transform(data_tranformed[y])
   return data_tranformed,data_original

回答 14

如果您在数据框中拥有数值和分类两种数据类型,则可以使用:这里X是我的数据框同时具有分类和数值两种变量

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])

注意:如果您不希望将它们转换回去,则此方法很好。

If you have numerical and categorical both type of data in dataframe You can use : here X is my dataframe having categorical and numerical both variables

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,X.shape[1]):
    if X.dtypes[i]=='object':
        X[X.columns[i]] = le.fit_transform(X[X.columns[i]])

Note: This technique is good if you are not interested in converting them back.


回答 15

使用Neuraxle

TLDR;您在这里可以使用FlattenForEach包装类简单地改变你的DF,如:FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

使用这种方法,您的标签编码器将能够在常规的scikit-learn Pipeline中进行调整。让我们简单地导入:

from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach

列使用相同的共享编码器:

这是一个共享的LabelEncoder应用于所有数据进行编码的方式:

    p = FlattenForEach(LabelEncoder(), then_unflatten=True)

结果:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [6, 7, 6, 8, 7, 7],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

每列不同的编码器:

这是第一个独立的LabelEncoder应用于宠物的方式,第二个将共享给列所有者和位置。确切地说,我们在这里混合使用不同和共享的标签编码器:

    p = ColumnTransformer([
        # A different encoder will be used for column 0 with name "pets":
        (0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
        # A shared encoder will be used for column 1 and 2, "owner" and "location":
        ([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
    ], n_dimension=2)

结果:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [0, 1, 0, 2, 1, 1],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

Using Neuraxle

TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like: FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df).

With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let’s simply import:

from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach

Same shared encoder for columns:

Here is how one shared LabelEncoder will be applied on all the data to encode it:

    p = FlattenForEach(LabelEncoder(), then_unflatten=True)

Result:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [6, 7, 6, 8, 7, 7],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

Different encoders per column:

And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:

    p = ColumnTransformer([
        # A different encoder will be used for column 0 with name "pets":
        (0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
        # A shared encoder will be used for column 1 and 2, "owner" and "location":
        ([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
    ], n_dimension=2)

Result:

    p, predicted_output = p.fit_transform(df.values)
    expected_output = np.array([
        [0, 1, 0, 2, 1, 1],
        [1, 3, 0, 1, 5, 3],
        [4, 2, 2, 4, 4, 2]
    ]).transpose()
    assert np.array_equal(predicted_output, expected_output)

回答 16

主要用于@Alexander答案,但必须进行一些更改-

cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])

然后要在将来重用,您可以将输出保存到json文档中,并在需要时将其读入并.map()像上面一样使用函数。

Mainly used @Alexander answer but had to make some changes –

cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])

Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.


回答 17

问题是传递给拟合函数的数据(pd数据帧)的形状。您必须通过一维列表。

The problem is the shape of the data (pd dataframe) you are passing to the fit function. You’ve got to pass 1d list.


回答 18

import pandas as pd
from sklearn.preprocessing import LabelEncoder

train=pd.read_csv('.../train.csv')

#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object 
def MultiLabelEncoder(columnlist,dataframe):
    for i in columnlist:

        labelencoder_X=LabelEncoder()
        dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)

在这里,我从位置读取一个csv,并且在功能上,我传递了我要标签编码的列列表和我要应用的数据框。

import pandas as pd
from sklearn.preprocessing import LabelEncoder

train=pd.read_csv('.../train.csv')

#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object 
def MultiLabelEncoder(columnlist,dataframe):
    for i in columnlist:

        labelencoder_X=LabelEncoder()
        dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)

Here i am reading a csv from location and in function i am passing the column list i want to labelencode and the dataframe I want to apply this.


回答 19

这个怎么样?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

它不是最有效的,但是它可以工作并且非常简单。

How about this?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

It is not the most efficient, however it works and it is super simple.


如何在NumPy中标准化数组?

问题:如何在NumPy中标准化数组?

我想拥有一个NumPy数组的规范。更具体地说,我正在寻找此功能的等效版本

def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
       return v
    return v / norm

skearn或中有类似的东西numpy吗?

该函数在v向量为0 的情况下起作用。

I would like to have the norm of one NumPy array. More specifically, I am looking for an equivalent version of this function

def normalize(v):
    norm = np.linalg.norm(v)
    if norm == 0: 
       return v
    return v / norm

Is there something like that in skearn or numpy?

This function works in a situation where v is the 0 vector.


回答 0

如果您使用的是scikit-learn,则可以使用sklearn.preprocessing.normalize

import numpy as np
from sklearn.preprocessing import normalize

x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True

If you’re using scikit-learn you can use sklearn.preprocessing.normalize:

import numpy as np
from sklearn.preprocessing import normalize

x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True

回答 1

我同意,如果这样的功能是随附电池的一部分,那就太好了。但据我所知不是。这是适用于任意轴并提供最佳性能的版本。

import numpy as np

def normalized(a, axis=-1, order=2):
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2==0] = 1
    return a / np.expand_dims(l2, axis)

A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))

print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))

I would agree that it were nice if such a function was part of the included batteries. But it isn’t, as far as I know. Here is a version for arbitrary axes, and giving optimal performance.

import numpy as np

def normalized(a, axis=-1, order=2):
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2==0] = 1
    return a / np.expand_dims(l2, axis)

A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))

print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))

回答 2

您可以指定ord以获得L1范数。为了避免零除,我使用了eps,但这可能不是很好。

def normalize(v):
    norm=np.linalg.norm(v, ord=1)
    if norm==0:
        norm=np.finfo(v.dtype).eps
    return v/norm

You can specify ord to get the L1 norm. To avoid zero division I use eps, but that’s maybe not great.

def normalize(v):
    norm=np.linalg.norm(v, ord=1)
    if norm==0:
        norm=np.finfo(v.dtype).eps
    return v/norm

回答 3

这可能也适合您

import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))

但在v长度为0 时失败。

This might also work for you

import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))

but fails when v has length 0.


回答 4

如果您具有多维数据,并且希望将每个轴归一化为其最大值或总和:

def normalize(_d, to_sum=True, copy=True):
    # d is a (n x dimension) np array
    d = _d if not copy else np.copy(_d)
    d -= np.min(d, axis=0)
    d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
    return d

使用numpys 峰到峰功能。

a = np.random.random((5, 3))

b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1

c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1

If you have multidimensional data and want each axis normalized to its max or its sum:

def normalize(_d, to_sum=True, copy=True):
    # d is a (n x dimension) np array
    d = _d if not copy else np.copy(_d)
    d -= np.min(d, axis=0)
    d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
    return d

Uses numpys peak to peak function.

a = np.random.random((5, 3))

b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1

c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1

回答 5

Christoph Gohlke unit_vector()在流行的转换模块中还具有将向量标准化的功能:

import transformations as trafo
import numpy as np

data = np.array([[1.0, 1.0, 0.0],
                 [1.0, 1.0, 1.0],
                 [1.0, 2.0, 3.0]])

print(trafo.unit_vector(data, axis=1))

There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:

import transformations as trafo
import numpy as np

data = np.array([[1.0, 1.0, 0.0],
                 [1.0, 1.0, 1.0],
                 [1.0, 2.0, 3.0]])

print(trafo.unit_vector(data, axis=1))

回答 6

您提到了sci-kit学习,所以我想分享另一个解决方案。

科学工具学习 MinMaxScaler

在sci-kit学习中,有一个名为的API MinMaxScaler,可以根据需要自定义值范围。

它还为我们处理了NaN问题。

将NaN视为缺失值:忽略适合度,并保留其变换值。…参见参考文献[1]

代码样例

代码很简单,只需键入

# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
参考

You mentioned sci-kit learn, so I want to share another solution.

sci-kit learn MinMaxScaler

In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.

It also deal with NaN issues for us.

NaNs are treated as missing values: disregarded in fit, and maintained in transform. … see reference [1]

Code sample

The code is simple, just type

# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference

回答 7

没有sklearn和使用正义numpy。只需定义一个函数即可:

假设行是变量列是样本axis= 1):

import numpy as np

# Example array
X = np.array([[1,2,3],[4,5,6]])

def stdmtx(X):
    means = X.mean(axis =1)
    stds = X.std(axis= 1, ddof=1)
    X= X - means[:, np.newaxis]
    X= X / stds[:, np.newaxis]
    return np.nan_to_num(X)

输出:

X
array([[1, 2, 3],
       [4, 5, 6]])

stdmtx(X)
array([[-1.,  0.,  1.],
       [-1.,  0.,  1.]])

Without sklearn and using just numpy. Just define a function:.

Assuming that the rows are the variables and the columns the samples (axis= 1):

import numpy as np

# Example array
X = np.array([[1,2,3],[4,5,6]])

def stdmtx(X):
    means = X.mean(axis =1)
    stds = X.std(axis= 1, ddof=1)
    X= X - means[:, np.newaxis]
    X= X / stds[:, np.newaxis]
    return np.nan_to_num(X)

output:

X
array([[1, 2, 3],
       [4, 5, 6]])

stdmtx(X)
array([[-1.,  0.,  1.],
       [-1.,  0.,  1.]])


回答 8

如果要标准化存储在3D张量中的n维特征向量,则也可以使用PyTorch:

import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize

vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()

If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:

import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize

vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()

回答 9

如果您正在使用3D向量,则可以使用toolbelt vg简洁地执行此操作。它是numpy之上的一个轻层,它支持单个值和堆叠的向量。

import numpy as np
import vg

x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True

我在上次启动时创建了该库,它的使用动机如下:简单的想法在NumPy中太冗长了。

If you’re working with 3D vectors, you can do this concisely using the toolbelt vg. It’s a light layer on top of numpy and it supports single values and stacked vectors.

import numpy as np
import vg

x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True

I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.


回答 10

如果您不需要极高的精度,则可以将函数简化为:

v_norm = v / (np.linalg.norm(v) + 1e-16)

If you don’t need utmost precision, your function can be reduced to:

v_norm = v / (np.linalg.norm(v) + 1e-16)

回答 11

如果使用多维数组,则可以快速解决。

假设我们有2D数组,我们想通过最后一个轴对其进行归一化,而有些行的范数为零。

import numpy as np
arr = np.array([
    [1, 2, 3], 
    [0, 0, 0],
    [5, 6, 7]
], dtype=np.float)

lengths = np.linalg.norm(arr, axis=-1)
print(lengths)  # [ 3.74165739  0.         10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0.         0.         0.        ]
# [0.47673129 0.57207755 0.66742381]]

If you work with multidimensional array following fast solution is possible.

Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.

import numpy as np
arr = np.array([
    [1, 2, 3], 
    [0, 0, 0],
    [5, 6, 7]
], dtype=np.float)

lengths = np.linalg.norm(arr, axis=-1)
print(lengths)  # [ 3.74165739  0.         10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0.         0.         0.        ]
# [0.47673129 0.57207755 0.66742381]]

将分类器保存到scikit-learn中的磁盘

问题:将分类器保存到scikit-learn中的磁盘

如何保存经过训练的朴素贝叶斯分类器磁盘并用于预测数据?

我有来自scikit-learn网站的以下示例程序:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

How do I save a trained Naive Bayes classifier to disk and use it to predict data?

I have the following sample program from the scikit-learn website:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

回答 0

分类器只是可以像其他任何东西一样被腌制和倾倒的对象。继续您的示例:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

回答 1

您还可以使用joblib.dumpjoblib.load,它们在处理数字数组方面比默认的python pickler效率更高。

Joblib包含在scikit-learn中:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

编辑:在Python 3.8+中,如果您使用pickle协议5(不是默认值),现在可以使用pickle对具有大数值数组的对象进行有效的酸洗作为属性。

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.

Joblib is included in scikit-learn:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

Edit: in Python 3.8+ it’s now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).


回答 2

您正在寻找的内容被称为sklearn词中的模型持久性,并且在简介模型持久中都有记录部分中进行了记录。

因此,您已经初始化了分类器并使用

clf = some.classifier()
clf.fit(X, y)

之后,您有两个选择:

1)使用泡菜

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2)使用Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

再读一遍有助于阅读上述链接

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.

So you have initialized your classifier and trained it for a long time with

clf = some.classifier()
clf.fit(X, y)

After this you have two options:

1) Using Pickle

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2) Using Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

One more time it is helpful to read the above-mentioned links


回答 3

在许多情况下,尤其是对于文本分类,仅存储分类器是不够的,但是您还需要存储矢量化器,以便将来可以对输入进行矢量化。

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

未来用例:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

在转储矢量化器之前,可以通过以下方式删除矢量化器的stop_words_属性:

vectorizer.stop_words_ = None

使倾销更有效率。同样,如果您的分类器参数稀疏(如大多数文本分类示例中一样),则可以将参数从密集转换为稀疏,这将在内存消耗,加载和转储方面产生巨大差异。通过以下方式稀疏模型:

clf.sparsify()

这对于SGDClassifier将自动工作,但是如果您知道模型稀疏(clf.coef_中为零),则可以通过以下方式将clf.coef_手动转换为csr scipy稀疏矩阵

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

然后您可以更有效地存储它。

In many cases, particularly with text classification it is not enough just to store the classifier but you’ll need to store the vectorizer as well so that you can vectorize your input in future.

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

future use case:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:

vectorizer.stop_words_ = None

to make dumping more efficient. Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:

clf.sparsify()

Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

and then you can store it more efficiently.


回答 4

sklearn估计器实现的方法使您可以轻松保存估计器的相关训练属性。一些估计器__getstate__自己实现方法,但是其他估计器,例如GMM仅使用基本实现,该实现只是将对象保存在内部字典中:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

将模型保存到光盘的推荐方法是使用以下pickle模块:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

但是,您应该保存其他数据,以便将来可以重新训练模型,或遭受可怕的后果(例如被锁定在旧版本的sklearn中)

文档中

为了用将来的scikit-learn版本重建类似的模型,应该在腌制的模型中保存其他元数据:

训练数据,例如对不变快照的引用

用于生成模型的python源代码

scikit-learn的版本及其依赖项

在训练数据上获得的交叉验证分数

对于依赖于tree.pyx用Cython(例如IsolationForest)编写的模块的Ensemble估计器而言尤其如此,因为它会创建与实现的耦合,这不能保证sklearn版本之间的稳定性。在过去,它已经看到了不兼容的变化。

如果您的模型变得非常大并且加载变得很麻烦,那么您还可以使用更高效的joblib。从文档中:

在scikit的特定情况下,使用joblib替换picklejoblib.dumpjoblib.load)可能会更有趣,这对于内部装有大型numpy数组的对象更有效,就像装配的scikit-learn估计量通常那样,但只能腌制到磁盘而不是字符串:

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

The recommended method to save your model to disc is to use the pickle module:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).

From the documentation:

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

The training data, e.g. a reference to a immutable snapshot

The python source code used to generate the model

The versions of scikit-learn and its dependencies

The cross validation score obtained on the training data

This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.

If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:


回答 5

sklearn.externals.joblib已被弃用,因为0.21,将在被删除v0.23

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/ init .py:15:FutureWarning:sklearn.externals.joblib在0.21中已弃用,在0.23中将被删除。请直接从joblib导入此功能,可以通过以下方式安装该功能:pip install joblib。如果在加载腌制模型时出现此警告,则可能需要使用scikit-learn 0.21+重新序列化那些模型。
warnings.warn(msg,category = FutureWarning)


因此,您需要安装joblib

pip install joblib

最后将模型写入磁盘:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

现在,要读取转储的文件,您需要运行的是:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)


Therefore, you need to install joblib:

pip install joblib

and finally write the model to disk:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

Now in order to read the dumped file all you need to run is:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)