标签归档:machine-learning

是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

问题:是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?


回答 0

这是一个小型的kmean,使用scipy.spatial.distance或用户函数中的20多个距离中的 任意一个。
欢迎发表评论(到目前为止,只有一位用户,这还不够);特别是,您的N,dim,k公制是什么?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

2012年3月26日添加了一些注意事项:

1)对于余弦距离,首先将所有数据向量归一化为| X | = 1; 然后

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

很快 对于位向量,请将规范与向量分开,而不是扩展为浮点数(尽管某些程序可能会为您扩展)。对于稀疏向量,说N,X的1%。Y应该花费时间O(2%N),空间O(N); 但我不知道哪个程序可以做到这一点。

2) Scikit学习集群 很好地概述了k均值,mini-batch-k均值…以及适用于scipy.sparse矩阵的代码。

3)务必在k均值之后检查群集大小。如果您期望群集大小大致相等,但它们出来了 [44 37 9 5 5] %……(令人头疼的声音)。

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don’t know which programs do that.

2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means … with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means. If you’re expecting roughly equal-sized clusters, but they come out [44 37 9 5 5] % … (sound of head-scratching).


回答 1

不幸的是,没有:scikit-learn当前的k-means实现仅使用欧几里得距离。

将k均值扩展到其他距离并非易事,并且denis的上述回答并不是为其他度量实施k均值的正确方法。

Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.

It is not trivial to extend k-means to other distances and denis’ answer above is not the correct way to implement k-means for other metrics.


回答 2

只需在可以执行此操作的地方使用nltk即可,例如

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

回答 3

是的,您可以使用差异度量功能;但是,根据定义,k均值聚类算法依赖于距每个聚类均值的eucldiean距离。

您可以使用其他指标,因此即使您仍在计算均值,也可以使用诸如马氏距离之类的值。

Yes you can use a difference metric function; however, by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster.

You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.


回答 4

pyclustering,它是python / C ++(非常快!),可让您指定自定义指标函数

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

实际上,我还没有测试过此代码,而是从票证示例代码中将其拼凑在一起。

There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

Actually, i haven’t tested this code but cobbled it together from a ticket and example code.


回答 5

Spectral Python的k均值允许使用L1(曼哈顿)距离。

k-means of Spectral Python allows the use of L1 (Manhattan) distance.


回答 6

Sklearn Kmeans使用欧几里德距离。它没有指标参数。这就是说,如果你聚类的时间序列,你可以使用tslearnPython包时,你可以指定一个度量标准(dtwsoftdtweuclidean)。

Sklearn Kmeans uses the Euclidean distance. It has no metric parameter. This said, if you’re clustering time series, you can use the tslearn python package, when you can specify a metric (dtw, softdtw, euclidean).


如何在Python中进行热编码?

问题:如何在Python中进行热编码?

我有一个80%分类变量的机器学习分类问题。如果要使用一些分类器进行分类,是否必须使用一种热编码?我可以在没有编码的情况下将数据传递给分类器吗?

我正在尝试进行以下功能选择:

  1. 我读了火车文件:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    
  2. 我将类别特征的类型更改为“类别”:

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    
  3. 我使用一种热编码:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题是,尽管我使用的是坚固的机器,但第3部分经常卡住。

因此,没有一种热编码,我就无法进行任何特征选择来确定特征的重要性。

您有什么推荐的吗?

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

  1. I read the train file:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    
  2. I change the type of the categorical features to ‘category’:

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    
  3. I use one hot encoding:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    

The problem is that the 3’rd part often get stuck, although I am using a strong machine.

Thus, without the one hot encoding I can’t do any feature selection, for determining the importance of the features.

What do you recommend?


回答 0

方法1:您可以在pandas数据框上使用get_dummies。

范例1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

范例2:

下面将把给定的列转换为一个热门列。使用前缀具有多个虚拟变量。

import pandas as pd

df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

方法2:使用Scikit学习

给定一个具有三个特征和四个样本的数据集,我们让编码器找到每个特征的最大值,并将数据转换为二进制的一键编码。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

这是此示例的链接:http : //scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Approach 1: You can use pandas’ pd.get_dummies.

Example 1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

Example 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

Approach 2: Use Scikit-learn

Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html


回答 1

使用Pandas进行基本的一键编码要容易得多。如果您正在寻找更多选项,可以使用scikit-learn

对于使用Pandas的基本一键编码,您只需将数据帧传递到get_dummies函数中。

例如,如果我有一个名为imdb_movies的数据

…并且我想对“额定值”列进行一次热编码,我只需要这样做:

pd.get_dummies(imdb_movies.Rated)

dataframe将为存在的每个“ 等级 ” 返回一个新的带有列的列,以及一个1或0,用于指定给定观察值的等级。

通常,我们希望将其作为原始文档的一部分dataframe。在这种情况下,我们只需使用“ column-binding ”将新的伪编码帧附加到原始帧即可。

我们可以使用Pandas concat函数进行列绑定:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

现在,我们可以对全部数据进行分析dataframe

简单的功能

我建议您使自己成为实用工具,以快速完成此任务:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

用法

encode_and_bind(imdb_movies, 'Rated')

结果

另外,按照@pmalbu注释,如果您希望该函数删除原始的feature_to_encode,请使用以下版本:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 

您可以同时对多个功能进行编码,如下所示:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)

Much easier to use Pandas for basic one-hot encoding. If you’re looking for more options you can use scikit-learn.

For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.

For example, if I have a dataframe called imdb_movies:

…and I want to one-hot encode the Rated column, I do this:

pd.get_dummies(imdb_movies.Rated)

This returns a new dataframe with a column for every “level” of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.

Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using “column-binding.

We can column-bind by using Pandas concat function:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

We can now run an analysis on our full dataframe.

SIMPLE UTILITY FUNCTION

I would recommend making yourself a utility function to do this quickly:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

Usage:

encode_and_bind(imdb_movies, 'Rated')

Result:

Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 

You can encode multiple features at the same time as follows:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)

回答 2

您可以使用numpy.eye和使用数组元素选择机制来做到这一点:

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

现在的返回值indices_to_one_hot(nb_classes, data)

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

使用.reshape(-1)可以确保您使用正确的标签格式(也可能使用[[2], [3], [4], [0]])。

You can do it with numpy.eye and a using the array element selection mechanism:

import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]

The the return value of indices_to_one_hot(nb_classes, data) is now

array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])

The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).


回答 3

首先,最简单的一种热编码方式是:使用Sklearn。

http://scikit-learn.org/stable/modules/generation/sklearn.preprocessing.OneHotEncoder.html

其次,我不认为使用pandas进行一次热编码就这么简单(不过未经证实)

在Pandas中为Python创建虚拟变量

最后,您是否有必要进行一次热编码?一种热编码以指数方式增加了功能数量,从而极大地增加了任何分类器或将要运行的任何其他对象的运行时间。尤其是当每个分类特征具有多个级别时。相反,您可以进行伪编码。

使用伪编码通常效果很好,运行时间和复杂性要少得多。一位明智的教授曾经告诉我,“少即是多”。

如果需要,这是我的自定义编码功能的代码。

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

编辑:比较要更清楚:

一键编码:将n个级别转换为n-1列。

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

如果分类功能中有许多不同的类型(或级别),则可以看到这将如何扩展您的内存。请记住,这只是一栏。

虚拟编码:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

改为转换为数字表示形式。大大节省了功能空间,但以准确性为代价。

Firstly, easiest way to one hot encode: use Sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Secondly, I don’t think using pandas to one hot encode is that simple (unconfirmed though)

Creating dummy variables in pandas for python

Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, ‘Less is More’.

Here’s the code for my custom encoding function if you want.

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

EDIT: Comparison to be clearer:

One-hot encoding: convert n levels to n-1 columns.

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

Dummy Coding:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.


回答 4

使用熊猫进行热编码非常简单:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

编辑:

使用sklearn的另一种方式one_hot LabelBinarizer

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

One hot encoding with pandas is very easy:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

EDIT:

Another way to one_hot using sklearn’s LabelBinarizer :

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

回答 5

您可以使用numpy.eye函数。

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

结果

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

You can use numpy.eye function.

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

Result

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

回答 6

pandas具有内置功能“ get_dummies”,可以对该特定列进行一次热编码。

一种热编码的行代码:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)

pandas as has inbuilt function “get_dummies” to get one hot encoding of that particular column/s.

one line code for one-hot-encoding:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)

回答 7

这是使用DictVectorizer和Pandas DataFrame.to_dict('records')方法的解决方案。

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

Here is a solution using DictVectorizer and the Pandas DataFrame.to_dict('records') method.

>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
                      'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
                      'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
                     })

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
 'country=MEX': 1,
 'country=US': 2,
 'race=Black': 3,
 'race=Latino': 4,
 'race=White': 5}

>>> X_qual.toarray()
array([[ 0.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  0.]])

回答 8

一键编码比将值转换为指示符变量还需要更多一点。通常,机器学习过程要求您多次将此编码应用于验证或测试数据集,并将构造的模型应用于实时观察到的数据。您应该存储用于构造模型的映射(转换)。一个好的解决方案是使用DictVectorizeror LabelEncoder(后跟get_dummies。这是可以使用的函数:

def oneHotEncode2(df, le_dict = {}):
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder()
        try:
            if train:
                df[feature] = le_dict[feature].fit_transform(df[feature])
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, 
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1)
        except:
            print('Error encoding '+feature)
            #df[feature]  = df[feature].convert_objects(convert_numeric='force')
            df[feature]  = df[feature].apply(pd.to_numeric, errors='coerce')
    return (df, le_dict)

这适用于pandas数据框,并为数据框的每一列创建并返回映射。因此,您可以这样称呼它:

train_data, le_dict = oneHotEncode2(train_data)

然后在测试数据上,通过传递训练返回的字典进行调用:

test_data, _ = oneHotEncode2(test_data, le_dict)

等效的方法是使用DictVectorizer。同一篇文章的相关文章在我的博客上。我在这里提到它,是因为它提供了这种方法背后的一些理由,而不仅仅是使用get_dummies 帖子 (公开:这是我自己的博客)。

One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:

def oneHotEncode2(df, le_dict = {}):
    if not le_dict:
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        train = True;
    else:
        columnsToEncode = le_dict.keys()   
        train = False;

    for feature in columnsToEncode:
        if train:
            le_dict[feature] = LabelEncoder()
        try:
            if train:
                df[feature] = le_dict[feature].fit_transform(df[feature])
            else:
                df[feature] = le_dict[feature].transform(df[feature])

            df = pd.concat([df, 
                              pd.get_dummies(df[feature]).rename(columns=lambda x: feature + '_' + str(x))], axis=1)
            df = df.drop(feature, axis=1)
        except:
            print('Error encoding '+feature)
            #df[feature]  = df[feature].convert_objects(convert_numeric='force')
            df[feature]  = df[feature].apply(pd.to_numeric, errors='coerce')
    return (df, le_dict)

This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:

train_data, le_dict = oneHotEncode2(train_data)

Then on the test data, the call is made by passing the dictionary returned back from training:

test_data, _ = oneHotEncode2(test_data, le_dict)

An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).


回答 9

您可以将数据传递给catboost分类器,而无需进行编码。Catboost通过执行一键式和目标扩展均值编码来自身处理分类变量。

You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.


回答 10

您也可以执行以下操作。请注意以下内容,您不必使用pd.concat

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

您还可以将显式列更改为分类。例如,在这里我要更改ColorGroup

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

You can do the following as well. Note for the below you don’t have to use pd.concat.

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

You can also change explicit columns to categorical. For example, here I am changing the Color and Group

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

回答 11

我知道我来晚了,但是以自动化方式对数据帧进行热编码的最简单方法是使用此功能:

def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values

I know I’m late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:

def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values

回答 12

我在声学模型中使用了它:可能对您的模型有帮助。

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))

I used this in my acoustic model: probably this helps in ur model.

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))

回答 13

要添加其他问题,让我提供如何使用Numpy使用Python 2.0函数来实现它:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

该行n_values = np.max(y_) + 1可能经过硬编码,以便在使用迷你批处理的情况下使用大量神经元。

使用此功能的演示项目/教程:https : //github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

To add to other questions, let me provide how I did it with a Python 2.0 function using Numpy:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

The line n_values = np.max(y_) + 1 could be hard-coded for you to use the good number of neurons in case you use mini-batches for example.

Demo project/tutorial where this function has been used: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition


回答 14

这对我有用:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

输出:

[0, 1, 2, 0]

This works for me:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

Output:

[0, 1, 2, 0]

回答 15

它可以并且应该很容易:

class OneHotEncoder:
    def __init__(self,optionKeys):
        length=len(optionKeys)
        self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}

用法:

ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)

It can and it should be easy as :

class OneHotEncoder:
    def __init__(self,optionKeys):
        length=len(optionKeys)
        self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}

Usage :

ohe=OneHotEncoder(["A","B","C","D"])
print(ohe.A)
print(ohe.D)

回答 16

扩展@Martin Thoma的答案

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]

Expanding @Martin Thoma’s answer

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]

回答 17

简短答案

这是一个无需使用numpy,pandas或其他软件包即可进行一次热编码的函数。它需要一个整数,布尔值或字符串(可能还有其他类型)的列表。

import typing


def one_hot_encode(items: list) -> typing.List[list]:
    results = []
    # find the unique items (we want to unique items b/c duplicate items will have the same encoding)
    unique_items = list(set(items))
    # sort the unique items
    sorted_items = sorted(unique_items)
    # find how long the list of each item should be
    max_index = len(unique_items)

    for item in items:
        # create a list of zeros the appropriate length
        one_hot_encoded_result = [0 for i in range(0, max_index)]
        # find the index of the item
        one_hot_index = sorted_items.index(item)
        # change the zero at the index from the previous line to a one
        one_hot_encoded_result[one_hot_index] = 1
        # add the result
        results.append(one_hot_encoded_result)

    return results

例:

one_hot_encode([2, 1, 1, 2, 5, 3])

# [[0, 1, 0, 0],
#  [1, 0, 0, 0],
#  [1, 0, 0, 0],
#  [0, 1, 0, 0],
#  [0, 0, 0, 1],
#  [0, 0, 1, 0]]
one_hot_encode([True, False, True])

# [[0, 1], [1, 0], [0, 1]]
one_hot_encode(['a', 'b', 'c', 'a', 'e'])

# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]

长(长)答案

我知道这个问题已经有很多答案了,但是我注意到了两点。首先,大多数答案都使用numpy和/或pandas之类的软件包。这是一件好事。如果要编写生产代码,则可能应该使用健壮,快速的算法,例如numpy / pandas软件包中提供的算法。但是,出于教育的目的,我认为应该提供一个答案,该答案具有透明的算法,而不仅仅是其他人算法的实现。其次,我注意到许多答案没有提供可靠的一键编码实现,因为它们不满足以下要求之一。以下是一些有用,准确且健壮的一键编码功能的要求(如我所见):

一键编码功能必须:

  • 处理各种类型的列表(例如,整数,字符串,浮点数等)作为输入
  • 处理重复的输入列表
  • 返回与输入相对应的列表列表(顺序相同)
  • 返回列表列表,其中每个列表都尽可能短

我测试了这个问题的许多答案,但大多数都无法满足上述要求之一。

Short Answer

Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).

import typing


def one_hot_encode(items: list) -> typing.List[list]:
    results = []
    # find the unique items (we want to unique items b/c duplicate items will have the same encoding)
    unique_items = list(set(items))
    # sort the unique items
    sorted_items = sorted(unique_items)
    # find how long the list of each item should be
    max_index = len(unique_items)

    for item in items:
        # create a list of zeros the appropriate length
        one_hot_encoded_result = [0 for i in range(0, max_index)]
        # find the index of the item
        one_hot_index = sorted_items.index(item)
        # change the zero at the index from the previous line to a one
        one_hot_encoded_result[one_hot_index] = 1
        # add the result
        results.append(one_hot_encoded_result)

    return results

Example:

one_hot_encode([2, 1, 1, 2, 5, 3])

# [[0, 1, 0, 0],
#  [1, 0, 0, 0],
#  [1, 0, 0, 0],
#  [0, 1, 0, 0],
#  [0, 0, 0, 1],
#  [0, 0, 1, 0]]
one_hot_encode([True, False, True])

# [[0, 1], [1, 0], [0, 1]]
one_hot_encode(['a', 'b', 'c', 'a', 'e'])

# [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1]]

Long(er) Answer

I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else’s algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:

A one-hot encoding function must:

  • handle list of various types (e.g. integers, strings, floats, etc.) as input
  • handle an input list with duplicates
  • return a list of lists corresponding (in the same order as) to the inputs
  • return a list of lists where each list is as short as possible

I tested many of the answers to this question and most of them fail on one of the requirements above.


回答 18

试试这个:

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head()

生成的数据框df_train_encoded与原始数据框相同,但是现在将分类功能替换为它们的一键编码版本。

有关更多信息,请category_encoders 参见此处

Try this:

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head()

The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.

More information on category_encoders here.


回答 19

在这里,我尝试了这种方法:

import numpy as np
#converting to one_hot





def one_hot_encoder(value, datal):

    datal[value] = 1

    return datal


def _one_hot_values(labels_data):
    encoded = [0] * len(labels_data)

    for j, i in enumerate(labels_data):
        max_value = [0] * (np.max(labels_data) + 1)

        encoded[j] = one_hot_encoder(i, max_value)

    return np.array(encoded)

Here i tried with this approach :

import numpy as np
#converting to one_hot





def one_hot_encoder(value, datal):

    datal[value] = 1

    return datal


def _one_hot_values(labels_data):
    encoded = [0] * len(labels_data)

    for j, i in enumerate(labels_data):
        max_value = [0] * (np.max(labels_data) + 1)

        encoded[j] = one_hot_encoder(i, max_value)

    return np.array(encoded)

Python-sklearn.pipeline.Pipeline到底是什么?

问题:Python-sklearn.pipeline.Pipeline到底是什么?

我不知道如何sklearn.pipeline.Pipeline工作。

文档中有一些解释。例如,它们的意思是:

带有最终估算器的变换管线。

为了使我的问题更清楚,什么是steps?它们如何运作?

编辑

多亏了答案,我可以使我的问题更清楚:

当我调用管道并通过时,需要两个转换器和一个估计器,例如:

pipln = Pipeline([("trsfm1",transformer_1),
                  ("trsfm2",transformer_2),
                  ("estmtr",estimator)])

我叫这个怎么办?

pipln.fit()
OR
pipln.fit_transform()

我不知道估算器如何成为变压器以及如何装配变压器。

I can’t figure out how the sklearn.pipeline.Pipeline works exactly.

There are a few explanation in the doc. For example what do they mean by:

Pipeline of transforms with a final estimator.

To make my question clearer, what are steps? How do they work?

Edit

Thanks to the answers I can make my question clearer:

When I call pipeline and pass, as steps, two transformers and one estimator, e.g:

pipln = Pipeline([("trsfm1",transformer_1),
                  ("trsfm2",transformer_2),
                  ("estmtr",estimator)])

What happens when I call this?

pipln.fit()
OR
pipln.fit_transform()

I can’t figure out how an estimator can be a transformer and how a transformer can be fitted.


回答 0

scikit-learn中的Transformer-一些具有fit和transform方法或fit_transform方法的类。

预测器 -具有fit和预测方法或fit_predict方法的某些类。

管道只是一个抽象概念,它不是现有的ml算法。在ML任务中,通常需要在应用最终估计量之前对原始数据集执行不同变换的序列(查找特征集,生成新特征,仅选择一些良好特征)。

是管道使用的一个很好的例子。管道为您提供了所有3个转换步骤和最终估算器的单一界面。它在内部封装了转换器和预测变量,现在您可以执行以下操作:

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

只是:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

使用管道,您可以轻松地针对该元估计器的每个步骤对一组参数执行网格搜索。如以上链接中所述。除了最后一个步骤以外的所有步骤都必须是转换步骤,最后一个步骤可以是变换器或预测值。 编辑答案:调用时pipln.fit()-管道中的每个变压器都将安装在先前变压器的输出上(从原始数据集获悉第一个变压器)。最后一个估计器可以是转换器或预测器,仅当您的最后一个估计器是转换器(可以实现fit_transform或分别转换和拟合方法)时,才可以在管道上调用fit_transform(),仅在以下情况下可以在管道上调用fit_predict()或dictate():您的最后一个估算器是预测器。因此,您无法调用fit_transform或在管道上进行转换,而最后一步是预测变量。

Transformer in scikit-learn – some class that have fit and transform method, or fit_transform method.

Predictor – some class that has fit and predict methods, or fit_predict method.

Pipeline is just an abstract notion, it’s not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.

Here is a good example of Pipeline usage. Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

With just:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor. Answer to edit: When you call pipln.fit() – each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can’t call fit_transform or transform on pipeline, last step of which is predictor.


回答 1

我认为M0rkHaV有正确的想法。Scikit学习的管道类是用于封装多个不同的变压器旁边的估计到一个对象,一个有用的工具,这样你只需要打电话给你一次(重要的方法fit()predict()等等)。让我们分解两个主要部分:

  1. 变形金刚是同时实现fit()和的类transform()。您可能熟悉一些sklearn预处理工具,例如TfidfVectorizerBinarizer。如果查看这些预处理工具的文档,就会发现它们实现了这两种方法。我觉得很酷的是,一些估算器也可以用作转换步骤,例如LinearSVC

  2. 估算器是同时实现fit()和的类predict()。您会发现许多分类器和回归模型都实现了这两种方法,因此您可以轻松地测试许多不同的模型。可以使用另一个转换器作为最终估计量(即,它不一定实现predict(),但肯定实现fit())。这意味着您不能打电话predict()

至于您的编辑:让我们来看一个基于文本的示例。使用LabelBinarizer,我们希望将标签列表转换为二进制值列表。

bin = LabelBinarizer()  #first we initialize

vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized

现在,当二进制化器适合某些数据时,它将具有一个称为的结构classes_,其中包含转换器“知道”的唯一类。如果不调用fit()Binarizer,则不知道数据的外观,因此调用transform()没有任何意义。如果在尝试拟合数据之前打印出类的列表,则为true。

print bin.classes_  

尝试此操作时出现以下错误:

AttributeError: 'LabelBinarizer' object has no attribute 'classes_'

但是,当您将二值化器放在vec列表中时:

bin.fit(vec)

然后再试一次

print bin.classes_

我得到以下内容:

['cat' 'dog']


print bin.transform(vec)

现在,在对vec对象调用transform之后,我们得到以下信息:

[[0]
 [1]
 [1]
 [1]]

至于用作转换器的估计器,让我们以DecisionTree分类器为特征提取器的示例。决策树之所以出色,有很多原因,但是出于我们的目的,重要的是决策树能够对那些认为对预测有用的要素进行排名。当你调用transform()一个决策树,它将把你的输入数据,并查找认为是最重要的特征。因此,您可以考虑将数据矩阵(n行乘m列)转换为较小的矩阵(n行乘k列),其中k列是决策树发现的k个最重要的特征。

I think that M0rkHaV has the right idea. Scikit-learn’s pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let’s break down the two major components:

  1. Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you’ll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!

  2. Estimators are classes that implement both fit() and predict(). You’ll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn’t necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn’t be able to call predict().

As for your edit: let’s go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.

bin = LabelBinarizer()  #first we initialize

vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized

Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer ‘knows’ about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn’t make any sense. This is true if you print out the list of classes before trying to fit the data.

print bin.classes_  

I get the following error when trying this:

AttributeError: 'LabelBinarizer' object has no attribute 'classes_'

But when you fit the binarizer on the vec list:

bin.fit(vec)

and try again

print bin.classes_

I get the following:

['cat' 'dog']


print bin.transform(vec)

And now, after calling transform on the vec object, we get the following:

[[0]
 [1]
 [1]
 [1]]

As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what’s important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.


回答 2

ML算法通常处理表格数据。您可能需要在ML算法之前和之后对该数据进行预处理和后处理。管道是链接这些数据处理步骤的一种方式。

什么是ML管道,它们如何工作?

管道是转换数据的一系列步骤。它来自旧的“管道和过滤器”设计模式(例如,您可以想到带有管道“ |”的unix bash命令或重定向运算符“>”)。但是,管道是代码中的对象。因此,您可能为每个过滤器(又称为每个管道步骤)都有一个类,然后是另一个将这些步骤组合到最终管道中的类。一些管道可能将其他管道串联或并联组合,具有多个输入或输出,依此类推。我们喜欢将机器学习管道视为:

  • 管道和过滤器。管道的步骤处理数据,并且它们管理可以从数据中学到的内部状态。
  • 复合材料。管道可以嵌套:例如,整个管道可以视为另一个管道中的单个管道步骤。流水线步骤不一定是流水线,但根据定义,流水线本身至少是流水线步骤。
  • 有向无环图(DAG)。流水线步骤的输出可以发送到许多其他步骤,然后可以重新组合生成的输出,依此类推。旁注:尽管管道是非循环的,但它们可以一个接一个地处理多个项目,并且如果它们的状态发生变化(例如:每次使用fit_transform方法),那么它们可以被视为随着时间的流逝不断展开,保持其状态(例如RNN)。这是一种有趣的方式,可用于在生产中进行在线学习并在更多数据上对其进行培训时进行在线学习。

Scikit-Learn管道的方法

管道(或管道中的步骤)必须具有以下两种方法

  • 适合 ”以学习数据并获取状态(例如:神经网络的神经权重就是这种状态)
  • 转换 ”(或“预测”)以实际处理数据并生成预测。

也可以调用此方法来链接两者:

  • fit_transform ”可以拟合然后转换数据,但是要一次通过,当必须直接一个接一个地执行这两种方法时,可以进行潜在的代码优化。

sklearn.pipeline.Pipeline类的问题

Scikit-Learn的“管道和过滤器”设计模式非常漂亮。但是如何将其用于深度学习,AutoML和复杂的生产级管道?

Scikit-Learn于2007年首次发布,那是一个深度学习纪。但是,它是最著名和采用最广泛的机器学习库之一,并且仍在增长。最重要的是,它使用“管道和过滤器”设计模式作为软件体系结构样式-这就是Scikit-Learn如此出色的原因,此外它还提供了可供使用的算法。但是,在执行以下操作时会遇到很多问题,我们应该能够在2020年做到这一点:

  • 自动机器学习(AutoML),
  • 深度学习管道,
  • 更复杂的机器学习管道。

我们为那些Scikit-Learn问题找到的解决方案

当然,Scikit-Learn非常方便且结构精良。但是,它需要刷新。这是我们与Neuraxle的解决方案,使Scikit-Learn在现代计算项目中变得新鲜和可用!

通过Neuraxle提供的其他管道方法和功能

注意:如果管道的某个步骤不需要使用fit或transform方法之一,则它可以从NonFittableMixinNonTransformableMixin继承,以提供这些方法之一的默认实现而不执行任何操作。

首先,管道或其步骤还可以选择定义这些方法

  • setup ”,将在每个步骤中调用“ setup”方法。例如,如果某个步骤包含TensorFlow,PyTorch或Keras神经网络,则这些步骤可以创建它们的神经图,并在适合之前通过“设置”方法将它们注册到GPU。不建议在步骤的构造函数中直接创建图形,这有几个原因,例如,如果在自动机器学习算法中使用不同的超参数多次运行之前复制了这些步骤,然后自动为您搜索最佳的超参数。
  • 拆解 ”,与“设置”方法相反:它清除资源。

默认提供以下方法,使管理的超参数:

  • get_hyperparams ”将为您返回超参数的字典。如果您的管道包含更多的管道(嵌套管道),则超参数的键将用双下划线“ __”分隔符链接。
  • set_hyperparams ”将允许您以获取时的相同格式设置新的超参数。
  • get_hyperparams_space ”允许您获取超参数的空间,如果您定义了超参数的空间,则该空间不会为空。因此,这里与“ get_hyperparams”的唯一区别是,您将获得统计分布作为值而不是精确值。例如,层数的一个超参数可以是a RandInt(1, 3),表示1到3层。您可以调用.rvs()此dict随机选择一个值,并将其发送到“ set_hyperparams”以尝试对其进行训练。
  • set_hyperparams_space ”可用于使用与“ get_hyperparams_space ”中相同的超参数分布类来设置新空间。

有关建议的解决方案的更多信息,请阅读上面带有链接的大列表中的条目。

ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.

What are ML pipelines and how do they work?

A pipeline is a series of steps in which data is transformed. It comes from the old “pipe and filter” design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:

  • Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
  • Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
  • Directed Acyclic Graphs (DAG). A pipeline step’s output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.

Methods of a Scikit-Learn Pipeline

Pipelines (or steps in the pipeline) must have those two methods:

  • fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
  • transform” (or “predict”) to actually process the data and generate a prediction.

It’s also possible to call this method to chain both:

  • fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.

Problems of the sklearn.pipeline.Pipeline class

Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?

Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style – it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:

  • Automatic Machine Learning (AutoML),
  • Deep Learning Pipelines,
  • More complex Machine Learning pipelines.

Solutions that we’ve Found to Those Scikit-Learn’s Problems

For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!

Additional pipeline methods and features offered through Neuraxle

Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.

As a starter, it is possible for pipelines or their steps to also optionally define those methods:

  • setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
  • teardown”, which is the opposite of the “setup” method: it clears resources.

The following methods are provided by default to allow for managing hyperparameters:

  • get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
  • set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
  • get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
  • set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.

For more info on our suggested solutions, read the entries in the big list with links above.


Tensorflow后端的Keras能否被迫随意使用CPU或GPU?

问题:Tensorflow后端的Keras能否被迫随意使用CPU或GPU?

我安装了Tensorflow后端和CUDA的Keras。我有时想按需强迫Keras使用CPU。不用说在虚拟环境中安装单独的仅CPU的Tensorflow就能做到吗?如果可以,怎么办?如果后端是Theano,则可以设置标志,但是我还没有听说过可以通过Keras访问Tensorflow标志。

I have Keras installed with the Tensorflow backend and CUDA. I’d like to sometimes on demand force Keras to use CPU. Can this be done without say installing a separate CPU-only Tensorflow in a virtual environment? If so how? If the backend were Theano, the flags could be set, but I have not heard of Tensorflow flags accessible via Keras.


回答 0

如果要强制Keras使用CPU

方式1

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

在导入Keras / Tensorflow之前。

方式二

运行脚本为

$ CUDA_VISIBLE_DEVICES="" ./your_keras_code.py

也可以看看

  1. https://github.com/keras-team/keras/issues/152
  2. https://github.com/fchollet/keras/issues/4613

If you want to force Keras to use CPU

Way 1

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

before Keras / Tensorflow is imported.

Way 2

Run your script as

$ CUDA_VISIBLE_DEVICES="" ./your_keras_code.py

See also

  1. https://github.com/keras-team/keras/issues/152
  2. https://github.com/fchollet/keras/issues/4613

回答 1

一个相当分离的方法是使用

import tensorflow as tf
from keras import backend as K

num_cores = 4

if GPU:
    num_GPU = 1
    num_CPU = 1
if CPU:
    num_CPU = 1
    num_GPU = 0

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}
                       )

session = tf.Session(config=config)
K.set_session(session)

在此处,通过booleans GPUCPU,我们通过严格定义允许Tensorflow会话访问的GPU和CPU的数量来指示我们是否要使用GPU或CPU运行代码。变量num_GPUnum_CPU定义该值。num_cores然后通过intra_op_parallelism_threads和设置可供使用的CPU内核数inter_op_parallelism_threads

intra_op_parallelism_threads变量指示在计算图中单个节点中并行操作被允许使用(内部)的线程数。虽然inter_ops_parallelism_threads变量定义了跨计算图节点(并行)进行并行操作可访问的线程数。

allow_soft_placement 如果满足以下任一条件,则允许在CPU上运行操作:

  1. 该操作没有GPU实现

  2. 没有已知或注册的GPU设备

  3. 需要与CPU的其他输入一起放置

所有这些都在任何其他操作之前在我的类的构造函数中执行,并且可以与我使用的任何模型或其他代码完全分开。

注意:这要求tensorflow-gpucuda/ cudnn要安装,因为提供了使用GPU的选项。

参考:

A rather separable way of doing this is to use

import tensorflow as tf
from keras import backend as K

num_cores = 4

if GPU:
    num_GPU = 1
    num_CPU = 1
if CPU:
    num_CPU = 1
    num_GPU = 0

config = tf.ConfigProto(intra_op_parallelism_threads=num_cores,
                        inter_op_parallelism_threads=num_cores, 
                        allow_soft_placement=True,
                        device_count = {'CPU' : num_CPU,
                                        'GPU' : num_GPU}
                       )

session = tf.Session(config=config)
K.set_session(session)

Here, with booleans GPU and CPU, we indicate whether we would like to run our code with the GPU or CPU by rigidly defining the number of GPUs and CPUs the Tensorflow session is allowed to access. The variables num_GPU and num_CPU define this value. num_cores then sets the number of CPU cores available for usage via intra_op_parallelism_threads and inter_op_parallelism_threads.

The intra_op_parallelism_threads variable dictates the number of threads a parallel operation in a single node in the computation graph is allowed to use (intra). While the inter_ops_parallelism_threads variable defines the number of threads accessible for parallel operations across the nodes of the computation graph (inter).

allow_soft_placement allows for operations to be run on the CPU if any of the following criterion are met:

  1. there is no GPU implementation for the operation

  2. there are no GPU devices known or registered

  3. there is a need to co-locate with other inputs from the CPU

All of this is executed in the constructor of my class before any other operations, and is completely separable from any model or other code I use.

Note: This requires tensorflow-gpu and cuda/cudnn to be installed because the option is given to use a GPU.

Refs:


回答 2

这对我有用(win10),在导入keras之前放置:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

This worked for me (win10), place before you import keras:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

回答 3

只需导入tensortflow并使用keras,就这么简单。

import tensorflow as tf
# your code here
with tf.device('/gpu:0'):
    model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

Just import tensortflow and use keras, it’s that easy.

import tensorflow as tf
# your code here
with tf.device('/gpu:0'):
    model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)

回答 4

按照keras 教程,您可以简单地使用与tf.device常规tensorflow中相同的作用域:

with tf.device('/gpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on GPU:0

with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on CPU:0

As per keras tutorial, you can simply use the same tf.device scope as in regular tensorflow:

with tf.device('/gpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on GPU:0

with tf.device('/cpu:0'):
    x = tf.placeholder(tf.float32, shape=(None, 20, 64))
    y = LSTM(32)(x)  # all ops in the LSTM layer will live on CPU:0

回答 5

我只是花了一些时间弄清楚。Thoma的答案不完整。假设您的程序是test.py,您想使用gpu0来运行该程序,并使其他gpus保持空闲。

你应该写 CUDA_VISIBLE_DEVICES=0 python test.py

注意DEVICES不是DEVICE

I just spent some time figure it out. Thoma’s answer is not complete. Say your program is test.py, you want to use gpu0 to run this program, and keep other gpus free.

You should write CUDA_VISIBLE_DEVICES=0 python test.py

Notice it’s DEVICES not DEVICE


回答 6

对于从事PyCharm并强制使用CPU的人员,您可以在“运行/调试”配置的“环境变量”下添加以下行:

<OTHER_ENVIRONMENT_VARIABLES>;CUDA_VISIBLE_DEVICES=-1

For people working on PyCharm, and for forcing CPU, you can add the following line in the Run/Debug configuration, under Environment variables:

<OTHER_ENVIRONMENT_VARIABLES>;CUDA_VISIBLE_DEVICES=-1

如何在TensorFlow中应用梯度裁剪?

问题:如何在TensorFlow中应用梯度裁剪?

考虑示例代码

我想知道如何在RNN上的该网络上应用梯度剪切,而梯度可能会爆炸。

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

这是一个可以使用的示例,但是在哪里介绍呢?在RNN中

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

但这没有意义,因为张量_X是输入,而不是grad,要裁剪的内容是什么?

我是否需要为此定义自己的优化器,还是有一个更简单的选择?

Considering the example code.

I would like to know How to apply gradient clipping on this network on the RNN where there is a possibility of exploding gradients.

tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

This is an example that could be used but where do I introduce this ? In the def of RNN

    lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0)
    # Split data because rnn cell needs a list of inputs for the RNN inner loop
    _X = tf.split(0, n_steps, _X) # n_steps
tf.clip_by_value(_X, -1, 1, name=None)

But this doesn’t make sense as the tensor _X is the input and not the grad what is to be clipped?

Do I have to define my own Optimizer for this or is there a simpler option?


回答 0

在计算梯度之后,但在应用梯度更新模型参数之前,需要进行梯度修剪。在您的示例中,这两种AdamOptimizer.minimize()方法均由该方法处理。

为了裁剪您的渐变,您需要按照TensorFlow API文档本节中的说明显式计算,裁剪和应用它们。具体来说,您需要minimize()用以下类似的方法代替对方法的调用:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

Gradient clipping needs to happen after computing the gradients, but before applying them to update the model’s parameters. In your example, both of those things are handled by the AdamOptimizer.minimize() method.

In order to clip your gradients you’ll need to explicitly compute, clip, and apply them as described in this section in TensorFlow’s API documentation. Specifically you’ll need to substitute the call to the minimize() method with something like the following:

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gvs = optimizer.compute_gradients(cost)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
train_op = optimizer.apply_gradients(capped_gvs)

回答 1

尽管看起来很流行,但您可能希望通过其全局范数来裁剪整个渐变:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

分别裁剪每个渐变矩阵会更改其相对比例,但是也可以:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

在TensorFlow 2中,磁带计算梯度,优化器来自Keras,我们不需要存储更新操作,因为它会自动运行而不将其传递给会话:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

Despite what seems to be popular, you probably want to clip the whole gradient by its global norm:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimize = optimizer.apply_gradients(zip(gradients, variables))

Clipping each gradient matrix individually changes their relative scale but is also possible:

optimizer = tf.train.AdamOptimizer(1e-3)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients = [
    None if gradient is None else tf.clip_by_norm(gradient, 5.0)
    for gradient in gradients]
optimize = optimizer.apply_gradients(zip(gradients, variables))

In TensorFlow 2, a tape computes the gradients, the optimizers come from Keras, and we don’t need to store the update op because it runs automatically without passing it to a session:

optimizer = tf.keras.optimizers.Adam(1e-3)
# ...
with tf.GradientTape() as tape:
  loss = ...
variables = ...
gradients = tape.gradient(loss, variables)
gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
optimizer.apply_gradients(zip(gradients, variables))

回答 2

实际上在文档中对此做了正确解释。

调用minimum()既要计算梯度,又要将其应用于变量。如果要在应用渐变之前对其进行处理,则可以分三步使用优化器:

  • 使用compute_gradients()计算梯度。
  • 根据需要处理渐变。
  • 使用apply_gradients()应用处理后的渐变。

在他们提供的示例中,他们使用以下3个步骤:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

MyCapper是限制渐变的任何函数。有用的功能列表(除外tf.clip_by_value())在此处

This is actually properly explained in the documentation.:

Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:

  • Compute the gradients with compute_gradients().
  • Process the gradients as you wish.
  • Apply the processed gradients with apply_gradients().

And in the example they provide they use these 3 steps:

# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)

Here MyCapper is any function that caps your gradient. The list of useful functions (other than tf.clip_by_value()) is here.


回答 3

对于那些想了解梯度裁剪的想法(按规范)的人:

每当梯度范数大于特定阈值时,我们都会修剪梯度范数,以使其保持在阈值之内。此阈值有时设置为5

令梯度为g,max_norm_threshold为j

现在,如果|| g || > j,我们这样做:

g =( j * g)/ || G ||

这是在 tf.clip_by_norm

For those who would like to understand the idea of gradient clipping (by norm):

Whenever the gradient norm is greater than a particular threshold, we clip the gradient norm so that it stays within the threshold. This threshold is sometimes set to 5.

Let the gradient be g and the max_norm_threshold be j.

Now, if ||g|| > j , we do:

g = ( j * g ) / ||g||

This is the implementation done in tf.clip_by_norm


回答 4

IMO最好的解决方案是用TF的估算器装饰器包装优化器tf.contrib.estimator.clip_gradients_by_norm

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

这样,您只需要定义一次,而不必在每次梯度计算后运行它。

文档:https : //www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm

IMO the best solution is wrapping your optimizer with TF’s estimator decorator tf.contrib.estimator.clip_gradients_by_norm:

original_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(original_optimizer, clip_norm=5.0)
train_op = optimizer.minimize(loss)

This way you only have to define this once, and not run it after every gradients calculation.

Documentation: https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/clip_gradients_by_norm


回答 5

梯度修剪基本上可以在梯度爆炸或消失的情况下起到帮助作用。说您的损失太高,将会导致指数梯度流经网络,可能导致Nan值。为了克服这个问题,我们将梯度裁剪在特定范围内(-1到1或根据条件的任何范围)。

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

其中grads _and_vars是渐变对(您可以通过tf.compute_gradients计算)及其变量。

裁剪后,我们只需使用优化器即可应用其值。 optimizer.apply_gradients(clipped_value)

Gradient Clipping basically helps in case of exploding or vanishing gradients.Say your loss is too high which will result in exponential gradients to flow through the network which may result in Nan values . To overcome this we clip gradients within a specific range (-1 to 1 or any range as per condition) .

clipped_value=tf.clip_by_value(grad, -range, +range), var) for grad, var in grads_and_vars

where grads _and_vars are the pairs of gradients (which you calculate via tf.compute_gradients) and their variables they will be applied to.

After clipping we simply apply its value using an optimizer. optimizer.apply_gradients(clipped_value)


如何使用scikit learning计算多类案例的精度,召回率,准确性和f1-得分?

问题:如何使用scikit learning计算多类案例的精度,召回率,准确性和f1-得分?

我正在研究情绪分析问题,数据看起来像这样:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

所以,我的数据是不平衡的,因为1190 instances标有5。对于使用scikit的SVC进行的分类Im 。问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的精度,查全率,准确性和f1得分。因此,我尝试了以下方法:

第一:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

第二:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

第三:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

但是,我收到这样的警告:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

如何正确处理我的不平衡数据,以便以正确的方式计算分类器的指标?

I’m working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit’s SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:

First:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

Second:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

Third:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

However, Im getting warnings like this:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

How can I deal correctly with my unbalanced data in order to compute in the right way classifier’s metrics?


回答 0

我认为对于将哪些砝码用于什么有很多困惑。我不确定我是否确切知道让您感到困扰,所以我将涉及不同的话题,请耐心等待;)。

Class重量

来自class_weight参数的权重用于训练分类器。它们不会用于您正在使用的任何度量的计算中:使用不同的类别权重,数字会有所不同,仅仅是因为分类器不同。

基本上,在每个scikit-learn分类器中,都使用类权重来告诉您的模型,类的重要性。这意味着在训练过程中,分类器将付出更多的努力来对权重较高的类进行正确分类。
他们如何做到这一点是特定于算法的。如果您想了解有关SVC如何工作的详细信息,而该文档对您来说没有意义,请随时提及。

指标

有了分类器后,您想知道其效果如何。在这里,你可以使用你所提到的指标:accuracyrecall_scoref1_score

通常,当Class分布不平衡时,准确性被认为是较差的选择,因为它会给仅预测最频繁Class的模型打高分。

我不会详细说明所有这些指标,但是请注意,除之外accuracy,它们自然地应用于类级别:如您在print分类报告中所见,它们是为每个类定义的。他们依赖诸如true positives或的概念,这些概念false negative要求定义哪个类别是肯定的

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

警告

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

之所以收到此警告,是因为您使用的是f1得分,召回率和精确度,而未定义应如何计算它们!问题可以改写为:从以上分类报告中,您如何为f1分数输出一个全局数字?你可以:

  1. 取每个Class的f1分数的平均值:这就是avg / total上面的结果。也称为平均。
  2. 使用真实阳性/阴性阴性等的总计数来计算f1-分数(您将每个类别的真实阳性/阴性阴性的总数相加)。又名平均。
  3. 计算f1分数的加权平均值。使用'weighted'在scikit学习会由支持类的权衡F1评分:越要素类有,更重要的F1的得分这个类在计算中。

这是scikit-learn中的3个选项,警告是说您必须选择一个。因此,您必须average为score方法指定一个参数。

选择哪种方法取决于您如何衡量分类器的性能:例如,宏平均不考虑类的不平衡,并且类1的f1分数与类的f1分数一样重要5.但是,如果您使用加权平均,则对于第5类,您将变得更加重要。

这些指标中的整个参数规范目前在scikit-learn中尚不十分清楚,根据文档,它将在0.18版中变得更好。他们正在删除一些不明显的标准行为,并发出警告,以便开发人员注意到它。

计算分数

我要提到的最后一件事(如果您知道它,可以随时跳过它)是,分数只有在根据分类器从未见过的数据进行计算时才有意义。这一点非常重要,因为您获得的用于拟合分类器的数据得分都是完全不相关的。

这是使用的一种方法StratifiedShuffleSplit,它可以随机分配数据(经过改组后),以保留标签的分布。

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

希望这可以帮助。

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).

Class weights

The weights from the class_weight parameter are used to train the classifier. They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.

Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.

The metrics

Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned: accuracy, recall_score, f1_score

Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

The warning

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output one global number for the f1-score? You could:

  1. Take the average of the f1-score for each class: that’s the avg / total result above. It’s also called macro averaging.
  2. Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
  3. Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.

These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.

Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you’ll get more importance for the class 5.

The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.

Computing scores

Last thing I want to mention (feel free to skip it if you’re aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.

Here’s a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

Hope this helps.


回答 1

这里有很多非常详细的答案,但我认为您没有回答正确的问题。据我了解的问题,有两个问题:

  1. 我如何为多类问题评分?
  2. 我该如何处理不平衡的数据?

1。

可以将scikit-learn中的大多数计分函数用于多类问题和单类问题。例如:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

这样,您最终得到每个类的有形和可解释的数字。

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

然后…

2。

…您可以判断出不平衡的数据是否甚至是一个问题。如果代表较少的Class(第1类和第2类)的得分低于训练样本较多的Class(第4类和第5类)的得分,那么您就知道不平衡的数据实际上是个问题,您可以采取相应措施,例如在该线程的其他一些答案中进行了介绍。但是,如果要预测的数据中存在相同的类别分布,那么不平衡的训练数据可以很好地代表数据,因此,不平衡是一件好事。

Lot of very detailed answers here but I don’t think you are answering the right questions. As I understand the question, there are two concerns:

  1. How to I score a multiclass problem?
  2. How do I deal with unbalanced data?

1.

You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

This way you end up with tangible and interpretable numbers for each of the classes.

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

Then…

2.

… you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread. However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.


回答 2

提出的问题

回答“对于不平衡数据的多类别分类应使用什么度量”这一问题:Macro-F1-measure。也可以使用Macro Precision和Macro Recall,但是它们不像二进制分类那样容易解释,它们已经被合并到F量度中,并且多余的量度使方法比较,参数调整等复杂化。

微观平均对类不平衡很敏感:例如,如果您的方法对大多数常见标签都有效,而完全使其他标签混乱,则微观平均指标将显示出良好的结果。

加权平均不适用于不平衡数据,因为它通过标签计数加权。此外,它很难解释且不受欢迎:例如,在以下我非常建议仔细研究的非常详细的调查中,没有提及这种平均值:

Sokolova,Marina和Guy Lapalme。“对分类任务的绩效指标进行系统分析。” 信息处理与管理45.4(2009):427-437。

特定于应用程序的问题

但是,回到您的任务,我将研究2个主题:

  1. 通常用于您的特定任务的指标-它使(a)与他人比较您的方法,并了解您做错了什么;(b)不要自己探索这一方法并重用他人的发现;
  2. 方法的不同错误的成本-例如,您的应用程序的用例可能仅依赖于4星级和5星级审核-在这种情况下,好的指标应仅将这2个标签计算在内。

常用指标。 从文献资料中我可以推断出,有两个主要的评估指标:

  1. 精度,例如

Yu,April和Daryl Chang。“使用Yelp业务进行多类情感预测。”

链接)-请注意,作者使用的评级分布几乎相同,请参见图5。

庞波和李丽娟 “看见星星:利用阶级关系来进行与等级量表有关的情感分类。” 计算语言学协会第四十三届年会论文集。计算语言学协会,2005年。

链接

  1. MSE(或更不常见的是,平均绝对误差- -MAE)-例如,

Lee,Moontae和R.Grafe。“带有餐厅评论的多类情感分析。” CS N 224(2010)中的最终项目。

链接)-他们同时探讨准确性和MSE,并认为后者会更好

帕帕斯,尼古拉斯,Rue Marconi和Andrei Popescu-Belis。“解释星星:基于方面的情感分析的加权多实例学习。” 2014年自然语言处理经验方法会议论文集。EPFL-CONF-200899号。2014。

链接)-他们利用scikit-learn进行评估和基准评估,并声明其代码可用;但是,我找不到它,所以如果您需要它,请写信给作者,这本书是相当新的,似乎是用Python编写的。

不同错误的代价 如果您更关心避免出现大失误,例如将1星评价转换为5星评价或类似方法,请查看MSE;如果差异很重要,但不是那么重要,请尝试MAE,因为它不会使差异平方;否则保持准确性。

关于方法,而不是指标

尝试使用回归方法,例如SVR,因为它们通常胜过SVC或OVA SVM之类的多类分类器。

Posed question

Responding to the question ‘what metric should be used for multi-class classification with imbalanced data’: Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.

Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.

Weighting averaging isn’t well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:

Sokolova, Marina, and Guy Lapalme. “A systematic analysis of performance measures for classification tasks.” Information Processing & Management 45.4 (2009): 427-437.

Application-specific question

However, returning to your task, I’d research 2 topics:

  1. metrics commonly used for your specific task – it lets (a) to compare your method with others and understand if you do something wrong, and (b) to not explore this by yourself and reuse someone else’s findings;
  2. cost of different errors of your methods – for example, use-case of your application may rely on 4- and 5-star reviewes only – in this case, good metric should count only these 2 labels.

Commonly used metrics. As I can infer after looking through literature, there are 2 main evaluation metrics:

  1. Accuracy, which is used, e.g. in

Yu, April, and Daryl Chang. “Multiclass Sentiment Prediction using Yelp Business.”

(link) – note that the authors work with almost the same distribution of ratings, see Figure 5.

Pang, Bo, and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.” Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

(link)

  1. MSE (or, less often, Mean Absolute Error – MAE) – see, for example,

Lee, Moontae, and R. Grafe. “Multiclass sentiment analysis with restaurant reviews.” Final Projects from CS N 224 (2010).

(link) – they explore both accuracy and MSE, considering the latter to be better

Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. “Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis.” Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.

(link) – they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can’t find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.

Cost of different errors. If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn’t square diff; otherwise stay with Accuracy.

About approaches, not metrics

Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.


回答 3

首先,仅使用计数分析来判断您的数据是否不平衡会更加困难。例如:每1000个阳性观察中就有1个只是噪音,错误还是科学突破?你永远不会知道。
因此,最好使用所有可用的知识并明智地选择其状态。

好吧,如果真的不平衡怎么办?
再次-查看您的数据。有时您会发现一两个观察值乘以一百倍。有时创建这种虚假的一类观察很有用。
如果所有数据都是干净的,下一步是在预测模型中使用类权重。

那么多类指标呢?
根据我的经验,通常不会使用您的任何指标。有两个主要原因。
首先:与概率一起使用总是比采用可靠的预测更好(因为如果它们给同一个类,那么您还可以如何分别将0.9和0.6预测模型分开?)
其次:比较预测模型和构建新模型要容易得多仅取决于一项好的指标。
根据我的经验,我可以推荐loglossMSE(或均方误差)。

如何解决sklearn警告?
只是简单地(如yangjie所注意到的)average使用以下值之一覆盖参数:('micro'全局计算指标),'macro'(计算每个标签的指标)或'weighted'(与宏相同,但具有自动权重)。

f1_score(y_test, prediction, average='weighted')

在使用默认average值调用指标函数后发出所有警告,'binary'这不适用于多类预测。
祝你好运,并享受机器学习的乐趣!

编辑:
我发现了另一个回答者建议,建议改用我无法同意的回归方法(例如SVR)。据我所知,甚至没有多类回归。是的,多标签回归有很大的不同,是的,在某些情况下,有可能在回归和分类之间进行切换(如果类以某种方式排序),但这种情况很少见。

我建议(在scikit-learn范围内)尝试另一个非常强大的分类工具:梯度增强随机森林(我最喜欢),KNeighbors等。

之后,您可以计算预测之间的算术平均值或几何平均值,并且大多数时候您将获得更好的结果。

final_prediction = (KNNprediction * RFprediction) ** 0.5

First of all it’s a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it’s always better to use all your available knowledge and choice its status with all wise.

Okay, what if it’s really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it’s useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.

So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it’s always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it’s much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).

How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).

f1_score(y_test, prediction, average='weighted')

All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!

Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it’s possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.

What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.

After that you can calculate arithmetic or geometric mean between predictions and most of the time you’ll get even better result.

final_prediction = (KNNprediction * RFprediction) ** 0.5

谁能向我解释StandardScaler?

问题:谁能向我解释StandardScaler?

我无法理解网页StandardScaler的文档中sklearn

谁能简单地向我解释一下?

I am unable to understand the page of the StandardScaler in the documentation of sklearn.

Can anyone explain this to me in simple terms?


回答 0

背后的想法StandardScaler是,它将转换您的数据,使其分布的平均值为0,标准差为1。
对于多变量数据,这是按功能进行的(换句话说,独立于数据的每一列) 。
给定数据的分布,数据集中的每个值都将减去平均值,然后除以整个数据集(或多变量情况下的特征)的标准差。

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).


回答 1

简介:我假设您有一个矩阵X,其中每一行/每一都是一个样本/观测值,每一都是一个变量/特征sklearn顺便说一下,这是任何ML函数的预期输入- X.shape应该是[number_of_samples, number_of_features])。


方法的核心:主要思路是正常化/标准化,即μ = 0σ = 1你的功能/变量/列X单独之前应用任何机器学习模型。

StandardScaler()归一化的特征,即X的每一列中,独立地,使每个柱/特征/变量将具有μ = 0σ = 1


附言我在此页面上找到了最受欢迎的答案,这是错误的。我引用的是“数据集中的每个值都将减去样本平均值” –这既不正确也不正确。


另请参阅:如何以及为何使数据标准化:python教程


例:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

验证每个特征(列)的平均值是否为0:

scaled_data.mean(axis = 0)
array([0., 0.])

验证每个功能(列)的标准差为1:

scaled_data.std(axis = 0)
array([1., 1.])

数学:


UPDATE 08/2019:Concering输入参数with_meanwith_stdFalse/ True,我这里提供一个答案:之间StandardScaler差“with_std =伪或真”和“with_mean =伪或真”

Intro: I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way — X.shape should be [number_of_samples, number_of_features]).


Core of method: The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.


P.S: I find the most upvoted answer on this page, wrong. I am quoting “each value in the dataset will have the sample mean value subtracted” — This is neither true nor correct.


See also: How and why to Standardize your data: A python tutorial


Example:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

The maths:


UPDATE 08/2020: Concerning the input parameters with_mean and with_std to False/True, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”


回答 2


回答 3

StandardScaler执行标准化任务。通常,数据集包含比例不同的变量。例如,一个Employee数据集将包含AGE列,其值的范围为20-70,而SALARY列的值的范围为10000-80000
由于这两列的规模不同,因此在构建机器学习模型时将它们标准化以具有相同的规模。

StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.


回答 4

当您要比较对应于不同单位的数据时,这很有用。在这种情况下,您要删除单元。要以一致的方式处理所有数据,请以方差为单位且序列的均值为0的方式转换数据。

This is useful when you want to compare data that correspond to different units. In that case, you want to remove the units. To do that in a consistent way of all the data, you transform the data in a way that the variance is unitary and that the mean of the series is 0.


回答 5

上面的答案很好,但是我需要一个简单的例子来减轻过去的担忧。我想确保确实将每个专栏分开对待。现在,我可以放心了,无法找到引起我关注的示例。如上所述,所有列均按比例缩放。

import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

输出值

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)

The answers above are great, but I needed a simple example to alleviate some concerns that I have had in the past. I wanted to make sure it was indeed treating each column separately. I am now reassured and can’t find what example had caused me concern. All columns ARE scaled separately as described by those above.

CODE

import pandas as pd
import scipy.stats as ss
from sklearn.preprocessing import StandardScaler


data= [[1, 1, 1, 1, 1],[2, 5, 10, 50, 100],[3, 10, 20, 150, 200],[4, 15, 40, 200, 300]]

df = pd.DataFrame(data, columns=['N0', 'N1', 'N2', 'N3', 'N4']).astype('float64')

sc_X = StandardScaler()
df = sc_X.fit_transform(df)

num_cols = len(df[0,:])
for i in range(num_cols):
    col = df[:,i]
    col_stats = ss.describe(col)
    print(col_stats)

OUTPUT

DescribeResult(nobs=4, minmax=(-1.3416407864998738, 1.3416407864998738), mean=0.0, variance=1.3333333333333333, skewness=0.0, kurtosis=-1.3599999999999999)
DescribeResult(nobs=4, minmax=(-1.2828087129930659, 1.3778315806221817), mean=-5.551115123125783e-17, variance=1.3333333333333337, skewness=0.11003776770595125, kurtosis=-1.394993095506219)
DescribeResult(nobs=4, minmax=(-1.155344148338584, 1.53471088361394), mean=0.0, variance=1.3333333333333333, skewness=0.48089217736510326, kurtosis=-1.1471008824318165)
DescribeResult(nobs=4, minmax=(-1.2604572012883055, 1.2668071116222517), mean=-5.551115123125783e-17, variance=1.3333333333333333, skewness=0.0056842140599118185, kurtosis=-1.6438177182479734)
DescribeResult(nobs=4, minmax=(-1.338945389819976, 1.3434309690153527), mean=5.551115123125783e-17, variance=1.3333333333333333, skewness=0.005374558840039456, kurtosis=-1.3619131970819205)

NOTE:

The scipy.stats module is correctly reporting the “sample” variance, which uses (n – 1) in the denominator. The “population” variance would use n in the denominator for the calculation of variance. To understand better, please see the code below that uses scaled data from the first column of the data set above:

Code

import scipy.stats as ss

sc_Data = [[-1.34164079], [-0.4472136], [0.4472136], [1.34164079]]
col_stats = ss.describe([-1.34164079, -0.4472136, 0.4472136, 1.34164079])
print(col_stats)
print()

mean_by_hand = 0
for row in sc_Data:
    for element in row:
        mean_by_hand += element
mean_by_hand /= 4

variance_by_hand = 0
for row in sc_Data:
    for element in row:
        variance_by_hand += (mean_by_hand - element)**2
sample_variance_by_hand = variance_by_hand / 3
sample_std_dev_by_hand = sample_variance_by_hand ** 0.5

pop_variance_by_hand = variance_by_hand / 4
pop_std_dev_by_hand = pop_variance_by_hand ** 0.5

print("Sample of Population Calcs:")
print(mean_by_hand, sample_variance_by_hand, sample_std_dev_by_hand, '\n')
print("Population Calcs:")
print(mean_by_hand, pop_variance_by_hand, pop_std_dev_by_hand)

Output

DescribeResult(nobs=4, minmax=(-1.34164079, 1.34164079), mean=0.0, variance=1.3333333422778562, skewness=0.0, kurtosis=-1.36000000429325)

Sample of Population Calcs:
0.0 1.3333333422778562 1.1547005422523435

Population Calcs:
0.0 1.000000006708392 1.000000003354196

回答 6

以下是一个简单的工作示例,用于解释标准化计算的工作原理。理论部分已经在其他答案中得到了很好的解释。

>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)

>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])

>>>np.mean(a, axis=0)
array([6. , 2.5])

>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)

#Xchanged = (X−μ)/σ  WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z

计算方式

正如您在输出中看到的,均值为[6。,2.5]和标准偏差为[1.41421356,0.8660254]

数据为(0,1)位置为2标准化=(2-2.5)/0.8660254 = -0.57735027

(1,0)位置的数据为4标准化=(4-6)/1.41421356 = -1.414

标准化后的结果

标准化后检查均值和标准偏差

注意:-2.77555756e-17非常接近0。

参考文献

  1. 将不同缩放器对数据的影响与离群值进行比较

  2. 标准化和标准化之间有什么区别?

  3. 使用sklearn StandardScaler缩放的数据均值不为零

Following is a simple working example to explain how standarization calculation works. The theory part is already well explained in other answers.

>>>import numpy as np
>>>data = [[6, 2], [4, 2], [6, 4], [8, 2]]
>>>a = np.array(data)

>>>np.std(a, axis=0)
array([1.41421356, 0.8660254 ])

>>>np.mean(a, axis=0)
array([6. , 2.5])

>>>from sklearn.preprocessing import StandardScaler
>>>scaler = StandardScaler()
>>>scaler.fit(data)
>>>print(scaler.mean_)

#Xchanged = (X−μ)/σ  WHERE σ is Standard Deviation and μ is mean
>>>z=scaler.transform(data)
>>>z

Calculation

As you can see in the output, mean is [6. , 2.5] and std deviation is [1.41421356, 0.8660254 ]

Data is (0,1) position is 2 Standardization = (2 – 2.5)/0.8660254 = -0.57735027

Data in (1,0) position is 4 Standardization = (4-6)/1.41421356 = -1.414

Result After Standardization

Check Mean and Std Deviation After Standardization

Note: -2.77555756e-17 is very close to 0.

References

  1. Compare the effect of different scalers on data with outliers

  2. What’s the difference between Normalization and Standardization?

  3. Mean of data scaled with sklearn StandardScaler is not zero


回答 7

应用后StandardScaler(),X中的每一列的平均值为0,标准差为1。

公式在此页面上被其他人列出。

基本原理:某些算法要求数据看起来像这样(请参阅sklearn docs)。

After applying StandardScaler(), each column in X will have mean of 0 and standard deviation of 1.

Formulas are listed by others on this page.

Rationale: some algorithms require data to look like this (see sklearn docs).


Recommenders-推荐系统的最佳实战教程

Recommenders现在在PyPI上,可以使用pip安装!此外,还有大量的错误修复和实用程序改进

您可以在此处找到PyPI页面:https://pypi.org/project/ms-recommenders/

您可以在此处找到软件包文档:https://microsoft-recommenders.readthedocs.io/en/latest/

引言

此存储库包含构建推荐系统的示例和最佳实践,以Jupyter笔记本形式提供。这些示例详细介绍了我们在五项关键任务上的学习:

中提供了几个实用程序reco_utils以支持常见任务,例如以不同算法预期的格式加载数据集、评估模型输出以及拆分训练/测试数据。其中包括几种最先进算法的实现,以便在您自己的应用程序中进行自学和自定义。请参阅reco_utils documentation

有关存储库的更详细概述,请参阅wiki page

快速入门

请参阅setup guide有关在本地设置计算机的更多详细信息,请访问data science virtual machine (DSVM)或打开Azure Databricks

推荐器软件包的安装已经过Python版本3.6和3.7的测试。建议在干净的环境中安装软件包及其依赖项(例如condavenv)

要在本地计算机上进行设置,请执行以下操作:

要安装核心实用程序、基于CPU的算法和相关性,请执行以下操作:

  1. 确保安装了编译所需的软件。在Linux上,可以通过添加构建基本依赖项来支持这一点:
sudo apt-get install -y build-essential

在Windows上,您需要Microsoft C++ Build Tools

  1. 从以下位置安装软件包PyPI
pip install --upgrade pip
pip install ms-recommenders[examples]
  1. 向Jupyter注册您的(CONDA或虚拟)环境:
python -m ipykernel install --user --name my_environment_name --display-name "Python (reco)"
  1. 启动Jupyter笔记本服务器
jupyter notebook
  1. 运行SAR Python CPU MovieLens笔记本在下面00_quick_start文件夹。确保将内核更改为“Python(Reco)”

有关安装软件包的其他选项(支持图形处理器、电光等)看见this guide

注意事项-TheAlternating Least Squares (ALS)笔记本电脑需要PySpark环境才能运行。请按照中的步骤操作setup guide在PySpark环境中运行这些笔记本。对于深度学习算法,建议使用GPU机器,并遵循setup guide要设置NVIDIA库,请执行以下操作

算法

下表列出了存储库中当前可用的推荐算法。当有不同的实施可用时,笔记本将链接在环境列下

算法 环境 类型 描述
交替最小二乘(ALS) PySpark 协同过滤 大数据集中显式或隐式反馈的矩阵分解算法,电光MLLib针对可扩展性和分布式计算能力进行了优化
关注异步奇异值分解(A2SVD)* Python CPU / Python GPU 协同过滤 一种基于顺序的算法,旨在利用注意力机制捕获长期和短期用户偏好
CONAC/贝叶斯个性化排名(BPR) Python CPU 协同过滤 隐式反馈预测项目排序的矩阵分解算法
角/双边变分自动编码器(BiVAE) Python CPU / Python GPU 协同过滤 二元数据的生成模型(例如,用户-项目交互)
卷积序列嵌入推荐(CASER) Python CPU / Python GPU 协同过滤 基于卷积的算法,旨在捕获用户的一般偏好和序列模式
深度知识感知网络(DKN)* Python CPU / Python GPU 基于内容的过滤 包含知识图和文章嵌入的深度学习算法,可提供强大的新闻或文章推荐
极深因式分解机(XDeepFM)* Python CPU / Python GPU 混血儿 基于深度学习的具有用户/项目特征的隐式和显式反馈算法
FastAI嵌入点偏置(FAST) Python CPU / Python GPU 协同过滤 面向用户和项目的带嵌入和偏向的通用算法
LightFM/混合矩阵分解 Python CPU 混血儿 隐式和显式反馈的混合矩阵分解算法
LightGBM/渐变增强树* Python CPU/PySpark 基于内容的过滤 基于内容问题的快速训练和低内存占用的梯度Boosting树算法
LightGCN Python CPU / Python GPU 协同过滤 一种深度学习算法,简化了隐式反馈预测GCN的设计
GeoIMC* Python CPU 混血儿 矩阵补全算法,考虑了用户和项目的特点,使用黎曼共轭梯度优化,遵循几何方法
GRU4Rec Python CPU / Python GPU 协同过滤 基于序列的算法,旨在使用递归神经网络捕获长期和短期用户偏好
多项式VAE Python CPU / Python GPU 协同过滤 预测用户/项目交互的产生式模型
具有长期和短期用户表示的神经推荐(LSTUR)* Python CPU / Python GPU 基于内容的过滤 具有长期和短期用户兴趣建模的神经推荐算法
基于注意力多视图学习(NAML)的神经推荐* Python CPU / Python GPU 基于内容的过滤 具有注意力多视角学习的神经推荐算法
神经协同过滤(NCF) Python CPU / Python GPU 协同过滤 一种性能增强的隐式反馈深度学习算法
具有个性化注意的神经推荐(NPA)* Python CPU / Python GPU 基于内容的过滤 基于个性化注意力网络的神经推荐算法
基于多头自我注意(NRMS)的神经推荐* Python CPU / Python GPU 基于内容的过滤 一种多头自关注的神经推荐算法
下一项推荐(NextItNet) Python CPU / Python GPU 协同过滤 基于扩张卷积和残差网络的序列模式捕获算法
受限Boltzmann机器(RBM) Python CPU / Python GPU 协同过滤 基于神经网络的显式或隐式反馈潜在概率分布学习算法
黎曼低秩矩阵完成(RLRMC)* Python CPU 协同过滤 基于黎曼共轭梯度优化的小内存矩阵分解算法
简单推荐算法(SAR)* Python CPU 协同过滤 基于相似度的隐式反馈数据集算法
短期和长期偏好综合推荐(SLI-REC)* Python CPU / Python GPU 协同过滤 基于顺序的算法,旨在使用注意机制、时间感知控制器和内容感知控制器捕获长期和短期用户偏好
多兴趣感知的序贯用户建模(SUM)* Python CPU / Python GPU 协同过滤 一种旨在捕捉用户多兴趣的增强型记忆网络顺序用户模型
标准VAE Python CPU / Python GPU 协同过滤 预测用户/项目交互的产生式模型
奇异值分解(SVD) Python CPU 协同过滤 预测不是很大数据集中显式评级反馈的矩阵分解算法
术语频率-文档频率反转(TF-IDF) Python CPU 基于内容的过滤 一种简单的基于相似度的文本数据内容推荐算法
Vowpal Wabbit(大众)* Python CPU (online training) 基于内容的过滤 快速在线学习算法,非常适合用户功能/上下文不断变化的场景
宽而深 Python CPU / Python GPU 混血儿 一种可记忆特征交互和泛化用户特征的深度学习算法
xLearch/factorization Machine(FM)&场感知FM(FFM) Python CPU 混血儿 利用用户/项目特征预测标签的快速高效内存算法

注意事项*指示由Microsoft发明/贡献的算法

独立的或正在孵化的算法和实用程序是contrib文件夹。这将包含可能不容易放入核心存储库的贡献,或者需要时间来重构或成熟代码并添加必要的测试

算法 环境 类型 描述
SARplus* PySpark 协同过滤 电光合成孔径雷达的优化实现

初步比较

我们提供一个benchmark notebook以说明如何评估和比较不同的算法。在本笔记本中,使用分层拆分将MovieLens数据集按75/25的比率拆分成训练/测试集。使用下面的每个协同过滤算法来训练推荐模型。我们利用文献中报道的经验参数值。here在对指标进行排名时,我们使用k=10(前十大推荐项目)。我们在标准NC6S_v2上运行比较Azure DSVM(6个vCPU、112 GB内存和1个P100 GPU)。电光肌萎缩侧索硬化症在本地独立模式下运行。在此表中,我们显示了在Movielens 100k上运行15个时期的算法的结果

算法 地图 nDCG@k 精度@k Recall@k RMSE Mae R2个 解释的差异
ALS 0.004732 0.044239 0.048462 0.017796 0.965038 0.753001 0.255647 0.251648
BiVAE 0.146126 0.475077 0.411771 0.219145 不适用 不适用 不适用 不适用
BPR 0.132478 0.441997 0.388229 0.212522 不适用 不适用 不适用 不适用
FastAI 0.025503 0.147866 0.130329 0.053824 0.943084 0.744337 0.285308 0.287671
LightGCN 0.088526 0.419846 0.379626 0.144336 不适用 不适用 不适用 不适用
NCF 0.107720 0.396118 0.347296 0.180775 不适用 不适用 不适用 不适用
SAR 0.110591 0.382461 0.330753 0.176385 1.253805 1.048484 -0.569363 0.030474
SVD 0.012873 0.095930 0.091198 0.032783 0.938681 0.742690 0.291967 0.291971

行为规范

本项目坚持Microsoft’s Open Source Code of Conduct为了给所有人营造一个欢迎和鼓舞人心的社区

贡献

这个项目欢迎大家提供意见和建议。在投稿之前,请查看我们的contribution guidelines

生成状态

这些测试是夜间构建,用于计算冒烟和集成测试。main是我们的主要分支机构staging是我们的发展分部。我们使用pytest中测试python实用程序reco_utilspapermill对于notebooks有关测试管道的更多信息,请参阅test documentation

DSVM构建状态

以下测试每天在Linux DSVM上运行。这些机器全天候运转。

构建类型 分支机构 状态 分支机构 状态
Linux CPU 主干道 试运行
Linux GPU 主干道 试运行
LINUX电光 主干道 试运行

相关项目

参考文献

  • A.Argyriou、M.González-Fierro和L.Zhang,“Microsoft推荐器:可投入生产的推荐系统的最佳实践(Best Practices for Production-Ready Recommendation Systems)”,万维网2020:台北国际万维网大会,2020年。在线提供:https://dl.acm.org/doi/abs/10.1145/3366424.3382692
  • 张立,吴涛,谢,A.Argyriou,M.González-Fierro,J.Lian,“建立规模化的可生产推荐系统”,ACM SIGKDD 2019年知识发现和数据挖掘大会(KDD 2019),2019年
  • S.Graham,J.K.Min,T.Wu,“微软推荐器:加速开发推荐器系统的工具”,RecSys‘19:第13届ACM推荐系统会议论文集,2019年。在线提供:https://dl.acm.org/doi/10.1145/3298689.3346967

Stanford-tensorflow-tutorials-此存储库包含斯坦福课程的代码示例


斯坦福-TensorFlow-教程

此存储库包含课程CS 20:深度学习研究的TensorFlow的代码示例。
它将随着课程的进展而更新。
详细的教学大纲和课堂讲稿可以在这里找到。here
对于本课程,我使用python3.6和TensorFlow 1.4.1

前一年课程的代码和备注请参见文件夹2017和网站https://web.stanford.edu/class/cs20si/2017

有关安装说明和依赖项列表,请参阅此存储库的安装文件夹

Nltk-NLTK源

自然语言工具包(NLTK)


NLTK–自然语言工具包–是一套开源的Python模块、数据集和教程,支持自然语言处理方面的研究和开发。NLTK需要Python版本3.5、3.6、3.7、3.8或3.9

有关文档,请访问nltk.org

贡献

你想为NLTK的发展做贡献吗?太棒了!请阅读CONTRIBUTING.md有关更多详细信息,请参阅

另请参阅how to contribute to NLTK

捐赠

你觉得这个工具包有帮助吗?请使用NLTK主页上的链接通过PayPal向项目捐款来支持NLTK开发

引用

如果您发表使用NLTK的作品,请引用NLTK书,如下所示:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

版权所有

版权所有(C)2001-2021年NLTK项目

有关许可证信息,请参阅LICENSE.txt

AUTHORS.md包含对NLTK做出贡献的每个人的列表

重新分布

  • NLTK源代码是在Apache2.0许可下分发的
  • NLTK文档在知识共享署名-非商业性-无衍生作品3.0美国许可下分发
  • NLTK语料库是根据自述文件中给出的条款为每个语料库提供的;所有语料库都可以再分发,并可用于非商业用途
  • 根据这些许可证的规定,NLTK可以自由再分发