标签归档:machine-learning

如何从scikit-learn决策树中提取决策规则?

问题:如何从scikit-learn决策树中提取决策规则?

我可以从决策树中经过训练的树中提取出基本的决策规则(或“决策路径”)作为文本列表吗?

就像是:

if A>0.4 then if B<0.2 then if C>0.8 then class='X'

谢谢你的帮助。

Can I extract the underlying decision-rules (or ‘decision paths’) from a trained tree in a decision tree as a textual list?

Something like:

if A>0.4 then if B<0.2 then if C>0.8 then class='X'

Thanks for your help.


回答 0

我相信这个答案比这里的其他答案更正确:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)

这会打印出有效的Python函数。这是尝试返回其输入的树的示例输出,该数字介于0和10之间。

def tree(f0):
  if f0 <= 6.0:
    if f0 <= 1.5:
      return [[ 0.]]
    else:  # if f0 > 1.5
      if f0 <= 4.5:
        if f0 <= 3.5:
          return [[ 3.]]
        else:  # if f0 > 3.5
          return [[ 4.]]
      else:  # if f0 > 4.5
        return [[ 5.]]
  else:  # if f0 > 6.0
    if f0 <= 8.5:
      if f0 <= 7.5:
        return [[ 7.]]
      else:  # if f0 > 7.5
        return [[ 8.]]
    else:  # if f0 > 8.5
      return [[ 9.]]

这是我在其他答案中看到的一些绊脚石:

  1. 使用tree_.threshold == -2来决定一个节点是否为叶是不是一个好主意。如果它是阈值为-2的真实决策节点怎么办?相反,您应该查看tree.featuretree.children_*
  2. 该行在features = [feature_names[i] for i in tree_.feature]我的sklearn版本中崩溃,因为某些值tree.tree_.feature是-2(特别是对于叶节点)。
  3. 递归函数中不需要有多个if语句,只需一个就可以了。

I believe that this answer is more correct than the other answers here:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)

This prints out a valid Python function. Here’s an example output for a tree that is trying to return its input, a number between 0 and 10.

def tree(f0):
  if f0 <= 6.0:
    if f0 <= 1.5:
      return [[ 0.]]
    else:  # if f0 > 1.5
      if f0 <= 4.5:
        if f0 <= 3.5:
          return [[ 3.]]
        else:  # if f0 > 3.5
          return [[ 4.]]
      else:  # if f0 > 4.5
        return [[ 5.]]
  else:  # if f0 > 6.0
    if f0 <= 8.5:
      if f0 <= 7.5:
        return [[ 7.]]
      else:  # if f0 > 7.5
        return [[ 8.]]
    else:  # if f0 > 8.5
      return [[ 9.]]

Here are some stumbling blocks that I see in other answers:

  1. Using tree_.threshold == -2 to decide whether a node is a leaf isn’t a good idea. What if it’s a real decision node with a threshold of -2? Instead, you should look at tree.feature or tree.children_*.
  2. The line features = [feature_names[i] for i in tree_.feature] crashes with my version of sklearn, because some values of tree.tree_.feature are -2 (specifically for leaf nodes).
  3. There is no need to have multiple if statements in the recursive function, just one is fine.

回答 1

我创建了自己的函数,以从sklearn创建的决策树中提取规则:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# dummy data:
df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})

# create decision tree
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1)
dt.fit(df.ix[:,:2], df.dv)

此函数首先从节点开始(在子数组中由-1标识),然后递归地找到父节点。我称其为节点的“血统”。一路上,我掌握了创建if / then / else SAS逻辑所需的值:

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]

     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'

          lineage.append((parent, split, threshold[parent], features[parent]))

          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)

     for child in idx:
          for node in recurse(left, right, child):
               print node

下面的元组集包含创建SAS if / then / else语句所需的所有内容。我不喜欢do在SAS中使用块,这就是为什么我创建描述节点整个路径的逻辑的原因。元组之后的单个整数是路径中终端节点的ID。所有前面的元组组合在一起创建该节点。

In [1]: get_lineage(dt, df.columns)
(0, 'l', 0.5, 'col1')
1
(0, 'r', 0.5, 'col1')
(2, 'l', 4.5, 'col2')
3
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'l', 2.5, 'col1')
5
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'r', 2.5, 'col1')
6

I created my own function to extract the rules from the decision trees created by sklearn:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# dummy data:
df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})

# create decision tree
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=1)
dt.fit(df.ix[:,:2], df.dv)

This function first starts with the nodes (identified by -1 in the child arrays) and then recursively finds the parents. I call this a node’s ‘lineage’. Along the way, I grab the values I need to create if/then/else SAS logic:

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]

     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'

          lineage.append((parent, split, threshold[parent], features[parent]))

          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)

     for child in idx:
          for node in recurse(left, right, child):
               print node

The sets of tuples below contain everything I need to create SAS if/then/else statements. I do not like using do blocks in SAS which is why I create logic describing a node’s entire path. The single integer after the tuples is the ID of the terminal node in a path. All of the preceding tuples combine to create that node.

In [1]: get_lineage(dt, df.columns)
(0, 'l', 0.5, 'col1')
1
(0, 'r', 0.5, 'col1')
(2, 'l', 4.5, 'col2')
3
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'l', 2.5, 'col1')
5
(0, 'r', 0.5, 'col1')
(2, 'r', 4.5, 'col2')
(4, 'r', 2.5, 'col1')
6


回答 2

我修改了Zelazny7提交的代码以打印一些伪代码:

def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])

        recurse(left, right, threshold, features, 0)

如果调用get_code(dt, df.columns)同一示例,则将获得:

if ( col1 <= 0.5 ) {
return [[ 1.  0.]]
} else {
if ( col2 <= 4.5 ) {
return [[ 0.  1.]]
} else {
if ( col1 <= 2.5 ) {
return [[ 1.  0.]]
} else {
return [[ 0.  1.]]
}
}
}

I modified the code submitted by Zelazny7 to print some pseudocode:

def get_code(tree, feature_names):
        left      = tree.tree_.children_left
        right     = tree.tree_.children_right
        threshold = tree.tree_.threshold
        features  = [feature_names[i] for i in tree.tree_.feature]
        value = tree.tree_.value

        def recurse(left, right, threshold, features, node):
                if (threshold[node] != -2):
                        print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                        if left[node] != -1:
                                recurse (left, right, threshold, features,left[node])
                        print "} else {"
                        if right[node] != -1:
                                recurse (left, right, threshold, features,right[node])
                        print "}"
                else:
                        print "return " + str(value[node])

        recurse(left, right, threshold, features, 0)

if you call get_code(dt, df.columns) on the same example you will obtain:

if ( col1 <= 0.5 ) {
return [[ 1.  0.]]
} else {
if ( col2 <= 4.5 ) {
return [[ 0.  1.]]
} else {
if ( col1 <= 2.5 ) {
return [[ 1.  0.]]
} else {
return [[ 0.  1.]]
}
}
}

回答 3

Scikit Learn引入了一种美味的新方法,称为export_text0.21版(2019年5月),用于从树中提取规则。文档在这里。不再需要创建自定义函数。

拟合模型后,只需两行代码。首先,导入export_text

from sklearn.tree.export import export_text

其次,创建一个包含规则的对象。为了使规则更具可读性,请使用feature_names参数并传递功能名称列表。例如,如果您的模型被调用,model并且您的要素在名为的数据框中命名X_train,则可以创建一个名为的对象tree_rules

tree_rules = export_text(model, feature_names=list(X_train))

然后只需打印或保存tree_rules。您的输出将如下所示:

|--- Age <= 0.63
|   |--- EstimatedSalary <= 0.61
|   |   |--- Age <= -0.16
|   |   |   |--- class: 0
|   |   |--- Age >  -0.16
|   |   |   |--- EstimatedSalary <= -0.06
|   |   |   |   |--- class: 0
|   |   |   |--- EstimatedSalary >  -0.06
|   |   |   |   |--- EstimatedSalary <= 0.40
|   |   |   |   |   |--- EstimatedSalary <= 0.03
|   |   |   |   |   |   |--- class: 1

Scikit learn introduced a delicious new method called export_text in version 0.21 (May 2019) to extract the rules from a tree. Documentation here. It’s no longer necessary to create a custom function.

Once you’ve fit your model, you just need two lines of code. First, import export_text:

from sklearn.tree import export_text

Second, create an object that will contain your rules. To make the rules look more readable, use the feature_names argument and pass a list of your feature names. For example, if your model is called model and your features are named in a dataframe called X_train, you could create an object called tree_rules:

tree_rules = export_text(model, feature_names=list(X_train.columns))

Then just print or save tree_rules. Your output will look like this:

|--- Age <= 0.63
|   |--- EstimatedSalary <= 0.61
|   |   |--- Age <= -0.16
|   |   |   |--- class: 0
|   |   |--- Age >  -0.16
|   |   |   |--- EstimatedSalary <= -0.06
|   |   |   |   |--- class: 0
|   |   |   |--- EstimatedSalary >  -0.06
|   |   |   |   |--- EstimatedSalary <= 0.40
|   |   |   |   |   |--- EstimatedSalary <= 0.03
|   |   |   |   |   |   |--- class: 1

回答 4

0.18.0版本中提供了一种新DecisionTreeClassifier方法。开发人员提供了广泛的(有据可查的)演练decision_path

演练中打印树结构的代码的第一部分似乎还可以。但是,我修改了第二部分中的代码以询问一个样本。我的更改用表示# <--

编辑# <--在拉取请求#8653#10951中指出错误之后,以下代码中标记的更改已在演练链接中更新。现在跟随起来要容易得多。

sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))

Rules used to predict sample 0: 
decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011921)
decision id node 2 : (X[0, 2] (= 5.1) > 4.94999980927)
leaf node 4 reached, no decision here

更改sample_id以查看其他样本的决策路径。我没有问过开发人员这些更改,只是在研究示例时看起来更加直观。

There is a new DecisionTreeClassifier method, decision_path, in the 0.18.0 release. The developers provide an extensive (well-documented) walkthrough.

The first section of code in the walkthrough that prints the tree structure seems to be OK. However, I modified the code in the second section to interrogate one sample. My changes denoted with # <--

Edit The changes marked by # <-- in the code below have since been updated in walkthrough link after the errors were pointed out in pull requests #8653 and #10951. It’s much easier to follow along now.

sample_id = 0
node_index = node_indicator.indices[node_indicator.indptr[sample_id]:
                                    node_indicator.indptr[sample_id + 1]]

print('Rules used to predict sample %s: ' % sample_id)
for node_id in node_index:

    if leave_id[sample_id] == node_id:  # <-- changed != to ==
        #continue # <-- comment out
        print("leaf node {} reached, no decision here".format(leave_id[sample_id])) # <--

    else: # < -- added else to iterate through decision nodes
        if (X_test[sample_id, feature[node_id]] <= threshold[node_id]):
            threshold_sign = "<="
        else:
            threshold_sign = ">"

        print("decision id node %s : (X[%s, %s] (= %s) %s %s)"
              % (node_id,
                 sample_id,
                 feature[node_id],
                 X_test[sample_id, feature[node_id]], # <-- changed i to sample_id
                 threshold_sign,
                 threshold[node_id]))

Rules used to predict sample 0: 
decision id node 0 : (X[0, 3] (= 2.4) > 0.800000011921)
decision id node 2 : (X[0, 2] (= 5.1) > 4.94999980927)
leaf node 4 reached, no decision here

Change the sample_id to see the decision paths for other samples. I haven’t asked the developers about these changes, just seemed more intuitive when working through the example.


回答 5

from StringIO import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)
print out.getvalue()

您可以看到有向图树。然后,clf.tree_.featureclf.tree_.value分别是节点分割特征数组和节点值数组。您可以从此github源中引用更多详细信息。

from StringIO import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)
print out.getvalue()

You can see a digraph Tree. Then, clf.tree_.feature and clf.tree_.value are array of nodes splitting feature and array of nodes values respectively. You can refer to more details from this github source.


回答 6

仅仅因为每个人都非常乐于助人,所以我将对Zelazny7和Daniele的精美解决方案进行修改。这个是针对python 2.7的,带有标签使其更具可读性:

def get_code(tree, feature_names, tabdepth=0):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, tabdepth=0):
            if (threshold[node] != -2):
                    print '\t' * tabdepth,
                    print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "} else {"
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "}"
            else:
                    print '\t' * tabdepth,
                    print "return " + str(value[node])

    recurse(left, right, threshold, features, 0)

Just because everyone was so helpful I’ll just add a modification to Zelazny7 and Daniele’s beautiful solutions. This one is for python 2.7, with tabs to make it more readable:

def get_code(tree, feature_names, tabdepth=0):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, tabdepth=0):
            if (threshold[node] != -2):
                    print '\t' * tabdepth,
                    print "if ( " + features[node] + " <= " + str(threshold[node]) + " ) {"
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "} else {"
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node], tabdepth+1)
                    print '\t' * tabdepth,
                    print "}"
            else:
                    print '\t' * tabdepth,
                    print "return " + str(value[node])

    recurse(left, right, threshold, features, 0)

回答 7

下面的代码是我在anaconda python 2.7下加上包名称“ pydot-ng”制作带有决策规则的PDF文件的方法。希望对您有所帮助。

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_leaf_nodes=n)
clf_ = clf.fit(X, data_y)

feature_names = X.columns
class_name = clf_.classes_.astype(int).astype(str)

def output_pdf(clf_, name):
    from sklearn import tree
    from sklearn.externals.six import StringIO
    import pydot_ng as pydot
    dot_data = StringIO()
    tree.export_graphviz(clf_, out_file=dot_data,
                         feature_names=feature_names,
                         class_names=class_name,
                         filled=True, rounded=True,
                         special_characters=True,
                          node_ids=1,)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf("%s.pdf"%name)

output_pdf(clf_, name='filename%s'%n)

一个树形图在这里显示

Codes below is my approach under anaconda python 2.7 plus a package name “pydot-ng” to making a PDF file with decision rules. I hope it is helpful.

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_leaf_nodes=n)
clf_ = clf.fit(X, data_y)

feature_names = X.columns
class_name = clf_.classes_.astype(int).astype(str)

def output_pdf(clf_, name):
    from sklearn import tree
    from sklearn.externals.six import StringIO
    import pydot_ng as pydot
    dot_data = StringIO()
    tree.export_graphviz(clf_, out_file=dot_data,
                         feature_names=feature_names,
                         class_names=class_name,
                         filled=True, rounded=True,
                         special_characters=True,
                          node_ids=1,)
    graph = pydot.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf("%s.pdf"%name)

output_pdf(clf_, name='filename%s'%n)

a tree graphy show here


回答 8

我已经经历过了,但是我需要规则以这种格式编写

if A>0.4 then if B<0.2 then if C>0.8 then class='X' 

因此,我修改了@paulkernfeld的答案(谢谢),您可以根据自己的需要进行自定义

def tree_to_code(tree, feature_names, Y):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    pathto=dict()

    global k
    k = 0
    def recurse(node, depth, parent):
        global k
        indent = "  " * depth

        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            s= "{} <= {} ".format( name, threshold, node )
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s

            recurse(tree_.children_left[node], depth + 1, node)
            s="{} > {}".format( name, threshold)
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s
            recurse(tree_.children_right[node], depth + 1, node)
        else:
            k=k+1
            print(k,')',pathto[parent], tree_.value[node])
    recurse(0, 1, 0)

I’ve been going through this, but i needed the rules to be written in this format

if A>0.4 then if B<0.2 then if C>0.8 then class='X' 

So I adapted the answer of @paulkernfeld (thanks) that you can customize to your need

def tree_to_code(tree, feature_names, Y):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    pathto=dict()

    global k
    k = 0
    def recurse(node, depth, parent):
        global k
        indent = "  " * depth

        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            s= "{} <= {} ".format( name, threshold, node )
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s

            recurse(tree_.children_left[node], depth + 1, node)
            s="{} > {}".format( name, threshold)
            if node == 0:
                pathto[node]=s
            else:
                pathto[node]=pathto[parent]+' & ' +s
            recurse(tree_.children_right[node], depth + 1, node)
        else:
            k=k+1
            print(k,')',pathto[parent], tree_.value[node])
    recurse(0, 1, 0)

回答 9

这是一种使用SKompiler库将整个树转换为单个(不一定是人类可读的)python表达式的方法:

from skompiler import skompile
skompile(dtree.predict).to('python/code')

Here is a way to translate the whole tree into a single (not necessarily too human-readable) python expression using the SKompiler library:

from skompiler import skompile
skompile(dtree.predict).to('python/code')

回答 10

这基于@paulkernfeld的答案。如果您有一个具有特征的数据框X和一个具有共振的目标数据框y,并且想要了解哪个y值终止于哪个节点(并相应地对其进行绘制),则可以执行以下操作:

    def tree_to_code(tree, feature_names):
        from sklearn.tree import _tree
        codelines = []
        codelines.append('def get_cat(X_tmp):\n')
        codelines.append('   catout = []\n')
        codelines.append('   for codelines in range(0,X_tmp.shape[0]):\n')
        codelines.append('      Xin = X_tmp.iloc[codelines]\n')
        tree_ = tree.tree_
        feature_name = [
            feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
            for i in tree_.feature
        ]
        #print "def tree({}):".format(", ".join(feature_names))

        def recurse(node, depth):
            indent = "      " * depth
            if tree_.feature[node] != _tree.TREE_UNDEFINED:
                name = feature_name[node]
                threshold = tree_.threshold[node]
                codelines.append ('{}if Xin["{}"] <= {}:\n'.format(indent, name, threshold))
                recurse(tree_.children_left[node], depth + 1)
                codelines.append( '{}else:  # if Xin["{}"] > {}\n'.format(indent, name, threshold))
                recurse(tree_.children_right[node], depth + 1)
            else:
                codelines.append( '{}mycat = {}\n'.format(indent, node))

        recurse(0, 1)
        codelines.append('      catout.append(mycat)\n')
        codelines.append('   return pd.DataFrame(catout,index=X_tmp.index,columns=["category"])\n')
        codelines.append('node_ids = get_cat(X)\n')
        return codelines
    mycode = tree_to_code(clf,X.columns.values)

    # now execute the function and obtain the dataframe with all nodes
    exec(''.join(mycode))
    node_ids = [int(x[0]) for x in node_ids.values]
    node_ids2 = pd.DataFrame(node_ids)

    print('make plot')
    import matplotlib.cm as cm
    colors = cm.rainbow(np.linspace(0, 1, 1+max( list(set(node_ids)))))
    #plt.figure(figsize=cm2inch(24, 21))
    for i in list(set(node_ids)):
        plt.plot(y[node_ids2.values==i],'o',color=colors[i], label=str(i))  
    mytitle = ['y colored by node']
    plt.title(mytitle ,fontsize=14)
    plt.xlabel('my xlabel')
    plt.ylabel(tagname)
    plt.xticks(rotation=70)       
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=9)
    plt.tight_layout()
    plt.show()
    plt.close 

不是最优雅的版本,但可以胜任工作…

This builds on @paulkernfeld ‘s answer. If you have a dataframe X with your features and a target dataframe y with your resonses and you you want to get an idea which y value ended in which node (and also ant to plot it accordingly) you can do the following:

    def tree_to_code(tree, feature_names):
        from sklearn.tree import _tree
        codelines = []
        codelines.append('def get_cat(X_tmp):\n')
        codelines.append('   catout = []\n')
        codelines.append('   for codelines in range(0,X_tmp.shape[0]):\n')
        codelines.append('      Xin = X_tmp.iloc[codelines]\n')
        tree_ = tree.tree_
        feature_name = [
            feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
            for i in tree_.feature
        ]
        #print "def tree({}):".format(", ".join(feature_names))

        def recurse(node, depth):
            indent = "      " * depth
            if tree_.feature[node] != _tree.TREE_UNDEFINED:
                name = feature_name[node]
                threshold = tree_.threshold[node]
                codelines.append ('{}if Xin["{}"] <= {}:\n'.format(indent, name, threshold))
                recurse(tree_.children_left[node], depth + 1)
                codelines.append( '{}else:  # if Xin["{}"] > {}\n'.format(indent, name, threshold))
                recurse(tree_.children_right[node], depth + 1)
            else:
                codelines.append( '{}mycat = {}\n'.format(indent, node))

        recurse(0, 1)
        codelines.append('      catout.append(mycat)\n')
        codelines.append('   return pd.DataFrame(catout,index=X_tmp.index,columns=["category"])\n')
        codelines.append('node_ids = get_cat(X)\n')
        return codelines
    mycode = tree_to_code(clf,X.columns.values)

    # now execute the function and obtain the dataframe with all nodes
    exec(''.join(mycode))
    node_ids = [int(x[0]) for x in node_ids.values]
    node_ids2 = pd.DataFrame(node_ids)

    print('make plot')
    import matplotlib.cm as cm
    colors = cm.rainbow(np.linspace(0, 1, 1+max( list(set(node_ids)))))
    #plt.figure(figsize=cm2inch(24, 21))
    for i in list(set(node_ids)):
        plt.plot(y[node_ids2.values==i],'o',color=colors[i], label=str(i))  
    mytitle = ['y colored by node']
    plt.title(mytitle ,fontsize=14)
    plt.xlabel('my xlabel')
    plt.ylabel(tagname)
    plt.xticks(rotation=70)       
    plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=9)
    plt.tight_layout()
    plt.show()
    plt.close 

not the most elegant version but it does the job…


回答 11

这是您需要的代码

我已经修改了最喜欢的代码以正确缩进jupyter笔记本python 3

import numpy as np
from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [feature_names[i] 
                    if i != _tree.TREE_UNDEFINED else "undefined!" 
                    for i in tree_.feature]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "    " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, np.argmax(tree_.value[node])))

    recurse(0, 1)

This is the code you need

I have modified the top liked code to indent in a jupyter notebook python 3 correctly

import numpy as np
from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [feature_names[i] 
                    if i != _tree.TREE_UNDEFINED else "undefined!" 
                    for i in tree_.feature]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "    " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, np.argmax(tree_.value[node])))

    recurse(0, 1)

回答 12

这是一个函数,在python 3下打印scikit-learn决策树的规则,并带有条件块的偏移量以使结构更易读:

def print_decision_tree(tree, feature_names=None, offset_unit='    '):
    '''Plots textual representation of rules of a decision tree
    tree: scikit-learn representation of tree
    feature_names: list of feature names. They are set to f1,f2,f3,... if not specified
    offset_unit: a string of offset of the conditional block'''

    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value
    if feature_names is None:
        features  = ['f%d'%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0):
            offset = offset_unit*depth
            if (threshold[node] != -2):
                    print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node],depth+1)
                    print(offset+"} else {")
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node],depth+1)
                    print(offset+"}")
            else:
                    print(offset+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0,0)

Here is a function, printing rules of a scikit-learn decision tree under python 3 and with offsets for conditional blocks to make the structure more readable:

def print_decision_tree(tree, feature_names=None, offset_unit='    '):
    '''Plots textual representation of rules of a decision tree
    tree: scikit-learn representation of tree
    feature_names: list of feature names. They are set to f1,f2,f3,... if not specified
    offset_unit: a string of offset of the conditional block'''

    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value
    if feature_names is None:
        features  = ['f%d'%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0):
            offset = offset_unit*depth
            if (threshold[node] != -2):
                    print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                    if left[node] != -1:
                            recurse (left, right, threshold, features,left[node],depth+1)
                    print(offset+"} else {")
                    if right[node] != -1:
                            recurse (left, right, threshold, features,right[node],depth+1)
                    print(offset+"}")
            else:
                    print(offset+"return " + str(value[node]))

    recurse(left, right, threshold, features, 0,0)

回答 13

您还可以通过区分它属于哪个类,甚至提及其输出值来使它更具信息性。

def print_decision_tree(tree, feature_names, offset_unit='    '):    
left      = tree.tree_.children_left
right     = tree.tree_.children_right
threshold = tree.tree_.threshold
value = tree.tree_.value
if feature_names is None:
    features  = ['f%d'%i for i in tree.tree_.feature]
else:
    features  = [feature_names[i] for i in tree.tree_.feature]        

def recurse(left, right, threshold, features, node, depth=0):
        offset = offset_unit*depth
        if (threshold[node] != -2):
                print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                if left[node] != -1:
                        recurse (left, right, threshold, features,left[node],depth+1)
                print(offset+"} else {")
                if right[node] != -1:
                        recurse (left, right, threshold, features,right[node],depth+1)
                print(offset+"}")
        else:
                #print(offset,value[node]) 

                #To remove values from node
                temp=str(value[node])
                mid=len(temp)//2
                tempx=[]
                tempy=[]
                cnt=0
                for i in temp:
                    if cnt<=mid:
                        tempx.append(i)
                        cnt+=1
                    else:
                        tempy.append(i)
                        cnt+=1
                val_yes=[]
                val_no=[]
                res=[]
                for j in tempx:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_no.append(j)
                for j in tempy:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_yes.append(j)
                val_yes = int("".join(map(str, val_yes)))
                val_no = int("".join(map(str, val_no)))

                if val_yes>val_no:
                    print(offset,'\033[1m',"YES")
                    print('\033[0m')
                elif val_no>val_yes:
                    print(offset,'\033[1m',"NO")
                    print('\033[0m')
                else:
                    print(offset,'\033[1m',"Tie")
                    print('\033[0m')

recurse(left, right, threshold, features, 0,0)

You can also make it more informative by distinguishing it to which class it belongs or even by mentioning its output value.

def print_decision_tree(tree, feature_names, offset_unit='    '):    
left      = tree.tree_.children_left
right     = tree.tree_.children_right
threshold = tree.tree_.threshold
value = tree.tree_.value
if feature_names is None:
    features  = ['f%d'%i for i in tree.tree_.feature]
else:
    features  = [feature_names[i] for i in tree.tree_.feature]        

def recurse(left, right, threshold, features, node, depth=0):
        offset = offset_unit*depth
        if (threshold[node] != -2):
                print(offset+"if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
                if left[node] != -1:
                        recurse (left, right, threshold, features,left[node],depth+1)
                print(offset+"} else {")
                if right[node] != -1:
                        recurse (left, right, threshold, features,right[node],depth+1)
                print(offset+"}")
        else:
                #print(offset,value[node]) 

                #To remove values from node
                temp=str(value[node])
                mid=len(temp)//2
                tempx=[]
                tempy=[]
                cnt=0
                for i in temp:
                    if cnt<=mid:
                        tempx.append(i)
                        cnt+=1
                    else:
                        tempy.append(i)
                        cnt+=1
                val_yes=[]
                val_no=[]
                res=[]
                for j in tempx:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_no.append(j)
                for j in tempy:
                    if j=="[" or j=="]" or j=="." or j==" ":
                        res.append(j)
                    else:
                        val_yes.append(j)
                val_yes = int("".join(map(str, val_yes)))
                val_no = int("".join(map(str, val_no)))

                if val_yes>val_no:
                    print(offset,'\033[1m',"YES")
                    print('\033[0m')
                elif val_no>val_yes:
                    print(offset,'\033[1m',"NO")
                    print('\033[0m')
                else:
                    print(offset,'\033[1m',"Tie")
                    print('\033[0m')

recurse(left, right, threshold, features, 0,0)


回答 14

这是我提取可直接在sql中使用的形式的决策规则的方法,因此可以按节点对数据进行分组。(基于先前海报的方法。)

结果将是CASE可以复制到sql语句(例如)的后续子句。

SELECT COALESCE(*CASE WHEN <conditions> THEN > <NodeA>*, > *CASE WHEN <conditions> THEN <NodeB>*, > ....)NodeName,* > FROM <table or view>


import numpy as np

import pickle
feature_names=.............
features  = [feature_names[i] for i in range(len(feature_names))]
clf= pickle.loads(trained_model)
impurity=clf.tree_.impurity
importances = clf.feature_importances_
SqlOut=""

#global Conts
global ContsNode
global Path
#Conts=[]#
ContsNode=[]
Path=[]
global Results
Results=[]

def print_decision_tree(tree, feature_names, offset_unit=''    ''):    
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value

    if feature_names is None:
        features  = [''f%d''%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0,ParentNode=0,IsElse=0):
        global Conts
        global ContsNode
        global Path
        global Results
        global LeftParents
        LeftParents=[]
        global RightParents
        RightParents=[]
        for i in range(len(left)): # This is just to tell you how to create a list.
            LeftParents.append(-1)
            RightParents.append(-1)
            ContsNode.append("")
            Path.append("")


        for i in range(len(left)): # i is node
            if (left[i]==-1 and right[i]==-1):      
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not " +ContsNode[RightParents[i]]                     
                Results.append(" case when  " +Path[i]+"  then ''" +"{:4d}".format(i)+ " "+"{:2.2f}".format(impurity[i])+" "+Path[i][0:180]+"''")

            else:       
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not "+ContsNode[RightParents[i]]                      
                if (left[i]!=-1):
                    LeftParents[left[i]]=i
                if (right[i]!=-1):
                    RightParents[right[i]]=i
                ContsNode[i]=   "( "+ features[i] + " <= " + str(threshold[i])   + " ) "

    recurse(left, right, threshold, features, 0,0,0,0)
print_decision_tree(clf,features)
SqlOut=""
for i in range(len(Results)): 
    SqlOut=SqlOut+Results[i]+ " end,"+chr(13)+chr(10)

Here is my approach to extract the decision rules in a form that can be used in directly in sql, so the data can be grouped by node. (Based on the approaches of previous posters.)

The result will be subsequent CASE clauses that can be copied to an sql statement, ex.

SELECT COALESCE(*CASE WHEN <conditions> THEN > <NodeA>*, > *CASE WHEN <conditions> THEN <NodeB>*, > ....)NodeName,* > FROM <table or view>


import numpy as np

import pickle
feature_names=.............
features  = [feature_names[i] for i in range(len(feature_names))]
clf= pickle.loads(trained_model)
impurity=clf.tree_.impurity
importances = clf.feature_importances_
SqlOut=""

#global Conts
global ContsNode
global Path
#Conts=[]#
ContsNode=[]
Path=[]
global Results
Results=[]

def print_decision_tree(tree, feature_names, offset_unit=''    ''):    
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    value = tree.tree_.value

    if feature_names is None:
        features  = [''f%d''%i for i in tree.tree_.feature]
    else:
        features  = [feature_names[i] for i in tree.tree_.feature]        

    def recurse(left, right, threshold, features, node, depth=0,ParentNode=0,IsElse=0):
        global Conts
        global ContsNode
        global Path
        global Results
        global LeftParents
        LeftParents=[]
        global RightParents
        RightParents=[]
        for i in range(len(left)): # This is just to tell you how to create a list.
            LeftParents.append(-1)
            RightParents.append(-1)
            ContsNode.append("")
            Path.append("")


        for i in range(len(left)): # i is node
            if (left[i]==-1 and right[i]==-1):      
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not " +ContsNode[RightParents[i]]                     
                Results.append(" case when  " +Path[i]+"  then ''" +"{:4d}".format(i)+ " "+"{:2.2f}".format(impurity[i])+" "+Path[i][0:180]+"''")

            else:       
                if LeftParents[i]>=0:
                    if Path[LeftParents[i]]>" ":
                        Path[i]=Path[LeftParents[i]]+" AND " +ContsNode[LeftParents[i]]                                 
                    else:
                        Path[i]=ContsNode[LeftParents[i]]                                   
                if RightParents[i]>=0:
                    if Path[RightParents[i]]>" ":
                        Path[i]=Path[RightParents[i]]+" AND not " +ContsNode[RightParents[i]]                                   
                    else:
                        Path[i]=" not "+ContsNode[RightParents[i]]                      
                if (left[i]!=-1):
                    LeftParents[left[i]]=i
                if (right[i]!=-1):
                    RightParents[right[i]]=i
                ContsNode[i]=   "( "+ features[i] + " <= " + str(threshold[i])   + " ) "

    recurse(left, right, threshold, features, 0,0,0,0)
print_decision_tree(clf,features)
SqlOut=""
for i in range(len(Results)): 
    SqlOut=SqlOut+Results[i]+ " end,"+chr(13)+chr(10)

回答 15

现在您可以使用export_text。

from sklearn.tree import export_text

r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)

[sklearn] [1]中的完整示例

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

Now you can use export_text.

from sklearn.tree import export_text

r = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(r)

A complete example from [sklearn][1]

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
X = iris['data']
y = iris['target']
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(X, y)
r = export_text(decision_tree, feature_names=iris['feature_names'])
print(r)

回答 16

修改了Zelazny7的代码以从决策树中获取SQL。

# SQL from decision tree

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]
     le='<='               
     g ='>'
     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'
          lineage.append((parent, split, threshold[parent], features[parent]))
          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)
     print 'case '
     for j,child in enumerate(idx):
        clause=' when '
        for node in recurse(left, right, child):
            if len(str(node))<3:
                continue
            i=node
            if i[1]=='l':  sign=le 
            else: sign=g
            clause=clause+i[3]+sign+str(i[2])+' and '
        clause=clause[:-4]+' then '+str(j)
        print clause
     print 'else 99 end as clusters'

Modified Zelazny7’s code to fetch SQL from the decision tree.

# SQL from decision tree

def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]
     le='<='               
     g ='>'
     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'
          lineage.append((parent, split, threshold[parent], features[parent]))
          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)
     print 'case '
     for j,child in enumerate(idx):
        clause=' when '
        for node in recurse(left, right, child):
            if len(str(node))<3:
                continue
            i=node
            if i[1]=='l':  sign=le 
            else: sign=g
            clause=clause+i[3]+sign+str(i[2])+' and '
        clause=clause[:-4]+' then '+str(j)
        print clause
     print 'else 99 end as clusters'

回答 17

显然,很久以前,已经有人决定尝试将以下功能添加到官方scikit的树导出功能中(该功能基本上仅支持export_graphviz)

def export_dict(tree, feature_names=None, max_depth=None) :
    """Export a decision tree in dict format.

这是他的全部承诺:

https://github.com/scikit-learn/scikit-learn/blob/79bdc8f711d0af225ed6be9fdb708cea9f98a910/sklearn/tree/export.py

不确定该评论发生了什么。但是您也可以尝试使用该功能。

我认为这对scikit-learn的优秀人员提出了严肃的文档要求,以正确地记录sklearn.tree.TreeAPI,API是DecisionTreeClassifier作为其属性公开的底层树结构tree_

Apparently a long time ago somebody already decided to try to add the following function to the official scikit’s tree export functions (which basically only supports export_graphviz)

def export_dict(tree, feature_names=None, max_depth=None) :
    """Export a decision tree in dict format.

Here is his full commit:

https://github.com/scikit-learn/scikit-learn/blob/79bdc8f711d0af225ed6be9fdb708cea9f98a910/sklearn/tree/export.py

Not exactly sure what happened to this comment. But you could also try to use that function.

I think this warrants a serious documentation request to the good people of scikit-learn to properly document the sklearn.tree.Tree API which is the underlying tree structure that DecisionTreeClassifier exposes as its attribute tree_.


回答 18

像这样使用sklearn.tree中的函数

from sklearn.tree import export_graphviz
    export_graphviz(tree,
                out_file = "tree.dot",
                feature_names = tree.columns) //or just ["petal length", "petal width"]

然后在项目文件夹中查找tree.dot文件,复制所有内容并将其粘贴到此处http://www.webgraphviz.com/并生成图形:)

Just use the function from sklearn.tree like this

from sklearn.tree import export_graphviz
    export_graphviz(tree,
                out_file = "tree.dot",
                feature_names = tree.columns) //or just ["petal length", "petal width"]

And then look in your project folder for the file tree.dot, copy the ALL the content and paste it here http://www.webgraphviz.com/ and generate your graph :)


回答 19

感谢@paulkerfeld的出色解决方案。在他的解决方案之上,为所有那些谁希望有树木序列化版本,只要使用tree.thresholdtree.children_lefttree.children_righttree.featuretree.value。由于叶子没有分裂,因此没有要素名称和子元素,因此它们在tree.featuretree.children_***中的占位符为_tree.TREE_UNDEFINEDand _tree.TREE_LEAF。每个分割均由分配唯一索引depth first search
请注意,tree.value形状为[n, 1, 1]

Thank for the wonderful solution of @paulkerfeld. On top of his solution, for all those who want to have a serialized version of trees, just use tree.threshold, tree.children_left, tree.children_right, tree.feature and tree.value. Since the leaves don’t have splits and hence no feature names and children, their placeholder in tree.feature and tree.children_*** are _tree.TREE_UNDEFINED and _tree.TREE_LEAF. Every split is assigned a unique index by depth first search.
Notice that the tree.value is of shape [n, 1, 1]


回答 20

这是一个通过转换以下内容的决策树生成Python代码的函数export_text

import string
from sklearn.tree import export_text

def export_py_code(tree, feature_names, max_depth=100, spacing=4):
    if spacing < 2:
        raise ValueError('spacing must be > 1')

    # Clean up feature names (for correctness)
    nums = string.digits
    alnums = string.ascii_letters + nums
    clean = lambda s: ''.join(c if c in alnums else '_' for c in s)
    features = [clean(x) for x in feature_names]
    features = ['_'+x if x[0] in nums else x for x in features if x]
    if len(set(features)) != len(feature_names):
        raise ValueError('invalid feature names')

    # First: export tree to text
    res = export_text(tree, feature_names=features, 
                        max_depth=max_depth,
                        decimals=6,
                        spacing=spacing-1)

    # Second: generate Python code from the text
    skip, dash = ' '*spacing, '-'*(spacing-1)
    code = 'def decision_tree({}):\n'.format(', '.join(features))
    for line in repr(tree).split('\n'):
        code += skip + "# " + line + '\n'
    for line in res.split('\n'):
        line = line.rstrip().replace('|',' ')
        if '<' in line or '>' in line:
            line, val = line.rsplit(maxsplit=1)
            line = line.replace(' ' + dash, 'if')
            line = '{} {:g}:'.format(line, float(val))
        else:
            line = line.replace(' {} class:'.format(dash), 'return')
        code += skip + line + '\n'

    return code

用法示例:

res = export_py_code(tree, feature_names=names, spacing=4)
print (res)

样本输出:

def decision_tree(f1, f2, f3):
    # DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
    #                        max_features=None, max_leaf_nodes=None,
    #                        min_impurity_decrease=0.0, min_impurity_split=None,
    #                        min_samples_leaf=1, min_samples_split=2,
    #                        min_weight_fraction_leaf=0.0, presort=False,
    #                        random_state=42, splitter='best')
    if f1 <= 12.5:
        if f2 <= 17.5:
            if f1 <= 10.5:
                return 2
            if f1 > 10.5:
                return 3
        if f2 > 17.5:
            if f2 <= 22.5:
                return 1
            if f2 > 22.5:
                return 1
    if f1 > 12.5:
        if f1 <= 17.5:
            if f3 <= 23.5:
                return 2
            if f3 > 23.5:
                return 3
        if f1 > 17.5:
            if f1 <= 25:
                return 1
            if f1 > 25:
                return 2

上面的示例是使用生成的names = ['f'+str(j+1) for j in range(NUM_FEATURES)]

一个方便的功能是,它可以生成较小的文件,且间距减小。刚设定spacing=2

Here is a function that generates Python code from a decision tree by converting the output of export_text:

import string
from sklearn.tree import export_text

def export_py_code(tree, feature_names, max_depth=100, spacing=4):
    if spacing < 2:
        raise ValueError('spacing must be > 1')

    # Clean up feature names (for correctness)
    nums = string.digits
    alnums = string.ascii_letters + nums
    clean = lambda s: ''.join(c if c in alnums else '_' for c in s)
    features = [clean(x) for x in feature_names]
    features = ['_'+x if x[0] in nums else x for x in features if x]
    if len(set(features)) != len(feature_names):
        raise ValueError('invalid feature names')

    # First: export tree to text
    res = export_text(tree, feature_names=features, 
                        max_depth=max_depth,
                        decimals=6,
                        spacing=spacing-1)

    # Second: generate Python code from the text
    skip, dash = ' '*spacing, '-'*(spacing-1)
    code = 'def decision_tree({}):\n'.format(', '.join(features))
    for line in repr(tree).split('\n'):
        code += skip + "# " + line + '\n'
    for line in res.split('\n'):
        line = line.rstrip().replace('|',' ')
        if '<' in line or '>' in line:
            line, val = line.rsplit(maxsplit=1)
            line = line.replace(' ' + dash, 'if')
            line = '{} {:g}:'.format(line, float(val))
        else:
            line = line.replace(' {} class:'.format(dash), 'return')
        code += skip + line + '\n'

    return code

Sample usage:

res = export_py_code(tree, feature_names=names, spacing=4)
print (res)

Sample output:

def decision_tree(f1, f2, f3):
    # DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
    #                        max_features=None, max_leaf_nodes=None,
    #                        min_impurity_decrease=0.0, min_impurity_split=None,
    #                        min_samples_leaf=1, min_samples_split=2,
    #                        min_weight_fraction_leaf=0.0, presort=False,
    #                        random_state=42, splitter='best')
    if f1 <= 12.5:
        if f2 <= 17.5:
            if f1 <= 10.5:
                return 2
            if f1 > 10.5:
                return 3
        if f2 > 17.5:
            if f2 <= 22.5:
                return 1
            if f2 > 22.5:
                return 1
    if f1 > 12.5:
        if f1 <= 17.5:
            if f3 <= 23.5:
                return 2
            if f3 > 23.5:
                return 3
        if f1 > 17.5:
            if f1 <= 25:
                return 1
            if f1 > 25:
                return 2

The above example is generated with names = ['f'+str(j+1) for j in range(NUM_FEATURES)].

One handy feature is that it can generate smaller file size with reduced spacing. Just set spacing=2.


TensorFlow,为什么选择python语言?

问题:TensorFlow,为什么选择python语言?

我最近开始研究深度学习和其他ML技术,并开始寻找简化构建网络并对其进行培训的框架,然后我发现TensorFlow在该领域经验不足,对我来说,速度似乎是如果与深度学习一起工作,那么使大型机器学习系统变得更大的重要因素,那么为什么Google选择python来制造TensorFlow?用一种可以编译且无法解释的语言来编写代码会更好吗?

使用Python而不是像C ++这样的语言进行机器学习有什么优势?

I recently started studying deep learning and other ML techniques, and I started searching for frameworks that simplify the process of build a net and training it, then I found TensorFlow, having little experience in the field, for me, it seems that speed is a big factor for making a big ML system even more if working with deep learning, so why python was chosen by Google to make TensorFlow? Wouldn’t it be better to make it over an language that can be compiled and not interpreted?

What are the advantages of using Python over a language like C++ for machine learning?


回答 0

关于TensorFlow的最重要的认识是,在大多数情况下,内核不是用Python编写的:它是由高度优化的C ++和CUDA(Nvidia用于GPU编程的语言)结合而成。反过来,大多数情况是通过使用Eigen(高性能C ++和CUDA数值库)和NVidia的cuDNN(为NVidia GPU进行了非常优化的DNN库,用于诸如卷积之类的功能)而发生的。

TensorFlow的模型是程序员使用“某种语言”(很可能是Python!)来表达模型。该模型以TensorFlow构造编写,例如:

h1 = tf.nn.relu(tf.matmul(l1, W1) + b1)
h2 = ...

在运行Python时实际上并未执行。相反,实际创建的是一个数据流图,该表示接受特定的输入,应用特定的操作,将结果作为输入提供给其他操作,等等。 该模型由快速的C ++代码执行,并且在大多数情况下,操作之间传递的数据永远不会复制回Python代码

然后,程序员通过拉上节点来“驱动”该模型的执行-通常在Python中进行训练,有时在Python中甚至在原始C ++中进行服务:

sess.run(eval_results)

这个Python(或C ++函数调用)使用对C ++的进程内调用或针对分布式版本的RPC来调用C ++ TensorFlow服务器以使其执行,然后将结果复制回去。

因此,话虽如此,让我们重新表述一下问题:为什么TensorFlow为什么选择Python作为表达和控制模型训练的第一种得到良好支持的语言?

答案很简单:对于许多数据科学家和机器学习专家来说,Python可能最舒适的语言,它易于集成并可以控制C ++后端,同时在内部和外部也广泛使用。和开放源代码。鉴于使用TensorFlow的基本模型,Python的性能并不那么重要,因此很自然。NumPy的巨大优势还在于它可以在Python中轻松进行预处理-同时具有高性能-在将其输入TensorFlow进行真正占用大量CPU的处理之前。

表示执行模型时不使用的模型也有很多复杂性-形状推断(例如,如果您做matmul(A,B),结果数据的形状是什么?)和自动梯度计算。事实证明,能够用Python表达这些内容真是太好了,尽管从长远来看,我认为它们可能会转移到C ++后端以使添加其他语言变得更加容易。

(当然,希望是将来支持其他语言来创建和表达模型。使用其他几种语言来运行推理已经非常简单了-C ++现在可以工作了,Facebook的某人贡献了Go绑定,我们现在正在对其进行审查。等)

The most important thing to realize about TensorFlow is that, for the most part, the core is not written in Python: It’s written in a combination of highly-optimized C++ and CUDA (Nvidia’s language for programming GPUs). Much of that happens, in turn, by using Eigen (a high-performance C++ and CUDA numerical library) and NVidia’s cuDNN (a very optimized DNN library for NVidia GPUs, for functions such as convolutions).

The model for TensorFlow is that the programmer uses “some language” (most likely Python!) to express the model. This model, written in the TensorFlow constructs such as:

h1 = tf.nn.relu(tf.matmul(l1, W1) + b1)
h2 = ...

is not actually executed when the Python is run. Instead, what’s actually created is a dataflow graph that says to take particular inputs, apply particular operations, supply the results as the inputs to other operations, and so on. This model is executed by fast C++ code, and for the most part, the data going between operations is never copied back to the Python code.

Then the programmer “drives” the execution of this model by pulling on nodes — for training, usually in Python, and for serving, sometimes in Python and sometimes in raw C++:

sess.run(eval_results)

This one Python (or C++ function call) uses either an in-process call to C++ or an RPC for the distributed version to call into the C++ TensorFlow server to tell it to execute, and then copies back the results.

So, with that said, let’s re-phrase the question: Why did TensorFlow choose Python as the first well-supported language for expressing and controlling the training of models?

The answer to that is simple: Python is probably the most comfortable language for a large range of data scientists and machine learning experts that’s also that easy to integrate and have control a C++ backend, while also being general, widely-used both inside and outside of Google, and open source. Given that with the basic model of TensorFlow, the performance of Python isn’t that important, it was a natural fit. It’s also a huge plus that NumPy makes it easy to do pre-processing in Python — also with high performance — before feeding it in to TensorFlow for the truly CPU-heavy things.

There’s also a bunch of complexity in expressing the model that isn’t used when executing it — shape inference (e.g., if you do matmul(A, B), what is the shape of the resulting data?) and automatic gradient computation. It turns out to have been nice to be able to express those in Python, though I think in the long term they’ll probably move to the C++ backend to make adding other languages easier.

(The hope, of course, is to support other languages in the future for creating and expressing models. It’s already quite straightforward to run inference using several other languages — C++ works now, someone from Facebook contributed Go bindings that we’re reviewing now, etc.)


回答 1

TF不是用python编写的。它是用C ++编写的(并使用高性能的数字CUDA代码),您可以通过查看他们的github进行检查。因此,核心不是用python编写的,而是TF提供了许多其他语言(python,C ++,Java,Go)的接口

如果您来自数据分析领域,则可以像numpy(不是用python编写,但提供了Python的接口)那样考虑它,或者如果您是Web开发人员,则可以将其视为数据库(PostgreSQL,MySQL,可以从Java,Python,PHP调用)


由于许多 原因, Python前端(人们使用TF编写模型的语言)最受欢迎。在我看来,主要原因是历史原因:大多数ML用户已经在使用它(另一个流行的选择是R),因此,如果您不提供python的接口,那么您的库很可能注定会变得晦涩难懂。


但是用python编写并不意味着您的模型是用python执行的。相反,如果您以正确的方式编写模型,则在评估TF图期间绝不会执行Python(tf.py_func()除外,该存在于调试中,应在实际模型中避免使用,因为它是在Python方面)。

例如,这与numpy不同。例如,如果您这样做np.linalg.eig(np.matmul(A, np.transpose(A))(是eig(AA')),则该操作将以某种快速语言(C ++或fortran)计算转置,将其返回给python,将其与python一起从python中取出,并以某种快速语言计算一个乘法并将其返回给python,然后计算特征值并将其返回给python。因此,尽管有效地计算了诸如matmul和eig之类的昂贵操作,但您仍然需要通过将结果移回python并强制执行来浪费时间。TF不会这样做,一旦定义了图,张量就不会在python中,而是在C ++ / CUDA /其他地方流动。

TF is not written in python. It is written in C++ (and uses high-performant numerical libraries and CUDA code) and you can check this by looking at their github. So the core is written not in python but TF provide an interface to many other languages (python, C++, Java, Go)

If you come from a data analysis world, you can think about it like numpy (not written in python, but provides an interface to Python) or if you are a web-developer – think about it as a database (PostgreSQL, MySQL, which can be invoked from Java, Python, PHP)


Python frontend (the language in which people write models in TF) is the most popular due to many reasons. In my opinion the main reason is historical: majority of ML users already use it (another popular choice is R) so if you will not provide an interface to python, your library is most probably doomed to obscurity.


But being written in python does not mean that your model is executed in python. On the contrary, if you written your model in the right way Python is never executed during the evaluation of the TF graph (except of tf.py_func(), which exists for debugging and should be avoided in real model exactly because it is executed on Python’s side).

This is different from for example numpy. For example if you do np.linalg.eig(np.matmul(A, np.transpose(A)) (which is eig(AA')), the operation will compute transpose in some fast language (C++ or fortran), return it to python, take it from python together with A, and compute a multiplication in some fast language and return it to python, then compute eigenvalues and return it to python. So nonetheless expensive operations like matmul and eig are calculated efficiently, you still lose time by moving the results to python back and force. TF does not do it, once you defined the graph your tensors flow not in python but in C++/CUDA/something else.


回答 2

Python允许您使用C和C ++创建扩展模块,与本机代码接口,并且仍然获得Python给您的优势。

TensorFlow使用Python,是的,但是它也包含大量的C ++

这样就可以使用更简单的界面进行实验,从而减少了用Python进行的人工操作,并通过对C ++中最重要的部分进行编程来提高性能。

Python allows you to create extension modules using C and C++, interfacing with native code, and still getting the advantages that Python gives you.

TensorFlow uses Python, yes, but it also contains large amounts of C++.

This allows a simpler interface for experimentation with less human-thought overhead with Python, and add performance by programming the most important parts in C++.


回答 3

您可以从此处查看的最新比率显示TensorFlow C ++内部需要约50%的代码,而Python需要约40%的代码。

C ++和Python都是Google的官方语言,所以也难怪为什么会这样。如果我必须在存在C ++和Python的地方提供快速回归…

C ++在计算代数内部,Python用于其他所有方面,包括测试。知道今天的测试无处不在,难怪Python代码对TF做出了如此大的贡献。

The latest ratio you can check from here shows inside TensorFlow C++ takes ~50% of code, and Python takes ~40% of code.

Both C++ and Python are the official languages at Google so there is no wonder why this is so. If I would have to provide fast regression where C++ and Python are present…

C++ is inside the computational algebra, and Python is used for everything else including for the testing. Knowing how ubiquitous the testing is today it is no wonder why Python code contributes that much to TF.


将分类器保存到scikit-learn中的磁盘

问题:将分类器保存到scikit-learn中的磁盘

如何保存经过训练的朴素贝叶斯分类器磁盘并用于预测数据?

我有来自scikit-learn网站的以下示例程序:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

How do I save a trained Naive Bayes classifier to disk and use it to predict data?

I have the following sample program from the scikit-learn website:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

回答 0

分类器只是可以像其他任何东西一样被腌制和倾倒的对象。继续您的示例:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

回答 1

您还可以使用joblib.dumpjoblib.load,它们在处理数字数组方面比默认的python pickler效率更高。

Joblib包含在scikit-learn中:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

编辑:在Python 3.8+中,如果您使用pickle协议5(不是默认值),现在可以使用pickle对具有大数值数组的对象进行有效的酸洗作为属性。

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.

Joblib is included in scikit-learn:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

Edit: in Python 3.8+ it’s now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).


回答 2

您正在寻找的内容被称为sklearn词中的模型持久性,并且在简介模型持久中都有记录部分中进行了记录。

因此,您已经初始化了分类器并使用

clf = some.classifier()
clf.fit(X, y)

之后,您有两个选择:

1)使用泡菜

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2)使用Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

再读一遍有助于阅读上述链接

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.

So you have initialized your classifier and trained it for a long time with

clf = some.classifier()
clf.fit(X, y)

After this you have two options:

1) Using Pickle

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2) Using Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

One more time it is helpful to read the above-mentioned links


回答 3

在许多情况下,尤其是对于文本分类,仅存储分类器是不够的,但是您还需要存储矢量化器,以便将来可以对输入进行矢量化。

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

未来用例:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

在转储矢量化器之前,可以通过以下方式删除矢量化器的stop_words_属性:

vectorizer.stop_words_ = None

使倾销更有效率。同样,如果您的分类器参数稀疏(如大多数文本分类示例中一样),则可以将参数从密集转换为稀疏,这将在内存消耗,加载和转储方面产生巨大差异。通过以下方式稀疏模型:

clf.sparsify()

这对于SGDClassifier将自动工作,但是如果您知道模型稀疏(clf.coef_中为零),则可以通过以下方式将clf.coef_手动转换为csr scipy稀疏矩阵

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

然后您可以更有效地存储它。

In many cases, particularly with text classification it is not enough just to store the classifier but you’ll need to store the vectorizer as well so that you can vectorize your input in future.

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

future use case:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:

vectorizer.stop_words_ = None

to make dumping more efficient. Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:

clf.sparsify()

Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

and then you can store it more efficiently.


回答 4

sklearn估计器实现的方法使您可以轻松保存估计器的相关训练属性。一些估计器__getstate__自己实现方法,但是其他估计器,例如GMM仅使用基本实现,该实现只是将对象保存在内部字典中:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

将模型保存到光盘的推荐方法是使用以下pickle模块:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

但是,您应该保存其他数据,以便将来可以重新训练模型,或遭受可怕的后果(例如被锁定在旧版本的sklearn中)

文档中

为了用将来的scikit-learn版本重建类似的模型,应该在腌制的模型中保存其他元数据:

训练数据,例如对不变快照的引用

用于生成模型的python源代码

scikit-learn的版本及其依赖项

在训练数据上获得的交叉验证分数

对于依赖于tree.pyx用Cython(例如IsolationForest)编写的模块的Ensemble估计器而言尤其如此,因为它会创建与实现的耦合,这不能保证sklearn版本之间的稳定性。在过去,它已经看到了不兼容的变化。

如果您的模型变得非常大并且加载变得很麻烦,那么您还可以使用更高效的joblib。从文档中:

在scikit的特定情况下,使用joblib替换picklejoblib.dumpjoblib.load)可能会更有趣,这对于内部装有大型numpy数组的对象更有效,就像装配的scikit-learn估计量通常那样,但只能腌制到磁盘而不是字符串:

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

The recommended method to save your model to disc is to use the pickle module:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).

From the documentation:

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

The training data, e.g. a reference to a immutable snapshot

The python source code used to generate the model

The versions of scikit-learn and its dependencies

The cross validation score obtained on the training data

This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.

If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:


回答 5

sklearn.externals.joblib已被弃用,因为0.21,将在被删除v0.23

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/ init .py:15:FutureWarning:sklearn.externals.joblib在0.21中已弃用,在0.23中将被删除。请直接从joblib导入此功能,可以通过以下方式安装该功能:pip install joblib。如果在加载腌制模型时出现此警告,则可能需要使用scikit-learn 0.21+重新序列化那些模型。
warnings.warn(msg,category = FutureWarning)


因此,您需要安装joblib

pip install joblib

最后将模型写入磁盘:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

现在,要读取转储的文件,您需要运行的是:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)


Therefore, you need to install joblib:

pip install joblib

and finally write the model to disk:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

Now in order to read the dumped file all you need to run is:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

GamestonkTerminal-彭博终端的自由/开源软件替代方案

Gamestonk终端是一个令人敬畏的股票和密码市场终端,它是为了好玩而开发的,而我看到我的GME股票暴跌。不过,嘿,我喜欢这个股票💎🙌

进展如何:

Gamestonk Terminal为投资研究提供了一个基于Python的现代集成环境,允许普通的Joe零售交易员利用最先进的数据科学和机器学习技术

作为一个基于Python的现代环境,GamestonkTerminal打开了对数据科学(Pandas、Numpy、Scipy、Jupyter)、机器学习(Pytorch、TensorFlow、SkLearning、FLAIR)和数据采集(Beautiful Soup)和众多第三方API中众多Python数据库的访问

捐赠

Gamestonk终端是一个免费的开源软件。这意味着整个代码库是公开的,任何用户都可以免费使用

我们的一个小团队一直在努力为项目提供尽可能多的更新,这是在工作时间之外完成的,通常是深夜来改进这个工具。虽然我们没有从Gamestonk终端赚到任何钱,但我们希望确保我们所有的用户都能从我们的软件中获得最好的投资。随着我们继续在这个项目上继续建设,我们将非常感谢任何形式的捐赠或支持,这样我们就可以购买更多的咖啡来为我们提供更多的燃料!

这是我们的Patreon页面:https://www.patreon.com/gamestonkterminal

有很多方法可以帮助支持商品及服务税。如果您想提供非货币性的帮助,请加入我们的discord与朋友共享终端也会有很大帮助。先谢谢猩猩

快速入门

安装

如果您想看安装过程的视频记录,@JohnnyDankSeed已经提供了一个here

user@mchow01已经提供了关于以下内容的教程how to run the terminal on an Apple M1

该项目支持Python 3.7、3.8和3.9

我们当前的建议是将此项目与Anaconda的Python发行版一起使用-可以是完整的Anaconda3 LatestMiniconda3 Latest该项目中的几个功能利用了机器学习。机器学习Python依赖项是可选的。如果您决定在以后添加机器学习功能,则使用Anaconda的Python发行版可能会获得更好的用户体验

  1. 启动项目

  1. Install Anaconda(它在AUR上以python或Miniconda3的形式出现!)

使用以下命令确认您拥有它:conda -V输出应该是类似以下内容的内容:conda 4.9.2

  1. 安装Git
conda install -c anaconda git
  1. 克隆项目
  • 通过HTTPS:git clone https://github.com/GamestonkTerminal/GamestonkTerminal.git
  • 通过SSH:git clone git@github.com:GamestonkTerminal/GamestonkTerminal.git
  1. 导航到项目的文件夹
cd GamestonkTerminal/
  1. 创建环境

您可以随心所欲地命名环境。尽管您可以使用如下名称:welikethestockthisisthewaydiamondhands,我们推荐一些简单直观的东西,比如gst这是因为从现在开始将使用这个名称。

conda env create -n gst --file build/conda/conda-3-8-env.yaml
  1. 激活虚拟环境
conda activate gst

注意:最后,您可以使用以下命令将其停用:conda deactivate

  1. 安装诗歌依赖项
poetry install

如果您在诗歌方面遇到问题(例如,在Windows系统上),只需使用pip安装Requirements.txt即可

pip install -r requirements.txt
  1. 你准备好玩游戏了!
python terminal.py
  1. (Windows-可选)加快未来的打开流程

安装Gamestonk终端后,您会发现一个名为“Gamestonk Terminal.bat”的文件。您可以使用此文件更快地打开Gamestonk终端。如果您愿意,可以将此文件移动到您的桌面。如果您在尝试运行批处理文件时遇到问题。如果您遇到批处理文件的问题,请编辑该文件并检查目录是否匹配。此文件假定您在安装时使用了默认目录

注:当您关闭终端并重新打开它时,您需要重新调用的唯一命令是conda activate gst在你打电话之前python terminal.py又一次

故障排除:如果您在安装时遇到问题,请查看我们的最新版本troubleshoot page

高级用户安装-机器学习

如果您是高级用户并使用其他Python发行版,我们有几个Requirements.txt文档可供您选择以下载项目依赖项

如果在步骤5中使用的是conda而不是build/conda/conda-3-8-env.yaml配置文件,请使用build/conda/conda-3-8-env-full

注意:在requirements.txt文件已经过测试并可用于此项目,但是,这些文件可能是较旧的版本。因此,建议用户在安装它们之前设置一个虚拟Python环境。这允许将不同项目所需的依赖项保存在不同的位置

如果您想使用可选的机器学习功能:

ENABLE_PREDICT = os.getenv("GTFF_ENABLE_PREDICT") or True
  • 安装可选的ML功能依赖项:
poetry install -E prediction

如果要设置坞站映像,请执行以下操作:

  • 构建码头:docker build .
  • 运行它:docker run -it gamestonkterminal:dev

注意:docker的问题是它不会输出matplotlib图形

更新终端

终端不断更新新功能和错误修复,因此,要更新您的终端,您可以运行:

git pull

要获取最新更改,请执行以下操作

如果由于您修改了一些python文件而导致此操作失败,并且与更新冲突,您可以使用:

git stash

然后,重新运行poetry installpip install -r requirements.txt要获取任何新的依赖项,请执行以下操作

安装完成后,您就可以开始游戏了

如果你stashed您以前所做的更改,您可以使用以下命令取消隐藏:

git stash pop

API密钥

该项目围绕几个不同API调用构建,无论是访问历史数据还是财务数据

以下是需要密钥的情况:

获得这些信息后,不要忘记更新config_terminal.py

或者,也可以将它们设置为以下环境变量:

网站 变量
Alpha Vantage GT_API_KEY_ALPHAVANTAGE
Binance GT_API_BINANCE_KEY
GT_API_BINANCE_SECRET
CoinMarketCap GT_CMC_API_KEY
GT_CMC_API_KEY
DEGIRO GT_DG_用户名
GT_DG_PASSWORD
GT_DG_TOTP_SECRET
FRED GT_API_FRED_KEY
Financial Modeling Prep GT_API_KEY_FINANCIALMODELINGPREP
Finhub GT_API_FINNHUB_KEY
News GT_API_NEWS_TOKEN
Oanda GT_OANDA_TOKEN
GT_OANDA_ACCOUNT
Polygon GT_API_POLYGON_KEY
Quandl GT_API_KEY_QUANDL
Reddit GT_API_Reddit_Client_ID
GT_API_Reddit_Client_Secret
GT_API_Reddit_用户名
GT_API_Reddit_USER_AGENT
GT_API_Reddit_Password
Tradier GT_Tradier_Token
Twitter GT_API_Twitter_KEY
GT_API_Twitter_SECRET_KEY
GT_API_Twitter_承载令牌

示例:

export GT_API_REDDIT_USERNAME=SexyYear

环境变量也可以在.env回购顶部的文件。git会忽略此文件,因此您的API密钥将保密。上面的示例存储在.env将是:

GT_API_REDDIT_USERNAME=SexyYear

请注意,GT_API_REDDIT_USER_AGENT获取Reddit API密钥时设置的脚本名称。请注意,获取每日OHLC值不需要有效的Alpha Vantage密钥

用法

首先加载感兴趣的自动收报机:

load -t GME

加载自动收报机后,菜单将扩展到其所有菜单

查看该股票的历史数据:

view

通过加载自动收报机并设置起始点对历史数据进行切片,例如

load -t GME -s 2020-06-04

使用进入技术分析菜单

ta

并使用以下条件运行SMA:

sma

但是,假设您想要更改窗口的长度,因为您不想走得太长,而是想做一个摆动,因此需要一个较小的窗口。检查SMA命令上有哪些可用设置:

sma -h

一旦看到这一点,请在标记参数后设置所需的参数。在这种情况下,要将长度窗口更改为10,我们必须执行以下操作:

sma -l 10

示例:

贡献

对这个项目的贡献主要有3种方式

对于在添加新功能的同时解释repo体系结构的1小时编码会议,请查看https://www.youtube.com/watch?v=9BMI9cleTTg

成为贡献者🦍

如果你买了DIP,我会推荐你,而股价却一直在下跌。你最好在石块上升的时候让自己忙碌起来。

  1. 分叉项目
  2. 创建您的要素分支(git checkout -b feature/AmazingFeature)
  3. 提交您的更改(git commit -m 'Add some AmazingFeature')
  4. 通过运行以下命令安装预提交挂接:pre-commit install每次提交更改时,Linters都将自动运行。在更改时,您必须重新提交
  5. 推送至您的分支机构(git push origin feature/AmazingFeature)
  6. 打开拉取请求

成为一名卡伦人🤷

推荐你是否采取了高买低卖的策略

我们感兴趣的是您对哪种产品的看法features会让你买得更高卖得更低

另外,如果你因为这个航站楼而坐了好几英里,别忘了报告一个bug这样团队就可以修正,并保持旧的方式

加入🙌💎帮派

如果红色是你最喜欢的颜色,而且你从不亏本出售

欢迎加入俱乐部,并随时支持这个令人惊叹的开源项目背后的开发人员。

许可证

在麻省理工学院的许可下分发。看见LICENSE了解更多信息

免责声明

“有几件事我不是.我不是猫.我不是机构投资者,也不是对冲基金.我没有客户,也不提供个性化的投资建议收取费用或佣金.”DFV

金融工具交易涉及高风险,包括损失部分或全部投资额的风险,可能并不适合所有投资者。在决定交易金融工具之前,您应该充分了解与金融市场交易相关的风险和成本,仔细考虑您的投资目标、经验水平和风险偏好,并在需要的地方寻求专业建议。商品及服务税中包含的数据不一定准确。对于因您的交易或您对所显示信息的依赖而造成的任何损失或损害,GST和本网站中包含的任何数据提供商将不承担任何责任。

联系人

Didier Rodrigues Lopesdro.lopes@campus.fct.unl.pt

Artem Veremyartem@veremey.net

James Maslekjmaslek11@gmail.com

欢迎在以下网址分享迷失色情、迷因或任何问题:

大声喊出:

  • pll_llq查维特拉HINXX:使用Qt创建GUI
    • 请联系我们的#gui不和谐通道
  • 1lluz10ncrspy,以及马蒂亚兹:在我们的登录页上工作https://gamestonkterminal.netlify.app
  • 梅根·霍恩:管理Twitter帐户
  • 阿罗坎人:通过发展负责Forex菜单
  • 查维特拉Deel18:为了德吉罗的整合
  • 可追踪性3:通过添加多个预设屏幕

其他贡献者

cClauss,shadycuz,lolrenx,buzzCraft,衣夹,arcutright,jperkins12,nodesocket,akx,sigaloid,pchaganti,danielorf,henrytdsimmons,rowanharley,sabujp,qTipTip,gmerrall,bfxavier,donno2048,noufal85,rmassoth,benkulbertis,ricleal-fugue,rmassoth,benkulbertis,ricleal-fugue,

确认

Tpot-使用遗传编程优化机器学习管道的Python自动机器学习工具

TPOT代表T基于REE的PipelineO优化T哦哦。将TPOT视为您的数据科学助理TPOT是一种Python自动机器学习工具,可使用遗传编程优化机器学习管道

TPOT将通过智能地探索数千个可能的管道来找到最适合您数据的管道,从而自动化机器学习中最繁琐的部分

一个机器学习流水线示例

一旦TPOT完成搜索(或者您厌倦了等待),它就会为您提供它找到的最佳管道的Python代码,这样您就可以从那里修补管道了

TPOT构建在SCRICKIT-LEARN之上,因此它生成的所有代码看起来都应该很熟悉。如果你熟悉SCRICKIT-不管怎样,还是要学

TPOT仍在积极发展中我们鼓励您定期检查此存储库是否有更新

有关TPOT的更多信息,请参阅project documentation

许可证

请参阅repository license有关TPOT的许可和使用信息

通常,我们已经授权TPOT使其尽可能广泛使用

安装

我们坚持TPOT installation instructions在文档中。TPOT需要Python的正常安装

用法

可以使用TPOTon the command linewith Python code

单击相应的链接以在文档中查找有关TPOT用法的更多信息

示例

分类

以下是光学识别手写数字数据集的最小工作示例

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

运行此代码应该会发现达到约98%测试准确率的管道,并且相应的Python代码应该导出到tpot_digits_pipeline.py文件,如下所示:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

回归

同样,TPOT可以针对回归问题优化管道。下面是使用Practice波士顿房价数据集的最小工作示例

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

这将导致管道达到约12.77的均方误差(MSE),并且中的Python代码tpot_boston_pipeline.py应与以下内容类似:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

请查看文档以了解more examples and tutorials

对TPOT的贡献

我们欢迎您的光临check the existing issues以获取要处理的错误或增强功能。如果您有扩展TPOT的想法,请file a new issue这样我们就可以讨论一下了

在提交任何投稿之前,请审阅我们的contribution guidelines

对TPOT有问题或有疑问吗?

check the existing open and closed issues看看您的问题是否已经得到处理。如果没有,file a new issue在此存储库上,以便我们可以检查您的问题

引用TPOT

如果您在科学出版物中使用TPOT,请考虑至少引用以下一篇论文:

陈天乐,傅维轩,杰森·H·摩尔(2020)。Scaling tree-based automated machine learning to biomedical big data with a feature set selector生物信息学36(1):250-256

BibTeX条目:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

兰德尔·S·奥尔森、瑞安·J·厄巴诺维茨、彼得·C·安德鲁斯、妮可·A·拉文德、拉克里斯·基德和杰森·H·摩尔(2016)。Automating biomedical data science through tree-based pipeline optimization进化计算的应用,第123-137页

BibTeX条目:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

兰德尔·S·奥尔森、内森·巴特利、瑞安·J·厄巴诺维奇和杰森·H·摩尔(2016)。Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data ScienceGECCO 2016论文集,第485-492页

BibTeX条目:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

或者,您也可以使用以下DOI直接引用存储库:

支持TPOT

TPOT是在Computational Genetics LabUniversity of Pennsylvania有了来自NIH在赠款R01 AI117694项下。我们非常感谢美国国立卫生研究院和宾夕法尼亚大学在这个项目的发展过程中给予的支持

TPOT标志是由托德·纽穆伊斯(Todd Newmuis)设计的,他慷慨地为该项目贡献了时间

Autokeras-面向深度学习的AutoML库

官网:autokeras.com

AutoKera:一个基于KERS的AutoML系统。它是由DATA Lab在德克萨斯农工大学。AutoKera的目标是让每个人都可以使用机器学习

学习资源

  • 一个简短的例子
import autokeras as ak

clf = ak.ImageClassifier()
clf.fit(x_train, y_train)
results = clf.predict(x_test)

安装

要安装该软件包,请使用pip安装步骤如下:

pip3 install autokeras

请按照installation guide有关更多详细信息,请参阅

注:目前,AutoKera仅与Python>=3.5TensorFlow>=2.3.0

社区

随时了解最新信息

推特:你也可以在推特上关注我们@autokeras了解最新消息

电子邮件:订阅我们的email list接收通知的步骤

问题和讨论

GitHub讨论:请在我们的GitHub Discussions这是一个在GitHub上托管的论坛。我们将在那里监控并回答问题

即时通信

松弛Request an invitation使用#autokeras通信通道

QQ群:加入我们的QQ群1150366085。密码:akqqgroup

在线会议:加入online meeting Google group日历事件将出现在您的Google日历上

贡献代码

我们致力于让AutoKera的一切向公众开放。每个人都可以很容易地以开发人员的身份加入。以下是我们如何管理我们的项目

  • 对问题进行分类例如,我们从中挑选要解决的关键问题GitHub issues它们将被添加到此Project其中一些问题随后将添加到milestones,用于计划发布
  • 分配任务:我们在网上会议期间将任务分配给人们
  • 讨论:我们可以在多个地方进行讨论。代码审查在GitHub上。问题可以在Slake或在会议期间提问

请加入我们的Slack给金海峰发个口信。或顺道拜访我们的online meetings然后跟我们谈谈。我们将帮助您入门!

请参阅我们的Contributing Guide学习最佳实践

感谢所有的贡献者!

捐赠

我们接受财政上的支持Open Collective感谢每一位赞助商对我们的支持!


引用这部作品

金海峰、宋清泉、夏虎。“Auto-keras:一种高效的神经结构搜索系统。”第25届ACM SIGKDD知识发现与数据挖掘国际会议论文集。ACM,2019年。(Download)

Biblatex条目:

@inproceedings{jin2019auto,
  title={Auto-Keras: An Efficient Neural Architecture Search System},
  author={Jin, Haifeng and Song, Qingquan and Hu, Xia},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  pages={1946--1956},
  year={2019},
  organization={ACM}
}

确认

作者感谢国防高级研究计划局(DARPA)通过AFRL合同FA8750-17-2-0116、德克萨斯农工学院和德克萨斯农工大学管理的D3M计划

Numerical-linear-algebra-Jupyter笔记本免费在线教材Fast.ai计算线性代数课程

编码器的计算线性代数

本课程重点讨论以下问题:我们如何以可接受的速度和可接受的精度进行矩阵计算?

这门课是在University of San Francisco’s Masters of Science in Analytics计划,2017年夏季(面向正在学习成为数据科学家的研究生)。本课程使用Python和Jupyter笔记本讲授,在大多数课程中使用的库包括Scikit-Learning和Numpy,还有几节课使用Numba(将Python编译为C以提高性能的库)和PyTorch(用于GPU的Numpy的替代库

随附笔记本的还有一个playlist of lecture videos, available on YouTube如果你曾经被一堂课弄糊涂了,或者它讲得太快,请看下一段视频的开头,我在视频的开头复习了上一节课的概念,经常从新的角度或用不同的插图来解释,并回答问题。

获取帮助

您可以通过以下方式提问或分享您的想法和资源Computational Linear Algebra category on our fast.ai discussion forums

目录

下面的清单链接到此存储库中的笔记本,通过nbviewer服务。涵盖的主题:

0. Course Logistics(Video 1)

1. Why are we here?(Video 1)

我们首先对数值线性代数中的一些基本概念做一个高层次的概述

2. Topic Modeling with NMF and SVD(Video 2Video 3)

我们将使用新闻组数据集来尝试识别不同帖子的主题。我们使用术语-文档矩阵来表示文档中词汇的频率。我们使用NMF进行因子分解,然后使用奇异值分解(SVD

3. Background Removal with Robust PCA(Video 3Video 4,以及Video 5)

奇异值分解的另一个应用是识别人物并去除监控视频的背景。我们将介绍使用随机奇异值分解的鲁棒PCA。随机奇异值分解使用LU因式分解

4. Compressed Sensing with Robust Regression(Video 6Video 7)

压缩感知对于以较低的辐射进行CT扫描至关重要–可以用较少的数据重建图像。在这里,我们将学习这项技术,并将其应用于CT图像

5. Predicting Health Outcomes with Linear Regressions(Video 8)

6. How to Implement Linear Regression(Video 8)

7. PageRank with Eigen Decompositions(Video 9Video 10)

我们已经将奇异值分解应用于主题建模、背景去除和线性回归。奇异值分解与特征分解密切相关,因此我们现在将学习如何计算大型矩阵的特征值。我们将使用DBpedia数据,这是维基百科链接的大型数据集,因为这里的主要特征向量给出了不同维基百科页面的相对重要性(这是Google的PageRank算法的基本思想)。我们将看三种不同的计算特征向量的方法,它们的复杂度越来越高(实用性也越来越强!)

8. Implementing QR Factorization(Video 10)


为什么这门课的授课顺序如此怪异?

本课程的结构包括自上而下教学方法,这与大多数数学课程的操作方式不同。通常,在自下而上方法时,您首先学习要使用的所有独立组件,然后逐渐将它们构建成更复杂的结构。这样做的问题是,学生经常失去动力,没有“大局”意识,也不知道他们需要什么

哈佛大学教授大卫·珀金斯有一本书,Making Learning Whole他用棒球作类比。我们不要求孩子们在让他们玩棒球之前记住所有的棒球规则,了解所有的技术细节。相反,他们开始只是玩一般意义上的游戏,然后随着时间的推移逐渐学习更多的规则/细节。

如果你上了Fast.ai深度学习课程,那就是我们用的。你可以听到更多关于我的教学理念in this blog postthis talk I gave at the San Francisco Machine Learning meetup

总而言之,如果你一开始什么都不懂,也不要担心!你不应该这么做的。我们将开始使用一些尚未解释的“黑盒”或矩阵分解,然后我们将在稍后对更低级别的细节进行挖洞分析

首先,把重点放在事情做什么上,而不是它们是什么

Pattern-用于Python的Web挖掘模块,包含抓取、自然语言处理、机器学习、网络分析和可视化工具

Pattern是Python的Web挖掘模块。它具有用于以下方面的工具:
  • 数据挖掘:Web服务(Google、Twitter、Wikipedia)、Web爬虫、HTML DOM解析器
  • 自然语言处理:词性标记、n元语法搜索、情感分析、WordNet
  • 机器学习:向量空间模型、聚类、分类(KNN、SVM、感知器)
  • 网络分析:图形中心性与可视化

它有很好的文档记录,通过350多个单元测试进行了彻底的测试,并附带了50多个示例。源代码是按照BSD授权的

示例

此示例使用Python3对从Twitter挖掘的形容词训练分类器。首先,收集包含#WIN或#FAIL标签的tweet。例如:“20美元今天给一位可爱的小老太太小费#赢了”然后解析词性标签,只保留形容词。每条推文都被转换为向量,即形容词→计数项的字典,标记为WINFAIL分类器使用向量来学习哪些其他tweet看起来更像WIN或者更像是FAIL

from pattern.web import Twitter
from pattern.en import tag
from pattern.vector import KNN, count

twitter, knn = Twitter(), KNN()

for i in range(1, 3):
    for tweet in twitter.search('#win OR #fail', start=i, count=100):
        s = tweet.text.lower()
        p = '#win' in s and 'WIN' or 'FAIL'
        v = tag(s)
        v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective
        v = count(v) # {'sweet': 1}
        if v:
            knn.train(v, type=p)

print(knn.classify('sweet potato burger'))
print(knn.classify('stupid autocorrect'))

安装

模式支持Python2.7和Python3.6。要安装Pattern以使其在您的所有脚本中都可用,请解压缩下载并从命令行执行以下操作:

cd pattern-3.6
python setup.py install

如果您有pip,您可以从PyPI repository

pip install pattern

如果上述方法都不起作用,您可以通过三种方式使Python了解该模块:

  • 将模式文件夹放在与脚本相同的文件夹中
  • 将模式文件夹放在模块的标准位置,以便所有脚本都可以使用:
    • c:\python36\Lib\site-packages\(Windows)、
    • /Library/Python/3.6/site-packages/(Mac OS X)、
    • /usr/lib/python3.6/site-packages/(UNIX)
  • 将模块的位置添加到sys.path在您的脚本中,在导入之前:
MODULE = '/users/tom/desktop/pattern'
import sys; if MODULE not in sys.path: sys.path.append(MODULE)
from pattern.en import parsetree

文档

有关文档和示例,请参阅user documentation

版本

3.6

许可证

BSD,请参见LICENSE.txt有关更多详细信息,请参阅

参考文献

De Smedt,T.,Daelemans,W.(2012)。Python的模式。机器学习研究杂志,13,2031-2035

贡献力量

源代码托管在GitHub上,欢迎投稿或捐赠

捆绑的依赖项

Pattern与以下数据集、算法和Python包捆绑在一起:

  • 布里尔标记器,Eric Brill
  • 用于荷兰语的布里尔标记器、杰伦·格尔岑(Jeroen Geertzen)
  • 用于德语的Brill标记器,Gerold Schneider&Martin Volk
  • 西班牙语的布里尔标记器,在Wikicorpus上进行培训(Samuel Reese&Gema Boleda等人)
  • 法语的布里尔标记器,关于Lefff的培训(Benoüt Sagot&Lionel Clément等人)
  • 适用于意大利语的布里尔标记器,从维基词典中挖掘
  • 英语复数,Damian Conway
  • 西班牙语动词词尾变化,Fred Jehle
  • 法语动词词尾变化、鲍勃·萨利塔(Bob Salita)
  • Graph JavaScript框架,Aslak Hellesoy&Dave Hoover
  • LIBSVM、张志忠、林志仁
  • LIBLINEAR、范荣恩等人
  • 网络X中心性供稿:Aric Hagberg,Dan Schult&Pieter Swart
  • 拼写校正器、彼得·诺维格(Peter Norvig)

确认

作者:

投稿人(按时间顺序):

  • 弗雷德里克·德·布莱泽
  • 杰森·维纳
  • 丹尼尔·弗里森
  • 杰伦·格尔岑
  • 托马斯·克伦贝兹
  • 肯·威廉姆斯
  • 彼得里斯·埃林斯(Peteris Erins)
  • 拉杰什·奈尔
  • F·德·斯梅德
  • RadimŘehůřek
  • 汤姆·洛雷多
  • 约翰·德博维斯
  • 托马斯·西里奥
  • 杰罗德·施耐德
  • 马丁·沃尔克
  • 塞缪尔·约瑟夫
  • 舒班舒·米什拉(Shubhanshu Mishra)
  • 罗伯特·埃尔韦尔
  • 弗雷德·杰尔
  • Antoine Mazières+Fabelier.org
  • Rémi de Zoeten+closealert t.nl
  • 肯尼思·科赫(Kenneth Koch)
  • 延斯·格里沃拉
  • 法比奥·马菲亚
  • 史蒂文·洛里亚
  • 科林·莫尔特(Colin Molter)+tevizz.com
  • 彼得·布尔
  • 毛里齐奥·桑巴蒂
  • 旦福
  • 塞尔瓦托·迪·迪奥
  • 文森特·范·阿施
  • 弗雷德里克·埃尔韦特

Machine-learning-course-使用Python语言的💬机器学习课程:

Introduction

本项目的目的是提供一门使用Python进行机器学习的全面而又简单的课程

Motivation

Machine Learning,作为一种工具Artificial Intelligence,是采用最广泛的科学领域之一。已经发表了大量关于机器学习的文献。本项目的目的是提供以下最重要的方面Machine Learning通过介绍一系列简单而全面的教程,您可以使用Python在这个项目中,我们使用了许多不同的众所周知的机器学习框架来构建我们的教程,例如Scikit-learn在本项目中,您将了解到:

  • 机器学习的定义是什么?
  • 它是什么时候开始的,趋势是什么?
  • 什么是机器学习类别和子类别?
  • 最常用的机器学习算法是什么?如何实现它们?

Machine Learning

标题 文档
机器学习导论 Overview

Machine Learning Basics

标题 代码 文档
线性回归 Python Tutorial
适配过高/适配不足 Python Tutorial
正则化 Python Tutorial
交叉验证 Python Tutorial

Supervised learning

标题 代码 文档
决策树 Python Tutorial
K-近邻 Python Tutorial
朴素贝叶斯 Python Tutorial
Logistic回归 Python Tutorial
支持向量机 Python Tutorial

Unsupervised learning

标题 代码 文档
群集 Python Tutorial
主成分分析 Python Tutorial

Deep Learning

标题 代码 文档
神经网络概述 Python Tutorial
卷积神经网络 Python Tutorial
自动编码器 Python Tutorial
递归神经网络 Python IPython

Pull Request Process

请考虑以下标准,以便更好地帮助我们:

  1. 拉取请求主要预期为链接建议
  2. 请确保您建议的资源没有过时或损坏
  3. 在执行构建和创建拉入请求时,请确保在图层结束之前移除所有安装或构建依赖项
  4. 添加带有接口更改详细信息的注释,包括新的环境变量、暴露的端口、有用的文件位置和容器参数
  5. 一旦您获得至少一个其他开发人员的签字,您就可以合并拉取请求,或者如果您没有权限这样做,如果您相信所有检查都已通过,您可以请求所有者为您合并该请求

Final Note

我们期待着您的善意反馈。请帮助我们改进这个开源项目,让我们的工作做得更好。对于捐款,请创建拉取请求,我们会立即进行调查。再次感谢您的反馈和支持

Developers

创建者:机器学习思维模式[BlogGitHubTwitter]

主管:Amirsina Torfi[GitHubPersonal WebsiteLinkedin]

开发商:Brendan Sherman*,James E Hopkins*[Linkedin],扎克·史密斯[Linkedin]

注意事项:本项目已被开发为顶峰项目,由[CS 4624 Multimedia/ Hypertext course at Virginia Tech],并由[Machine Learning Mindset]

*:平均分担

Citation

如果您觉得本课程有用,请考虑引用如下内容:

@software{amirsina_torfi_2019_3585763,
  author       = {Amirsina Torfi and
                  Brendan Sherman and
                  Jay Hopkins and
                  Eric Wynn and
                  hokie45 and
                  Frederik De Bleser and
                  李明岳 and
                  Samuel Husso and
                  Alain},
  title        = {{machinelearningmindset/machine-learning-course:
                   Machine Learning with Python}},
  month        = dec,
  year         = 2019,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.3585763},
  url          = {https://doi.org/10.5281/zenodo.3585763}
}

如何在Python中实现Softmax函数

问题:如何在Python中实现Softmax函数

Udacity的深度学习类中,y_i的softmax只是指数除以整个Y向量的指数和:

其中S(y_i)y_i和的softmax函数e是指数,并且j是否。输入向量Y中的列数。

我尝试了以下方法:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

返回:

[ 0.8360188   0.11314284  0.05083836]

但是建议的解决方案是:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

即使第一个实现显式地获取每列和最大值的差然后除以总和,它也会产生与第一个实现相同的输出

有人可以从数学上说明为什么吗?一个是正​​确的,另一个是错误的吗?

在代码和时间复杂度方面实现是否相似?哪个更有效?

From the Udacity’s deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:

Where S(y_i) is the softmax function of y_i and e is the exponential and j is the no. of columns in the input vector Y.

I’ve tried the following:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

which returns:

[ 0.8360188   0.11314284  0.05083836]

But the suggested solution was:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

which produces the same output as the first implementation, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.

Can someone show mathematically why? Is one correct and the other one wrong?

Are the implementation similar in terms of code and time complexity? Which is more efficient?


回答 0

它们都是正确的,但是从数值稳定性的角度来看,您是首选。

你开始

e ^ (x - max(x)) / sum(e^(x - max(x))

通过使用a ^(b-c)=(a ^ b)/(a ^ c)的事实,我们得到

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

另一个答案是什么。您可以将max(x)替换为任何变量,它将被抵消。

They’re both correct, but yours is preferred from the point of view of numerical stability.

You start with

e ^ (x - max(x)) / sum(e^(x - max(x))

By using the fact that a^(b – c) = (a^b)/(a^c) we have

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.


回答 1

(嗯……在这里,无论是在问题还是在答案中,都有很多困惑……)

首先,这两种解决方案(即您和建议的解决方案)并不相同;它们恰好只对一维分数数组的特例等效。如果您还尝试了Udacity测验提供的示例中的2-D分数数组,则将发现它。

从结果来看,这两个解决方案之间的唯一实际区别是axis=0参数。为了了解这种情况,让我们尝试您的解决方案(your_softmax),其中唯一的区别是axis参数:

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

正如我所说,对于一维分数数组,结果确实是相同的:

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

不过,以下是在Udacity测验中给出的2-D分数数组的结果作为测试示例:

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

结果是不同的-第二个结果确实与Udacity测验中预期的结果相同,在Udacity测验中,所有列的确加起来为1,而第一个(错误的)结果并非如此。

因此,所有的麻烦实际上是针对实现细节- axis参数。根据numpy.sum文档

默认值axis = None将对输入数组的所有元素求和

因此在这里我们要逐行求和axis=0。对于一维数组,(仅)行的总和与所有元素的总和恰好相同,因此在这种情况下您的结果相同…

除了axis问题之外,您的实现(即您选择先减去最大值)实际上比建议的解决方案更好!实际上,这是实现softmax函数的推荐方法- 有关理由,请参见此处(数字稳定性,此处也由其他一些答案指出)。

(Well… much confusion here, both in the question and in the answers…)

To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let’s try your solution (your_softmax) and one where the only difference is the axis argument:

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

As I said, for a 1-D score array, the results are indeed identical:

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

The results are different – the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail – the axis argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case…

The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function – see here for the justification (numeric stability, also pointed out by some other answers here).


回答 2

因此,这确实是对Desertnaut答案的评论,但由于我的声誉,我暂时无法对此发表评论。正如他指出的那样,仅当您的输入包含单个样本时,您的版本才是正确的。如果您的输入包含多个样本,那是错误的。但是,desertnaut的解决方案也是错误的。问题在于,一旦他接受一维输入,然后接受二维输入。让我给你看看。

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

让我们以Desertnauts为例:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

这是输出:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

您会看到在这种情况下desernauts版本将失败。(如果输入只是一维,如np.array([1、2、3、6]),则不会。

现在使用3个样本,因为那是我们使用二维输入的原因。以下x2与来自desernauts示例的x2不同。

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

此输入包含3个样本的批次。但是样本一和样本三本质上是相同的。现在,我们期望3行softmax激活,其中第一行应与第三行相同,并且也应与x1的激活相同!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

希望您能看到只有我的解决方案才有这种情况。

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

此外,这是TensorFlows softmax实现的结果:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

So, this is really a comment to desertnaut’s answer but I can’t comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut’s solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

Lets take desertnauts example:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

This is the output:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

I hope you can see that this is only the case with my solution.

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

Additionally, here is the results of TensorFlows softmax implementation:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

And the result:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

回答 3

我要说的是,尽管两者在数学上都是正确的,但从实现角度来看,第一个更好。当计算softmax时,中间值可能会变得非常大。将两个大数相除可能会造成数值不稳定。这些注释(来自斯坦福大学)提到了归一化技巧,这实际上就是您正在做的事情。

I would say that while both are correct mathematically, implementation-wise, first one is better. When computing softmax, the intermediate values may become very large. Dividing two large numbers can be numerically unstable. These notes (from Stanford) mention a normalization trick which is essentially what you are doing.


回答 4

sklearn还提供softmax的实现

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]]) 

sklearn also offers implementation of softmax

from sklearn.utils.extmath import softmax
import numpy as np

x = np.array([[ 0.50839931,  0.49767588,  0.51260159]])
softmax(x)

# output
array([[ 0.3340521 ,  0.33048906,  0.33545884]]) 

回答 5

从数学观点来看,双方是平等的。

您可以轻松证明这一点。让我们开始吧m=max(x)。现在,您的函数softmax将返回一个向量,其第i个坐标等于

请注意,这适用于any m,因为对于所有(甚至复数)数字e^m != 0

  • 从计算复杂度的角度来看,它们也是等效的,并且都在O(n)时间上运行,其中n向量的大小在哪里。

  • 数值稳定性的角度来看,首选第一个解决方案,因为它e^x增长非常快,即使很小的值x也会溢出。减去最大值可以消除此溢出。为了实际体验我所谈论的内容,请尝试x = np.array([1000, 5])同时使用这两个功能。一个将返回正确的概率,第二个将溢出nan

  • 您的解决方案仅适用于向量(Udacity测验也希望您也针对矩阵进行计算)。为了修复它,您需要使用sum(axis=0)

From mathematical point of view both sides are equal.

And you can easily prove this. Let’s m=max(x). Now your function softmax returns a vector, whose i-th coordinate is equal to

notice that this works for any m, because for all (even complex) numbers e^m != 0

  • from computational complexity point of view they are also equivalent and both run in O(n) time, where n is the size of a vector.

  • from numerical stability point of view, the first solution is preferred, because e^x grows very fast and even for pretty small values of x it will overflow. Subtracting the maximum value allows to get rid of this overflow. To practically experience the stuff I was talking about try to feed x = np.array([1000, 5]) into both of your functions. One will return correct probability, the second will overflow with nan

  • your solution works only for vectors (Udacity quiz wants you to calculate it for matrices as well). In order to fix it you need to use sum(axis=0)


回答 6

编辑。从1.2.0版开始,scipy包含softmax作为特殊功能:

https://scipy.github.io/devdocs/generation/scipy.special.softmax.html

我编写了一个在所有轴上应用softmax的函数:

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

如其他用户所述,减去最大值是一种很好的做法。我在这里写了一篇详细的文章。

EDIT. As of version 1.2.0, scipy includes softmax as a special function:

https://scipy.github.io/devdocs/generated/scipy.special.softmax.html

I wrote a function applying the softmax over any axis:

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

Subtracting the max, as other users described, is good practice. I wrote a detailed post about it here.


回答 7

在这里,您可以了解他们为什么使用- max

从那里:

“在实践中编写用于计算Softmax函数的代码时,由于指数的原因,中间项可能会非常大。将大数相除可能会造成数值不稳定,因此使用归一化技巧很重要。”

Here you can find out why they used - max.

From there:

“When you’re writing code for computing the Softmax function in practice, the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.”


回答 8

一个更简洁的版本是:

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

A more concise version is:

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

回答 9

要提供替代解决方案,请考虑以下情况:您的论点的数量级非常大,以致exp(x)于下溢(在否定的情况下)或上溢(在肯定的情况下)。您希望在此处尽可能长时间地保留在日志空间中,仅在您可以相信结果会表现良好的末尾进行幂运算。

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

To offer an alternative solution, consider the cases where your arguments are extremely large in magnitude such that exp(x) would underflow (in the negative case) or overflow (in the positive case). Here you want to remain in log space as long as possible, exponentiating only at the end where you can trust the result will be well-behaved.

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

回答 10

我需要一些与Tensorflow密集层的输出兼容的东西。

@desertnaut的解决方案在这种情况下不起作用,因为我有大量数据。因此,我提供了另一种在两种情况下均适用的解决方案:

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

结果:

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

参考:Tensorflow softmax

I needed something compatible with the output of a dense layer from Tensorflow.

The solution from @desertnaut does not work in this case because I have batches of data. Therefore, I came with another solution that should work in both cases:

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

Results:

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

Ref: Tensorflow softmax


回答 11

我建议这样做:

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

它将适用于随机和批处理。
有关更多详细信息,请参见:https : //medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

I would suggest this:

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

It will work for stochastic as well as the batch.
For more detail see : https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d


回答 12

为了保持数值稳定性,应减去max(x)。以下是softmax函数的代码;

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

In order to maintain for numerical stability, max(x) should be subtracted. The following is the code for softmax function;

def softmax(x):

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

回答 13

在以上答案中已经详细回答了。max被减去以避免溢出。我在这里在python3中添加了另一个实现。

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

Already answered in much detail in above answers. max is subtracted to avoid overflow. I am adding here one more implementation in python3.

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

回答 14

每个人似乎都发布了他们的解决方案,所以我将发布我的解决方案:

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

我得到的结果与从sklearn导入的结果完全相同:

from sklearn.utils.extmath import softmax

Everybody seems to post their solution so I’ll post mine:

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

I get the exact same results as the imported from sklearn:

from sklearn.utils.extmath import softmax

回答 15

import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()
import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

回答 16

根据所有答复和CS231n注释,请允许我总结一下:

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

用法:

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

输出:

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

Based on all the responses and CS231n notes, allow me to summarise:

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

Usage:

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

Output:

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

回答 17

我想补充一点对问题的理解。在这里减去数组的最大值是正确的。但是,如果您在另一篇文章中运行代码,则当数组为2D或更高尺寸时,您会发现它没有给出正确的答案。

在这里,我给您一些建议:

  1. 要获得最大值,请尝试沿x轴进行操作,您将获得一维数组。
  2. 将您的最大数组重塑为原始形状。
  3. 是否使np.exp获得指数值。
  4. 沿轴做np.sum。
  5. 获得最终结果。

按照结果进行矢量化处理,您将获得正确的答案。由于它与大学作业有关,因此我无法在此处发布确切的代码,但是如果您不理解,我想提出更多建议。

I would like to supplement a little bit more understanding of the problem. Here it is correct of subtracting max of the array. But if you run the code in the other post, you would find it is not giving you right answer when the array is 2D or higher dimensions.

Here I give you some suggestions:

  1. To get max, try to do it along x-axis, you will get an 1D array.
  2. Reshape your max array to original shape.
  3. Do np.exp get exponential value.
  4. Do np.sum along axis.
  5. Get the final results.

Follow the result you will get the correct answer by doing vectorization. Since it is related to the college homework, I cannot post the exact code here, but I would like to give more suggestions if you don’t understand.


回答 18

softmax函数的目的是保留矢量的比率,而不是随着值饱和(即趋于+/- 1(tanh)或从0到1(逻辑))用S形压缩端点。这是因为它保留了有关端点变化率的更多信息,因此更适用于N输出为1-of的神经网络编码(即,如果压缩端点,则很难区分1 -of-N输出类,因为我们不能说哪个是“最大”或“最小”的,因为它们被压扁了。);也会使总输出总和为1,明确的获胜者将接近1,而彼此接近的其他数字将为1 / p,其中p是具有相似值的输出神经元的数量。

从向量中减去最大值的目的是,当您进行指数运算时,您可能会得到很高的值,该值会将浮点数修剪为最大值,导致出现平局,在此示例中不是这种情况。如果您减去最大值以得出负数,那么这将成为一个大问题,您将拥有一个负指数,该指数会迅速缩小值以更改比率,这是发帖人的问题中出现的结果,并且给出了错误的答案。

Udacity提供的答案很糟糕。我们要做的第一件事是为所有矢量分量计算e ^ y_j,保留这些值,然后将它们求和并除。Udacity搞砸的地方是他们计算两次e ^ y_j!这是正确的答案:

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can’t tell which one is the “biggest” or “smallest” because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.

The purpose of subtracting the max value from the vector is that when you do e^y exponents you may get very high value that clips the float at the max value leading to a tie, which is not the case in this example. This becomes a BIG problem if you subtract the max value to make a negative number, then you have a negative exponent that rapidly shrinks the values altering the ratio, which is what occurred in poster’s question and yielded the incorrect answer.

The answer supplied by Udacity is HORRIBLY inefficient. The first thing we need to do is calculate e^y_j for all vector components, KEEP THOSE VALUES, then sum them up, and divide. Where Udacity messed up is they calculate e^y_j TWICE!!! Here is the correct answer:

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

回答 19

目标是使用Numpy和Tensorflow达到类似的结果。原始答案的唯一变化是api的axis参数np.sum

初始方法axis=0-但是,当尺寸为N时,这不会提供预期的结果。

修改方法axis=len(e_x.shape)-1-总是在最后一个维度上求和。这提供了与tensorflow的softmax函数相似的结果。

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

Goal was to achieve similar results using Numpy and Tensorflow. The only change from original answer is axis parameter for np.sum api.

Initial approach : axis=0 – This however does not provide intended results when dimensions are N.

Modified approach: axis=len(e_x.shape)-1 – Always sum on the last dimension. This provides similar results as tensorflow’s softmax function.

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

回答 20

这是使用numpy和comparision的广义解决方案,用于使用tensorflow ansscipy的正确性:

数据准备:

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

输出:

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

使用张量流的Softmax:

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用scipy的Softmax:

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用numpy的Softmax(https://nolanbconaway.github.io/blog/2017/softmax-numpy):

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Here is generalized solution using numpy and comparision for correctness with tensorflow ans scipy:

Data preparation:

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

Output:

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

Softmax using tensorflow:

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using scipy:

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

Softmax using numpy (https://nolanbconaway.github.io/blog/2017/softmax-numpy) :

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

Output:

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

回答 21

softmax函数是一种激活函数,可将数字转换为总计为1的概率。softmax函数输出一个向量,该向量表示结果列表的概率分布。它也是深度学习分类任务中使用的核心元素。

当我们有多个类时,将使用Softmax函数。

这对于找出具有最大值的类很有用。可能性。

Softmax函数理想地用于输出层,我们实际上是在尝试获得定义每个输入的类的概率。

取值范围是0〜1。

Softmax函数将logits [2.0,1.0,0.1]转换为概率[0.7,0.2,0.1],并且概率之和为1。Logits是神经网络最后一层输出的原始分数。在激活之前。要了解softmax函数,我们必须查看第(n-1)层的输出。

实际上,softmax函数是arg max函数。这意味着它不会从输入中返回最大值,而是返回最大值的位置。

例如:

在softmax之前

X = [13, 31, 5]

在softmax之后

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

码:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

The softmax function is an activation function that turns numbers into probabilities which sum to one. The softmax function outputs a vector that represents the probability distributions of a list of outcomes. It is also a core element used in deep learning classification tasks.

Softmax function is used when we have multiple classes.

It is useful for finding out the class which has the max. Probability.

The Softmax function is ideally used in the output layer, where we are actually trying to attain the probabilities to define the class of each input.

It ranges from 0 to 1.

Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1], and the probabilities sum to 1. Logits are the raw scores output by the last layer of a neural network. Before activation takes place. To understand the softmax function, we must look at the output of the (n-1)th layer.

The softmax function is, in fact, an arg max function. That means that it does not return the largest value from the input, but the position of the largest values.

For example:

Before softmax

X = [13, 31, 5]

After softmax

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

Code:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference