问题:如何使用scikit learning计算多类案例的精度,召回率,准确性和f1-得分?

我正在研究情绪分析问题,数据看起来像这样:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

所以,我的数据是不平衡的,因为1190 instances标有5。对于使用scikit的SVC进行的分类Im 。问题是我不知道如何以正确的方式平衡我的数据,以便准确计算多类案例的精度,查全率,准确性和f1得分。因此,我尝试了以下方法:

第一:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

第二:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

第三:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

但是,我收到这样的警告:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

如何正确处理我的不平衡数据,以便以正确的方式计算分类器的指标?

I’m working in a sentiment analysis problem the data looks like this:

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit’s SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:

First:

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

Second:

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

Third:

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

However, Im getting warnings like this:

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

How can I deal correctly with my unbalanced data in order to compute in the right way classifier’s metrics?


回答 0

我认为对于将哪些砝码用于什么有很多困惑。我不确定我是否确切知道让您感到困扰,所以我将涉及不同的话题,请耐心等待;)。

Class重量

来自class_weight参数的权重用于训练分类器。它们不会用于您正在使用的任何度量的计算中:使用不同的类别权重,数字会有所不同,仅仅是因为分类器不同。

基本上,在每个scikit-learn分类器中,都使用类权重来告诉您的模型,类的重要性。这意味着在训练过程中,分类器将付出更多的努力来对权重较高的类进行正确分类。
他们如何做到这一点是特定于算法的。如果您想了解有关SVC如何工作的详细信息,而该文档对您来说没有意义,请随时提及。

指标

有了分类器后,您想知道其效果如何。在这里,你可以使用你所提到的指标:accuracyrecall_scoref1_score

通常,当Class分布不平衡时,准确性被认为是较差的选择,因为它会给仅预测最频繁Class的模型打高分。

我不会详细说明所有这些指标,但是请注意,除之外accuracy,它们自然地应用于类级别:如您在print分类报告中所见,它们是为每个类定义的。他们依赖诸如true positives或的概念,这些概念false negative要求定义哪个类别是肯定的

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

警告

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

之所以收到此警告,是因为您使用的是f1得分,召回率和精确度,而未定义应如何计算它们!问题可以改写为:从以上分类报告中,您如何为f1分数输出一个全局数字?你可以:

  1. 取每个Class的f1分数的平均值:这就是avg / total上面的结果。也称为平均。
  2. 使用真实阳性/阴性阴性等的总计数来计算f1-分数(您将每个类别的真实阳性/阴性阴性的总数相加)。又名平均。
  3. 计算f1分数的加权平均值。使用'weighted'在scikit学习会由支持类的权衡F1评分:越要素类有,更重要的F1的得分这个类在计算中。

这是scikit-learn中的3个选项,警告是说您必须选择一个。因此,您必须average为score方法指定一个参数。

选择哪种方法取决于您如何衡量分类器的性能:例如,宏平均不考虑类的不平衡,并且类1的f1分数与类的f1分数一样重要5.但是,如果您使用加权平均,则对于第5类,您将变得更加重要。

这些指标中的整个参数规范目前在scikit-learn中尚不十分清楚,根据文档,它将在0.18版中变得更好。他们正在删除一些不明显的标准行为,并发出警告,以便开发人员注意到它。

计算分数

我要提到的最后一件事(如果您知道它,可以随时跳过它)是,分数只有在根据分类器从未见过的数据进行计算时才有意义。这一点非常重要,因为您获得的用于拟合分类器的数据得分都是完全不相关的。

这是使用的一种方法StratifiedShuffleSplit,它可以随机分配数据(经过改组后),以保留标签的分布。

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

希望这可以帮助。

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).

Class weights

The weights from the class_weight parameter are used to train the classifier. They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.

Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.

The metrics

Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned: accuracy, recall_score, f1_score

Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

The warning

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

You get this warning because you are using the f1-score, recall and precision without defining how they should be computed! The question could be rephrased: from the above classification report, how do you output one global number for the f1-score? You could:

  1. Take the average of the f1-score for each class: that’s the avg / total result above. It’s also called macro averaging.
  2. Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
  3. Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.

These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.

Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you’ll get more importance for the class 5.

The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.

Computing scores

Last thing I want to mention (feel free to skip it if you’re aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.

Here’s a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))    

Hope this helps.


回答 1

这里有很多非常详细的答案,但我认为您没有回答正确的问题。据我了解的问题,有两个问题:

  1. 我如何为多类问题评分?
  2. 我该如何处理不平衡的数据?

1。

可以将scikit-learn中的大多数计分函数用于多类问题和单类问题。例如:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

这样,您最终得到每个类的有形和可解释的数字。

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

然后…

2。

…您可以判断出不平衡的数据是否甚至是一个问题。如果代表较少的Class(第1类和第2类)的得分低于训练样本较多的Class(第4类和第5类)的得分,那么您就知道不平衡的数据实际上是个问题,您可以采取相应措施,例如在该线程的其他一些答案中进行了介绍。但是,如果要预测的数据中存在相同的类别分布,那么不平衡的训练数据可以很好地代表数据,因此,不平衡是一件好事。

Lot of very detailed answers here but I don’t think you are answering the right questions. As I understand the question, there are two concerns:

  1. How to I score a multiclass problem?
  2. How do I deal with unbalanced data?

1.

You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

This way you end up with tangible and interpretable numbers for each of the classes.

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

Then…

2.

… you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread. However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.


回答 2

提出的问题

回答“对于不平衡数据的多类别分类应使用什么度量”这一问题:Macro-F1-measure。也可以使用Macro Precision和Macro Recall,但是它们不像二进制分类那样容易解释,它们已经被合并到F量度中,并且多余的量度使方法比较,参数调整等复杂化。

微观平均对类不平衡很敏感:例如,如果您的方法对大多数常见标签都有效,而完全使其他标签混乱,则微观平均指标将显示出良好的结果。

加权平均不适用于不平衡数据,因为它通过标签计数加权。此外,它很难解释且不受欢迎:例如,在以下我非常建议仔细研究的非常详细的调查中,没有提及这种平均值:

Sokolova,Marina和Guy Lapalme。“对分类任务的绩效指标进行系统分析。” 信息处理与管理45.4(2009):427-437。

特定于应用程序的问题

但是,回到您的任务,我将研究2个主题:

  1. 通常用于您的特定任务的指标-它使(a)与他人比较您的方法,并了解您做错了什么;(b)不要自己探索这一方法并重用他人的发现;
  2. 方法的不同错误的成本-例如,您的应用程序的用例可能仅依赖于4星级和5星级审核-在这种情况下,好的指标应仅将这2个标签计算在内。

常用指标。 从文献资料中我可以推断出,有两个主要的评估指标:

  1. 精度,例如

Yu,April和Daryl Chang。“使用Yelp业务进行多类情感预测。”

链接)-请注意,作者使用的评级分布几乎相同,请参见图5。

庞波和李丽娟 “看见星星:利用阶级关系来进行与等级量表有关的情感分类。” 计算语言学协会第四十三届年会论文集。计算语言学协会,2005年。

链接

  1. MSE(或更不常见的是,平均绝对误差- -MAE)-例如,

Lee,Moontae和R.Grafe。“带有餐厅评论的多类情感分析。” CS N 224(2010)中的最终项目。

链接)-他们同时探讨准确性和MSE,并认为后者会更好

帕帕斯,尼古拉斯,Rue Marconi和Andrei Popescu-Belis。“解释星星:基于方面的情感分析的加权多实例学习。” 2014年自然语言处理经验方法会议论文集。EPFL-CONF-200899号。2014。

链接)-他们利用scikit-learn进行评估和基准评估,并声明其代码可用;但是,我找不到它,所以如果您需要它,请写信给作者,这本书是相当新的,似乎是用Python编写的。

不同错误的代价 如果您更关心避免出现大失误,例如将1星评价转换为5星评价或类似方法,请查看MSE;如果差异很重要,但不是那么重要,请尝试MAE,因为它不会使差异平方;否则保持准确性。

关于方法,而不是指标

尝试使用回归方法,例如SVR,因为它们通常胜过SVC或OVA SVM之类的多类分类器。

Posed question

Responding to the question ‘what metric should be used for multi-class classification with imbalanced data’: Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.

Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.

Weighting averaging isn’t well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:

Sokolova, Marina, and Guy Lapalme. “A systematic analysis of performance measures for classification tasks.” Information Processing & Management 45.4 (2009): 427-437.

Application-specific question

However, returning to your task, I’d research 2 topics:

  1. metrics commonly used for your specific task – it lets (a) to compare your method with others and understand if you do something wrong, and (b) to not explore this by yourself and reuse someone else’s findings;
  2. cost of different errors of your methods – for example, use-case of your application may rely on 4- and 5-star reviewes only – in this case, good metric should count only these 2 labels.

Commonly used metrics. As I can infer after looking through literature, there are 2 main evaluation metrics:

  1. Accuracy, which is used, e.g. in

Yu, April, and Daryl Chang. “Multiclass Sentiment Prediction using Yelp Business.”

(link) – note that the authors work with almost the same distribution of ratings, see Figure 5.

Pang, Bo, and Lillian Lee. “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.” Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

(link)

  1. MSE (or, less often, Mean Absolute Error – MAE) – see, for example,

Lee, Moontae, and R. Grafe. “Multiclass sentiment analysis with restaurant reviews.” Final Projects from CS N 224 (2010).

(link) – they explore both accuracy and MSE, considering the latter to be better

Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. “Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis.” Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.

(link) – they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can’t find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.

Cost of different errors. If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn’t square diff; otherwise stay with Accuracy.

About approaches, not metrics

Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.


回答 3

首先,仅使用计数分析来判断您的数据是否不平衡会更加困难。例如:每1000个阳性观察中就有1个只是噪音,错误还是科学突破?你永远不会知道。
因此,最好使用所有可用的知识并明智地选择其状态。

好吧,如果真的不平衡怎么办?
再次-查看您的数据。有时您会发现一两个观察值乘以一百倍。有时创建这种虚假的一类观察很有用。
如果所有数据都是干净的,下一步是在预测模型中使用类权重。

那么多类指标呢?
根据我的经验,通常不会使用您的任何指标。有两个主要原因。
首先:与概率一起使用总是比采用可靠的预测更好(因为如果它们给同一个类,那么您还可以如何分别将0.9和0.6预测模型分开?)
其次:比较预测模型和构建新模型要容易得多仅取决于一项好的指标。
根据我的经验,我可以推荐loglossMSE(或均方误差)。

如何解决sklearn警告?
只是简单地(如yangjie所注意到的)average使用以下值之一覆盖参数:('micro'全局计算指标),'macro'(计算每个标签的指标)或'weighted'(与宏相同,但具有自动权重)。

f1_score(y_test, prediction, average='weighted')

在使用默认average值调用指标函数后发出所有警告,'binary'这不适用于多类预测。
祝你好运,并享受机器学习的乐趣!

编辑:
我发现了另一个回答者建议,建议改用我无法同意的回归方法(例如SVR)。据我所知,甚至没有多类回归。是的,多标签回归有很大的不同,是的,在某些情况下,有可能在回归和分类之间进行切换(如果类以某种方式排序),但这种情况很少见。

我建议(在scikit-learn范围内)尝试另一个非常强大的分类工具:梯度增强随机森林(我最喜欢),KNeighbors等。

之后,您可以计算预测之间的算术平均值或几何平均值,并且大多数时候您将获得更好的结果。

final_prediction = (KNNprediction * RFprediction) ** 0.5

First of all it’s a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it’s always better to use all your available knowledge and choice its status with all wise.

Okay, what if it’s really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it’s useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.

So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it’s always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it’s much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).

How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).

f1_score(y_test, prediction, average='weighted')

All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!

Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it’s possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.

What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.

After that you can calculate arithmetic or geometric mean between predictions and most of the time you’ll get even better result.

final_prediction = (KNNprediction * RFprediction) ** 0.5

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。