将分类器保存到scikit-learn中的磁盘

问题:将分类器保存到scikit-learn中的磁盘

如何保存经过训练的朴素贝叶斯分类器磁盘并用于预测数据?

我有来自scikit-learn网站的以下示例程序:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

How do I save a trained Naive Bayes classifier to disk and use it to predict data?

I have the following sample program from the scikit-learn website:

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print "Number of mislabeled points : %d" % (iris.target != y_pred).sum()

回答 0

分类器只是可以像其他任何东西一样被腌制和倾倒的对象。继续您的示例:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

Classifiers are just objects that can be pickled and dumped like any other. To continue your example:

import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    

# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)

回答 1

您还可以使用joblib.dumpjoblib.load,它们在处理数字数组方面比默认的python pickler效率更高。

Joblib包含在scikit-learn中:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

编辑:在Python 3.8+中,如果您使用pickle协议5(不是默认值),现在可以使用pickle对具有大数值数组的对象进行有效的酸洗作为属性。

You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.

Joblib is included in scikit-learn:

>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier

>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482

>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)

>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482

Edit: in Python 3.8+ it’s now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).


回答 2

您正在寻找的内容被称为sklearn词中的模型持久性,并且在简介模型持久中都有记录部分中进行了记录。

因此,您已经初始化了分类器并使用

clf = some.classifier()
clf.fit(X, y)

之后,您有两个选择:

1)使用泡菜

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2)使用Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

再读一遍有助于阅读上述链接

What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.

So you have initialized your classifier and trained it for a long time with

clf = some.classifier()
clf.fit(X, y)

After this you have two options:

1) Using Pickle

import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)

# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)

2) Using Joblib

from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')

One more time it is helpful to read the above-mentioned links


回答 3

在许多情况下,尤其是对于文本分类,仅存储分类器是不够的,但是您还需要存储矢量化器,以便将来可以对输入进行矢量化。

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

未来用例:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

在转储矢量化器之前,可以通过以下方式删除矢量化器的stop_words_属性:

vectorizer.stop_words_ = None

使倾销更有效率。同样,如果您的分类器参数稀疏(如大多数文本分类示例中一样),则可以将参数从密集转换为稀疏,这将在内存消耗,加载和转储方面产生巨大差异。通过以下方式稀疏模型:

clf.sparsify()

这对于SGDClassifier将自动工作,但是如果您知道模型稀疏(clf.coef_中为零),则可以通过以下方式将clf.coef_手动转换为csr scipy稀疏矩阵

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

然后您可以更有效地存储它。

In many cases, particularly with text classification it is not enough just to store the classifier but you’ll need to store the vectorizer as well so that you can vectorize your input in future.

import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)

future use case:

with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)

X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)

Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:

vectorizer.stop_words_ = None

to make dumping more efficient. Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:

clf.sparsify()

Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:

clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)

and then you can store it more efficiently.


回答 4

sklearn估计器实现的方法使您可以轻松保存估计器的相关训练属性。一些估计器__getstate__自己实现方法,但是其他估计器,例如GMM仅使用基本实现,该实现只是将对象保存在内部字典中:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

将模型保存到光盘的推荐方法是使用以下pickle模块:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

但是,您应该保存其他数据,以便将来可以重新训练模型,或遭受可怕的后果(例如被锁定在旧版本的sklearn中)

文档中

为了用将来的scikit-learn版本重建类似的模型,应该在腌制的模型中保存其他元数据:

训练数据,例如对不变快照的引用

用于生成模型的python源代码

scikit-learn的版本及其依赖项

在训练数据上获得的交叉验证分数

对于依赖于tree.pyx用Cython(例如IsolationForest)编写的模块的Ensemble估计器而言尤其如此,因为它会创建与实现的耦合,这不能保证sklearn版本之间的稳定性。在过去,它已经看到了不兼容的变化。

如果您的模型变得非常大并且加载变得很麻烦,那么您还可以使用更高效的joblib。从文档中:

在scikit的特定情况下,使用joblib替换picklejoblib.dumpjoblib.load)可能会更有趣,这对于内部装有大型numpy数组的对象更有效,就像装配的scikit-learn估计量通常那样,但只能腌制到磁盘而不是字符串:

sklearn estimators implement methods to make it easy for you to save relevant trained properties of an estimator. Some estimators implement __getstate__ methods themselves, but others, like the GMM just use the base implementation which simply saves the objects inner dictionary:

def __getstate__(self):
    try:
        state = super(BaseEstimator, self).__getstate__()
    except AttributeError:
        state = self.__dict__.copy()

    if type(self).__module__.startswith('sklearn.'):
        return dict(state.items(), _sklearn_version=__version__)
    else:
        return state

The recommended method to save your model to disc is to use the pickle module:

from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
X = iris.data[:100, :2]
y = iris.target[:100]
model = SVC()
model.fit(X,y)
import pickle
with open('mymodel','wb') as f:
    pickle.dump(model,f)

However, you should save additional data so you can retrain your model in the future, or suffer dire consequences (such as being locked into an old version of sklearn).

From the documentation:

In order to rebuild a similar model with future versions of scikit-learn, additional metadata should be saved along the pickled model:

The training data, e.g. a reference to a immutable snapshot

The python source code used to generate the model

The versions of scikit-learn and its dependencies

The cross validation score obtained on the training data

This is especially true for Ensemble estimators that rely on the tree.pyx module written in Cython(such as IsolationForest), since it creates a coupling to the implementation, which is not guaranteed to be stable between versions of sklearn. It has seen backwards incompatible changes in the past.

If your models become very large and loading becomes a nuisance, you can also use the more efficient joblib. From the documentation:

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:


回答 5

sklearn.externals.joblib已被弃用,因为0.21,将在被删除v0.23

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/ init .py:15:FutureWarning:sklearn.externals.joblib在0.21中已弃用,在0.23中将被删除。请直接从joblib导入此功能,可以通过以下方式安装该功能:pip install joblib。如果在加载腌制模型时出现此警告,则可能需要使用scikit-learn 0.21+重新序列化那些模型。
warnings.warn(msg,category = FutureWarning)


因此,您需要安装joblib

pip install joblib

最后将模型写入磁盘:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

现在,要读取转储的文件,您需要运行的是:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)

sklearn.externals.joblib has been deprecated since 0.21 and will be removed in v0.23:

/usr/local/lib/python3.7/site-packages/sklearn/externals/joblib/init.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)


Therefore, you need to install joblib:

pip install joblib

and finally write the model to disk:

import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier


digits = load_digits()
clf = SGDClassifier().fit(digits.data, digits.target)

with open('myClassifier.joblib.pkl', 'wb') as f:
    joblib.dump(clf, f, compress=9)

Now in order to read the dumped file all you need to run is:

with open('myClassifier.joblib.pkl', 'rb') as f:
    my_clf = joblib.load(f)