In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows
But what is the difference between them ?
回答 0
在这里,仅当您已经在矩阵上计算了PCA时,才可以使用pca.transform的区别
In[12]: pc2 =RandomizedPCA(n_components=3)In[13]: pc2.transform(X)# can't transform because it does not know how to do it.---------------------------------------------------------------------------AttributeErrorTraceback(most recent call last)<ipython-input-13-e3b6b8ea2aff>in<module>()---->1 pc2.transform(X)/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)714# XXX remove scipy.sparse support here in 0.16715 X = atleast2d_or_csr(X)-->716if self.mean_ isnotNone:717 X = X - self.mean_718AttributeError:'RandomizedPCA' object has no attribute 'mean_'In[14]: pc2.ftransform(X)
pc2.fit pc2.fit_transform In[14]: pc2.fit_transform(X)Out[14]:
array([[-1.38340578,-0.2935787],[-2.22189802,0.25133484],[-3.6053038,-0.04224385],[1.38340578,0.2935787],[2.22189802,-0.25133484],[3.6053038,0.04224385]])
The .transform method is meant for when you have already computed PCA, i.e. if you have already called its .fit method.
In [12]: pc2 = RandomizedPCA(n_components=3)
In [13]: pc2.transform(X) # can't transform because it does not know how to do it.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-e3b6b8ea2aff> in <module>()
----> 1 pc2.transform(X)
/usr/local/lib/python3.4/dist-packages/sklearn/decomposition/pca.py in transform(self, X, y)
714 # XXX remove scipy.sparse support here in 0.16
715 X = atleast2d_or_csr(X)
--> 716 if self.mean_ is not None:
717 X = X - self.mean_
718
AttributeError: 'RandomizedPCA' object has no attribute 'mean_'
In [14]: pc2.ftransform(X)
pc2.fit pc2.fit_transform
In [14]: pc2.fit_transform(X)
Out[14]:
array([[-1.38340578, -0.2935787 ],
[-2.22189802, 0.25133484],
[-3.6053038 , -0.04224385],
[ 1.38340578, 0.2935787 ],
[ 2.22189802, -0.25133484],
[ 3.6053038 , 0.04224385]])
So you want to fitRandomizedPCA and then transform as:
In [20]: pca = RandomizedPCA(n_components=3)
In [21]: pca.fit(X)
Out[21]:
RandomizedPCA(copy=True, iterated_power=3, n_components=3, random_state=None,
whiten=False)
In [22]: pca.transform(z)
Out[22]:
array([[ 2.76681156, 0.58715739],
[ 1.92831932, 1.13207093],
[ 0.54491354, 0.83849224],
[ 5.53362311, 1.17431479],
[ 6.37211535, 0.62940125],
[ 7.75552113, 0.92297994]])
In [23]:
In particular PCA .transform applies the change of basis obtained through the PCA decomposition of the matrix X to the matrix Z.
fit(raw_documents[, y]): Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y]): Learn the vocabulary dictionary and return term-document matrix. This is equivalent to fit followed by the transform, but more efficiently implemented.
transform(raw_documents): Transform documents to document-term matrix. Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.
Both fit_transform and transform returns the same, Document-term matrix.
In layman’s terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values). So for training set, you need to both calculate and do transformation.
But for testing set, Machine learning applies prediction based on what was learned during the training set and so it doesn’t need to calculate, it just performs the transformation.
Imagine we are fitting a tokenizer, if we fit X we are including testing data into the tokenizer, but I have seen this error many times!
The correct is to fit ONLY with X_train, because you don’t know “your future data” so you cannot use X_test data for fitting anything!
Then you can transform your test data, but separately, that’s why there are different methods.
Final tip: X_train_transformed = model.fit_transform(X_train) is equivalent to:
X_train_transformed = model.fit(X_train).transform(X_train), but the first one is faster.
Note that what I call “model” usually will be a scaler, a tfidf transformer, other kind of vectorizer, a tokenizer…