vect =CountVectorizer()
tfidf =TfidfTransformer()
clf =SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
只是:
pipeline =Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',SGDClassifier()),])
predicted = pipeline.fit(Xtrain).predict(Xtrain)# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
Transformer in scikit-learn – some class that have fit and transform method, or fit_transform method.
Predictor – some class that has fit and predict methods, or fit_predict method.
Pipeline is just an abstract notion, it’s not some existing ml algorithm. Often in ML tasks you need to perform sequence of different transformations (find set of features, generate new features, select only some good features) of raw dataset before applying final estimator.
Here is a good example of Pipeline usage.
Pipeline gives you a single interface for all 3 steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and now you can do something like:
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
With just:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
With pipelines you can easily perform a grid-search over set of parameters for each step of this meta-estimator. As described in the link above. All steps except last one must be transforms, last step can be transformer or predictor.
Answer to edit:
When you call pipln.fit() – each transformer inside pipeline will be fitted on outputs of previous transformer (First transformer is learned on raw dataset). Last estimator may be transformer or predictor, you can call fit_transform() on pipeline only if your last estimator is transformer (that implements fit_transform, or transform and fit methods separately), you can call fit_predict() or predict() on pipeline only if your last estimator is predictor. So you just can’t call fit_transform or transform on pipeline, last step of which is predictor.
I think that M0rkHaV has the right idea. Scikit-learn’s pipeline class is a useful tool for encapsulating multiple different transformers alongside an estimator into one object, so that you only have to call your important methods once (fit(), predict(), etc). Let’s break down the two major components:
Transformers are classes that implement both fit() and transform(). You might be familiar with some of the sklearn preprocessing tools, like TfidfVectorizer and Binarizer. If you look at the docs for these preprocessing tools, you’ll see that they implement both of these methods. What I find pretty cool is that some estimators can also be used as transformation steps, e.g. LinearSVC!
Estimators are classes that implement both fit() and predict(). You’ll find that many of the classifiers and regression models implement both these methods, and as such you can readily test many different models. It is possible to use another transformer as the final estimator (i.e., it doesn’t necessarily implement predict(), but definitely implements fit()). All this means is that you wouldn’t be able to call predict().
As for your edit: let’s go through a text-based example. Using LabelBinarizer, we want to turn a list of labels into a list of binary values.
bin = LabelBinarizer() #first we initialize
vec = ['cat', 'dog', 'dog', 'dog'] #we have our label list we want binarized
Now, when the binarizer is fitted on some data, it will have a structure called classes_ that contains the unique classes that the transformer ‘knows’ about. Without calling fit() the binarizer has no idea what the data looks like, so calling transform() wouldn’t make any sense. This is true if you print out the list of classes before trying to fit the data.
print bin.classes_
I get the following error when trying this:
AttributeError: 'LabelBinarizer' object has no attribute 'classes_'
But when you fit the binarizer on the vec list:
bin.fit(vec)
and try again
print bin.classes_
I get the following:
['cat' 'dog']
print bin.transform(vec)
And now, after calling transform on the vec object, we get the following:
[[0]
[1]
[1]
[1]]
As for estimators being used as transformers, let us use the DecisionTree classifier as an example of a feature-extractor. Decision Trees are great for a lot of reasons, but for our purposes, what’s important is that they have the ability to rank features that the tree found useful for predicting. When you call transform() on a Decision Tree, it will take your input data and find what it thinks are the most important features. So you can think of it transforming your data matrix (n rows by m columns) into a smaller matrix (n rows by k columns), where the k columns are the k most important features that the Decision Tree found.
ML algorithms typically process tabular data. You may want to do preprocessing and post-processing of this data before and after your ML algorithm. A pipeline is a way to chain those data processing steps.
A pipeline is a series of steps in which data is transformed. It comes from the old “pipe and filter” design pattern (for instance, you could think of unix bash commands with pipes “|” or redirect operators “>”). However, pipelines are objects in the code. Thus, you may have a class for each filter (a.k.a. each pipeline step), and then another class to combine those steps into the final pipeline. Some pipelines may combine other pipelines in series or in parallel, have multiple inputs or outputs, and so on. We like to view Machine Learning pipelines as:
Pipe and filters. The pipeline’s steps process data, and they manage their inner state which can be learned from the data.
Composites. Pipelines can be nested: for example a whole pipeline can be treated as a single pipeline step in another pipeline. A pipeline step is not necessarily a pipeline, but a pipeline is itself at least a pipeline step by definition.
Directed Acyclic Graphs (DAG). A pipeline step’s output may be sent to many other steps, and then the resulting outputs can be recombined, and so on. Side note: despite pipelines are acyclic, they can process multiple items one by one, and if their state change (e.g.: using the fit_transform method each time), then they can be viewed as recurrently unfolding through time, keeping their states (think like an RNN). That’s an interesting way to see pipelines for doing online learning when putting them in production and training them on more data.
Methods of a Scikit-Learn Pipeline
Pipelines (or steps in the pipeline) must have those two methods:
“fit” to learn on the data and acquire state (e.g.: neural network’s neural weights are such state)
“transform” (or “predict”) to actually process the data and generate a prediction.
It’s also possible to call this method to chain both:
“fit_transform” to fit and then transform the data, but in one pass, which allows for potential code optimizations when the two methods must be done one after the other directly.
Scikit-Learn’s “pipe and filter” design pattern is simply beautiful. But how to use it for Deep Learning, AutoML, and complex production-level pipelines?
Scikit-Learn had its first release in 2007, which was a pre deep learning era. However, it’s one of the most known and adopted machine learning library, and is still growing. On top of all, it uses the Pipe and Filter design pattern as a software architectural style – it’s what makes Scikit-Learn so fabulous, added to the fact it provides algorithms ready for use. However, it has massive issues when it comes to do the following, which we should be able to do in 2020 already:
Automatic Machine Learning (AutoML),
Deep Learning Pipelines,
More complex Machine Learning pipelines.
Solutions that we’ve Found to Those Scikit-Learn’s Problems
For sure, Scikit-Learn is very convenient and well-built. However, it needs a refresh. Here are our solutions with Neuraxle to make Scikit-Learn fresh and useable within modern computing projects!
Additional pipeline methods and features offered through Neuraxle
Note: if a step of a pipeline doesn’t need to have one of the fit or transform methods, it could inherit from NonFittableMixin or NonTransformableMixin to be provided a default implementation of one of those methods to do nothing.
As a starter, it is possible for pipelines or their steps to also optionally define those methods:
“setup” which will call the “setup” method on each of its step. For instance, if a step contains a TensorFlow, PyTorch, or Keras neural network, the steps could create their neural graphs and register them to the GPU in the “setup” method before fit. It is discouraged to create the graphs directly in the constructors of the steps for several reasons, such as if the steps are copied before running many times with different hyperparameters within an Automatic Machine Learning algorithm that searches for the best hyperparameters for you.
“teardown”, which is the opposite of the “setup” method: it clears resources.
The following methods are provided by default to allow for managing hyperparameters:
“get_hyperparams” will return you a dictionary of the hyperparameters. If your pipeline contains more pipelines (nested pipelines), then the hyperparameter’ keys are chained with double underscores “__” separators.
“set_hyperparams” will allow you to set new hyperparameters in the same format of when you get them.
“get_hyperparams_space” allows you to get the space of hyperparameter, which will be not empty if you defined one. So, the only difference with “get_hyperparams” here is that you’ll get statistic distributions as values instead of a precise value. For instance, one hyperparameter for the number of layers could be a RandInt(1, 3) which means 1 to 3 layers. You can call .rvs() on this dict to pick a value randomly and send it to “set_hyperparams” to try training on it.
“set_hyperparams_space” can be used to set a new space using the same hyperparameter distribution classes as in “get_hyperparams_space”.
For more info on our suggested solutions, read the entries in the big list with links above.
I’m trying to use scikit-learn’s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I’d rather just have one big LabelEncoder objects that works across all my columns of data.
Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I’m using dummy data here; in actuality I’m dealing with about 50 columns of string labeled data, so need a solution that doesn’t reference any columns by name.
Traceback (most recent call last):
File “”, line 1, in
File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py”, line 103, in fit
y = column_or_1d(y, warn=True)
File “/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py”, line 306, in column_or_1d
raise ValueError(“bad input shape {0}”.format(shape))
ValueError: bad input shape (6, 3)
from collections import defaultdict
d = defaultdict(LabelEncoder)
这样,您现在将所有列保留LabelEncoder为字典。
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
as the OneHotEncoder now supports string input.
Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.
EDIT:
Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.
For inverse_transform and transform, you have to do a little bit of hack.
from collections import defaultdict
d = defaultdict(LabelEncoder)
With this, you now retain all columns LabelEncoder as dictionary.
# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))
# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))
# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))
import pandas as pd
from sklearn.preprocessing importLabelEncoderfrom sklearn.pipeline importPipeline# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({'fruit':['apple','orange','pear','orange'],'color':['red','orange','green','green'],'weight':[5,6,3,4]})classMultiColumnLabelEncoder:def __init__(self,columns =None):
self.columns = columns # array of column names to encodedef fit(self,X,y=None):return self # not relevant heredef transform(self,X):'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()if self.columns isnotNone:for col in self.columns:
output[col]=LabelEncoder().fit_transform(output[col])else:for colname,col in output.iteritems():
output[colname]=LabelEncoder().fit_transform(col)return output
def fit_transform(self,X,y=None):return self.fit(X,y).transform(X)
encoding_pipeline =Pipeline([('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))# add more pipeline steps as needed])
encoding_pipeline.fit_transform(fruit_data)
As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart’s excellent blog post found here.
Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform() methods. In your case, a good start might be something like this:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
# Create some toy data in a Pandas dataframe
fruit_data = pd.DataFrame({
'fruit': ['apple','orange','pear','orange'],
'color': ['red','orange','green','green'],
'weight': [5,6,3,4]
})
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows:
Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded (which I believe is what you were originally looking for):
>>> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
location owner pets
011010212000311241315021
要创建映射字典,您可以使用字典理解来枚举类别:
>>>{col:{n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)}for col in df}{'location':{0:'New_York',1:'San_Diego'},'owner':{0:'Brick',1:'Champ',2:'Ron',3:'Veronica'},'pets':{0:'cat',1:'dog',2:'monkey'}}
You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.
this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)
However, for the purpose of a few classification tasks etc. you could use
pandas.get_dummies(input_df)
this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more
在这种情况下,您很可能具有非唯一的行标签(如您的问题所示)。要查看编码器创建的类,可以执行le.classes_。您会注意到,该元素应与中的元素相同set(y for x in df.get_values() for y in x)。再次将行标签转换为编码标签使用le.transform(...)。例如,如果要检索df.columns数组第一列和第一行的标签,则可以执行以下操作:
le.transform([df.get_value(0, df.columns[0])])
您在评论中遇到的问题比较复杂,但仍然可以解决:
le.fit([str(z)for z in set((x[0], y)for x in df.iteritems()for y in x[1])])
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:
le.fit(df.columns)
In the above code you will have a unique number corresponding to each column.
More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column’s encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:
le.transform(df.columns.get_values())
Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:
le.fit([y for x in df.get_values() for y in x])
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You’ll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:
le.transform([df.get_value(0, df.columns[0])])
The question you had in your comment is a bit more complicated, but can still
be accomplished:
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
The above code does the following:
Make a unique combination of all of the pairs of (column, row)
Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
Fits the new items to the LabelEncoder.
Now to use this new model it’s a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It’s designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).
classMultiColumnLabelEncoder(LabelEncoder):"""
Wraps sklearn LabelEncoder functionality for use on multiple columns of a
pandas dataframe.
"""def __init__(self, columns=None):
self.columns = columns
def fit(self, dframe):"""
Fit label encoder to pandas columns.
Access individual column classes via indexig `self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
"""# if columns are provided, iterate through and get `classes_`if self.columns isnotNone:# ndarray to hold LabelEncoder().classes_ for each# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)for idx, column in enumerate(self.columns):# fit LabelEncoder to get `classes_` for the column
le =LabelEncoder()
le.fit(dframe.loc[:, column].values)# append the `classes_` to our ndarray container
self.all_classes_[idx]=(column,
np.array(le.classes_.tolist(),
dtype=object))# append this column's encoder
self.all_encoders_[idx]= le
else:# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:,:].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)for idx, column in enumerate(self.columns):
le =LabelEncoder()
le.fit(dframe.loc[:, column].values)
self.all_classes_[idx]=(column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx]= le
return self
def fit_transform(self, dframe):"""
Fit label encoder and return encoded labels.
Access individual column classes via indexing
`self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
Access individual column encoded labels via indexing
`self.all_labels_`
"""# if columns are provided, iterate through and get `classes_`if self.columns isnotNone:# ndarray to hold LabelEncoder().classes_ for each# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_labels_ = np.ndarray(shape=self.columns.shape,
dtype=object)for idx, column in enumerate(self.columns):# instantiate LabelEncoder
le =LabelEncoder()# fit and transform labels in the column
dframe.loc[:, column]=\
le.fit_transform(dframe.loc[:, column].values)# append the `classes_` to our ndarray container
self.all_classes_[idx]=(column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx]= le
self.all_labels_[idx]= le
else:# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:,:].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)for idx, column in enumerate(self.columns):
le =LabelEncoder()
dframe.loc[:, column]= le.fit_transform(
dframe.loc[:, column].values)
self.all_classes_[idx]=(column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx]= le
return dframe.loc[:, self.columns].values
def transform(self, dframe):"""
Transform labels to normalized encoding.
"""if self.columns isnotNone:for idx, column in enumerate(self.columns):
dframe.loc[:, column]= self.all_encoders_[
idx].transform(dframe.loc[:, column].values)else:
self.columns = dframe.iloc[:,:].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column]= self.all_encoders_[idx]\
.transform(dframe.loc[:, column].values)return dframe.loc[:, self.columns].values
def inverse_transform(self, dframe):"""
Transform labels back to original encoding.
"""if self.columns isnotNone:for idx, column in enumerate(self.columns):
dframe.loc[:, column]= self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)else:
self.columns = dframe.iloc[:,:].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column]= self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)return dframe.loc[:, self.columns].values
# get `object` columns
df_object_columns = df.iloc[:,:].select_dtypes(include=['object']).columns
df_copy_object_columns = df_copy.iloc[:,:].select_dtypes(include=['object']).columns
# instantiate `MultiColumnLabelEncoder`
mcle =MultiColumnLabelEncoder(columns=object_columns)# fit to `df` data
mcle.fit(df)# transform the `df` data
mcle.transform(df)# returns output like below
array([[1,0,0,...,1,1,0],[0,5,1,...,1,1,2],[1,1,1,...,1,1,2],...,[3,5,1,...,1,1,2],# transform `df_copy` data
mcle.transform(df_copy)# returns output like below (assuming the respective columns # of `df_copy` contain the same unique values as that particular # column in `df`
array([[1,0,0,...,1,1,0],[0,5,1,...,1,1,2],[1,1,1,...,1,1,2],...,[3,5,1,...,1,1,2],# inverse `df` data
mcle.inverse_transform(df)# outputs data like below
array([['August','Friday','2013',...,'N','N','CA'],['April','Tuesday','2014',...,'N','N','NJ'],['August','Monday','2014',...,'N','N','NJ'],...,['February','Tuesday','2014',...,'N','N','NJ'],['April','Tuesday','2014',...,'N','N','NJ'],['March','Tuesday','2013',...,'N','N','NJ']], dtype=object)# inverse `df_copy` data
mcle.inverse_transform(df_copy)# outputs data like below
array([['August','Friday','2013',...,'N','N','CA'],['April','Tuesday','2014',...,'N','N','NJ'],['August','Monday','2014',...,'N','N','NJ'],...,['February','Tuesday','2014',...,'N','N','NJ'],['April','Tuesday','2014',...,'N','N','NJ'],['March','Tuesday','2013',...,'N','N','NJ']], dtype=object)
This is a year-and-a-half after the fact, but I too, needed to be able to .transform() multiple pandas dataframe columns at once (and be able to .inverse_transform() them as well). This expands upon the excellent suggestion of @PriceHardman above:
class MultiColumnLabelEncoder(LabelEncoder):
"""
Wraps sklearn LabelEncoder functionality for use on multiple columns of a
pandas dataframe.
"""
def __init__(self, columns=None):
self.columns = columns
def fit(self, dframe):
"""
Fit label encoder to pandas columns.
Access individual column classes via indexig `self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# fit LabelEncoder to get `classes_` for the column
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
# append this column's encoder
self.all_encoders_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
le.fit(dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return self
def fit_transform(self, dframe):
"""
Fit label encoder and return encoded labels.
Access individual column classes via indexing
`self.all_classes_`
Access individual column encoders via indexing
`self.all_encoders_`
Access individual column encoded labels via indexing
`self.all_labels_`
"""
# if columns are provided, iterate through and get `classes_`
if self.columns is not None:
# ndarray to hold LabelEncoder().classes_ for each
# column; should match the shape of specified `columns`
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_encoders_ = np.ndarray(shape=self.columns.shape,
dtype=object)
self.all_labels_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
# instantiate LabelEncoder
le = LabelEncoder()
# fit and transform labels in the column
dframe.loc[:, column] =\
le.fit_transform(dframe.loc[:, column].values)
# append the `classes_` to our ndarray container
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
self.all_labels_[idx] = le
else:
# no columns specified; assume all are to be encoded
self.columns = dframe.iloc[:, :].columns
self.all_classes_ = np.ndarray(shape=self.columns.shape,
dtype=object)
for idx, column in enumerate(self.columns):
le = LabelEncoder()
dframe.loc[:, column] = le.fit_transform(
dframe.loc[:, column].values)
self.all_classes_[idx] = (column,
np.array(le.classes_.tolist(),
dtype=object))
self.all_encoders_[idx] = le
return dframe.loc[:, self.columns].values
def transform(self, dframe):
"""
Transform labels to normalized encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[
idx].transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
def inverse_transform(self, dframe):
"""
Transform labels back to original encoding.
"""
if self.columns is not None:
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
else:
self.columns = dframe.iloc[:, :].columns
for idx, column in enumerate(self.columns):
dframe.loc[:, column] = self.all_encoders_[idx]\
.inverse_transform(dframe.loc[:, column].values)
return dframe.loc[:, self.columns].values
Example:
If df and df_copy() are mixed-type pandas dataframes, you can apply the MultiColumnLabelEncoder() to the dtype=object columns in the following way:
classLabelEncodingColoumns(BaseEstimator,TransformerMixin):def __init__(self, cols=None):
pdu._is_cols_input_valid(cols)
self.cols = cols
self.les ={col:LabelEncoder()for col in cols}
self._is_fitted =Falsedef transform(self, df,**transform_params):"""
Scaling ``cols`` of ``df`` using the fitting
Parameters
----------
df : DataFrame
DataFrame to be preprocessed
"""ifnot self._is_fitted:raiseNotFittedError("Fitting was not preformed")
pdu._is_cols_subset_of_df_cols(self.cols, df)
df = df.copy()
label_enc_dict ={}for col in self.cols:
label_enc_dict[col]= self.les[col].transform(df[col])
labelenc_cols = pd.DataFrame(label_enc_dict,# The index of the resulting DataFrame should be assigned and# equal to the one of the original DataFrame. Otherwise, upon# concatenation NaNs will be introduced.
index=df.index
)for col in self.cols:
df[col]= labelenc_cols[col]return df
def fit(self, df, y=None,**fit_params):"""
Fitting the preprocessing
Parameters
----------
df : DataFrame
Data to use for fitting.
In many cases, should be ``X_train``.
"""
pdu._is_cols_subset_of_df_cols(self.cols, df)for col in self.cols:
self.les[col].fit(df[col])
self._is_fitted =Truereturn self
Following up on the comments raised on the solution of @PriceHardman I would propose the following version of the class:
class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
pdu._is_cols_input_valid(cols)
self.cols = cols
self.les = {col: LabelEncoder() for col in cols}
self._is_fitted = False
def transform(self, df, **transform_params):
"""
Scaling ``cols`` of ``df`` using the fitting
Parameters
----------
df : DataFrame
DataFrame to be preprocessed
"""
if not self._is_fitted:
raise NotFittedError("Fitting was not preformed")
pdu._is_cols_subset_of_df_cols(self.cols, df)
df = df.copy()
label_enc_dict = {}
for col in self.cols:
label_enc_dict[col] = self.les[col].transform(df[col])
labelenc_cols = pd.DataFrame(label_enc_dict,
# The index of the resulting DataFrame should be assigned and
# equal to the one of the original DataFrame. Otherwise, upon
# concatenation NaNs will be introduced.
index=df.index
)
for col in self.cols:
df[col] = labelenc_cols[col]
return df
def fit(self, df, y=None, **fit_params):
"""
Fitting the preprocessing
Parameters
----------
df : DataFrame
Data to use for fitting.
In many cases, should be ``X_train``.
"""
pdu._is_cols_subset_of_df_cols(self.cols, df)
for col in self.cols:
self.les[col].fit(df[col])
self._is_fitted = True
return self
This class fits the encoder on the training set and uses the fitted version when transforming. Initial version of the code can be found here.
回答 9
使用LabelEncoder()多个列的一种简短方法dict():
from sklearn.preprocessing importLabelEncoder
le_dict ={col:LabelEncoder()for col in columns }for col in columns:
le_dict[col].fit_transform(df[col])
A short way to LabelEncoder() multiple columns with a dict():
from sklearn.preprocessing import LabelEncoder
le_dict = {col: LabelEncoder() for col in columns }
for col in columns:
le_dict[col].fit_transform(df[col])
and you can use this le_dict to labelEncode any other column:
le_dict[col].transform(df_another[col])
回答 10
可以直接在熊猫中进行所有操作,并且非常适合该replace方法的独特功能。
首先,让我们创建一个字典字典,将列及其值映射到新的替换值。
transform_dict ={}for col in df.columns:
cats = pd.Categorical(df[col]).categories
d ={}for i, cat in enumerate(cats):
d[cat]= i
transform_dict[col]= d
transform_dict
{'location':{'New_York':0,'San_Diego':1},'owner':{'Brick':0,'Champ':1,'Ron':2,'Veronica':3},'pets':{'cat':0,'dog':1,'monkey':2}}
由于这将始终是一对一的映射,因此我们可以反转内部字典以获得新值回到原始值的映射。
inverse_transform_dict ={}for col, d in transform_dict.items():
inverse_transform_dict[col]={v:k for k, v in d.items()}
inverse_transform_dict
{'location':{0:'New_York',1:'San_Diego'},'owner':{0:'Brick',1:'Champ',2:'Ron',3:'Veronica'},'pets':{0:'cat',1:'dog',2:'monkey'}}
It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.
First, let’s make a dictionary of dictionaries mapping the columns and their values to their new replacement values.
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}
Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.
inverse_transform_dict = {}
for col, d in transform_dict.items():
inverse_transform_dict[col] = {v:k for k, v in d.items()}
inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.
def cat_var(df):"""Identify categorical features.
Parameters
----------
df: original df after missing operations
Returns
-------
cat_var_df: summary df with col index and col name for all categorical vars
"""
col_type = df.dtypes
col_names = list(df)
cat_var_index =[i for i, x in enumerate(col_type)if x=='object']
cat_var_name =[x for i, x in enumerate(col_names)if i in cat_var_index]
cat_var_df = pd.DataFrame({'cat_ind': cat_var_index,'cat_name': cat_var_name})return cat_var_df
from sklearn.preprocessing importLabelEncoderdef column_encoder(df, cat_var_list):"""Encoding categorical feature in the dataframe
Parameters
----------
df: input dataframe
cat_var_list: categorical feature index and name, from cat_var function
Return
------
df: new dataframe where categorical features are encoded
label_list: classes_ attribute for all encoded features
"""
label_list =[]
cat_var_df = cat_var(df)
cat_list = cat_var_df.loc[:,'cat_name']for index, cat_feature in enumerate(cat_list):
le =LabelEncoder()
le.fit(df.loc[:, cat_feature])
label_list.append(list(le.classes_))
df.loc[:, cat_feature]= le.transform(df.loc[:, cat_feature])return df, label_list
Very Rough ideas…
first, identify which columns needed LabelEncoder, then loop through each column.
def cat_var(df):
"""Identify categorical features.
Parameters
----------
df: original df after missing operations
Returns
-------
cat_var_df: summary df with col index and col name for all categorical vars
"""
col_type = df.dtypes
col_names = list(df)
cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
cat_var_df = pd.DataFrame({'cat_ind': cat_var_index,
'cat_name': cat_var_name})
return cat_var_df
from sklearn.preprocessing import LabelEncoder
def column_encoder(df, cat_var_list):
"""Encoding categorical feature in the dataframe
Parameters
----------
df: input dataframe
cat_var_list: categorical feature index and name, from cat_var function
Return
------
df: new dataframe where categorical features are encoded
label_list: classes_ attribute for all encoded features
"""
label_list = []
cat_var_df = cat_var(df)
cat_list = cat_var_df.loc[:, 'cat_name']
for index, cat_feature in enumerate(cat_list):
le = LabelEncoder()
le.fit(df.loc[:, cat_feature])
label_list.append(list(le.classes_))
df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
return df, label_list
The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column.
This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.
EDIT:
Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)
回答 13
如果我们有单列来进行标签编码,并且在python中有多列时,它的逆变换很容易做到
def stringtocategory(dataset):'''
@author puja.sharma
@see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data
@param dataset dataframe on whoes column the label encoding has to be done
@return label encoded and inverse tranform of the label encoded data.
'''
data_original = dataset[:]
data_tranformed = dataset[:]for y in dataset.columns:#check the dtype of the column object type contains strings or charsif(dataset[y].dtype == object):print("The string type features are : "+ y)
le = preprocessing.LabelEncoder()
le.fit(dataset[y].unique())#label encoded data
data_tranformed[y]= le.transform(dataset[y])#inverse label transform data
data_original[y]= le.inverse_transform(data_tranformed[y])return data_tranformed,data_original
if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python
def stringtocategory(dataset):
'''
@author puja.sharma
@see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data
@param dataset dataframe on whoes column the label encoding has to be done
@return label encoded and inverse tranform of the label encoded data.
'''
data_original = dataset[:]
data_tranformed = dataset[:]
for y in dataset.columns:
#check the dtype of the column object type contains strings or chars
if (dataset[y].dtype == object):
print("The string type features are : " + y)
le = preprocessing.LabelEncoder()
le.fit(dataset[y].unique())
#label encoded data
data_tranformed[y] = le.transform(dataset[y])
#inverse label transform data
data_original[y] = le.inverse_transform(data_tranformed[y])
return data_tranformed,data_original
from sklearn import preprocessing
le = preprocessing.LabelEncoder()for i in range(0,X.shape[1]):if X.dtypes[i]=='object':
X[X.columns[i]]= le.fit_transform(X[X.columns[i]])
If you have numerical and categorical both type of data in dataframe
You can use : here X is my dataframe having categorical and numerical both variables
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(0,X.shape[1]):
if X.dtypes[i]=='object':
X[X.columns[i]] = le.fit_transform(X[X.columns[i]])
Note: This technique is good if you are not interested in converting them back.
p =ColumnTransformer([# A different encoder will be used for column 0 with name "pets":(0,FlattenForEach(LabelEncoder(), then_unflatten=True)),# A shared encoder will be used for column 1 and 2, "owner" and "location":([1,2],FlattenForEach(LabelEncoder(), then_unflatten=True)),], n_dimension=2)
TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like: FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df).
With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let’s simply import:
from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach
Same shared encoder for columns:
Here is how one shared LabelEncoder will be applied on all the data to encode it:
p = FlattenForEach(LabelEncoder(), then_unflatten=True)
And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:
p = ColumnTransformer([
# A different encoder will be used for column 0 with name "pets":
(0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
# A shared encoder will be used for column 1 and 2, "owner" and "location":
([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
], n_dimension=2)
cols_need_mapped =['col1','col2']
mapper ={col:{cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)}for col in df[cols_need_mapped]}for c in cols_need_mapped :
df[c]= df[c].map(mapper[c])
Mainly used @Alexander answer but had to make some changes –
cols_need_mapped = ['col1', 'col2']
mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)}
for col in df[cols_need_mapped]}
for c in cols_need_mapped :
df[c] = df[c].map(mapper[c])
Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.
The problem is the shape of the data (pd dataframe) you are passing to the fit function.
You’ve got to pass 1d list.
回答 18
import pandas as pd
from sklearn.preprocessing importLabelEncoder
train=pd.read_csv('.../train.csv')#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values# Create a label encoder object defMultiLabelEncoder(columnlist,dataframe):for i in columnlist:
labelencoder_X=LabelEncoder()
dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']MultiLabelEncoder(columnlist,train)
import pandas as pd
from sklearn.preprocessing import LabelEncoder
train=pd.read_csv('.../train.csv')
#X=train.loc[:,['waterpoint_type_group','status','waterpoint_type','source_class']].values
# Create a label encoder object
def MultiLabelEncoder(columnlist,dataframe):
for i in columnlist:
labelencoder_X=LabelEncoder()
dataframe[i]=labelencoder_X.fit_transform(dataframe[i])
columnlist=['waterpoint_type_group','status','waterpoint_type','source_class','source_type']
MultiLabelEncoder(columnlist,train)
Here i am reading a csv from location and in function i am passing the column list i want to labelencode and the dataframe I want to apply this.
回答 19
这个怎么样?
defMultiColumnLabelEncode(choice, columns, X):LabelEncoders=[]if choice =='encode':for i in enumerate(columns):LabelEncoders.append(LabelEncoder())
i=0for cols in columns:
X[:, cols]=LabelEncoders[i].fit_transform(X[:, cols])
i +=1elif choice =='decode':for cols in columns:
X[:, cols]=LabelEncoders[i].inverse_transform(X[:, cols])
i +=1else:print('Please select correct parameter "choice". Available parameters: encode/decode')
def MultiColumnLabelEncode(choice, columns, X):
LabelEncoders = []
if choice == 'encode':
for i in enumerate(columns):
LabelEncoders.append(LabelEncoder())
i=0
for cols in columns:
X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
i += 1
elif choice == 'decode':
for cols in columns:
X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
i += 1
else:
print('Please select correct parameter "choice". Available parameters: encode/decode')
It is not the most efficient, however it works and it is super simple.