I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
I change the type of the categorical features to ‘category’:
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
The problem is that the 3’rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can’t do any feature selection, for determining the importance of the features.
What do you recommend?
回答 0
方法1:您可以在pandas数据框上使用get_dummies。
范例1:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)Out[]:
a b c01.00.00.010.01.00.020.00.01.031.00.00.0
范例2:
下面将把给定的列转换为一个热门列。使用前缀具有多个虚拟变量。
import pandas as pd
df = pd.DataFrame({'A':['a','b','a'],'B':['b','a','c']})
dfOut[]:
A B0 a b1 b a2 a c# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])# Drop column B as it is now encoded
df = df.drop('B',axis =1)# Join the encoded df
df = df.join(one_hot)
df Out[]:
A a b c0 a 0101 b 1002 a 001
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
Example 2:
The following will transform a given column into one hot. Use prefix to have multiple dummies.
import pandas as pd
df = pd.DataFrame({
'A':['a','b','a'],
'B':['b','a','c']
})
df
Out[]:
A B
0 a b
1 b a
2 a c
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df
Out[]:
A a b c
0 a 0 1 0
1 b 1 0 0
2 a 0 0 1
Approach 2: Use Scikit-learn
Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
Much easier to use Pandas for basic one-hot encoding. If you’re looking for more options you can use scikit-learn.
For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.
For example, if I have a dataframe called imdb_movies:
…and I want to one-hot encode the Rated column, I do this:
pd.get_dummies(imdb_movies.Rated)
This returns a new dataframe with a column for every “level” of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.
Usually, we want this to be part of the original dataframe. In this case, we attach our new dummy coded frame onto the original frame using “column-binding.
We can column-bind by using Pandas concat function:
import numpy as np
nb_classes =6
data =[[2,3,4,0]]def indices_to_one_hot(data, nb_classes):"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)return np.eye(nb_classes)[targets]
You can do it with numpy.eye and a using the array element selection mechanism:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
The the return value of indices_to_one_hot(nb_classes, data) is now
from sklearn.preprocessing importLabelEncoder#Auto encodes any dataframe column of type category or object.def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le =LabelEncoder()for feature in columnsToEncode:try:
df[feature]= le.fit_transform(df[feature])except:print('Error encoding '+feature)return df
编辑:比较要更清楚:
一键编码:将n个级别转换为n-1列。
IndexAnimalIndex cat mouse
1 dog 1002 cat -->2103 mouse 301
如果分类功能中有许多不同的类型(或级别),则可以看到这将如何扩展您的内存。请记住,这只是一栏。
虚拟编码:
IndexAnimalIndexAnimal1 dog 102 cat -->213 mouse 32
Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.
Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, ‘Less is More’.
Here’s the code for my custom encoding function if you want.
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
EDIT: Comparison to be clearer:
One-hot encoding: convert n levels to n-1 columns.
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.
Dummy Coding:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
回答 4
使用熊猫进行热编码非常简单:
def one_hot(df, cols):"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)return df
编辑:
使用sklearn的另一种方式one_hot LabelBinarizer:
from sklearn.preprocessing importLabelBinarizer
label_binarizer =LabelBinarizer()
label_binarizer.fit(all_your_labels_list)# need to be global or remembered to use it laterdef one_hot_encode(x):"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""return label_binarizer.transform(x)
def one_hot(df, cols):
"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
EDIT:
Another way to one_hot using sklearn’s LabelBinarizer :
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
def one_hot_encode(x):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return label_binarizer.transform(x)
回答 5
您可以使用numpy.eye函数。
import numpy as np
def one_hot_encode(x, n_classes):"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""return np.eye(n_classes)[x]def main():
list =[0,1,2,3,4,3,2,1,0]
n_classes =5
one_hot_list = one_hot_encode(list, n_classes)print(one_hot_list)if __name__ =="__main__":
main()
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
One-hot encoding requires bit more than converting the values to indicator variables. Typically ML process requires you to apply this coding several times to validation or test data sets and applying the model you construct to real-time observed data. You should store the mapping (transform) that was used to construct the model. A good solution would use the DictVectorizer or LabelEncoder (followed by get_dummies. Here is a function that you can use:
This works on a pandas dataframe and for each column of the dataframe it creates and returns a mapping back. So you would call it like this:
train_data, le_dict = oneHotEncode2(train_data)
Then on the test data, the call is made by passing the dictionary returned back from training:
test_data, _ = oneHotEncode2(test_data, le_dict)
An equivalent method is to use DictVectorizer. A related post on the same is on my blog. I mention it here since it provides some reasoning behind this approach over simply using get_dummies post (disclosure: this is my own blog).
You can pass the data to catboost classifier without encoding. Catboost handles categorical variables itself by performing one-hot and target expanding mean encoding.
回答 10
您也可以执行以下操作。请注意以下内容,您不必使用pd.concat。
import pandas as pd
# intialise data of lists.
data ={'Color':['Red','Yellow','Red','Yellow'],'Length':[20.1,21.1,19.1,18.1],'Group':[1,2,1,2]}# Create DataFrame
df = pd.DataFrame(data)for _c in df.select_dtypes(include=['object']).columns:print(_c)
df[_c]= pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
您还可以将显式列更改为分类。例如,在这里我要更改Color和Group
import pandas as pd
# intialise data of lists.
data ={'Color':['Red','Yellow','Red','Yellow'],'Length':[20.1,21.1,19.1,18.1],'Group':[1,2,1,2]}# Create DataFrame
df = pd.DataFrame(data)
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')for _c in columns_to_change:print(_c)
df[_c]= pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed
classOneHotEncoder:def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0if i!=j else1for i in range(length)]for j in range(length)}
class OneHotEncoder:
def __init__(self,optionKeys):
length=len(optionKeys)
self.__dict__={optionKeys[j]:[0 if i!=j else 1 for i in range(length)] for j in range(length)}
def one_hot_encode(y):"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten()# Sometimes not flattened vector is passed e.g (118,1) in these cases# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y))# get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes)))# get the class labels as a dictionary# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.# standardised labels fixes this issue by returning a dictionary;# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y)# map the dictionary values to array.return np.eye(nb_classes)[targets]
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]
import typing
def one_hot_encode(items: list)-> typing.List[list]:
results =[]# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))# sort the unique items
sorted_items = sorted(unique_items)# find how long the list of each item should be
max_index = len(unique_items)for item in items:# create a list of zeros the appropriate length
one_hot_encoded_result =[0for i in range(0, max_index)]# find the index of the item
one_hot_index = sorted_items.index(item)# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index]=1# add the result
results.append(one_hot_encoded_result)return results
Here is a function to do one-hot-encoding without using numpy, pandas, or other packages. It takes a list of integers, booleans, or strings (and perhaps other types too).
import typing
def one_hot_encode(items: list) -> typing.List[list]:
results = []
# find the unique items (we want to unique items b/c duplicate items will have the same encoding)
unique_items = list(set(items))
# sort the unique items
sorted_items = sorted(unique_items)
# find how long the list of each item should be
max_index = len(unique_items)
for item in items:
# create a list of zeros the appropriate length
one_hot_encoded_result = [0 for i in range(0, max_index)]
# find the index of the item
one_hot_index = sorted_items.index(item)
# change the zero at the index from the previous line to a one
one_hot_encoded_result[one_hot_index] = 1
# add the result
results.append(one_hot_encoded_result)
return results
I know there are already a lot of answers to this question, but I noticed two things. First, most of the answers use packages like numpy and/or pandas. And this is a good thing. If you are writing production code, you should probably be using robust, fast algorithms like those provided in the numpy/pandas packages. But, for the sake of education, I think someone should provide an answer which has a transparent algorithm and not just an implementation of someone else’s algorithm. Second, I noticed that many of the answers do not provide a robust implementation of one-hot encoding because they do not meet one of the requirements below. Below are some of the requirements (as I see them) for a useful, accurate, and robust one-hot encoding function:
A one-hot encoding function must:
handle list of various types (e.g. integers, strings, floats, etc.) as input
handle an input list with duplicates
return a list of lists corresponding (in the same order as) to the inputs
return a list of lists where each list is as short as possible
I tested many of the answers to this question and most of them fail on one of the requirements above.
回答 18
试试这个:
!pip install category_encoders
import category_encoders as ce
categorical_columns =[...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
!pip install category_encoders
import category_encoders as ce
categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)
df_encoded.head()
The resulting dataframe df_train_encoded is the same as the original, but the categorical features are now replaced with their one-hot-encoded versions.
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
numpy.eye(number of classes)[vector containing the labels]
回答 6
这是将一维矢量转换为一维二维热阵列的函数。
#!/usr/bin/env pythonimport numpy as np
def convertToOneHot(vector, num_classes=None):"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""assert isinstance(vector, np.ndarray)assert len(vector)>0if num_classes isNone:
num_classes = np.max(vector)+1else:assert num_classes >0assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector]=1return result.astype(int)
以下是一些用法示例:
>>> a = np.array([1,0,3])>>> convertToOneHot(a)
array([[0,1,0,0],[1,0,0,0],[0,0,0,1]])>>> convertToOneHot(a, num_classes=10)
array([[0,1,0,0,0,0,0,0,0,0],[1,0,0,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0]])
I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution — I don’t like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.
def onehottify(x, n=None, dtype=float):"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x)+1if n isNoneelse n
return np.eye(n, dtype=dtype)[x]
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x)+1if n isNoneelse n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x]=1return b
后一种方法的速度提高了约35%(MacBook Pro 13 2015),但前一种方法更通用:
>>>import numpy as np
>>> np.random.seed(42)>>> a = np.random.randint(0,9, size=(10_000,))>>> a
array([6,3,7,...,5,8,6])>>>%timeit onehottify(a,10)188µs ±5.03µs per loop (mean ± std. dev. of 7 runs,10000 loops each)>>>%timeit onehottify_only_1d(a,10)139µs ±2.78µs per loop (mean ± std. dev. of 7 runs,10000 loops each)
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy
回答 12
这种编码类型通常是numpy数组的一部分。如果您使用这样的numpy数组:
a = np.array([1,0,3])
那么有一种非常简单的方法可以将其转换为1-hot编码
out =(np.arange(4)== a[:,None]).astype(np.float32)
def expand_integer_grid(arr, n_classes):"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape +(n_classes,))
axes_ranges =[range(arr.shape[i])for i in range(arr.ndim)]
flat_grids =[_.ravel()for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids +[arr.ravel()]]=1assert((one_hot.sum(-1)==1).all())assert(np.allclose(np.argmax(one_hot,-1), arr))return one_hot
Here’s a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot
回答 18
使用以下代码。效果最好。
def one_hot_encode(x):"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x),10))for idx, val in enumerate(x):
encoded[idx][val]=1return encoded
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don’t need to go into the link.