问题:用字典重新映射熊猫列中的值
我有一本字典,看起来像这样: di = {1: "A", 2: "B"}
我想将其应用于类似于以下内容的数据框的“ col1”列:
col1 col2
0 w a
1 1 2
2 2 NaN
要得到:
col1 col2
0 w a
1 A 2
2 B NaN
我怎样才能最好地做到这一点?由于某种原因,与此相关的谷歌搜索术语仅向我显示了有关如何根据字典创建列的链接,反之亦然:-/
I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the “col1” column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
回答 0
您可以使用.replace
。例如:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
或直接在上,即df["col1"].replace(di, inplace=True)
。
You can use .replace
. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the , i.e. df["col1"].replace(di, inplace=True)
.
回答 1
map
可以比 replace
如果您的字典有多个键,使用map
速度可能比快得多replace
。此方法有两种版本,具体取决于字典是否详尽地映射所有可能的值(以及是否要让不匹配项保留其值或将其转换为NaN):
详尽的映射
在这种情况下,表格非常简单:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
尽管map
最常用函数作为参数,但也可以选择字典或系列: Pandas.series.map的文档
非穷举映射
如果您有一个非详尽的映射,并且希望保留现有变量用于非匹配,则可以添加fillna
:
df['col1'].map(di).fillna(df['col1'])
如@jpp的答案在这里: 通过字典有效地替换熊猫系列中的值
基准测试
在pandas 0.23.1版中使用以下数据:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
并进行测试时%timeit
,它的map
速度大约比速度快10倍replace
。
请注意,您的加速map
会随数据而变化。最大的提速似乎是使用大词典和详尽的替换方法。有关更广泛的基准测试和讨论,请参见@jpp答案(上面链接)。
map
can be much faster than replace
If your dictionary has more than a couple of keys, using map
can be much faster than replace
. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map
most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna
:
df['col1'].map(di).fillna(df['col1'])
as in @jpp’s answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit
, it appears that map
is approximately 10x faster than replace
.
Note that your speedup with map
will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.
回答 2
您的问题有点含糊。至少有三种解释:
- 中的键
di
引用索引值 - 中的键是
di
指df['col1']
值 - 中的键
di
指的是索引位置(不是OP的问题,而是为了娱乐而抛出的。)
以下是每种情况的解决方案。
情况1:
如果的键di
旨在引用索引值,则可以使用以下update
方法:
df['col1'].update(pd.Series(di))
例如,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
Yield
col1 col2
1 w a
2 B 30
0 A NaN
我已经修改了您原始帖子中的值,因此操作更清晰update
。注意输入中的键如何di
与索引值关联。索引值的顺序(即索引位置)无关紧要。
情况2:
如果其中的键di
引用df['col1']
值,则@DanAllan和@DSM显示如何通过以下方法实现此目的replace
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
Yield
col1 col2
1 w a
2 A 30
0 B NaN
注意如何在这种情况下,在键di
改为匹配值的df['col1']
。
情况3:
如果其中的键di
引用了索引位置,则可以使用
df['col1'].put(di.keys(), di.values())
以来
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
Yield
col1 col2
1 A a
2 10 30
0 B NaN
在这里,第一行和第三行被更改了,因为其中的键di
是0
和2
,使用Python基于0的索引对其进行索引,它们指向第一位置和第三位置。
There is a bit of ambiguity in your question. There are at least three two interpretations:
- the keys in
di
refer to index values - the keys in
di
refer to df['col1']
values - the keys in
di
refer to index locations (not the OP’s question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di
are meant to refer to index values, then you could use the update
method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I’ve modified the values from your original post so it is clearer what update
is doing.
Note how the keys in di
are associated with index values. The order of the index values — that is, the index locations — does not matter.
Case 2:
If the keys in di
refer to df['col1']
values, then @DanAllan and @DSM show how to achieve this with replace
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di
were changed to match values in df['col1']
.
Case 3:
If the keys in di
refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di
are 0
and 2
, which with Python’s 0-based indexing refer to the first and third locations.
回答 3
如果您有多个列要在数据数据帧中重新映射,则添加到此问题:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
希望它对某人有用。
干杯
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
回答 4
DSM已经接受了答案,但是编码似乎并不适合所有人。这是与当前版本的熊猫一起使用的版本(截至8/2018为0.23.4):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
您会看到它看起来像:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
pandas.DataFrame.replace的文档在这里。
DSM has the accepted answer, but the coding doesn’t seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You’ll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
回答 5
或做apply
:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
演示:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
Or do apply
:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
回答 6
给定map
的速度比替换(@JohnE的解决方案)要快,因此在打算将特定值映射到的非穷举映射时,您NaN
需要格外小心。在这种情况下,正确的方法需要在您mask
使用Series时执行.fillna
,否则撤消到的映射NaN
。
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Given map
is faster than replace (@JohnE’s solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN
. The proper method in this case requires that you mask
the Series when you .fillna
, else you undo the mapping to NaN
.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
回答 7
一个很好的完整解决方案,可以保留您的类标签的地图:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
这样,您可以随时从labels_dict引用原始类标签。
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
回答 8
作为对Nico Coallier(适用于多列)和U10-Forward(使用应用方式的方法)的建议的扩展,并将其概括为一个单一的行:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
将.transform()
每个列按顺序处理。.apply()
与之相反,将DataFrame中聚集的列传递给该列。
因此,您可以应用Series方法map()
。
最后,由于U10,我发现了此行为,您可以在.get()表达式中使用整个Series。除非我误解了它的行为,并且它按顺序而不是按位处理序列。您在映射字典中未提及的值
的.get(x,x)
帐户,否则该.map()
方法将被视为Nan
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform()
processes each column as a series. Contrary to .apply()
which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map()
.
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)
accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map()
method
回答 9
一种更本地的熊猫方法是应用如下替换函数:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
定义函数后,可以将其应用于数据框。
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。