替换pandas DataFrame中的列值

问题:替换pandas DataFrame中的列值

我正在尝试替换数据框的一列中的值。列(“ female”)仅包含值“ female”和“ male”。

我尝试了以下方法:

w['female']['female']='1'
w['female']['male']='0' 

但是会收到与以前结果完全相同的副本。

理想情况下,我希望得到一些类似于下面的循环元素的输出。

if w['female'] =='female':
    w['female'] = '1';
else:
    w['female'] = '0';

我浏览了gotchas文档(http://pandas.pydata.org/pandas-docs/stable/gotchas.html),但无法弄清楚为什么什么也没发生。

任何帮助将不胜感激。

I’m trying to replace the values in one column of a dataframe. The column (‘female’) only contains the values ‘female’ and ‘male’.

I have tried the following:

w['female']['female']='1'
w['female']['male']='0' 

But receive the exact same copy of the previous results.

I would ideally like to get some output which resembles the following loop element-wise.

if w['female'] =='female':
    w['female'] = '1';
else:
    w['female'] = '0';

I’ve looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.

Any help will be appreciated.


回答 0

如果我理解正确,则您需要以下内容:

w['female'] = w['female'].map({'female': 1, 'male': 0})

(在这里,我将值转换为数字,而不是包含数字的字符串。如果确实需要,可以将它们转换为"1""0",但是我不确定为什么要这么做。)

您的代码不起作用的原因是,['female']在列上使用('female'您的中的第二个w['female']['female'])并不意味着“选择值是’female’的行”。这意味着选择索引为“女性”的行,而您的DataFrame中可能没有索引。

If I understand right, you want something like this:

w['female'] = w['female'].map({'female': 1, 'male': 0})

(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I’m not sure why you’d want that.)

The reason your code doesn’t work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn’t mean “select rows where the value is ‘female'”. It means to select rows where the index is ‘female’, of which there may not be any in your DataFrame.


回答 1

您可以使用loc编辑数据框的子集:

df.loc[<row selection>, <column selection>]

在这种情况下:

w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1

You can edit a subset of a dataframe by using loc:

df.loc[<row selection>, <column selection>]

In this case:

w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1

回答 2

w.female.replace(to_replace=dict(female=1, male=0), inplace=True)

请参阅pandas.DataFrame.replace()docs

w.female.replace(to_replace=dict(female=1, male=0), inplace=True)

See pandas.DataFrame.replace() docs.


回答 3

轻微变化:

w.female.replace(['male', 'female'], [1, 0], inplace=True)

Slight variation:

w.female.replace(['male', 'female'], [1, 0], inplace=True)

回答 4

这也应该起作用:

w.female[w.female == 'female'] = 1 
w.female[w.female == 'male']   = 0

This should also work:

w.female[w.female == 'female'] = 1 
w.female[w.female == 'male']   = 0

回答 5

您还可以使用apply.get

w['female'] = w['female'].apply({'male':0, 'female':1}.get)

w = pd.DataFrame({'female':['female','male','female']})
print(w)

数据框w

   female
0  female
1    male
2  female

使用apply从字典替换值:

w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)

结果:

   female
0       1
1       0
2       1 

注意: apply如果在字典中定义了数据框中列的所有可能值,则应使用字典,否则,对于未在字典中定义的列,该字段将为空。

You can also use apply with .get i.e.

w['female'] = w['female'].apply({'male':0, 'female':1}.get):

w = pd.DataFrame({'female':['female','male','female']})
print(w)

Dataframe w:

   female
0  female
1    male
2  female

Using apply to replace values from the dictionary:

w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)

Result:

   female
0       1
1       0
2       1 

Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.


回答 6

这非常紧凑:

w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0

另一个好的:

w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)

This is very compact:

w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0

Another good one:

w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)

回答 7

另外,对于这些类型的分配,还有内置函数pd.get_dummies:

w['female'] = pd.get_dummies(w['female'],drop_first = True)

这为您提供了一个包含两列的数据框,每个列对应于w [‘female’]中出现的每个值,您将其中的第一列删除(因为您可以从剩下的那一列中推断出来)。新列将自动命名为您替换的字符串。

如果您的分类变量具有两个以上的可能值,则此功能特别有用。此函数创建区分所有情况所需的尽可能多的伪变量。请注意,不要将整个数据框分配给单个列,而是如果w [‘female’]可以是“ male”,“ female”或“ neutral”,请执行以下操作:

w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)

然后,剩下两个新列,为您提供“ female”的伪编码,并且您摆脱了带有字符串的列。

Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:

w['female'] = pd.get_dummies(w['female'],drop_first = True)

This gives you a data frame with two columns, one for each value that occurs in w[‘female’], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.

This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don’t assign the entire data frame to a single column, but instead, if w[‘female’] could be ‘male’, ‘female’ or ‘neutral’, do something like this:

w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)

Then you are left with two new columns giving you the dummy coding of ‘female’ and you got rid of the column with the strings.


回答 8

使用Series.mapSeries.fillna

如果您的列包含的字符串多于femalemaleSeries.map则在这种情况下将失败,因为它将返回NaN其他值。

这就是为什么我们必须将其与fillna

为什么.map失败的示例

df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})

   female
0    male
1  female
2  female
3    male
4   other
5   other
df['female'].map({'female': '1', 'male': '0'})

0      0
1      1
2      1
3      0
4    NaN
5    NaN
Name: female, dtype: object

对于正确的方法,我们map与链接fillna,因此我们NaN用原始列中的值填充:

df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])

0        0
1        1
2        1
3        0
4    other
5    other
Name: female, dtype: object

Using Series.map with Series.fillna

If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.

That’s why we have to chain it with fillna:

Example why .map fails:

df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})

   female
0    male
1  female
2  female
3    male
4   other
5   other
df['female'].map({'female': '1', 'male': '0'})

0      0
1      1
2      1
3      0
4    NaN
5    NaN
Name: female, dtype: object

For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:

df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])

0        0
1        1
2        1
3        0
4    other
5    other
Name: female, dtype: object

回答 9

pandas调用了一个函数factorize,您可以使用该函数自动执行此类工作。它将标签转换为数字:['male', 'female', 'male'] -> [0, 1, 0]。有关更多信息,请参见此答案。

There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.


回答 10

我认为应该指出,在上面建议的所有方法中,您都会得到哪种类型的对象:是Series还是DataFrame。

当您按w.female.或获得列w[[2]](假设其中2是列数)时,您将获得DataFrame。因此,在这种情况下,您可以使用DataFrame之类的方法.replace

当您使用.loc或者iloc你回来系列和系列没有.replace方法,所以你应该使用类似的方法applymap等等。

I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.

When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you’ll get back DataFrame. So in this case you can use DataFrame methods like .replace.

When you use .loc or iloc you get back Series, and Series don’t have .replace method, so you should use methods like apply, map and so on.