根据熊猫中的另一个值更改一个值

问题:根据熊猫中的另一个值更改一个值

我试图将我的Stata代码重新编程为Python,以提高速度,而我的方向是PANDAS。但是,我很难集中精力处理数据。

假设我要遍历列标题“ ID”中的所有值。如果该ID与特定数字匹配,那么我想更改两个相应的值FirstName和LastName。

在Stata中,它看起来像这样:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

因此,这将替换FirstName中与Matt的ID == 103值相对应的所有值。

在PANDAS中,我正在尝试类似的方法

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

不知道从这里去哪里。有任何想法吗?

I’m trying to reprogram my Stata code into Python for speed improvements, and I was pointed in the direction of PANDAS. I am, however, having a hard time wrapping my head around how to process the data.

Let’s say I want to iterate over all values in the column head ‘ID.’ If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.

In Stata it looks like this:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.

In PANDAS, I’m trying something like this

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

Not sure where to go from here. Any ideas?


回答 0

一种选择是使用Python的切片和索引功能来逻辑评估条件所在的位置并覆盖其中的数据。

假设您可以使用将数据直接加载到pandas其中,pandas.read_csv则以下代码可能对您有所帮助。

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

如评论中所述,您也可以一次性完成对两列的分配:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

请注意,您需要pandas使用0.11或更高版本才能进行loc覆盖分配操作。


另一种方法是使用所谓的链式分配。这种行为的稳定性较差,因此不被认为是最佳解决方案(在文档中明确建议不要这样做),但了解以下信息将很有用:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

One option is to use Python’s slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.

Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

As mentioned in the comments, you can also do the assignment to both columns in one shot:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

Note that you’ll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations.


Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"

回答 1

您可以使用map,它可以映射字典或自定义函数中的值。

假设这是您的df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

创建字典:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

和地图:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

结果将是:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

或使用自定义函数:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

You can use map, it can map vales from a dictonairy or even a custom function.

Suppose this is your df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

Create the dicts:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

And map:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

The result will be:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

Or use a custom function:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])

回答 2

最初的问题是针对特定的狭窄用例。对于那些需要更通用答案的人,这里有一些示例:

使用其他列中的数据创建新列

给定以下数据框:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

下面,我们description通过使用+被系列覆盖的操作,添加一个新列作为其他列的串联。花式字符串格式,f字符串等在这里不起作用,因为这+适用于标量而不是“原始”值:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

我们获得1 years了猫(而不是1 year),它将在下面使用条件固定。

使用条件修改现有列

在这里,我们用animal其他列中的值替换原始列,并np.where根据的值设置条件子字符串age

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

使用条件修改多列

一种更灵活的方法是调用.apply()整个数据框而不是单个列:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

在上面的代码中,该transform_row(r)函数接受一个Series表示给定行的对象(用表示axis=1,默认值axis=0Series为每一列提供一个对象)。因为我们可以使用列名称访问行中的实际“原始”值,并且可以查看给定行/列中其他单元格的情况,所以这简化了处理。

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:

Creating a new column using data from other columns

Given the dataframe below:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won’t work here since the + applies to scalars and not ‘primitive’ values:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.

Modifying an existing column with conditionals

Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

Modifying multiple columns with conditionals

A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since we can access the actual ‘primitive’ values in the row using the column names and have visibility of other cells in the given row/column.


回答 3

这个问题可能仍然经常被探访,因此值得为卡西斯先生的回答提供补充。可以对dict内置类进行子类化,以便为“缺失”键返回默认值。此机制对熊猫有效。但请参阅下文。

这样就可以避免关键错误。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

可以通过以下方式更简单地完成同一件事。getdict对象的方法使用’default’参数使得不必将dict子类化。

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

This question might still be visited often enough that it’s worth offering an addendum to Mr Kassies’ answer. The dict built-in class can be sub-classed so that a default is returned for ‘missing’ keys. This mechanism works well for pandas. But see below.

In this way it’s possible to avoid key errors.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

The same thing can be done more simply in the following way. The use of the ‘default’ argument for the get method of a dict object makes it unnecessary to subclass a dict.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401