熊猫有条件地创建系列/数据框列

Question 1

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.

What’s the best way to do this?

Question 2

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use . For example, if you want color to be

yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

Question 3

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

Question 4

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Question 5

Here’s yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What’s it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Question 6

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the “Set” column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

Question 7

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a ‘color’ column and set all values to “red”

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

Question 8

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

Question 9

If you’re working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

熊猫有条件地创建系列/数据框列

问题：熊猫有条件地创建系列/数据框列

回答 0

回答 1

回答 2

回答 3

回答 4

编辑（21/06/2019）：使用plydata

Edit (21/06/2019): Using plydata

回答 5

回答 6

回答 7

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

在Django中使用Pylint

如何使用点“。” 访问字典成员？

Pip-Python软件包安装程序

以科学计数法显示小数

如何将所有模块加载到文件夹中？

将namedtuple转换成字典

熊猫有条件地创建系列/数据框列

问题：熊猫有条件地创建系列/数据框列

回答 0

回答 1

回答 2

回答 3

回答 4

编辑（21/06/2019）：使用plydata

Edit (21/06/2019): Using plydata

回答 5

回答 6

回答 7

相关文章

排行榜展示

文章展示