问题:从熊猫返回多列apply()
我有一个熊猫DataFrame ,df_test
。它包含一列“大小”,以字节为单位表示大小。我已经使用以下代码计算了KB,MB和GB:
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')
df_test
dir size size_kb size_mb size_gb
0 /Users/uname1 994933 971.6 KB 0.9 MB 0.0 GB
1 /Users/uname2 109338711 106,776.1 KB 104.3 MB 0.1 GB
[2 rows x 5 columns]
我已经运行了超过120,000行,并且根据%timeit,每列花费的时间约为2.97秒* 3 =〜9秒。
无论如何,我可以使它更快吗?例如,我是否可以代替一次套用并运行3次而不是一次返回一列,可以一次返回所有三列以将其插入回原始数据帧吗?
我发现的其他问题都希望采用多个值并返回一个值。我想要一个值并返回多列。
回答 0
这是一个古老的问题,但是为了完整起见,您可以从包含新数据的应用函数中返回一个Series,从而避免了需要迭代3次的麻烦。传递axis=1
到apply函数sizes
会将函数应用于数据框的每一行,并返回一个序列以添加到新的数据框。该系列s包含新值以及原始数据。
def sizes(s):
s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return s
df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)
回答 1
使用apply和zip将比Series方式快3倍。
def sizes(s):
return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))
测试结果为:
Separate df.apply():
100 loops, best of 3: 1.43 ms per loop
Return Series:
100 loops, best of 3: 2.61 ms per loop
Return tuple:
1000 loops, best of 3: 819 µs per loop
回答 2
当前的某些回复效果很好,但我想提供另一个也许更“惊慌”的选择。这适用于我当前的大熊猫0.23(不确定是否可以在以前的版本中使用):
import pandas as pd
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
def sizes(s):
a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")
注意,技巧在的result_type
参数上apply
,它将把结果扩展为DataFrame
,可以直接分配给新/旧列。
回答 3
只是另一种可读的方式。这段代码将添加三个新列及其值,并在apply函数中返回不使用任何参数的序列。
def sizes(s):
val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])
df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)
来自以下的一般示例:https : //pandas.pydata.org/pandas-docs/stable/genic/pandas.DataFrame.apply.html
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
#foo bar
#0 1 2
#1 1 2
#2 1 2
回答 4
真的很酷的答案!感谢Jesse和jaumebonet!关于以下方面的一些观察:
zip(* ...
... result_type="expand")
尽管expand有点更优雅(平庸),但是zip至少快** 2倍。在下面这个简单的示例中,我的速度提高了4倍。
import pandas as pd
dat = [ [i, 10*i] for i in range(1000)]
df = pd.DataFrame(dat, columns = ["a","b"])
def add_and_sub(row):
add = row["a"] + row["b"]
sub = row["a"] - row["b"]
return add, sub
df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))
回答 5
顶部答案之间的性能显著变化,和杰西&famaral42已经讨论过这一点,但它是值得分享的顶部答案之间的公平比较,并制定杰西的回答的一个微妙但重要的细节:在传递给说法功能,也会影响性能。
(Python 3.7.4,Pandas 1.0.3)
import pandas as pd
import locale
import timeit
def create_new_df_test():
df_test = pd.DataFrame([
{'dir': '/Users/uname1', 'size': 994933},
{'dir': '/Users/uname2', 'size': 109338711},
])
return df_test
def sizes_pass_series_return_series(series):
series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return series
def sizes_pass_series_return_tuple(series):
a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
def sizes_pass_value_return_tuple(value):
a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
return a, b, c
结果如下:
# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
请注意如何返回元组是最快的方法,但是什么传递中作为一个参数,也是影响性能。代码上的差异很细微,但性能却有显着提高。
测试#4(传递一个值)的速度是测试#3(传递一个值)的两倍,即使表面上执行的操作相同。
但是还有更多…
# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 µs ± 3.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
在某些情况下(#1a和#4a),将函数应用到已经存在输出列的DataFrame上比从函数创建它们更快。
这是运行测试的代码:
# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'], df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
回答 6
我相信1.1版会破坏此处最佳答案中建议的行为。
import pandas as pd
def test_func(row):
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['i', 'j', 'k']})
df.apply(test_func, axis=1)
上面的代码在pandas 1.1.0返回上运行:
a b c d
0 1 i 1i 2
1 1 i 1i 2
2 1 i 1i 2
在大熊猫1.0.5中返回:
a b c d
0 1 i 1i 2
1 2 j 2j 3
2 3 k 3k 4
我认为这是您所期望的。
不知道怎么的发行说明解释这种行为,然而由于解释这里通过复制他们复活旧的行为避免了原始行的突变。即:
def test_func(row):
row = row.copy() # <---- Avoid mutating the original reference
row['c'] = str(row['a']) + str(row['b'])
row['d'] = row['a'] + 1
return row
回答 7
通常,要返回多个值,这就是我要做的
def gimmeMultiple(group):
x1 = 1
x2 = 2
return array([[1, 2]])
def gimmeMultipleDf(group):
x1 = 1
x2 = 2
return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)
明确地返回数据帧具有其优势,但有时并非必需。您可以查看apply()
收益,并使用这些功能;)
回答 8
它提供了一个新数据框,其中包含原始列的两列。
import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)