熊猫：在数据框中创建两个新列，并使用从现有列中计算出的值

Question 1

I am working with the pandas library and I want to add two new columns to a dataframe df with n columns (n > 0).
These new columns result from the application of a function to one of the columns in the dataframe.

The function to apply is like:

def calculate(x):
    ...operate...
    return z, y

One method for creating a new column for a function returning only a value is:

df['new_col']) = df['column_A'].map(a_function)

So, what I want, and tried unsuccesfully (*), is something like:

(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)

What the best way to accomplish this could be ? I scanned the documentation with no clue.

**df['column_A'].map(calculate) returns a pandas Series each item consisting of a tuple z, y. And trying to assign this to two dataframe columns produces a ValueError.*

Question 2

I’d just use zip:

In [1]: from pandas import *

In [2]: def calculate(x):
   ...:     return x*2, x*3
   ...: 

In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

In [4]: df
Out[4]: 
   a  b
0  1  2
1  2  3
2  3  4

In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))

In [6]: df
Out[6]: 
   a  b  A1  A2
0  1  2   2   3
1  2  3   4   6
2  3  4   6   9

Question 3

The top answer is flawed in my opinion. Hopefully, no one is mass importing all of pandas into their namespace with from pandas import *. Also, the map method should be reserved for those times when passing it a dictionary or Series. It can take a function but this is what apply is used for.

So, if you must use the above approach, I would write it like this

df["A1"], df["A2"] = zip(*df["a"].apply(calculate))

There’s actually no reason to use zip here. You can simply do this:

df["A1"], df["A2"] = calculate(df['a'])

This second method is also much faster on larger DataFrames

df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})

DataFrame created with 300,000 rows

%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

60x faster than zip

In general, avoid using apply

Apply is generally not much faster than iterating over a Python list. Let’s test the performance of a for-loop to do the same thing as above

%%timeit
A1, A2 = [], []
for val in df['a']:
    A1.append(val**2)
    A2.append(val**3)

df['A1'] = A1
df['A2'] = A2

298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this is twice as slow which isn’t a terrible performance regression, but if we cythonize the above, we get much better performance. Assuming, you are using ipython:

%load_ext cython

%%cython
cpdef power(vals):
    A1, A2 = [], []
    cdef double val
    for val in vals:
        A1.append(val**2)
        A2.append(val**3)

    return A1, A2

%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Directly assigning without apply

You can get even greater speed improvements if you use the direct vectorized operations.

%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This takes advantage of NumPy’s extremely fast vectorized operations instead of our loops. We now have a 30x speedup over the original.

The simplest speed test with `apply`

The above example should clearly show how slow apply can be, but just so its extra clear let’s look at the most basic example. Let’s square a Series of 10 million numbers with and without apply

s = pd.Series(np.random.rand(10000000))

%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without apply is 50x faster

%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

熊猫：在数据框中创建两个新列，并使用从现有列中计算出的值

问题：熊猫：在数据框中创建两个新列，并使用从现有列中计算出的值

回答 0

回答 1

通常，避免使用Apply

直接分配而不适用

最简单的速度测试 `apply`

In general, avoid using apply

Directly assigning without apply

The simplest speed test with `apply`

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

Python 短文本自动识别个体是否有自杀倾向

PyTorch中的“视图”方法如何工作？

如何更新SQLAlchemy行条目？

程序退出前做些事情

熊猫中的for循环真的不好吗？我什么时候应该在意？

子流程Popen和call有什么区别（我该如何使用它们）？

熊猫：在数据框中创建两个新列，并使用从现有列中计算出的值

问题：熊猫：在数据框中创建两个新列，并使用从现有列中计算出的值

回答 0

回答 1

通常，避免使用Apply

直接分配而不适用

最简单的速度测试 apply

In general, avoid using apply

Directly assigning without apply

The simplest speed test with apply

相关文章

排行榜展示

文章展示

最简单的速度测试 `apply`

The simplest speed test with `apply`