问题:熊猫将数据框转换为元组数组
我已经使用熊猫处理了一些数据,现在我想将批处理保存回数据库。这要求我将数据帧转换为元组数组,每个元组都对应于数据帧的“行”。
我的DataFrame看起来像:
In [182]: data_set
Out[182]:
index data_date data_1 data_2
0 14303 2012-02-17 24.75 25.03
1 12009 2012-02-16 25.00 25.07
2 11830 2012-02-15 24.99 25.15
3 6274 2012-02-14 24.68 25.05
4 2302 2012-02-13 24.62 24.77
5 14085 2012-02-10 24.38 24.61
我想将其转换为元组数组,例如:
[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]
关于如何有效执行此操作的任何建议?
I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a “row” of the dataframe.
My DataFrame looks something like:
In [182]: data_set
Out[182]:
index data_date data_1 data_2
0 14303 2012-02-17 24.75 25.03
1 12009 2012-02-16 25.00 25.07
2 11830 2012-02-15 24.99 25.15
3 6274 2012-02-14 24.68 25.05
4 2302 2012-02-13 24.62 24.77
5 14085 2012-02-10 24.38 24.61
I want to convert it to an array of tuples like:
[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]
Any suggestion on how I can efficiently do this?
回答 0
怎么样:
subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]
大熊猫<0.24使用
tuples = [tuple(x) for x in subset.values]
How about:
subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]
for pandas < 0.24 use
tuples = [tuple(x) for x in subset.values]
回答 1
list(data_set.itertuples(index=False))
从17.1开始,以上代码将返回namedtuples列表。
如果需要普通元组的列表,请name=None
作为参数传递:
list(data_set.itertuples(index=False, name=None))
list(data_set.itertuples(index=False))
As of 17.1, the above will return a list of namedtuples.
If you want a list of ordinary tuples, pass name=None
as an argument:
list(data_set.itertuples(index=False, name=None))
回答 2
通用方式:
[tuple(x) for x in data_set.to_records(index=False)]
A generic way:
[tuple(x) for x in data_set.to_records(index=False)]
回答 3
动机
许多数据集足够大,我们需要关注自身的速度/效率。因此,我本着这种精神提供此解决方案。它恰好也是简洁的。
为了比较,让我们删除该index
列
df = data_set.drop('index', 1)
解决方案
我将建议使用zip
和map
list(zip(*map(df.get, df)))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
如果我们要处理特定的列子集,它也很灵活。我们假设已经显示的列是我们想要的子集。
list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
什么是更快?
转弯records
最快,然后渐近收敛zipmap
,iter_tuples
我将使用simple_benchmarks
从这篇文章中获得的库
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
import pandas as pd
import numpy as np
def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))
funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
b.add_function()(func)
def creator(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
@b.add_arguments('Rows in DataFrame')
def argument_provider():
for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
yield n, creator(n)
r = b.run()
检查结果
r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))
tuple_comp iter_namedtuples iter_tuples records zipmap
100 2.905662 6.626308 3.450741 1.469471 1.000000
316 4.612692 4.814433 2.375874 1.096352 1.000000
1000 6.513121 4.106426 1.958293 1.000000 1.316303
3162 8.446138 4.082161 1.808339 1.000000 1.533605
10000 8.424483 3.621461 1.651831 1.000000 1.558592
31622 7.813803 3.386592 1.586483 1.000000 1.515478
100000 7.050572 3.162426 1.499977 1.000000 1.480131
r.plot()
Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.
For the sake of comparison, let’s drop the index
column
df = data_set.drop('index', 1)
Solution
I’ll propose the use of zip
and map
list(zip(*map(df.get, df)))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
It happens to also be flexible if we wanted to deal with a specific subset of columns. We’ll assume the columns we’ve already displayed are the subset we want.
list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
What is Quicker?
Turn’s out records
is quickest followed by asymptotically converging zipmap
and iter_tuples
I’ll use a library simple_benchmarks
that I got from this post
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
import pandas as pd
import numpy as np
def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))
funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
b.add_function()(func)
def creator(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
@b.add_arguments('Rows in DataFrame')
def argument_provider():
for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
yield n, creator(n)
r = b.run()
Check the results
r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))
tuple_comp iter_namedtuples iter_tuples records zipmap
100 2.905662 6.626308 3.450741 1.469471 1.000000
316 4.612692 4.814433 2.375874 1.096352 1.000000
1000 6.513121 4.106426 1.958293 1.000000 1.316303
3162 8.446138 4.082161 1.808339 1.000000 1.533605
10000 8.424483 3.621461 1.651831 1.000000 1.558592
31622 7.813803 3.386592 1.586483 1.000000 1.515478
100000 7.050572 3.162426 1.499977 1.000000 1.480131
r.plot()
回答 4
这是一种向量化方法(假设将数据帧data_set
定义为df
),它返回的list
of tuples
,如下所示:
>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()
生成:
[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
(datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
(datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
(datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
(datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
(datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]
将datetime列设置为索引轴的想法是,通过对数据帧使用其中的参数来帮助将Timestamp
值转换为其对应的datetime.datetime
等效格式。convert_datetime64
DF.to_records
DateTimeIndex
这会返回recarray
,然后可以将其返回给list
using.tolist
根据用例,更通用的解决方案是:
df.to_records().tolist() # Supply index=False to exclude index
Here’s a vectorized approach (assuming the dataframe, data_set
to be defined as df
instead) that returns a list
of tuples
as shown:
>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()
produces:
[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
(datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
(datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
(datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
(datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
(datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]
The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp
value to it’s corresponding datetime.datetime
format equivalent by making use of the convert_datetime64
argument in DF.to_records
which does so for a DateTimeIndex
dataframe.
This returns a recarray
which could be then made to return a list
using .tolist
More generalized solution depending on the use case would be:
df.to_records().tolist() # Supply index=False to exclude index
回答 5
最有效,最简单的方法:
list(data_set.to_records())
您可以在此调用之前过滤所需的列。
The most efficient and easy way:
list(data_set.to_records())
You can filter the columns you need before this call.
回答 6
该答案不会添加尚未讨论的任何答案,但是这里提供了一些速度结果。我认为这应该可以解决评论中出现的问题。所有这些看起来都像是O(n)基于这三个值,。
TL; DR:tuples = list(df.itertuples(index=False, name=None))
和tuples = list(zip(*[df[c].values.tolist() for c in df]))
并列最快。
我对结果进行了快速速度测试,得出以下三个建议:
- @pirsquared的zip答案:
tuples = list(zip(*[df[c].values.tolist() for c in df]))
- @ wes-mckinney接受的答案:
tuples = [tuple(x) for x in df.values]
- itertuples来自@ksindi的答案以及来自@Axel的
name=None
建议:tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd
def create_random_df(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
小尺寸:
df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
较大:
df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
尽我所能:
df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
给出:
1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zip版本和itertuples版本彼此在置信区间内。我怀疑他们在幕后做着同样的事情。
这些速度测试可能无关紧要。突破计算机内存的限制并不需要花费大量时间,并且您实际上不应该对大型数据集执行此操作。完成这些操作后,使用这些元组将最终效率低下。这不太可能成为代码中的主要瓶颈,因此请坚持使用您认为最易读的版本。
This answer doesn’t add any answers that aren’t already discussed, but here are some speed results. I think this should resolve questions that came up in the comments. All of these look like they are O(n), based on these three values.
TL;DR: tuples = list(df.itertuples(index=False, name=None))
and tuples = list(zip(*[df[c].values.tolist() for c in df]))
are tied for the fastest.
I did a quick speed test on results for three suggestions here:
- The zip answer from @pirsquared:
tuples = list(zip(*[df[c].values.tolist() for c in df]))
- The accepted answer from @wes-mckinney:
tuples = [tuple(x) for x in df.values]
- The itertuples answer from @ksindi with the
name=None
suggestion from @Axel: tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd
def create_random_df(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
Small size:
df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Larger:
df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As much patience as I have:
df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The zip version and the itertuples version are within the confidence intervals each other. I suspect that they are doing the same thing under the hood.
These speed tests are probably irrelevant though. Pushing the limits of my computer’s memory doesn’t take a huge amount of time, and you really shouldn’t be doing this on a large data set. Working with those tuples after doing this will end up being really inefficient. It’s unlikely to be a major bottleneck in your code, so just stick with the version you think is most readable.
回答 7
#try this one:
tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)
#try this one:
tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)
回答 8
将数据框架列表更改为元组列表。
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print(df)
OUTPUT
col1 col2
0 1 4
1 2 5
2 3 6
records = df.to_records(index=False)
result = list(records)
print(result)
OUTPUT
[(1, 4), (2, 5), (3, 6)]
Changing the data frames list into a list of tuples.
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print(df)
OUTPUT
col1 col2
0 1 4
1 2 5
2 3 6
records = df.to_records(index=False)
result = list(records)
print(result)
OUTPUT
[(1, 4), (2, 5), (3, 6)]
回答 9
更多pythonic方式:
df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)
More pythonic way:
df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。