读取pandas数据框的前几行的方法

问题：读取pandas数据框的前几行的方法

是否有内置的使用方式 read_csv仅读取n文件的前几行而无需提前知道行的长度？我有一个大文件，需要花费很长时间才能读取，偶尔只想使用前20行来获取它的样本（并且不希望加载完整的文件并花大头）。

如果我知道总行数，则可以执行类似的操作footer_lines = total_lines - n并将其传递给skipfooter关键字arg。我当前的解决方案是n使用python和StringIO 手动将第一行抓取到熊猫：

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

并没有那么糟，但是有没有更简洁的“ pandasic”（？）方式来处理关键字或其他内容呢？

Is there a built-in way to use read_csv to read only the first n lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It’s not that bad, but is there a more concise, ‘pandasic’ (?) way to do it with keywords or something?

回答 0

我认为您可以使用该nrows参数。从文档：

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

这似乎有效。使用标准大型测试文件之一（988504479字节，5344499行）：

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

I think you can use the nrows parameter. From the docs:

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

读取pandas数据框的前几行的方法

问题：读取pandas数据框的前几行的方法

回答 0

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

我如何获得执行Python程序的时间？

Python Heartrate 像观察心率一样观察代码性能表现

用Python注释函数的正确方法是什么？

Python 3中是否有一个“ foreach”功能？

使用Python从字符串中删除数字以外的字符？

目标数据库不是最新的

读取pandas数据框的前几行的方法

问题：读取pandas数据框的前几行的方法

回答 0

相关文章

排行榜展示

文章展示