问题:如何在Pandas read_csv函数的负载中过滤行?
如何使用熊猫过滤CSV的哪些行要加载到内存中?这似乎是应该在中找到的一种选择read_csv
。我想念什么吗?
示例:我们有一个带时间戳列的CSV,我们只想加载时间戳大于给定常量的行。
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we’ve a CSV with a timestamp column and we’d like to load just the lines that with a timestamp greater than a given constant.
回答 0
在将CSV文件加载到pandas对象之前,没有选项可以过滤行。
您可以加载文件,然后使用进行过滤df[df['field'] > constant]
,或者如果文件很大,又担心内存用完,则可以使用迭代器并在连接文件块时应用过滤器,例如:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
您可以更改chunksize
以适合您的可用内存。有关更多详细信息,请参见此处。
There isn’t an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
回答 1
在的上下文中,我没有找到直接的方法read_csv
。但是,read_csv
返回一个DataFrame,可以通过按布尔向量选择行来对其进行过滤df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
这将选择df中的所有行(假设df是任何DataFrame,例如read_csv
调用的结果,至少包含datetime列timestamp
),因此该timestamp
列中的值大于targettime的值。类似的问题。
I didn’t find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
回答 2
如果过滤后的范围是连续的(通常使用时间(时间戳)过滤器),那么最快的解决方案是对行范围进行硬编码。只需skiprows=range(1, start_row)
与nrows=end_row
参数结合即可。然后,导入将花费数秒,而接受的解决方案将花费数分钟。start_row
考虑到导入时间的节省,一些最初的实验并不是很大的成本。注意,我们使用保留了标题行range(1,..)
。
If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row)
with nrows=end_row
parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row
are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..)
.
回答 3
如果您使用的是Linux,则可以使用grep。
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print('{} for {} from {}'.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
return data
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print('{} for {} from {}'.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
return data
回答 4
您可以指定nrows
参数。
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
此代码在0.20.3版中效果很好。
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.