在日期上过滤熊猫数据框

问题:在日期上过滤熊猫数据框

我有一个带有“日期”列的Pandas DataFrame。现在,我需要过滤掉DataFrame中日期在接下来两个月之外的所有行。本质上,我只需要保留接下来两个月内的行。

实现此目标的最佳方法是什么?

I have a Pandas DataFrame with a ‘date’ column. Now I need to filter out all rows in the DataFrame that have dates outside of the next two months. Essentially, I only need to retain the rows that are within the next two months.

What is the best way to achieve this?


回答 0

如果date列是索引,则将.loc用于基于标签的索引,将.iloc用于位置索引。

例如:

df.loc['2014-01-01':'2014-02-01']

在此处查看详细信息http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

如果列不是索引,则有两个选择:

  1. 使其成为索引(如果是时间序列数据,则为临时索引或永久索引)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

有关一般说明,请参见此处

注意:不建议使用.ix。

If date column is the index, then use .loc for label based indexing or .iloc for positional indexing.

For example:

df.loc['2014-01-01':'2014-02-01']

See details here http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection

If the column is not the index you have two choices:

  1. Make it the index (either temporarily or permanently if it’s time-series data)
  2. df[(df['date'] > '2013-01-01') & (df['date'] < '2013-02-01')]

See here for the general explanation

Note: .ix is deprecated.


回答 1

根据我的经验,上一个答案是不正确的,您不能将其传递为简单的字符串,而必须是datetime对象。所以:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

Previous answer is not correct in my experience, you can’t pass it a simple string, needs to be a datetime object. So:

import datetime 
df.loc[datetime.date(year=2014,month=1,day=1):datetime.date(year=2014,month=2,day=1)]

回答 2

而且,如果通过导入datetime包将日期标准化,则可以简单地使用:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

为了使用datetime包标准化日期字符串,可以使用以下功能:

import datetime
datetime.datetime.strptime

And if your dates are standardized by importing datetime package, you can simply use:

df[(df['date']>datetime.date(2016,1,1)) & (df['date']<datetime.date(2016,3,1))]  

For standarding your date string using datetime package, you can use this function:

import datetime
datetime.datetime.strptime

回答 3

如果您的datetime列具有Pandas datetime类型(例如datetime64[ns]),则为了进行正确的过滤,您需要pd.Timestamp对象,例如:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

If your datetime column have the Pandas datetime type (e.g. datetime64[ns]), for proper filtering you need the pd.Timestamp object, for example:

from datetime import date

import pandas as pd

value_to_check = pd.Timestamp(date.today().year, 1, 1)
filter_mask = df['date_column'] < value_to_check
filtered_df = df[filter_mask]

回答 4

如果日期在索引中,则只需:

df['20160101':'20160301']

If the dates are in the index then simply:

df['20160101':'20160301']

回答 5

您可以使用pd.Timestamp执行查询和本地引用

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

与输出

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

看一下DataFrame.query的pandas文档,特别是有关本地变量引用的udsing @前缀的提及。在这种情况下,我们pd.Timestamp使用本地别名ts进行引用,以便能够提供时间戳字符串

You can use pd.Timestamp to perform a query and a local reference

import pandas as pd
import numpy as np

df = pd.DataFrame()
ts = pd.Timestamp

df['date'] = np.array(np.arange(10) + datetime.now().timestamp(), dtype='M8[s]')

print(df)
print(df.query('date > @ts("20190515T071320")')

with the output

                 date
0 2019-05-15 07:13:16
1 2019-05-15 07:13:17
2 2019-05-15 07:13:18
3 2019-05-15 07:13:19
4 2019-05-15 07:13:20
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25


                 date
5 2019-05-15 07:13:21
6 2019-05-15 07:13:22
7 2019-05-15 07:13:23
8 2019-05-15 07:13:24
9 2019-05-15 07:13:25

Have a look at the pandas documentation for DataFrame.query, specifically the mention about the local variabile referenced udsing @ prefix. In this case we reference pd.Timestamp using the local alias ts to be able to supply a timestamp string


回答 6

因此,在加载csv数据文件时,我们需要如下所示将date列设置为索引,以便根据日期范围过滤数据。现在不推荐使用的方法:pd.DataFrame.from_csv()不需要此功能。

如果您只想显示一月至二月两个月的数据,例如2020-01-01至2020-02-29,则可以执行以下操作:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

已针对Python 3.7进行了测试。希望您会发现这个有用。

So when loading the csv data file, we’ll need to set the date column as index now as below, in order to filter data based on a range of dates. This was not needed for the now deprecated method: pd.DataFrame.from_csv().

If you just want to show the data for two months from Jan to Feb, e.g. 2020-01-01 to 2020-02-29, you can do so:

import pandas as pd
mydata = pd.read_csv('mydata.csv',index_col='date') # or its index number, e.g. index_col=[0]
mydata['2020-01-01':'2020-02-29'] # will pull all the columns
#if just need one column, e.g. Cost, can be done:
mydata['2020-01-01':'2020-02-29','Cost'] 

This has been tested working for Python 3.7. Hope you will find this useful.


回答 7

怎么样使用 pyjanitor

它具有很酷的功能。

pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

How about using pyjanitor

It has cool features.

After pip install pyjanitor

import janitor

df_filtered = df.filter_date(your_date_column_name, start_date, end_date)

回答 8

按日期过滤数据框的最短方法:假设您的日期列为datetime64 [ns]类型

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

The shortest way to filter your dataframe by date: Lets suppose your date column is type of datetime64[ns]

# filter by single day
df = df[df['date'].dt.strftime('%Y-%m-%d') == '2014-01-01']

# filter by single month
df = df[df['date'].dt.strftime('%Y-%m') == '2014-01']

# filter by single year
df = df[df['date'].dt.strftime('%Y') == '2014']

回答 9

我尚未被允许发表任何评论,所以如果有人可以阅读所有评论并达到目的,我将写一个答案。

如果数据集的索引是日期时间,并且您只想按(例如)个月筛选,则可以执行以下操作:

df.loc[df.index.month = 3]

这将在三月之前为您过滤数据集。

I’m not allowed to write any comments yet, so I’ll write an answer, if somebody will read all of them and reach this one.

If the index of the dataset is a datetime and you want to filter that just by (for example) months, you can do following:

df.loc[df.index.month = 3]

That will filter the dataset for you by March.


回答 10

您可以通过执行以下操作来选择时间范围:df.loc [‘start_date’:’end_date’]

You could just select the time range by doing: df.loc[‘start_date’:’end_date’]


回答 11

如果您已经使用pd.to_datetime将字符串转换为日期格式,则可以使用:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]

If you have already converted the string to a date format using pd.to_datetime you can just use:

df = df[(df['Date']> "2018-01-01") & (df['Date']< "2019-07-01")]