问题:熊猫可以自动识别日期吗?

今天,我感到惊讶的是,pandas在从数据文件中读取数据时能够识别值的类型:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

例如,可以通过以下方式检查它:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

特别是整数,浮点数和字符串可以正确识别。但是,我有一列的日期采用以下格式:2013-6-4。这些日期被识别为字符串(而不是python日期对象)。有没有一种方法可以“学习”熊猫到公认的日期?

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to “learn” pandas to recognized dates?


回答 0

您应该添加parse_dates=True,或者parse_dates=['column name']在阅读时通常足以神奇地解析它。但是总有一些奇怪的格式需要手动定义。在这种情况下,您还可以添加日期解析器功能,这是最灵活的方法。

假设您的字符串中有一列“ datetime”,然后:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

这样,您甚至可以将多个列合并为一个datetime列,从而将一个“ date”和一个“ time”列合并为一个“ datetime”列:

dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

您可以在此页面strptimestrftime 找到指令(即用于不同格式的字母)。

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column ‘datetime’ with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a ‘date’ and a ‘time’ column into a single ‘datetime’ column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.


回答 1

自@Rutger回答以来,熊猫界面可能已更改,但是在我使用的版本(0.15.2)中,该date_parser函数接收日期列表,而不是单个值。在这种情况下,他的代码应该这样更新:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

Perhaps the pandas interface has changed since @Rutger answered, but in the version I’m using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

回答 2

pandas read_csv方法非常适合解析日期。完整的文档位于http://pandas.pydata.org/pandas-docs/stable/genic/pandas.io.parsers.read_csv.html

您甚至可以在不同的列中包含不同的日期部分,并传递参数:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo : [1, 3]} -> parse columns 1, 3 as date and call result foo

默认的日期检测效果很好,但似乎偏向于北美日期格式。如果您住在其他地方,您可能偶尔会被结果所吸引。据我所知,2000年1月6日是美国的1月6日,而不是我居住的6月1日。如果使用了2000年6月23日这样的日期,它足够聪明地摆弄它们。不过,使用YYYYMMDD日期变化可能更安全。向熊猫开发者表示歉意,但是最近我还没有在当地进行测试。

您可以使用date_parser参数传递一个函数来转换格式。

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

回答 3

您可以pandas.to_datetime()按照文档中的建议使用pandas.read_csv()

如果列或索引包含不可解析的日期,则整个列或索引将按原样作为对象数据类型返回。对于非标准的日期时间解析,请pd.to_datetime在之后使用pd.read_csv

演示:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

回答 4

将两列合并为一个datetime列时,可接受的答案会产生错误(pandas版本0.20.3),因为这些列分别发送到date_parser函数。

以下作品:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

回答 5

是的-根据pandas.read_csv 文档

注意:存在iso8601格式日期的快速路径。

因此,如果您的csv有一个名为的列datetime,并且日期看起来像2013-01-01T01:01例如,运行此命令将使熊猫(我在v0.19.2上)自动获取日期和时间:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

请注意,您需要显式传递parse_dates,否则将无法正常运行。

验证:

df.dtypes

您应该看到列的数据类型是 datetime64[ns]

Yes – according to the pandas.read_csv documentation:

Note: A fast-path exists for iso8601-formatted dates.

So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I’m on v0.19.2) pick up the date and time automatically:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn’t work without.

Verify with:

df.dtypes

You should see the datatype of the column is datetime64[ns]


回答 6

如果性能对您很重要,请确保您有时间:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

印刷品:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

因此,与ISO8601格式的日期(%Y-%m-%d %H:%M:%S显然是一个ISO8601格式的日期,我猜的T 可以被丢弃,并用空格代替),你应该指定infer_datetime_format(不使更多常见的两种明显的差异),并通过自己的解析器只会破坏性能。另一方面,date_parser与标准日期格式相比确实有所不同。像往常一样,请务必先确定时间再进行优化。

If performance matters to you make sure you time:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.


回答 7

加载csv文件中包含date列时,我们有两种方法可以使熊猫识别date列,即

  1. 熊猫通过arg明确识别格式 date_parser=mydateparser

  2. 熊猫通过AGR隐式识别格式 infer_datetime_format=True

一些日期列数据

18/01/18

18/02/02

这里我们不知道前两件事,可能是一个月或一天。因此,在这种情况下,我们必须使用方法1:-显式传递格式

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

方法2:-隐式或自动识别格式

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

  1. Pandas explicit recognize the format by arg date_parser=mydateparser

  2. Pandas implicit recognize the format by agr infer_datetime_format=True

Some of the date column data

01/01/18

01/02/18

Here we don’t know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。