熊猫中的datetime dtypes read_csv-Python 实用宝典

熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如: headers = ['col1', 'col2', 'col3', 'col4'] dtypes = ['datetime', 'datetime', 'str', 'float'] pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes) 运行时出现错误: TypeError:不了解数据类型“ datetime” 事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。 另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

问题:熊猫中的datetime dtypes read_csv

我正在读取具有多个datetime列的csv文件。我需要在读取文件时设置数据类型,但是日期时间似乎是个问题。例如:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

运行时出现错误:

TypeError:不了解数据类型“ datetime”

事后通过pandas.to_datetime()转换列不是一个选项,我不知道哪些列将是datetime对象。该信息可以更改,并且可以从通知我的dtypes列表的任何信息中获取。

另外,我尝试用numpy.genfromtxt加载csv文件,在该函数中设置dtypes,然后转换为pandas.dataframe,但它会使数据乱码。任何帮助是极大的赞赏!

I'm reading in a csv file with multiple datetime columns. I'd need to set the data types upon reading in the file, but datetimes appear to be a problem. For instance:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = ['datetime', 'datetime', 'str', 'float']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

When run gives a error:

TypeError: data type "datetime" not understood

Converting columns after the fact, via pandas.to_datetime() isn't an option I can't know which columns will be datetime objects. That information can change and comes from whatever informs my dtypes list.

Alternatively, I've tried to load the csv file with numpy.genfromtxt, set the dtypes in that function, and then convert to a pandas.dataframe but it garbles the data. Any help is greatly appreciated!


回答 0

为什么它不起作用

没有为read_csv设置datetime dtype,因为csv文件只能包含字符串,整数和浮点数。

将dtype设置为datetime将使熊猫将datetime解释为对象,这意味着您将以字符串结尾。

熊猫解决这个问题的方法

函数具有名为parse_dates

使用此功能,您可以使用默认date_parserdateutil.parser.parser)快速将字符串,浮点数或整数转换为日期时间

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

这将导致熊猫读取col1col2作为字符串,它们很可能是字符串(“ 2016-05-05”等),并且在读取字符串之后,每一列的date_parser都会对该字符串起作用,并返回该函数返回的任何内容。

定义自己的日期解析功能:

函数具有名为date_parser

将其设置为lambda函数将使该特定函数可用于日期解析。

GOTCHA警告

您必须为其提供功能,而不是功能的执行,因此这是正确的

date_parser = pd.datetools.to_datetime

这是不正确的

date_parser = pd.datetools.to_datetime()

熊猫0.22更新

pd.datetools.to_datetime 已移至 date_parser = pd.to_datetime

谢谢@stackoverYC

Why it does not work

There is no datetime dtype to be set for read_csv as csv files can only contain strings, integers and floats.

Setting a dtype to datetime will make pandas interpret the datetime as an object, meaning you will end up with a string.

Pandas way of solving this

The function has a keyword argument called parse_dates

Using this you can on the fly convert strings, floats or integers into datetimes using the default date_parser (dateutil.parser.parser)

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}
parse_dates = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates)

This will cause pandas to read col1 and col2 as strings, which they most likely are ("2016-05-05" etc.) and after having read the string, the date_parser for each column will act upon that string and give back whatever that function returns.

Defining your own date parsing function:

The function also has a keyword argument called date_parser

Setting this to a lambda function will make that particular function be used for the parsing of the dates.

GOTCHA WARNING

You have to give it the function, not the execution of the function, thus this is Correct

date_parser = pd.datetools.to_datetime

This is incorrect:

date_parser = pd.datetools.to_datetime()

Pandas 0.22 Update

pd.datetools.to_datetime has been relocated to date_parser = pd.to_datetime

Thanks @stackoverYC


回答 1

有一个parse_dates参数read_csv可让您定义要视为日期或日期时间的列的名称:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

There is a parse_dates parameter for read_csv which allows you to define the names of the columns you want treated as dates or datetimes:

date_cols = ['col1', 'col2']
pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=date_cols)

回答 2

您可以尝试传递实际类型而不是字符串。

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

但是,如果没有任何可修改的数据,将很难诊断出来。

实际上,您可能希望熊猫将日期解析为时间戳记,因此可能是:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

You might try passing actual types instead of strings.

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

But it's going to be really hard to diagnose this without any of your data to tinker with.

And really, you probably want pandas to parse the the dates into TimeStamps, so that might be:

pd.read_csv(file, sep='\t', header=None, names=headers, parse_dates=True)

回答 3

我尝试使用dtypes = [datetime,...]选项,但是

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

我遇到以下错误:

TypeError: data type not understood

我唯一要做的更改是将datetime替换为datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I tried using the dtypes=[datetime, ...] option, but

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime, datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

I encountered the following error:

TypeError: data type not understood

The only change I had to make is to replace datetime with datetime.datetime

import pandas as pd
from datetime import datetime
headers = ['col1', 'col2', 'col3', 'col4'] 
dtypes = [datetime.datetime, datetime.datetime, str, float] 
pd.read_csv(file, sep='\t', header=None, names=headers, dtype=dtypes)

本文由 Python 实用宝典 作者:Python实用宝典 发表,其版权均为 Python 实用宝典 所有,文章内容系作者个人观点,不代表 Python 实用宝典 对观点赞同或支持。如需转载,请注明文章来源。
17

抱歉,评论已关闭!