标签归档:series

有条件替换熊猫

问题:有条件替换熊猫

我有一个DataFrame,我想用超过零的值替换特定列中的值。我以为这是实现此目标的一种方式:

df[df.my_channel > 20000].my_channel = 0

如果将通道复制到新的数据框中,这很简单:

df2 = df.my_channel 

df2[df2 > 20000] = 0

这完全符合我的要求,但似乎无法与通道一起用作原始DataFrame的一部分。

I have a DataFrame, and I want to replace the values in a particular column that exceed a value with zero. I had thought this was a way of achieving this:

df[df.my_channel > 20000].my_channel = 0

If I copy the channel into a new data frame it’s simple:

df2 = df.my_channel 

df2[df2 > 20000] = 0

This does exactly what I want, but seems not to work with the channel as part of the original DataFrame.


回答 0

.ixindexer可以在0.20.0之前的熊猫版本上正常工作,但是由于pandas为0.20.0 ,因此不推荐使用.ix indexer ,因此应避免使用它。而是可以使用或索引器。您可以通过以下方法解决此问题:.lociloc

mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0

或者,一行

df.loc[df.my_channel > 20000, 'my_channel'] = 0

mask帮助您选择这些行df.my_channel > 20000True,而df.loc[mask, column_name] = 0将值0到所选择的行,其中mask在其名称是列存放column_name

更新: 在这种情况下,应该使用,loc因为如果使用iloc,则会NotImplementedError告诉您基于iLocation的基于整数类型的布尔索引不可用

.ix indexer works okay for pandas version prior to 0.20.0, but since pandas 0.20.0, the .ix indexer is deprecated, so you should avoid using it. Instead, you can use .loc or iloc indexers. You can solve this problem by:

mask = df.my_channel > 20000
column_name = 'my_channel'
df.loc[mask, column_name] = 0

Or, in one line,

df.loc[df.my_channel > 20000, 'my_channel'] = 0

mask helps you to select the rows in which df.my_channel > 20000 is True, while df.loc[mask, column_name] = 0 sets the value 0 to the selected rows where maskholds in the column which name is column_name.

Update: In this case, you should use loc because if you use iloc, you will get a NotImplementedError telling you that iLocation based boolean indexing on an integer type is not available.


回答 1

尝试

df.loc[df.my_channel > 20000, 'my_channel'] = 0

注: 由于v0.20.0,ix 已被弃用,赞成loc/ iloc

Try

df.loc[df.my_channel > 20000, 'my_channel'] = 0

Note: Since v0.20.0, ix has been deprecated in favour of loc / iloc.


回答 2

np.where 功能如下:

df['X'] = np.where(df['Y']>=50, 'yes', 'no')

在您的情况下,您需要:

import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)

np.where function works as follows:

df['X'] = np.where(df['Y']>=50, 'yes', 'no')

In your case you would want:

import numpy as np
df['my_channel'] = np.where(df.my_channel > 20000, 0, df.my_channel)

回答 3

原始数据框不更新的原因是,链接索引可能会导致您修改副本而不是数据框的视图。该文档提供了以下建议:

在熊猫对象中设置值时,必须注意避免所谓的链接索引。

您有几种选择:-

loc +布尔索引

loc 可以用于设置值并支持布尔掩码:

df.loc[df['my_channel'] > 20000, 'my_channel'] = 0

mask +布尔索引

您可以分配给您的系列:

df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)

或者,您可以就地更新系列:

df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)

np.where +布尔索引

可以通过分配当你的条件原系列使用NumPy的满足的; 但是,前两种解决方案更干净,因为它们仅显式更改指定的值。

df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])

The reason your original dataframe does not update is because chained indexing may cause you to modify a copy rather than a view of your dataframe. The docs give this advice:

When setting values in a pandas object, care must be taken to avoid what is called chained indexing.

You have a few alternatives:-

loc + Boolean indexing

loc may be used for setting values and supports Boolean masks:

df.loc[df['my_channel'] > 20000, 'my_channel'] = 0

mask + Boolean indexing

You can assign to your series:

df['my_channel'] = df['my_channel'].mask(df['my_channel'] > 20000, 0)

Or you can update your series in place:

df['my_channel'].mask(df['my_channel'] > 20000, 0, inplace=True)

np.where + Boolean indexing

You can use NumPy by assigning your original series when your condition is not satisfied; however, the first two solutions are cleaner since they explicitly change only specified values.

df['my_channel'] = np.where(df['my_channel'] > 20000, 0, df['my_channel'])

回答 4

我会用lambda一个函数SeriesDataFrame是这样的:

f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)

我没有断言这是一种有效的方法,但是效果很好。

I would use lambda function on a Series of a DataFrame like this:

f = lambda x: 0 if x>100 else 1
df['my_column'] = df['my_column'].map(f)

I do not assert that this is an efficient way, but it works fine.


回答 5

试试这个:

df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

要么

df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)

Try this:

df.my_channel = df.my_channel.where(df.my_channel <= 20000, other= 0)

or

df.my_channel = df.my_channel.mask(df.my_channel > 20000, other= 0)


使用python pandas合并日期和时间列

问题:使用python pandas合并日期和时间列

我有一个带有以下各栏的熊猫数据框;

Date              Time
01-06-2013      23:00:00
02-06-2013      01:00:00
02-06-2013      21:00:00
02-06-2013      22:00:00
02-06-2013      23:00:00
03-06-2013      01:00:00
03-06-2013      21:00:00
03-06-2013      22:00:00
03-06-2013      23:00:00
04-06-2013      01:00:00

如何合并data [‘Date’]和data [‘Time’]以获得以下内容?有办法做到pd.to_datetime吗?

Date
01-06-2013 23:00:00
02-06-2013 01:00:00
02-06-2013 21:00:00
02-06-2013 22:00:00
02-06-2013 23:00:00
03-06-2013 01:00:00
03-06-2013 21:00:00
03-06-2013 22:00:00
03-06-2013 23:00:00
04-06-2013 01:00:00

I have a pandas dataframe with the following columns;

Date              Time
01-06-2013      23:00:00
02-06-2013      01:00:00
02-06-2013      21:00:00
02-06-2013      22:00:00
02-06-2013      23:00:00
03-06-2013      01:00:00
03-06-2013      21:00:00
03-06-2013      22:00:00
03-06-2013      23:00:00
04-06-2013      01:00:00

How do I combine data[‘Date’] & data[‘Time’] to get the following? Is there a way of doing it using pd.to_datetime?

Date
01-06-2013 23:00:00
02-06-2013 01:00:00
02-06-2013 21:00:00
02-06-2013 22:00:00
02-06-2013 23:00:00
03-06-2013 01:00:00
03-06-2013 21:00:00
03-06-2013 22:00:00
03-06-2013 23:00:00
04-06-2013 01:00:00

回答 0

值得一提的是,你可能已经能够在阅读这直接,如果你正在使用如read_csv使用parse_dates=[['Date', 'Time']]

假设这些只是字符串,您可以简单地将它们添加在一起(带有空格),从而可以应用to_datetime

In [11]: df['Date'] + ' ' + df['Time']
Out[11]:
0    01-06-2013 23:00:00
1    02-06-2013 01:00:00
2    02-06-2013 21:00:00
3    02-06-2013 22:00:00
4    02-06-2013 23:00:00
5    03-06-2013 01:00:00
6    03-06-2013 21:00:00
7    03-06-2013 22:00:00
8    03-06-2013 23:00:00
9    04-06-2013 01:00:00
dtype: object

In [12]: pd.to_datetime(df['Date'] + ' ' + df['Time'])
Out[12]:
0   2013-01-06 23:00:00
1   2013-02-06 01:00:00
2   2013-02-06 21:00:00
3   2013-02-06 22:00:00
4   2013-02-06 23:00:00
5   2013-03-06 01:00:00
6   2013-03-06 21:00:00
7   2013-03-06 22:00:00
8   2013-03-06 23:00:00
9   2013-04-06 01:00:00
dtype: datetime64[ns]

注意:令人惊讶的(对我而言),这在将NaN转换为NaT时可以很好地工作,但值得担心的是转换(也许使用raise参数)。

It’s worth mentioning that you may have been able to read this in directly e.g. if you were using read_csv using parse_dates=[['Date', 'Time']].

Assuming these are just strings you could simply add them together (with a space), allowing you to apply to_datetime:

In [11]: df['Date'] + ' ' + df['Time']
Out[11]:
0    01-06-2013 23:00:00
1    02-06-2013 01:00:00
2    02-06-2013 21:00:00
3    02-06-2013 22:00:00
4    02-06-2013 23:00:00
5    03-06-2013 01:00:00
6    03-06-2013 21:00:00
7    03-06-2013 22:00:00
8    03-06-2013 23:00:00
9    04-06-2013 01:00:00
dtype: object

In [12]: pd.to_datetime(df['Date'] + ' ' + df['Time'])
Out[12]:
0   2013-01-06 23:00:00
1   2013-02-06 01:00:00
2   2013-02-06 21:00:00
3   2013-02-06 22:00:00
4   2013-02-06 23:00:00
5   2013-03-06 01:00:00
6   2013-03-06 21:00:00
7   2013-03-06 22:00:00
8   2013-03-06 23:00:00
9   2013-04-06 01:00:00
dtype: datetime64[ns]

Note: surprisingly (for me), this works fine with NaNs being converted to NaT, but it is worth worrying that the conversion (perhaps using the raise argument).


回答 1

可接受的答案适用于数据类型的列string。出于完整性考虑:当列的数据类型为:日期和时间时,我在搜索如何执行此操作时遇到了这个问题。

df.apply(lambda r : pd.datetime.combine(r['date_column_name'],r['time_column_name']),1)

The accepted answer works for columns that are of datatype string. For completeness: I come across this question when searching how to do this when the columns are of datatypes: date and time.

df.apply(lambda r : pd.datetime.combine(r['date_column_name'],r['time_column_name']),1)

回答 2

您可以使用它来将日期和时间合并到数据框的同一列中。

import pandas as pd    
data_file = 'data.csv' #path of your file

读取具有合并列Date_Time的.csv文件:

data = pd.read_csv(data_file, parse_dates=[['Date', 'Time']]) 

您可以使用此行同时保留其他两列。

data.set_index(['Date', 'Time'], drop=False)

You can use this to merge date and time into the same column of dataframe.

import pandas as pd    
data_file = 'data.csv' #path of your file

Reading .csv file with merged columns Date_Time:

data = pd.read_csv(data_file, parse_dates=[['Date', 'Time']]) 

You can use this line to keep both other columns also.

data.set_index(['Date', 'Time'], drop=False)

回答 3

如果类型不同(datetime和timestamp或str),则可以强制转换列,并使用to_datetime:

df.loc[:,'Date'] = pd.to_datetime(df.Date.astype(str)+' '+df.Time.astype(str))

结果:

0   2013-01-06 23:00:00
1   2013-02-06 01:00:00
2   2013-02-06 21:00:00
3   2013-02-06 22:00:00
4   2013-02-06 23:00:00
5   2013-03-06 01:00:00
6   2013-03-06 21:00:00
7   2013-03-06 22:00:00
8   2013-03-06 23:00:00
9   2013-04-06 01:00:00

最好,

You can cast the columns if the types are different (datetime and timestamp or str) and use to_datetime :

df.loc[:,'Date'] = pd.to_datetime(df.Date.astype(str)+' '+df.Time.astype(str))

Result :

0   2013-01-06 23:00:00
1   2013-02-06 01:00:00
2   2013-02-06 21:00:00
3   2013-02-06 22:00:00
4   2013-02-06 23:00:00
5   2013-03-06 01:00:00
6   2013-03-06 21:00:00
7   2013-03-06 22:00:00
8   2013-03-06 23:00:00
9   2013-04-06 01:00:00

Best,


回答 4

我没有足够的声誉对jka.ne进行评论,所以:

我必须修改jka.ne的行才能使其工作:

df.apply(lambda r : pd.datetime.combine(r['date_column_name'],r['time_column_name']).time(),1)

这可能会帮助其他人。

另外,我还测试了另一种方法,replace而不是使用combine

def combine_date_time(df, datecol, timecol):
    return df.apply(lambda row: row[datecol].replace(
                                hour=row[timecol].hour,
                                minute=row[timecol].minute),
                    axis=1)

在OP的情况下为:

combine_date_time(df, 'Date', 'Time')

我已经为两种方法设定了相对较大的数据集(> 500.000行)的时间,并且它们都具有相似的运行时,但是使用combine速度更快(的响应时间为59s replace与的响应时间为50s combine)。

I don’t have enough reputation to comment on jka.ne so:

I had to amend jka.ne’s line for it to work:

df.apply(lambda r : pd.datetime.combine(r['date_column_name'],r['time_column_name']).time(),1)

This might help others.

Also, I have tested a different approach, using replace instead of combine:

def combine_date_time(df, datecol, timecol):
    return df.apply(lambda row: row[datecol].replace(
                                hour=row[timecol].hour,
                                minute=row[timecol].minute),
                    axis=1)

which in the OP’s case would be:

combine_date_time(df, 'Date', 'Time')

I have timed both approaches for a relatively large dataset (>500.000 rows), and they both have similar runtimes, but using combine is faster (59s for replace vs 50s for combine).


回答 5

答案实际上取决于您的列类型是什么。就我而言,我有datetimetimedelta

> df[['Date','Time']].dtypes
Date     datetime64[ns]
Time    timedelta64[ns]

如果是这种情况,则只需添加以下列:

> df['Date'] + df['Time']

The answer really depends on what your column types are. In my case, I had datetime and timedelta.

> df[['Date','Time']].dtypes
Date     datetime64[ns]
Time    timedelta64[ns]

If this is your case, then you just need to add the columns:

> df['Date'] + df['Time']

回答 6

您还可以datetime通过datetimetimedelta对象进行转换,而无需字符串连接。与结合使用pd.DataFrame.pop,您可以同时删除源系列:

df['DateTime'] = pd.to_datetime(df.pop('Date')) + pd.to_timedelta(df.pop('Time'))

print(df)

             DateTime
0 2013-01-06 23:00:00
1 2013-02-06 01:00:00
2 2013-02-06 21:00:00
3 2013-02-06 22:00:00
4 2013-02-06 23:00:00
5 2013-03-06 01:00:00
6 2013-03-06 21:00:00
7 2013-03-06 22:00:00
8 2013-03-06 23:00:00
9 2013-04-06 01:00:00

print(df.dtypes)

DateTime    datetime64[ns]
dtype: object

You can also convert to datetime without string concatenation, by combining datetime and timedelta objects. Combined with pd.DataFrame.pop, you can remove the source series simultaneously:

df['DateTime'] = pd.to_datetime(df.pop('Date')) + pd.to_timedelta(df.pop('Time'))

print(df)

             DateTime
0 2013-01-06 23:00:00
1 2013-02-06 01:00:00
2 2013-02-06 21:00:00
3 2013-02-06 22:00:00
4 2013-02-06 23:00:00
5 2013-03-06 01:00:00
6 2013-03-06 21:00:00
7 2013-03-06 22:00:00
8 2013-03-06 23:00:00
9 2013-04-06 01:00:00

print(df.dtypes)

DateTime    datetime64[ns]
dtype: object

回答 7

首先确保具有正确的数据类型:

df["Date"] = pd.to_datetime(df["Date"])
df["Time"] = pd.to_timedelta(df["Time"])

然后,您可以轻松地将它们组合:

df["DateTime"] = df["Date"] + df["Time"]

First make sure to have the right data types:

df["Date"] = pd.to_datetime(df["Date"])
df["Time"] = pd.to_timedelta(df["Time"])

Then you easily combine them:

df["DateTime"] = df["Date"] + df["Time"]

回答 8

使用 combine功能:

datetime.datetime.combine(date, time)

Use the combine function:

datetime.datetime.combine(date, time)

回答 9

我的数据集有1秒的分辨率数据,持续了几天,通过此处建议的方法进行解析非常慢。相反,我使用了:

dates = pandas.to_datetime(df.Date, cache=True)
times = pandas.to_timedelta(df.Time)
datetimes  = dates + times

请注意,cache=True由于我的文件中只有几个唯一的日期,因此使用make可以非常有效地解析日期,这对于合并的日期和时间列而言并非如此。

My dataset had 1second resolution data for a few days and parsing by the suggested methods here was very slow. Instead I used:

dates = pandas.to_datetime(df.Date, cache=True)
times = pandas.to_timedelta(df.Time)
datetimes  = dates + times

Note the use of cache=True makes parsing the dates very efficient since there are only a couple unique dates in my files, which is not true for a combined date and time column.


回答 10

数据:

<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL> SPFB.RTS,1,20190103,100100,106580.0000000,107260.0000000,106570.0000000 ,107230.0000000,3726

码:

data.columns = ['ticker', 'per', 'date', 'time', 'open', 'high', 'low', 'close', 'vol']    
data.datetime = pd.to_datetime(data.date.astype(str) + ' ' + data.time.astype(str), format='%Y%m%d %H%M%S')

DATA:

<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL> SPFB.RTS,1,20190103,100100,106580.0000000,107260.0000000,106570.0000000,107230.0000000,3726

CODE:

data.columns = ['ticker', 'per', 'date', 'time', 'open', 'high', 'low', 'close', 'vol']    
data.datetime = pd.to_datetime(data.date.astype(str) + ' ' + data.time.astype(str), format='%Y%m%d %H%M%S')

将熊猫数据框转换为序列

问题:将熊猫数据框转换为序列

我对熊猫有些陌生。我有一个熊猫数据框,它是1行乘23列。

我想将其转换为系列吗?我想知道最pythonic的方法是什么?

我已经尝试过了,pd.Series(myResults)但是抱怨ValueError: cannot copy sequence with size 23 to array axis with dimension 1。它还不够聪明,无法意识到它仍然是数学上的“向量”。

谢谢!

I’m somewhat new to pandas. I have a pandas data frame that is 1 row by 23 columns.

I want to convert this into a series? I’m wondering what the most pythonic way to do this is?

I’ve tried pd.Series(myResults) but it complains ValueError: cannot copy sequence with size 23 to array axis with dimension 1. It’s not smart enough to realize it’s still a “vector” in math terms.

Thanks!


回答 0

它还不够聪明,无法意识到它仍然是数学上的“向量”。

可以说它足够聪明,可以识别尺寸差异。:-)

我认为您可以做的最简单的事情是使用位置选择该行iloc,这将为您提供一个Series,其列作为新索引,值作为值:

>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
   a0  a1  a2  a3  a4
0   0   1   2   3   4
>>> df.iloc[0]
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

It’s not smart enough to realize it’s still a “vector” in math terms.

Say rather that it’s smart enough to recognize a difference in dimensionality. :-)

I think the simplest thing you can do is select that row positionally using iloc, which gives you a Series with the columns as the new index and the values as the values:

>>> df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])
>>> df
   a0  a1  a2  a3  a4
0   0   1   2   3   4
>>> df.iloc[0]
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64
>>> type(_)
<class 'pandas.core.series.Series'>

回答 1

您可以转置单行数据框(仍会生成一个数据框),然后结果压缩为一系列(与相反to_frame)。

df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])

>>> df.T.squeeze()  # Or more simply, df.squeeze() for a single row dataframe.
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64

注意:为了适应@IanS提出的观点(即使不是OP的问题),请测试数据框的大小。我假设这df是一个数据框,但是边缘情况是一个空的数据框,一个形状为(1,1)的数据框以及一个具有多行的数据框,在这种情况下,使用应实现其所需的功能。

if df.empty:
    # Empty dataframe, so convert to empty Series.
    result = pd.Series()
elif df.shape == (1, 1)
    # DataFrame with one value, so convert to series with appropriate index.
    result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
    # Convert to series per OP's question.
    result = df.T.squeeze()
else:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass

也可以按照@themachinist提供的答案进行简化。

if len(df) > 1:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass
else:
    result = pd.Series() if df.empty else df.iloc[0, :]

You can transpose the single-row dataframe (which still results in a dataframe) and then squeeze the results into a series (the inverse of to_frame).

df = pd.DataFrame([list(range(5))], columns=["a{}".format(i) for i in range(5)])

>>> df.T.squeeze()  # Or more simply, df.squeeze() for a single row dataframe.
a0    0
a1    1
a2    2
a3    3
a4    4
Name: 0, dtype: int64

Note: To accommodate the point raised by @IanS (even though it is not in the OP’s question), test for the dataframe’s size. I am assuming that df is a dataframe, but the edge cases are an empty dataframe, a dataframe of shape (1, 1), and a dataframe with more than one row in which case the use should implement their desired functionality.

if df.empty:
    # Empty dataframe, so convert to empty Series.
    result = pd.Series()
elif df.shape == (1, 1)
    # DataFrame with one value, so convert to series with appropriate index.
    result = pd.Series(df.iat[0, 0], index=df.columns)
elif len(df) == 1:
    # Convert to series per OP's question.
    result = df.T.squeeze()
else:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass

This can also be simplified along the lines of the answer provided by @themachinist.

if len(df) > 1:
    # Dataframe with multiple rows.  Implement desired behavior.
    pass
else:
    result = pd.Series() if df.empty else df.iloc[0, :]

回答 2

您可以使用以下两种方法之一对数据框进行切片来检索系列:

http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.iloc.html http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.loc.html

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))

series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

You can retrieve the series through slicing your dataframe using one of these two methods:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html

import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randn(1,8))

series1=df.iloc[0,:]
type(series1)
pandas.core.series.Series

回答 3

其他方式 –

假设myResult是包含1 col和23行形式的数据的dataFrame

// label your columns by passing a list of names
myResult.columns = ['firstCol']

// fetch the column in this way, which will return you a series
myResult = myResult['firstCol']

print(type(myResult))

以类似的方式,您可以从具有多个列的Dataframe中获得序列。

Another way –

Suppose myResult is the dataFrame that contains your data in the form of 1 col and 23 rows

// label your columns by passing a list of names
myResult.columns = ['firstCol']

// fetch the column in this way, which will return you a series
myResult = myResult['firstCol']

print(type(myResult))

In similar fashion, you can get series from Dataframe with multiple columns.


回答 4

您也可以使用stack()

df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])

在您运行df之后,请运行:

df.stack()

您获得系列数据

You can also use stack()

df= DataFrame([list(range(5))], columns = [“a{}”.format(I) for I in range(5)])

After u run df, then run:

df.stack()

You obtain your dataframe in series


回答 5

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)

这给出了一个带有索引的数据框,作为数据的列名,并且所有数据都在“值”列中

data = pd.DataFrame({"a":[1,2,3,34],"b":[5,6,7,8]})
new_data = pd.melt(data)
new_data.set_index("variable", inplace=True)

This gives a dataframe with index as column name of data and all data are present in “values” column


DataFrame中的字符串,但dtype是object

问题:DataFrame中的字符串,但dtype是object

为什么Pandas告诉我我有对象,尽管所选列中的每个项目都是一个字符串-即使经过显式转换也是如此。

这是我的DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

他们五个dtype object。我将这些对象明确转换为字符串:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

然后,尽管显示,df["attr2"]仍然是正确的。dtype objecttype(df["attr2"].ix[0]str

熊猫区分int64float64object。没有时背后的逻辑是什么dtype str?为什么被str覆盖object

Why does Pandas tell me that I have objects, although every item in the selected column is a string — even after explicit conversion.

This is my DataFrame:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56992 entries, 0 to 56991
Data columns (total 7 columns):
id            56992  non-null values
attr1         56992  non-null values
attr2         56992  non-null values
attr3         56992  non-null values
attr4         56992  non-null values
attr5         56992  non-null values
attr6         56992  non-null values
dtypes: int64(2), object(5)

Five of them are dtype object. I explicitly convert those objects to strings:

for c in df.columns:
    if df[c].dtype == object:
        print "convert ", df[c].name, " to string"
        df[c] = df[c].astype(str)

Then, df["attr2"] still has dtype object, although type(df["attr2"].ix[0] reveals str, which is correct.

Pandas distinguishes between int64 and float64 and object. What is the logic behind it when there is no dtype str? Why is a str covered by object?


回答 0

dtype对象来自NumPy,它描述ndarray中元素的类型。ndarray中的每个元素都必须具有相同的字节大小。对于int64和float64,它们是8个字节。但是对于字符串,字符串的长度不是固定的。因此,熊猫没有直接将字符串的字节保存在ndarray中,而是使用对象ndarray来保存指向对象的指针,因此,这种ndarray的dtype是object。

这是一个例子:

  • int64数组包含4个int64值。
  • 对象数组包含4个指向3个字符串对象的指针。

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in a ndarray must has the same size in byte. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of save the bytes of strings in the ndarray directly, Pandas use object ndarray, which save pointers to objects, because of this the dtype of this kind ndarray is object.

Here is an example:

  • the int64 array contains 4 int64 value.
  • the object array contains 4 pointers to 3 string objects.


回答 1

接受的答案是好的。只是想提供一个参考文档的答案。该文档说:

熊猫使用对象dtype来存储字符串。

正如主要评论所说:“不用担心;它应该像这样。” (尽管可接受的答案在解释“为什么”方面做得很好,字符串是可变长度的)

但是对于字符串,字符串的长度不是固定的。

The accepted answer is good. Just wanted to provide an answer which referenced the documentation. The documentation says:

Pandas uses the object dtype for storing strings.

As the leading comment says “Don’t worry about it; it’s supposed to be like this.” (Although the accepted answer did a great job explaining the “why”; strings are variable-length)

But for strings, the length of the string is not fixed.


回答 2

@HYRY的答案很好。我只想提供更多背景信息。

阵列存储的数据作为连续的固定大小的存储器块。这些属性的结合使阵列可以快速进行数据访问。例如,考虑您的计算机可能如何存储32位整数数组[3,0,1]

如果您要求计算机获取数组中的第3个元素,它将从头开始,然后跨64位跳转到第3个元素。确切知道要跳过多少位才可以使数组快速运行

现在考虑字符串的顺序['hello', 'i', 'am', 'a', 'banana']。字符串是大小不同的对象,因此,如果您尝试将它们存储在连续的内存块中,它将最终看起来像这样。

现在,您的计算机没有快速的方法来访问随机请求的元素。克服这个问题的关键是使用指针。基本上,将每个字符串存储在某个随机的内存位置,然后用每个字符串的内存地址填充数组。(内存地址只是整数。)所以现在,事情看起来像这样

现在,如果您像以前一样要求计算机获取第三个元素,它可以跨64位跳转(假设内存地址是32位整数),然后再执行一个步骤来获取字符串。

NumPy面临的挑战是不能保证指针实际上指向字符串。这就是为什么它将dtype报告为“对象”的原因。

无耻地插入我自己的博客文章,最初是在此进行讨论的。

@HYRY’s answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it’ll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it’d end up looking like this.

Now your computer doesn’t have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there’s no guarantee the pointers are actually pointing to strings. That’s why it reports the dtype as ‘object’.

Shamelessly gonna plug my own blog article where I originally discussed this.


回答 3

从1.0.0版开始(2020年1月),pandas作为实验功能被引入,它通过提供对字符串类型的一流支持pandas.StringDtype

虽然您仍然会object默认看到,但是可以通过指定dtypeof pd.StringDtype或简单地使用新类型'string'

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

As of version 1.0.0 (January 2020), pandas has introduced as an experimental feature providing first-class support for string types through pandas.StringDtype.

While you’ll still be seeing object by default, the new type can be used by specifying a dtype of pd.StringDtype or simply 'string':

>>> pd.Series(['abc', None, 'def'])
0     abc
1    None
2     def
dtype: object
>>> pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())
0     abc
1    <NA>
2     def
dtype: string
>>> pd.Series(['abc', None, 'def']).astype('string')
0     abc
1    <NA>
2     def
dtype: string

通过标签选择的熊猫有时返回Series,有时返回DataFrame

问题:通过标签选择的熊猫有时返回Series,有时返回DataFrame

在Pandas中,当我选择一个索引中仅包含一个条目的标签时,我会得到一个系列,但是当我选择一个具有多于一个条目的条目时,我就会得到一个数据框。

这是为什么?有没有办法确保我总是取回数据帧?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

Why is that? Is there a way to ensure I always get back a data frame?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

回答 0

可以肯定的是,这种行为是不一致的,但是我认为很容易想到这种情况是方便的。无论如何,要每次获取一个DataFrame,只需将一个列表传递给即可loc。还有其他方法,但我认为这是最干净的方法。

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Granted that the behavior is inconsistent, but I think it’s easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

回答 1

您有一个包含三个索引项的索引3。因此,df.loc[3]将返回一个数据帧。

原因是您未指定列。因此,df.loc[3]选择所有列中的三个项目(即column 0),同时df.loc[3,0]将返回一个Series。例如df.loc[1:2],由于对行进行了切片,因此它还会返回一个数据框。

选择单个行(如df.loc[1])将返回一个以列名作为索引的Series。

如果要确保始终有一个DataFrame,可以像这样切片df.loc[1:1]。另一个选项是布尔索引(df.loc[df.index==1])或take方法(df.take([0]),但是此使用的位置不是标签!)。

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don’t specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).


回答 2

TLDR

使用时 loc

df.loc[:]=数据

df.loc[int]=数据框(如果您有多于一列)和系列(如果您在数据框中只有一列)

df.loc[:, ["col_name"]]=数据

df.loc[:, "col_name"]= 系列

不使用 loc

df["col_name"]= 系列

df[["col_name"]]=数据

The TLDR

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe


回答 3

使用df['columnName']得到一个系列,并df[['columnName']]得到一个数据帧。

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.


回答 4

您在评论joris的答案中写道:

“我不理解将单行转换为系列的设计决策-为什么不将一行数据框呢?”

单行不会在系列中转换
一个系列:No, I don't think so, in fact; see the edit

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。例如,DataFrame是Series的容器,Panel是DataFrame对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

这样选择了Pandas对象的数据模型。原因当然在于它确保了一些我不知道的优势(我不完全理解引文的最后一句话,也许是原因)

编辑:我不同意

DataFrame不能由可能 Series 的元素组成,因为以下代码为行和列提供相同的“ Series”类型:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

结果

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

因此,没有理由假装DataFrame由Series组成,因为这些Series应该是什么:列或行?愚蠢的问题和远见。

那么什么是DataFrame?

在此答案的先前版本中,我提出了这个问题,试图在他的评论之一中找到Why is that?OP问题部分的答案以及类似的审问single rows to get converted into a series - why not a data frame with one row?
Is there a way to ensure I always get back a data frame?Dan Allan 对此部分做了回答。

然后,正如上面引用的Pandas的文档所述,最好将Pandas的数据结构看作是低维数据的容器,在我看来,对为什么的理解可以从DataFrame结构的性质中找到。

但是,我意识到,不应将引用的建议作为对熊猫数据结构本质的精确描述。
此建议并不意味着DataFrame是Series的容器。
它表示,将DataFrame作为Series的容器的心理表示形式(根据推理的某个时刻考虑的选项是行还是列)是考虑DataFrame的一种好方法,即使实际上并非严格如此。“良好”意味着该愿景可以高效地使用DataFrame。就这样。

那么什么是DataFrame对象?

所述数据帧类产生具有特定结构起源于实例NDFrame基类,本身从派生 PandasContainer基类,也是一个父类的系列类。
请注意,这对于0.12版之前的Pandas才是正确的。在即将发布的0.13版本中,Series也将仅从NDFrame类派生。

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

结果

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

因此,我的理解是,DataFrame实例具有精心设计的某些方法,以控制从行和列中提取数据的方式。

本页介绍了这些提取方法的工作方式:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
我们在其中找到了Dan Allan给出的方法和其他方法。

为什么这些提取方法是照原样制作的?
当然,这是因为它们被认为是提供更好的可能性和简化数据分析的方法。
这句话正是这样表达的:

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。

为什么数据从数据帧的实例提取的不在于它的结构,它位于为什么这种结构。我猜想,Pandas数据结构的结构和功能已经过精心设计,以便尽可能地提高智力上的直观性,并且要了解详细信息,必须阅读Wes McKinney的博客。

You wrote in a comment to joris’ answer:

“I don’t understand the design decision for single rows to get converted into a series – why not a data frame with one row?”

A single row isn’t converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don’t know (I don’t fully understand the last sentence of the citation, maybe it’s the reason)

.

Edit : I don’t agree with me

A DataFrame can’t be composed of elements that would be Series, because the following code gives the same type “Series” as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas’ docs cited above says that the pandas’ data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas’ data structures.
This advice doesn’t mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn’t strictly the case in reality. “Good” meaning that this vision enables to use DataFrames with efficiency. That’s all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That’s certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It’s precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn’t lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas’ data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.


回答 5

如果目标是使用索引获取数据集的子集,则最好避免使用lociloc。相反,您应该使用类似于以下的语法:

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

回答 6

如果您还选择数据框的索引,则结果可以是DataFrame或Series 也可以是Series或标量(单个值)。

此函数可确保您始终从选择中获得列表(如果df,index和column有效):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).

This function ensures that you always get a list from your selection (if the df, index and column are valid):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

如何将pandas DataFrame的第一列作为系列?

问题:如何将pandas DataFrame的第一列作为系列?

我试过了:

x=pandas.DataFrame(...)
s = x.take([0], axis=1)

s获取一个DataFrame,而不是一个Series。

I tried:

x=pandas.DataFrame(...)
s = x.take([0], axis=1)

And s gets a DataFrame, not a Series.


回答 0

>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
3  4  7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>

================================================== =========================

更新

如果您在2017年6月之后阅读ix此书,则熊猫0.20.2已弃用该书,因此请不要使用它。使用lociloc代替。查看对此问题的评论和其他答案。

>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
3  4  7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>

===========================================================================

UPDATE

If you’re reading this after June 2017, ix has been deprecated in pandas 0.20.2, so don’t use it. Use loc or iloc instead. See comments and other answers to this question.


回答 1

您可以通过以下代码将第一列作为系列:

x[x.columns[0]]

You can get the first column as a Series by following code:

x[x.columns[0]]

回答 2

从v0.11 +开始,…使用df.iloc

In [7]: df.iloc[:,0]
Out[7]: 
0    1
1    2
2    3
3    4
Name: x, dtype: int64

From v0.11+, … use df.iloc.

In [7]: df.iloc[:,0]
Out[7]: 
0    1
1    2
2    3
3    4
Name: x, dtype: int64

回答 3

这不是最简单的方法吗?

按列名:

In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
    x   y
0   1   4
1   2   5
2   3   6
3   4   7

In [23]: df.x
Out[23]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

In [24]: type(df.x)
Out[24]:
pandas.core.series.Series

Isn’t this the simplest way?

By column name:

In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
    x   y
0   1   4
1   2   5
2   3   6
3   4   7

In [23]: df.x
Out[23]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

In [24]: type(df.x)
Out[24]:
pandas.core.series.Series

回答 4

当您要从csv文件加载系列时,这非常有用

x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))


<class 'pandas.core.series.Series'>
0    110.96
1    119.40
2    135.89
3    152.32
4    192.91
5    177.20
6    181.16
7    177.30
8    200.13
9    235.41
Name: x, dtype: float64

This works great when you want to load a series from a csv file

x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))


<class 'pandas.core.series.Series'>
0    110.96
1    119.40
2    135.89
3    152.32
4    192.91
5    177.20
6    181.16
7    177.30
8    200.13
9    235.41
Name: x, dtype: float64

回答 5

df[df.columns[i]]

其中i是列的位置/编号(从0开始)。

因此,i = 0是第一列。

您也可以使用 i = -1

df[df.columns[i]]

where i is the position/number of the column(starting from 0).

So, i = 0 is for the first column.

You can also get the last column using i = -1


使用pandas.to_datetime时仅保留日期部分

问题:使用pandas.to_datetime时仅保留日期部分

pandas.to_datetime用来解析数据中的日期。默认情况下,熊猫代表日期,datetime64[ns]即使所有日期都是每天也是如此。我想知道是否存在一种优雅/巧妙的方法来将日期转换为datetime.date或,datetime64[D]以便当我将数据写入CSV时,日期不附加00:00:00。我知道我可以手动逐个元素地转换类型:

[dt.to_datetime().date() for dt in df.dates]

但这确实很慢,因为我有很多行,这有点违反了使用目的pandas.to_datetime。有没有一种方法可以一次转换dtype整个列?或者,是否pandas.to_datetime支持精度规范,以便在处理日常数据时可以省去时间部分?

I use pandas.to_datetime to parse the dates in my data. Pandas by default represents the dates with datetime64[ns] even though the dates are all daily only. I wonder whether there is an elegant/clever way to convert the dates to datetime.date or datetime64[D] so that, when I write the data to CSV, the dates are not appended with 00:00:00. I know I can convert the type manually element-by-element:

[dt.to_datetime().date() for dt in df.dates]

But this is really slow since I have many rows and it sort of defeats the purpose of using pandas.to_datetime. Is there a way to convert the dtype of the entire column at once? Or alternatively, does pandas.to_datetime support a precision specification so that I can get rid of the time part while working with daily data?


回答 0

从版本开始,0.15.0现在可以轻松地通过.dt仅访问日期组件来完成此操作:

df['just_date'] = df['dates'].dt.date

上面的方法返回一个datetime.datedtype,如果您想要一个a,datetime64则可以normalize将时间分量设置为午夜,以便将所有值设置为00:00:00

df['normalised_date'] = df['dates'].dt.normalize()

这会使dtype保持不变,datetime64但显示屏仅显示该date值。

Since version 0.15.0 this can now be easily done using .dt to access just the date component:

df['just_date'] = df['dates'].dt.date

The above returns a datetime.date dtype, if you want to have a datetime64 then you can just normalize the time component to midnight so it sets all the values to 00:00:00:

df['normalised_date'] = df['dates'].dt.normalize()

This keeps the dtype as datetime64, but the display shows just the date value.


回答 1

简单的解决方案:

df['date_only'] = df['date_time_column'].dt.date

Simple Solution:

df['date_only'] = df['date_time_column'].dt.date

回答 2

虽然我赞成EdChum的答案,这是对OP提出的问题的最直接答案,但它并不能真正解决性能问题(它仍然依赖于python datetime对象,因此对它们的任何操作都不会被矢量化-即,它会很慢)。

性能更好的替代方法是使用df['dates'].dt.floor('d')。严格来说,它不会“仅保留日期部分”,因为它只是将时间设置为00:00:00。但是它确实可以按OP的要求运行,例如:

  • 打印到屏幕
  • 保存到csv
  • 使用列来 groupby

…并且效率更高,因为该操作已矢量化。

编辑:其实,在OP的宁愿答案很可能是“最近的版本pandas没有时间写为csv如果是00:00:00对所有的意见”。

While I upvoted EdChum’s answer, which is the most direct answer to the question the OP posed, it does not really solve the performance problem (it still relies on python datetime objects, and hence any operation on them will be not vectorized – that is, it will be slow).

A better performing alternative is to use df['dates'].dt.floor('d'). Strictly speaking, it does not “keep only date part”, since it just sets the time to 00:00:00. But it does work as desired by the OP when, for instance:

  • printing to screen
  • saving to csv
  • using the column to groupby

… and it is much more efficient, since the operation is vectorized.

EDIT: in fact, the answer the OP’s would have preferred is probably “recent versions of pandas do not write the time to csv if it is 00:00:00 for all observations”.


回答 3

熊猫DatetimeIndexSeries有一种方法normalize可以完全满足您的需求。

您可以在此答案中了解更多信息。

可以用作 ser.dt.normalize()

Pandas DatetimeIndex and Series have a method called normalize that does exactly what you want.

You can read more about it in this answer.

It can be used as ser.dt.normalize()


回答 4

熊猫v0.13 +:to_csvdate_format参数一起使用

尽可能避免将您的datetime64[ns]系列转换为objectdtype系列的datetime.date对象。后者通常使用构造pd.Series.dt.date,存储为指针数组,相对于基于NumPy的纯序列而言效率低下。

由于在写入CSV时您关注的是格式,因此只需使用date_format参数to_csv。例如:

df.to_csv(filename, date_format='%Y-%m-%d')

有关格式设置约定,请参见Python的strftime指令

Pandas v0.13+: Use to_csv with date_format parameter

Avoid, where possible, converting your datetime64[ns] series to an object dtype series of datetime.date objects. The latter, often constructed using pd.Series.dt.date, is stored as an array of pointers and is inefficient relative to a pure NumPy-based series.

Since your concern is format when writing to CSV, just use the date_format parameter of to_csv. For example:

df.to_csv(filename, date_format='%Y-%m-%d')

See Python’s strftime directives for formatting conventions.


回答 5

这是提取日期的简单方法:

import pandas as pd

d='2015-01-08 22:44:09' 
date=pd.to_datetime(d).date()
print(date)

This is a simple way to extract the date:

import pandas as pd

d='2015-01-08 22:44:09' 
date=pd.to_datetime(d).date()
print(date)

回答 6

转换为datetime64[D]

df.dates.values.astype('M8[D]')

尽管将其重新分配给DataFrame col将其恢复为[ns]。

如果您想要实际的datetime.date

dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])

Converting to datetime64[D]:

df.dates.values.astype('M8[D]')

Though re-assigning that to a DataFrame col will revert it back to [ns].

If you wanted actual datetime.date:

dt = pd.DatetimeIndex(df.dates)
dates = np.array([datetime.date(*date_tuple) for date_tuple in zip(dt.year, dt.month, dt.day)])

回答 7

如果有人看到此旧帖子,请给出一个最新的答案。

转换为日期时间时添加“ utc = False”将删除时区部分,仅将日期保留为datetime64 [ns]数据类型。

pd.to_datetime(df['Date'], utc=False)

您将能够将其保存在excel中,而不会出现错误“ ValueError:Excel不支持带时区的日期时间。在写入Excel之前,请确保日期时间不知道时区。”

Just giving a more up to date answer in case someone sees this old post.

Adding “utc=False” when converting to datetime will remove the timezone component and keep only the date in a datetime64[ns] data type.

pd.to_datetime(df['Date'], utc=False)

You will be able to save it in excel without getting the error “ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel.”


回答 8

我希望能够更改数据框中一组列的类型,然后删除保持一天的时间。round(),floor(),ceil()全部工作

df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))

I wanted to be able to change the type for a set of columns in a data frame and then remove the time keeping the day. round(), floor(), ceil() all work

df[date_columns] = df[date_columns].apply(pd.to_datetime)
df[date_columns] = df[date_columns].apply(lambda t: t.dt.floor('d'))

在熊猫中将两个系列组合到一个DataFrame中

问题:在熊猫中将两个系列组合到一个DataFrame中

我有两个Series,s1并且s2索引相同(非连续)。如何合并s1s2成为DataFrame中的两列,并将其中一个索引保留为第三列?

I have two Series s1 and s2 with the same (non-consecutive) indices. How do I combine s1 and s2 to being two columns in a DataFrame and keep one of the indices as a third column?


回答 0

我认为这concat是个不错的方法。如果存在它们,则将“系列”的名称属性用作列(否则,将它们简单地编号):

In [1]: s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')

In [2]: s2 = pd.Series([3, 4], index=['A', 'B'], name='s2')

In [3]: pd.concat([s1, s2], axis=1)
Out[3]:
   s1  s2
A   1   3
B   2   4

In [4]: pd.concat([s1, s2], axis=1).reset_index()
Out[4]:
  index  s1  s2
0     A   1   3
1     B   2   4

注意:这扩展到2个以上的系列。

I think concat is a nice way to do this. If they are present it uses the name attributes of the Series as the columns (otherwise it simply numbers them):

In [1]: s1 = pd.Series([1, 2], index=['A', 'B'], name='s1')

In [2]: s2 = pd.Series([3, 4], index=['A', 'B'], name='s2')

In [3]: pd.concat([s1, s2], axis=1)
Out[3]:
   s1  s2
A   1   3
B   2   4

In [4]: pd.concat([s1, s2], axis=1).reset_index()
Out[4]:
  index  s1  s2
0     A   1   3
1     B   2   4

Note: This extends to more than 2 Series.


回答 1

如果两个索引都相同,为什么不只使用.to_frame?

> = v0.23

a.to_frame().join(b)

< v0.23

a.to_frame().join(b.to_frame())

Why don’t you just use .to_frame if both have the same indexes?

>= v0.23

a.to_frame().join(b)

< v0.23

a.to_frame().join(b.to_frame())

回答 2

熊猫会自动将这些通过的序列对齐并创建联合索引。它们在这里恰好是相同的。reset_index将索引移到列。

In [2]: s1 = Series(randn(5),index=[1,2,4,5,6])

In [4]: s2 = Series(randn(5),index=[1,2,4,5,6])

In [8]: DataFrame(dict(s1 = s1, s2 = s2)).reset_index()
Out[8]: 
   index        s1        s2
0      1 -0.176143  0.128635
1      2 -1.286470  0.908497
2      4 -0.995881  0.528050
3      5  0.402241  0.458870
4      6  0.380457  0.072251

Pandas will automatically align these passed in series and create the joint index They happen to be the same here. reset_index moves the index to a column.

In [2]: s1 = Series(randn(5),index=[1,2,4,5,6])

In [4]: s2 = Series(randn(5),index=[1,2,4,5,6])

In [8]: DataFrame(dict(s1 = s1, s2 = s2)).reset_index()
Out[8]: 
   index        s1        s2
0      1 -0.176143  0.128635
1      2 -1.286470  0.908497
2      4 -0.995881  0.528050
3      5  0.402241  0.458870
4      6  0.380457  0.072251

回答 3

示例代码:

a = pd.Series([1,2,3,4], index=[7,2,8,9])
b = pd.Series([5,6,7,8], index=[7,2,8,9])
data = pd.DataFrame({'a': a,'b':b, 'idx_col':a.index})

Pandas允许您从中创建一个DataFramedictSeries作为值,将列名作为键。当找到a Series作为值时,它将使用Series索引作为索引的一部分DataFrame。数据对齐是熊猫的主要特权之一。因此,除非您有其他需求,否则新创建的商品DataFrame具有重复的价值。在上述示例中,data['idx_col']具有与相同的数据data.index

Example code:

a = pd.Series([1,2,3,4], index=[7,2,8,9])
b = pd.Series([5,6,7,8], index=[7,2,8,9])
data = pd.DataFrame({'a': a,'b':b, 'idx_col':a.index})

Pandas allows you to create a DataFrame from a dict with Series as the values and the column names as the keys. When it finds a Series as a value, it uses the Series index as part of the DataFrame index. This data alignment is one of the main perks of Pandas. Consequently, unless you have other needs, the freshly created DataFrame has duplicated value. In the above example, data['idx_col'] has the same data as data.index.


回答 4

如果我可以回答这个问题。

将系列转换为数据框的基本原理是要了解

从概念上讲,数据框中的每一列都是一个序列。

2.而且,每个列名都是映射到系列的键名。

如果牢记以上两个概念,则可以想到许多将系列转换为数据框的方法。一个简单的解决方案将是这样的:

在这里创建两个系列

import pandas as pd

series_1 = pd.Series(list(range(10)))

series_2 = pd.Series(list(range(20,30)))

使用所需的列名创建一个空的数据框

df = pd.DataFrame(columns = ['Column_name#1', 'Column_name#1'])

使用映射概念将序列值放入数据框内

df['Column_name#1'] = series_1

df['Column_name#2'] = series_2

立即检查结果

df.head(5)

If I may answer this.

The fundamentals behind converting series to data frame is to understand that

1. At conceptual level, every column in data frame is a series.

2. And, every column name is a key name that maps to a series.

If you keep above two concepts in mind, you can think of many ways to convert series to data frame. One easy solution will be like this:

Create two series here

import pandas as pd

series_1 = pd.Series(list(range(10)))

series_2 = pd.Series(list(range(20,30)))

Create an empty data frame with just desired column names

df = pd.DataFrame(columns = ['Column_name#1', 'Column_name#1'])

Put series value inside data frame using mapping concept

df['Column_name#1'] = series_1

df['Column_name#2'] = series_2

Check results now

df.head(5)

回答 5

不确定我是否完全理解您的问题,但这是您想做的吗?

pd.DataFrame(data=dict(s1=s1, s2=s2), index=s1.index)

index=s1.index这里甚至没有必要)

Not sure I fully understand your question, but is this what you want to do?

pd.DataFrame(data=dict(s1=s1, s2=s2), index=s1.index)

(index=s1.index is not even necessary here)


回答 6

基于以下方式的解决方案的简化join()

df = a.to_frame().join(b)

A simplification of the solution based on join():

df = a.to_frame().join(b)

回答 7

我使用了pandas将numpy数组或iseries转换为数据框,然后添加并按键将其他附加列作为“预测”。如果您需要将数据框转换回列表,请使用values.tolist()

output=pd.DataFrame(X_test)
output['prediction']=y_pred

list=output.values.tolist()     

I used pandas to convert my numpy array or iseries to an dataframe then added and additional the additional column by key as ‘prediction’. If you need dataframe converted back to a list then use values.tolist()

output=pd.DataFrame(X_test)
output['prediction']=y_pred

list=output.values.tolist()