标签归档:pandas

通过标签选择的熊猫有时返回Series,有时返回DataFrame

问题:通过标签选择的熊猫有时返回Series,有时返回DataFrame

在Pandas中,当我选择一个索引中仅包含一个条目的标签时,我会得到一个系列,但是当我选择一个具有多于一个条目的条目时,我就会得到一个数据框。

这是为什么?有没有办法确保我总是取回数据帧?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

Why is that? Is there a way to ensure I always get back a data frame?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

回答 0

可以肯定的是,这种行为是不一致的,但是我认为很容易想到这种情况是方便的。无论如何,要每次获取一个DataFrame,只需将一个列表传递给即可loc。还有其他方法,但我认为这是最干净的方法。

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Granted that the behavior is inconsistent, but I think it’s easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

回答 1

您有一个包含三个索引项的索引3。因此,df.loc[3]将返回一个数据帧。

原因是您未指定列。因此,df.loc[3]选择所有列中的三个项目(即column 0),同时df.loc[3,0]将返回一个Series。例如df.loc[1:2],由于对行进行了切片,因此它还会返回一个数据框。

选择单个行(如df.loc[1])将返回一个以列名作为索引的Series。

如果要确保始终有一个DataFrame,可以像这样切片df.loc[1:1]。另一个选项是布尔索引(df.loc[df.index==1])或take方法(df.take([0]),但是此使用的位置不是标签!)。

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don’t specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).


回答 2

TLDR

使用时 loc

df.loc[:]=数据

df.loc[int]=数据框(如果您有多于一列)和系列(如果您在数据框中只有一列)

df.loc[:, ["col_name"]]=数据

df.loc[:, "col_name"]= 系列

不使用 loc

df["col_name"]= 系列

df[["col_name"]]=数据

The TLDR

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe


回答 3

使用df['columnName']得到一个系列,并df[['columnName']]得到一个数据帧。

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.


回答 4

您在评论joris的答案中写道:

“我不理解将单行转换为系列的设计决策-为什么不将一行数据框呢?”

单行不会在系列中转换
一个系列:No, I don't think so, in fact; see the edit

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。例如,DataFrame是Series的容器,Panel是DataFrame对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

这样选择了Pandas对象的数据模型。原因当然在于它确保了一些我不知道的优势(我不完全理解引文的最后一句话,也许是原因)

编辑:我不同意

DataFrame不能由可能 Series 的元素组成,因为以下代码为行和列提供相同的“ Series”类型:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

结果

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

因此,没有理由假装DataFrame由Series组成,因为这些Series应该是什么:列或行?愚蠢的问题和远见。

那么什么是DataFrame?

在此答案的先前版本中,我提出了这个问题,试图在他的评论之一中找到Why is that?OP问题部分的答案以及类似的审问single rows to get converted into a series - why not a data frame with one row?
Is there a way to ensure I always get back a data frame?Dan Allan 对此部分做了回答。

然后,正如上面引用的Pandas的文档所述,最好将Pandas的数据结构看作是低维数据的容器,在我看来,对为什么的理解可以从DataFrame结构的性质中找到。

但是,我意识到,不应将引用的建议作为对熊猫数据结构本质的精确描述。
此建议并不意味着DataFrame是Series的容器。
它表示,将DataFrame作为Series的容器的心理表示形式(根据推理的某个时刻考虑的选项是行还是列)是考虑DataFrame的一种好方法,即使实际上并非严格如此。“良好”意味着该愿景可以高效地使用DataFrame。就这样。

那么什么是DataFrame对象?

所述数据帧类产生具有特定结构起源于实例NDFrame基类,本身从派生 PandasContainer基类,也是一个父类的系列类。
请注意,这对于0.12版之前的Pandas才是正确的。在即将发布的0.13版本中,Series也将仅从NDFrame类派生。

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

结果

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

因此,我的理解是,DataFrame实例具有精心设计的某些方法,以控制从行和列中提取数据的方式。

本页介绍了这些提取方法的工作方式:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
我们在其中找到了Dan Allan给出的方法和其他方法。

为什么这些提取方法是照原样制作的?
当然,这是因为它们被认为是提供更好的可能性和简化数据分析的方法。
这句话正是这样表达的:

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。

为什么数据从数据帧的实例提取的不在于它的结构,它位于为什么这种结构。我猜想,Pandas数据结构的结构和功能已经过精心设计,以便尽可能地提高智力上的直观性,并且要了解详细信息,必须阅读Wes McKinney的博客。

You wrote in a comment to joris’ answer:

“I don’t understand the design decision for single rows to get converted into a series – why not a data frame with one row?”

A single row isn’t converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don’t know (I don’t fully understand the last sentence of the citation, maybe it’s the reason)

.

Edit : I don’t agree with me

A DataFrame can’t be composed of elements that would be Series, because the following code gives the same type “Series” as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas’ docs cited above says that the pandas’ data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas’ data structures.
This advice doesn’t mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn’t strictly the case in reality. “Good” meaning that this vision enables to use DataFrames with efficiency. That’s all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That’s certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It’s precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn’t lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas’ data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.


回答 5

如果目标是使用索引获取数据集的子集,则最好避免使用lociloc。相反,您应该使用类似于以下的语法:

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

回答 6

如果您还选择数据框的索引,则结果可以是DataFrame或Series 也可以是Series或标量(单个值)。

此函数可确保您始终从选择中获得列表(如果df,index和column有效):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).

This function ensures that you always get a list from your selection (if the df, index and column are valid):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

如何替换熊猫数据框的列中的文本?

问题:如何替换熊猫数据框的列中的文本?

我的数据框中有这样的一列:

range
"(2,30)"
"(50,290)"
"(400,1000)"
... 

我想,-破折号代替逗号。我目前正在使用此方法,但没有任何更改。

org_info_exc['range'].replace(',', '-', inplace=True)

有人可以帮忙吗?

I have a column in my dataframe like this:

range
"(2,30)"
"(50,290)"
"(400,1000)"
... 

and I want to replace the , comma with - dash. I’m currently using this method but nothing is changed.

org_info_exc['range'].replace(',', '-', inplace=True)

Can anybody help?


回答 0

使用向量化str方法replace

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

编辑

因此,如果我们查看您尝试过的内容以及为何不起作用:

df['range'].replace(',','-',inplace=True)

文档中,我们看到以下说明:

str或regex:str:与to_replace完全匹配的字符串将替换为value

因此,由于str值不匹配,因此不会发生替换,请与以下内容进行比较:

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

在这里,我们在第二行获得了完全匹配,并且替换发生了。

Use the vectorised str method replace:

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

EDIT

So if we look at what you tried and why it didn’t work:

df['range'].replace(',','-',inplace=True)

from the docs we see this desc:

str or regex: str: string exactly matching to_replace will be replaced with value

So because the str values do not match, no replacement occurs, compare with the following:

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

here we get an exact match on the second row and the replacement occurs.


回答 1

对于其他任何从Google搜索到的人,如何在所有列上进行字符串替换(例如,如果其中有多个列,如OP的“范围”列):Pandas在replace数据框对象上具有内置方法。

df.replace(',', '-', regex=True)

资料来源:文件

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP’s ‘range’ column): Pandas has a built in replace method available on a dataframe object.

df.replace(',', '-', regex=True)

Source: Docs


回答 2

在列名称中用下划线替换所有逗号

data.columns= data.columns.str.replace(' ','_',regex=True)

Replace all commas with underscore in the column names

data.columns= data.columns.str.replace(' ','_',regex=True)

回答 3

另外,对于那些希望替换一列中多个字符的用户,可以使用正则表达式来实现:

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

在DataFrame Pandas中添加带有日期之间的天数的列

问题:在DataFrame Pandas中添加带有日期之间的天数的列

我想从“ B”中的日期中减去“ A”中的日期,并添加一个具有差异的新列。

df
          A        B
one 2014-01-01  2014-02-28 
two 2014-02-03  2014-03-01

我尝试了以下操作,但是在尝试将其包含在for循环中时遇到错误…

import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta =  (mdate1 - rdate1).days
print delta

我该怎么办?

I want to subtract dates in ‘A’ from dates in ‘B’ and add a new column with the difference.

df
          A        B
one 2014-01-01  2014-02-28 
two 2014-02-03  2014-03-01

I’ve tried the following, but get an error when I try to include this in a for loop…

import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta =  (mdate1 - rdate1).days
print delta

What should I do?


回答 0

假设这些是datetime列(如果不适用to_datetime),则可以减去它们:

df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])

In [11]: df.dtypes  # if already datetime64 you don't need to use to_datetime
Out[11]:
A    datetime64[ns]
B    datetime64[ns]
dtype: object

In [12]: df['A'] - df['B']
Out[12]:
one   -58 days
two   -26 days
dtype: timedelta64[ns]

In [13]: df['C'] = df['A'] - df['B']

In [14]: df
Out[14]:
             A          B        C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days

注意:请确保您使用的是新熊猫(例如0.13.1),这可能在较旧的版本中不起作用。

Assuming these were datetime columns (if they’re not apply to_datetime) you can just subtract them:

df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])

In [11]: df.dtypes  # if already datetime64 you don't need to use to_datetime
Out[11]:
A    datetime64[ns]
B    datetime64[ns]
dtype: object

In [12]: df['A'] - df['B']
Out[12]:
one   -58 days
two   -26 days
dtype: timedelta64[ns]

In [13]: df['C'] = df['A'] - df['B']

In [14]: df
Out[14]:
             A          B        C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days

Note: ensure you’re using a new of pandas (e.g. 0.13.1), this may not work in older versions.


回答 1

要删除“天”文本元素,您还可以使用系列的dt()访问器:https : //pandas.pydata.org/pandas-docs/stable/generation/pandas.Series.dt.html

所以,

df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days

返回:

             A          B   C
one 2014-01-01 2014-02-28  58
two 2014-02-03 2014-03-01  26

To remove the ‘days’ text element, you can also make use of the dt() accessor for series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.html

So,

df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days

which returns:

             A          B   C
one 2014-01-01 2014-02-28  58
two 2014-02-03 2014-03-01  26

回答 2

列表理解是最Python(最快捷)的方式的最佳选择:

[int(i.days) for i in (df.B - df.A)]
  1. 我将返回timedelta(例如“ -58天”)
  2. i.days将以长整数值(例如-58L)返回此值
  3. int(i.days)将为您提供-58。

如果您的列不是日期时间格式。较短的语法为:df.A = pd.to_datetime(df.A)

A list comprehension is your best bet for the most Pythonic (and fastest) way to do this:

[int(i.days) for i in (df.B - df.A)]
  1. i will return the timedelta(e.g. ‘-58 days’)
  2. i.days will return this value as a long integer value(e.g. -58L)
  3. int(i.days) will give you the -58 you seek.

If your columns aren’t in datetime format. The shorter syntax would be: df.A = pd.to_datetime(df.A)


回答 3

这个怎么样:

times['days_since'] = max(list(df.index.values))  
times['days_since'] = times['days_since'] - times['months']  
times

How about this:

times['days_since'] = max(list(df.index.values))  
times['days_since'] = times['days_since'] - times['months']  
times

pandas.qcut和pandas.cut有什么区别?

问题:pandas.qcut和pandas.cut有什么区别?

该文档说:

http://pandas.pydata.org/pandas-docs/dev/basics.html

“可以使用cut(基于值的bin)和qcut(基于样本分位数的bin)函数离散化连续值”

对我来说听起来很抽象…我可以在下面的示例中看到差异,但是qcut(样本分位数)实际上在做什么/意味着什么?什么时候使用qcut和cut?

谢谢。

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

The documentation says:

http://pandas.pydata.org/pandas-docs/dev/basics.html

“Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions”

Sounds very abstract to me… I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

Thanks.

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

回答 0

首先,请注意,分位数只是百分位数,四分位数和中位数之类的最通用术语。在示例中,您指定了五个垃圾箱,因此您需要qcut五分位数。

因此,当您使用来请求五分位数时qcut,将选择垃圾箱,以便每个垃圾箱中有相同数量的记录。您有30条记录,因此每个bin中应有6条记录(您的输出应如下所示,尽管断点会因随机抽取而有所不同):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

相反,因为cut您将看到更加不平衡的东西:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

这是因为cut将根据值本身而不是这些值的频率来选择要均匀分布的垃圾箱。因此,由于您是从随机法线中提取的,因此您会看到内部垃圾箱中的频率更高,而外部垃圾箱中的频率更低。从本质上讲,这将是直方图的表格形式(您会期望它具有30条记录的相当钟形)。

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

That’s because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you’ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).


回答 1

  • cut命令创建等距的条带,但是每个条带中的采样频率不相等
  • qcut命令创建不等大小的bin,但是每个bin中的采样频率均相等。

在此处输入图片说明

    >>> x=np.array([24,  7,  2, 25, 22, 29])
    >>> x
    array([24,  7,  2, 25, 22, 29])

    >>> pd.cut(x,3).value_counts() #Bins size has equal interval of 9
    (2, 11.0]        2
    (11.0, 20.0]     0
    (20.0, 29.0]     4

    >>> pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins
    (1.999, 17.0]     2
    (17.0, 24.333]    2
    (24.333, 29.0]    2
  • cut command creates equispaced bins but frequency of samples is unequal in each bin
  • qcut command creates unequal size bins but frequency of samples is equal in each bin.

enter image description here

    >>> x=np.array([24,  7,  2, 25, 22, 29])
    >>> x
    array([24,  7,  2, 25, 22, 29])

    >>> pd.cut(x,3).value_counts() #Bins size has equal interval of 9
    (2, 11.0]        2
    (11.0, 20.0]     0
    (20.0, 29.0]     4

    >>> pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins
    (1.999, 17.0]     2
    (17.0, 24.333]    2
    (24.333, 29.0]    2

回答 2

因此,即使它们聚集在样本空间中,qcut仍可以确保每个仓中的值分布更加均匀。这意味着您不太可能拥有一个具有非常接近的值的数据箱和另一个具有0值的数据箱。一般而言,取样效果更好。

So qcut ensures a more even distribution of the values in each bin even if they cluster in the sample space. This means you are less likely to have a bin full of data with very close values and another bin with 0 values. In general, it’s better sampling.


回答 3

Pd.qcut根据((数组中的元素数)/(箱数-1))进行分割来分配数组的元素,然后除以这个数。每个垃圾箱中的元素顺序排列。

Pd.cut根据((第一个+最后一个元素)/(箱数-1的数量))进行除法分配数组元素,然后根据其所属的值范围分配元素。

Pd.qcut distribute elements of an array on making division on the basis of ((no.of elements in array)/(no. of bins – 1)), then divide this much no. of elements serially in each bins.

Pd.cut distribute elements of an array on making division on the basis of ((first +last element)/(no. of bins-1)) and then distribute element according to the range of values in which they fall.


如何在Pandas read_csv函数的负载中过滤行?

问题:如何在Pandas read_csv函数的负载中过滤行?

如何使用熊猫过滤CSV的哪些行要加载到内存中?这似乎是应该在中找到的一种选择read_csv。我想念什么吗?

示例:我们有一个带时间戳列的CSV,我们只想加载时间戳大于给定常量的行。

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we’ve a CSV with a timestamp column and we’d like to load just the lines that with a timestamp greater than a given constant.


回答 0

在将CSV文件加载到pandas对象之前,没有选项可以过滤行。

您可以加载文件,然后使用进行过滤df[df['field'] > constant],或者如果文件很大,又担心内存用完,则可以使用迭代器并在连接文件块时应用过滤器,例如:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

您可以更改chunksize以适合您的可用内存。有关更多详细信息,请参见此处

There isn’t an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.


回答 1

在的上下文中,我没有找到直接的方法read_csv。但是,read_csv返回一个DataFrame,可以通过按布尔向量选择行来对其进行过滤df[bool_vec]

filtered = df[(df['timestamp'] > targettime)]

这将选择df中的所有行(假设df是任何DataFrame,例如read_csv调用的结果,至少包含datetime列timestamp),因此该timestamp列中的值大于targettime的值。类似的问题

I didn’t find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.


回答 2

如果过滤后的范围是连续的(通常使用时间(时间戳)过滤器),那么最快的解决方案是对行范围进行硬编码。只需skiprows=range(1, start_row)nrows=end_row参数结合即可。然后,导入将花费数秒,而接受的解决方案将花费数分钟。start_row考虑到导入时间的节省,一些最初的实验并不是很大的成本。注意,我们使用保留了标题行range(1,..)

If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row) with nrows=end_row parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..).


回答 3

如果您使用的是Linux,则可以使用grep。

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    

        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)

        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    

        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)

        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data

回答 4

您可以指定nrows参数。

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

此代码在0.20.3版中效果很好。

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.


Python:在数据框中将timedelta转换为int

问题:Python:在数据框中将timedelta转换为int

我想在pandas数据框中创建一个列,该列表示timedelta列中天数的整数。是否可以使用“ datetime.days”,还是我需要做更多手动操作?

timedelta列

7天23:29:00

天整数列

7

I would like to create a column in a pandas data frame that is an integer representation of the number of days in a timedelta column. Is it possible to use ‘datetime.days’ or do I need to do something more manual?

timedelta column

7 days, 23:29:00

day integer column

7


回答 0

使用dt.days属性。通过以下方式访问此属性:

timedelta_series.dt.days

您也可以通过相同的方式获取secondsmicroseconds属性。

Use the dt.days attribute. Access this attribute via:

timedelta_series.dt.days

You can also get the seconds and microseconds attributes in the same way.


回答 1

您可以这样做,td您的一系列时间增量在哪里。该除法将纳秒增量转换为天增量,并且转换为int的时间减少为整天。

import numpy as np

(td / np.timedelta64(1, 'D')).astype(int)

You could do this, where td is your series of timedeltas. The division converts the nanosecond deltas into day deltas, and the conversion to int drops to whole days.

import numpy as np

(td / np.timedelta64(1, 'D')).astype(int)

回答 2

Timedelta对象具有只读实例属性.days.seconds.microseconds

Timedelta objects have read-only instance attributes .days, .seconds, and .microseconds.


回答 3

如果问题不仅仅在于“如何访问timedelta的整数形式?” 但是“如何将数据帧中的timedelta列转换为int?” 答案可能有所不同。除了访问.dt.days器,您还需要df.astype或者pd.to_numeric

这些选项中的任何一个都可以帮助:

df['tdColumn'] = pd.to_numeric(df['tdColumn'].dt.days, downcast='integer')

要么

df['tdColumn'] = df['tdColumn'].dt.days.astype('int16')

If the question isn’t just “how to access an integer form of the timedelta?” but “how to convert the timedelta column in the dataframe to an int?” the answer might be a little different. In addition to the .dt.days accessor you need either df.astype or pd.to_numeric

Either of these options should help:

df['tdColumn'] = pd.to_numeric(df['tdColumn'].dt.days, downcast='integer')

or

df['tdColumn'] = df['tdColumn'].dt.days.astype('int16')

python pandas dataframe到字典

问题:python pandas dataframe到字典

我有两列数据框,打算将其转换为python字典-第一列将是键,第二列将是值。先感谢您。

数据框:

    id    value
0    0     10.2
1    1      5.7
2    2      7.4

I’ve a two columns dataframe, and intend to convert it to python dictionary – the first column will be the key and the second will be the value. Thank you in advance.

Dataframe:

    id    value
0    0     10.2
1    1      5.7
2    2      7.4

回答 0

请参阅有关的文档to_dict。您可以像这样使用它:

df.set_index('id').to_dict()

如果只有一列,为避免列名也是dict中的一个级别(实际上,在这种情况下,请使用Series.to_dict()):

df.set_index('id')['value'].to_dict()

See the docs for to_dict. You can use it like this:

df.set_index('id').to_dict()

And if you have only one column, to avoid the column name is also a level in the dict (actually, in this case you use the Series.to_dict()):

df.set_index('id')['value'].to_dict()

回答 1

mydict = dict(zip(df.id, df.value))
mydict = dict(zip(df.id, df.value))

回答 2

如果您想要一种简单的方法来保留重复项,则可以使用groupby

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3
>>> {k: g["value"].tolist() for k,g in ptest.groupby("id")}
{'a': [1, 2], 'b': [3]}

If you want a simple way to preserve duplicates, you could use groupby:

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3
>>> {k: g["value"].tolist() for k,g in ptest.groupby("id")}
{'a': [1, 2], 'b': [3]}

回答 3

此线程中的joris和重复的线程中的punchagan的答案非常好,但是,如果用于键的列包含任何重复的值,它们将不会给出正确的结果。

例如:

>>> ptest = p.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

# note that in both cases the association a->1 is lost:
>>> ptest.set_index('id')['value'].to_dict()
{'a': 2, 'b': 3}
>>> dict(zip(ptest.id, ptest.value))
{'a': 2, 'b': 3}

如果您有重复的条目并且不想丢失它们,则可以使用以下难看但有效的代码:

>>> mydict = {}
>>> for x in range(len(ptest)):
...     currentid = ptest.iloc[x,0]
...     currentvalue = ptest.iloc[x,1]
...     mydict.setdefault(currentid, [])
...     mydict[currentid].append(currentvalue)
>>> mydict
{'a': [1, 2], 'b': [3]}

The answers by joris in this thread and by punchagan in the duplicated thread are very elegant, however they will not give correct results if the column used for the keys contains any duplicated value.

For example:

>>> ptest = p.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

# note that in both cases the association a->1 is lost:
>>> ptest.set_index('id')['value'].to_dict()
{'a': 2, 'b': 3}
>>> dict(zip(ptest.id, ptest.value))
{'a': 2, 'b': 3}

If you have duplicated entries and do not want to lose them, you can use this ugly but working code:

>>> mydict = {}
>>> for x in range(len(ptest)):
...     currentid = ptest.iloc[x,0]
...     currentvalue = ptest.iloc[x,1]
...     mydict.setdefault(currentid, [])
...     mydict[currentid].append(currentvalue)
>>> mydict
{'a': [1, 2], 'b': [3]}

回答 4

最简单的解决方案:

df.set_index('id').T.to_dict('records')

例:

df= pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
df.set_index('id').T.to_dict('records')

如果您有多个值,例如val1,val2,val3等,并且您希望将它们作为列表,请使用以下代码:

df.set_index('id').T.to_dict('list')

Simplest solution:

df.set_index('id').T.to_dict('records')

Example:

df= pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
df.set_index('id').T.to_dict('records')

If you have multiple values, like val1, val2, val3,etc and u want them as lists, then use the below code:

df.set_index('id').T.to_dict('list')

回答 5

在某些版本中,以下代码可能无法正常工作

mydict = dict(zip(df.id, df.value))

所以要明确

id_=df.id.values
value=df.value.values
mydict=dict(zip(id_,value))

注意我使用id_,因为单词id是保留单词

in some versions the code below might not work

mydict = dict(zip(df.id, df.value))

so make it explicit

id_=df.id.values
value=df.value.values
mydict=dict(zip(id_,value))

Note i used id_ because the word id is reserved word


回答 6

您可以使用“字典理解”

my_dict = {row[0]: row[1] for row in df.values}

You can use ‘dict comprehension’

my_dict = {row[0]: row[1] for row in df.values}

回答 7

另一个(略短)的解决方案,不会丢失重复的条目:

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

>>> pdict = dict()
>>> for i in ptest['id'].unique().tolist():
...     ptest_slice = ptest[ptest['id'] == i]
...     pdict[i] = ptest_slice['value'].tolist()
...

>>> pdict
{'b': [3], 'a': [1, 2]}

Another (slightly shorter) solution for not losing duplicate entries:

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

>>> pdict = dict()
>>> for i in ptest['id'].unique().tolist():
...     ptest_slice = ptest[ptest['id'] == i]
...     pdict[i] = ptest_slice['value'].tolist()
...

>>> pdict
{'b': [3], 'a': [1, 2]}

回答 8

您需要一个列表作为字典值。这段代码可以解决问题。

from collections import defaultdict
mydict = defaultdict(list)
for k, v in zip(df.id.values,df.value.values):
    mydict[k].append(v)

You need a list as a dictionary value. This code will do the trick.

from collections import defaultdict
mydict = defaultdict(list)
for k, v in zip(df.id.values,df.value.values):
    mydict[k].append(v)

回答 9

我试图从熊猫数据框的列中制作字典时发现了这个问题。在我的情况下,数据框具有A,B和C列(假设A和B是经度和纬度的地理坐标,C则是国家/地区/州/等等,或多或少是这种情况)。

我想要一个字典,其中每对A,B值(字典键)与对应行中的C(字典值)的值匹配(由于先前的过滤,每 A,B值保证是唯一的),但是它是在这种情况下,对于不同的A,B值对,可能具有相同的C值),所以我这样做了:

mydict = dict(zip(zip(df['A'],df['B']), df['C']))

使用熊猫to_dict()也可以:

mydict = df.set_index(['A','B']).to_dict(orient='dict')['C']

(在执行创建字典的行之前,A或B列均未用作索引)

两种方法都非常快速(在具有8万行,具有5年历史的快速双核笔记本电脑上,数据帧不到一秒钟)。

我发布此消息的原因:

  1. 对于那些需要这种解决方案的人
  2. 如果有人知道执行速度更快的解决方案(例如,数百万行),我将不胜感激。

I found this question while trying to make a dictionary out of three columns of a pandas dataframe. In my case the dataframe has columns A, B and C (let’s say A and B are the geographical coordinates of longitude and latitude and C the country region/state/etc, which is more or less the case).

I wanted a dictionary with each pair of A,B values (dictionary key) matching the value of C (dictionary value) in the corresponding row (each pair of A,B values is guaranteed to be unique due to previous filtering, but it is possible to have the same value of C for different pairs of A,B values in this context), so I did:

mydict = dict(zip(zip(df['A'],df['B']), df['C']))

Using pandas to_dict() also works:

mydict = df.set_index(['A','B']).to_dict(orient='dict')['C']

(none of the columns A or B were used as index before executing the line creating the dictionary)

Both approaches are fast (less than one second on a dataframe with 85k rows, 5-year-old fast dual-core laptop).

The reasons I’m posting this:

  1. for those who need this kind of solution
  2. if someone knows a faster executing solution (e.g., for millions of rows), I’d appreciate a reply.

回答 10

def get_dict_from_pd(df, key_col, row_col):
    result = dict()
    for i in set(df[key_col].values):
        is_i = df[key_col] == i
        result[i] = list(df[is_i][row_col].values)
    return result

这是我的想法,一个基本的循环

def get_dict_from_pd(df, key_col, row_col):
    result = dict()
    for i in set(df[key_col].values):
        is_i = df[key_col] == i
        result[i] = list(df[is_i][row_col].values)
    return result

this is my sloution, a basic loop


回答 11

这是我的解决方案:

import pandas as pd
df = pd.read_excel('dic.xlsx')
df_T = df.set_index('id').T
dic = df_T.to_dict('records')
print(dic)

This is my solution:

import pandas as pd
df = pd.read_excel('dic.xlsx')
df_T = df.set_index('id').T
dic = df_T.to_dict('records')
print(dic)

将Pandas DataFrame的行转换为列标题,

问题:将Pandas DataFrame的行转换为列标题,

我必须使用的数据有点混乱。它的数据中包含标头名称。如何从现有的pandas数据框中选择一行并使其(重命名为)列标题?

我想做类似的事情:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?

I want to do something like:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header

回答 0

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

将列标签设置为等于第二行(索引位置1)中的值:

In [23]: df.columns = df.iloc[1]

如果索引具有唯一标签,则可以使用以下命令删除第二行:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

如果索引不是唯一的,则可以使用:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

使用df.drop(df.index[1])删除所有与第二行具有相同标签的行。因为非唯一索引可能会导致像这样的绊脚石(或潜在的错误),所以通常最好注意索引的唯一性(即使Pandas不需要它)。

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

Set the column labels to equal the values in the 2nd row (index location 1):

In [23]: df.columns = df.iloc[1]

If the index has unique labels, you can drop the 2nd row using:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

If the index is not unique, you could use:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it’s often better to take care that the index is unique (even though Pandas does not require it).


回答 1

这有效(熊猫v’0.19.2’):

df.rename(columns=df.iloc[0])

This works (pandas v’0.19.2′):

df.rename(columns=df.iloc[0])

回答 2

重新创建数据框会更容易。这也将从头开始解释列的类型。

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

It would be easier to recreate the data frame. This would also interpret the columns types from scratch.

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

回答 3

您可以通过代表的参数在read_csvread_html构造函数中指定行索引。这样的优点是可以自动删除所有先前被认为是垃圾的行。headerRow number(s) to use as the column names, and the start of the data

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85

You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85

如何释放熊猫数据框使用的内存?

问题:如何释放熊猫数据框使用的内存?

我在熊猫中打开了一个非常大的csv文件,如下所示。

import pandas
df = pandas.read_csv('large_txt_file.txt')

完成此操作后,内存使用量将增加2GB,这是预期的,因为此文件包含数百万行。我的问题出在我需要释放此内存的时候。我跑了…

del df

但是,我的内存使用没有下降。这是释放熊猫数据帧使用的内存的错误方法吗?如果是,正确的方法是什么?

I have a really large csv file that I opened in pandas as follows….

import pandas
df = pandas.read_csv('large_txt_file.txt')

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran….

del df

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?


回答 0

减少Python中的内存使用量非常困难,因为Python实际上并未将内存释放回操作系统。如果删除对象,则内存可用于新的Python对象,但不能free()返回系统(请参阅此问题)。

如果坚持使用数字numpy数组,则将释放它们,但装箱的对象不会释放。

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

减少数据框的数量

Python使内存保持高水位,但是我们可以减少创建的数据帧的总数。修改数据框时,请选择inplace=True,这样就不会创建副本。

另一个常见的陷阱是在ipython中保留以前创建的数据帧的副本:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

您可以通过键入%reset Out清除历史记录来解决此问题。另外,您可以调整ipython保留的历史记录数量ipython --cache-size=5(默认为1000)。

减少数据框大小

尽可能避免使用对象dtype。

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

带有对象dtype的值被装箱,这意味着numpy数组仅包含一个指针,并且堆中对于数据框中的每个值都有一个完整的Python对象。这包括字符串。

尽管numpy支持数组中固定大小的字符串,但pandas不支持(这会引起用户混乱)。这可以产生很大的变化:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

您可能要避免使用字符串列,或者找到一种将字符串数据表示为数字的方法。

如果您的数据框包含许多重复值(NaN非常常见),则可以使用稀疏数据结构来减少内存使用量:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

查看内存使用情况

您可以查看内存使用情况(docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

从熊猫0.17.1开始,您还df.info(memory_usage='deep')可以查看包括对象在内的内存使用情况。

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()‘d back to the system (see this question).

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don’t create copies.

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

Reducing Dataframe Size

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

Whilst numpy supports fixed-size strings in arrays, pandas does not (it’s caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

You can view the memory usage (docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.


回答 1

如评论中所述,有一些尝试的方法:gc.collect(@EdChum)可能清除东西。至少从我的经验来看,这些东西有时会起作用,而通常却不会。

但是,总有一件事情总是可行的,因为它是在操作系统而不是语言级别上完成的。

假设您有一个函数,该函数创建一个中间庞大的DataFrame,并返回较小的结果(也可能是DataFrame):

def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

那如果你做类似的事情

import multiprocessing

result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

然后,该函数在不同的过程中执行。该过程完成后,操作系统将收回其使用的所有资源。实际上,Python,熊猫(垃圾收集器)无法阻止这种情况。

As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don’t.

There is one thing that always works, however, because it is done at the OS, not language, level.

Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):

def huge_intermediate_calc(something):
    ...
    huge_df = pd.DataFrame(...)
    ...
    return some_aggregate

Then if you do something like

import multiprocessing

result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]

Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There’s really nothing Python, pandas, the garbage collector, could do to stop that.


回答 2

这解决了为我释放内存的问题!!!

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

数据框将显式设置为null

This solves the problem of releasing the memory for me!!!

import gc
import pandas as pd

del [[df_1,df_2]]
gc.collect()
df_1=pd.DataFrame()
df_2=pd.DataFrame()

the data-frame will be explicitly set to null

in the above statements

Firstly, the self reference of the dataframe is deleted meaning the dataframe is no longer available to python there after all the references of the dataframe is collected by garbage collector (gc.collect()) and then explicitly set all the references to empty dataframe.

more on the working of garbage collector is well explained in https://stackify.com/python-garbage-collection/


回答 3

del df如果删除时有任何引用,将不会df删除。因此,您需要删除所有对其的引用,del df以释放内存。

因此,应删除绑定到df的所有实例以触发垃圾回收。

使用objgragh来检查哪些对象被保留。

del df will not be deleted if there are any reference to the df at the time of deletion. So you need to to delete all the references to it with del df to release the memory.

So all the instances bound to df should be deleted to trigger garbage collection.

Use objgragh to check which is holding onto the objects.


回答 4

似乎glibc有一个问题会影响Pandas中的内存分配:https : //github.com/pandas-dev/pandas/issues/2659

对这个问题的详细猴补丁解决了我的问题:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

It seems there is an issue with glibc that affects the memory allocation in Pandas: https://github.com/pandas-dev/pandas/issues/2659

The monkey patch detailed on this issue has resolved the problem for me:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

在熊猫数据框中将Unix时间转换为可读日期

问题:在熊猫数据框中将Unix时间转换为可读日期

我有一个带有unix时间和价格的数据框。我想转换索引列,以便以人类可读的日期显示它。

因此,例如,我在index列中有dateas 1349633705,但我希望它显示为10/07/2012(或至少10/07/2012 18:15)。

在某些情况下,这是我正在使用的代码以及我已经尝试过的代码:

import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)   
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates 
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date   

如您所见,我在df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))这里使用的 是无效的,因为我使用的是整数而不是字符串。我认为我需要使用,datetime.date.fromtimestamp但我不确定如何将其应用于整个df.date

谢谢。

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.

So for instance I have date as 1349633705 in the index column but I’d want it to show as 10/07/2012 (or at least 10/07/2012 18:15).

For some context, here is the code I’m working with and what I’ve tried already:

import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)   
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates 
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date   

As you can see I’m using df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn’t work since I’m working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I’m not quite sure how to apply this to the whole of df.date.

Thanks.


回答 0

距纪元似乎只有几秒钟。

In [20]: df = DataFrame(data['values'])

In [21]: df.columns = ["date","price"]

In [22]: df
Out[22]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date     358  non-null values
price    358  non-null values
dtypes: float64(1), int64(1)

In [23]: df.head()
Out[23]: 
         date  price
0  1349720105  12.08
1  1349806505  12.35
2  1349892905  12.15
3  1349979305  12.19
4  1350065705  12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')

In [26]: df.head()
Out[26]: 
                 date  price
0 2012-10-08 18:15:05  12.08
1 2012-10-09 18:15:05  12.35
2 2012-10-10 18:15:05  12.15
3 2012-10-11 18:15:05  12.19
4 2012-10-12 18:15:05  12.15

In [27]: df.dtypes
Out[27]: 
date     datetime64[ns]
price           float64
dtype: object

These appear to be seconds since epoch.

In [20]: df = DataFrame(data['values'])

In [21]: df.columns = ["date","price"]

In [22]: df
Out[22]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date     358  non-null values
price    358  non-null values
dtypes: float64(1), int64(1)

In [23]: df.head()
Out[23]: 
         date  price
0  1349720105  12.08
1  1349806505  12.35
2  1349892905  12.15
3  1349979305  12.19
4  1350065705  12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')

In [26]: df.head()
Out[26]: 
                 date  price
0 2012-10-08 18:15:05  12.08
1 2012-10-09 18:15:05  12.35
2 2012-10-10 18:15:05  12.15
3 2012-10-11 18:15:05  12.19
4 2012-10-12 18:15:05  12.15

In [27]: df.dtypes
Out[27]: 
date     datetime64[ns]
price           float64
dtype: object

回答 1

如果您尝试使用:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))

并收到一个错误:

“ pandas.tslib.OutOfBoundsDatetime:无法转换单位为’s’的输入”

这表示DATE_FIELD未指定秒。

就我而言,是毫秒- EPOCH time

转换工作如下:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms')) 

If you try using:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))

and receive an error :

“pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit ‘s'”

This means the DATE_FIELD is not specified in seconds.

In my case, it was milli seconds – EPOCH time.

The conversion worked using below:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms')) 

回答 2

假设我们导入了pandas as pd并且df是我们的数据框

pd.to_datetime(df['date'], unit='s')

为我工作。

Assuming we imported pandas as pd and df is our dataframe

pd.to_datetime(df['date'], unit='s')

works for me.


回答 3

或者,通过更改上面的代码行:

# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))

它也应该起作用。

Alternatively, by changing a line of the above code:

# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))

It should also work.