标签归档:dataframe

通过标签选择的熊猫有时返回Series,有时返回DataFrame

问题:通过标签选择的熊猫有时返回Series,有时返回DataFrame

在Pandas中,当我选择一个索引中仅包含一个条目的标签时,我会得到一个系列,但是当我选择一个具有多于一个条目的条目时,我就会得到一个数据框。

这是为什么?有没有办法确保我总是取回数据帧?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

Why is that? Is there a way to ensure I always get back a data frame?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

回答 0

可以肯定的是,这种行为是不一致的,但是我认为很容易想到这种情况是方便的。无论如何,要每次获取一个DataFrame,只需将一个列表传递给即可loc。还有其他方法,但我认为这是最干净的方法。

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Granted that the behavior is inconsistent, but I think it’s easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

回答 1

您有一个包含三个索引项的索引3。因此,df.loc[3]将返回一个数据帧。

原因是您未指定列。因此,df.loc[3]选择所有列中的三个项目(即column 0),同时df.loc[3,0]将返回一个Series。例如df.loc[1:2],由于对行进行了切片,因此它还会返回一个数据框。

选择单个行(如df.loc[1])将返回一个以列名作为索引的Series。

如果要确保始终有一个DataFrame,可以像这样切片df.loc[1:1]。另一个选项是布尔索引(df.loc[df.index==1])或take方法(df.take([0]),但是此使用的位置不是标签!)。

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don’t specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).


回答 2

TLDR

使用时 loc

df.loc[:]=数据

df.loc[int]=数据框(如果您有多于一列)和系列(如果您在数据框中只有一列)

df.loc[:, ["col_name"]]=数据

df.loc[:, "col_name"]= 系列

不使用 loc

df["col_name"]= 系列

df[["col_name"]]=数据

The TLDR

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe


回答 3

使用df['columnName']得到一个系列,并df[['columnName']]得到一个数据帧。

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.


回答 4

您在评论joris的答案中写道:

“我不理解将单行转换为系列的设计决策-为什么不将一行数据框呢?”

单行不会在系列中转换
一个系列:No, I don't think so, in fact; see the edit

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。例如,DataFrame是Series的容器,Panel是DataFrame对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

这样选择了Pandas对象的数据模型。原因当然在于它确保了一些我不知道的优势(我不完全理解引文的最后一句话,也许是原因)

编辑:我不同意

DataFrame不能由可能 Series 的元素组成,因为以下代码为行和列提供相同的“ Series”类型:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

结果

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

因此,没有理由假装DataFrame由Series组成,因为这些Series应该是什么:列或行?愚蠢的问题和远见。

那么什么是DataFrame?

在此答案的先前版本中,我提出了这个问题,试图在他的评论之一中找到Why is that?OP问题部分的答案以及类似的审问single rows to get converted into a series - why not a data frame with one row?
Is there a way to ensure I always get back a data frame?Dan Allan 对此部分做了回答。

然后,正如上面引用的Pandas的文档所述,最好将Pandas的数据结构看作是低维数据的容器,在我看来,对为什么的理解可以从DataFrame结构的性质中找到。

但是,我意识到,不应将引用的建议作为对熊猫数据结构本质的精确描述。
此建议并不意味着DataFrame是Series的容器。
它表示,将DataFrame作为Series的容器的心理表示形式(根据推理的某个时刻考虑的选项是行还是列)是考虑DataFrame的一种好方法,即使实际上并非严格如此。“良好”意味着该愿景可以高效地使用DataFrame。就这样。

那么什么是DataFrame对象?

所述数据帧类产生具有特定结构起源于实例NDFrame基类,本身从派生 PandasContainer基类,也是一个父类的系列类。
请注意,这对于0.12版之前的Pandas才是正确的。在即将发布的0.13版本中,Series也将仅从NDFrame类派生。

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

结果

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

因此,我的理解是,DataFrame实例具有精心设计的某些方法,以控制从行和列中提取数据的方式。

本页介绍了这些提取方法的工作方式:http : //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
我们在其中找到了Dan Allan给出的方法和其他方法。

为什么这些提取方法是照原样制作的?
当然,这是因为它们被认为是提供更好的可能性和简化数据分析的方法。
这句话正是这样表达的:

考虑熊猫数据结构的最佳方法是将其作为低维数据的灵活容器。

为什么数据从数据帧的实例提取的不在于它的结构,它位于为什么这种结构。我猜想,Pandas数据结构的结构和功能已经过精心设计,以便尽可能地提高智力上的直观性,并且要了解详细信息,必须阅读Wes McKinney的博客。

You wrote in a comment to joris’ answer:

“I don’t understand the design decision for single rows to get converted into a series – why not a data frame with one row?”

A single row isn’t converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don’t know (I don’t fully understand the last sentence of the citation, maybe it’s the reason)

.

Edit : I don’t agree with me

A DataFrame can’t be composed of elements that would be Series, because the following code gives the same type “Series” as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas’ docs cited above says that the pandas’ data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas’ data structures.
This advice doesn’t mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn’t strictly the case in reality. “Good” meaning that this vision enables to use DataFrames with efficiency. That’s all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That’s certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It’s precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn’t lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas’ data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.


回答 5

如果目标是使用索引获取数据集的子集,则最好避免使用lociloc。相反,您应该使用类似于以下的语法:

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

回答 6

如果您还选择数据框的索引,则结果可以是DataFrame或Series 也可以是Series或标量(单个值)。

此函数可确保您始终从选择中获得列表(如果df,index和column有效):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).

This function ensures that you always get a list from your selection (if the df, index and column are valid):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

如何替换熊猫数据框的列中的文本?

问题:如何替换熊猫数据框的列中的文本?

我的数据框中有这样的一列:

range
"(2,30)"
"(50,290)"
"(400,1000)"
... 

我想,-破折号代替逗号。我目前正在使用此方法,但没有任何更改。

org_info_exc['range'].replace(',', '-', inplace=True)

有人可以帮忙吗?

I have a column in my dataframe like this:

range
"(2,30)"
"(50,290)"
"(400,1000)"
... 

and I want to replace the , comma with - dash. I’m currently using this method but nothing is changed.

org_info_exc['range'].replace(',', '-', inplace=True)

Can anybody help?


回答 0

使用向量化str方法replace

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

编辑

因此,如果我们查看您尝试过的内容以及为何不起作用:

df['range'].replace(',','-',inplace=True)

文档中,我们看到以下说明:

str或regex:str:与to_replace完全匹配的字符串将替换为value

因此,由于str值不匹配,因此不会发生替换,请与以下内容进行比较:

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

在这里,我们在第二行获得了完全匹配,并且替换发生了。

Use the vectorised str method replace:

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

EDIT

So if we look at what you tried and why it didn’t work:

df['range'].replace(',','-',inplace=True)

from the docs we see this desc:

str or regex: str: string exactly matching to_replace will be replaced with value

So because the str values do not match, no replacement occurs, compare with the following:

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

here we get an exact match on the second row and the replacement occurs.


回答 1

对于其他任何从Google搜索到的人,如何在所有列上进行字符串替换(例如,如果其中有多个列,如OP的“范围”列):Pandas在replace数据框对象上具有内置方法。

df.replace(',', '-', regex=True)

资料来源:文件

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP’s ‘range’ column): Pandas has a built in replace method available on a dataframe object.

df.replace(',', '-', regex=True)

Source: Docs


回答 2

在列名称中用下划线替换所有逗号

data.columns= data.columns.str.replace(' ','_',regex=True)

Replace all commas with underscore in the column names

data.columns= data.columns.str.replace(' ','_',regex=True)

回答 3

另外,对于那些希望替换一列中多个字符的用户,可以使用正则表达式来实现:

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

将Pandas DataFrame的行转换为列标题,

问题:将Pandas DataFrame的行转换为列标题,

我必须使用的数据有点混乱。它的数据中包含标头名称。如何从现有的pandas数据框中选择一行并使其(重命名为)列标题?

我想做类似的事情:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?

I want to do something like:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header

回答 0

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

将列标签设置为等于第二行(索引位置1)中的值:

In [23]: df.columns = df.iloc[1]

如果索引具有唯一标签,则可以使用以下命令删除第二行:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

如果索引不是唯一的,则可以使用:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

使用df.drop(df.index[1])删除所有与第二行具有相同标签的行。因为非唯一索引可能会导致像这样的绊脚石(或潜在的错误),所以通常最好注意索引的唯一性(即使Pandas不需要它)。

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

Set the column labels to equal the values in the 2nd row (index location 1):

In [23]: df.columns = df.iloc[1]

If the index has unique labels, you can drop the 2nd row using:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

If the index is not unique, you could use:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it’s often better to take care that the index is unique (even though Pandas does not require it).


回答 1

这有效(熊猫v’0.19.2’):

df.rename(columns=df.iloc[0])

This works (pandas v’0.19.2′):

df.rename(columns=df.iloc[0])

回答 2

重新创建数据框会更容易。这也将从头开始解释列的类型。

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

It would be easier to recreate the data frame. This would also interpret the columns types from scratch.

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

回答 3

您可以通过代表的参数在read_csvread_html构造函数中指定行索引。这样的优点是可以自动删除所有先前被认为是垃圾的行。headerRow number(s) to use as the column names, and the start of the data

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85

You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85

在熊猫数据框中将Unix时间转换为可读日期

问题:在熊猫数据框中将Unix时间转换为可读日期

我有一个带有unix时间和价格的数据框。我想转换索引列,以便以人类可读的日期显示它。

因此,例如,我在index列中有dateas 1349633705,但我希望它显示为10/07/2012(或至少10/07/2012 18:15)。

在某些情况下,这是我正在使用的代码以及我已经尝试过的代码:

import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)   
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates 
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date   

如您所见,我在df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))这里使用的 是无效的,因为我使用的是整数而不是字符串。我认为我需要使用,datetime.date.fromtimestamp但我不确定如何将其应用于整个df.date

谢谢。

I have a dataframe with unix times and prices in it. I want to convert the index column so that it shows in human readable dates.

So for instance I have date as 1349633705 in the index column but I’d want it to show as 10/07/2012 (or at least 10/07/2012 18:15).

For some context, here is the code I’m working with and what I’ve tried already:

import json
import urllib2
from datetime import datetime
response = urllib2.urlopen('http://blockchain.info/charts/market-price?&format=json')
data = json.load(response)   
df = DataFrame(data['values'])
df.columns = ["date","price"]
#convert dates 
df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.index = df.date   

As you can see I’m using df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d")) here which doesn’t work since I’m working with integers, not strings. I think I need to use datetime.date.fromtimestamp but I’m not quite sure how to apply this to the whole of df.date.

Thanks.


回答 0

距纪元似乎只有几秒钟。

In [20]: df = DataFrame(data['values'])

In [21]: df.columns = ["date","price"]

In [22]: df
Out[22]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date     358  non-null values
price    358  non-null values
dtypes: float64(1), int64(1)

In [23]: df.head()
Out[23]: 
         date  price
0  1349720105  12.08
1  1349806505  12.35
2  1349892905  12.15
3  1349979305  12.19
4  1350065705  12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')

In [26]: df.head()
Out[26]: 
                 date  price
0 2012-10-08 18:15:05  12.08
1 2012-10-09 18:15:05  12.35
2 2012-10-10 18:15:05  12.15
3 2012-10-11 18:15:05  12.19
4 2012-10-12 18:15:05  12.15

In [27]: df.dtypes
Out[27]: 
date     datetime64[ns]
price           float64
dtype: object

These appear to be seconds since epoch.

In [20]: df = DataFrame(data['values'])

In [21]: df.columns = ["date","price"]

In [22]: df
Out[22]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 358 entries, 0 to 357
Data columns (total 2 columns):
date     358  non-null values
price    358  non-null values
dtypes: float64(1), int64(1)

In [23]: df.head()
Out[23]: 
         date  price
0  1349720105  12.08
1  1349806505  12.35
2  1349892905  12.15
3  1349979305  12.19
4  1350065705  12.15
In [25]: df['date'] = pd.to_datetime(df['date'],unit='s')

In [26]: df.head()
Out[26]: 
                 date  price
0 2012-10-08 18:15:05  12.08
1 2012-10-09 18:15:05  12.35
2 2012-10-10 18:15:05  12.15
3 2012-10-11 18:15:05  12.19
4 2012-10-12 18:15:05  12.15

In [27]: df.dtypes
Out[27]: 
date     datetime64[ns]
price           float64
dtype: object

回答 1

如果您尝试使用:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))

并收到一个错误:

“ pandas.tslib.OutOfBoundsDatetime:无法转换单位为’s’的输入”

这表示DATE_FIELD未指定秒。

就我而言,是毫秒- EPOCH time

转换工作如下:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms')) 

If you try using:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],***unit='s'***))

and receive an error :

“pandas.tslib.OutOfBoundsDatetime: cannot convert input with unit ‘s'”

This means the DATE_FIELD is not specified in seconds.

In my case, it was milli seconds – EPOCH time.

The conversion worked using below:

df[DATE_FIELD]=(pd.to_datetime(df[DATE_FIELD],unit='ms')) 

回答 2

假设我们导入了pandas as pd并且df是我们的数据框

pd.to_datetime(df['date'], unit='s')

为我工作。

Assuming we imported pandas as pd and df is our dataframe

pd.to_datetime(df['date'], unit='s')

works for me.


回答 3

或者,通过更改上面的代码行:

# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))

它也应该起作用。

Alternatively, by changing a line of the above code:

# df.date = df.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
df.date = df.date.apply(lambda d: datetime.datetime.fromtimestamp(int(d)).strftime('%Y-%m-%d'))

It should also work.


根据另一个列熊猫数据框提取列值

问题:根据另一个列熊猫数据框提取列值

我有点被困在提取一个变量对另一个变量的条件值上。例如,以下数据框:

A  B
p1 1
p1 2
p3 3
p2 4

我如何获得Awhen 的价值B=3?每当我提取的值时A,我都会得到一个对象,而不是字符串。

I am kind of getting stuck on extracting value of one variable conditioning on another variable. For example, the following dataframe:

A  B
p1 1
p1 2
p3 3
p2 4

How can I get the value of A when B=3? Every time when I extracted the value of A, I got an object, not a string.


回答 0

您可以loc用来获取满足条件的序列,然后iloc获取第一个元素:

In [2]: df
Out[2]:
    A  B
0  p1  1
1  p1  2
2  p3  3
3  p2  4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2    p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

You could use loc to get series which satisfying your condition and then iloc to get first element:

In [2]: df
Out[2]:
    A  B
0  p1  1
1  p1  2
2  p3  3
3  p2  4

In [3]: df.loc[df['B'] == 3, 'A']
Out[3]:
2    p3
Name: A, dtype: object

In [4]: df.loc[df['B'] == 3, 'A'].iloc[0]
Out[4]: 'p3'

回答 1

您可以尝试query,输入更少:

df.query('B==3')['A']

You can try query, which is less typing:

df.query('B==3')['A']

回答 2

df[df['B']==3]['A'],假设df是您的pandas.DataFrame。

df[df['B']==3]['A'], assuming df is your pandas.DataFrame.


回答 3

使用df[df['B']==3]['A'].values如果你只是想项目本身没有括号

Use df[df['B']==3]['A'].values if you just want item itself without the brackets


回答 4

male_avgtip=(tips_data.loc[tips_data['sex'] == 'Male', 'tip']).mean()

我还为我的任务进行了这种clause和提取操作。

male_avgtip=(tips_data.loc[tips_data['sex'] == 'Male', 'tip']).mean()

I have also worked on this clausing and extraction operations for my assignment.


获取总计熊猫列

问题:获取总计熊猫列

目标

我有一个Pandas数据框,如下所示,具有多个列,并希望获取列的总数MyColumn


数据框df

print df

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   

我的尝试

我试图使用groupby和获得列的总和.sum()

Total = df.groupby['MyColumn'].sum()

print Total

这将导致以下错误:

TypeError: 'instancemethod' object has no attribute '__getitem__'

预期Yield

我期望输出如下:

319

或者,我想df用一个包含总数的新row标题进行编辑TOTAL

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   
TOTAL                  319

Target

I have a Pandas data frame, as shown below, with multiple columns and would like to get the total of column, MyColumn.


Data Framedf:

print df

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   

My attempt:

I have attempted to get the sum of the column using groupby and .sum():

Total = df.groupby['MyColumn'].sum()

print Total

This causes the following error:

TypeError: 'instancemethod' object has no attribute '__getitem__'

Expected Output

I’d have expected the output to be as followed:

319

Or alternatively, I would like df to be edited with a new row entitled TOTAL containing the total:

           X           MyColumn  Y              Z   
0          A           84        13.0           69.0   
1          B           76         77.0          127.0   
2          C           28         69.0           16.0   
3          D           28         28.0           31.0   
4          E           19         20.0           85.0   
5          F           84        193.0           70.0   
TOTAL                  319

回答 0

您应该使用sum

Total = df['MyColumn'].sum()
print (Total)
319

然后loc与一起使用Series,在这种情况下,索引应设置为与您需要求和的特定列相同:

df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index = ['MyColumn'])
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

因为如果传递标量,则将填充所有行的值:

df.loc['Total'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A        84   13.0   69.0
1        B        76   77.0  127.0
2        C        28   69.0   16.0
3        D        28   28.0   31.0
4        E        19   20.0   85.0
5        F        84  193.0   70.0
Total  319       319  319.0  319.0

另有两个解决方案atix请参见下面的应用程序:

df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

注意:自Pandas v0.20起,ix已弃用。使用lociloc代替。

You should use sum:

Total = df['MyColumn'].sum()
print (Total)
319

Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:

df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index = ['MyColumn'])
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

because if you pass scalar, the values of all rows will be filled:

df.loc['Total'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A        84   13.0   69.0
1        B        76   77.0  127.0
2        C        28   69.0   16.0
3        D        28   28.0   31.0
4        E        19   20.0   85.0
5        F        84  193.0   70.0
Total  319       319  319.0  319.0

Two other solutions are with at, and ix see the applications below:

df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
         X  MyColumn      Y      Z
0        A      84.0   13.0   69.0
1        B      76.0   77.0  127.0
2        C      28.0   69.0   16.0
3        D      28.0   28.0   31.0
4        E      19.0   20.0   85.0
5        F      84.0  193.0   70.0
Total  NaN     319.0    NaN    NaN

Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.


回答 1

您可以在此处使用的另一种选择:

df.loc["Total", "MyColumn"] = df.MyColumn.sum()

#         X  MyColumn      Y       Z
#0        A     84.0    13.0    69.0
#1        B     76.0    77.0   127.0
#2        C     28.0    69.0    16.0
#3        D     28.0    28.0    31.0
#4        E     19.0    20.0    85.0
#5        F     84.0   193.0    70.0
#Total  NaN    319.0     NaN     NaN

您也可以使用append()方法:

df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))


更新:

如果需要为所有数字列追加总和,则可以执行以下操作之一:

用于append以功能性方式执行此操作(不更改原始数据帧):

# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')

# append sums to the data frame
df.append(sums)
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     319.0  400.0  398.0

用于loc在适当位置更改数据框:

df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     638.0  800.0  796.0

Another option you can go with here:

df.loc["Total", "MyColumn"] = df.MyColumn.sum()

#         X  MyColumn      Y       Z
#0        A     84.0    13.0    69.0
#1        B     76.0    77.0   127.0
#2        C     28.0    69.0    16.0
#3        D     28.0    28.0    31.0
#4        E     19.0    20.0    85.0
#5        F     84.0   193.0    70.0
#Total  NaN    319.0     NaN     NaN

You can also use append() method:

df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))


Update:

In case you need to append sum for all numeric columns, you can do one of the followings:

Use append to do this in a functional manner (doesn’t change the original data frame):

# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')

# append sums to the data frame
df.append(sums)
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     319.0  400.0  398.0

Use loc to mutate data frame in place:

df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
#         X  MyColumn      Y      Z
#0        A      84.0   13.0   69.0
#1        B      76.0   77.0  127.0
#2        C      28.0   69.0   16.0
#3        D      28.0   28.0   31.0
#4        E      19.0   20.0   85.0
#5        F      84.0  193.0   70.0
#total  NaN     638.0  800.0  796.0

回答 2

与获取数据框的长度类似len(df),以下内容适用于熊猫和大火:

Total = sum(df['MyColumn'])

或者

Total = sum(df.MyColumn)
print Total

Similar to getting the length of a dataframe, len(df), the following worked for pandas and blaze:

Total = sum(df['MyColumn'])

or alternatively

Total = sum(df.MyColumn)
print Total

回答 3

列求和有两种方法

数据集= pd.read_csv(“ data.csv”)

1:总和(dataset.Column_name)

2:数据集[‘Column_Name’]。sum()

如果有任何问题,请纠正我。

There are two ways to sum of a column

dataset = pd.read_csv(“data.csv”)

1: sum(dataset.Column_name)

2: dataset[‘Column_Name’].sum()

If there is any issue in this the please correct me..


回答 4

作为其他选择,您可以执行以下操作

Group   Valuation   amount
    0   BKB Tube    156
    1   BKB Tube    143
    2   BKB Tube    67
    3   BAC Tube    176
    4   BAC Tube    39
    5   JDK Tube    75
    6   JDK Tube    35
    7   JDK Tube    155
    8   ETH Tube    38
    9   ETH Tube    56

下面的脚本,您可以用于上面的数据

import pandas as pd    
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()

As other option, you can do something like below

Group   Valuation   amount
    0   BKB Tube    156
    1   BKB Tube    143
    2   BKB Tube    67
    3   BAC Tube    176
    4   BAC Tube    39
    5   JDK Tube    75
    6   JDK Tube    35
    7   JDK Tube    155
    8   ETH Tube    38
    9   ETH Tube    56

Below script, you can use for above data

import pandas as pd    
data = pd.read_csv("daata1.csv")
bytreatment = data.groupby('Group')
bytreatment['amount'].sum()

将列表或系列作为一行附加到熊猫DataFrame吗?

问题:将列表或系列作为一行附加到熊猫DataFrame吗?

因此,我已经初始化了一个空的Pandas DataFrame,并希望迭代地将列表(或Series)追加为该DataFrame中的行。最好的方法是什么?

So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?


回答 0

有时,在熊猫之​​外进行所有附加操作会更容易,然后只需创建DataFrame即可。

>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
   col1 col2
0    a    b
1    e    f

Sometimes it’s easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.

>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
   col1 col2
0    a    b
1    e    f

回答 1

df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]

回答 2

这是一个简单而愚蠢的解决方案:

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)

Here’s a simple and dumb solution:

>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)

回答 3

你能做这样的事情吗?

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True) 
>>> df
  col1 col2
0    a    b
1    d    e

有谁有更优雅的解决方案?

Could you do something like this?

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True) 
>>> df
  col1 col2
0    a    b
1    d    e

Does anyone have a more elegant solution?


回答 4

跟随Mike Chirico的回答…如果您想已填充数据框追加列表…

>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
  col1 col2
0    a    b
1    d    e
2    f    g

Following onto Mike Chirico’s answer… if you want to append a list after the dataframe is already populated…

>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
  col1 col2
0    a    b
1    d    e
2    f    g

回答 5

如果要添加一个Series并将Series的索引用作DataFrame的列,则只需将Series附加在方括号之间:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: row=pd.Series([1,2,3],["A","B","C"])

In [4]: row
Out[4]: 
A    1
B    2
C    3
dtype: int64

In [5]: df.append([row],ignore_index=True)
Out[5]: 
   A  B  C
0  1  2  3

[1 rows x 3 columns]

淘汰ignore_index=True你没有得到正确的索引。

If you want to add a Series and use the Series’ index as columns of the DataFrame, you only need to append the Series between brackets:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: row=pd.Series([1,2,3],["A","B","C"])

In [4]: row
Out[4]: 
A    1
B    2
C    3
dtype: int64

In [5]: df.append([row],ignore_index=True)
Out[5]: 
   A  B  C
0  1  2  3

[1 rows x 3 columns]

Whitout the ignore_index=True you don’t get proper index.


回答 6

这是一个给定已经创建的数据框的函数,该函数会将列表作为新行追加。这可能应该抛出错误捕获器,但是如果您确切知道要添加的内容,那应该不是问题。

import pandas as pd
import numpy as np

def addRow(df,ls):
    """
    Given a dataframe and a list, append the list as a new row to the dataframe.

    :param df: <DataFrame> The original dataframe
    :param ls: <list> The new row to be added
    :return: <DataFrame> The dataframe with the newly appended row
    """

    numEl = len(ls)

    newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))

    df = df.append(newRow, ignore_index=True)

    return df

Here’s a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you’re adding then it shouldn’t be an issue.

import pandas as pd
import numpy as np

def addRow(df,ls):
    """
    Given a dataframe and a list, append the list as a new row to the dataframe.

    :param df: <DataFrame> The original dataframe
    :param ls: <list> The new row to be added
    :return: <DataFrame> The dataframe with the newly appended row
    """

    numEl = len(ls)

    newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))

    df = df.append(newRow, ignore_index=True)

    return df

回答 7

将列表转换为append函数中的数据框也有效,即使在循环中应用也是如此

import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))

Converting the list to a data frame within the append function works, also when applied in a loop

import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))

回答 8

只需使用loc:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

simply use loc:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

回答 9

如此处所述-https: //kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python,您需要先将列表转换为序列,然后将序列附加到数据框。

df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)

As mentioned here – https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you’ll need to first convert the list to a series then append the series to dataframe.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)

回答 10

最简单的方法:

my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values

编辑:

不要忘记,新列表的长度应与相应数据框的长度相同。

The simplest way:

my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values

Edit:

Don’t forget that the length of the new list should be the same of the corresponding Dataframe.


将pandas数据框中的列从int转换为string

问题:将pandas数据框中的列从int转换为string

我在pandas中有一个数据帧,其中包含int和str数据列。我想先串联数据框内的列。为此,我必须将int列转换为str。我尝试做如下:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

要么

mtrx['X.3'] = mtrx['X.3'].astype(str)

但是在两种情况下都无法正常工作,并且我收到一条错误消息:“无法连接’str’和’int’对象”。连接两str列效果很好。

I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int column to str. I’ve tried to do as follows:

mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])

or

mtrx['X.3'] = mtrx['X.3'].astype(str)

but in both cases it’s not working and I’m getting an error saying “cannot concatenate ‘str’ and ‘int’ objects”. Concatenating two str columns is working perfectly fine.


回答 0

In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

转换系列

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

不要忘记将结果分配回去:

df['A'] = df['A'].apply(str)

转换整个框架

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)
In [16]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))

In [17]: df
Out[17]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [18]: df.dtypes
Out[18]: 
A    int64
B    int64
dtype: object

Convert a series

In [19]: df['A'].apply(str)
Out[19]: 
0    0
1    2
2    4
3    6
4    8
Name: A, dtype: object

In [20]: df['A'].apply(str)[0]
Out[20]: '0'

Don’t forget to assign the result back:

df['A'] = df['A'].apply(str)

Convert the whole frame

In [21]: df.applymap(str)
Out[21]: 
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [22]: df.applymap(str).iloc[0,0]
Out[22]: '0'

df = df.applymap(str)

回答 1

更改DataFrame列的数据类型:

要诠释:

df.column_name = df.column_name.astype(np.int64)

要str:

df.column_name = df.column_name.astype(str)

Change data type of DataFrame column:

To int:

df.column_name = df.column_name.astype(np.int64)

To str:

df.column_name = df.column_name.astype(str)


回答 2

警告:给定的两个解决方案 astype()和apply()都不以nan或None形式保留NULL值。

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

我相信这是由to_string()的实现解决的

Warning: Both solutions given ( astype() and apply() ) do not preserve NULL values in either the nan or the None form.

import pandas as pd
import numpy as np

df = pd.DataFrame([None,'string',np.nan,42], index=[0,1,2,3], columns=['A'])

df1 = df['A'].astype(str)
df2 =  df['A'].apply(str)

print df.isnull()
print df1.isnull()
print df2.isnull()

I believe this is fixed by the implementation of to_string()


回答 3

使用以下代码:

df.column_name = df.column_name.astype('str')

Use the following code:

df.column_name = df.column_name.astype('str')

回答 4

仅供参考。

以上所有答案均适用于数据帧的情况。但是,如果您在创建/修改列时使用lambda,则此方法将不起作用,因为在那里将其视为int属性而不是pandas系列。您必须使用str(target_attribute)使其成为字符串。请参考以下示例。

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

Just for an additional reference.

All of the above answers will work in case of a data frame. But if you are using lambda while creating / modify a column this won’t work, Because there it is considered as a int attribute instead of pandas series. You have to use str( target_attribute ) to make it as a string. Please refer the below example.

def add_zero_in_prefix(df):
    if(df['Hour']<10):
        return '0' + str(df['Hour'])

data['str_hr'] = data.apply(add_zero_in_prefix, axis=1)

从pandas DataFrame中删除名称包含特定字符串的列

问题:从pandas DataFrame中删除名称包含特定字符串的列

我有一个带有以下列名称的pandas数据框:

Result1,Test1,Result2,Test2,Result3,Test3等…

我想删除名称包含单词“ Test”的所有列。这样的列数不是静态的,而是取决于先前的功能。

我怎样才能做到这一点?

I have a pandas dataframe with the following column names:

Result1, Test1, Result2, Test2, Result3, Test3, etc…

I want to drop all the columns whose name contains the word “Test”. The numbers of such columns is not static but depends on a previous function.

How can I do that?


回答 0

import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412
import pandas as pd

import numpy as np

array=np.random.random((2,4))

df=pd.DataFrame(array, columns=('Test1', 'toto', 'test2', 'riri'))

print df

      Test1      toto     test2      riri
0  0.923249  0.572528  0.845464  0.144891
1  0.020438  0.332540  0.144455  0.741412

cols = [c for c in df.columns if c.lower()[:4] != 'test']

df=df[cols]

print df
       toto      riri
0  0.572528  0.144891
1  0.332540  0.741412

回答 1

这是一个很好的方法:

df = df[df.columns.drop(list(df.filter(regex='Test')))]

Here is one way to do this:

df = df[df.columns.drop(list(df.filter(regex='Test')))]

回答 2

便宜,快捷和惯用语: str.contains

在最新版本的熊猫中,可以在索引和列上使用字符串方法。在这里,str.startswith似乎很合适。

要删除以给定子字符串开头的所有列:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

对于不区分大小写的匹配,可以将基于正则表达式的匹配与str.containsSOL锚一起使用:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

如果可能使用混合类型,则也要指定na=False

Cheaper, Faster, and Idiomatic: str.contains

In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.

To remove all columns starting with a given substring:

df.columns.str.startswith('Test')
# array([ True, False, False, False])

df.loc[:,~df.columns.str.startswith('Test')]

  toto test2 riri
0    x     x    x
1    x     x    x

For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:

df.columns.str.contains('^test', case=False)
# array([ True, False,  True, False])

df.loc[:,~df.columns.str.contains('^test', case=False)] 

  toto riri
0    x    x
1    x    x

if mixed-types is a possibility, specify na=False as well.


回答 3

您可以使用“过滤器”过滤出您想要的列

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

现在过滤

df.filter(like='result',axis=1)

得到..

   result1  result34
0   2.0     NaN
1   NaN     10.0

You can filter out the columns you DO want using ‘filter’

import pandas as pd
import numpy as np

data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]

df = pd.DataFrame(data2)

df

    c   result1     result34    test    test2
0   NaN     2.0     NaN     NaN     1.0
1   20.0    NaN     10.0    5.0     NaN

Now filter

df.filter(like='result',axis=1)

Get..

   result1  result34
0   2.0     NaN
1   NaN     10.0

回答 4

可以整齐地用以下一行完成此操作:

df = df.drop(df.filter(regex='Test').columns, axis=1)

This can be done neatly in one line with:

df = df.drop(df.filter(regex='Test').columns, axis=1)

回答 5

使用DataFrame.select方法:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

Use the DataFrame.select method:

In [38]: df = DataFrame({'Test1': randn(10), 'Test2': randn(10), 'awesome': randn(10)})

In [39]: df.select(lambda x: not re.search('Test\d+', x), axis=1)
Out[39]:
   awesome
0    1.215
1    1.247
2    0.142
3    0.169
4    0.137
5   -0.971
6    0.736
7    0.214
8    0.111
9   -0.214

回答 6

此方法可以完成所有事情。其他许多答案也会创建副本,但效率不高:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)

This method does everything in place. Many of the other answers create copies and are not as efficient:

df.drop(df.columns[df.columns.str.contains('Test')], axis=1, inplace=True)


回答 7

不要丢下 赶上你想要的相反。

df = df.filter(regex='^((?!badword).)*$').columns

Don’t drop. Catch the opposite of what you want.

df = df.filter(regex='^((?!badword).)*$').columns

回答 8

最简单的方法是:

resdf = df.filter(like='Test',axis=1)

the shortest way to do is is :

resdf = df.filter(like='Test',axis=1)

回答 9

删除包含正则表达式的列名称列表时的解决方案。我更喜欢这种方法,因为我经常编辑下拉列表。对下拉列表使用负过滤器正则表达式。

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

Solution when dropping a list of column names containing regex. I prefer this approach because I’m frequently editing the drop list. Uses a negative filter regex for the drop list.

drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)

熊猫中的dtype(’O’)是什么?

问题:熊猫中的dtype(’O’)是什么?

我在pandas中有一个数据框,我试图找出其值的类型。我不确定column的类型'Test'。但是,当我跑步时myFrame['Test'].dtype,我得到了;

dtype('O')

这是什么意思?

I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;

dtype('O')

What does this mean?


回答 0

它的意思是:

'O'     (Python) objects

来源

第一个字符指定数据的类型,其余字符指定每个项目的字节数,Unicode除外,Unicode将其解释为字符数。项目大小必须与现有类型相对应,否则将引发错误。支持的类型为现有类型,否则将引发错误。支持的种类有:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

如果需要检查,另一个答案会有所帮助type

It means:

'O'     (Python) objects

Source.

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

'b'       boolean
'i'       (signed) integer
'u'       unsigned integer
'f'       floating-point
'c'       complex-floating point
'O'       (Python) objects
'S', 'a'  (byte-)string
'U'       Unicode
'V'       raw data (void)

Another answer helps if need check types.


回答 1

当您dtype('O')在数据框内看到这意味着熊猫字符串。

什么dtype

属于pandasnumpy或两者兼而有之的东西?如果我们检查熊猫代码:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

它将输出如下:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

您可以将最后一个解释为Pandas dtype('O')或Pandas对象,它是Python类型的字符串,它对应于Numpy string_unicode_type。

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

就像唐吉x德(Don Quixote)在屁股上一样,熊猫(Pandas)在Numpy上一样,Numpy理解系统的基础架构,并numpy.dtype为此使用类。

数据类型对象是numpy.dtype类的实例,可以更精确地理解数据类型,包括:

  • 数据类型(整数,浮点数,Python对象等)
  • 数据大小(例如整数中有多少个字节)
  • 数据的字节顺序(小端或大端)
  • 如果数据类型是结构化的,则为其他数据类型的集合(例如,描述由整数和浮点数组成的数组项)
  • 该结构的“字段”的名称是什么
  • 每个字段的数据类型是什么
  • 每个字段占用存储块的哪一部分
  • 如果数据类型是子数组,则其形状和数据类型是什么

在这个问题的上下文中dtype,它既属于pand又属于numpy,尤其dtype('O')意味着我们期望该字符串。


这是一些测试用的代码,并带有解释:如果我们将数据集作为字典

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

最后几行将检查数据框并记录输出:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

各种不同 dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

但是,如果我们尝试设置np.nanNone这将不会影响原始列的dtype。输出将如下所示:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

因此,np.nan否则None将不会更改列dtype,除非我们将所有列行都设置为np.nanNone。在这种情况下,列将分别变为float64object

您也可以尝试设置单行:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

这里需要注意的是,如果我们在非字符串列中设置字符串,它将变成string或object dtype

When you see dtype('O') inside dataframe this means Pandas string.

What is dtype?

Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

df = pd.DataFrame({'float': [1.0],
                    'int': [1],
                    'datetime': [pd.Timestamp('20180310')],
                    'string': ['foo']})
print(df)
print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
df['string'].dtype

It will output like this:

   float  int   datetime string    
0    1.0    1 2018-03-10    foo
---
float64 int64 datetime64[ns] object
---
dtype('O')

You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

Pandas dtype    Python type     NumPy type          Usage
object          str             string_, unicode_   Text

Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

Data type object is an instance of numpy.dtype class that understand the data type more precise including:

  • Type of the data (integer, float, Python object, etc.)
  • Size of the data (how many bytes is in e.g. the integer)
  • Byte order of the data (little-endian or big-endian)
  • If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
  • What are the names of the “fields” of the structure
  • What is the data-type of each field
  • Which part of the memory block each field takes
  • If the data type is a sub-array, what is its shape and data type

In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.


Here is some code for testing with explanation: If we have the dataset as dictionary

import pandas as pd
import numpy as np
from pandas import Timestamp

data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
df = pd.DataFrame.from_dict(data) #now we have a dataframe

print(df)
print(df.dtypes)

The last lines will examine the dataframe and note the output:

   id       date                  role  num   fnum
0   1 2018-12-12               Support  123   3.14
1   2 2018-12-12             Marketing  234   2.14
2   3 2018-12-12  Business Development  345  -0.14
3   4 2018-12-12                 Sales  456  41.30
4   5 2018-12-12           Engineering  567   3.14
id               int64
date    datetime64[ns]
role            object
num              int64
fnum           float64
dtype: object

All kind of different dtypes

df.iloc[1,:] = np.nan
df.iloc[2,:] = None

But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

print(df)
print(df.dtypes)

    id       date         role    num   fnum
0  1.0 2018-12-12      Support  123.0   3.14
1  NaN        NaT          NaN    NaN    NaN
2  NaN        NaT         None    NaN    NaN
3  4.0 2018-12-12        Sales  456.0  41.30
4  5.0 2018-12-12  Engineering  567.0   3.14
id             float64
date    datetime64[ns]
role            object
num            float64
fnum           float64
dtype: object

So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

You may try also setting single rows:

df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object

And to note here, if we set string inside a non string column it will become string or object dtype.


回答 2

它的意思是“一个python对象”,即不是numpy支持的内置标量类型之一。

np.array([object()]).dtype
=> dtype('O')

It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.

np.array([object()]).dtype
=> dtype('O')

回答 3

“ O”代表对象

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

第一行返回: dtype('O')

带有print语句的行返回以下内容: object

‘O’ stands for object.

#Loading a csv file as a dataframe
import pandas as pd 
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'

#Checking the datatype of column name
train_df[col_name].dtype

#Instead try printing the same thing
print train_df[col_name].dtype

The first line returns: dtype('O')

The line with the print statement returns the following: object