标签归档:pandas

JSON转换为Pandas DataFrame

问题:JSON转换为Pandas DataFrame

我想做的是沿着经纬度坐标指定的路径从Google Maps API中提取海拔数据,如下所示:

from urllib2 import Request, urlopen
import json

path1 = '42.974049,-81.205203|42.974298,-81.195755'
request=Request('http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false')
response = urlopen(request)
elevations = response.read()

这给了我一个看起来像这样的数据:

elevations.splitlines()

['{',
 '   "results" : [',
 '      {',
 '         "elevation" : 243.3462677001953,',
 '         "location" : {',
 '            "lat" : 42.974049,',
 '            "lng" : -81.205203',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      },',
 '      {',
 '         "elevation" : 244.1318664550781,',
 '         "location" : {',
 '            "lat" : 42.974298,',
 '            "lng" : -81.19575500000001',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      }',
 '   ],',
 '   "status" : "OK"',
 '}']

当放入DataFrame时,我得到的是:

在此处输入图片说明

pd.read_json(elevations)

这是我想要的:

在此处输入图片说明

我不确定这是否可行,但主要是我想寻找的是一种将海拔,纬度和经度数据放到pandas数据框中的方式(不必具有花哨的mutiline标头)。

如果有人可以帮助或提出一些使用此数据的建议,那就太好了!如果您不能告诉我之前我并没有对JSON数据做太多工作…

编辑:

这种方法并不是很吸引人,但似乎可以起作用:

data = json.loads(elevations)
lat,lng,el = [],[],[]
for result in data['results']:
    lat.append(result[u'location'][u'lat'])
    lng.append(result[u'location'][u'lng'])
    el.append(result[u'elevation'])
df = pd.DataFrame([lat,lng,el]).T

结束具有经度,纬度,海拔列的数据框

在此处输入图片说明

What I am trying to do is extract elevation data from a google maps API along a path specified by latitude and longitude coordinates as follows:

from urllib2 import Request, urlopen
import json

path1 = '42.974049,-81.205203|42.974298,-81.195755'
request=Request('http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false')
response = urlopen(request)
elevations = response.read()

This gives me a data that looks like this:

elevations.splitlines()

['{',
 '   "results" : [',
 '      {',
 '         "elevation" : 243.3462677001953,',
 '         "location" : {',
 '            "lat" : 42.974049,',
 '            "lng" : -81.205203',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      },',
 '      {',
 '         "elevation" : 244.1318664550781,',
 '         "location" : {',
 '            "lat" : 42.974298,',
 '            "lng" : -81.19575500000001',
 '         },',
 '         "resolution" : 19.08790397644043',
 '      }',
 '   ],',
 '   "status" : "OK"',
 '}']

when putting into as DataFrame here is what I get:

enter image description here

pd.read_json(elevations)

and here is what I want:

enter image description here

I’m not sure if this is possible, but mainly what I am looking for is a way to be able to put the elevation, latitude and longitude data together in a pandas dataframe (doesn’t have to have fancy mutiline headers).

If any one can help or give some advice on working with this data that would be great! If you can’t tell I haven’t worked much with json data before…

EDIT:

This method isn’t all that attractive but seems to work:

data = json.loads(elevations)
lat,lng,el = [],[],[]
for result in data['results']:
    lat.append(result[u'location'][u'lat'])
    lng.append(result[u'location'][u'lng'])
    el.append(result[u'elevation'])
df = pd.DataFrame([lat,lng,el]).T

ends up dataframe having columns latitude, longitude, elevation

enter image description here


回答 0

我找到了一个快速简便的解决方案,以解决我想要使用的json_normalize()问题pandas 1.01

from urllib2 import Request, urlopen
import json

import pandas as pd    

path1 = '42.974049,-81.205203|42.974298,-81.195755'
request=Request('http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false')
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
df = pd.json_normalize(data['results'])

这提供了一个很好的扁平化数据框架,其中包含我从Google Maps API获得的json数据。

I found a quick and easy solution to what I wanted using json_normalize() included in pandas 1.01.

from urllib2 import Request, urlopen
import json

import pandas as pd    

path1 = '42.974049,-81.205203|42.974298,-81.195755'
request=Request('http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false')
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
df = pd.json_normalize(data['results'])

This gives a nice flattened dataframe with the json data that I got from the Google Maps API.


回答 1

检查此片段。

# reading the JSON data using json.load()
file = 'data.json'
with open(file) as train_file:
    dict_train = json.load(train_file)

# converting json dataset from dictionary to dataframe
train = pd.DataFrame.from_dict(dict_train, orient='index')
train.reset_index(level=0, inplace=True)

希望能帮助到你 :)

Check this snip out.

# reading the JSON data using json.load()
file = 'data.json'
with open(file) as train_file:
    dict_train = json.load(train_file)

# converting json dataset from dictionary to dataframe
train = pd.DataFrame.from_dict(dict_train, orient='index')
train.reset_index(level=0, inplace=True)

Hope it helps :)


回答 2

您可以先将json数据导入Python字典中:

data = json.loads(elevations)

然后动态修改数据:

for result in data['results']:
    result[u'lat']=result[u'location'][u'lat']
    result[u'lng']=result[u'location'][u'lng']
    del result[u'location']

重建json字符串:

elevations = json.dumps(data)

最后:

pd.read_json(elevations)

您也可以避免将数据转储到字符串中,我假设Panda可以直接从字典创建DataFrame(很长时间以来我就没有使用过它:p)

You could first import your json data in a Python dictionnary :

data = json.loads(elevations)

Then modify data on the fly :

for result in data['results']:
    result[u'lat']=result[u'location'][u'lat']
    result[u'lng']=result[u'location'][u'lng']
    del result[u'location']

Rebuild json string :

elevations = json.dumps(data)

Finally :

pd.read_json(elevations)

You can, also, probably avoid to dump data back to a string, I assume Panda can directly create a DataFrame from a dictionnary (I haven’t used it since a long time :p)


回答 3

只是接受答案的新版本,因为python3.x不支持urllib2

from requests import request
import json
from pandas.io.json import json_normalize

path1 = '42.974049,-81.205203|42.974298,-81.195755'
response=request(url='http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false', method='get')
elevations = response.json()
elevations
data = json.loads(elevations)
json_normalize(data['results'])

Just a new version of the accepted answer, as python3.x does not support urllib2

from requests import request
import json
from pandas.io.json import json_normalize

path1 = '42.974049,-81.205203|42.974298,-81.195755'
response=request(url='http://maps.googleapis.com/maps/api/elevation/json?locations='+path1+'&sensor=false', method='get')
elevations = response.json()
elevations
data = json.loads(elevations)
json_normalize(data['results'])

回答 4

问题是您在数据框中有几列,其中包含较小的dict。有用的Json通常是大量嵌套的。我一直在编写一些小的函数,这些函数将我想要的信息拉到新的列中。这样,我就可以使用想要的格式了。

for row in range(len(data)):
    #First I load the dict (one at a time)
    n = data.loc[row,'dict_column']
    #Now I make a new column that pulls out the data that I want.
    data.loc[row,'new_column'] = n.get('key')

The problem is that you have several columns in the data frame that contain dicts with smaller dicts inside them. Useful Json is often heavily nested. I have been writing small functions that pull the info I want out into a new column. That way I have it in the format that I want to use.

for row in range(len(data)):
    #First I load the dict (one at a time)
    n = data.loc[row,'dict_column']
    #Now I make a new column that pulls out the data that I want.
    data.loc[row,'new_column'] = n.get('key')

回答 5

优化可接受的答案:

可接受的答案存在一些功能上的问题,因此我想共享不依赖urllib2的代码:

import requests
from pandas.io.json import json_normalize
url = 'https://www.energidataservice.dk/proxy/api/datastore_search?resource_id=nordpoolmarket&limit=5'

r = requests.get(url)
dictr = r.json()
recs = dictr['result']['records']
df = json_normalize(recs)
print(df)

输出:

        _id                    HourUTC               HourDK  ... ElbasAveragePriceEUR  ElbasMaxPriceEUR  ElbasMinPriceEUR
0    264028  2019-01-01T00:00:00+00:00  2019-01-01T01:00:00  ...                  NaN               NaN               NaN
1    138428  2017-09-03T15:00:00+00:00  2017-09-03T17:00:00  ...                33.28              33.4              32.0
2    138429  2017-09-03T16:00:00+00:00  2017-09-03T18:00:00  ...                35.20              35.7              34.9
3    138430  2017-09-03T17:00:00+00:00  2017-09-03T19:00:00  ...                37.50              37.8              37.3
4    138431  2017-09-03T18:00:00+00:00  2017-09-03T20:00:00  ...                39.65              42.9              35.3
..      ...                        ...                  ...  ...                  ...               ...               ...
995  139290  2017-10-09T13:00:00+00:00  2017-10-09T15:00:00  ...                38.40              38.4              38.4
996  139291  2017-10-09T14:00:00+00:00  2017-10-09T16:00:00  ...                41.90              44.3              33.9
997  139292  2017-10-09T15:00:00+00:00  2017-10-09T17:00:00  ...                46.26              49.5              41.4
998  139293  2017-10-09T16:00:00+00:00  2017-10-09T18:00:00  ...                56.22              58.5              49.1
999  139294  2017-10-09T17:00:00+00:00  2017-10-09T19:00:00  ...                56.71              65.4              42.2 

PS:API用于丹麦电价

Optimization of the accepted answer:

The accepted answer has some functioning problems, so I want to share my code that does not rely on urllib2:

import requests
from pandas import json_normalize
url = 'https://www.energidataservice.dk/proxy/api/datastore_search?resource_id=nordpoolmarket&limit=5'

response = requests.get(url)
dictr = response.json()
recs = dictr['result']['records']
df = json_normalize(recs)
print(df)

Output:

        _id                    HourUTC               HourDK  ... ElbasAveragePriceEUR  ElbasMaxPriceEUR  ElbasMinPriceEUR
0    264028  2019-01-01T00:00:00+00:00  2019-01-01T01:00:00  ...                  NaN               NaN               NaN
1    138428  2017-09-03T15:00:00+00:00  2017-09-03T17:00:00  ...                33.28              33.4              32.0
2    138429  2017-09-03T16:00:00+00:00  2017-09-03T18:00:00  ...                35.20              35.7              34.9
3    138430  2017-09-03T17:00:00+00:00  2017-09-03T19:00:00  ...                37.50              37.8              37.3
4    138431  2017-09-03T18:00:00+00:00  2017-09-03T20:00:00  ...                39.65              42.9              35.3
..      ...                        ...                  ...  ...                  ...               ...               ...
995  139290  2017-10-09T13:00:00+00:00  2017-10-09T15:00:00  ...                38.40              38.4              38.4
996  139291  2017-10-09T14:00:00+00:00  2017-10-09T16:00:00  ...                41.90              44.3              33.9
997  139292  2017-10-09T15:00:00+00:00  2017-10-09T17:00:00  ...                46.26              49.5              41.4
998  139293  2017-10-09T16:00:00+00:00  2017-10-09T18:00:00  ...                56.22              58.5              49.1
999  139294  2017-10-09T17:00:00+00:00  2017-10-09T19:00:00  ...                56.71              65.4              42.2 

PS: API is for Danish electricity prices


回答 6

这是将JSON转换为DataFrame并返回的小型实用程序类:希望对您有所帮助。

# -*- coding: utf-8 -*-
from pandas.io.json import json_normalize

class DFConverter:

    #Converts the input JSON to a DataFrame
    def convertToDF(self,dfJSON):
        return(json_normalize(dfJSON))

    #Converts the input DataFrame to JSON 
    def convertToJSON(self, df):
        resultJSON = df.to_json(orient='records')
        return(resultJSON)

Here is small utility class that converts JSON to DataFrame and back: Hope you find this helpful.

# -*- coding: utf-8 -*-
from pandas.io.json import json_normalize

class DFConverter:

    #Converts the input JSON to a DataFrame
    def convertToDF(self,dfJSON):
        return(json_normalize(dfJSON))

    #Converts the input DataFrame to JSON 
    def convertToJSON(self, df):
        resultJSON = df.to_json(orient='records')
        return(resultJSON)

回答 7

billmanH的解决方案对我有所帮助,但是直到我从以下位置切换后才起作用:

n = data.loc[row,'json_column']

至:

n = data.iloc[[row]]['json_column']

这就是其余的内容,转换为字典对于使用json数据很有帮助。

import json

for row in range(len(data)):
    n = data.iloc[[row]]['json_column'].item()
    jsonDict = json.loads(n)
    if ('mykey' in jsonDict):
        display(jsonDict['mykey'])

billmanH’s solution helped me but didn’t work until i switched from:

n = data.loc[row,'json_column']

to:

n = data.iloc[[row]]['json_column']

here’s the rest of it, converting to a dictionary is helpful for working with json data.

import json

for row in range(len(data)):
    n = data.iloc[[row]]['json_column'].item()
    jsonDict = json.loads(n)
    if ('mykey' in jsonDict):
        display(jsonDict['mykey'])

回答 8

#Use the small trick to make the data json interpret-able
#Since your data is not directly interpreted by json.loads()

>>> import json
>>> f=open("sampledata.txt","r+")
>>> data = f.read()
>>> for x in data.split("\n"):
...     strlist = "["+x+"]"
...     datalist=json.loads(strlist)
...     for y in datalist:
...             print(type(y))
...             print(y)
...
...
<type 'dict'>
{u'0': [[10.8, 36.0], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'1': [[10.8, 36.1], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'2': [[10.8, 36.2], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'3': [[10.8, 36.300000000000004], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'4': [[10.8, 36.4], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'5': [[10.8, 36.5], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'6': [[10.8, 36.6], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'7': [[10.8, 36.7], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'8': [[10.8, 36.800000000000004], {u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'9': [[10.8, 36.9], {u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}

#Use the small trick to make the data json interpret-able
#Since your data is not directly interpreted by json.loads()

>>> import json
>>> f=open("sampledata.txt","r+")
>>> data = f.read()
>>> for x in data.split("\n"):
...     strlist = "["+x+"]"
...     datalist=json.loads(strlist)
...     for y in datalist:
...             print(type(y))
...             print(y)
...
...
<type 'dict'>
{u'0': [[10.8, 36.0], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'1': [[10.8, 36.1], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'2': [[10.8, 36.2], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'3': [[10.8, 36.300000000000004], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'4': [[10.8, 36.4], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'5': [[10.8, 36.5], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'6': [[10.8, 36.6], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'7': [[10.8, 36.7], {u'10': 0, u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'8': [[10.8, 36.800000000000004], {u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}
<type 'dict'>
{u'9': [[10.8, 36.9], {u'1': 0, u'0': 0, u'3': 0, u'2': 0, u'5': 0, u'4': 0, u'7': 0, u'6': 0, u'9': 0, u'8': 0}]}



回答 9

DataFrame通过接受的答案获得展平后,可以将列设置为“ MultiIndex(花式多行标题)”,如下所示:

df.columns = pd.MultiIndex.from_tuples([tuple(c.split('.')) for c in df.columns])

Once you have the flattened DataFrame obtained by the accepted answer, you can make the columns a MultiIndex (“fancy multiline header”) like this:

df.columns = pd.MultiIndex.from_tuples([tuple(c.split('.')) for c in df.columns])

如何将pandas DataFrame的第一列作为系列?

问题:如何将pandas DataFrame的第一列作为系列?

我试过了:

x=pandas.DataFrame(...)
s = x.take([0], axis=1)

s获取一个DataFrame,而不是一个Series。

I tried:

x=pandas.DataFrame(...)
s = x.take([0], axis=1)

And s gets a DataFrame, not a Series.


回答 0

>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
3  4  7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>

================================================== =========================

更新

如果您在2017年6月之后阅读ix此书,则熊猫0.20.2已弃用该书,因此请不要使用它。使用lociloc代替。查看对此问题的评论和其他答案。

>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
3  4  7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>

===========================================================================

UPDATE

If you’re reading this after June 2017, ix has been deprecated in pandas 0.20.2, so don’t use it. Use loc or iloc instead. See comments and other answers to this question.


回答 1

您可以通过以下代码将第一列作为系列:

x[x.columns[0]]

You can get the first column as a Series by following code:

x[x.columns[0]]

回答 2

从v0.11 +开始,…使用df.iloc

In [7]: df.iloc[:,0]
Out[7]: 
0    1
1    2
2    3
3    4
Name: x, dtype: int64

From v0.11+, … use df.iloc.

In [7]: df.iloc[:,0]
Out[7]: 
0    1
1    2
2    3
3    4
Name: x, dtype: int64

回答 3

这不是最简单的方法吗?

按列名:

In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
    x   y
0   1   4
1   2   5
2   3   6
3   4   7

In [23]: df.x
Out[23]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

In [24]: type(df.x)
Out[24]:
pandas.core.series.Series

Isn’t this the simplest way?

By column name:

In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
    x   y
0   1   4
1   2   5
2   3   6
3   4   7

In [23]: df.x
Out[23]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

In [24]: type(df.x)
Out[24]:
pandas.core.series.Series

回答 4

当您要从csv文件加载系列时,这非常有用

x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))


<class 'pandas.core.series.Series'>
0    110.96
1    119.40
2    135.89
3    152.32
4    192.91
5    177.20
6    181.16
7    177.30
8    200.13
9    235.41
Name: x, dtype: float64

This works great when you want to load a series from a csv file

x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))


<class 'pandas.core.series.Series'>
0    110.96
1    119.40
2    135.89
3    152.32
4    192.91
5    177.20
6    181.16
7    177.30
8    200.13
9    235.41
Name: x, dtype: float64

回答 5

df[df.columns[i]]

其中i是列的位置/编号(从0开始)。

因此,i = 0是第一列。

您也可以使用 i = -1

df[df.columns[i]]

where i is the position/number of the column(starting from 0).

So, i = 0 is for the first column.

You can also get the last column using i = -1


重命名熊猫DataFrame索引

问题:重命名熊猫DataFrame索引

我有一个没有标头的csv文件,带有DateTime索引。我想重命名索引和列名,但是使用df.rename()仅重命名了列名。虫子?我正在使用0.12.0版本

In [2]: df = pd.read_csv(r'D:\Data\DataTimeSeries_csv//seriesSM.csv', header=None, parse_dates=[[0]], index_col=[0] )

In [3]: df.head()
Out[3]: 
                   1
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

In [4]: df.rename(index={0:'Date'}, columns={1:'SM'}, inplace=True)

In [5]: df.head()
Out[5]: 
                  SM
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

I’ve a csv file without header, with a DateTime index. I want to rename the index and column name, but with df.rename() only the column name is renamed. Bug? I’m on version 0.12.0

In [2]: df = pd.read_csv(r'D:\Data\DataTimeSeries_csv//seriesSM.csv', header=None, parse_dates=[[0]], index_col=[0] )

In [3]: df.head()
Out[3]: 
                   1
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

In [4]: df.rename(index={0:'Date'}, columns={1:'SM'}, inplace=True)

In [5]: df.head()
Out[5]: 
                  SM
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

回答 0

rename方法采用适用于索引的索引字典。
您想重命名为索引级别的名称:

df.index.names = ['Date']

考虑这一点的一种好方法是,列和索引是同一类型的对象(IndexMultiIndex),您可以通过转置来互换二者。

这有点令人困惑,因为索引名称与列具有相似的含义,因此这里有更多示例:

In [1]: df = pd.DataFrame([[1, 2, 3], [4, 5 ,6]], columns=list('ABC'))

In [2]: df
Out[2]: 
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df1 = df.set_index('A')

In [4]: df1
Out[4]: 
   B  C
A      
1  2  3
4  5  6

您可以在索引上看到重命名,它可以更改 1:

In [5]: df1.rename(index={1: 'a'})
Out[5]: 
   B  C
A      
a  2  3
4  5  6

In [6]: df1.rename(columns={'B': 'BB'})
Out[6]: 
   BB  C
A       
1   2  3
4   5  6

重命名级别名称时:

In [7]: df1.index.names = ['index']
        df1.columns.names = ['column']

注意:此属性只是一个列表,您可以将其重命名为列表理解/映射。

In [8]: df1
Out[8]: 
column  B  C
index       
1       2  3
4       5  6

The rename method takes a dictionary for the index which applies to index values.
You want to rename to index level’s name:

df.index.names = ['Date']

A good way to think about this is that columns and index are the same type of object (Index or MultiIndex), and you can interchange the two via transpose.

This is a little bit confusing since the index names have a similar meaning to columns, so here are some more examples:

In [1]: df = pd.DataFrame([[1, 2, 3], [4, 5 ,6]], columns=list('ABC'))

In [2]: df
Out[2]: 
   A  B  C
0  1  2  3
1  4  5  6

In [3]: df1 = df.set_index('A')

In [4]: df1
Out[4]: 
   B  C
A      
1  2  3
4  5  6

You can see the rename on the index, which can change the value 1:

In [5]: df1.rename(index={1: 'a'})
Out[5]: 
   B  C
A      
a  2  3
4  5  6

In [6]: df1.rename(columns={'B': 'BB'})
Out[6]: 
   BB  C
A       
1   2  3
4   5  6

Whilst renaming the level names:

In [7]: df1.index.names = ['index']
        df1.columns.names = ['column']

Note: this attribute is just a list, and you could do the renaming as a list comprehension/map.

In [8]: df1
Out[8]: 
column  B  C
index       
1       2  3
4       5  6

回答 1

当前选择的答案未提及rename_axis可用于重命名索引和列级别的方法。


重命名索引级别时,Pandas具有一些古怪之处。还有一个新的DataFrame方法rename_axis可用于更改索引级别名称。

让我们看一下DataFrame

df = pd.DataFrame({'age':[30, 2, 12],
                       'color':['blue', 'green', 'red'],
                       'food':['Steak', 'Lamb', 'Mango'],
                       'height':[165, 70, 120],
                       'score':[4.6, 8.3, 9.0],
                       'state':['NY', 'TX', 'FL']},
                       index = ['Jane', 'Nick', 'Aaron'])

在此处输入图片说明

对于行索引和列索引,此DataFrame都有一个级别。行索引和列索引都没有名称。让我们将行索引级别的名称更改为“名称”。

df.rename_axis('names')

在此处输入图片说明

rename_axis方法还可以通过更改axis参数来更改列级别名称:

df.rename_axis('names').rename_axis('attributes', axis='columns')

在此处输入图片说明

如果使用某些列设置索引,则列名称将成为新的索引级别名称。让我们将索引级别附加到原始DataFrame上:

df1 = df.set_index(['state', 'color'], append=True)
df1

在此处输入图片说明

注意原始索引是如何没有名称的。我们仍然可以使用,rename_axis但需要向其传递与索引级别数相同长度的列表。

df1.rename_axis(['names', None, 'Colors'])

在此处输入图片说明

您可以None用来有效地删除索引级别名称。


系列工作类似,但有所不同

让我们创建一个具有三个索引级别的系列

s = df.set_index(['state', 'color'], append=True)['food']
s

       state  color
Jane   NY     blue     Steak
Nick   TX     green     Lamb
Aaron  FL     red      Mango
Name: food, dtype: object

我们可以rename_axis像使用DataFrames一样使用

s.rename_axis(['Names','States','Colors'])

Names  States  Colors
Jane   NY      blue      Steak
Nick   TX      green      Lamb
Aaron  FL      red       Mango
Name: food, dtype: object

请注意,该系列下面还有一个元数据,称为 Name。从DataFrame创建系列时,此属性设置为列名。

我们可以将字符串名称传递给rename方法以进行更改

s.rename('FOOOOOD')

       state  color
Jane   NY     blue     Steak
Nick   TX     green     Lamb
Aaron  FL     red      Mango
Name: FOOOOOD, dtype: object

DataFrames没有此属性,如果这样使用,事实上会引发异常

df.rename('my dataframe')
TypeError: 'str' object is not callable

在熊猫0.21之前,您可能曾用于rename_axis重命名索引和列中的值。它已被弃用,所以不要这样做

The currently selected answer does not mention the rename_axis method which can be used to rename the index and column levels.


Pandas has some quirkiness when it comes to renaming the levels of the index. There is also a new DataFrame method rename_axis available to change the index level names.

Let’s take a look at a DataFrame

df = pd.DataFrame({'age':[30, 2, 12],
                       'color':['blue', 'green', 'red'],
                       'food':['Steak', 'Lamb', 'Mango'],
                       'height':[165, 70, 120],
                       'score':[4.6, 8.3, 9.0],
                       'state':['NY', 'TX', 'FL']},
                       index = ['Jane', 'Nick', 'Aaron'])

enter image description here

This DataFrame has one level for each of the row and column indexes. Both the row and column index have no name. Let’s change the row index level name to ‘names’.

df.rename_axis('names')

enter image description here

The rename_axis method also has the ability to change the column level names by changing the axis parameter:

df.rename_axis('names').rename_axis('attributes', axis='columns')

enter image description here

If you set the index with some of the columns, then the column name will become the new index level name. Let’s append to index levels to our original DataFrame:

df1 = df.set_index(['state', 'color'], append=True)
df1

enter image description here

Notice how the original index has no name. We can still use rename_axis but need to pass it a list the same length as the number of index levels.

df1.rename_axis(['names', None, 'Colors'])

enter image description here

You can use None to effectively delete the index level names.


Series work similarly but with some differences

Let’s create a Series with three index levels

s = df.set_index(['state', 'color'], append=True)['food']
s

       state  color
Jane   NY     blue     Steak
Nick   TX     green     Lamb
Aaron  FL     red      Mango
Name: food, dtype: object

We can use rename_axis similarly to how we did with DataFrames

s.rename_axis(['Names','States','Colors'])

Names  States  Colors
Jane   NY      blue      Steak
Nick   TX      green      Lamb
Aaron  FL      red       Mango
Name: food, dtype: object

Notice that the there is an extra piece of metadata below the Series called Name. When creating a Series from a DataFrame, this attribute is set to the column name.

We can pass a string name to the rename method to change it

s.rename('FOOOOOD')

       state  color
Jane   NY     blue     Steak
Nick   TX     green     Lamb
Aaron  FL     red      Mango
Name: FOOOOOD, dtype: object

DataFrames do not have this attribute and infact will raise an exception if used like this

df.rename('my dataframe')
TypeError: 'str' object is not callable

Prior to pandas 0.21, you could have used rename_axis to rename the values in the index and columns. It has been deprecated so don’t do this


回答 2

对于较新的pandas版本

df.index = df.index.rename('new name')

要么

df.index.rename('new name', inplace=True)

如果数据框应保留其所有属性,则需要后者

For newer pandas versions

df.index = df.index.rename('new name')

or

df.index.rename('new name', inplace=True)

The latter is required if a data frame should retain all its properties.


回答 3

在Pandas 0.13及更高版本中,索引级别名称是不可变的(类型FrozenList),不能再直接设置。您必须首先使用Index.rename()将新的索引级别名称应用到Index,然后再使用DataFrame.reindex()将新的索引应用到DataFrame。例子:

对于熊猫版本<0.13

df.index.names = ['Date']

对于熊猫版本> = 0.13

df = df.reindex(df.index.rename(['Date']))

In Pandas version 0.13 and greater the index level names are immutable (type FrozenList) and can no longer be set directly. You must first use Index.rename() to apply the new index level names to the Index and then use DataFrame.reindex() to apply the new index to the DataFrame. Examples:

For Pandas version < 0.13

df.index.names = ['Date']

For Pandas version >= 0.13

df = df.reindex(df.index.rename(['Date']))

回答 4

您还可以Index.set_names如下使用:

In [25]: x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
   ....:                   'country':['A','A','B','B','A','A','B','B'],
   ....:                   'prod':[1,2,1,2,1,2,1,2],
   ....:                   'val':[10,20,15,25,20,30,25,35]})

In [26]: x = x.set_index(['year','country','prod']).squeeze()

In [27]: x
Out[27]: 
year  country  prod
1     A        1       10
               2       20
      B        1       15
               2       25
2     A        1       20
               2       30
      B        1       25
               2       35
Name: val, dtype: int64
In [28]: x.index = x.index.set_names('foo', level=1)

In [29]: x
Out[29]: 
year  foo  prod
1     A    1       10
           2       20
      B    1       15
           2       25
2     A    1       20
           2       30
      B    1       25
           2       35
Name: val, dtype: int64

You can also use Index.set_names as follows:

In [25]: x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
   ....:                   'country':['A','A','B','B','A','A','B','B'],
   ....:                   'prod':[1,2,1,2,1,2,1,2],
   ....:                   'val':[10,20,15,25,20,30,25,35]})

In [26]: x = x.set_index(['year','country','prod']).squeeze()

In [27]: x
Out[27]: 
year  country  prod
1     A        1       10
               2       20
      B        1       15
               2       25
2     A        1       20
               2       30
      B        1       25
               2       35
Name: val, dtype: int64
In [28]: x.index = x.index.set_names('foo', level=1)

In [29]: x
Out[29]: 
year  foo  prod
1     A    1       10
           2       20
      B    1       15
           2       25
2     A    1       20
           2       30
      B    1       25
           2       35
Name: val, dtype: int64

回答 5

如果要使用相同的映射来重命名列和索引,则可以执行以下操作:

mapping = {0:'Date', 1:'SM'}
df.index.names = list(map(lambda name: mapping.get(name, name), df.index.names))
df.rename(columns=mapping, inplace=True)

If you want to use the same mapping for renaming both columns and index you can do:

mapping = {0:'Date', 1:'SM'}
df.index.names = list(map(lambda name: mapping.get(name, name), df.index.names))
df.rename(columns=mapping, inplace=True)

回答 6

df.index.rename('new name', inplace=True)

是唯一为我完成工作的人(熊猫0.22.0)。
如果没有inplace = True,则在我的情况下不会设置索引名称。

df.index.rename('new name', inplace=True)

Is the only one that does the job for me (pandas 0.22.0).
Without the inplace=True, the name of the index is not set in my case.


回答 7

您可以使用index和的columns属性pandas.DataFrame。注意:列表元素的数量必须与行/列的数量匹配。

#       A   B   C
# ONE   11  12  13
# TWO   21  22  23
# THREE 31  32  33

df.index = [1, 2, 3]
df.columns = ['a', 'b', 'c']
print(df)

#     a   b   c
# 1  11  12  13
# 2  21  22  23
# 3  31  32  33

you can use index and columns attributes of pandas.DataFrame. NOTE: number of elements of list must match the number of rows/columns.

#       A   B   C
# ONE   11  12  13
# TWO   21  22  23
# THREE 31  32  33

df.index = [1, 2, 3]
df.columns = ['a', 'b', 'c']
print(df)

#     a   b   c
# 1  11  12  13
# 2  21  22  23
# 3  31  32  33

熊猫:在Excel文件中查找工作表列表

问题:熊猫:在Excel文件中查找工作表列表

新版本的Pandas使用以下界面加载Excel文件:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

但是,如果我不知道可用的图纸怎么办?

例如,我正在使用以下工作表的excel文件

数据1,数据2 …,数据N,foo,bar

但我不知道N先验。

有什么方法可以从Pandas的excel文档中获取工作表列表吗?

The new version of Pandas uses the following interface to load Excel files:

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

but what if I don’t know the sheets that are available?

For example, I am working with excel files that the following sheets

Data 1, Data 2 …, Data N, foo, bar

but I don’t know N a priori.

Is there any way to get the list of sheets from an excel document in Pandas?


回答 0

您仍然可以使用ExcelFile类(和sheet_names属性):

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

有关更多选项,请参阅文档以进行解析

You can still use the ExcelFile class (and the sheet_names attribute):

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

see docs for parse for more options…


回答 1

您应该将第二个参数(工作表名称)明确指定为“无”。像这样:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

“ df”都是作为DataFrames字典的工作表,您可以通过运行以下命令进行验证:

df.keys()

结果是这样的:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

请参阅pandas doc了解更多详细信息: https //pandas.pydata.org/pandas-docs/stable/generation/pandas.read_excel.html

You should explicitly specify the second parameter (sheetname) as None. like this:

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

“df” are all sheets as a dictionary of DataFrames, you can verify it by run this:

df.keys()

result like this:

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

please refer pandas doc for more details: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html


回答 2

这是我发现最快的方法,灵感来自@divingTobi的答案。所有基于xlrd,openpyxl或pandas的答案对我来说都很慢,因为它们都首先加载整个文件。

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]

This is the fastest way I have found, inspired by @divingTobi’s answer. All The answers based on xlrd, openpyxl or pandas are slow for me, as they all load the whole file first.

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]


回答 3

以@dhwanil_shah的答案为基础,您不需要提取整个文件。有了zf.open它,可以直接从一个压缩文件中读取。

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

连续两个 readline s很难看,但内容仅在文本的第二行中。无需解析整个文件。

该解决方案似乎比该read_excel版本要快得多,而且很有可能比完整提取版本还快。

Building on @dhwanil_shah ‘s answer, you do not need to extract the whole file. With zf.open it is possible to read from a zipped file directly.

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

The two consecutive readlines are ugly, but the content is only in the second line of the text. No need to parse the whole file.

This solution seems to be much faster than the read_excel version, and most likely also faster than the full extract version.


回答 4

我已经尝试过xlrd,pandas,openpyxl和其他类似的库,并且随着读取整个文件时文件大小的增加,它们似乎都花费了指数时间。上面提到的其他使用’on_demand’的解决方案对我不起作用。如果只想最初获取工作表名称,则以下功能适用于xlsx文件。

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

由于所有xlsx基本上都是压缩文件,因此我们提取基本的xml数据并直接从工作簿中读取工作表名称,与库函数相比,此过程只需花费一秒钟的时间。

基准测试:(在具有4张纸的
6mb xlsx文件上)Pandas,xlrd: 12秒
openpyxl: 24秒
建议的方法: 0.4秒

由于我的要求只是读取工作表名称,因此读取整个时间不必要的开销困扰着我,所以我改用了这种方法。

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used ‘on_demand’ did not work for me. If you just want to get the sheet names initially, the following function works for xlsx files.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

Since my requirement was just reading the sheet names, the unnecessary overhead of reading the entire time was bugging me so I took this route instead.


回答 5

from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

对于我正在使用的5MB Excel文件,load_workbook没有read_only标记花费了8.24秒。带有read_only标志,仅花费了39.6 ms。如果您仍然想使用Excel库而不是使用xml解决方案,那将比解析整个文件的方法快得多。

from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

For a 5MB Excel file I’m working with, load_workbook without the read_only flag took 8.24s. With the read_only flag it only took 39.6 ms. If you still want to use an Excel library and not drop to an xml solution, that’s much faster than the methods that parse the whole file.


在pandas中的DataFrame上搜索“不包含”

问题:在pandas中的DataFrame上搜索“不包含”

我已经进行了一些搜索,无法弄清楚如何通过过滤数据帧df["col"].str.contains(word),但是我想知道是否有一种方法可以反向执行:通过该集合的补充来过滤数据帧。例如:的效果!(df["col"].str.contains(word))

可以通过一种DataFrame方法来完成吗?

I’ve done some searching and can’t figure out how to filter a dataframe by df["col"].str.contains(word), however I’m wondering if there is a way to do the reverse: filter a dataframe by that set’s compliment. eg: to the effect of !(df["col"].str.contains(word)).

Can this be done through a DataFrame method?


回答 0

您可以使用invert(〜)运算符(其作用类似于非布尔数据):

new_df = df[~df["col"].str.contains(word)]

new_dfRHS返回的副本在哪里。

包含还接受正则表达式…


如果以上方法引发ValueError,则可能是由于您混合使用了数据类型,所以请使用na=False

new_df = df[~df["col"].str.contains(word, na=False)]

要么,

new_df = df[df["col"].str.contains(word) == False]

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

, where new_df is the copy returned by RHS.

contains also accepts a regular expression…


If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]

回答 1

我也遇到了not(〜)符号的问题,所以这是另一个StackOverflow线程的另一种方式:

df[df["col"].str.contains('this|that')==False]

I was having trouble with the not (~) symbol as well, so here’s another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]

回答 2

您可以使用Apply和Lambda选择列中包含列表中任何内容的行。对于您的方案:

df[df["col"].apply(lambda x:x not in [word1,word2,word3])]

You can use Apply and Lambda to select rows where a column contains any thing in a list. For your scenario :

df[df["col"].apply(lambda x:x not in [word1,word2,word3])]

回答 3

在使用上面Andy推荐的命令之前,我必须摆脱NULL值。一个例子:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

现在运行命令:

~df["second"].str.contains(word)

我收到以下错误:

TypeError: bad operand type for unary ~: 'float'

我首先使用dropna()或fillna()摆脱了NULL值,然后重试了命令,没有问题。

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.


回答 4

我希望答案已经发布

我正在添加框架以查找多个单词并从dataFrame中取反

这里'word1','word2','word3','word4'=要搜索的模式列表

df = DataFrame

column_a =来自DataFrame df的列名

Search_for_These_values = ['word1','word2','word3','word4'] 

pattern = '|'.join(Search_for_These_values)

result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from from DataFrame df

Search_for_These_values = ['word1','word2','word3','word4'] 

pattern = '|'.join(Search_for_These_values)

result = df.loc[~(df['column_a'].str.contains(pattern, case=False)]

回答 5

除了nanselm2的答案,您可以使用0代替False

df["col"].str.contains(word)==0

Additional to nanselm2’s answer, you can use 0 instead of False:

df["col"].str.contains(word)==0

替换pandas DataFrame中的列值

问题:替换pandas DataFrame中的列值

我正在尝试替换数据框的一列中的值。列(“ female”)仅包含值“ female”和“ male”。

我尝试了以下方法:

w['female']['female']='1'
w['female']['male']='0' 

但是会收到与以前结果完全相同的副本。

理想情况下,我希望得到一些类似于下面的循环元素的输出。

if w['female'] =='female':
    w['female'] = '1';
else:
    w['female'] = '0';

我浏览了gotchas文档(http://pandas.pydata.org/pandas-docs/stable/gotchas.html),但无法弄清楚为什么什么也没发生。

任何帮助将不胜感激。

I’m trying to replace the values in one column of a dataframe. The column (‘female’) only contains the values ‘female’ and ‘male’.

I have tried the following:

w['female']['female']='1'
w['female']['male']='0' 

But receive the exact same copy of the previous results.

I would ideally like to get some output which resembles the following loop element-wise.

if w['female'] =='female':
    w['female'] = '1';
else:
    w['female'] = '0';

I’ve looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.

Any help will be appreciated.


回答 0

如果我理解正确,则您需要以下内容:

w['female'] = w['female'].map({'female': 1, 'male': 0})

(在这里,我将值转换为数字,而不是包含数字的字符串。如果确实需要,可以将它们转换为"1""0",但是我不确定为什么要这么做。)

您的代码不起作用的原因是,['female']在列上使用('female'您的中的第二个w['female']['female'])并不意味着“选择值是’female’的行”。这意味着选择索引为“女性”的行,而您的DataFrame中可能没有索引。

If I understand right, you want something like this:

w['female'] = w['female'].map({'female': 1, 'male': 0})

(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I’m not sure why you’d want that.)

The reason your code doesn’t work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn’t mean “select rows where the value is ‘female'”. It means to select rows where the index is ‘female’, of which there may not be any in your DataFrame.


回答 1

您可以使用loc编辑数据框的子集:

df.loc[<row selection>, <column selection>]

在这种情况下:

w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1

You can edit a subset of a dataframe by using loc:

df.loc[<row selection>, <column selection>]

In this case:

w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1

回答 2

w.female.replace(to_replace=dict(female=1, male=0), inplace=True)

请参阅pandas.DataFrame.replace()docs

w.female.replace(to_replace=dict(female=1, male=0), inplace=True)

See pandas.DataFrame.replace() docs.


回答 3

轻微变化:

w.female.replace(['male', 'female'], [1, 0], inplace=True)

Slight variation:

w.female.replace(['male', 'female'], [1, 0], inplace=True)

回答 4

这也应该起作用:

w.female[w.female == 'female'] = 1 
w.female[w.female == 'male']   = 0

This should also work:

w.female[w.female == 'female'] = 1 
w.female[w.female == 'male']   = 0

回答 5

您还可以使用apply.get

w['female'] = w['female'].apply({'male':0, 'female':1}.get)

w = pd.DataFrame({'female':['female','male','female']})
print(w)

数据框w

   female
0  female
1    male
2  female

使用apply从字典替换值:

w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)

结果:

   female
0       1
1       0
2       1 

注意: apply如果在字典中定义了数据框中列的所有可能值,则应使用字典,否则,对于未在字典中定义的列,该字段将为空。

You can also use apply with .get i.e.

w['female'] = w['female'].apply({'male':0, 'female':1}.get):

w = pd.DataFrame({'female':['female','male','female']})
print(w)

Dataframe w:

   female
0  female
1    male
2  female

Using apply to replace values from the dictionary:

w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)

Result:

   female
0       1
1       0
2       1 

Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.


回答 6

这非常紧凑:

w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0

另一个好的:

w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)

This is very compact:

w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0

Another good one:

w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)

回答 7

另外,对于这些类型的分配,还有内置函数pd.get_dummies:

w['female'] = pd.get_dummies(w['female'],drop_first = True)

这为您提供了一个包含两列的数据框,每个列对应于w [‘female’]中出现的每个值,您将其中的第一列删除(因为您可以从剩下的那一列中推断出来)。新列将自动命名为您替换的字符串。

如果您的分类变量具有两个以上的可能值,则此功能特别有用。此函数创建区分所有情况所需的尽可能多的伪变量。请注意,不要将整个数据框分配给单个列,而是如果w [‘female’]可以是“ male”,“ female”或“ neutral”,请执行以下操作:

w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)

然后,剩下两个新列,为您提供“ female”的伪编码,并且您摆脱了带有字符串的列。

Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:

w['female'] = pd.get_dummies(w['female'],drop_first = True)

This gives you a data frame with two columns, one for each value that occurs in w[‘female’], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.

This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don’t assign the entire data frame to a single column, but instead, if w[‘female’] could be ‘male’, ‘female’ or ‘neutral’, do something like this:

w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)

Then you are left with two new columns giving you the dummy coding of ‘female’ and you got rid of the column with the strings.


回答 8

使用Series.mapSeries.fillna

如果您的列包含的字符串多于femalemaleSeries.map则在这种情况下将失败,因为它将返回NaN其他值。

这就是为什么我们必须将其与fillna

为什么.map失败的示例

df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})

   female
0    male
1  female
2  female
3    male
4   other
5   other
df['female'].map({'female': '1', 'male': '0'})

0      0
1      1
2      1
3      0
4    NaN
5    NaN
Name: female, dtype: object

对于正确的方法,我们map与链接fillna,因此我们NaN用原始列中的值填充:

df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])

0        0
1        1
2        1
3        0
4    other
5    other
Name: female, dtype: object

Using Series.map with Series.fillna

If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.

That’s why we have to chain it with fillna:

Example why .map fails:

df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})

   female
0    male
1  female
2  female
3    male
4   other
5   other
df['female'].map({'female': '1', 'male': '0'})

0      0
1      1
2      1
3      0
4    NaN
5    NaN
Name: female, dtype: object

For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:

df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])

0        0
1        1
2        1
3        0
4    other
5    other
Name: female, dtype: object

回答 9

pandas调用了一个函数factorize,您可以使用该函数自动执行此类工作。它将标签转换为数字:['male', 'female', 'male'] -> [0, 1, 0]。有关更多信息,请参见此答案。

There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.


回答 10

我认为应该指出,在上面建议的所有方法中,您都会得到哪种类型的对象:是Series还是DataFrame。

当您按w.female.或获得列w[[2]](假设其中2是列数)时,您将获得DataFrame。因此,在这种情况下,您可以使用DataFrame之类的方法.replace

当您使用.loc或者iloc你回来系列和系列没有.replace方法,所以你应该使用类似的方法applymap等等。

I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.

When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you’ll get back DataFrame. So in this case you can use DataFrame methods like .replace.

When you use .loc or iloc you get back Series, and Series don’t have .replace method, so you should use methods like apply, map and so on.


如何用熊猫DataFrame中的先前值替换NaN?

问题:如何用熊猫DataFrame中的先前值替换NaN?

假设我有一个带有NaNs 的DataFrame :

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

我需要做的是用上面同一列中NaN的第一个非NaN值替换每个值。假设第一行永远不会包含NaN。因此,对于前面的示例,结果将是

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

我可以遍历整个DataFrame的逐列,逐元素并直接设置值,但是是否有一种简单的方法(最佳无循环)来实现呢?

Suppose I have a DataFrame with some NaNs:

>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
    0   1   2
0   1   2   3
1   4 NaN NaN
2 NaN NaN   9

What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be

   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?


回答 0

您可以fillna在DataFrame上使用该方法,并将该方法指定为ffill(正向填充):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

这个方法

将上一个有效观察结果传播到下一个有效观察结果

相反,还有一个 bfill方法。

此方法不会就地修改DataFrame-您需要将返回的DataFrame重新绑定到变量,或者指定inplace=True

df.fillna(method='ffill', inplace=True)

You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):

>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
   0  1  2
0  1  2  3
1  4  2  3
2  4  2  9

This method…

propagate[s] last valid observation forward to next valid

To go the opposite way, there’s also a bfill method.

This method doesn’t modify the DataFrame inplace – you’ll need to rebind the returned DataFrame to a variable or else specify inplace=True:

df.fillna(method='ffill', inplace=True)

回答 1

公认的答案是完美的。我遇到了一个相关但略有不同的情况,我必须向前填写,但只能在小组中填写。如果有人有相同的需求,请知道fillna可用于DataFrameGroupBy对象。

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.

>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
  name  number
0    a     0.0
1    a     1.0
2    a     2.0
3    b     NaN
4    b     4.0
5    b     NaN
6    c     6.0
7    c     7.0
8    c     8.0
9    c     9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0    0.0
1    1.0
2    2.0
3    NaN
4    4.0
5    4.0
6    6.0
7    7.0
8    8.0
9    9.0
Name: number, dtype: float64

回答 2

您可以使用pandas.DataFrame.fillnamethod='ffill'选项。'ffill'代表“向前填充”,并将向前传播最后一个有效观察值。替代方法是'bfill'相同的方法,但倒退。

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

为此,还有一个直接的同义词功能pandas.DataFrame.ffill,可以简化操作。

You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for ‘forward fill’ and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')

print(df)
#   0  1  2
#0  1  2  3
#1  4  2  3
#2  4  2  9

There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.


回答 3

我在尝试此解决方案时注意到的一件事是,如果您在数组的开头或结尾处都没有N / A,则填充和填充将无法正常工作。你们两个都需要。

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don’t quite work. You need both.

In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])

In [225]: df.ffill()
Out[225]:
     0
0  NaN
1  1.0
...
7  6.0
8  6.0

In [226]: df.bfill()
Out[226]:
     0
0  1.0
1  1.0
...
7  6.0
8  NaN

In [227]: df.bfill().ffill()
Out[227]:
     0
0  1.0
1  1.0
...
7  6.0
8  6.0

回答 4

ffill 现在有自己的方法 pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

ffill now has it’s own method pd.DataFrame.ffill

df.ffill()

     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

回答 5

仅一列版本

  • 最后一个有效值填充NAN
df[column_name].fillna(method='ffill', inplace=True)
  • 下一个有效值填充NAN
df[column_name].fillna(method='backfill', inplace=True)

Only one column version

  • Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
  • Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)

回答 6

只是同意ffillmethod,但是一个额外的信息是您可以使用关键字arguments限制正向填充limit

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

现在带有limit关键字参数

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.

>>> import pandas as pd    
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])

>>> df
     0    1   2
0  1.0  2.0   3
1  NaN  NaN   6
2  NaN  NaN   9

>>> df[1].fillna(method='ffill', inplace=True)
>>> df
     0    1    2
0  1.0  2.0    3
1  NaN  2.0    6
2  NaN  2.0    9

Now with limit keyword argument

>>> df[0].fillna(method='ffill', limit=1, inplace=True)

>>> df
     0    1  2
0  1.0  2.0  3
1  1.0  2.0  6
2  NaN  2.0  9

回答 7

就我而言,我们有来自不同设备的时间序列,但是某些设备在一段时间内无法发送任何值。因此,我们应该为每个设备和时间段创建NA值,然后再执行fillna。

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

结果:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.

df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')

Result:

        0   1   value
0   device1     1   first val of device1
1   device1     2   first val of device1
2   device1     3   first val of device1
3   device2     1   None
4   device2     2   first val of device2
5   device2     3   first val of device2
6   device3     1   None
7   device3     2   None
8   device3     3   first val of device3

回答 8

您可以fillna用来删除或替换NaN值。

NaN 移除

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN 替换

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

参考pandas.DataFrame.fillna

You can use fillna to remove or replace NaN values.

NaN Remove

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0

NaN Replace

df.fillna(0) # 0 means What Value you want to replace 
     0    1    2
0  1.0  2.0  3.0
1  4.0  0.0  0.0
2  0.0  0.0  9.0

Reference pandas.DataFrame.fillna


检查pandas数据框索引中是否存在值

问题:检查pandas数据框索引中是否存在值

我敢肯定有一个明显的方法可以做到这一点,但是现在还不能想到任何光滑的东西。

基本上不是引发异常,而是要获取TrueFalse查看pandas df索引中是否存在值。

import pandas as pd
df = pd.DataFrame({'test':[1,2,3,4]}, index=['a','b','c','d'])
df.loc['g']  # (should give False)

我现在工作的是以下内容

sum(df.index == 'g')

I am sure there is an obvious way to do this but cant think of anything slick right now.

Basically instead of raising exception I would like to get True or False to see if a value exists in pandas df index.

import pandas as pd
df = pd.DataFrame({'test':[1,2,3,4]}, index=['a','b','c','d'])
df.loc['g']  # (should give False)

What I have working now is the following

sum(df.index == 'g')

回答 0

这应该可以解决问题

'g' in df.index

This should do the trick

'g' in df.index

回答 1

仅供参考,这是我一直在寻找的东西,您可以通过附加“ .values”方法来测试值或索引中是否存在,例如

g in df.<your selected field>.values
g in df.index.values

我发现添加“ .values”以获取简单的列表或ndarray会使存在或“输入”检查与其他python工具一起运行更为流畅。只是以为我会把那个扔给别人。

Just for reference as it was something I was looking for, you can test for presence within the values or the index by appending the “.values” method, e.g.

g in df.<your selected field>.values
g in df.index.values

I find that adding the “.values” to get a simple list or ndarray out makes exist or “in” checks run more smoothly with the other python tools. Just thought I’d toss that out there for people.


回答 2

多索引的工作方式与单索引略有不同。这是多索引数据框的一些方法。

df = pd.DataFrame({'col1': ['a', 'b','c', 'd'], 'col2': ['X','X','Y', 'Y'], 'col3': [1, 2, 3, 4]}, columns=['col1', 'col2', 'col3'])
df = df.set_index(['col1', 'col2'])

in df.index 仅在检查单个索引值时才适用于第一级。

'a' in df.index     # True
'X' in df.index     # False

检查df.index.levels其他级别。

'a' in df.index.levels[0] # True
'X' in df.index.levels[1] # True

签入df.index索引组合元组。

('a', 'X') in df.index  # True
('a', 'Y') in df.index  # False

Multi index works a little different from single index. Here are some methods for multi-indexed dataframe.

df = pd.DataFrame({'col1': ['a', 'b','c', 'd'], 'col2': ['X','X','Y', 'Y'], 'col3': [1, 2, 3, 4]}, columns=['col1', 'col2', 'col3'])
df = df.set_index(['col1', 'col2'])

in df.index works for the first level only when checking single index value.

'a' in df.index     # True
'X' in df.index     # False

Check df.index.levels for other levels.

'a' in df.index.levels[0] # True
'X' in df.index.levels[1] # True

Check in df.index for an index combination tuple.

('a', 'X') in df.index  # True
('a', 'Y') in df.index  # False

回答 3

与DataFrame:df_data

>>> df_data
  id   name  value
0  a  ampha      1
1  b   beta      2
2  c     ce      3

我试过了:

>>> getattr(df_data, 'value').isin([1]).any()
True
>>> getattr(df_data, 'value').isin(['1']).any()
True

但:

>>> 1 in getattr(df_data, 'value')
True
>>> '1' in getattr(df_data, 'value')
False

很有趣:D

with DataFrame: df_data

>>> df_data
  id   name  value
0  a  ampha      1
1  b   beta      2
2  c     ce      3

I tried:

>>> getattr(df_data, 'value').isin([1]).any()
True
>>> getattr(df_data, 'value').isin(['1']).any()
True

but:

>>> 1 in getattr(df_data, 'value')
True
>>> '1' in getattr(df_data, 'value')
False

So fun :D


回答 4

df = pandas.DataFrame({'g':[1]}, index=['isStop'])

#df.loc['g']

if 'g' in df.index:
    print("find g")

if 'isStop' in df.index:
    print("find a") 
df = pandas.DataFrame({'g':[1]}, index=['isStop'])

#df.loc['g']

if 'g' in df.index:
    print("find g")

if 'isStop' in df.index:
    print("find a") 

回答 5

下面的代码不打印布尔值,但允许按索引对数据框进行子集设置…我知道这可能不是解决问题的最有效方法,但是我(1)喜欢这种读取方式,并且(2)您可以轻松地进行子集化df2中存在df1索引的位置:

df3 = df1[df1.index.isin(df2.index)]

或df2中不存在df1索引的地方…

df3 = df1[~df1.index.isin(df2.index)]

Code below does not print boolean, but allows for dataframe subsetting by index… I understand this is likely not the most efficient way to solve the problem, but I (1) like the way this reads and (2) you can easily subset where df1 index exists in df2:

df3 = df1[df1.index.isin(df2.index)]

or where df1 index does not exist in df2…

df3 = df1[~df1.index.isin(df2.index)]

标准化熊猫数据框的列

问题:标准化熊猫数据框的列

我在熊猫中有一个数据框,其中每一列都有不同的值范围。例如:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

知道如何将每个值介于0和1之间的数据框的列标准化吗?

我想要的输出是:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

I have a dataframe in pandas where each column has different value range. For example:

df:

A     B   C
1000  10  0.5
765   5   0.35
800   7   0.09

Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?

My desired output is:

A     B    C
1     1    1
0.765 0.5  0.7
0.8   0.7  0.18(which is 0.09/0.5)

回答 0

您可以使用软件包sklearn及其关联的预处理实用程序来规范化数据。

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

有关更多信息,请参见有关预处理数据的scikit-learn 文档:将特征缩放到一定范围。

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.


回答 1

使用Pandas的一种简单方法:(这里我要使用均值归一化)

normalized_df=(df-df.mean())/df.std()

使用最小-最大规格化:

normalized_df=(df-df.min())/(df.max()-df.min())

编辑:要解决一些问题,需要说熊猫自动在上面的代码中应用了以列为单位的函数。

one easy way by using Pandas: (here I want to use mean normalization)

normalized_df=(df-df.mean())/df.std()

to use min-max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())

Edit: To address some concerns, need to say that Pandas automatically applies colomn-wise function in the code above.


回答 2

根据这篇文章:https : //stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

您可以执行以下操作:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

您无需担心您的价值观是消极还是积极。并且这些值应在0到1之间很好地分布。

Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

You can do the following:

def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

You don’t need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.


回答 3

您的问题实际上是作用在列上的简单转换:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

或更简洁:

   frame.apply(lambda x: x/x.max(), axis=0)

Your problem is actually a simple transform acting on the columns:

def f(s):
    return s/s.max()

frame.apply(f, axis=0)

Or even more terse:

   frame.apply(lambda x: x/x.max(), axis=0)

回答 4

如果您喜欢使用sklearn包,则可以通过使用pandas来保留列名和索引名,loc如下所示:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

If you like using the sklearn package, you can keep the column and index names by using pandas loc like so:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(df) 
df.loc[:,:] = scaled_values

回答 5

简单即美:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

Simple is Beautiful:

df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()

回答 6

您可以创建要标准化的列的列表

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

现在,您的Pandas Dataframe仅在您想要的列上进行了标准化


但是,如果你想的相反,选择列的列表不要想规范化,您可以简单地创建的所有列的列表,删除非期望的人

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

You can create a list of columns that you want to normalize

column_names_to_normalize = ['A', 'E', 'G', 'sadasdsd', 'lol']
x = df[column_names_to_normalize].values
x_scaled = min_max_scaler.fit_transform(x)
df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
df[column_names_to_normalize] = df_temp

Your Pandas Dataframe is now normalized only at the columns you want


However, if you want the opposite, select a list of columns that you DON’T want to normalize, you can simply create a list of all columns and remove that non desired ones

column_names_to_not_normalize = ['B', 'J', 'K']
column_names_to_normalize = [x for x in list(df) if x not in column_names_to_not_normalize ]

回答 7

我认为在熊猫中做到这一点的更好方法是

df = df/df.max().astype(np.float64)

编辑如果在数据框中出现负数,则应改用

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

I think that a better way to do that in pandas is just

df = df/df.max().astype(np.float64)

Edit If in your data frame negative numbers are present you should use instead

df = df/df.loc[df.abs().idxmax()].astype(np.float64)

回答 8

Sandman和Praveen给出的解决方案非常好。唯一的问题是,如果数据框的其他列中有类别变量,则此方法将需要进行一些调整。

我针对此类问题的解决方案如下:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.

My solution to this type of issue is following:

 from sklearn import preprocesing
 x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
 min_max_scaler = preprocessing.MinMaxScaler()
 x_scaled = min_max_scaler.fit_transform(x)
 x_new = pd.DataFrame(x_scaled)
 df = pd.concat([df.Categoricals,x_new])

回答 9

python中不同标准化的示例。

作为参考,请参阅以下维基百科文章: https //en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

示例数据

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

使用熊猫进行归一化(给出无偏估计)

归一化时,我们只需减去平均值并除以标准差即可。

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

使用sklearn进行归一化(Gives有偏差的估计,与熊猫不同)

如果您做同样的事情,sklearn您将获得不同的输出!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

偏向sklearn的估计是否会使机器学习功能降低?

没有。

sklearn.preprocessing.scale的官方文档指出,使用偏倚估计量会异常地影响机器学习算法的性能,因此我们可以安全地使用它们。

From official documentation:
We use a biased estimator for the standard deviation,
equivalent to numpy.std(x, ddof=0). 
Note that the choice of ddof is unlikely to affect model performance.

MinMax Scaling怎么样?

MinMax缩放中没有标准偏差计算。因此,在熊猫和scikit-learn中,结果都是相同的。

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

Example of different standardizations in python.

For reference look at this wikipedia article: Unbiased Estimation of Standard Deviation

Example Data

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
print(df)
   A    B  C
0  1  100  a
1  2  300  b
2  3  500  c

Normalization using pandas (Gives unbiased estimates)

When normalizing we simply subtract the mean and divide by standard deviation.

df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
print(df)
     A    B  C
0 -1.0 -1.0  a
1  0.0  0.0  b
2  1.0  1.0  c

Normalization using sklearn (Gives biased estimates, different from pandas)

If you do the same thing with sklearn you will get DIFFERENT output!

import pandas as pd

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
               'C':list('abc')
             })
df.iloc[:,0:-1] = scaler.fit_transform(df.iloc[:,0:-1].to_numpy())
print(df)
          A         B  C
0 -1.224745 -1.224745  a
1  0.000000  0.000000  b
2  1.224745  1.224745  c

Does Biased estimates of sklearn makes Machine Learning Less Powerful?

NO.

The official documentation of sklearn.preprocessing.scale states that using biased estimator is UNLIKELY to affect the performance of machine learning algorithms and we can safely use them.

From official documentation:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

What about MinMax Scaling?

There is no Standard Deviation calculation in MinMax scaling. So the result is same in both pandas and scikit-learn.

import pandas as pd
df = pd.DataFrame({
               'A':[1,2,3],
               'B':[100,300,500],
             })
(df - df.min()) / (df.max() - df.min())
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0


# Using sklearn
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
arr_scaled = scaler.fit_transform(df) 

print(arr_scaled)
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

df_scaled = pd.DataFrame(arr_scaled, columns=df.columns,index=df.index)
print(df_scaled)
     A    B
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0

回答 10

您可能希望对某些列进行规范化,而对其他列进行不变,例如某些回归任务,其中数据标签或分类列不变,因此,我建议您使用这种pythonic方式(这是@shg和@Cina答案的组合):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

You might want to have some of columns being normalized and the others be unchanged like some of regression tasks which data labels or categorical columns are unchanged So I suggest you this pythonic way (It’s a combination of @shg and @Cina answers ):

features_to_normalize = ['A', 'B', 'C']
# could be ['A','B'] 

df[features_to_normalize] = df[features_to_normalize].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

回答 11

这只是简单的数学。答案应如下所示。

normed_df = (df - df.min()) / (df.max() - df.min())

It is only simple mathematics. The answer should as simple as below.

normed_df = (df - df.min()) / (df.max() - df.min())

回答 12

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

从熊猫文件中,DataFrame结构可以对其自身应用操作(函数)。

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

沿DataFrame的输入轴应用功能。传递给函数的对象是Series对象,其索引为DataFrame的索引(axis = 0)或列(axis = 1)。返回类型取决于传递的函数是否聚合,或者取决于DataFrame为空时的reduce参数。

您可以应用自定义函数来操作DataFrame。

def normalize(x):
    try:
        x = x/np.linalg.norm(x,ord=1)
        return x
    except :
        raise
data = pd.DataFrame.apply(data,normalize)

From the document of pandas,DataFrame structure can apply an operation (function) to itself .

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.

You can apply a custom function to operate the DataFrame .


回答 13

以下函数计算Z分数:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

The following function calculates the Z score:

def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

回答 14

这是使用列表推导按列进行的方式:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

This is how you do it column-wise using list comprehension:

[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

回答 15

您可以通过这种方式简单地使用pandas.DataFrame.transform 1函数:

df.transform(lambda x: x/x.max())

You can simply use the pandas.DataFrame.transform1 function in this way:

df.transform(lambda x: x/x.max())

回答 16

df_normalized = df / df.max(axis=0)
df_normalized = df / df.max(axis=0)

回答 17

您可以一行完成

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

它对每一列取均值,然后从每一行中减去(均值)(特定列的均值仅从其行中减去)并仅除以均值。最后,我们得到的是标准化数据集。

You can do this in one line

DF_test = DF_test.sub(DF_test.mean(axis=0), axis=1)/DF_test.mean(axis=0)

it takes mean for each of the column and then subtracts it(mean) from every row(mean of particular column subtracts from its row only) and divide by mean only. Finally, we what we get is the normalized data set.


回答 18

熊猫默认情况下会按列进行规范化。试试下面的代码。

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

输出值将在0到1的范围内。

Pandas does column wise normalization by default. Try the code below.

X= pd.read_csv('.\\data.csv')
X = (X-X.min())/(X.max()-X.min())

The output values will be in range of 0 and 1.


如何制作好的可复制熊猫实例

问题:如何制作好的可复制熊猫实例

花了相当多的时间看这两个 在SO上标记,我得到的印象是pandas问题不太可能包含可重复的数据。这是后话了R的群落已经不错了关于鼓励,并感谢像导游这样,新人能得到放在一起,这些例子一些帮助。能够阅读这些指南并获得可复制数据的人通常会很幸运地得到他们问题的答案。

我们如何为pandas问题创建良好的可复制示例?可以将简单的数据框放在一起,例如:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

但是许多示例数据集需要更复杂的结构,例如:

  • datetime 索引或数据
  • 多个类别变量(是否存在R的等效expand.grid()函数,该函数产生某些给定变量的所有可能组合?)
  • MultiIndex或Panel数据

对于dput()难以使用几行代码来模拟的数据集,是否有与R等效的功能,可让您生成可复制粘贴的代码来重新生成数据结构?

Having spent a decent amount of time watching both the and tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

But many example datasets need more complicated structure, e.g.:

  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R’s expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex or Panel data

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R’s dput() that allows you to generate copy-pasteable code to regenerate your datastructure?


回答 0

注意:这里的想法对于Stack Overflow非常通用,实际上是问题

免责声明:写一个好问题是困难的。

好:

  • 确实包含小的*示例DataFrame,作为可运行代码:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

    或使用使其“可复制和粘贴” pd.read_clipboard(sep='\s\s+'),您可以设置Stack Overflow高亮显示的文本格式,并使用Ctrl+ K(或在每行前添加四个空格),或在代码上方和下方放置三个波浪号,而无需缩进代码:

    In [2]: df
    Out[2]: 
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    测试pd.read_clipboard(sep='\s\s+')自己。

    * 我确实的意思是很小,绝大多数示例DataFrames可能需要少于6行引用我敢打赌我可以在5行中完成。df = df.head()是否可以用来重现该错误,如果没有弄清楚,是否可以组成一个小的DataFrame来显示您面临的问题。

    * 所有规则都有exceptions,很明显的一个是性能问题(在这种情况下,肯定要用到%timeit和可能的%PRUN),你应该生成(考虑使用np.random.seed所以我们有相同的帧)df = pd.DataFrame(np.random.randn(100000000, 10))。说“让我快速编写此代码”并不是严格意义上的网站主题…

  • 写出您想要的结果(与上面类似)

    In [3]: iwantthis
    Out[3]: 
       A  B
    0  1  5
    1  4  6

    解释数字的来源:5是A为1的行的B列之和。

  • 确实显示您尝试过的代码

    In [4]: df.groupby('A').sum()
    Out[4]: 
       B
    A   
    1  5
    4  6

    但是,请说出不正确的地方:A列位于索引中,而不是列中。

  • 确实表明您已经做过一些研究(搜索docs搜索StackOverflow),并给出摘要:

    sum的文档字符串仅声明“计算组值之和”

    GROUPBY文档不给任何的例子。

    撇开:这里的答案是使用df.groupby('A', as_index=False).sum()

  • 如果您有“时间戳记”列是相关的(例如,正在重采样等),则应明确并将pd.to_datetime其应用于良好的度量**。

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..

    ** 有时这就是问题本身:它们是字符串。

坏处:

  • 不包含我们无法复制和粘贴的MultiIndex (请参见上文),这对熊猫默认显示有点不满,但很烦人:

    In [11]: df
    Out[11]:
         C
    A B   
    1 2  3
      2  6

    正确的方法是在set_index调用中包含一个普通的DataFrame :

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
    In [13]: df
    Out[13]: 
         C
    A B   
    1 2  3
      2  6
  • 在提供所需结果时,确实可以洞察其含义:

       B
    A   
    1  1
    5  0

    请具体说明您如何获得这些数字(它们是什么)…再次检查它们是正确的。

  • 如果您的代码抛出错误,请包括整个堆栈跟踪信息(如果噪声太大,可以稍后编辑)。显示行号(以及代码所针对的行)。

丑陋的:

  • 不要链接到我们无权访问的CSV(理想情况下根本不要链接到外部源…)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options

    我们得到的大多数数据都是专有的:组成相似的数据,看看是否可以重现问题(很小的问题)。

  • 不要用语言模糊地解释这种情况,就像您有一个“大”的DataFrame一样,在传递时提及一些列名(请确保不要提及它们的dtypes)。尝试深入了解一些完全没有意义的细节,而无需查看实际上下文。大概没人会读到本段末尾。

    杂文不好,用小例子更容易。

  • 在讨论您的实际问题之前,请勿包含10+(100+ ??)行数据处理。

    拜托,我们在日常工作中会看到足够多的东西。我们想提供帮助,但不是这样
    切入简介,仅在引起麻烦的步骤中显示相关的DataFrame(或它们的小版本)。

无论如何,祝您学习Python,NumPy和Pandas玩得开心!

Note: The ideas here are pretty generic for Stack Overflow, indeed questions.

Disclaimer: Writing a good question is HARD.

The Good:

  • do include small* example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it “copy and pasteable” using pd.read_clipboard(sep='\s\s+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three tildes above and below your code with your code unindented:

    In [2]: df
    Out[2]: 
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    test pd.read_clipboard(sep='\s\s+') yourself.

    * I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head(), if not fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame): df = pd.DataFrame(np.random.randn(100000000, 10)). Saying that, “make this code fast for me” is not strictly on topic for the site…

  • write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]: 
       A  B
    0  1  5
    1  4  6
    

    Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.

  • do show the code you’ve tried:

    In [4]: df.groupby('A').sum()
    Out[4]: 
       B
    A   
    1  5
    4  6
    

    But say what’s incorrect: the A column is in the index rather than a column.

  • do show you’ve done some research (search the docs, search StackOverflow), give a summary:

    The docstring for sum simply states “Compute sum of group values”

    The groupby docs don’t give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • if it’s relevant that you have Timestamp columns, e.g. you’re resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date..
    

    ** Sometimes this is the issue itself: they were strings.

The Bad:

  • don’t include a MultiIndex, which we can’t copy and paste (see above), this is kind of a grievance with pandas default display but nonetheless annoying:

    In [11]: df
    Out[11]:
         C
    A B   
    1 2  3
      2  6
    

    The correct way is to include an ordinary DataFrame with a set_index call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
    In [13]: df
    Out[13]: 
         C
    A B   
    1 2  3
      2  6
    
  • do provide insight to what it is when giving the outcome you want:

       B
    A   
    1  1
    5  0
    

    Be specific about how you got the numbers (what are they)… double check they’re correct.

  • If your code throws an error, do include the entire stack trace (this can be edited out later if it’s too noisy). Show the line number (and the corresponding line of your code which it’s raising against).

The Ugly:

  • don’t link to a csv we don’t have access to (ideally don’t link to an external source at all…)

    df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options
    

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

  • don’t explain the situation vaguely in words, like you have a DataFrame which is “large”, mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad, it’s easier with small examples.

  • don’t include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this….
    Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Anyways, have fun learning Python, NumPy and Pandas!


回答 1

如何创建样本数据集

这主要是通过提供有关如何创建示例数据框的示例来扩展@AndyHayden的答案。熊猫和(特别是)numpy为此提供了多种工具,因此您通常只需几行代码就可以为任何实际数据集创建合理的传真。

导入numpy和pandas之后,如果您希望人们能够准确地复制数据和结果,请确保提供随机种子。

import numpy as np
import pandas as pd

np.random.seed(123)

厨房水槽的例子

这是一个示例,显示了您可以执行的各种操作。各种有用的示例数据框都可以从其中的一个子集创建:

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

这将生成:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

一些注意事项:

  1. np.repeatnp.tile(列de)对于以非常规则的方式创建组和索引非常有用。对于2列,这可用于轻松复制r,expand.grid()但在提供所有排列的子集的能力上也更加灵活。但是,对于3列或更多列,语法很快变得笨拙。
  2. 有关r’s的更直接替代方法,expand.grid()请参阅itertools熊猫食谱》中np.meshgrid解决方案或此处显示的解决方案。这些将允许任何数量的尺寸。
  3. 您可以使用进行很多操作np.random.choice。例如,在column列中g,我们从2011年开始随机选择6个日期。此外,通过设置,replace=False我们可以确保这些日期是唯一的-如果我们要将其用作具有唯一值的索引,则非常方便。

假股市数据

除了获取上述代码的子集之外,您还可以进一步结合这些技术来执行几乎所有操作。例如,这是一个简短的示例,它结合np.tiledate_range创建涵盖相同日期的4只股票的样本报价数据:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

现在,我们有了一个具有100行的示例数据集(每个代码25个日期),但是仅用了4行就可以做到,这使得其他所有人都可以轻松复制而无需复制和粘贴100行代码。然后,如果可以帮助解释您的问题,则可以显示数据的子集:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

How to create sample datasets

This is to mainly to expand on @AndyHayden’s answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

import numpy as np
import pandas as pd

np.random.seed(123)

A kitchen sink example

Here’s an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

This produces:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

Some notes:

  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r’s expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r’s expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique — very handy if we want to use this as an index with unique values.

Fake stock market data

In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here’s a short example that combines np.tile and date_range to create sample ticker data for 4 stocks covering the same dates:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

回答 2

答录人日记

对于提出问题,我最好的建议是发挥回答问题者的心理。作为其中的一员,我可以深入了解为什么我回答某些问题以及为什么我不回答其他问题。

动机

我出于以下几个原因而愿意回答问题

  1. 对我来说,Stackoverflow.com是非常宝贵的资源。我想回馈。
  2. 在回馈的过程中,我发现此站点是比以前更强大的资源。回答问题对我来说是一种学习经历,我喜欢学习。 阅读此答案,并请其他兽医发表评论。这种互动使我感到高兴。
  3. 我喜欢积分!
  4. 参见#3。
  5. 我喜欢有趣的问题。

我所有的最纯粹的意图都是美好的,但是如果我回答1个问题或30个问题,我就会感到满意。 驱使我选择回答哪些问题的动机在于最大化分数。

我还将花时间在有趣的问题上,但这之间相差无几,并且对于需要解决无趣问题的提问者没有帮助。让我回答问题的最佳选择是将问题尽快解决,让我尽可能少地回答。如果我正在看两个问题,而一个有代码,我可以复制粘贴以创建我需要的所有变量…我要使用那个!如果有时间的话,我会再回到另一个。

主要建议

使人们易于回答问题。

  • 提供创建所需变量的代码。
  • 最小化该代码。如果我在看帖子时眼神呆滞,那我将继续下一个问题,或者回到我正在做的其他事情。
  • 考虑一下您要问的内容并做到具体。我们想看看您做了什么,因为自然语言(英语)不准确且令人困惑。您尝试过的代码示例有助于解决自然语言描述中的不一致问题。
  • 请显示您的期望!!!我必须坐下来尝试一下。如果不尝试一些事情,我几乎永远不会知道问题的答案。如果我看不到您要查找的示例,则可能会跳过这个问题,因为我不想猜测。

您的声誉不仅仅是您的声誉。

我喜欢要点(我在上面提到过)。但是这些并不是我真正的声誉。我真正的声誉是网站上其他人对我的看法的融合。我努力做到公平诚实,希望其他人能看到这一点。对于询问者而言,这意味着我们记住询问者的行为。我记得,如果您没有选择答案并推荐好的答案。我记得,如果你的举止表现得我不喜欢或我喜欢。我还将回答哪些问题。


无论如何,我可能可以继续,但是我会饶恕所有真正读过这篇文章的人。

Diary of an Answerer

My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don’t answer others.

Motivations

I’m motivated to answer questions for several reasons

  1. Stackoverflow.com has been a tremendously valuable resource to me. I wanted to give back.
  2. In my efforts to give back, I’ve found this site to be an even more powerful resource than before. Answering questions is a learning experience for me and I like to learn. Read this answer and comment from another vet. This kind of interaction makes me happy.
  3. I like points!
  4. See #3.
  5. I like interesting problems.

All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choices for which questions to answer has a huge component of point maximization.

I’ll also spend time on interesting problems but that is few and far between and doesn’t help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I’m looking at two questions and one has code I can copy paste to create all the variables I need… I’m taking that one! I’ll come back to the other one if I have time, maybe.

Main Advice

Make it easy for the people answering questions.

  • Provide code that creates variables that are needed.
  • Minimize that code. If my eyes glaze over as I look at the post, I’m on to the next question or getting back to whatever else I’m doing.
  • Think about what you’re asking and be specific. We want to see what you’ve done because natural languages (English) are inexact and confusing. Code samples of what you’ve tried help resolve inconsistencies in a natural language description.
  • PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don’t see an example of what you’re looking for, I might pass on the question because I don’t feel like guessing.

Your reputation is more than just your reputation.

I like points (I mentioned that above). But those points aren’t really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don’t select answers and upvote good answers, I remember. If you behave in ways I don’t like or in ways I do like, I remember. This also plays into which questions I’ll answer.


Anyway, I can probably go on, but I’ll spare all of you who actually read this.


回答 3

挑战回答SO问题的最大挑战之一是重新创建问题(包括数据)所花费的时间。没有清晰的方法来重现数据的问题不太可能被回答。既然您花时间写问题,并且有一个需要帮助的问题,则可以通过提供其他人可以用来帮助解决问题的数据来轻松地帮助自己。

@Andy提供的有关编写良好熊猫问题的说明是一个很好的起点。有关更多信息,请参阅如何提问以及如何创建最小,完整和可验证的示例

请事先明确说明您的问题。 花时间写完您的问题和任何示例代码后,请尝试阅读并为您的读者提供一个“执行摘要”,其中概述了问题并清楚地陈述了问题。

原始问题

我有这个数据…

我想做这个…

我希望我的结果看起来像这样…

但是,当我尝试执行[this]时,出现以下问题…

我试图通过[this]和[that]找到解决方案。

我如何解决它?

根据所提供的数据量,示例代码和错误堆栈,读者需要走很长一段路才能理解问题所在。尝试重新陈述问题,使问题本身排在最前面,然后提供必要的详细信息。

修改后的问题

问题: 我该怎么做?

我试图通过[this]和[that]找到解决方案。

当我尝试执行此操作时,出现以下问题…

我希望最终结果看起来像这样…

这是一些可以重现我的问题的最小代码…

这里是如何重新创建示例数据的方法: df = pd.DataFrame({'A': [...], 'B': [...], ...})

如果需要,提供样品数据!!!

有时只需要DataFrame的开头或结尾即可。您也可以使用@JohnE提出的方法来创建更大的数据集,以供其他人复制。使用他的示例生成股票行的100行DataFrame:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

如果这是您的实际数据,则可能只需要按以下方式包括数据框的头部和/或尾部(请确保匿名所有敏感数据):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

您可能还需要提供DataFrame的描述(仅使用相关列)。这使得其他人更容易检查每一列的数据类型并识别其他常见错误(例如,日期为字符串vs. datetime64 vs.对象):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

注意:如果您的DataFrame有一个MultiIndex:

如果您的DataFrame具有多索引,则必须先重设,然后再调用to_dict。然后,您需要使用重新创建索引set_index

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059

The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don’t have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you’d like help with, you can easily help yourself by providing data that others can then use to help solve your problem.

The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.

Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an ‘Executive Summary’ for your reader which summarizes the problem and clearly states the question.

Original question:

I have this data…

I want to do this…

I want my result to look like this…

However, when I try to do [this], I get the following problem…

I’ve tried to find solutions by doing [this] and [that].

How do I fix it?

Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.

Revised Question:

Qustion: How can I do [this]?

I’ve tried to find solutions by doing [this] and [that].

When I’ve tried to do [this], I get the following problem…

I’d like my final results to look like this…

Here is some minimal code that can reproduce my problem…

And here is how to recreate my sample data: df = pd.DataFrame({'A': [...], 'B': [...], ...})

PROVIDE SAMPLE DATA IF NEEDED!!!

Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

NOTE: If your DataFrame has a MultiIndex:

If your DataFrame has a multiindex, you must first reset before calling to_dict. You then need to recreate the index using set_index:

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059

回答 4

这是我的版本dput-用于生成可重复报告的标准R工具-适用于Pandas DataFrame。对于更复杂的框架,它可能会失败,但是在简单情况下,它似乎可以完成任务:

import pandas as pd
def dput (x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

现在,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

请注意,这产生的详细输出比DataFrame.to_dict,例如

pd.DataFrame({
  'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
  index=pd.RangeIndex(start=0, stop=8, step=1))

{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 
 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 
 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 
 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}

du如上所述,但它保留列类型。例如,在上述测试案例中,

du.equals(pd.DataFrame(du.to_dict()))
==> False

因为du.dtypesuint8pd.DataFrame(du.to_dict()).dtypesint64

Here is my version of dput – the standard R tool to produce reproducible reports – for Pandas DataFrames. It will probably fail for more complex frames, but it seems to do the job in simple cases:

import pandas as pd
def dput(x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

now,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

Note that this produces a much more verbose output than DataFrame.to_dict, e.g.,

pd.DataFrame({
  'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
  index=pd.RangeIndex(start=0, stop=8, step=1))

vs

{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 
 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 
 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 
 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}

for du above, but it preserves column types. E.g., in the above test case,

du.equals(pd.DataFrame(du.to_dict()))
==> False

because du.dtypes is uint8 and pd.DataFrame(du.to_dict()).dtypes is int64.