Python 实用宝典

Question 1

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

Why is that? Is there a way to ensure I always get back a data frame?

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])

In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame

In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series

Question 2

Granted that the behavior is inconsistent, but I think it’s easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Question 3

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don’t specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).

Question 4

The TLDR

When using `loc`

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe

df.loc[:, "col_name"] = Series

Not using `loc`

df["col_name"] = Series

df[["col_name"]] = Dataframe

Question 5

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.

Question 6

You wrote in a comment to joris’ answer:

“I don’t understand the design decision for single rows to get converted into a series – why not a data frame with one row?”

A single row isn’t converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don’t know (I don’t fully understand the last sentence of the citation, maybe it’s the reason)

.

Edit : I don’t agree with me

A DataFrame can’t be composed of elements that would be Series, because the following code gives the same type “Series” as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas’ docs cited above says that the pandas’ data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas’ data structures.
This advice doesn’t mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn’t strictly the case in reality. “Good” meaning that this vision enables to use DataFrames with efficiency. That’s all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That’s certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It’s precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn’t lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas’ data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.

Question 7

If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :

df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3] 
isinstance(result, pd.DataFrame) # True

result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True

Question 8

If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).

This function ensures that you always get a list from your selection (if the df, index and column are valid):

def get_list_from_df_column(df, index, column):
    df_or_series = df.loc[index,[column]] 
    # df.loc[index,column] is also possible and returns a series or a scalar
    if isinstance(df_or_series, pd.Series):
        resulting_list = df_or_series.tolist() #get list from series
    else:
        resulting_list = df_or_series[column].tolist() 
        # use the column key to get a series from the dataframe
    return(resulting_list)

Question 9

我的数据框中有这样的一列：

range
"(2,30)"
"(50,290)"
"(400,1000)"
...

我想,用-破折号代替逗号。我目前正在使用此方法，但没有任何更改。

org_info_exc['range'].replace(',', '-', inplace=True)

有人可以帮忙吗？

Question 10

I have a column in my dataframe like this:

range
"(2,30)"
"(50,290)"
"(400,1000)"
...

and I want to replace the , comma with - dash. I’m currently using this method but nothing is changed.

org_info_exc['range'].replace(',', '-', inplace=True)

Can anybody help?

Question 11

使用向量化str方法replace：

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

编辑

因此，如果我们查看您尝试过的内容以及为何不起作用：

df['range'].replace(',','-',inplace=True)

从文档中，我们看到以下说明：

str或regex：str：与to_replace完全匹配的字符串将替换为value

因此，由于str值不匹配，因此不会发生替换，请与以下内容进行比较：

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

在这里，我们在第二行获得了完全匹配，并且替换发生了。

Question 12

Use the vectorised str method replace:

In [30]:

df['range'] = df['range'].str.replace(',','-')
df
Out[30]:
      range
0    (2-30)
1  (50-290)

EDIT

So if we look at what you tried and why it didn’t work:

df['range'].replace(',','-',inplace=True)

from the docs we see this desc:

str or regex: str: string exactly matching to_replace will be replaced with value

So because the str values do not match, no replacement occurs, compare with the following:

In [43]:

df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
Out[43]:
0    (2,30)
1         -
Name: range, dtype: object

here we get an exact match on the second row and the replacement occurs.

Question 13

对于其他任何从Google搜索到的人，如何在所有列上进行字符串替换（例如，如果其中有多个列，如OP的“范围”列）：Pandas在replace数据框对象上具有内置方法。

df.replace(',', '-', regex=True)

资料来源：文件

Question 14

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP’s ‘range’ column): Pandas has a built in replace method available on a dataframe object.

df.replace(',', '-', regex=True)

Source: Docs

Question 15

在列名称中用下划线替换所有逗号

data.columns= data.columns.str.replace(' ','_',regex=True)

Question 16

Replace all commas with underscore in the column names

data.columns= data.columns.str.replace(' ','_',regex=True)

Question 17

另外，对于那些希望替换一列中多个字符的用户，可以使用正则表达式来实现：

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

Question 18

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:

import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'

df['string_col'].str.replace(regular_expression, '', regex=True)

Question 19

我想从“ B”中的日期中减去“ A”中的日期，并添加一个具有差异的新列。

df
          A        B
one 2014-01-01  2014-02-28 
two 2014-02-03  2014-03-01

我尝试了以下操作，但是在尝试将其包含在for循环中时遇到错误…

import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta =  (mdate1 - rdate1).days
print delta

我该怎么办？

Question 20

I want to subtract dates in ‘A’ from dates in ‘B’ and add a new column with the difference.

df
          A        B
one 2014-01-01  2014-02-28 
two 2014-02-03  2014-03-01

I’ve tried the following, but get an error when I try to include this in a for loop…

import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta =  (mdate1 - rdate1).days
print delta

What should I do?

Question 21

假设这些是datetime列（如果不适用to_datetime），则可以减去它们：

df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])

In [11]: df.dtypes  # if already datetime64 you don't need to use to_datetime
Out[11]:
A    datetime64[ns]
B    datetime64[ns]
dtype: object

In [12]: df['A'] - df['B']
Out[12]:
one   -58 days
two   -26 days
dtype: timedelta64[ns]

In [13]: df['C'] = df['A'] - df['B']

In [14]: df
Out[14]:
             A          B        C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days

注意：请确保您使用的是新熊猫（例如0.13.1），这可能在较旧的版本中不起作用。

Question 22

Assuming these were datetime columns (if they’re not apply to_datetime) you can just subtract them:

df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])

In [11]: df.dtypes  # if already datetime64 you don't need to use to_datetime
Out[11]:
A    datetime64[ns]
B    datetime64[ns]
dtype: object

In [12]: df['A'] - df['B']
Out[12]:
one   -58 days
two   -26 days
dtype: timedelta64[ns]

In [13]: df['C'] = df['A'] - df['B']

In [14]: df
Out[14]:
             A          B        C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days

Note: ensure you’re using a new of pandas (e.g. 0.13.1), this may not work in older versions.

Question 23

要删除“天”文本元素，您还可以使用系列的dt（）访问器：https : //pandas.pydata.org/pandas-docs/stable/generation/pandas.Series.dt.html

所以，

df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days

返回：

             A          B   C
one 2014-01-01 2014-02-28  58
two 2014-02-03 2014-03-01  26

Question 24

To remove the ‘days’ text element, you can also make use of the dt() accessor for series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.html

So,

df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days

which returns:

             A          B   C
one 2014-01-01 2014-02-28  58
two 2014-02-03 2014-03-01  26

Question 25

列表理解是最Python（最快捷）的方式的最佳选择：

[int(i.days) for i in (df.B - df.A)]

我将返回timedelta（例如“ -58天”）
i.days将以长整数值（例如-58L）返回此值
int（i.days）将为您提供-58。

如果您的列不是日期时间格式。较短的语法为：df.A = pd.to_datetime(df.A)

Question 26

A list comprehension is your best bet for the most Pythonic (and fastest) way to do this:

[int(i.days) for i in (df.B - df.A)]

i will return the timedelta(e.g. ‘-58 days’)
i.days will return this value as a long integer value(e.g. -58L)
int(i.days) will give you the -58 you seek.

If your columns aren’t in datetime format. The shorter syntax would be: df.A = pd.to_datetime(df.A)

Question 27

这个怎么样：

times['days_since'] = max(list(df.index.values))  
times['days_since'] = times['days_since'] - times['months']  
times

Question 28

How about this:

times['days_since'] = max(list(df.index.values))  
times['days_since'] = times['days_since'] - times['months']  
times

Question 29

The documentation says:

http://pandas.pydata.org/pandas-docs/dev/basics.html

“Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions”

Sounds very abstract to me… I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

Thanks.

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

Question 30

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

That’s because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you’ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

Question 31

cut command creates equispaced bins but frequency of samples is unequal in each bin
qcut command creates unequal size bins but frequency of samples is equal in each bin.

    >>> x=np.array([24,  7,  2, 25, 22, 29])
    >>> x
    array([24,  7,  2, 25, 22, 29])

    >>> pd.cut(x,3).value_counts() #Bins size has equal interval of 9
    (2, 11.0]        2
    (11.0, 20.0]     0
    (20.0, 29.0]     4

    >>> pd.qcut(x,3).value_counts() #Equal frequecy of 2 in each bins
    (1.999, 17.0]     2
    (17.0, 24.333]    2
    (24.333, 29.0]    2

Question 32

So qcut ensures a more even distribution of the values in each bin even if they cluster in the sample space. This means you are less likely to have a bin full of data with very close values and another bin with 0 values. In general, it’s better sampling.

Question 33

Pd.qcut distribute elements of an array on making division on the basis of ((no.of elements in array)/(no. of bins – 1)), then divide this much no. of elements serially in each bins.

Pd.cut distribute elements of an array on making division on the basis of ((first +last element)/(no. of bins-1)) and then distribute element according to the range of values in which they fall.

Question 34

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we’ve a CSV with a timestamp column and we’d like to load just the lines that with a timestamp greater than a given constant.

Question 35

There isn’t an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

Question 36

I didn’t find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.

Question 37

If the filtered range is contiguous (as it usually is with time(stamp) filters), then the fastest solution is to hard-code the range of rows. Simply combine skiprows=range(1, start_row) with nrows=end_row parameters. Then the import takes seconds where the accepted solution would take minutes. A few experiments with the initial start_row are not a huge cost given the savings on import times. Notice we kept header row by using range(1,..).

Question 38

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    

        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)

        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data

Question 39

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

Question 40

I would like to create a column in a pandas data frame that is an integer representation of the number of days in a timedelta column. Is it possible to use ‘datetime.days’ or do I need to do something more manual?

timedelta column

7 days, 23:29:00

day integer column

7

Question 41

Use the dt.days attribute. Access this attribute via:

timedelta_series.dt.days

You can also get the seconds and microseconds attributes in the same way.

Question 42

You could do this, where td is your series of timedeltas. The division converts the nanosecond deltas into day deltas, and the conversion to int drops to whole days.

import numpy as np

(td / np.timedelta64(1, 'D')).astype(int)

Question 43

Timedelta objects have read-only instance attributes .days, .seconds, and .microseconds.

Question 44

If the question isn’t just “how to access an integer form of the timedelta?” but “how to convert the timedelta column in the dataframe to an int?” the answer might be a little different. In addition to the .dt.days accessor you need either df.astype or pd.to_numeric

Either of these options should help:

df['tdColumn'] = pd.to_numeric(df['tdColumn'].dt.days, downcast='integer')

or

df['tdColumn'] = df['tdColumn'].dt.days.astype('int16')

Question 45

I’ve a two columns dataframe, and intend to convert it to python dictionary – the first column will be the key and the second will be the value. Thank you in advance.

Dataframe:

    id    value
0    0     10.2
1    1      5.7
2    2      7.4

Question 46

See the docs for to_dict. You can use it like this:

df.set_index('id').to_dict()

And if you have only one column, to avoid the column name is also a level in the dict (actually, in this case you use the Series.to_dict()):

df.set_index('id')['value'].to_dict()

Question 47

mydict = dict(zip(df.id, df.value))

Question 48

If you want a simple way to preserve duplicates, you could use groupby:

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3
>>> {k: g["value"].tolist() for k,g in ptest.groupby("id")}
{'a': [1, 2], 'b': [3]}

Question 49

The answers by joris in this thread and by punchagan in the duplicated thread are very elegant, however they will not give correct results if the column used for the keys contains any duplicated value.

For example:

>>> ptest = p.DataFrame([['a',1],['a',2],['b',3]], columns=['id', 'value']) 
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

# note that in both cases the association a->1 is lost:
>>> ptest.set_index('id')['value'].to_dict()
{'a': 2, 'b': 3}
>>> dict(zip(ptest.id, ptest.value))
{'a': 2, 'b': 3}

If you have duplicated entries and do not want to lose them, you can use this ugly but working code:

>>> mydict = {}
>>> for x in range(len(ptest)):
...     currentid = ptest.iloc[x,0]
...     currentvalue = ptest.iloc[x,1]
...     mydict.setdefault(currentid, [])
...     mydict[currentid].append(currentvalue)
>>> mydict
{'a': [1, 2], 'b': [3]}

Question 50

Simplest solution:

df.set_index('id').T.to_dict('records')

Example:

df= pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
df.set_index('id').T.to_dict('records')

If you have multiple values, like val1, val2, val3,etc and u want them as lists, then use the below code:

df.set_index('id').T.to_dict('list')

Question 51

in some versions the code below might not work

mydict = dict(zip(df.id, df.value))

so make it explicit

id_=df.id.values
value=df.value.values
mydict=dict(zip(id_,value))

Note i used id_ because the word id is reserved word

Question 52

You can use ‘dict comprehension’

my_dict = {row[0]: row[1] for row in df.values}

Question 53

Another (slightly shorter) solution for not losing duplicate entries:

>>> ptest = pd.DataFrame([['a',1],['a',2],['b',3]], columns=['id','value'])
>>> ptest
  id  value
0  a      1
1  a      2
2  b      3

>>> pdict = dict()
>>> for i in ptest['id'].unique().tolist():
...     ptest_slice = ptest[ptest['id'] == i]
...     pdict[i] = ptest_slice['value'].tolist()
...

>>> pdict
{'b': [3], 'a': [1, 2]}

Question 54

You need a list as a dictionary value. This code will do the trick.

from collections import defaultdict
mydict = defaultdict(list)
for k, v in zip(df.id.values,df.value.values):
    mydict[k].append(v)

Question 55

I found this question while trying to make a dictionary out of three columns of a pandas dataframe. In my case the dataframe has columns A, B and C (let’s say A and B are the geographical coordinates of longitude and latitude and C the country region/state/etc, which is more or less the case).

I wanted a dictionary with each pair of A,B values (dictionary key) matching the value of C (dictionary value) in the corresponding row (each pair of A,B values is guaranteed to be unique due to previous filtering, but it is possible to have the same value of C for different pairs of A,B values in this context), so I did:

mydict = dict(zip(zip(df['A'],df['B']), df['C']))

Using pandas to_dict() also works:

mydict = df.set_index(['A','B']).to_dict(orient='dict')['C']

(none of the columns A or B were used as index before executing the line creating the dictionary)

Both approaches are fast (less than one second on a dataframe with 85k rows, 5-year-old fast dual-core laptop).

The reasons I’m posting this:

for those who need this kind of solution
if someone knows a faster executing solution (e.g., for millions of rows), I’d appreciate a reply.

Question 56

def get_dict_from_pd(df, key_col, row_col):
    result = dict()
    for i in set(df[key_col].values):
        is_i = df[key_col] == i
        result[i] = list(df[is_i][row_col].values)
    return result

this is my sloution, a basic loop

Question 57

This is my solution:

import pandas as pd
df = pd.read_excel('dic.xlsx')
df_T = df.set_index('id').T
dic = df_T.to_dict('records')
print(dic)

Question 58

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?

I want to do something like:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header

Question 59

In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

Set the column labels to equal the values in the 2nd row (index location 1):

In [23]: df.columns = df.iloc[1]

If the index has unique labels, you can drop the 2nd row using:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

If the index is not unique, you could use:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it’s often better to take care that the index is unique (even though Pandas does not require it).

Question 60

This works (pandas v’0.19.2′):

df.rename(columns=df.iloc[0])

Question 61

It would be easier to recreate the data frame. This would also interpret the columns types from scratch.

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

Question 62

You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85

Question 63

I have a really large csv file that I opened in pandas as follows….

import pandas
df = pandas.read_csv('large_txt_file.txt')

Once I do this my memory usage increases by 2GB, which is expected because this file contains millions of rows. My problem comes when I need to release this memory. I ran….

del df

However, my memory usage did not drop. Is this the wrong approach to release memory used by a pandas data frame? If it is, what is the proper way?

Question 64

Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()‘d back to the system (see this question).

If you stick to numeric numpy arrays, those are freed, but boxed objects are not.

>>> import os, psutil, numpy as np
>>> def usage():
...     process = psutil.Process(os.getpid())
...     return process.get_memory_info()[0] / float(2 ** 20)
... 
>>> usage() # initial memory usage
27.5 

>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array

>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875  # numpy frees the array, but python keeps the heap big

Reducing the Number of Dataframes

Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don’t create copies.

Another common gotcha is holding on to copies of previously created dataframes in ipython:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})

In [3]: df + 1
Out[3]: 
   foo
0    2
1    3
2    4
3    5

In [4]: df + 2
Out[4]: 
   foo
0    3
1    4
2    5
3    6

In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]: 
{3:    foo
 0    2
 1    3
 2    4
 3    5, 4:    foo
 0    3
 1    4
 2    5
 3    6}

You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).

Reducing Dataframe Size

Wherever possible, avoid using object dtypes.

>>> df.dtypes
foo    float64 # 8 bytes per value
bar      int64 # 8 bytes per value
baz     object # at least 48 bytes per value, often more

Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.

Whilst numpy supports fixed-size strings in arrays, pandas does not (it’s caused user confusion). This can make a significant difference:

>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9

>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120

You may want to avoid using string columns, or find a way of representing string data as numbers.

If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:

>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 605.5 MB

>>> df1.shape
(39681584, 1)

>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN

>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo    float64
dtypes: float64(1)
memory usage: 543.0 MB

Viewing Memory Usage

You can view the memory usage (docs):

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64[ns](1), float64(8), int64(1), object(4)
memory usage: 4.4+ GB

As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.

问题：通过标签选择的熊猫有时返回Series，有时返回DataFrame

回答 0

回答 1

回答 2

TLDR

使用时 loc

不使用 loc

The TLDR

When using loc

Not using loc

回答 3

回答 4

编辑：我不同意

Edit : I don’t agree with me

回答 5

回答 6

问题：如何替换熊猫数据框的列中的文本？

回答 0

回答 1

回答 2

回答 3

问题：在DataFrame Pandas中添加带有日期之间的天数的列

回答 0

回答 1

回答 2

回答 3

问题：pandas.qcut和pandas.cut有什么区别？

回答 0

回答 1

回答 2

回答 3

问题：如何在Pandas read_csv函数的负载中过滤行？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：Python：在数据框中将timedelta转换为int

回答 0

回答 1

回答 2

回答 3

问题：python pandas dataframe到字典

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：将Pandas DataFrame的行转换为列标题，

回答 0

回答 1

回答 2

回答 3

问题：如何释放熊猫数据框使用的内存？

回答 0

减少数据框的数量

减少数据框大小

查看内存使用情况

Reducing the Number of Dataframes

Reducing Dataframe Size

Viewing Memory Usage

回答 1

回答 2

回答 3

回答 4

问题：在熊猫数据框中将Unix时间转换为可读日期

回答 0

回答 1

回答 2

回答 3

有趣好用的Python教程

使用时 `loc`

不使用 `loc`

When using `loc`

Not using `loc`