Python 实用宝典

Question 1

When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy() method. For example,

X = my_dataframe[features_list].copy()

…instead of just

X = my_dataframe[features_list]

Why are they making a copy of the data frame? What will happen if I don’t make a copy?

Question 2

This expands on Paul’s answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you’d want to use the copy if you want to make sure the initial DataFrame shouldn’t change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You’ll get:

x
0 -1
1  2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1

Question 3

Because if you don’t make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.

For example:

df2 = df
func1(df2)
func2(df)

func1 can modify df by modifying df2, so to avoid that:

df2 = df.copy()
func1(df2)
func2(df)

Question 4

It’s necessary to mention that returning copy or view depends on kind of indexing.

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, ‘A’], a view will be returned.

Question 5

The primary purpose is to avoid chained indexing and eliminate the SettingWithCopyWarning.

Here chained indexing is something like dfc['A'][0] = 111

The document said chained indexing should be avoided in Returning a view versus a copy. Here is a slightly modified example from that document:

In [1]: import pandas as pd

In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [3]: dfc
Out[3]:
    A   B
0   aaa 1
1   bbb 2
2   ccc 3

In [4]: aColumn = dfc['A']

In [5]: aColumn[0] = 111
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [6]: dfc
Out[6]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

Here the aColumn is a view and not a copy from the original DataFrame, so modifying aColumn will cause the original dfc be modified too. Next, if we index the row first:

In [7]: zero_row = dfc.loc[0]

In [8]: zero_row['A'] = 222
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [9]: dfc
Out[9]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time zero_row is a copy, so the original dfc is not modified.

From these two examples above, we see it’s ambiguous whether or not you want to change the original DataFrame. This is especially dangerous if you write something like the following:

In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [11]: dfc
Out[11]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time it didn’t work at all. Here we wanted to change dfc, but we actually modified an intermediate value dfc.loc[0] that is a copy and is discarded immediately. It’s very hard to predict whether the intermediate value like dfc.loc[0] or dfc['A'] is a view or a copy, so it’s not guaranteed whether or not original DataFrame will be updated. That’s why chained indexing should be avoided, and pandas generates the SettingWithCopyWarning for this kind of chained indexing update.

Now is the use of .copy(). To eliminate the warning, make a copy to express your intention explicitly:

In [12]: zero_row_copy = dfc.loc[0].copy()

In [13]: zero_row_copy['A'] = 444 # This time no warning

Since you are modifying a copy, you know the original dfc will never change and you are not expecting it to change. Your expectation matches the behavior, then the SettingWithCopyWarning disappears.

Note, If you do want to modify the original DataFrame, the document suggests you use loc:

In [14]: dfc.loc[0,'A'] = 555

In [15]: dfc
Out[15]:
    A   B
0   555 1
1   bbb 2
2   ccc 3

Question 6

In general it is safer to work on copies than on original data frames, except when you know that you won’t be needing the original anymore and want to proceed with the manipulated version. Normally, you would still have some use for the original data frame to compare with the manipulated version, etc. Therefore, most people work on copies and merge at the end.

Question 7

Assumed you have data frame as below

df1
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

When you would like create another df2 which is identical to df1, without copy

df2=df1
df2
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

And would like modify the df2 value only as below

df2.iloc[0,0]='changed'

df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

At the same time the df1 is changed as well

df1
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Since two df as same object, we can check it by using the id

id(df1)
140367679979600
id(df2)
140367679979600

So they as same object and one change another one will pass the same value as well.

If we add the copy, and now df1 and df2 are considered as different object, if we do the same change to one of them the other will not change.

df2=df1.copy()
id(df1)
140367679979600
id(df2)
140367674641232

df1.iloc[0,0]='changedback'
df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Good to mention, when you subset the original dataframe, it is safe to add the copy as well in order to avoid the SettingWithCopyWarning

Question 8

I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandasnic way to do it.

import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D)) # raises TypeError: data argument can't be an iterator

What is the pandasnic way to do it?

Question 9

There is a way of doing this and it actually looks similar to R

new = old[['A', 'C', 'D']].copy()

Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you’ll probably want to use .copy() to avoid a SettingWithCopyWarning.

An alternative method is to use filter which will create a copy by default:

new = old.filter(['A','B','D'], axis=1)

Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):

new = old.drop('B', axis=1)

Question 10

The easiest way is

new = old[['A','C','D']]

.

Question 11

Another simpler way seems to be:

new = pd.DataFrame([old.A, old.B, old.C]).transpose()

where old.column_name will give you a series. Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.

In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]: 
   A   B    C
0  4  10  100
1  5  20   50

Question 12

Generic functional form

def select_columns(data_frame, column_names):
    new_frame = data_frame.loc[:, column_names]
    return new_frame

Specific for your problem above

selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)

Question 13

If you want to have a new data frame then:

import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new=  old[['A', 'C', 'D']]

Question 14

As far as I can tell, you don’t necessarily need to specify the axis when using the filter function.

new = old.filter(['A','B','D'])

returns the same dataframe as

new = old.filter(['A','B','D'], axis=1)

Question 15

columns by index:

# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()

Question 16

我正在将电子表格的内容读入熊猫。DataNitro具有一种返回矩形单元格选择作为列表列表的方法。所以

table = Cell("A1").table

给

table = [['Heading1', 'Heading2'], [1 , 2], [3, 4]]

headers = table.pop(0) # gives the headers as list and leaves data

我正忙着编写代码来翻译此内容，但我猜想它是如此简单，因此必须有方法可以做到这一点。无法在文档中找到它。任何可以简化此方法的指针？

Question 17

I am reading contents of a spreadsheet into pandas. DataNitro has a method that returns a rectangular selection of cells as a list of lists. So

table = Cell("A1").table

gives

table = [['Heading1', 'Heading2'], [1 , 2], [3, 4]]

headers = table.pop(0) # gives the headers as list and leaves data

I am busy writing code to translate this, but my guess is that it is such a simple use that there must be method to do this. Cant seem to find it in documentation. Any pointers to the method that would simplify this?

Question 18

pd.DataFrame直接调用构造函数：

df = pd.DataFrame(table, columns=headers)
df

   Heading1  Heading2
0         1         2
1         3         4

Question 19

Call the pd.DataFrame constructor directly:

df = pd.DataFrame(table, columns=headers)
df

   Heading1  Heading2
0         1         2
1         3         4

Question 20

使用上面EdChum解释的方法，列表中的值显示为行。要在DataFrame中将列表的值显示为列，只需使用transpose（）如下：

table = [[1 , 2], [3, 4]]
df = DataFrame(table)
df = df.transpose()
df.columns = ['Heading1', 'Heading2']

输出为：

      Heading1  Heading2
0         1        3
1         2        4

Question 21

With approach explained by EdChum above, the values in the list are shown as rows. To show the values of lists as columns in DataFrame instead, simply use transpose() as following:

table = [[1 , 2], [3, 4]]
df = DataFrame(table)
df = df.transpose()
df.columns = ['Heading1', 'Heading2']

The output then is:

      Heading1  Heading2
0         1        3
1         2        4

Question 22

即使没有pop列表，我们也可以set_index

pd.DataFrame(table).T.set_index(0).T
Out[11]: 
0 Heading1 Heading2
1        1        2
2        3        4

更新资料 from_records

table = [['Heading1', 'Heading2'], [1 , 2], [3, 4]]

pd.DataFrame.from_records(table[1:],columns=table[0])
Out[58]: 
   Heading1  Heading2
0         1         2
1         3         4

Question 23

Even without pop the list we can do with set_index

pd.DataFrame(table).T.set_index(0).T
Out[11]: 
0 Heading1 Heading2
1        1        2
2        3        4

Update from_records

table = [['Heading1', 'Heading2'], [1 , 2], [3, 4]]

pd.DataFrame.from_records(table[1:],columns=table[0])
Out[58]: 
   Heading1  Heading2
0         1         2
1         3         4

Question 24

I’ve got a dataframe called data. How would I rename the only one column header? For example gdp to log(gdp)?

data =
    y  gdp  cap
0   1    2    5
1   2    3    9
2   8    7    2
3   3    4    7
4   6    7    7
5   4    8    3
6   8    2    8
7   9    9   10
8   6    6    4
9  10   10    7

Question 25

data.rename(columns={'gdp':'log(gdp)'}, inplace=True)

The rename show that it accepts a dict as a param for columns so you just pass a dict with a single entry.

Also see related

Question 26

A much faster implementation would be to use list-comprehension if you need to rename a single column.

df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]

If the need arises to rename multiple columns, either use conditional expressions like:

df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]

Or, construct a mapping using a dictionary and perform the list-comprehension with it’s get operation by setting default value as the old name:

col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'}   ## key→old name, value→new name

df.columns = [col_dict.get(x, x) for x in df.columns]

Timings:

%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop

%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop

Question 27

How do I rename a specific column in pandas?

From v0.24+, to rename one (or more) columns at a time,

DataFrame.rename() with axis=1 or axis='columns' (the axis argument was introduced in v0.21.
Index.str.replace() for string/regex based replacement.

If you need to rename ALL columns at once,

DataFrame.set_axis() method with axis=1. Pass a list-like sequence. Options are available for in-place modification as well.

`rename` with `axis=1`

df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df

   y gdp cap
0  x   x   x
1  x   x   x
2  x   x   x
3  x   x   x
4  x   x   x

With 0.21+, you can now specify an axis parameter with rename:

df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
    
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

(Note that rename is not in-place by default, so you will need to assign the result back.)

This addition has been made to improve consistency with the rest of the API. The new axis argument is analogous to the columns parameter—they do the same thing.

df.rename(columns={'gdp': 'log(gdp)'})

   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

rename also accepts a callback that is called once for each column.

df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')

   y  g  c
0  x  x  x
1  x  x  x
2  x  x  x
3  x  x  x
4  x  x  x

For this specific scenario, you would want to use

df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)

`Index.str.replace`

Similar to replace method of strings in python, pandas Index and Series (object dtype only) define a (“vectorized”) str.replace method for string and regex-based replacement.

df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
 
   y log(gdp) cap
0  x        x   x
1  x        x   x
2  x        x   x
3  x        x   x
4  x        x   x

The advantage of this over the other methods is that str.replace supports regex (enabled by default). See the docs for more information.

Passing a list to `set_axis` with `axis=1`

Call set_axis with a list of header(s). The list must be equal in length to the columns/index size. set_axis mutates the original DataFrame by default, but you can specify inplace=False to return a modified copy.

df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)

  cap log(gdp)  y
0   x        x  x
1   x        x  x
2   x        x  x
3   x        x  x
4   x        x  x

Note: In future releases, inplace will default to True.

Method Chaining
Why choose set_axis when we already have an efficient way of assigning columns with df.columns = ...? As shown by Ted Petrou in [this answer],(https://stackoverflow.com/a/46912050/4909087) set_axis is useful when trying to chain methods.

Compare

# new for pandas 0.21+
df.some_method1()
  .some_method2()
  .set_axis()
  .some_method3()

Versus

# old way
df1 = df.some_method1()
        .some_method2()
df1.columns = columns
df1.some_method3()

The former is more natural and free flowing syntax.

Question 28

There are at least five different ways to rename specific columns in pandas, and I have listed them below along with links to the original answers. I also timed these methods and found them to perform about the same (though YMMV depending on your data set and scenario). The test case below is to rename columns A M N Z to A2 M2 N2 Z2 in a dataframe with columns A to Z containing a million rows.

# Import required modules
import numpy as np
import pandas as pd
import timeit

# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
    df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})

# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
    df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)

# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
    if x=='A' or x=='M' or x=='N' or x=='Z':
        return x + '2'
    return x
def method_3():
    df_renamed = df.rename(columns=rename_some)

# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
    df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
        np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
    ]})

# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
    df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))

print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))

Output:

Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007

Use the method that is most intuitive to you and easiest for you to implement in your application.

Question 29

I need to delete the first three rows of a dataframe in pandas.

I know df.ix[:-1] would remove the last row, but I can’t figure out how to remove first n rows.

Question 30

Use iloc:

df = df.iloc[3:]

will give you a new df without the first three rows.

Question 31

I’ve got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn’t work for a pandas DataFrame.

Question 32

You can simply use DataFrame.fillna to fill the nan‘s directly:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

Question 33

Try:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

Question 34

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

Question 35

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Question 36

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

Question 37

Directly use df.fillna(df.mean()) to fill all the null value with mean

If you want to fill null value with mean of that column then you can use this

suppose x=df['Item_Weight'] here Item_Weight is column name

here we are assigning (fill null values of x with mean of x into x)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

here Outlet_size is column name

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

Question 38

Another option besides those above is:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It’s less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

Question 39

Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column

Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:

Use method .fillna():

mean_value=df['nr_items'].mean() df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.

You should be careful when using the mean. If you have outliers is more recommendable to use the median

Question 40

using sklearn library preprocessing class

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN

Question 41

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.

Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?

I don’t really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.

ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')

ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False

Here is the error:

ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis

I tried to reproduce this with a simple example, but I failed

In [32]: import pandas as pd

In [33]: import numpy as np

In [34]: a = np.arange(35).reshape(5,7)

In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))

In [36]: df.values.dtype
Out[36]: dtype('int64')

In [37]: df.loc['sums'] = df.sum(axis=0)

In [38]: df
Out[38]: 
      10  11  12  13  14  15   16
x      0   1   2   3   4   5    6
y      7   8   9  10  11  12   13
u     14  15  16  17  18  19   20
z     21  22  23  24  25  26   27
w     28  29  30  31  32  33   34
sums  70  75  80  85  90  95  100

Question 42

This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.

Question 43

As others have said, you’ve probably got duplicate values in your original index. To find them do this:

df[df.index.duplicated()]

Question 44

Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don’t care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.

Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:

df.index = new_index

Question 45

For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:

df = df.loc[:,~df.columns.duplicated()]

Question 46

Simply skip the error using .values at the end.

affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values

Question 47

I came across this error today when I wanted to add a new column like this

df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:

----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
   2417         else:
   2418             # set column
-> 2419             self._set_item(key, value)
   2420 
   2421     def _setitem_slice(self, key, value):

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
   2483 
   2484         self._ensure_valid_index(value)
-> 2485         value = self._sanitize_column(key, value)
   2486         NDFrame._set_item(self, key, value)
   2487 

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
   2633 
   2634         if isinstance(value, Series):
-> 2635             value = reindexer(value)
   2636 
   2637         elif isinstance(value, DataFrame):

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
   2625                     # duplicate axis
   2626                     if not value.index.is_unique:
-> 2627                         raise e
   2628 
   2629                     # other

ValueError: cannot reindex from a duplicate axis

As you can see it, the right code should be

df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.

Hope you can understand it and my answer can help other people to debug their code.

Question 48

In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:

df.head()
                          SensA
date                           
2018-04-03 13:54:47.274   -0.45
2018-04-03 13:55:46.484   -0.42
2018-04-03 13:56:56.235   -0.37
2018-04-03 13:57:57.207   -0.34
2018-04-03 13:59:34.636   -0.33

series.head()
date
2018-04-03 14:09:36.577    62.2
2018-04-03 14:10:28.138    63.5
2018-04-03 14:11:27.400    63.1
2018-04-03 14:12:39.623    62.6
2018-04-03 14:13:27.310    62.5
Name: SensA_rrT, dtype: float64

df = series.to_frame().combine_first(df)

df.head(10)
                          SensA  SensA_rrT
date                           
2018-04-03 13:54:47.274   -0.45        NaN
2018-04-03 13:55:46.484   -0.42        NaN
2018-04-03 13:56:56.235   -0.37        NaN
2018-04-03 13:57:57.207   -0.34        NaN
2018-04-03 13:59:34.636   -0.33        NaN
2018-04-03 14:00:34.565   -0.33        NaN
2018-04-03 14:01:19.994   -0.37        NaN
2018-04-03 14:02:29.636   -0.34        NaN
2018-04-03 14:03:31.599   -0.32        NaN
2018-04-03 14:04:30.779   -0.33        NaN
2018-04-03 14:05:31.733   -0.35        NaN
2018-04-03 14:06:33.290   -0.38        NaN
2018-04-03 14:07:37.459   -0.39        NaN
2018-04-03 14:08:36.361   -0.36        NaN
2018-04-03 14:09:36.577   -0.37       62.2

Question 49

I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function. Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.

Question 50

Simple Fix that Worked for Me

Run df.reset_index(inplace=True) before grouping.

Thank you to this github comment for the solution.

Question 51

I have a pandas data frame with two columns. I need to change the values of the first column without affecting the second one and get back the whole data frame with just first column values changed. How can I do that using apply in pandas?

Question 52

Given a sample dataframe df as:

a,b
1,2
2,3
3,4
4,5

what you want is:

df['a'] = df['a'].apply(lambda x: x + 1)

that returns:

Question 53

For a single column better to use map(), like this:

df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9



df['a'] = df['a'].map(lambda a: a / 2.)

      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

Question 54

You don’t need a function at all. You can work on a whole column directly.

Example data:

>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df

      a     b     c
0   100   200   300
1  1000  2000  3000

Half all the values in column a:

>>> df.a = df.a / 2
>>> df

     a     b     c
0   50   200   300
1  500  2000  3000

Question 55

Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples “using apply“, it might be they wanted a version that returns a new data frame, as apply does).

This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

In short:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])

In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]: 
      a   b  c
0   7.5  15  5
1  10.0  10  7
2  12.5  30  9

In [4]: df
Out[4]: 
    a   b  c
0  15  15  5
1  20  10  7
2  25  30  9

Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.

Question 56

If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:

import pandas as pd
import swifter

def fnc(m):
    return m*3+4

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})

# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)

This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.

Question 57

Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.

df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)

Question 58

Suppose I have a dataframe with columns a, b and c, I want to sort the dataframe by column b in ascending order, and by column c in descending order, how do I do this?

Question 59

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same:

df.sort_values(['a', 'b'], ascending=[True, False])

You can use the ascending argument of sort:

df.sort(['a', 'b'], ascending=[True, False])

For example:

In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])

In [12]: df1.sort(['a', 'b'], ascending=[True, False])
Out[12]:
   a  b
2  1  4
7  1  3
1  1  2
3  1  2
4  3  2
6  4  4
0  4  3
9  4  3
5  4  1
8  4  1

As commented by @renadeen

Sort isn’t in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.

that is, if you want to reuse df1 as a sorted DataFrame:

df1 = df1.sort(['a', 'b'], ascending=[True, False])

or

df1.sort(['a', 'b'], ascending=[True, False], inplace=True)

Question 60

As of pandas 0.17.0, DataFrame.sort() is deprecated, and set to be removed in a future version of pandas. The way to sort a dataframe by its values is now is DataFrame.sort_values

As such, the answer to your question would now be

df.sort_values(['b', 'c'], ascending=[True, False], inplace=True)

Question 61

For large dataframes of numeric data, you may see a significant performance improvement via numpy.lexsort, which performs an indirect sort using a sequence of keys:

import pandas as pd
import numpy as np

np.random.seed(0)

df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
df1 = pd.concat([df1]*100000)

def pdsort(df1):
    return df1.sort_values(['a', 'b'], ascending=[True, False])

def lex(df1):
    arr = df1.values
    return pd.DataFrame(arr[np.lexsort((-arr[:, 1], arr[:, 0]))])

assert (pdsort(df1).values == lex(df1).values).all()

%timeit pdsort(df1)  # 193 ms per loop
%timeit lex(df1)     # 143 ms per loop

One peculiarity is that the defined sorting order with numpy.lexsort is reversed: (-'b', 'a') sorts by series a first. We negate series b to reflect we want this series in descending order.

Be aware that np.lexsort only sorts with numeric values, while pd.DataFrame.sort_values works with either string or numeric values. Using np.lexsort with strings will give: TypeError: bad operand type for unary -: 'str'.

Question 62

How do I get the index column name in python pandas? Here’s an example dataframe:

             Column 1
Index Title          
Apples              1
Oranges             2
Puppies             3
Ducks               4

What I’m trying to do is get/set the dataframe index title. Here is what i tried:

import pandas as pd
data = {'Column 1'     : [1., 2., 3., 4.],
        'Index Title'  : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df

Anyone know how to do this?

Question 63

You can just get/set the index via its name property

In [7]: df.index.name
Out[7]: 'Index Title'

In [8]: df.index.name = 'foo'

In [9]: df.index.name
Out[9]: 'foo'

In [10]: df
Out[10]: 
         Column 1
foo              
Apples          1
Oranges         2
Puppies         3
Ducks           4

Question 64

You can use rename_axis, for removing set to None:

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
             Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
None

The new functionality works well in method chains.

df = df.rename_axis('foo')
print (df)
         Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

You can also rename column names with parameter axis:

d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name     Column 1
Index Title          
Apples            1.0
Oranges           2.0
Puppies           3.0
Ducks             4.0

print (df.index.name)
Index Title

print (df.columns.name)
Col Name

print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

print df.rename_axis('foo').rename_axis("bar", axis=1)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

From version pandas 0.24.0+ is possible use parameter index and columns:

df = df.rename_axis(index='foo', columns="bar")
print (df)
bar      Column 1
foo              
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

Removing index and columns names means set it to None:

df = df.rename_axis(index=None, columns=None)
print (df)
         Column 1
Apples        1.0
Oranges       2.0
Puppies       3.0
Ducks         4.0

If MultiIndex in index only:

mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


df = pd.DataFrame(np.random.randint(10, size=(4,6)), 
                  index=mux, 
                  columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name                   A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

print (df.index.name)
None

print (df.columns.name)
col name

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name     A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

df2 = df.rename_axis('baz', axis=1)
print (df2)
baz                        A  B  C  D  E  F
index name 1 index name 1                  
Apples       a             5  4  0  5  2  2
Oranges      b             5  8  2  5  9  9
Puppies      c             7  6  0  7  8  3
Ducks        d             6  5  0  1  6  0

df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz          A  B  C  D  E  F
foo     bar                  
Apples  a    5  4  0  5  2  2
Oranges b    5  8  2  5  9  9
Puppies c    7  6  0  7  8  3
Ducks   d    6  5  0  1  6  0

Removing index and columns names means set it to None:

df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)

           A  B  C  D  E  F
Apples  a  6  9  9  5  4  6
Oranges b  2  6  7  4  3  5
Puppies c  6  3  6  3  5  1
Ducks   d  4  9  1  3  0  5

For MultiIndex in index and columns is necessary working with .names instead .name and set by list or tuples:

mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
                                  list('abcd')], 
                                  names=['index name 1','index name 1'])


mux2 = pd.MultiIndex.from_product([list('ABC'),
                                  list('XY')], 
                                  names=['col name 1','col name 2'])

df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1                 A     B     C   
col name 2                 X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

Plural is necessary for check/set values:

print (df.index.name)
None

print (df.columns.name)
None

print (df.index.names)
['index name 1', 'index name 1']

print (df.columns.names)
['col name 1', 'col name 2']

df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1   A     B     C   
col name 2   X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz                        A     B     C   
bak                        X  Y  X  Y  X  Y
index name 1 index name 1                  
Apples       a             2  9  4  7  0  3
Oranges      b             9  0  6  0  9  4
Puppies      c             2  4  6  1  4  4
Ducks        d             6  6  7  1  2  8

df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    2  9  4  7  0  3
Oranges b    9  0  6  0  9  4
Puppies c    2  4  6  1  4  4
Ducks   d    6  6  7  1  2  8

Removing index and columns names means set it to None:

df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)

           A     B     C   
           X  Y  X  Y  X  Y
Apples  a  2  0  2  5  2  0
Oranges b  1  7  5  5  4  8
Puppies c  2  4  6  3  6  5
Ducks   d  9  6  3  9  7  0

And @Jeff solution:

df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)

baz          A     B     C   
bak          X  Y  X  Y  X  Y
foo     bar                  
Apples  a    3  4  7  3  3  3
Oranges b    1  2  5  8  1  0
Puppies c    9  6  3  9  6  3
Ducks   d    3  2  1  0  1  0

问题：我为什么要在熊猫中复制数据框

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：将特定的选定列提取到新DataFrame中作为副本

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：将列表列表导入Pandas DataFrame

回答 0

回答 1

回答 2

问题：重命名熊猫中的特定列

回答 0

回答 1

回答 2

如何重命名熊猫中的特定列？

rename 与 axis=1

传递一个列表，set_axis与axis=1

How do I rename a specific column in pandas?

rename with axis=1

Passing a list to set_axis with axis=1

回答 3

问题：删除熊猫中数据框的前三行

回答 0

问题：pandas DataFrame：用列的平均值替换nan值

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：“ ValueError：无法从重复轴重新索引”是什么意思？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：熊猫：如何对单个列使用apply（）函数？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：如何按两列或更多列对python pandas中的dataFrame进行排序？

回答 0

回答 1

回答 2

问题：熊猫索引列标题或名称

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

有趣好用的Python教程

`rename` 与 `axis=1`

传递一个列表，`set_axis`与`axis=1`

`rename` with `axis=1`

Passing a list to `set_axis` with `axis=1`