Python 实用宝典

Question 1

I know that if I use randn,

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

gives me what I am looking for, but with elements from a normal distribution. But what if I just wanted random integers?

randint works by providing a range, but not an array like randn does. So how do I do this with random integers between some range?

Question 2

numpy.random.randint accepts a third argument (size) , in which you can specify the size of the output array. You can use this to create your DataFrame –

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

Here – np.random.randint(0,100,size=(100, 4)) – creates an output array of size (100,4) with random integer elements between [0,100) .

Demo –

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

which produces:

     A   B   C   D
0   45  88  44  92
1   62  34   2  86
2   85  65  11  31
3   74  43  42  56
4   90  38  34  93
5    0  94  45  10
6   58  23  23  60
..  ..  ..  ..  ..

Question 3

The recommended way to create random integers with NumPy these days is to use numpy.random.Generator.integers. (documentation)

import numpy as np
import pandas as pd

rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100, 4)), columns=list('ABCD'))
df
----------------------
      A    B    C    D
 0   58   96   82   24
 1   21    3   35   36
 2   67   79   22   78
 3   81   65   77   94
 4   73    6   70   96
... ...  ...  ...  ...
95   76   32   28   51
96   33   68   54   77
97   76   43   57   43
98   34   64   12   57
99   81   77   32   50
100 rows × 4 columns

Question 4

I have a DataFrame like this one:

In [7]:
frame.head()
Out[7]:
Communications and Search   Business    General Lifestyle
0   0.745763    0.050847    0.118644    0.084746
0   0.333333    0.000000    0.583333    0.083333
0   0.617021    0.042553    0.297872    0.042553
0   0.435897    0.000000    0.410256    0.153846
0   0.358974    0.076923    0.410256    0.153846

In here, I want to ask how to get column name which has maximum value for each row, the desired output is like this:

In [7]:
    frame.head()
    Out[7]:
    Communications and Search   Business    General Lifestyle   Max
    0   0.745763    0.050847    0.118644    0.084746           Communications 
    0   0.333333    0.000000    0.583333    0.083333           Business  
    0   0.617021    0.042553    0.297872    0.042553           Communications 
    0   0.435897    0.000000    0.410256    0.153846           Communications 
    0   0.358974    0.076923    0.410256    0.153846           Business

Question 5

You can use idxmax with axis=1 to find the column with the greatest value on each row:

>>> df.idxmax(axis=1)
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

To create the new column ‘Max’, use df['Max'] = df.idxmax(axis=1).

To find the row index at which the maximum value occurs in each column, use df.idxmax() (or equivalently df.idxmax(axis=0)).

Question 6

And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr’s answer:

df['Max'] = df[['Communications','Business']].idxmax(axis=1)

Question 7

You could apply on dataframe and get argmax() of each row via axis=1

In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0    Communications
1          Business
2    Communications
3    Communications
4          Business
dtype: object

Here’s a benchmark to compare how slow apply method is to idxmax() for len(df) ~ 20K

In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop

In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop

Question 8

I’m new to pandas and trying to figure out how to add multiple columns to pandas simultaneously. Any help here is appreciated. Ideally I would like to do this in one step rather than multiple repeated steps…

import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)

df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3]  #thought this would work here...

Question 9

I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn’t actually matter if the columns of the DataFrame have the same names as the columns you are creating).

Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.

Here are several approaches that will work:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

Then one of the following:

1) Three assignments in one, using list unpacking:

df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]

2) `DataFrame` conveniently expands a single row to match the index, so you can do this:

df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)

3) Make a temporary data frame with new columns, then combine with the original data frame later:

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3]], 
            index=df.index, 
            columns=['column_new_1', 'column_new_2', 'column_new_3']
        )
    ], axis=1
)

4) Similar to the previous, but using `join` instead of `concat` (may be less efficient):

df = df.join(pd.DataFrame(
    [[np.nan, 'dogs', 3]], 
    index=df.index, 
    columns=['column_new_1', 'column_new_2', 'column_new_3']
))

5) Using a dict is a more “natural” way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):

df = df.join(pd.DataFrame(
    {
        'column_new_1': np.nan,
        'column_new_2': 'dogs',
        'column_new_3': 3
    }, index=df.index
))

6) Use `.assign()` with multiple column arguments.

I like this variant on @zero’s answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:

df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)

7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don’t know when it would be worth the trouble:

new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols)   # add empty cols
df[new_cols] = new_vals  # multi-column assignment works for existing cols

8) In the end it’s hard to beat three separate assignments:

df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3

Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame

Question 10

You could use assign with a dict of column names and values.

In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
   col_1  col_2 col2_new_2  col3_new_3  col_new_1
0      0      4       dogs           3        NaN
1      1      5       dogs           3        NaN
2      2      6       dogs           3        NaN
3      3      7       dogs           3        NaN

Question 11

With the use of concat:

In [128]: df
Out[128]: 
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]: 
   col_1  col_2 column_new_1 column_new_2 column_new_3
0    0.0    4.0          NaN          NaN          NaN
1    1.0    5.0          NaN          NaN          NaN
2    2.0    6.0          NaN          NaN          NaN
3    3.0    7.0          NaN          NaN          NaN

Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?

In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]

In [144]: df1
Out[144]: 
   col_1  col_2  column_new_1 column_new_2  column_new_3
0    0.0    4.0           NaN         dogs             3
1    1.0    5.0           NaN         dogs             3
2    2.0    6.0           NaN         dogs             3
3    3.0    7.0           NaN         dogs             3

Question 12

use of list comprehension, pd.DataFrame and pd.concat

pd.concat(
    [
        df,
        pd.DataFrame(
            [[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
            df.index, ['column_new_1', 'column_new_2','column_new_3']
        )
    ], axis=1)

Question 13

if adding a lot of missing columns (a, b, c ,….) with the same value, here 0, i did this:

    new_cols = ["a", "b", "c" ] 
    df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)

It’s based on the second variant of the accepted answer.

Question 14

Just want to point out that option2 in @Matthias Fripp’s answer

(2) I wouldn’t necessarily expect DataFrame to work this way, but it does

df[[‘column_new_1’, ‘column_new_2’, ‘column_new_3’]] = pd.DataFrame([[np.nan, ‘dogs’, 3]], index=df.index)

is already documented in pandas’ own documentation http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics

You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner. You may find this useful for applying a transform (in-place) to a subset of the columns.

Question 15

If you just want to add empty new columns, reindex will do the job

df
   col_1  col_2
0      0      4
1      1      5
2      2      6
3      3      7

df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
   col_1  col_2  column_new_1  column_new_2  column_new_3
0      0      4           NaN           NaN           NaN
1      1      5           NaN           NaN           NaN
2      2      6           NaN           NaN           NaN
3      3      7           NaN           NaN           NaN

full code example

import numpy as np
import pandas as pd

df = {'col_1': [0, 1, 2, 3],
        'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')

otherwise go for zeros answer with assign

Question 16

I am not comfortable using “Index” and so on…could come up as below

df.columns
Index(['A123', 'B123'], dtype='object')

df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])

df.rename(columns={
    'C':'C123',
    'D':'D123',
    'E':'E123'
},inplace=True)


df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')

Question 17

I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
   a  b  c
0  1  2  3
1  4  5  6

and I’ll define a function that access elements with a given row

def rowFunc(row):
    return row['a'] + row['b'] * row['c']

I can apply it like so:

df['d'] = df.apply(rowFunc, axis=1)
>>> df
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

Awesome! Now what if I want to incorporate the index into my function? The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can’t just access row.index.

I know I could create a temporary column in the table where I store the index, but I’m wondering if it is stored in the row object somewhere.

Question 18

To access the index in this case you access the name attribute:

In [182]:

df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
    return row['a'] + row['b'] * row['c']

def rowIndex(row):
    return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

Note that if this is really what you are trying to do that the following works and is much faster:

In [198]:

df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
   a  b  c   d
0  1  2  3   7
1  4  5  6  34

In [199]:

%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop

EDIT

Looking at this question 3+ years later, you could just do:

In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df

Out[15]: 
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1

but assuming it isn’t as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:

In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df

Out[16]: 
   a  b  c   d  rowIndex  newCol
0  1  2  3   7         0       6
1  4  5  6  34         1      16

Question 19

Either:

1. with `row.name` inside the `apply(..., axis=1)` call:

df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])

   a  b  c
x  1  2  3
y  4  5  6

df.apply(lambda row: row.name, axis=1)

x    x
y    y

2. with `iterrows()` (slower)

DataFrame.iterrows() allows you to iterate over rows, and access their index:

for idx, row in df.iterrows():
    ...

Question 20

To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).

Working example (pandas 0.23.4):

>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
   b  c
a      
1  2  3
4  5  6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
   b  c  index_x10
a                 
1  2  3         10
4  5  6         40

Question 21

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but ‘pet’.

I have a solution, but it’s rather inelegant:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

Question 22

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

Question 23

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s)
df[s.str.contains('og|at')]

Output:

0 cat
1 hat
2 dog
3 fog

Question 24

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1  col2
0   cat 1000.0
1   hat 2000000.0
2   dog 1000.0
3   fog 330000.0
4   pet 330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

    col1    col2        TrueFalse
0   cat     1000.0      1
1   hat     2000000.0   1
2   dog     1000.0      1
3   fog     330000.0    1
4   pet     330000.0    0

Question 25

I’m confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I’m lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I’m just missing? What’s going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I’m attempting to do in the last example above)?

Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I’ve also read through the “Related” questions on this topic, but I’m still missing the simple rule Pandas is using, and how I’d apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.

Question 26

Here’s the rules, subsequent override:

All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that’s why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you shoulld never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.

Question 27

I would like to append a string to the start of each value in a said column of a pandas dataframe (elegantly). I already figured out how to kind-of do this and I am currently using:

df.ix[(df['col'] != False), 'col'] = 'str'+df[(df['col'] != False), 'col']

This seems one hell of an inelegant thing to do – do you know any other way (which maybe also adds the character to rows where that column is 0 or NaN)?

In case this is yet unclear, I would like to turn:

    col 
1     a
2     0

into:

       col 
1     stra
2     str0

Question 28

df['col'] = 'str' + df['col'].astype(str)

Example:

>>> df = pd.DataFrame({'col':['a',0]})
>>> df
  col
0   a
1   0
>>> df['col'] = 'str' + df['col'].astype(str)
>>> df
    col
0  stra
1  str0

Question 29

As an alternative, you can also use an apply combined with format (or better with f-strings) which I find slightly more readable if one e.g. also wants to add a suffix or manipulate the element itself:

df = pd.DataFrame({'col':['a', 0]})

df['col'] = df['col'].apply(lambda x: "{}{}".format('str', x))

which also yields the desired output:

    col
0  stra
1  str0

If you are using Python 3.6+, you can also use f-strings:

df['col'] = df['col'].apply(lambda x: f"str{x}")

yielding the same output.

The f-string version is almost as fast as @RomanPekar’s solution (python 3.6.4):

df = pd.DataFrame({'col':['a', 0]*200000})

%timeit df['col'].apply(lambda x: f"str{x}")
117 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit 'str' + df['col'].astype(str)
112 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using format, however, is indeed far slower:

%timeit df['col'].apply(lambda x: "{}{}".format('str', x))
185 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 30

You can use pandas.Series.map :

df['col'].map('str{}'.format)

It will apply the word “str” before all your values.

Question 31

If you load you table file with dtype=str
or convert column type to string df['a'] = df['a'].astype(str)
then you can use such approach:

df['a']= 'col' + df['a'].str[:]

This approach allows prepend, append, and subset string of df.
Works on Pandas v0.23.4, v0.24.1. Don’t know about earlier versions.

Question 32

Another solution with .loc:

df = pd.DataFrame({'col': ['a', 0]})
df.loc[df.index, 'col'] = 'string' + df['col'].astype(str)

This is not as quick as solutions above (>1ms per loop slower) but may be useful in case you need conditional change, like:

mask = (df['col'] == 0)
df.loc[mask, 'col'] = 'string' + df['col'].astype(str)

Question 33

Using Python Pandas I am trying to find the Country & Place with the maximum value.

This returns the maximum value:

data.groupby(['Country','Place'])['Value'].max()

But how do I get the corresponding Country and Place name?

Question 34

Assuming df has a unique index, this gives the row with the maximum value:

In [34]: df.loc[df['Value'].idxmax()]
Out[34]: 
Country        US
Place      Kansas
Value         894
Name: 7

Note that idxmax returns index labels. So if the DataFrame has duplicates in the index, the label may not uniquely identify the row, so df.loc may return more than one row.

Therefore, if df does not have a unique index, you must make the index unique before proceeding as above. Depending on the DataFrame, sometimes you can use stack or set_index to make the index unique. Or, you can simply reset the index (so the rows become renumbered, starting at 0):

df = df.reset_index()

Question 35

df[df['Value']==df['Value'].max()]

This will return the entire row with max value

Question 36

The country and place is the index of the series, if you don’t need the index, you can set as_index=False:

df.groupby(['country','place'], as_index=False)['value'].max()

Edit:

It seems that you want the place with max value for every country, following code will do what you want:

df.groupby("country").apply(lambda df:df.irow(df.value.argmax()))

Question 37

I think the easiest way to return a row with the maximum value is by getting its index. argmax() can be used to return the index of the row with the largest value.

index = df.Value.argmax()

Now the index could be used to get the features for that particular row:

df.iloc[df.Value.argmax(), 0:2]

Question 38

Use the index attribute of DataFrame. Note that I don’t type all the rows in the example.

In [14]: df = data.groupby(['Country','Place'])['Value'].max()

In [15]: df.index
Out[15]: 
MultiIndex
[Spain  Manchester, UK     London    , US     Mchigan   ,        NewYork   ]

In [16]: df.index[0]
Out[16]: ('Spain', 'Manchester')

In [17]: df.index[1]
Out[17]: ('UK', 'London')

You can also get the value by that index:

In [21]: for index in df.index:
    print index, df[index]
   ....:      
('Spain', 'Manchester') 512
('UK', 'London') 778
('US', 'Mchigan') 854
('US', 'NewYork') 562

Edit

Sorry for misunderstanding what you want, try followings:

In [52]: s=data.max()

In [53]: print '%s, %s, %s' % (s['Country'], s['Place'], s['Value'])
US, NewYork, 854

Question 39

In order to print the Country and Place with maximum value, use the following line of code.

print(df[['Country', 'Place']][df.Value == df.Value.max()])

Question 40

My solution for finding maximum values in columns:

df.ix[df.idxmax()]

, also minimum:

df.ix[df.idxmin()]

Question 41

I’d recommend using nlargest for better performance and shorter code. import pandas

df[col_name].value_counts().nlargest(n=1)

Question 42

You can use:

print(df[df['Value']==df['Value'].max()])

Question 43

import pandas
df is the data frame you create.

Use the command:

df1=df[['Country','Place']][df.Value == df['Value'].max()]

This will display the country and place whose value is maximum.

Question 44

I encountered a similar error while trying to import data using pandas, The first column on my dataset had spaces before the start of the words. I removed the spaces and it worked like a charm!!

Question 45

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

What should I do to achieve this?

Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

Question 47

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist())    
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals]    
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I’ll work later on making it more efficient:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                   'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D
0  1     [1, 2]  [1, 2, 3]  A
1  2  [1, 2, 3]     [1, 2]  B
2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C
0   1  A   1
1   1  A   2
2   1  A   1
3   1  A   2
4   1  A   3
5   2  B   1
6   2  B   2
7   2  B   3
8   2  B   1
9   2  B   2
10  3  C   1
11  3  C   1
12  3  C   2

Functions

def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
       index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
        df = pd.concat([df]*c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

Question 48

Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')

Out:

Question 49

One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:

import numpy as np
import pandas as pd


def unnest(frame, explode):
    def mesh(values):
        return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

    data = np.vstack(mesh(row) for row in frame[explode].values)
    return pd.DataFrame(data=data, columns=explode)


df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B']))  # base
print()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C']))  # multiple columns
print()

df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                   'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

print(unnest(df, ['A', 'B']))  # uneven length lists
print()
print(unnest(df, ['D', 'B']))  # different types
print()

Output

Question 50

My 5 cents:

df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                      ignore_index=True)

and another 5

df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

(pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
 .reset_index(level=1, drop=True)
 .reset_index())

both resulting in the same

Question 51

Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.

MultiIndex should be also a easier way to write and has near the same performances as numpy way.

Surprisingly, in my implementation comprehension way has the best performance.

def stack(df):
    return df.set_index(['A', 'C']).B.apply(pd.Series).stack()


def comprehension(df):
    return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])


def multiindex(df):
    return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))


def array(df):
    return pd.DataFrame(
        np.column_stack((
            np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
            np.concatenate(df.B.values)
        ))
    )


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
    index=[
        'stack',
        'comprehension',
        'multiindex',
        'array',
    ],
    columns=[1000, 2000, 5000, 10000, 20000, 50000],
    dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
        df = pd.concat([df] * c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=20)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

Relative time of each method

Question 52

I generalized the problem a bit to be applicable to more columns.

Summary of what my solution does:

In[74]: df
Out[74]: 
    A   B             C             columnD
0  A1  B1  [C1.1, C1.2]                D1
1  A2  B2  [C2.1, C2.2]  [D2.1, D2.2, D2.3]
2  A3  B3            C3        [D3.1, D3.2]

In[75]: dfListExplode(df,['C','columnD'])
Out[75]: 
    A   B     C columnD
0  A1  B1  C1.1    D1
1  A1  B1  C1.2    D1
2  A2  B2  C2.1    D2.1
3  A2  B2  C2.1    D2.2
4  A2  B2  C2.1    D2.3
5  A2  B2  C2.2    D2.1
6  A2  B2  C2.2    D2.2
7  A2  B2  C2.2    D2.3
8  A3  B3    C3    D3.1
9  A3  B3    C3    D3.2

Complete example:

The actual explosion is performed in 3 lines. The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, …).

import pandas as pd
import numpy as np

df=pd.DataFrame( {'A': ['A1','A2','A3'],
                  'B': ['B1','B2','B3'],
                  'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
                  'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
                  })
print('df',df, sep='\n')

def dfListExplode(df, explodeKeys):
    if not isinstance(explodeKeys, list):
        explodeKeys=[explodeKeys]
    # recursive handling of explodeKeys
    if len(explodeKeys)==0:
        return df
    elif len(explodeKeys)==1:
        explodeKey=explodeKeys[0]
    else:
        return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
    # perform explosion/unnesting for key: explodeKey
    dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
    dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
    dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
    dfReind=dfMerged.reindex(columns=list(df))
    return dfReind

dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')

Credits to WeNYoBen’s answer

Question 53

Problem Setup

Assume there are multiple columns with different length objects within it

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': [[1, 2], [3, 4, 5]]
})

df

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]

When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be “zipped” together.

   A       B          C
0  1  [1, 2]     [1, 2]  # Typical to assume these should be zipped [(1, 1), (2, 2)]
1  2  [3, 4]  [3, 4, 5]

However, the assumption gets challenged when we see different length objects, should we “zip”, if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (4, 4), (None, 5)]?

OR

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]

The Function

This function gracefully handles zip or product based on a parameter and assumes to zip according to the length of the longest object with zip_longest

from itertools import zip_longest, product

def xplode(df, explode, zipped=True):
    method = zip_longest if zipped else product

    rest = {*df} - {*explode}

    zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
    tups = [tup + exploded
     for tup, pre in zipped
     for exploded in method(*pre)]

    return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]

Zipped

xplode(df, ['B', 'C'])

   A    B  C
0  1  1.0  1
1  1  2.0  2
2  2  3.0  3
3  2  4.0  4
4  2  NaN  5

Product

xplode(df, ['B', 'C'], zipped=False)

   A  B  C
0  1  1  1
1  1  1  2
2  1  2  1
3  1  2  2
4  2  3  3
5  2  3  4
6  2  3  5
7  2  4  3
8  2  4  4
9  2  4  5

New Setup

Varying up the example a bit

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': 'C',
    'D': [[1, 2], [3, 4, 5]],
    'E': [('X', 'Y', 'Z'), ('W',)]
})

df

   A       B  C          D          E
0  1  [1, 2]  C     [1, 2]  (X, Y, Z)
1  2  [3, 4]  C  [3, 4, 5]       (W,)

Zipped

xplode(df, ['B', 'D', 'E'])

   A    B  C    D     E
0  1  1.0  C  1.0     X
1  1  2.0  C  2.0     Y
2  1  NaN  C  NaN     Z
3  2  3.0  C  3.0     W
4  2  4.0  C  4.0  None
5  2  NaN  C  5.0  None

Product

xplode(df, ['B', 'D', 'E'], zipped=False)

    A  B  C  D  E
0   1  1  C  1  X
1   1  1  C  1  Y
2   1  1  C  1  Z
3   1  1  C  2  X
4   1  1  C  2  Y
5   1  1  C  2  Z
6   1  2  C  1  X
7   1  2  C  1  Y
8   1  2  C  1  Z
9   1  2  C  2  X
10  1  2  C  2  Y
11  1  2  C  2  Z
12  2  3  C  3  W
13  2  3  C  4  W
14  2  3  C  5  W
15  2  4  C  3  W
16  2  4  C  4  W
17  2  4  C  5  W

Question 54

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

Question 55

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

pd.concat([df['A'], pd.DataFrame(df['B'].values.tolist())], axis = 1)\
  .melt(id_vars = 'A', value_name = 'B')\
  .dropna()\
  .drop('variable', axis = 1)

    A   B
0   1   1
1   2   1
2   1   2
3   2   2

Any opinions on this method I thought of? or is doing both concat and melt considered too “expensive”?

Question 56

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)

out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})

       A    B
   0    1   1
   1    1   2
   2    2   1
   3    2   2

you can implement this as one liner, if you don’t wish to create intermediate object

Question 57

# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125

# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})

# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)  

# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)

# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
    for y in j:
        df2.iat[2*x+y, 0] = a[x][0]
        df2.iat[2*x+y, 1] = a[x][1][y]

# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)

Question 58

In my case with more than one column to explode, and with variables lengths for the arrays that needs to be unnested.

I ended up applying the new pandas 0.25 explode function two times, then removing generated duplicates and it does the job !

df = df.explode('A')
df = df.explode('B')
df = df.drop_duplicates()

Question 59

I have another good way to solves this when you have more than one column to explode.

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})

print(df)
   A       B          C
0  1  [1, 2]  [1, 2, 3]
1  2  [1, 2]  [1, 2, 3]

I want to explode the columns B and C. First I explode B, second C. Than I drop B and C from the original df. After that I will do an index join on the 3 dfs.

explode_b = df.explode('B')['B']
explode_c = df.explode('C')['C']
df = df.drop(['B', 'C'], axis=1)
df = df.join([explode_b, explode_c])

Question 60

How can I print a pandas dataframe as a nice text-based table, like the following?

+------------+---------+-------------+
| column_one | col_two |   column_3  |
+------------+---------+-------------+
|          0 |  0.0001 | ABCD        |
|          1 |  1e-005 | ABCD        |
|          2 |  1e-006 | long string |
|          3 |  1e-007 | ABCD        |
+------------+---------+-------------+

Question 61

I’ve just found a great tool for that need, it is called tabulate.

It prints tabular data and works with DataFrame.

from tabulate import tabulate
import pandas as pd

df = pd.DataFrame({'col_two' : [0.0001, 1e-005 , 1e-006, 1e-007],
                   'column_3' : ['ABCD', 'ABCD', 'long string', 'ABCD']})
print(tabulate(df, headers='keys', tablefmt='psql'))

+----+-----------+-------------+
|    |   col_two | column_3    |
|----+-----------+-------------|
|  0 |    0.0001 | ABCD        |
|  1 |    1e-05  | ABCD        |
|  2 |    1e-06  | long string |
|  3 |    1e-07  | ABCD        |
+----+-----------+-------------+

Note:

To suppress row indices for all types of data, pass showindex="never" or showindex=False.

Question 62

A simple approach is to output as html, which pandas does out of the box:

df.to_html('temp.html')

Question 63

pandas >= 1.0

If you want an inbuilt function to dump your data into some github markdown, you now have one. Take a look at to_markdown:

df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])  
print(df.to_markdown()) 

|    |   A |   B |
|:---|----:|----:|
| a  |   1 |   1 |
| a  |   2 |   2 |
| b  |   3 |   3 |

Here’s what that looks like on github:

Note that you will still need to have the tabulate package installed.

Question 64

If you are in Jupyter notebook, you could run the following code to interactively display the dataframe in a well formatted table.

This answer builds on the to_html(‘temp.html’) answer above, but instead of creating a file displays the well formatted table directly in the notebook:

from IPython.display import display, HTML

display(HTML(df.to_html()))

Credit for this code due to example at: Show DataFrame as table in iPython Notebook

问题：如何使用Pandas创建随机整数的DataFrame？

回答 0

回答 1

问题：查找具有每一行最大值的列名

回答 0

回答 1

回答 2

问题：如何在一次分配中向熊猫数据框添加多列？

回答 0

1）使用列表拆包，将三个作业合二为一：

2）DataFrame方便地扩展单个行以匹配索引，因此您可以执行以下操作：

3）用新列创建一个临时数据框，然后与原始数据框合并：

4）与前面类似，但是使用join代替concat（可能效率较低）：

5）使用dict比前两个更“自然”地创建新数据框，但是新列将按字母顺序排序（至少在Python 3.6或3.7之前）：

6）.assign()与多个列参数一起使用。

7）这很有趣（基于https://stackoverflow.com/a/44951376/3830997），但是我不知道什么时候值得这样做：

8）最后，很难击败三个独立的任务：

1) Three assignments in one, using list unpacking:

2) DataFrame conveniently expands a single row to match the index, so you can do this:

3) Make a temporary data frame with new columns, then combine with the original data frame later:

4) Similar to the previous, but using join instead of concat (may be less efficient):

5) Using a dict is a more “natural” way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):

6) Use .assign() with multiple column arguments.

7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don’t know when it would be worth the trouble:

8) In the end it’s hard to beat three separate assignments:

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：获取熊猫应用函数中的行的索引

回答 0

回答 1

1.与row.name内线apply(..., axis=1)通话：

2.与iterrows()（较慢）

1. with row.name inside the apply(..., axis=1) call:

2. with iterrows() (slower)

回答 2

问题：如何在大熊猫中测试字符串是否包含列表中的子字符串之一？

回答 0

回答 1

回答 2

问题：熊猫使用什么规则生成视图与副本？

回答 0

问题：使用熊猫将字符串前缀添加到字符串列中的每个值

回答 0

回答 1

回答 2

回答 3

回答 4

问题：查找列的最大值，并使用Pandas返回相应的行值

回答 0

回答 1

回答 2

回答 3

回答 4

编辑

Edit

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：如何取消嵌套（爆炸）pandas DataFrame中的列？

回答 0

列式嵌套

Column-wise Unnesting

回答 1

回答 2

回答 3

回答 4

回答 5

性能

Performance

回答 6

回答 7

问题设定

2）`DataFrame`方便地扩展单个行以匹配索引，因此您可以执行以下操作：

4）与前面类似，但是使用`join`代替`concat`（可能效率较低）：

6）`.assign()`与多个列参数一起使用。

2) `DataFrame` conveniently expands a single row to match the index, so you can do this:

4) Similar to the previous, but using `join` instead of `concat` (may be less efficient):

6) Use `.assign()` with multiple column arguments.

1.与`row.name`内线`apply(..., axis=1)`通话：

2.与`iterrows()`（较慢）

1. with `row.name` inside the `apply(..., axis=1)` call:

2. with `iterrows()` (slower)