Python 实用宝典

Question 1

I want to create a new column in a pandas data frame by applying a function to two existing columns. Following this answer I’ve been able to create a new column when I only need one column as an argument:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})

def fx(x):
    return x * x

print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)

However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?

def fxy(x, y):
    return x * y

Question 2

Alternatively, you can use numpy underlying function:

>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         300

or vectorize arbitrary function in general case:

>>> def fx(x, y):
...     return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         300

Question 3

You can go with @greenAfrican example, if it’s possible for you to rewrite your function. But if you don’t want to rewrite your function, you can wrap it into anonymous function inside apply, like this:

>>> def fxy(x, y):
...     return x * y

>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
    A   B  newcolumn
0  10  20        200
1  20  30        600
2  30  10        300

Question 4

This solves the problem:

df['newcolumn'] = df.A * df.B

You could also do:

def fab(row):
  return row['A'] * row['B']

df['newcolumn'] = df.apply(fab, axis=1)

Question 5

If you need to create multiple columns at once:

Create the dataframe:

import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})

Create the function:

def fab(row):                                                  
    return row['A'] * row['B'], row['A'] + row['B']

Assign the new columns:

df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))

Question 6

One more dict style clean syntax:

df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)

or,

df["new_column"] = df["A"] * df["B"]

Question 7

I would like to display a pandas dataframe with a given format using print() and the IPython display(). For example:

df = pd.DataFrame([123.4567, 234.5678, 345.6789, 456.7890],
                  index=['foo','bar','baz','quux'],
                  columns=['cost'])
print df

         cost
foo   123.4567
bar   234.5678
baz   345.6789
quux  456.7890

I would like to somehow coerce this into printing

         cost
foo   $123.46
bar   $234.57
baz   $345.68
quux  $456.79

without having to modify the data itself or create a copy, just change the way it is displayed.

How can I do this?

Question 8

import pandas as pd
pd.options.display.float_format = '${:,.2f}'.format
df = pd.DataFrame([123.4567, 234.5678, 345.6789, 456.7890],
                  index=['foo','bar','baz','quux'],
                  columns=['cost'])
print(df)

yields

        cost
foo  $123.46
bar  $234.57
baz  $345.68
quux $456.79

but this only works if you want every float to be formatted with a dollar sign.

Otherwise, if you want dollar formatting for some floats only, then I think you’ll have to pre-modify the dataframe (converting those floats to strings):

import pandas as pd
df = pd.DataFrame([123.4567, 234.5678, 345.6789, 456.7890],
                  index=['foo','bar','baz','quux'],
                  columns=['cost'])
df['foo'] = df['cost']
df['cost'] = df['cost'].map('${:,.2f}'.format)
print(df)

yields

         cost       foo
foo   $123.46  123.4567
bar   $234.57  234.5678
baz   $345.68  345.6789
quux  $456.79  456.7890

Question 9

If you don’t want to modify the dataframe, you could use a custom formatter for that column.

import pandas as pd
pd.options.display.float_format = '${:,.2f}'.format
df = pd.DataFrame([123.4567, 234.5678, 345.6789, 456.7890],
                  index=['foo','bar','baz','quux'],
                  columns=['cost'])


print df.to_string(formatters={'cost':'${:,.2f}'.format})

yields

        cost
foo  $123.46
bar  $234.57
baz  $345.68
quux $456.79

Question 10

As of Pandas 0.17 there is now a styling system which essentially provides formatted views of a DataFrame using Python format strings:

import pandas as pd
import numpy as np

constants = pd.DataFrame([('pi',np.pi),('e',np.e)],
                   columns=['name','value'])
C = constants.style.format({'name': '~~ {} ~~', 'value':'--> {:15.10f} <--'})
C

which displays

This is a view object; the DataFrame itself does not change formatting, but updates in the DataFrame are reflected in the view:

constants.name = ['pie','eek']
C

However it appears to have some limitations:

Adding new rows and/or columns in-place seems to cause inconsistency in the styled view (doesn’t add row/column labels):
```
constants.loc[2] = dict(name='bogus', value=123.456)
constants['comment'] = ['fee','fie','fo']
constants
```

which looks ok but:

Formatting works only for values, not index entries:

constants = pd.DataFrame([('pi',np.pi),('e',np.e)],
               columns=['name','value'])
constants.set_index('name',inplace=True)
C = constants.style.format({'name': '~~ {} ~~', 'value':'--> {:15.10f} <--'})
C

Question 11

Similar to unutbu above, you could also use applymap as follows:

import pandas as pd
df = pd.DataFrame([123.4567, 234.5678, 345.6789, 456.7890],
                  index=['foo','bar','baz','quux'],
                  columns=['cost'])

df = df.applymap("${0:.2f}".format)

Question 12

I like using pandas.apply() with python format().

import pandas as pd
s = pd.Series([1.357, 1.489, 2.333333])

make_float = lambda x: "${:,.2f}".format(x)
s.apply(make_float)

Also, it can be easily used with multiple columns…

df = pd.concat([s, s * 2], axis=1)

make_floats = lambda row: "${:,.2f}, ${:,.3f}".format(row[0], row[1])
df.apply(make_floats, axis=1)

Question 13

You can also set locale to your region and set float_format to use a currency format. This will automatically set $ sign for currency in USA.

import locale

locale.setlocale(locale.LC_ALL, "en_US.UTF-8")

pd.set_option("float_format", locale.currency)

df = pd.DataFrame(
    [123.4567, 234.5678, 345.6789, 456.7890],
    index=["foo", "bar", "baz", "quux"],
    columns=["cost"],
)
print(df)

        cost
foo  $123.46
bar  $234.57
baz  $345.68
quux $456.79

Question 14

summary:


    df = pd.DataFrame({'money': [100.456, 200.789], 'share': ['100,000', '200,000']})
    print(df)
    print(df.to_string(formatters={'money': '${:,.2f}'.format}))
    for col_name in ('share',):
        df[col_name] = df[col_name].map(lambda p: int(p.replace(',', '')))
    print(df)
    """
        money    share
    0  100.456  100,000
    1  200.789  200,000

        money    share
    0 $100.46  100,000
    1 $200.79  200,000

         money   share
    0  100.456  100000
    1  200.789  200000
    """

Question 15

I have the following DataFrame:

customer    item1      item2    item3
1           apple      milk     tomato
2           water      orange   potato
3           juice      mango    chips

which I want to translate it to list of dictionaries per row

rows = [{'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
    {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
    {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Question 16

Edit

As John Galt mentions in his answer , you should probably instead use df.to_dict('records'). It’s faster than transposing manually.

In [20]: timeit df.T.to_dict().values()
1000 loops, best of 3: 395 µs per loop

In [21]: timeit df.to_dict('records')
10000 loops, best of 3: 53 µs per loop

Original answer

Use df.T.to_dict().values(), like below:

In [1]: df
Out[1]:
   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

In [2]: df.T.to_dict().values()
Out[2]:
[{'customer': 1.0, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2.0, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3.0, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Question 17

Use df.to_dict('records') — gives the output without having to transpose externally.

In [2]: df.to_dict('records')
Out[2]:
[{'customer': 1L, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 {'customer': 2L, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 {'customer': 3L, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}]

Question 18

As an extension to John Galt’s answer –

For the following DataFrame,

   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

If you want to get a list of dictionaries including the index values, you can do something like,

df.to_dict('index')

Which outputs a dictionary of dictionaries where keys of the parent dictionary are index values. In this particular case,

{0: {'customer': 1, 'item1': 'apple', 'item2': 'milk', 'item3': 'tomato'},
 1: {'customer': 2, 'item1': 'water', 'item2': 'orange', 'item3': 'potato'},
 2: {'customer': 3, 'item1': 'juice', 'item2': 'mango', 'item3': 'chips'}}

Question 19

If you are interested in only selecting one column this will work.

df[["item1"]].to_dict("records")

The below will NOT work and produces a TypeError: unsupported type: . I believe this is because it is trying to convert a series to a dict and not a Data Frame to a dict.

df["item1"].to_dict("records")

I had a requirement to only select one column and convert it to a list of dicts with the column name as the key and was stuck on this for a bit so figured I’d share.

Question 20

I have a large spreadsheet file (.xlsx) that I’m processing using python pandas. It happens that I need data from two tabs in that large file. One of the tabs has a ton of data and the other is just a few square cells.

When I use pd.read_excel() on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I’m interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we’re only using the specified sheet).

Am I using it wrong or is it just limited in this way?

Thank you!

Question 21

Try pd.ExcelFile:

xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')

As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile() call (there doesn’t appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.

Note that the sheet_name argument to pd.read_excel() can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0).

If None is specified, all sheets are returned, as a {sheet_name:dataframe} dictionary.

Question 22

There are a few options:

Read all sheets directly into an ordered dictionary.

import pandas as pd

# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)

# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)

Read the first sheet directly into dataframe

df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df

Read the excel file and get a list of sheets. Then chose and load the sheets.

xls = pd.ExcelFile('excel_file_path.xls')

# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]

# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheetname="house")

Read all sheets and store it in a dictionary. Same as first but more explicit.

# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
    sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
    # you can also use sheet_index [0,1,2..] instead of sheet name.

Thanks @ihightower for pointing it out way to read all sheets and @toto_tico for pointing out the version issue.

sheetname : string, int, mixed list of strings/ints, or None, default 0 Deprecated since version 0.21.0: Use sheet_name instead Source Link

Question 23

You can also use the index for the sheet:

xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)

will give the first worksheet. for the second worksheet:

sheet2 = xls.parse(1)

Question 24

You could also specify the sheet name as a parameter:

data_file = pd.read_excel('path_to_file.xls', sheet_name="sheet_name")

will upload only the sheet "sheet_name".

Question 25

pd.read_excel('filename.xlsx')

by default read the first sheet of workbook.

pd.read_excel('filename.xlsx', sheet_name = 'sheetname')

read the specific sheet of workbook and

pd.read_excel('filename.xlsx', sheet_name = None)

read all the worksheets from excel to pandas dataframe as a type of OrderedDict means nested dataframes, all the worksheets as dataframes collected inside dataframe and it’s type is OrderedDict.

Question 26

Yes unfortunately it will always load the full file. If you’re doing this repeatedly probably best to extract the sheets to separate CSVs and then load separately. You can automate that process with d6tstack which also adds additional features like checking if all the columns are equal across all sheets or multiple Excel files.

import d6tstack
c = d6tstack.convert_xls.XLStoCSVMultiSheet('multisheet.xlsx')
c.convert_all() # ['multisheet-Sheet1.csv','multisheet-Sheet2.csv']

See d6tstack Excel examples

Question 27

If you have saved the excel file in the same folder as your python program (relative paths) then you just need to mention sheet number along with file name.

Example:

 data = pd.read_excel("wt_vs_ht.xlsx", "Sheet2")
 print(data)
 x = data.Height
 y = data.Weight
 plt.plot(x,y,'x')
 plt.show()

Question 28

How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.

Attempt 1:

Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column

Attempt 2:

percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
                                'lst2Tite' : [lst2],
                                'lst3Tite' : [lst3] }, 
                                columns=['lst1Tite','lst1Tite', 'lst1Tite'])

yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column

How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?

Question 29

I think you’re almost there, try removing the extra square brackets around the lst‘s (Also you don’t need to specify the column names when you’re creating a dataframe from a dict like this):

import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
    {'lst1Title': lst1,
     'lst2Title': lst2,
     'lst3Title': lst3
    })

percentile_list
    lst1Title  lst2Title  lst3Title
0          0         0         0
1          1         1         1
2          2         2         2
3          3         3         3
4          4         4         4
5          5         5         5
6          6         6         6
...

If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:

import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]), 
                               columns=['lst1Title', 'lst2Title', 'lst3Title'])

Question 30

Adding to Aditya Guru‘s answer here. There is no need of using map. You can do it simply by:

pd.DataFrame(list(zip(lst1, lst2, lst3)))

This will set the column’s names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.

pd.DataFrame(list(zip(lst1, lst2, lst3)),
              columns=['lst1_title','lst2_title', 'lst3_title'])

Question 31

Just adding that using the first approach it can be done as –

pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))

Question 32

Adding one more scalable solution.

lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)

Question 33

Adding to above answers, we can create on the fly

df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)

hope it helps !

Question 34

@oopsi used pd.concat() but didn’t include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):

import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)

s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)

percentile_list
Out[2]: 
    lst1Title  lst2Title  lst3Title
0           0          0          0
1           1          1          1
2           2          2          2
3           3          3          3
4           4          4          4
5           5          5          5
6           6          6          6
7           7          7          7
8           8          8          8
...

Question 35

There are several ways to create a dataframe from multiple lists.

list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]

pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])

Question 36

you can simple use this following code

train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])

Question 37

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple values in a cell, I’d like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'trial_num': [1, 2, 3, 1, 2, 3],
     'subject': [1, 1, 1, 2, 2, 2],
     'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
    }
)

df
Out[10]: 
                 samples  subject  trial_num
0    [0.57, -0.83, 1.44]        1          1
1    [-0.01, 1.13, 0.36]        1          2
2   [1.18, -1.46, -0.94]        1          3
3  [-0.08, -4.22, -2.05]        2          1
4     [0.72, 0.79, 0.53]        2          2
5    [0.4, -0.32, -0.13]        2          3

How do I convert to long form, e.g.:

   subject  trial_num  sample  sample_num
0        1          1    0.57           0
1        1          1   -0.83           1
2        1          1    1.44           2
3        1          2   -0.01           0
4        1          2    1.13           1
5        1          2    0.36           2
6        1          3    1.18           0
# etc.

The index is not important, it’s OK to set existing columns as the index and the final ordering isn’t important.

Question 38

lst_col = 'samples'

r = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

In [103]: r
Out[103]:
    samples  subject  trial_num
0      0.10        1          1
1     -0.20        1          1
2      0.05        1          1
3      0.25        1          2
4      1.32        1          2
5     -0.17        1          2
6      0.64        1          3
7     -0.22        1          3
8     -0.71        1          3
9     -0.03        2          1
10    -0.65        2          1
11     0.76        2          1
12     1.77        2          2
13     0.89        2          2
14     0.65        2          2
15    -0.98        2          3
16     0.65        2          3
17    -0.30        2          3

PS here you may find a bit more generic solution

UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

in the following line we are repeating values in one column N times where N – is the length of the corresponding list:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

In [11]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         )
Out[11]:
    trial_num  subject
0           1        1
1           1        1
2           1        1
3           2        1
4           2        1
5           2        1
6           3        1
..        ...      ...
11          1        2
12          2        2
13          2        2
14          2        2
15          3        2
16          3        2
17          3        2

[18 rows x 2 columns]

using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])

putting all this together:

In [13]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
    trial_num  subject  samples
0           1        1    -1.04
1           1        1    -0.58
2           1        1    -1.32
3           2        1     0.82
4           2        1    -0.59
5           2        1    -0.34
6           3        1     0.25
..        ...      ...      ...
11          1        2     0.68
12          2        2     0.55
13          2        2    -0.56
14          2        2     0.65
15          3        2    -0.04
16          3        2     0.36
17          3        2    -0.31

[18 rows x 3 columns]

using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order…

Question 39

A bit longer than I expected:

>>> df
                samples  subject  trial_num
0  [-0.07, -2.9, -2.44]        1          1
1   [-1.52, -0.35, 0.1]        1          2
2  [-0.17, 0.57, -0.65]        1          3
3  [-0.82, -1.06, 0.47]        2          1
4   [0.79, 1.35, -0.09]        2          2
5   [1.17, 1.14, -1.79]        2          3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
   subject  trial_num  sample
0        1          1   -0.07
0        1          1   -2.90
0        1          1   -2.44
1        1          2   -1.52
1        1          2   -0.35
1        1          2    0.10
2        1          3   -0.17
2        1          3    0.57
2        1          3   -0.65
3        2          1   -0.82
3        2          1   -1.06
3        2          1    0.47
4        2          2    0.79
4        2          2    1.35
4        2          2   -0.09
5        2          3    1.17
5        2          3    1.14
5        2          3   -1.79

If you want sequential index, you can apply reset_index(drop=True) to the result.

update:

>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
    subject  trial_num  sample_num  sample
0         1          1           0    1.89
1         1          1           1   -2.92
2         1          1           2    0.34
3         1          2           0    0.85
4         1          2           1    0.24
5         1          2           2    0.72
6         1          3           0   -0.96
7         1          3           1   -2.72
8         1          3           2   -0.11
9         2          1           0   -1.33
10        2          1           1    3.13
11        2          1           2   -0.65
12        2          2           0    0.10
13        2          2           1    0.65
14        2          2           2    0.15
15        2          3           0    0.64
16        2          3           1   -0.10
17        2          3           2   -0.76

Question 40

Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

df = pd.DataFrame({
    'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 
    'var2': [1, 2, 3, 4]
})
df
        var1  var2
0  [a, b, c]     1
1     [d, e]     2
2         []     3
3        NaN     4

df.explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
2  NaN     3  # empty list converted to NaN
3  NaN     4  # NaN entry preserved as-is

# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5  NaN     3
6  NaN     4

Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).

However, you should note that explode only works on a single column (for now).

P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.

Question 41

you can also use pd.concat and pd.melt for this:

>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
   subject  trial_num     0     1     2
0        1          1 -0.49 -1.00  0.44
1        1          2 -0.28  1.48  2.01
2        1          3 -0.52 -1.84  0.02
3        2          1  1.23 -1.36 -1.06
4        2          2  0.54  0.18  0.51
5        2          3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample', 
...         value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
    subject  trial_num sample_num  sample
0         1          1          0   -0.49
1         1          2          0   -0.28
2         1          3          0   -0.52
3         2          1          0    1.23
4         2          2          0    0.54
5         2          3          0   -2.18
6         1          1          1   -1.00
7         1          2          1    1.48
8         1          3          1   -1.84
9         2          1          1   -1.36
10        2          2          1    0.18
11        2          3          1   -0.13
12        1          1          2    0.44
13        1          2          2    2.01
14        1          3          2    0.02
15        2          1          2   -1.06
16        2          2          2    0.51
17        2          3          2   -1.35

last, if you need you can sort base on the first the first three columns.

Question 42

Trying to work through Roman Pekar’s solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can’t say that it’s obviously a clearer solution though:

items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index

melted_items = pd.melt(items_as_cols, id_vars='orig_index', 
                       var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)

df.merge(melted_items, left_index=True, right_index=True)

Output (obviously we can drop the original samples column now):

                 samples  subject  trial_num sample_num  sample
0    [1.84, 1.05, -0.66]        1          1          0    1.84
0    [1.84, 1.05, -0.66]        1          1          1    1.05
0    [1.84, 1.05, -0.66]        1          1          2   -0.66
1    [-0.24, -0.9, 0.65]        1          2          0   -0.24
1    [-0.24, -0.9, 0.65]        1          2          1   -0.90
1    [-0.24, -0.9, 0.65]        1          2          2    0.65
2    [1.15, -0.87, -1.1]        1          3          0    1.15
2    [1.15, -0.87, -1.1]        1          3          1   -0.87
2    [1.15, -0.87, -1.1]        1          3          2   -1.10
3   [-0.8, -0.62, -0.68]        2          1          0   -0.80
3   [-0.8, -0.62, -0.68]        2          1          1   -0.62
3   [-0.8, -0.62, -0.68]        2          1          2   -0.68
4    [0.91, -0.47, 1.43]        2          2          0    0.91
4    [0.91, -0.47, 1.43]        2          2          1   -0.47
4    [0.91, -0.47, 1.43]        2          2          2    1.43
5  [-1.14, -0.24, -0.91]        2          3          0   -1.14
5  [-1.14, -0.24, -0.91]        2          3          1   -0.24
5  [-1.14, -0.24, -0.91]        2          3          2   -0.91

Question 43

For those looking for a version of Roman Pekar’s answer that avoids manual column naming:

column_to_explode = 'samples'
res = (df
       .set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
       .apply(pd.Series)
       .stack()
       .reset_index())
res = res.rename(columns={
          res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
          res.columns[-1]: '{}_exploded'.format(column_to_explode)})

Question 44

I found the easiest way was to:

Convert the samples column into a DataFrame
Joining with the original df
Melting

Shown here:

    df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')

        subject  trial_num sample  value
    0         1          1      0  -0.24
    1         1          2      0   0.14
    2         1          3      0  -0.67
    3         2          1      0  -1.52
    4         2          2      0  -0.00
    5         2          3      0  -1.73
    6         1          1      1  -0.70
    7         1          2      1  -0.70
    8         1          3      1  -0.29
    9         2          1      1  -0.70
    10        2          2      1  -0.72
    11        2          3      1   1.30
    12        1          1      2  -0.55
    13        1          2      2   0.10
    14        1          3      2  -0.44
    15        2          1      2   0.13
    16        2          2      2  -1.44
    17        2          3      2   0.73

It’s worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.

Question 45

Very late answer but I want to add this:

A fast solution using vanilla Python that also takes care of the sample_num column in OP’s example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.

df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
    lstcollist.extend(lstcol[ii])
    indexlist.extend([ii]*len(lstcol[ii]))
    countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)

Question 46

Also very late, but here is an answer from Karvy1 that worked well for me if you don’t have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287

For the example above you may write:

data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])

Speed test:

%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])

1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()

4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})

1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Question 47

import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)

Try this in pandas >=0.25 version

Question 48

I am trying to highlight exactly what changed between two dataframes.

Suppose I have two Python Pandas dataframes:

"StudentRoster Jan-1":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                Graduated
113  Zoe    4.12                     True       

"StudentRoster Jan-2":
id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                Graduated
113  Zoe    4.12                     False                On vacation

My goal is to output an HTML table that:

Identifies rows that have changed (could be int, float, boolean, string)

Outputs rows with same, OLD and NEW values (ideally into an HTML table) so the consumer can clearly see what changed between two dataframes:

"StudentRoster Difference Jan-1 - Jan-2":  
id   Name   score                    isEnrolled           Comment
112  Nick   was 1.11| now 1.21       False                Graduated
113  Zoe    4.12                     was True | now False was "" | now   "On   vacation"

I suppose I could do a row by row and column by column comparison, but is there an easier way?

Question 49

The first part is similar to Constantine, you can get the boolean of which rows are empty*:

In [21]: ne = (df1 != df2).any(1)

In [22]: ne
Out[22]:
0    False
1     True
2     True
dtype: bool

Then we can see which entries have changed:

In [23]: ne_stacked = (df1 != df2).stack()

In [24]: changed = ne_stacked[ne_stacked]

In [25]: changed.index.names = ['id', 'col']

In [26]: changed
Out[26]:
id  col
1   score         True
2   isEnrolled    True
    Comment       True
dtype: bool

Here the first entry is the index and the second the columns which has been changed.

In [27]: difference_locations = np.where(df1 != df2)

In [28]: changed_from = df1.values[difference_locations]

In [29]: changed_to = df2.values[difference_locations]

In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
Out[30]:
               from           to
id col
1  score       1.11         1.21
2  isEnrolled  True        False
   Comment     None  On vacation

* Note: it’s important that df1 and df2 share the same index here. To overcome this ambiguity, you can ensure you only look at the shared labels using df1.index & df2.index, but I think I’ll leave that as an exercise.

Question 50

Highlighting the difference between two DataFrames

It is possible to use the DataFrame style property to highlight the background color of the cells where there is a difference.

Using the example data from the original question

The first step is to concatenate the DataFrames horizontally with the concat function and distinguish each frame with the keys parameter:

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], 
                   axis='columns', keys=['First', 'Second'])
df_all

It’s probably easier to swap the column levels and put the same column names next to each other:

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]]
df_final

Now, its much easier to spot the differences in the frames. But, we can go further and use the style property to highlight the cells that are different. We define a custom function to do this which you can see in this part of the documentation.

def highlight_diff(data, color='yellow'):
    attr = 'background-color: {}'.format(color)
    other = data.xs('First', axis='columns', level=-1)
    return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''),
                        index=data.index, columns=data.columns)

df_final.style.apply(highlight_diff, axis=None)

This will highlight cells that both have missing values. You can either fill them or provide extra logic so that they don’t get highlighted.

Question 51

This answer simply extends @Andy Hayden’s, making it resilient to when numeric fields are nan, and wrapping it up into a function.

import pandas as pd
import numpy as np


def diff_pd(df1, df2):
    """Identify differences between two pandas DataFrames"""
    assert (df1.columns == df2.columns).all(), \
        "DataFrame column names are different"
    if any(df1.dtypes != df2.dtypes):
        "Data Types are different, trying to convert"
        df2 = df2.astype(df1.dtypes)
    if df1.equals(df2):
        return None
    else:
        # need to account for np.nan != np.nan returning True
        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
        ne_stacked = diff_mask.stack()
        changed = ne_stacked[ne_stacked]
        changed.index.names = ['id', 'col']
        difference_locations = np.where(diff_mask)
        changed_from = df1.values[difference_locations]
        changed_to = df2.values[difference_locations]
        return pd.DataFrame({'from': changed_from, 'to': changed_to},
                            index=changed.index)

So with your data (slightly edited to have a NaN in the score column):

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)
df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
diff_pd(df1, df2)

Output:

                from           to
id  col                          
112 score       1.11         1.21
113 isEnrolled  True        False
    Comment           On vacation

Question 52

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.11                     False                           Graduated
113  Zoe    4.12                     True       ''',

         '''\
id   Name   score                    isEnrolled                        Comment
111  Jack   2.17                     True                 He was late to class
112  Nick   1.21                     False                           Graduated
113  Zoe    4.12                     False                         On vacation''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,21,20])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,21,20])
df = pd.concat([df1,df2]) 

print(df)
#     id  Name  score isEnrolled               Comment
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.11      False             Graduated
# 2  113   Zoe   4.12       True                   NaN
# 0  111  Jack   2.17       True  He was late to class
# 1  112  Nick   1.21      False             Graduated
# 2  113   Zoe   4.12      False           On vacation

df.set_index(['id', 'Name'], inplace=True)
print(df)
#           score isEnrolled               Comment
# id  Name                                        
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.11      False             Graduated
# 113 Zoe    4.12       True                   NaN
# 111 Jack   2.17       True  He was late to class
# 112 Nick   1.21      False             Graduated
# 113 Zoe    4.12      False           On vacation

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

changes = df.groupby(level=['id', 'Name']).agg(report_diff)
print(changes)

prints

                score    isEnrolled               Comment
id  Name                                                 
111 Jack         2.17          True  He was late to class
112 Nick  1.11 | 1.21         False             Graduated
113 Zoe          4.12  True | False     nan | On vacation

Question 53

I have faced this issue, but found an answer before finding this post :

Based on unutbu’s answer, load your data…

import pandas as pd
import io

texts = ['''\
id   Name   score                    isEnrolled                       Date
111  Jack                            True              2013-05-01 12:00:00
112  Nick   1.11                     False             2013-05-12 15:05:23
     Zoe    4.12                     True                                  ''',

         '''\
id   Name   score                    isEnrolled                       Date
111  Jack   2.17                     True              2013-05-01 12:00:00
112  Nick   1.21                     False                                
     Zoe    4.12                     False             2013-05-01 12:00:00''']


df1 = pd.read_fwf(io.StringIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4])
df2 = pd.read_fwf(io.StringIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

…define your diff function…

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then you can simply use a Panel to conclude :

my_panel = pd.Panel(dict(df1=df1,df2=df2))
print my_panel.apply(report_diff, axis=0)

#          id  Name        score    isEnrolled                       Date
#0        111  Jack   nan | 2.17          True        2013-05-01 12:00:00
#1        112  Nick  1.11 | 1.21         False  2013-05-12 15:05:23 | NaT
#2  nan | nan   Zoe         4.12  True | False  NaT | 2013-05-01 12:00:00

By the way, if you’re in IPython Notebook, you may like to use a colored diff function to give colors depending whether cells are different, equal or left/right null :

from IPython.display import HTML
pd.options.display.max_colwidth = 500  # You need this, otherwise pandas
#                          will limit your HTML strings to 50 characters

def report_diff(x):
    if x[0]==x[1]:
        return unicode(x[0].__str__())
    elif pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#00ff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan')
    elif pd.isnull(x[0]) and ~pd.isnull(x[1]):
        return u'<table style="background-color:#ffff00;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1])
    elif ~pd.isnull(x[0]) and pd.isnull(x[1]):
        return u'<table style="background-color:#0000ff;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan')
    else:
        return u'<table style="background-color:#ff0000;font-weight:bold;">'+\
            '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1])

HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

Question 54

If your two dataframes have the same ids in them, then finding out what changed is actually pretty easy. Just doing frame1 != frame2 will give you a boolean DataFrame where each True is data that has changed. From that, you could easily get the index of each changed row by doing changedids = frame1.index[np.any(frame1 != frame2,axis=1)].

Question 55

A different approach using concat and drop_duplicates:

import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO
import pandas as pd

DF1 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.11                     False                "Graduated"
113  Zoe    NaN                     True                  " "
""")
DF2 = StringIO("""id   Name   score                    isEnrolled           Comment
111  Jack   2.17                     True                 "He was late to class"
112  Nick   1.21                     False                "Graduated"
113  Zoe    NaN                     False                "On vacation" """)

df1 = pd.read_table(DF1, sep='\s+', index_col='id')
df2 = pd.read_table(DF2, sep='\s+', index_col='id')
#%%
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df.drop_duplicates(keep=False)

Output:

       Name  score isEnrolled      Comment
  id                                      
1 112  Nick   1.11      False    Graduated
  113   Zoe    NaN       True             
2 112  Nick   1.21      False    Graduated
  113   Zoe    NaN      False  On vacation

Question 56

After fiddling around with @journois’s answer, I was able to get it to work using MultiIndex instead of Panel due to Panel’s deprication.

First, create some dummy data:

df1 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '555'],
    'let': ['a', 'b', 'c', 'd', 'e'],
    'num': ['1', '2', '3', '4', '5']
})
df2 = pd.DataFrame({
    'id': ['111', '222', '333', '444', '666'],
    'let': ['a', 'b', 'c', 'D', 'f'],
    'num': ['1', '2', 'Three', '4', '6'],
})

Then, define your diff function, in this case I’ll use the one from his answer report_diff stays the same:

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

Then, I’m going to concatenate the data into a MultiIndex dataframe:

df_all = pd.concat(
    [df1.set_index('id'), df2.set_index('id')], 
    axis='columns', 
    keys=['df1', 'df2'],
    join='outer'
)
df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

And finally I’m going to apply the report_diff down each column group:

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

This outputs:

         let        num
111        a          1
222        b          2
333        c  3 | Three
444    d | D          4
555  e | nan    5 | nan
666  nan | f    nan | 6

And that is all!

Question 57

Extending answer of @cge, which is pretty cool for more readability of result:

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

Full demonstration example:

import numpy as np, pandas as pd

a = pd.DataFrame(np.random.randn(7,3), columns=list('ABC'))
b = a.copy()
b.iloc[0,2] = np.nan
b.iloc[1,0] = 7
b.iloc[3,1] = 77
b.iloc[4,2] = 777

a[a != b][np.any(a != b, axis=1)].join(pd.DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join(
        b[a != b][np.any(a != b, axis=1)]
        ,rsuffix='_b', how='outer'
).fillna('')

Question 58

Here is another way using select and merge:

In [6]: # first lets create some dummy dataframes with some column(s) different
   ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)})
   ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))})


In [7]: df1
Out[7]:
   a   b   c
0 -5  10  20
1 -4  11  21
2 -3  12  22
3 -2  13  23
4 -1  14  24


In [8]: df2
Out[8]:
   a   b    c
0 -5  10   20
1 -4  11  101
2 -3  12  102
3 -2  13  103
4 -1  14  104


In [10]: # make condition over the columns you want to comapre
    ...: condition = df1['c'] != df2['c']
    ...:
    ...: # select rows from each dataframe where the condition holds
    ...: diff1 = df1[condition]
    ...: diff2 = df2[condition]


In [11]: # merge the selected rows (dataframes) with some suffixes (optional)
    ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after'))
Out[11]:
   a   b  c_before  c_after
0 -4  11        21      101
1 -3  12        22      102
2 -2  13        23      103
3 -1  14        24      104

Here is the same thing from a Jupyter screenshot:

Question 59

pandas >= 1.1: `DataFrame.compare`

With pandas 1.1, you could essentially replicate Ted Petrou’s output with a single function call. Example taken from the docs:

pd.__version__
# '1.1.0'

df1.compare(df2)

  score       isEnrolled       Comment             
   self other       self other    self        other
1  1.11  1.21        NaN   NaN     NaN          NaN
2   NaN   NaN        1.0   0.0     NaN  On vacation

Here, “self” refers to the LHS dataFrame, while “other” is the RHS DataFrame. By default, equal values are replaced with NaNs so you can focus on just the diffs. If you want to show values that are equal as well, use

df1.compare(df2, keep_equal=True, keep_shape=True) 

  score       isEnrolled           Comment             
   self other       self  other       self        other
1  1.11  1.21      False  False  Graduated    Graduated
2  4.12  4.12       True  False        NaN  On vacation

You can also change the axis of comparison using align_axis:

df1.compare(df2, align_axis='index')

         score  isEnrolled      Comment
1 self    1.11         NaN          NaN
  other   1.21         NaN          NaN
2 self     NaN         1.0          NaN
  other    NaN         0.0  On vacation

This compares values row-wise, instead of column-wise.

Question 60

A function that finds asymmetrical difference between two data frames is implemented below: (Based on set difference for pandas) GIST: https://gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e03

def diff_df(df1, df2, how="left"):
    """
      Find Difference of rows for given two dataframes
      this function is not symmetric, means
            diff(x, y) != diff(y, x)
      however
            diff(x, y, how='left') == diff(y, x, how='right')

      Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800
    """
    if (df1.columns != df2.columns).any():
        raise ValueError("Two dataframe columns must match")

    if df1.equals(df2):
        return None
    elif how == 'right':
        return pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
    elif how == 'left':
        return pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
    else:
        raise ValueError('how parameter supports only "left" or "right keywords"')

Example:

df1 = pd.DataFrame(d1)
Out[1]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1             Graduated  Nick       False   1.11
2                         Zoe        True   4.12


df2 = pd.DataFrame(d2)

Out[2]: 
                Comment  Name  isEnrolled  score
0  He was late to class  Jack        True   2.17
1           On vacation   Zoe        True   4.12

diff_df(df1, df2)
Out[3]: 
     Comment  Name  isEnrolled  score
1  Graduated  Nick       False   1.11
2              Zoe        True   4.12

diff_df(df2, df1)
Out[4]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

# This gives the same result as above
diff_df(df1, df2, how='right')
Out[22]: 
       Comment Name  isEnrolled  score
1  On vacation  Zoe        True   4.12

Question 61

import pandas as pd
import numpy as np

df = pd.read_excel('D:\\HARISH\\DATA SCIENCE\\1 MY Training\\SAMPLE DATA & projs\\CRICKET DATA\\IPL PLAYER LIST\\IPL PLAYER LIST _ harish.xlsx')


df1= srh = df[df['TEAM'].str.contains("SRH")]
df2 = csk = df[df['TEAM'].str.contains("CSK")]   

srh = srh.iloc[:,0:2]
csk = csk.iloc[:,0:2]

csk = csk.reset_index(drop=True)
csk

srh = srh.reset_index(drop=True)
srh

new = pd.concat([srh, csk], axis=1)

new.head()

** 
               PLAYER     TYPE           PLAYER         TYPE

0        David Warner  Batsman    ...        MS Dhoni      Captain

1  Bhuvaneshwar Kumar   Bowler  ...    Ravindra Jadeja  All-Rounder

2       Manish Pandey  Batsman   ...   Suresh Raina  All-Rounder

3   Rashid Khan Arman   Bowler     ...   Kedar Jadhav  All-Rounder

4      Shikhar Dhawan  Batsman    ....    Dwayne Bravo  All-Rounder

Question 62

How can one modify the format for the output from a groupby operation in pandas that produces scientific notation for very large numbers?

I know how to do string formatting in python but I’m at a loss when it comes to applying it here.

df1.groupby('dept')['data1'].sum()

dept
value1       1.192433e+08
value2       1.293066e+08
value3       1.077142e+08

This suppresses the scientific notation if I convert to string but now I’m just wondering how to string format and add decimals.

sum_sales_dept.astype(str)

Question 63

Granted, the answer I linked in the comments is not very helpful. You can specify your own string converter like so.

In [25]: pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [28]: Series(np.random.randn(3))*1000000000
Out[28]: 
0    -757322420.605
1   -1436160588.997
2   -1235116117.064
dtype: float64

I’m not sure if that’s the preferred way to do this, but it works.

Converting numbers to strings purely for aesthetic purposes seems like a bad idea, but if you have a good reason, this is one way:

In [6]: Series(np.random.randn(3)).apply(lambda x: '%.3f' % x)
Out[6]: 
0     0.026
1    -0.482
2    -0.694
dtype: object

Question 64

Here is another way of doing it, similar to Dan Allan’s answer but without the lambda function:

>>> pd.options.display.float_format = '{:.2f}'.format
>>> Series(np.random.randn(3))
0    0.41
1    0.99
2    0.10

or

>>> pd.set_option('display.float_format', '{:.2f}'.format)

问题：应用具有多个参数的函数以创建新的pandas列

回答 0

回答 1

回答 2

回答 3

回答 4

问题：如何使用列的格式字符串显示浮点数的pandas DataFrame？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：Pandas DataFrame到字典列表

回答 0

编辑

原始答案

Edit

Original answer

回答 1

回答 2

回答 3

问题：使用Pandas对同一工作簿的多个工作表进行pd.read_excel（）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：将多个列表放入数据框

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：列表的熊猫列，为每个列表元素创建一行

回答 0

回答 1

回答 2

熊猫> = 0.25

Pandas >= 0.25

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：比较两个DataFrame并并排输出它们的差异

回答 0

回答 1

突出显示两个DataFrame之间的差异

Highlighting the difference between two DataFrames

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

大熊猫> = 1.1： DataFrame.compare

pandas >= 1.1: DataFrame.compare

回答 11

回答 12

问题：从Python熊猫聚合结果格式化/抑制科学计数法

回答 0

回答 1

回答 2

回答 3

回答 4

问题：熊猫在每个组中获得最高的n条记录

回答 0

回答 1

大熊猫> = 1.1： `DataFrame.compare`

pandas >= 1.1: `DataFrame.compare`