Python 实用宝典

Question 1

Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])

For example it can be checked in this way:

for i, r in df.iterrows():
    print type(r['col1']), type(r['col2']), type(r['col3'])

In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to “learn” pandas to recognized dates?

Question 2

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column ‘datetime’ with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a ‘date’ and a ‘time’ column into a single ‘datetime’ column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Question 3

Perhaps the pandas interface has changed since @Rutger answered, but in the version I’m using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

Question 4

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.

Question 5

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

Question 6

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

Question 7

Yes – according to the pandas.read_csv documentation:

Note: A fast-path exists for iso8601-formatted dates.

So if your csv has a column named datetime and the dates looks like 2013-01-01T01:01 for example, running this will make pandas (I’m on v0.19.2) pick up the date and time automatically:

df = pd.read_csv('test.csv', parse_dates=['datetime'])

Note that you need to explicitly pass parse_dates, it doesn’t work without.

Verify with:

df.dtypes

You should see the datatype of the column is datetime64[ns]

Question 8

If performance matters to you make sure you time:

import sys
import timeit
import pandas as pd

print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)

repeat = 3
numbers = 100

def time(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

print("Format %m/%d/%y")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')

print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io

data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""

time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
     'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')

prints:

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996

So with iso8601-formatted date (%Y-%m-%d %H:%M:%S is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format (which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.

Question 9

While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e

Pandas explicit recognize the format by arg date_parser=mydateparser
Pandas implicit recognize the format by agr infer_datetime_format=True

Some of the date column data

01/01/18

01/02/18

Here we don’t know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format

    mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
    df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)

Method 2:- Implicit or Automatically recognize the format

df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)

Question 10

I have a dynamic DataFrame which works fine, but when there are no data to be added into the DataFrame I get an error. And therefore I need a solution to create an empty DataFrame with only the column names.

For now I have something like this:

df = pd.DataFrame(columns=COLUMN_NAMES) # Note that there are now row data inserted.

PS: It is important that the column names would still appear in a DataFrame.

But when I use it like this I get something like that as a result:

Index([], dtype='object')
Empty DataFrame

The “Empty DataFrame” part is good! But instead of the Index thing I need to still display the columns.

Edit:

An important thing that I found out: I am converting this DataFrame to a PDF using Jinja2, so therefore I’m calling out a method to first output it to HTML like that:

df.to_html()

This is where the columns get lost I think.

Edit2: In general, I followed this example: http://pbpython.com/pdf-reports.html. The css is also from the link. That’s what I do to send the dataframe to the PDF:

env = Environment(loader=FileSystemLoader('.'))
template = env.get_template("pdf_report_template.html")
template_vars = {"my_dataframe": df.to_html()}

html_out = template.render(template_vars)
HTML(string=html_out).write_pdf("my_pdf.pdf", stylesheets=["pdf_report_style.css"])

Edit3:

If I print out the dataframe right after creation I get the followin:

[0 rows x 9 columns]
Empty DataFrame
Columns: [column_a, column_b, column_c, column_d, 
column_e, column_f, column_g, 
column_h, column_i]
Index: []

That seems reasonable, but if I print out the template_vars:

'my_dataframe': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Index([], dtype=\'object\')</td>\n      <td>Empty DataFrame</td>\n    </tr>\n  </tbody>\n</table>'

And it seems that the columns are missing already.

E4: If I print out the following:

print(df.to_html())

I get the following result already:

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <td>Index([], dtype='object')</td>
      <td>Empty DataFrame</td>
    </tr>
  </tbody>
</table>

Question 11

You can create an empty DataFrame with either column names or an Index:

In [4]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
In [6]: df
Out[6]:
Empty DataFrame
Columns: [A, B, C, D, E, F, G]
Index: []

Or

In [7]: df = pd.DataFrame(index=range(1,10))
In [8]: df
Out[8]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Edit: Even after your amendment with the .to_html, I can’t reproduce. This:

df = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
df.to_html('test.html')

Produces:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>A</th>
      <th>B</th>
      <th>C</th>
      <th>D</th>
      <th>E</th>
      <th>F</th>
      <th>G</th>
    </tr>
  </thead>
  <tbody>
  </tbody>
</table>

Question 12

Are you looking for something like this?

    COLUMN_NAMES=['A','B','C','D','E','F','G']
    df = pd.DataFrame(columns=COLUMN_NAMES)
    df.columns

   Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Question 13

df.to_html() has a columns parameter.

Just pass the columns into the to_html() method.

df.to_html(columns=['A','B','C','D','E','F','G'])

Question 14

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

Any ideas how this can be improved?

Basically I want to turn this:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux

Into this:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I’ve managed to do it with the code below, but man is it ugly. It’s not Pythonic and I’m sure it’s not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

if df[i].dtype == np.dtype('object')

But that’s not much of an improvement

And finally, this code sets the target strings to None, which works with Pandas’ functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.

Question 15

I think df.replace() does the job, since pandas 0.13:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

As Temak pointed it out, use df.replace(r'^\s+$', np.nan, regex=True) in case your valid data contains white spaces.

Question 16

If you want to replace an empty string and records with only spaces, the correct answer is!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that ‘fo o’ is not replaced with Nan, though it contains a space. Further note, that a simple:

df.replace(r'', np.NaN)

Does not work either – try it out.

Question 17

How about:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.

Question 18

I will did this:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.

Question 19

Simplest of all solutions:

df = df.replace(r'^\s+$', np.nan, regex=True)

Question 20

If you are exporting the data from the CSV file it can be as simple as this :

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na

Question 21

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

df.mask(df == ' ')

Question 22

These are all close to the right answer, but I wouldn’t say any solve the problem while remaining most readable to others reading your code. I’d say that answer is a combination of BrenBarn’s Answer and tuomasttik’s comment below that answer. BrenBarn’s answer utilizes isspace builtin, but does not support removing empty strings, as OP requested, and I would tend to attribute that as the standard use case of replacing strings with null.

I rewrote it with .apply, so you can call it on a pd.Series or pd.DataFrame.

Python 3:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you’ll need to replace str with basestring.

Python 2:

To replace empty strings or strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings of entirely spaces:

df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

Question 23

This worked for me. When I import my csv file I added na_values = ‘ ‘. Spaces are not included in the default NaN values.

df= pd.read_csv(filepath,na_values = ‘ ‘)

Question 24

you can also use a filter to do it.

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

Question 25

print(df.isnull().sum()) # check numbers of null value in each column

modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN"

# modifiedDf = fd.dropna() # Remove rows with empty values

print(modifiedDf.isnull().sum()) # check numbers of null value in each column

Question 26

This is not an elegant solution, but what does seem to work is saving to XLSX and then importing it back. The other solutions on this page did not work for me, unsure why.

data.to_excel(filepath, index=False)
data = pd.read_excel(filepath)

Question 27

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary (or list?) of values within it. The DataFrame looks like this:

[1] df
Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns so that the DataFrame looks like this:

[2] df2
Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I’m having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: a, b, and c. And they always appear in the same order (a first, b second, c third).

The following code USED to work and return exactly what I wanted (df2).

[3] df 
[4] objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
[5] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
[6] print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end)

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is importing from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

Question 28

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.

Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Question 29

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions…

Question 30

Try this: The data returned from SQL has to converted into a Dict. or could it be "Pollutant Levels" is now Pollutants'

   StationID                   Pollutants
0       8809  {"a":"46","b":"3","c":"12"}
1       8810   {"a":"36","b":"5","c":"8"}
2       8811            {"b":"2","c":"7"}
3       8812                   {"c":"11"}
4       8813          {"a":"82","c":"15"}


df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )

    a    b   c
0   46    3  12
1   36    5   8
2  NaN    2   7
3  NaN  NaN  11
4   82  NaN  15


result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result

   StationID    a    b   c
0       8809   46    3  12
1       8810   36    5   8
2       8811  NaN    2   7
3       8812  NaN  NaN  11
4       8813   82  NaN  15

Question 31

Merlin’s answer is better and super easy, but we don’t need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:

Way 1: Two steps

# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)

df_pol_ps:
    a   b   c
0   46  3   12
1   36  5   8
2   NaN 2   7
3   NaN NaN 11
4   82  NaN 15

# step 2: concat columns `a, b, c` and drop/remove the `Pollutants` 
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Way 2: The above two steps can be combined in one go:

df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)

df_final:
    StationID   a   b   c
0   8809    46  3   12
1   8810    36  5   8
2   8811    NaN 2   7
3   8812    NaN NaN 11
4   8813    82  NaN 15

Question 32

I strongly recommend the method extract the column ‘Pollutants’:

df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)

it’s much faster than

df_pollutants = df['Pollutants'].apply(pd.Series)

when the size of df is giant.

Question 33

You can use join with pop + tolist. Performance is comparable to concat with drop + tolist, but some may find this syntax cleaner:

res = df.join(pd.DataFrame(df.pop('b').tolist()))

Benchmarking with other methods:

df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

def joris1(df):
    return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)

def joris2(df):
    return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)

def jpp(df):
    return df.join(pd.DataFrame(df.pop('b').tolist()))

df = pd.concat([df]*1000, ignore_index=True)

%timeit joris1(df.copy())  # 1.33 s per loop
%timeit joris2(df.copy())  # 7.42 ms per loop
%timeit jpp(df.copy())     # 7.68 ms per loop

Question 34

One line solution is following:

>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
   Station ID    a    b   c
0        8809   46    3  12
1        8810   36    5   8
2        8811  NaN    2   7
3        8812  NaN  NaN  11
4        8813   82  NaN  15

Question 35

my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])

.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.

Question 36

I’ve concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:

def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
    """
    dw: DataFrame with some column which contain a dict to expand
        in columns
    column_to_expand: String with column name of dw
    """
    import pandas as pd

    def convert_to_dict(sequence: str) -> Dict:
        import json
        s = sequence
        json_acceptable_string = s.replace("'", "\"")
        d = json.loads(json_acceptable_string)
        return d    

    expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
                                    dw[column_to_expand]
                                    .apply(convert_to_dict)
                                    .apply(pd.Series)],
                                    axis=1)
    return expanded_dataframe

Question 37

df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)

Question 38

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

Slicing based on a single value/label
Slicing based on multiple labels from one or more levels
Filtering on boolean conditions and expressions
Which methods are applicable in what circumstances

Assumptions for simplicity:

input dataframe does not have duplicate index keys
input dataframe below only has two levels. (Most solutions shown here generalize to N levels)

Example input:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

How do I select rows having “a” in level “one”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level “one” in the output?

Question 1b
How do I slice all rows with value “t” on level “two”?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

How can I select rows corresponding to items “b” and “d” in level “one”?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section `(x, y)`

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections `[(a, b), (c, d), ...]`

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

How do I get all rows where values in level “two” are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).

Question 39

MultiIndex / Advanced Indexing

Note
This post will be structured in the following manner:

The questions put forth in the OP will be addressed, one by one

For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ….as applicable.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

DataFrame.loc – A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)
DataFrame.xs – Extract a particular cross section from a Series/DataFrame.
DataFrame.query – Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.
Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.

Question 1

How do I select rows having “a” in level “one”?
         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

df.loc[['a']]

At this point, if you get

TypeError: Expected tuple, got str

That means you’re using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=False argument is needed to prevent xs from dropping level “one” in the result (the level we sliced on).

Yet another option here is using query:

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

Finally, using get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level “one” in the output?
     col
two     
t      0
u      1
v      2
w      3

This can be easily done using either

df.loc['a'] # Notice the single string argument instead the list.

Or,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_level argument (it is assumed to be True by default).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,
v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])
You can get rid of these levels using MultiIndex.remove_unused_levels:
v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

Question 1b

How do I slice all rows with value “t” on level “two”?
         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:
df.loc(axis=0)[pd.IndexSlice[:, 't']]
Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'")

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.

Question 2

How can I select rows corresponding to items “b” and “d” in level “one”?
         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]

To solve the above problem of selecting “b” and “d”, you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn’t conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]

Question 2b

How would I get all values corresponding to “t” and “w” in level “two”?
         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :]

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]

Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by
         col
one two     
c   u      9

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]

Or,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarning that looks like this:
PerformanceWarning: indexing past lexsort depth may impact performance.
This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:
df_sort = df.sort_index()
df_sort.loc[('c', 'u')]
You can also use MultiIndex.is_lexsorted() to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]

Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?
         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you’re not checking, then use isin to test membership, and then boolean index on the final result.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]

Question 5

How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?
         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you’ll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2]

Question 6

How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.
         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I’ve added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing “a”, “b” and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for “a” and “b” is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.

Question 7

How do I get all rows where values in level “two” are greater than 5?
         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

df2.query("two > 5")

And get_level_values.

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.

Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

To slice with loc, use

 df3.loc[:, ....] # Notice how we slice across the index with `:`.

or,

 df3.loc[:, pd.IndexSlice[...]]

To use xs as appropriate, just pass an argument axis=1.
You can access the column level values directly using df.columns.get_level_values. You will then need to do something like
```
 df.loc[:, {condition}] 
```
Where {condition} represents some condition built using columns.get_level_values.
To use query, your only option is to transpose, query on the index, and transpose again:
```
 df3.T.query(...).T
```
Not recommended, use one of the other 3 options.

Question 40

Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn’t make any of the solutions above produce the results I was looking for. It’s quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.

I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.

This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)

Specifically I was looking for:

A way to choose two+ values from one level of the index and a single value from another level of the index, and
A way to leave the index values from the previous operation in the dataframe output.

As a monkey wrench in the gears (however totally fixable):

The indexes were unnamed.

On the toy dataframe below:

    index = pd.MultiIndex.from_product([['a','b'],
                               ['stock1','stock2','stock3'],
                               ['price','volume','velocity']])

    df = pd.DataFrame([1,2,3,4,5,6,7,8,9,
                      10,11,12,13,14,15,16,17,18], 
                       index)

                        0
    a stock1 price      1
             volume     2
             velocity   3
      stock2 price      4
             volume     5
             velocity   6
      stock3 price      7
             volume     8
             velocity   9
    b stock1 price     10
             volume    11
             velocity  12
      stock2 price     13
             volume    14
             velocity  15
      stock3 price     16
             volume    17
             velocity  18

Using the below works, of course:

    df.xs(('stock1', 'velocity'), level=(1,2))

        0
    a   3
    b  12

But I wanted a different result, so my method to get that result was:

   df.iloc[df.index.isin(['stock1'], level=1) & 
           df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
    b stock1 velocity  12

And if I wanted two+ values from one level and a single (or 2+) value from another level:

    df.iloc[df.index.isin(['stock1','stock3'], level=1) & 
            df.index.isin(['velocity'], level=2)] 

                        0
    a stock1 velocity   3
      stock3 velocity   9
    b stock1 velocity  12
      stock3 velocity  18

The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.

Question 41

I am trying to fill none values in a Pandas dataframe with 0’s for only some subset of columns.

When I do:

import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df

The output:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  NaN  7.0
3  NaN  6.0  8.0
     a    b    c
0  1.0  4.0  0.0
1  2.0  5.0  0.0
2  3.0  0.0  7.0
3  0.0  6.0  8.0

It replaces every None with 0‘s. What I want to do is, only replace Nones in columns a and b, but not c.

What is the best way of doing this?

Question 42

You can select your desired columns and do it by assignment:

df[['a', 'b']] = df[['a','b']].fillna(value=0)

The resulting output is as expected:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

Question 43

You can using dict , fillna with different value for different column

df.fillna({'a':0,'b':0})
Out[829]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

After assign it back

df=df.fillna({'a':0,'b':0})
df
Out[831]: 
     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

Question 44

You can avoid making a copy of the object using Wen’s solution and inplace=True:

df.fillna({'a':0, 'b':0}, inplace=True)
print(df)

Which yields:

     a    b    c
0  1.0  4.0  NaN
1  2.0  5.0  NaN
2  3.0  0.0  7.0
3  0.0  6.0  8.0

Question 45

Here’s how you can do it all in one line:

df[['a', 'b']].fillna(value=0, inplace=True)

Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.

Question 46

using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)

Question 47

Or something like:

df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0

and if there is more:

for i in your_list:
    df.loc[df[i].isnull(),i]=0

Question 48

Sometimes this syntax wont work:

df[['col1','col2']] = df[['col1','col2']].fillna()

Use the following instead:

df['col1','col2']

Question 49

Given a pandas dataframe containing possible NaN values scattered here and there:

Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?

Question 50

UPDATE: using Pandas 0.22.0

Newer Pandas versions have new methods ‘DataFrame.isna()’ and ‘DataFrame.notna()’

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

as list of columns:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

to select those columns (containing at least one NaN value):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

OLD answer:

Try to use isnull():

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

or as @root proposed clearer version:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

to select a subset – all columns containing at least one NaN value:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

Question 51

You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.

Question 52

I had a problem where I had to many columns to visually inspect on the screen so a short list comp that filters and returns the offending columns is

nan_cols = [i for i in df.columns if df[i].isnull().any()]

if that’s helpful to anyone

Question 53

In datasets having large number of columns its even better to see how many columns contain null values and how many don’t.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.

Question 54

i use these three lines of code to print out the column names which contain at least one null value:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

Question 55

Both of these should work:

df.isnull().sum()
df.isna().sum()

DataFrame methods isna() or isnull() are completely identical.

Note: Empty strings '' is considered as False (not considered NA)

Question 56

This worked for me,

1. For getting Columns having at least 1 null value. (column names)

data.columns[data.isnull().any()]

2. For getting Columns with count, with having at least 1 null value.

data[data.columns[data.isnull().any()]].isnull().sum()

[Optional] 3. For getting percentage of the null count.

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

Question 57

I tried:

x=pandas.DataFrame(...)
s = x.take([0], axis=1)

And s gets a DataFrame, not a Series.

Question 58

>>> import pandas as pd
>>> df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
3  4  7
>>> s = df.ix[:,0]
>>> type(s)
<class 'pandas.core.series.Series'>
>>>

===========================================================================

UPDATE

If you’re reading this after June 2017, ix has been deprecated in pandas 0.20.2, so don’t use it. Use loc or iloc instead. See comments and other answers to this question.

Question 59

You can get the first column as a Series by following code:

x[x.columns[0]]

Question 60

From v0.11+, … use df.iloc.

In [7]: df.iloc[:,0]
Out[7]: 
0    1
1    2
2    3
3    4
Name: x, dtype: int64

Question 61

Isn’t this the simplest way?

By column name:

In [20]: df = pd.DataFrame({'x' : [1, 2, 3, 4], 'y' : [4, 5, 6, 7]})
In [21]: df
Out[21]:
    x   y
0   1   4
1   2   5
2   3   6
3   4   7

In [23]: df.x
Out[23]:
0    1
1    2
2    3
3    4
Name: x, dtype: int64

In [24]: type(df.x)
Out[24]:
pandas.core.series.Series

Question 62

This works great when you want to load a series from a csv file

x = pd.read_csv('x.csv', index_col=False, names=['x'],header=None).iloc[:,0]
print(type(x))
print(x.head(10))


<class 'pandas.core.series.Series'>
0    110.96
1    119.40
2    135.89
3    152.32
4    192.91
5    177.20
6    181.16
7    177.30
8    200.13
9    235.41
Name: x, dtype: float64

Question 63

df[df.columns[i]]

where i is the position/number of the column(starting from 0).

So, i = 0 is for the first column.

You can also get the last column using i = -1

Question 64

I’ve a csv file without header, with a DateTime index. I want to rename the index and column name, but with df.rename() only the column name is renamed. Bug? I’m on version 0.12.0

In [2]: df = pd.read_csv(r'D:\Data\DataTimeSeries_csv//seriesSM.csv', header=None, parse_dates=[[0]], index_col=[0] )

In [3]: df.head()
Out[3]: 
                   1
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

In [4]: df.rename(index={0:'Date'}, columns={1:'SM'}, inplace=True)

In [5]: df.head()
Out[5]: 
                  SM
0                   
2002-06-18  0.112000
2002-06-22  0.190333
2002-06-26  0.134000
2002-06-30  0.093000
2002-07-04  0.098667

问题：熊猫可以自动识别日期吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：熊猫仅使用列名创建空的DataFrame

回答 0

回答 1

回答 2

问题：在熊猫中用NaN替换空白值（空白）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：将“熊猫”列中的字典/列表拆分为单独的列

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：在熊猫MultiIndex DataFrame中选择行

输入示例：

问题1：选择单个项目

问题2：在一个级别中选择多个值

问题3：切片单个横截面 (x, y)

问题4：切片多个横截面 [(a, b), (c, d), ...]

问题5：每个级别切成一个项目

问题6：任意切片

问题7：按数字不等式对多索引的各个级别进行过滤

Example input:

Question 1: Selecting a Single Item

Question 2: Selecting Multiple Values in a Level

Question 3: Slicing a Single Cross Section (x, y)

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

Question 5: One Item Sliced per Level

Question 6: Arbitrary Slicing

Question 7: Filtering by numeric inequality on individual levels of the multiindex

回答 0

问题1

问题1b

问题2

问题2b

问题3

问题4

问题5

问题6

问题7

奖金问题

Question 1

Question 1b

Question 2

Question 2b

Question 3

Question 4

Question 5

Question 6

Question 7

Bonus Question

回答 1

问题：熊猫数据框fillna（）仅存在一些列

回答 0

回答 1

回答 2

问题3：切片单个横截面 `(x, y)`

问题4：切片多个横截面 `[(a, b), (c, d), ...]`

Question 3: Slicing a Single Cross Section `(x, y)`

Question 4: Slicing Multiple Cross Sections `[(a, b), (c, d), ...]`

对于较新的`pandas`版本

For newer `pandas` versions