Python 实用宝典

Question 1

I have a dataframe along the lines of the below:

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (equal number of records/rows) which sets a colour 'green' if Set == 'Z' and 'red' if Set equals anything else.

What’s the best way to do this?

Question 2

If you only have two choices to select from:

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

For example,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

yields

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

If you have more than two conditions then use np.select. For example, if you want color to be

yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
otherwise purple when (df['Type'] == 'B')
otherwise black,

then use

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

which yields

  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

Question 3

List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

Example list comprehension:

df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit tests:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

Question 4

Another way in which this could be achieved is

df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

Question 5

Here’s yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:

def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

What’s it look like:

df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

This approach can be very powerful when you have many ifelse-type statements to make (i.e. many unique values to replace).

And of course you could always do this:

df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

But that approach is more than three times as slow as the apply approach from above, on my machine.

And you could also do this, using dict.get:

df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

Question 6

The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.

Simple example using just the “Set” column:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Example with more colours and more columns taken into account:

def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C   blue

Edit (21/06/2019): Using plydata

It is also possible to use plydata to do this kind of things (this seems even slower than using assign and apply, though).

from plydata import define, if_else

Simple if_else:

df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B  green
3   Y    C  green

Nested if_else:

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)

  Set Type  color
0   Z    A    red
1   Z    B    red
2   X    B   blue
3   Y    C  green

Question 7

Maybe this has been possible with newer updates of Pandas (tested with pandas=1.0.5), but I think the following is the shortest and maybe best answer for the question, so far. You can use the .loc method and use one condition or several depending on your need.

Code Summary:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

Explanation:

df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

add a ‘color’ column and set all values to “red”

df['Color'] = "red"

Apply your single condition:

df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

or multiple conditions if you want:

df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

You can read on Pandas logical operators and conditional selection here: Logical operators for boolean indexing in Pandas

Question 8

One liner with .apply() method is following:

df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

After that, df data frame looks like this:

>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

Question 9

If you’re working with massive data, a memoized approach would be best:

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when: data_size > 10**4 & n_distinct < data_size/4

E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.

Question 10

I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2.

I expect to be able to do this (per this answer):

df[(len(df['column name']) < 2)]

but I just get the error:

KeyError: u'no item named False'

What am I doing wrong?

(Note: I know I can use df.dropna() to get rid of rows that contain any NaN, but I didn’t see how to remove rows based on a conditional expression.)

Question 11

When you do len(df['column name']) you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply len to each element in the column, use df['column name'].map(len). So try

df[df['column name'].map(len) < 2]

Question 12

To directly answer this question’s original title “How to delete rows from a pandas DataFrame based on a conditional expression” (which I understand is not necessarily the OP’s problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)

df = df.drop(df[<some boolean condition>].index)

Example

To remove all rows where column ‘score’ is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

To remove all rows where column ‘score’ is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)

Question 13

You can assign the DataFrame to a filtered version of itself:

df = df[df.score > 50]

This is faster than drop:

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Question 14

I will expand on @User’s generic solution to provide a drop free alternative. This is for folks directed here based on the question’s title (not OP ‘s problem)

Say you want to delete all rows with negative values. One liner solution is:-

df = df[(df > 0).all(axis=1)]

Step by step Explanation:–

Let’s generate a 5×5 random normal distribution data frame

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

Let the condition be deleting negatives. A boolean df satisfying the condition:-

df > 0
      A     B      C      D      E
0   True  True   True   True   True
1  False  True  False  False   True
2   True  True   True   True   True
3   True  True  False   True  False
4  False  True   True  False   True

A boolean series for all rows satisfying the condition Note if any element in the row fails the condition the row is marked false

(df > 0).all(axis=1)
0     True
1    False
2     True
3    False
4    False
dtype: bool

Finally filter out rows from data frame based on the condition

df[(df > 0).all(axis=1)]
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
2  0.144044  1.454274  0.761038  0.121675  0.443863

You can assign it back to df to actually delete vs filter ing done above
df = df[(df > 0).all(axis=1)]

This can easily be extended to filter out rows containing NaN s (non numeric entries):-
df = df[(~df.isnull()).all(axis=1)]

This can also be simplified for cases like: Delete all rows where column E is negative

df = df[(df.E>0)]

I would like to end with some profiling stats on why @User’s drop solution is slower than raw column based filtration:-

%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A column is basically a Series i.e a NumPy array, it can be indexed without any cost. For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas:

Question 15

In pandas you can do str.len with your boundary and using the Boolean result to filter it .

df[df['column name'].str.len().lt(2)]

Question 16

If you want to drop rows of data frame on the basis of some complicated condition on the column value then writing that in the way shown above can be complicated. I have the following simpler solution which always works. Let us assume that you want to drop the column with ‘header’ so get that column in a list first.

text_data = df['name'].tolist()

now apply some function on the every element of the list and put that in a panda series:

text_length = pd.Series([func(t) for t in text_data])

in my case I was just trying to get the number of tokens:

text_length = pd.Series([len(t.split()) for t in text_data])

now add one extra column with the above series in the data frame:

df = df.assign(text_length = text_length .values)

now we can apply condition on the new column such as:

df = df[df.text_length  >  10]

def pass_filter(df, label, length, pass_type):

    text_data = df[label].tolist()

    text_length = pd.Series([len(t.split()) for t in text_data])

    df = df.assign(text_length = text_length .values)

    if pass_type == 'high':
        df = df[df.text_length  >  length]

    if pass_type == 'low':
        df = df[df.text_length  <  length]

    df = df.drop(columns=['text_length'])

    return df

Question 17

This seems like a ridiculously easy question… but I’m not seeing the easy answer I was expecting.

So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).

For example, let’s say I want to pull the 1.2 value in Btime as a variable.

Whats the right way to do this?

df_test =

  ATime   X   Y   Z   Btime  C   D   E
0    1.2  2  15   2    1.2  12  25  12
1    1.4  3  12   1    1.3  13  22  11
2    1.5  1  10   6    1.4  11  20  16
3    1.6  2   9  10    1.7  12  29  12
4    1.9  1   1   9    1.9  11  21  19
5    2.0  0   0   0    2.0   8  10  11
6    2.4  0   0   0    2.4  10  12  15

Question 18

To select the ith row, use iloc:

In [31]: df_test.iloc[0]
Out[31]: 
ATime     1.2
X         2.0
Y        15.0
Z         2.0
Btime     1.2
C        12.0
D        25.0
E        12.0
Name: 0, dtype: float64

To select the ith value in the Btime column you could use:

In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2

There is a difference between `df_test['Btime'].iloc[0]` (recommended) and `df_test.iloc[0]['Btime']`:

DataFrames store data in column-based blocks (where each block has a single dtype). If you select by column first, a view can be returned (which is quicker than returning a copy) and the original dtype is preserved. In contrast, if you select by row first, and if the DataFrame has columns of different dtypes, then Pandas copies the data into a new Series of object dtype. So selecting columns is a bit faster than selecting rows. Thus, although df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit more efficient.

There is a big difference between the two when it comes to assignment. df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime'] may not. See below for an explanation of why. Because a subtle difference in the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:

df.iloc[0, df.columns.get_loc('Btime')] = x

`df.iloc[0, df.columns.get_loc('Btime')] = x` (recommended):

The recommended way to assign new values to a DataFrame is to avoid chained indexing, and instead use the method shown by andrew,

df.loc[df.index[n], 'Btime'] = x

or

df.iloc[n, df.columns.get_loc('Btime')] = x

The latter method is a bit faster, because df.loc has to convert the row and column labels to positional indices, so there is a little less conversion necessary if you use df.iloc instead.

`df['Btime'].iloc[0] = x` works, but is not recommended:

Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value at the nth location of the Btime column of df.

Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:

In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

In [26]: df
Out[26]: 
  foo  bar
0   A   99  <-- assignment succeeded
2   B  100
1   C  100

`df.iloc[0]['Btime'] = x` does not work:

In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:

In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [67]: df
Out[67]: 
  foo  bar
0   A   99  <-- assignment failed
2   B  100
1   C  100

Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,

In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])

In [2]: df
Out[2]: 
  foo
0   A
2   B
1   C

In [4]: df.ix[1, 'foo']
Out[4]: 'C'

Question 19

Note that the answer from @unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.

In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)

Another approach that will consistently work with both setting and getting is:

In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
  foo  bar
0   A   99
2   B  100
1   C  100

Question 20

Another way to do this:

first_value = df['Btime'].values[0]

This way seems to be faster than using .iloc:

In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Question 21

df.iloc[0].head(1) – First data set only from entire first row.
df.iloc[0] – Entire First row in column.

Question 22

In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:

data = dataframe[0:N][:,J]

Question 23

To get e.g the value from column ‘test’ and row 1 it works like

df[['test']].values[0][0]

as only df[['test']].values[0] gives back a array

Question 24

Another way of getting the first row and preserving the index:

x = df.first('d') # Returns the first day. '3d' gives first three days.

Question 25

I have a Python dictionary like the following:

{u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

The keys are Unicode dates and the values are integers. I would like to convert this into a pandas dataframe by having the dates and their corresponding values as two separate columns. Example: col1: Dates col2: DateValue (the dates are still Unicode and datevalues are still integers)

     Date         DateValue
0    2012-07-01    391
1    2012-07-02    392
2    2012-07-03    392
.    2012-07-04    392
.    ...           ...
.    ...           ...

Any help in this direction would be much appreciated. I am unable to find resources on the pandas docs to help me with this.

I know one solution might be to convert each key-value pair in this dict, into a dict so the entire structure becomes a dict of dicts, and then we can add each row individually to the dataframe. But I want to know if there is an easier way and a more direct way to do this.

So far I have tried converting the dict into a series object but this doesn’t seem to maintain the relationship between the columns:

s  = Series(my_dict,index=my_dict.keys())

Question 26

The error here, is since calling the DataFrame constructor with scalar values (where it expects values to be a list/dict/… i.e. have multiple columns):

pd.DataFrame(d)
ValueError: If using all scalar values, you must must pass an index

You could take the items from the dictionary (i.e. the key-value pairs):

In [11]: pd.DataFrame(d.items())  # or list(d.items()) in python 3
Out[11]:
             0    1
0   2012-07-02  392
1   2012-07-06  392
2   2012-06-29  391
3   2012-06-28  391
...

In [12]: pd.DataFrame(d.items(), columns=['Date', 'DateValue'])
Out[12]:
          Date  DateValue
0   2012-07-02        392
1   2012-07-06        392
2   2012-06-29        391

But I think it makes more sense to pass the Series constructor:

In [21]: s = pd.Series(d, name='DateValue')
Out[21]:
2012-06-08    388
2012-06-09    388
2012-06-10    388

In [22]: s.index.name = 'Date'

In [23]: s.reset_index()
Out[23]:
          Date  DateValue
0   2012-06-08        388
1   2012-06-09        388
2   2012-06-10        388

Question 27

When converting a dictionary into a pandas dataframe where you want the keys to be the columns of said dataframe and the values to be the row values, you can do simply put brackets around the dictionary like this:

>>> dict_ = {'key 1': 'value 1', 'key 2': 'value 2', 'key 3': 'value 3'}
>>> pd.DataFrame([dict_])

    key 1     key 2     key 3
0   value 1   value 2   value 3

It’s saved me some headaches so I hope it helps someone out there!

EDIT: In the pandas docs one option for the data parameter in the DataFrame constructor is a list of dictionaries. Here we’re passing a list with one dictionary in it.

Question 28

As explained on another answer using pandas.DataFrame() directly here will not act as you think.

What you can do is use pandas.DataFrame.from_dict with orient='index':

In[7]: pandas.DataFrame.from_dict({u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 .....
 u'2012-07-05': 392,
 u'2012-07-06': 392}, orient='index', columns=['foo'])
Out[7]: 
            foo
2012-06-08  388
2012-06-09  388
2012-06-10  388
2012-06-11  389
2012-06-12  389
........
2012-07-05  392
2012-07-06  392

Question 29

Pass the items of the dictionary to the DataFrame constructor, and give the column names. After that parse the Date column to get Timestamp values.

Note the difference between python 2.x and 3.x:

In python 2.x:

df = pd.DataFrame(data.items(), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

In Python 3.x: (requiring an additional ‘list’)

df = pd.DataFrame(list(data.items()), columns=['Date', 'DateValue'])
df['Date'] = pd.to_datetime(df['Date'])

Question 30

p.s. in particular, I’ve found Row-Oriented examples helpful; since often that how records are stored externally.

https://pbpython.com/pandas-list-dict.html

Question 31

Pandas have built-in function for conversion of dict to data frame.

pd.DataFrame.from_dict(dictionaryObject,orient=’index’)

For your data you can convert it like below:

import pandas as pd
your_dict={u'2012-06-08': 388,
 u'2012-06-09': 388,
 u'2012-06-10': 388,
 u'2012-06-11': 389,
 u'2012-06-12': 389,
 u'2012-06-13': 389,
 u'2012-06-14': 389,
 u'2012-06-15': 389,
 u'2012-06-16': 389,
 u'2012-06-17': 389,
 u'2012-06-18': 390,
 u'2012-06-19': 390,
 u'2012-06-20': 390,
 u'2012-06-21': 390,
 u'2012-06-22': 390,
 u'2012-06-23': 390,
 u'2012-06-24': 390,
 u'2012-06-25': 391,
 u'2012-06-26': 391,
 u'2012-06-27': 391,
 u'2012-06-28': 391,
 u'2012-06-29': 391,
 u'2012-06-30': 391,
 u'2012-07-01': 391,
 u'2012-07-02': 392,
 u'2012-07-03': 392,
 u'2012-07-04': 392,
 u'2012-07-05': 392,
 u'2012-07-06': 392}

your_df_from_dict=pd.DataFrame.from_dict(your_dict,orient='index')
print(your_df_from_dict)

Question 32

pd.DataFrame({'date' : dict_dates.keys() , 'date_value' : dict_dates.values() })

Question 33

You can also just pass the keys and values of the dictionary to the new dataframe, like so:

import pandas as pd

myDict = {<the_dict_from_your_example>]
df = pd.DataFrame()
df['Date'] = myDict.keys()
df['DateValue'] = myDict.values()

Question 34

In my case I wanted keys and values of a dict to be columns and values of DataFrame. So the only thing that worked for me was:

data = {'adjust_power': 'y', 'af_policy_r_submix_prio_adjust': '[null]', 'af_rf_info': '[null]', 'bat_ac': '3500', 'bat_capacity': '75'} 

columns = list(data.keys())
values = list(data.values())
arr_len = len(values)

pd.DataFrame(np.array(values, dtype=object).reshape(1, arr_len), columns=columns)

Question 35

This is what worked for me, since I wanted to have a separate index column

df = pd.DataFrame.from_dict(some_dict, orient="index").reset_index()
df.columns = ['A', 'B']

Question 36

Accepts a dict as argument and returns a dataframe with the keys of the dict as index and values as a column.

def dict_to_df(d):
    df=pd.DataFrame(d.items())
    df.set_index(0, inplace=True)
    return df

Question 37

This is how it worked for me :

df= pd.DataFrame([d.keys(), d.values()]).T
df.columns= ['keys', 'values']  # call them whatever you like

I hope this helps

Question 38

d = {'Date': list(yourDict.keys()),'Date_Values': list(yourDict.values())}
df = pandas.DataFrame(data=d)

If you don’t encapsulate yourDict.keys() inside of list() , then you will end up with all of your keys and values being placed in every row of every column. Like this:

Date \ 0 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1... 1 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1... 2 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1... 3 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1... 4 (2012-06-08, 2012-06-09, 2012-06-10, 2012-06-1...

But by adding list() then the result looks like this:

Date Date_Values 0 2012-06-08 388 1 2012-06-09 388 2 2012-06-10 388 3 2012-06-11 389 4 2012-06-12 389 ...

Question 39

I have run into this several times and have an example dictionary that I created from a function get_max_Path(), and it returns the sample dictionary:

{2: 0.3097502930247044, 3: 0.4413177909384636, 4: 0.5197224051562838, 5: 0.5717654946470984, 6: 0.6063959031223476, 7: 0.6365209824708223, 8: 0.655918861281035, 9: 0.680844386645206}

To convert this to a dataframe, I ran the following:

df = pd.DataFrame.from_dict(get_max_path(2), orient = 'index').reset_index()

Returns a simple two column dataframe with a separate index:

index 0 0 2 0.309750 1 3 0.441318

Just rename the columns using f.rename(columns={'index': 'Column1', 0: 'Column2'}, inplace=True)

Question 40

I think that you can make some changes in your data format when you create dictionary, then you can easily convert it to DataFrame:

input:

a={'Dates':['2012-06-08','2012-06-10'],'Date_value':[388,389]}

output:

{'Date_value': [388, 389], 'Dates': ['2012-06-08', '2012-06-10']}

input:

aframe=DataFrame(a)

output: will be your DataFrame

You just need to use some text editing in somewhere like Sublime or maybe Excel.

Question 41

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.

Question 42

You can use the attribute df.empty to check whether it’s empty or not:

if df.empty:
    print('DataFrame is empty!')

Source: Pandas Documentation

Question 43

I use the len function. It’s much faster than empty. len(df.index) is even faster.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))

def empty(df):
    return df.empty

def lenz(df):
    return len(df) == 0

def lenzi(df):
    return len(df.index) == 0

'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)

10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop

len on index seems to be faster
'''

Question 44

I prefer going the long route. These are the checks I follow to avoid using a try-except clause –

check if variable is not None
then check if its a dataframe and
make sure its not empty

Here, DATA is the suspect variable –

DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty

Question 45

To see if a dataframe is empty, I argue that one should test for the length of a dataframe’s columns index:

if len(df.columns) == 0: 1

Reason:

According to the Pandas Reference API, there is a distinction between:

an empty dataframe with 0 rows and 0 columns
an empty dataframe with rows containing NaN hence at least 1 column

Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.

Examples

Example 1: An empty dataframe with 0 rows and 0 columns

In [1]: import pandas as pd
        df1 = pd.DataFrame()
        df1
Out[1]: Empty DataFrame
        Columns: []
        Index: []

In [2]: len(df1.index)  # or len(df1)
Out[2]: 0

In [3]: df1.empty
Out[3]: True

Example 2: A dataframe which is emptied to 0 rows but still retains n columns

In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
        df2
Out[4]:    AA  BB
        0   1  11
        1   2  22
        2   3  33

In [5]: df2 = df2[df2['AA'] == 5]
        df2
Out[5]: Empty DataFrame
        Columns: [AA, BB]
        Index: []

In [6]: len(df2.index)  # or len(df2)
Out[6]: 0

In [7]: df2.empty
Out[7]: True

Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.

In [8]: len(df1.columns)
Out[8]: 0

In [9]: len(df2.columns)
Out[9]: 2

Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.

Why it matters

Let’s add a new column to these dataframes to understand the implications:

# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
         df1
Out[10]:    CC
         0 111
         1 222
         2 333
In [11]: len(df1.columns)
Out[11]: 1

# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
         df2
Out[12]:    AA  BB   CC
         0 NaN NaN  111
         1 NaN NaN  222
         2 NaN NaN  333
In [13]: len(df2.columns)
Out[13]: 3

It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.

Practical solution

# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
        df
Out[1]:    AA  BB
        0   1  11
        1   2  22
        2   3  33

# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
        df
Out[2]: Empty DataFrame
        Columns: [AA, BB]
        Index: []

# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2

# And accordingly, the other answers on this page
In [4]: len(df.index)  # or len(df)
Out[4]: 0

In [5]: df.empty
Out[5]: True

# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0:  # <--- here
            # Do something, e.g. 
            # drop any columns containing rows with `NaN`
            # to make the df really empty
            df = df.dropna(how='all', axis=1)
        df
Out[6]: Empty DataFrame
        Columns: []
        Index: []

# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0

Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):

In [8]: df['CC'] = [111, 222, 333]
         df
Out[8]:    CC
         0 111
         1 222
         2 333
In [9]: len(df.columns)
Out[9]: 1

Question 46

1) If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
2) when this situation can happen? 
This situation happens when a single function is used to plot more than one DataFrame 
which are passed as parameter.In such a situation the function try to plot the data even 
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
3) why? 
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values 
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message 
if a DataFrame is empty(instead of plotting):

df1                     df2
col1 col2           col1 col2 
Nan   2              Nan  Nan 
2     Nan            Nan  Nan

and the function:

def myfunc(df):
  if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
     print('not empty')
     df.plot(kind='barh')
  else:
     display a message instead of plotting if it is empty
     print('empty')

Question 47

How do I convert a numpy.datetime64 object to a datetime.datetime (or Timestamp)?

In the following code, I create a datetime, timestamp and datetime64 objects.

import datetime
import numpy as np
import pandas as pd
dt = datetime.datetime(2012, 5, 1)
# A strange way to extract a Timestamp object, there's surely a better way?
ts = pd.DatetimeIndex([dt])[0]
dt64 = np.datetime64(dt)

In [7]: dt
Out[7]: datetime.datetime(2012, 5, 1, 0, 0)

In [8]: ts
Out[8]: <Timestamp: 2012-05-01 00:00:00>

In [9]: dt64
Out[9]: numpy.datetime64('2012-05-01T01:00:00.000000+0100')

Note: it’s easy to get the datetime from the Timestamp:

In [10]: ts.to_datetime()
Out[10]: datetime.datetime(2012, 5, 1, 0, 0)

But how do we extract the datetime or Timestamp from a numpy.datetime64 (dt64)?

.

Update: a somewhat nasty example in my dataset (perhaps the motivating example) seems to be:

dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

which should be datetime.datetime(2002, 6, 28, 1, 0), and not a long (!) (1025222400000000000L)…

Question 48

To convert numpy.datetime64 to datetime object that represents time in UTC on numpy-1.8:

>>> from datetime import datetime
>>> import numpy as np
>>> dt = datetime.utcnow()
>>> dt
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> dt64 = np.datetime64(dt)
>>> ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
>>> ts
1354650685.3624549
>>> datetime.utcfromtimestamp(ts)
datetime.datetime(2012, 12, 4, 19, 51, 25, 362455)
>>> np.__version__
'1.8.0.dev-7b75899'

The above example assumes that a naive datetime object is interpreted by np.datetime64 as time in UTC.

To convert datetime to np.datetime64 and back (numpy-1.6):

>>> np.datetime64(datetime.utcnow()).astype(datetime)
datetime.datetime(2012, 12, 4, 13, 34, 52, 827542)

It works both on a single np.datetime64 object and a numpy array of np.datetime64.

Think of np.datetime64 the same way you would about np.int8, np.int16, etc and apply the same methods to convert beetween Python objects such as int, datetime and corresponding numpy objects.

Your “nasty example” works correctly:

>>> from datetime import datetime
>>> import numpy 
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
datetime.datetime(2002, 6, 28, 0, 0)
>>> numpy.__version__
'1.6.2' # current version available via pip install numpy

I can reproduce the long value on numpy-1.8.0 installed as:

pip install git+https://github.com/numpy/numpy.git#egg=numpy-dev

The same example:

>>> from datetime import datetime
>>> import numpy
>>> numpy.datetime64('2002-06-28T01:00:00.000000000+0100').astype(datetime)
1025222400000000000L
>>> numpy.__version__
'1.8.0.dev-7b75899'

It returns long because for numpy.datetime64 type .astype(datetime) is equivalent to .astype(object) that returns Python integer (long) on numpy-1.8.

To get datetime object you could:

>>> dt64.dtype
dtype('<M8[ns]')
>>> ns = 1e-9 # number of seconds in a nanosecond
>>> datetime.utcfromtimestamp(dt64.astype(int) * ns)
datetime.datetime(2002, 6, 28, 0, 0)

To get datetime64 that uses seconds directly:

>>> dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100', 's')
>>> dt64.dtype
dtype('<M8[s]')
>>> datetime.utcfromtimestamp(dt64.astype(int))
datetime.datetime(2002, 6, 28, 0, 0)

The numpy docs say that the datetime API is experimental and may change in future numpy versions.

Question 49

You can just use the pd.Timestamp constructor. The following diagram may be useful for this and related questions.

Conversions between time representations

Question 50

Welcome to hell.

You can just pass a datetime64 object to pandas.Timestamp:

In [16]: Timestamp(numpy.datetime64('2012-05-01T01:00:00.000000'))
Out[16]: <Timestamp: 2012-05-01 01:00:00>

I noticed that this doesn’t work right though in NumPy 1.6.1:

numpy.datetime64('2012-05-01T01:00:00.000000+0100')

Also, pandas.to_datetime can be used (this is off of the dev version, haven’t checked v0.9.1):

In [24]: pandas.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[24]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

Question 51

I think there could be a more consolidated effort in an answer to better explain the relationship between Python’s datetime module, numpy’s datetime64/timedelta64 and pandas’ Timestamp/Timedelta objects.

The datetime standard library of Python

The datetime standard library has four main objects

time – only time, measured in hours, minutes, seconds and microseconds
date – only year, month and day
datetime – All components of time and date
timedelta – An amount of time with maximum unit of days

Create these four objects

>>> import datetime
>>> datetime.time(hour=4, minute=3, second=10, microsecond=7199)
datetime.time(4, 3, 10, 7199)

>>> datetime.date(year=2017, month=10, day=24)
datetime.date(2017, 10, 24)

>>> datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 24, 4, 3, 10, 7199)

>>> datetime.timedelta(days=3, minutes = 55)
datetime.timedelta(3, 3300)

>>> # add timedelta to datetime
>>> datetime.timedelta(days=3, minutes = 55) + \
    datetime.datetime(year=2017, month=10, day=24, hour=4, minute=3, second=10, microsecond=7199)
datetime.datetime(2017, 10, 27, 4, 58, 10, 7199)

NumPy’s datetime64 and timedelta64 objects

NumPy has no separate date and time objects, just a single datetime64 object to represent a single moment in time. The datetime module’s datetime object has microsecond precision (one-millionth of a second). NumPy’s datetime64 object allows you to set its precision from hours all the way to attoseconds (10 ^ -18). It’s constructor is more flexible and can take a variety of inputs.

Construct NumPy’s datetime64 and timedelta64 objects

Pass an integer with a string for the units. See all units here. It gets converted to that many units after the UNIX epoch: Jan 1, 1970

>>> np.datetime64(5, 'ns') 
numpy.datetime64('1970-01-01T00:00:00.000000005')

>>> np.datetime64(1508887504, 's')
numpy.datetime64('2017-10-24T23:25:04')

You can also use strings as long as they are in ISO 8601 format.

>>> np.datetime64('2017-10-24')
numpy.datetime64('2017-10-24')

Timedeltas have a single unit

>>> np.timedelta64(5, 'D') # 5 days
>>> np.timedelta64(10, 'h') 10 hours

Can also create them by subtracting two datetime64 objects

>>> np.datetime64('2017-10-24T05:30:45.67') - np.datetime64('2017-10-22T12:35:40.123')
numpy.timedelta64(147305547,'ms')

Pandas Timestamp and Timedelta build much more functionality on top of NumPy

A pandas Timestamp is a moment in time very similar to a datetime but with much more functionality. You can construct them with either pd.Timestamp or pd.to_datetime.

>>> pd.Timestamp(1239.1238934) #defautls to nanoseconds
Timestamp('1970-01-01 00:00:00.000001239')

>>> pd.Timestamp(1239.1238934, unit='D') # change units
Timestamp('1973-05-24 02:58:24.355200')

>>> pd.Timestamp('2017-10-24 05') # partial strings work
Timestamp('2017-10-24 05:00:00')

pd.to_datetime works very similarly (with a few more options) and can convert a list of strings into Timestamps.

>>> pd.to_datetime('2017-10-24 05')
Timestamp('2017-10-24 05:00:00')

>>> pd.to_datetime(['2017-1-1', '2017-1-2'])
DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq=None)

Converting Python datetime to datetime64 and Timestamp

>>> dt = datetime.datetime(year=2017, month=10, day=24, hour=4, 
                   minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')

>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')

Converting numpy datetime64 to datetime and Timestamp

>>> dt64 = np.datetime64('2017-10-24 05:34:20.123456')
>>> unix_epoch = np.datetime64(0, 's')
>>> one_second = np.timedelta64(1, 's')
>>> seconds_since_epoch = (dt64 - unix_epoch) / one_second
>>> seconds_since_epoch
1508823260.123456

>>> datetime.datetime.utcfromtimestamp(seconds_since_epoch)
>>> datetime.datetime(2017, 10, 24, 5, 34, 20, 123456)

Convert to Timestamp

>>> pd.Timestamp(dt64)
Timestamp('2017-10-24 05:34:20.123456')

Convert from Timestamp to datetime and datetime64

This is quite easy as pandas timestamps are very powerful

>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')

>>> ts.to_pydatetime()   # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)

>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')

Question 52

>>> dt64.tolist()
datetime.datetime(2012, 5, 1, 0, 0)

For DatetimeIndex, the tolist returns a list of datetime objects. For a single datetime64 object it returns a single datetime object.

Question 53

If you want to convert an entire pandas series of datetimes to regular python datetimes, you can also use .to_pydatetime().

pd.date_range('20110101','20110102',freq='H').to_pydatetime()

> [datetime.datetime(2011, 1, 1, 0, 0) datetime.datetime(2011, 1, 1, 1, 0)
   datetime.datetime(2011, 1, 1, 2, 0) datetime.datetime(2011, 1, 1, 3, 0)
   ....

It also supports timezones:

pd.date_range('20110101','20110102',freq='H').tz_localize('UTC').tz_convert('Australia/Sydney').to_pydatetime()

[ datetime.datetime(2011, 1, 1, 11, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
 datetime.datetime(2011, 1, 1, 12, 0, tzinfo=<DstTzInfo 'Australia/Sydney' EST+11:00:00 DST>)
....

NOTE: If you are operating on a Pandas Series you cannot call to_pydatetime() on the entire series. You will need to call .to_pydatetime() on each individual datetime64 using a list comprehension or something similar:

datetimes = [val.to_pydatetime() for val in df.problem_datetime_column]

Question 54

One option is to use str, and then to_datetime (or similar):

In [11]: str(dt64)
Out[11]: '2012-05-01T01:00:00.000000+0100'

In [12]: pd.to_datetime(str(dt64))
Out[12]: datetime.datetime(2012, 5, 1, 1, 0, tzinfo=tzoffset(None, 3600))

Note: it is not equal to dt because it’s become “offset-aware”:

In [13]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[13]: datetime.datetime(2012, 5, 1, 1, 0)

This seems inelegant.

.

Update: this can deal with the “nasty example”:

In [21]: dt64 = numpy.datetime64('2002-06-28T01:00:00.000000000+0100')

In [22]: pd.to_datetime(str(dt64)).replace(tzinfo=None)
Out[22]: datetime.datetime(2002, 6, 28, 1, 0)

Question 55

This post has been up for 4 years and I still struggled with this conversion problem – so the issue is still active in 2017 in some sense. I was somewhat shocked that the numpy documentation does not readily offer a simple conversion algorithm but that’s another story.

I have come across another way to do the conversion that only involves modules numpy and datetime, it does not require pandas to be imported which seems to me to be a lot of code to import for such a simple conversion. I noticed that datetime64.astype(datetime.datetime) will return a datetime.datetime object if the original datetime64 is in micro-second units while other units return an integer timestamp. I use module xarray for data I/O from Netcdf files which uses the datetime64 in nanosecond units making the conversion fail unless you first convert to micro-second units. Here is the example conversion code,

import numpy as np
import datetime

def convert_datetime64_to_datetime( usert: np.datetime64 )->datetime.datetime:
    t = np.datetime64( usert, 'us').astype(datetime.datetime)
return t

Its only tested on my machine, which is Python 3.6 with a recent 2017 Anaconda distribution. I have only looked at scalar conversion and have not checked array based conversions although I’m guessing it will be good. Nor have I looked at the numpy datetime64 source code to see if the operation makes sense or not.

Question 56

I’ve come back to this answer more times than I can count, so I decided to throw together a quick little class, which converts a Numpy datetime64 value to Python datetime value. I hope it helps others out there.

from datetime import datetime
import pandas as pd

class NumpyConverter(object):
    @classmethod
    def to_datetime(cls, dt64, tzinfo=None):
        """
        Converts a Numpy datetime64 to a Python datetime.
        :param dt64: A Numpy datetime64 variable
        :type dt64: numpy.datetime64
        :param tzinfo: The timezone the date / time value is in
        :type tzinfo: pytz.timezone
        :return: A Python datetime variable
        :rtype: datetime
        """
        ts = pd.to_datetime(dt64)
        if tzinfo is not None:
            return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second, tzinfo=tzinfo)
        return datetime(ts.year, ts.month, ts.day, ts.hour, ts.minute, ts.second)

I’m gonna keep this in my tool bag, something tells me I’ll need it again.

Question 57

import numpy as np
import pandas as pd 

def np64toDate(np64):
    return pd.to_datetime(str(np64)).replace(tzinfo=None).to_datetime()

use this function to get pythons native datetime object

Question 58

Some solutions work well for me but numpy will deprecate some parameters. The solution that work better for me is to read the date as a pandas datetime and excract explicitly the year, month and day of a pandas object. The following code works for the most common situation.

def format_dates(dates):
    dt = pd.to_datetime(dates)
    try: return [datetime.date(x.year, x.month, x.day) for x in dt]    
    except TypeError: return datetime.date(dt.year, dt.month, dt.day)

Question 59

indeed, all of these datetime types can be difficult, and potentially problematic (must keep careful track of timezone information). here’s what i have done, though i admit that i am concerned that at least part of it is “not by design”. also, this can be made a bit more compact as needed. starting with a numpy.datetime64 dt_a:

dt_a

numpy.datetime64(‘2015-04-24T23:11:26.270000-0700’)

dt_a1 = dt_a.tolist() # yields a datetime object in UTC, but without tzinfo

dt_a1

datetime.datetime(2015, 4, 25, 6, 11, 26, 270000)

# now, make your "aware" datetime:

dt_a2=datetime.datetime(*list(dt_a1.timetuple()[:6]) + [dt_a1.microsecond], tzinfo=pytz.timezone(‘UTC’))

… and of course, that can be compressed into one line as needed.

Question 60

I am using pandas as a db substitute as I have multiple databases (oracle, mssql, etc) and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, .... etc etc

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in pandas?

Question 61

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Question 62

Here is another method, much simple, lets say your dataframe name is daat and column name is YEARMONTH

daat.YEARMONTH.value_counts()

Question 63

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

Question 64

Using crosstab, this will return more information than groupby nunique

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]: 
CLIENTCODE  1  2  3
YEARMONTH          
201301      2  1  0
201302      1  2  1

After a little bit modify ,yield the result

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]: 
YEARMONTH
201301    2
201302    3
dtype: int64

问题：熊猫有条件地创建系列/数据框列

回答 0

回答 1

回答 2

回答 3

回答 4

编辑（21/06/2019）：使用plydata

Edit (21/06/2019): Using plydata

回答 5

回答 6

回答 7

问题：根据涉及len（string）的条件表达式从pandas DataFrame删除行，从而给出KeyError

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：熊猫-获取给定列的第一行值

回答 0

df_test['Btime'].iloc[0]（推荐）和之间有区别df_test.iloc[0]['Btime']：

df.iloc[0, df.columns.get_loc('Btime')] = x （推荐的）：

df['Btime'].iloc[0] = x 可行，但不建议：

df.iloc[0]['Btime'] = x 不起作用：

There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:

df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):

df['Btime'].iloc[0] = x works, but is not recommended:

df.iloc[0]['Btime'] = x does not work:

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：将Python字典转换为数据框

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

问题：如何检查pandas DataFrame是否为空？

回答 0

回答 1

回答 2

回答 3

Reason:

Examples

Why it matters

Practical solution

回答 4

问题：在datetime，Timestamp和datetime64之间转换

回答 0

回答 1

回答 2

回答 3

Python的日期时间标准库

创建这四个对象

NumPy的datetime64和timedelta64对象

构造NumPy的datetime64和timedelta64对象

Pandas Timestamp和Timedelta在NumPy之上构建了更多功能

将Python datetime转换为datetime64和Timestamp

将numpy datetime64转换为datetime和Timestamp

从时间戳转换为datetime和datetime64

The datetime standard library of Python

Create these four objects

NumPy’s datetime64 and timedelta64 objects

Construct NumPy’s datetime64 and timedelta64 objects

Pandas Timestamp and Timedelta build much more functionality on top of NumPy

Converting Python datetime to datetime64 and Timestamp

Converting numpy datetime64 to datetime and Timestamp

`df_test['Btime'].iloc[0]`（推荐）和之间有区别`df_test.iloc[0]['Btime']`：

`df.iloc[0, df.columns.get_loc('Btime')] = x` （推荐的）：

`df['Btime'].iloc[0] = x` 可行，但不建议：

`df.iloc[0]['Btime'] = x` 不起作用：

There is a difference between `df_test['Btime'].iloc[0]` (recommended) and `df_test.iloc[0]['Btime']`:

`df.iloc[0, df.columns.get_loc('Btime')] = x` (recommended):

`df['Btime'].iloc[0] = x` works, but is not recommended:

`df.iloc[0]['Btime'] = x` does not work: