This won’t win a code golf competition, and borrows from the previous answers – but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.
My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.
The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables – which is what we set up.
In[46]: a = pd.DataFrame(np.random.rand(5,3), columns=["a","b","c"])In[47]: b = pd.DataFrame(np.random.rand(5,3), columns=["d","e","f"])In[48]: cartesian(a,b)Out[48]:
a b c d e f
00.4364800.0684910.2602920.9913110.0641670.71514210.4364800.0684910.2602920.1017770.8404640.76061620.4364800.0684910.2602920.6553910.2895370.39189330.4364800.0684910.2602920.3837290.0618110.77362740.4364800.0684910.2602920.5757110.9951510.80456750.4695780.0529320.6333940.9913110.0641670.71514260.4695780.0529320.6333940.1017770.8404640.76061670.4695780.0529320.6333940.6553910.2895370.39189380.4695780.0529320.6333940.3837290.0618110.77362790.4695780.0529320.6333940.5757110.9951510.804567100.4668130.2240620.2189940.9913110.0641670.715142110.4668130.2240620.2189940.1017770.8404640.760616120.4668130.2240620.2189940.6553910.2895370.391893130.4668130.2240620.2189940.3837290.0618110.773627140.4668130.2240620.2189940.5757110.9951510.804567150.8313650.2738900.1304100.9913110.0641670.715142160.8313650.2738900.1304100.1017770.8404640.760616170.8313650.2738900.1304100.6553910.2895370.391893180.8313650.2738900.1304100.3837290.0618110.773627190.8313650.2738900.1304100.5757110.9951510.804567200.4476400.8482830.6272240.9913110.0641670.715142210.4476400.8482830.6272240.1017770.8404640.760616220.4476400.8482830.6272240.6553910.2895370.391893230.4476400.8482830.6272240.3837290.0618110.773627240.4476400.8482830.6272240.5757110.9951510.804567
As an alternative, one can rely on the cartesian product provided by itertools: itertools.product, which avoids creating a temporary key or modifying the index:
import numpy as np
import pandas as pd
import itertools
def cartesian(df1, df2):
rows = itertools.product(df1.iterrows(), df2.iterrows())
df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
return df.reset_index(drop=True)
import pandas as pd
def cartesian(df1, df2):"""Determine Cartesian product of two data frames."""
key ='key'while key in df1.columns or key in df2.columns:
key ='_'+ key
key_d ={key:0}return pd.merge(
df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1,2],'key':[3,4]})
df2 = pd.DataFrame({'digit':[5,6]})
cartesian(df1, df2)
Here is a helper function to perform a simple Cartesian product with two data frames. The internal logic handles using an internal key, and avoids mangling any columns that happen to be named “key” from either side.
import pandas as pd
def cartesian(df1, df2):
"""Determine Cartesian product of two data frames."""
key = 'key'
while key in df1.columns or key in df2.columns:
key = '_' + key
key_d = {key: 0}
return pd.merge(
df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)
# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)
shows:
number key digit
0 1 3 5
1 1 3 6
2 2 4 5
3 2 4 6
回答 8
你可以采取的笛卡尔积启动df1.col1和df2.col3,然后合并回df1得到col2。
这是一个通用的笛卡尔乘积函数,它采用列表字典:
def cartesian_product(d):
index = pd.MultiIndex.from_product(d.values(), names=d.keys())return pd.DataFrame(index=index).reset_index()
I find using pandas MultiIndex to be the best tool for the job. If you have a list of lists lists_list, call pd.MultiIndex.from_product(lists_list) and iterate over the result (or use it in DataFrame index).
df.loc['Total']= pd.Series(df['MyColumn'].sum(), index =['MyColumn'])print(df)
X MyColumn Y Z
0 A 84.013.069.01 B 76.077.0127.02 C 28.069.016.03 D 28.028.031.04 E 19.020.085.05 F 84.0193.070.0TotalNaN319.0NaNNaN
因为如果传递标量,则将填充所有行的值:
df.loc['Total']= df['MyColumn'].sum()print(df)
X MyColumn Y Z
0 A 8413.069.01 B 7677.0127.02 C 2869.016.03 D 2828.031.04 E 1920.085.05 F 84193.070.0Total319319319.0319.0
df.at['Total','MyColumn']= df['MyColumn'].sum()print(df)
X MyColumn Y Z
0 A 84.013.069.01 B 76.077.0127.02 C 28.069.016.03 D 28.028.031.04 E 19.020.085.05 F 84.0193.070.0TotalNaN319.0NaNNaN
df.ix['Total','MyColumn']= df['MyColumn'].sum()print(df)
X MyColumn Y Z
0 A 84.013.069.01 B 76.077.0127.02 C 28.069.016.03 D 28.028.031.04 E 19.020.085.05 F 84.0193.070.0TotalNaN319.0NaNNaN
Then you use loc with Series, in that case the index should be set as the same as the specific column you need to sum:
df.loc['Total'] = pd.Series(df['MyColumn'].sum(), index = ['MyColumn'])
print (df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
because if you pass scalar, the values of all rows will be filled:
df.loc['Total'] = df['MyColumn'].sum()
print (df)
X MyColumn Y Z
0 A 84 13.0 69.0
1 B 76 77.0 127.0
2 C 28 69.0 16.0
3 D 28 28.0 31.0
4 E 19 20.0 85.0
5 F 84 193.0 70.0
Total 319 319 319.0 319.0
Two other solutions are with at, and ix see the applications below:
df.at['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
df.ix['Total', 'MyColumn'] = df['MyColumn'].sum()
print (df)
X MyColumn Y Z
0 A 84.0 13.0 69.0
1 B 76.0 77.0 127.0
2 C 28.0 69.0 16.0
3 D 28.0 28.0 31.0
4 E 19.0 20.0 85.0
5 F 84.0 193.0 70.0
Total NaN 319.0 NaN NaN
Note: Since Pandas v0.20, ix has been deprecated. Use loc or iloc instead.
回答 1
您可以在此处使用的另一种选择:
df.loc["Total","MyColumn"]= df.MyColumn.sum()# X MyColumn Y Z#0 A 84.0 13.0 69.0#1 B 76.0 77.0 127.0#2 C 28.0 69.0 16.0#3 D 28.0 28.0 31.0#4 E 19.0 20.0 85.0#5 F 84.0 193.0 70.0#Total NaN 319.0 NaN NaN
您也可以使用append()方法:
df.append(pd.DataFrame(df.MyColumn.sum(), index =["Total"], columns=["MyColumn"]))
更新:
如果需要为所有数字列追加总和,则可以执行以下操作之一:
用于append以功能性方式执行此操作(不更改原始数据帧):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')# append sums to the data frame
df.append(sums)# X MyColumn Y Z#0 A 84.0 13.0 69.0#1 B 76.0 77.0 127.0#2 C 28.0 69.0 16.0#3 D 28.0 28.0 31.0#4 E 19.0 20.0 85.0#5 F 84.0 193.0 70.0#total NaN 319.0 400.0 398.0
用于loc在适当位置更改数据框:
df.loc['total']= df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z#0 A 84.0 13.0 69.0#1 B 76.0 77.0 127.0#2 C 28.0 69.0 16.0#3 D 28.0 28.0 31.0#4 E 19.0 20.0 85.0#5 F 84.0 193.0 70.0#total NaN 638.0 800.0 796.0
df.loc["Total", "MyColumn"] = df.MyColumn.sum()
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#Total NaN 319.0 NaN NaN
You can also use append() method:
df.append(pd.DataFrame(df.MyColumn.sum(), index = ["Total"], columns=["MyColumn"]))
Update:
In case you need to append sum for all numeric columns, you can do one of the followings:
Use append to do this in a functional manner (doesn’t change the original data frame):
# select numeric columns and calculate the sums
sums = df.select_dtypes(pd.np.number).sum().rename('total')
# append sums to the data frame
df.append(sums)
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 319.0 400.0 398.0
Use loc to mutate data frame in place:
df.loc['total'] = df.select_dtypes(pd.np.number).sum()
df
# X MyColumn Y Z
#0 A 84.0 13.0 69.0
#1 B 76.0 77.0 127.0
#2 C 28.0 69.0 16.0
#3 D 28.0 28.0 31.0
#4 E 19.0 20.0 85.0
#5 F 84.0 193.0 70.0
#total NaN 638.0 800.0 796.0
So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?
回答 0
有时,在熊猫之外进行所有附加操作会更容易,然后只需创建DataFrame即可。
>>>import pandas as pd
>>> simple_list=[['a','b']]>>> simple_list.append(['e','f'])>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
Sometimes it’s easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
In[1]:import pandas as pd
In[2]: df = pd.DataFrame()In[3]: row=pd.Series([1,2,3],["A","B","C"])In[4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In[5]: df.append([row],ignore_index=True)Out[5]:
A B C
0123[1 rows x 3 columns]
If you want to add a Series and use the Series’ index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don’t get proper index.
import pandas as pd
import numpy as np
def addRow(df,ls):"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)return df
Here’s a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you’re adding then it shouldn’t be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df
I’m trying to reprogram my Stata code into Python for speed improvements, and I was pointed in the direction of PANDAS. I am, however, having a hard time wrapping my head around how to process the data.
Let’s say I want to iterate over all values in the column head ‘ID.’ If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.
In Stata it looks like this:
replace FirstName = "Matt" if ID==103
replace LastName = "Jones" if ID==103
So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.
In PANDAS, I’m trying something like this
df = read_csv("test.csv")
for i in df['ID']:
if i ==103:
...
Note that you’ll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations.
Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog','hound',5],['cat','ragdoll',1]],
columns=['animal','type','age'])In[1]:Out[1]:
animal type age
----------------------0 dog hound 51 cat ragdoll 1
df['description']='A '+ df.age.astype(str)+' years old ' \
+ df.type +' '+ df.animal
In[2]: df
Out[2]:
animal type age description
-------------------------------------------------0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won’t work here since the + applies to scalars and not ‘primitive’ values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since we can access the actual ‘primitive’ values in the row using the column names and have visibility of other cells in the given row/column.
This question might still be visited often enough that it’s worth offering an addendum to Mr Kassies’ answer. The dict built-in class can be sub-classed so that a default is returned for ‘missing’ keys. This mechanism works well for pandas. But see below.
The same thing can be done more simply in the following way. The use of the ‘default’ argument for the get method of a dict object makes it unnecessary to subclass a dict.
df.index //2# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index //2).first()# Alternatively,# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
索引被步幅(在本例中为2)划分为底数。如果索引是非数字的,请执行
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df))//2).first()
a b c
0 x x x
1 x x x
2 x x x
df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()
a b c
0 x x x
1 x x x
2 x x x
回答 3
我也有类似的要求,但我希望特定组中的第n个物品。这就是我解决的方法。
groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x %3==0)
subset = data[selection]
I have a dataframe in pandas with mixed int and str data columns. I want to concatenate first the columns within the dataframe. To do that I have to convert an int column to str.
I’ve tried to do as follows:
mtrx['X.3'] = mtrx.to_string(columns = ['X.3'])
or
mtrx['X.3'] = mtrx['X.3'].astype(str)
but in both cases it’s not working and I’m getting an error saying “cannot concatenate ‘str’ and ‘int’ objects”. Concatenating two str columns is working perfectly fine.
回答 0
In[16]: df =DataFrame(np.arange(10).reshape(5,2),columns=list('AB'))In[17]: df
Out[17]:
A B
001123245367489In[18]: df.dtypes
Out[18]:
A int64
B int64
dtype: object
转换系列
In[19]: df['A'].apply(str)Out[19]:0012243648Name: A, dtype: object
In[20]: df['A'].apply(str)[0]Out[20]:'0'
不要忘记将结果分配回去:
df['A']= df['A'].apply(str)
转换整个框架
In[21]: df.applymap(str)Out[21]:
A B
001123245367489In[22]: df.applymap(str).iloc[0,0]Out[22]:'0'
All of the above answers will work in case of a data frame. But if you are using lambda while creating / modify a column this won’t work, Because there it is considered as a int attribute instead of pandas series. You have to use str( target_attribute ) to make it as a string. Please refer the below example.
I want to drop all the columns whose name contains the word “Test”. The numbers of such columns is not static but depends on a previous function.
How can I do that?
回答 0
import pandas as pdimport numpy as np
array=np.random.random((2,4))
df=pd.DataFrame(array, columns=('Test1','toto','test2','riri'))print dfTest1 toto test2 riri00.9232490.5725280.8454640.14489110.0204380.3325400.1444550.741412
cols =[c for c in df.columns if c.lower()[:4]!='test']
df=df[cols]print df
toto riri00.5725280.14489110.3325400.741412
df.columns.str.startswith('Test')# array([ True, False, False, False])
df.loc[:,~df.columns.str.startswith('Test')]
toto test2 riri0 x x x1 x x x
对于不区分大小写的匹配,可以将基于正则表达式的匹配与str.containsSOL锚一起使用:
df.columns.str.contains('^test', case=False)# array([ True, False, True, False])
df.loc[:,~df.columns.str.contains('^test', case=False)]
toto riri0 x x1 x x
In recent versions of pandas, you can use string methods on the index and columns. Here, str.startswith seems like a good fit.
To remove all columns starting with a given substring:
df.columns.str.startswith('Test')
# array([ True, False, False, False])
df.loc[:,~df.columns.str.startswith('Test')]
toto test2 riri
0 x x x
1 x x x
For case-insensitive matching, you can use regex-based matching with str.contains with an SOL anchor:
df.columns.str.contains('^test', case=False)
# array([ True, False, True, False])
df.loc[:,~df.columns.str.contains('^test', case=False)]
toto riri
0 x x
1 x x
if mixed-types is a possibility, specify na=False as well.
回答 3
您可以使用“过滤器”过滤出您想要的列
import pandas as pdimport numpy as np
data2 =[{'test2':1,'result1':2},{'test':5,'result34':10,'c':20}]
df = pd.DataFrame(data2)
df
c result1 result34 test test20NaN2.0NaNNaN1.0120.0NaN10.05.0NaN
You can filter out the columns you DO want using ‘filter’
import pandas as pd
import numpy as np
data2 = [{'test2': 1, 'result1': 2}, {'test': 5, 'result34': 10, 'c': 20}]
df = pd.DataFrame(data2)
df
c result1 result34 test test2
0 NaN 2.0 NaN NaN 1.0
1 20.0 NaN 10.0 5.0 NaN
drop_column_names =['A','B.+','C.*']
drop_columns_regex ='^(?!(?:'+'|'.join(drop_column_names)+')$)'print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)
Solution when dropping a list of column names containing regex. I prefer this approach because I’m frequently editing the drop list. Uses a negative filter regex for the drop list.
drop_column_names = ['A','B.+','C.*']
drop_columns_regex = '^(?!(?:'+'|'.join(drop_column_names)+')$)'
print('Dropping columns:',', '.join([c for c in df.columns if re.search(drop_columns_regex,c)]))
df = df.filter(regex=drop_columns_regex,axis=1)
Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1
I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.
However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?
1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.
# Boolean indexing with Numeric value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# Boolean indexing with string value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# List positional indexing. def get_0th(lst):try:return lst[0]# Handle empty lists and NaNs gracefully.except(IndexError,TypeError):return np.nan
ser.map(get_0th)# map
ser.str[0]# str accessor
pd.Series([x[0]if len(x)>0else np.nan for x in ser])# list comp
pd.Series([get_0th(x)for x in ser])# list comp safe
注意
如果索引很重要,则需要执行以下操作:
pd.Series([...], index=ser.index)
重建系列时。
列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)# stack
pd.Series(list(chain.from_iterable(ser.tolist())))# itertools.chain
pd.Series([y for x in ser for y in x])# nested list comp
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')def matcher(x):
m = p.search(x)if m:return m.group(0)return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)# str.extract
pd.Series([matcher(x)for x in ser])# list comprehension
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections importCounterfrom itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B']),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],lambda df: df[get_mask(df.A.values, df.B.values)]],
labels=['vectorized !=','query (numexpr)','list comp','numba'],
n_range=[2**k for k in range(0,15)],
xlabel='N')
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[lambda ser: ser.value_counts(sort=False).to_dict(),lambda ser: dict(zip(*np.unique(ser, return_counts=True))),lambda ser:Counter(ser),],
labels=['value_counts','np.unique','Counter'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=lambda x, y: dict(x)== dict(y))
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B'], dtype=str),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],],
labels=['vectorized !=','query (numexpr)','list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Dictionary value extraction.
ser1 = pd.Series([{'key':'abc','value':123},{'key':'xyz','value':456}])
perfplot.show(
setup=lambda n: pd.concat([ser1]* n, ignore_index=True),
kernels=[lambda ser: ser.map(operator.itemgetter('value')),lambda ser: pd.Series([x.get('value')for x in ser]),],
labels=['map','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# List positional indexing.
ser2 = pd.Series([['a','b','c'],[1,2],[]])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: ser.map(get_0th),lambda ser: ser.str[0],lambda ser: pd.Series([x[0]if len(x)>0else np.nan for x in ser]),lambda ser: pd.Series([get_0th(x)for x in ser]),],
labels=['map','str accessor','list comprehension','list comp safe'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Nested list flattening.
ser3 = pd.Series([['a','b','c'],['d','e'],['f','g']])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),lambda ser: pd.Series([y for x in ser for y in x]),],
labels=['stack','itertools.chain','nested list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Extracting strings.
ser4 = pd.Series(['foo xyz','test A1234','D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4]* n, ignore_index=True),
kernels=[lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),lambda ser: pd.Series([matcher(x)for x in ser])],
labels=['str.extract','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:
When your data is small (…depending on what you’re doing),
When dealing with object/mixed dtypes
When using the str/regex accessor functions
Let’s examine these situations individually.
Iteration v/s Vectorization on Small Data
Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.
When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working
Index/axis alignment
Handling mixed datatypes
Handling missing data
Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).
for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.
List comprehensions follow the pattern
[f(x) for x in seq]
Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,
[f(x, y) for x, y in zip(seq1, seq2)]
Where seq1 and seq2 are columns.
Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:
The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.
Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment,
but this means that if your code is dependent on indexing alignment,
this will break. In some cases, vectorised operations over the
underlying NumPy arrays can be considered as bringing in the “best of
both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as
df[df.A.values != df.B.values]
Which outperforms both the pandas and list comprehension equivalents:
NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.
Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:
The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).
Note
More trivia (courtesy @user2357112). The Counter is implemented with a C
accelerator,
so while it still has to work with python objects instead of the
underlying C datatypes, it is still faster than a for loop. Python
power!
Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.
Operations with Mixed/object dtypes
String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.
Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.
When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.
Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.
# Dictionary value extraction.
ser.map(operator.itemgetter('value')) # map
pd.Series([x.get('value') for x in ser]) # list comprehension
Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
Note
If the index matters, you would want to do:
pd.Series([...], index=ser.index)
When reconstructing the series.
List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.
These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.
Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).
Regex Operations, and .str Accessor Methods
Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.
It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
Or,
ser2 = ser[[bool(p.search(x)) for x in ser]]
If you need to handle NaNs, you can do something like
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
The list comp equivalent to str.extract (without groups) will look something like:
df['col2'] = [p.search(x).group(0) for x in df['col']]
If you need to handle no-matches and NaNs, you can use a custom function (still faster!):
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df['col2'] = [matcher(x) for x in df['col']]
The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.
For str.extractall, change p.search to p.findall.
String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.
As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.
The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.
The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.
Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:
Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.
As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.
Appendix: Code Snippets
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
n_range=[2**k for k in range(0, 15)],
xlabel='N'
)
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=['value_counts', 'np.unique', 'Counter'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=lambda x, y: dict(x) == dict(y)
)
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=['vectorized !=', 'query (numexpr)', 'list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter('value')),
lambda ser: pd.Series([x.get('value') for x in ser]),
],
labels=['map', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# List positional indexing.
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=['stack', 'itertools.chain', 'nested list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=['str.extract', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
I can do count(distinct hID) in Qlik to come up with count of 5 for unique hID. How do I do that in python using a pandas dataframe? Or maybe a numpy array? Similarly, if were to do count(hID) I will get 8 in Qlik. What is the equivalent way to do it in pandas?
I have a dataframe in pandas and I’m trying to figure out what the types of its values are. I am unsure what the type is of column 'Test'. However, when I run myFrame['Test'].dtype, I get;
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are
to an existing type, or an error will be raised. The supported kinds are:
'b' boolean
'i' (signed) integer
'u' unsigned integer
'f' floating-point
'c' complex-floating point
'O' (Python) objects
'S', 'a' (byte-)string
'U' Unicode
'V' raw data (void)
import pandas as pd
import numpy as np
from pandas importTimestamp
data={'id':{0:1,1:2,2:3,3:4,4:5},'date':{0:Timestamp('2018-12-12 00:00:00'),1:Timestamp('2018-12-12 00:00:00'),2:Timestamp('2018-12-12 00:00:00'),3:Timestamp('2018-12-12 00:00:00'),4:Timestamp('2018-12-12 00:00:00')},'role':{0:'Support',1:'Marketing',2:'Business Development',3:'Sales',4:'Engineering'},'num':{0:123,1:234,2:345,3:456,4:567},'fnum':{0:3.14,1:2.14,2:-0.14,3:41.3,4:3.14}}
df = pd.DataFrame.from_dict(data)#now we have a dataframeprint(df)print(df.dtypes)
最后几行将检查数据框并记录输出:
id date role num fnum
012018-12-12Support1233.14122018-12-12Marketing2342.14232018-12-12BusinessDevelopment345-0.14342018-12-12Sales45641.30452018-12-12Engineering5673.14
id int64
date datetime64[ns]
role object
num int64
fnum float64
dtype: object
各种不同 dtypes
df.iloc[1,:]= np.nan
df.iloc[2,:]=None
但是,如果我们尝试设置np.nan或None这将不会影响原始列的dtype。输出将如下所示:
print(df)print(df.dtypes)
id date role num fnum
01.02018-12-12Support123.03.141NaNNaTNaNNaNNaN2NaNNaTNoneNaNNaN34.02018-12-12Sales456.041.3045.02018-12-12Engineering567.03.14
id float64
date datetime64[ns]
role object
num float64
fnum float64
dtype: object
You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.
Pandas dtype Python type NumPy type Usage
object str string_, unicode_ Text
Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.
Data type object is an instance of numpy.dtype class that understand the data type more precise including:
Type of the data (integer, float, Python object, etc.)
Size of the data (how many bytes is in e.g. the integer)
Byte order of the data (little-endian or big-endian)
If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
What are the names of the “fields” of the structure
What is the data-type of each field
Which part of the memory block each field takes
If the data type is a sub-array, what is its shape and data type
In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.
Here is some code for testing with explanation:
If we have the dataset as dictionary
The last lines will examine the dataframe and note the output:
id date role num fnum
0 1 2018-12-12 Support 123 3.14
1 2 2018-12-12 Marketing 234 2.14
2 3 2018-12-12 Business Development 345 -0.14
3 4 2018-12-12 Sales 456 41.30
4 5 2018-12-12 Engineering 567 3.14
id int64
date datetime64[ns]
role object
num int64
fnum float64
dtype: object
All kind of different dtypes
df.iloc[1,:] = np.nan
df.iloc[2,:] = None
But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:
print(df)
print(df.dtypes)
id date role num fnum
0 1.0 2018-12-12 Support 123.0 3.14
1 NaN NaT NaN NaN NaN
2 NaN NaT None NaN NaN
3 4.0 2018-12-12 Sales 456.0 41.30
4 5.0 2018-12-12 Engineering 567.0 3.14
id float64
date datetime64[ns]
role object
num float64
fnum float64
dtype: object
So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.
You may try also setting single rows:
df.iloc[3,:] = 0 # will convert datetime to object only
df.iloc[4,:] = '' # will convert all columns to object
And to note here, if we set string inside a non string column it will become string or object dtype.
It means “a python object”, i.e. not one of the builtin scalar types supported by numpy.
np.array([object()]).dtype
=> dtype('O')
回答 3
“ O”代表对象。
#Loading a csv file as a dataframeimport pandas as pd
train_df = pd.read_csv('train.csv')
col_name ='Name of Employee'#Checking the datatype of column name
train_df[col_name].dtype
#Instead try printing the same thingprint train_df[col_name].dtype
#Loading a csv file as a dataframe
import pandas as pd
train_df = pd.read_csv('train.csv')
col_name = 'Name of Employee'
#Checking the datatype of column name
train_df[col_name].dtype
#Instead try printing the same thing
print train_df[col_name].dtype
The first line returns: dtype('O')
The line with the print statement returns the following: object