Can you tell me when to use these vectorization methods with basic examples?
I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!
In[116]: frame =DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])In[117]: frame
Out[117]:
b d e
Utah-0.0296381.0815631.280300Ohio0.6477470.831136-1.549481Texas0.513416-0.8844170.195343Oregon-0.485454-0.477388-0.309548In[118]: f =lambda x: x.max()- x.min()In[119]: frame.apply(f)Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
In[120]: format =lambda x:'%.2f'% x
In[121]: frame.applymap(format)Out[121]:
b d e
Utah-0.031.081.28Ohio0.650.83-1.55Texas0.51-0.880.20Oregon-0.49-0.48-0.31
之所以使用applymap之所以命名,是因为Series具有用于应用逐元素函数的map方法:
In[122]: frame['e'].map(format)Out[122]:Utah1.28Ohio-1.55Texas0.20Oregon-0.31Name: e, dtype: object
Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book):
Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: frame
Out[117]:
b d e
Utah -0.029638 1.081563 1.280300
Ohio 0.647747 0.831136 -1.549481
Texas 0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548
In [118]: f = lambda x: x.max() - x.min()
In [119]: frame.apply(f)
Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % x
In [121]: frame.applymap(format)
Out[121]:
b d e
Utah -0.03 1.08 1.28
Ohio 0.65 0.83 -1.55
Texas 0.51 -0.88 0.20
Oregon -0.49 -0.48 -0.31
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [122]: frame['e'].map(format)
Out[122]:
Utah 1.28
Ohio -1.55
Texas 0.20
Oregon -0.31
Name: e, dtype: object
Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.
apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.
Fourth major difference (the most important one): USE CASE
map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))
Summarising
Footnotes
map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as
NaN in the output.
applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in
some cases. My suggestion is to test them both and use whatever works
better.
map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to
use faster code paths for better performance.
Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has
fastpaths when called with certain NumPy functions such as mean,
sum, etc.
There’s great information in these answers, but I’m adding my own to clearly summarize which methods work array-wise versus element-wise. jeremiahbuddha mostly did this but did not mention Series.apply. I don’t have the rep to comment.
DataFrame.apply operates on entire rows or columns at a time.
DataFrame.applymap, Series.apply, and Series.map operate on one
element at time.
There is a lot of overlap between the capabilities of Series.apply and Series.map, meaning that either one will work in most cases. They do have some slight differences though, some of which were discussed in osa’s answer.
frame.apply(np.sqrt)Out[102]:
b d e
UtahNaN1.435159NaNOhio1.0981640.5105940.729748TexasNaN0.4564360.697337Oregon0.359079NaNNaN
frame.applymap(np.sqrt)Out[103]:
b d e
UtahNaN1.435159NaNOhio1.0981640.5105940.729748TexasNaN0.4564360.697337Oregon0.359079NaNNaN
@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation….
frame.apply(np.sqrt)
Out[102]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
frame.applymap(np.sqrt)
Out[103]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
回答 5
只是想指出一点,因为我为此苦了一点
def f(x):if x <0:
x =0elif x >100000:
x =100000return x
df.applymap(f)
df.describe()
Probably simplest explanation the difference between apply and applymap:
apply takes the whole column as a parameter and then assign the result to this column
applymap takes the separate cell value as a parameter and assign the result back to this cell.
NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.
回答 7
我的理解:
从功能上看:
如果函数具有需要在列/行中进行比较的变量,请使用
apply。
例如:lambda x: x.max()-x.mean()。
如果要将函数应用于每个元素:
1>如果找到列/行,请使用 apply
2>如果适用于整个数据框,请使用 applymap
majority =lambda x : x >17
df2['legal_drinker']= df2['age'].apply(majority)def times10(x):if type(x)is int:
x *=10return x
df2.applymap(times10)
If the function has variables that need to compare within a column/ row, use
apply.
e.g.: lambda x: x.max()-x.mean().
If the function is to be applied to each element:
1> If a column/row is located, use apply
2> If apply to entire dataframe, use applymap
majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)
def times10(x):
if type(x) is int:
x *= 10
return x
df2.applymap(times10)
In[3]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])In[4]: frame
Out[4]:
b d e
Utah0.129885-0.475957-0.207679Ohio-2.978331-1.0159180.784675Texas-0.256689-0.2263662.262588Oregon2.6055261.139105-0.927518In[5]: myformat=lambda x: f'{x:.2f}'In[6]: frame.d.map(myformat)Out[6]:Utah-0.48Ohio-1.02Texas-0.23Oregon1.14Name: d, dtype: object
In[7]: frame.d.apply(myformat)Out[7]:Utah-0.48Ohio-1.02Texas-0.23Oregon1.14Name: d, dtype: object
In[8]: frame.applymap(myformat)Out[8]:
b d e
Utah0.13-0.48-0.21Ohio-2.98-1.020.78Texas-0.26-0.232.26Oregon2.611.14-0.93In[9]: frame.apply(lambda x: x.apply(myformat))Out[9]:
b d e
Utah0.13-0.48-0.21Ohio-2.98-1.020.78Texas-0.26-0.232.26Oregon2.611.14-0.93In[10]: myfunc=lambda x: x**2In[11]: frame.applymap(myfunc)Out[11]:
b d e
Utah0.0168700.2265350.043131Ohio8.8704531.0320890.615714Texas0.0658890.0512425.119305Oregon6.7887661.2975600.860289In[12]: frame.apply(myfunc)Out[12]:
b d e
Utah0.0168700.2265350.043131Ohio8.8704531.0320890.615714Texas0.0658890.0512425.119305Oregon6.7887661.2975600.860289
df =DataFrame(1, columns=list('abc'),
index=list('1234'))print(df)
f =lambda x: np.log(x)print(df.applymap(f))# apply to the whole dataframeprint(np.log(df))# applied to the whole dataframeprint(df.applymap(np.sum))# reducing can be applied for rows only# apply can take different options (vs. applymap cannot)print(df.apply(f))# same as applymapprint(df.apply(sum, axis=1))# reducing exampleprint(df.apply(np.log, axis=1))# cannot reduceprint(df.apply(lambda x:[1,2,3], axis=1, result_type='expand'))# expand result
The following example shows apply and applymap applied to a DataFrame.
map function is something you do apply on Series only. You cannot apply map on DataFrame.
The thing to remember is that apply can do anythingapplymap can, but apply has eXtra options.
The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).
df = DataFrame(1, columns=list('abc'),
index=list('1234'))
print(df)
f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only
# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1)) # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result
As a sidenote, Series map function, should not be confused with the Python map function.
The first one is applied on Series, to map the values, and the second one to every item of an iterable.
Lastly don’t confuse the dataframe apply method with groupby apply method.