import numpy as np
def f(x):return x * x +3* x -2if x >0else x *5+8
f = np.vectorize(f)# or use a different name if you want to keep the original f
result_array = f(A)# if A is your Numpy array
You could just vectorize the function and then apply it directly to a Numpy array each time you need it:
import numpy as np
def f(x):
return x * x + 3 * x - 2 if x > 0 else x * 5 + 8
f = np.vectorize(f) # or use a different name if you want to keep the original f
result_array = f(A) # if A is your Numpy array
It’s probably better to specify an explicit output type directly when vectorizing:
I believe I have found a better solution. The idea to change the function to python universal function (see documentation), which can exercise parallel computation under the hood.
One can write his own customised ufunc in C, which surely is more efficient, or by invoking np.frompyfunc, which is built-in factory method. After testing, this is more efficient than np.vectorize:
from llvmlite import binding
# set before import
binding.set_option('SVML','-vector-library=SVML')import numba as nb
@nb.vectorize(target="cpu")def nb_vexp_svml(x):return np.exp(x)
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
f,
vf,
nb_vf
],
logx=True,
logy=True,
xlabel='len(x)')
B.比较exp:
import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
nb_vexp,
np.exp,
np_copy_exp,],
logx=True,
logy=True,
xlabel='len(x)',)
When the 2d-array (or nd-array) is C- or F-contiguous, then this task of mapping a function onto a 2d-array is practically the same as the task of mapping a function onto a 1d-array – we just have to view it that way, e.g. via np.ravel(A,'K').
Possible solution for 1d-array have been discussed for example here.
However, when the memory of the 2d-array isn’t contiguous, then the situation a little bit more complicated, because one would like to avoid possible cache misses if axis are handled in wrong order.
Numpy has already a machinery in place to process axes in the best possible order. One possibility to use this machinery is np.vectorize. However, numpy’s documentation on np.vectorize states that it is “provided primarily for convenience, not for performance” – a slow python function stays a slow python function with the whole associated overhead! Another issue is its huge memory-consumption – see for example this SO-post.
When one wants to have a performance of a C-function but to use numpy’s machinery, a good solution is to use numba for creation of ufuncs, for example:
# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
return x+2*x*x+4*x*x*x
It easily beats np.vectorize but also when the same function would be performed as numpy-array multiplication/addition, i.e.
# numpy-functionality
def f(x):
return x+2*x*x+4*x*x*x
# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"
See appendix of this answer for time-measurement-code:
Numba’s version (green) is about 100 times faster than the python-function (i.e. np.vectorize), which is not surprising. But it is also about 10 times faster than the numpy-functionality, because numbas version doesn’t need intermediate arrays and thus uses cache more efficiently.
While numba’s ufunc approach is a good trade-off between usability and performance, it is still not the best we can do. Yet there is no silver bullet or an approach best for any task – one has to understand what are the limitation and how they can be mitigated.
For example, for transcendental functions (e.g. exp, sin, cos) numba doesn’t provide any advantages over numpy’s np.exp (there are no temporary arrays created – the main source of the speed-up). However, my Anaconda installation utilizes Intel’s VML for vectors bigger than 8192 – it just cannot do it if memory is not contiguous. So it might be better to copy the elements to a contiguous memory in order to be able to use Intel’s VML:
import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp(x):
return np.exp(x)
def np_copy_exp(x):
copy = np.ravel(x, 'K')
return np.exp(copy).reshape(x.shape)
For the fairness of the comparison, I have switched off VML’s parallelization (see code in the appendix):
As one can see, once VML kicks in, the overhead of copying is more than compensated. Yet once data becomes too big for L3 cache, the advantage is minimal as task becomes once again memory-bandwidth-bound.
On the other hand, numba could use Intel’s SVML as well, as explained in this post:
from llvmlite import binding
# set before import
binding.set_option('SVML', '-vector-library=SVML')
import numba as nb
@nb.vectorize(target="cpu")
def nb_vexp_svml(x):
return np.exp(x)
and using VML with parallelization yields:
numba’s version has less overhead, but for some sizes VML beats SVML even despite of the additional copying overhead – which isn’t a bit surprise as numba’s ufuncs aren’t parallelized.
Listings:
A. comparison of polynomial function:
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
f,
vf,
nb_vf
],
logx=True,
logy=True,
xlabel='len(x)'
)
B. comparison of exp:
import perfplot
import numexpr as ne # using ne is the easiest way to set vml_num_threads
ne.set_vml_num_threads(1)
perfplot.show(
setup=lambda n: np.random.rand(n,n)[::2,::2],
n_range=[2**k for k in range(0,12)],
kernels=[
nb_vexp,
np.exp,
np_copy_exp,
],
logx=True,
logy=True,
xlabel='len(x)',
)
import numpy, time
def A(e):return e * e
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([A(x)for x in y.reshape(-1)]).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.fromiter((A(x)for x in y.reshape(-1)), y.dtype).reshape(y.shape)print(time.time()- now)
now = time.time()
numpy.square(y)print(time.time()- now)
输出量
>>> timeit()1.162431240081787# list comprehension and then building numpy array1.0775556564331055# from numpy.fromiter0.002948284149169922# using inbuilt function
在这里你可以清楚地看到 numpy.fromiter用户平方函数,可以使用任何选择。如果你的功能是依赖于i, j 那就是数组的索引,迭代上数组的大小一样for ind in range(arr.size),用numpy.unravel_index得到i, j, ..基于阵列的您1D指数和形状numpy.unravel_index
All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.
I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function. I am presenting the general case for n dimensional array. For two dimensional just make iter for 2D.
import numpy, time
def A(e):
return e * e
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([A(x) for x in y.reshape(-1)]).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.fromiter((A(x) for x in y.reshape(-1)), y.dtype).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.square(y)
print(time.time() - now)
Output
>>> timeit()
1.162431240081787 # list comprehension and then building numpy array
1.0775556564331055 # from numpy.fromiter
0.002948284149169922 # using inbuilt function
here you can clearly see numpy.fromiter user square function, use any of your choice. If you function is dependent on i, j that is indices of array, iterate on size of array like for ind in range(arr.size), use numpy.unravel_index to get i, j, .. based on your 1D index and shape of array numpy.unravel_index
This answers is inspired by my answer on other question here
Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?1
I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.
However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?
1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.
# Boolean indexing with Numeric value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# Boolean indexing with string value comparison.
df[df.A != df.B]# vectorized !=
df.query('A != B')# query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]# list comp
# List positional indexing. def get_0th(lst):try:return lst[0]# Handle empty lists and NaNs gracefully.except(IndexError,TypeError):return np.nan
ser.map(get_0th)# map
ser.str[0]# str accessor
pd.Series([x[0]if len(x)>0else np.nan for x in ser])# list comp
pd.Series([get_0th(x)for x in ser])# list comp safe
注意
如果索引很重要,则需要执行以下操作:
pd.Series([...], index=ser.index)
重建系列时。
列表扁平
化最后一个例子是扁平化列表。这是另一个常见问题,它演示了纯python在这里有多么强大。
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)# stack
pd.Series(list(chain.from_iterable(ser.tolist())))# itertools.chain
pd.Series([y for x in ser for y in x])# nested list comp
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')def matcher(x):
m = p.search(x)if m:return m.group(0)return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)# str.extract
pd.Series([matcher(x)for x in ser])# list comprehension
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections importCounterfrom itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B']),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],lambda df: df[get_mask(df.A.values, df.B.values)]],
labels=['vectorized !=','query (numexpr)','list comp','numba'],
n_range=[2**k for k in range(0,15)],
xlabel='N')
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[lambda ser: ser.value_counts(sort=False).to_dict(),lambda ser: dict(zip(*np.unique(ser, return_counts=True))),lambda ser:Counter(ser),],
labels=['value_counts','np.unique','Counter'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=lambda x, y: dict(x)== dict(y))
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000,(n,2)), columns=['A','B'], dtype=str),
kernels=[lambda df: df[df.A != df.B],lambda df: df.query('A != B'),lambda df: df[[x != y for x, y in zip(df.A, df.B)]],],
labels=['vectorized !=','query (numexpr)','list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Dictionary value extraction.
ser1 = pd.Series([{'key':'abc','value':123},{'key':'xyz','value':456}])
perfplot.show(
setup=lambda n: pd.concat([ser1]* n, ignore_index=True),
kernels=[lambda ser: ser.map(operator.itemgetter('value')),lambda ser: pd.Series([x.get('value')for x in ser]),],
labels=['map','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# List positional indexing.
ser2 = pd.Series([['a','b','c'],[1,2],[]])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: ser.map(get_0th),lambda ser: ser.str[0],lambda ser: pd.Series([x[0]if len(x)>0else np.nan for x in ser]),lambda ser: pd.Series([get_0th(x)for x in ser]),],
labels=['map','str accessor','list comprehension','list comp safe'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Nested list flattening.
ser3 = pd.Series([['a','b','c'],['d','e'],['f','g']])
perfplot.show(
setup=lambda n: pd.concat([ser2]* n, ignore_index=True),
kernels=[lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),lambda ser: pd.Series([y for x in ser for y in x]),],
labels=['stack','itertools.chain','nested list comp'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
# Extracting strings.
ser4 = pd.Series(['foo xyz','test A1234','D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4]* n, ignore_index=True),
kernels=[lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),lambda ser: pd.Series([matcher(x)for x in ser])],
labels=['str.extract','list comprehension'],
n_range=[2**k for k in range(0,15)],
xlabel='N',
equality_check=None)
TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:
When your data is small (…depending on what you’re doing),
When dealing with object/mixed dtypes
When using the str/regex accessor functions
Let’s examine these situations individually.
Iteration v/s Vectorization on Small Data
Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.
When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working
Index/axis alignment
Handling mixed datatypes
Handling missing data
Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).
for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.
List comprehensions follow the pattern
[f(x) for x in seq]
Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,
[f(x, y) for x, y in zip(seq1, seq2)]
Where seq1 and seq2 are columns.
Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:
# Boolean indexing with Numeric value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:
The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.
Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment,
but this means that if your code is dependent on indexing alignment,
this will break. In some cases, vectorised operations over the
underlying NumPy arrays can be considered as bringing in the “best of
both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as
df[df.A.values != df.B.values]
Which outperforms both the pandas and list comprehension equivalents:
NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.
Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:
The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).
Note
More trivia (courtesy @user2357112). The Counter is implemented with a C
accelerator,
so while it still has to work with python objects instead of the
underlying C datatypes, it is still faster than a for loop. Python
power!
Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.
from numba import njit, prange
@njit(parallel=True)
def get_mask(x, y):
result = [False] * len(x)
for i in prange(len(x)):
result[i] = x[i] != y[i]
return np.array(result)
df[get_mask(df.A.values, df.B.values)] # numba
Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.
Operations with Mixed/object dtypes
String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.
# Boolean indexing with string value comparison.
df[df.A != df.B] # vectorized !=
df.query('A != B') # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]] # list comp
So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.
Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.
When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.
Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.
# Dictionary value extraction.
ser.map(operator.itemgetter('value')) # map
pd.Series([x.get('value') for x in ser]) # list comprehension
Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:
ser.map(get_0th) # map
ser.str[0] # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]) # list comp
pd.Series([get_0th(x) for x in ser]) # list comp safe
Note
If the index matters, you would want to do:
pd.Series([...], index=ser.index)
When reconstructing the series.
List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.
# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True) # stack
pd.Series(list(chain.from_iterable(ser.tolist()))) # itertools.chain
pd.Series([y for x in ser for y in x]) # nested list comp
Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.
These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.
Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).
Regex Operations, and .str Accessor Methods
Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.
It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:
p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])
Or,
ser2 = ser[[bool(p.search(x)) for x in ser]]
If you need to handle NaNs, you can do something like
ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]
The list comp equivalent to str.extract (without groups) will look something like:
df['col2'] = [p.search(x).group(0) for x in df['col']]
If you need to handle no-matches and NaNs, you can use a custom function (still faster!):
def matcher(x):
m = p.search(str(x))
if m:
return m.group(0)
return np.nan
df['col2'] = [matcher(x) for x in df['col']]
The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.
For str.extractall, change p.search to p.findall.
String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.
# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
m = p.search(x)
if m:
return m.group(0)
return np.nan
ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False) # str.extract
pd.Series([matcher(x) for x in ser]) # list comprehension
More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.
As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.
The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.
The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.
Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:
Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.
As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.
Appendix: Code Snippets
import perfplot
import operator
import pandas as pd
import numpy as np
import re
from collections import Counter
from itertools import chain
# Boolean indexing with Numeric value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
lambda df: df[get_mask(df.A.values, df.B.values)]
],
labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
n_range=[2**k for k in range(0, 15)],
xlabel='N'
)
# Value Counts comparison.
perfplot.show(
setup=lambda n: pd.Series(np.random.choice(1000, n)),
kernels=[
lambda ser: ser.value_counts(sort=False).to_dict(),
lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
lambda ser: Counter(ser),
],
labels=['value_counts', 'np.unique', 'Counter'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=lambda x, y: dict(x) == dict(y)
)
# Boolean indexing with string value comparison.
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
kernels=[
lambda df: df[df.A != df.B],
lambda df: df.query('A != B'),
lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
],
labels=['vectorized !=', 'query (numexpr)', 'list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(operator.itemgetter('value')),
lambda ser: pd.Series([x.get('value') for x in ser]),
],
labels=['map', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# List positional indexing.
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: ser.map(get_0th),
lambda ser: ser.str[0],
lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
lambda ser: pd.Series([get_0th(x) for x in ser]),
],
labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
kernels=[
lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
lambda ser: pd.Series([y for x in ser for y in x]),
],
labels=['stack', 'itertools.chain', 'nested list comp'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
kernels=[
lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
lambda ser: pd.Series([matcher(x) for x in ser])
],
labels=['str.extract', 'list comprehension'],
n_range=[2**k for k in range(0, 15)],
xlabel='N',
equality_check=None
)
Can you tell me when to use these vectorization methods with basic examples?
I see that map is a Series method whereas the rest are DataFrame methods. I got confused about apply and applymap methods though. Why do we have two methods for applying a function to a DataFrame? Again, simple examples which illustrate the usage would be great!
In[116]: frame =DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])In[117]: frame
Out[117]:
b d e
Utah-0.0296381.0815631.280300Ohio0.6477470.831136-1.549481Texas0.513416-0.8844170.195343Oregon-0.485454-0.477388-0.309548In[118]: f =lambda x: x.max()- x.min()In[119]: frame.apply(f)Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
In[120]: format =lambda x:'%.2f'% x
In[121]: frame.applymap(format)Out[121]:
b d e
Utah-0.031.081.28Ohio0.650.83-1.55Texas0.51-0.880.20Oregon-0.49-0.48-0.31
之所以使用applymap之所以命名,是因为Series具有用于应用逐元素函数的map方法:
In[122]: frame['e'].map(format)Out[122]:Utah1.28Ohio-1.55Texas0.20Oregon-0.31Name: e, dtype: object
Straight from Wes McKinney’s Python for Data Analysis book, pg. 132 (I highly recommended this book):
Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s apply method does exactly this:
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: frame
Out[117]:
b d e
Utah -0.029638 1.081563 1.280300
Ohio 0.647747 0.831136 -1.549481
Texas 0.513416 -0.884417 0.195343
Oregon -0.485454 -0.477388 -0.309548
In [118]: f = lambda x: x.max() - x.min()
In [119]: frame.apply(f)
Out[119]:
b 1.133201
d 1.965980
e 2.829781
dtype: float64
Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % x
In [121]: frame.applymap(format)
Out[121]:
b d e
Utah -0.03 1.08 1.28
Ohio 0.65 0.83 -1.55
Texas 0.51 -0.88 0.20
Oregon -0.49 -0.48 -0.31
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [122]: frame['e'].map(format)
Out[122]:
Utah 1.28
Ohio -1.55
Texas 0.20
Oregon -0.31
Name: e, dtype: object
Summing up, apply works on a row / column basis of a DataFrame, applymap works element-wise on a DataFrame, and map works element-wise on a Series.
apply also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.
Fourth major difference (the most important one): USE CASE
map is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'}))
applymap is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip))
apply is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize))
Summarising
Footnotes
map when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as
NaN in the output.
applymap in more recent versions has been optimised for some operations. You will find applymap slightly faster than apply in
some cases. My suggestion is to test them both and use whatever works
better.
map is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to
use faster code paths for better performance.
Series.apply returns a scalar for aggregating operations, Series otherwise. Similarly for DataFrame.apply. Note that apply also has
fastpaths when called with certain NumPy functions such as mean,
sum, etc.
There’s great information in these answers, but I’m adding my own to clearly summarize which methods work array-wise versus element-wise. jeremiahbuddha mostly did this but did not mention Series.apply. I don’t have the rep to comment.
DataFrame.apply operates on entire rows or columns at a time.
DataFrame.applymap, Series.apply, and Series.map operate on one
element at time.
There is a lot of overlap between the capabilities of Series.apply and Series.map, meaning that either one will work in most cases. They do have some slight differences though, some of which were discussed in osa’s answer.
frame.apply(np.sqrt)Out[102]:
b d e
UtahNaN1.435159NaNOhio1.0981640.5105940.729748TexasNaN0.4564360.697337Oregon0.359079NaNNaN
frame.applymap(np.sqrt)Out[103]:
b d e
UtahNaN1.435159NaNOhio1.0981640.5105940.729748TexasNaN0.4564360.697337Oregon0.359079NaNNaN
@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation….
frame.apply(np.sqrt)
Out[102]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
frame.applymap(np.sqrt)
Out[103]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
回答 5
只是想指出一点,因为我为此苦了一点
def f(x):if x <0:
x =0elif x >100000:
x =100000return x
df.applymap(f)
df.describe()
Probably simplest explanation the difference between apply and applymap:
apply takes the whole column as a parameter and then assign the result to this column
applymap takes the separate cell value as a parameter and assign the result back to this cell.
NB If apply returns the single value you will have this value instead of the column after assigning and eventually will have just a row instead of matrix.
回答 7
我的理解:
从功能上看:
如果函数具有需要在列/行中进行比较的变量,请使用
apply。
例如:lambda x: x.max()-x.mean()。
如果要将函数应用于每个元素:
1>如果找到列/行,请使用 apply
2>如果适用于整个数据框,请使用 applymap
majority =lambda x : x >17
df2['legal_drinker']= df2['age'].apply(majority)def times10(x):if type(x)is int:
x *=10return x
df2.applymap(times10)
If the function has variables that need to compare within a column/ row, use
apply.
e.g.: lambda x: x.max()-x.mean().
If the function is to be applied to each element:
1> If a column/row is located, use apply
2> If apply to entire dataframe, use applymap
majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)
def times10(x):
if type(x) is int:
x *= 10
return x
df2.applymap(times10)
In[3]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'), index=['Utah','Ohio','Texas','Oregon'])In[4]: frame
Out[4]:
b d e
Utah0.129885-0.475957-0.207679Ohio-2.978331-1.0159180.784675Texas-0.256689-0.2263662.262588Oregon2.6055261.139105-0.927518In[5]: myformat=lambda x: f'{x:.2f}'In[6]: frame.d.map(myformat)Out[6]:Utah-0.48Ohio-1.02Texas-0.23Oregon1.14Name: d, dtype: object
In[7]: frame.d.apply(myformat)Out[7]:Utah-0.48Ohio-1.02Texas-0.23Oregon1.14Name: d, dtype: object
In[8]: frame.applymap(myformat)Out[8]:
b d e
Utah0.13-0.48-0.21Ohio-2.98-1.020.78Texas-0.26-0.232.26Oregon2.611.14-0.93In[9]: frame.apply(lambda x: x.apply(myformat))Out[9]:
b d e
Utah0.13-0.48-0.21Ohio-2.98-1.020.78Texas-0.26-0.232.26Oregon2.611.14-0.93In[10]: myfunc=lambda x: x**2In[11]: frame.applymap(myfunc)Out[11]:
b d e
Utah0.0168700.2265350.043131Ohio8.8704531.0320890.615714Texas0.0658890.0512425.119305Oregon6.7887661.2975600.860289In[12]: frame.apply(myfunc)Out[12]:
b d e
Utah0.0168700.2265350.043131Ohio8.8704531.0320890.615714Texas0.0658890.0512425.119305Oregon6.7887661.2975600.860289
df =DataFrame(1, columns=list('abc'),
index=list('1234'))print(df)
f =lambda x: np.log(x)print(df.applymap(f))# apply to the whole dataframeprint(np.log(df))# applied to the whole dataframeprint(df.applymap(np.sum))# reducing can be applied for rows only# apply can take different options (vs. applymap cannot)print(df.apply(f))# same as applymapprint(df.apply(sum, axis=1))# reducing exampleprint(df.apply(np.log, axis=1))# cannot reduceprint(df.apply(lambda x:[1,2,3], axis=1, result_type='expand'))# expand result
The following example shows apply and applymap applied to a DataFrame.
map function is something you do apply on Series only. You cannot apply map on DataFrame.
The thing to remember is that apply can do anythingapplymap can, but apply has eXtra options.
The X factor options are: axis and result_type where result_type only works when axis=1 (for columns).
df = DataFrame(1, columns=list('abc'),
index=list('1234'))
print(df)
f = lambda x: np.log(x)
print(df.applymap(f)) # apply to the whole dataframe
print(np.log(df)) # applied to the whole dataframe
print(df.applymap(np.sum)) # reducing can be applied for rows only
# apply can take different options (vs. applymap cannot)
print(df.apply(f)) # same as applymap
print(df.apply(sum, axis=1)) # reducing example
print(df.apply(np.log, axis=1)) # cannot reduce
print(df.apply(lambda x: [1, 2, 3], axis=1, result_type='expand')) # expand result
As a sidenote, Series map function, should not be confused with the Python map function.
The first one is applied on Series, to map the values, and the second one to every item of an iterable.
Lastly don’t confuse the dataframe apply method with groupby apply method.