Python 实用宝典

Question 1

Lets suppose I have a list like this:

mylist = ["a","b","c","d"]

To get the values printed along with their index I can use Python’s enumerate function like this

>>> for i,j in enumerate(mylist):
...     print i,j
...
0 a
1 b
2 c
3 d
>>>

Now, when I try to use it inside a list comprehension it gives me this error

>>> [i,j for i,j in enumerate(mylist)]
  File "<stdin>", line 1
    [i,j for i,j in enumerate(mylist)]
           ^
SyntaxError: invalid syntax

So, my question is: what is the correct way of using enumerate inside list comprehension?

Question 2

Try this:

[(i, j) for i, j in enumerate(mylist)]

You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate() already returns a tuple, you can return it directly without unpacking it first:

[pair for pair in enumerate(mylist)]

Either way, the result that gets returned is as expected:

> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Question 3

Just to be really clear, this has nothing to do with enumerate and everything to do with list comprehension syntax.

This list comprehension returns a list of tuples:

[(i,j) for i in range(3) for j in 'abc']

this a list of dicts:

[{i:j} for i in range(3) for j in 'abc']

a list of lists:

[[i,j] for i in range(3) for j in 'abc']

a syntax error:

[i,j for i in range(3) for j in 'abc']

Which is inconsistent (IMHO) and confusing with dictionary comprehensions syntax:

>>> {i:j for i,j in enumerate('abcdef')}
{0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f'}

And a set of tuples:

>>> {(i,j) for i,j in enumerate('abcdef')}
set([(0, 'a'), (4, 'e'), (1, 'b'), (2, 'c'), (5, 'f'), (3, 'd')])

As Óscar López stated, you can just pass the enumerate tuple directly:

>>> [t for t in enumerate('abcdef') ] 
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f')]

Question 4

Or, if you don’t insist on using a list comprehension:

>>> mylist = ["a","b","c","d"]
>>> list(enumerate(mylist))
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Question 5

If you’re using long lists, it appears the list comprehension’s faster, not to mention more readable.

~$ python -mtimeit -s"mylist = ['a','b','c','d']" "list(enumerate(mylist))"
1000000 loops, best of 3: 1.61 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[(i, j) for i, j in enumerate(mylist)]"
1000000 loops, best of 3: 0.978 usec per loop
~$ python -mtimeit -s"mylist = ['a','b','c','d']" "[t for t in enumerate(mylist)]"
1000000 loops, best of 3: 0.767 usec per loop

Question 6

Here’s a way to do it:

>>> mylist = ['a', 'b', 'c', 'd']
>>> [item for item in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

Alternatively, you can do:

>>> [(i, j) for i, j in enumerate(mylist)]
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]

The reason you got an error was that you were missing the () around i and j to make it a tuple.

Question 7

Be explicit about the tuples.

[(i, j) for (i, j) in enumerate(mylist)]

Question 8

All great answer guys. I know the question here is specific to enumeration but how about something like this, just another perspective

from itertools import izip, count
a = ["5", "6", "1", "2"]
tupleList = list( izip( count(), a ) )
print(tupleList)

It becomes more powerful, if one has to iterate multiple lists in parallel in terms of performance. Just a thought

a = ["5", "6", "1", "2"]
b = ["a", "b", "c", "d"]
tupleList = list( izip( count(), a, b ) )
print(tupleList)

Question 9

To read some text file, in C or Pascal, I always use the following snippets to read the data until EOF:

while not eof do begin
  readline(a);
  do_something;
end;

Thus, I wonder how can I do this simple and fast in Python?

Question 10

Loop over the file to read lines:

with open('somefile') as openfileobject:
    for line in openfileobject:
        do_something()

File objects are iterable and yield lines until EOF. Using the file object as an iterable uses a buffer to ensure performant reads.

You can do the same with the stdin (no need to use raw_input():

import sys

for line in sys.stdin:
    do_something()

To complete the picture, binary reads can be done with:

from functools import partial

with open('somefile', 'rb') as openfileobject:
    for chunk in iter(partial(openfileobject.read, 1024), b''):
        do_something()

where chunk will contain up to 1024 bytes at a time from the file, and iteration stops when openfileobject.read(1024) starts returning empty byte strings.

Question 11

You can imitate the C idiom in Python.

To read a buffer up to max_size number of bytes, you can do this:

with open(filename, 'rb') as f:
    while True:
        buf = f.read(max_size)
        if not buf:
            break
        process(buf)

Or, a text file line by line:

# warning -- not idiomatic Python! See below...
with open(filename, 'rb') as f:
    while True:
        line = f.readline()
        if not line:
            break
        process(line)

You need to use while True / break construct since there is no eof test in Python other than the lack of bytes returned from a read.

In C, you might have:

while ((ch != '\n') && (ch != EOF)) {
   // read the next ch and add to a buffer
   // ..
}

However, you cannot have this in Python:

 while (line = f.readline()):
     # syntax error

because assignments are not allowed in expressions in Python (although recent versions of Python can mimic this using assignment expressions, see below).

It is certainly more idiomatic in Python to do this:

# THIS IS IDIOMATIC Python. Do this:
with open('somefile') as f:
    for line in f:
        process(line)

Update: Since Python 3.8 you may also use assignment expressions:

 while line := f.readline():
     process(line)

Question 12

The Python idiom for opening a file and reading it line-by-line is:

with open('filename') as f:
    for line in f:
        do_something(line)

The file will be automatically closed at the end of the above code (the with construct takes care of that).

Finally, it is worth noting that line will preserve the trailing newline. This can be easily removed using:

line = line.rstrip()

Question 13

You can use below code snippet to read line by line, till end of file

line = obj.readline()
while(line != ''):

    # Do Something

    line = obj.readline()

Question 14

While there are suggestions above for “doing it the python way”, if one wants to really have a logic based on EOF, then I suppose using exception handling is the way to do it —

try:
    line = raw_input()
    ... whatever needs to be done incase of no EOF ...
except EOFError:
    ... whatever needs to be done incase of EOF ...

Example:

$ echo test | python -c "while True: print raw_input()"
test
Traceback (most recent call last):
  File "<string>", line 1, in <module> 
EOFError: EOF when reading a line

Or press Ctrl-Z at a raw_input() prompt (Windows, Ctrl-Z Linux)

Question 15

You can use the following code snippet. readlines() reads in the whole file at once and splits it by line.

line = obj.readlines()

Question 16

In addition to @dawg’s great answer, the equivalent solution using walrus operator (Python >= 3.8):

with open(filename, 'rb') as f:
    while buf := f.read(max_size):
        process(buf)

Question 17

我有以下数据框：

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

要求：

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C导出用于2015-01-31通过取value的D。

然后，我需要使用value的C用于2015-01-31通过和乘法value的A上2015-02-01添加B。

我尝试使用，apply并shift使用if else，这会导致出现关键错误。

Question 18

I have the following dataframe:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   Nan  10
 2015-02-01     2    3   Nan  22 
 2015-02-02    10   60   Nan  280
 2015-02-03    10   100   Nan  250

Require:

 Index_Date    A    B    C    D
 ===============================
 2015-01-31    10   10   10   10
 2015-02-01     2    3   23   22
 2015-02-02    10   60   290  280
 2015-02-03    10   100  3000 250

Column C is derived for 2015-01-31 by taking value of D.

Then I need to use the value of C for 2015-01-31 and multiply by the value of A on 2015-02-01 and add B.

I have attempted an apply and a shift using an if else by this gives a key error.

Question 19

首先，创建派生值：

df.loc[0, 'C'] = df.loc[0, 'D']

然后遍历其余行并填充计算出的值：

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

Question 20

First, create the derived value:

df.loc[0, 'C'] = df.loc[0, 'D']

Then iterate through the remaining rows and fill the calculated values:

for i in range(1, len(df)):
    df.loc[i, 'C'] = df.loc[i-1, 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']


  Index_Date   A   B    C    D
0 2015-01-31  10  10   10   10
1 2015-02-01   2   3   23   22
2 2015-02-02  10  60  290  280

Question 21

给定一列数字：

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

您可以使用shift引用上一行：

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

Question 22

Given a column of numbers:

lst = []
cols = ['A']
for a in range(100, 105):
    lst.append([a])
df = pd.DataFrame(lst, columns=cols, index=range(5))
df

    A
0   100
1   101
2   102
3   103
4   104

You can reference the previous row with shift:

df['Change'] = df.A - df.A.shift(1)
df

    A   Change
0   100 NaN
1   101 1.0
2   102 1.0
3   103 1.0
4   104 1.0

Question 23

`numba`

对于不可矢量化的递归计算numba，使用JIT编译并与较低级别的对象配合使用，通常会带来较大的性能改进。您只需要定义一个常规for循环并使用decorator@njit或（对于旧版本）@jit(nopython=True)：

对于合理大小的数据帧，与常规for循环相比，这可以将性能提高约30倍：

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

Question 24

`numba`

For recursive calculations which are not vectorisable, numba, which uses JIT-compilation and works with lower level objects, often yields large performance improvements. You need only define a regular for loop and use the decorator @njit or (for older versions) @jit(nopython=True):

For a reasonable size dataframe, this gives a ~30x performance improvement versus a regular for loop:

from numba import jit

@jit(nopython=True)
def calculator_nb(a, b, d):
    res = np.empty(d.shape)
    res[0] = d[0]
    for i in range(1, res.shape[0]):
        res[i] = res[i-1] * a[i] + b[i]
    return res

df['C'] = calculator_nb(*df[list('ABD')].values.T)

n = 10**5
df = pd.concat([df]*n, ignore_index=True)

# benchmarking on Python 3.6.0, Pandas 0.19.2, NumPy 1.11.3, Numba 0.30.1
# calculator() is same as calculator_nb() but without @jit decorator
%timeit calculator_nb(*df[list('ABD')].values.T)  # 14.1 ms per loop
%timeit calculator(*df[list('ABD')].values.T)     # 444 ms per loop

Question 25

在numpy数组上应用递归函数将比当前答案更快。

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

输出量

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Question 26

Applying the recursive function on numpy arrays will be faster than the current answer.

df = pd.DataFrame(np.repeat(np.arange(2, 6),3).reshape(4,3), columns=['A', 'B', 'D'])
new = [df.D.values[0]]
for i in range(1, len(df.index)):
    new.append(new[i-1]*df.A.values[i]+df.B.values[i])
df['C'] = new

Output

      A  B  D    C
   0  1  1  1    1
   1  2  2  2    4
   2  3  3  3   15
   3  4  4  4   64
   4  5  5  5  325

Question 27

尽管问这个问题已经有一段时间了，但我还是会发表我的答案，希望对大家有所帮助。

免责声明：我知道此解决方案不是标准的，但我认为它很好用。

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

因此，基本上，我们使用applyfrom from pandas和全局变量的帮助来跟踪先前的计算值。

for循环时间比较：

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

每个循环3.2 s±114毫秒（平均±标准偏差，共运行7次，每个循环1次）

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

每个循环1.82 s±64.4 ms（平均±标准偏差，共7次运行，每个循环1次）

因此平均快0.57倍。

Question 28

Although it has been a while since this question was asked, I will post my answer hoping it helps somebody.

Disclaimer: I know this solution is not standard, but I think it works well.

import pandas as pd
import numpy as np

data = np.array([[10, 2, 10, 10],
                 [10, 3, 60, 100],
                 [np.nan] * 4,
                 [10, 22, 280, 250]]).T
idx = pd.date_range('20150131', end='20150203')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df
               A    B     C    D
 =================================
 2015-01-31    10   10    NaN  10
 2015-02-01    2    3     NaN  22 
 2015-02-02    10   60    NaN  280
 2015-02-03    10   100   NaN  250

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)
df
               A    B     C     D
 =================================
 2015-01-31    10   10    10    10
 2015-02-01    2    3     23    22 
 2015-02-02    10   60    290   280
 2015-02-03    10   100   3000  250

So basically we use a apply from pandas and the help of a global variable that keeps track of the previous calculated value.

Time comparison with a for loop:

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

df.loc['2015-01-31', 'C'] = df.loc['2015-01-31', 'D']

%%timeit
for i in df.loc['2015-02-01':].index.date:
    df.loc[i, 'C'] = df.loc[(i - pd.DateOffset(days=1)).date(), 'C'] * df.loc[i, 'A'] + df.loc[i, 'B']

3.2 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

data = np.random.random(size=(1000, 4))
idx = pd.date_range('20150131', end='20171026')
df = pd.DataFrame(data=data, columns=list('ABCD'), index=idx)
df.C = np.nan

def calculate(mul, add):
    global value
    value = value * mul + add
    return value

value = df.loc['2015-01-31', 'D']
df.loc['2015-01-31', 'C'] = value

%%timeit
df.loc['2015-02-01':, 'C'] = df.loc['2015-02-01':].apply(lambda row: calculate(*row[['A', 'B']]), axis=1)

1.82 s ± 64.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So 0.57 times faster on average.

Question 29

通常，避免显式循环的关键是在rowindex-1 == rowindex上联接（合并）数据框的2个实例。

然后，您将拥有一个包含r和r-1行的大数据框，可以在其中执行df.apply（）函数。

但是，创建大型数据集的开销可能抵消了并行处理的好处。

马丁

Question 30

In general, the key to avoiding an explicit loop would be to join (merge) 2 instances of the dataframe on rowindex-1==rowindex.

Then you would have a big dataframe containing rows of r and r-1, from where you could do a df.apply() function.

However the overhead of creating the large dataset may offset the benefits of parallel processing…

HTH Martin

Question 31

Are for loops really “bad”? If not, in what situation(s) would they be better than using a more conventional “vectorized” approach?¹

I am familiar with the concept of “vectorization”, and how pandas employs vectorized techniques to speed up computation. Vectorized functions broadcast operations over the entire series or DataFrame to achieve speedups much greater than conventionally iterating over the data.

However, I am quite surprised to see a lot of code (including from answers on Stack Overflow) offering solutions to problems that involve looping through data using for loops and list comprehensions. The documentation and API say that loops are “bad”, and that one should “never” iterate over arrays, series, or DataFrames. So, how come I sometimes see users suggesting loop-based solutions?

_{1 – While it is true that the question sounds somewhat broad, the truth is that there are very specific situations when for loops are usually better than conventionally iterating over data. This post aims to capture this for posterity.}

Question 32

TLDR; No, for loops are not blanket “bad”, at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

When your data is small (…depending on what you’re doing),
When dealing with object/mixed dtypes
When using the str/regex accessor functions

Let’s examine these situations individually.

Iteration v/s Vectorization on Small Data

Pandas follows a “Convention Over Configuration” approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

Index/axis alignment
Handling mixed datatypes
Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What’s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the “best of both worlds”, allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as
df[df.A.values != df.B.values]
Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example – this time, with another vanilla python construct that is faster than a for loop – collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don’t give you the performance you need, there is always cython and numba. Let’s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.

Operations with Mixed/`object` dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query('A != B')                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading “Code Snippets”.

# Dictionary value extraction.
ser.map(operator.itemgetter('value'))     # map
pd.Series([x.get('value') for x in ser])  # list comprehension

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:
pd.Series([...], index=ser.index)
When reconstructing the series.

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it’s that slow).

Regex Operations, and `.str` Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other “vectorized” string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python’s re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df['col2'] = [p.search(x).group(0) for x in df['col']]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df['col2'] = [matcher(x) for x in df['col']]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r'(?<=[A-Z])(\d{4})')
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

More Examples
Full disclosure – I am the author (in part or whole) of these posts listed below.

Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The “vectorized” functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it’s up to you to decide whether these solutions are worth the trouble of implementing.

Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp', 'numba'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N'
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=['value_counts', 'np.unique', 'Counter'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B'], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query('A != B'),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=['vectorized !=', 'query (numexpr)', 'list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{'key': 'abc', 'value': 123}, {'key': 'xyz', 'value': 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter('value')),
        lambda ser: pd.Series([x.get('value') for x in ser]),
    ],
    labels=['map', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([['a', 'b', 'c'], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=['map', 'str accessor', 'list comprehension', 'list comp safe'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([['a', 'b', 'c'], ['d', 'e'], ['f', 'g']])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=['stack', 'itertools.chain', 'nested list comp'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(['foo xyz', 'test A1234', 'D3345 xtz'])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r'(?<=[A-Z])(\d{4})', expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=['str.extract', 'list comprehension'],
    n_range=[2**k for k in range(0, 15)],
    xlabel='N',
    equality_check=None
)

Question 33

In short

for loop + iterrows is extremely slow. Overhead isn’t significant on ~1k rows, but noticeable on 10k+ rows.
for loop + itertuples is much faster than iterrows or apply.
vectorization is usually much faster than itertuples

Benchmark

Question 34

I have a python object with several attributes and methods. I want to iterate over object attributes.

class my_python_obj(object):
    attr1='a'
    attr2='b'
    attr3='c'

    def method1(self, etc, etc):
        #Statements

I want to generate a dictionary containing all of the objects attributes and their current values, but I want to do it in a dynamic way (so if later I add another attribute I don’t have to remember to update my function as well).

In php variables can be used as keys, but objects in python are unsuscriptable and if I use the dot notation for this it creates a new attribute with the name of my var, which is not my intent.

Just to make things clearer:

def to_dict(self):
    '''this is what I already have'''
    d={}
    d["attr1"]= self.attr1
    d["attr2"]= self.attr2
    d["attr3"]= self.attr3
    return d

·

def to_dict(self):
    '''this is what I want to do'''
    d={}
    for v in my_python_obj.attributes:
        d[v] = self.v
    return d

Update: With attributes I mean only the variables of this object, not the methods.

Question 35

Assuming you have a class such as

>>> class Cls(object):
...     foo = 1
...     bar = 'hello'
...     def func(self):
...         return 'call me'
...
>>> obj = Cls()

calling dir on the object gives you back all the attributes of that object, including python special attributes. Although some object attributes are callable, such as methods.

>>> dir(obj)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'bar', 'foo', 'func']

You can always filter out the special methods by using a list comprehension.

>>> [a for a in dir(obj) if not a.startswith('__')]
['bar', 'foo', 'func']

or if you prefer map/filters.

>>> filter(lambda a: not a.startswith('__'), dir(obj))
['bar', 'foo', 'func']

If you want to filter out the methods, you can use the builtin callable as a check.

>>> [a for a in dir(obj) if not a.startswith('__') and not callable(getattr(obj, a))]
['bar', 'foo']

You could also inspect the difference between your class and its instance object using.

>>> set(dir(Cls)) - set(dir(object))
set(['__module__', 'bar', 'func', '__dict__', 'foo', '__weakref__'])

Question 36

in general put a __iter__ method in your class and iterate through the object attributes or put this mixin class in your class.

class IterMixin(object):
    def __iter__(self):
        for attr, value in self.__dict__.iteritems():
            yield attr, value

Your class:

>>> class YourClass(IterMixin): pass
...
>>> yc = YourClass()
>>> yc.one = range(15)
>>> yc.two = 'test'
>>> dict(yc)
{'one': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], 'two': 'test'}

Question 37

As mentioned in some of the answers/comments already, Python objects already store a dictionary of their attributes (methods aren’t included). This can be accessed as __dict__, but the better way is to use vars (the output is the same, though). Note that modifying this dictionary will modify the attributes on the instance! This can be useful, but also means you should be careful with how you use this dictionary. Here’s a quick example:

class A():
    def __init__(self, x=3, y=2, z=5):
        self.x = x
        self._y = y
        self.__z__ = z

    def f(self):
        pass

a = A()
print(vars(a))
# {'x': 3, '_y': 2, '__z__': 5}
# all of the attributes of `a` but no methods!

# note how the dictionary is always up-to-date
a.x = 10
print(vars(a))
# {'x': 10, '_y': 2, '__z__': 5}

# modifying the dictionary modifies the instance attribute
vars(a)["_y"] = 20
print(vars(a))
# {'x': 10, '_y': 20, '__z__': 5}

Using dir(a) is an odd, if not outright bad, approach to this problem. It’s good if you really needed to iterate over all attributes and methods of the class (including the special methods like __init__). However, this doesn’t seem to be what you want, and even the accepted answer goes about this poorly by applying some brittle filtering to try to remove methods and leave just the attributes; you can see how this would fail for the class A defined above.

(using __dict__ has been done in a couple of answers, but they all define unnecessary methods instead of using it directly. Only a comment suggests to use vars).

Question 38

Objects in python store their atributes (including functions) in a dict called __dict__. You can (but generally shouldn’t) use this to access the attributes directly. If you just want a list, you can also call dir(obj), which returns an iterable with all the attribute names, which you could then pass to getattr.

However, needing to do anything with the names of the variables is usually bad design. Why not keep them in a collection?

class Foo(object):
    def __init__(self, **values):
        self.special_values = values

You can then iterate over the keys with for key in obj.special_values:

Question 39

class someclass:
        x=1
        y=2
        z=3
        def __init__(self):
           self.current_idx = 0
           self.items = ["x","y","z"]
        def next(self):
            if self.current_idx < len(self.items):
                self.current_idx += 1
                k = self.items[self.current_idx-1]
                return (k,getattr(self,k))
            else:
                raise StopIteration
        def __iter__(self):
           return self

then just call it as an iterable

s=someclass()
for k,v in s:
    print k,"=",v

Question 40

The correct answer to this is that you shouldn’t. If you want this type of thing either just use a dict, or you’ll need to explicitly add attributes to some container. You can automate that by learning about decorators.

In particular, by the way, method1 in your example is just as good of an attribute.

Question 41

For python 3.6

class SomeClass:

    def attr_list(self, should_print=False):

        items = self.__dict__.items()
        if should_print:
            [print(f"attribute: {k}    value: {v}") for k, v in items]

        return items

Question 42

For all the pythonian zealots out there I’m sure Johan Cleeze would approve of your dogmatism ;). I’m leaving this answer keep demeriting it It actually makes me more confidant. Leave a comment you chickens!

For python 3.6

class SomeClass:

    def attr_list1(self, should_print=False):

        for k in self.__dict__.keys():
            v = self.__dict__.__getitem__(k)
            if should_print:
                print(f"attr: {k}    value: {v}")

    def attr_list(self, should_print=False):

        b = [(k, v) for k, v in self.__dict__.items()]
        if should_print:
            [print(f"attr: {a[0]}    value: {a[1]}") for a in b]
        return b

Question 43

Consider:

>>> lst = iter([1,2,3])
>>> next(lst)
1
>>> next(lst)
2

So, advancing the iterator is, as expected, handled by mutating that same object.

This being the case, I would expect:

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

to skip every second element: the call to next should advance the iterator once, then the implicit call made by the loop should advance it a second time – and the result of this second call would be assigned to i.

It doesn’t. The loop prints all of the items in the list, without skipping any.

My first thought was that this might happen because the loop calls iter on what it is passed, and this might give an independent iterator – this isn’t the case, as we have iter(a) is a.

So, why does next not appear to advance the iterator in this case?

Question 44

What you see is the interpreter echoing back the return value of next() in addition to i being printed each iteration:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0
1
2
3
4
5
6
7
8
9

So 0 is the output of print(i), 1 the return value from next(), echoed by the interactive interpreter, etc. There are just 5 iterations, each iteration resulting in 2 lines being written to the terminal.

If you assign the output of next() things work as expected:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    _ = next(a)
... 
0
2
4
6
8

or print extra information to differentiate the print() output from the interactive interpreter echo:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print('Printing: {}'.format(i))
...    next(a)
... 
Printing: 0
1
Printing: 2
3
Printing: 4
5
Printing: 6
7
Printing: 8
9

In other words, next() is working as expected, but because it returns the next value from the iterator, echoed by the interactive interpreter, you are led to believe that the loop has its own iterator copy somehow.

Question 45

What is happening is that next(a) returns the next value of a, which is printed to the console because it is not affected.

What you can do is affect a variable with this value:

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    b=next(a)
...
0
2
4
6
8

Question 46

I find the existing answers a little confusing, because they only indirectly indicate the essential mystifying thing in the code example: both* the “print i” and the “next(a)” are causing their results to be printed.

Since they’re printing alternating elements of the original sequence, and it’s unexpected that the “next(a)” statement is printing, it appears as if the “print i” statement is printing all the values.

In that light, it becomes more clear that assigning the result of “next(a)” to a variable inhibits the printing of its’ result, so that just the alternate values that the “i” loop variable are printed. Similarly, making the “print” statement emit something more distinctive disambiguates it, as well.

(One of the existing answers refutes the others because that answer is having the example code evaluated as a block, so that the interpreter is not reporting the intermediate values for “next(a)”.)

The beguiling thing in answering questions, in general, is being explicit about what is obvious once you know the answer. It can be elusive. Likewise critiquing answers once you understand them. It’s interesting…

Question 47

Something is wrong with your Python/Computer.

a = iter(list(range(10)))
for i in a:
   print(i)
   next(a)

>>> 
0
2
4
6
8

Works like expected.

Tested in Python 2.7 and in Python 3+ . Works properly in both

Question 48

For those who still do not understand.

>>> a = iter(list(range(10)))
>>> for i in a:
...    print(i)
...    next(a)
... 
0 # print(i) printed this
1 # next(a) printed this
2 # print(i) printed this
3 # next(a) printed this
4 # print(i) printed this
5 # next(a) printed this
6 # print(i) printed this
7 # next(a) printed this
8 # print(i) printed this
9 # next(a) printed this

As others have already said, next increases the iterator by 1 as expected. Assigning its returned value to a variable doesn’t magically changes its behaviour.

Question 49

It behaves the way you want if called as a function:

>>> def test():
...     a = iter(list(range(10)))
...     for i in a:
...         print(i)
...         next(a)
... 
>>> test()
0
2
4
6
8

Question 50

I have a generator that generates a series, for example:

def triangle_nums():
    '''Generates a series of triangle numbers'''
    tn = 0
    counter = 1
    while True:
        tn += counter
        yield tn
        counter += + 1

In Python 2 I am able to make the following calls:

g = triangle_nums()  # get the generator
g.next()             # get the next value

however in Python 3 if I execute the same two lines of code I get the following error:

AttributeError: 'generator' object has no attribute 'next'

but, the loop iterator syntax does work in Python 3

for n in triangle_nums():
    if not exit_cond:
       do_something()...

I haven’t been able to find anything yet that explains this difference in behavior for Python 3.

Question 51

g.next() has been renamed to g.__next__(). The reason for this is consistency: special methods like __init__() and __del__() all have double underscores (or “dunder” in the current vernacular), and .next() was one of the few exceptions to that rule. This was fixed in Python 3.0. [*]

But instead of calling g.__next__(), use next(g).

[*] There are other special attributes that have gotten this fix; func_name, is now __name__, etc.

Question 52

Try:

next(g)

Check out this neat table that shows the differences in syntax between 2 and 3 when it comes to this.

Question 53

If your code must run under Python2 and Python3, use the 2to3 six library like this:

import six

six.next(g)  # on PY2K: 'g.next()' and onPY3K: 'next(g)'

问题：Python使用枚举内部列表理解

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：在Python中，“虽然不是EOF”的完美替代品是什么？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：当在apply中也计算出先前值时，Pandas中有没有一种方法可以使用dataframe.apply中的先前行值？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：熊猫中的for循环真的不好吗？我什么时候应该在意？

回答 0

小数据上的迭代v / s矢量化

混合/ objectdtype操作

正则表达式操作和访问器.str方法

结论

附录：代码段

Iteration v/s Vectorization on Small Data

Operations with Mixed/object dtypes

Regex Operations, and .str Accessor Methods

Conclusion

Appendix: Code Snippets

回答 1

问题：遍历python中的对象属性

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：Python列表迭代器行为和next（iterator）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：在Python 3中是否可以看到generator.next（）？

回答 0

回答 1

回答 2

问题：在Python中遍历一系列日期

笔记

样本输出

Notes

Sample Output

回答 0

与John Machin讨论后更新：

Update after discussion with John Machin:

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

混合/ `object`dtype操作

正则表达式操作和访问器`.str`方法

Operations with Mixed/`object` dtypes

Regex Operations, and `.str` Accessor Methods