As this question is been 7 years before, in the latest version which I am using is numpy version 1.13, and python3, I am doing the same thing with adding a row to a matrix, remember to put a double bracket to the second argument, otherwise, it will raise dimension error.
If no calculations are necessary after every row, it’s much quicker to add rows in python, then convert to numpy. Here are timing tests using python 3.6 vs. numpy 1.14, adding 100 rows, one at a time:
import numpy as np
from time import perf_counter, sleep
def time_it():
# Compare performance of two methods for adding rows to numpy array
py_array = [[0, 1, 2], [0, 2, 0]]
py_row = [4, 5, 6]
numpy_array = np.array(py_array)
numpy_row = np.array([4,5,6])
n_loops = 100
start_clock = perf_counter()
for count in range(0, n_loops):
numpy_array = np.vstack([numpy_array, numpy_row]) # 5.8 micros
duration = perf_counter() - start_clock
print('numpy 1.14 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
start_clock = perf_counter()
for count in range(0, n_loops):
py_array.append(py_row) # .15 micros
numpy_array = np.array(py_array) # 43.9 micros
duration = perf_counter() - start_clock
print('python 3.6 takes {:.3f} micros per row'.format(duration * 1e6 / n_loops))
sleep(15)
#time_it() prints:
numpy 1.14 takes 5.971 micros per row
python 3.6 takes 0.694 micros per row
So, the simple solution to the original question, from seven years ago, is to use vstack() to add a new row after converting the row to a numpy array. But a more realistic solution should consider vstack’s poor performance under those circumstances. If you don’t need to run data analysis on the array after every addition, it is better to buffer the new rows to a python list of rows (a list of lists, really), and add them as a group to the numpy array using vstack() before doing any data analysis.
A = np.array([[1,2,3],[4,5,6]])Alist=[r for r in A]for i in range(100):
newrow = np.arange(3)+i
if i%5:Alist.append(newrow)
A = np.array(Alist)delAlist
If you can do the construction in a single operation, then something like the vstack-with-fancy-indexing answer is a fine approach. But if your condition is more complicated or your rows come in on the fly, you may want to grow the array. In fact the numpythonic way to do something like this – dynamically grow an array – is to dynamically grow a list:
A = np.array([[1,2,3],[4,5,6]])
Alist = [r for r in A]
for i in range(100):
newrow = np.arange(3)+i
if i%5:
Alist.append(newrow)
A = np.array(Alist)
del Alist
Lists are highly optimized for this kind of access pattern; you don’t have convenient numpy multidimensional indexing while in list form, but for as long as you’re appending it’s hard to do better than a list of row arrays.
回答 7
我使用更快的“ np.vstack”,例如:
import numpy as np
input_array=np.array([1,2,3])
new_row= np.array([4,5,6])
new_array=np.vstack([input_array, new_row])
Now I want to iterate over the rows of this frame. For every row I want to be able to access its elements (values in cells) by the name of the columns. For example:
for row in df.rows:
print row['c1'], row['c2']
Is it possible to do that in pandas?
I found this similar question. But it does not give me the answer I need. For example, it is suggested there to use:
for date, row in df.T.iteritems():
or
for row in df.iterrows():
But I do not understand what the row object is and how I can work with it.
# iterating over one column - `f` is some function that processes your data
result =[f(x)for x in df['col']]# iterating over two columns, use `zip`
result =[f(x, y)for x, y in zip(df['col1'], df['col2'])]# iterating over multiple columns - same data type
result =[f(row[0],..., row[n])for row in df[['col1',...,'coln']].to_numpy()]# iterating over multiple columns - differing data type
result =[f(row[0],..., row[n])for row in zip(df['col1'],..., df['coln'])]
How to iterate over rows in a DataFrame in Pandas?
Answer: DON’T*!
Iteration in pandas is an anti-pattern, and is something you should only do when you have exhausted every other option. You should not use any function with “iter” in its name for more than a few thousand rows or you will have to get used to a lot of waiting.
iterrows and itertuples (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.
Appeal to Authority The docs page on iteration has a huge red warning box that says:
Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed […].
* It’s actually a little more complicated than “don’t”. df.iterrows() is the correct answer to this question, but “vectorize your ops” is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you’re not sure whether you need an iterative solution, you probably don’t. PS: To know more about my rationale for writing this answer, skip to the very bottom.
A good number of basic operations and computations are “vectorised” by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.
If none exists, feel free to write your own using custom cython extensions.
List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you’re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common pandas tasks.
The formula is simple,
# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns - same data type
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].to_numpy()]
# iterating over multiple columns - differing data type
result = [f(row[0], ..., row[n]) for row in zip(df['col1'], ..., df['coln'])]
If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw python.
Caveats
List comprehensions assume that your data is easy to work with – what that means is your data types are consistent and you don’t have NaNs, but this cannot always be guaranteed.
The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
When dealing with mixed data types you should iterate over zip(df['A'], df['B'], ...) instead of df[['A', 'B']].to_numpy() as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, to_numpy() will cast the entire array to string, which may not be what you want. Fortunately zipping your columns together is the most straightforward workaround to this.
* YMMV for the reasons outlined in the Caveats section above.
An Obvious Example
Let’s demonstrate the difference with a simple example of adding two pandas columns A + B. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above.
I should mention, however, that it isn’t always this cut and dry. Sometimes the answer to “what is the best method for an operation” is “it depends on your data”. My advice is to test out different approaches on your data before settling on one.
When should I ever want to use pandas apply() in my code? – apply is slow (but not as slow as the iter* family. There are, however, situations where one can (or should) consider apply as a serious alternative, especially in some GroupBy operations).
* Pandas string methods are “vectorized” in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.
Why I Wrote this Answer
A common trend I notice from new users is to ask questions of the form “how can I iterate over my df to do X?”. Showing code that calls iterrows() while doing something inside a for loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.
The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I’m not trying to start a war of iteration vs vectorization, but I want new users to be informed when developing solutions to their problems with this library.
for row in df.itertuples(index=True, name='Pandas'):
print(row.c1, row.c2)
itertuples() is supposed to be faster than iterrows()
But be aware, according to the docs (pandas 0.24.2 at the moment):
iterrows: dtype might not match from row to row
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
iterrows: Do not modify rows
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
You should use df.iterrows(). Though iterating row-by-row is not especially efficient since Series objects have to be created.
回答 4
虽然这iterrows()是一个不错的选择,但有时itertuples()可能会更快:
df = pd.DataFrame({'a': randn(1000),'b': randn(1000),'N': randint(100,1000,(1000)),'x':'x'})%timeit [row.a *2for idx, row in df.iterrows()]# => 10 loops, best of 3: 50.3 ms per loop%timeit [row[1]*2for row in df.itertuples()]# => 1000 loops, best of 3: 541 µs per loop
While iterrows() is a good option, sometimes itertuples() can be much faster:
df = pd.DataFrame({'a': randn(1000), 'b': randn(1000),'N': randint(100, 1000, (1000)), 'x': 'x'})
%timeit [row.a * 2 for idx, row in df.iterrows()]
# => 10 loops, best of 3: 50.3 ms per loop
%timeit [row[1] * 2 for row in df.itertuples()]
# => 1000 loops, best of 3: 541 µs per loop
I was looking for How to iterate on rows AND columns and ended here so :
for i, row in df.iterrows():
for j, column in row.iteritems():
print(column)
回答 8
您可以编写自己的迭代器来实现 namedtuple
from collections import namedtuple
def myiter(d, cols=None):if cols isNone:
v = d.values.tolist()
cols = d.columns.values.tolist()else:
j =[d.columns.get_loc(c)for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)for line in iter(v):yield n(*line)
You can write your own iterator that implements namedtuple
from collections import namedtuple
def myiter(d, cols=None):
if cols is None:
v = d.values.tolist()
cols = d.columns.values.tolist()
else:
j = [d.columns.get_loc(c) for c in cols]
v = d.values[:, j].tolist()
n = namedtuple('MyTuple', cols)
for line in iter(v):
yield n(*line)
This is directly comparable to pd.DataFrame.itertuples. I’m aiming at performing the same task with more efficiency.
start_time = time.clock()
result =0for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock()- start_time,2)print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2)默认itertuples()值已经快得多,但是它不适用于诸如以下的列名My Col-Name is very Strange(如果重复列或如果列名不能简单地转换为python变量名,则应避免使用此方法):
start_time = time.clock()
result =0for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock()- start_time,2)print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
start_time = time.clock()
result =0for(_, col1, col2, col3, col4)in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock()- start_time,2)print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4)最后,named itertuples()的速度比上一点慢,但是您不必为每列定义一个变量,它可以与诸如的列名一起使用My Col-Name is very Strange。
start_time = time.clock()
result =0for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock()- start_time,2)print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
输出:
A B C D
041634223154924652153410933994829744887954...........99999548274259999961651342899999713961149999986651277099999951534799[1000000 rows x 4 columns]1.Iterrows done in104.96 seconds, result =661515192.NamedItertuples done in1.26 seconds, result =661515193.Itertuples done in0.94 seconds, result =661515194.PolyvalentItertuples working even with special characters in the column name done in2.94 seconds, result =66151519
If you really have to iterate a pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.
In short:
As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
Otherwise, use df.itertuples() except if your columns have special characters such as spaces or ‘-‘. See point (2)
It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
Only use iterrows() if you cannot the previous solutions. See point (1)
Different methods to iterate over rows in a pandas dataframe:
Generate a random dataframe with a million rows and 4 columns:
1) The usual iterrows() is convenient but damn slow:
start_time = time.clock()
result = 0
for _, row in df.iterrows():
result += max(row['B'], row['C'])
total_elapsed_time = round(time.clock() - start_time, 2)
print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
2) The default itertuples() is already much faster but it doesn’t work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a python variable name).:
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row.B, row.C)
total_elapsed_time = round(time.clock() - start_time, 2)
print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
3) The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.
start_time = time.clock()
result = 0
for(_, col1, col2, col3, col4) in df.itertuples(name=None):
result += max(col2, col3)
total_elapsed_time = round(time.clock() - start_time, 2)
print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
4) Finally, the named itertuples() is slower than the previous point but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.
start_time = time.clock()
result = 0
for row in df.itertuples(index=False):
result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
total_elapsed_time = round(time.clock() - start_time, 2)
print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
Output:
A B C D
0 41 63 42 23
1 54 9 24 65
2 15 34 10 9
3 39 94 82 97
4 4 88 79 54
... .. .. .. ..
999995 48 27 4 25
999996 16 51 34 28
999997 1 39 61 14
999998 66 51 27 70
999999 51 53 47 99
[1000000 rows x 4 columns]
1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
for ind in df.index:
print df['c1'][ind], df['c2'][ind]
回答 12
有时一个有用的模式是:
# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1':[1,2],'col2':[0.1,0.2]}, index=['a','b'])# The to_dict call results in a list of dicts# where each row_dict is a dictionary with k:v pairs of columns:value for that rowfor row_dict in df.to_dict(orient='records'):print(row_dict)
# Borrowing @KutalmisB df example
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
# The to_dict call results in a list of dicts
# where each row_dict is a dictionary with k:v pairs of columns:value for that row
for row_dict in df.to_dict(orient='records'):
print(row_dict)
There is a way to iterate throw rows while getting a DataFrame in return, and not a Series. I don’t see anyone mentioning that you can pass index as a list for the row to be returned as a DataFrame:
for i in range(len(df)):
row = df.iloc[[i]]
Note the usage of double brackets. This returns a DataFrame with a single row.
For both viewing and modifying values, I would use iterrows(). In a for loop and by using tuple unpacking (see the example: i, row), I use the row for only viewing the value and use i with the loc method when I want to modify values. As stated in previous answers, here you should not modify something you are iterating over.
for i, row in df.iterrows():
df_column_A = df.loc[i, 'A']
if df_column_A == 'Old_Value':
df_column_A = 'New_value'
Here the row in the loop is a copy of that row, and not a view of it. Therefore, you should NOT write something like row['A'] = 'New_Value', it will not modify the DataFrame. However, you can use i and loc and specify the DataFrame to do the work.
I know I’m late to the answering party, but I just wanted to add to @cs95’s answer above, which I believe should be the accepted answer. In his answer, he shows that pandas vectorization far outperforms other pandas methods for computing stuff with dataframes.
I wanted to add that if you first convert the dataframe to a numpy array and then use vectorization, it’s even faster than pandas dataframe vectorization, (and that includes the time to turn it back into a dataframe series).
If you add the following functions to @cs95’s benchmark code, this becomes pretty evident:
You can also do numpy indexing for even greater speed ups. It’s not really iterating but works much better than iteration for certain applications.
subset = row['c1'][0:5]
all = row['c1'][:]
You may also want to cast it to an array. These indexes/selections are supposed to act like Numpy arrays already but I ran into issues and needed to cast
np.asarray(all)
imgs[:] = cv2.resize(imgs[:], (224,224) ) #resize every image in an hdf5 file
回答 18
有很多方法可以遍历pandas数据框中的行。一种非常简单直观的方法是:
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})print(df)for i in range(df.shape[0]):# For printing the second columnprint(df.iloc[i,1])# For printing more than one columnsprint(df.iloc[i,[0,2]])
There are so many ways to iterate over the rows in pandas dataframe. One very simple and intuitive way is :
df=pd.DataFrame({'A':[1,2,3], 'B':[4,5,6],'C':[7,8,9]})
print(df)
for i in range(df.shape[0]):
# For printing the second column
print(df.iloc[i,1])
# For printing more than one columns
print(df.iloc[i,[0,2]])
回答 19
本示例使用iloc隔离数据帧中的每个数字。
import pandas as pd
a =[1,2,3,4]
b =[5,6,7,8]
mjr = pd.DataFrame({'a':a,'b':b})
size = mjr.shape
for i in range(size[0]):for j in range(size[1]):print(mjr.iloc[i, j])
This example uses iloc to isolate each digit in the data frame.
import pandas as pd
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
mjr = pd.DataFrame({'a':a, 'b':b})
size = mjr.shape
for i in range(size[0]):
for j in range(size[1]):
print(mjr.iloc[i, j])
classDataFrameReader:def __init__(self, df):
self._df = df
self._row =None
self._columns = df.columns.tolist()
self.reset()
self.row_index =0def __getattr__(self, key):return self.__getitem__(key)def read(self)-> bool:
self._row = next(self._iterator,None)
self.row_index +=1return self._row isnotNonedef columns(self):return self._columns
def reset(self)->None:
self._iterator = self._df.itertuples()def get_index(self):return self._row[0]def index(self):return self._row[0]def to_dict(self, columns:List[str]=None):return self.row(columns=columns)def tolist(self, cols)->List[object]:return[self.__getitem__(c)for c in cols]def row(self, columns:List[str]=None)->Dict[str, object]:
cols = set(self._columns if columns isNoneelse columns)return{c : self.__getitem__(c)for c in self._columns if c in cols}def __getitem__(self, key)-> object:# the df index of the row is at index 0try:if type(key)is list:
ix =[self._columns.index(key)+1for k in key]else:
ix = self._columns.index(key)+1return self._row[ix]exceptBaseExceptionas e:returnNonedef __next__(self)->'DataFrameReader':if self.read():return self
else:raiseStopIterationdef __iter__(self)->'DataFrameReader':return self
可以使用:
for row inDataFrameReader(df):print(row.my_column_name)print(row.to_dict())print(row['my_column_name'])print(row.tolist())
Some libraries (e.g. a Java interop library that I use) require values to be passed in a row at a time, for example, if streaming data. To replicate the streaming nature, I ‘stream’ my dataframe values one by one, I wrote the below, which comes in handy from time to time.
class DataFrameReader:
def __init__(self, df):
self._df = df
self._row = None
self._columns = df.columns.tolist()
self.reset()
self.row_index = 0
def __getattr__(self, key):
return self.__getitem__(key)
def read(self) -> bool:
self._row = next(self._iterator, None)
self.row_index += 1
return self._row is not None
def columns(self):
return self._columns
def reset(self) -> None:
self._iterator = self._df.itertuples()
def get_index(self):
return self._row[0]
def index(self):
return self._row[0]
def to_dict(self, columns: List[str] = None):
return self.row(columns=columns)
def tolist(self, cols) -> List[object]:
return [self.__getitem__(c) for c in cols]
def row(self, columns: List[str] = None) -> Dict[str, object]:
cols = set(self._columns if columns is None else columns)
return {c : self.__getitem__(c) for c in self._columns if c in cols}
def __getitem__(self, key) -> object:
# the df index of the row is at index 0
try:
if type(key) is list:
ix = [self._columns.index(key) + 1 for k in key]
else:
ix = self._columns.index(key) + 1
return self._row[ix]
except BaseException as e:
return None
def __next__(self) -> 'DataFrameReader':
if self.read():
return self
else:
raise StopIteration
def __iter__(self) -> 'DataFrameReader':
return self
Which can be used:
for row in DataFrameReader(df):
print(row.my_column_name)
print(row.to_dict())
print(row['my_column_name'])
print(row.tolist())
And preserves the values/ name mapping for the rows being iterated. Obviously, is a lot slower than using apply and Cython as indicated above, but is necessary in some circumstances.