逐行迭代时更新熊猫数据框

问题:逐行迭代时更新熊猫数据框

我有一个看起来像这样的熊猫数据框(非常大)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

现在我想逐行进行迭代,当我遍历每一行时,每行中的值ifor 可以根据某些条件而变化,因此我需要查找另一个数据帧。

现在,如何在迭代时更新它。尝试了几项都不起作用的东西。

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

这些方法似乎都不起作用。我看不到数据框中更新的值。

I have a pandas data frame that looks like this (its a pretty big one)

           date      exer exp     ifor         mat  
1092  2014-03-17  American   M  528.205  2014-04-19 
1093  2014-03-17  American   M  528.205  2014-04-19 
1094  2014-03-17  American   M  528.205  2014-04-19 
1095  2014-03-17  American   M  528.205  2014-04-19    
1096  2014-03-17  American   M  528.205  2014-05-17 

now I would like to iterate row by row and as I go through each row, the value of ifor in each row can change depending on some conditions and I need to lookup another dataframe.

Now, how do I update this as I iterate. Tried a few things none of them worked.

for i, row in df.iterrows():
    if <something>:
        row['ifor'] = x
    else:
        row['ifor'] = y

    df.ix[i]['ifor'] = x

None of these approaches seem to work. I don’t see the values updated in the dataframe.


回答 0

您可以使用df.set_value在循环中分配值:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

如果不需要行值,则可以简单地遍历df的索引,但是我保留了原始的for循环,以防需要此处未显示的行值。

更新

从0.21.0版开始不推荐使用df.set_value(),而可以使用df.at():

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

You can assign values in the loop using df.set_value:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.set_value(i,'ifor',ifor_val)

If you don’t need the row values you could simply iterate over the indices of df, but I kept the original for-loop in case you need the row value for something not shown here.

update

df.set_value() has been deprecated since version 0.21.0 you can use df.at() instead:

for i, row in df.iterrows():
    ifor_val = something
    if <condition>:
        ifor_val = something_else
    df.at[i,'ifor'] = ifor_val

回答 1

熊猫DataFrame对象应被视为一系列系列。换句话说,您应该从列的角度来考虑它。之所以如此重要,是因为在使用时,您将pd.DataFrame.iterrows行作为Series进行迭代。但是这些不是数据帧存储的系列,因此它们是在迭代时为您创建的新系列。这意味着当您尝试分配它们时,这些编辑将不会最终反映在原始数据框中。

好的,现在这已经不合时宜了:我们该怎么办?

此职位之前的建议包括:

  1. pd.DataFrame.set_value弃用的熊猫版0.21
  2. pd.DataFrame.ix弃用
  3. pd.DataFrame.loc很好,但是可以在数组索引器上工作,并且您可以做得更好

我的建议
使用pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

您甚至可以将其更改为:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

回应评论

如果我需要使用if条件的前一行的值怎么办?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

Pandas DataFrame object should be thought of as a Series of Series. In other words, you should think of it in terms of columns. The reason why this is important is because when you use pd.DataFrame.iterrows you are iterating through rows as Series. But these are not the Series that the data frame is storing and so they are new Series that are created for you while you iterate. That implies that when you attempt to assign tho them, those edits won’t end up reflected in the original data frame.

Ok, now that that is out of the way: What do we do?

Suggestions prior to this post include:

  1. pd.DataFrame.set_value is deprecated as of Pandas version 0.21
  2. pd.DataFrame.ix is deprecated
  3. pd.DataFrame.loc is fine but can work on array indexers and you can do better

My recommendation
Use pd.DataFrame.at

for i in df.index:
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

You can even change this to:

for i in df.index:
    df.at[i, 'ifor'] = x if <something> else y

Response to comment

and what if I need to use the value of the previous row for the if condition?

for i in range(1, len(df) + 1):
    j = df.columns.get_loc('ifor')
    if <something>:
        df.iat[i - 1, j] = x
    else:
        df.iat[i - 1, j] = y

回答 2

可以使用的方法是itertuples(),将DataFrame行作为namedtuple进行迭代,将index值作为tuple的第一个元素。与相比,它要快得多iterrows()。对于itertuples(),每个都在DataFrame中row包含其Index,您可以loc用来设置值。

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

在大多数情况下,itertuples()速度比iat或快at

感谢@SantiStSupery,使用.at速度比快得多loc

A method you can use is itertuples(), it iterates over DataFrame rows as namedtuples, with index value as first element of the tuple. And it is much much faster compared with iterrows(). For itertuples(), each row contains its Index in the DataFrame, and you can use loc to set the value.

for row in df.itertuples():
    if <something>:
        df.at[row.Index, 'ifor'] = x
    else:
        df.at[row.Index, 'ifor'] = x

    df.loc[row.Index, 'ifor'] = x

Under most cases, itertuples() is faster than iat or at.

Thanks @SantiStSupery, using .at is much faster than loc.


回答 3

您应该用df.ix[i, 'exp']=Xdf.loc[i, 'exp']=X代替赋值df.ix[i]['ifor'] = x

否则,您正在处理视图,并且应该变暖:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

但是可以肯定的是,循环可能最好用某种矢量化算法代替,以充分利用DataFrame@Phillip Cloud建议的方法。

You should assign value by df.ix[i, 'exp']=X or df.loc[i, 'exp']=X instead of df.ix[i]['ifor'] = x.

Otherwise you are working on a view, and should get a warming:

-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead

But certainly, loop probably should better be replaced by some vectorized algorithm to make the full use of DataFrame as @Phillip Cloud suggested.


回答 4

好吧,如果您要进行迭代,为什么不使用所有最简单的方法, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

或者,如果您想将新值与旧值或类似值进行比较,为什么不将其存储在列表中,然后追加到末尾。

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

Well, if you are going to iterate anyhow, why don’t use the simplest method of all, df['Column'].values[i]

df['Column'] = ''

for i in range(len(df)):
    df['Column'].values[i] = something/update/new_value

Or if you want to compare the new values with old or anything like that, why not store it in a list and then append in the end.

mylist, df['Column'] = [], ''

for <condition>:
    mylist.append(something/update/new_value)

df['Column'] = mylist

回答 5

for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y
for i, row in df.iterrows():
    if <something>:
        df.at[i, 'ifor'] = x
    else:
        df.at[i, 'ifor'] = y

回答 6

最好lambda使用df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

It’s better to use lambda functions using df.apply()

df["ifor"] = df.apply(lambda x: {value} if {condition} else x["ifor"], axis=1)

回答 7

从一列增加最大数。例如 :

df1 = [sort_ID, Column1,Column2]
print(df1)

我的输出:

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

现在,我需要在df2中创建一列,并填充增加MAX的列值。

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

注意:df2最初将仅包含Column1和Column2。我们需要创建Sortid列,并从df1开始增加MAX。

Increment the MAX number from a column. For Example :

df1 = [sort_ID, Column1,Column2]
print(df1)

My output :

Sort_ID Column1 Column2
12         a    e
45         b    f
65         c    g
78         d    h

MAX = df1['Sort_ID'].max() #This returns my Max Number 

Now , I need to create a column in df2 and fill the column values which increments the MAX .

Sort_ID Column1 Column2
79      a1       e1
80      b1       f1
81      c1       g1
82      d1       h1

Note : df2 will initially contain only the Column1 and Column2 . we need the Sortid column to be created and incremental of the MAX from df1 .