如何从熊猫数据框中删除行列表?

问题:如何从熊猫数据框中删除行列表?

我有一个数据框df:

>>> df
                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20060630   6.590       NaN      6.590   5.291
       20060930  10.103       NaN     10.103   7.981
       20061231  15.915       NaN     15.915  12.686
       20070331   3.196       NaN      3.196   2.710
       20070630   7.907       NaN      7.907   6.459

然后,我想删除具有列表中指示的某些序列号的行,假设此时留在这里[1,2,4],

                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20061231  15.915       NaN     15.915  12.686
       20070630   7.907       NaN      7.907   6.459

如何或什么功能可以做到这一点?

I have a dataframe df :

>>> df
                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20060630   6.590       NaN      6.590   5.291
       20060930  10.103       NaN     10.103   7.981
       20061231  15.915       NaN     15.915  12.686
       20070331   3.196       NaN      3.196   2.710
       20070630   7.907       NaN      7.907   6.459

Then I want to drop rows with certain sequence numbers which indicated in a list, suppose here is [1,2,4], then left:

                  sales  discount  net_sales    cogs
STK_ID RPT_Date                                     
600141 20060331   2.709       NaN      2.709   2.245
       20061231  15.915       NaN     15.915  12.686
       20070630   7.907       NaN      7.907   6.459

How or what function can do that ?


回答 0

使用DataFrame.drop并将其传递给一系列索引标签:

In [65]: df
Out[65]: 
       one  two
one      1    4
two      2    3
three    3    2
four     4    1


In [66]: df.drop(df.index[[1,3]])
Out[66]: 
       one  two
one      1    4
three    3    2

Use DataFrame.drop and pass it a Series of index labels:

In [65]: df
Out[65]: 
       one  two
one      1    4
two      2    3
three    3    2
four     4    1


In [66]: df.drop(df.index[[1,3]])
Out[66]: 
       one  two
one      1    4
three    3    2

回答 1

请注意,当您要插入时,使用“ inplace”命令可能很重要。

df.drop(df.index[[1,3]], inplace=True)

因为您的原始问题没有返回任何内容,所以应使用此命令。 http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html

Note that it may be important to use the “inplace” command when you want to do the drop in line.

df.drop(df.index[[1,3]], inplace=True)

Because your original question is not returning anything, this command should be used. http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.drop.html


回答 2

如果DataFrame很大,并且要删除的行数也很大,那么按索引df.drop(df.index[])进行简单的删除将花费太多时间。

在我的情况下,我有一个带的浮点数的多索引DataFrame 100M rows x 3 cols,我需要从中删除10k行。与直觉相反,我发现最快的方法是take其余行。

让我们indexes_to_drop作为要放置的位置索引数组([1, 2, 4]在问题中)。

indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))

就我而言,这花费了20.5s,而简单的df.drop花费5min 27s了很多内存。所得的DataFrame是相同的。

If the DataFrame is huge, and the number of rows to drop is large as well, then simple drop by index df.drop(df.index[]) takes too much time.

In my case, I have a multi-indexed DataFrame of floats with 100M rows x 3 cols, and I need to remove 10k rows from it. The fastest method I found is, quite counterintuitively, to take the remaining rows.

Let indexes_to_drop be an array of positional indexes to drop ([1, 2, 4] in the question).

indexes_to_keep = set(range(df.shape[0])) - set(indexes_to_drop)
df_sliced = df.take(list(indexes_to_keep))

In my case this took 20.5s, while the simple df.drop took 5min 27s and consumed a lot of memory. The resulting DataFrame is the same.


回答 3

您还可以传递给DataFrame.drop标签本身(而不是索引标签系列):

In[17]: df
Out[17]: 
            a         b         c         d         e
one  0.456558 -2.536432  0.216279 -1.305855 -0.121635
two -1.015127 -0.445133  1.867681  2.179392  0.518801

In[18]: df.drop('one')
Out[18]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

等效于:

In[19]: df.drop(df.index[[0]])
Out[19]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

You can also pass to DataFrame.drop the label itself (instead of Series of index labels):

In[17]: df
Out[17]: 
            a         b         c         d         e
one  0.456558 -2.536432  0.216279 -1.305855 -0.121635
two -1.015127 -0.445133  1.867681  2.179392  0.518801

In[18]: df.drop('one')
Out[18]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

Which is equivalent to:

In[19]: df.drop(df.index[[0]])
Out[19]: 
            a         b         c         d         e
two -1.015127 -0.445133  1.867681  2.179392  0.518801

回答 4

我以一种简单的方式解决了这一问题-仅需两个步骤。

步骤1:首先形成包含不需要的行/数据的数据框。

步骤2:使用此不需要的数据框的索引从原始数据框删除行。

例:

假设您有一个数据框df,其中包括“ Age”的整数列,该列是整数。现在假设您要删除所有以“年龄”为负数的行。

步骤1:df_age_negative = df [df [‘Age’] <0]

步骤2:df = df.drop(df_age_negative.index,axis = 0)

希望这会更简单并且对您有所帮助。

I solved this in a simpler way – just in 2 steps.

  1. Make a dataframe with unwanted rows/data.

  2. Use the index of this unwanted dataframe to drop the rows from the original dataframe.

Example:
Suppose you have a dataframe df which as many columns including ‘Age’ which is an integer. Now let’s say you want to drop all the rows with ‘Age’ as negative number.

df_age_negative = df[ df['Age'] < 0 ] # Step 1
df = df.drop(df_age_negative.index, axis=0) # Step 2

Hope this is much simpler and helps you.


回答 5

如果要删除具有index的行x,我将执行以下操作:

df = df[df.index != x]

如果我想删除多个索引(例如,这些索引在list中unwanted_indices),则可以执行以下操作:

desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

If I want to drop a row which has let’s say index x, I would do the following:

df = df[df.index != x]

If I would want to drop multiple indices (say these indices are in the list unwanted_indices), I would do:

desired_indices = [i for i in len(df.index) if i not in unwanted_indices]
desired_df = df.iloc[desired_indices]

回答 6

我想展示一些具体的例子。假设您在某些行中有许多重复的条目。如果您有字符串条目,则可以轻松地使用字符串方法来查找所有要删除的索引。

ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index

现在使用它们的索引删除这些行

new_df = df.drop(ind_drop)

Here is a bit specific example, I would like to show. Say you have many duplicate entries in some of your rows. If you have string entries you could easily use string methods to find all indexes to drop.

ind_drop = df[df['column_of_strings'].apply(lambda x: x.startswith('Keyword'))].index

And now to drop those rows using their indexes

new_df = df.drop(ind_drop)

回答 7

在对@ theodros-zelleke的答案的评论中,@ j-jones询问了如果索引不是唯一的怎么办。我不得不处理这种情况。我要做的是在我叫drop()la 之前重命名索引中的重复项:

dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)

rename_duplicates()我定义的函数在哪里,它通过了index元素并重命名了重复项。我使用了与pd.read_csv()在列上相同的重命名模式,即,"%s.%d" % (name, count)其中name行的名称和count它以前发生过的次数。

In a comment to @theodros-zelleke’s answer, @j-jones asked about what to do if the index is not unique. I had to deal with such a situation. What I did was to rename the duplicates in the index before I called drop(), a la:

dropped_indexes = <determine-indexes-to-drop>
df.index = rename_duplicates(df.index)
df.drop(df.index[dropped_indexes], inplace=True)

where rename_duplicates() is a function I defined that went through the elements of index and renamed the duplicates. I used the same renaming pattern as pd.read_csv() uses on columns, i.e., "%s.%d" % (name, count), where name is the name of the row and count is how many times it has occurred previously.


回答 8

如上所述,从布尔值确定索引,例如

df[df['column'].isin(values)].index

与使用此方法确定索引相比,可能会占用更多的内存

pd.Index(np.where(df['column'].isin(values))[0])

像这样应用

df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)

当处理大数据帧和有限的内存时,此方法很有用。

Determining the index from the boolean as described above e.g.

df[df['column'].isin(values)].index

can be more memory intensive than determining the index using this method

pd.Index(np.where(df['column'].isin(values))[0])

applied like so

df.drop(pd.Index(np.where(df['column'].isin(values))[0]), inplace = True)

This method is useful when dealing with large dataframes and limited memory.


回答 9

仅使用索引arg删除行:

df.drop(index = 2, inplace = True)

对于多行:

df.drop(index=[1,3], inplace = True)

Use only the Index arg to drop row:-

df.drop(index = 2, inplace = True)

For multiple rows:-

df.drop(index=[1,3], inplace = True)

回答 10

考虑一个示例数据框

df =     
index    column1
0           00
1           10
2           20
3           30

我们要删除第二和第三索引行。

方法1:

df = df.drop(df.index[2,3])
 or 
df.drop(df.index[2,3],inplace=True)
print(df)

df =     
index    column1
0           00
3           30

 #This approach removes the rows as we wanted but the index remains unordered

方法2

df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =     
index    column1
0           00
1           30
#This approach removes the rows as we wanted and resets the index. 

Consider an example dataframe

df =     
index    column1
0           00
1           10
2           20
3           30

we want to drop 2nd and 3rd index rows.

Approach 1:

df = df.drop(df.index[2,3])
 or 
df.drop(df.index[2,3],inplace=True)
print(df)

df =     
index    column1
0           00
3           30

 #This approach removes the rows as we wanted but the index remains unordered

Approach 2

df.drop(df.index[2,3],inplace=True,ignore_index=True)
print(df)
df =     
index    column1
0           00
1           30
#This approach removes the rows as we wanted and resets the index.