问题:在熊猫数据框中删除全零的行
我可以使用pandas
dropna()
功能来删除将部分或全部列设置为NA
的行。是否存在用于删除所有列的值为0的行的等效函数?
P kt b tt mky depth
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 1.1 3 4.5 2.3 9.0
在此示例中,我们要删除数据帧的前4行。
谢谢!
I can use pandas
dropna()
functionality to remove rows with some or all columns set as NA
‘s. Is there an equivalent function for dropping rows with all columns having value 0?
P kt b tt mky depth
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 1.1 3 4.5 2.3 9.0
In this example, we would like to drop the first 4 rows from the data frame.
thanks!
回答 0
事实证明,这可以很好地以矢量化方式表达:
> df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
> df = df[(df.T != 0).any()]
> df
a b
1 0 1
2 1 0
3 1 1
It turns out this can be nicely expressed in a vectorized fashion:
> df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
> df = df[(df.T != 0).any()]
> df
a b
1 0 1
2 1 0
3 1 1
回答 1
一线。无需移调:
df.loc[~(df==0).all(axis=1)]
对于那些喜欢对称的人,这也适用…
df.loc[(df!=0).any(axis=1)]
One-liner. No transpose needed:
df.loc[~(df==0).all(axis=1)]
And for those who like symmetry, this also works…
df.loc[(df!=0).any(axis=1)]
回答 2
我大约每月一次查找此问题,并且总是必须从评论中找出最佳答案:
df.loc[(df!=0).any(1)]
谢谢丹·艾伦!
I look up this question about once a month and always have to dig out the best answer from the comments:
df.loc[(df!=0).any(1)]
Thanks Dan Allan!
回答 3
用替换零,nan
然后将所有条目的行都删除为nan
。之后,将其替换nan
为零。
import numpy as np
df = df.replace(0, np.nan)
df = df.dropna(how='all', axis=0)
df = df.replace(np.nan, 0)
Replace the zeros with nan
and then drop the rows with all entries as nan
.
After that replace nan
with zeros.
import numpy as np
df = df.replace(0, np.nan)
df = df.dropna(how='all', axis=0)
df = df.replace(np.nan, 0)
回答 4
我认为这种解决方案是最短的:
df= df[df['ColName'] != 0]
I think this solution is the shortest :
df= df[df['ColName'] != 0]
回答 5
我发现一些解决方案在查找时很有用,尤其是对于较大的数据集:
df[(df.sum(axis=1) != 0)] # 30% faster
df[df.values.sum(axis=1) != 0] # 3X faster
继续@ U2EF1中的示例:
In [88]: df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
In [91]: %timeit df[(df.T != 0).any()]
1000 loops, best of 3: 686 µs per loop
In [92]: df[(df.sum(axis=1) != 0)]
Out[92]:
a b
1 0 1
2 1 0
3 1 1
In [95]: %timeit df[(df.sum(axis=1) != 0)]
1000 loops, best of 3: 495 µs per loop
In [96]: %timeit df[df.values.sum(axis=1) != 0]
1000 loops, best of 3: 217 µs per loop
在更大的数据集上:
In [119]: bdf = pd.DataFrame(np.random.randint(0,2,size=(10000,4)))
In [120]: %timeit bdf[(bdf.T != 0).any()]
1000 loops, best of 3: 1.63 ms per loop
In [121]: %timeit bdf[(bdf.sum(axis=1) != 0)]
1000 loops, best of 3: 1.09 ms per loop
In [122]: %timeit bdf[bdf.values.sum(axis=1) != 0]
1000 loops, best of 3: 517 µs per loop
Couple of solutions I found to be helpful while looking this up, especially for larger data sets:
df[(df.sum(axis=1) != 0)] # 30% faster
df[df.values.sum(axis=1) != 0] # 3X faster
Continuing with the example from @U2EF1:
In [88]: df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
In [91]: %timeit df[(df.T != 0).any()]
1000 loops, best of 3: 686 µs per loop
In [92]: df[(df.sum(axis=1) != 0)]
Out[92]:
a b
1 0 1
2 1 0
3 1 1
In [95]: %timeit df[(df.sum(axis=1) != 0)]
1000 loops, best of 3: 495 µs per loop
In [96]: %timeit df[df.values.sum(axis=1) != 0]
1000 loops, best of 3: 217 µs per loop
On a larger dataset:
In [119]: bdf = pd.DataFrame(np.random.randint(0,2,size=(10000,4)))
In [120]: %timeit bdf[(bdf.T != 0).any()]
1000 loops, best of 3: 1.63 ms per loop
In [121]: %timeit bdf[(bdf.sum(axis=1) != 0)]
1000 loops, best of 3: 1.09 ms per loop
In [122]: %timeit bdf[bdf.values.sum(axis=1) != 0]
1000 loops, best of 3: 517 µs per loop
回答 6
import pandas as pd
df = pd.DataFrame({'a' : [0,0,1], 'b' : [0,0,-1]})
temp = df.abs().sum(axis=1) == 0
df = df.drop(temp)
结果:
>>> df
a b
2 1 -1
import pandas as pd
df = pd.DataFrame({'a' : [0,0,1], 'b' : [0,0,-1]})
temp = df.abs().sum(axis=1) == 0
df = df.drop(temp)
Result:
>>> df
a b
2 1 -1
回答 7
您可以使用快速lambda
功能来检查给定行中的所有值是否均为0
。然后,您可以将应用该结果的结果lambda
用作仅选择与该条件匹配或不匹配的行的一种方式:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3),
index=['one', 'two', 'three', 'four', 'five'],
columns=list('abc'))
df.loc[['one', 'three']] = 0
print df
print df.loc[~df.apply(lambda row: (row==0).all(), axis=1)]
Yield:
a b c
one 0.000000 0.000000 0.000000
two 2.240893 1.867558 -0.977278
three 0.000000 0.000000 0.000000
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[5 rows x 3 columns]
a b c
two 2.240893 1.867558 -0.977278
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[3 rows x 3 columns]
You can use a quick lambda
function to check if all the values in a given row are 0
. Then you can use the result of applying that lambda
as a way to choose only the rows that match or don’t match that condition:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3),
index=['one', 'two', 'three', 'four', 'five'],
columns=list('abc'))
df.loc[['one', 'three']] = 0
print df
print df.loc[~df.apply(lambda row: (row==0).all(), axis=1)]
Yields:
a b c
one 0.000000 0.000000 0.000000
two 2.240893 1.867558 -0.977278
three 0.000000 0.000000 0.000000
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[5 rows x 3 columns]
a b c
two 2.240893 1.867558 -0.977278
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[3 rows x 3 columns]
回答 8
另一种选择:
# Is there anything in this row non-zero?
# df != 0 --> which entries are non-zero? T/F
# (df != 0).any(axis=1) --> are there 'any' entries non-zero row-wise? T/F of rows that return true to this statement.
# df.loc[all_zero_mask,:] --> mask your rows to only show the rows which contained a non-zero entry.
# df.shape to confirm a subset.
all_zero_mask=(df != 0).any(axis=1) # Is there anything in this row non-zero?
df.loc[all_zero_mask,:].shape
Another alternative:
# Is there anything in this row non-zero?
# df != 0 --> which entries are non-zero? T/F
# (df != 0).any(axis=1) --> are there 'any' entries non-zero row-wise? T/F of rows that return true to this statement.
# df.loc[all_zero_mask,:] --> mask your rows to only show the rows which contained a non-zero entry.
# df.shape to confirm a subset.
all_zero_mask=(df != 0).any(axis=1) # Is there anything in this row non-zero?
df.loc[all_zero_mask,:].shape
回答 9
对我来说这段代码: df.loc[(df!=0).any(axis=0)]
没有用。它返回了确切的数据集。
相反,我用 df.loc[:, (df!=0).any(axis=0)]
并删除了数据集中所有具有0值的列
该函数.all()
删除了我的数据集中所有零值的所有列。
For me this code: df.loc[(df!=0).any(axis=0)]
did not work. It returned the exact dataset.
Instead, I used df.loc[:, (df!=0).any(axis=0)]
and dropped all the columns with 0 values in the dataset
The function .all()
droped all the columns in which are any zero values in my dataset.
回答 10
df = df [~( df [ ['kt' 'b' 'tt' 'mky' 'depth', ] ] == 0).all(axis=1) ]
尝试使用此命令,即可正常运行。
df = df [~( df [ ['kt' 'b' 'tt' 'mky' 'depth', ] ] == 0).all(axis=1) ]
Try this command its perfectly working.
回答 11
要在任何行中删除所有值为0的列:
new_df = df[df.loc[:]!=0].dropna()
To drop all columns with values 0 in any row:
new_df = df[df.loc[:]!=0].dropna()