如何通过正则表达式过滤熊猫中的行-Python 实用宝典

问题：如何通过正则表达式过滤熊猫中的行

我想在其中一列上使用正则表达式干净地过滤数据框。

举一个人为的例子：

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

我想将行过滤为以f正则表达式开头的行。首先去：

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

这不是太有用了。但是，这将使我得到我的布尔值索引：

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

因此，我可以通过以下方式进行限制：

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

但是，这使我人为地将一组放入正则表达式中，并且似乎不是一种干净的方法。有一个更好的方法吗？

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That’s not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?

回答 0

使用包含代替：

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

回答 1

已经有一个字符串处理功能 Series.str.startswith()。你应该尝试foo[foo.b.str.startswith('f')]。

结果：

    a   b
1   2   foo
2   3   fat

我认为您的期望。

另外，您可以使用包含和正则表达式选项。例如：

foo[foo.b.str.contains('oo', regex= True, na=False)]

结果：

    a   b
1   2   foo

na=False 是为了防止出现nan，null等值时出现错误

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values

回答 2

使用数据框进行多列搜索：

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

回答 3

这可能会有点晚，但是现在在Pandas中更容易做到。您可以调用match with as_indexer=True以获得布尔结果。这是记录（与之间的差异沿match和contains）在这里。

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

回答 4

感谢您提供@ user3136169的出色答案，这是一个如何删除NoneType值的示例。

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

您也可以将regex添加为arg：

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

Thanks for the great answer @user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

Also you can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

回答 5

编写一个布尔函数来检查正则表达式并在列上使用apply

foo[foo['b'].apply(regex_function)]

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

回答 6

使用str 切片

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

如何通过正则表达式过滤熊猫中的行

问题：如何通过正则表达式过滤熊猫中的行

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

Python内存泄漏

使用pip命令从requirements.txt升级python软件包

通过标签选择的熊猫有时返回Series，有时返回DataFrame

为什么Python使用“魔术方法”？

如何做多个参数来映射函数在python中保持不变？

如何在Python 3中使用自定义比较功能？

如何通过正则表达式过滤熊猫中的行

问题：如何通过正则表达式过滤熊猫中的行

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

相关文章

排行榜展示

文章展示