Python 实用宝典

Question 1

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That’s not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?

Question 2

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

Question 3

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values

Question 4

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

Question 5

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

Question 6

Thanks for the great answer @user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

Also you can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(res_regex_filter,regex=myregex)]

Question 7

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

Question 8

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

Python 实用宝典

如何通过正则表达式过滤熊猫中的行

问题：如何通过正则表达式过滤熊猫中的行

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

有趣好用的Python教程