标签归档:regex-greedy

Python非贪婪正则表达式

问题:Python非贪婪正则表达式

"(.*)"给定的"a (b) c (d) e"python匹配"b"而不是的情况下,如何制作这样的python正则表达式"b) c (d"

我知道我可以使用"[^)]"代替".",但是我正在寻找一种更通用的解决方案,使我的regex更加整洁。有什么办法告诉python“嘿,尽快匹配它”?

How do I make a python regex like "(.*)" such that, given "a (b) c (d) e" python matches "b" instead of "b) c (d"?

I know that I can use "[^)]" instead of ".", but I’m looking for a more general solution that keeps my regex a little cleaner. Is there any way to tell python “hey, match this as soon as possible”?


回答 0

您寻求无所不能 *?

从文档来看,贪婪与非贪婪

非贪心预选赛*?+???,或{m,n}?[…]匹配的 文本越好。

You seek the all-powerful *?

From the docs, Greedy versus Non-Greedy

the non-greedy qualifiers *?, +?, ??, or {m,n}? […] match as little text as possible.


回答 1

>>> x = "a (b) c (d) e"
>>> re.search(r"\(.*\)", x).group()
'(b) c (d)'
>>> re.search(r"\(.*?\)", x).group()
'(b)'

根据文档

*”,“ +”和“ ?”限定词都是贪婪的;它们匹配尽可能多的文本。有时这种行为是不希望的;如果RE <.*>与’ <H1>title</H1>‘ 匹配,它将匹配整个字符串,而不仅仅是’ <H1>‘。?在限定符之后添加’ ‘,以使其以非贪婪或最小的方式进行匹配;尽可能少的字符将被匹配。使用.*?在前面的表达式将只匹配“ <H1>”。

>>> x = "a (b) c (d) e"
>>> re.search(r"\(.*\)", x).group()
'(b) c (d)'
>>> re.search(r"\(.*?\)", x).group()
'(b)'

According to the docs:

The ‘*‘, ‘+‘, and ‘?‘ qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against ‘<H1>title</H1>‘, it will match the entire string, and not just ‘<H1>‘. Adding ‘?‘ after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only ‘<H1>‘.


回答 2

不行\\(.*?\\)吗?那就是非贪婪的语法。

Would not \\(.*?\\) work? That is the non-greedy syntax.


回答 3

正如其他人所说的那样使用?*修饰符上的修饰符将解决您的迫在眉睫的问题,但请注意,您开始误入正则表达式停止工作的区域,而需要解析器。例如,字符串“(foo(bar))baz”会给您带来麻烦。

As the others have said using the ? modifier on the * quantifier will solve your immediate problem, but be careful, you are starting to stray into areas where regexes stop working and you need a parser instead. For instance, the string “(foo (bar)) baz” will cause you problems.


回答 4

使用不匹配的比赛是一个不错的开始,但是我还建议您重新考虑对它的任何使用.*-这怎么办?

groups = re.search(r"\([^)]*\)", x)

Using an ungreedy match is a good start, but I’d also suggest that you reconsider any use of .* — what about this?

groups = re.search(r"\([^)]*\)", x)

回答 5

是否要与“(b)”匹配?按照Zitrax和Paolo的建议做。您要它匹配“ b”吗?做

>>> x = "a (b) c (d) e"
>>> re.search(r"\((.*?)\)", x).group(1)
'b'

Do you want it to match “(b)”? Do as Zitrax and Paolo have suggested. Do you want it to match “b”? Do

>>> x = "a (b) c (d) e"
>>> re.search(r"\((.*?)\)", x).group(1)
'b'

回答 6

首先,我不建议在正则表达式中使用“ *”。是的,我知道,它是最常用的多字符定界符,但它不是一个好主意。这是因为,尽管它确实匹配该字符的任何重复量,但“ any”仍包含0,这通常是您要为其抛出语法错误而不是接受的东西。相反,我建议使用+与长度> 1的任何重复匹配的符号。此外,从我所看到的来看,您正在处理定长括号括起来的表达式。结果,您可能可以使用{x, y}语法来专门指定所需的长度。

但是,如果您确实需要非贪婪的重复,我建议您咨询无所不能的?。将其放置在任何正则表达式重复说明符的末尾时,将强制正则表达式的该部分查找尽可能少的文本。

话虽这么说,但我会非常小心?,就像Who博士中的Sonic螺丝起子有这样做的倾向,如果不仔细校准,我应该如何将它“稍微”变坏。例如,要使用示例输入,它将识别((1)(注意缺少第二个rparen)作为匹配项。

To start with, I do not suggest using “*” in regexes. Yes, I know, it is the most used multi-character delimiter, but it is nevertheless a bad idea. This is because, while it does match any amount of repetition for that character, “any” includes 0, which is usually something you want to throw a syntax error for, not accept. Instead, I suggest using the + sign, which matches any repetition of length > 1. What’s more, from what I can see, you are dealing with fixed-length parenthesized expressions. As a result, you can probably use the {x, y} syntax to specifically specify the desired length.

However, if you really do need non-greedy repetition, I suggest consulting the all-powerful ?. This, when placed after at the end of any regex repetition specifier, will force that part of the regex to find the least amount of text possible.

That being said, I would be very careful with the ? as it, like the Sonic Screwdriver in Dr. Who, has a tendency to do, how should I put it, “slightly” undesired things if not carefully calibrated. For example, to use your example input, it would identify ((1) (note the lack of a second rparen) as a match.