正则表达式匹配多行文本块

问题:正则表达式匹配多行文本块

与跨多行的文本进行匹配时,让Python正则表达式无法正常工作有点麻烦。示例文本为(“ \ n”是换行符)

some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).

我想捕获两件事:“ some_Varying_TEXT”部分,以及一次捕获中位于其下方两行的所有大写文本行(我以后可以去除换行符)。我尝试了几种方法:

re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines

并有很多变化,没有运气。最后一个似乎与文本行一一对应,这不是我真正想要的。我可以抓住第一部分,没问题,但是我似乎无法抓住4-5行的大写文本。我希望match.group(1)是some_Varying_Text,而group(2)是line1 + line2 + line3 + etc,直到遇到空行。

如果有人好奇,它应该是构成蛋白质的氨基酸序列。

I’m having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (‘\n’ is a newline)

some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).

I’d like to capture two things: the ‘some_Varying_TEXT’ part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later). I’ve tried with a few approaches:

re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines

and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can’t seem to catch the 4-5 lines of uppercase text. I’d like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.

If anyone’s curious, its supposed to be a sequence of aminoacids that make up a protein.


回答 0

试试这个:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

我认为您的最大问题是,您期望^$定位符匹配换行符,但它们不匹配。在多行模式,^匹配立即位置以下换行符和$立即位置相匹配一个换行符。

同样要注意,换行符可以由换行符(\ n),回车符(\ r)或回车符+换行符(\ r \ n)组成。如果不确定目标文本仅使用换行符,则应使用此更广泛的正则表达式版本:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

顺便说一句,您不想在这里使用DOTALL修饰符;您依赖点与换行符以外的所有内容都匹配的事实。

Try this:

re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)

I think your biggest problem is that you’re expecting the ^ and $ anchors to match linefeeds, but they don’t. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.

Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren’t certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:

re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)

BTW, you don’t want to use the DOTALL modifier here; you’re relying on the fact that the dot matches everything except newlines.


回答 1

这将起作用:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

关于此正则表达式的一些解释可能会有用: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • 第一个字符(^)表示“从行首开始”。请注意,它与换行符本身不匹配(与$相同:表示“仅在换行符之前”,但与换行符本身不匹配)。
  • 然后(.+?)\n\n表示“匹配尽可能少的字符(允许所有字符),直到到达两个换行符”。结果(没有换行符)放在第一组中。
  • [A-Z]+\n意思是“匹配尽可能多的大写字母,直到到达换行符为止。这定义了我称之为文本行
  • ((?:文本行)+)表示匹配一个或多个文本行,但不要将每一行都放在一组中。相反,把所有文本行中的一组。
  • \n如果要在末尾强制使用双换行符,则可以在正则表达式中添加final 。
  • 另外,如果你不知道你会得到什么类型的换行符(\n\r\r\n),那么仅仅通过替换每次出现解决了正则表达式\n(?:\n|\r\n?)

This will work:

>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
...   title, sequence = match.groups()
...   title = title.strip()
...   sequence = rx_blanks.sub("",sequence)
...   print "Title:",title
...   print "Sequence:",sequence
...   print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK

Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW

Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)

  • The first character (^) means “starting at the beginning of a line”. Be aware that it does not match the newline itself (same for $: it means “just before a newline”, but it does not match the newline itself).
  • Then (.+?)\n\n means “match as few characters as possible (all characters are allowed) until you reach two newlines”. The result (without the newlines) is put in the first group.
  • [A-Z]+\n means “match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
  • ((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
  • You could add a final \n in the regular expression if you want to enforce a double newline at the end.
  • Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

回答 2

如果每个文件只有一个氨基酸序列,我将完全不使用正则表达式。就像这样:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

If each file only has one sequence of aminoacids, I wouldn’t use regular expressions at all. Just something like this:

def read_amino_acid_sequence(path):
    with open(path) as sequence_file:
        title = sequence_file.readline() # read 1st line
        aminoacid_sequence = sequence_file.read() # read the rest

    # some cleanup, if necessary
    title = title.strip() # remove trailing white spaces and newline
    aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
    return title, aminoacid_sequence

回答 3

找:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\ 1 = some_varying_text

\ 2 =所有CAPS的行

编辑(证明这可行):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])

find:

^>([^\n\r]+)[\n\r]([A-Z\n\r]+)

\1 = some_varying_text

\2 = lines of all CAPS

Edit (proof that this works):

text = """> some_Varying_TEXT

DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA

> some_Varying_TEXT2

DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""

import re

regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]

for m in matches:
    print 'Name: %s\nSequence:%s' % (m[0], m[1])

回答 4

以下是匹配多行文本块的正则表达式:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

The following is a regular expression matching a multiline block of text:

import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

回答 5

我的偏爱。

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

此时,您将someVaryingText作为字符串,并将酸作为字符串列表。您可以"".join( acids )制作一个字符串。

我发现它比多行正则表达式更令人沮丧(并且更灵活)。

My preference.

lineIter= iter(aFile)
for line in lineIter:
    if line.startswith( ">" ):
         someVaryingText= line
         break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
    if len(line.strip()) == 0:
        break
    acids.append( line )

At this point you have someVaryingText as a string, and the acids as a list of strings. You can do "".join( acids ) to make a single string.

I find this less frustrating (and more flexible) than multiline regexes.