提取正则表达式匹配项的一部分

问题:提取正则表达式匹配项的一部分

我想要一个正则表达式从HTML页面提取标题。目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

是否有一个正则表达式仅提取<title>的内容,所以我不必删除标签?

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?


回答 0

( )在正则表达式和group(1)python中检索捕获的字符串(re.search将返回None如果没有找到结果,所以不要用group()直接):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

回答 1

请注意,通过开始Python 3.8并引入赋值表达式(PEP 572):=运算符),可以通过在if条件中直接将匹配结果捕获为变量并将其在条件体内重复使用,从而对KrzysztofKrasoń解决方案进行一些改进:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

回答 2

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 3

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

回答 4

我可以推荐你去美丽汤。汤是一个很好的库,可以解析您的所有html文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

回答 5

尝试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 6

提供的代码段不能应付Exceptions 我的建议

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

如果未找到模式或第一个匹配项,则默认情况下返回空字符串。

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.


回答 7

我认为这足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

…假设您的文本(HTML)位于名为“ text”的变量中。

这也假定没有其他HTML标记可以合法地嵌入HTML TITLE标记内部,并且没有办法合法地将任何其他<字符嵌入这样的容器/块中。

但是

不要在Python中使用正则表达式进行HTML解析。使用HTML解析器!(除非您要编写完整的解析器,否则当标准库中已经包含各种HTML,SGML和XML解析器时,这将是一项额外的工作。

如果您处理“真实世界” 标记汤 HTML(通常不符合任何SGML / XML验证器),请使用BeautifulSoup包。它尚未出现在标准库中,但为此目的广泛建议使用。

另一个选项是:lxml …,它是为结构正确(符合标准的HTML)编写的。但是它可以选择退回到使用BeautifulSoup作为解析器:ElementSoup

I’d think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

… assuming that your text (HTML) is in a variable named “text.”

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However

Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.