I want a regular expression to extract the title from a HTML page. Currently I have this:

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?

回答 0

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

回答 1

Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

回答 2


Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 3

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

回答 4


May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

回答 5


title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 6

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.

回答 7


import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)

…假设您的文本(HTML)位于名为“ text”的变量中。

这也假定没有其他HTML标记可以合法地嵌入HTML TITLE标记内部,并且没有办法合法地将任何其他<字符嵌入这样的容器/块中。



如果您处理“真实世界” 标记汤 HTML(通常不符合任何SGML / XML验证器),请使用BeautifulSoup包。它尚未出现在标准库中,但为此目的广泛建议使用。

另一个选项是:lxml …,它是为结构正确(符合标准的HTML)编写的。但是它可以选择退回到使用BeautifulSoup作为解析器:ElementSoup

I’d think this should suffice:

import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)

… assuming that your text (HTML) is in a variable named “text.”

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.


Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.





Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don’t want. I can’t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?

回答 0


Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()

回答 1


The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

回答 2

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)


回答 3

我完全尊重使用Beautiful Soup获取呈现的内容,但是它可能不是获取页面上呈现的内容的理想软件包。


  <title>  Title here</title>


    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 




html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)

[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

该解决方案当然在许多情况下都有应用程序,并且通常可以很好地完成工作,但是在上面发布的html中,它保留了未呈现的文本。经过搜索之后,这里出现了一些解决方案,BeautifulSoup get_text不会剥离所有标签和JavaScript ,这里是使用Python将HTML渲染为纯文本的方式



import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop


betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.

I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.

  <title>  Title here</title>


    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 



One solution posted above is:

This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data…

One answer here from @Helge was about using nltk of all things.

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

回答 4


import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)


If you care about performance, here’s another more efficient way:

soup.strings is an iterator, and it returns NavigableString so that you can check the parent’s tag name directly, without going through multiple loops.

回答 5

标题位于<nyt_headline>标签内,该标签嵌套在<h1>标签和<div>ID为“ article” 的标签内。

soup.findAll('nyt_headline', limit=1)


文章正文位于<nyt_text>标记内,该标记嵌套在<div>ID为“ articleBody” 的标记内。在<nyt_text> 元素内部,文本本身包含在<p> 标签中。图片不在这些<p>标签内。对我来说,尝试语法很难,但是我希望工作的草稿看起来像这样。

text = soup.findAll('nyt_text', limit=1)[0]

The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id “article”.

soup.findAll('nyt_headline', limit=1)

Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id “articleBody”. Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It’s difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

text = soup.findAll('nyt_text', limit=1)[0]

回答 6


While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:

回答 7


tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

Using BeautifulSoup the easiest way with less code to just get the strings, without empty lines and crap.

回答 8


from bs4 import BeautifulSoup

如果存在,它将"3.7"在标记对象中找到文本元素,<span class="ratingsContent">3.7</span>但是默认为NoneType不存在时。

getattr(object, name[, default])


The simplest way to handle this case is by using getattr(). You can adapt this example to your needs:

Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, ‘foobar’) is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.

回答 9

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'