标签归档:html-content-extraction

提取正则表达式匹配项的一部分

问题:提取正则表达式匹配项的一部分

我想要一个正则表达式从HTML页面提取标题。目前我有这个:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

是否有一个正则表达式仅提取<title>的内容,所以我不必删除标签?

I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don’t have to remove the tags?


回答 0

( )在正则表达式和group(1)python中检索捕获的字符串(re.search将返回None如果没有找到结果,所以不要用group()直接):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn’t find the result, so don’t use group() directly):

title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)

if title_search:
    title = title_search.group(1)

回答 1

请注意,通过开始Python 3.8并引入赋值表达式(PEP 572):=运算符),可以通过在if条件中直接将匹配结果捕获为变量并将其在条件体内重复使用,从而对KrzysztofKrasoń解决方案进行一些改进:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

Note that starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), it’s possible to improve a bit on Krzysztof Krasoń’s solution by capturing the match result directly within the if condition as a variable and re-use it in the condition’s body:

# pattern = '<title>(.*)</title>'
# text = '<title>hello</title>'
if match := re.search(pattern, text, re.IGNORECASE):
  title = match.group(1)
# hello

回答 2

尝试使用捕获组:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try using capturing groups:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 3

re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)
re.search('<title>(.*)</title>', s, re.IGNORECASE).group(1)

回答 4

我可以推荐你去美丽汤。汤是一个很好的库,可以解析您的所有html文档。

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

May I recommend you to Beautiful Soup. Soup is a very good lib to parse all of your html document.

soup = BeatifulSoup(html_doc)
titleName = soup.title.name

回答 5

尝试:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

Try:

title = re.search('<title>(.*)</title>', html, re.IGNORECASE).group(1)

回答 6

提供的代码段不能应付Exceptions 我的建议

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

如果未找到模式或第一个匹配项,则默认情况下返回空字符串。

The provided pieces of code do not cope with Exceptions May I suggest

getattr(re.search(r"<title>(.*)</title>", s, re.IGNORECASE), 'groups', lambda:[u""])()[0]

This returns an empty string by default if the pattern has not been found, or the first match.


回答 7

我认为这足够了:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

…假设您的文本(HTML)位于名为“ text”的变量中。

这也假定没有其他HTML标记可以合法地嵌入HTML TITLE标记内部,并且没有办法合法地将任何其他<字符嵌入这样的容器/块中。

但是

不要在Python中使用正则表达式进行HTML解析。使用HTML解析器!(除非您要编写完整的解析器,否则当标准库中已经包含各种HTML,SGML和XML解析器时,这将是一项额外的工作。

如果您处理“真实世界” 标记汤 HTML(通常不符合任何SGML / XML验证器),请使用BeautifulSoup包。它尚未出现在标准库中,但为此目的广泛建议使用。

另一个选项是:lxml …,它是为结构正确(符合标准的HTML)编写的。但是它可以选择退回到使用BeautifulSoup作为解析器:ElementSoup

I’d think this should suffice:

#!python
import re
pattern = re.compile(r'<title>([^<]*)</title>', re.MULTILINE|re.IGNORECASE)
pattern.search(text)

… assuming that your text (HTML) is in a variable named “text.”

This also assumes that there are not other HTML tags which can be legally embedded inside of an HTML TITLE tag and no way to legally embed any other < character within such a container/block.

However

Don’t use regular expressions for HTML parsing in Python. Use an HTML parser! (Unless you’re going to write a full parser, which would be a of extra work when various HTML, SGML and XML parsers are already in the standard libraries.

If your handling “real world” tag soup HTML (which is frequently non-conforming to any SGML/XML validator) then use the BeautifulSoup package. It isn’t in the standard libraries (yet) but is wide recommended for this purpose.

Another option is: lxml … which is written for properly structured (standards conformant) HTML. But it has an option to fallback to using BeautifulSoup as a parser: ElementSoup.


BeautifulSoup抓取可见网页文本

问题:BeautifulSoup抓取可见网页文本

基本上,我想使用BeautifulSoup来严格抓取网页上的可见文本。例如,此网页是我的测试用例。我主要想获取正文文本(文章),甚至在这里和那里甚至几个标签名称。我尝试了这个SO问题中的建议,该建议返回很多<script>我不想要的标签和html注释。我无法弄清楚该函数所需的参数findAll(),以便仅获取网页上的可见文本。

那么,我应该如何查找除脚本,注释,CSS等之外的所有可见文本?

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don’t want. I can’t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?


回答 0

试试这个:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

回答 1

@jbochi批准的答案对我不起作用。str()函数调用会引发异常,因为它无法对BeautifulSoup元素中的非ASCII字符进行编码。这是将示例网页过滤为可见文本的一种更为简洁的方法。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

回答 2

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))
import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

回答 3

我完全尊重使用Beautiful Soup获取呈现的内容,但是它可能不是获取页面上呈现的内容的理想软件包。

我遇到了类似的问题,无法获取渲染的内容或典型浏览器中的可见内容。特别是,在下面的一个简单示例中,我可能有许多非典型案例。在这种情况下,不可显示的标签嵌套在样式标签中,在我检查过的许多浏览器中都不可见。存在其他变体,例如将类标签设置显示定义为无。然后将此类用于div。

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

上面发布的一种解决方案是:

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

该解决方案当然在许多情况下都有应用程序,并且通常可以很好地完成工作,但是在上面发布的html中,它保留了未呈现的文本。经过搜索之后,这里出现了一些解决方案,BeautifulSoup get_text不会剥离所有标签和JavaScript ,这里是使用Python将HTML渲染为纯文本的方式

我尝试了这两种解决方案:html2text和nltk.clean_html,并且对计时结果感到惊讶,因此认为它们值得后代的答案。当然,速度很大程度上取决于数据的内容。

@Helge的一个答案是关于使用nltk的所有东西。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

返回带有呈现的html的字符串的效果很好。这个nltk模块甚至比html2text还要快,尽管html2text可能更健壮。

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.

I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

One solution posted above is:

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data…

One answer here from @Helge was about using nltk of all things.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

回答 4

如果您关心性能,这是另一种更有效的方法:

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings是一个迭代器,它返回,NavigableString以便您可以直接检查父级的标记名,而无需经历多个循环。

If you care about performance, here’s another more efficient way:

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings is an iterator, and it returns NavigableString so that you can check the parent’s tag name directly, without going through multiple loops.


回答 5

标题位于<nyt_headline>标签内,该标签嵌套在<h1>标签和<div>ID为“ article” 的标签内。

soup.findAll('nyt_headline', limit=1)

应该管用。

文章正文位于<nyt_text>标记内,该标记嵌套在<div>ID为“ articleBody” 的标记内。在<nyt_text> 元素内部,文本本身包含在<p> 标签中。图片不在这些<p>标签内。对我来说,尝试语法很难,但是我希望工作的草稿看起来像这样。

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id “article”.

soup.findAll('nyt_headline', limit=1)

Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id “articleBody”. Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It’s difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

回答 6

虽然,我会完全建议一般使用精美的汤,但是,如果有人希望显示格式错误的html的可见部分(例如,您只有网页的一段或一行),无论出于何种原因,以下内容将删除<>标签之间的内容:

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

回答 7

使用BeautifulSoup是最简单的方法,只需较少的代码即可获取字符串,而不会出现空行和废话。

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

Using BeautifulSoup the easiest way with less code to just get the strings, without empty lines and crap.

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

回答 8

处理这种情况的最简单方法是使用getattr()。您可以根据需要调整此示例:

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

如果存在,它将"3.7"在标记对象中找到文本元素,<span class="ratingsContent">3.7</span>但是默认为NoneType不存在时。

getattr(object, name[, default])

返回对象的命名属性的值。名称必须是字符串。如果字符串是对象属性之一的名称,则结果是该属性的值。例如,getattr(x,’foobar’)等同于x.foobar。如果命名属性不存在,则返回默认值(如果提供),否则引发AttributeError。

The simplest way to handle this case is by using getattr(). You can adapt this example to your needs:

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

This will find the text element,"3.7", within the tag object <span class="ratingsContent">3.7</span> when it exists, however, default to NoneType when it does not.

getattr(object, name[, default])

Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, ‘foobar’) is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.


回答 9

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))