标签归档：beautifulsoup

问题：ImportError：没有名为bs4的模块（BeautifulSoup）

我正在使用Python并使用Flask。当我在计算机上运行我的主Python文件时，它可以正常运行，但是当我激活venv并在终端中运行Flask Python文件时，它表示我的主Python文件具有“没有名为bs4的模块”。任何意见或建议，不胜感激。

I’m working in Python and using Flask. When I run my main Python file on my computer, it works perfectly, but when I activate venv and run the Flask Python file in the terminal, it says that my main Python file has “No Module Named bs4.” Any comments or advice is greatly appreciated.

回答 0

激活virtualenv，然后安装BeautifulSoup4：

$ pip install BeautifulSoup4

当您安装bs4使用easy_install，您在系统范围内进行了安装。因此，您的系统python可以导入它，但您的virtualenv python不能导入。如果您不需要bs4在系统python路径中安装，请卸载它并将其保留在virtualenv中。

有关virtualenvs的更多信息，请阅读此内容

Activate the virtualenv, and then install BeautifulSoup4:

$ pip install BeautifulSoup4

When you installed bs4 with easy_install, you installed it system-wide. So your system python can import it, but not your virtualenv python. If you do not need bs4 to be installed in your system python path, uninstall it and keep it in your virtualenv.

For more information about virtualenvs, read this

回答 1

对于python2.x：

sudo pip install BeautifulSoup4

对于python3：

sudo apt-get install python3-bs4

For python2.x:

sudo pip install BeautifulSoup4

For python3:

sudo apt-get install python3-bs4

回答 2

只需标记Balthazar的答案即可。跑步

pip install BeautifulSoup4

没有为我工作。改为使用

pip install beautifulsoup4

Just tagging onto Balthazar’s answer. Running

pip install BeautifulSoup4

did not work for me. Instead use

pip install beautifulsoup4

回答 3

pip3 install BeautifulSoup4

试试这个。这个对我有用。原因在这里得到了很好的解释..

pip3 install BeautifulSoup4

Try this. It works for me. The reason is well explained here..

回答 4

如果您将Anaconda用于软件包管理，则应执行以下操作：

conda install -c anaconda beautifulsoup4

If you are using Anaconda for package management, following should do:

conda install -c anaconda beautifulsoup4

回答 5

如果您使用Pycharm，请转到preferences - project interpreter - install bs4。

如果尝试安装BeautifulSoup，它将仍然显示没有名为的模块bs4。

If you use Pycharm, go to preferences - project interpreter - install bs4.

If you try to install BeautifulSoup, it will still show that no module named bs4.

回答 6

我建议您使用以下命令来卸载bs4库：

pip卸载bs4

然后使用以下命令进行安装：

须藤apt-get install python3-bs4

当我使用以下命令安装bs4库时，在Linux Ubuntu中遇到了相同的问题：

点安装bs4

I will advise you to uninstall the bs4 library by using this command:

pip uninstall bs4

and then install it using this command:

sudo apt-get install python3-bs4

I was facing the same problem in my Linux Ubuntu when I used the following command for installing bs4 library:

pip install bs4

回答 7

试试这个：

sudo python3 -m pip install bs4

Try this:

sudo python3 -m pip install bs4

回答 8

pip install --user BeautifulSoup4

回答 9

pip3.7 install bs4

试试这个。它适用于python 3.7

pip3.7 install bs4

Try this. It works with python 3.7

回答 10

我做了@ rayid-ali所说的，除了我在Windows 10机器上，所以我省略了sudo。也就是说，我做了以下工作：

python3 -m pip install bs4

它就像一个pycharm。无论如何都像魅力一样工作。

I did what @rayid-ali said, except I’m on a Windows 10 machine so I left out the sudo. That is, I did the following:

python3 -m pip install bs4

and it worked like a pycharm. Worked like a charm anyway.

回答 11

最简单的是使用easy_install。

easy_install bs4

如果pip失败，它将起作用。

The easiest is using easy_install.

easy_install bs4

It will work if pip fails.

回答 12

很多针对Python 2编写的教程/参考资料都告诉您使用pip install somename。如果您使用的是Python 3，则要将其更改为pip3 install somename。

A lot of tutorials/references were written for Python 2 and tell you to use pip install somename. If you’re using Python 3 you want to change that to pip3 install somename.

回答 13

您可能想尝试使用安装bs4

pip install --ignore-installed BeautifulSoup4

如果上述方法不适合您。

You might want to try install bs4 with

pip install --ignore-installed BeautifulSoup4

if the methods above didn’t work for you.

回答 14

尝试重新安装模块，或者尝试使用以下命令与漂亮的汤一起安装

pip install --ignore-installed BeautifulSoup4

Try reinstalling the module OR Try installing with beautiful soup with the below command

pip install --ignore-installed BeautifulSoup4

回答 15

原始查询的附录：modules.py

help('modules')

$python modules.py

它列出了已经安装的模块bs4。

_codecs_kr          blinker             json                six
_codecs_tw          brotli              kaitaistruct        smtpd
_collections        bs4                 keyword             smtplib
_collections_abc    builtins            ldap3               sndhdr
_compat_pickle      bz2                 lib2to3             socket

正确的解决方案是：

pip install --upgrade bs4

应该解决问题。

不仅如此，其他模块也会显示相同的错误。因此，对于那些错误的模块，您必须以与上述相同的方式发出pip命令。

Addendum to the original query: modules.py

help('modules')

$python modules.py

It lists that module bs4 already been installed.

_codecs_kr          blinker             json                six
_codecs_tw          brotli              kaitaistruct        smtpd
_collections        bs4                 keyword             smtplib
_collections_abc    builtins            ldap3               sndhdr
_compat_pickle      bz2                 lib2to3             socket

Proper solution is:

pip install --upgrade bs4

Should solve the problem.

Not only that, it will show same error for other modules as well. So you got to issue the pip command same way as above for those errored module(s).

知识问答

BeautifulSoup和Scrapy搜寻器之间的区别？

2021年8月17日 Python实用宝典

问题：BeautifulSoup和Scrapy搜寻器之间的区别？

我想建立一个网站，显示亚马逊和电子海湾产品价格之间的比较。其中哪个会更好，为什么？我对BeautifulSoup有点熟悉，但对Scrapy爬虫却不太了解。

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

回答 0

Scrapy是一个Web-spider或Web Scraper 框架，您为Scrapy提供一个根URL以开始爬网，然后您可以指定要爬网和获取的URL数量的限制。它是用于爬网或爬网的完整框架。

而

BeautifulSoup是一个解析库，它在从URL提取内容方面也做得很好，并允许您轻松解析其中的某些部分。它仅获取您提供的URL的内容，然后停止。除非您使用某些条件将其手动放入无限循环内，否则它不会爬网。

简而言之，使用Beautiful Soup，您可以构建类似于Scrapy的东西。美丽的汤是一个库，而Scrapy是一个完整的框架。

资源

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source

回答 1

我认为两者都很好。即时通讯正在做一个同时使用两者的项目。首先，我使用scrapy抓取所有页面，并使用它们的管道将其保存在mongodb集合中，还下载页面上存在的图像。之后，我使用BeautifulSoup4进行pos处理，我必须更改属性值并获取一些特殊标签。

如果您不知道所需的产品页面，那么一个好的工具将是徒劳的，因为您可以使用它们的搜寻器来运行所有amazon / ebay网站来寻找产品，而无需进行明确的for循环。

看一下草率的文档，它非常易于使用。

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.

回答 2

两者都用于解析数据。

Scrapy：

Scrapy是一个快速的高级Web爬网和Web爬网框架，用于对网站进行爬网并从其页面中提取结构化数据。
但是当数据来自Java脚本或动态加载时，它有一些局限性，我们可以通过使用诸如splash，selenium等包来克服它。

BeautifulSoup：

Beautiful Soup是一个Python库，用于从HTML和XML文件中提取数据。
我们可以使用此包从Java脚本获取数据或动态加载页面。

Scrapy with BeautifulSoup是我们可以使用的最好的组合之一，可用于刮取静态和动态内容

Both are using to parse data.

Scrapy:

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents

回答 3

我这样做的方法是使用eBay / Amazon API，而不是scrapy，然后使用BeautifulSoup解析结果。

API为您提供了一种正式的方式来获取与从scrapy爬网程序中获得的数据相同的正式方式，而无需担心隐藏您的身份，与代理相关的麻烦等。

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.

回答 4

Scrapy 这是一个 Web抓取框架，其中包含大量的功能，使抓取变得更加容易，因此我们可以仅关注抓取逻辑。下面是我最喜欢的一些scrapy照顾我们的事情。

Feed导出：基本上，它可以使我们以CSV，JSON，jsonlines和XML等各种格式保存数据。
异步抓取：Scrapy使用了扭曲的框架，该框架使我们能够一次访问多个URL，在每个URL中以非阻塞方式处理每个请求（基本上，在发送另一个请求之前，我们不必等待请求完成）。
选择器：在这里我们可以比较scrap头和漂亮的汤。选择器使我们能够从网页中选择特定数据，例如标题，具有类名的某些div等）。Scrapy使用lxml进行解析，这比漂亮的汤要快得多。
设置代理，用户代理，标题等：scrapy允许我们动态设置和旋转代理以及其他标题。
项目管道：管道使我们能够在提取后处理数据。例如，我们可以配置管道以将数据推送到您的mysql服务器。
Cookies：scrapy自动为我们处理cookie。

等等

TLDR：scrapy是一个框架，提供了构建大规模爬网可能需要的所有内容。它提供了各种功能，可隐藏爬网的复杂性。您可以简单地开始编写网络爬虫，而无需担心安装负担。

美丽的汤 Beautiful Soup是用于解析HTML和XML文档的Python包。因此，使用Beautiful汤，您可以解析一个已经下载的网页。BS4非常受欢迎且古老。与刮y不同，您不能仅用美丽的汤来制作履带。您将需要其他库（例如request，urllib等）来使bs4成为爬虫。同样，这意味着您将需要管理要爬网的URL列表，要爬网的URL，处理Cookie，管理代理，处理错误，创建自己的函数以将数据推送到CSV，JSON，XML等。如果要加快速度比您将不得不使用其他库（如多处理）。

总结一下。

Scrapy是一个丰富的框架，您可以使用它开始编写爬虫程序，而无需进行任何麻烦。
美丽的汤是您可以用来解析网页的库。它不能单独用于刮网。

您绝对应该在您的Amazon和e-bay产品价格比较网站上使用scrapy。您可以建立一个url数据库并每天运行爬虫（cron作业，Celery用于计划爬虫）并更新数据库的价格。这样，您的网站将始终从数据库中提取，并且爬虫和数据库将作为单独的组件。

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

Scrapy is a rich framework that you can use to start writing crawlers without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.

回答 5

BeautifulSoup是一个使您可以从网页提取信息的库。

另一方面，Scrapy是一个框架，它可以执行上述操作以及您在抓取项目中可能需要的许多其他事情，例如用于保存数据的管道。

您可以查看此博客以开始使用Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

回答 6

使用scrapy可以节省大量代码，并从结构化编程开始。如果您不喜欢scapy的任何预写方法，则可以使用BeautifulSoup代替scrapy方法。大型项目同时具有这两个优点。

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.

回答 7

差异很多，选择任何工具/技术都取决于个人需求。

几个主要区别是：

BeautifulSoup 比Scrapy 容易学习。
Scrapy的扩展，支持和社区大于BeautifulSoup。
当BeautifulSoup是解析器时，Scrapy应该被视为蜘蛛。

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

BeautifulSoup is comparatively is easy to learn than Scrapy.
The extensions, support, community is larger for Scrapy than for BeautifulSoup.
Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

知识问答

BeautifulSoup抓取可见网页文本

2021年8月17日 Python实用宝典

问题：BeautifulSoup抓取可见网页文本

基本上，我想使用BeautifulSoup来严格抓取网页上的可见文本。例如，此网页是我的测试用例。我主要想获取正文文本（文章），甚至在这里和那里甚至几个标签名称。我尝试了这个SO问题中的建议，该建议返回很多<script>我不想要的标签和html注释。我无法弄清楚该函数所需的参数findAll()，以便仅获取网页上的可见文本。

那么，我应该如何查找除脚本，注释，CSS等之外的所有可见文本？

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don’t want. I can’t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.

So, how should I find all visible text excluding scripts, comments, css etc.?

回答 0

试试这个：

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

回答 1

@jbochi批准的答案对我不起作用。str（）函数调用会引发异常，因为它无法对BeautifulSoup元素中的非ASCII字符进行编码。这是将示例网页过滤为可见文本的一种更为简洁的方法。

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

回答 2

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

回答 3

我完全尊重使用Beautiful Soup获取呈现的内容，但是它可能不是获取页面上呈现的内容的理想软件包。

我遇到了类似的问题，无法获取渲染的内容或典型浏览器中的可见内容。特别是，在下面的一个简单示例中，我可能有许多非典型案例。在这种情况下，不可显示的标签嵌套在样式标签中，在我检查过的许多浏览器中都不可见。存在其他变体，例如将类标签设置显示定义为无。然后将此类用于div。

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

上面发布的一种解决方案是：

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

该解决方案当然在许多情况下都有应用程序，并且通常可以很好地完成工作，但是在上面发布的html中，它保留了未呈现的文本。经过搜索之后，这里出现了一些解决方案，BeautifulSoup get_text不会剥离所有标签和JavaScript ，这里是使用Python将HTML渲染为纯文本的方式

我尝试了这两种解决方案：html2text和nltk.clean_html，并且对计时结果感到惊讶，因此认为它们值得后代的答案。当然，速度很大程度上取决于数据的内容。

@Helge的一个答案是关于使用nltk的所有东西。

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

返回带有呈现的html的字符串的效果很好。这个nltk模块甚至比html2text还要快，尽管html2text可能更健壮。

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.

I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

One solution posted above is:

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data…

One answer here from @Helge was about using nltk of all things.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

回答 4

如果您关心性能，这是另一种更有效的方法：

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings是一个迭代器，它返回，NavigableString以便您可以直接检查父级的标记名，而无需经历多个循环。

If you care about performance, here’s another more efficient way:

import re

INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')

def visible_texts(soup):
    """ get visible text from a document """
    text = ' '.join([
        s for s in soup.strings
        if s.parent.name not in INVISIBLE_ELEMS
    ])
    # collapse multiple spaces to two spaces.
    return RE_SPACES.sub('  ', text)

soup.strings is an iterator, and it returns NavigableString so that you can check the parent’s tag name directly, without going through multiple loops.

回答 5

标题位于<nyt_headline>标签内，该标签嵌套在<h1>标签和<div>ID为“ article” 的标签内。

soup.findAll('nyt_headline', limit=1)

应该管用。

文章正文位于<nyt_text>标记内，该标记嵌套在<div>ID为“ articleBody” 的标记内。在<nyt_text> 元素内部，文本本身包含在<p> 标签中。图片不在这些<p>标签内。对我来说，尝试语法很难，但是我希望工作的草稿看起来像这样。

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id “article”.

soup.findAll('nyt_headline', limit=1)

Should work.

The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id “articleBody”. Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It’s difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.

text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')

回答 6

虽然，我会完全建议一般使用精美的汤，但是，如果有人希望显示格式错误的html的可见部分（例如，您只有网页的一段或一行），无论出于何种原因，以下内容将删除<和>标签之间的内容：

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:

import re   ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):             
    return(re.sub("(\<.*?\>)", "",text))

回答 7

使用BeautifulSoup是最简单的方法，只需较少的代码即可获取字符串，而不会出现空行和废话。

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

Using BeautifulSoup the easiest way with less code to just get the strings, without empty lines and crap.

tag = <Parent_Tag_that_contains_the_data>
soup = BeautifulSoup(tag, 'html.parser')

for i in soup.stripped_strings:
    print repr(i)

回答 8

处理这种情况的最简单方法是使用getattr()。您可以根据需要调整此示例：

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

如果存在，它将"3.7"在标记对象中找到文本元素，<span class="ratingsContent">3.7</span>但是默认为NoneType不存在时。

getattr(object, name[, default])

返回对象的命名属性的值。名称必须是字符串。如果字符串是对象属性之一的名称，则结果是该属性的值。例如，getattr（x，’foobar’）等同于x.foobar。如果命名属性不存在，则返回默认值（如果提供），否则引发AttributeError。

The simplest way to handle this case is by using getattr(). You can adapt this example to your needs:

from bs4 import BeautifulSoup

source_html = """
<span class="ratingsDisplay">
    <a class="ratingNumber" href="https://www.youtube.com/watch?v=oHg5SJYRHA0" target="_blank" rel="noopener">
        <span class="ratingsContent">3.7</span>
    </a>
</span>
"""

soup = BeautifulSoup(source_html, "lxml")
my_ratings = getattr(soup.find('span', {"class": "ratingsContent"}), "text", None)
print(my_ratings)

This will find the text element,"3.7", within the tag object <span class="ratingsContent">3.7</span> when it exists, however, default to NoneType when it does not.

getattr(object, name[, default])

Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, ‘foobar’) is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.

回答 9

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    if re.match(r"[\n]+",str(element)): return False
    return True
def text_from_html(url):
    body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
    soup = BeautifulSoup(body ,"lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    text = u",".join(t.strip() for t in visible_texts)
    text = text.lstrip().rstrip()
    text = text.split(',')
    clean_text = ''
    for sen in text:
        if sen:
            sen = sen.rstrip().lstrip()
            clean_text += sen+','
    return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))

知识问答

如何使用BeautifulSoup查找节点的子节点

2021年8月11日 Python实用宝典

问题：如何使用BeautifulSoup查找节点的子节点

我想获取所有<a>属于以下子项的标签<li>：

<div>
<li class="test">
    <a>link1</a>
    <ul> 
       <li>  
          <a>link2</a> 
       </li>
    </ul>
</li>
</div>

我知道如何找到像这样的特定类的元素：

soup.find("li", { "class" : "test" })

但是我不知道如何找到所有<a>的孩子的孩子，<li class=test>而不是其他孩子的孩子。

就像我想选择：

<a>link1</a>

I want to get all the <a> tags which are children of <li>:

<div>
<li class="test">
    <a>link1</a>
    <ul> 
       <li>  
          <a>link2</a> 
       </li>
    </ul>
</li>
</div>

I know how to find element with particular class like this:

soup.find("li", { "class" : "test" })

But I don’t know how to find all <a> which are children of <li class=test> but not any others.

Like I want to select:

<a>link1</a>

回答 0

试试这个

li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
    print child

Try this

li = soup.find('li', {'class': 'text'})
children = li.findChildren("a" , recursive=False)
for child in children:
    print(child)

回答 1

DOC中有一个超小部分，显示了如何查找/ find_all 直接子级。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-recursive-argument

在您需要的情况下，link1是第一个直接子级：

# for only first direct child
soup.find("li", { "class" : "test" }).find("a", recursive=False)

如果您想要所有直系子女：

# for all direct children
soup.find("li", { "class" : "test" }).findAll("a", recursive=False)

There’s a super small section in the DOCs that shows how to find/find_all direct children.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-recursive-argument

In your case as you want link1 which is first direct child:

# for only first direct child
soup.find("li", { "class" : "test" }).find("a", recursive=False)

If you want all direct children:

# for all direct children
soup.find("li", { "class" : "test" }).findAll("a", recursive=False)

回答 2

也许你想做

soup.find("li", { "class" : "test" }).find('a')

Perhaps you want to do

soup.find("li", { "class" : "test" }).find('a')

回答 3

试试这个：

li = soup.find("li", { "class" : "test" })
children = li.find_all("a") # returns a list of all <a> children of li

其他提醒：

find方法仅获取第一个出现的子元素。find_all方法获取所有后代元素，并存储在列表中。

try this:

li = soup.find("li", { "class" : "test" })
children = li.find_all("a") # returns a list of all <a> children of li

other reminders:

The find method only gets the first occurring child element. The find_all method gets all descendant elements and are stored in a list.

回答 4

“如何找到所有人a的孩子，<li class=test>而不是其他孩子？”

给定下面的HTML（我添加了另一个<a>以显示select和之间的区别select_one）：

<div>
  <li class="test">
    <a>link1</a>
    <ul>
      <li>
        <a>link2</a>
      </li>
    </ul>
    <a>link3</a>
  </li>
</div>

解决方案是使用放在两个CSS选择器之间的子组合器（>）：

>>> soup.select('li.test > a')
[<a>link1</a>, <a>link3</a>]

如果您只想找到第一个孩子：

>>> soup.select_one('li.test > a')
<a>link1</a>

“How to find all a which are children of <li class=test> but not any others?”

Given the HTML below (I added another <a> to show te difference between select and select_one):

<div>
  <li class="test">
    <a>link1</a>
    <ul>
      <li>
        <a>link2</a>
      </li>
    </ul>
    <a>link3</a>
  </li>
</div>

The solution is to use child combinator (>) that is placed between two CSS selectors:

>>> soup.select('li.test > a')
[<a>link1</a>, <a>link3</a>]

In case you want to find only the first child:

>>> soup.select_one('li.test > a')
<a>link1</a>

回答 5

另一种方法-创建一个过滤器函数，该函数返回True所有所需标签：

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'li' and
        'test' in tag.parent['class'])

然后只需调用find_all参数：

for a in soup(my_filter): # or soup.find_all(my_filter)
    print a

Yet another method – create a filter function that returns True for all desired tags:

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'li' and
        'test' in tag.parent['class'])

Then just call find_all with the argument:

for a in soup(my_filter): # or soup.find_all(my_filter)
    print a

知识问答

使用beautifulsoup提取属性值

2021年8月8日 Python实用宝典

问题：使用beautifulsoup提取属性值

我试图在网页上的特定“输入”标签中提取单个“值”属性的内容。我使用以下代码：

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTag = soup.findAll(attrs={"name" : "stainfo"})

output = inputTag['value']

print str(output)

我收到TypeError：列表索引必须是整数，而不是str

即使从Beautifulsoup文档中我了解到字符串在这里也不应该成为问题…但是我没有专家，我可能会误解了。

任何建议，不胜感激！提前致谢。

I am trying to extract the content of a single “value” attribute in a specific “input” tag on a webpage. I use the following code:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTag = soup.findAll(attrs={"name" : "stainfo"})

output = inputTag['value']

print str(output)

I get a TypeError: list indices must be integers, not str

even though from the Beautifulsoup documentation i understand that strings should not be a problem here… but i a no specialist and i may have misunderstood.

Any suggestion is greatly appreciated! Thanks in advance.

回答 0

.find_all() 返回所有找到的元素的列表，因此：

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag是一个列表（可能仅包含一个元素）。根据您的确切要求，您应该执行以下任一操作：

 output = input_tag[0]['value']

或使用.find()仅返回一个（第一个）找到的元素的方法：

 input_tag = soup.find(attrs={"name": "stainfo"})
 output = input_tag['value']

.find_all() returns list of all found elements, so:

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:

output = input_tag[0]['value']

or use .find() method which returns only one (first) found element:

input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']

回答 1

在中Python 3.x，只需get(attr_name)在您使用的标签对象上使用find_all：

xmlData = None

with open('conf//test1.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

xmlDecoded = xmlData

xmlSoup = BeautifulSoup(xmlData, 'html.parser')

repElemList = xmlSoup.find_all('repeatingelement')

for repElem in repElemList:
    print("Processing repElem...")
    repElemID = repElem.get('id')
    repElemName = repElem.get('name')

    print("Attribute id = %s" % repElemID)
    print("Attribute name = %s" % repElemName)

针对如下的XML文件conf//test1.xml：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="11" name="Joe"/>
    <repeatingElement id="12" name="Mary"/>
</root>

印刷品：

Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:

xmlData = None

with open('conf//test1.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

xmlDecoded = xmlData

xmlSoup = BeautifulSoup(xmlData, 'html.parser')

repElemList = xmlSoup.find_all('repeatingelement')

for repElem in repElemList:
    print("Processing repElem...")
    repElemID = repElem.get('id')
    repElemName = repElem.get('name')

    print("Attribute id = %s" % repElemID)
    print("Attribute name = %s" % repElemName)

against XML file conf//test1.xml that looks like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="11" name="Joe"/>
    <repeatingElement id="12" name="Mary"/>
</root>

prints:

Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

回答 2

如果要从上面的源中检索属性的多个值，可以使用findAll和列表推导来获取所需的一切：

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})

output = [x["stainfo"] for x in inputTags]

print output
### This will print a list of the values.

If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})

output = [x["stainfo"] for x in inputTags]

print output
### This will print a list of the values.

回答 3

我实际上建议您使用一种节省时间的方法，假设您知道哪种标签具有这些属性。

假设标签xyz的attritube名为“ staininfo”。

full_tag = soup.findAll("xyz")

而且我不希望您知道full_tag是一个列表

for each_tag in full_tag:
    staininfo_attrb_value = each_tag["staininfo"]
    print staininfo_attrb_value

因此，您可以获得所有标签xyz的staininfo的所有attrb值

I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.

suppose say a tag xyz has that attritube named “staininfo”..

full_tag = soup.findAll("xyz")

And i wan’t you to understand that full_tag is a list

for each_tag in full_tag:
    staininfo_attrb_value = each_tag["staininfo"]
    print staininfo_attrb_value

Thus you can get all the attrb values of staininfo for all the tags xyz

回答 4

您也可以使用：

import requests
from bs4 import BeautifulSoup
import csv

url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})

for val in get_details:
    get_val = val["value"]
    print(get_val)

you can also use this :

import requests
from bs4 import BeautifulSoup
import csv

url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})

for val in get_details:
    get_val = val["value"]
    print(get_val)

回答 5

我在Beautifulsoup 4.8.1中使用它来获取某些元素的所有类属性的值：

from bs4 import BeautifulSoup

html = "<td class='val1'/><td col='1'/><td class='val2' />"

bsoup = BeautifulSoup(html, 'html.parser')

for td in bsoup.find_all('td'):
    if td.has_attr('class'):
        print(td['class'][0])

重要的是要注意，即使属性只有一个值，属性键也会检索列表。

I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:

from bs4 import BeautifulSoup

html = "<td class='val1'/><td col='1'/><td class='val2' />"

bsoup = BeautifulSoup(html, 'html.parser')

for td in bsoup.find_all('td'):
    if td.has_attr('class'):
        print(td['class'][0])

Its important to note that the attribute key retrieves a list even when the attribute has only a single value.

知识问答

我们可以将xpath与BeautifulSoup一起使用吗？

2021年8月8日 Python实用宝典

问题：我们可以将xpath与BeautifulSoup一起使用吗？

我正在使用BeautifulSoup抓取网址，并且我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中，我们可以findAll用来获取标签和与其相关的信息，但是我想使用xpath。是否可以将xpath与BeautifulSoup一起使用？如果可能的话，任何人都可以给我提供示例代码，以便提供更多帮助吗？

I am using BeautifulSoup to scrape a url and I had the following code

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

Now in the above code we can use findAll to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?

回答 0

不，BeautifulSoup本身不支持XPath表达式。

另一种库，LXML，不支持的XPath 1.0。它具有BeautifulSoup兼容模式，它将尝试以Soup的方式解析损坏的HTML。但是，默认的lxml HTML解析器可以很好地完成解析损坏的HTML的工作，而且我相信它的速度更快。

将文档解析为lxml树后，就可以使用该.xpath()方法搜索元素。

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

还有一个带有附加功能的专用lxml.html()模块。

请注意，在上面的示例中，我将response对象直接传递给lxml，因为直接从流中读取解析器比将响应首先读取到大字符串中更为有效。要对requests库执行相同的操作，您需要在启用透明传输解压缩后设置stream=True并传递response.raw对象：

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

您可能会感兴趣的是CSS选择器支持；在CSSSelector类转换CSS语句转换为XPath表达式，使您的搜索td.empformbody更加容易：

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

即将来临：BeautifulSoup本身确实具有非常完整的CSS选择器支持：

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you’ve parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

回答 1

我可以确认Beautiful Soup中没有XPath支持。

I can confirm that there is no XPath support within Beautiful Soup.

回答 2

正如其他人所说，BeautifulSoup没有xpath支持。可能有很多方法可以从xpath中获取某些东西，包括使用Selenium。但是，以下是可在Python 2或3中使用的解决方案：

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

我以此为参考。

As others have said, BeautifulSoup doesn’t have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here’s a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.

回答 3

BeautifulSoup 从当前指向子元素的元素中有一个名为findNext的函数，因此：

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

上面的代码可以模仿以下xpath：

div[class=class_value]/div[id=id_value]

BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]

回答 4

我搜索了他们的文档，似乎没有xpath选项。此外，你可以看到在这里对SO类似的问题时，OP是要求从XPath来BeautifulSoup一个翻译，所以我的结论是-没有，没有的XPath解析可用。

I’ve searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be – no, there is no xpath parsing available.

回答 5

当您使用lxml时，一切都很简单：

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是使用BeautifulSoup BS4时也很简单：

首先删除“ //”和“ @”
第二个-在“ =“之前添加星号

试试这个魔术：

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

如您所见，这不支持子标签，因此我删除了“ / @ href”部分

when you use lxml all simple:

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

but when use BeautifulSoup BS4 all simple too:

first remove “//” and “@”
second – add star before “=”

try this magic:

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

as you see, this does not support sub-tag, so i remove “/@href” part

回答 6

也许您可以在没有XPath的情况下尝试以下操作

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

Maybe you can try the following without XPath

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

回答 7

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

上面使用了Soup对象和lxml的组合，并且可以使用xpath提取值

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

Above used the combination of Soup object with lxml and one can extract the value using xpath

回答 8

这是一个很旧的线程，但是现在有一个解决方法，当时在BeautifulSoup中可能还没有。

这是我所做的一个例子。我使用“请求”模块读取RSS提要，并在名为“ rss_text”的变量中获取其文本内容。这样，我就可以通过BeautifulSoup运行它，搜索xpath / rss / channel / title，并检索其内容。它并不是XPath的全部功能（通配符，多个路径等），但是，如果您只有要定位的基本路径，则可以使用。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

Here is an example of what I did. I use the “requests” module to read an RSS feed and get its text content in a variable called “rss_text”. With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It’s not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

知识问答

美丽的汤并通过ID提取div及其内容

2021年8月4日 Python实用宝典

问题：美丽的汤并通过ID提取div及其内容

soup.find("tagName", { "id" : "articlebody" })

为什么这不返回<div id="articlebody"> ... </div>标签和中间的东西？它什么也不返回。我知道一个事实，因为我正盯着它

soup.prettify()

soup.find("div", { "id" : "articlebody" }) 也行不通。

（编辑：我发现BeautifulSoup无法正确解析我的页面，这可能意味着我尝试解析的页面在SGML或其他格式中未正确格式化）

soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I’m staring right at it from

soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

(EDIT: I found that BeautifulSoup wasn’t correctly parsing my page, which probably meant the page I was trying to parse isn’t properly formatted in SGML or whatever)

回答 0

您应该发布示例文档，因为代码可以正常工作：

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

<div>在<div>s中查找s 也可以：

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

回答 1

通过其元素查找id：

div = soup.find(id="articlebody")

To find an element by its id:

div = soup.find(id="articlebody")

回答 2

Beautiful Soup 4支持该方法的大多数CSS选择器，因此您可以使用以下选择器：.select()id

soup.select('#articlebody')

如果需要指定元素的类型，则可以在选择器之前添加类型选择id器：

soup.select('div#articlebody')

该.select()方法将返回元素的集合，这意味着它将返回与以下.find_all()方法示例相同的结果：

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

如果只想选择一个元素，则可以使用.find()方法：

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody')

If you need to specify the element’s type, you can add a type selector before the id selector:

soup.select('div#articlebody')

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")

回答 3

我认为’div’标签嵌套过多时会出现问题。我正在尝试从Facebook html文件中解析一些联系人，Beautifulsoup无法找到带有“ fcontent”类的标签“ div”。

其他类也会发生这种情况。一般而言，当我搜索div时，它只会变成那些嵌套不多的div。

html源代码可以是您的朋友（而不是您的朋友之一）的朋友列表中来自facebook的任何页面。如果有人可以测试它并提供一些建议，我将非常感激。

这是我的代码，在这里我只尝试打印带有“ fcontent”类的标签“ div”的数量：

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

I think there is a problem when the ‘div’ tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags “div” with class “fcontent”.

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags “div” with class “fcontent”:

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

回答 4

很可能是因为默认的beautifulsoup解析器有问题。更改其他解析器，例如“ lxml”，然后重试。

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like ‘lxml’ and try again.

回答 5

在beautifulsoup源代码中，此行允许div嵌套在div中；因此您对lukas评论的关注不会成立。

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

我认为您需要做的是指定所需的attrs，例如

source.find('div', attrs={'id':'articlebody'})

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas’ comment wouldn’t be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})

回答 6

你有尝试过soup.findAll("div", {"id": "articlebody"})吗？

听起来很疯狂，但是如果您从野外抓东西，就不能排除多个div …

have you tried soup.findAll("div", {"id": "articlebody"})?

sounds crazy, but if you’re scraping stuff from the wild, you can’t rule out multiple divs…

回答 7

我用了：

soup.findAll('tag', attrs={'attrname':"attrvalue"})

作为我的find / findall语法；也就是说，除非标签和属性列表之间还有其他可选参数，否则这应该没有什么不同。

I used:

soup.findAll('tag', attrs={'attrname':"attrvalue"})

As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn’t be different.

回答 8

在尝试抓取Google时也遇到了我。
我最终使用pyquery。
安装：

pip install pyquery

用：

from pyquery import PyQuery    
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')

Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:

pip install pyquery

Use:

from pyquery import PyQuery    
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')

回答 9

这是一个代码片段

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

如您所见，我找到了所有标签，然后找到了所有带有class =“ article”的标签

Here is a code fragment

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see I find all tags and then I find all tags with class=”article” inside

回答 10

该Id属性始终是唯一标识的。这意味着您无需指定元素就可以直接使用它。因此，如果您的元素可以在内容中进行解析，则是一个加分点。

divEle = soup.find(id = "articlebody")

The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.

divEle = soup.find(id = "articlebody")

知识问答

使用python和BeautifulSoup从网页检索链接

2021年8月4日 Python实用宝典

问题：使用python和BeautifulSoup从网页检索链接

如何使用Python检索网页的链接并复制链接的URL地址？

How can I retrieve the links of a webpage and copy the url address of the links using Python?

回答 0

这是在BeautifulSoup中使用SoupStrainer类的一小段代码：

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup文档实际上非常好，涵盖了许多典型方案：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

编辑：请注意，我使用了SoupStrainer类，因为如果您事先知道要解析什么，它会更有效率（在内存和速度方面）。

Here’s a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Edit: Note that I used the SoupStrainer class because it’s a bit more efficient (memory and speed wise), if you know what you’re parsing in advance.

回答 1

为了完整起见，BeautifulSoup 4版本还使用了服务器提供的编码：

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

或Python 2版本：

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

以及使用requestslibrary的版本，该版本在Python 2和3中均可使用：

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

该soup.find_all('a', href=True)调用将查找<a>具有href属性的所有元素。没有属性的元素将被跳过。

BeautifulSoup 3于2012年3月停止开发；新项目确实应该始终使用BeautifulSoup 4。

请注意，您应该将HTML从字节解码为BeautifulSoup。您可以将HTTP响应标头中找到的字符集告知BeautifulSoup以帮助解码，但这可能是错误的，并且与<meta>HTML本身中的标头信息相冲突，因此，以上内容使用BeautifulSoup内部类方法EncodingDetector.find_declared_encoding()来确保这样的嵌入式编码提示可以胜过配置错误的服务器。

使用requests，即使response.encoding响应text/*未返回任何字符集，如果响应具有mimetype ，该属性也会默认为Latin-1 。这与HTTP RFC一致，但是在与HTML解析一起使用时会很麻烦，因此当charsetContent-Type标头中设置为no时，应忽略该属性。

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

or the Python 2 version:

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

and a version using the requests library, which as written will work in both Python 2 and 3:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.

BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.

Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.

With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.

回答 2

其他人推荐了BeautifulSoup，但是使用lxml更好。尽管它的名字，它也用于解析和抓取HTML。它比BeautifulSoup快得多，而且甚至比BeautifulSoup（他们声名fa起）更好地处理“断”的HTML。如果您不想学习lxml API，它也具有BeautifulSoup的兼容性API。

伊恩·布里金（Ian Blicking）同意。

除非您使用的是Google App Engine或不允许使用任何非Python的工具，否则没有理由再使用BeautifulSoup。

lxml.html还支持CSS3选择器，因此这种事情很简单。

具有lxml和xpath的示例如下所示：

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

Others have recommended BeautifulSoup, but it’s much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It’s much, much faster than BeautifulSoup, and it even handles “broken” HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don’t want to learn the lxml API.

Ian Blicking agrees.

There’s no reason to use BeautifulSoup anymore, unless you’re on Google App Engine or something where anything not purely Python isn’t allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

An example with lxml and xpath would look like this:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

回答 3

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

回答 4

以下代码使用urllib2和检索网页中所有可用的链接BeautifulSoup4：

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

回答 5

现在，BeautifulSoup现在使用lxml。请求，lxml和列表理解能力是一个杀手comb。

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在list comp中，“ if’//’和’url.com’不是x”是一种简单的方法，可以清理网站“内部”导航网址等的网址列表。

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

In the list comp, the “if ‘//’ and ‘url.com’ not in x” is a simple method to scrub the url list of the sites ‘internal’ navigation urls, etc.

回答 6

仅用于获取链接，而无需B.soup和regex：

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

对于更复杂的操作，当然BSoup仍然是首选。

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.

回答 7

该脚本可以满足您的需求，而且可以将相对链接解析为绝对链接。

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

This script does what your looking for, But also resolves the relative links to absolute links.

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

回答 8

要查找所有链接，在本示例中，我们将urllib2模块与re.module一起使用 * re模块中最强大的功能之一是“ re.findall（）”。使用re.search（）查找模式的第一个匹配项时，re.findall（）查找所有匹配项并将它们作为字符串列表返回，每个字符串代表一个匹配项*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

To find all the links, we will in this example use the urllib2 module together with the re.module *One of the most powerful function in the re module is “re.findall()”. While re.search() is used to find the first match for a pattern, re.findall() finds all the matches and returns them as a list of strings, with each string representing one match*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

回答 9

为什么不使用正则表达式：

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

Why not use regular expressions:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

回答 10

链接可以位于各种属性中，因此您可以传递这些属性的列表以进行选择

例如，使用src和href属性（在这里，我使用^开头的运算符来指定这些属性值之一以http开头。您可以根据需要定制此属性

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

属性=值选择器

[attr ^ = value]

表示属性名称为attr的元素，其值以值为前缀（前缀）。

Links can be within a variety of attributes so you could pass a list of those attributes to select

for example, with src and href attribute (here I am using the starts with ^ operator to specify that either of these attributes values starts with http. You can tailor this as required

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

Attribute = value selectors

[attr^=value]

Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.

回答 11

下面是使用@ars接受的答案和一个例子BeautifulSoup4，requests和wget模块来处理下载。

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

Here’s an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

回答 12

经过以下更正（发现无法正常运行的情况），我发现了@ Blairg23工作的答案：

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

对于Python 3：

urllib.parse.urljoin 必须使用以获得完整的URL。

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

For Python 3:

urllib.parse.urljoin has to be used in order to obtain the full URL instead.

回答 13

BeatifulSoup自己的解析器可能很慢。使用能够直接从URL进行解析的lxml（可能会在下面提到一些限制）可能更为可行。

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

上面的代码将按原样返回链接，并且在大多数情况下，它们将是相对链接或从站点根目录开始的绝对链接。由于我的用例仅提取某种类型的链接，因此下面是一个将链接转换为完整URL的版本，该版本可以选择接受glob模式（如）*.mp3。它不会处理相对路径中的单点和双点，但是到目前为止，我并不需要它。如果您需要解析包含../或的URL片段，./则urlparse.urljoin可能会派上用场。

注意：直接的lxml url解析不处理来自加载，https也不进行重定向，因此，出于这个原因，以下版本使用urllib2+ lxml。

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

用法如下：

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

BeatifulSoup’s own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won’t handle single and double dots in the relative paths though, but so far I didn’t have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.

NOTE: Direct lxml url parsing doesn’t handle loading from https and doesn’t do redirects, so for this reason the version below is using urllib2 + lxml.

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

The usage is as follows:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

回答 14

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

回答 15

外部链接和内部链接都可以有很多重复的链接。要区分两者并仅使用集合获得唯一链接：

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

知识问答

bs4.FeatureNotFound：找不到具有您请求的功能的树构建器：lxml。您需要安装解析器库吗？

2021年8月2日 Python实用宝典

问题：bs4.FeatureNotFound：找不到具有您请求的功能的树构建器：lxml。您需要安装解析器库吗？

...
soup = BeautifulSoup(html, "lxml")
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

以上输出在我的终端上。我在Mac OS 10.7.x上。我有Python 2.7.1，并按照本教程操作获得了Beautiful Soup和lxml，它们都已成功安装并与位于此处的单独测试文件一起使用。在导致此错误的Python脚本中，我包含以下行： from pageCrawler import comparePages 在pageCrawler文件中，我包含以下两行： from bs4 import BeautifulSoup from urllib2 import urlopen

找出问题所在以及如何解决的任何帮助将不胜感激。

...
soup = BeautifulSoup(html, "lxml")
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

The above outputs on my Terminal. I am on Mac OS 10.7.x. I have Python 2.7.1, and followed this tutorial to get Beautiful Soup and lxml, which both installed successfully and work with a separate test file located here. In the Python script that causes this error, I have included this line: from pageCrawler import comparePages And in the pageCrawler file I have included the following two lines: from bs4 import BeautifulSoup from urllib2 import urlopen

Any help in figuring out what the problem is and how it can be solved would much be appreciated.

回答 0

我怀疑这与BS将用于读取HTML的解析器有关。他们的文档在这里，但是如果您像我（在OSX上）一样，可能会遇到一些麻烦，需要做一些工作：

您会注意到，在上面的BS4文档页面中，他们指出，默认情况下，BS4将使用Python内置的HTML解析器。假设您使用的是OSX，则Apple捆绑的Python版本是2.7.2，它对字符格式不宽容。我遇到了同样的问题，因此我升级了Python版本以解决此问题。在virtualenv中执行此操作可以最大程度地减少对其他项目的破坏。

如果这样做听起来很痛苦，则可以切换到LXML解析器：

pip install lxml

然后尝试：

soup = BeautifulSoup(html, "lxml")

根据您的情况，这可能就足够了。我发现这很烦人，需要升级我的Python版本。使用的virtualenv，您可以迁移的包很容易。

I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you’re like me (on OSX) you might be stuck with something that requires a bit of work:

You’ll notice that in the BS4 documentation page above, they point out that by default BS4 will use the Python built-in HTML parser. Assuming you are in OSX, the Apple-bundled version of Python is 2.7.2 which is not lenient for character formatting. I hit this same problem, so I upgraded my version of Python to work around it. Doing this in a virtualenv will minimize disruption to other projects.

If doing that sounds like a pain, you can switch over to the LXML parser:

pip install lxml

And then try:

soup = BeautifulSoup(html, "lxml")

Depending on your scenario, that might be good enough. I found this annoying enough to warrant upgrading my version of Python. Using virtualenv, you can migrate your packages fairly easily.

回答 1

对于安装了bs4的基本开箱即用的python，则可以使用以下命令处理xml

soup = BeautifulSoup(html, "html5lib")

但是，如果您想使用formatter =’xml’，则需要

pip3 install lxml

soup = BeautifulSoup(html, features="xml")

For basic out of the box python with bs4 installed then you can process your xml with

soup = BeautifulSoup(html, "html5lib")

If however you want to use formatter=’xml’ then you need to

pip3 install lxml

soup = BeautifulSoup(html, features="xml")

回答 2

我首选内置python html解析器，不安装不依赖

soup = BeautifulSoup(s, "html.parser")

I preferred built in python html parser, no install no dependencies

soup = BeautifulSoup(s, "html.parser")

回答 3

我正在使用Python 3.6，并且在这篇文章中有相同的原始错误。运行命令后：

python3 -m pip install lxml

它解决了我的问题

I am using Python 3.6 and I had the same original error in this post. After I ran the command:

python3 -m pip install lxml

it resolved my problem

回答 4

运行以下三个命令，以确保已安装所有相关的软件包：

pip install bs4
pip install html5lib
pip install lxml

然后根据需要重新启动Python IDE。

那应该处理与这个问题有关的任何事情。

Run these three commands to make sure that you have all the relevant packages installed:

pip install bs4
pip install html5lib
pip install lxml

Then restart your Python IDE, if needed.

That should take care of anything related to this issue.

回答 5

您可以使用以下代码代替使用lxml和html.parser：

soup = BeautifulSoup(html, 'html.parser')

Instead of using lxml use html.parser, you can use this piece of code:

soup = BeautifulSoup(html, 'html.parser')

回答 6

尽管BeautifulSoup默认情况下支持HTML解析器。但是，如果您想使用任何其他第三方Python解析器，则需要安装该外部解析器，例如（lxml）。

soup_object= BeautifulSoup(markup,"html.parser") #Python HTML parser

但是，如果您未将任何解析器指定为参数，则会收到一条警告，提示您未指定解析器。

soup_object= BeautifulSoup(markup) #Warnning

要使用任何其他外部解析器，您需要先安装它，然后再指定它。喜欢

pip install lxml

soup_object= BeautifulSoup(markup,'lxml') # C dependent parser

外部解析器具有c和python依赖关系，这可能有一些优点和缺点。

Although BeautifulSoup supports the HTML parser by default If you want to use any other third-party Python parsers you need to install that external parser like(lxml).

soup_object= BeautifulSoup(markup,"html.parser") #Python HTML parser

But if you don’t specified any parser as parameter you will get an warning that no parser specified.

soup_object= BeautifulSoup(markup) #Warnning

To use any other external parser you need to install it and then need to specify it. like

pip install lxml

soup_object= BeautifulSoup(markup,'lxml') # C dependent parser

External parser have c and python dependency which may have some advantage and disadvantage.

回答 7

我遇到了同样的问题。我发现原因是我有一个过时的python 6软件包。

>>> import html5lib
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/html5lib/__init__.py", line 16, in <module>
    from .html5parser import HTMLParser, parse, parseFragment
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 2, in <module>
    from six import with_metaclass, viewkeys, PY3
ImportError: cannot import name viewkeys

升级六个软件包将解决此问题：

sudo pip install six=1.10.0

I encountered the same issue. I found the reason is that I had a slightly-outdated python six package.

>>> import html5lib
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/html5lib/__init__.py", line 16, in <module>
    from .html5parser import HTMLParser, parse, parseFragment
  File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 2, in <module>
    from six import with_metaclass, viewkeys, PY3
ImportError: cannot import name viewkeys

Upgrading your six package will solve the issue:

sudo pip install six=1.10.0

回答 8

在python环境中安装LXML分析器。

pip install lxml

您的问题将得到解决。您还可以使用内置的python软件包，其用法与以下相同：

soup = BeautifulSoup(s,  "html.parser")

注意：在Python3中，“ HTMLParser”模块已重命名为“ html.parser”

Install LXML parser in python environment.

pip install lxml

Your problem will be resolve. You can also use built-in python package for the same as:

soup = BeautifulSoup(s,  "html.parser")

Note: The “HTMLParser” module has been renamed to “html.parser” in Python3

回答 9

在某些参考中，使用第二个而不是第一个：

soup_object= BeautifulSoup(markup,'html-parser')
soup_object= BeautifulSoup(markup,'html.parser')

In some references, use the second instead of the first:

soup_object= BeautifulSoup(markup,'html-parser')
soup_object= BeautifulSoup(markup,'html.parser')

回答 10

由于使用了解析器，因此出现错误。通常，如果您具有HTML文件/代码，则需要使用html5lib（文档可在此处找到）；如果您具有XML文件/数据，则需要使用lxml（文档可在此处找到）。您也可以将其lxml用于HTML文件/代码，但有时会出现上述错误。因此，最好根据数据/文件的类型明智地选择软件包。您也可以使用html_parser内置模块。但是，这有时有时也不起作用。

有关何时使用哪个软件包的更多详细信息，请参见此处的详细信息。

The error is coming because of the parser you are using. In general, if you have HTML file/code then you need to use html5lib(documentation can be found here) & in-case you have XML file/data then you need to use lxml(documentation can be found here). You can use lxml for HTML file/code also but sometimes it gives an error as above. So, better to choose the package wisely based on the type of data/file. You can also use html_parser which is built-in module. But, this also sometimes do not work.

For more details regarding when to use which package you can see the details here

回答 11

空白参数将导致警告，提示您最好使用该参数。
汤= BeautifulSoup（html）

————— // UserWarning：未明确指定解析器，因此我正在为此系统使用最佳的HTML解析器（“ html5lib”）。通常这不是问题，但是如果您在另一个系统或不同的虚拟环境中运行此代码，则它可能使用不同的解析器并且行为不同。 ——- /

python –version Python 3.7.7

PyCharm 19.3.4 CE

Blank parameter will result in a warning for best available.
soup = BeautifulSoup(html)

—————/UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html5lib”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.———————-/

python –version Python 3.7.7

PyCharm 19.3.4 CE

知识问答

UnicodeEncodeError：“ charmap”编解码器无法编码字符

2021年8月2日 Python实用宝典

问题：UnicodeEncodeError：“ charmap”编解码器无法编码字符

我正在尝试抓取一个网站，但这给我一个错误。

我正在使用以下代码：

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

我收到以下错误：

File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

我该怎么做才能解决此问题？

I’m trying to scrape a website, but it gives me an error.

I’m using the following code:

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

And I’m getting the following error:

File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

What can I do to fix this?

回答 0

UnicodeEncodeError将抓取的网页内容保存到文件中时，我得到的是相同的。为了解决这个问题，我替换了以下代码：

with open(fname, "w") as f:
    f.write(html)

有了这个：

import io
with io.open(fname, "w", encoding="utf-8") as f:
    f.write(html)

使用io可以向后兼容Python 2。

如果只需要支持Python 3，则可以改用内置open函数：

with open(fname, "w", encoding="utf-8") as f:
    f.write(html)

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:

with open(fname, "w") as f:
    f.write(html)

with this:

import io
with io.open(fname, "w", encoding="utf-8") as f:
    f.write(html)

Using io gives you backward compatibility with Python 2.

If you only need to support Python 3 you can use the builtin open function instead:

with open(fname, "w", encoding="utf-8") as f:
    f.write(html)

回答 1

我通过添加将.encode("utf-8")其修复soup。

那意味着print(soup)变成print(soup.encode("utf-8"))。

I fixed it by adding .encode("utf-8") to soup.

That means that print(soup) becomes print(soup.encode("utf-8")).

回答 2

在Python 3.7中，并且运行Windows 10可以正常工作（我不确定它是否可以在其他平台和/或其他版本的Python上运行）

替换此行：

with open('filename', 'w') as f:

有了这个：

with open('filename', 'w', encoding='utf-8') as f:

之所以起作用，是因为在使用文件时将编码更改为UTF-8，因此能够将UTF-8中的字符转换为文本，而不是遇到UTF-8字符时返回错误。当前编码不支持。

In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)

Replacing this line:

with open('filename', 'w') as f:

With this:

with open('filename', 'w', encoding='utf-8') as f:

The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.

回答 3

在保存get请求的响应时，在窗口10上的Python 3.7上引发了相同的错误。从URL接收到的响应的编码为UTF-8，因此始终建议检查编码，以便可以传递相同的编码以避免此类琐碎的问题因为它确实浪费了很多生产时间

import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
    f.write(resp.text)

当我用open命令添加encoding =“ utf-8”时，它以正确的响应保存了文件

with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
    f.write(resp.text)

While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production

import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
    f.write(resp.text)

When I added encoding=”utf-8″ with the open command it saved the file with the correct response

with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
    f.write(resp.text)

回答 4

甚至我在尝试打印，读取/写入或打开它时遇到的编码问题都相同。如上文所述，如果您尝试打印.encoding =“ utf-8”，则将有所帮助。

soup.encode（“ utf-8”）

如果您尝试打开抓取的数据并将其写入文件，请使用（……，encoding =“ utf-8”）打开文件

使用open（filename_csv，’w’，newline =”，encoding =“ utf-8”）作为csv_file：

Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding=”utf-8″ will help if you are trying to print it.

soup.encode(“utf-8”)

If you are trying to open scraped data and maybe write it into a file, then open the file with (……,encoding=”utf-8″)

with open(filename_csv , ‘w’, newline=”,encoding=”utf-8″) as csv_file:

回答 5

对于那些仍然收到此错误，添加encode("utf-8")到soup也将解决这个问题。

soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

For those still getting this error, adding encode("utf-8") to soup will also fix this.

soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)