BeautifulSoup和Scrapy搜寻器之间的区别?

问题:BeautifulSoup和Scrapy搜寻器之间的区别?

我想建立一个网站,显示亚马逊和电子海湾产品价格之间的比较。其中哪个会更好,为什么?我对BeautifulSoup有点熟悉,但对Scrapy爬虫却不太了解

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.


回答 0

Scrapy是一个Web-spider或Web Scraper 框架,您为Scrapy提供一个根URL以开始爬网,然后您可以指定要爬网和获取的URL数量的限制。它是用于爬网爬网的完整框架。

BeautifulSoup是一个解析库,它在从URL提取内容方面也做得很好,并允许您轻松解析其中的某些部分。它仅获取您提供的URL的内容,然后停止。除非您使用某些条件将其手动放入无限循环内,否则它不会爬网。

简而言之,使用Beautiful Soup,您可以构建类似于Scrapy的东西。美丽的汤是一个库,而Scrapy是一个完整的框架

资源

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source


回答 1

我认为两者都很好。即时通讯正在做一个同时使用两者的项目。首先,我使用scrapy抓取所有页面,并使用它们的管道将其保存在mongodb集合中,还下载页面上存在的图像。之后,我使用BeautifulSoup4进行pos处理,我必须更改属性值并获取一些特殊标签。

如果您不知道所需的产品页面,那么一个好的工具将是徒劳的,因为您可以使用它们的搜寻器来运行所有amazon / ebay网站来寻找产品,而无需进行明确的for循环。

看一下草率的文档,它非常易于使用。

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.


回答 2

两者都用于解析数据。

Scrapy

  • Scrapy是一个快速的高级Web爬网和Web爬网框架,用于对网站进行爬网并从其页面中提取结构化数据。
  • 但是当数据来自Java脚本或动态加载时,它有一些局限性,我们可以通过使用诸如splash,selenium等包来克服它。

BeautifulSoup

  • Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

  • 我们可以使用此包从Java脚本获取数据或动态加载页面。

Scrapy with BeautifulSoup是我们可以使用的最好的组合之一,可用于刮取静态和动态内容

Both are using to parse data.

Scrapy:

  • Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

  • Beautiful Soup is a Python library for pulling data out of HTML and XML files.

  • we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents


回答 3

我这样做的方法是使用eBay / Amazon API,而不是scrapy,然后使用BeautifulSoup解析结果。

API为您提供了一种正式的方式来获取与从scrapy爬网程序中获得的数据相同的正式方式,而无需担心隐藏您的身份,与代理相关的麻烦等。

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.


回答 4

Scrapy 这是一个 Web抓取框架,其中包含大量的功能,使抓取变得更加容易,因此我们可以仅关注抓取逻辑。下面是我最喜欢的一些scrapy照顾我们的事情。

  • Feed导出:基本上,它可以使我们以CSV,JSON,jsonlines和XML等各种格式保存数据。
  • 异步抓取:Scrapy使用了扭曲的框架,该框架使我们能够一次访问多个URL,在每个URL中以非阻塞方式处理每个请求(基本上,在发送另一个请求之前,我们不必等待请求完成)。
  • 选择器:在这里我们可以比较scrap头和漂亮的汤。选择器使我们能够从网页中选择特定数据,例如标题,具有类名的某些div等)。Scrapy使用lxml进行解析,这比漂亮的汤要快得多。
  • 设置代理,用户代理,标题等:scrapy允许我们动态设置和旋转代理以及其他标题。

  • 项目管道:管道使我们能够在提取后处理数据。例如,我们可以配置管道以将数据推送到您的mysql服务器。

  • Cookies:scrapy自动为我们处理cookie。

等等

TLDR:scrapy是一个框架,提供了构建大规模爬网可能需要的所有内容。它提供了各种功能,可隐藏爬网的复杂性。您可以简单地开始编写网络爬虫,而无需担心安装负担。

美丽的汤 Beautiful Soup是用于解析HTML和XML文档的Python包。因此,使用Beautiful汤,您可以解析一个已经下载的网页。BS4非常受欢迎且古老。与刮y不同,您不能仅用美丽的汤来制作履带。您将需要其他库(例如request,urllib等)来使bs4成为爬虫。同样,这意味着您将需要管理要爬网的URL列表,要爬网的URL,处理Cookie,管理代理,处理错误,创建自己的函数以将数据推送到CSV,JSON,XML等。如果要加快速度比您将不得不使用其他库(如多处理)

总结一下。

  • Scrapy是一个丰富的框架,您可以使用它开始编写爬虫程序,而无需进行任何麻烦。

  • 美丽的汤是您可以用来解析网页的库。它不能单独用于刮网。

您绝对应该在您的Amazon和e-bay产品价格比较网站上使用scrapy。您可以建立一个url数据库并每天运行爬虫(cron作业,Celery用于计划爬虫)并更新数据库的价格。这样,您的网站将始终从数据库中提取,并且爬虫和数据库将作为单独的组件。

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

  • Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
  • Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
  • Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
  • Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.

  • Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.

  • Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

  • Scrapy is a rich framework that you can use to start writing crawlers without any hassale.

  • Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.


回答 5

BeautifulSoup是一个使您可以从网页提取信息的库。

另一方面,Scrapy是一个框架,它可以执行上述操作以及您在抓取项目中可能需要的许多其他事情,例如用于保存数据的管道。

您可以查看此博客以开始使用Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/


回答 6

使用scrapy可以节省大量代码,并从结构化编程开始。如果您不喜欢scapy的任何预写方法,则可以使用BeautifulSoup代替scrapy方法。大型项目同时具有这两个优点。

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.


回答 7

差异很多,选择任何工具/技术都取决于个人需求。

几个主要区别是:

  1. BeautifulSoup 比Scrapy 容易学习
  2. Scrapy的扩展,支持和社区大于BeautifulSoup。
  3. 当BeautifulSoup是解析器时,Scrapy应该被视为蜘蛛

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

  1. BeautifulSoup is comparatively is easy to learn than Scrapy.
  2. The extensions, support, community is larger for Scrapy than for BeautifulSoup.
  3. Scrapy should be considered as a Spider while BeautifulSoup is a Parser.