Python 实用宝典

Question 1

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

Question 2

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source

Question 3

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.

Question 4

Both are using to parse data.

Scrapy:

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents

Question 5

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.

Question 6

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

Scrapy is a rich framework that you can use to start writing crawlers without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.

Question 7

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

Question 8

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.

Question 9

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

BeautifulSoup is comparatively is easy to learn than Scrapy.
The extensions, support, community is larger for Scrapy than for BeautifulSoup.
Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

Question 10

I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:

import urllib.request
import re

url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern  = re.compile(regex)

with urllib.request.urlopen(url) as response:
   html = response.read()

title = re.findall(pattern, html)
print(title)

And I get this unexpected error:

Traceback (most recent call last):
  File "path\to\file\Crawler.py", line 11, in <module>
    title = re.findall(pattern, html)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

What am I doing wrong?

Question 11

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').

See Convert bytes to a Python String

Question 12

The problem is that your regex is a string, but html is bytes:

>>> type(html)
<class 'bytes'>

Since python doesn’t know how those bytes are encoded, it throws an exception when you try to use a string regex on them.

You can either decode the bytes to a string:

html = html.decode('ISO-8859-1')  # encoding may vary!
title = re.findall(pattern, html)  # no more error

Or use a bytes regex:

regex = rb'<title>(,+?)</title>'
#        ^

In this particular context, you can get the encoding from the response headers:

with urllib.request.urlopen(url) as response:
    encoding = response.info().get_param('charset', 'utf8')
    html = response.read().decode(encoding)

See the urlopen documentation for more details.

Question 13

I am trying to pass a user defined argument to a scrapy’s spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.

Question 14

Spider arguments are passed in the crawl command using the -a option. For example:

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

Update 2015: Adjust wording

Update 2016: Use newer base class and add super, thanks @Birla

Update 2017: Use Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes

Question 15

Previous answers were correct, but you don’t have to declare the constructor (__init__) every time you want to code a scrapy’s spider, you could just specify the parameters as before:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

and in your spider code you can just use them as spider arguments:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

And it just works.

Question 16

To pass arguments with crawl command

scrapy crawl myspider -a category=’mycategory’ -a domain=’example.com’

To pass arguments to run on scrapyd replace -a with -d

curl http://your.ip.address.here:port/schedule.json -d spider=myspider -d category=’mycategory’ -d domain=’example.com’

The spider will receive arguments in its constructor.


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')

Question 17

Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-

scrapy crawl myspider -a domain=”http://www.example.com”

And receive arguments in spider’s constructors:

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

…

it will work :)

Question 18

Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api’s to stop/start/status/list the spiders.

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploy will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API

Refer scrapyd API documentation for more details

Question 19

I want to send a value for "User-agent" while requesting a webpage using Python Requests. I am not sure is if it is okay to send this as a part of the header, as in the code below:

debug = {'verbose': sys.stderr}
user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(url, headers = user_agent, config=debug)

The debug information isn’t showing the headers being sent during the request.

Is it acceptable to send this information in the header? If not, how can I send it?

Question 20

The user-agent should be specified as a field in the header.

Here is a list of HTTP header fields, and you’d probably be interested in request-specific fields, which includes User-Agent.

If you’re using requests v2.13 and newer

The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:

import requests

url = 'SOME URL'

headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'youremail@domain.com'  # This is another valid field
}

response = requests.get(url, headers=headers)

If you’re using requests v2.12.x and older

Older versions of requests clobbered default headers, so you’d want to do the following to preserve default headers and then add your own to them.

import requests

url = 'SOME URL'

# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()

# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
    {
        'User-Agent': 'My User Agent 1.0',
    }
)

response = requests.get(url, headers=headers)

Question 21

It’s more convenient to use a session, this way you don’t have to remember to set headers each time:

session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})

session.get('https://httpbin.org/headers')

By default, session also manages cookies for you. In case you want to disable that, see this question.

Python 实用宝典

标签归档：web-crawler

BeautifulSoup和Scrapy搜寻器之间的区别？

问题：BeautifulSoup和Scrapy搜寻器之间的区别？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

TypeError：无法在re.findall（）中的类似字节的对象上使用字符串模式

问题：TypeError：无法在re.findall（）中的类似字节的对象上使用字符串模式

回答 0

回答 1

如何在Scrapy Spider中传递用户定义的参数

问题：如何在Scrapy Spider中传递用户定义的参数

回答 0

回答 1

回答 2

回答 3

回答 4

使用Python中的请求库发送“用户代理”

问题：使用Python中的请求库发送“用户代理”

回答 0

如果您使用的是v2.13及更高版本的请求

如果您使用的是v2.12.x及更高版本的请求

If you’re using requests v2.13 and newer

If you’re using requests v2.12.x and older

回答 1

有趣好用的Python教程