Python 实用宝典

Question 1

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

Question 2

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source

Question 3

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.

Question 4

Both are using to parse data.

Scrapy:

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files.
we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents

Question 5

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.

Question 6

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

Scrapy is a rich framework that you can use to start writing crawlers without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.

Question 7

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

Question 8

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.

Question 9

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

BeautifulSoup is comparatively is easy to learn than Scrapy.
The extensions, support, community is larger for Scrapy than for BeautifulSoup.
Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

Question 10

我正在使用Python 2.7开发Scrapy 0.20。我发现PyCharm具有良好的Python调试器。我想使用它测试我的Scrapy蜘蛛。有人知道该怎么做吗？

我尝试过的

实际上，我尝试将Spider作为脚本运行。结果，我构建了该脚本。然后，我尝试将Scrapy项目添加到PyCharm中，如下所示：

File->Setting->Project structure->Add content root.

但是我不知道我还要做什么

Question 11

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a script. As a result, I built that script. Then, I tried to add my Scrapy project to PyCharm as a model like this:

File->Setting->Project structure->Add content root.

But I don’t know what else I have to do

Question 12

该scrapy命令是python脚本，这意味着您可以从PyCharm内部启动它。

当检查scrapy二进制文件（which scrapy）时，您会注意到这实际上是一个python脚本：

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

这意味着scrapy crawl IcecatCrawler还可以像这样执行命令：python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

尝试找到scrapy.cmdline软件包。就我而言，位置在这里：/Library/Python/2.7/site-packages/scrapy/cmdline.py

使用该脚本作为脚本在PyCharm中创建运行/调试配置。用scrapy命令和Spider填充脚本参数。在这种情况下crawl IcecatCrawler。

像这样：

将断点放在爬网代码中的任何位置，它应该可以正常工作。

Question 13

The scrapy command is a python script which means you can start it from inside PyCharm.

When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

This means that a command like scrapy crawl IcecatCrawler can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

Try to find the scrapy.cmdline package. In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py

Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.

Like this:

Put your breakpoints anywhere in your crawling code and it should work™.

Question 14

您只需要这样做。

在项目的搜寻器文件夹上创建一个Python文件。我使用了main.py。

项目
- 履带式
  - 履带式
    - 蜘蛛网
    - …
  - main.py
  - scrapy.cfg

在您的main.py内部，将下面的代码。

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

并且您需要创建一个“运行配置”以运行您的main.py。

这样做，如果在代码上放置断点，它将在此处停止。

Question 15

You just need to do this.

Create a Python file on crawler folder on your project. I used main.py.

Project
- Crawler
  - Crawler
    - Spiders
    - …
  - main.py
  - scrapy.cfg

Inside your main.py put this code below.

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

And you need to create a “Run Configuration” to run your main.py.

Doing this, if you put a breakpoint at your code it will stop there.

Question 16

截至2018.1，这变得容易得多。现在Module name，您可以在项目的中进行选择Run/Debug Configuration。将此设置为，scrapy.cmdline并将其设置Working directory为scrapy项目的根目录（其中有一个目录settings.py）。

像这样：

现在，您可以添加断点来调试代码。

Question 17

As of 2018.1 this became a lot easier. You can now select Module name in your project’s Run/Debug Configuration. Set this to scrapy.cmdline and the Working directory to the root dir of the scrapy project (the one with settings.py in it).

Like so:

Now you can add breakpoints to debug your code.

Question 18

我正在使用Python 3.5.0在virtualenv中运行scrapy，并设置“ script”参数/path_to_project_env/env/bin/scrapy为我解决了该问题。

Question 19

I am running scrapy in a virtualenv with Python 3.5.0 and setting the “script” parameter to /path_to_project_env/env/bin/scrapy solved the issue for me.

Question 20

intellij的想法也可以。

创建main.py：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

显示如下：

Question 21

intellij idea also work.

create main.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

show below:

Question 22

要在可接受的答案中添加一点点，将近一个小时后，我发现必须从下拉列表（图标工具栏中央附近）中选择正确的“运行配置”，然后单击“调试”按钮才能使其正常工作。希望这可以帮助！

Question 23

To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!

Question 24

我也在使用PyCharm，但没有使用其内置的调试功能。

为了调试，我使用ipdb。我设置了键盘快捷键，可以import ipdb; ipdb.set_trace()在希望断点发生的任何行上插入。

然后，我可以键入n执行下s一条语句，以进入函数，键入任何对象名称以查看其值，更改执行环境，键入c以继续执行…

这非常灵活，可以在PyCharm之外的其他环境中使用，在这些环境中您无法控制执行环境。

只需输入您的虚拟环境，pip install ipdb然后放在import ipdb; ipdb.set_trace()您要暂停执行的行上即可。

Question 25

I am also using PyCharm, but I am not using its built-in debugging features.

For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace() on any line I want the break point to happen.

Then I can type n to execute the next statement, s to step into a function, type any object name to see its value, alter execution environment, type c to continue execution…

This is very flexible, works in environments other than PyCharm, where you don’t control the execution environment.

Just type in your virtual environment pip install ipdb and place import ipdb; ipdb.set_trace() on a line where you want the execution to pause.

UPDATE

You can also pip install pdbpp and use the standard import pdb; pdb.set_trace instead of ipdb. PDB++ is nicer in my opinion.

Question 26

根据该文件https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

Question 27

According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

Question 28

我使用以下简单脚本：

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

Question 29

I use this simple script:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

Question 30

扩展了@Rodrigo的答案版本，我添加了此脚本，现在我可以从配置中设置蜘蛛网名称，而不用更改字符串。

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())

Question 31

Extending @Rodrigo’s version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())

Question 32

I am trying to pass a user defined argument to a scrapy’s spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.

Question 33

Spider arguments are passed in the crawl command using the -a option. For example:

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

Update 2015: Adjust wording

Update 2016: Use newer base class and add super, thanks @Birla

Update 2017: Use Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes

Question 34

Previous answers were correct, but you don’t have to declare the constructor (__init__) every time you want to code a scrapy’s spider, you could just specify the parameters as before:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

and in your spider code you can just use them as spider arguments:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

And it just works.

Question 35

To pass arguments with crawl command

scrapy crawl myspider -a category=’mycategory’ -a domain=’example.com’

To pass arguments to run on scrapyd replace -a with -d

curl http://your.ip.address.here:port/schedule.json -d spider=myspider -d category=’mycategory’ -d domain=’example.com’

The spider will receive arguments in its constructor.


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')

Question 36

Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-

scrapy crawl myspider -a domain=”http://www.example.com”

And receive arguments in spider’s constructors:

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

…

it will work :)

Question 37

Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api’s to stop/start/status/list the spiders.

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploy will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API

Refer scrapyd API documentation for more details

Question 38

I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

Most of the issues are solvable and I’m having a good little mess around. However I’m hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

Now my experience with dynamic web content is low, so this thing is something I’m having trouble getting my head around.

I think Java or Javascript is a key, this pops up often.

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don’t. I’m using the scrapy library with Python 2.7

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?

Question 39

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

In the bottom of the picture you can see that I’ve filtered request down to XHR – these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.

Question 40

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …):

When I analyze the source code of the page I can’t see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

It doesn’t reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

And I observe the HTTP request that is responsible for message body:

After finish, I analyze the headers of the request (I must quote that this URL I’ll extract from source page from var section, see the code below):

And the form data content of the request (the HTTP method is “Post”):

And the content of response, which is a JSON file:

Which presents all the information I’m looking for.

From now, I must implement all this knowledge in scrapy. Let’s define the spider for this purpose:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.

Question 41

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

Some things to note:

You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

from selenium import selenium

class SeleniumSpider(CrawlSpider):
    name = "SeleniumSpider"
    start_urls = ["http://www.domain.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
    )

    def __init__(self):
        CrawlSpider.__init__(self)
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
        self.selenium.start()

    def __del__(self):
        self.selenium.stop()
        print self.verificationErrors
        CrawlSpider.__del__(self)

    def parse_page(self, response):
        item = Item()

        hxs = HtmlXPathSelector(response)
        #Do some XPath selection with Scrapy
        hxs.select('//div').extract()

        sel = self.selenium
        sel.open(response.url)

        #Wait for javscript to load in Selenium
        time.sleep(2.5)

        #Do some crawling of javascript created content with Selenium
        sel.get_text("//div")
        yield item

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: wynbennett
# date  : Jun 21, 2011

Reference: http://snipplr.com/view/66998/

Question 42

Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

1) Define class within the middlewares.py script.

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8")

Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T’s solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it’s parse_page function — That’s two requests for the same content.

Question 43

I was using a custom downloader middleware, but wasn’t very happy with it, as I didn’t manage to make the cache work with it.

A better approach was to implement a custom download handler.

There is a working example here. It looks like this:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called “scraper”. If you put the mentioned code inside a file called handlers.py on the root of the “scraper” folder, then you could add to your settings.py:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

And voilà, the JS parsed DOM, with scrapy cache, retries, etc.

Question 44

how can scrapy be used to scrape this dynamic data so that I can use it?

I wonder why no one has posted the solution using Scrapy only.

Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES . The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.

The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

Question 45

yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.

There are Two approaches to scrapy these kind of websites.

First,

you can use splash to render Javascript code and then parse the rendered HTML. you can find the doc and project here Scrapy splash, git

Second,

As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.

Question 46

I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference

Question 47

I’m trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error:

OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

I’ve tried to deactivate the rootless feature in OSX 10.11 with the command:

sudo nvram boot-args="rootless=0";sudo reboot

but I still get the same error when the machine reboots.

Any clue or idea from my fellow StackExchangers?

If it helps, the full script output is the following:

sudo -s pip install scrapy
Collecting scrapy
  Downloading Scrapy-1.0.2-py2-none-any.whl (290kB)
    100% |████████████████████████████████| 290kB 345kB/s 
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9 in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)
Collecting w3lib>=1.8.0 (from scrapy)
  Downloading w3lib-1.12.0-py2.py3-none-any.whl
Collecting lxml (from scrapy)
  Downloading lxml-3.4.4.tar.gz (3.5MB)
    100% |████████████████████████████████| 3.5MB 112kB/s 
Collecting Twisted>=10.0.0 (from scrapy)
  Downloading Twisted-15.3.0.tar.bz2 (4.4MB)
    100% |████████████████████████████████| 4.4MB 94kB/s 
Collecting six>=1.5.2 (from scrapy)
  Downloading six-1.9.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=10.0.0->scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Installing collected packages: six, w3lib, lxml, Twisted, scrapy
  Found existing installation: six 1.4.1
    DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/basecommand.py", line 223, in main
status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/commands/install.py", line 299, in run
root=options.root_path,
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_set.py", line 640, in install
requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_install.py", line 726, in uninstall
paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_uninstall.py", line 125, in remove
renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/utils/__init__.py", line 314, in renames
shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

Question 48

I also think it’s absolutely not necessary to start hacking OS X.

I was able to solve it doing a

brew install python

It seems that using the python / pip that comes with new El Capitan has some issues.

Question 49

pip install --ignore-installed six

Would do the trick.

Source: github.com/pypa/pip/issues/3165

Question 50

As the other answers said, it’s because of the new System Integrity Protection, but I believe the other answers are overcomplicated.

If you’re only gonna use that package in the current user, you should be able to install it just fine, without the need to disable the SIP, by using the --user flag. Like this:

sudo pip install --user packagename

Question 51

The high voted answers didn’t work for me, it seems to work for El Capitan users. But for MacOS Sierra users try the following steps

brew install python
sudo pip install --user <package name>

Question 52

Warnings

I would suggest very strongly against modifying the system Python on Mac; there are numerous issues that can occur.

Your particular error shows that the installer has issues resolving the dependencies for Scrapy without impacting the current Python installation. The system uses Python for a number of essential tasks, so it’s important to keep the system installation stable and as originally installed by Apple.

I would also exhaust all other possibilities before bypassing built in security.

Package Manager Solutions:

Please look into a Python virtualization tool such as virtualenv first; this will allow you to experiment safely.

Another useful tool to use languages and software without conflicting with your Mac OS is Homebrew. Like MacPorts or Fink, Homebrew is a package manager for Mac, and is useful for safely trying lots of other languages and tools.

“Roll your own” Software Installs:

If you don’t like the package manager approach, you could use the /usr/local path or create an /opt/local directory for installing an alternate Python installation and fix up your paths in your .bashrc. Note that you’ll have to enable root for these solutions.

How to do it anyway:

If you absolutely must disable the security check (and I sincerely hope it’s for something other than messing with the system languages and resources), you can disable it temporarily and re-enable it using some of the techniques in this post on how to Disable System Integrity-Protection.

Question 53

This did the trick for me:

   sudo pip install scrapy --ignore-installed six

Question 54

You should disable “System Integrity Protection” which is a new feature in El Capitan.

First, you should run the command for rootless config on your terminal

# nvram boot-args="rootless=0"
# reboot

Then, you should run the command below on recovery partition’s terminal (Recovery OS)

# csrutil disable
# reboot

I’ve just solved my problem like that. I’m not sure that the first part is necessary. Try as you like.

–WARNING

You should enable SIP again after everything works;

Simply reboot again into Recovery Mode and run in terminal

# csrutil enable

csrutil: Configuring System Integrity Protection

Question 55

I tried to install AWS via pip in El Capitan but this error appear

OSError: [Errno 1] Operation not permitted: ‘/var/folders/wm/jhnj0g_s16gb36y8kwvrgm7h0000gp/T/pip-wTnb_D-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info’

I found the answer here

sudo -H pip install awscli --upgrade --ignore-installed six

It works for me :)

Question 56

I was getting the same error on on my MacOS Sierra. I followed these steps and successfully able to install scarpy package.

1. sudo pip install --ignore-installed six
2. sudo pip install --ignore-installed scrapy

MacBook-Air:~ shree$ scrapy version
Scrapy 1.4.0

Question 57

This did the trick for me.

sudo pip install –ignore-installed scrapy

Question 58

Tried a combination of some answers and this eventually worked:

sudo -H pip install --upgrade --ignore-installed awsebcli

Cheers

Question 59

install python again:

brew install python

try it again:

sudo pip install scrapy

works for me, hope it can help

Question 60

Restart Mac -> hold down “Command + R” after the startup chime -> Opens OS X Utilities -> Open Terminal and type “csrutil disable” -> Reboot OS X -> Open Terminal and check “csrutil status”

Question 61

This command would work perfectly fine :D

sudo -H pip install –upgrade package_name –ignore-installed six

Question 62

Sometimes such behavior may be achieved if you try to install python3 lib in python2 folder using pip instead of pip3.

Question 63

— close SIP(system Integrity Protection) — then reboot, use command +R to enter debug mode, then select terminal: csrutil disable reboot

2.

sudo C_INCLUDE_PATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2 :/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2/libxml :/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include pip install scrapy –ignore-installed six

3. — then remove old six, install it again sudo rm -rf /Library/Python/2.7/site-packages/six* sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six* sudo pip install six

4. — then set it back csrutil enable reboot

— crappy works now

Question 64

it work for me:

pip install scrapy --user -U

问题：BeautifulSoup和Scrapy搜寻器之间的区别？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何使用PyCharm调试Scrapy项目

我尝试过的

What I have tried

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：如何在Scrapy Spider中传递用户定义的参数

回答 0

回答 1

回答 2

回答 3

回答 4

问题：可以使用scrapy从使用AJAX的网站中抓取动态内容吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：在OSX 10.11（El Capitan）中安装Scrapy（系统完整性保护）时，出现“ OSError：[Errno 1] Operation not allow”

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

问题：无法在Mac OS X 10.9上安装Lxml

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

回答 22