标签归档:scrapy

BeautifulSoup和Scrapy搜寻器之间的区别?

问题:BeautifulSoup和Scrapy搜寻器之间的区别?

我想建立一个网站,显示亚马逊和电子海湾产品价格之间的比较。其中哪个会更好,为什么?我对BeautifulSoup有点熟悉,但对Scrapy爬虫却不太了解

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.


回答 0

Scrapy是一个Web-spider或Web Scraper 框架,您为Scrapy提供一个根URL以开始爬网,然后您可以指定要爬网和获取的URL数量的限制。它是用于爬网爬网的完整框架。

BeautifulSoup是一个解析库,它在从URL提取内容方面也做得很好,并允许您轻松解析其中的某些部分。它仅获取您提供的URL的内容,然后停止。除非您使用某些条件将其手动放入无限循环内,否则它不会爬网。

简而言之,使用Beautiful Soup,您可以构建类似于Scrapy的东西。美丽的汤是一个库,而Scrapy是一个完整的框架

资源

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source


回答 1

我认为两者都很好。即时通讯正在做一个同时使用两者的项目。首先,我使用scrapy抓取所有页面,并使用它们的管道将其保存在mongodb集合中,还下载页面上存在的图像。之后,我使用BeautifulSoup4进行pos处理,我必须更改属性值并获取一些特殊标签。

如果您不知道所需的产品页面,那么一个好的工具将是徒劳的,因为您可以使用它们的搜寻器来运行所有amazon / ebay网站来寻找产品,而无需进行明确的for循环。

看一下草率的文档,它非常易于使用。

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.


回答 2

两者都用于解析数据。

Scrapy

  • Scrapy是一个快速的高级Web爬网和Web爬网框架,用于对网站进行爬网并从其页面中提取结构化数据。
  • 但是当数据来自Java脚本或动态加载时,它有一些局限性,我们可以通过使用诸如splash,selenium等包来克服它。

BeautifulSoup

  • Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

  • 我们可以使用此包从Java脚本获取数据或动态加载页面。

Scrapy with BeautifulSoup是我们可以使用的最好的组合之一,可用于刮取静态和动态内容

Both are using to parse data.

Scrapy:

  • Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

  • Beautiful Soup is a Python library for pulling data out of HTML and XML files.

  • we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents


回答 3

我这样做的方法是使用eBay / Amazon API,而不是scrapy,然后使用BeautifulSoup解析结果。

API为您提供了一种正式的方式来获取与从scrapy爬网程序中获得的数据相同的正式方式,而无需担心隐藏您的身份,与代理相关的麻烦等。

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.


回答 4

Scrapy 这是一个 Web抓取框架,其中包含大量的功能,使抓取变得更加容易,因此我们可以仅关注抓取逻辑。下面是我最喜欢的一些scrapy照顾我们的事情。

  • Feed导出:基本上,它可以使我们以CSV,JSON,jsonlines和XML等各种格式保存数据。
  • 异步抓取:Scrapy使用了扭曲的框架,该框架使我们能够一次访问多个URL,在每个URL中以非阻塞方式处理每个请求(基本上,在发送另一个请求之前,我们不必等待请求完成)。
  • 选择器:在这里我们可以比较scrap头和漂亮的汤。选择器使我们能够从网页中选择特定数据,例如标题,具有类名的某些div等)。Scrapy使用lxml进行解析,这比漂亮的汤要快得多。
  • 设置代理,用户代理,标题等:scrapy允许我们动态设置和旋转代理以及其他标题。

  • 项目管道:管道使我们能够在提取后处理数据。例如,我们可以配置管道以将数据推送到您的mysql服务器。

  • Cookies:scrapy自动为我们处理cookie。

等等

TLDR:scrapy是一个框架,提供了构建大规模爬网可能需要的所有内容。它提供了各种功能,可隐藏爬网的复杂性。您可以简单地开始编写网络爬虫,而无需担心安装负担。

美丽的汤 Beautiful Soup是用于解析HTML和XML文档的Python包。因此,使用Beautiful汤,您可以解析一个已经下载的网页。BS4非常受欢迎且古老。与刮y不同,您不能仅用美丽的汤来制作履带。您将需要其他库(例如request,urllib等)来使bs4成为爬虫。同样,这意味着您将需要管理要爬网的URL列表,要爬网的URL,处理Cookie,管理代理,处理错误,创建自己的函数以将数据推送到CSV,JSON,XML等。如果要加快速度比您将不得不使用其他库(如多处理)

总结一下。

  • Scrapy是一个丰富的框架,您可以使用它开始编写爬虫程序,而无需进行任何麻烦。

  • 美丽的汤是您可以用来解析网页的库。它不能单独用于刮网。

您绝对应该在您的Amazon和e-bay产品价格比较网站上使用scrapy。您可以建立一个url数据库并每天运行爬虫(cron作业,Celery用于计划爬虫)并更新数据库的价格。这样,您的网站将始终从数据库中提取,并且爬虫和数据库将作为单独的组件。

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

  • Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
  • Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
  • Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
  • Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.

  • Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.

  • Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

  • Scrapy is a rich framework that you can use to start writing crawlers without any hassale.

  • Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.


回答 5

BeautifulSoup是一个使您可以从网页提取信息的库。

另一方面,Scrapy是一个框架,它可以执行上述操作以及您在抓取项目中可能需要的许多其他事情,例如用于保存数据的管道。

您可以查看此博客以开始使用Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/


回答 6

使用scrapy可以节省大量代码,并从结构化编程开始。如果您不喜欢scapy的任何预写方法,则可以使用BeautifulSoup代替scrapy方法。大型项目同时具有这两个优点。

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.


回答 7

差异很多,选择任何工具/技术都取决于个人需求。

几个主要区别是:

  1. BeautifulSoup 比Scrapy 容易学习
  2. Scrapy的扩展,支持和社区大于BeautifulSoup。
  3. 当BeautifulSoup是解析器时,Scrapy应该被视为蜘蛛

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

  1. BeautifulSoup is comparatively is easy to learn than Scrapy.
  2. The extensions, support, community is larger for Scrapy than for BeautifulSoup.
  3. Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

如何使用PyCharm调试Scrapy项目

问题:如何使用PyCharm调试Scrapy项目

我正在使用Python 2.7开发Scrapy 0.20。我发现PyCharm具有良好的Python调试器。我想使用它测试我的Scrapy蜘蛛。有人知道该怎么做吗?

我尝试过的

实际上,我尝试将Spider作为脚本运行。结果,我构建了该脚本。然后,我尝试将Scrapy项目添加到PyCharm中,如下所示:
File->Setting->Project structure->Add content root.

但是我不知道我还要做什么

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a script. As a result, I built that script. Then, I tried to add my Scrapy project to PyCharm as a model like this:
File->Setting->Project structure->Add content root.

But I don’t know what else I have to do


回答 0

scrapy命令是python脚本,这意味着您可以从PyCharm内部启动它。

当检查scrapy二进制文件(which scrapy)时,您会注意到这实际上是一个python脚本:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

这意味着scrapy crawl IcecatCrawler还可以像这样执行命令 :python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

尝试找到scrapy.cmdline软件包。就我而言,位置在这里:/Library/Python/2.7/site-packages/scrapy/cmdline.py

使用该脚本作为脚本在PyCharm中创建运行/调试配置。用scrapy命令和Spider填充脚本参数。在这种情况下crawl IcecatCrawler

像这样:

将断点放在爬网代码中的任何位置,它应该可以正常工作。

The scrapy command is a python script which means you can start it from inside PyCharm.

When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

This means that a command like scrapy crawl IcecatCrawler can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

Try to find the scrapy.cmdline package. In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py

Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.

Like this:

Put your breakpoints anywhere in your crawling code and it should work™.


回答 1

您只需要这样做。

在项目的搜寻器文件夹上创建一个Python文件。我使用了main.py。

  • 项目
    • 履带式
      • 履带式
        • 蜘蛛网
      • main.py
      • scrapy.cfg

在您的main.py内部,将下面的代码。

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

并且您需要创建一个“运行配置”以运行您的main.py。

这样做,如果在代码上放置断点,它将在此处停止。

You just need to do this.

Create a Python file on crawler folder on your project. I used main.py.

  • Project
    • Crawler
      • Crawler
        • Spiders
      • main.py
      • scrapy.cfg

Inside your main.py put this code below.

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

And you need to create a “Run Configuration” to run your main.py.

Doing this, if you put a breakpoint at your code it will stop there.


回答 2

截至2018.1,这变得容易得多。现在Module name,您可以在项目的中进行选择Run/Debug Configuration。将此设置为,scrapy.cmdline并将其设置Working directory为scrapy项目的根目录(其中有一个目录settings.py)。

像这样:

现在,您可以添加断点来调试代码。

As of 2018.1 this became a lot easier. You can now select Module name in your project’s Run/Debug Configuration. Set this to scrapy.cmdline and the Working directory to the root dir of the scrapy project (the one with settings.py in it).

Like so:

Now you can add breakpoints to debug your code.


回答 3

我正在使用Python 3.5.0在virtualenv中运行scrapy,并设置“ script”参数/path_to_project_env/env/bin/scrapy为我解决了该问题。

I am running scrapy in a virtualenv with Python 3.5.0 and setting the “script” parameter to /path_to_project_env/env/bin/scrapy solved the issue for me.


回答 4

intellij的想法也可以。

创建main.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

显示如下:

intellij idea also work.

create main.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

show below:


回答 5

要在可接受的答案中添加一点点,将近一个小时后,我发现必须从下拉列表(图标工具栏中央附近)中选择正确的“运行配置”,然后单击“调试”按钮才能使其正常工作。希望这可以帮助!

To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!


回答 6

我也在使用PyCharm,但没有使用其内置的调试功能。

为了调试,我使用ipdb。我设置了键盘快捷键,可以import ipdb; ipdb.set_trace()在希望断点发生的任何行上插入。

然后,我可以键入n执行下s一条语句,以进入函数,键入任何对象名称以查看其值,更改执行环境,键入c以继续执行…

这非常灵活,可以在PyCharm之外的其他环境中使用,在这些环境中您无法控制执行环境。

只需输入您的虚拟环境,pip install ipdb然后放在import ipdb; ipdb.set_trace()您要暂停执行的行上即可。

I am also using PyCharm, but I am not using its built-in debugging features.

For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace() on any line I want the break point to happen.

Then I can type n to execute the next statement, s to step into a function, type any object name to see its value, alter execution environment, type c to continue execution…

This is very flexible, works in environments other than PyCharm, where you don’t control the execution environment.

Just type in your virtual environment pip install ipdb and place import ipdb; ipdb.set_trace() on a line where you want the execution to pause.

UPDATE

You can also pip install pdbpp and use the standard import pdb; pdb.set_trace instead of ipdb. PDB++ is nicer in my opinion.


回答 7

根据该文件https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

回答 8

我使用以下简单脚本:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

I use this simple script:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

回答 9

扩展了@Rodrigo的答案版本,我添加了此脚本,现在我可以从配置中设置蜘蛛网名称,而不用更改字符串。

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())

Extending @Rodrigo’s version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())

如何在Scrapy Spider中传递用户定义的参数

问题:如何在Scrapy Spider中传递用户定义的参数

我正在尝试将用户定义的参数传递给Scrapy的Spider。有人可以建议如何做吗?

我在-a某处读到一个参数,但不知道如何使用它。

I am trying to pass a user defined argument to a scrapy’s spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.


回答 0

蜘蛛形参crawl使用-a选项在命令中传递。例如:

scrapy crawl myspider -a category=electronics -a domain=system

蜘蛛程序可以将参数作为属性访问:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

摘自Scrapy文档:http ://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

2013年更新:添加第二个参数

2015年更新:调整措辞

2016年更新:使用较新的基类并添加超级类,谢谢@Birla

2017年更新:使用Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

更新2018@eLRuLL指出,蜘蛛可以将参数作为属性访问

Spider arguments are passed in the crawl command using the -a option. For example:

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

Update 2015: Adjust wording

Update 2016: Use newer base class and add super, thanks @Birla

Update 2017: Use Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes


回答 1

先前的答案是正确的,但是您不必__init__每次想要编写scrapy的Spider时都声明构造函数(),只需像以前一样指定参数即可:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

在您的蜘蛛代码中,您可以将它们用作蜘蛛参数:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

而且就可以了。

Previous answers were correct, but you don’t have to declare the constructor (__init__) every time you want to code a scrapy’s spider, you could just specify the parameters as before:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

and in your spider code you can just use them as spider arguments:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

And it just works.


回答 2

通过爬网命令传递参数

抓取抓取myspider -a category =’mycategory’-a domain =’example.com’

传递参数以在scrapyd上运行-a替换为-d

curl http://your.ip.address.here:port/schedule.json -d spider = myspider -d category =’mycategory’-d domain =’example.com’

蜘蛛程序将在其构造函数中接收参数。


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy将所有参数作为蜘蛛属性,您可以完全跳过init方法。当心使用getattr方法获取那些属性,以便您的代码不会中断。


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')

To pass arguments with crawl command

scrapy crawl myspider -a category=’mycategory’ -a domain=’example.com’

To pass arguments to run on scrapyd replace -a with -d

curl http://your.ip.address.here:port/schedule.json -d spider=myspider -d category=’mycategory’ -d domain=’example.com’

The spider will receive arguments in its constructor.


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')


回答 3

使用-a选项在运行检索命令时会传递蜘蛛参数。例如,如果我想将域名作为参数传递给我的Spider,那么我将这样做-

scrapy抓取myspider -a domain =“ http://www.example.com”

并在Spider的构造函数中接收参数:

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

它将起作用:)

Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-

scrapy crawl myspider -a domain=”http://www.example.com”

And receive arguments in spider’s constructors:

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

it will work :)


回答 4

另外,我们可以使用ScrapyD公开一个API,在其中我们可以传递start_url和Spider名称。ScrapyD具有api来停止/启动/状态/列出蜘蛛。

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploy将以鸡蛋的形式将蜘蛛部署到守护程序中,甚至可以维护蜘蛛的版本。启动Spider时,您可以提及要使用哪个版本的Spider。

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

附加的优点是您可以构建自己的UI来接受用户的url和其他参数,并使用上面的scrapy schedule API安排任务

有关更多详细信息,请参阅scrapyd API文档

Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api’s to stop/start/status/list the spiders.

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploy will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API

Refer scrapyd API documentation for more details


可以使用scrapy从使用AJAX的网站中抓取动态内容吗?

问题:可以使用scrapy从使用AJAX的网站中抓取动态内容吗?

我最近一直在学习Python,并全力以赴来构建网络抓取工具。一点都不花哨。其唯一目的是从博彩网站上获取数据,并将此数据放入Excel。

大多数问题都是可以解决的,我周围有些混乱。但是,我在一个问题上遇到了巨大的障碍。如果站点加载一张马表并列出当前的投注价格,则此信息不在任何源文件中。线索是这些数据有时是实时的,而数字显然是从某个远程服务器上更新的。我PC上的HTML只是有一个漏洞,他们的服务器正在推送我需要的所有有趣数据。

现在我对动态Web内容的经验很低,所以这件事使我难以理解。

我认为Java或Javascript是关键,这经常弹出。

刮板只是赔率比较引擎。有些网站有API,但对于那些没有的API则需要。我正在使用python 2.7的scrapy库

如果这个问题过于开放,我深表歉意。简而言之,我的问题是:如何使用scrapy来抓取此动态数据,以便可以使用它?这样我就可以实时抓取该赔率赔率数据?

I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

Most of the issues are solvable and I’m having a good little mess around. However I’m hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

Now my experience with dynamic web content is low, so this thing is something I’m having trouble getting my head around.

I think Java or Javascript is a key, this pops up often.

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don’t. I’m using the scrapy library with Python 2.7

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?


回答 0

基于Webkit的浏览器(例如Google Chrome或Safari)具有内置的开发人员工具。在Chrome中,您可以将其打开Menu->Tools->Developer Tools。该Network选项卡使您可以查看有关每个请求和响应的所有信息:

在图片的底部,您可以看到我已将请求过滤为XHR-这些是由javascript代码发出的请求。

提示:每次加载页面时都会清除日志,在图片底部,黑点按钮将保留日志。

分析请求和响应后,您可以模拟来自网络爬虫的请求并提取有价值的数据。在许多情况下,获取数据比解析HTML更容易,因为该数据不包含表示逻辑,并且其格式设置为可以由javascript代码访问。

Firefox具有类似的扩展名,称为firebug。有人会说萤火虫功能更强大,但我喜欢webkit的简单性。

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

In the bottom of the picture you can see that I’ve filtered request down to XHR – these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.


回答 1

这是一个scrapy带有AJAX请求的简单示例 。让我们看看网站rubin-kazan.ru

所有消息都加载了AJAX请求。我的目标是获取这些消息及其所有属性(作者,日期等):

当我分析页面的源代码时,因为网页使用AJAX技术,所以看不到所有这些消息。但是我可以使用Mozilla Firefox中的Firebug(或其他浏览器中的等效工具)来分析HTTP请求,该请求会在网页上生成消息:

它不会重新加载整个页面,而是仅重新加载页面中包含消息的部分。为此,我单击底部任意数量的页面:

我观察到负责邮件正文的HTTP请求:

完成后,我分析请求的标头(我必须引用将从var部分的源页面中提取的该URL,请参见下面的代码):

和请求的表单数据内容(HTTP方法为“ Post”):

以及响应的内容,它是一个JSON文件:

其中显示了我正在寻找的所有信息。

从现在开始,我必须抓紧实施所有这些知识。为此,我们定义蜘蛛:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

parse函数中,我有第一个请求的响应。在RubiGuessItem我有所有信息的JSON文件。

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …):

When I analyze the source code of the page I can’t see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

It doesn’t reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

And I observe the HTTP request that is responsible for message body:

After finish, I analyze the headers of the request (I must quote that this URL I’ll extract from source page from var section, see the code below):

And the form data content of the request (the HTTP method is “Post”):

And the content of response, which is a JSON file:

Which presents all the information I’m looking for.

From now, I must implement all this knowledge in scrapy. Let’s define the spider for this purpose:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.


回答 2

进行爬网时,很多时候我们会遇到这样的问题:页面上呈现的内容是使用Javascript生成的,因此scrapy无法为其进行爬网(例如ajax请求,jQuery疯狂)。

但是,如果您将Scrapy与Web测试框架Selenium一起使用,则我们能够对普通Web浏览器中显示的内容进行爬网。

注意事项:

  • 您必须安装Python版本的Selenium RC才能正常工作,并且必须正确设置Selenium。这也只是模板搜寻器。您可能会更疯狂,更先进,但是我只是想展示基本思想。如代码所示,您将对任何给定的URL进行两个请求。Scrapy提出了一个请求,Selenium提出了另一个请求。我确信有办法解决这个问题,这样您就可以让Selenium做一个唯一的请求,但是我没有理会实现这一点,通过执行两个请求,您也可以使用Scrapy抓取页面。

  • 这非常强大,因为现在您可以抓取整个呈现的DOM,并且仍然可以使用Scrapy中所有不错的抓取功能。当然,这将使爬网速度变慢,但是取决于您需要呈现的DOM的多少,可能值得等待。

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    
    from selenium import selenium
    
    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["http://www.domain.com"]
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )
    
        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
            self.selenium.start()
    
        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)
    
        def parse_page(self, response):
            item = Item()
    
            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()
    
            sel = self.selenium
            sel.open(response.url)
    
            #Wait for javscript to load in Selenium
            time.sleep(2.5)
    
            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item
    
    # Snippet imported from snippets.scrapy.org (which no longer works)
    # author: wynbennett
    # date  : Jun 21, 2011

参考:http//snipplr.com/view/66998/

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

Some things to note:

  • You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

  • This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    
    from selenium import selenium
    
    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["http://www.domain.com"]
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )
    
        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
            self.selenium.start()
    
        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)
    
        def parse_page(self, response):
            item = Item()
    
            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()
    
            sel = self.selenium
            sel.open(response.url)
    
            #Wait for javscript to load in Selenium
            time.sleep(2.5)
    
            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item
    
    # Snippet imported from snippets.scrapy.org (which no longer works)
    # author: wynbennett
    # date  : Jun 21, 2011
    

Reference: http://snipplr.com/view/66998/


回答 3

另一个解决方案是实现下载处理程序或下载处理程序中间件。(有关下载器中间件的更多信息,请参见scrapy docs)以下是将硒与无头phantomjs webdriver一起使用的示例类:

1)middlewares.py脚本中定义类。

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2)JsDownload()在变量DOWNLOADER_MIDDLEWARE内添加类settings.py

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3)整合HTMLResponse内部your_spider.py。解码响应主体将为您提供所需的输出。

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8") 

可选的插件:
我希望能够告诉不同的Spider使用哪个中间件,因此我实现了此包装器:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

为了使包装工作,所有蜘蛛必须至少具有:

middleware = set([])

包括中间件:

middleware = set([MyProj.middleware.ModuleName.ClassName])

优点:
以这种方式而不是在蜘蛛网中实现它的主要优点是,您最终只会提出一个请求。例如,在AT解决方案中:下载处理程序处理请求,然后将响应传递给蜘蛛。然后,蜘蛛程序将在其parse_page函数中提出一个全新的请求-这是对相同内容的两个请求。

Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

1) Define class within the middlewares.py script.

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8") 

Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T’s solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it’s parse_page function — That’s two requests for the same content.


回答 4

我当时使用的是自定义下载器中间件,但对此并不满意,因为我没有设法使缓存与之配合使用。

更好的方法是实现自定义下载处理程序。

有一个工作示例这里。看起来像这样:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

假设您的刮板称为“刮板”。如果您将提到的代码放在“ scraper”文件夹根目录下的名为handlers.py的文件中,则可以将其添加到settings.py中:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

还有JS解析DOM的voilà,其中包含抓取的缓存,重试等。

I was using a custom downloader middleware, but wasn’t very happy with it, as I didn’t manage to make the cache work with it.

A better approach was to implement a custom download handler.

There is a working example here. It looks like this:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called “scraper”. If you put the mentioned code inside a file called handlers.py on the root of the “scraper” folder, then you could add to your settings.py:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

And voilà, the JS parsed DOM, with scrapy cache, retries, etc.


回答 5

如何使用scrapy来抓取此动态数据,以便可以使用它?

我想知道为什么没有人仅使用Scrapy发布解决方案。

查阅Scrapy小组SCRAPING INFINITE SCROLLING PAGES 的博客文章。该示例抓取了使用无限滚动的http://spidyquotes.herokuapp.com/scroll网站。

这个想法是使用浏览器的开发人员工具并注意AJAX请求,然后根据该信息创建对Scrapy的请求

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

how can scrapy be used to scrape this dynamic data so that I can use it?

I wonder why no one has posted the solution using Scrapy only.

Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES . The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.

The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

回答 6

是的,Scrapy可以废弃动态网站,即通过javaScript呈现的网站。

抓取此类网站有两种方法。

第一,

您可以splash用来呈现Javascript代码,然后解析呈现的HTML。你可以在这里找到文档和项目Scrapy splash,git

第二,

就像每个人都在说的那样,通过监视network calls,可以的,您可以找到获取数据的api调用,并在您的scrapy spider中模拟该调用可能会帮助您获取所需的数据。

yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.

There are Two approaches to scrapy these kind of websites.

First,

you can use splash to render Javascript code and then parse the rendered HTML. you can find the doc and project here Scrapy splash, git

Second,

As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.


回答 7

我通过使用Selenium和Firefox Web驱动程序来处理ajax请求。如果需要使用爬虫作为守护程序,速度不是很快,但是比任何手动解决方案要好得多。我在这里写了一个简短的教程供参考

I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference


在OSX 10.11(El Capitan)中安装Scrapy(系统完整性保护)时,出现“ OSError:[Errno 1] Operation not allow”

问题:在OSX 10.11(El Capitan)中安装Scrapy(系统完整性保护)时,出现“ OSError:[Errno 1] Operation not allow”

我正在尝试通过pip在OSX 10.11(El Capitan)中安装Scrapy Python框架。安装脚本下载所需的模块,并在某些时候返回以下错误:

OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

我尝试使用以下命令停用OSX 10.11中的无根功能:

sudo nvram boot-args="rootless=0";sudo reboot

但是当机器重启时,我仍然收到相同的错误。

我的StackExchangers同事有什么线索或想法吗?

如果有帮助,则完整的脚本输出如下:

sudo -s pip install scrapy
Collecting scrapy
  Downloading Scrapy-1.0.2-py2-none-any.whl (290kB)
    100% |████████████████████████████████| 290kB 345kB/s 
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9 in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)
Collecting w3lib>=1.8.0 (from scrapy)
  Downloading w3lib-1.12.0-py2.py3-none-any.whl
Collecting lxml (from scrapy)
  Downloading lxml-3.4.4.tar.gz (3.5MB)
    100% |████████████████████████████████| 3.5MB 112kB/s 
Collecting Twisted>=10.0.0 (from scrapy)
  Downloading Twisted-15.3.0.tar.bz2 (4.4MB)
    100% |████████████████████████████████| 4.4MB 94kB/s 
Collecting six>=1.5.2 (from scrapy)
  Downloading six-1.9.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=10.0.0->scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Installing collected packages: six, w3lib, lxml, Twisted, scrapy
  Found existing installation: six 1.4.1
    DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/basecommand.py", line 223, in main
status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/commands/install.py", line 299, in run
root=options.root_path,
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_set.py", line 640, in install
requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_install.py", line 726, in uninstall
paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_uninstall.py", line 125, in remove
renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/utils/__init__.py", line 314, in renames
shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

I’m trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error:

OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

I’ve tried to deactivate the rootless feature in OSX 10.11 with the command:

sudo nvram boot-args="rootless=0";sudo reboot

but I still get the same error when the machine reboots.

Any clue or idea from my fellow StackExchangers?

If it helps, the full script output is the following:

sudo -s pip install scrapy
Collecting scrapy
  Downloading Scrapy-1.0.2-py2-none-any.whl (290kB)
    100% |████████████████████████████████| 290kB 345kB/s 
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9 in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)
Collecting w3lib>=1.8.0 (from scrapy)
  Downloading w3lib-1.12.0-py2.py3-none-any.whl
Collecting lxml (from scrapy)
  Downloading lxml-3.4.4.tar.gz (3.5MB)
    100% |████████████████████████████████| 3.5MB 112kB/s 
Collecting Twisted>=10.0.0 (from scrapy)
  Downloading Twisted-15.3.0.tar.bz2 (4.4MB)
    100% |████████████████████████████████| 4.4MB 94kB/s 
Collecting six>=1.5.2 (from scrapy)
  Downloading six-1.9.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=10.0.0->scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Installing collected packages: six, w3lib, lxml, Twisted, scrapy
  Found existing installation: six 1.4.1
    DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/basecommand.py", line 223, in main
status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/commands/install.py", line 299, in run
root=options.root_path,
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_set.py", line 640, in install
requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_install.py", line 726, in uninstall
paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_uninstall.py", line 125, in remove
renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/utils/__init__.py", line 314, in renames
shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

回答 0

我也认为完全没有必要开始入侵OSX。

我能够解决它

brew install python

看来,使用新版El Capitan随附的python / pip会有一些问题。

I also think it’s absolutely not necessary to start hacking OS X.

I was able to solve it doing a

brew install python

It seems that using the python / pip that comes with new El Capitan has some issues.


回答 1

pip install --ignore-installed six

会做到的。

资料来源:github.com/pypa/pip/issues/3165

pip install --ignore-installed six

Would do the trick.

Source: github.com/pypa/pip/issues/3165


回答 2

正如其他答案所说,这是由于新的系统完整性保护,但我认为其他答案过于复杂。

如果您只想在当前用户中使用该软件包,则可以使用该--user标志,而无需禁用SIP,就可以很好地安装它。像这样:

sudo pip install --user packagename

As the other answers said, it’s because of the new System Integrity Protection, but I believe the other answers are overcomplicated.

If you’re only gonna use that package in the current user, you should be able to install it just fine, without the need to disable the SIP, by using the --user flag. Like this:

sudo pip install --user packagename

回答 3

高投票答案对我不起作用,似乎对El Capitan用户有效。但是对于MacOS Sierra用户,请尝试以下步骤

  1. brew install python
  2. sudo pip install --user <package name>

The high voted answers didn’t work for me, it seems to work for El Capitan users. But for MacOS Sierra users try the following steps

  1. brew install python
  2. sudo pip install --user <package name>

回答 4

警告事项

强烈建议不要在Mac上修改系统Python;可能会发生许多问题。

您的特定错误表明安装程序存在解决Scrapy依赖关系的问题,而不影响当前的Python安装。该系统使用Python执行许多基本任务,因此保持系统安装的稳定性和Apple最初安装的位置非常重要

在绕过内置安全性之前,我还将穷尽所有其他可能性

包管理器解决方案:

请先研究Python虚拟化工具,例如virtualenv;这样可以安全地进行实验。

在不与Mac OS冲突的情况下使用语言和软件的另一个有用工具是Homebrew。与MacPortsFink一样Homebrew是Mac的软件包管理器。,对于安全尝试许多其他语言和工具很有用。

“自行运行”软件安装:

如果您不喜欢包管理器方法,则可以使用该/usr/local路径或创建/opt/local用于安装备用Python安装的目录,然后在中修复路径.bashrc。请注意,您必须为这些解决方案启用root。

无论如何如何:

如果您绝对必须禁用安全检查(并且我衷心希望它用于解决系统语言和资源之外的其他问题),则可以暂时禁用它,然后使用本文中有关如何禁用系统的一些技术来重新启用它完整性保护

Warnings

I would suggest very strongly against modifying the system Python on Mac; there are numerous issues that can occur.

Your particular error shows that the installer has issues resolving the dependencies for Scrapy without impacting the current Python installation. The system uses Python for a number of essential tasks, so it’s important to keep the system installation stable and as originally installed by Apple.

I would also exhaust all other possibilities before bypassing built in security.

Package Manager Solutions:

Please look into a Python virtualization tool such as virtualenv first; this will allow you to experiment safely.

Another useful tool to use languages and software without conflicting with your Mac OS is Homebrew. Like MacPorts or Fink, Homebrew is a package manager for Mac, and is useful for safely trying lots of other languages and tools.

“Roll your own” Software Installs:

If you don’t like the package manager approach, you could use the /usr/local path or create an /opt/local directory for installing an alternate Python installation and fix up your paths in your .bashrc. Note that you’ll have to enable root for these solutions.

How to do it anyway:

If you absolutely must disable the security check (and I sincerely hope it’s for something other than messing with the system languages and resources), you can disable it temporarily and re-enable it using some of the techniques in this post on how to Disable System Integrity-Protection.


回答 5

这对我有用:

   sudo pip install scrapy --ignore-installed six

This did the trick for me:

   sudo pip install scrapy --ignore-installed six

回答 6

您应禁用“系统完整性保护”,这是El Capitan的一项新功能。

首先,您应该在终端上运行无根配置命令

# nvram boot-args="rootless=0"
# reboot

然后,您应该在恢复分区的终端(恢复操作系统)上运行以下命令

# csrutil disable
# reboot

我刚刚解决了我的问题。我不确定第一部分是否必要。随便尝试。

– 警告

一切正常后,您应该再次启用SIP。

只需再次重新启动进入恢复模式并在终端中运行

# csrutil enable

csrutil:配置系统完整性保护

You should disable “System Integrity Protection” which is a new feature in El Capitan.

First, you should run the command for rootless config on your terminal

# nvram boot-args="rootless=0"
# reboot

Then, you should run the command below on recovery partition’s terminal (Recovery OS)

# csrutil disable
# reboot

I’ve just solved my problem like that. I’m not sure that the first part is necessary. Try as you like.

–WARNING

You should enable SIP again after everything works;

Simply reboot again into Recovery Mode and run in terminal

# csrutil enable

csrutil: Configuring System Integrity Protection


回答 7

我试图通过El Capitan中的pip安装AWS,但出现此错误

OSError:[Errno 1]不允许操作:’/var/folders/wm/jhnj0g_s16gb36y8kwvrgm7h0000gp/T/pip-wTnb_D-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six- 1.4.1-py2.7.egg-info’

我在这里找到答案

sudo -H pip install awscli --upgrade --ignore-installed six

这个对我有用 :)

I tried to install AWS via pip in El Capitan but this error appear

OSError: [Errno 1] Operation not permitted: ‘/var/folders/wm/jhnj0g_s16gb36y8kwvrgm7h0000gp/T/pip-wTnb_D-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info’

I found the answer here

sudo -H pip install awscli --upgrade --ignore-installed six

It works for me :)


回答 8

我在MacOS Sierra上遇到了相同的错误。我按照以下步骤操作,成功安装了可伸缩的软件包。

1. sudo pip install --ignore-installed six
2. sudo pip install --ignore-installed scrapy

MacBook-Air:~ shree$ scrapy version
Scrapy 1.4.0

I was getting the same error on on my MacOS Sierra. I followed these steps and successfully able to install scarpy package.

1. sudo pip install --ignore-installed six
2. sudo pip install --ignore-installed scrapy

MacBook-Air:~ shree$ scrapy version
Scrapy 1.4.0

回答 9

这对我有用。

sudo pip install –ignore-installed scrapy

This did the trick for me.

sudo pip install –ignore-installed scrapy


回答 10

尝试了一些答案的组合,最终成功了:

sudo -H pip install --upgrade --ignore-installed awsebcli

干杯

Tried a combination of some answers and this eventually worked:

sudo -H pip install --upgrade --ignore-installed awsebcli

Cheers


回答 11

再次安装python:

brew安装python

再试一次:

须藤点安装刮

为我工作,希望它可以帮助

install python again:

brew install python

try it again:

sudo pip install scrapy

works for me, hope it can help


回答 12

重新启动Mac->在启动提示音后按住“ Command + R”->打开OS X实用程序->打开终端并键入“ csrutil disable”->重新启动OS X->打开终端并检查“ csrutil status”

Restart Mac -> hold down “Command + R” after the startup chime -> Opens OS X Utilities -> Open Terminal and type “csrutil disable” -> Reboot OS X -> Open Terminal and check “csrutil status”


回答 13

这个命令可以很好地工作:D

sudo -H pip install –upgrade package_name –ignore-installed六

This command would work perfectly fine :D

sudo -H pip install –upgrade package_name –ignore-installed six


回答 14

如果您尝试使用pip而不是pip3在python2文件夹中安装python3 lib,有时可能会实现这种行为。

Sometimes such behavior may be achieved if you try to install python3 lib in python2 folder using pip instead of pip3.


回答 15

  1. -关闭SIP(系统完整性保护)-然后重新启动,使用命令+ R进入调试模式,然后选择终端:csrutil disable reboot

2。

sudo C_INCLUDE_PATH = /应用程序/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2:/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX .platform / Developer / SDKs / MacOSX10.11.sdk / usr / include / libxml2 / libxml:/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/包括pip install scrapy –ignore-installed 6

3.-然后删除旧的六个,重新安装sudo rm -rf /Library/Python/2.7/site-packages/six* sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/ lib / python / six * sudo pip安装六

4.-然后将其设置回csrutil启用重新启动

-cr脚的作品现在

  1. — close SIP(system Integrity Protection) — then reboot, use command +R to enter debug mode, then select terminal: csrutil disable reboot

2.

sudo C_INCLUDE_PATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2 :/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2/libxml :/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include pip install scrapy –ignore-installed six

3. — then remove old six, install it again sudo rm -rf /Library/Python/2.7/site-packages/six* sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six* sudo pip install six

4. — then set it back csrutil enable reboot

— crappy works now


回答 16

它对我有用:

pip install scrapy --user -U

it work for me:

pip install scrapy --user -U

回答 17

我在其他地方缺少依赖项,因此我为项目安装了其他要求,如下所示:

pip install --user -r requirements.txt

I was missing a dependency somewhere else along the line, so I installed the other requirements for the project like this:

pip install --user -r requirements.txt

无法在Mac OS X 10.9上安装Lxml

问题:无法在Mac OS X 10.9上安装Lxml

我想安装Lxml,以便随后可以安装Scrapy。

今天更新Mac时,不允许我重新安装lxml,但出现以下错误:

In file included from src/lxml/lxml.etree.c:314:
/private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error: 'libxml/xmlversion.h' file not found
#include "libxml/xmlversion.h"
         ^
1 error generated.
error: command 'cc' failed with exit status 1

我尝试使用brew安装libxml2和libxslt,两者都安装良好,但仍然无法安装lxml。

上次安装时,我需要在Xcode上启用开发人员工具,但是由于将其更新为Xcode 5,因此不再提供该选项。

有人知道我需要做什么吗?

I want to install Lxml so I can then install Scrapy.

When I updated my Mac today it wouldn’t let me reinstall lxml, I get the following error:

In file included from src/lxml/lxml.etree.c:314:
/private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error: 'libxml/xmlversion.h' file not found
#include "libxml/xmlversion.h"
         ^
1 error generated.
error: command 'cc' failed with exit status 1

I have tried using brew to install libxml2 and libxslt, both installed fine but I still cannot install lxml.

Last time I was installing I needed to enable the developer tools on Xcode but since its updated to Xcode 5 it doesnt give me that option anymore.

Does anyone know what I need to do?


回答 0

您应该为xcode安装或升级命令行工具。在终端上尝试一下:

xcode-select --install

You should install or upgrade the commandline tool for xcode. Try this in a terminal:

xcode-select --install

回答 1

我通过安装和链接libxml2以及libxslt通过brew 在优胜美地上解决了这个问题:

brew install libxml2
brew install libxslt
brew link libxml2 --force
brew link libxslt --force

如果使用此方法解决了问题,但稍后又弹出,则可能需要以上四行之前运行它:

brew unlink libxml2
brew unlink libxslt

如果您在Homebrew上遇到权限错误,尤其是在El Capitan上,则这是一个有用的文档。本质上,无论OS X是什么版本,都请尝试运行:

sudo chown -R $(whoami):admin /usr/local

I solved this issue on Yosemite by both installing and linking libxml2 and libxslt through brew:

brew install libxml2
brew install libxslt
brew link libxml2 --force
brew link libxslt --force

If you have solved the problem using this method but it pops up again at a later time, you might need to run this before the four lines above:

brew unlink libxml2
brew unlink libxslt

If you are having permission errors with Homebrew, especially on El Capitan, this is a helpful document. In essence, regardless of OS X version, try running:

sudo chown -R $(whoami):admin /usr/local

回答 2

您可以通过在命令行上运行此命令来解决问题:

 STATIC_DEPS=true pip install lxml

它肯定帮助了我。文档说明

You may solve your problem by running this on the commandline:

 STATIC_DEPS=true pip install lxml

It sure helped me. Explanations on docs


回答 3

我尝试了上面的大多数解决方案,但没有一个对我有用。我正在运行Yosemite 10.10,唯一适用于我的解决方案是在终端中键入以下内容:

sudo CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 CFLAGS=-Qunused-arguments CPPFLAGS=-Qunused-arguments pip install lxml

编辑:如果您使用virtualenv,则不需要开头的sudo。

I tried most of the solutions above, but none of them worked for me. I’m running Yosemite 10.10, the only solution that worked for me was to type this in the terminal:

sudo CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 CFLAGS=-Qunused-arguments CPPFLAGS=-Qunused-arguments pip install lxml

EDIT: If you are using virtualenv, the sudo in beginning is not needed.


回答 4

这也困扰了我一段时间。我对python distutils等的内部知识不了解,但是这里的include路径是错误的。我进行了以下丑陋的修改,直到python lxml人员可以进行适当的修复为止。

sudo ln -s  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2/libxml/ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml

This has been bothering me as well for a while. I don’t know the internals enough about python distutils etc, but the include path here is wrong. I made the following ugly hack to hold me over until the python lxml people can do the proper fix.

sudo ln -s  /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2/libxml/ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml

回答 5

全局安装… OS X 10.9.2

xcode-select --install
sudo easy_install pip
sudo CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 CFLAGS=-Qunused-arguments CPPFLAGS=-Qunused-arguments pip install lxml

Installing globally… OS X 10.9.2

xcode-select --install
sudo easy_install pip
sudo CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 CFLAGS=-Qunused-arguments CPPFLAGS=-Qunused-arguments pip install lxml

回答 6

http://lxml.de/installation.html上的安装说明说明:

为了加快测试环境(例如,在持续集成服务器上)的构建速度,请通过设置CFLAGS环境变量来禁用C编译器优化:

CFLAGS="-O0" pip install lxml

instalation instructions on http://lxml.de/installation.html explain:

To speed up the build in test environments, e.g. on a continuous integration server, disable the C compiler optimisations by setting the CFLAGS environment variable:

CFLAGS="-O0" pip install lxml

回答 7

上面的这些在10.9.2上都不适合我,因为编译因以下错误而失败:

clang: error: unknown argument: '-mno-fused-madd' 

这实际上导致最干净的解决方案(请参见[1]中的更多详细信息):

export CFLAGS=-Qunused-arguments
export CPPFLAGS=-Qunused-arguments

pip install lxml

或跟随如果全球安装

sudo pip install lxml

[1] lang错误:未知参数:’-mno-fused-madd’(python软件包安装失败)

None of the above worked for me on 10.9.2, as compilation bails out with following error:

clang: error: unknown argument: '-mno-fused-madd' 

Which actually lead to cleanest solution (see more details in [1]):

export CFLAGS=-Qunused-arguments
export CPPFLAGS=-Qunused-arguments

pip install lxml

or following if installing globally

sudo pip install lxml

[1] clang error: unknown argument: ‘-mno-fused-madd’ (python package installation failure)


回答 8

xcode-select --install
sudo easy_install pip
sudo pip install lxml
xcode-select --install
sudo easy_install pip
sudo pip install lxml

回答 9

Yosemite通过运行以下命令解决了这个问题:

xcode-select install #this may take several minutes.
pip install lxml

I solved this issue on Yosemite by running the following commands:

xcode-select install #this may take several minutes.
pip install lxml

回答 10

对于自制软件,libxml2被隐藏以不干扰系统libxml2,因此必须对pip有所帮助才能找到它。

用bash:

LDFLAGS=-L`brew --prefix libxml2`/lib CPPFLAGS=-I`brew --prefix libxml2`/include/libxml2 pip install --user lxml

与鱼:

env LDFLAGS=-L(brew --prefix libxml2)/lib CPPFLAGS=-I(brew --prefix libxml2)/include/libxml2 pip install --user lxml

With homebrew, libxml2 is hidden to not interfere with the system libxml2, so pip must be helped a little in order to find it.

With bash:

LDFLAGS=-L`brew --prefix libxml2`/lib CPPFLAGS=-I`brew --prefix libxml2`/include/libxml2 pip install --user lxml

With fish:

env LDFLAGS=-L(brew --prefix libxml2)/lib CPPFLAGS=-I(brew --prefix libxml2)/include/libxml2 pip install --user lxml

回答 11

OSX 10.9.2

sudo env ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future STATIC_DEPS=true pip install lxml

OSX 10.9.2

sudo env ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future STATIC_DEPS=true pip install lxml

回答 12

我尝试了此页面上的所有答案,但没有一个对我有用。我正在运行OS X版本10.9.2

但这绝对有效….就像一个魅力:

ARCHFLAGS = -Wno-error =未使用的命令行参数-将来的硬错误pip install lxml

I tried all the answers on this page, none of them worked for me. I’m running OS X Version 10.9.2

But this definitely works….like a charm:

ARCHFLAGS=-Wno-error=unused-command-line-argument-hard-error-in-future pip install lxml


回答 13

不幸的是xcode-select --install,由于我已经拥有最新版本,因此不适用于我。

这很奇怪,但是我通过打开XCode并接受条款与条件解决了这个问题。重新运行pip install lxml后未返回任何错误。

Unfortunately xcode-select --install did not work for me as I already had the latest version.

It’s very strange but I solved the issue by opening XCode and accepting the Terms & Conditions. Re-running pip install lxml returned no errors after.


回答 14

从pip(lxml 3.6.4)成功安装后,导入lxml.etree模块时出现错误。

我一直在不停地搜索以将其安装为scrapy的必要条件,并尝试了所有选项,但最终这对我有用(mac osx 10.11 python 2.7):

$ STATIC_DEPS=true sudo easy_install-2.7 "lxml==2.3.5"

较旧的lxml版本似乎可以与etree模块一起使用。

Pip通常可以忽略软件包的指定版本,例如,当您在pip缓存中有较新的版本时,即为easy_install。该'-2.7'选项适用于python版本,如果要安装python 3.x,则忽略此选项。

After successful install from pip (lxml 3.6.4) I was getting an error when importing the lxml.etree module.

I was searching endlessly to install this as a requisite for scrapy, and tried all the options, but finally this worked for me (mac osx 10.11 python 2.7):

$ STATIC_DEPS=true sudo easy_install-2.7 "lxml==2.3.5"

The older version of lxml seem to work with etree module.

Pip can often ignore the specified version of a package, for example when you have the newer version in the pip cache, thus the easy_install. The '-2.7' option is for python version, omit this if you are installing for python 3.x.


回答 15

就我而言,在通过以下方式安装lxml之前,必须关闭Kaspersky Antivirus

pip install lxml

In my case, I must shutdown Kaspersky Antivirus before installing lxml by:

pip install lxml

回答 16

我正在使用OSX 10.9.2,但出现相同的错误。

XCode命令行工具的安装不适用于此特定版本的OSX。

我认为解决此问题的更好方法是使用以下命令进行安装:

$ CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 pip install lxml

这类似于jdkoftinoff的修复程序,但不会永久更改系统。

I am using OSX 10.9.2 and I get the same error.

Installation of the XCode command line tools does not help for this particular version of OSX.

I think a better approach to fix this is to install with the following command:

$ CPATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/libxml2 pip install lxml

This is similar to jdkoftinoff’ fix, but does not alter your system in a permanent way.


回答 17

我遇到了同样的问题,经过几天的工作,我在OS X 10.9.4上使用Python 3.4.1解决了这个问题。

这是我的解决方案,

根据从lxml.de 安装lxml

lxml的macport可用。尝试类似port install py25-lxml的操作

如果您没有MacPort,请从MacPort.org安装它。这很容易。您可能还需要编译器来安装XCode编译工具,使用xcode-select --install

首先,我通过将该端口更新为最新版本sudo port selfupdate

然后我输入sudo port install libxml2,几分钟后,您应该看到libxml2安装成功。可能您可能还需要libxslt安装lxml。要安装libxslt,请使用:sudo port install libxslt

现在,只需键入pip install lxml,它应该可以正常工作。

I met the same question and after days of working I resolved this problem on my OS X 10.9.4, with Python 3.4.1.

Here’s my solution,

According to installing lxml from lxml.de,

A macport of lxml is available. Try something like port install py25-lxml

If you do not have MacPort, install it from MacPort.org. It’s quite easy. You may also need a compiler, to install XCode compiling tools, use xcode-select --install

Firstly I updated my port to the latest version via sudo port selfupdate,

Then I just type sudo port install libxml2 and several minutes later you should see libxml2 installed successfully. Probably you may also need libxslt to install lxml. To install libxslt, use:sudo port install libxslt.

Now, just type pip install lxml, it should work fine.


回答 18

在编译之前,将xmlversion.h的路径添加到您的环境中。

$ set INCLUDE=$INCLUDE:/private/tmp/pip_build_root/lxml/src/lxml/

但是请确保我提供的路径中包含xmlversion.h文件。然后,

$ python setup.py install

before compiling add the path that to xmlversion.h into your environment.

$ set INCLUDE=$INCLUDE:/private/tmp/pip_build_root/lxml/src/lxml/

But make sure the path I’ve provided has the xmlversion.h file located inside. Then,

$ python setup.py install

回答 19

pip没有为我工作。我去了 https://pypi.python.org/pypi/lxml/2.3 并下载了macosx .egg文件:

https://pypi.python.org/packages/2.7/l/lxml/lxml-2.3-py2.7-macosx-10.6-intel.egg#md5=52322e4698d68800c6b6aedb0dbe5f34

然后使用命令行easy_install安装.egg文件。

pip did not work for me. I went to https://pypi.python.org/pypi/lxml/2.3 and downloaded the macosx .egg file:

https://pypi.python.org/packages/2.7/l/lxml/lxml-2.3-py2.7-macosx-10.6-intel.egg#md5=52322e4698d68800c6b6aedb0dbe5f34

Then used command line easy_install to install the .egg file.


回答 20

这篇文章链接到一个适用于我的解决方案,在Mac OS X 10.9上适用于 Python3,lxml和“未找到符号:_lzma_auto_decoder”

hth


回答 21

在头发多撕裂和咬牙切齿之后,我用pip卸载了xcode并运行:

easy_install lxml

一切都很好。

After much tearing of the hair and gnashing of the teeth, I uninstalled xcode with pip and ran:

easy_install lxml

And all was well.


回答 22

尝试:

% STATIC_DEPS=true pip install lxml

要么:

% STATIC_DEPS=true sudo pip install lxml

有用!

Try:

% STATIC_DEPS=true pip install lxml

Or:

% STATIC_DEPS=true sudo pip install lxml

It works!