I want to make a website that shows the comparison between amazon and e-bay product price.
Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.
While
BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.
In simple words, with Beautiful Soup you can build something similar to Scrapy.
Beautiful Soup is a library while Scrapy is a complete framework.
I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page.
After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.
If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.
Take a look at the scrapy documentation, it’s very simple to use.
The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.
The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.
Scrapy
It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.
Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.
etc.
TLDR: scrapy is a framework that provides everything that one might
need to build large scale crawls. It provides various features that
hide complexity of crawling the webs. one can simply start writing web
crawlers without worrying about the setup burden.
Beautiful soup
Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.
To sum up.
Scrapy is a rich framework that you can use to start writing crawlers
without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It
cannot be used alone to scrape web.
You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.
BeautifulSoup is a library that lets you extract information from a web page.
Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.
Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method.
Big project takes both advantages.
I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?
What I have tried
Actually I tried to run the spider as a script. As a result, I built that script. Then, I tried to add my Scrapy project to PyCharm as a model like this:
The scrapy command is a python script which means you can start it from inside PyCharm.
When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:
#!/usr/bin/python
from scrapy.cmdline import execute
execute()
This means that a command like
scrapy crawl IcecatCrawler can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler
Try to find the scrapy.cmdline package.
In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py
Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.
Like this:
Put your breakpoints anywhere in your crawling code and it should work™.
回答 1
您只需要这样做。
在项目的搜寻器文件夹上创建一个Python文件。我使用了main.py。
项目
履带式
履带式
蜘蛛网
…
main.py
scrapy.cfg
在您的main.py内部,将下面的代码。
from scrapy import cmdline
cmdline.execute("scrapy crawl spider".split())
As of 2018.1 this became a lot easier. You can now select Module name in your project’s Run/Debug Configuration. Set this to scrapy.cmdline and the Working directory to the root dir of the scrapy project (the one with settings.py in it).
I am running scrapy in a virtualenv with Python 3.5.0 and setting the “script” parameter to /path_to_project_env/env/bin/scrapy solved the issue for me.
回答 4
intellij的想法也可以。
创建main.py:
#!/usr/bin/env python# -*- coding: utf-8 -*-#coding=utf-8import sys
from scrapy import cmdline
def main(name):if name:
cmdline.execute(name.split())if __name__ =='__main__':print('[*] beginning main thread')
name ="scrapy crawl stack"#name = "scrapy crawl spa"
main(name)print('[*] main thread exited')print('main stop====================================================')
To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!
I am also using PyCharm, but I am not using its built-in debugging features.
For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace() on any line I want the break point to happen.
Then I can type n to execute the next statement, s to step into a function, type any object name to see its value, alter execution environment, type c to continue execution…
This is very flexible, works in environments other than PyCharm, where you don’t control the execution environment.
Just type in your virtual environment pip install ipdb and place import ipdb; ipdb.set_trace() on a line where you want the execution to pause.
UPDATE
You can also pip install pdbpp and use the standard import pdb; pdb.set_trace instead of ipdb. PDB++ is nicer in my opinion.
import scrapy
from scrapy.crawler importCrawlerProcessclassMySpider(scrapy.Spider):# Your spider definition...
process =CrawlerProcess({'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(MySpider)
process.start()# the script will block here until the crawling is finished
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
回答 8
我使用以下简单脚本:
from scrapy.crawler importCrawlerProcessfrom scrapy.utils.project import get_project_settings
process =CrawlerProcess(get_project_settings())
process.crawl('your_spider_name')
process.start()
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('your_spider_name')
process.start()
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
在您的蜘蛛代码中,您可以将它们用作蜘蛛参数:
classMySpider(Spider):
name ='myspider'...def parse(self, response):...if self.parameter1 == value1:# this is True# or alsoif getattr(self, parameter2)== value2:# this is also True
Previous answers were correct, but you don’t have to declare the constructor (__init__) every time you want to code a scrapy’s spider, you could just specify the parameters as before:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
and in your spider code you can just use them as spider arguments:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
And it just works.
回答 2
通过爬网命令传递参数
抓取抓取myspider -a category =’mycategory’-a domain =’example.com’
Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.
Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-
scrapy crawl myspider -a domain=”http://www.example.com”
And receive arguments in spider’s constructors:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, domain='', *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
#
Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api’s to stop/start/status/list the spiders.
pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default
scrapyd-deploy will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.
class MySpider(CrawlSpider):
def __init__(self, start_urls, *args, **kwargs):
self.start_urls = start_urls.split('|')
super().__init__(*args, **kwargs)
name = testspider
I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.
Most of the issues are solvable and I’m having a good little mess around. However I’m hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.
Now my experience with dynamic web content is low, so this thing is something I’m having trouble getting my head around.
I think Java or Javascript is a key, this pops up often.
The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don’t. I’m using the scrapy library with Python 2.7
I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?
Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:
In the bottom of the picture you can see that I’ve filtered request down to XHR – these are requests made by javascript code.
Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.
After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.
Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.
Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.
All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …):
When I analyze the source code of the page I can’t see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:
It doesn’t reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:
And I observe the HTTP request that is responsible for message body:
After finish, I analyze the headers of the request (I must quote that this URL I’ll extract from source page from var section, see the code below):
And the form data content of the request (the HTTP method is “Post”):
And the content of response, which is a JSON file:
Which presents all the information I’m looking for.
From now, I must implement all this knowledge in scrapy. Let’s define the spider for this purpose:
from scrapy.contrib.spiders importCrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml importSgmlLinkExtractorfrom scrapy.selector importHtmlXPathSelectorfrom scrapy.http importRequestfrom selenium import selenium
classSeleniumSpider(CrawlSpider):
name ="SeleniumSpider"
start_urls =["http://www.domain.com"]
rules =(Rule(SgmlLinkExtractor(allow=('\.html',)), callback='parse_page',follow=True),)def __init__(self):CrawlSpider.__init__(self)self.verificationErrors =[]self.selenium = selenium("localhost",4444,"*chrome","http://www.domain.com")self.selenium.start()def __del__(self):self.selenium.stop()printself.verificationErrors
CrawlSpider.__del__(self)def parse_page(self, response):
item =Item()
hxs =HtmlXPathSelector(response)#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel =self.selenium
sel.open(response.url)#Wait for javscript to load in Selenium
time.sleep(2.5)#Do some crawling of javascript created content with Selenium
sel.get_text("//div")yield item
# Snippet imported from snippets.scrapy.org (which no longer works)# author: wynbennett# date : Jun 21, 2011
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).
However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
Some things to note:
You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import selenium
class SeleniumSpider(CrawlSpider):
name = "SeleniumSpider"
start_urls = ["http://www.domain.com"]
rules = (
Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
)
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
self.selenium.start()
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
def parse_page(self, response):
item = Item()
hxs = HtmlXPathSelector(response)
#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel = self.selenium
sel.open(response.url)
#Wait for javscript to load in Selenium
time.sleep(2.5)
#Do some crawling of javascript created content with Selenium
sel.get_text("//div")
yield item
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: wynbennett
# date : Jun 21, 2011
classSpider(CrawlSpider):# define unique name of spider
name ="spider"
start_urls =["https://www.url.de"]def parse(self, response):# initialize items
item =CrawlerItem()# store data as items
item["js_enabled"]= response.body.decode("utf-8")
Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:
1) Define class within the middlewares.py script.
from selenium import webdriver
from scrapy.http import HtmlResponse
class JsDownload(object):
@check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:
3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.
class Spider(CrawlSpider):
# define unique name of spider
name = "spider"
start_urls = ["https://www.url.de"]
def parse(self, response):
# initialize items
item = CrawlerItem()
# store data as items
item["js_enabled"] = response.body.decode("utf-8")
Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:
Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T’s solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it’s parse_page function — That’s two requests for the same content.
I was using a custom downloader middleware, but wasn’t very happy with it, as I didn’t manage to make the cache work with it.
A better approach was to implement a custom download handler.
There is a working example here. It looks like this:
# encoding: utf-8
from __future__ import unicode_literals
from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure
class PhantomJSDownloadHandler(object):
def __init__(self, settings):
self.options = settings.get('PHANTOMJS_OPTIONS', {})
max_run = settings.get('PHANTOMJS_MAXRUN', 10)
self.sem = defer.DeferredSemaphore(max_run)
self.queue = queue.LifoQueue(max_run)
SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
def download_request(self, request, spider):
"""use semaphore to guard a phantomjs pool"""
return self.sem.run(self._wait_request, request, spider)
def _wait_request(self, request, spider):
try:
driver = self.queue.get_nowait()
except queue.Empty:
driver = webdriver.PhantomJS(**self.options)
driver.get(request.url)
# ghostdriver won't response when switch window until page is loaded
dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
dfd.addCallback(self._response, driver, spider)
return dfd
def _response(self, _, driver, spider):
body = driver.execute_script("return document.documentElement.innerHTML")
if body.startswith("<head></head>"): # cannot access response header in Selenium
body = driver.execute_script("return document.documentElement.textContent")
url = driver.current_url
respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
resp = respcls(url=url, body=body, encoding="utf-8")
response_failed = getattr(spider, "response_failed", None)
if response_failed and callable(response_failed) and response_failed(resp, driver):
driver.close()
return defer.fail(Failure())
else:
self.queue.put(driver)
return defer.succeed(resp)
def _close(self):
while not self.queue.empty():
driver = self.queue.get_nowait()
driver.close()
Suppose your scraper is called “scraper”. If you put the mentioned code inside a file called handlers.py on the root of the “scraper” folder, then you could add to your settings.py:
yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.
There are Two approaches to scrapy these kind of websites.
First,
you can use splash to render Javascript code and then parse the rendered HTML.
you can find the doc and project here Scrapy splash, git
Second,
As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.
I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference
sudo -s pip install scrapy
Collecting scrapy
DownloadingScrapy-1.0.2-py2-none-any.whl (290kB)100%|████████████████████████████████|290kB345kB/s
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9in/Library/Python/2.7/site-packages (from scrapy)Requirement already satisfied (use --upgrade to upgrade): queuelib in/Library/Python/2.7/site-packages (from scrapy)Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)Collecting w3lib>=1.8.0(from scrapy)Downloading w3lib-1.12.0-py2.py3-none-any.whl
Collecting lxml (from scrapy)Downloading lxml-3.4.4.tar.gz (3.5MB)100%|████████████████████████████████|3.5MB112kB/s
CollectingTwisted>=10.0.0(from scrapy)DownloadingTwisted-15.3.0.tar.bz2 (4.4MB)100%|████████████████████████████████|4.4MB94kB/s
Collecting six>=1.5.2(from scrapy)Downloading six-1.9.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0in/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (fromTwisted>=10.0.0->scrapy)Requirement already satisfied (use --upgrade to upgrade): setuptools in/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)Installing collected packages: six, w3lib, lxml,Twisted, scrapy
Found existing installation: six 1.4.1
DEPRECATION:Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version.Thisis due to the fact that uninstalling a distutils project will only partially uninstall the project.Uninstalling six-1.4.1:Exception:Traceback(most recent call last):File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/basecommand.py", line 223,in main
status = self.run(options, args)File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/commands/install.py", line 299,in run
root=options.root_path,File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_set.py", line 640,in install
requirement.uninstall(auto_confirm=True)File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_install.py", line 726,in uninstall
paths_to_remove.remove(auto_confirm)File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_uninstall.py", line 125,in remove
renames(path, new_path)File"/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/utils/__init__.py", line 314,in renames
shutil.move(old, new)File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302,in move
copy2(src, real_dst)File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131,in copy2
copystat(src, dst)File"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103,in copystat
os.chflags(dst, st.st_flags)OSError:[Errno1]Operationnot permitted:'/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'
I’m trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error:
OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'
I’ve tried to deactivate the rootless feature in OSX 10.11 with the command:
sudo nvram boot-args="rootless=0";sudo reboot
but I still get the same error when the machine reboots.
Any clue or idea from my fellow StackExchangers?
If it helps, the full script output is the following:
sudo -s pip install scrapy
Collecting scrapy
Downloading Scrapy-1.0.2-py2-none-any.whl (290kB)
100% |████████████████████████████████| 290kB 345kB/s
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9 in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib in /Library/Python/2.7/site-packages (from scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy)
Collecting w3lib>=1.8.0 (from scrapy)
Downloading w3lib-1.12.0-py2.py3-none-any.whl
Collecting lxml (from scrapy)
Downloading lxml-3.4.4.tar.gz (3.5MB)
100% |████████████████████████████████| 3.5MB 112kB/s
Collecting Twisted>=10.0.0 (from scrapy)
Downloading Twisted-15.3.0.tar.bz2 (4.4MB)
100% |████████████████████████████████| 4.4MB 94kB/s
Collecting six>=1.5.2 (from scrapy)
Downloading six-1.9.0-py2.py3-none-any.whl
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=10.0.0->scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->Twisted>=10.0.0->scrapy)
Installing collected packages: six, w3lib, lxml, Twisted, scrapy
Found existing installation: six 1.4.1
DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/basecommand.py", line 223, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/commands/install.py", line 299, in run
root=options.root_path,
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_set.py", line 640, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_install.py", line 726, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/req/req_uninstall.py", line 125, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-7.1.0-py2.7.egg/pip/utils/__init__.py", line 314, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-nIfswi-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'
As the other answers said, it’s because of the new System Integrity Protection, but I believe the other answers are overcomplicated.
If you’re only gonna use that package in the current user, you should be able to install it just fine, without the need to disable the SIP, by using the --user flag. Like this:
I would suggest very strongly against modifying the system Python on Mac; there are numerous issues that can occur.
Your particular error shows that the installer has issues resolving the dependencies for Scrapy without impacting the current Python installation. The system uses Python for a number of essential tasks, so it’s important to keep the system installation stable and as originally installed by Apple.
I would also exhaust all other possibilities before bypassing built in security.
Package Manager Solutions:
Please look into a Python virtualization tool such as virtualenv first; this will allow you to experiment safely.
Another useful tool to use languages and software without conflicting with your Mac OS is Homebrew. Like MacPorts or Fink, Homebrew is a package manager for Mac, and is useful for safely trying lots of other languages and tools.
“Roll your own” Software Installs:
If you don’t like the package manager approach, you could use the /usr/local path or create an /opt/local directory for installing an alternate Python installation and fix up your paths in your .bashrc. Note that you’ll have to enable root for these solutions.
How to do it anyway:
If you absolutely must disable the security check (and I sincerely hope it’s for something other than messing with the system languages and resources), you can disable it temporarily and re-enable it using some of the techniques in this post on how to Disable System Integrity-Protection.
Restart Mac -> hold down “Command + R” after the startup chime -> Opens OS X Utilities -> Open Terminal and type “csrutil disable” -> Reboot OS X -> Open Terminal and check “csrutil status”
— close SIP(system Integrity Protection)
— then reboot, use command +R to enter debug mode, then select terminal:
csrutil disable
reboot
2.
sudo C_INCLUDE_PATH=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2
:/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2/libxml
:/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include
pip install scrapy –ignore-installed six
3.
— then remove old six, install it again
sudo rm -rf /Library/Python/2.7/site-packages/six*
sudo rm -rf /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six*
sudo pip install six
In file included from src/lxml/lxml.etree.c:314:/private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error:'libxml/xmlversion.h' file not found
#include "libxml/xmlversion.h"^1 error generated.
error: command 'cc' failed with exit status 1
I want to install Lxml so I can then install Scrapy.
When I updated my Mac today it wouldn’t let me reinstall lxml, I get the following error:
In file included from src/lxml/lxml.etree.c:314:
/private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error: 'libxml/xmlversion.h' file not found
#include "libxml/xmlversion.h"
^
1 error generated.
error: command 'cc' failed with exit status 1
I have tried using brew to install libxml2 and libxslt, both installed fine but I still cannot install lxml.
Last time I was installing I needed to enable the developer tools on Xcode but since its updated to Xcode 5 it doesnt give me that option anymore.
I solved this issue on Yosemite by both installing and linking libxml2 and libxslt through brew:
brew install libxml2
brew install libxslt
brew link libxml2 --force
brew link libxslt --force
If you have solved the problem using this method but it pops up again at a later time, you might need to run this before the four lines above:
brew unlink libxml2
brew unlink libxslt
If you are having permission errors with Homebrew, especially on El Capitan, this is a helpful document. In essence, regardless of OS X version, try running:
I tried most of the solutions above, but none of them worked for me. I’m running Yosemite 10.10, the only solution that worked for me was to type this in the terminal:
This has been bothering me as well for a while. I don’t know the internals enough about python distutils etc, but the include path here is wrong. I made the following ugly hack to hold me over until the python lxml people can do the proper fix.
To speed up the build in test environments, e.g. on a continuous integration server, disable the C compiler optimisations by setting the CFLAGS environment variable:
After successful install from pip (lxml 3.6.4) I was getting an error when importing the lxml.etree module.
I was searching endlessly to install this as a requisite for scrapy, and tried all the options, but finally this worked for me (mac osx 10.11 python 2.7):
The older version of lxml seem to work with etree module.
Pip can often ignore the specified version of a package, for example when you have the newer version in the pip cache, thus the easy_install. The '-2.7' option is for python version, omit this if you are installing for python 3.x.
A macport of lxml is available. Try something like port install py25-lxml
If you do not have MacPort, install it from MacPort.org. It’s quite easy. You may also need a compiler, to install XCode compiling tools, use xcode-select --install
Firstly I updated my port to the latest version via sudo port selfupdate,
Then I just type sudo port install libxml2 and several minutes later you should see libxml2 installed successfully. Probably you may also need libxslt to install lxml. To install libxslt, use:sudo port install libxslt.
Now, just type pip install lxml, it should work fine.
回答 18
在编译之前,将xmlversion.h的路径添加到您的环境中。
$ set INCLUDE=$INCLUDE:/private/tmp/pip_build_root/lxml/src/lxml/