I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.
Most of the issues are solvable and I’m having a good little mess around. However I’m hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.
Now my experience with dynamic web content is low, so this thing is something I’m having trouble getting my head around.
I think Java or Javascript is a key, this pops up often.
The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don’t. I’m using the scrapy library with Python 2.7
I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?
Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:
In the bottom of the picture you can see that I’ve filtered request down to XHR – these are requests made by javascript code.
Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.
After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.
Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.
Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.
All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …):
When I analyze the source code of the page I can’t see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:
It doesn’t reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:
And I observe the HTTP request that is responsible for message body:
After finish, I analyze the headers of the request (I must quote that this URL I’ll extract from source page from var section, see the code below):
And the form data content of the request (the HTTP method is “Post”):
And the content of response, which is a JSON file:
Which presents all the information I’m looking for.
From now, I must implement all this knowledge in scrapy. Let’s define the spider for this purpose:
from scrapy.contrib.spiders importCrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml importSgmlLinkExtractorfrom scrapy.selector importHtmlXPathSelectorfrom scrapy.http importRequestfrom selenium import selenium
classSeleniumSpider(CrawlSpider):
name ="SeleniumSpider"
start_urls =["http://www.domain.com"]
rules =(Rule(SgmlLinkExtractor(allow=('\.html',)), callback='parse_page',follow=True),)def __init__(self):CrawlSpider.__init__(self)self.verificationErrors =[]self.selenium = selenium("localhost",4444,"*chrome","http://www.domain.com")self.selenium.start()def __del__(self):self.selenium.stop()printself.verificationErrors
CrawlSpider.__del__(self)def parse_page(self, response):
item =Item()
hxs =HtmlXPathSelector(response)#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel =self.selenium
sel.open(response.url)#Wait for javscript to load in Selenium
time.sleep(2.5)#Do some crawling of javascript created content with Selenium
sel.get_text("//div")yield item
# Snippet imported from snippets.scrapy.org (which no longer works)# author: wynbennett# date : Jun 21, 2011
Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).
However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
Some things to note:
You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import selenium
class SeleniumSpider(CrawlSpider):
name = "SeleniumSpider"
start_urls = ["http://www.domain.com"]
rules = (
Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
)
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
self.selenium.start()
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
def parse_page(self, response):
item = Item()
hxs = HtmlXPathSelector(response)
#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel = self.selenium
sel.open(response.url)
#Wait for javscript to load in Selenium
time.sleep(2.5)
#Do some crawling of javascript created content with Selenium
sel.get_text("//div")
yield item
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: wynbennett
# date : Jun 21, 2011
classSpider(CrawlSpider):# define unique name of spider
name ="spider"
start_urls =["https://www.url.de"]def parse(self, response):# initialize items
item =CrawlerItem()# store data as items
item["js_enabled"]= response.body.decode("utf-8")
Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:
1) Define class within the middlewares.py script.
from selenium import webdriver
from scrapy.http import HtmlResponse
class JsDownload(object):
@check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:
3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.
class Spider(CrawlSpider):
# define unique name of spider
name = "spider"
start_urls = ["https://www.url.de"]
def parse(self, response):
# initialize items
item = CrawlerItem()
# store data as items
item["js_enabled"] = response.body.decode("utf-8")
Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:
Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T’s solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it’s parse_page function — That’s two requests for the same content.
I was using a custom downloader middleware, but wasn’t very happy with it, as I didn’t manage to make the cache work with it.
A better approach was to implement a custom download handler.
There is a working example here. It looks like this:
# encoding: utf-8
from __future__ import unicode_literals
from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure
class PhantomJSDownloadHandler(object):
def __init__(self, settings):
self.options = settings.get('PHANTOMJS_OPTIONS', {})
max_run = settings.get('PHANTOMJS_MAXRUN', 10)
self.sem = defer.DeferredSemaphore(max_run)
self.queue = queue.LifoQueue(max_run)
SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
def download_request(self, request, spider):
"""use semaphore to guard a phantomjs pool"""
return self.sem.run(self._wait_request, request, spider)
def _wait_request(self, request, spider):
try:
driver = self.queue.get_nowait()
except queue.Empty:
driver = webdriver.PhantomJS(**self.options)
driver.get(request.url)
# ghostdriver won't response when switch window until page is loaded
dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
dfd.addCallback(self._response, driver, spider)
return dfd
def _response(self, _, driver, spider):
body = driver.execute_script("return document.documentElement.innerHTML")
if body.startswith("<head></head>"): # cannot access response header in Selenium
body = driver.execute_script("return document.documentElement.textContent")
url = driver.current_url
respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
resp = respcls(url=url, body=body, encoding="utf-8")
response_failed = getattr(spider, "response_failed", None)
if response_failed and callable(response_failed) and response_failed(resp, driver):
driver.close()
return defer.fail(Failure())
else:
self.queue.put(driver)
return defer.succeed(resp)
def _close(self):
while not self.queue.empty():
driver = self.queue.get_nowait()
driver.close()
Suppose your scraper is called “scraper”. If you put the mentioned code inside a file called handlers.py on the root of the “scraper” folder, then you could add to your settings.py:
yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.
There are Two approaches to scrapy these kind of websites.
First,
you can use splash to render Javascript code and then parse the rendered HTML.
you can find the doc and project here Scrapy splash, git
Second,
As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.
I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference
I’d like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?
import urllib2
fromBeautifulSoupimportBeautifulSoup# or if you're using BeautifulSoup4:# from bs4 import BeautifulSoup
soup =BeautifulSoup(urllib2.urlopen('http://example.com').read())for row in soup('table',{'class':'spad'})[0].tbody('tr'):
tds = row('td')print tds[0].string, tds[1].string
# will print date and sunrise
Use urllib2 in combination with the brilliant BeautifulSoup library:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
# will print date and sunrise
Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
Scrapy has better and fastest support for parsing (x)html on top of libxml2.
Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.
I would strongly suggest checking out pyquery. It uses jquery-like (aka css-like) syntax which makes things really easy for those coming from that background.
For your case, it would be something like:
from pyquery import *
html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')
for tr in trs:
tds = tr.getchildren()
print tds[1].text, tds[2].text
Output:
5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM
I use a combination of Scrapemark (finding urls – py2) and httlib2 (downloading images – py2+3). The scrapemark.py has 500 lines of code, but uses regular expressions, so it may be not so fast, did not test.
import urllib2
from bs4 importBeautifulSoup
main_url ="http://www.example.com"
main_page_html = tryAgain(main_url)
main_page_soup =BeautifulSoup(main_page_html)# Scrape all TDs from TRs inside Tablefor tr in main_page_soup.select("table.class_of_table"):for td in tr.select("td#id"):print(td.text)# For acnhors inside TDprint(td.select("a")[0].text)# Value of Href attributeprint(td.select("a")[0]["href"])# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)def tryAgain(passed_url):try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
exceptException:while1:print("Trying again the URL:")print(passed_url)try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")print("---- URL was successfully scraped ---")print("-------------------------------------")return page
exceptException:
time.sleep(20)continue
I know I have come late to party but I have a nice suggestion for you.
Using BeautifulSoup is already been suggested I would rather prefer using CSS Selectors to scrape data inside HTML
import urllib2
from bs4 import BeautifulSoup
main_url = "http://www.example.com"
main_page_html = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)
# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
for td in tr.select("td#id"):
print(td.text)
# For acnhors inside TD
print(td.select("a")[0].text)
# Value of Href attribute
print(td.select("a")[0]["href"])
# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
except Exception:
while 1:
print("Trying again the URL:")
print(passed_url)
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")
print("---- URL was successfully scraped ---")
print("-------------------------------------")
return page
except Exception:
time.sleep(20)
continue
回答 7
如果我们想从任何特定类别中获取商品名称,则可以通过使用CSS选择器指定该类别的类别名称来实现:
import requests ;from bs4 importBeautifulSoup
soup =BeautifulSoup(requests.get('https://www.flipkart.com/').text,"lxml")for link in soup.select('div._2kSfQ4'):print(link.text)
这是部分搜索结果:
Puma, USPA,Adidas& moreUp to 70%OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals,SlippersPhilips& moreStarting ₹99LEDBulbs&EmergencyLights
If we think of getting name of items from any specific category then we can do that by specifying the class name of that category using css selector:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
print(link.text)
Here is a simple web crawler, i used BeautifulSoup and we will search for all the links(anchors) who’s class name is _3NFO0d. I used Flipkar.com, it is an online retailing store.
import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
url = 'https://www.flipkart.com/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': '_3NFO0d'}):
href = link.get('href')
print(href)
crawl_flipkart()
Python has good options to scrape the web. The best one with a framework is scrapy. It can be a little tricky for beginners, so here is a little help.
1. Install python above 3.5 (lower ones till 2.7 will work).
2. Create a environment in conda ( I did this).
3. Install scrapy at a location and run in from there.
4. Scrapy shell will give you an interactive interface to test you code.
5. Scrapy startproject projectname will create a framework.
6. Scrapy genspider spidername will create a spider. You can create as many spiders as you want. While doing this make sure you are inside the project directory.
The easier one is to use requests and beautiful soup. Before starting give one hour of time to go through the documentation, it will solve most of your doubts. BS4 offer wide range of parsers that you can opt for. Use user-agent and sleep to make scraping easier. BS4 returns a bs.tag so use variable[0]. If there is js running, you wont be able to scrape using requests and bs4 directly. You could get the api link then parse the JSON to get the information you need or try selenium.