标签归档:screen-scraping

可以使用scrapy从使用AJAX的网站中抓取动态内容吗?

问题:可以使用scrapy从使用AJAX的网站中抓取动态内容吗?

我最近一直在学习Python,并全力以赴来构建网络抓取工具。一点都不花哨。其唯一目的是从博彩网站上获取数据,并将此数据放入Excel。

大多数问题都是可以解决的,我周围有些混乱。但是,我在一个问题上遇到了巨大的障碍。如果站点加载一张马表并列出当前的投注价格,则此信息不在任何源文件中。线索是这些数据有时是实时的,而数字显然是从某个远程服务器上更新的。我PC上的HTML只是有一个漏洞,他们的服务器正在推送我需要的所有有趣数据。

现在我对动态Web内容的经验很低,所以这件事使我难以理解。

我认为Java或Javascript是关键,这经常弹出。

刮板只是赔率比较引擎。有些网站有API,但对于那些没有的API则需要。我正在使用python 2.7的scrapy库

如果这个问题过于开放,我深表歉意。简而言之,我的问题是:如何使用scrapy来抓取此动态数据,以便可以使用它?这样我就可以实时抓取该赔率赔率数据?

I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

Most of the issues are solvable and I’m having a good little mess around. However I’m hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

Now my experience with dynamic web content is low, so this thing is something I’m having trouble getting my head around.

I think Java or Javascript is a key, this pops up often.

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don’t. I’m using the scrapy library with Python 2.7

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?


回答 0

基于Webkit的浏览器(例如Google Chrome或Safari)具有内置的开发人员工具。在Chrome中,您可以将其打开Menu->Tools->Developer Tools。该Network选项卡使您可以查看有关每个请求和响应的所有信息:

在图片的底部,您可以看到我已将请求过滤为XHR-这些是由javascript代码发出的请求。

提示:每次加载页面时都会清除日志,在图片底部,黑点按钮将保留日志。

分析请求和响应后,您可以模拟来自网络爬虫的请求并提取有价值的数据。在许多情况下,获取数据比解析HTML更容易,因为该数据不包含表示逻辑,并且其格式设置为可以由javascript代码访问。

Firefox具有类似的扩展名,称为firebug。有人会说萤火虫功能更强大,但我喜欢webkit的简单性。

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

In the bottom of the picture you can see that I’ve filtered request down to XHR – these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.


回答 1

这是一个scrapy带有AJAX请求的简单示例 。让我们看看网站rubin-kazan.ru

所有消息都加载了AJAX请求。我的目标是获取这些消息及其所有属性(作者,日期等):

当我分析页面的源代码时,因为网页使用AJAX技术,所以看不到所有这些消息。但是我可以使用Mozilla Firefox中的Firebug(或其他浏览器中的等效工具)来分析HTTP请求,该请求会在网页上生成消息:

它不会重新加载整个页面,而是仅重新加载页面中包含消息的部分。为此,我单击底部任意数量的页面:

我观察到负责邮件正文的HTTP请求:

完成后,我分析请求的标头(我必须引用将从var部分的源页面中提取的该URL,请参见下面的代码):

和请求的表单数据内容(HTTP方法为“ Post”):

以及响应的内容,它是一个JSON文件:

其中显示了我正在寻找的所有信息。

从现在开始,我必须抓紧实施所有这些知识。为此,我们定义蜘蛛:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

parse函数中,我有第一个请求的响应。在RubiGuessItem我有所有信息的JSON文件。

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru.

All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …):

When I analyze the source code of the page I can’t see all these messages because the web page uses AJAX technology. But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:

It doesn’t reload the whole page but only the parts of the page that contain messages. For this purpose I click an arbitrary number of page on the bottom:

And I observe the HTTP request that is responsible for message body:

After finish, I analyze the headers of the request (I must quote that this URL I’ll extract from source page from var section, see the code below):

And the form data content of the request (the HTTP method is “Post”):

And the content of response, which is a JSON file:

Which presents all the information I’m looking for.

From now, I must implement all this knowledge in scrapy. Let’s define the spider for this purpose:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

In parse function I have the response for first request. In RubiGuessItem I have the JSON file with all information.


回答 2

进行爬网时,很多时候我们会遇到这样的问题:页面上呈现的内容是使用Javascript生成的,因此scrapy无法为其进行爬网(例如ajax请求,jQuery疯狂)。

但是,如果您将Scrapy与Web测试框架Selenium一起使用,则我们能够对普通Web浏览器中显示的内容进行爬网。

注意事项:

  • 您必须安装Python版本的Selenium RC才能正常工作,并且必须正确设置Selenium。这也只是模板搜寻器。您可能会更疯狂,更先进,但是我只是想展示基本思想。如代码所示,您将对任何给定的URL进行两个请求。Scrapy提出了一个请求,Selenium提出了另一个请求。我确信有办法解决这个问题,这样您就可以让Selenium做一个唯一的请求,但是我没有理会实现这一点,通过执行两个请求,您也可以使用Scrapy抓取页面。

  • 这非常强大,因为现在您可以抓取整个呈现的DOM,并且仍然可以使用Scrapy中所有不错的抓取功能。当然,这将使爬网速度变慢,但是取决于您需要呈现的DOM的多少,可能值得等待。

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    
    from selenium import selenium
    
    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["http://www.domain.com"]
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )
    
        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
            self.selenium.start()
    
        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)
    
        def parse_page(self, response):
            item = Item()
    
            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()
    
            sel = self.selenium
            sel.open(response.url)
    
            #Wait for javscript to load in Selenium
            time.sleep(2.5)
    
            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item
    
    # Snippet imported from snippets.scrapy.org (which no longer works)
    # author: wynbennett
    # date  : Jun 21, 2011

参考:http//snipplr.com/view/66998/

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.

Some things to note:

  • You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.

  • This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    
    from selenium import selenium
    
    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["http://www.domain.com"]
    
        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )
    
        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
            self.selenium.start()
    
        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)
    
        def parse_page(self, response):
            item = Item()
    
            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()
    
            sel = self.selenium
            sel.open(response.url)
    
            #Wait for javscript to load in Selenium
            time.sleep(2.5)
    
            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item
    
    # Snippet imported from snippets.scrapy.org (which no longer works)
    # author: wynbennett
    # date  : Jun 21, 2011
    

Reference: http://snipplr.com/view/66998/


回答 3

另一个解决方案是实现下载处理程序或下载处理程序中间件。(有关下载器中间件的更多信息,请参见scrapy docs)以下是将硒与无头phantomjs webdriver一起使用的示例类:

1)middlewares.py脚本中定义类。

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2)JsDownload()在变量DOWNLOADER_MIDDLEWARE内添加类settings.py

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3)整合HTMLResponse内部your_spider.py。解码响应主体将为您提供所需的输出。

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8") 

可选的插件:
我希望能够告诉不同的Spider使用哪个中间件,因此我实现了此包装器:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

为了使包装工作,所有蜘蛛必须至少具有:

middleware = set([])

包括中间件:

middleware = set([MyProj.middleware.ModuleName.ClassName])

优点:
以这种方式而不是在蜘蛛网中实现它的主要优点是,您最终只会提出一个请求。例如,在AT解决方案中:下载处理程序处理请求,然后将响应传递给蜘蛛。然后,蜘蛛程序将在其parse_page函数中提出一个全新的请求-这是对相同内容的两个请求。

Another solution would be to implement a download handler or download handler middleware. (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver:

1) Define class within the middlewares.py script.

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py:

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) Integrate the HTMLResponse within your_spider.py. Decoding the response body will get you the desired output.

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8") 

Optional Addon:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

for wrapper to work all spiders must have at minimum:

middleware = set([])

to include a middleware:

middleware = set([MyProj.middleware.ModuleName.ClassName])

Advantage:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request. In A T’s solution for example: The download handler processes the request and then hands off the response to the spider. The spider then makes a brand new request in it’s parse_page function — That’s two requests for the same content.


回答 4

我当时使用的是自定义下载器中间件,但对此并不满意,因为我没有设法使缓存与之配合使用。

更好的方法是实现自定义下载处理程序。

有一个工作示例这里。看起来像这样:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

假设您的刮板称为“刮板”。如果您将提到的代码放在“ scraper”文件夹根目录下的名为handlers.py的文件中,则可以将其添加到settings.py中:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

还有JS解析DOM的voilà,其中包含抓取的缓存,重试等。

I was using a custom downloader middleware, but wasn’t very happy with it, as I didn’t manage to make the cache work with it.

A better approach was to implement a custom download handler.

There is a working example here. It looks like this:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called “scraper”. If you put the mentioned code inside a file called handlers.py on the root of the “scraper” folder, then you could add to your settings.py:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

And voilà, the JS parsed DOM, with scrapy cache, retries, etc.


回答 5

如何使用scrapy来抓取此动态数据,以便可以使用它?

我想知道为什么没有人仅使用Scrapy发布解决方案。

查阅Scrapy小组SCRAPING INFINITE SCROLLING PAGES 的博客文章。该示例抓取了使用无限滚动的http://spidyquotes.herokuapp.com/scroll网站。

这个想法是使用浏览器的开发人员工具并注意AJAX请求,然后根据该信息创建对Scrapy的请求

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

how can scrapy be used to scrape this dynamic data so that I can use it?

I wonder why no one has posted the solution using Scrapy only.

Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES . The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.

The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy.

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

回答 6

是的,Scrapy可以废弃动态网站,即通过javaScript呈现的网站。

抓取此类网站有两种方法。

第一,

您可以splash用来呈现Javascript代码,然后解析呈现的HTML。你可以在这里找到文档和项目Scrapy splash,git

第二,

就像每个人都在说的那样,通过监视network calls,可以的,您可以找到获取数据的api调用,并在您的scrapy spider中模拟该调用可能会帮助您获取所需的数据。

yes, Scrapy can scrap dynamic websites, website that are rendered through javaScript.

There are Two approaches to scrapy these kind of websites.

First,

you can use splash to render Javascript code and then parse the rendered HTML. you can find the doc and project here Scrapy splash, git

Second,

As everyone is stating, by monitoring the network calls, yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.


回答 7

我通过使用Selenium和Firefox Web驱动程序来处理ajax请求。如果需要使用爬虫作为守护程序,速度不是很快,但是比任何手动解决方案要好得多。我在这里写了一个简短的教程供参考

I handle the ajax request by using Selenium and the Firefox web driver. It is not that fast if you need the crawler as a daemon, but much better than any manual solution. I wrote a short tutorial here for reference


使用Python进行网页抓取[关闭]

问题:使用Python进行网页抓取[关闭]

我想从网站上获取每天的日出/日落时间。是否可以使用Python抓取Web内容?使用什么模块?有没有可用的教程?

I’d like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?


回答 0

结合使用urllib2和出色的BeautifulSoup库:

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

Use urllib2 in combination with the brilliant BeautifulSoup library:

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

回答 1

我真的会推荐Scrapy。

引用删除的答案:

  • Scrapy爬行比机械化最快,因为它使用异步操作(在Twisted之上)。
  • Scrapy在libxml2之上对解析(x)html提供了更好,最快的支持。
  • Scrapy是具有完整unicode的成熟框架,可处理重定向,gzip压缩响应,奇数编码,集成的http缓存等。
  • 一旦进入Scrapy,您可以在不到5分钟的时间内编写蜘蛛,下载图像,创建缩略图并将提取的数据直接导出到csv或json。

I’d really recommend Scrapy.

Quote from a deleted answer:

  • Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
  • Scrapy has better and fastest support for parsing (x)html on top of libxml2.
  • Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
  • Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.

回答 2

我将网络抓取工作中的脚本收集到了这个位桶库中

针对您的案例的示例脚本:

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

输出:

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

I collected together scripts from my web scraping work into this bit-bucket library.

Example script for your case:

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

Output:

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

回答 3

我强烈建议您检查pyquery。它使用类似jquery(又称css)的语法,这对于那些来自该背景的人来说确实很容易。

对于您的情况,它将类似于:

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

输出:

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

I would strongly suggest checking out pyquery. It uses jquery-like (aka css-like) syntax which makes things really easy for those coming from that background.

For your case, it would be something like:

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

Output:

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

回答 4

您可以使用urllib2发出HTTP请求,然后获得Web内容。

您可以这样获得:

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soup是一个Python HTML解析器,应该适合于屏幕抓取。

特别是,是他们的解析HTML文档的教程。

祝好运!

You can use urllib2 to make the HTTP requests, and then you’ll have web content.

You can get it like this:

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soup is a python HTML parser that is supposed to be good for screen scraping.

In particular, here is their tutorial on parsing an HTML document.

Good luck!


回答 5

我将Scrapemark(查找网址-py2)和httlib2(下载图像-py2 + 3)结合使用。scrapemark.py有500行代码,但是使用正则表达式,因此它可能没有那么快,没有进行测试。

抓取网站的示例:

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

用法:

python2 sunscraper.py http://www.example.com/

结果:

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

I use a combination of Scrapemark (finding urls – py2) and httlib2 (downloading images – py2+3). The scrapemark.py has 500 lines of code, but uses regular expressions, so it may be not so fast, did not test.

Example for scraping your website:

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

Usage:

python2 sunscraper.py http://www.example.com/

Result:

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

回答 6

通过使用使您的生活更轻松 CSS Selectors

我知道我来晚了,但是我对你有很好的建议。

使用BeautifulSoup已经有人建议我宁愿用CSS Selectors刮里面的数据HTML

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue 

Make your life easier by using CSS Selectors

I know I have come late to party but I have a nice suggestion for you.

Using BeautifulSoup is already been suggested I would rather prefer using CSS Selectors to scrape data inside HTML

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue 

回答 7

如果我们想从任何特定类别中获取商品名称,则可以通过使用CSS选择器指定该类别的类别名称来实现:

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
    print(link.text)

这是部分搜索结果:

Puma, USPA, Adidas & moreUp to 70% OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals, Slippers
Philips & moreStarting 99LED Bulbs & Emergency Lights

If we think of getting name of items from any specific category then we can do that by specifying the class name of that category using css selector:

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
    print(link.text)

This is the partial search results:

Puma, USPA, Adidas & moreUp to 70% OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals, Slippers
Philips & moreStarting ₹99LED Bulbs & Emergency Lights

回答 8

这是一个简单的Web搜寻器,我使用BeautifulSoup,我们将搜索所有类名称为_3NFO0d的链接(锚)。我使用了Flipkar.com,它是一家在线零售商店。

import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
    url = 'https://www.flipkart.com/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    for link in soup.findAll('a', {'class': '_3NFO0d'}):
        href = link.get('href')
        print(href)

crawl_flipkart()

Here is a simple web crawler, i used BeautifulSoup and we will search for all the links(anchors) who’s class name is _3NFO0d. I used Flipkar.com, it is an online retailing store.

import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
    url = 'https://www.flipkart.com/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    for link in soup.findAll('a', {'class': '_3NFO0d'}):
        href = link.get('href')
        print(href)

crawl_flipkart()

回答 9

Python有很好的选择来抓取网络。具有框架的最好的框架是令人毛骨悚然的。对于初学者来说可能有些棘手,所以这里有一些帮助。
1.在3.5以上安装python(直到2.7才可用)。
2.在conda中创建一个环境(我这样做了)。
3.将scrapy安装在某个位置,然后从那里运行。
4. Scrapy shell将为您提供一个交互式界面来测试您的代码。
5. Scrapy startproject projectname将创建一个框架。
6. Scrapy genspider spidername会制造蜘蛛。您可以根据需要创建任意数量的蜘蛛。在执行此操作时,请确保您位于项目目录中。


较容易的是使用要求漂亮的汤。在开始花一小时时间阅读文档之前,它将解决您的大部分疑问。BS4提供了广泛的解析器供您选择。使用user-agentsleep使刮擦更容易。BS4返回bs.tag,请使用variable[0]。如果正在运行js,您将无法直接使用request和bs4进行抓取。您可以获取api链接,然后解析JSON以获取所需的信息或尝试进行操作selenium

Python has good options to scrape the web. The best one with a framework is scrapy. It can be a little tricky for beginners, so here is a little help.
1. Install python above 3.5 (lower ones till 2.7 will work).
2. Create a environment in conda ( I did this).
3. Install scrapy at a location and run in from there.
4. Scrapy shell will give you an interactive interface to test you code.
5. Scrapy startproject projectname will create a framework.
6. Scrapy genspider spidername will create a spider. You can create as many spiders as you want. While doing this make sure you are inside the project directory.


The easier one is to use requests and beautiful soup. Before starting give one hour of time to go through the documentation, it will solve most of your doubts. BS4 offer wide range of parsers that you can opt for. Use user-agent and sleep to make scraping easier. BS4 returns a bs.tag so use variable[0]. If there is js running, you wont be able to scrape using requests and bs4 directly. You could get the api link then parse the JSON to get the information you need or try selenium.