github

Pyspider 一个功能强大的Python爬虫(Web Crawler)系统

2021年7月15日 Python实用宝典

内容隐藏

一个功能强大的Python蜘蛛(Web Crawler)系统

用Python编写脚本
功能强大的WebUI，具有脚本编辑器、任务监视器、项目管理器和结果查看器
MySQL，MongoDB，Redis，SQLite，Elasticsearch；PostgreSQL使用SQLAlchemy作为数据库后端
RabbitMQ，Redis和Kombu作为消息队列
任务优先级、重试、定期、按时间重新爬网等
分布式架构、爬行Javascript页面、Python2.{6，7}、3.{3，4，5，6}支持等

教程：http://docs.pyspider.org/en/latest/tutorial/
文档：http://docs.pyspider.org/
发行说明：https://github.com/binux/pyspider/releases

示例代码

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

安装

pip install pyspider
运行命令pyspider，访问http://localhost:5000/

警告：默认情况下，WebUI对公众开放，它可以用来执行任何可能损害您的系统的命令。请在内部网络中使用，或者enable need-auth for webui

快速入门：http://docs.pyspider.org/en/latest/Quickstart/

贡献力量

用它吧
打开Issue，发送请购单
User Group
中文问答

待办事项

v0.4.0

可视化的抓取界面，如portia

许可证

根据Apache许可证2.0版进行许可

有趣好用的Python教程

退出移动版