问题:如何在Scrapy Spider中传递用户定义的参数
我正在尝试将用户定义的参数传递给Scrapy的Spider。有人可以建议如何做吗?
我在-a
某处读到一个参数,但不知道如何使用它。
I am trying to pass a user defined argument to a scrapy’s spider. Can anyone suggest on how to do that?
I read about a parameter -a
somewhere but have no idea how to use it.
回答 0
蜘蛛形参crawl
使用-a
选项在命令中传递。例如:
scrapy crawl myspider -a category=electronics -a domain=system
蜘蛛程序可以将参数作为属性访问:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3
def parse(self, response)
self.log(self.domain) # system
摘自Scrapy文档:http ://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
2013年更新:添加第二个参数
2015年更新:调整措辞
2016年更新:使用较新的基类并添加超级类,谢谢@Birla
2017年更新:使用Python3 super
# previously
super(MySpider, self).__init__(**kwargs) # python2
更新2018:@eLRuLL指出,蜘蛛可以将参数作为属性访问
Spider arguments are passed in the crawl
command using the -a
option. For example:
scrapy crawl myspider -a category=electronics -a domain=system
Spiders can access arguments as attributes:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3
def parse(self, response)
self.log(self.domain) # system
Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
Update 2013: Add second argument
Update 2015: Adjust wording
Update 2016: Use newer base class and add super, thanks @Birla
Update 2017: Use Python3 super
# previously
super(MySpider, self).__init__(**kwargs) # python2
Update 2018: As @eLRuLL points out, spiders can access arguments as attributes
回答 1
先前的答案是正确的,但是您不必__init__
每次想要编写scrapy的Spider时都声明构造函数(),只需像以前一样指定参数即可:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
在您的蜘蛛代码中,您可以将它们用作蜘蛛参数:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
而且就可以了。
Previous answers were correct, but you don’t have to declare the constructor (__init__
) every time you want to code a scrapy’s spider, you could just specify the parameters as before:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
and in your spider code you can just use them as spider arguments:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
And it just works.
回答 2
通过爬网命令传递参数
抓取抓取myspider -a category =’mycategory’-a domain =’example.com’
传递参数以在scrapyd上运行-a替换为-d
curl http://your.ip.address.here:port/schedule.json -d spider = myspider -d category =’mycategory’-d domain =’example.com’
蜘蛛程序将在其构造函数中接收参数。
class MySpider(Spider):
name="myspider"
def __init__(self,category='',domain='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.category = category
self.domain = domain
Scrapy将所有参数作为蜘蛛属性,您可以完全跳过init方法。当心使用getattr方法获取那些属性,以便您的代码不会中断。
class MySpider(Spider):
name="myspider"
start_urls = ('https://httpbin.org/ip',)
def parse(self,response):
print getattr(self,'category','')
print getattr(self,'domain','')
To pass arguments with crawl command
scrapy crawl myspider -a category=’mycategory’ -a domain=’example.com’
To pass arguments to run on scrapyd replace -a with -d
curl http://your.ip.address.here:port/schedule.json -d
spider=myspider -d category=’mycategory’ -d domain=’example.com’
The spider will receive arguments in its constructor.
class MySpider(Spider):
name="myspider"
def __init__(self,category='',domain='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.category = category
self.domain = domain
Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.
class MySpider(Spider):
name="myspider"
start_urls = ('https://httpbin.org/ip',)
def parse(self,response):
print getattr(self,'category','')
print getattr(self,'domain','')
回答 3
使用-a选项在运行检索命令时会传递蜘蛛参数。例如,如果我想将域名作为参数传递给我的Spider,那么我将这样做-
scrapy抓取myspider -a domain =“ http://www.example.com”
并在Spider的构造函数中接收参数:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, domain='', *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
#
…
它将起作用:)
Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-
scrapy crawl myspider -a domain=”http://www.example.com”
And receive arguments in spider’s constructors:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, domain='', *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
#
…
it will work :)
回答 4
另外,我们可以使用ScrapyD公开一个API,在其中我们可以传递start_url和Spider名称。ScrapyD具有api来停止/启动/状态/列出蜘蛛。
pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default
scrapyd-deploy
将以鸡蛋的形式将蜘蛛部署到守护程序中,甚至可以维护蜘蛛的版本。启动Spider时,您可以提及要使用哪个版本的Spider。
class MySpider(CrawlSpider):
def __init__(self, start_urls, *args, **kwargs):
self.start_urls = start_urls.split('|')
super().__init__(*args, **kwargs)
name = testspider
curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"
附加的优点是您可以构建自己的UI来接受用户的url和其他参数,并使用上面的scrapy schedule API安排任务
有关更多详细信息,请参阅scrapyd API文档
Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api’s to stop/start/status/list the spiders.
pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default
scrapyd-deploy
will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.
class MySpider(CrawlSpider):
def __init__(self, start_urls, *args, **kwargs):
self.start_urls = start_urls.split('|')
super().__init__(*args, **kwargs)
name = testspider
curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"
Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API
Refer scrapyd API documentation for more details