如何使用PyCharm调试Scrapy项目

问题:如何使用PyCharm调试Scrapy项目

我正在使用Python 2.7开发Scrapy 0.20。我发现PyCharm具有良好的Python调试器。我想使用它测试我的Scrapy蜘蛛。有人知道该怎么做吗?

我尝试过的

实际上,我尝试将Spider作为脚本运行。结果,我构建了该脚本。然后,我尝试将Scrapy项目添加到PyCharm中,如下所示:
File->Setting->Project structure->Add content root.

但是我不知道我还要做什么

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please?

What I have tried

Actually I tried to run the spider as a script. As a result, I built that script. Then, I tried to add my Scrapy project to PyCharm as a model like this:
File->Setting->Project structure->Add content root.

But I don’t know what else I have to do


回答 0

scrapy命令是python脚本,这意味着您可以从PyCharm内部启动它。

当检查scrapy二进制文件(which scrapy)时,您会注意到这实际上是一个python脚本:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

这意味着scrapy crawl IcecatCrawler还可以像这样执行命令 :python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

尝试找到scrapy.cmdline软件包。就我而言,位置在这里:/Library/Python/2.7/site-packages/scrapy/cmdline.py

使用该脚本作为脚本在PyCharm中创建运行/调试配置。用scrapy命令和Spider填充脚本参数。在这种情况下crawl IcecatCrawler

像这样:

将断点放在爬网代码中的任何位置,它应该可以正常工作。

The scrapy command is a python script which means you can start it from inside PyCharm.

When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script:

#!/usr/bin/python

from scrapy.cmdline import execute
execute()

This means that a command like scrapy crawl IcecatCrawler can also be executed like this: python /Library/Python/2.7/site-packages/scrapy/cmdline.py crawl IcecatCrawler

Try to find the scrapy.cmdline package. In my case the location was here: /Library/Python/2.7/site-packages/scrapy/cmdline.py

Create a run/debug configuration inside PyCharm with that script as script. Fill the script parameters with the scrapy command and spider. In this case crawl IcecatCrawler.

Like this:

Put your breakpoints anywhere in your crawling code and it should work™.


回答 1

您只需要这样做。

在项目的搜寻器文件夹上创建一个Python文件。我使用了main.py。

  • 项目
    • 履带式
      • 履带式
        • 蜘蛛网
      • main.py
      • scrapy.cfg

在您的main.py内部,将下面的代码。

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

并且您需要创建一个“运行配置”以运行您的main.py。

这样做,如果在代码上放置断点,它将在此处停止。

You just need to do this.

Create a Python file on crawler folder on your project. I used main.py.

  • Project
    • Crawler
      • Crawler
        • Spiders
      • main.py
      • scrapy.cfg

Inside your main.py put this code below.

from scrapy import cmdline    
cmdline.execute("scrapy crawl spider".split())

And you need to create a “Run Configuration” to run your main.py.

Doing this, if you put a breakpoint at your code it will stop there.


回答 2

截至2018.1,这变得容易得多。现在Module name,您可以在项目的中进行选择Run/Debug Configuration。将此设置为,scrapy.cmdline并将其设置Working directory为scrapy项目的根目录(其中有一个目录settings.py)。

像这样:

现在,您可以添加断点来调试代码。

As of 2018.1 this became a lot easier. You can now select Module name in your project’s Run/Debug Configuration. Set this to scrapy.cmdline and the Working directory to the root dir of the scrapy project (the one with settings.py in it).

Like so:

Now you can add breakpoints to debug your code.


回答 3

我正在使用Python 3.5.0在virtualenv中运行scrapy,并设置“ script”参数/path_to_project_env/env/bin/scrapy为我解决了该问题。

I am running scrapy in a virtualenv with Python 3.5.0 and setting the “script” parameter to /path_to_project_env/env/bin/scrapy solved the issue for me.


回答 4

intellij的想法也可以。

创建main.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

显示如下:

intellij idea also work.

create main.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#coding=utf-8
import sys
from scrapy import cmdline
def main(name):
    if name:
        cmdline.execute(name.split())



if __name__ == '__main__':
    print('[*] beginning main thread')
    name = "scrapy crawl stack"
    #name = "scrapy crawl spa"
    main(name)
    print('[*] main thread exited')
    print('main stop====================================================')

show below:


回答 5

要在可接受的答案中添加一点点,将近一个小时后,我发现必须从下拉列表(图标工具栏中央附近)中选择正确的“运行配置”,然后单击“调试”按钮才能使其正常工作。希望这可以帮助!

To add a bit to the accepted answer, after almost an hour I found I had to select the correct Run Configuration from the dropdown list (near the center of the icon toolbar), then click the Debug button in order to get it to work. Hope this helps!


回答 6

我也在使用PyCharm,但没有使用其内置的调试功能。

为了调试,我使用ipdb。我设置了键盘快捷键,可以import ipdb; ipdb.set_trace()在希望断点发生的任何行上插入。

然后,我可以键入n执行下s一条语句,以进入函数,键入任何对象名称以查看其值,更改执行环境,键入c以继续执行…

这非常灵活,可以在PyCharm之外的其他环境中使用,在这些环境中您无法控制执行环境。

只需输入您的虚拟环境,pip install ipdb然后放在import ipdb; ipdb.set_trace()您要暂停执行的行上即可。

I am also using PyCharm, but I am not using its built-in debugging features.

For debugging I am using ipdb. I set up a keyboard shortcut to insert import ipdb; ipdb.set_trace() on any line I want the break point to happen.

Then I can type n to execute the next statement, s to step into a function, type any object name to see its value, alter execution environment, type c to continue execution…

This is very flexible, works in environments other than PyCharm, where you don’t control the execution environment.

Just type in your virtual environment pip install ipdb and place import ipdb; ipdb.set_trace() on a line where you want the execution to pause.

UPDATE

You can also pip install pdbpp and use the standard import pdb; pdb.set_trace instead of ipdb. PDB++ is nicer in my opinion.


回答 7

根据该文件https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

According to the documentation https://doc.scrapy.org/en/latest/topics/practices.html

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

回答 8

我使用以下简单脚本:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

I use this simple script:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your_spider_name')
process.start()

回答 9

扩展了@Rodrigo的答案版本,我添加了此脚本,现在我可以从配置中设置蜘蛛网名称,而不用更改字符串。

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())

Extending @Rodrigo’s version of the answer I added this script and now I can set spider name from configuration instead of changing in the string.

import sys
from scrapy import cmdline

cmdline.execute(f"scrapy crawl {sys.argv[1]}".split())