BeautifulSoup和Scrapy搜寻器之间的区别?

问题:BeautifulSoup和Scrapy搜寻器之间的区别?

我想建立一个网站,显示亚马逊和电子海湾产品价格之间的比较。其中哪个会更好,为什么?我对BeautifulSoup有点熟悉,但对Scrapy爬虫却不太了解

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.


回答 0

Scrapy是一个Web-spider或Web Scraper 框架,您为Scrapy提供一个根URL以开始爬网,然后您可以指定要爬网和获取的URL数量的限制。它是用于爬网爬网的完整框架。

BeautifulSoup是一个解析库,它在从URL提取内容方面也做得很好,并允许您轻松解析其中的某些部分。它仅获取您提供的URL的内容,然后停止。除非您使用某些条件将其手动放入无限循环内,否则它不会爬网。

简而言之,使用Beautiful Soup,您可以构建类似于Scrapy的东西。美丽的汤是一个库,而Scrapy是一个完整的框架

资源

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a library while Scrapy is a complete framework.

Source


回答 1

我认为两者都很好。即时通讯正在做一个同时使用两者的项目。首先,我使用scrapy抓取所有页面,并使用它们的管道将其保存在mongodb集合中,还下载页面上存在的图像。之后,我使用BeautifulSoup4进行pos处理,我必须更改属性值并获取一些特殊标签。

如果您不知道所需的产品页面,那么一个好的工具将是徒劳的,因为您可以使用它们的搜寻器来运行所有amazon / ebay网站来寻找产品,而无需进行明确的for循环。

看一下草率的文档,它非常易于使用。

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.


回答 2

两者都用于解析数据。

Scrapy

  • Scrapy是一个快速的高级Web爬网和Web爬网框架,用于对网站进行爬网并从其页面中提取结构化数据。
  • 但是当数据来自Java脚本或动态加载时,它有一些局限性,我们可以通过使用诸如splash,selenium等包来克服它。

BeautifulSoup

  • Beautiful Soup是一个Python库,用于从HTML和XML文件中提取数据。

  • 我们可以使用此包从Java脚本获取数据或动态加载页面。

Scrapy with BeautifulSoup是我们可以使用的最好的组合之一,可用于刮取静态和动态内容

Both are using to parse data.

Scrapy:

  • Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.

BeautifulSoup:

  • Beautiful Soup is a Python library for pulling data out of HTML and XML files.

  • we can use this package for getting data from java script or dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents


回答 3

我这样做的方法是使用eBay / Amazon API,而不是scrapy,然后使用BeautifulSoup解析结果。

API为您提供了一种正式的方式来获取与从scrapy爬网程序中获得的数据相同的正式方式,而无需担心隐藏您的身份,与代理相关的麻烦等。

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.


回答 4

Scrapy 这是一个 Web抓取框架,其中包含大量的功能,使抓取变得更加容易,因此我们可以仅关注抓取逻辑。下面是我最喜欢的一些scrapy照顾我们的事情。

  • Feed导出:基本上,它可以使我们以CSV,JSON,jsonlines和XML等各种格式保存数据。
  • 异步抓取:Scrapy使用了扭曲的框架,该框架使我们能够一次访问多个URL,在每个URL中以非阻塞方式处理每个请求(基本上,在发送另一个请求之前,我们不必等待请求完成)。
  • 选择器:在这里我们可以比较scrap头和漂亮的汤。选择器使我们能够从网页中选择特定数据,例如标题,具有类名的某些div等)。Scrapy使用lxml进行解析,这比漂亮的汤要快得多。
  • 设置代理,用户代理,标题等:scrapy允许我们动态设置和旋转代理以及其他标题。

  • 项目管道:管道使我们能够在提取后处理数据。例如,我们可以配置管道以将数据推送到您的mysql服务器。

  • Cookies:scrapy自动为我们处理cookie。

等等

TLDR:scrapy是一个框架,提供了构建大规模爬网可能需要的所有内容。它提供了各种功能,可隐藏爬网的复杂性。您可以简单地开始编写网络爬虫,而无需担心安装负担。

美丽的汤 Beautiful Soup是用于解析HTML和XML文档的Python包。因此,使用Beautiful汤,您可以解析一个已经下载的网页。BS4非常受欢迎且古老。与刮y不同,您不能仅用美丽的汤来制作履带。您将需要其他库(例如request,urllib等)来使bs4成为爬虫。同样,这意味着您将需要管理要爬网的URL列表,要爬网的URL,处理Cookie,管理代理,处理错误,创建自己的函数以将数据推送到CSV,JSON,XML等。如果要加快速度比您将不得不使用其他库(如多处理)

总结一下。

  • Scrapy是一个丰富的框架,您可以使用它开始编写爬虫程序,而无需进行任何麻烦。

  • 美丽的汤是您可以用来解析网页的库。它不能单独用于刮网。

您绝对应该在您的Amazon和e-bay产品价格比较网站上使用scrapy。您可以建立一个url数据库并每天运行爬虫(cron作业,Celery用于计划爬虫)并更新数据库的价格。这样,您的网站将始终从数据库中提取,并且爬虫和数据库将作为单独的组件。

Scrapy It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

  • Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
  • Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
  • Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
  • Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.

  • Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.

  • Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.

Beautiful soup Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

  • Scrapy is a rich framework that you can use to start writing crawlers without any hassale.

  • Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.


回答 5

BeautifulSoup是一个使您可以从网页提取信息的库。

另一方面,Scrapy是一个框架,它可以执行上述操作以及您在抓取项目中可能需要的许多其他事情,例如用于保存数据的管道。

您可以查看此博客以开始使用Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/


回答 6

使用scrapy可以节省大量代码,并从结构化编程开始。如果您不喜欢scapy的任何预写方法,则可以使用BeautifulSoup代替scrapy方法。大型项目同时具有这两个优点。

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method. Big project takes both advantages.


回答 7

差异很多,选择任何工具/技术都取决于个人需求。

几个主要区别是:

  1. BeautifulSoup 比Scrapy 容易学习
  2. Scrapy的扩展,支持和社区大于BeautifulSoup。
  3. 当BeautifulSoup是解析器时,Scrapy应该被视为蜘蛛

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

  1. BeautifulSoup is comparatively is easy to learn than Scrapy.
  2. The extensions, support, community is larger for Scrapy than for BeautifulSoup.
  3. Scrapy should be considered as a Spider while BeautifulSoup is a Parser.

如何使用python / matplotlib为3D图设置“相机位置”?

问题:如何使用python / matplotlib为3D图设置“相机位置”?

我正在学习如何使用mplot3d生成漂亮的3d数据图,到目前为止我还很高兴。我现在想做的是旋转表面的动画效果。为此,我需要为3D投影设置相机位置。我猜这一定是可能的,因为在交互使用matplotlib时,可以使用鼠标旋转表面。但是如何从脚本执行此操作?我在mpl_toolkits.mplot3d.proj3d中发现了很多转换,但是我找不到如何使用这些转换的目的,也没有找到任何尝试的示例。

I’m learning how to use mplot3d to produce nice plots of 3d data and I’m pretty happy so far. What I am trying to do at the moment is a little animation of a rotating surface. For that purpose, I need to set a camera position for the 3D projection. I guess this must be possible since a surface can be rotated using the mouse when using matplotlib interactively. But how can I do this from a script? I found a lot of transforms in mpl_toolkits.mplot3d.proj3d but I could not find out how to use these for my purpose and I didn’t find any example for what I’m trying to do.


回答 0

通过“摄像机位置”,听起来好像您想调整用于查看3D图的仰角和方位角。您可以使用设置ax.view_init。我使用以下脚本首先创建了绘图,然后确定了一个合适的高程(或)elev,从中可以查看我的绘图。然后,我调整了方位角或azim,以改变绘图周围的整个360度,并保存了每个实例的图形(并在保存绘图时记下了哪个方位角)。对于更复杂的相机镜头,您可以同时调整仰角和角度以达到所需的效果。

    from mpl_toolkits.mplot3d import Axes3D
    ax = Axes3D(fig)
    ax.scatter(xx,yy,zz, marker='o', s=20, c="goldenrod", alpha=0.6)
    for ii in xrange(0,360,1):
        ax.view_init(elev=10., azim=ii)
        savefig("movie%d.png" % ii)

By “camera position,” it sounds like you want to adjust the elevation and the azimuth angle that you use to view the 3D plot. You can set this with ax.view_init. I’ve used the below script to first create the plot, then I determined a good elevation, or elev, from which to view my plot. I then adjusted the azimuth angle, or azim, to vary the full 360deg around my plot, saving the figure at each instance (and noting which azimuth angle as I saved the plot). For a more complicated camera pan, you can adjust both the elevation and angle to achieve the desired effect.

    from mpl_toolkits.mplot3d import Axes3D
    ax = Axes3D(fig)
    ax.scatter(xx,yy,zz, marker='o', s=20, c="goldenrod", alpha=0.6)
    for ii in xrange(0,360,1):
        ax.view_init(elev=10., azim=ii)
        savefig("movie%d.png" % ii)

回答 1

方便的是将“摄影机”位置应用于新图。因此,我进行绘图,然后使用鼠标更改距离来移动绘图。然后尝试复制包含另一图上距离的视图。我发现axx.ax.get_axes()为我提供了一个带有旧.azim和.elev的对象。

在PYTHON中…

axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
dst=axx.dist       # ALWAYS GIVES 10
#dst=ax1.axes.dist # ALWAYS GIVES 10
#dst=ax1.dist      # ALWAYS GIVES 10

以后的3D图形…

ax2.view_init(elev=ele, azim=azm) #Works!
ax2.dist=dst                       # works but always 10 from axx

编辑1 …好,关于.dist值,相机位置是错误的思维方式。它作为整个图形的一种hackey标量乘法器而位于一切之上。

这适用于视图的放大/缩放:

xlm=ax1.get_xlim3d() #These are two tupples
ylm=ax1.get_ylim3d() #we use them in the next
zlm=ax1.get_zlim3d() #graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev

以后的图…

ax2.view_init(elev=ele, azim=azm) #Reproduce view
ax2.set_xlim3d(xlm[0],xlm[1])     #Reproduce magnification
ax2.set_ylim3d(ylm[0],ylm[1])     #...
ax2.set_zlim3d(zlm[0],zlm[1])     #...

What would be handy would be to apply the Camera position to a new plot. So I plot, then move the plot around with the mouse changing the distance. Then try to replicate the view including the distance on another plot. I find that axx.ax.get_axes() gets me an object with the old .azim and .elev.

IN PYTHON…

axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev
dst=axx.dist       # ALWAYS GIVES 10
#dst=ax1.axes.dist # ALWAYS GIVES 10
#dst=ax1.dist      # ALWAYS GIVES 10

Later 3d graph…

ax2.view_init(elev=ele, azim=azm) #Works!
ax2.dist=dst                       # works but always 10 from axx

EDIT 1… OK, Camera position is the wrong way of thinking concerning the .dist value. It rides on top of everything as a kind of hackey scalar multiplier for the whole graph.

This works for the magnification/zoom of the view:

xlm=ax1.get_xlim3d() #These are two tupples
ylm=ax1.get_ylim3d() #we use them in the next
zlm=ax1.get_zlim3d() #graph to reproduce the magnification from mousing
axx=ax1.get_axes()
azm=axx.azim
ele=axx.elev

Later Graph…

ax2.view_init(elev=ele, azim=azm) #Reproduce view
ax2.set_xlim3d(xlm[0],xlm[1])     #Reproduce magnification
ax2.set_ylim3d(ylm[0],ylm[1])     #...
ax2.set_zlim3d(zlm[0],zlm[1])     #...

Python中泡菜的常见用例

问题:Python中泡菜的常见用例

我看过泡菜文档,但是我不知道泡菜在哪里有用。

泡菜有哪些常见用例?

I’ve looked at the pickle documentation, but I don’t understand where pickle is useful.

What are some common use-cases for pickle?


回答 0

我遇到的一些用途:

1)将程序的状态数据保存到磁盘,以便它可以在重新启动时从中断处继续执行(持久性)

2)在多核或分布式系统中通过TCP连接发送python数据(编组)

3)将python对象存储在数据库中

4)将任意python对象转换为字符串,以便可以将其用作字典键(例如,用于缓存和备忘录)。

最后一个存在一些问题-两个相同的对象可以被腌制并导致不同的字符串-甚至相同的对象两次被腌制也可以具有不同的表示形式。这是因为泡菜可以包括参考计数信息。

为了强调@lunaryorn的评论-切勿从不可靠的来源获取字符串,因为精心制作的pickle可以在系统上执行任意代码。例如,请参阅https://blog.nelhage.com/2011/03/exploiting-pickle/

Some uses that I have come across:

1) saving a program’s state data to disk so that it can carry on where it left off when restarted (persistence)

2) sending python data over a TCP connection in a multi-core or distributed system (marshalling)

3) storing python objects in a database

4) converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).

There are some issues with the last one – two identical objects can be pickled and result in different strings – or even the same object pickled twice can have different representations. This is because the pickle can include reference count information.

To emphasise @lunaryorn’s comment – you should never unpickle a string from an untrusted source, since a carefully crafted pickle could execute arbitrary code on your system. For example see https://blog.nelhage.com/2011/03/exploiting-pickle/


回答 1

最小往返次数示例

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

编辑:但作为酸洗的现实世界的例子的问题,也许最先进的使用酸洗的(你必须相当深挖掘到源)ZODB: http://svn.zope.org/

否则,PyPI会提到几个:http ://pypi.python.org/pypi?:action=search&term=pickle&submit=search

我个人已经看到了几个通过网络发送的腌制对象的示例,它们是一种易于使用的网络传输协议。

Minimal roundtrip example..

>>> import pickle
>>> a = Anon()
>>> a.foo = 'bar'
>>> pickled = pickle.dumps(a)
>>> unpickled = pickle.loads(pickled)
>>> unpickled.foo
'bar'

Edit: but as for the question of real-world examples of pickling, perhaps the most advanced use of pickling (you’d have to dig quite deep into the source) is ZODB: http://svn.zope.org/

Otherwise, PyPI mentions several: http://pypi.python.org/pypi?:action=search&term=pickle&submit=search

I have personally seen several examples of pickled objects being sent over the network as an easy to use network transfer protocol.


回答 2

酸洗对于分布式和并行计算绝对必要。

假设您要使用并行映射简化multiprocessing(或使用pyina跨群集节点),那么您需要确保要在并行资源上映射的函数可以腌制。如果没有腌制,则无法将其发送到其他进程,计算机等上的其他资源。另请参见此处的示例。

为此,我使用dill,它可以在python中序列化几乎所有内容。Dill还有一些很好的工具,可以帮助您了解在代码失败时导致酸洗失败的原因。

而且,是的,人们使用挑选来保存计算状态,您的ipython会话等。

Pickling is absolutely necessary for distributed and parallel computing.

Say you wanted to do a parallel map-reduce with multiprocessing (or across cluster nodes with pyina), then you need to make sure the function you want to have mapped across the parallel resources will pickle. If it doesn’t pickle, you can’t send it to the other resources on another process, computer, etc. Also see here for a good example.

To do this, I use dill, which can serialize almost anything in python. Dill also has some good tools for helping you understand what is causing your pickling to fail when your code fails.

And, yes, people use picking to save the state of a calculation, or your ipython session, or whatever.


回答 3

我已经在我的一个项目中使用了它。如果该应用在工作期间终止(它完成了冗长的任务并处理了许多数据),那么我需要保存整个数据结构,并在再次运行该应用后重新加载它。我之所以使用cPickle,是因为速度至关重要,并且数据量确实很大。

I have used it in one of my projects. If the app was terminated during it’s working (it did a lengthy task and processed lots of data), I needed to save the whole data structure and reload it after the app was run again. I used cPickle for this, as speed was a crucial thing and the size of data was really big.


回答 4

对于您的数据结构和类,Pickle类似于“另存为..”和“打开..”。假设我要保存数据结构,以便在程序运行之间保持持久性。

保存:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

正在加载:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

现在,我不必从头开始重新构建myStuff,而我可以从上次停止的地方继续学习。

Pickle is like “Save As..” and “Open..” for your data structures and classes. Let’s say I want to save my data structures so that it is persistent between program runs.

Saving:

with open("save.p", "wb") as f:    
    pickle.dump(myStuff, f)        

Loading:

try:
    with open("save.p", "rb") as f:
        myStuff = pickle.load(f)
except:
    myStuff = defaultdict(dict)

Now I don’t have to build myStuff from scratch all over again, and I can just pick(le) up from where I left off.


回答 5

对于初学者(就像我一样),很难理解为什么在阅读官方文档时首先使用泡菜。可能是因为文档暗示您已经知道序列化的全部目的。仅在阅读了序列化的一般说明之后,我才了解该模块的原因及其常见用例。不考虑特定编程语言的序列化的广泛解释也可能会有所帮助:https : //stackoverflow.com/a/14482962/4383472什么是序列化?https://stackoverflow.com/a/3984483/4383472

For the beginner (as is the case with me) it’s really hard to understand why use pickle in the first place when reading the official documentation. It’s maybe because the docs imply that you already know the whole purpose of serialization. Only after reading the general description of serialization have I understood the reason for this module and its common use cases. Also broad explanations of serialization disregarding a particular programming language may help: https://stackoverflow.com/a/14482962/4383472, What is serialization?, https://stackoverflow.com/a/3984483/4383472


回答 6

要添加一个真实的示例:用于Python 的Sphinx文档工具使用pickle来缓存已解析的文档和文档之间的交叉引用,以加快文档的后续构建。

To add a real-world example: The Sphinx documentation tool for Python uses pickle to cache parsed documents and cross-references between documents, to speed up subsequent builds of the documentation.


回答 7

我可以告诉你我使用它的用途,并且已经看到它的用途:

  • 游戏资料保存
  • 游戏数据可以像生命和健康一样保存
  • 以前输入程序的说号的记录

那些是我至少用过的

I can tell you the uses I use it for and have seen it used for:

  • Game profile saves
  • Game data saves like lives and health
  • Previous records of say numbers inputed to a program

Those are the ones I use it for at least


回答 8

当时,我在网站的一个网站上进行网页爬取时使用了腌制,因此我想存储超过8000k的URL,并希望尽快处理它们,所以我使用腌制是因为它的输出质量非常高。

您可以轻松地到达url,甚至在作业目录关键字停止的位置也可以非常快速地获取url详细信息以恢复该过程。

I use pickling during web scrapping one of website at that time I want to store more than 8000k urls and want to process them as fast as possible so I use pickling because its output quality is very high.

you can easily reach to url and where you stop even job directory key word also fetch url details very fast for resuming the process.


如何将我的秘密密钥和密码安全地保存在版本控制系统中?

问题:如何将我的秘密密钥和密码安全地保存在版本控制系统中?

我在版本控制系统中保留了重要的设置,例如开发和生产服务器的主机名和端口。但是我知道,将秘密(例如私钥和数据库密码)保存在VCS存储库中是一种不好的做法

但是像其他设置一样,密码似乎应该进行版本控制。那么保持密码版本控制的正确方法什么?

我想这将涉及保持秘密,在自己的“秘密设置”文件,并有文件的加密和版本控制。但是什么技术呢?以及如何正确执行此操作?有没有更好的方法可以完全解决这个问题?


我通常会问这个问题,但是在我的特定实例中,我想使用gitgithub存储Django / Python站点的秘密密钥和密码。

同样,当我使用git推/拉时,理想的解决方案也会做一些神奇的事情-例如,如果加密的密码文件更改了运行的脚本,该脚本会要求输入密码并将其解密到位。


编辑:为清楚起见,我正在询问在哪里存储的生产秘密。

I keep important settings like the hostnames and ports of development and production servers in my version control system. But I know that it’s bad practice to keep secrets (like private keys and database passwords) in a VCS repository.

But passwords–like any other setting–seem like they should be versioned. So what is the proper way to keep passwords version controlled?

I imagine it would involve keeping the secrets in their own “secrets settings” file and having that file encrypted and version controlled. But what technologies? And how to do this properly? Is there a better way entirely to go about it?


I ask the question generally, but in my specific instance I would like to store secret keys and passwords for a Django/Python site using git and github.

Also, an ideal solution would do something magical when I push/pull with git–e.g., if the encrypted passwords file changes a script is run which asks for a password and decrypts it into place.


EDIT: For clarity, I am asking about where to store production secrets.

回答 0

您完全想在保持版本控制的同时加密敏感设置文件是正确的。正如您提到的,最好的解决方案是Git在推送某些敏感文件时透明地对其进行加密,以便在本地(即在具有证书的任何计算机上)可以使用设置文件,但Git或Dropbox或在VC下存储文件无法读取纯文本信息。

推/拉期间的透明加密/解密教程

这个要点https://gist.github.com/873637显示了有关如何将Git的smudge / clean过滤器驱动程序与openssl一起使用的教程,以透明方式加密推送的文件。您只需要进行一些初始设置即可。

工作原理摘要

您基本上将创建一个.gitencrypt包含3个bash脚本的文件夹,

clean_filter_openssl 
smudge_filter_openssl 
diff_filter_openssl 

Git用于解密,加密和支持Git差异。在这些脚本中定义了主密码和盐(已修复!),您必须确保.gitencrypt从未被实际推送。示例clean_filter_openssl脚本:

#!/bin/bash

SALT_FIXED=<your-salt> # 24 or less hex characters
PASS_FIXED=<your-passphrase>

openssl enc -base64 -aes-256-ecb -S $SALT_FIXED -k $PASS_FIXED

smudge_filter_open_ssl和类似diff_filter_oepnssl。参见要点。

您的带有敏感信息的仓库应该有一个.gitattribute文件(未加密并包含在仓库中),该文件引用.gitencrypt目录(该目录包含Git透明地加密/解密项目所需的所有内容),并且该文件存在于本地计算机上。

.gitattribute 内容:

* filter=openssl diff=openssl
[merge]
    renormalize = true

最后,您还需要将以下内容添加到.git/config文件中

[filter "openssl"]
    smudge = ~/.gitencrypt/smudge_filter_openssl
    clean = ~/.gitencrypt/clean_filter_openssl
[diff "openssl"]
    textconv = ~/.gitencrypt/diff_filter_openssl

现在,当您将包含敏感信息的存储库推送到远程存储库时,文件将被透明加密。当您从具有.gitencrypt目录(包含密码)的本地计算机中提取文件时,文件将被透明解密。

笔记

我应该注意,本教程没有描述仅加密敏感设置文件的方法。这将透明地加密推送到远程VC主机的整个存储库,并解密整个存储库,以便在本地对其进行完全解密。为了实现所需的行为,您可以将一个或多个项目的敏感文件放在一个sensitive_settings_repo中。如果您确实需要将敏感文件存储在同一存储库中,则可以研究这种透明加密技术如何与Git子模块http://git-scm.com/book/en/Git-Tools-Submodules一起使用。

如果攻击者可以访问许多加密的存储库/文件,则在理论上使用固定的密码短语可能会导致暴力破解漏洞。海事组织,这种可能性非常低。正如本教程底部提到的那样,不使用固定密码短语将导致不同机器上的本地回购版本始终显示“ git status”已发生更改。

You’re exactly right to want to encrypt your sensitive settings file while still maintaining the file in version control. As you mention, the best solution would be one in which Git will transparently encrypt certain sensitive files when you push them so that locally (i.e. on any machine which has your certificate) you can use the settings file, but Git or Dropbox or whoever is storing your files under VC does not have the ability to read the information in plaintext.

Tutorial on Transparent Encryption/Decryption during Push/Pull

This gist https://gist.github.com/873637 shows a tutorial on how to use the Git’s smudge/clean filter driver with openssl to transparently encrypt pushed files. You just need to do some initial setup.

Summary of How it Works

You’ll basically be creating a .gitencrypt folder containing 3 bash scripts,

clean_filter_openssl 
smudge_filter_openssl 
diff_filter_openssl 

which are used by Git for decryption, encryption, and supporting Git diff. A master passphrase and salt (fixed!) is defined inside these scripts and you MUST ensure that .gitencrypt is never actually pushed. Example clean_filter_openssl script:

#!/bin/bash

SALT_FIXED=<your-salt> # 24 or less hex characters
PASS_FIXED=<your-passphrase>

openssl enc -base64 -aes-256-ecb -S $SALT_FIXED -k $PASS_FIXED

Similar for smudge_filter_open_ssl and diff_filter_oepnssl. See Gist.

Your repo with sensitive information should have a .gitattribute file (unencrypted and included in repo) which references the .gitencrypt directory (which contains everything Git needs to encrypt/decrypt the project transparently) and which is present on your local machine.

.gitattribute contents:

* filter=openssl diff=openssl
[merge]
    renormalize = true

Finally, you will also need to add the following content to your .git/config file

[filter "openssl"]
    smudge = ~/.gitencrypt/smudge_filter_openssl
    clean = ~/.gitencrypt/clean_filter_openssl
[diff "openssl"]
    textconv = ~/.gitencrypt/diff_filter_openssl

Now, when you push the repository containing your sensitive information to a remote repository, the files will be transparently encrypted. When you pull from a local machine which has the .gitencrypt directory (containing your passphrase), the files will be transparently decrypted.

Notes

I should note that this tutorial does not describe a way to only encrypt your sensitive settings file. This will transparently encrypt the entire repository that is pushed to the remote VC host and decrypt the entire repository so it is entirely decrypted locally. To achieve the behavior you want, you could place sensitive files for one or many projects in one sensitive_settings_repo. You could investigate how this transparent encryption technique works with Git submodules http://git-scm.com/book/en/Git-Tools-Submodules if you really need the sensitive files to be in the same repository.

The use of a fixed passphrase could theoretically lead to brute-force vulnerabilities if attackers had access to many encrypted repos/files. IMO, the probability of this is very low. As a note at the bottom of this tutorial mentions, not using a fixed passphrase will result in local versions of a repo on different machines always showing that changes have occurred with ‘git status’.


回答 1

Heroku推动使用环境变量进行设置和密钥:

处理此类config var的传统方法是将它们放在源代码下-放在某种属性文件中。这是一个容易出错的过程,对于经常需要使用应用程序特定配置维护单独的(和私有的)分支的开源应用程序而言,这尤其复杂。

更好的解决方案是使用环境变量,并将键保留在代码之外。在传统主机上或在本地工作,您可以在bashrc中设置环境变量。在Heroku上,您使用config vars。

借助Foreman和.env文件,Heroku提供了令人羡慕的工具链来导出,导入和同步环境变量。


就个人而言,我认为将秘密密钥与代码一起保存是错误的。从根本上说,它与源代码管理不一致,因为密钥是针对代码外部的服务。一个好处是开发人员可以克隆HEAD并运行应用程序而无需进行任何设置。但是,假设开发人员检查了该代码的历史版本。他们的副本将包含去年的数据库密码,因此该应用程序将无法使用今天的数据库。

使用上面的Heroku方法,开发人员可以签出去年的应用程序,使用今天的密钥对其进行配置,然后针对当今的数据库成功运行它。

Heroku pushes the use of environment variables for settings and secret keys:

The traditional approach for handling such config vars is to put them under source – in a properties file of some sort. This is an error-prone process, and is especially complicated for open source apps which often have to maintain separate (and private) branches with app-specific configurations.

A better solution is to use environment variables, and keep the keys out of the code. On a traditional host or working locally you can set environment vars in your bashrc. On Heroku, you use config vars.

With Foreman and .env files Heroku provide an enviable toolchain to export, import and synchronise environment variables.


Personally, I believe it’s wrong to save secret keys alongside code. It’s fundamentally inconsistent with source control, because the keys are for services extrinsic to the the code. The one boon would be that a developer can clone HEAD and run the application without any setup. However, suppose a developer checks out a historic revision of the code. Their copy will include last year’s database password, so the application will fail against today’s database.

With the Heroku method above, a developer can checkout last year’s app, configure it with today’s keys, and run it successfully against today’s database.


回答 2

我认为最干净的方法是使用环境变量。例如,您不必处理.dist文件,生产环境中的项目状态将与本地计算机的状态相同。

我建议阅读“十二要素应用程序 ”的配置章节,如果您有兴趣的话,也阅读其他章节。

The cleanest way in my opinion is to use environment variables. You won’t have to deal with .dist files for example, and the project state on the production environment would be the same as your local machine’s.

I recommend reading The Twelve-Factor App‘s config chapter, the others too if you’re interested.


回答 3

一种选择是将项目绑定的凭据放入加密的容器(TrueCrypt或Keepass)中并推送。

从下面的评论中更新为答案:

有趣的问题顺便说一句。我刚刚发现了这一点:github.com/shadowhand/git-encrypt对于自动加密而言,它看起来很有希望

An option would be to put project-bound credentials into an encrypted container (TrueCrypt or Keepass) and push it.

Update as answer from my comment below:

Interesting question btw. I just found this: github.com/shadowhand/git-encrypt which looks very promising for automatic encryption


回答 4

我建议为此使用配置文件,而不要对其进行版本控制。

但是,您可以版本文件的示例。

我看不到共享开发设置的任何问题。根据定义,它不应包含任何有价值的数据。

I suggest using configuration files for that and to not version them.

You can however version examples of the files.

I don’t see any problem of sharing development settings. By definition it should contain no valuable data.


回答 5

BlackBox是由StackExchange最近发布的,虽然我尚未使用它,但它似乎可以完全解决问题并支持此问题中要求的功能。

https://github.com/StackExchange/blackbox的描述中:

将机密安全存储在VCS仓库中(即Git或Mercurial)。这些命令使您可以轻松地对存储库中的特定文件进行GPG加密,以便在存储库中对它们进行“静态加密”。但是,使用脚本可以轻松地在需要查看或编辑脚本时对其进行解密,并可以将其解密以用于生产。

BlackBox was recently released by StackExchange and while I have yet to use it, it seems to exactly address the problems and support the features requested in this question.

From the description on https://github.com/StackExchange/blackbox:

Safely store secrets in a VCS repo (i.e. Git or Mercurial). These commands make it easy for you to GPG encrypt specific files in a repo so they are “encrypted at rest” in your repository. However, the scripts make it easy to decrypt them when you need to view or edit them, and decrypt them for for use in production.


回答 6

自从问了这个问题之后,我就确定了一个解决方案,该解决方案是在与一小群人一起开发小型应用程序时使用的。

git-crypt

git-crypt使用GPG对文件名进行匹配的透明加密。例如,如果您添加到.gitattributes文件中…

*.secret.* filter=git-crypt diff=git-crypt

…然后,像这样的文件config.secret.json将始终通过加密方式推送到远程存储库,但在本地文件系统上保持未加密状态。

如果我想向您的存储库添加一个新的GPG密钥(一个人),可以解密受保护的文件,请运行git-crypt add-gpg-user <gpg_user_key>。这将创建一个新的提交。新用户将能够解密后续提交。

Since asking this question I have settled on a solution, which I use when developing small application with a small team of people.

git-crypt

git-crypt uses GPG to transparently encrypt files when their names match certain patterns. For intance, if you add to your .gitattributes file…

*.secret.* filter=git-crypt diff=git-crypt

…then a file like config.secret.json will always be pushed to remote repos with encryption, but remain unencrypted on your local file system.

If I want to add a new GPG key (a person) to your repo which can decrypt the protected files then run git-crypt add-gpg-user <gpg_user_key>. This creates a new commit. The new user will be able to decrypt subsequent commits.


回答 7

我通常会问这个问题,但是在我的特定实例中,我想使用git和github存储Django / Python站点的秘密密钥和密码。

不,只是不要,即使这是您的私人存储库,并且您从不打算共享它,也不要。

您应该创建一个local_settings.py,将其放在VCS上忽略,然后在settings.py中执行类似的操作

from local_settings import DATABASES, SECRET_KEY
DATABASES = DATABASES

SECRET_KEY = SECRET_KEY

如果您的机密设置用途广泛,我很想说您做错了

I ask the question generally, but in my specific instance I would like to store secret keys and passwords for a Django/Python site using git and github.

No, just don’t, even if it’s your private repo and you never intend to share it, don’t.

You should create a local_settings.py put it on VCS ignore and in your settings.py do something like

from local_settings import DATABASES, SECRET_KEY
DATABASES = DATABASES

SECRET_KEY = SECRET_KEY

If your secrets settings are that versatile, I am eager to say you’re doing something wrong


回答 8

编辑:我假设您想跟踪以前的密码版本-例如,对于防止密码重用等的脚本。

我认为GnuPG是最好的方法-已经在一个与git相关的项目(git-annex)中用于加密存储在云服务上的存储库内容。GnuPG(gnu pgp)提供了非常强大的基于密钥的加密。

  1. 您将密钥保留在本地计算机上。
  2. 您将’mypassword’添加到忽略的文件。
  3. 在pre-commit挂钩上,您将mypassword文件加密为git跟踪的mypassword.gpg文件,并将其添加到提交中。
  4. 在合并后钩子上,您只需将mypassword.gpg解密为mypassword。

现在,如果您的“ mypassword”文件未更改,则将使用相同的密文对其进行加密,并且不会将其添加到索引中(无冗余)。稍加修改mypassword会导致密文截然不同,并且登台区域中的mypassword.gpg与存储库中的密文有很大不同,因此将被添加到提交中。即使攻击者掌握了您的gpg密钥,他仍然需要暴力破解密码。如果攻击者可以使用密文访问远程存储库,则他可以比较一堆密文,但是它们的数量不足以给他带来任何不可忽视的优势。

稍后,您可以使用.gitattributes为您的密码退出git diff提供即时解密。

您也可以为不同类型的密码等设置单独的密钥。

EDIT: I assume you want to keep track of your previous passwords versions – say, for a script that would prevent password reusing etc.

I think GnuPG is the best way to go – it’s already used in one git-related project (git-annex) to encrypt repository contents stored on cloud services. GnuPG (gnu pgp) provides a very strong key-based encryption.

  1. You keep a key on your local machine.
  2. You add ‘mypassword’ to ignored files.
  3. On pre-commit hook you encrypt the mypassword file into the mypassword.gpg file tracked by git and add it to the commit.
  4. On post-merge hook you just decrypt mypassword.gpg into mypassword.

Now if your ‘mypassword’ file did not change then encrypting it will result with same ciphertext and it won’t be added to the index (no redundancy). Slightest modification of mypassword results in radically different ciphertext and mypassword.gpg in staging area differs a lot from the one in repository, thus will be added to the commit. Even if the attacker gets a hold of your gpg key he still needs to bruteforce the password. If the attacker gets an access to remote repository with ciphertext he can compare a bunch of ciphertexts, but their number won’t be sufficient to give him any non-negligible advantage.

Later on you can use .gitattributes to provide an on-the-fly decryption for quit git diff of your password.

Also you can have separate keys for different types of passwords etc.


回答 9

通常,我将密码分隔为配置文件。并使其成为dist。

/yourapp
    main.py
    default.cfg.dist

当我跑步时main.py,输入真实密码default.cfg该副本中。

ps。当您使用git或hg时。您可以忽略*.cfg要制作的文件.gitignore.hgignore

Usually, i seperate password as a config file. and make them dist.

/yourapp
    main.py
    default.cfg.dist

And when i run main.py, put the real password in default.cfg that copied.

ps. when you work with git or hg. you can ignore *.cfg files to make .gitignore or .hgignore


回答 10

提供一种方法来覆盖配置

这是管理您签入的配置的一组合理默认值的最佳方法,而无需完成配置或包含主机名和凭据之类的内容。有几种方法可以覆盖默认配置。

环境变量(如其他人已经提到的)是做到这一点的一种方法。

最好的方法是寻找一个覆盖默认配置值的外部配置文件。这使您可以通过诸如Chef,Puppet或Cfengine之类的配置管理系统来管理外部配置。配置管理是与代码库分开进行配置管理的标准答案,因此您无需发布即可在单个主机或一组主机上更新配置。

仅供参考:加密凭证并非总是最佳做法,尤其是在资源有限的地方。可能情况是,加密凭据不会为您带来额外的风险缓解,而只会增加不必要的复杂性。在做出决定之前,请确保您进行了正确的分析。

Provide a way to override the config

This is the best way to manage a set of sane defaults for the config you checkin without requiring the config be complete, or contain things like hostnames and credentials. There are a few ways to override default configs.

Environment variables (as others have already mentioned) are one way of doing it.

The best way is to look for an external config file that overrides the default config values. This allows you to manage the external configs via a configuration management system like Chef, Puppet or Cfengine. Configuration management is the standard answer for the management of configs separate from the codebase so you don’t have to do a release to update the config on a single host or a group of hosts.

FYI: Encrypting creds is not always a best practice, especially in a place with limited resources. It may be the case that encrypting creds will gain you no additional risk mitigation and simply add an unnecessary layer of complexity. Make sure you do the proper analysis before making a decision.


回答 11

使用例如GPG加密密码文件。将密钥添加到本地计算机和服务器上。解密文件并将其放在您的repo文件夹之外。

我使用homefolder中的passwords.conf。在每次部署中,此文件都会更新。

Encrypt the passwords file, using for example GPG. Add the keys on your local machine and on your server. Decrypt the file and put it outside your repo folders.

I use a passwords.conf, located in my homefolder. On every deploy this file gets updated.


回答 12

不可以,私钥和密码不属于版本控制。当大多数情况下并非所有人都应该有权访问这些服务时,没有理由让每个人都拥有对存储库的读取访问权的负担,因为他们知道生产中使用的敏感服务凭证。

从Django 1.4开始,您的Django项目现在附带了project.wsgi定义application对象的模块,这是开始强制使用a的理想场所project.local包含特定于站点的配置设置模块。

该设置模块从版本控制中被忽略,但是在将项目实例作为WSGI应用程序运行时(在生产环境中通常需要)它必须存在。它应该是这样的:

import os

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.local")

# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

现在,您可以拥有一个local.py可以配置所有者和组的模块,以便只有授权人员和Django进程才能读取文件的内容。

No, private keys and passwords do not fall under revision control. There is no reason to burden everyone with read access to your repository with knowing sensitive service credentials used in production, when most likely not all of them should have access to those services.

Starting with Django 1.4, your Django projects now ship with a project.wsgi module that defines the application object and it’s a perfect place to start enforcing the use of a project.local settings module that contains site-specific configurations.

This settings module is ignored from revision control, but it’s presence is required when running your project instance as a WSGI application, typical for production environments. This is how it should look like:

import os

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.local")

# This application object is used by the development server
# as well as any WSGI server configured to use this file.
from django.core.wsgi import get_wsgi_application
application = get_wsgi_application()

Now you can have a local.py module who’s owner and group can be configured so that only authorized personnel and the Django processes can read the file’s contents.


回答 13

如果您需要VCS作为秘密,则至少应将其保存在与实际代码分开的第二个存储库中。因此,您可以授予团队成员访问源代码存储库的权限,而他们不会看到您的凭据。此外,将此存储库托管在其他位置(例如,在具有加密文件系统的您自己的服务器上,而不是在github上),并将其检出到生产系统中,可以使用git-submodule之类的东西。

If you need VCS for your secrets you should at least keep them in a second repository seperated from you actual code. So you can give your team members access to the source code repository and they won’t see your credentials. Furthermore host this repository somewhere else (eg. on your own server with an encrypted filesystem, not on github) and for checking it out to the production system you could use something like git-submodule.


回答 14

另一种方法可能是完全避免在版本控制系统中保存机密,而是使用诸如hashicorp的Vault之类的工具,该工具是具有密钥滚动和审核,具有API和嵌入式加密功能的秘密存储。

Another approach could be to completely avoid saving secrets in version control systems and instead use a tool like vault from hashicorp, a secret storage with key rolling and auditing, with an API and embedded encryption.


回答 15

这是我的工作:

  • 将所有秘密保留为$ HOME / .bashrc来源的$ HOME / .secrets(go-r权限)中的env vars(这样,如果您在某人面前打开.bashrc,他们将看不到这些秘密)
  • 配置文件作为模板存储在VCS中,例如config.properties存储为config.properties.tmpl
  • 模板文件包含机密的占位符,例如:

    my.password = ## MY_PASSWORD ##

  • 在应用程序部署中,将运行脚本,该脚本将模板文件转换为目标文件,并用环境变量的值替换占位符,例如将## MY_PASSWORD ##更改为$ MY_PASSWORD的值。

This is what I do:

  • Keep all secrets as env vars in $HOME/.secrets (go-r perms) that $HOME/.bashrc sources (this way if you open .bashrc in front of someone, they won’t see the secrets)
  • Configuration files are stored in VCS as templates, such as config.properties stored as config.properties.tmpl
  • The template files contain a placeholder for the secret, such as:

    my.password=##MY_PASSWORD##

  • On application deployment, script is ran that transforms the template file into the target file, replacing placeholders with values of environment variables, such as changing ##MY_PASSWORD## to the value of $MY_PASSWORD.


回答 16

如果您的系统提供了此功能,则可以使用EncFS。因此,您可以将加密的数据保留为存储库的子文件夹,同时为应用程序提供对已装入数据的解密视图。由于加密是透明的,因此在拉或推上不需要任何特殊操作。

但是,它将需要挂载EncFS文件夹,这可以由您的应用程序根据存储在版本化文件夹之外的其他位置的密码(例如,环境变量)来完成。

You could use EncFS if your system provides that. Thus you could keep your encrypted data as a subfolder of your repository, while providing your application a decrypted view to the data mounted aside. As the encryption is transparent, no special operations are needed on pull or push.

It would however need to mount the EncFS folders, which could be done by your application based on an password stored elsewhere outside the versioned folders (eg. environment variables).


如何使用Python检查单词是否为英语单词?

问题:如何使用Python检查单词是否为英语单词?

我想检查Python程序中英语词典中是否有单词。

我相信可以使用nltk wordnet接口,但是我不知道如何将其用于如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

将来,我可能想检查单词的单数形式是否在字典中(例如,属性->属性->英语单词)。我将如何实现?

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?


回答 0

要获得更大的功能和灵活性,请使用专用的拼写检查库,例如PyEnchant。有一个教程,或者您可以直接学习:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant带有一些词典(en_GB,en_US,de_DE,fr_FR),但是如果您需要更多语言,可以使用任何OpenOffice

似乎有一个名为的多元化图书馆inflect,但我不知道它是否有用。

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I’ve no idea whether it’s any good.


回答 1

它不适用于WordNet,因为WordNet并不包含所有英语单词。基于NLTK却没有附魔的另一种可能性是NLTK的语料库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

It won’t work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK’s words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

回答 2

使用NLTK

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

如果您在安装wordnet时遇到问题或想要尝试其他方法,则应该参考本文

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.


回答 3

使用集合存储单词列表,因为查找它们会更快:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

为了回答问题的第二部分,复数已经在一个好的单词列表中了,但是如果出于某种原因要专门从列表中排除那些复数,则确实可以编写一个函数来处理它。但是英语的复数规则非常棘手,以至于我只在单词列表中包括复数。

至于在哪里找到英语单词列表,我只是通过谷歌搜索“英语单词列表”找到了几个。这是其中之一:http : //www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt 如果您想特别使用其中一种方言,则可以使用Google的英式或美式英语。

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling “English word list”. Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.


回答 4

对于更快的基于NLTK的解决方案,您可以对单词集进行哈希处理以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

回答 5

我发现有3种基于包的解决方案可以解决该问题。它们是pyenchant,wordnet和语料库(自定义或来自ntlk)。使用py3无法在Win64中轻松安装Pyenchant。Wordnet不能很好地运行,因为它的语料库不完整。所以对我来说,我选择@Sadik回答的解决方案,并使用’set(words.words())’加快速度。

第一:

pip3 install nltk
python3

import nltk
nltk.download('words')

然后:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

回答 6

使用pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

回答 7

对于语义Web方法,您可以以RDF格式针对WordNet运行sparql查询。基本上只使用urllib模块发出GET请求并以JSON格式返回结果,然后使用python’json’模块进行解析。如果不是英文单词,您将不会获得任何结果。

另外,您可以查询Wiktionary的API

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.

As another idea, you could query Wiktionary’s API.


回答 8

对于所有Linux / Unix用户

如果您的操作系统使用Linux内核,则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录中,/usr/share/dict您有一个words文件。还有一个更具体american-englishbritish-english文件。这些包含该特定语言的所有单词。您可以通过每种编程语言来访问它,这就是为什么我认为您可能想了解这一点的原因。

现在,对于特定于python的用户,下面的python代码应该将列表单词分配为具有每个单词的值:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

希望这可以帮助!!!

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!!!


熊猫唯一值多列

问题:熊猫唯一值多列

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

返回“ Col1”和“ Col2”的唯一值的最佳方法是什么?

所需的输出是

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'
df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

What is the best way to return the unique values of ‘Col1’ and ‘Col2’?

The desired output is

'Bob', 'Joe', 'Bill', 'Mary', 'Steve'

回答 0

pd.unique 从输入数组或DataFrame列或索引返回唯一值。

此函数的输入必须是一维的,因此将需要合并多列。最简单的方法是选择所需的列,然后在展平的NumPy数组中查看值。整个操作如下所示:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

请注意,这ravel()是一个数组方法,它返回多维数组的视图(如果可能)。该参数'K'告诉方法按元素在内存中存储的顺序展平数组(熊猫通常以Fortran连续的顺序存储基础数组;列在行之前)。这可能比使用该方法的默认“ C”顺序快得多。


另一种方法是选择列并将其传递给np.unique

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

ravel()此处不需要使用该方法,因为该方法可以处理多维数组。即使这样,它也可能比pd.unique使用基于排序的算法而不是哈希表来标识唯一值的方法要慢。

对于较大的DataFrame,速度上的差异非常大(尤其是在只有少数唯一值的情况下):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

>>> pd.unique(df[['Col1', 'Col2']].values.ravel('K'))
array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method than returns a view (if possible) of a multidimensional array. The argument 'K' tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method’s default ‘C’ order.


An alternative way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']].values)
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

There is no need to use ravel() here as the method handles multidimensional arrays. Even so, this is likely to be slower than pd.unique as it uses a sort-based algorithm rather than a hashtable to identify unique values.

The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):

>>> df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loop, best of 3: 1.12 s per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel('K'))
10 loops, best of 3: 38.9 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel()) # ravel using C order
10 loops, best of 3: 49.9 ms per loop

回答 1

DataFrame在其列中设置了一些简单的字符串:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

您可以连接感兴趣的列并调用unique函数:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

I have setup a DataFrame with a few simple strings in it’s columns:

>>> df
   a  b
0  a  g
1  b  h
2  d  a
3  e  e

You can concatenate the columns you are interested in and call unique function:

>>> pandas.concat([df['a'], df['b']]).unique()
array(['a', 'b', 'd', 'e', 'g', 'h'], dtype=object)

回答 2

In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

要么:

set(df.Col1) | set(df.Col2)
In [5]: set(df.Col1).union(set(df.Col2))
Out[5]: {'Bill', 'Bob', 'Joe', 'Mary', 'Steve'}

Or:

set(df.Col1) | set(df.Col2)

回答 3

如果使用多个列,则使用numpy v1.13 +更新的解决方案需要在np.unique中指定轴,否则该数组将隐式展平。

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

此更改于2016年11月引入:https : //github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be

An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.

import numpy as np

np.unique(df[['col1', 'col2']], axis=0)

This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be


回答 4

pandas解决方案:使用set()。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

输出:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

Non-pandas solution: using set().

import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1' : ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
              'Col2' : ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
               'Col3' : np.random.random(5)})

print df

print set(df.Col1.append(df.Col2).values)

Output:

   Col1   Col2      Col3
0   Bob    Joe  0.201079
1   Joe  Steve  0.703279
2  Bill    Bob  0.722724
3  Mary    Bob  0.093912
4   Joe  Steve  0.766027
set(['Steve', 'Bob', 'Bill', 'Joe', 'Mary'])

回答 5

对于那些喜欢大熊猫的人来说,适用于它们,当然还有lambda函数:

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

for those of us that love all things pandas, apply, and of course lambda functions:

df['Col3'] = df[['Col1', 'Col2']].apply(lambda x: ''.join(x), axis=1)

回答 6

这是另一种方式


import numpy as np
set(np.concatenate(df.values))

here’s another way


import numpy as np
set(np.concatenate(df.values))

回答 7

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

输出将是[‘Mary’,’Joe’,’Steve’,’Bob’,’Bill’]

list(set(df[['Col1', 'Col2']].as_matrix().reshape((1,-1)).tolist()[0]))

The output will be [‘Mary’, ‘Joe’, ‘Steve’, ‘Bob’, ‘Bill’]


ImportError:没有名为dateutil.parser的模块

问题:ImportError:没有名为dateutil.parser的模块

我在导入时收到以下错误pandasPython程序

monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback (most recent call last):
  File "t1.py", line 4, in <module>
    import pandas as pd
  File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
  File "tslib.pyx", line 31, in init pandas.tslib (pandas/tslib.c:48782)
ImportError: No module named dateutil.parser

这也是程序:

import codecs 
from math import sqrt
import numpy as np
import pandas as pd

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))




    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)


    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

r = recommender(users)
# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')

ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")



pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

I am receiving the following error when importing pandas in a Python program

monas-mbp:book mona$ sudo pip install python-dateutil
Requirement already satisfied (use --upgrade to upgrade): python-dateutil in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python
Cleaning up...
monas-mbp:book mona$ python t1.py
No module named dateutil.parser
Traceback (most recent call last):
  File "t1.py", line 4, in <module>
    import pandas as pd
  File "/Library/Python/2.7/site-packages/pandas/__init__.py", line 6, in <module>
    from . import hashtable, tslib, lib
  File "tslib.pyx", line 31, in init pandas.tslib (pandas/tslib.c:48782)
ImportError: No module named dateutil.parser

Also here’s the program:

import codecs 
from math import sqrt
import numpy as np
import pandas as pd

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},

         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},

         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},

         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},

         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},

         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},

         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))




    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)


    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

r = recommender(users)
# The author implementation
r.loadBookDB('/Users/mona/Downloads/BX-Dump/')

ratings = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Book-Ratings.csv', sep=";", quotechar="\"", escapechar="\\")
books = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Books.csv', sep=";", quotechar="\"", escapechar="\\")
users = pd.read_csv('/Users/danialt/BX-CSV-Dump/BX-Users.csv', sep=";", quotechar="\"", escapechar="\\")



pivot_rating = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')

回答 0

在Ubuntu上,您可能需要先安装软件包管理器pip

sudo apt-get install python-pip

然后使用以下命令安装python-dateutil软件包:

sudo pip install python-dateutil

On Ubuntu you may need to install the package manager pip first:

sudo apt-get install python-pip

Then install the python-dateutil package with:

sudo pip install python-dateutil

回答 1

您可以在https://pypi.python.org/pypi/python-dateutil中找到dateutil包。将其解压缩到某个地方并运行命令:

python setup.py install

它为我工作!

You can find the dateutil package at https://pypi.python.org/pypi/python-dateutil. Extract it to somewhere and run the command:

python setup.py install

It worked for me!


回答 2

对于Python 3:

pip3 install python-dateutil

For Python 3:

pip3 install python-dateutil

回答 3

对于上述Python 3,请使用:

sudo apt-get install python3-dateutil

For Python 3 above, use:

sudo apt-get install python3-dateutil

回答 4

如果使用的是virtualenv,请确保从virtualenv内部运行pip 。

$ which pip
/Library/Frameworks/Python.framework/Versions/Current/bin/pip
$ find . -name pip -print
./flask/bin/pip
./flask/lib/python2.7/site-packages/pip
$ ./flask/bin/pip install python-dateutil

If you’re using a virtualenv, make sure that you are running pip from within the virtualenv.

$ which pip
/Library/Frameworks/Python.framework/Versions/Current/bin/pip
$ find . -name pip -print
./flask/bin/pip
./flask/lib/python2.7/site-packages/pip
$ ./flask/bin/pip install python-dateutil

回答 5

没有一种解决方案对我有用。如果您使用的是PIP,请执行以下操作:

pip install pycrypto==2.6.1

None of the solutions worked for me. If you are using PIP do:

pip install pycrypto==2.6.1


回答 6

在适用于Python2的Ubuntu 18.04中:

sudo apt-get install python-dateutil

In Ubuntu 18.04 for Python2:

sudo apt-get install python-dateutil

回答 7

我在MacOS上也遇到了同样的问题,尝试安装python-dateutil对我来说是有用的

i have same issues on my MacOS and it’s work for me to try install python-dateutil


回答 8

如果您使用Pipenv,则可能需要将此添加到您的Pipfile

[packages]
python-dateutil = "*"

If you are using Pipenv, you may need to add this to your Pipfile:

[packages]
python-dateutil = "*"

从另一个文件导入变量?

问题:从另一个文件导入变量?

如何将变量从一个文件导入到另一个文件?

示例:file1具有变量x1以及x2如何将其传递给file2

如何将所有变量从一个导入到另一个?

How can I import variables from one file to another?

example: file1 has the variables x1 and x2 how to pass them to file2?

How can I import all of the variables from one to another?


回答 0

from file1 import *  

将导入file1中的所有对象和方法

from file1 import *  

will import all objects and methods in file1


回答 1

导入file1内部file2

要从文件1导入所有变量而不泛洪文件2的命名空间,请使用:

import file1

#now use file1.x1, file2.x2, ... to access those variables

要将所有变量从file1导入到file2的命名空间(不推荐):

from file1 import *
#now use x1, x2..

文档

虽然from module import *在模块级别使用是有效的,但通常不是一个好主意。首先,它失去了Python否则具有的重要属性-您可以知道每个顶级名称在您喜欢的编辑器中通过简单的“搜索”功能定义的位置。如果某些模块增加了其他功能或类,将来还会给自己带来麻烦。

Import file1 inside file2:

To import all variables from file1 without flooding file2’s namespace, use:

import file1

#now use file1.x1, file2.x2, ... to access those variables

To import all variables from file1 to file2’s namespace( not recommended):

from file1 import *
#now use x1, x2..

From the docs:

While it is valid to use from module import * at module level it is usually a bad idea. For one, this loses an important property Python otherwise has — you can know where each toplevel name is defined by a simple “search” function in your favourite editor. You also open yourself to trouble in the future, if some module grows additional functions or classes.


回答 2

最好显式导入x1x2

from file1 import x1, x2

这样可以避免file1在中工作时与变量和函数发生不必要的命名空间冲突file2

但是,如果您确实需要,可以导入所有变量:

from file1 import * 

Best to import x1 and x2 explicitly:

from file1 import x1, x2

This allows you to avoid unnecessary namespace conflicts with variables and functions from file1 while working in file2.

But if you really want, you can import all the variables:

from file1 import * 

回答 3

实际上,使用以下命令导入变量并不完全相同:

from file1 import x1
print(x1)

import file1
print(file1.x1)

在导入时,x1和file1.x1的值完全相同,但它们不是相同的变量。例如,在file1中调用一个修改x1的函数,然后尝试从主文件中打印该变量:您将看不到修改后的值。

Actually this is not really the same to import a variable with:

from file1 import x1
print(x1)

and

import file1
print(file1.x1)

Altough at import time x1 and file1.x1 have the same value, they are not the same variables. For instance, call a function in file1 that modifies x1 and then try to print the variable from the main file: you will not see the modified value.


回答 4

马克的回答是正确的。实际上,您可以打印变量的内存地址,print(hex(id(libvar))并且可以看到地址是不同的。

# mylib.py
libvar = None
def lib_method():
    global libvar
    print(hex(id(libvar)))

# myapp.py
from mylib import libvar, lib_method
import mylib

lib_method()
print(hex(id(libvar)))
print(hex(id(mylib.libvar)))

Marc response is correct. Actually, you can print the memory address for the variables print(hex(id(libvar)) and you can see the addresses are different.

# mylib.py
libvar = None
def lib_method():
    global libvar
    print(hex(id(libvar)))

# myapp.py
from mylib import libvar, lib_method
import mylib

lib_method()
print(hex(id(libvar)))
print(hex(id(mylib.libvar)))

回答 5

script1.py

title="Hello world"

script2.py是我们使用script1变量的地方

方法1:

import script1
print(script1.title)

方法2:

from script1 import title
print(title)

script1.py

title="Hello world"

script2.py is where we using script1 variable

Method 1:

import script1
print(script1.title)

Method 2:

from script1 import title
print(title)

回答 6

Python您可以访问的其他文件的内容像,就好像它们
是某种库,比起像Java或任何OOP语言基础等语言,这是真的很酷;

这使得可以访问文件的内容或将其导入以对其进行处理或对其进行任何处理;这就是为什么Python高度首选数据科学和机器学习等语言的主要原因;

这是 project structure

我从.env file哪里访问API links秘密密钥和秘密密钥所在的位置。

总体结构:

from <File-Name> import *

In Python you can access the contents of other files like as if they
are some kind of a library, compared to other languages like java or any oop base languages , This is really cool ;

This makes accessing the contents of the file or import it to to process it or to do anything with it ; And that is the Main reason why Python is highly preferred Language for Data Science and Machine Learning etc. ;

And this is the picture of project structure

Where I am accessing variables from .env file where the API links and Secret keys reside .

General Structure:

from <File-Name> import *

回答 7

first.py:

a=5

second.py:

import first
print(first.a)

结果将是5。

first.py:

a=5

second.py:

import first
print(first.a)

The result will be 5.


如何调试Flask应用

问题:如何调试Flask应用

您打算如何调试Flask中的错误?打印到控制台?向页面闪现消息?还是有一个更强大的选项可用来找出出现问题时发生的情况?

How are you meant to debug errors in Flask? Print to the console? Flash messages to the page? Or is there a more powerful option available to figure out what’s happening when something goes wrong?


回答 0

出现错误时,以开发模式运行该应用程序将在浏览器中显示交互式回溯和控制台。要在开发模式下运行,请设置FLASK_ENV=development环境变量,然后使用flask run命令(请记住也指向FLASK_APP您的应用程序)。

对于Linux,Mac,Windows的Linux子系统,Windows的Git Bash等:

export FLASK_APP=myapp
export FLASK_ENV=development
flask run

对于Windows CMD,使用set而不是导出:

set FLASK_ENV=development

对于PowerShell,请使用$env

$env:FLASK_ENV = "development"

在Flask 1.0之前,它是由FLASK_DEBUG=1环境变量控制的。

如果您使用的是app.run()方法而不是flask run命令,请传递debug=True以启用调试模式。

不管开发模式如何,都将回溯打印到运行服务器的终端。

如果您使用的是PyCharm,VS Code等,则可以利用其调试器逐步使用带有断点的代码。运行配置可以指向调用app.run(debug=True, use_reloader=False)venv/bin/flask脚本,也可以将其指向脚本并像在命令行中一样使用它。您可以禁用重新加载器,但是重新加载将终止调试上下文,并且您将不得不再次捕获断点。

您还可以通过set_trace在要开始调试的视图中调用来使用pdb,pudb或其他终端调试器。


确保不要使用太宽的积木。将所有代码都包含在“包罗万象”中try... except...将使您想要调试的错误静音。一般来说,这是不必要的,因为Flask已经可以通过显示调试器或500错误并将回溯打印到控制台来处理异常。

Running the app in development mode will show an interactive traceback and console in the browser when there is an error. To run in development mode, set the FLASK_ENV=development environment variable then use the flask run command (remember to point FLASK_APP to your app as well).

For Linux, Mac, Linux Subsystem for Windows, Git Bash on Windows, etc.:

export FLASK_APP=myapp
export FLASK_ENV=development
flask run

For Windows CMD, use set instead of export:

set FLASK_ENV=development

For PowerShell, use $env:

$env:FLASK_ENV = "development"

Prior to Flask 1.0, this was controlled by the FLASK_DEBUG=1 environment variable instead.

If you’re using the app.run() method instead of the flask run command, pass debug=True to enable debug mode.

Tracebacks are also printed to the terminal running the server, regardless of development mode.

If you’re using PyCharm, VS Code, etc., you can take advantage of its debugger to step through the code with breakpoints. The run configuration can point to a script calling app.run(debug=True, use_reloader=False), or point it at the venv/bin/flask script and use it as you would from the command line. You can leave the reloader disabled, but a reload will kill the debugging context and you will have to catch a breakpoint again.

You can also use pdb, pudb, or another terminal debugger by calling set_trace in the view where you want to start debugging.


Be sure not to use too-broad except blocks. Surrounding all your code with a catch-all try... except... will silence the error you want to debug. It’s unnecessary in general, since Flask will already handle exceptions by showing the debugger or a 500 error and printing the traceback to the console.


回答 1

您可以按如下所述app.run(debug=True)用于Werkzeug调试器 编辑,我应该知道。

You can use app.run(debug=True) for the Werkzeug Debugger edit as mentioned below, and I should have known.


回答 2

1.1.x文档中,您可以通过将环境变量导出到Shell提示符来启用调试模式:

export FLASK_APP=/daemon/api/views.py  # path to app
export FLASK_DEBUG=1
python -m flask run --host=0.0.0.0

From the 1.1.x documentation, you can enable debug mode by exporting an environment variable to your shell prompt:

export FLASK_APP=/daemon/api/views.py  # path to app
export FLASK_DEBUG=1
python -m flask run --host=0.0.0.0

回答 3

人们还可以使用Flask Debug Toolbar扩展程序来获取嵌入在渲染页面中的更多详细信息。

from flask import Flask
from flask_debugtoolbar import DebugToolbarExtension
import logging

app = Flask(__name__)
app.debug = True
app.secret_key = 'development key'

toolbar = DebugToolbarExtension(app)

@app.route('/')
def index():
    logging.warning("See this message in Flask Debug Toolbar!")
    return "<html><body></body></html>"

启动应用程序,如下所示:

FLASK_APP=main.py FLASK_DEBUG=1 flask run

One can also use the Flask Debug Toolbar extension to get more detailed information embedded in rendered pages.

from flask import Flask
from flask_debugtoolbar import DebugToolbarExtension
import logging

app = Flask(__name__)
app.debug = True
app.secret_key = 'development key'

toolbar = DebugToolbarExtension(app)

@app.route('/')
def index():
    logging.warning("See this message in Flask Debug Toolbar!")
    return "<html><body></body></html>"

Start the application as follows:

FLASK_APP=main.py FLASK_DEBUG=1 flask run

回答 4

如果您使用的是Visual Studio Code,请替换

app.run(debug=True)

app.run()

当打开内部调试器禁用VS Code调试器时,它会出现。

If you’re using Visual Studio Code, replace

app.run(debug=True)

with

app.run()

It appears when turning on the internal debugger disables the VS Code debugger.


回答 5

如果要调试flask应用程序,则只需转到flask应用程序所在的文件夹。不要忘了激活您的虚拟环境,并将控制台行中的行更改“ mainfilename”粘贴到flask主文件。

export FLASK_APP="mainfilename.py"
export FLASK_DEBUG=1
python -m flask run --host=0.0.0.0

启用flask应用程序的调试器后,几乎所有错误都会打印在控制台或浏览器窗口上。如果您想弄清楚发生了什么,可以使用简单的打印语句,也可以将console.log()用于javascript代码。

If you want to debug your flask app then just go to the folder where flask app is. Don’t forget to activate your virtual environment and paste the lines in the console change “mainfilename” to flask main file.

export FLASK_APP="mainfilename.py"
export FLASK_DEBUG=1
python -m flask run --host=0.0.0.0

After you enable your debugger for flask app almost every error will be printed on the console or on the browser window. If you want to figure out what’s happening, you can use simple print statements or you can also use console.log() for javascript code.


回答 6

python-dotenv在虚拟环境中安装。

在项目根目录中创建一个.flaskenv。用项目根目录,是指包含您的app.py文件的文件夹

在此文件中写入以下内容:

FLASK_APP=myapp 
FLASK_ENV=development

现在发出以下命令:

flask run

Install python-dotenv in your virtual environment.

Create a .flaskenv in your project root. By project root, I mean the folder which has your app.py file

Inside this file write the following:

FLASK_APP=myapp 
FLASK_ENV=development

Now issue the following command:

flask run

回答 7

要在Flask中激活调试模式,您只需FLASK_DEBUG=1CMDWindows 上键入set 并FLASK_DEBUG=1在Linux terminal上导出,然后重新启动您的应用程序就可以了!

To activate debug mode in flask you simply type set FLASK_DEBUG=1 on your CMD for windows and export FLASK_DEBUG=1 on Linux termial then restart your app and you are good to go!!


回答 8

快速提示-如果您使用的是PyCharm,请转到Edit Configurations=> Configurations并启用FLASK_DEBUG复选框,然后重新启动Run

Quick tip – if you use a PyCharm, go to Edit Configurations => Configurations and enable FLASK_DEBUG checkbox, restart the Run.


回答 9

在开发环境中使用记录器和打印语句,在生产环境中可以进行岗哨。

Use loggers and print statements in the Development Environment, you can go for sentry in case of production environments.


回答 10

对于Windows用户:

打开Powershell并cd进入您的项目目录。

在Powershell中使用这些突击队,其他所有东西在Powershell中将无法使用。

$env:FLASK_APP = "app"  
$env:FLASK_ENV = "development"

For Windows users:

Open Powershell and cd into your project directory.

Use these commandos in Powershell, all the other stuff won’t work in Powershell.

$env:FLASK_APP = "app"  
$env:FLASK_ENV = "development"

回答 11

如果您在本地运行它并希望能够逐步执行代码:

python -m pdb script.py

If you are running it locally and want to be able to step through the code:

python -m pdb script.py


SQLAlchemy:引擎,连接和会话的区别

问题:SQLAlchemy:引擎,连接和会话的区别

我使用SQLAlchemy并至少有三个实体:enginesession并且connection,其中有execute方法,所以如果我如想选择所有记录table我能做到这一点

engine.execute(select([table])).fetchall()

还有这个

connection.execute(select([table])).fetchall()

甚至这个

session.execute(select([table])).fetchall()

-结果将是相同的。

据我了解,如果有人使用engine.executeconnection,它会创建,打开session(Alchemy会为您处理)并执行查询。但是,执行此任务的这三种方式之间是否存在全局差异?

I use SQLAlchemy and there are at least three entities: engine, session and connection, which have execute method, so if I e.g. want to select all records from table I can do this

engine.execute(select([table])).fetchall()

and this

connection.execute(select([table])).fetchall()

and even this

session.execute(select([table])).fetchall()

– the results will be the same.

As I understand it, if someone uses engine.execute it creates connection, opens session (Alchemy takes care of it for you) and executes the query. But is there a global difference between these three ways of performing such a task?


回答 0

单行概述:

的行为execute()是在所有情况下相同,但它们是3种不同的方法,在EngineConnectionSession类。

到底是什么execute()

要了解行为,execute()我们需要调查Executable该类。Executable是所有“语句”类型对象的超类,包括select(),delete(),update(),insert(),text()-用最简单的词来说,Executable是SQLAlchemy支持的SQL表达式构造。

在所有情况下,该execute()方法均采用SQL文本或构造的SQL表达式,即SQLAlchemy支持的各种SQL表达式构造,并返回查询结果(ResultProxya-包装DB-API游标对象以更轻松地访问行列。)


为了进一步澄清(仅用于概念澄清,不建议使用方法)

除了Engine.execute()(无连接执行),Connection.execute()和之外Session.execute(),还可以execute()直接在任何Executable构造上使用。该Executable班有它自己的执行execute()-每个正式文件作为,对什么人一行说明execute()确实是“ 编译并执行这个Executable ”。在这种情况下,我们需要将Executable(SQL表达式构造)与Connection对象或Engine对象(隐式获取Connection对象)进行显式绑定,以便execute()将知道在何处执行SQL

下面的示例很好地演示了它-给定如下表:

from sqlalchemy import MetaData, Table, Column, Integer

meta = MetaData()
users_table = Table('users', meta,
    Column('id', Integer, primary_key=True),
    Column('name', String(50)))

显式执行,Connection.execute()-将SQL文本或构造的SQL表达式传递给以下execute()方法Connection

engine = create_engine('sqlite:///file.db')
connection = engine.connect()
result = connection.execute(users_table.select())
for row in result:
    # ....
connection.close()

显式无连接执行,Engine.execute()-将SQL文本或构造的SQL表达式直接传递给execute()Engine方法:

engine = create_engine('sqlite:///file.db')
result = engine.execute(users_table.select())
for row in result:
    # ....
result.close()

隐式执行(Executable.execute()-)也是无连接的,并且调用的execute()方法Executable,即它execute()直接在SQL表达式构造(的实例Executable)本身上调用方法。

engine = create_engine('sqlite:///file.db')
meta.bind = engine
result = users_table.select().execute()
for row in result:
    # ....
result.close()

注意:出于说明的目的,陈述了隐式执行示例-强烈建议不按照这种方式执行这种执行方式-按照docs

“隐式执行”是一个非常古老的使用模式,在大多数情况下,它比有用的方法更令人困惑,并且不鼓励使用它。两种模式似乎都鼓励在应用程序设计中过度使用权宜之计的“捷径”,这会在以后导致问题。


你的问题:

据我了解,如果有人使用engine.execute,它将创建连接,打开会话(Alchemy会为您关心)并执行查询。

您认为“如果有人使用engine.execute它会创建connection” 这一部分是正确的,但对于“打开session(炼金术会为您关心)并执行查询”而言,您是正确的- 在形式上,使用Engine.execute()Connection.execute()(几乎)是同一件事,在形式上,Connection对象是隐式创建的,在以后的情况下,我们显式实例化它。在这种情况下真正发生的是:

`Engine` object (instantiated via `create_engine()`) -> `Connection` object (instantiated via `engine_instance.connect()`) -> `connection.execute({*SQL expression*})`

但是,执行此任务的这三种方式之间是否存在全局差异?

在数据库层,这完全是同一回事,所有这些都在执行SQL(文本表达式或各种SQL表达式构造)。从应用程序的角度来看,有两个选项:

  • 直接执行-使用Engine.execute()Connection.execute()
  • 使用sessions-通过有效地处理交易单单元的工作,轻松session.add()session.rollback()session.commit()session.close()。在ORM(即映射表)的情况下,这是与DB进行交互的方式。提供identity_map,以便在单个请求期间立即获取已被访问的对象或新创建/添加的对象。

Session.execute()最终使用Connection.execute()语句执行方法来执行SQL语句。使用Session对象是SQLAlchemy ORM建议的应用程序与数据库交互的方式。

文档摘录:

重要的是要注意,在使用SQLAlchemy ORM时,通常不访问这些对象。而是将Session对象用作数据库的接口。但是,对于围绕直接使用文本SQL语句和/或SQL表达式构造而无需ORM更高级别的管理服务参与的应用程序,“引擎”和“连接”为王(也是王后?),请继续阅读。

A one-line overview:

The behavior of execute() is same in all the cases, but they are 3 different methods, in Engine, Connection, and Session classes.

What exactly is execute():

To understand behavior of execute() we need to look into the Executable class. Executable is a superclass for all “statement” types of objects, including select(), delete(),update(), insert(), text() – in simplest words possible, an Executable is a SQL expression construct supported in SQLAlchemy.

In all the cases the execute() method takes the SQL text or constructed SQL expression i.e. any of the variety of SQL expression constructs supported in SQLAlchemy and returns query results (a ResultProxy – Wraps a DB-API cursor object to provide easier access to row columns.)


To clarify it further (only for conceptual clarification, not a recommended approach):

In addition to Engine.execute() (connectionless execution), Connection.execute(), and Session.execute(), it is also possible to use the execute() directly on any Executable construct. The Executable class has it’s own implementation of execute() – As per official documentation, one line description about what the execute() does is “Compile and execute this Executable“. In this case we need to explicitly bind the Executable (SQL expression construct) with a Connection object or, Engine object (which implicitly get a Connection object), so the execute() will know where to execute the SQL.

The following example demonstrates it well – Given a table as below:

from sqlalchemy import MetaData, Table, Column, Integer

meta = MetaData()
users_table = Table('users', meta,
    Column('id', Integer, primary_key=True),
    Column('name', String(50)))

Explicit execution i.e. Connection.execute() – passing the SQL text or constructed SQL expression to the execute() method of Connection:

engine = create_engine('sqlite:///file.db')
connection = engine.connect()
result = connection.execute(users_table.select())
for row in result:
    # ....
connection.close()

Explicit connectionless execution i.e. Engine.execute() – passing the SQL text or constructed SQL expression directly to the execute() method of Engine:

engine = create_engine('sqlite:///file.db')
result = engine.execute(users_table.select())
for row in result:
    # ....
result.close()

Implicit execution i.e. Executable.execute() – is also connectionless, and calls the execute() method of the Executable, that is, it calls execute() method directly on the SQL expression construct (an instance of Executable) itself.

engine = create_engine('sqlite:///file.db')
meta.bind = engine
result = users_table.select().execute()
for row in result:
    # ....
result.close()

Note: Stated the implicit execution example for the purpose of clarification – this way of execution is highly not recommended – as per docs:

“implicit execution” is a very old usage pattern that in most cases is more confusing than it is helpful, and its usage is discouraged. Both patterns seem to encourage the overuse of expedient “short cuts” in application design which lead to problems later on.


Your questions:

As I understand if someone use engine.execute it creates connection, opens session (Alchemy cares about it for you) and executes query.

You’re right for the part “if someone use engine.execute it creates connection ” but not for “opens session (Alchemy cares about it for you) and executes query ” – Using Engine.execute() and Connection.execute() is (almost) one the same thing, in formal, Connection object gets created implicitly, and in later case we explicitly instantiate it. What really happens in this case is:

`Engine` object (instantiated via `create_engine()`) -> `Connection` object (instantiated via `engine_instance.connect()`) -> `connection.execute({*SQL expression*})`

But is there a global difference between these three ways of performing such task?

At DB layer it’s exactly the same thing, all of them are executing SQL (text expression or various SQL expression constructs). From application’s point of view there are two options:

  • Direct execution – Using Engine.execute() or Connection.execute()
  • Using sessions – efficiently handles transaction as single unit-of-work, with ease via session.add(), session.rollback(), session.commit(), session.close(). It is the way to interact with the DB in case of ORM i.e. mapped tables. Provides identity_map for instantly getting already accessed or newly created/added objects during a single request.

Session.execute() ultimately uses Connection.execute() statement execution method in order to execute the SQL statement. Using Session object is SQLAlchemy ORM’s recommended way for an application to interact with the database.

An excerpt from the docs:

Its important to note that when using the SQLAlchemy ORM, these objects are not generally accessed; instead, the Session object is used as the interface to the database. However, for applications that are built around direct usage of textual SQL statements and/or SQL expression constructs without involvement by the ORM’s higher level management services, the Engine and Connection are king (and queen?) – read on.


回答 1

Nabeel的答案涵盖了很多细节并且很有帮助,但是我发现难以理解。由于这是该问题的第一个Google结果,因此,我对以后发现此问题的人们加深了理解:

运行.execute()

正如OP和Nabell Ahmed都指出的那样,执行平原时SELECT * FROM tablename,提供的结果没有区别。

这三个对象之间的区别取决于上下文就成为非常重要的SELECT声明中,或者更常见的是,当你想要做其他事情一样使用INSERTDELETE等等。

何时使用引擎,连接,会话

  • 引擎是SQLAlchemy使用的最低级别的对象。它维护了一个连接池,可在应用程序需要与数据库对话时使用。.execute()是一种先调用conn = engine.connect(close_with_result=True)然后调用的便捷方法conn.execute()。close_with_result参数表示连接自动关闭。(我稍微解释了源代码,但本质上是正确的)。编辑:这是engine.execute的源代码

    您可以使用引擎执行原始SQL。

    result = engine.execute('SELECT * FROM tablename;')
    #what engine.execute() is doing under the hood
    conn = engine.connect(close_with_result=True)
    result = conn.execute('SELECT * FROM tablename;')
    
    #after you iterate over the results, the result and connection get closed
    for row in result:
        print(result['columnname']
    
    #or you can explicitly close the result, which also closes the connection
    result.close()

    基本用法下的文档中对此进行了介绍。

  • 连接(正如我们在上面看到的)实际上是执行SQL查询的工作。每当您想更好地控制连接的属性,何时关闭连接等时,都应该执行此操作。例如,非常重要的示例是Transaction,它使您可以决定何时将更改提交到数据库。在正常使用中,更改是自动提交的。通过使用事务,您可以(例如)运行多个不同的SQL语句,如果其中一个出现问题,则可以立即撤消所有更改。

    connection = engine.connect()
    trans = connection.begin()
    try:
        connection.execute("INSERT INTO films VALUES ('Comedy', '82 minutes');")
        connection.execute("INSERT INTO datalog VALUES ('added a comedy');")
        trans.commit()
    except:
        trans.rollback()
        raise

    如果一次失败,这将使您撤消两项更改,就像您忘记创建数据日志表一样。

    因此,如果您正在执行原始SQL代码并需要控制,请使用连接

  • 会话用于SQLAlchemy的对象关系管理(ORM)方面(实际上,您可以从它们的导入方式中看到这一点:)from sqlalchemy.orm import sessionmaker。他们在后台使用连接和事务来运行其自动生成的SQL语句。.execute()是一个便捷功能,可传递到会话绑定的任何对象(通常是引擎,但可以是连接)。

    如果您使用的是ORM功能,请使用会话。如果只执行不绑定对象的直接SQL查询,则最好直接使用连接。

Nabeel’s answer covers a lot of details and is helpful, but I found it confusing to follow. Since this is currently the first Google result for this issue, adding my understanding of it for future people that find this question:

Running .execute()

As OP and Nabell Ahmed both note, when executing a plain SELECT * FROM tablename, there’s no difference in the result provided.

The differences between these three objects do become important depending on the context that the SELECT statement is used in or, more commonly, when you want to do other things like INSERT, DELETE, etc.

When to use Engine, Connection, Session generally

  • Engine is the lowest level object used by SQLAlchemy. It maintains a pool of connections available for use whenever the application needs to talk to the database. .execute() is a convenience method that first calls conn = engine.connect(close_with_result=True) and the then conn.execute(). The close_with_result parameter means the connection is closed automatically. (I’m slightly paraphrasing the source code, but essentially true). edit: Here’s the source code for engine.execute

    You can use engine to execute raw SQL.

    result = engine.execute('SELECT * FROM tablename;')
    #what engine.execute() is doing under the hood
    conn = engine.connect(close_with_result=True)
    result = conn.execute('SELECT * FROM tablename;')
    
    #after you iterate over the results, the result and connection get closed
    for row in result:
        print(result['columnname']
    
    #or you can explicitly close the result, which also closes the connection
    result.close()
    

    This is covered in the docs under basic usage.

  • Connection is (as we saw above) the thing that actually does the work of executing a SQL query. You should do this whenever you want greater control over attributes of the connection, when it gets closed, etc. For example, a very import example of this is a Transaction, which lets you decide when to commit your changes to the database. In normal use, changes are autocommitted. With the use of transactions, you could (for example) run several different SQL statements and if something goes wrong with one of them you could undo all the changes at once.

    connection = engine.connect()
    trans = connection.begin()
    try:
        connection.execute("INSERT INTO films VALUES ('Comedy', '82 minutes');")
        connection.execute("INSERT INTO datalog VALUES ('added a comedy');")
        trans.commit()
    except:
        trans.rollback()
        raise
    

    This would let you undo both changes if one failed, like if you forgot to create the datalog table.

    So if you’re executing raw SQL code and need control, use connections

  • Sessions are used for the Object Relationship Management (ORM) aspect of SQLAlchemy (in fact you can see this from how they’re imported: from sqlalchemy.orm import sessionmaker). They use connections and transactions under the hood to run their automatically-generated SQL statements. .execute() is a convenience function that passes through to whatever the session is bound to (usually an engine, but can be a connection).

    If you’re using the ORM functionality, use session; if you’re only doing straight SQL queries not bound to objects, you’re probably better off using connections directly.


回答 2

这是运行诸如GRANT之类的DCL(数据控制语言)的示例

def grantAccess(db, tb, user):
  import sqlalchemy as SA
  import psycopg2

  url = "{d}+{driver}://{u}:{p}@{h}:{port}/{db}".\
            format(d="redshift",
            driver='psycopg2',
            u=username,
            p=password,
            h=host,
            port=port,
            db=db)
  engine = SA.create_engine(url)
  cnn = engine.connect()
  trans = cnn.begin()
  strSQL = "GRANT SELECT on table " + tb + " to " + user + " ;"
  try:
      cnn.execute(strSQL)
      trans.commit()
  except:
      trans.rollback()
      raise

Here is an example of running DCL (Data Control Language) such as GRANT

def grantAccess(db, tb, user):
  import sqlalchemy as SA
  import psycopg2

  url = "{d}+{driver}://{u}:{p}@{h}:{port}/{db}".\
            format(d="redshift",
            driver='psycopg2',
            u=username,
            p=password,
            h=host,
            port=port,
            db=db)
  engine = SA.create_engine(url)
  cnn = engine.connect()
  trans = cnn.begin()
  strSQL = "GRANT SELECT on table " + tb + " to " + user + " ;"
  try:
      cnn.execute(strSQL)
      trans.commit()
  except:
      trans.rollback()
      raise

有趣好用的Python教程

退出移动版
微信支付
请使用 微信 扫码支付