标签归档:Python

是否有类似RStudio for Python的东西?[关闭]

问题:是否有类似RStudio for Python的东西?[关闭]

在RStudio中,可以在代码编辑窗口中运行部分代码,结果将显示在控制台中。

您还可以做一些很酷的事情,例如选择是运行光标之前的所有内容,还是光标之后的所有内容,还是只是选择的部分,等等。所有这些东西都有热键。

这就像Python交互式外壳之上的一个步骤-您可以在其中使用readline返回上一行,但是它没有任何关于功能是什么,代码段等的“概念”。

是否有类似Python的工具?或者,您是否有某种类似的解决方法,例如在vim中使用?

In RStudio, you can run parts of code in the code editing window, and the results appear in the console.

You can also do cool stuff like selecting whether you want everything up to the cursor to run, or everything after the cursor, or just the part that you selected, and so on. And there are hot keys for all that stuff.

It’s like a step above the interactive shell in Python — there you can use readline to go back to previous individual lines, but it doesn’t have any “concept” of what a function is, a section of code, etc.

Is there a tool like that for Python? Or, do you have some sort of similar workaround that you use, say, in vim?


回答 0

IPython Notebooks很棒。我最近发现了另一个基于浏览器的更新工具:Rodeo。我的印象是,它似乎可以更好地支持类似RStudio的工作流程。

IPython Notebooks are awesome. Here’s another, newer browser-based tool I’ve recently discovered: Rodeo. My impression is that it seems to better support an RStudio-like workflow.


回答 1

Jupyter Notebook(以前称为IPython Notebook)是一个非常酷的项目,用于使用Python(和其他语言,包括R)进行交互式数据操作。基本上,它允许您在一个界面中交互地编码和记录正在执行的操作,然后将其另存为:

  • 笔记本(.ipynb
  • 脚本(仅包含源代码的.py文件)
  • 静态html(因此也是pdf)

您甚至可以使用nbviewer服务与他人在线共享您的笔记本,该服务使人们可以出版整本书。此外,GitHub 呈现您的.ipynb文件。您可以将Jupyter笔记本作为可复制的研究文章发表在Authorea上。要由多个用户进行协作编辑,请查看基于Jupyter构建的Google Colab。

Jupyter Notebook的默认版本在本地启动Web应用程序(或将其部署到服务器),然后从浏览器中使用它。正如Ryan在回答中提到的那样,Rodeo是一个与基于Jupyter内核构建的RStudio更相似的界面。

JupyterLab是UI的较新版本,它为您编辑笔记本,控制交互式小部件甚至在终端仿真器中运行命令提供了更大的灵活性。

还有一个用于IPythonQt控制台,这是一个带有嵌入式绘图的类似项目,它是一个桌面应用程序。

Jupyter是一个普通的Python软件包,可以使用安装pip install jupyter。但是,要使所有科学图书馆都在您的计算机上运行,​​尝试使用官方的Jupyter Docker容器可能会更容易。例如,假设您的笔记本在〜/ code / jupyter中,则可以按以下方式运行容器:

docker run -it --rm -p 8888:8888 -v ~/code/jupyter:/home/jovyan/work jupyter/datascience-notebook

Jupyter Notebook (previously known as IPython notebook) is a really cool project for interactive data manipulation in Python (and other languages, including R). It basically allows you to interactively code and document what you’re doing in one interface and later on save it as a:

  • notebook (.ipynb)
  • script (a .py file including only the source code)
  • static html (and therefore pdf as well)

You can even share your notebooks online with others using the nbviewer service, where people publish whole books. Furthermore, GitHub renders your .ipynb files. You can publish your Jupyter Notebooks as reproducible research articles on Authorea. For collaborative editing by multiple users, check out Google Colab built on top of Jupyter.

The default Jupyter Notebook version starts a web application locally (or you deploy it to a server) and you use it from your browser. As Ryan also mentioned in his answer, Rodeo is an interface more similar to RStudio built on top of the Jupyter kernel.

JupyterLab is a newer take on the UI allowing for more flexibility in how you edit your notebooks, control interactive widgets and even run commands in terminal emulators.

There’s also a Qt console for IPython, a similar project with inline plots, which is a desktop application.

Jupyter is a normal Python package and can be installed using pip install jupyter. To get all the scientific libraries running on your computer, however, it might be easier to try the official Jupyter Docker containers. For example, assuming your notebooks are in ~/code/jupyter, you can run the container as:

docker run -it --rm -p 8888:8888 -v ~/code/jupyter:/home/jovyan/work jupyter/datascience-notebook

回答 2

spyder或安装python(x,y)。这太棒了。

如果您不熟悉Python,则可以安装免费的Anaconda发行版(http://continuum.io/downloads.html),它将为您安装Spyder以及Python 2.7和IPython。Spyder与RStudio非常相似。

spyder or install python(x,y). it is great.

If you are new to Python, you can install the free Anaconda distribution (http://continuum.io/downloads.html), which will install Spyder for you, as well as Python 2.7 and IPython. Spyder is very similar to RStudio.


回答 3

如果您正在寻找RStudio for Python之类的东西,请查看Yhat的Rodeo

牛仔竞技有:

  • 文本编辑器(在后台使用Atom)
  • Vim / Emacs模式
  • IPython控制台
  • 自动完成
  • 文档字符串
  • 能够查看图表,数据框,变量

Check out Rodeo from Yhat if you’re looking for something like RStudio for Python.

Rodeo has:

  • text editor (uses Atom under the hood)
  • Vim / Emacs mode
  • an IPython console
  • autocomplete
  • docstrings
  • ability to see plots, dataframes, variables

回答 4

您可能需要研究JupyterLab(下一代Jupyter Notbooks):https : //github.com/jupyter/jupyterlab

JupyterLab旨在在Web上创建更类似于桌面的体验。

更新:截至2018年3月,JupyterLab处于beta版。“该Beta版本适合一般使用。对于JupyterLab扩展开发人员而言,扩展API将会继续发展,直到1.0版本。最终,JupyterLab将在JupyterLab达到1.0后替换经典的Jupyter Notebook。

要将Jupyter Lab作为桌面应用程序运行,请参阅 christopherroach.com/articles/jupyterlab-desktop-app(感谢PatrickT)。

快速预览:

您可以在监视系统的终端上方的图形控制台旁边放置一个笔记本,同时将文件管理器保持在左侧:

有关更多详细信息,请参见:https : //blog.jupyter.org/2016/07/14/jupyter-lab-alpha/以及此处:http : //www.techatbloomberg.com/blog/inside-the-collaboration-that-内置了开源jupyterlab-project /

You might want to look into JupyterLab (the next generation of Jupyter Notbooks): https://github.com/jupyter/jupyterlab.

JupyterLab aims to create a more desktop-like experience on the Web.

Update: As of March 2018 JupyterLab is in beta. “The beta releases are suitable for general usage. For JupyterLab extension developers, the extension APIs will continue to evolve until the 1.0 release. Eventually, JupyterLab will replace the classic Jupyter Notebook after JupyterLab reaches 1.0.

To run Jupyter Lab as a Desktop Application, see christopherroach.com/articles/jupyterlab-desktop-app (Thanks to PatrickT).

Here’s a quick preview:

You can arrange a notebook next to a graphical console atop a terminal that is monitoring the system, while keeping the file manager on the left:

For more details see: https://blog.jupyter.org/2016/07/14/jupyter-lab-alpha/ and here: http://www.techatbloomberg.com/blog/inside-the-collaboration-that-built-the-open-source-jupyterlab-project/.


回答 5

Pycharm是一个非常不错的IDE。从目前为止我所看到的,它与Rstudio最相似。另一个不错的功能是,它允许您以类似于Rstudio的方式安装新的Python库(否则可能是一场噩梦)。现在有一个免费的“社区”版。

Pycharm is a really decent IDE. From what I have seen so far it is the most similar to Rstudio. Another nice piece is that it allows you to install new Python libraries in a fashion similar to Rstudio (which otherwise can be a nightmare). There is now a free ‘community’ edition.


回答 6

我认为值得一提的是RStudio v1.1.359 Preview已发布。它具有可用于Python的终端功能。

在此处下载

文档在这里

I think it is worth while to mention that RStudio v1.1.359 Preview is released. It has terminal feature that can be used for Python.

Download is available here

Documentation is available here


回答 7

间谍是您所需要的! https://code.google.com/p/spyderlib/
Spyder(以前称为Pydee)是功能强大的Python语言交互式开发环境,具有高级编辑,交互式测试,调试和自省功能

spyder is you need! https://code.google.com/p/spyderlib/
Spyder (previously known as Pydee) is a powerful interactive development environment for the Python language with advanced editing, interactive testing, debugging and introspection features


回答 8

对于更好的Python交互式外壳,请查看DreamPie。它不是真正的IDE(就像RStudio一样?)

For a nicer interactive shell for Python, have a look at DreamPie. It’s not really an IDE though (as RStudio seems to be?)


回答 9

Wing IDE,以及其他Python IDE(例如PyCharm和PyDev)也具有类似的功能。在Wing中,您可以在集成的Python Shell中选择并执行代码,或者如果要调试某些内容,则可以与Shell中暂停的调试程序进行交互(称为“调试探针”)。万一您正在使用matplotlib,它还提供了特殊支持,因此您可以交互使用绘图。

Wing IDE, and probably also other Python IDEs like PyCharm and PyDev have features like this. In Wing you can either select and execute code in the integrated Python Shell or if you’re debugging something you can interact with the paused debug program in a shell (called the Debug Probe). There is also special support for matplotlib, in case you’re using that, so that you can work with plots interactively.


Python mysqldb:库未加载:libmysqlclient.18.dylib

问题:Python mysqldb:库未加载:libmysqlclient.18.dylib

我刚刚在Mac OS 10.6上为python 2.7编译并安装了mysqldb。我创建了一个简单的测试文件,可以导入

import MySQLdb as mysql

首先,此命令带有红色下划线,并且信息告诉我“未解决的导入”。然后我尝试运行以下简单的python代码

import MySQLdb as mysql

def main():
    conn = mysql.connect( charset="utf8", use_unicode=True, host="localhost",user="root", passwd="",db="" )

if __name__ == '__main__'():
    main()

执行它时,我收到以下错误消息

Traceback (most recent call last):
  File "/path/to/project/Python/src/cvdv/TestMySQLdb.py", line 4, in <module>
    import MySQLdb as mysql
  File "build/bdist.macosx-10.6-intel/egg/MySQLdb/__init__.py", line 19, in <module>
    \namespace cvdv
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 7, in <module>
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so
  Reason: image not found

解决我的问题的方法可能是什么?

编辑:实际上我发现该库位于/ usr / local / mysql / lib中。所以我需要告诉我的pydev eclipse版本在哪里找到它。我在哪里设置?

I just compiled and installed mysqldb for python 2.7 on my mac os 10.6. I created a simple test file that imports

import MySQLdb as mysql

Firstly, this command is red underlined and the info tells me “Unresolved import”. Then I tried to run the following simple python code

import MySQLdb as mysql

def main():
    conn = mysql.connect( charset="utf8", use_unicode=True, host="localhost",user="root", passwd="",db="" )

if __name__ == '__main__'():
    main()

When executing it I get the following error message

Traceback (most recent call last):
  File "/path/to/project/Python/src/cvdv/TestMySQLdb.py", line 4, in <module>
    import MySQLdb as mysql
  File "build/bdist.macosx-10.6-intel/egg/MySQLdb/__init__.py", line 19, in <module>
    \namespace cvdv
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 7, in <module>
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so
  Reason: image not found

What might be the solution to my problem?

EDIT: Actually I found out that the library lies in /usr/local/mysql/lib. So I need to tell my pydev eclipse version where to find it. Where do I set this?


回答 0

我通过创建到库的符号链接解决了这个问题。即

实际的库位于

/usr/local/mysql/lib

然后我在其中创建了一个符号链接

/usr/lib

使用命令:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/lib/libmysqlclient.18.dylib

这样我就具有以下映射:

ls -l libmysqlclient.18.dylib 
lrwxr-xr-x  1 root  wheel  44 16 Jul 14:01 libmysqlclient.18.dylib -> /usr/local/mysql/lib/libmysqlclient.18.dylib

就是这样 之后,一切正常。

编辑:

请注意,自MacOS El Capitan以来,系统完整性保护(SIP,也称为“无根”)将阻止您在中创建链接/usr/lib/。您可以按照以下说明禁用SIP ,但可以在其中创建链接/usr/local/lib/

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/local/lib/libmysqlclient.18.dylib

I solved the problem by creating a symbolic link to the library. I.e.

The actual library resides in

/usr/local/mysql/lib

And then I created a symbolic link in

/usr/lib

Using the command:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/lib/libmysqlclient.18.dylib

so that I have the following mapping:

ls -l libmysqlclient.18.dylib 
lrwxr-xr-x  1 root  wheel  44 16 Jul 14:01 libmysqlclient.18.dylib -> /usr/local/mysql/lib/libmysqlclient.18.dylib

That was it. After that everything worked fine.

EDIT:

Notice, that since MacOS El Capitan the System Integrity Protection (SIP, also known as “rootless”) will prevent you from creating links in /usr/lib/. You could disable SIP by following these instructions, but you can create a link in /usr/local/lib/ instead:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/local/lib/libmysqlclient.18.dylib

Python中的最大浮点数是多少?

问题:Python中的最大浮点数是多少?

我认为可以通过调用python中的最大整数sys.maxint

最大值floatlongPython中的最大值是多少?

I think the maximum integer in python is available by calling sys.maxint.

What is the maximum float or long in Python?


回答 0

对于float看看sys.float_info

>>> import sys
>>> sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)

具体来说sys.float_info.max

>>> sys.float_info.max
1.7976931348623157e+308

如果那还不够大,那么总会有正无穷大

>>> infinity = float("inf")
>>> infinity
inf
>>> infinity / 10000
inf

long类型具有无限的精度,因此我认为您仅受可用内存的限制。

For float have a look at sys.float_info:

>>> import sys
>>> sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)

Specifically, sys.float_info.max:

>>> sys.float_info.max
1.7976931348623157e+308

If that’s not big enough, there’s always positive infinity:

>>> infinity = float("inf")
>>> infinity
inf
>>> infinity / 10000
inf

The long type has unlimited precision, so I think you’re only limited by available memory.


回答 1

sys.maxint不是python支持的最大整数。它是python的常规整数类型支持的最大整数。

sys.maxint is not the largest integer supported by python. It’s the largest integer supported by python’s regular integer type.


回答 2

如果您使用numpy的,你可以使用D型float128 ”,并得到的最大浮动10E + 4931

>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)

If you are using numpy, you can use dtypefloat128‘ and get a max float of 10e+4931

>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)

Python Git模块的经验?[关闭]

问题:Python Git模块的经验?[关闭]

人们对Python的任何Git模块有何经验?(我知道GitPython,PyGit和Dulwich-如果您知道其他人,请随意提及。)

我正在编写一个程序,该程序必须与Git存储库进行交互(添加,删除,提交),但是没有使用Git的经验,所以我要寻找的一件事是关于Git的易用性/理解性。

我主要感兴趣的其他内容是库的成熟度和完整性,合理的错误缺失,持续的开发以及文档和开发人员的帮助。

如果您有其他我想/需要知道的事情,请随时提及。

What are people’s experiences with any of the Git modules for Python? (I know of GitPython, PyGit, and Dulwich – feel free to mention others if you know of them.)

I am writing a program which will have to interact (add, delete, commit) with a Git repository, but have no experience with Git, so one of the things I’m looking for is ease of use/understanding with regards to Git.

The other things I’m primarily interested in are maturity and completeness of the library, a reasonable lack of bugs, continued development, and helpfulness of the documentation and developers.

If you think of something else I might want/need to know, please feel free to mention it.


回答 0

虽然这个问题是在不久前提出的,但我当时还不知道库的状态,但是值得搜索的人提到,GitPython在抽象命令行工具方面做得很好,因此您无需使用子流程。您可以使用一些有用的内置抽象,但是对于其他所有事情,您都可以执行以下操作:

import git
repo = git.Repo( '/home/me/repodir' )
print repo.git.status()
# checkout and track a remote branch
print repo.git.checkout( 'origin/somebranch', b='somebranch' )
# add a file
print repo.git.add( 'somefile' )
# commit
print repo.git.commit( m='my commit message' )
# now we are one commit ahead
print repo.git.status()

GitPython中的其他所有功能都使其更易于浏览。我对此库非常满意,并赞赏它是基础git工具的包装。

更新:我已经切换到不仅使用git,而且还使用python中需要的大多数命令行实用程序使用sh模块。为了复制上面的内容,我将改为执行以下操作:

import sh
git = sh.git.bake(_cwd='/home/me/repodir')
print git.status()
# checkout and track a remote branch
print git.checkout('-b', 'somebranch')
# add a file
print git.add('somefile')
# commit
print git.commit(m='my commit message')
# now we are one commit ahead
print git.status()

While this question was asked a while ago and I don’t know the state of the libraries at that point, it is worth mentioning for searchers that GitPython does a good job of abstracting the command line tools so that you don’t need to use subprocess. There are some useful built in abstractions that you can use, but for everything else you can do things like:

import git
repo = git.Repo( '/home/me/repodir' )
print repo.git.status()
# checkout and track a remote branch
print repo.git.checkout( 'origin/somebranch', b='somebranch' )
# add a file
print repo.git.add( 'somefile' )
# commit
print repo.git.commit( m='my commit message' )
# now we are one commit ahead
print repo.git.status()

Everything else in GitPython just makes it easier to navigate. I’m fairly well satisfied with this library and appreciate that it is a wrapper on the underlying git tools.

UPDATE: I’ve switched to using the sh module for not just git but most commandline utilities I need in python. To replicate the above I would do this instead:

import sh
git = sh.git.bake(_cwd='/home/me/repodir')
print git.status()
# checkout and track a remote branch
print git.checkout('-b', 'somebranch')
# add a file
print git.add('somefile')
# commit
print git.commit(m='my commit message')
# now we are one commit ahead
print git.status()

回答 1

我以为我会回答自己的问题,因为我所采取的途径与答案中所建议的不同。尽管如此,感谢那些回答。

首先,简要介绍一下我在GitPython,PyGit和Dulwich的经验:

  • GitPython:下载后,我将其导入并初始化了适当的对象。但是,尝试执行本教程中建议的操作会导致错误。缺乏更多文档,我转向其他地方。
  • PyGit:这甚至都不会导入,而且我找不到任何文档。
  • 德威:似乎是最有前途的(至少就我想要和看到的而言)。与GitPython相比,我在其中取得了一些进步,因为它的卵来自Python源。但是,过了一会儿,我决定尝试做一下可能会更容易。

同样,StGit看起来很有趣,但是我需要将功能提取到一个单独的模块中,并且不希望现在等待它发生。

在比使上面的三个模块正常工作所花费的时间少得多的时间内,我设法通过子流程模块使git命令起作用,例如

def gitAdd(fileName, repoDir):
    cmd = ['git', 'add', fileName]
    p = subprocess.Popen(cmd, cwd=repoDir)
    p.wait()

gitAdd('exampleFile.txt', '/usr/local/example_git_repo_dir')

这还没有完全集成到我的程序中,但是除了速度(我有时会处理数百甚至数千个文件)之外,我没有预料到任何问题。

也许我只是没有耐心让Dulwich或GitPython正常运行。就是说,我希望这些模块能够得到更多的开发并且很快会有用。

I thought I would answer my own question, since I’m taking a different path than suggested in the answers. Nonetheless, thanks to those who answered.

First, a brief synopsis of my experiences with GitPython, PyGit, and Dulwich:

  • GitPython: After downloading, I got this imported and the appropriate object initialized. However, trying to do what was suggested in the tutorial led to errors. Lacking more documentation, I turned elsewhere.
  • PyGit: This would not even import, and I could find no documentation.
  • Dulwich: Seems to be the most promising (at least for what I wanted and saw). I made some progress with it, more than with GitPython, since its egg comes with Python source. However, after a while, I decided it may just be easier to try what I did.

Also, StGit looks interesting, but I would need the functionality extracted into a separate module and do not want wait for that to happen right now.

In (much) less time than I spent trying to get the three modules above working, I managed to get git commands working via the subprocess module, e.g.

def gitAdd(fileName, repoDir):
    cmd = ['git', 'add', fileName]
    p = subprocess.Popen(cmd, cwd=repoDir)
    p.wait()

gitAdd('exampleFile.txt', '/usr/local/example_git_repo_dir')

This isn’t fully incorporated into my program yet, but I’m not anticipating a problem, except maybe speed (since I’ll be processing hundreds or even thousands of files at times).

Maybe I just didn’t have the patience to get things going with Dulwich or GitPython. That said, I’m hopeful the modules will get more development and be more useful soon.


回答 2

我建议pygit2-它使用出色的libgit2绑定

I’d recommend pygit2 – it uses the excellent libgit2 bindings


回答 3

这是一个非常老的问题,在寻找Git库时,我发现了今年(2013年)制造的一个名为Gittle的库

它对我很有用(我尝试过的其他地方都比较薄弱),而且似乎涵盖了大多数常见操作。

自述文件中的一些示例:

from gittle import Gittle

# Clone a repository
repo_path = '/tmp/gittle_bare'
repo_url = 'git://github.com/FriendCode/gittle.git'
repo = Gittle.clone(repo_url, repo_path)

# Stage multiple files
repo.stage(['other1.txt', 'other2.txt'])

# Do the commit
repo.commit(name="Samy Pesse", email="samy@friendco.de", message="This is a commit")

# Authentication with RSA private key
key_file = open('/Users/Me/keys/rsa/private_rsa')
repo.auth(pkey=key_file)

# Do push
repo.push()

This is a pretty old question, and while looking for Git libraries, I found one that was made this year (2013) called Gittle.

It worked great for me (where the others I tried were flaky), and seems to cover most of the common actions.

Some examples from the README:

from gittle import Gittle

# Clone a repository
repo_path = '/tmp/gittle_bare'
repo_url = 'git://github.com/FriendCode/gittle.git'
repo = Gittle.clone(repo_url, repo_path)

# Stage multiple files
repo.stage(['other1.txt', 'other2.txt'])

# Do the commit
repo.commit(name="Samy Pesse", email="samy@friendco.de", message="This is a commit")

# Authentication with RSA private key
key_file = open('/Users/Me/keys/rsa/private_rsa')
repo.auth(pkey=key_file)

# Do push
repo.push()

回答 4

也许有帮助,但是Bazaar和Mercurial都使用dulwich来实现Git的互操作性。

Dulwich在某种意义上可能与另一个有所不同,因为它是python中git的重新实现。另一个可能只是Git命令的包装器(因此从较高的角度来看,它可能更易于使用:commit / add / delete),这可能意味着它们的API与git的命令行非常接近,因此您需要获得有关Git的经验。

Maybe it helps, but Bazaar and Mercurial are both using dulwich for their Git interoperability.

Dulwich is probably different than the other in the sense that’s it’s a reimplementation of git in python. The other might just be a wrapper around Git’s commands (so it could be simpler to use from a high level point of view: commit/add/delete), it probably means their API is very close to git’s command line so you’ll need to gain experience with Git.


回答 5

更新的答案反映了更改的时间:

GitPython当前是最容易使用的。它支持许多git plumbing命令的包装,并具有可插入的对象数据库(其中的一个是德威奇),并且如果未实现命令,则提供了一个简单的api,可以用于命令行。例如:

repo = Repo('.')
repo.checkout(b='new_branch')

这调用:

bash$ git checkout -b new_branch

德威也不错,但水平要低得多。使用它有点痛苦,因为它需要在管道级上对git对象进行操作,并且没有通常想要的精美瓷器。但是,如果您打算修改git的任何部分,或者使用git-receive-pack和git-upload-pack,则需要使用dulwich。

An updated answer reflecting changed times:

GitPython currently is the easiest to use. It supports wrapping of many git plumbing commands and has pluggable object database (dulwich being one of them), and if a command isn’t implemented, provides an easy api for shelling out to the command line. For example:

repo = Repo('.')
repo.checkout(b='new_branch')

This calls:

bash$ git checkout -b new_branch

Dulwich is also good but much lower level. It’s somewhat of a pain to use because it requires operating on git objects at the plumbing level and doesn’t have nice porcelain that you’d normally want to do. However, if you plan on modifying any parts of git, or use git-receive-pack and git-upload-pack, you need to use dulwich.


回答 6

为了完整起见,http://github.com/alex/pyvcs/是所有dvc的抽象层。它使用dulwich,但与其他dvc提供互操作。

For the sake of completeness, http://github.com/alex/pyvcs/ is an abstraction layer for all dvcs’s. It uses dulwich, but provides interop with the other dvcs’s.


回答 7

这是“ git status”的真正快速实现:

import os
import string
from subprocess import *

repoDir = '/Users/foo/project'

def command(x):
    return str(Popen(x.split(' '), stdout=PIPE).communicate()[0])

def rm_empty(L): return [l for l in L if (l and l!="")]

def getUntracked():
    os.chdir(repoDir)
    status = command("git status")
    if "# Untracked files:" in status:
        untf = status.split("# Untracked files:")[1][1:].split("\n")
        return rm_empty([x[2:] for x in untf if string.strip(x) != "#" and x.startswith("#\t")])
    else:
        return []

def getNew():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tnew file:   ")]

def getModified():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tmodified:   ")]

print("Untracked:")
print( getUntracked() )
print("New:")
print( getNew() )
print("Modified:")
print( getModified() )

Here’s a really quick implementation of “git status”:

import os
import string
from subprocess import *

repoDir = '/Users/foo/project'

def command(x):
    return str(Popen(x.split(' '), stdout=PIPE).communicate()[0])

def rm_empty(L): return [l for l in L if (l and l!="")]

def getUntracked():
    os.chdir(repoDir)
    status = command("git status")
    if "# Untracked files:" in status:
        untf = status.split("# Untracked files:")[1][1:].split("\n")
        return rm_empty([x[2:] for x in untf if string.strip(x) != "#" and x.startswith("#\t")])
    else:
        return []

def getNew():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tnew file:   ")]

def getModified():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tmodified:   ")]

print("Untracked:")
print( getUntracked() )
print("New:")
print( getNew() )
print("Modified:")
print( getModified() )

回答 8

PTBNL的答案对我来说非常完美。我为Windows用户提供了更多功能。

import time
import subprocess
def gitAdd(fileName, repoDir):
    cmd = 'git add ' + fileName
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 

def gitCommit(commitMessage, repoDir):
    cmd = 'git commit -am "%s"'%commitMessage
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 
def gitPush(repoDir):
    cmd = 'git push '
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    pipe.wait()
    return 

temp=time.localtime(time.time())
uploaddate= str(temp[0])+'_'+str(temp[1])+'_'+str(temp[2])+'_'+str(temp[3])+'_'+str(temp[4])

repoDir='d:\\c_Billy\\vfat\\Programming\\Projector\\billyccm' # your git repository , windows your need to use double backslash for right directory.
gitAdd('.',repoDir )
gitCommit(uploaddate, repoDir)
gitPush(repoDir)

PTBNL’s Answer is quite perfect for me. I make a little more for Windows user.

import time
import subprocess
def gitAdd(fileName, repoDir):
    cmd = 'git add ' + fileName
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 

def gitCommit(commitMessage, repoDir):
    cmd = 'git commit -am "%s"'%commitMessage
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 
def gitPush(repoDir):
    cmd = 'git push '
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    pipe.wait()
    return 

temp=time.localtime(time.time())
uploaddate= str(temp[0])+'_'+str(temp[1])+'_'+str(temp[2])+'_'+str(temp[3])+'_'+str(temp[4])

repoDir='d:\\c_Billy\\vfat\\Programming\\Projector\\billyccm' # your git repository , windows your need to use double backslash for right directory.
gitAdd('.',repoDir )
gitCommit(uploaddate, repoDir)
gitPush(repoDir)

回答 9

StGit的git交互库部分实际上非常好。但是,它不是作为单独的软件包分解的,但是,如果有足够的兴趣,我相信可以解决。

它具有非常好的抽象,用于表示提交,树等,以及用于创建新的提交和树。

The git interaction library part of StGit is actually pretty good. However, it isn’t broken out as a separate package but if there is sufficient interest, I’m sure that can be fixed.

It has very nice abstractions for representing commits, trees etc, and for creating new commits and trees.


回答 10

记录下来,前面提到的Git Python库似乎都没有包含“ git status”等效项,这实际上是我唯一想要的,因为通过子进程处理其余git命令非常容易。

For the record, none of the aforementioned Git Python libraries seem to contain a “git status” equivalent, which is really the only thing I would want since dealing with the rest of the git commands via subprocess is so easy.


按两个字段对Python列表进行排序

问题:按两个字段对Python列表进行排序

我有一个从排序的csv创建的以下列表

list1 = sorted(csv1, key=operator.itemgetter(1))

我实际上想按两个条件对列表进行排序:首先按字段1中的值,然后按字段2中的值。我该怎么做?

I have the following list created from a sorted csv

list1 = sorted(csv1, key=operator.itemgetter(1))

I would actually like to sort the list by two criteria: first by the value in field 1 and then by the value in field 2. How do I do this?


回答 0

像这样:

import operator
list1 = sorted(csv1, key=operator.itemgetter(1, 2))

like this:

import operator
list1 = sorted(csv1, key=operator.itemgetter(1, 2))

回答 1

使用lambda函数时无需导入任何内容。
以下list按第一个元素排序,然后按第二个元素排序。

sorted(list, key=lambda x: (x[0], -x[1]))

No need to import anything when using lambda functions.
The following sorts list by the first element, then by the second element.

sorted(list, key=lambda x: (x[0], -x[1]))

回答 2

Python具有稳定的排序方式,因此,只要性能不成问题,最简单的方法就是按字段2对其进行排序,然后再次按字段1对其进行排序。

这将为您提供所需的结果,唯一的陷阱是,如果列表很大(或者您希望经常对其进行排序),则两次调用sort可能是不可接受的开销。

list1 = sorted(csv1, key=operator.itemgetter(2))
list1 = sorted(list1, key=operator.itemgetter(1))

这样一来,还可以轻松处理需要对某些列进行反向排序的情况,只需在必要时添加’reverse = True’参数即可。

否则,您可以将多个参数传递给itemgetter或手动构建一个元组。这可能会更快一些,但是有一个问题,就是如果某些列想要反向排序,它不能很好地推广(数字列仍然可以通过取反来反转,但是这会使排序保持稳定)。

因此,如果您不需要对任何列进行反向排序,则可以向itemgetter输入多个参数(如果可能),并且这些列不是数字的,或者您希望保持排序稳定以进行多个连续排序。

编辑:对于在理解此答案的原始方式时遇到问题的评论者,以下示例准确显示了排序的稳定性,从而确保了我们可以对每个键进行单独的排序并最终对多个条件下的数据进行排序:

DATA = [
    ('Jones', 'Jane', 58),
    ('Smith', 'Anne', 30),
    ('Jones', 'Fred', 30),
    ('Smith', 'John', 60),
    ('Smith', 'Fred', 30),
    ('Jones', 'Anne', 30),
    ('Smith', 'Jane', 58),
    ('Smith', 'Twin2', 3),
    ('Jones', 'John', 60),
    ('Smith', 'Twin1', 3),
    ('Jones', 'Twin1', 3),
    ('Jones', 'Twin2', 3)
]

# Sort by Surname, Age DESCENDING, Firstname
print("Initial data in random order")
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred''')
DATA.sort(key=lambda row: row[1])

for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.''')
DATA.sort(key=lambda row: row[2], reverse=True)
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.
''')
DATA.sort(key=lambda row: row[0])
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

这是一个可运行的示例,但是为了节省运行它的人员,输出为:

Initial data in random order
Jones      Jane       58
Smith      Anne       30
Jones      Fred       30
Smith      John       60
Smith      Fred       30
Jones      Anne       30
Smith      Jane       58
Smith      Twin2      3
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Jones      Twin2      3

First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Jones      Jane       58
Smith      Jane       58
Smith      John       60
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.
Smith      John       60
Jones      John       60
Jones      Jane       58
Smith      Jane       58
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.

Jones      John       60
Jones      Jane       58
Jones      Anne       30
Jones      Fred       30
Jones      Twin1      3
Jones      Twin2      3
Smith      John       60
Smith      Jane       58
Smith      Anne       30
Smith      Fred       30
Smith      Twin1      3
Smith      Twin2      3

特别要注意的是,在第二步中,reverse=True参数如何按顺序保留名字,而仅对列表进行排序然后反转,则会丢失第三个排序键的期望顺序。

Python has a stable sort, so provided that performance isn’t an issue the simplest way is to sort it by field 2 and then sort it again by field 1.

That will give you the result you want, the only catch is that if it is a big list (or you want to sort it often) calling sort twice might be an unacceptable overhead.

list1 = sorted(csv1, key=operator.itemgetter(2))
list1 = sorted(list1, key=operator.itemgetter(1))

Doing it this way also makes it easy to handle the situation where you want some of the columns reverse sorted, just include the ‘reverse=True’ parameter when necessary.

Otherwise you can pass multiple parameters to itemgetter or manually build a tuple. That is probably going to be faster, but has the problem that it doesn’t generalise well if some of the columns want to be reverse sorted (numeric columns can still be reversed by negating them but that stops the sort being stable).

So if you don’t need any columns reverse sorted, go for multiple arguments to itemgetter, if you might, and the columns aren’t numeric or you want to keep the sort stable go for multiple consecutive sorts.

Edit: For the commenters who have problems understanding how this answers the original question, here is an example that shows exactly how the stable nature of the sorting ensures we can do separate sorts on each key and end up with data sorted on multiple criteria:

DATA = [
    ('Jones', 'Jane', 58),
    ('Smith', 'Anne', 30),
    ('Jones', 'Fred', 30),
    ('Smith', 'John', 60),
    ('Smith', 'Fred', 30),
    ('Jones', 'Anne', 30),
    ('Smith', 'Jane', 58),
    ('Smith', 'Twin2', 3),
    ('Jones', 'John', 60),
    ('Smith', 'Twin1', 3),
    ('Jones', 'Twin1', 3),
    ('Jones', 'Twin2', 3)
]

# Sort by Surname, Age DESCENDING, Firstname
print("Initial data in random order")
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred''')
DATA.sort(key=lambda row: row[1])

for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.''')
DATA.sort(key=lambda row: row[2], reverse=True)
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.
''')
DATA.sort(key=lambda row: row[0])
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

This is a runnable example, but to save people running it the output is:

Initial data in random order
Jones      Jane       58
Smith      Anne       30
Jones      Fred       30
Smith      John       60
Smith      Fred       30
Jones      Anne       30
Smith      Jane       58
Smith      Twin2      3
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Jones      Twin2      3

First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Jones      Jane       58
Smith      Jane       58
Smith      John       60
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.
Smith      John       60
Jones      John       60
Jones      Jane       58
Smith      Jane       58
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.

Jones      John       60
Jones      Jane       58
Jones      Anne       30
Jones      Fred       30
Jones      Twin1      3
Jones      Twin2      3
Smith      John       60
Smith      Jane       58
Smith      Anne       30
Smith      Fred       30
Smith      Twin1      3
Smith      Twin2      3

Note in particular how in the second step the reverse=True parameter keeps the firstnames in order whereas simply sorting then reversing the list would lose the desired order for the third sort key.


回答 3

list1 = sorted(csv1, key=lambda x: (x[1], x[2]) )
list1 = sorted(csv1, key=lambda x: (x[1], x[2]) )

回答 4

employees.sort(key = lambda x:x[1])
employees.sort(key = lambda x:x[0])

我们也可以将.sort与lambda一起使用2次,因为python sort到位且稳定。这将首先根据第二个元素x [1]对列表进行排序。然后,它将对第一个元素x [0](最高优先级)进行排序。

employees[0] = Employee's Name
employees[1] = Employee's Salary

这等效于执行以下操作:employee.sort(key = lambda x:(x [0],x [1]))

employees.sort(key = lambda x:x[1])
employees.sort(key = lambda x:x[0])

We can also use .sort with lambda 2 times because python sort is in place and stable. This will first sort the list according to the second element, x[1]. Then, it will sort the first element, x[0] (highest priority).

employees[0] = Employee's Name
employees[1] = Employee's Salary

This is equivalent to doing the following: employees.sort(key = lambda x:(x[0], x[1]))


回答 5

您可以按升序使用:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]))

或按降序使用:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]),reverse=True)

In ascending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]))

or in descending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]),reverse=True)

回答 6

使用下面的字典排序列表将以降序对列表进行排序,第一列为薪水,第二列为年龄

d=[{'salary':123,'age':23},{'salary':123,'age':25}]
d=sorted(d, key=lambda i: (i['salary'], i['age']),reverse=True)

输出:[{‘salary’:123,’age’:25},{‘salary’:123,’age’:23}]

Sorting list of dicts using below will sort list in descending order on first column as salary and second column as age

d=[{'salary':123,'age':23},{'salary':123,'age':25}]
d=sorted(d, key=lambda i: (i['salary'], i['age']),reverse=True)

Output: [{‘salary’: 123, ‘age’: 25}, {‘salary’: 123, ‘age’: 23}]


是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

问题:是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

是否可以使用scikit-learn K-Means聚类指定自己的距离函数?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?


回答 0

这是一个小型的kmean,使用scipy.spatial.distance或用户函数中的20多个距离中的 任意一个。
欢迎发表评论(到目前为止,只有一位用户,这还不够);特别是,您的N,dim,k公制是什么?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

2012年3月26日添加了一些注意事项:

1)对于余弦距离,首先将所有数据向量归一化为| X | = 1; 然后

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

很快 对于位向量,请将规范与向量分开,而不是扩展为浮点数(尽管某些程序可能会为您扩展)。对于稀疏向量,说N,X的1%。Y应该花费时间O(2%N),空间O(N); 但我不知道哪个程序可以做到这一点。

2) Scikit学习集群 很好地概述了k均值,mini-batch-k均值…以及适用于scipy.sparse矩阵的代码。

3)务必在k均值之后检查群集大小。如果您期望群集大小大致相等,但它们出来了 [44 37 9 5 5] %……(令人头疼的声音)。

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don’t know which programs do that.

2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means … with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means. If you’re expecting roughly equal-sized clusters, but they come out [44 37 9 5 5] % … (sound of head-scratching).


回答 1

不幸的是,没有:scikit-learn当前的k-means实现仅使用欧几里得距离。

将k均值扩展到其他距离并非易事,并且denis的上述回答并不是为其他度量实施k均值的正确方法。

Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.

It is not trivial to extend k-means to other distances and denis’ answer above is not the correct way to implement k-means for other metrics.


回答 2

只需在可以执行此操作的地方使用nltk即可,例如

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

回答 3

是的,您可以使用差异度量功能;但是,根据定义,k均值聚类算法依赖于距每个聚类均值的eucldiean距离。

您可以使用其他指标,因此即使您仍在计算均值,也可以使用诸如马氏距离之类的值。

Yes you can use a difference metric function; however, by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster.

You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.


回答 4

pyclustering,它是python / C ++(非常快!),可让您指定自定义指标函数

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

实际上,我还没有测试过此代码,而是从票证示例代码中将其拼凑在一起。

There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

Actually, i haven’t tested this code but cobbled it together from a ticket and example code.


回答 5

Spectral Python的k均值允许使用L1(曼哈顿)距离。

k-means of Spectral Python allows the use of L1 (Manhattan) distance.


回答 6

Sklearn Kmeans使用欧几里德距离。它没有指标参数。这就是说,如果你聚类的时间序列,你可以使用tslearnPython包时,你可以指定一个度量标准(dtwsoftdtweuclidean)。

Sklearn Kmeans uses the Euclidean distance. It has no metric parameter. This said, if you’re clustering time series, you can use the tslearn python package, when you can specify a metric (dtw, softdtw, euclidean).


sys.stdout.flush()方法的用法

问题:sys.stdout.flush()方法的用法

怎么sys.stdout.flush()办?

What does sys.stdout.flush() do?


回答 0

Python的标准输出被缓冲(这意味着它在将标准写入之前将其收集的一些数据“写入”到标准输出中)。调用会sys.stdout.flush()强制其“刷新”缓冲区,这意味着它将把缓冲区中的所有内容都写到终端,即使通常情况下它会等待这样做。

以下是有关(非)缓冲I / O及其有用之处的一些良好信息:
http : //en.wikipedia.org/wiki/Data_buffer
缓冲与无缓冲IO

Python’s standard out is buffered (meaning that it collects some of the data “written” to standard out before it writes it to the terminal). Calling sys.stdout.flush() forces it to “flush” the buffer, meaning that it will write everything in the buffer to the terminal, even if normally it would wait before doing so.

Here’s some good information about (un)buffered I/O and why it’s useful:
http://en.wikipedia.org/wiki/Data_buffer
Buffered vs unbuffered IO


回答 1

考虑以下简单的Python脚本:

import time
import sys

for i in range(5):
    print(i),
    #sys.stdout.flush()
    time.sleep(1)

这是为了打印每秒五秒钟一个号码,你要是跑不过它,因为它是现在(取决于默认的系统缓存),你可能看不到任何输出,直到脚本完成,然后一下子你会看到0 1 2 3 4印到屏幕。

这是因为输出正在缓冲中,除非sys.stdout每次刷新后print您都不会立即看到输出。从sys.stdout.flush()行中删除注释以查看区别。

Consider the following simple Python script:

import time
import sys

for i in range(5):
    print(i),
    #sys.stdout.flush()
    time.sleep(1)

This is designed to print one number every second for five seconds, but if you run it as it is now (depending on your default system buffering) you may not see any output until the script completes, and then all at once you will see 0 1 2 3 4 printed to the screen.

This is because the output is being buffered, and unless you flush sys.stdout after each print you won’t see the output immediately. Remove the comment from the sys.stdout.flush() line to see the difference.


回答 2

根据我的理解,无论何时执行打印语句,输出都会写入缓冲区。当刷新缓冲区(清除)时,我们将在屏幕上看到输出。默认情况下,程序退出时将刷新缓冲区。但是我们也可以通过在程序中使用“ sys.stdout.flush()”语句来手动刷新缓冲区。在下面的代码中,当i的值达到5时,将刷新代码缓冲区。

您可以通过执行以下代码来理解。

chiru@online:~$ cat flush.py
import time
import sys

for i in range(10):
    print i
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()
    time.sleep(1)

for i in range(10):
    print i,
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()
chiru@online:~$ python flush.py 
0 1 2 3 4 5 Flushing buffer
6 7 8 9 0 1 2 3 4 5 Flushing buffer
6 7 8 9

As per my understanding, When ever we execute print statements output will be written to buffer. And we will see the output on screen when buffer get flushed(cleared). By default buffer will be flushed when program exits. BUT WE CAN ALSO FLUSH THE BUFFER MANUALLY by using “sys.stdout.flush()” statement in the program. In the below code buffer will be flushed when value of i reaches 5.

You can understand by executing the below code.

chiru@online:~$ cat flush.py
import time
import sys

for i in range(10):
    print i
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()
    time.sleep(1)

for i in range(10):
    print i,
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()
chiru@online:~$ python flush.py 
0 1 2 3 4 5 Flushing buffer
6 7 8 9 0 1 2 3 4 5 Flushing buffer
6 7 8 9

回答 3

import sys
for x in range(10000):
    print "HAPPY >> %s <<\r" % str(x),
    sys.stdout.flush()
import sys
for x in range(10000):
    print "HAPPY >> %s <<\r" % str(x),
    sys.stdout.flush()

回答 4

根据我的理解,sys.stdout.flush()会将缓冲到该点的所有数据推送到文件对象。使用stdout时,数据在写入终端之前先存储在缓冲存储器中(一段时间或直到内存被填满)。使用flush()会强制清空缓冲区,甚至在缓冲区没有空间之前就将其写入终端。

As per my understanding sys.stdout.flush() pushes out all the data that has been buffered to that point to a file object. While using stdout, data is stored in buffer memory (for some time or until the memory gets filled) before it gets written to terminal. Using flush() forces to empty the buffer and write to terminal even before buffer has empty space.


json.load()和json.loads()函数有什么区别

问题:json.load()和json.loads()函数有什么区别

在Python中,json.load()和之间有什么区别json.loads()

我猜想load()函数必须与文件对象一起使用(因此,我需要使用上下文管理器),而load()函数将文件路径作为字符串。这有点令人困惑。

字母“ sjson.loads()代表字符串吗?

非常感谢你的回答!

In Python, what is the difference between json.load() and json.loads()?

I guess that the load() function must be used with a file object (I need thus to use a context manager) while the loads() function take the path to the file as a string. It is a bit confusing.

Does the letter “s” in json.loads() stand for string?

Thanks a lot for your answers!


回答 0

是的,s代表字符串。该json.loads函数不采用文件路径,而是将文件内容作为字符串。查看位于https://docs.python.org/2/library/json.html的文档!

Yes, s stands for string. The json.loads function does not take the file path, but the file contents as a string. Look at the documentation at https://docs.python.org/2/library/json.html!


回答 1

只是在每个人的解释中添加一个简单的例子,

json.load()

json.load可以反序列化文件本身,即它接受一个file对象,例如,

# open a json file for reading and print content using json.load
with open("/xyz/json_data.json", "r") as content:
  print(json.load(content))

将输出

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

如果我改用json.loads打开文件,

# you cannot use json.loads on file object
with open("json_data.json", "r") as content:
  print(json.loads(content))

我会收到此错误:

TypeError:预期的字符串或缓冲区

json.loads()

json.loads() 反串化字符串。

因此,要使用json.loads该文件read(),我将不得不使用函数传递文件的内容,例如,

content.read()json.loads()文件的返回内容一起使用,

with open("json_data.json", "r") as content:
  print(json.loads(content.read()))

输出,

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

那是因为类型content.read()是字符串,即<type 'str'>

如果json.load()与配合使用content.read(),则会出现错误,

with open("json_data.json", "r") as content:
  print(json.load(content.read()))

给,

AttributeError:’str’对象没有属性’read’

因此,现在您知道json.load反序列化文件并json.loads反序列化一个字符串。

另一个例子,

sys.stdin返回file对象,所以如果我这样做print(json.load(sys.stdin)),我将获得实际的json数据,

cat json_data.json | ./test.py

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

如果要使用json.loads(),我会print(json.loads(sys.stdin.read()))改为使用。

Just going to add a simple example to what everyone has explained,

json.load()

json.load can deserialize a file itself i.e. it accepts a file object, for example,

# open a json file for reading and print content using json.load
with open("/xyz/json_data.json", "r") as content:
  print(json.load(content))

will output,

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

If I use json.loads to open a file instead,

# you cannot use json.loads on file object
with open("json_data.json", "r") as content:
  print(json.loads(content))

I would get this error:

TypeError: expected string or buffer

json.loads()

json.loads() deserialize string.

So in order to use json.loads I will have to pass the content of the file using read() function, for example,

using content.read() with json.loads() return content of the file,

with open("json_data.json", "r") as content:
  print(json.loads(content.read()))

Output,

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

That’s because type of content.read() is string, i.e. <type 'str'>

If I use json.load() with content.read(), I will get error,

with open("json_data.json", "r") as content:
  print(json.load(content.read()))

Gives,

AttributeError: ‘str’ object has no attribute ‘read’

So, now you know json.load deserialze file and json.loads deserialize a string.

Another example,

sys.stdin return file object, so if i do print(json.load(sys.stdin)), I will get actual json data,

cat json_data.json | ./test.py

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

If I want to use json.loads(), I would do print(json.loads(sys.stdin.read())) instead.


回答 2

文档非常清晰:https//docs.python.org/2/library/json.html

json.load(fp[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

使用此转换表将fp(支持.read()的包含JSON文档的类似文件的对象)反序列化为Python对象。

json.loads(s[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

使用此转换表将s(包含JSON文档的str或unicode实例)反序列化为Python对象。

所以load是一个文件,loads一个string

Documentation is quite clear: https://docs.python.org/2/library/json.html

json.load(fp[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object using this conversion table.

json.loads(s[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

Deserialize s (a str or unicode instance containing a JSON document) to a Python object using this conversion table.

So load is for a file, loads for a string


回答 3

快速解答(非常简化!)

json.load()需要一个文件

json.load()需要一个文件(文件对象),例如,您在文件路径(如)给定之前打开的文件'files/example.json'


json.loads()需要一个STRING

json.loads()需要一个(有效)JSON字符串-即 {"foo": "bar"}


例子

假设您有一个文件example.json,其内容如下:{“ key_1”:1,1,“ key_2”:“ foo”,“ Key_3”:null}

>>> import json
>>> file = open("example.json")

>>> type(file)
<class '_io.TextIOWrapper'>

>>> file
<_io.TextIOWrapper name='example.json' mode='r' encoding='UTF-8'>

>>> json.load(file)
{'key_1': 1, 'key_2': 'foo', 'Key_3': None}

>>> json.loads(file)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 341, in loads
TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper


>>> string = '{"foo": "bar"}'

>>> type(string)
<class 'str'>

>>> string
'{"foo": "bar"}'

>>> json.loads(string)
{'foo': 'bar'}

>>> json.load(string)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'

QUICK ANSWER (very simplified!)

json.load() takes a FILE

json.load() expects a file (file object) – e.g. a file you opened before given by filepath like 'files/example.json'.


json.loads() takes a STRING

json.loads() expects a (valid) JSON string – i.e. {"foo": "bar"}


EXAMPLES

Assuming you have a file example.json with this content: { “key_1”: 1, “key_2”: “foo”, “Key_3”: null }

>>> import json
>>> file = open("example.json")

>>> type(file)
<class '_io.TextIOWrapper'>

>>> file
<_io.TextIOWrapper name='example.json' mode='r' encoding='UTF-8'>

>>> json.load(file)
{'key_1': 1, 'key_2': 'foo', 'Key_3': None}

>>> json.loads(file)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 341, in loads
TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper


>>> string = '{"foo": "bar"}'

>>> type(string)
<class 'str'>

>>> string
'{"foo": "bar"}'

>>> json.loads(string)
{'foo': 'bar'}

>>> json.load(string)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'

回答 4

所述json.load()方法(无“S”中的“负荷”)可直接读取的文件:

import json
with open('strings.json') as f:
    d = json.load(f)
    print(d)

json.loads()方法,仅用于字符串参数。

import json

person = '{"name": "Bob", "languages": ["English", "Fench"]}'
print(type(person))
# Output : <type 'str'>

person_dict = json.loads(person)
print( person_dict)
# Output: {'name': 'Bob', 'languages': ['English', 'Fench']}

print(type(person_dict))
# Output : <type 'dict'>

在这里,我们可以看到在使用load()将字符串(type(str))作为输入并返回字典之后

The json.load() method (without “s” in “load”) can read a file directly:

import json
with open('strings.json') as f:
    d = json.load(f)
    print(d)

json.loads() method, which is used for string arguments only.

import json

person = '{"name": "Bob", "languages": ["English", "Fench"]}'
print(type(person))
# Output : <type 'str'>

person_dict = json.loads(person)
print( person_dict)
# Output: {'name': 'Bob', 'languages': ['English', 'Fench']}

print(type(person_dict))
# Output : <type 'dict'>

Here , we can see after using loads() takes a string ( type(str) ) as a input and return dictionary.


回答 5

在python3.7.7中,根据cpython源代码,json.load的定义如下:

def load(fp, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):

    return loads(fp.read(),
        cls=cls, object_hook=object_hook,
        parse_float=parse_float, parse_int=parse_int,
        parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

json.load实际上调用json.loads并fp.read()用作第一个参数。

因此,如果您的代码是:

with open (file) as fp:
    s = fp.read()
    json.loads(s)

这样做是一样的:

with open (file) as fp:
    json.load(fp)

但是,如果您需要指定从文件中读取的字节,例如,fp.read(10)或者您要反序列化的字符串/字节不是从文件中读取,则应使用json.loads()

至于json.loads(),它不仅反序列化字符串,而且还反序列化字节。如果s为bytes或bytearray,则将其首先解码为字符串。您也可以在源代码中找到它。

def loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    """Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
    containing a JSON document) to a Python object.

    ...

    """
    if isinstance(s, str):
        if s.startswith('\ufeff'):
            raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
                                  s, 0)
    else:
        if not isinstance(s, (bytes, bytearray)):
            raise TypeError(f'the JSON object must be str, bytes or bytearray, '
                            f'not {s.__class__.__name__}')
        s = s.decode(detect_encoding(s), 'surrogatepass')

In python3.7.7, the definition of json.load is as below according to cpython source code:

def load(fp, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):

    return loads(fp.read(),
        cls=cls, object_hook=object_hook,
        parse_float=parse_float, parse_int=parse_int,
        parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

json.load actually calls json.loads and use fp.read() as the first argument.

So if your code is:

with open (file) as fp:
    s = fp.read()
    json.loads(s)

It’s the same to do this:

with open (file) as fp:
    json.load(fp)

But if you need to specify the bytes reading from the file as like fp.read(10) or the string/bytes you want to deserialize is not from file, you should use json.loads()

As for json.loads(), it not only deserialize string but also bytes. If s is bytes or bytearray, it will be decoded to string first. You can also find it in the source code.

def loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    """Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
    containing a JSON document) to a Python object.

    ...

    """
    if isinstance(s, str):
        if s.startswith('\ufeff'):
            raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
                                  s, 0)
    else:
        if not isinstance(s, (bytes, bytearray)):
            raise TypeError(f'the JSON object must be str, bytes or bytearray, '
                            f'not {s.__class__.__name__}')
        s = s.decode(detect_encoding(s), 'surrogatepass')


在matplotlib中删除已保存图像周围的空白

问题:在matplotlib中删除已保存图像周围的空白

我需要拍摄图像并经过一些处理将其保存。显示该图形时,它看起来不错,但是保存该图形后,在保存的图像周围有一些空白。我尝试过方法的'tight'选项savefig,也没有用。代码:

  import matplotlib.image as mpimg
  import matplotlib.pyplot as plt

  fig = plt.figure(1)
  img = mpimg.imread(path)
  plt.imshow(img)
  ax=fig.add_subplot(1,1,1)

  extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
  plt.savefig('1.png', bbox_inches=extent)

  plt.axis('off') 
  plt.show()

我正在尝试通过在图上使用NetworkX绘制基本图形并将其保存。我意识到没有图就可以,但是当添加图时,保存的图像周围会有空白;

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import networkx as nx

G = nx.Graph()
G.add_node(1)
G.add_node(2)
G.add_node(3)
G.add_edge(1,3)
G.add_edge(1,2)
pos = {1:[100,120], 2:[200,300], 3:[50,75]}

fig = plt.figure(1)
img = mpimg.imread("C:\\images\\1.jpg")
plt.imshow(img)
ax=fig.add_subplot(1,1,1)

nx.draw(G, pos=pos)

extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
plt.savefig('1.png', bbox_inches = extent)

plt.axis('off') 
plt.show()

I need to take an image and save it after some process. The figure looks fine when I display it, but after saving the figure, I got some white space around the saved image. I have tried the 'tight' option for savefig method, did not work either. The code:

  import matplotlib.image as mpimg
  import matplotlib.pyplot as plt

  fig = plt.figure(1)
  img = mpimg.imread(path)
  plt.imshow(img)
  ax=fig.add_subplot(1,1,1)

  extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
  plt.savefig('1.png', bbox_inches=extent)

  plt.axis('off') 
  plt.show()

I am trying to draw a basic graph by using NetworkX on a figure and save it. I realized that without graph it works, but when added a graph I get white space around the saved image;

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import networkx as nx

G = nx.Graph()
G.add_node(1)
G.add_node(2)
G.add_node(3)
G.add_edge(1,3)
G.add_edge(1,2)
pos = {1:[100,120], 2:[200,300], 3:[50,75]}

fig = plt.figure(1)
img = mpimg.imread("C:\\images\\1.jpg")
plt.imshow(img)
ax=fig.add_subplot(1,1,1)

nx.draw(G, pos=pos)

extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
plt.savefig('1.png', bbox_inches = extent)

plt.axis('off') 
plt.show()

回答 0

我不能说我确切知道我的“解决方案”为什么起作用或如何起作用,但是当我想将几个翼型截面的轮廓(没有白色边距)绘制到PDF文件时,这就是我要做的。(请注意,我在带有-pylab标志的IPython笔记本中使用了matplotlib。)

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.gca().xaxis.set_major_locator(plt.NullLocator())
plt.gca().yaxis.set_major_locator(plt.NullLocator())
plt.savefig("filename.pdf", bbox_inches = 'tight',
    pad_inches = 0)

我尝试停用此功能的不同部分,但这总是在某处导致空白。您甚至可以对此进行修改,以防止由于缺乏边距而使图形附近的粗线被刮掉。

I cannot claim I know exactly why or how my “solution” works, but this is what I had to do when I wanted to plot the outline of a couple of aerofoil sections — without white margins — to a PDF file. (Note that I used matplotlib inside an IPython notebook, with the -pylab flag.)

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.gca().xaxis.set_major_locator(plt.NullLocator())
plt.gca().yaxis.set_major_locator(plt.NullLocator())
plt.savefig("filename.pdf", bbox_inches = 'tight',
    pad_inches = 0)

I have tried to deactivate different parts of this, but this always lead to a white margin somewhere. You may even have modify this to keep fat lines near the limits of the figure from being shaved by the lack of margins.


回答 1

您可以通过bbox_inches="tight"在中设置来删除空白填充savefig

plt.savefig("test.png",bbox_inches='tight')

您必须将参数bbox_inches作为字符串输入,也许这就是为什么它对您较早不起作用的原因。


可能重复:

Matplotlib图:删除轴,图例和空白

如何设置matplotlib图形的边距?

减少matplotlib图中的左右边距

You can remove the white space padding by setting bbox_inches="tight" in savefig:

plt.savefig("test.png",bbox_inches='tight')

You’ll have to put the argument to bbox_inches as a string, perhaps this is why it didn’t work earlier for you.


Possible duplicates:

Matplotlib plots: removing axis, legends and white spaces

How to set the margins for a matplotlib figure?

Reduce left and right margins in matplotlib plot


回答 2

在尝试了上述答案但没有成功(以及许多其他堆栈文章)之后,最终对我有用的只是

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.savefig("myfig.pdf")

重要的是,这不包括bbox或padding参数。

After trying the above answers with no success (and a slew of other stack posts) what finally worked for me was just

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.savefig("myfig.pdf")

Importantly this does not include the bbox or padding arguments.


回答 3

我从Arvind Pereira(http://robotics.usc.edu/~ampereir/wordpress/?p=626)找到了一些东西,似乎对我有用:

plt.savefig(filename, transparent = True, bbox_inches = 'tight', pad_inches = 0)

I found something from Arvind Pereira (http://robotics.usc.edu/~ampereir/wordpress/?p=626) and seemed to work for me:

plt.savefig(filename, transparent = True, bbox_inches = 'tight', pad_inches = 0)

回答 4

以下功能合并了上面的johannes-s答案。我有测试过plt.figure,并plt.subplots()与多个轴,它工作得很好。

def save(filepath, fig=None):
    '''Save the current image with no whitespace
    Example filepath: "myfig.png" or r"C:\myfig.pdf" 
    '''
    import matplotlib.pyplot as plt
    if not fig:
        fig = plt.gcf()

    plt.subplots_adjust(0,0,1,1,0,0)
    for ax in fig.axes:
        ax.axis('off')
        ax.margins(0,0)
        ax.xaxis.set_major_locator(plt.NullLocator())
        ax.yaxis.set_major_locator(plt.NullLocator())
    fig.savefig(filepath, pad_inches = 0, bbox_inches='tight')

The following function incorporates johannes-s answer above. I have tested it with plt.figure and plt.subplots() with multiple axes, and it works nicely.

def save(filepath, fig=None):
    '''Save the current image with no whitespace
    Example filepath: "myfig.png" or r"C:\myfig.pdf" 
    '''
    import matplotlib.pyplot as plt
    if not fig:
        fig = plt.gcf()

    plt.subplots_adjust(0,0,1,1,0,0)
    for ax in fig.axes:
        ax.axis('off')
        ax.margins(0,0)
        ax.xaxis.set_major_locator(plt.NullLocator())
        ax.yaxis.set_major_locator(plt.NullLocator())
    fig.savefig(filepath, pad_inches = 0, bbox_inches='tight')

回答 5

我发现以下代码非常适合这项工作。

fig = plt.figure(figsize=[6,6])
ax = fig.add_subplot(111)
ax.imshow(data)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set_frame_on(False)
plt.savefig('data.png', dpi=400, bbox_inches='tight',pad_inches=0)

I found the following codes work perfectly for the job.

fig = plt.figure(figsize=[6,6])
ax = fig.add_subplot(111)
ax.imshow(data)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
ax.set_frame_on(False)
plt.savefig('data.png', dpi=400, bbox_inches='tight',pad_inches=0)

回答 6

我遵循了这个顺序,它就像一个魅力。

plt.axis("off")
fig=plt.imshow(image array,interpolation='nearest')
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.savefig('destination_path.pdf',
    bbox_inches='tight', pad_inches=0, format='pdf', dpi=1200)

i followed this sequence and it worked like a charm.

plt.axis("off")
fig=plt.imshow(image array,interpolation='nearest')
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.savefig('destination_path.pdf',
    bbox_inches='tight', pad_inches=0, format='pdf', dpi=1200)

回答 7

对于任何想以像素而不是英寸为单位的人,都可以使用。

加上平时您还需要

from matplotlib.transforms import Bbox

然后,您可以使用以下命令:

my_dpi = 100 # Good default - doesn't really matter

# Size of output in pixels
h = 224
w = 224

fig, ax = plt.subplots(1, figsize=(w/my_dpi, h/my_dpi), dpi=my_dpi)

ax.set_position([0, 0, 1, 1]) # Critical!

# Do some stuff
ax.imshow(img)
ax.imshow(heatmap) # 4-channel RGBA
ax.plot([50, 100, 150], [50, 100, 150], color="red")

ax.axis("off")

fig.savefig("saved_img.png",
            bbox_inches=Bbox([[0, 0], [w/my_dpi, h/my_dpi]]),
            dpi=my_dpi)

For anyone who wants to work in pixels rather than inches this will work.

Plus the usual you will also need

from matplotlib.transforms import Bbox

Then you can use the following:

my_dpi = 100 # Good default - doesn't really matter

# Size of output in pixels
h = 224
w = 224

fig, ax = plt.subplots(1, figsize=(w/my_dpi, h/my_dpi), dpi=my_dpi)

ax.set_position([0, 0, 1, 1]) # Critical!

# Do some stuff
ax.imshow(img)
ax.imshow(heatmap) # 4-channel RGBA
ax.plot([50, 100, 150], [50, 100, 150], color="red")

ax.axis("off")

fig.savefig("saved_img.png",
            bbox_inches=Bbox([[0, 0], [w/my_dpi, h/my_dpi]]),
            dpi=my_dpi)


回答 8

我发现一种更简单的方法是使用plt.imsave

    import matplotlib.pyplot as plt
    arr = plt.imread(path)
    plt.imsave('test.png', arr)

A much simpler approach I found is to use plt.imsave :

    import matplotlib.pyplot as plt
    arr = plt.imread(path)
    plt.imsave('test.png', arr)

回答 9

您可以尝试一下。它解决了我的问题。

import matplotlib.image as mpimg
img = mpimg.imread("src.png")
mpimg.imsave("out.png", img, cmap=cmap)

You may try this. It solved my issue.

import matplotlib.image as mpimg
img = mpimg.imread("src.png")
mpimg.imsave("out.png", img, cmap=cmap)

回答 10

如果要显示要保存的内容,我建议您使用plt.tight_layout转换,因为它在使用时不会进行不必要的裁剪,因此实际上更可取plt.savefig

import matplotlib as plt    
plt.plot([1,2,3], [1,2,3])
plt.tight_layout(pad=0)
plt.savefig('plot.png')

The most straightforward method is to use plt.tight_layout transformation which is actually more preferable as it doesn’t do unnecessary cropping when using plt.savefig

import matplotlib as plt    
plt.plot([1,2,3], [1,2,3])
plt.tight_layout(pad=0)
plt.savefig('plot.png')

However, this may not be preferable for complex plots that modifies the figure. Refer to top answers that uses plt.subplots_adjust if that’s the case.


回答 11

这对我有用,将用imshow绘制的numpy数组保存到文件

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,10))
plt.imshow(img) # your image here
plt.axis("off")
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
        hspace = 0, wspace = 0)
plt.savefig("example2.png", box_inches='tight', dpi=100)
plt.show()

This works for me saving a numpy array plotted with imshow to file

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10,10))
plt.imshow(img) # your image here
plt.axis("off")
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
        hspace = 0, wspace = 0)
plt.savefig("example2.png", box_inches='tight', dpi=100)
plt.show()

计算两个Python字典中包含的键的差异

问题:计算两个Python字典中包含的键的差异

假设我有两个Python字典- dictAdictB。我需要找出是否有任何键存在于中,dictB但没有dictA。最快的方法是什么?

我应该将字典键转换为集合然后继续吗?

有兴趣了解您的想法…


感谢您的回复。

很抱歉未能正确说明我的问题。我的情况是这样的-我有一个dictA与可能相同的dictB密钥,或者可能缺少一些密钥,dictB否则某些密钥的值可能会有所不同,必须将其设置为dictA密钥的值。

问题在于字典没有标准,并且可以具有可以作为dict的值。

dictA={'key1':a, 'key2':b, 'key3':{'key11':cc, 'key12':dd}, 'key4':{'key111':{....}}}
dictB={'key1':a, 'key2:':newb, 'key3':{'key11':cc, 'key12':newdd, 'key13':ee}.......

因此,必须将“ key2”值重置为新值,并在字典内部添加“ key13”。键值没有固定的格式。它可以是一个简单的值或dict或dict的dict。

Suppose I have two Python dictionaries – dictA and dictB. I need to find out if there are any keys which are present in dictB but not in dictA. What is the fastest way to go about it?

Should I convert the dictionary keys into a set and then go about?

Interested in knowing your thoughts…


Thanks for your responses.

Apologies for not stating my question properly. My scenario is like this – I have a dictA which can be the same as dictB or may have some keys missing as compared to dictB or else the value of some keys might be different which has to be set to that of dictA key’s value.

Problem is the dictionary has no standard and can have values which can be dict of dict.

Say

dictA={'key1':a, 'key2':b, 'key3':{'key11':cc, 'key12':dd}, 'key4':{'key111':{....}}}
dictB={'key1':a, 'key2:':newb, 'key3':{'key11':cc, 'key12':newdd, 'key13':ee}.......

So ‘key2’ value has to be reset to the new value and ‘key13’ has to be added inside the dict. The key value does not have a fixed format. It can be a simple value or a dict or a dict of dict.


回答 0

您可以在按键上使用设置操作:

diff = set(dictb.keys()) - set(dicta.keys())

这是一个查找所有可能性的类:添加了什么,删除了什么,哪些键值对相同​​以及哪些键值对已更改。

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
        self.current_dict, self.past_dict = current_dict, past_dict
        self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
        self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
        return self.set_current - self.intersect 
    def removed(self):
        return self.set_past - self.intersect 
    def changed(self):
        return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
        return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

这是一些示例输出:

>>> a = {'a': 1, 'b': 1, 'c': 0}
>>> b = {'a': 1, 'b': 2, 'd': 0}
>>> d = DictDiffer(b, a)
>>> print "Added:", d.added()
Added: set(['d'])
>>> print "Removed:", d.removed()
Removed: set(['c'])
>>> print "Changed:", d.changed()
Changed: set(['b'])
>>> print "Unchanged:", d.unchanged()
Unchanged: set(['a'])

可以作为github存储库使用:https : //github.com/hughdbrown/dictdiffer

You can use set operations on the keys:

diff = set(dictb.keys()) - set(dicta.keys())

Here is a class to find all the possibilities: what was added, what was removed, which key-value pairs are the same, and which key-value pairs are changed.

class DictDiffer(object):
    """
    Calculate the difference between two dictionaries as:
    (1) items added
    (2) items removed
    (3) keys same in both but changed values
    (4) keys same in both and unchanged values
    """
    def __init__(self, current_dict, past_dict):
        self.current_dict, self.past_dict = current_dict, past_dict
        self.set_current, self.set_past = set(current_dict.keys()), set(past_dict.keys())
        self.intersect = self.set_current.intersection(self.set_past)
    def added(self):
        return self.set_current - self.intersect 
    def removed(self):
        return self.set_past - self.intersect 
    def changed(self):
        return set(o for o in self.intersect if self.past_dict[o] != self.current_dict[o])
    def unchanged(self):
        return set(o for o in self.intersect if self.past_dict[o] == self.current_dict[o])

Here is some sample output:

>>> a = {'a': 1, 'b': 1, 'c': 0}
>>> b = {'a': 1, 'b': 2, 'd': 0}
>>> d = DictDiffer(b, a)
>>> print "Added:", d.added()
Added: set(['d'])
>>> print "Removed:", d.removed()
Removed: set(['c'])
>>> print "Changed:", d.changed()
Changed: set(['b'])
>>> print "Unchanged:", d.unchanged()
Unchanged: set(['a'])

Available as a github repo: https://github.com/hughdbrown/dictdiffer


回答 1

如果您需要递归的区别,我已经为python编写了一个软件包:https : //github.com/seperman/deepdiff

安装

从PyPi安装:

pip install deepdiff

用法示例

输入

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function # In case running on Python 2

同一对象返回空

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> print(DeepDiff(t1, t2))
{}

项目类型已更改

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{ 'type_changes': { 'root[2]': { 'newtype': <class 'str'>,
                                 'newvalue': '2',
                                 'oldtype': <class 'int'>,
                                 'oldvalue': 2}}}

项目的价值已更改

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

添加和/或删除项目

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff)
{'dic_item_added': ['root[5]', 'root[6]'],
 'dic_item_removed': ['root[4]'],
 'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

弦差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { 'root[2]': {'newvalue': 4, 'oldvalue': 2},
                      "root[4]['b']": { 'newvalue': 'world!',
                                        'oldvalue': 'world'}}}

弦差异2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { "root[4]['b']": { 'diff': '--- \n'
                                                '+++ \n'
                                                '@@ -1,5 +1,4 @@\n'
                                                '-world!\n'
                                                '-Goodbye!\n'
                                                '+world\n'
                                                ' 1\n'
                                                ' 2\n'
                                                ' End',
                                        'newvalue': 'world\n1\n2\nEnd',
                                        'oldvalue': 'world!\n'
                                                    'Goodbye!\n'
                                                    '1\n'
                                                    '2\n'
                                                    'End'}}}

>>> 
>>> print (ddiff['values_changed']["root[4]['b']"]["diff"])
--- 
+++ 
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
 1
 2
 End

类型变更

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'type_changes': { "root[4]['b']": { 'newtype': <class 'str'>,
                                      'newvalue': 'world\n\n\nEnd',
                                      'oldtype': <class 'list'>,
                                      'oldvalue': [1, 2, 3]}}}

清单差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3, 4]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{'iterable_item_removed': {"root[4]['b'][2]": 3, "root[4]['b'][3]": 4}}

清单差异2:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': {"root[4]['b'][3]": 3},
  'values_changed': { "root[4]['b'][1]": {'newvalue': 3, 'oldvalue': 2},
                      "root[4]['b'][2]": {'newvalue': 2, 'oldvalue': 3}}}

列出差异忽略顺序或重复项:(具有与上述相同的字典)

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2, ignore_order=True)
>>> print (ddiff)
{}

包含字典的列表:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
  'values_changed': {"root[4]['b'][2][1]": {'newvalue': 3, 'oldvalue': 1}}}

套装:

>>> t1 = {1, 2, 8}
>>> t2 = {1, 2, 3, 5}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (DeepDiff(t1, t2))
{'set_item_added': ['root[3]', 'root[5]'], 'set_item_removed': ['root[8]']}

命名元组:

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> pprint (DeepDiff(t1, t2))
{'values_changed': {'root.y': {'newvalue': 23, 'oldvalue': 22}}}

自定义对象:

>>> class ClassA(object):
...     a = 1
...     def __init__(self, b):
...         self.b = b
... 
>>> t1 = ClassA(1)
>>> t2 = ClassA(2)
>>> 
>>> pprint(DeepDiff(t1, t2))
{'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

添加对象属性:

>>> t2.c = "new attribute"
>>> pprint(DeepDiff(t1, t2))
{'attribute_added': ['root.c'],
 'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

In case you want the difference recursively, I have written a package for python: https://github.com/seperman/deepdiff

Installation

Install from PyPi:

pip install deepdiff

Example usage

Importing

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function # In case running on Python 2

Same object returns empty

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> print(DeepDiff(t1, t2))
{}

Type of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{ 'type_changes': { 'root[2]': { 'newtype': <class 'str'>,
                                 'newvalue': '2',
                                 'oldtype': <class 'int'>,
                                 'oldvalue': 2}}}

Value of an item has changed

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

Item added and/or removed

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff)
{'dic_item_added': ['root[5]', 'root[6]'],
 'dic_item_removed': ['root[4]'],
 'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

String difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { 'root[2]': {'newvalue': 4, 'oldvalue': 2},
                      "root[4]['b']": { 'newvalue': 'world!',
                                        'oldvalue': 'world'}}}

String difference 2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { "root[4]['b']": { 'diff': '--- \n'
                                                '+++ \n'
                                                '@@ -1,5 +1,4 @@\n'
                                                '-world!\n'
                                                '-Goodbye!\n'
                                                '+world\n'
                                                ' 1\n'
                                                ' 2\n'
                                                ' End',
                                        'newvalue': 'world\n1\n2\nEnd',
                                        'oldvalue': 'world!\n'
                                                    'Goodbye!\n'
                                                    '1\n'
                                                    '2\n'
                                                    'End'}}}

>>> 
>>> print (ddiff['values_changed']["root[4]['b']"]["diff"])
--- 
+++ 
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
 1
 2
 End

Type change

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'type_changes': { "root[4]['b']": { 'newtype': <class 'str'>,
                                      'newvalue': 'world\n\n\nEnd',
                                      'oldtype': <class 'list'>,
                                      'oldvalue': [1, 2, 3]}}}

List difference

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3, 4]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{'iterable_item_removed': {"root[4]['b'][2]": 3, "root[4]['b'][3]": 4}}

List difference 2:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': {"root[4]['b'][3]": 3},
  'values_changed': { "root[4]['b'][1]": {'newvalue': 3, 'oldvalue': 2},
                      "root[4]['b'][2]": {'newvalue': 2, 'oldvalue': 3}}}

List difference ignoring order or duplicates: (with the same dictionaries as above)

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2, ignore_order=True)
>>> print (ddiff)
{}

List that contains dictionary:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
  'values_changed': {"root[4]['b'][2][1]": {'newvalue': 3, 'oldvalue': 1}}}

Sets:

>>> t1 = {1, 2, 8}
>>> t2 = {1, 2, 3, 5}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (DeepDiff(t1, t2))
{'set_item_added': ['root[3]', 'root[5]'], 'set_item_removed': ['root[8]']}

Named Tuples:

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> pprint (DeepDiff(t1, t2))
{'values_changed': {'root.y': {'newvalue': 23, 'oldvalue': 22}}}

Custom objects:

>>> class ClassA(object):
...     a = 1
...     def __init__(self, b):
...         self.b = b
... 
>>> t1 = ClassA(1)
>>> t2 = ClassA(2)
>>> 
>>> pprint(DeepDiff(t1, t2))
{'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

Object attribute added:

>>> t2.c = "new attribute"
>>> pprint(DeepDiff(t1, t2))
{'attribute_added': ['root.c'],
 'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

回答 2

不知道它是否“快速”,但是通常情况下,可以做到这一点

dicta = {"a":1,"b":2,"c":3,"d":4}
dictb = {"a":1,"d":2}
for key in dicta.keys():
    if not key in dictb:
        print key

not sure whether its “fast” or not, but normally, one can do this

dicta = {"a":1,"b":2,"c":3,"d":4}
dictb = {"a":1,"d":2}
for key in dicta.keys():
    if not key in dictb:
        print key

回答 3

就像Alex Martelli所写的那样,如果您只想检查B中的任何键是否不在A中,那any(True for k in dictB if k not in dictA)将是您的最佳选择。

要查找缺少的密钥:

diff = set(dictB)-set(dictA) #sets

C:\Dokumente und Einstellungen\thc>python -m timeit -s "dictA =    
dict(zip(range(1000),range
(1000))); dictB = dict(zip(range(0,2000,2),range(1000)))" "diff=set(dictB)-set(dictA)"
10000 loops, best of 3: 107 usec per loop

diff = [ k for k in dictB if k not in dictA ] #lc

C:\Dokumente und Einstellungen\thc>python -m timeit -s "dictA = 
dict(zip(range(1000),range
(1000))); dictB = dict(zip(range(0,2000,2),range(1000)))" "diff=[ k for k in dictB if
k not in dictA ]"
10000 loops, best of 3: 95.9 usec per loop

因此,这两种解决方案的速度几乎相同。

As Alex Martelli wrote, if you simply want to check if any key in B is not in A, any(True for k in dictB if k not in dictA) would be the way to go.

To find the keys that are missing:

diff = set(dictB)-set(dictA) #sets

C:\Dokumente und Einstellungen\thc>python -m timeit -s "dictA =    
dict(zip(range(1000),range
(1000))); dictB = dict(zip(range(0,2000,2),range(1000)))" "diff=set(dictB)-set(dictA)"
10000 loops, best of 3: 107 usec per loop

diff = [ k for k in dictB if k not in dictA ] #lc

C:\Dokumente und Einstellungen\thc>python -m timeit -s "dictA = 
dict(zip(range(1000),range
(1000))); dictB = dict(zip(range(0,2000,2),range(1000)))" "diff=[ k for k in dictB if
k not in dictA ]"
10000 loops, best of 3: 95.9 usec per loop

So those two solutions are pretty much the same speed.


回答 4

如果您确实要说的是真的(您只需要找出B中而不是A中“有任何键”的情况,那么ONES可能没有),最快的方法应该是:

if any(True for k in dictB if k not in dictA): ...

如果您实际上需要找出哪个键(如果有)在B中而不是在A中,而不仅仅是“ IF”,那么有这样的键,那么现有的答案就很合适了(但是我建议在以后的问题中更精确一些,如果那是确实是您的意思;-)。

If you really mean exactly what you say (that you only need to find out IF “there are any keys” in B and not in A, not WHICH ONES might those be if any), the fastest way should be:

if any(True for k in dictB if k not in dictA): ...

If you actually need to find out WHICH KEYS, if any, are in B and not in A, and not just “IF” there are such keys, then existing answers are quite appropriate (but I do suggest more precision in future questions if that’s indeed what you mean;-).


回答 5

用途set()

set(dictA.keys()).intersection(dictB.keys())

Use set():

set(dictA.keys()).intersection(dictB.keys())

回答 6

hughdbrown的最高答案是建议使用集差异,这绝对是最好的方法:

diff = set(dictb.keys()) - set(dicta.keys())

这段代码的问题在于,它仅创建两个列表就创建了两个列表,因此浪费了4N的时间和2N的空间。它也比需要的要复杂一些。

通常,这没什么大不了的,但是如果是这样的话:

diff = dictb.keys() - dicta

Python 2

在Python 2中,keys()返回键列表,而不是KeysView。因此,您必须viewkeys()直接提出要求。

diff = dictb.viewkeys() - dicta

对于双版本2.7 / 3.x代码,希望使用six或类似的代码,因此可以使用six.viewkeys(dictb)

diff = six.viewkeys(dictb) - dicta

在2.4-2.6中,没有KeysView。但是,您可以直接从迭代器中构建左集合,而不是先构建列表,至少可以将成本从4N削减到N:

diff = set(dictb) - dicta

物品

我有一个dictA可以与dictB相同,或者与dictB相比可能缺少一些键,否则某些键的值可能不同

因此,您实际上不需要比较键,而是需要比较项。ItemsViewSet当值是可哈希值(例如字符串)时,an 才是a 。如果是这样,这很容易:

diff = dictb.items() - dicta.items()

递归差异

尽管问题不是直接要求递归差异,但某些示例值是dict,并且看来预期的输出确实递归地对它们进行差异。这里已经有多个答案显示了如何执行此操作。

The top answer by hughdbrown suggests using set difference, which is definitely the best approach:

diff = set(dictb.keys()) - set(dicta.keys())

The problem with this code is that it builds two lists just to create two sets, so it’s wasting 4N time and 2N space. It’s also a bit more complicated than it needs to be.

Usually, this is not a big deal, but if it is:

diff = dictb.keys() - dicta

Python 2

In Python 2, keys() returns a list of the keys, not a KeysView. So you have to ask for viewkeys() directly.

diff = dictb.viewkeys() - dicta

For dual-version 2.7/3.x code, you’re hopefully using six or something similar, so you can use six.viewkeys(dictb):

diff = six.viewkeys(dictb) - dicta

In 2.4-2.6, there is no KeysView. But you can at least cut the cost from 4N to N by building your left set directly out of an iterator, instead of building a list first:

diff = set(dictb) - dicta

Items

I have a dictA which can be the same as dictB or may have some keys missing as compared to dictB or else the value of some keys might be different

So you really don’t need to compare the keys, but the items. An ItemsView is only a Set if the values are hashable, like strings. If they are, it’s easy:

diff = dictb.items() - dicta.items()

Recursive diff

Although the question isn’t directly asking for a recursive diff, some of the example values are dicts, and it appears the expected output does recursively diff them. There are already multiple answers here showing how to do that.


回答 7

关于此参数,stackoverflow中还有另一个问题,我不得不承认有一个简单的解决方案:python 的datadiff库有助于打印两个字典之间的差异。

There is an other question in stackoverflow about this argument and i have to admit that there is a simple solution explained: the datadiff library of python helps printing the difference between two dictionaries.


回答 8

这是一种可行的方法,允许将键的值计算为False,并且在可能的情况下仍使用生成器表达式尽早退出。虽然不是特别漂亮。

any(map(lambda x: True, (k for k in b if k not in a)))

编辑:

THC4k发表了对我对另一个答案的评论的回复。这是一种更好,更漂亮的方法来执行上述操作:

any(True for k in b if k not in a)

不知道那怎么没想到…

Here’s a way that will work, allows for keys that evaluate to False, and still uses a generator expression to fall out early if possible. It’s not exceptionally pretty though.

any(map(lambda x: True, (k for k in b if k not in a)))

EDIT:

THC4k posted a reply to my comment on another answer. Here’s a better, prettier way to do the above:

any(True for k in b if k not in a)

Not sure how that never crossed my mind…


回答 9

这是一个古老的问题,要求的问题比我需要的要少,因此,此答案实际上比该问题所要求的要多。这个问题的答案帮助我解决了以下问题:

  1. (要求)记录两个词典之间的差异
  2. 将#1的差异合并到基础词典中
  3. (要求)合并两个字典之间的差异(将第2个字典视为差异字典)
  4. 尝试检测物品的移动和变化
  5. (要求)递归执行所有这些操作

所有这些与JSON相结合,提供了非常强大的配置存储支持。

解决方案(也在github上):

from collections import OrderedDict
from pprint import pprint


class izipDestinationMatching(object):
    __slots__ = ("attr", "value", "index")

    def __init__(self, attr, value, index):
        self.attr, self.value, self.index = attr, value, index

    def __repr__(self):
        return "izip_destination_matching: found match by '%s' = '%s' @ %d" % (self.attr, self.value, self.index)


def izip_destination(a, b, attrs, addMarker=True):
    """
    Returns zipped lists, but final size is equal to b with (if shorter) a padded with nulls
    Additionally also tries to find item reallocations by searching child dicts (if they are dicts) for attribute, listed in attrs)
    When addMarker == False (patching), final size will be the longer of a, b
    """
    for idx, item in enumerate(b):
        try:
            attr = next((x for x in attrs if x in item), None)  # See if the item has any of the ID attributes
            match, matchIdx = next(((orgItm, idx) for idx, orgItm in enumerate(a) if attr in orgItm and orgItm[attr] == item[attr]), (None, None)) if attr else (None, None)
            if match and matchIdx != idx and addMarker: item[izipDestinationMatching] = izipDestinationMatching(attr, item[attr], matchIdx)
        except:
            match = None
        yield (match if match else a[idx] if len(a) > idx else None), item
    if not addMarker and len(a) > len(b):
        for item in a[len(b) - len(a):]:
            yield item, item


def dictdiff(a, b, searchAttrs=[]):
    """
    returns a dictionary which represents difference from a to b
    the return dict is as short as possible:
      equal items are removed
      added / changed items are listed
      removed items are listed with value=None
    Also processes list values where the resulting list size will match that of b.
    It can also search said list items (that are dicts) for identity values to detect changed positions.
      In case such identity value is found, it is kept so that it can be re-found during the merge phase
    @param a: original dict
    @param b: new dict
    @param searchAttrs: list of strings (keys to search for in sub-dicts)
    @return: dict / list / whatever input is
    """
    if not (isinstance(a, dict) and isinstance(b, dict)):
        if isinstance(a, list) and isinstance(b, list):
            return [dictdiff(v1, v2, searchAttrs) for v1, v2 in izip_destination(a, b, searchAttrs)]
        return b
    res = OrderedDict()
    if izipDestinationMatching in b:
        keepKey = b[izipDestinationMatching].attr
        del b[izipDestinationMatching]
    else:
        keepKey = izipDestinationMatching
    for key in sorted(set(a.keys() + b.keys())):
        v1 = a.get(key, None)
        v2 = b.get(key, None)
        if keepKey == key or v1 != v2: res[key] = dictdiff(v1, v2, searchAttrs)
    if len(res) <= 1: res = dict(res)  # This is only here for pretty print (OrderedDict doesn't pprint nicely)
    return res


def dictmerge(a, b, searchAttrs=[]):
    """
    Returns a dictionary which merges differences recorded in b to base dictionary a
    Also processes list values where the resulting list size will match that of a
    It can also search said list items (that are dicts) for identity values to detect changed positions
    @param a: original dict
    @param b: diff dict to patch into a
    @param searchAttrs: list of strings (keys to search for in sub-dicts)
    @return: dict / list / whatever input is
    """
    if not (isinstance(a, dict) and isinstance(b, dict)):
        if isinstance(a, list) and isinstance(b, list):
            return [dictmerge(v1, v2, searchAttrs) for v1, v2 in izip_destination(a, b, searchAttrs, False)]
        return b
    res = OrderedDict()
    for key in sorted(set(a.keys() + b.keys())):
        v1 = a.get(key, None)
        v2 = b.get(key, None)
        #print "processing", key, v1, v2, key not in b, dictmerge(v1, v2)
        if v2 is not None: res[key] = dictmerge(v1, v2, searchAttrs)
        elif key not in b: res[key] = v1
    if len(res) <= 1: res = dict(res)  # This is only here for pretty print (OrderedDict doesn't pprint nicely)
    return res

This is an old question and asks a little bit less than what I needed so this answer actually solves more than this question asks. The answers in this question helped me solve the following:

  1. (asked) Record differences between two dictionaries
  2. Merge differences from #1 into base dictionary
  3. (asked) Merge differences between two dictionaries (treat dictionary #2 as if it were a diff dictionary)
  4. Try to detect item movements as well as changes
  5. (asked) Do all of this recursively

All this combined with JSON makes for a pretty powerful configuration storage support.

The solution (also on github):

from collections import OrderedDict
from pprint import pprint


class izipDestinationMatching(object):
    __slots__ = ("attr", "value", "index")

    def __init__(self, attr, value, index):
        self.attr, self.value, self.index = attr, value, index

    def __repr__(self):
        return "izip_destination_matching: found match by '%s' = '%s' @ %d" % (self.attr, self.value, self.index)


def izip_destination(a, b, attrs, addMarker=True):
    """
    Returns zipped lists, but final size is equal to b with (if shorter) a padded with nulls
    Additionally also tries to find item reallocations by searching child dicts (if they are dicts) for attribute, listed in attrs)
    When addMarker == False (patching), final size will be the longer of a, b
    """
    for idx, item in enumerate(b):
        try:
            attr = next((x for x in attrs if x in item), None)  # See if the item has any of the ID attributes
            match, matchIdx = next(((orgItm, idx) for idx, orgItm in enumerate(a) if attr in orgItm and orgItm[attr] == item[attr]), (None, None)) if attr else (None, None)
            if match and matchIdx != idx and addMarker: item[izipDestinationMatching] = izipDestinationMatching(attr, item[attr], matchIdx)
        except:
            match = None
        yield (match if match else a[idx] if len(a) > idx else None), item
    if not addMarker and len(a) > len(b):
        for item in a[len(b) - len(a):]:
            yield item, item


def dictdiff(a, b, searchAttrs=[]):
    """
    returns a dictionary which represents difference from a to b
    the return dict is as short as possible:
      equal items are removed
      added / changed items are listed
      removed items are listed with value=None
    Also processes list values where the resulting list size will match that of b.
    It can also search said list items (that are dicts) for identity values to detect changed positions.
      In case such identity value is found, it is kept so that it can be re-found during the merge phase
    @param a: original dict
    @param b: new dict
    @param searchAttrs: list of strings (keys to search for in sub-dicts)
    @return: dict / list / whatever input is
    """
    if not (isinstance(a, dict) and isinstance(b, dict)):
        if isinstance(a, list) and isinstance(b, list):
            return [dictdiff(v1, v2, searchAttrs) for v1, v2 in izip_destination(a, b, searchAttrs)]
        return b
    res = OrderedDict()
    if izipDestinationMatching in b:
        keepKey = b[izipDestinationMatching].attr
        del b[izipDestinationMatching]
    else:
        keepKey = izipDestinationMatching
    for key in sorted(set(a.keys() + b.keys())):
        v1 = a.get(key, None)
        v2 = b.get(key, None)
        if keepKey == key or v1 != v2: res[key] = dictdiff(v1, v2, searchAttrs)
    if len(res) <= 1: res = dict(res)  # This is only here for pretty print (OrderedDict doesn't pprint nicely)
    return res


def dictmerge(a, b, searchAttrs=[]):
    """
    Returns a dictionary which merges differences recorded in b to base dictionary a
    Also processes list values where the resulting list size will match that of a
    It can also search said list items (that are dicts) for identity values to detect changed positions
    @param a: original dict
    @param b: diff dict to patch into a
    @param searchAttrs: list of strings (keys to search for in sub-dicts)
    @return: dict / list / whatever input is
    """
    if not (isinstance(a, dict) and isinstance(b, dict)):
        if isinstance(a, list) and isinstance(b, list):
            return [dictmerge(v1, v2, searchAttrs) for v1, v2 in izip_destination(a, b, searchAttrs, False)]
        return b
    res = OrderedDict()
    for key in sorted(set(a.keys() + b.keys())):
        v1 = a.get(key, None)
        v2 = b.get(key, None)
        #print "processing", key, v1, v2, key not in b, dictmerge(v1, v2)
        if v2 is not None: res[key] = dictmerge(v1, v2, searchAttrs)
        elif key not in b: res[key] = v1
    if len(res) <= 1: res = dict(res)  # This is only here for pretty print (OrderedDict doesn't pprint nicely)
    return res

回答 10

怎么样标准(比较完整对象)

PyDev->新的PyDev模块->模块:单元测试

import unittest


class Test(unittest.TestCase):


    def testName(self):
        obj1 = {1:1, 2:2}
        obj2 = {1:1, 2:2}
        self.maxDiff = None # sometimes is usefull
        self.assertDictEqual(d1, d2)

if __name__ == "__main__":
    #import sys;sys.argv = ['', 'Test.testName']

    unittest.main()

what about standart (compare FULL Object)

PyDev->new PyDev Module->Module: unittest

import unittest


class Test(unittest.TestCase):


    def testName(self):
        obj1 = {1:1, 2:2}
        obj2 = {1:1, 2:2}
        self.maxDiff = None # sometimes is usefull
        self.assertDictEqual(d1, d2)

if __name__ == "__main__":
    #import sys;sys.argv = ['', 'Test.testName']

    unittest.main()

回答 11

如果在Python≥2.7上:

# update different values in dictB
# I would assume only dictA should be updated,
# but the question specifies otherwise

for k in dictA.viewkeys() & dictB.viewkeys():
    if dictA[k] != dictB[k]:
        dictB[k]= dictA[k]

# add missing keys to dictA

dictA.update( (k,dictB[k]) for k in dictB.viewkeys() - dictA.viewkeys() )

If on Python ≥ 2.7:

# update different values in dictB
# I would assume only dictA should be updated,
# but the question specifies otherwise

for k in dictA.viewkeys() & dictB.viewkeys():
    if dictA[k] != dictB[k]:
        dictB[k]= dictA[k]

# add missing keys to dictA

dictA.update( (k,dictB[k]) for k in dictB.viewkeys() - dictA.viewkeys() )

回答 12

这是深度比较两个字典键的解决方案:

def compareDictKeys(dict1, dict2):
  if type(dict1) != dict or type(dict2) != dict:
      return False

  keys1, keys2 = dict1.keys(), dict2.keys()
  diff = set(keys1) - set(keys2) or set(keys2) - set(keys1)

  if not diff:
      for key in keys1:
          if (type(dict1[key]) == dict or type(dict2[key]) == dict) and not compareDictKeys(dict1[key], dict2[key]):
              diff = True
              break

  return not diff

Here is a solution for deep comparing 2 dictionaries keys:

def compareDictKeys(dict1, dict2):
  if type(dict1) != dict or type(dict2) != dict:
      return False

  keys1, keys2 = dict1.keys(), dict2.keys()
  diff = set(keys1) - set(keys2) or set(keys2) - set(keys1)

  if not diff:
      for key in keys1:
          if (type(dict1[key]) == dict or type(dict2[key]) == dict) and not compareDictKeys(dict1[key], dict2[key]):
              diff = True
              break

  return not diff

回答 13

这是一个可以比较两个以上命令的解决方案:

def diff_dict(dicts, default=None):
    diff_dict = {}
    # add 'list()' around 'd.keys()' for python 3 compatibility
    for k in set(sum([d.keys() for d in dicts], [])):
        # we can just use "values = [d.get(k, default) ..." below if 
        # we don't care that d1[k]=default and d2[k]=missing will
        # be treated as equal
        if any(k not in d for d in dicts):
            diff_dict[k] = [d.get(k, default) for d in dicts]
        else:
            values = [d[k] for d in dicts]
            if any(v != values[0] for v in values):
                diff_dict[k] = values
    return diff_dict

用法示例:

import matplotlib.pyplot as plt
diff_dict([plt.rcParams, plt.rcParamsDefault, plt.matplotlib.rcParamsOrig])

here’s a solution that can compare more than two dicts:

def diff_dict(dicts, default=None):
    diff_dict = {}
    # add 'list()' around 'd.keys()' for python 3 compatibility
    for k in set(sum([d.keys() for d in dicts], [])):
        # we can just use "values = [d.get(k, default) ..." below if 
        # we don't care that d1[k]=default and d2[k]=missing will
        # be treated as equal
        if any(k not in d for d in dicts):
            diff_dict[k] = [d.get(k, default) for d in dicts]
        else:
            values = [d[k] for d in dicts]
            if any(v != values[0] for v in values):
                diff_dict[k] = values
    return diff_dict

usage example:

import matplotlib.pyplot as plt
diff_dict([plt.rcParams, plt.rcParamsDefault, plt.matplotlib.rcParamsOrig])

回答 14

我的两个字典之间的对称差异的配方:

def find_dict_diffs(dict1, dict2):
    unequal_keys = []
    unequal_keys.extend(set(dict1.keys()).symmetric_difference(set(dict2.keys())))
    for k in dict1.keys():
        if dict1.get(k, 'N\A') != dict2.get(k, 'N\A'):
            unequal_keys.append(k)
    if unequal_keys:
        print 'param', 'dict1\t', 'dict2'
        for k in set(unequal_keys):
            print str(k)+'\t'+dict1.get(k, 'N\A')+'\t '+dict2.get(k, 'N\A')
    else:
        print 'Dicts are equal'

dict1 = {1:'a', 2:'b', 3:'c', 4:'d', 5:'e'}
dict2 = {1:'b', 2:'a', 3:'c', 4:'d', 6:'f'}

find_dict_diffs(dict1, dict2)

结果是:

param   dict1   dict2
1       a       b
2       b       a
5       e       N\A
6       N\A     f

My recipe of symmetric difference between two dictionaries:

def find_dict_diffs(dict1, dict2):
    unequal_keys = []
    unequal_keys.extend(set(dict1.keys()).symmetric_difference(set(dict2.keys())))
    for k in dict1.keys():
        if dict1.get(k, 'N\A') != dict2.get(k, 'N\A'):
            unequal_keys.append(k)
    if unequal_keys:
        print 'param', 'dict1\t', 'dict2'
        for k in set(unequal_keys):
            print str(k)+'\t'+dict1.get(k, 'N\A')+'\t '+dict2.get(k, 'N\A')
    else:
        print 'Dicts are equal'

dict1 = {1:'a', 2:'b', 3:'c', 4:'d', 5:'e'}
dict2 = {1:'b', 2:'a', 3:'c', 4:'d', 6:'f'}

find_dict_diffs(dict1, dict2)

And result is:

param   dict1   dict2
1       a       b
2       b       a
5       e       N\A
6       N\A     f

回答 15

正如其他答案中提到的那样,unittest可以生成一些不错的输出来比较dict,但是在此示例中,我们不需要先构建整个测试。

废弃unittest源代码,看起来您可以通过以下方式获得公平的解决方案:

import difflib
import pprint

def diff_dicts(a, b):
    if a == b:
        return ''
    return '\n'.join(
        difflib.ndiff(pprint.pformat(a, width=30).splitlines(),
                      pprint.pformat(b, width=30).splitlines())
    )

所以

dictA = dict(zip(range(7), map(ord, 'python')))
dictB = {0: 112, 1: 'spam', 2: [1,2,3], 3: 104, 4: 111}
print diff_dicts(dictA, dictB)

结果是:

{0: 112,
-  1: 121,
-  2: 116,
+  1: 'spam',
+  2: [1, 2, 3],
   3: 104,
-  4: 111,
?        ^

+  4: 111}
?        ^

-  5: 110}

哪里:

  • ‘-‘表示第一/第二个字典中的键/值
  • “ +”表示第二个而不是第一个字典中的键/值

像在单元测试中一样,唯一的警告是由于尾随逗号/括号,最终映射可以被认为是差异。

As mentioned in other answers, unittest produces some nice output for comparing dicts, but in this example we don’t want to have to build a whole test first.

Scraping the unittest source, it looks like you can get a fair solution with just this:

import difflib
import pprint

def diff_dicts(a, b):
    if a == b:
        return ''
    return '\n'.join(
        difflib.ndiff(pprint.pformat(a, width=30).splitlines(),
                      pprint.pformat(b, width=30).splitlines())
    )

so

dictA = dict(zip(range(7), map(ord, 'python')))
dictB = {0: 112, 1: 'spam', 2: [1,2,3], 3: 104, 4: 111}
print diff_dicts(dictA, dictB)

Results in:

{0: 112,
-  1: 121,
-  2: 116,
+  1: 'spam',
+  2: [1, 2, 3],
   3: 104,
-  4: 111,
?        ^

+  4: 111}
?        ^

-  5: 110}

Where:

  • ‘-‘ indicates key/values in the first but not second dict
  • ‘+’ indicates key/values in the second but not the first dict

Like in unittest, the only caveat is that the final mapping can be thought to be a diff, due to the trailing comma/bracket.


回答 16

@Maxx有一个很好的答案,请使用unittestPython提供的工具:

import unittest


class Test(unittest.TestCase):
    def runTest(self):
        pass

    def testDict(self, d1, d2, maxDiff=None):
        self.maxDiff = maxDiff
        self.assertDictEqual(d1, d2)

然后,您可以在代码中的任何位置调用:

try:
    Test().testDict(dict1, dict2)
except Exception, e:
    print e

结果输出看起来像来自的输出diff,用不同的行+或在-每行之前添加漂亮的字典。

@Maxx has an excellent answer, use the unittest tools provided by Python:

import unittest


class Test(unittest.TestCase):
    def runTest(self):
        pass

    def testDict(self, d1, d2, maxDiff=None):
        self.maxDiff = maxDiff
        self.assertDictEqual(d1, d2)

Then, anywhere in your code you can call:

try:
    Test().testDict(dict1, dict2)
except Exception, e:
    print e

The resulting output looks like the output from diff, pretty-printing the dictionaries with + or - prepending each line that is different.


回答 17

不知道它是否仍然有用,但是我遇到了这个问题,我的情况是我只需要返回所有嵌套字典等的变化的字典。找不到合适的解决方案,但是我最终写了一个简单的函数要做到这一点。希望这可以帮助,

Not sure if it is still relevant but I came across this problem, my situation i just needed to return a dictionary of the changes for all nested dictionaries etc etc. Could not find a good solution out there but I did end up writing a simple function to do this. Hope this helps,


回答 18

如果您想使用内置解决方案与任意dict结构进行全面比较,@ Maxx的答案就是一个好的开始。

import unittest

test = unittest.TestCase()
test.assertEqual(dictA, dictB)

If you want a built-in solution for a full comparison with arbitrary dict structures, @Maxx’s answer is a good start.

import unittest

test = unittest.TestCase()
test.assertEqual(dictA, dictB)

回答 19

根据ghostdog74的回答,

dicta = {"a":1,"d":2}
dictb = {"a":5,"d":2}

for value in dicta.values():
    if not value in dictb.values():
        print value

将打印不同的dicta值

Based on ghostdog74’s answer,

dicta = {"a":1,"d":2}
dictb = {"a":5,"d":2}

for value in dicta.values():
    if not value in dictb.values():
        print value

will print differ value of dicta


回答 20

尝试此操作以找到de交集,即两个字典中的键,如果要在第二个字典中找不到键,只需使用not in

intersect = filter(lambda x, dictB=dictB.keys(): x in dictB, dictA.keys())

Try this to find de intersection, the keys that is in both dictionarie, if you want the keys not found on second dictionarie, just use the not in

intersect = filter(lambda x, dictB=dictB.keys(): x in dictB, dictA.keys())