Python 实用宝典

Question 1

In RStudio, you can run parts of code in the code editing window, and the results appear in the console.

You can also do cool stuff like selecting whether you want everything up to the cursor to run, or everything after the cursor, or just the part that you selected, and so on. And there are hot keys for all that stuff.

It’s like a step above the interactive shell in Python — there you can use readline to go back to previous individual lines, but it doesn’t have any “concept” of what a function is, a section of code, etc.

Is there a tool like that for Python? Or, do you have some sort of similar workaround that you use, say, in vim?

Question 2

IPython Notebooks are awesome. Here’s another, newer browser-based tool I’ve recently discovered: Rodeo. My impression is that it seems to better support an RStudio-like workflow.

Question 3

Jupyter Notebook (previously known as IPython notebook) is a really cool project for interactive data manipulation in Python (and other languages, including R). It basically allows you to interactively code and document what you’re doing in one interface and later on save it as a:

notebook (.ipynb)
script (a .py file including only the source code)
static html (and therefore pdf as well)

You can even share your notebooks online with others using the nbviewer service, where people publish whole books. Furthermore, GitHub renders your .ipynb files. You can publish your Jupyter Notebooks as reproducible research articles on Authorea. For collaborative editing by multiple users, check out Google Colab built on top of Jupyter.

The default Jupyter Notebook version starts a web application locally (or you deploy it to a server) and you use it from your browser. As Ryan also mentioned in his answer, Rodeo is an interface more similar to RStudio built on top of the Jupyter kernel.

JupyterLab is a newer take on the UI allowing for more flexibility in how you edit your notebooks, control interactive widgets and even run commands in terminal emulators.

There’s also a Qt console for IPython, a similar project with inline plots, which is a desktop application.

Jupyter is a normal Python package and can be installed using pip install jupyter. To get all the scientific libraries running on your computer, however, it might be easier to try the official Jupyter Docker containers. For example, assuming your notebooks are in ~/code/jupyter, you can run the container as:

docker run -it --rm -p 8888:8888 -v ~/code/jupyter:/home/jovyan/work jupyter/datascience-notebook

Question 4

spyder or install python(x,y). it is great.

If you are new to Python, you can install the free Anaconda distribution (http://continuum.io/downloads.html), which will install Spyder for you, as well as Python 2.7 and IPython. Spyder is very similar to RStudio.

Question 5

Check out Rodeo from Yhat if you’re looking for something like RStudio for Python.

Rodeo has:

text editor (uses Atom under the hood)
Vim / Emacs mode
an IPython console
autocomplete
docstrings
ability to see plots, dataframes, variables

Question 6

You might want to look into JupyterLab (the next generation of Jupyter Notbooks): https://github.com/jupyter/jupyterlab.

JupyterLab aims to create a more desktop-like experience on the Web.

Update: As of March 2018 JupyterLab is in beta. “The beta releases are suitable for general usage. For JupyterLab extension developers, the extension APIs will continue to evolve until the 1.0 release. Eventually, JupyterLab will replace the classic Jupyter Notebook after JupyterLab reaches 1.0.“

To run Jupyter Lab as a Desktop Application, see christopherroach.com/articles/jupyterlab-desktop-app (Thanks to PatrickT).

Here’s a quick preview:

You can arrange a notebook next to a graphical console atop a terminal that is monitoring the system, while keeping the file manager on the left:

For more details see: https://blog.jupyter.org/2016/07/14/jupyter-lab-alpha/ and here: http://www.techatbloomberg.com/blog/inside-the-collaboration-that-built-the-open-source-jupyterlab-project/.

Question 7

Pycharm is a really decent IDE. From what I have seen so far it is the most similar to Rstudio. Another nice piece is that it allows you to install new Python libraries in a fashion similar to Rstudio (which otherwise can be a nightmare). There is now a free ‘community’ edition.

Question 8

I think it is worth while to mention that RStudio v1.1.359 Preview is released. It has terminal feature that can be used for Python.

Download is available here

Documentation is available here

Question 9

spyder is you need! https://code.google.com/p/spyderlib/
Spyder (previously known as Pydee) is a powerful interactive development environment for the Python language with advanced editing, interactive testing, debugging and introspection features

Question 10

For a nicer interactive shell for Python, have a look at DreamPie. It’s not really an IDE though (as RStudio seems to be?)

Question 11

Wing IDE, and probably also other Python IDEs like PyCharm and PyDev have features like this. In Wing you can either select and execute code in the integrated Python Shell or if you’re debugging something you can interact with the paused debug program in a shell (called the Debug Probe). There is also special support for matplotlib, in case you’re using that, so that you can work with plots interactively.

Question 12

I just compiled and installed mysqldb for python 2.7 on my mac os 10.6. I created a simple test file that imports

import MySQLdb as mysql

Firstly, this command is red underlined and the info tells me “Unresolved import”. Then I tried to run the following simple python code

import MySQLdb as mysql

def main():
    conn = mysql.connect( charset="utf8", use_unicode=True, host="localhost",user="root", passwd="",db="" )

if __name__ == '__main__'():
    main()

When executing it I get the following error message

Traceback (most recent call last):
  File "/path/to/project/Python/src/cvdv/TestMySQLdb.py", line 4, in <module>
    import MySQLdb as mysql
  File "build/bdist.macosx-10.6-intel/egg/MySQLdb/__init__.py", line 19, in <module>
    \namespace cvdv
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 7, in <module>
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so
  Reason: image not found

What might be the solution to my problem?

EDIT: Actually I found out that the library lies in /usr/local/mysql/lib. So I need to tell my pydev eclipse version where to find it. Where do I set this?

Question 13

I solved the problem by creating a symbolic link to the library. I.e.

The actual library resides in

/usr/local/mysql/lib

And then I created a symbolic link in

/usr/lib

Using the command:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/lib/libmysqlclient.18.dylib

so that I have the following mapping:

ls -l libmysqlclient.18.dylib 
lrwxr-xr-x  1 root  wheel  44 16 Jul 14:01 libmysqlclient.18.dylib -> /usr/local/mysql/lib/libmysqlclient.18.dylib

That was it. After that everything worked fine.

EDIT:

Notice, that since MacOS El Capitan the System Integrity Protection (SIP, also known as “rootless”) will prevent you from creating links in /usr/lib/. You could disable SIP by following these instructions, but you can create a link in /usr/local/lib/ instead:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/local/lib/libmysqlclient.18.dylib

Question 14

I think the maximum integer in python is available by calling sys.maxint.

What is the maximum float or long in Python?

Question 15

For float have a look at sys.float_info:

>>> import sys
>>> sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)

Specifically, sys.float_info.max:

>>> sys.float_info.max
1.7976931348623157e+308

If that’s not big enough, there’s always positive infinity:

>>> infinity = float("inf")
>>> infinity
inf
>>> infinity / 10000
inf

The long type has unlimited precision, so I think you’re only limited by available memory.

Question 16

sys.maxint is not the largest integer supported by python. It’s the largest integer supported by python’s regular integer type.

Question 17

If you are using numpy, you can use dtype ‘float128‘ and get a max float of 10e+4931

>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)

Question 18

What are people’s experiences with any of the Git modules for Python? (I know of GitPython, PyGit, and Dulwich – feel free to mention others if you know of them.)

I am writing a program which will have to interact (add, delete, commit) with a Git repository, but have no experience with Git, so one of the things I’m looking for is ease of use/understanding with regards to Git.

The other things I’m primarily interested in are maturity and completeness of the library, a reasonable lack of bugs, continued development, and helpfulness of the documentation and developers.

If you think of something else I might want/need to know, please feel free to mention it.

Question 19

While this question was asked a while ago and I don’t know the state of the libraries at that point, it is worth mentioning for searchers that GitPython does a good job of abstracting the command line tools so that you don’t need to use subprocess. There are some useful built in abstractions that you can use, but for everything else you can do things like:

import git
repo = git.Repo( '/home/me/repodir' )
print repo.git.status()
# checkout and track a remote branch
print repo.git.checkout( 'origin/somebranch', b='somebranch' )
# add a file
print repo.git.add( 'somefile' )
# commit
print repo.git.commit( m='my commit message' )
# now we are one commit ahead
print repo.git.status()

Everything else in GitPython just makes it easier to navigate. I’m fairly well satisfied with this library and appreciate that it is a wrapper on the underlying git tools.

UPDATE: I’ve switched to using the sh module for not just git but most commandline utilities I need in python. To replicate the above I would do this instead:

import sh
git = sh.git.bake(_cwd='/home/me/repodir')
print git.status()
# checkout and track a remote branch
print git.checkout('-b', 'somebranch')
# add a file
print git.add('somefile')
# commit
print git.commit(m='my commit message')
# now we are one commit ahead
print git.status()

Question 20

I thought I would answer my own question, since I’m taking a different path than suggested in the answers. Nonetheless, thanks to those who answered.

First, a brief synopsis of my experiences with GitPython, PyGit, and Dulwich:

GitPython: After downloading, I got this imported and the appropriate object initialized. However, trying to do what was suggested in the tutorial led to errors. Lacking more documentation, I turned elsewhere.
PyGit: This would not even import, and I could find no documentation.
Dulwich: Seems to be the most promising (at least for what I wanted and saw). I made some progress with it, more than with GitPython, since its egg comes with Python source. However, after a while, I decided it may just be easier to try what I did.

Also, StGit looks interesting, but I would need the functionality extracted into a separate module and do not want wait for that to happen right now.

In (much) less time than I spent trying to get the three modules above working, I managed to get git commands working via the subprocess module, e.g.

def gitAdd(fileName, repoDir):
    cmd = ['git', 'add', fileName]
    p = subprocess.Popen(cmd, cwd=repoDir)
    p.wait()

gitAdd('exampleFile.txt', '/usr/local/example_git_repo_dir')

This isn’t fully incorporated into my program yet, but I’m not anticipating a problem, except maybe speed (since I’ll be processing hundreds or even thousands of files at times).

Maybe I just didn’t have the patience to get things going with Dulwich or GitPython. That said, I’m hopeful the modules will get more development and be more useful soon.

Question 21

I’d recommend pygit2 – it uses the excellent libgit2 bindings

Question 22

This is a pretty old question, and while looking for Git libraries, I found one that was made this year (2013) called Gittle.

It worked great for me (where the others I tried were flaky), and seems to cover most of the common actions.

Some examples from the README:

from gittle import Gittle

# Clone a repository
repo_path = '/tmp/gittle_bare'
repo_url = 'git://github.com/FriendCode/gittle.git'
repo = Gittle.clone(repo_url, repo_path)

# Stage multiple files
repo.stage(['other1.txt', 'other2.txt'])

# Do the commit
repo.commit(name="Samy Pesse", email="samy@friendco.de", message="This is a commit")

# Authentication with RSA private key
key_file = open('/Users/Me/keys/rsa/private_rsa')
repo.auth(pkey=key_file)

# Do push
repo.push()

Question 23

Maybe it helps, but Bazaar and Mercurial are both using dulwich for their Git interoperability.

Dulwich is probably different than the other in the sense that’s it’s a reimplementation of git in python. The other might just be a wrapper around Git’s commands (so it could be simpler to use from a high level point of view: commit/add/delete), it probably means their API is very close to git’s command line so you’ll need to gain experience with Git.

Question 24

An updated answer reflecting changed times:

GitPython currently is the easiest to use. It supports wrapping of many git plumbing commands and has pluggable object database (dulwich being one of them), and if a command isn’t implemented, provides an easy api for shelling out to the command line. For example:

repo = Repo('.')
repo.checkout(b='new_branch')

This calls:

bash$ git checkout -b new_branch

Dulwich is also good but much lower level. It’s somewhat of a pain to use because it requires operating on git objects at the plumbing level and doesn’t have nice porcelain that you’d normally want to do. However, if you plan on modifying any parts of git, or use git-receive-pack and git-upload-pack, you need to use dulwich.

Question 25

For the sake of completeness, http://github.com/alex/pyvcs/ is an abstraction layer for all dvcs’s. It uses dulwich, but provides interop with the other dvcs’s.

Question 26

Here’s a really quick implementation of “git status”:

import os
import string
from subprocess import *

repoDir = '/Users/foo/project'

def command(x):
    return str(Popen(x.split(' '), stdout=PIPE).communicate()[0])

def rm_empty(L): return [l for l in L if (l and l!="")]

def getUntracked():
    os.chdir(repoDir)
    status = command("git status")
    if "# Untracked files:" in status:
        untf = status.split("# Untracked files:")[1][1:].split("\n")
        return rm_empty([x[2:] for x in untf if string.strip(x) != "#" and x.startswith("#\t")])
    else:
        return []

def getNew():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tnew file:   ")]

def getModified():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tmodified:   ")]

print("Untracked:")
print( getUntracked() )
print("New:")
print( getNew() )
print("Modified:")
print( getModified() )

Question 27

PTBNL’s Answer is quite perfect for me. I make a little more for Windows user.

import time
import subprocess
def gitAdd(fileName, repoDir):
    cmd = 'git add ' + fileName
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 

def gitCommit(commitMessage, repoDir):
    cmd = 'git commit -am "%s"'%commitMessage
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 
def gitPush(repoDir):
    cmd = 'git push '
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    pipe.wait()
    return 

temp=time.localtime(time.time())
uploaddate= str(temp[0])+'_'+str(temp[1])+'_'+str(temp[2])+'_'+str(temp[3])+'_'+str(temp[4])

repoDir='d:\\c_Billy\\vfat\\Programming\\Projector\\billyccm' # your git repository , windows your need to use double backslash for right directory.
gitAdd('.',repoDir )
gitCommit(uploaddate, repoDir)
gitPush(repoDir)

Question 28

The git interaction library part of StGit is actually pretty good. However, it isn’t broken out as a separate package but if there is sufficient interest, I’m sure that can be fixed.

It has very nice abstractions for representing commits, trees etc, and for creating new commits and trees.

Question 29

For the record, none of the aforementioned Git Python libraries seem to contain a “git status” equivalent, which is really the only thing I would want since dealing with the rest of the git commands via subprocess is so easy.

Question 30

I have the following list created from a sorted csv

list1 = sorted(csv1, key=operator.itemgetter(1))

I would actually like to sort the list by two criteria: first by the value in field 1 and then by the value in field 2. How do I do this?

Question 31

like this:

import operator
list1 = sorted(csv1, key=operator.itemgetter(1, 2))

Question 32

No need to import anything when using lambda functions.
The following sorts list by the first element, then by the second element.

sorted(list, key=lambda x: (x[0], -x[1]))

Question 33

Python has a stable sort, so provided that performance isn’t an issue the simplest way is to sort it by field 2 and then sort it again by field 1.

That will give you the result you want, the only catch is that if it is a big list (or you want to sort it often) calling sort twice might be an unacceptable overhead.

list1 = sorted(csv1, key=operator.itemgetter(2))
list1 = sorted(list1, key=operator.itemgetter(1))

Doing it this way also makes it easy to handle the situation where you want some of the columns reverse sorted, just include the ‘reverse=True’ parameter when necessary.

Otherwise you can pass multiple parameters to itemgetter or manually build a tuple. That is probably going to be faster, but has the problem that it doesn’t generalise well if some of the columns want to be reverse sorted (numeric columns can still be reversed by negating them but that stops the sort being stable).

So if you don’t need any columns reverse sorted, go for multiple arguments to itemgetter, if you might, and the columns aren’t numeric or you want to keep the sort stable go for multiple consecutive sorts.

Edit: For the commenters who have problems understanding how this answers the original question, here is an example that shows exactly how the stable nature of the sorting ensures we can do separate sorts on each key and end up with data sorted on multiple criteria:

DATA = [
    ('Jones', 'Jane', 58),
    ('Smith', 'Anne', 30),
    ('Jones', 'Fred', 30),
    ('Smith', 'John', 60),
    ('Smith', 'Fred', 30),
    ('Jones', 'Anne', 30),
    ('Smith', 'Jane', 58),
    ('Smith', 'Twin2', 3),
    ('Jones', 'John', 60),
    ('Smith', 'Twin1', 3),
    ('Jones', 'Twin1', 3),
    ('Jones', 'Twin2', 3)
]

# Sort by Surname, Age DESCENDING, Firstname
print("Initial data in random order")
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred''')
DATA.sort(key=lambda row: row[1])

for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.''')
DATA.sort(key=lambda row: row[2], reverse=True)
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.
''')
DATA.sort(key=lambda row: row[0])
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

This is a runnable example, but to save people running it the output is:

Initial data in random order
Jones      Jane       58
Smith      Anne       30
Jones      Fred       30
Smith      John       60
Smith      Fred       30
Jones      Anne       30
Smith      Jane       58
Smith      Twin2      3
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Jones      Twin2      3

First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Jones      Jane       58
Smith      Jane       58
Smith      John       60
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.
Smith      John       60
Jones      John       60
Jones      Jane       58
Smith      Jane       58
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.

Jones      John       60
Jones      Jane       58
Jones      Anne       30
Jones      Fred       30
Jones      Twin1      3
Jones      Twin2      3
Smith      John       60
Smith      Jane       58
Smith      Anne       30
Smith      Fred       30
Smith      Twin1      3
Smith      Twin2      3

Note in particular how in the second step the reverse=True parameter keeps the firstnames in order whereas simply sorting then reversing the list would lose the desired order for the third sort key.

Question 34

list1 = sorted(csv1, key=lambda x: (x[1], x[2]) )

Question 35

employees.sort(key = lambda x:x[1])
employees.sort(key = lambda x:x[0])

We can also use .sort with lambda 2 times because python sort is in place and stable. This will first sort the list according to the second element, x[1]. Then, it will sort the first element, x[0] (highest priority).

employees[0] = Employee's Name
employees[1] = Employee's Salary

This is equivalent to doing the following: employees.sort(key = lambda x:(x[0], x[1]))

Question 36

In ascending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]))

or in descending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]),reverse=True)

Question 37

Sorting list of dicts using below will sort list in descending order on first column as salary and second column as age

d=[{'salary':123,'age':23},{'salary':123,'age':25}]
d=sorted(d, key=lambda i: (i['salary'], i['age']),reverse=True)

Output: [{‘salary’: 123, ‘age’: 25}, {‘salary’: 123, ‘age’: 23}]

Question 38

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Question 39

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don’t know which programs do that.

2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means … with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means. If you’re expecting roughly equal-sized clusters, but they come out [44 37 9 5 5] % … (sound of head-scratching).

Question 40

Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.

It is not trivial to extend k-means to other distances and denis’ answer above is not the correct way to implement k-means for other metrics.

Question 41

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

Question 42

Yes you can use a difference metric function; however, by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster.

You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.

Question 43

There is pyclustering which is python/C++ (so its fast!) and lets you specify a custom metric function

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import type_metric, distance_metric

user_function = lambda point1, point2: point1[0] + point2[0] + 2
metric = distance_metric(type_metric.USER_DEFINED, func=user_function)

# create K-Means algorithm with specific distance metric
start_centers = [[4.7, 5.9], [5.7, 6.5]];
kmeans_instance = kmeans(sample, start_centers, metric=metric)

# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()

Actually, i haven’t tested this code but cobbled it together from a ticket and example code.

Question 44

k-means of Spectral Python allows the use of L1 (Manhattan) distance.

Question 45

Sklearn Kmeans uses the Euclidean distance. It has no metric parameter. This said, if you’re clustering time series, you can use the tslearn python package, when you can specify a metric (dtw, softdtw, euclidean).

Question 46

What does sys.stdout.flush() do?

Question 47

Python’s standard out is buffered (meaning that it collects some of the data “written” to standard out before it writes it to the terminal). Calling sys.stdout.flush() forces it to “flush” the buffer, meaning that it will write everything in the buffer to the terminal, even if normally it would wait before doing so.

Here’s some good information about (un)buffered I/O and why it’s useful:
http://en.wikipedia.org/wiki/Data_buffer
Buffered vs unbuffered IO

Question 48

Consider the following simple Python script:

import time
import sys

for i in range(5):
    print(i),
    #sys.stdout.flush()
    time.sleep(1)

This is designed to print one number every second for five seconds, but if you run it as it is now (depending on your default system buffering) you may not see any output until the script completes, and then all at once you will see 0 1 2 3 4 printed to the screen.

This is because the output is being buffered, and unless you flush sys.stdout after each print you won’t see the output immediately. Remove the comment from the sys.stdout.flush() line to see the difference.

Question 49

As per my understanding, When ever we execute print statements output will be written to buffer. And we will see the output on screen when buffer get flushed(cleared). By default buffer will be flushed when program exits. BUT WE CAN ALSO FLUSH THE BUFFER MANUALLY by using “sys.stdout.flush()” statement in the program. In the below code buffer will be flushed when value of i reaches 5.

You can understand by executing the below code.

chiru@online:~$ cat flush.py

import time
import sys

for i in range(10):
    print i
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()
    time.sleep(1)

for i in range(10):
    print i,
    if i == 5:
        print "Flushing buffer"
        sys.stdout.flush()

chiru@online:~$ python flush.py 
0 1 2 3 4 5 Flushing buffer
6 7 8 9 0 1 2 3 4 5 Flushing buffer
6 7 8 9

Question 50

import sys
for x in range(10000):
    print "HAPPY >> %s <<\r" % str(x),
    sys.stdout.flush()

Question 51

As per my understanding sys.stdout.flush() pushes out all the data that has been buffered to that point to a file object. While using stdout, data is stored in buffer memory (for some time or until the memory gets filled) before it gets written to terminal. Using flush() forces to empty the buffer and write to terminal even before buffer has empty space.

Question 52

In Python, what is the difference between json.load() and json.loads()?

I guess that the load() function must be used with a file object (I need thus to use a context manager) while the loads() function take the path to the file as a string. It is a bit confusing.

Does the letter “s” in json.loads() stand for string?

Thanks a lot for your answers!

Question 53

Yes, s stands for string. The json.loads function does not take the file path, but the file contents as a string. Look at the documentation at https://docs.python.org/2/library/json.html!

Question 54

Just going to add a simple example to what everyone has explained,

json.load()

json.load can deserialize a file itself i.e. it accepts a file object, for example,

# open a json file for reading and print content using json.load
with open("/xyz/json_data.json", "r") as content:
  print(json.load(content))

will output,

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

If I use json.loads to open a file instead,

# you cannot use json.loads on file object
with open("json_data.json", "r") as content:
  print(json.loads(content))

I would get this error:

TypeError: expected string or buffer

json.loads()

json.loads() deserialize string.

So in order to use json.loads I will have to pass the content of the file using read() function, for example,

using content.read() with json.loads() return content of the file,

with open("json_data.json", "r") as content:
  print(json.loads(content.read()))

Output,

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

That’s because type of content.read() is string, i.e. <type 'str'>

If I use json.load() with content.read(), I will get error,

with open("json_data.json", "r") as content:
  print(json.load(content.read()))

Gives,

AttributeError: ‘str’ object has no attribute ‘read’

So, now you know json.load deserialze file and json.loads deserialize a string.

Another example,

sys.stdin return file object, so if i do print(json.load(sys.stdin)), I will get actual json data,

cat json_data.json | ./test.py

{u'event': {u'id': u'5206c7e2-da67-42da-9341-6ea403c632c7', u'name': u'Sufiyan Ghori'}}

If I want to use json.loads(), I would do print(json.loads(sys.stdin.read())) instead.

Question 55

Documentation is quite clear: https://docs.python.org/2/library/json.html

json.load(fp[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object using this conversion table.

json.loads(s[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])

Deserialize s (a str or unicode instance containing a JSON document) to a Python object using this conversion table.

So load is for a file, loads for a string

Question 56

QUICK ANSWER (very simplified!)

json.load() takes a FILE

json.load() expects a file (file object) – e.g. a file you opened before given by filepath like 'files/example.json'.

json.loads() takes a STRING

json.loads() expects a (valid) JSON string – i.e. {"foo": "bar"}

EXAMPLES

Assuming you have a file example.json with this content: { “key_1”: 1, “key_2”: “foo”, “Key_3”: null }

>>> import json
>>> file = open("example.json")

>>> type(file)
<class '_io.TextIOWrapper'>

>>> file
<_io.TextIOWrapper name='example.json' mode='r' encoding='UTF-8'>

>>> json.load(file)
{'key_1': 1, 'key_2': 'foo', 'Key_3': None}

>>> json.loads(file)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 341, in loads
TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper


>>> string = '{"foo": "bar"}'

>>> type(string)
<class 'str'>

>>> string
'{"foo": "bar"}'

>>> json.loads(string)
{'foo': 'bar'}

>>> json.load(string)
Traceback (most recent call last):
  File "/usr/local/python/Versions/3.7/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'

Question 57

The json.load() method (without “s” in “load”) can read a file directly:

import json
with open('strings.json') as f:
    d = json.load(f)
    print(d)

json.loads() method, which is used for string arguments only.

import json

person = '{"name": "Bob", "languages": ["English", "Fench"]}'
print(type(person))
# Output : <type 'str'>

person_dict = json.loads(person)
print( person_dict)
# Output: {'name': 'Bob', 'languages': ['English', 'Fench']}

print(type(person_dict))
# Output : <type 'dict'>

Here , we can see after using loads() takes a string ( type(str) ) as a input and return dictionary.

Question 58

In python3.7.7, the definition of json.load is as below according to cpython source code:

def load(fp, *, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):

    return loads(fp.read(),
        cls=cls, object_hook=object_hook,
        parse_float=parse_float, parse_int=parse_int,
        parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

json.load actually calls json.loads and use fp.read() as the first argument.

So if your code is:

with open (file) as fp:
    s = fp.read()
    json.loads(s)

It’s the same to do this:

with open (file) as fp:
    json.load(fp)

But if you need to specify the bytes reading from the file as like fp.read(10) or the string/bytes you want to deserialize is not from file, you should use json.loads()

As for json.loads(), it not only deserialize string but also bytes. If s is bytes or bytearray, it will be decoded to string first. You can also find it in the source code.

def loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,
        parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):
    """Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
    containing a JSON document) to a Python object.

    ...

    """
    if isinstance(s, str):
        if s.startswith('\ufeff'):
            raise JSONDecodeError("Unexpected UTF-8 BOM (decode using utf-8-sig)",
                                  s, 0)
    else:
        if not isinstance(s, (bytes, bytearray)):
            raise TypeError(f'the JSON object must be str, bytes or bytearray, '
                            f'not {s.__class__.__name__}')
        s = s.decode(detect_encoding(s), 'surrogatepass')

Question 59

I need to take an image and save it after some process. The figure looks fine when I display it, but after saving the figure, I got some white space around the saved image. I have tried the 'tight' option for savefig method, did not work either. The code:

  import matplotlib.image as mpimg
  import matplotlib.pyplot as plt

  fig = plt.figure(1)
  img = mpimg.imread(path)
  plt.imshow(img)
  ax=fig.add_subplot(1,1,1)

  extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
  plt.savefig('1.png', bbox_inches=extent)

  plt.axis('off') 
  plt.show()

I am trying to draw a basic graph by using NetworkX on a figure and save it. I realized that without graph it works, but when added a graph I get white space around the saved image;

import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import networkx as nx

G = nx.Graph()
G.add_node(1)
G.add_node(2)
G.add_node(3)
G.add_edge(1,3)
G.add_edge(1,2)
pos = {1:[100,120], 2:[200,300], 3:[50,75]}

fig = plt.figure(1)
img = mpimg.imread("C:\\images\\1.jpg")
plt.imshow(img)
ax=fig.add_subplot(1,1,1)

nx.draw(G, pos=pos)

extent = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted())
plt.savefig('1.png', bbox_inches = extent)

plt.axis('off') 
plt.show()

Question 60

I cannot claim I know exactly why or how my “solution” works, but this is what I had to do when I wanted to plot the outline of a couple of aerofoil sections — without white margins — to a PDF file. (Note that I used matplotlib inside an IPython notebook, with the -pylab flag.)

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.gca().xaxis.set_major_locator(plt.NullLocator())
plt.gca().yaxis.set_major_locator(plt.NullLocator())
plt.savefig("filename.pdf", bbox_inches = 'tight',
    pad_inches = 0)

I have tried to deactivate different parts of this, but this always lead to a white margin somewhere. You may even have modify this to keep fat lines near the limits of the figure from being shaved by the lack of margins.

Question 61

You can remove the white space padding by setting bbox_inches="tight" in savefig:

plt.savefig("test.png",bbox_inches='tight')

You’ll have to put the argument to bbox_inches as a string, perhaps this is why it didn’t work earlier for you.

Possible duplicates:

Matplotlib plots: removing axis, legends and white spaces

How to set the margins for a matplotlib figure?

Reduce left and right margins in matplotlib plot

Question 62

After trying the above answers with no success (and a slew of other stack posts) what finally worked for me was just

plt.gca().set_axis_off()
plt.subplots_adjust(top = 1, bottom = 0, right = 1, left = 0, 
            hspace = 0, wspace = 0)
plt.margins(0,0)
plt.savefig("myfig.pdf")

Importantly this does not include the bbox or padding arguments.

Question 63

I found something from Arvind Pereira (http://robotics.usc.edu/~ampereir/wordpress/?p=626) and seemed to work for me:

plt.savefig(filename, transparent = True, bbox_inches = 'tight', pad_inches = 0)

Question 64

The following function incorporates johannes-s answer above. I have tested it with plt.figure and plt.subplots() with multiple axes, and it works nicely.

def save(filepath, fig=None):
    '''Save the current image with no whitespace
    Example filepath: "myfig.png" or r"C:\myfig.pdf" 
    '''
    import matplotlib.pyplot as plt
    if not fig:
        fig = plt.gcf()

    plt.subplots_adjust(0,0,1,1,0,0)
    for ax in fig.axes:
        ax.axis('off')
        ax.margins(0,0)
        ax.xaxis.set_major_locator(plt.NullLocator())
        ax.yaxis.set_major_locator(plt.NullLocator())
    fig.savefig(filepath, pad_inches = 0, bbox_inches='tight')

问题：是否有类似RStudio for Python的东西？[关闭]

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：Python mysqldb：库未加载：libmysqlclient.18.dylib

回答 0

问题：Python中的最大浮点数是多少？

回答 0

回答 1

回答 2

问题：Python Git模块的经验？[关闭]

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：按两个字段对Python列表进行排序

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：是否可以使用scikit-learn K-Means聚类指定自己的距离函数？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：sys.stdout.flush（）方法的用法

回答 0

回答 1

回答 2

回答 3

回答 4

问题：json.load（）和json.loads（）函数有什么区别

回答 0

回答 1

回答 2

回答 3

快速解答（非常简化！）

json.load（）需要一个文件

json.loads（）需要一个STRING

例子

QUICK ANSWER (very simplified!)

json.load() takes a FILE

json.loads() takes a STRING

EXAMPLES

回答 4

回答 5

问题：在matplotlib中删除已保存图像周围的空白

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：计算两个Python字典中包含的键的差异