Python 实用宝典

Question 1

This thread discusses how to get the name of a function as a string in Python: How to get a function name as a string?

How can I do the same for a variable? As opposed to functions, Python variables do not have the __name__ attribute.

In other words, if I have a variable such as:

foo = dict()
foo['bar'] = 2

I am looking for a function/attribute, e.g. retrieve_name() in order to create a DataFrame in Pandas from this list, where the column names are given by the names of the actual dictionaries:

# List of dictionaries for my DataFrame
list_of_dicts = [n_jobs, users, queues, priorities]
columns = [retrieve_name(d) for d in list_of_dicts]

Question 2

Using the python-varname package, you can easily retrieve the name of the variables

https://github.com/pwwang/python-varname

In your case, you can do:

from varname import Wrapper

foo = Wrapper(dict())

# foo.name == 'foo'
# foo.value == {}
foo.value['bar'] = 2

For list comprehension part, you can do:

n_jobs = Wrapper(<original_value>) 
users = Wrapper(<original_value>) 
queues = Wrapper(<original_value>) 
priorities = Wrapper(<original_value>) 

list_of_dicts = [n_jobs, users, queues, priorities]
columns = [d.name for d in list_of_dicts]
# ['n_jobs', 'users', 'queues', 'priorities']
# REMEMBER that you have to access the <original_value> by d.value

You can also try to retrieve the variable name DIRECTLY:

from varname import nameof

foo = dict()

fooname = nameof(foo)
# fooname == 'foo'

Note that this is working in this case as you expected:

n_jobs = <original_value>
d = n_jobs

nameof(d) # will return d, instead of n_jobs
# nameof only works directly with the variable

I am the author of this package. Please let me know if you have any questions or you can submit issues on Github.

Question 3

The only objects in Python that have canonical names are modules, functions, and classes, and of course there is no guarantee that this canonical name has any meaning in any namespace after the function or class has been defined or the module imported. These names can also be modified after the objects are created so they may not always be particularly trustworthy.

What you want to do is not possible without recursively walking the tree of named objects; a name is a one-way reference to an object. A common or garden-variety Python object contains no references to its names. Imagine if every integer, every dict, every list, every Boolean needed to maintain a list of strings that represented names that referred to it! It would be an implementation nightmare, with little benefit to the programmer.

Question 4

Even if variable values don’t point back to the name, you have access to the list of every assigned variable and its value, so I’m astounded that only one person suggested looping through there to look for your var name.

Someone mentioned on that answer that you might have to walk the stack and check everyone’s locals and globals to find foo, but if foo is assigned in the scope where you’re calling this retrieve_name function, you can use inspect‘s current frame to get you all of those local variables.

My explanation might be a little bit too wordy (maybe I should’ve used a “foo” less words), but here’s how it would look in code (Note that if there is more than one variable assigned to the same value, you will get both of those variable names):

import inspect

x,y,z = 1,2,3

def retrieve_name(var):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return [var_name for var_name, var_val in callers_local_vars if var_val is var]

print retrieve_name(y)

If you’re calling this function from another function, something like:

def foo(bar):
    return retrieve_name(bar)

foo(baz)

And you want the baz instead of bar, you’ll just need to go back a scope further. This can be done by adding an extra .f_back in the caller_local_vars initialization.

See an example here: ideone

Question 5

With Python 3.8 one can simply use f-string debugging feature:

>>> foo = dict()
>>> f'{foo=}'.split('=')[0]
'foo'

Question 6

On python3, this function will get the outer most name in the stack:

import inspect


def retrieve_name(var):
        """
        Gets the name of var. Does it from the out most frame inner-wards.
        :param var: variable to get name from.
        :return: string
        """
        for fi in reversed(inspect.stack()):
            names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
            if len(names) > 0:
                return names[0]

It is useful anywhere on the code. Traverses the reversed stack looking for the first match.

Question 7

I don’t believe this is possible. Consider the following example:

>>> a = []
>>> b = a
>>> id(a)
140031712435664
>>> id(b)
140031712435664

The a and b point to the same object, but the object can’t know what variables point to it.

Question 8

def name(**variables):
    return [x for x in variables]

It’s used like this:

name(variable=variable)

Question 9

Here’s one approach. I wouldn’t recommend this for anything important, because it’ll be quite brittle. But it can be done.

Create a function that uses the inspect module to find the source code that called it. Then you can parse the source code to identify the variable names that you want to retrieve. For example, here’s a function called autodict that takes a list of variables and returns a dictionary mapping variable names to their values. E.g.:

x = 'foo'
y = 'bar'
d = autodict(x, y)
print d

Would give:

{'x': 'foo', 'y': 'bar'}

Inspecting the source code itself is better than searching through the locals() or globals() because the latter approach doesn’t tell you which of the variables are the ones you want.

At any rate, here’s the code:

def autodict(*args):
    get_rid_of = ['autodict(', ',', ')', '\n']
    calling_code = inspect.getouterframes(inspect.currentframe())[1][4][0]
    calling_code = calling_code[calling_code.index('autodict'):]
    for garbage in get_rid_of:
        calling_code = calling_code.replace(garbage, '')
    var_names, var_values = calling_code.split(), args
    dyn_dict = {var_name: var_value for var_name, var_value in
                zip(var_names, var_values)}
    return dyn_dict

The action happens in the line with inspect.getouterframes, which returns the string within the code that called autodict.

The obvious downside to this sort of magic is that it makes assumptions about how the source code is structured. And of course, it won’t work at all if it’s run inside the interpreter.

Question 10

I wrote the package sorcery to do this kind of magic robustly. You can write:

from sorcery import dict_of

columns = dict_of(n_jobs, users, queues, priorities)

and pass that to the dataframe constructor. It’s equivalent to:

columns = dict(n_jobs=n_jobs, users=users, queues=queues, priorities=priorities)

Question 11

>>> locals()['foo']
{}
>>> globals()['foo']
{}

If you wanted to write your own function, it could be done such that you could check for a variable defined in locals then check globals. If nothing is found you could compare on id() to see if the variable points to the same location in memory.

If your variable is in a class, you could use className.dict.keys() or vars(self) to see if your variable has been defined.

Question 12

In Python, the def and class keywords will bind a specific name to the object they define (function or class). Similarly, modules are given a name by virtue of being called something specific in the filesystem. In all three cases, there’s an obvious way to assign a “canonical” name to the object in question.

However, for other kinds of objects, such a canonical name may simply not exist. For example, consider the elements of a list. The elements in the list are not individually named, and it is entirely possible that the only way to refer to them in a program is by using list indices on the containing list. If such a list of objects was passed into your function, you could not possibly assign meaningful identifiers to the values.

Python doesn’t save the name on the left hand side of an assignment into the assigned object because:

It would require figuring out which name was “canonical” among multiple conflicting objects,
It would make no sense for objects which are never assigned to an explicit variable name,
It would be extremely inefficient,
Literally no other language in existence does that.

So, for example, functions defined using lambda will always have the “name” <lambda>, rather than a specific function name.

The best approach would be simply to ask the caller to pass in an (optional) list of names. If typing the '...','...' is too cumbersome, you could accept e.g. a single string containing a comma-separated list of names (like namedtuple does).

Question 13

>> my_var = 5
>> my_var_name = [ k for k,v in locals().items() if v == my_var][0]
>> my_var_name 
'my_var'

locals() – Return a dictionary containing the current scope’s local variables. by iterating through this dictionary we can check the key which has a value equal to the defined variable, just extracting the key will give us the text of variable in string format.

from (after a bit changes) https://www.tutorialspoint.com/How-to-get-a-variable-name-as-a-string-in-Python

Question 14

I think it’s so difficult to do this in Python because of the simple fact that you never will not know the name of the variable you’re using. So, in his example, you could do:

Instead of:

list_of_dicts = [n_jobs, users, queues, priorities]

dict_of_dicts = {"n_jobs" : n_jobs, "users" : users, "queues" : queues, "priorities" : priorities}

Question 15

just another way to do this based on the content of input variable:

(it returns the name of the first variable that matches to the input variable, otherwise None. One can modify it to get all variable names which are having the same content as input variable)

def retrieve_name(x, Vars=vars()):
    for k in Vars:
        if type(x) == type(Vars[k]):
            if x is Vars[k]:
                return k
    return None

Question 16

This function will print variable name with its value:

import inspect

def print_this(var):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    print(str([k for k, v in callers_local_vars if v is var][0])+': '+str(var))

***Input & Function call:***
my_var = 10

print_this(my_var)

***Output**:*
my_var: 10

Question 17

I have a method, and while not the most efficient…it works! (and it doesn’t involve any fancy modules).

Basically it compares your Variable’s ID to globals() Variables’ IDs, then returns the match’s name.

def getVariableName(variable, globalVariables=globals().copy()):
    """ Get Variable Name as String by comparing its ID to globals() Variables' IDs

        args:
            variable(var): Variable to find name for (Obviously this variable has to exist)

        kwargs:
            globalVariables(dict): Copy of the globals() dict (Adding to Kwargs allows this function to work properly when imported from another .py)
    """
    for globalVariable in globalVariables:
        if id(variable) == id(globalVariables[globalVariable]): # If our Variable's ID matches this Global Variable's ID...
            return globalVariable # Return its name from the Globals() dict

Question 18

If the goal is to help you keep track of your variables, you can write a simple function that labels the variable and returns its value and type. For example, suppose i_f=3.01 and you round it to an integer called i_n to use in a code, and then need a string i_s that will go into a report.

def whatis(string, x):
    print(string+' value=',repr(x),type(x))
    return string+' value='+repr(x)+repr(type(x))
i_f=3.01
i_n=int(i_f)
i_s=str(i_n)
i_l=[i_f, i_n, i_s]
i_u=(i_f, i_n, i_s)

## make report that identifies all types
report='\n'+20*'#'+'\nThis is the report:\n'
report+= whatis('i_f ',i_f)+'\n'
report+=whatis('i_n ',i_n)+'\n'
report+=whatis('i_s ',i_s)+'\n'
report+=whatis('i_l ',i_l)+'\n'
report+=whatis('i_u ',i_u)+'\n'
print(report)

This prints to the window at each call for debugging purposes and also yields a string for the written report. The only downside is that you have to type the variable twice each time you call the function.

I am a Python newbie and found this very useful way to log my efforts as I program and try to cope with all the objects in Python. One flaw is that whatis() fails if it calls a function described outside the procedure where it is used. For example, int(i_f) was a valid function call only because the int function is known to Python. You could call whatis() using int(i_f**2), but if for some strange reason you choose to define a function called int_squared it must be declared inside the procedure where whatis() is used.

Question 19

Maybe this could be useful:

def Retriever(bar):
    return (list(globals().keys()))[list(map(lambda x: id(x), list(globals().values()))).index(id(bar))]

The function goes through the list of IDs of values from the global scope (the namespace could be edited), finds the index of the wanted/required var or function based on its ID, and then returns the name from the list of global names based on the acquired index.

Question 20

Following method will not return the name of variable but using this method you can create data frame easily if variable is available in global scope.

class CustomDict(dict):
    def __add__(self, other):
        return CustomDict({**self, **other})

class GlobalBase(type):
    def __getattr__(cls, key):
        return CustomDict({key: globals()[key]})

    def __getitem__(cls, keys):
        return CustomDict({key: globals()[key] for key in keys})

class G(metaclass=GlobalBase):
    pass

x, y, z = 0, 1, 2

print('method 1:', G['x', 'y', 'z']) # Outcome: method 1: {'x': 0, 'y': 1, 'z': 2}
print('method 2:', G.x + G.y + G.z) # Outcome: method 2: {'x': 0, 'y': 1, 'z': 2}

A = [0, 1]
B = [1, 2]
pd.DataFrame(G.A + G.B) # It will return a data frame with A and B columns

Question 21

For constants, you can use an enum, which supports retrieving its name.

Question 22

I try to get name from inspect locals, but it cann’t process var likes a[1], b.val. After it, I got a new idea — get var name from the code, and I try it succ! code like below:

#direct get from called function code
def retrieve_name_ex(var):
    stacks = inspect.stack()
    try:
        func = stacks[0].function
        code = stacks[1].code_context[0]
        s = code.index(func)
        s = code.index("(", s + len(func)) + 1
        e = code.index(")", s)
        return code[s:e].strip()
    except:
        return ""

Question 23

You can try the following to retrieve the name of a function you defined (does not work for built-in functions though):

import re
def retrieve_name(func):
    return re.match("<function\s+(\w+)\s+at.*", str(func)).group(1)

def foo(x):
    return x**2

print(retrieve_name(foo))
# foo

Question 24

In RStudio, you can run parts of code in the code editing window, and the results appear in the console.

You can also do cool stuff like selecting whether you want everything up to the cursor to run, or everything after the cursor, or just the part that you selected, and so on. And there are hot keys for all that stuff.

It’s like a step above the interactive shell in Python — there you can use readline to go back to previous individual lines, but it doesn’t have any “concept” of what a function is, a section of code, etc.

Is there a tool like that for Python? Or, do you have some sort of similar workaround that you use, say, in vim?

Question 25

IPython Notebooks are awesome. Here’s another, newer browser-based tool I’ve recently discovered: Rodeo. My impression is that it seems to better support an RStudio-like workflow.

Rodeo screenshot

Question 26

Jupyter Notebook (previously known as IPython notebook) is a really cool project for interactive data manipulation in Python (and other languages, including R). It basically allows you to interactively code and document what you’re doing in one interface and later on save it as a:

notebook (.ipynb)
script (a .py file including only the source code)
static html (and therefore pdf as well)

You can even share your notebooks online with others using the nbviewer service, where people publish whole books. Furthermore, GitHub renders your .ipynb files. You can publish your Jupyter Notebooks as reproducible research articles on Authorea. For collaborative editing by multiple users, check out Google Colab built on top of Jupyter.

The default Jupyter Notebook version starts a web application locally (or you deploy it to a server) and you use it from your browser. As Ryan also mentioned in his answer, Rodeo is an interface more similar to RStudio built on top of the Jupyter kernel.

JupyterLab is a newer take on the UI allowing for more flexibility in how you edit your notebooks, control interactive widgets and even run commands in terminal emulators.

There’s also a Qt console for IPython, a similar project with inline plots, which is a desktop application.

Jupyter is a normal Python package and can be installed using pip install jupyter. To get all the scientific libraries running on your computer, however, it might be easier to try the official Jupyter Docker containers. For example, assuming your notebooks are in ~/code/jupyter, you can run the container as:

docker run -it --rm -p 8888:8888 -v ~/code/jupyter:/home/jovyan/work jupyter/datascience-notebook

Question 27

spyder or install python(x,y). it is great.

If you are new to Python, you can install the free Anaconda distribution (http://continuum.io/downloads.html), which will install Spyder for you, as well as Python 2.7 and IPython. Spyder is very similar to RStudio.

Question 28

Check out Rodeo from Yhat if you’re looking for something like RStudio for Python.

Rodeo has:

text editor (uses Atom under the hood)
Vim / Emacs mode
an IPython console
autocomplete
docstrings
ability to see plots, dataframes, variables

Question 29

You might want to look into JupyterLab (the next generation of Jupyter Notbooks): https://github.com/jupyter/jupyterlab.

JupyterLab aims to create a more desktop-like experience on the Web.

Update: As of March 2018 JupyterLab is in beta. “The beta releases are suitable for general usage. For JupyterLab extension developers, the extension APIs will continue to evolve until the 1.0 release. Eventually, JupyterLab will replace the classic Jupyter Notebook after JupyterLab reaches 1.0.“

To run Jupyter Lab as a Desktop Application, see christopherroach.com/articles/jupyterlab-desktop-app (Thanks to PatrickT).

Here’s a quick preview:

You can arrange a notebook next to a graphical console atop a terminal that is monitoring the system, while keeping the file manager on the left:

For more details see: https://blog.jupyter.org/2016/07/14/jupyter-lab-alpha/ and here: http://www.techatbloomberg.com/blog/inside-the-collaboration-that-built-the-open-source-jupyterlab-project/.

Question 30

Pycharm is a really decent IDE. From what I have seen so far it is the most similar to Rstudio. Another nice piece is that it allows you to install new Python libraries in a fashion similar to Rstudio (which otherwise can be a nightmare). There is now a free ‘community’ edition.

enter image description here

Question 31

I think it is worth while to mention that RStudio v1.1.359 Preview is released. It has terminal feature that can be used for Python.

Download is available here

Documentation is available here

Question 32

spyder is you need! https://code.google.com/p/spyderlib/
Spyder (previously known as Pydee) is a powerful interactive development environment for the Python language with advanced editing, interactive testing, debugging and introspection features

Question 33

For a nicer interactive shell for Python, have a look at DreamPie. It’s not really an IDE though (as RStudio seems to be?)

Question 34

Wing IDE, and probably also other Python IDEs like PyCharm and PyDev have features like this. In Wing you can either select and execute code in the integrated Python Shell or if you’re debugging something you can interact with the paused debug program in a shell (called the Debug Probe). There is also special support for matplotlib, in case you’re using that, so that you can work with plots interactively.

Question 35

I just compiled and installed mysqldb for python 2.7 on my mac os 10.6. I created a simple test file that imports

import MySQLdb as mysql

Firstly, this command is red underlined and the info tells me “Unresolved import”. Then I tried to run the following simple python code

import MySQLdb as mysql

def main():
    conn = mysql.connect( charset="utf8", use_unicode=True, host="localhost",user="root", passwd="",db="" )

if __name__ == '__main__'():
    main()

When executing it I get the following error message

Traceback (most recent call last):
  File "/path/to/project/Python/src/cvdv/TestMySQLdb.py", line 4, in <module>
    import MySQLdb as mysql
  File "build/bdist.macosx-10.6-intel/egg/MySQLdb/__init__.py", line 19, in <module>
    \namespace cvdv
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 7, in <module>
  File "build/bdist.macosx-10.6-intel/egg/_mysql.py", line 6, in __bootstrap__
ImportError: dlopen(/Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so, 2): Library not loaded: libmysqlclient.18.dylib
  Referenced from: /Users/toom/.python-eggs/MySQL_python-1.2.3-py2.7-macosx-10.6-intel.egg-tmp/_mysql.so
  Reason: image not found

What might be the solution to my problem?

EDIT: Actually I found out that the library lies in /usr/local/mysql/lib. So I need to tell my pydev eclipse version where to find it. Where do I set this?

Question 36

I solved the problem by creating a symbolic link to the library. I.e.

The actual library resides in

/usr/local/mysql/lib

And then I created a symbolic link in

/usr/lib

Using the command:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/lib/libmysqlclient.18.dylib

so that I have the following mapping:

ls -l libmysqlclient.18.dylib 
lrwxr-xr-x  1 root  wheel  44 16 Jul 14:01 libmysqlclient.18.dylib -> /usr/local/mysql/lib/libmysqlclient.18.dylib

That was it. After that everything worked fine.

EDIT:

Notice, that since MacOS El Capitan the System Integrity Protection (SIP, also known as “rootless”) will prevent you from creating links in /usr/lib/. You could disable SIP by following these instructions, but you can create a link in /usr/local/lib/ instead:

sudo ln -s /usr/local/mysql/lib/libmysqlclient.18.dylib /usr/local/lib/libmysqlclient.18.dylib

Question 37

I think the maximum integer in python is available by calling sys.maxint.

What is the maximum float or long in Python?

Question 38

For float have a look at sys.float_info:

>>> import sys
>>> sys.float_info
sys.floatinfo(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2
250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsil
on=2.2204460492503131e-16, radix=2, rounds=1)

Specifically, sys.float_info.max:

>>> sys.float_info.max
1.7976931348623157e+308

If that’s not big enough, there’s always positive infinity:

>>> infinity = float("inf")
>>> infinity
inf
>>> infinity / 10000
inf

The long type has unlimited precision, so I think you’re only limited by available memory.

Question 39

sys.maxint is not the largest integer supported by python. It’s the largest integer supported by python’s regular integer type.

Question 40

If you are using numpy, you can use dtype ‘float128‘ and get a max float of 10e+4931

>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.18973149536e+4932, max=1.18973149536e+4932, dtype=float128)

Question 41

I have the following list created from a sorted csv

list1 = sorted(csv1, key=operator.itemgetter(1))

I would actually like to sort the list by two criteria: first by the value in field 1 and then by the value in field 2. How do I do this?

Question 42

like this:

import operator
list1 = sorted(csv1, key=operator.itemgetter(1, 2))

Question 43

No need to import anything when using lambda functions.
The following sorts list by the first element, then by the second element.

sorted(list, key=lambda x: (x[0], -x[1]))

Question 44

Python has a stable sort, so provided that performance isn’t an issue the simplest way is to sort it by field 2 and then sort it again by field 1.

That will give you the result you want, the only catch is that if it is a big list (or you want to sort it often) calling sort twice might be an unacceptable overhead.

list1 = sorted(csv1, key=operator.itemgetter(2))
list1 = sorted(list1, key=operator.itemgetter(1))

Doing it this way also makes it easy to handle the situation where you want some of the columns reverse sorted, just include the ‘reverse=True’ parameter when necessary.

Otherwise you can pass multiple parameters to itemgetter or manually build a tuple. That is probably going to be faster, but has the problem that it doesn’t generalise well if some of the columns want to be reverse sorted (numeric columns can still be reversed by negating them but that stops the sort being stable).

So if you don’t need any columns reverse sorted, go for multiple arguments to itemgetter, if you might, and the columns aren’t numeric or you want to keep the sort stable go for multiple consecutive sorts.

Edit: For the commenters who have problems understanding how this answers the original question, here is an example that shows exactly how the stable nature of the sorting ensures we can do separate sorts on each key and end up with data sorted on multiple criteria:

DATA = [
    ('Jones', 'Jane', 58),
    ('Smith', 'Anne', 30),
    ('Jones', 'Fred', 30),
    ('Smith', 'John', 60),
    ('Smith', 'Fred', 30),
    ('Jones', 'Anne', 30),
    ('Smith', 'Jane', 58),
    ('Smith', 'Twin2', 3),
    ('Jones', 'John', 60),
    ('Smith', 'Twin1', 3),
    ('Jones', 'Twin1', 3),
    ('Jones', 'Twin2', 3)
]

# Sort by Surname, Age DESCENDING, Firstname
print("Initial data in random order")
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred''')
DATA.sort(key=lambda row: row[1])

for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.''')
DATA.sort(key=lambda row: row[2], reverse=True)
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

print('''
Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.
''')
DATA.sort(key=lambda row: row[0])
for d in DATA:
    print("{:10s} {:10s} {}".format(*d))

This is a runnable example, but to save people running it the output is:

Initial data in random order
Jones      Jane       58
Smith      Anne       30
Jones      Fred       30
Smith      John       60
Smith      Fred       30
Jones      Anne       30
Smith      Jane       58
Smith      Twin2      3
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Jones      Twin2      3

First we sort by first name, after this pass all
Twin1 come before Twin2 and Anne comes before Fred
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Jones      Jane       58
Smith      Jane       58
Smith      John       60
Jones      John       60
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Second pass: sort by age in descending order.
Note that after this pass rows are sorted by age but
Twin1/Twin2 and Anne/Fred pairs are still in correct
firstname order.
Smith      John       60
Jones      John       60
Jones      Jane       58
Smith      Jane       58
Smith      Anne       30
Jones      Anne       30
Jones      Fred       30
Smith      Fred       30
Smith      Twin1      3
Jones      Twin1      3
Smith      Twin2      3
Jones      Twin2      3

Final pass sorts the Jones from the Smiths.
Within each family members are sorted by age but equal
age members are sorted by first name.

Jones      John       60
Jones      Jane       58
Jones      Anne       30
Jones      Fred       30
Jones      Twin1      3
Jones      Twin2      3
Smith      John       60
Smith      Jane       58
Smith      Anne       30
Smith      Fred       30
Smith      Twin1      3
Smith      Twin2      3

Note in particular how in the second step the reverse=True parameter keeps the firstnames in order whereas simply sorting then reversing the list would lose the desired order for the third sort key.

Question 45

list1 = sorted(csv1, key=lambda x: (x[1], x[2]) )

Question 46

employees.sort(key = lambda x:x[1])
employees.sort(key = lambda x:x[0])

We can also use .sort with lambda 2 times because python sort is in place and stable. This will first sort the list according to the second element, x[1]. Then, it will sort the first element, x[0] (highest priority).

employees[0] = Employee's Name
employees[1] = Employee's Salary

This is equivalent to doing the following: employees.sort(key = lambda x:(x[0], x[1]))

Question 47

In ascending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]))

or in descending order you can use:

sorted_data= sorted(non_sorted_data, key=lambda k: (k[1],k[0]),reverse=True)

Question 48

Sorting list of dicts using below will sort list in descending order on first column as salary and second column as age

d=[{'salary':123,'age':23},{'salary':123,'age':25}]
d=sorted(d, key=lambda i: (i['salary'], i['age']),reverse=True)

Output: [{‘salary’: 123, ‘age’: 25}, {‘salary’: 123, ‘age’: 23}]

Question 49

What are people’s experiences with any of the Git modules for Python? (I know of GitPython, PyGit, and Dulwich – feel free to mention others if you know of them.)

I am writing a program which will have to interact (add, delete, commit) with a Git repository, but have no experience with Git, so one of the things I’m looking for is ease of use/understanding with regards to Git.

The other things I’m primarily interested in are maturity and completeness of the library, a reasonable lack of bugs, continued development, and helpfulness of the documentation and developers.

If you think of something else I might want/need to know, please feel free to mention it.

Question 50

While this question was asked a while ago and I don’t know the state of the libraries at that point, it is worth mentioning for searchers that GitPython does a good job of abstracting the command line tools so that you don’t need to use subprocess. There are some useful built in abstractions that you can use, but for everything else you can do things like:

import git
repo = git.Repo( '/home/me/repodir' )
print repo.git.status()
# checkout and track a remote branch
print repo.git.checkout( 'origin/somebranch', b='somebranch' )
# add a file
print repo.git.add( 'somefile' )
# commit
print repo.git.commit( m='my commit message' )
# now we are one commit ahead
print repo.git.status()

Everything else in GitPython just makes it easier to navigate. I’m fairly well satisfied with this library and appreciate that it is a wrapper on the underlying git tools.

UPDATE: I’ve switched to using the sh module for not just git but most commandline utilities I need in python. To replicate the above I would do this instead:

import sh
git = sh.git.bake(_cwd='/home/me/repodir')
print git.status()
# checkout and track a remote branch
print git.checkout('-b', 'somebranch')
# add a file
print git.add('somefile')
# commit
print git.commit(m='my commit message')
# now we are one commit ahead
print git.status()

Question 51

I thought I would answer my own question, since I’m taking a different path than suggested in the answers. Nonetheless, thanks to those who answered.

First, a brief synopsis of my experiences with GitPython, PyGit, and Dulwich:

GitPython: After downloading, I got this imported and the appropriate object initialized. However, trying to do what was suggested in the tutorial led to errors. Lacking more documentation, I turned elsewhere.
PyGit: This would not even import, and I could find no documentation.
Dulwich: Seems to be the most promising (at least for what I wanted and saw). I made some progress with it, more than with GitPython, since its egg comes with Python source. However, after a while, I decided it may just be easier to try what I did.

Also, StGit looks interesting, but I would need the functionality extracted into a separate module and do not want wait for that to happen right now.

In (much) less time than I spent trying to get the three modules above working, I managed to get git commands working via the subprocess module, e.g.

def gitAdd(fileName, repoDir):
    cmd = ['git', 'add', fileName]
    p = subprocess.Popen(cmd, cwd=repoDir)
    p.wait()

gitAdd('exampleFile.txt', '/usr/local/example_git_repo_dir')

This isn’t fully incorporated into my program yet, but I’m not anticipating a problem, except maybe speed (since I’ll be processing hundreds or even thousands of files at times).

Maybe I just didn’t have the patience to get things going with Dulwich or GitPython. That said, I’m hopeful the modules will get more development and be more useful soon.

Question 52

I’d recommend pygit2 – it uses the excellent libgit2 bindings

Question 53

This is a pretty old question, and while looking for Git libraries, I found one that was made this year (2013) called Gittle.

It worked great for me (where the others I tried were flaky), and seems to cover most of the common actions.

Some examples from the README:

from gittle import Gittle

# Clone a repository
repo_path = '/tmp/gittle_bare'
repo_url = 'git://github.com/FriendCode/gittle.git'
repo = Gittle.clone(repo_url, repo_path)

# Stage multiple files
repo.stage(['other1.txt', 'other2.txt'])

# Do the commit
repo.commit(name="Samy Pesse", email="samy@friendco.de", message="This is a commit")

# Authentication with RSA private key
key_file = open('/Users/Me/keys/rsa/private_rsa')
repo.auth(pkey=key_file)

# Do push
repo.push()

Question 54

Maybe it helps, but Bazaar and Mercurial are both using dulwich for their Git interoperability.

Dulwich is probably different than the other in the sense that’s it’s a reimplementation of git in python. The other might just be a wrapper around Git’s commands (so it could be simpler to use from a high level point of view: commit/add/delete), it probably means their API is very close to git’s command line so you’ll need to gain experience with Git.

Question 55

An updated answer reflecting changed times:

GitPython currently is the easiest to use. It supports wrapping of many git plumbing commands and has pluggable object database (dulwich being one of them), and if a command isn’t implemented, provides an easy api for shelling out to the command line. For example:

repo = Repo('.')
repo.checkout(b='new_branch')

This calls:

bash$ git checkout -b new_branch

Dulwich is also good but much lower level. It’s somewhat of a pain to use because it requires operating on git objects at the plumbing level and doesn’t have nice porcelain that you’d normally want to do. However, if you plan on modifying any parts of git, or use git-receive-pack and git-upload-pack, you need to use dulwich.

Question 56

For the sake of completeness, http://github.com/alex/pyvcs/ is an abstraction layer for all dvcs’s. It uses dulwich, but provides interop with the other dvcs’s.

Question 57

Here’s a really quick implementation of “git status”:

import os
import string
from subprocess import *

repoDir = '/Users/foo/project'

def command(x):
    return str(Popen(x.split(' '), stdout=PIPE).communicate()[0])

def rm_empty(L): return [l for l in L if (l and l!="")]

def getUntracked():
    os.chdir(repoDir)
    status = command("git status")
    if "# Untracked files:" in status:
        untf = status.split("# Untracked files:")[1][1:].split("\n")
        return rm_empty([x[2:] for x in untf if string.strip(x) != "#" and x.startswith("#\t")])
    else:
        return []

def getNew():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tnew file:   ")]

def getModified():
    os.chdir(repoDir)
    status = command("git status").split("\n")
    return [x[14:] for x in status if x.startswith("#\tmodified:   ")]

print("Untracked:")
print( getUntracked() )
print("New:")
print( getNew() )
print("Modified:")
print( getModified() )

Question 58

PTBNL’s Answer is quite perfect for me. I make a little more for Windows user.

import time
import subprocess
def gitAdd(fileName, repoDir):
    cmd = 'git add ' + fileName
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 

def gitCommit(commitMessage, repoDir):
    cmd = 'git commit -am "%s"'%commitMessage
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    print out,error
    pipe.wait()
    return 
def gitPush(repoDir):
    cmd = 'git push '
    pipe = subprocess.Popen(cmd, shell=True, cwd=repoDir,stdout = subprocess.PIPE,stderr = subprocess.PIPE )
    (out, error) = pipe.communicate()
    pipe.wait()
    return 

temp=time.localtime(time.time())
uploaddate= str(temp[0])+'_'+str(temp[1])+'_'+str(temp[2])+'_'+str(temp[3])+'_'+str(temp[4])

repoDir='d:\\c_Billy\\vfat\\Programming\\Projector\\billyccm' # your git repository , windows your need to use double backslash for right directory.
gitAdd('.',repoDir )
gitCommit(uploaddate, repoDir)
gitPush(repoDir)

Question 59

The git interaction library part of StGit is actually pretty good. However, it isn’t broken out as a separate package but if there is sufficient interest, I’m sure that can be fixed.

It has very nice abstractions for representing commits, trees etc, and for creating new commits and trees.

Question 60

For the record, none of the aforementioned Git Python libraries seem to contain a “git status” equivalent, which is really the only thing I would want since dealing with the rest of the git commands via subprocess is so easy.

Question 61

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Question 62

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?

#!/usr/bin/env python
# kmeans.py using any of the 20-odd metrics in scipy.spatial.distance
# kmeanssample 2 pass, first sample sqrt(N)

from __future__ import division
import random
import numpy as np
from scipy.spatial.distance import cdist  # $scipy/spatial/distance.py
    # http://docs.scipy.org/doc/scipy/reference/spatial.html
from scipy.sparse import issparse  # $scipy/sparse/csr.py

__date__ = "2011-11-17 Nov denis"
    # X sparse, any cdist metric: real app ?
    # centres get dense rapidly, metrics in high dim hit distance whiteout
    # vs unsupervised / semi-supervised svm

#...............................................................................
def kmeans( X, centres, delta=.001, maxiter=10, metric="euclidean", p=2, verbose=1 ):
    """ centres, Xtocentre, distances = kmeans( X, initial centres ... )
    in:
        X N x dim  may be sparse
        centres k x dim: initial centres, e.g. random.sample( X, k )
        delta: relative error, iterate until the average distance to centres
            is within delta of the previous average distance
        maxiter
        metric: any of the 20-odd in scipy.spatial.distance
            "chebyshev" = max, "cityblock" = L1, "minkowski" with p=
            or a function( Xvec, centrevec ), e.g. Lqmetric below
        p: for minkowski metric -- local mod cdist for 0 < p < 1 too
        verbose: 0 silent, 2 prints running distances
    out:
        centres, k x dim
        Xtocentre: each X -> its nearest centre, ints N -> k
        distances, N
    see also: kmeanssample below, class Kmeans below.
    """
    if not issparse(X):
        X = np.asanyarray(X)  # ?
    centres = centres.todense() if issparse(centres) \
        else centres.copy()
    N, dim = X.shape
    k, cdim = centres.shape
    if dim != cdim:
        raise ValueError( "kmeans: X %s and centres %s must have the same number of columns" % (
            X.shape, centres.shape ))
    if verbose:
        print "kmeans: X %s  centres %s  delta=%.2g  maxiter=%d  metric=%s" % (
            X.shape, centres.shape, delta, maxiter, metric)
    allx = np.arange(N)
    prevdist = 0
    for jiter in range( 1, maxiter+1 ):
        D = cdist_sparse( X, centres, metric=metric, p=p )  # |X| x |centres|
        xtoc = D.argmin(axis=1)  # X -> nearest centre
        distances = D[allx,xtoc]
        avdist = distances.mean()  # median ?
        if verbose >= 2:
            print "kmeans: av |X - nearest centre| = %.4g" % avdist
        if (1 - delta) * prevdist <= avdist <= prevdist \
        or jiter == maxiter:
            break
        prevdist = avdist
        for jc in range(k):  # (1 pass in C)
            c = np.where( xtoc == jc )[0]
            if len(c) > 0:
                centres[jc] = X[c].mean( axis=0 )
    if verbose:
        print "kmeans: %d iterations  cluster sizes:" % jiter, np.bincount(xtoc)
    if verbose >= 2:
        r50 = np.zeros(k)
        r90 = np.zeros(k)
        for j in range(k):
            dist = distances[ xtoc == j ]
            if len(dist) > 0:
                r50[j], r90[j] = np.percentile( dist, (50, 90) )
        print "kmeans: cluster 50 % radius", r50.astype(int)
        print "kmeans: cluster 90 % radius", r90.astype(int)
            # scale L1 / dim, L2 / sqrt(dim) ?
    return centres, xtoc, distances

#...............................................................................
def kmeanssample( X, k, nsample=0, **kwargs ):
    """ 2-pass kmeans, fast for large N:
        1) kmeans a random sample of nsample ~ sqrt(N) from X
        2) full kmeans, starting from those centres
    """
        # merge w kmeans ? mttiw
        # v large N: sample N^1/2, N^1/2 of that
        # seed like sklearn ?
    N, dim = X.shape
    if nsample == 0:
        nsample = max( 2*np.sqrt(N), 10*k )
    Xsample = randomsample( X, int(nsample) )
    pass1centres = randomsample( X, int(k) )
    samplecentres = kmeans( Xsample, pass1centres, **kwargs )[0]
    return kmeans( X, samplecentres, **kwargs )

def cdist_sparse( X, Y, **kwargs ):
    """ -> |X| x |Y| cdist array, any cdist metric
        X or Y may be sparse -- best csr
    """
        # todense row at a time, v slow if both v sparse
    sxy = 2*issparse(X) + issparse(Y)
    if sxy == 0:
        return cdist( X, Y, **kwargs )
    d = np.empty( (X.shape[0], Y.shape[0]), np.float64 )
    if sxy == 2:
        for j, x in enumerate(X):
            d[j] = cdist( x.todense(), Y, **kwargs ) [0]
    elif sxy == 1:
        for k, y in enumerate(Y):
            d[:,k] = cdist( X, y.todense(), **kwargs ) [0]
    else:
        for j, x in enumerate(X):
            for k, y in enumerate(Y):
                d[j,k] = cdist( x.todense(), y.todense(), **kwargs ) [0]
    return d

def randomsample( X, n ):
    """ random.sample of the rows of X
        X may be sparse -- best csr
    """
    sampleix = random.sample( xrange( X.shape[0] ), int(n) )
    return X[sampleix]

def nearestcentres( X, centres, metric="euclidean", p=2 ):
    """ each X -> nearest centre, any metric
            euclidean2 (~ withinss) is more sensitive to outliers,
            cityblock (manhattan, L1) less sensitive
    """
    D = cdist( X, centres, metric=metric, p=p )  # |X| x |centres|
    return D.argmin(axis=1)

def Lqmetric( x, y=None, q=.5 ):
    # yes a metric, may increase weight of near matches; see ...
    return (np.abs(x - y) ** q) .mean() if y is not None \
        else (np.abs(x) ** q) .mean()

#...............................................................................
class Kmeans:
    """ km = Kmeans( X, k= or centres=, ... )
        in: either initial centres= for kmeans
            or k= [nsample=] for kmeanssample
        out: km.centres, km.Xtocentre, km.distances
        iterator:
            for jcentre, J in km:
                clustercentre = centres[jcentre]
                J indexes e.g. X[J], classes[J]
    """
    def __init__( self, X, k=0, centres=None, nsample=0, **kwargs ):
        self.X = X
        if centres is None:
            self.centres, self.Xtocentre, self.distances = kmeanssample(
                X, k=k, nsample=nsample, **kwargs )
        else:
            self.centres, self.Xtocentre, self.distances = kmeans(
                X, centres, **kwargs )

    def __iter__(self):
        for jc in range(len(self.centres)):
            yield jc, (self.Xtocentre == jc)

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    from time import time

    N = 10000
    dim = 10
    ncluster = 10
    kmsample = 100  # 0: random centres, > 0: kmeanssample
    kmdelta = .001
    kmiter = 10
    metric = "cityblock"  # "chebyshev" = max, "cityblock" L1,  Lqmetric
    seed = 1

    exec( "\n".join( sys.argv[1:] ))  # run this.py N= ...
    np.set_printoptions( 1, threshold=200, edgeitems=5, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "N %d  dim %d  ncluster %d  kmsample %d  metric %s" % (
        N, dim, ncluster, kmsample, metric)
    X = np.random.exponential( size=(N,dim) )
        # cf scikits-learn datasets/
    t0 = time()
    if kmsample > 0:
        centres, xtoc, dist = kmeanssample( X, ncluster, nsample=kmsample,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    else:
        randomcentres = randomsample( X, ncluster )
        centres, xtoc, dist = kmeans( X, randomcentres,
            delta=kmdelta, maxiter=kmiter, metric=metric, verbose=2 )
    print "%.0f msec" % ((time() - t0) * 1000)

    # also ~/py/np/kmeans/test-kmeans.py

Some notes added 26mar 2012:

1) for cosine distance, first normalize all the data vectors to |X| = 1; then

cosinedistance( X, Y ) = 1 - X . Y = Euclidean distance |X - Y|^2 / 2

is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don’t know which programs do that.

2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means … with code that works on scipy.sparse matrices.

3) Always check cluster sizes after k-means. If you’re expecting roughly equal-sized clusters, but they come out [44 37 9 5 5] % … (sound of head-scratching).

Question 63

Unfortunately no: scikit-learn current implementation of k-means only uses Euclidean distances.

It is not trivial to extend k-means to other distances and denis’ answer above is not the correct way to implement k-means for other metrics.

Question 64

Just use nltk instead where you can do this, e.g.

from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = <choose a value>
data = <sparse matrix that you would normally give to scikit>.toarray()

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance, repeats=25)
assigned_clusters = kclusterer.cluster(data, assign_clusters=True)

问题：将变量名作为字符串获取

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

回答 21

问题：是否有类似RStudio for Python的东西？[关闭]

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：Python mysqldb：库未加载：libmysqlclient.18.dylib

回答 0

问题：Python中的最大浮点数是多少？

回答 0

回答 1

回答 2

问题：按两个字段对Python列表进行排序

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：Python Git模块的经验？[关闭]

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：是否可以使用scikit-learn K-Means聚类指定自己的距离函数？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：sys.stdout.flush（）方法的用法

回答 0

回答 1

回答 2

回答 3

回答 4

问题：json.load（）和json.loads（）函数有什么区别

回答 0

回答 1

回答 2

回答 3

快速解答（非常简化！）