Python 实用宝典

Question 1

I’m writing a GUI application that regularly retrieves data through a web connection. Since this retrieval takes a while, this causes the UI to be unresponsive during the retrieval process (it cannot be split into smaller parts). This is why I’d like to outsource the web connection to a separate worker thread.

[Yes, I know, now I have two problems.]

Anyway, the application uses PyQt4, so I’d like to know what the better choice is: Use Qt’s threads or use the Python threading module? What are advantages / disadvantages of each? Or do you have a totally different suggestion?

Edit (re bounty): While the solution in my particular case will probably be using a non-blocking network request like Jeff Ober and Lukáš Lalinský suggested (so basically leaving the concurrency problems to the networking implementation), I’d still like a more in-depth answer to the general question:

What are advantages and disadvantages of using PyQt4’s (i.e. Qt’s) threads over native Python threads (from the threading module)?

Edit 2: Thanks all for you answers. Although there’s no 100% agreement, there seems to be widespread consensus that the answer is “use Qt”, since the advantage of that is integration with the rest of the library, while causing no real disadvantages.

For anyone looking to choose between the two threading implementations, I highly recommend they read all the answers provided here, including the PyQt mailing list thread that abbot links to.

There were several answers I considered for the bounty; in the end I chose abbot’s for the very relevant external reference; it was, however, a close call.

Thanks again.

Question 2

This was discussed not too long ago in PyQt mailing list. Quoting Giovanni Bajo’s comments on the subject:

It’s mostly the same. The main difference is that QThreads are better integrated with Qt (asynchrnous signals/slots, event loop, etc.). Also, you can’t use Qt from a Python thread (you can’t for instance post event to the main thread through QApplication.postEvent): you need a QThread for that to work.

A general rule of thumb might be to use QThreads if you’re going to interact somehow with Qt, and use Python threads otherwise.

And some earlier comment on this subject from PyQt’s author: “they are both wrappers around the same native thread implementations”. And both implementations use GIL in the same way.

Question 3

Python’s threads will be simpler and safer, and since it is for an I/O-based application, they are able to bypass the GIL. That said, have you considered non-blocking I/O using Twisted or non-blocking sockets/select?

EDIT: more on threads

Python threads

Python’s threads are system threads. However, Python uses a global interpreter lock (GIL) to ensure that the interpreter is only ever executing a certain size block of byte-code instructions at a time. Luckily, Python releases the GIL during input/output operations, making threads useful for simulating non-blocking I/O.

Important caveat: This can be misleading, since the number of byte-code instructions does not correspond to the number of lines in a program. Even a single assignment may not be atomic in Python, so a mutex lock is necessary for any block of code that must be executed atomically, even with the GIL.

QT threads

When Python hands off control to a 3rd party compiled module, it releases the GIL. It becomes the responsibility of the module to ensure atomicity where required. When control is passed back, Python will use the GIL. This can make using 3rd party libraries in conjunction with threads confusing. It is even more difficult to use an external threading library because it adds uncertainty as to where and when control is in the hands of the module vs the interpreter.

QT threads operate with the GIL released. QT threads are able to execute QT library code (and other compiled module code that does not acquire the GIL) concurrently. However, the Python code executed within the context of a QT thread still acquires the GIL, and now you have to manage two sets of logic for locking your code.

In the end, both QT threads and Python threads are wrappers around system threads. Python threads are marginally safer to use, since those parts that are not written in Python (implicitly using the GIL) use the GIL in any case (although the caveat above still applies.)

Non-blocking I/O

Threads add extraordinarily complexity to your application. Especially when dealing with the already complex interaction between the Python interpreter and compiled module code. While many find event-based programming difficult to follow, event-based, non-blocking I/O is often much less difficult to reason about than threads.

With asynchronous I/O, you can always be sure that, for each open descriptor, the path of execution is consistent and orderly. There are, obviously, issues that must be addressed, such as what to do when code depending on one open channel further depends on the results of code to be called when another open channel returns data.

One nice solution for event-based, non-blocking I/O is the new Diesel library. It is restricted to Linux at the moment, but it is extraordinarily fast and quite elegant.

It is also worth your time to learn pyevent, a wrapper around the wonderful libevent library, which provides a basic framework for event-based programming using the fastest available method for your system (determined at compile time).

Question 4

The advantage of QThread is that it’s integrated with the rest of the Qt library. That is, thread-aware methods in Qt will need to know in which thread they run, and to move objects between threads, you will need to use QThread. Another useful feature is running your own event loop in a thread.

If you are accessing a HTTP server, you should consider QNetworkAccessManager.

Question 5

I asked myself the same question when I was working to PyTalk.

If you are using Qt, you need to use QThread to be able to use the Qt framework and expecially the signal/slot system.

With the signal/slot engine, you will be able to talk from a thread to another and with every part of your project.

Moreover, there is not very performance question about this choice since both are a C++ bindings.

Here is my experience of PyQt and thread.

I encourage you to use QThread.

Question 6

Jeff has some good points. Only one main thread can do any GUI updates. If you do need to update the GUI from within the thread, Qt-4’s queued connection signals make it easy to send data across threads and will automatically be invoked if you’re using QThread; I’m not sure if they will be if you’re using Python threads, although it’s easy to add a parameter to connect().

Question 7

I can’t really recommend either, but I can try describing differences between CPython and Qt threads.

First of all, CPython threads do not run concurrently, at least not Python code. Yes, they do create system threads for each Python thread, however only the thread currently holding Global Interpreter Lock is allowed to run (C extensions and FFI code might bypass it, but Python bytecode is not executed while thread doesn’t hold GIL).

On the other hand, we have Qt threads, which are basically common layer over system threads, don’t have Global Interpreter Lock, and thus are capable of running concurrently. I’m not sure how PyQt deals with it, however unless your Qt threads call Python code, they should be able to run concurrently (bar various extra locks that might be implemented in various structures).

For extra fine-tuning, you can modify the amount of bytecode instructions that are interpreted before switching ownership of GIL – lower values mean more context switching (and possibly higher responsiveness) but lower performance per individual thread (context switches have their cost – if you try switching every few instructions it doesn’t help speed.)

Hope it helps with your problems :)

Question 8

I can’t comment on the exact differences between Python and PyQt threads, but I’ve been doing what you’re attempting to do using QThread, QNetworkAcessManager and making sure to call QApplication.processEvents() while the thread is alive. If GUI responsiveness is really the issue you’re trying to solve, the later will help.

Question 9

I have a project hosted on GitHub. For this I have written my README using the Markdown syntax in order to have it nicely formatted on GitHub.

As my project is in Python I also plan to upload it to PyPi. The syntax used for READMEs on PyPi is reStructuredText.

I would like to avoid having to handle two READMEs containing roughly the same content; so I searched for a markdown to RST (or the other way around) translator, but couldn’t find any.

The other solution I see is to perform a markdown/HTML and then a HTML/RST translation. I found some ressources for this here and here so I guess it should be possible.

Would you have any idea that could fit better with what I want to do?

Question 10

I would recommend Pandoc, the “swiss-army knife for converting files from one markup format into another” (check out the diagram of supported conversions at the bottom of the page, it is quite impressive). Pandoc allows markdown to reStructuredText translation directly. There is also an online editor here which lets you try it out, so you could simply use the online editor to convert your README files.

Question 11

As @Chris suggested, you can use Pandoc to convert Markdown to RST. This can be simply automated using pypandoc module and some magic in setup.py:

from setuptools import setup
try:
    from pypandoc import convert
    read_md = lambda f: convert(f, 'rst')
except ImportError:
    print("warning: pypandoc module not found, could not convert Markdown to RST")
    read_md = lambda f: open(f, 'r').read()

setup(
    # name, version, ...
    long_description=read_md('README.md'),
    install_requires=[]
)

This will automatically convert README.md to RST for the long description using on PyPi. When pypandoc is not available, then it just reads README.md without the conversion – to not force others to install pypandoc when they wanna just build the module, not upload to PyPi.

So you can write in Markdown as usual and don’t care about RST mess anymore. ;)

Question 12

2019 Update

The PyPI Warehouse now supports rendering Markdown as well! You just need to update your package configuration and add the long_description_content_type='text/markdown' to it. e.g.:

setup(
    name='an_example_package',
    # other arguments omitted
    long_description=long_description,
    long_description_content_type='text/markdown'
)

Therefore, there is no need to keep the README in two formats any longer.

You can find more information about it in the documentation.

Old answer:

The Markup library used by GitHub supports reStructuredText. This means you can write a README.rst file.

They even support syntax specific color highlighting using the code and code-block directives (Example)

Question 13

PyPI now supports Markdown for long descriptions!

In setup.py, set long_description to a Markdown string, add long_description_content_type="text/markdown" and make sure you’re using recent tooling (setuptools 38.6.0+, twine 1.11+).

See Dustin Ingram’s blog post for more details.

Question 14

For my requirements I didn’t want to install Pandoc in my computer. I used docverter. Docverter is a document conversion server with an HTTP interface using Pandoc for this.

import requests
r = requests.post(url='http://c.docverter.com/convert',
                  data={'to':'rst','from':'markdown'},
                  files={'input_files[]':open('README.md','rb')})
if r.ok:
    print r.content

Question 15

You might also be interested in the fact that it is possible to write in a common subset so that your document comes out the same way when rendered as markdown or rendered as reStructuredText: https://gist.github.com/dupuy/1855764 ☺

Question 16

I ran into this problem and solved it with the two following bash scripts.

Note that I have LaTeX bundled into my Markdown.

#!/usr/bin/env bash

if [ $# -lt 1 ]; then
  echo "$0 file.md"
  exit;
fi

filename=$(basename "$1")
extension="${filename##*.}"
filename="${filename%.*}"

if [ "$extension" = "md" ]; then
  rst=".rst"
  pandoc $1 -o $filename$rst
fi

Its also useful to convert to html. md2html:

#!/usr/bin/env bash

if [ $# -lt 1 ]; then
  echo "$0 file.md <style.css>"
  exit;
fi

filename=$(basename "$1")
extension="${filename##*.}"
filename="${filename%.*}"

if [ "$extension" = "md" ]; then
  html=".html"
  if [ -z $2 ]; then
    # if no css
    pandoc -s -S --mathjax --highlight-style pygments $1 -o $filename$html
  else
    pandoc -s -S --mathjax --highlight-style pygments -c $2 $1 -o $filename$html
  fi
fi

I hope that helps

Question 17

Using the pandoc tool suggested by others I created a md2rst utility to create the rst files. Even though this solution means you have both an md and an rst it seemed to be the least invasive and would allow for whatever future markdown support is added. I prefer it over altering setup.py and maybe you would as well:

#!/usr/bin/env python

'''
Recursively and destructively creates a .rst file for all Markdown
files in the target directory and below.

Created to deal with PyPa without changing anything in setup based on
the idea that getting proper Markdown support later is worth waiting
for rather than forcing a pandoc dependency in sample packages and such.

Vote for
(https://bitbucket.org/pypa/pypi/issue/148/support-markdown-for-readmes)

'''

import sys, os, re

markdown_sufs = ('.md','.markdown','.mkd')
markdown_regx = '\.(md|markdown|mkd)$'

target = '.'
if len(sys.argv) >= 2: target = sys.argv[1]

md_files = []
for root, dirnames, filenames in os.walk(target):
    for name in filenames:
        if name.endswith(markdown_sufs):
            md_files.append(os.path.join(root, name))

for md in md_files:
    bare = re.sub(markdown_regx,'',md)
    cmd='pandoc --from=markdown --to=rst "{}" -o "{}.rst"'
    print(cmd.format(md,bare))
    os.system(cmd.format(md,bare))

Question 18

I’m starting new Google App Engine application and currently considering two frameworks: Flask and webapp2. I’m rather satisfied with built-in webapp framework that I’ve used for my previous App Engine application, so I think webapp2 will be even better and I won’t have any problems with it.

However, there are a lot of good reviews of Flask, I really like its approach and all the things that I’ve read so far in the documentation and I want to try it out. But I’m a bit concerned about limitations that I can face down the road with Flask.

So, the question is – do you know any problems, performance issues, limitations (e.g. routing system, built-in authorization mechanism, etc.) that Flask could bring into Google App Engine application? By “problem” I mean something that I can’t work around in several lines of code (or any reasonable amount of code and efforts) or something that is completely impossible.

And as a follow-up question: are there any killer-features in Flask that you think can blow my mind and make me use it despite any problems that I can face?

Question 19

Disclaimer: I’m the author of tipfy and webapp2.

A big advantage of sticking with webapp (or its natural evolution, webapp2) is that you don’t have to create your own versions for existing SDK handlers for your framework of your choice.

For example, deferred uses a webapp handler. To use it in a pure Flask view, using werkzeug.Request and werkzeug.Response, you’ll need to implement deferred for it (like I did here for tipfy).

The same happens for other handlers: blobstore (Werkzeug still doesn’t support range requests, so you’ll need to use WebOb even if you create your own handler — see tipfy.appengine.blobstore), mail, XMPP and so on, or others that are included in the SDK in the future.

And the same happens for libraries created with App Engine in mind, like ProtoRPC, which is based on webapp and would need a port or adapter to work with other frameworks, if you don’t want to mix webapp and your-framework-of-choice handlers in the same app.

So, even if you choose a different framework, you’ll end a) using webapp in some special cases or b) having to create and maintain your versions for specific SDK handlers or features, if you’ll use them.

I much prefer Werkzeug over WebOb, but after over one year porting and maintaining versions of the SDK handlers that work natively with tipfy, I realized that this is a lost cause — to support GAE for the long term, best is to stay close to webapp/WebOb. It makes support for SDK libraries a breeze, maintenance becomes a lot easier, it is more future-proof as new libraries and SDK features will work out of the box and there’s the benefit of a large community working around the same App Engine tools.

A specific webapp2 defense is summarized here. Add to those that webapp2 can be used outside of App Engine and is easy to be customized to look like popular micro-frameworks and you have a good set of compelling reasons to go for it. Also, webapp2 has a big chance to be included in a future SDK release (this is extra-official, don’t quote me :-) which will push it forward and bring new developers and contributions.

That said, I’m a big fan of Werkzeug and the Pocoo guys and borrowed a lot from Flask and others (web.py, Tornado), but — and, you know, I’m biased — the above webapp2 benefits should be taken into account.

Question 20

Your question is extremely broad, but there appears to be no big problems using Flask on Google App Engine.

This mailing list thread links to several templates:

http://flask.pocoo.org/mailinglist/archive/2011/3/27/google-app-engine/#4f95bab1627a24922c60ad1d0a0a8e44

And here is a tutorial specific to the Flask / App Engine combination:

http://www.franciscosouza.com/2010/08/flying-with-flask-on-google-app-engine/

Also, see App Engine – Difficulty Accessing Twitter Data – Flask, Flask message flashing fails across redirects, and How do I manage third-party Python libraries with Google App Engine? (virtualenv? pip?) for issues people have had with Flask and Google App Engine.

Question 21

For me the decision for webapp2 was easy when I discovered that flask is not an object-oriented framework (from the beginning), while webapp2 is a pure object oriented framework. webapp2 uses Method Based Dispatching as standard for all RequestHandlers (as flask documentation calls it and implements it since V0.7 in MethodViews). While in flask MethodViews are an add-on it is a core design principle for webapp2. So your software design will look different using both frameworks. Both frameworks use nowadays jinja2 templates and are fairly feature identical.

I prefer to add security checks to a base-class RequestHandler and inherit from it. This is also good for utility functions, etc. As you can see for example in link [3] you can override methods to prevent dispatching a request.

If you are an OO-person, or if you need to design a REST-server, I would recommend webapp2 for you. If you prefer simple functions with decorators as handlers for multiple request-types, or you are uncomfortable with OO-inheritance then choose flask. I think both frameworks avoid the complexity and dependencies of much bigger frameworks like pyramid.

Question 22

I think google app engine officially supports flask framework. There is a sample code and tutorial here -> https://console.developers.google.com/start/appengine?_ga=1.36257892.596387946.1427891855

Question 23

I didn’t try webapp2 and found that tipfy was a bit difficult to use since it required setup scripts and builds that configure your python installation to other than default. For these and other reasons I haven’t made my largest project depend on a framework and I use the plain webapp instead, add the library called beaker to get session capability and django already has builtin translations for words common to many usecases so when building a localized application django was the right choice for my largest project. The 2 other frameworks I actually deployed with projects to a production environment were GAEframework.com and web2py and generally it seems that adding a framework which changes its template engine could lead to incompatibilities between old and new versions.

So my experience is that I’m being reluctant to adding a framework to my projects unless they solve the more advanced use cases (file upload, multi auth, admin ui are 3 examples of more advanced use cases that no framework for gae at the moment handles well.

Question 24

I’m using Django Rest Framework. and I keep getting an error

Exception Type: TemplateDoesNotExist
Exception Value: rest_framework/api.html

I dont know how I’m going wrong. This is the first time I’m trying out hands on REST Framework. This is code.

views.py

import socket, json
from modules.data.models import *
from modules.utils import *
from rest_framework import status
from rest_framework.decorators import api_view
from rest_framework.response import Response
from modules.actions.serializers import ActionSerializer


@api_view(['POST'])
@check_field_exists_wrapper("installation")
def api_actions(request, format = None):

    action_type = request.POST['action_type']
    if action_type == "Shutdown" : 
        send_message = '1'
        print "Shutting Down the system..."
    elif action_type == "Enable" : 
        send_message = '1'
        print "Enabling the system..."
    elif action_type == "Disable" : 
        send_message = '1'
        print "Disabling the system..."
    elif action_type == "Restart" : 
        send_message = '1'
        print "Restarting the system..."

    if action_type in ["Shutdown", "Enable", "Disable"] : PORT = 6000
    else : PORT = 6100

    controllers_list = Controller.objects.filter(installation_id = kwargs['installation_id'])

    for controller_obj in controllers_list:
        ip = controller_obj.ip
        try:
            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            s.connect((ip, PORT))
            s.send(send_message)
            s.close()
        except Exception as e:
            print("Exception when sending " + action_type +" command: "+str(e))

    return Response(status = status.HTTP_200_OK)

models.py

class Controller(models.Model):
    id = models.IntegerField(primary_key = True)
    name = models.CharField(max_length = 255, unique = True)
    ip = models.CharField(max_length = 255, unique = True)
    installation_id = models.ForeignKey('Installation')

serializers.py

from django.forms import widgets from rest_framework import serializers from modules.data.models import *

class ActionSerializer(serializers.ModelSerializer):
    class Meta:
        model = Controller
        fields = ('id', 'name', 'ip', 'installation_id')

urls.py

from django.conf.urls import patterns, url
from rest_framework.urlpatterns import format_suffix_patterns

urlpatterns = patterns('modules.actions.views',
    url(r'^$','api_actions',name='api_actions'),
)

Question 25

Make sure you have rest_framework listed in your settings.py INSTALLED_APPS.

Question 26

For me, rest_framework/api.html was actually missing on the filesystem due to a corrupt installation or some other unknown reason. Reinstalling djangorestframework fixed the problem:

$ pip install --upgrade djangorestframework

Question 27

Please note that the DRF attempts to return data in the same format that was requested. From your browser, this is most likely HTML. To specify an alternative response, use the ?format= parameter. For example: ?format=json.

The TemplateDoesNotExist error occurs most commonly when you are visiting an API endpoint in your browser and you do not have the rest_framework included in your list of installed apps, as described by other respondents.

If you do not have DRF included in your list of apps, but don’t want to use the HTML Admin DRF page, try using an alternative format to ‘side-step’ this error message.

More info from the docs here: http://www.django-rest-framework.org/topics/browsable-api/#formats

Question 28

Not your case, but also possible reason is customized loaders for Django. For example, if you have in settings (since Django 1.8):

TEMPLATES = [
{
    ...
    'OPTIONS': {
        'context_processors': [
            'django.template.context_processors.debug',
            'django.template.context_processors.request',
            'django.contrib.auth.context_processors.auth',
            'django.contrib.messages.context_processors.messages'
        ],
        'loaders': [
            'django.template.loaders.filesystem.Loader',
        ],
        ...
    }
}]

Django will not try to look at applications folders with templates, because you should explicitly add django.template.loaders.app_directories.Loader into loaders for that.

Notice, that by default django.template.loaders.app_directories.Loader included into loaders.

Question 29

I ran into the same error message. In my case, it was due to setting the backend to Jinja2. In my settings file:

TEMPLATES = [
{
    'BACKEND': 'django.template.backends.jinja2.Jinja2',
...

Changing this back to the default fixed the problem:

TEMPLATES = [
{
    'BACKEND': 'django.template.backends.django.DjangoTemplates',
...

Still not sure if there is a way to use the Jinja2 backend with rest_framework.

Question 30

I’m starting a new application and looking at using an ORM — in particular, SQLAlchemy.

Say I’ve got a column ‘foo’ in my database and I want to increment it. In straight sqlite, this is easy:

db = sqlite3.connect('mydata.sqlitedb')
cur = db.cursor()
cur.execute('update table stuff set foo = foo + 1')

I figured out the SQLAlchemy SQL-builder equivalent:

engine = sqlalchemy.create_engine('sqlite:///mydata.sqlitedb')
md = sqlalchemy.MetaData(engine)
table = sqlalchemy.Table('stuff', md, autoload=True)
upd = table.update(values={table.c.foo:table.c.foo+1})
engine.execute(upd)

This is slightly slower, but there’s not much in it.

Here’s my best guess for a SQLAlchemy ORM approach:

# snip definition of Stuff class made using declarative_base
# snip creation of session object
for c in session.query(Stuff):
    c.foo = c.foo + 1
session.flush()
session.commit()

This does the right thing, but it takes just under fifty times as long as the other two approaches. I presume that’s because it has to bring all the data into memory before it can work with it.

Is there any way to generate the efficient SQL using SQLAlchemy’s ORM? Or using any other python ORM? Or should I just go back to writing the SQL by hand?

Question 31

SQLAlchemy’s ORM is meant to be used together with the SQL layer, not hide it. But you do have to keep one or two things in mind when using the ORM and plain SQL in the same transaction. Basically, from one side, ORM data modifications will only hit the database when you flush the changes from your session. From the other side, SQL data manipulation statements don’t affect the objects that are in your session.

So if you say

for c in session.query(Stuff).all():
    c.foo = c.foo+1
session.commit()

it will do what it says, go fetch all the objects from the database, modify all the objects and then when it’s time to flush the changes to the database, update the rows one by one.

Instead you should do this:

session.execute(update(stuff_table, values={stuff_table.c.foo: stuff_table.c.foo + 1}))
session.commit()

This will execute as one query as you would expect, and because at least the default session configuration expires all data in the session on commit you don’t have any stale data issues.

In the almost-released 0.5 series you could also use this method for updating:

session.query(Stuff).update({Stuff.foo: Stuff.foo + 1})
session.commit()

That will basically run the same SQL statement as the previous snippet, but also select the changed rows and expire any stale data in the session. If you know you aren’t using any session data after the update you could also add synchronize_session=False to the update statement and get rid of that select.

Question 32

session.query(Clients).filter(Clients.id == client_id_list).update({'status': status})
session.commit()

Try this =)

Question 33

There are several ways to UPDATE using sqlalchemy

1) for c in session.query(Stuff).all():
       c.foo += 1
   session.commit()

2) session.query().\
       update({"foo": (Stuff.foo + 1)})
   session.commit()

3) conn = engine.connect()
   stmt = Stuff.update().\
       values(Stuff.foo = (Stuff.foo + 1))
   conn.execute(stmt)

Question 34

Here’s an example of how to solve the same problem without having to map the fields manually:

from sqlalchemy import Column, ForeignKey, Integer, String, Date, DateTime, text, create_engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.attributes import InstrumentedAttribute

engine = create_engine('postgres://postgres@localhost:5432/database')
session = sessionmaker()
session.configure(bind=engine)

Base = declarative_base()


class Media(Base):
  __tablename__ = 'media'
  id = Column(Integer, primary_key=True)
  title = Column(String, nullable=False)
  slug = Column(String, nullable=False)
  type = Column(String, nullable=False)

  def update(self):
    s = session()
    mapped_values = {}
    for item in Media.__dict__.iteritems():
      field_name = item[0]
      field_type = item[1]
      is_column = isinstance(field_type, InstrumentedAttribute)
      if is_column:
        mapped_values[field_name] = getattr(self, field_name)

    s.query(Media).filter(Media.id == self.id).update(mapped_values)
    s.commit()

So to update a Media instance, you can do something like this:

media = Media(id=123, title="Titular Line", slug="titular-line", type="movie")
media.update()

Question 35

Withough testing, I’d try:

for c in session.query(Stuff).all():
     c.foo = c.foo+1
session.commit()

(IIRC, commit() works without flush()).

I’ve found that at times doing a large query and then iterating in python can be up to 2 orders of magnitude faster than lots of queries. I assume that iterating over the query object is less efficient than iterating over a list generated by the all() method of the query object.

[Please note comment below – this did not speed things up at all].

Question 36

If it is because of the overhead in terms of creating objects, then it probably can’t be sped up at all with SA.

If it is because it is loading up related objects, then you might be able to do something with lazy loading. Are there lots of objects being created due to references? (IE, getting a Company object also gets all of the related People objects).

Question 37

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set. But unfortunate that list is not hashable and can’t make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn’t fast.

How can this done in the most efficient way?

The result of above list should be:

k = [[5, 6, 2], [1, 2], [3], [4]]

I don’t care about preserve order.

Note: this question is similar but not quite what I need. Searched SO but didn’t find exact duplicate.

Benchmarking:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

“loop in” (quadratic method) fastest of all for short lists. For long lists it’s faster then everyone except groupby method. Does this make sense?

For short list (the one in the code), 100000 iterations:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

For longer list (the one in the code duplicated 5 times):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599

Question 38

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it’s so much easier that it offers good returns on efforts. But sometimes (essentially for “tragically crucial bottlenecks” in deep inner loops of code that’s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one’s apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

Careful measurements of “point” performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it’s easier to use it at a shell prompt. For example, here’s a short module to showcase the general approach for this problem, save it as nodup.py:

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
  return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
  ks = sorted(k)
  return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
  ks = sorted(k)
  return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
  newk = []
  for i in k:
    if i not in newk:
      newk.append(i)
  return newk

# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
  savek = list(k)
  for f in doset, dosort, dogroupby, donewk:
    resk = f(k)
    assert k == savek
    print '%10s %s' % (f.__name__, sorted(resk))

Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

Now we can run checks on the tiny example list:

$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop

the quadratic approach isn’t bad, but the sort and groupby ones are better. Etc, etc.

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it’s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

It’s also well worth considering keeping a different representation for k — why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program’s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

Question 39

Doing it manually, creating a new k list and adding entries not found so far:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
    if elem not in new_k:
        new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]

Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it’s quadratic in complexity as you’re searching the whole of new_k for each element.

Question 40

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]

I don’t know if it’s necessarily faster, but you don’t have to use to tuples and sets.

Question 41

All the set-related solutions to this problem thus far require creating an entire set before iteration.

It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a “seen” set. Then only yield a list if it is not found in this tracker set.

This unique_everseen recipe is available in the itertools docs. It’s also available in the 3rd party toolz library:

from toolz import unique

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

# lazy iterator
res = map(list, unique(map(tuple, k)))

print(list(res))

[[1, 2], [4], [5, 6, 2], [3]]

Note that tuple conversion is necessary because lists are not hashable.

Question 42

Even your “long” list is pretty short. Also, did you choose them to match the actual data? Performance will vary with what these data actually look like. For example, you have a short list repeated over and over to make a longer list. This means that the quadratic solution is linear in your benchmarks, but not in reality.

For actually-large lists, the set code is your best bet—it’s linear (although space-hungry). The sort and groupby methods are O(n log n) and the loop in method is obviously quadratic, so you know how these will scale as n gets really big. If this is the real size of the data you are analyzing, then who cares? It’s tiny.

Incidentally, I’m seeing a noticeable speedup if I don’t form an intermediate list to make the set, that is to say if I replace

kt = [tuple(i) for i in k]
skt = set(kt)

with

skt = set(tuple(i) for i in k)

The real solution may depend on more information: Are you sure that a list of lists is really the representation you need?

Question 43

List of tuple and {} can be used to remove duplicates

>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>>

Question 44

Create a dictionary with tuple as the key, and print the keys.

create dictionary with tuple as key and index as value
print list of keys of dictionary

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

dict_tuple = {tuple(item): index for index, item in enumerate(k)}

print [list(itm) for itm in dict_tuple.keys()]

# prints [[1, 2], [5, 6, 2], [3], [4]]

Question 45

This should work.

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

k_cleaned = []
for ele in k:
    if set(ele) not in [set(x) for x in k_cleaned]:
        k_cleaned.append(ele)
print(k_cleaned)

# output: [[1, 2], [4], [5, 6, 2], [3]]

Question 46

Strangely, the answers above removes the ‘duplicates’ but what if I want to remove the duplicated value also?? The following should be useful and does not create a new object in memory!

def dictRemoveDuplicates(self):
    a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]


print(a)
temp = 0
position = -1
for pageNo, item in a:
    position+=1
    if pageNo != temp:
        temp = pageNo
        continue
    else:
        a[position] = 0
        a[position - 1] = 0
a = [x for x in a if x != 0]         
print(a)

and the o/p is:

[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]

Question 47

Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:

>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]

The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).

Question 48

What is the difference between the /usr/bin/python3 and /usr/bin/python3m executibles?

I am observing them on Ubuntu 13.04, but Google suggests that they exist on other distributions too.

The two files have the same md5sum, but do not seem to be symbolic links or hard links; the two files have different inode numbers returned by ls -li and testing find -xdev -samefile /usr/bin/python3.3 does not return any other files.

Someone asked a similar question on AskUbuntu, but I wanted to find out more about the difference between the two files.

Question 49

Credit for this goes to chepner for pointing out that I already had the link to the solution.

Python implementations MAY include additional flags in the file name tag as appropriate. For example, on POSIX systems these flags will also contribute to the file name:

–with-pydebug (flag: d)

–with-pymalloc (flag: m)

–with-wide-unicode (flag: u)

via PEP 3149.

Regarding the m flag specifically, this is what Pymalloc is:

Pymalloc, a specialized object allocator written by Vladimir Marangozov, was a feature added to Python 2.1. Pymalloc is intended to be faster than the system malloc() and to have less memory overhead for allocation patterns typical of Python programs. The allocator uses C’s malloc() function to get large pools of memory and then fulfills smaller memory requests from these pools.

via What’s New in Python 2.3

Finally, the two files may be hardlinked on some systems. While the two files have different inode numbers on my Ubuntu 13.04 system (thus are different files), a comp.lang.python post from two years ago shows that they once were hardlinked.

Question 50

I am wondering if it is possible to call python functions from java code using jython, or is it only for calling java code from python?

Question 51

Jython: Python for the Java Platform – http://www.jython.org/index.html

You can easily call python functions from Java code with Jython. That is as long as your python code itself runs under jython, i.e. doesn’t use some c-extensions that aren’t supported.

If that works for you, it’s certainly the simplest solution you can get. Otherwise you can use org.python.util.PythonInterpreter from the new Java6 interpreter support.

A simple example from the top of my head – but should work I hope: (no error checking done for brevity)

PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import sys\nsys.path.append('pathToModules if they are not there by default')\nimport yourModule");
// execute a function that takes a string and returns a string
PyObject someFunc = interpreter.get("funcName");
PyObject result = someFunc.__call__(new PyString("Test!"));
String realResult = (String) result.__tojava__(String.class);

Question 52

Hey I thought I would enter my answer to this even though its late. I think there are some important things to consider first with how strong you wish to have the linking between java and python.

Firstly Do you only want to call functions or do you actually want python code to change the data in your java objects? This is very important. If you only want to call some python code with or without arguments, then that is not very difficult. If your arguments are primitives it makes it even more easy. However if you want to have java class implement member functions in python, which change the data of the java object, then this is not so easy or straight forward.

Secondly are we talking cpython or will jython do? I would say cpython is where its at! I would advocate this is why python is so kool! Having such high abstractions however access to c,c++ when needed. Imagine if you could have that in java. This question is not even worth asking if jython is ok because then it is easy anyway.

So I have played with the following methods, and listed them from easy to difficult:

Java to Jython

Advantages: Trivially easy. Have actual references to java objects

Disadvantages: No CPython, Extremely Slow!

Jython from java is so easy, and if this is really enough then great. However it is very slow and no cpython! Is life worth living without cpython I don’t think so! You can easily have python code implementing your member functions for you java objects.

Java to Jython to CPython via Pyro

Pyro is the remote object module for python. You have some object on a cpython interpreter, and you can send it objects which are transferred via serialization and it can also return objects via this method. Note that if you send a serialized python object from jython and then call some functions which change the data in its members, then you will not see those changes in java. You just need to remember to send back the data which you want from pyro. This I believe is the easiest way to get to cpython! You do not need any jni or jna or swig or …. You don’t need to know any c, or c++. kool huh?

Advantages: Access to cpython, not as difficult as following methods

Disadvantages: Cannot change the member data of java objects directly from python. Is somewhat indirect, (jython is middle man).

Java to C/C++ via JNI/JNA/SWIG to Python via Embedded interpreter (maybe using BOOST Libraries?)

OMG this method is not for the faint of heart. And I can tell you it has taken me very long to achieve this in with a decent method. Main reason you would want to do this is so that you can run cpython code which as full rein over you java object. There are major major things to consider before deciding to try and bread java (which is like a chimp) with python (which is like a horse). Firstly if you crash the interpreter that’s lights out for you program! And don’t get me started on concurrency issues! In addition, there is allot allot of boiler, I believe I have found the best configuration to minimize this boiler but still it is allot! So how to go about this: Consider that C++ is your middle man, your objects are actually c++ objects! Good that you know that now. Just write your object as if your program as in cpp not java, with the data you want to access from both worlds. Then you can use the wrapper generator called swig (http://www.swig.org/Doc1.3/Java.html) to make this accessible to java and compile a dll which you call System.load(dll name here) in java. Get this working first, then move on to the hard part! To get to python you need to embed an interpreter. Firstly I suggest doing some hello interpreter programs or this tutorial Embedding python in C/C. Once you have that working, its time to make the horse and the monkey dance! You can send you c++ object to python via [boost][3] . I know I have not given you the fish, merely told you where to find the fish. Some pointers to note for this when compiling.

When you compile boost you will need to compile a shared library. And you need to include and link to the stuff you need from jdk, ie jawt.lib, jvm.lib, (you will also need the client jvm.dll in your path when launching the application) As well as the python27.lib or whatever and the boost_python-vc100-mt-1_55.lib. Then include Python/include, jdk/include, boost and only use shared libraries (dlls) otherwise boost has a teary. And yeah full on I know. There are so many ways in which this can go sour. So make sure you get each thing done block by block. Then put them together.

Question 53

It’s not smart to have python code inside java. Wrap your python code with flask or other web framework to make it as a microservice. Make your java program able to call this microservice (e.g. via REST).

Beleive me, this is much simple and will save you tons of issues. And the codes are loosely coupled so they are scalable.

Updated on Mar 24th 2020: According to @stx’s comment, the above approach is not suitable for massive data transfer between client and server. Here is another approach I recommended: Connecting Python and Java with Rust(C/C++ also ok). https://medium.com/@shmulikamar/https-medium-com-shmulikamar-connecting-python-and-java-with-rust-11c256a1dfb0

Question 54

Several of the answers mention that you can use JNI or JNA to access cpython but I would not recommend starting from scratch because there are already open source libraries for accessing cpython from java. For example:

JEP
JPY

Question 55

Here a library that lets you write your python scripts once and decide which integration method (Jython, CPython/PyPy via Jep and Py4j) to use at runtime:

https://github.com/subes/invesdwin-context-python

Since each method has its own benefits/drawbacks as explained in the link.

Question 56

It depends on what do you mean by python functions? if they were written in cpython you can not directly call them you will have to use JNI, but if they were written in Jython you can easily call them from java, as jython ultimately generates java byte code.

Now when I say written in cpython or jython it doesn’t make much sense because python is python and most code will run on both implementations unless you are using specific libraries which relies on cpython or java.

see here how to use Python interpreter in Java.

Question 57

Depending on your requirements, options like XML-RPC could be useful, which can be used to remotely call functions virtually in any language supporting the protocol.

Question 58

GraalVM is a good choice. I’ve done Java+Javascript combination with GraalVM for microservice design (Java with Javascript reflection). They recently added support for python, I’d give it a try especially with how big its community has grown over the years.

Question 59

You can call any language from java using Java Native Interface

Question 60

Jython has some limitations:

There are a number of differences. First, Jython programs cannot use CPython extension modules written in C. These modules usually have files with the extension .so, .pyd or .dll. If you want to use such a module, you should look for an equivalent written in pure Python or Java. Although it is technically feasible to support such extensions – IronPython does so – there are no plans to do so in Jython.

Distributing my Python scripts as JAR files with Jython?

you can simply call python scripts (or bash or Perl scripts) from Java using Runtime or ProcessBuilder and pass output back to Java:

Running a bash shell script in java

Running Command Line in Java

java runtime.getruntime() getting output from executing a command line program

Question 61

This gives a pretty good overview over the current options. Some of which are named in other answers. Jython is not usable until they decide to implement Python 3.x and many of the other projects are coming form the python side and want to access java. But there are a few options still, to name something which has not been named yet: gRPC

Question 62

I’d like to have loglevel TRACE (5) for my application, as I don’t think that debug() is sufficient. Additionally log(5, msg) isn’t what I want. How can I add a custom loglevel to a Python logger?

I’ve a mylogger.py with the following content:

import logging

@property
def log(obj):
    myLogger = logging.getLogger(obj.__class__.__name__)
    return myLogger

In my code I use it in the following way:

class ExampleClass(object):
    from mylogger import log

    def __init__(self):
        '''The constructor with the logger'''
        self.log.debug("Init runs")

Now I’d like to call self.log.trace("foo bar")

Thanks in advance for your help.

Edit (Dec 8th 2016): I changed the accepted answer to pfa’s which is, IMHO, an excellent solution based on the very good proposal from Eric S.

Question 63

@Eric S.

Eric S.’s answer is excellent, but I learned by experimentation that this will always cause messages logged at the new debug level to be printed — regardless of what the log level is set to. So if you make a new level number of 9, if you call setLevel(50), the lower level messages will erroneously be printed.

To prevent that from happening, you need another line inside the “debugv” function to check if the logging level in question is actually enabled.

Fixed example that checks if the logging level is enabled:

import logging
DEBUG_LEVELV_NUM = 9 
logging.addLevelName(DEBUG_LEVELV_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
    if self.isEnabledFor(DEBUG_LEVELV_NUM):
        # Yes, logger takes its '*args' as 'args'.
        self._log(DEBUG_LEVELV_NUM, message, args, **kws) 
logging.Logger.debugv = debugv

If you look at the code for class Logger in logging.__init__.py for Python 2.7, this is what all the standard log functions do (.critical, .debug, etc.).

I apparently can’t post replies to others’ answers for lack of reputation… hopefully Eric will update his post if he sees this. =)

Question 64

I took the avoid seeing “lambda” answer and had to modify where the log_at_my_log_level was being added. I too saw the problem that Paul did – I don’t think this works. Don’t you need logger as the first arg in log_at_my_log_level? This worked for me

import logging
DEBUG_LEVELV_NUM = 9 
logging.addLevelName(DEBUG_LEVELV_NUM, "DEBUGV")
def debugv(self, message, *args, **kws):
    # Yes, logger takes its '*args' as 'args'.
    self._log(DEBUG_LEVELV_NUM, message, args, **kws) 
logging.Logger.debugv = debugv

问题：PyQt应用程序中的线程：使用Qt线程还是Python线程？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：在Markdown和reStructuredText中都具有相同的自述文件

回答 0

回答 1

回答 2

2019更新

旧答案：

2019 Update

Old answer:

回答 3

回答 4

回答 5

回答 6

回答 7

问题：Flask vs webapp2（适用于Google App Engine）

回答 0

回答 1

回答 2

回答 3

回答 4

问题：TemplateDoesNotExist-Django错误

回答 0

回答 1

回答 2

回答 3

回答 4

问题：使用SQLAlchemy ORM高效地更新数据库

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：从列表列表中删除重复项

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：python3和python3m可执行文件之间的区别

回答 0

问题：在Java中调用Python？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：如何在Python的日志记录工具中添加自定义日志级别

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13