Python 实用宝典

Question 1

I am setting out to do a side project that has the goal of translating code from one programming language to another. The languages I am starting with are PHP and Python (Python to PHP should be easier to start with), but ideally I would be able to add other languages with (relative) ease. The plan is:

This is geared towards web development. The original and target code will be be sitting on top of frameworks (which I will also have to write). These frameworks will embrace an MVC design pattern and follow strict coding conventions. This should make translation somewhat easier.
I am also looking at IOC and dependency injection, as they might make the translation process easier and less error prone.
I’ll make use of Python’s parser module, which lets me fiddle with the Abstract Syntax Tree. Apparently the closest I can get with PHP is token_get_all(), which is a start.
From then on I can build the AST, symbol tables and control flow.

Then I believe I can start outputting code. I don’t need a perfect translation. I’ll still have to review the generated code and fix problems. Ideally the translator should flag problematic translations.

Before you ask “What the hell is the point of this?” The answer is… It’ll be an interesting learning experience. If you have any insights on how to make this less daunting, please let me know.

EDIT:

I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.

Question 2

I’ve been building tools (DMS Software Reengineering Toolkit) to do general purpose program manipulation (with language translation being a special case) since 1995, supported by a strong team of computer scientists. DMS provides generic parsing, AST building, symbol tables, control and data flow analysis, application of translation rules, regeneration of source text with comments, etc., all parameterized by explicit definitions of computer languages.

The amount of machinery you need to do this well is vast (especially if you want to be able to do this for multiple languages in a general way), and then you need reliable parsers for languages with unreliable definitions (PHP is perfect example of this).

There’s nothing wrong with you thinking about building a language-to-language translator or attempting it, but I think you’ll find this a much bigger task for real languages than you expect. We have some 100 man-years invested in just DMS, and another 6-12 months in each “reliable” language definition (including the one we painfully built for PHP), much more for nasty languages such as C++. It will be a “hell of a learning experience”; it has been for us. (You might find the technical Papers section at the above website interesting to jump start that learning).

People often attempt to build some kind of generalized machinery by starting with some piece of technology with which they are familiar, that does a part of the job. (Python ASTs are great example). The good news, is that part of the job is done. The bad news is that machinery has a zillion assumptions built into it, most of which you won’t discover until you try to wrestle it into doing something else. At that point you find out the machinery is wired to do what it originally does, and will really, really resist your attempt to make it do something else. (I suspect trying to get the Python AST to model PHP is going to be a lot of fun).

The reason I started to build DMS originally was to build foundations that had very few such assumptions built in. It has some that give us headaches. So far, no black holes. (The hardest part of my job over the last 15 years is to try to prevent such assumptions from creeping in).

Lots of folks also make the mistake of assuming that if they can parse (and perhaps get an AST), they are well on the way to doing something complicated. One of the hard lessons is that you need symbol tables and flow analysis to do good program analysis or transformation. ASTs are necessary but not sufficient. This is the reason that Aho&Ullman’s compiler book doesn’t stop at chapter 2. (The OP has this right in that he is planning to build additional machinery beyond the AST). For more on this topic, see Life After Parsing.

The remark about “I don’t need a perfect translation” is troublesome. What weak translators do is convert the “easy” 80% of the code, leaving the hard 20% to do by hand. If the application you intend to convert are pretty small, and you only intend to convert it once well, then that 20% is OK. If you want to convert many applications (or even the same one with minor changes over time), this is not nice. If you attempt to convert 100K SLOC then 20% is 20,000 original lines of code that are hard to translate, understand and modify in the context of another 80,000 lines of translated program you already don’t understand. That takes a huge amount of effort. At the million line level, this is simply impossible in practice. (Amazingly there are people that distrust automated tools and insist on translating million line systems by hand; that’s even harder and they normally find out painfully with long time delays, high costs and often outright failure.)

What you have to shoot for to translate large-scale systems is high nineties percentage conversion rates, or it is likely that you can’t complete the manual part of the translation activity.

Another key consideration is size of code to be translated. It takes a lot of energy to build a working, robust translator, even with good tools. While it seems sexy and cool to build a translator instead of simply doing a manual conversion, for small code bases (e.g., up to about 100K SLOC in our experience) the economics simply don’t justify it. Nobody likes this answer, but if you really have to translate just 10K SLOC of code, you are probably better off just biting the bullet and doing it. And yes, that’s painful.

I consider our tools to be extremely good (but then, I’m pretty biased). And it is still very hard to build a good translator; it takes us about 1.5-2 man-years and we know how to use our tools. The difference is that with this much machinery, we succeed considerably more often than we fail.

Question 3

My answer will address the specific task of parsing Python in order to translate it to another language, and not the higher-level aspects which Ira addressed well in his answer.

In short: do not use the parser module, there’s an easier way.

The ast module, available since Python 2.6 is much more suitable for your needs, since it gives you a ready-made AST to work with. I’ve written an article on this last year, but in short, use the parse method of ast to parse Python source code into an AST. The parser module will give you a parse tree, not an AST. Be wary of the difference.

Now, since Python’s ASTs are quite detailed, given an AST the front-end job isn’t terribly hard. I suppose you can have a simple prototype for some parts of the functionality ready quite quickly. However, getting to a complete solution will take more time, mainly because the semantics of the languages are different. A simple subset of the language (functions, basic types and so on) can be readily translated, but once you get into the more complex layers, you’ll need heavy machinery to emulate one language’s core in another. For example consider Python’s generators and list comprehensions which don’t exist in PHP (to my best knowledge, which is admittedly poor when PHP is involved).

To give you one final tip, consider the 2to3 tool created by the Python devs to translate Python 2 code to Python 3 code. Front-end-wise, it has most of the elements you need to translate Python to something. However, since the cores of Python 2 and 3 are similar, no emulation machinery is required there.

Question 4

Writing a translator isn’t impossible, especially considering that Joel’s Intern did it over a summer.

If you want to do one language, it’s easy. If you want to do more, it’s a little more difficult, but not too much. The hardest part is that, while any turing complete language can do what another turing complete language does, built-in data types can change what a language does phenomenally.

For instance:

word = 'This is not a word'
print word[::-2]

takes a lot of C++ code to duplicate (ok, well you can do it fairly short with some looping constructs, but still).

That’s a bit of an aside, I guess.

Have you ever written a tokenizer/parser based on a language grammar? You’ll probably want to learn how to do that if you haven’t, because that’s the main part of this project. What I would do is come up with a basic Turing complete syntax – something fairly similar to Python bytecode. Then you create a lexer/parser that takes a language grammar (perhaps using BNF), and based on the grammar, compiles the language into your intermediate language. Then what you’ll want to do is do the reverse – create a parser from your language into target languages based on the grammar.

The most obvious problem I see is that at first you’ll probably create horribly inefficient code, especially in more powerful* languages like Python.

But if you do it this way then you’ll probably be able to figure out ways to optimize the output as you go along. To summarize:

read provided grammar
compile program into intermediate (but also Turing complete) syntax
compile intermediate program into final language (based on provided grammar)
…?
Profit!(?)

*by powerful I mean that this takes 4 lines:

myinput = raw_input("Enter something: ")
print myinput.replace('a', 'A')
print sum(ord(c) for c in myinput)
print myinput[::-1]

Show me another language that can do something like that in 4 lines, and I’ll show you a language that’s as powerful as Python.

Question 5

There are a couple answers telling you not to bother. Well, how helpful is that? You want to learn? You can learn. This is compilation. It just so happens that your target language isn’t machine code, but another high-level language. This is done all the time.

There’s a relatively easy way to get started. First, go get http://sourceforge.net/projects/lime-php/ (if you want to work in PHP) or some such and go through the example code. Next, you can write a lexical analyzer using a sequence of regular expressions and feed tokens to the parser you generate. Your semantic actions can either output code directly in another language or build up some data structure (think objects, man) that you can massage and traverse to generate output code.

You’re lucky with PHP and Python because in many respects they are the same language as each other, but with different syntax. The hard part is getting over the semantic differences between the grammar forms and data structures. For example, Python has lists and dictionaries, while PHP only has assoc arrays.

The “learner” approach is to build something that works OK for a restricted subset of the language (such as only print statements, simple math, and variable assignment), and then progressively remove limitations. That’s basically what the “big” guys in the field all did.

Oh, and since you don’t have static types in Python, it might be best to write and rely on PHP functions like “python_add” which adds numbers, strings, or objects according to the way Python does it.

Obviously, this can get much bigger if you let it.

Question 6

I will second @EliBendersky point of view regarding using ast.parse instead of parser (which I did not know about before). I also warmly recommend you to review his blog. I used ast.parse to do Python->JavaScript translator (@https://bitbucket.org/amirouche/pythonium). I’ve come up with Pythonium design by somewhat reviewing other implementations and trying them on my own. I forked Pythonium from https://github.com/PythonJS/PythonJS which I also started, It’s actually a complete rewrite . The overall design is inspired from PyPy and http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-89-1.pdf paper.

Everything I tried, from beginning to the best solution, even if it looks like Pythonium marketing it really isn’t (don’t hesitate to tell me if something doesn’t seem correct to the netiquette):

Implement Python semantic in Plain Old JavaScript using prototype inheritance: AFAIK it’s impossible to implement Python multiple inheritance using JS prototype object system. I did try to do it using other tricks later (cf. getattribute). As far as I know there is no implementation of Python multiple inheritance in JavaScript, the best that exists is Single inhertance + mixins and I’m not sure they handle diamond inheritance. Kind of similar to Skulpt but without google clojure.
I tried with Google clojure, just like Skulpt (compiler) instead of actually reading Skulpt code #fail. Anyway because of JS prototype based object system still impossible. Creating binding was very very difficult, you need to write JavaScript and a lot of boilerplate code (cf. https://github.com/skulpt/skulpt/issues/50 where I am the ghost). At that time there was no clear way to integrate the binding in the build system. I think that Skulpt is a library and you just have to include your .py files in the html to be executed, no compilation phase required to be done by the developer.
Tried pyjaco (compiler) but creating bindings (calling Javascript code from Python code) was very difficult, there was too much boilerplate code to create every time. Now I think pyjaco is the one that more near Pythonium. pyjaco is written in Python (ast.parse too) but a lot is written in JavaScript and it use prototype inheritance.

I never actually succeed at running Pyjamas #fail and never tried to read the code #fail again. But in my mind PyJamas was doing API->API tranlation (or framework to framework) and not Python to JavaScript translation. The JavaScript framework consume data that is already in the page or data from the server. Python code is only “plumbing”. After that I discovered that pyjamas was actually a real python->js translator.

Still I think it’s possible to do API->API (or framework->framework) translation and that’s basicly what I do in Pythonium but at lower level. Probably Pyjamas use the same algorithm as Pythonium…

Then I discovered brython fully written in Javascript like Skulpt, no need for compilation and lot of fluff… but written in JavaScript.

Since the initial line written in the course of this project, I knew about PyPy, even the JavaScript backend for PyPy. Yep, you can, if you find it, directly generate a Python interpreter in JavaScript from PyPy. People say, it was a disaster. I read no where why. But I think the reason is that the intermediate language they use to implement the interpreter, RPython, is a subset of Python tailored to be translated to C (and maybe asm). Ira Baxter says you always make assumptions when you build something and probably you fine tune it to be the best at what it’s meant to do in the case of PyPy: Python->C translation. Those assumptions might not be relevant in another context worse they can infere overhead otherwise said direct translation will most likely always be better.

Having the interpreter written in Python sounded like a (very) good idea. But I was more interested in a compiler for performance reasons also it’s actually more easy to compile Python to JavaScript than interpret it.

I started PythonJS with the idea of putting together a subset of Python that I could easily translate to JavaScript. At first I didn’t even bother to implement OO system because of past experience. The subset of Python that I achieved to translate to JavaScript are:

function with full parameters semantic both in definition and calling. This is the part I am most proud of.
while/if/elif/else
Python types were converted to JavaScript types (there is no python types of any kind)
for could iterate over Javascript arrays only (for a in array)
Transparent access to JavaScript: if you write Array in the Python code it will be translated to Array in javascript. This is the biggest achievement in terms of usability over its competitors.
You can pass function defined in Python source to javascript functions. Default arguments will be taken into account.
It add has special function called new which is translated to JavaScript new e.g: new(Python)(1, 2, spam, “egg”) is translated to “new Python(1, 2, spam, “egg”).
“var” are automatically handled by the translator. (very nice finding from Brett (PythonJS contributor).
global keyword
closures
lambdas
list comprehensions
imports are supported via requirejs
single class inheritance + mixin via classyjs

This seems like a lot but actually very narrow compared to full blown semantic of Python. It’s really JavaScript with a Python syntax.

The generated JS is perfect ie. there is no overhead, it can not be improved in terms of performance by further editing it. If you can improve the generated code, you can do it from the Python source file too. Also, the compiler did not rely on any JS tricks that you can find in .js written by http://superherojs.com/, so it’s very readable.

The direct descendant of this part of PythonJS is the Pythonium Veloce mode. The full implementation can be found @ https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/veloce/veloce.py?at=master 793 SLOC + around 100 SLOC of shared code with the other translator.

An adapted version of pystones.py can be translated in Veloce mode cf. https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pystone/?at=master

After having setup basic Python->JavaScript translation I choosed another path to translate full Python to JavaScript. The way of glib doing object oriented class based code except the target language is JS so you have access to arrays, map-like objects and many other tricks and all that part was written in Python. IIRC there is no javascript code written by in Pythonium translator. Getting single inheritance is not difficult here are the difficult parts making Pythonium fully compliant with Python:

spam.egg in Python is always translated to getattribute(spam, "egg") I did not profile this in particular but I think that where it loose a lot of time and I’m not sure I can improve upon it with asm.js or anything else.
method resolution order: even with the algorithm written in Python, translating it to Python Veloce compatible code was a big endeavour.
getattributre: the actual getattribute resolution algorithm is kind of tricky and it still doesn’t support data descriptors
metaclass class based: I know where to plug the code, but still…
last bu not least: some_callable(…) is always transalted to “call(some_callable)”. AFAIK the translator doesn’t use inference at all, so every time you do a call you need to check which kind of object it is to call it they way it’s meant to be called.

This part is factored in https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/runtime.py?at=master It’s written in Python compatible with Python Veloce.

The actual compliant translator https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/compliant.py?at=master doesn’t generate JavaScript code directly and most importantly doesn’t do ast->ast transformation. I tried the ast->ast thing and ast even if nicer than cst is not nice to work with even with ast.NodeTransformer and more importantly I don’t need to do ast->ast.

Doing python ast to python ast in my case at least would maybe be a performance improvement since I sometime inspect the content of a block before generating the code associated with it, for instance:

var/global: to be able to var something I must know what I need to and not to var. Instead of generating a block tracking which variable are created in a given block and inserting it on top of the generated function block I just look for revelant variable assignation when I enter the block before actually visiting the child node to generate the associated code.
yield, generators have, as of yet, a special syntax in JS, so I need to know which Python function is a generator when I want to write the “var my_generator = function”

So I don’t really visit each node once for each phase of the translation.

The overall process can be described as:

Python source code -> Python ast -> Python source code compatible with Veloce mode -> Python ast -> JavaScript source code

Python builtins are written in Python code (!), IIRC there is a few restrictions related to bootstraping types, but you have access to everything that can translate Pythonium in compliant mode. Have a look at https://bitbucket.org/amirouche/pythonium/src/33898da731ee2d768ced392f1c369afd746c25d7/pythonium/compliant/builtins/?at=master

Reading JS code generated from pythonium compliant can be understood but source maps will greatly help.

The valuable advice I can give you in the light of this experience are kind old farts:

extensively review the subject both in literature and existing projects closed source or free. When I reviewed the different existing projects I should have given it way more time and motivation.
ask questions! If I knew beforehand that PyPy backend was useless because of the overhead due to C/Javascript semantic mismatch. I would maybe had Pythonium idea way before 6 month ago maybe 3 years ago.
know what you want to do, have a target. For this project I had different objectives: pratice a bit a javascript, learn more of Python and be able to write Python code that would run in the browser (more and that below).
failure is experience
a small step is a step
start small
dream big
do demos
iterate

With Python Veloce mode only, I’m very happy! But along the way I discovered that what I was really looking for was liberating me and others from Javascript but more importantly being able to create in a comfortable way. This lead me to Scheme, DSL, Models and eventually domain specific models (cf. http://dsmforum.org/).

About what Ira Baxter response:

The estimations are not helpful at all. I took me more or less 6 month of free time for both PythonJS and Pythonium. So I can expect more from full time 6 month. I think we all know what 100 man-year in an enterprise context can mean and not mean at all…

When someone says something is hard or more often impossible, I answer that “it only takes time to find a solution for a problem that is impossible” otherwise said nothing is impossible except if it’s proven impossible in this case a math proof…

If it’s not proven impossible then it leaves room for imagination:

finding a proof proving it’s impossible

and

If it is impossible there may be an “inferior” problem that can have a solution.

or

if it’s not impossible, finding a solution

It’s not just optimistic thinking. When I started Python->Javascript everybody was saying it was impossible. PyPy impossible. Metaclasses too hard. etc… I think that the only revolution that brings PyPy over Scheme->C paper (which is 25 years old) is some automatic JIT generation (based hints written in the RPython interpreter I think).

Most people that say that a thing is “hard” or “impossible” don’t provide the reasons. C++ is hard to parse? I know that, still they are (free) C++ parser. Evil is in the detail? I know that. Saying it’s impossible alone is not helpful, It’s even worse than “not helpful” it’s discouraging, and some people mean to discourage others. I heard about this question via https://stackoverflow.com/questions/22621164/how-to-automatically-generate-a-parser-code-to-code-translator-from-a-corpus.

What would be perfection for you? That’s how you define next goal and maybe reach the overall goal.

I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.

I see no patterns that can not be translated from one language to another language at least in a less than perfect way. Since language to language translation is possible, you’d better aim for this first. Since, I think according to http://en.wikipedia.org/wiki/Graph_isomorphism_problem, translation between two computer languages is a tree or DAG isomorphism. Even if we already know that they are both turing complete, so…

Framework->Framework which I better visualize as API->API translation might still be something that you might keep in mind as a way to improve the generated code. E.g: Prolog as very specific syntax but still you can do Prolog like computation by describing the same graph in Python… If I was to implement a Prolog to Python translator I wouldn’t implement unification in Python but in a C library and come up with a “Python syntax” that is very readable for a Pythonist. In the end, syntax is only “painting” for which we give a meaning (that’s why I started scheme). Evil is in the detail of the language and I’m not talking about the syntax. The concepts that are used in the language getattribute hook (you can live without it) but required VM features like tail-recursion optimisation can be difficult to deal with. You don’t care if the initial program doesn’t use tail recursion and even if there is no tail recursion in the target language you can emulate it using greenlets/event loop.

For target and source languages, look for:

Big and specific ideas
Tiny and common shared ideas

From this will emerge:

Things that are easy to translate
Things that are difficult to translate

You will also probably be able to know what will be translated to fast and slow code.

There is also the question of the stdlib or any library but there is no clear answer, it depends of your goals.

Idiomatic code or readable generated code have also solutions…

Targeting a platform like PHP is much more easy than targeting browsers since you can provide C-implementation of slow and/or critical path.

Given you first project is translating Python to PHP, at least for the PHP3 subset I know of, customising veloce.py is your best bet. If you can implement veloce.py for PHP then probably you will be able to run the compliant mode… Also if you can translate PHP to the subset of PHP you can generate with php_veloce.py it means that you can translate PHP to the subset of Python that veloce.py can consume which would mean that you can translate PHP to Javascript. Just saying…

You can also have a look at those libraries:

Also you might be interested by this blog post (and comments): https://www.rfk.id.au/blog/entry/pypy-js-poc-jit/

This Google Tech Talk from Ira Baxter is interesting https://www.youtube.com/watch?v=C-_dw9iEzhA

Question 7

You could take a look at the Vala compiler, which translates Vala (a C#-like language) into C.

Question 8

I want to programmatically edit python source code. Basically I want to read a .py file, generate the AST, and then write back the modified python source code (i.e. another .py file).

There are ways to parse/compile python source code using standard python modules, such as ast or compiler. However, I don’t think any of them support ways to modify the source code (e.g. delete this function declaration) and then write back the modifying python source code.

UPDATE: The reason I want to do this is I’d like to write a Mutation testing library for python, mostly by deleting statements / expressions, rerunning tests and seeing what breaks.

Question 9

Pythoscope does this to the test cases it automatically generates as does the 2to3 tool for python 2.6 (it converts python 2.x source into python 3.x source).

Both these tools uses the lib2to3 library which is a implementation of the python parser/compiler machinery that can preserve comments in source when it’s round tripped from source -> AST -> source.

The rope project may meet your needs if you want to do more refactoring like transforms.

The ast module is your other option, and there’s an older example of how to “unparse” syntax trees back into code (using the parser module). But the ast module is more useful when doing an AST transform on code that is then transformed into a code object.

The redbaron project also may be a good fit (ht Xavier Combelle)

Question 10

The builtin ast module doesn’t seem to have a method to convert back to source. However, the codegen module here provides a pretty printer for the ast that would enable you do do so. eg.

import ast
import codegen

expr="""
def foo():
   print("hello world")
"""
p=ast.parse(expr)

p.body[0].body = [ ast.parse("return 42").body[0] ] # Replace function body with "return 42"

print(codegen.to_source(p))

This will print:

def foo():
    return 42

Note that you may lose the exact formatting and comments, as these are not preserved.

However, you may not need to. If all you require is to execute the replaced AST, you can do so simply by calling compile() on the ast, and execing the resulting code object.

Question 11

In a different answer I suggested using the astor package, but I have since found a more up-to-date AST un-parsing package called astunparse:

>>> import ast
>>> import astunparse
>>> print(astunparse.unparse(ast.parse('def foo(x): return 2 * x')))


def foo(x):
    return (2 * x)

I have tested this on Python 3.5.

Question 12

You might not need to re-generate source code. That’s a bit dangerous for me to say, of course, since you have not actually explained why you think you need to generate a .py file full of code; but:

If you want to generate a .py file that people will actually use, maybe so that they can fill out a form and get a useful .py file to insert into their project, then you don’t want to change it into an AST and back because you’ll lose ~~all formatting (think of the blank lines that make Python so readable by grouping related sets of lines together)~~ (ast nodes have lineno and col_offset attributes) comments. Instead, you’ll probably want to use a templating engine (the Django template language, for example, is designed to make templating even text files easy) to customize the .py file, or else use Rick Copeland’s MetaPython extension.
If you are trying to make a change during compilation of a module, note that you don’t have to go all the way back to text; you can just compile the AST directly instead of turning it back into a .py file.
But in almost any and every case, you are probably trying to do something dynamic that a language like Python actually makes very easy, without writing new .py files! If you expand your question to let us know what you actually want to accomplish, new .py files will probably not be involved in the answer at all; I have seen hundreds of Python projects doing hundreds of real-world things, and not a single one of them needed to ever writer a .py file. So, I must admit, I’m a bit of a skeptic that you’ve found the first good use-case. :-)

Update: now that you’ve explained what you’re trying to do, I’d be tempted to just operate on the AST anyway. You will want to mutate by removing, not lines of a file (which could result in half-statements that simply die with a SyntaxError), but whole statements — and what better place to do that than in the AST?

Question 13

Parsing and modifying the code structure is certainly possible with the help of ast module and I will show it in an example in a moment. However, writing back the modified source code is not possible with ast module alone. There are other modules available for this job such as one here.

NOTE: Example below can be treated as an introductory tutorial on the usage of ast module but a more comprehensive guide on using ast module is available here at Green Tree snakes tutorial and official documentation on ast module.

Introduction to ast:

>>> import ast
>>> tree = ast.parse("print 'Hello Python!!'")
>>> exec(compile(tree, filename="<ast>", mode="exec"))
Hello Python!!

You can parse the python code (represented in string) by simply calling the API ast.parse(). This returns the handle to Abstract Syntax Tree (AST) structure. Interestingly you can compile back this structure and execute it as shown above.

Another very useful API is ast.dump() which dumps the whole AST in a string form. It can be used to inspect the tree structure and is very helpful in debugging. For example,

On Python 2.7:

>>> import ast
>>> tree = ast.parse("print 'Hello Python!!'")
>>> ast.dump(tree)
"Module(body=[Print(dest=None, values=[Str(s='Hello Python!!')], nl=True)])"

On Python 3.5:

>>> import ast
>>> tree = ast.parse("print ('Hello Python!!')")
>>> ast.dump(tree)
"Module(body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Str(s='Hello Python!!')], keywords=[]))])"

Notice the difference in syntax for print statement in Python 2.7 vs. Python 3.5 and the difference in type of AST node in respective trees.

How to modify code using ast:

Now, let’s a have a look at an example of modification of python code by ast module. The main tool for modifying AST structure is ast.NodeTransformer class. Whenever one needs to modify the AST, he/she needs to subclass from it and write Node Transformation(s) accordingly.

For our example, let’s try to write a simple utility which transforms the Python 2 , print statements to Python 3 function calls.

Print statement to Fun call converter utility: print2to3.py:

#!/usr/bin/env python
'''
This utility converts the python (2.7) statements to Python 3 alike function calls before running the code.

USAGE:
     python print2to3.py <filename>
'''
import ast
import sys

class P2to3(ast.NodeTransformer):
    def visit_Print(self, node):
        new_node = ast.Expr(value=ast.Call(func=ast.Name(id='print', ctx=ast.Load()),
            args=node.values,
            keywords=[], starargs=None, kwargs=None))
        ast.copy_location(new_node, node)
        return new_node

def main(filename=None):
    if not filename:
        return

    with open(filename, 'r') as fp:
        data = fp.readlines()
    data = ''.join(data)
    tree = ast.parse(data)

    print "Converting python 2 print statements to Python 3 function calls"
    print "-" * 35
    P2to3().visit(tree)
    ast.fix_missing_locations(tree)
    # print ast.dump(tree)

    exec(compile(tree, filename="p23", mode="exec"))

if __name__ == '__main__':
    if len(sys.argv) <=1:
        print ("\nUSAGE:\n\t print2to3.py <filename>")
        sys.exit(1)
    else:
        main(sys.argv[1])

This utility can be tried on small example file, such as one below, and it should work fine.

Test Input file : py2.py

class A(object):
    def __init__(self):
        pass

def good():
    print "I am good"

main = good

if __name__ == '__main__':
    print "I am in main"
    main()

Please note that above transformation is only for ast tutorial purpose and in real case scenario one will have to look at all different scenarios such as print " x is %s" % ("Hello Python").

Question 14

I’ve created recently quite stable (core is really well tested) and extensible piece of code which generates code from ast tree: https://github.com/paluh/code-formatter .

I’m using my project as a base for a small vim plugin (which I’m using every day), so my goal is to generate really nice and readable python code.

P.S. I’ve tried to extend codegen but it’s architecture is based on ast.NodeVisitor interface, so formatters (visitor_ methods) are just functions. I’ve found this structure quite limiting and hard to optimize (in case of long and nested expressions it’s easier to keep objects tree and cache some partial results – in other way you can hit exponential complexity if you want to search for best layout). BUT codegen as every piece of mitsuhiko’s work (which I’ve read) is very well written and concise.

Question 15

One of the other answers recommends codegen, which seems to have been superceded by astor. The version of astor on PyPI (version 0.5 as of this writing) seems to be a little outdated as well, so you can install the development version of astor as follows.

pip install git+https://github.com/berkerpeksag/astor.git#egg=astor

Then you can use astor.to_source to convert a Python AST to human-readable Python source code:

>>> import ast
>>> import astor
>>> print(astor.to_source(ast.parse('def foo(x): return 2 * x')))
def foo(x):
    return 2 * x

I have tested this on Python 3.5.

Question 16

If you are looking at this in 2019, then you can use this libcst package. It has syntax similar to ast. This works like a charm, and preserve the code structure. It’s basically helpful for the project where you have to preserve comments, whitespace, newline etc.

If you don’t need to care about the preserving comments, whitespace and others, then the combination of ast and astor works well.

Question 17

We had a similar need, which wasn’t solved by other answers here. So we created a library for this, ASTTokens, which takes an AST tree produced with the ast or astroid modules, and marks it with the ranges of text in the original source code.

It doesn’t do modifications of code directly, but that’s not hard to add on top, since it does tell you the range of text you need to modify.

For example, this wraps a function call in WRAP(...), preserving comments and everything else:

example = """
def foo(): # Test
  '''My func'''
  log("hello world")  # Print
"""

import ast, asttokens
atok = asttokens.ASTTokens(example, parse=True)

call = next(n for n in ast.walk(atok.tree) if isinstance(n, ast.Call))
start, end = atok.get_text_range(call)
print(atok.text[:start] + ('WRAP(%s)' % atok.text[start:end])  + atok.text[end:])

Produces:

def foo(): # Test
  '''My func'''
  WRAP(log("hello world"))  # Print

Hope this helps!

Question 18

A Program Transformation System is a tool that parses source text, builds ASTs, allows you to modify them using source-to-source transformations (“if you see this pattern, replace it by that pattern”). Such tools are ideal for doing mutation of existing source codes, which are just “if you see this pattern, replace by a pattern variant”.

Of course, you need a program transformation engine that can parse the language of interest to you, and still do the pattern-directed transformations. Our DMS Software Reengineering Toolkit is a system that can do that, and handles Python, and a variety of other languages.

See this SO answer for an example of a DMS-parsed AST for Python capturing comments accurately. DMS can make changes to the AST, and regenerate valid text, including the comments. You can ask it to prettyprint the AST, using its own formatting conventions (you can changes these), or do “fidelity printing”, which uses the original line and column information to maximally preserve the original layout (some change in layout where new code is inserted is unavoidable).

To implement a “mutation” rule for Python with DMS, you could write the following:

rule mutate_addition(s:sum, p:product):sum->sum =
  " \s + \p " -> " \s - \p"
 if mutate_this_place(s);

This rule replace “+” with “-” in a syntactically correct way; it operates on the AST and thus won’t touch strings or comments that happen to look right. The extra condition on “mutate_this_place” is to let you control how often this occurs; you don’t want to mutate every place in the program.

You’d obviously want a bunch more rules like this that detect various code structures, and replace them by the mutated versions. DMS is happy to apply a set of rules. The mutated AST is then prettyprinted.

Question 19

I used to use baron for this, but have now switched to parso because it’s up to date with modern python. It works great.

I also needed this for a mutation tester. It’s really quite simple to make one with parso, check out my code at https://github.com/boxed/mutmut

Question 20

Suppose I have an if statement with a return. From the efficiency perspective, should I use

if(A > B):
    return A+1
return A-1

or

if(A > B):
    return A+1
else:
    return A-1

Should I prefer one or another when using a compiled language (C) or a scripted one (Python)?

Question 21

Since the return statement terminates the execution of the current function, the two forms are equivalent (although the second one is arguably more readable than the first).

The efficiency of both forms is comparable, the underlying machine code has to perform a jump if the if condition is false anyway.

Note that Python supports a syntax that allows you to use only one return statement in your case:

return A+1 if A > B else A-1

Question 22

From Chromium’s style guide:

Don’t use else after return:

# Bad
if (foo)
  return 1
else
  return 2

# Good
if (foo)
  return 1
return 2

return 1 if foo else 2

Question 23

Regarding coding style:

Most coding standards no matter language ban multiple return statements from a single function as bad practice.

(Although personally I would say there are several cases where multiple return statements do make sense: text/data protocol parsers, functions with extensive error handling etc)

The consensus from all those industry coding standards is that the expression should be written as:

int result;

if(A > B)
{
  result = A+1;
}
else
{
  result = A-1;
}
return result;

Regarding efficiency:

The above example and the two examples in the question are all completely equivalent in terms of efficiency. The machine code in all these cases have to compare A > B, then branch to either the A+1 or the A-1 calculation, then store the result of that in a CPU register or on the stack.

EDIT :

Sources:

MISRA-C:2004 rule 14.7, which in turn cites…:
IEC 61508-3. Part 3, table B.9.
IEC 61508-7. C.2.9.

Question 24

With any sensible compiler, you should observe no difference; they should be compiled to identical machine code as they’re equivalent.

Question 25

This is a question of style (or preference) since the interpreter does not care. Personally I would try not to make the final statement of a function which returns a value at an indent level other than the function base. The else in example 1 obscures, if only slightly, where the end of the function is.

By preference I use:

return A+1 if (A > B) else A-1

As it obeys both the good convention of having a single return statement as the last statement in the function (as already mentioned) and the good functional programming paradigm of avoiding imperative style intermediate results.

For more complex functions I prefer to break the function into multiple sub-functions to avoid premature returns if possible. Otherwise I revert to using an imperative style variable called rval. I try not to use multiple return statements unless the function is trivial or the return statement before the end is as a result of an error. Returning prematurely highlights the fact that you cannot go on. For complex functions that are designed to branch off into multiple subfunctions I try to code them as case statements (driven by a dict for instance).

Some posters have mentioned speed of operation. Speed of Run-time is secondary for me since if you need speed of execution Python is not the best language to use. I use Python as its the efficiency of coding (i.e. writing error free code) that matters to me.

Question 26

I personally avoid else blocks when possible. See the Anti-if Campaign

Also, they don’t charge ‘extra’ for the line, you know :p

“Simple is better than complex” & “Readability is king”

delta = 1 if (A > B) else -1
return A + delta

Question 27

Version A is simpler and that’s why I would use it.

And if you turn on all compiler warnings in Java you will get a warning on the second Version because it is unnecesarry and turns up code complexity.

Question 28

I know the question is tagged python, but it mentions dynamic languages so thought I should mention that in ruby the if statement actually has a return type so you can do something like

def foo
  rv = if (A > B)
         A+1
       else
         A-1
       end
  return rv 
end

Or because it also has implicit return simply

def foo 
  if (A>B)
    A+1
  else 
    A-1
  end
end

which gets around the style issue of not having multiple returns quite nicely.

Question 29

I’m trying to get a better understanding of the difference. I’ve found a lot of explanations online, but they tend towards the abstract differences rather than the practical implications.

Most of my programming experiences has been with CPython (dynamic, interpreted), and Java (static, compiled). However, I understand that there are other kinds of interpreted and compiled languages. Aside from the fact that executable files can be distributed from programs written in compiled languages, are there any advantages/disadvantages to each type? Oftentimes, I hear people arguing that interpreted languages can be used interactively, but I believe that compiled languages can have interactive implementations as well, correct?

Question 30

A compiled language is one where the program, once compiled, is expressed in the instructions of the target machine. For example, an addition “+” operation in your source code could be translated directly to the “ADD” instruction in machine code.

An interpreted language is one where the instructions are not directly executed by the target machine, but instead read and executed by some other program (which normally is written in the language of the native machine). For example, the same “+” operation would be recognised by the interpreter at run time, which would then call its own “add(a,b)” function with the appropriate arguments, which would then execute the machine code “ADD” instruction.

You can do anything that you can do in an interpreted language in a compiled language and vice-versa – they are both Turing complete. Both however have advantages and disadvantages for implementation and use.

I’m going to completely generalise (purists forgive me!) but, roughly, here are the advantages of compiled languages:

Faster performance by directly using the native code of the target machine
Opportunity to apply quite powerful optimisations during the compile stage

And here are the advantages of interpreted languages:

Easier to implement (writing good compilers is very hard!!)
No need to run a compilation stage: can execute code directly “on the fly”
Can be more convenient for dynamic languages

Note that modern techniques such as bytecode compilation add some extra complexity – what happens here is that the compiler targets a “virtual machine” which is not the same as the underlying hardware. These virtual machine instructions can then be compiled again at a later stage to get native code (e.g. as done by the Java JVM JIT compiler).

Question 31

A language itself is neither compiled nor interpreted, only a specific implementation of a language is. Java is a perfect example. There is a bytecode-based platform (the JVM), a native compiler (gcj) and an interpeter for a superset of Java (bsh). So what is Java now? Bytecode-compiled, native-compiled or interpreted?

Other languages, which are compiled as well as interpreted, are Scala, Haskell or Ocaml. Each of these languages has an interactive interpreter, as well as a compiler to byte-code or native machine code.

So generally categorizing languages by “compiled” and “interpreted” doesn’t make much sense.

Question 32

Start thinking in terms of a: blast from the past

Once upon a time, long long ago, there lived in the land of computing interpreters and compilers. All kinds of fuss ensued over the merits of one over the other. The general opinion at that time was something along the lines of:

Interpreter: Fast to develop (edit and run). Slow to execute because each statement had to be interpreted into machine code every time it was executed (think of what this meant for a loop executed thousands of times).
Compiler: Slow to develop (edit, compile, link and run. The compile/link steps could take serious time). Fast to execute. The whole program was already in native machine code.

A one or two order of magnitude difference in the runtime performance existed between an interpreted program and a compiled program. Other distinguishing points, run-time mutability of the code for example, were also of some interest but the major distinction revolved around the run-time performance issues.

Today the landscape has evolved to such an extent that the compiled/interpreted distinction is pretty much irrelevant. Many compiled languages call upon run-time services that are not completely machine code based. Also, most interpreted languages are “compiled” into byte-code before execution. Byte-code interpreters can be very efficient and rival some compiler generated code from an execution speed point of view.

The classic difference is that compilers generated native machine code, interpreters read source code and generated machine code on the fly using some sort of run-time system. Today there are very few classic interpreters left – almost all of them compile into byte-code (or some other semi-compiled state) which then runs on a virtual “machine”.

Question 33

The extreme and simple cases:

A compiler will produce a binary executable in the target machine’s native executable format. This binary file contains all required resources except for system libraries; it’s ready to run with no further preparation and processing and it runs like lightning because the code is the native code for the CPU on the target machine.
An interpreter will present the user with a prompt in a loop where he can enter statements or code, and upon hitting RUN or the equivalent the interpreter will examine, scan, parse and interpretatively execute each line until the program runs to a stopping point or an error. Because each line is treated on its own and the interpreter doesn’t “learn” anything from having seen the line before, the effort of converting human-readable language to machine instructions is incurred every time for every line, so it’s dog slow. On the bright side, the user can inspect and otherwise interact with his program in all kinds of ways: Changing variables, changing code, running in trace or debug modes… whatever.

With those out of the way, let me explain that life ain’t so simple any more. For instance,

Many interpreters will pre-compile the code they’re given so the translation step doesn’t have to be repeated again and again.
Some compilers compile not to CPU-specific machine instructions but to bytecode, a kind of artificial machine code for a ficticious machine. This makes the compiled program a bit more portable, but requires a bytecode interpreter on every target system.
The bytecode interpreters (I’m looking at Java here) recently tend to re-compile the bytecode they get for the CPU of the target section just before execution (called JIT). To save time, this is often only done for code that runs often (hotspots).
Some systems that look and act like interpreters (Clojure, for instance) compile any code they get, immediately, but allow interactive access to the program’s environment. That’s basically the convenience of interpreters with the speed of binary compilation.
Some compilers don’t really compile, they just pre-digest and compress code. I heard a while back that’s how Perl works. So sometimes the compiler is just doing a bit of the work and most of it is still interpretation.

In the end, these days, interpreting vs. compiling is a trade-off, with time spent (once) compiling often being rewarded by better runtime performance, but an interpretative environment giving more opportunities for interaction. Compiling vs. interpreting is mostly a matter of how the work of “understanding” the program is divided up between different processes, and the line is a bit blurry these days as languages and products try to offer the best of both worlds.

Question 34

From http://www.quora.com/What-is-the-difference-between-compiled-and-interpreted-programming-languages

There is no difference, because “compiled programming language” and “interpreted programming language” aren’t meaningful concepts. Any programming language, and I really mean any, can be interpreted or compiled. Thus, interpretation and compilation are implementation techniques, not attributes of languages.

Interpretation is a technique whereby another program, the interpreter, performs operations on behalf of the program being interpreted in order to run it. If you can imagine reading a program and doing what it says to do step-by-step, say on a piece of scratch paper, that’s just what an interpreter does as well. A common reason to interpret a program is that interpreters are relatively easy to write. Another reason is that an interpreter can monitor what a program tries to do as it runs, to enforce a policy, say, for security.

Compilation is a technique whereby a program written in one language (the “source language”) is translated into a program in another language (the “object language”), which hopefully means the same thing as the original program. While doing the translation, it is common for the compiler to also try to transform the program in ways that will make the object program faster (without changing its meaning!). A common reason to compile a program is that there’s some good way to run programs in the object language quickly and without the overhead of interpreting the source language along the way.

You may have guessed, based on the above definitions, that these two implementation techniques are not mutually exclusive, and may even be complementary. Traditionally, the object language of a compiler was machine code or something similar, which refers to any number of programming languages understood by particular computer CPUs. The machine code would then run “on the metal” (though one might see, if one looks closely enough, that the “metal” works a lot like an interpreter). Today, however, it’s very common to use a compiler to generate object code that is meant to be interpreted—for example, this is how Java used to (and sometimes still does) work. There are compilers that translate other languages to JavaScript, which is then often run in a web browser, which might interpret the JavaScript, or compile it a virtual machine or native code. We also have interpreters for machine code, which can be used to emulate one kind of hardware on another. Or, one might use a compiler to generate object code that is then the source code for another compiler, which might even compile code in memory just in time for it to run, which in turn . . . you get the idea. There are many ways to combine these concepts.

Question 35

The biggest advantage of interpreted source code over compiled source code is PORTABILITY.

If your source code is compiled, you need to compile a different executable for each type of processor and/or platform that you want your program to run on (e.g. one for Windows x86, one for Windows x64, one for Linux x64, and so on). Furthermore, unless your code is completely standards compliant and does not use any platform-specific functions/libraries, you will actually need to write and maintain multiple code bases!

If your source code is interpreted, you only need to write it once and it can be interpreted and executed by an appropriate interpreter on any platform! It’s portable! Note that an interpreter itself is an executable program that is written and compiled for a specific platform.

An advantage of compiled code is that it hides the source code from the end user (which might be intellectual property) because instead of deploying the original human-readable source code, you deploy an obscure binary executable file.

Question 36

A compiler and an interpreter do the same job: translating a programming language to another pgoramming language, usually closer to the hardware, often direct executable machine code.

Traditionally, “compiled” means that this translation happens all in one go, is done by a developer, and the resulting executable is distributed to users. Pure example: C++. Compilation usually takes pretty long and tries to do lots of expensive optmization so that the resulting executable runs faster. End users don’t have the tools and knowledge to compile stuff themselves, and the executable often has to run on a variety of hardware, so you can’t do many hardware-specific optimizations. During development, the separate compilation step means a longer feedback cycle.

Traditionally, “interpreted” means that the translation happens “on the fly”, when the user wants to run the program. Pure example: vanilla PHP. A naive interpreter has to parse and translate every piece of code every time it runs, which makes it very slow. It can’t do complex, costly optimizations because they’d take longer than the time saved in execution. But it can fully use the capabilities of the hardware it runs on. The lack of a separrate compilation step reduces feedback time during development.

But nowadays “compiled vs. interpreted” is not a black-or-white issue, there are shades in between. Naive, simple interpreters are pretty much extinct. Many languages use a two-step process where the high-level code is translated to a platform-independant bytecode (which is much faster to interpret). Then you have “just in time compilers” which compile code at most once per program run, sometimes cache results, and even intelligently decide to interpret code that’s run rarely, and do powerful optimizations for code that runs a lot. During development, debuggers are capable of switching code inside a running program even for traditionally compiled languages.

Question 37

First, a clarification, Java is not fully static-compiled and linked in the way C++. It is compiled into bytecode, which is then interpreted by a JVM. The JVM can go and do just-in-time compilation to the native machine language, but doesn’t have to do it.

More to the point: I think interactivity is the main practical difference. Since everything is interpreted, you can take a small excerpt of code, parse and run it against the current state of the environment. Thus, if you had already executed code that initialized a variable, you would have access to that variable, etc. It really lends itself way to things like the functional style.

Interpretation, however, costs a lot, especially when you have a large system with a lot of references and context. By definition, it is wasteful because identical code may have to be interpreted and optimized twice (although most runtimes have some caching and optimizations for that). Still, you pay a runtime cost and often need a runtime environment. You are also less likely to see complex interprocedural optimizations because at present their performance is not sufficiently interactive.

Therefore, for large systems that are not going to change much, and for certain languages, it makes more sense to precompile and prelink everything, do all the optimizations that you can do. This ends up with a very lean runtime that is already optimized for the target machine.

As for generating executbles, that has little to do with it, IMHO. You can often create an executable from a compiled language. But you can also create an executable from an interpreted language, except that the interpreter and runtime is already packaged in the exectuable and hidden from you. This means that you generally still pay the runtime costs (although I am sure that for some language there are ways to translate everything to a tree executable).

I disagree that all languages could be made interactive. Certain languages, like C, are so tied to the machine and the entire link structure that I’m not sure you can build a meaningful fully-fledged interactive version

Question 38

It’s rather difficult to give a practical answer because the difference is about the language definition itself. It’s possible to build an interpreter for every compiled language, but it’s not possible to build an compiler for every interpreted language. It’s very much about the formal definition of a language. So that theoretical informatics stuff noboby likes at university.

Question 39

An interpreted language such as Python is one where the source code is converted to machine code and then executed each time the program runs. This is different from a compiled language such as C, where the source code is only converted to machine code once – the resulting machine code is then executed each time the program runs.

Question 40

Compile is the process of creating an executable program from code written in a compiled programming language. Compiling allows the computer to run and understand the program without the need of the programming software used to create it. When a program is compiled it is often compiled for a specific platform (e.g. IBM platform) that works with IBM compatible computers, but not other platforms (e.g. Apple platform). The first compiler was developed by Grace Hopper while working on the Harvard Mark I computer. Today, most high-level languages will include their own compiler or have toolkits available that can be used to compile the program. A good example of a compiler used with Java is Eclipse and an example of a compiler used with C and C++ is the gcc command. Depending on how big the program is it should take a few seconds or minutes to compile and if no errors are encountered while being compiled an executable file is created.check this information

Question 41

Short (un-precise) definition:

Compiled language: Entire program is translated to machine code at once, then the machine code is run by the CPU.

Interpreted language: Program is read line-by-line and as soon as a line is read the machine instructions for that line are executed by the CPU.

But really, few languages these days are purely compiled or purely interpreted, it often is a mix. For a more detailed description with pictures, see this thread:

What is the difference between compilation and interpretation?

Or my later blog post:

https://orangejuiceliberationfront.com/the-difference-between-compiler-and-interpreter/

问题：我可以对代码执行哪种模式以使其更容易转换为另一种编程语言？[关闭]

编辑：

EDIT:

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：解析一个.py文件，读取AST，对其进行修改，然后写回修改后的源代码

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：使用if-return-return或if-else-return更有效吗？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：编译语言与口译语言

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

有趣好用的Python教程