Python 实用宝典

Question 1

This sort of question has been asked before in varying degrees, but I feel it has not been answered in a concise way and so I ask it again.

I want to run a script in Python. Let’s say it’s this:

if __name__ == '__main__':
    with open(sys.argv[1], 'r') as f:
        s = f.read()
    print s

Which gets a file location, reads it, then prints its contents. Not so complicated.

Okay, so how do I run this in C#?

This is what I have now:

    private void run_cmd(string cmd, string args)
    {
        ProcessStartInfo start = new ProcessStartInfo();
        start.FileName = cmd;
        start.Arguments = args;
        start.UseShellExecute = false;
        start.RedirectStandardOutput = true;
        using (Process process = Process.Start(start))
        {
            using (StreamReader reader = process.StandardOutput)
            {
                string result = reader.ReadToEnd();
                Console.Write(result);
            }
        }
    }

When I pass the code.py location as cmd and the filename location as args it doesn’t work. I was told I should pass python.exe as the cmd, and then code.py filename as the args.

I have been looking for a while now and can only find people suggesting to use IronPython or such. But there must be a way to call a Python script from C#.

Some clarification:

I need to run it from C#, I need to capture the output, and I can’t use IronPython or anything else. Whatever hack you have will be fine.

P.S.: The actual Python code I’m running is much more complex than this, and it returns output which I need in C#, and the C# code will be constantly calling the Python code.

Pretend this is my code:

    private void get_vals()
    {
        for (int i = 0; i < 100; i++)
        {
            run_cmd("code.py", i);
        }
    }

Question 2

The reason it isn’t working is because you have UseShellExecute = false.

If you don’t use the shell, you will have to supply the complete path to the python executable as FileName, and build the Arguments string to supply both your script and the file you want to read.

Also note, that you can’t RedirectStandardOutput unless UseShellExecute = false.

I’m not quite sure how the argument string should be formatted for python, but you will need something like this:

private void run_cmd(string cmd, string args)
{
     ProcessStartInfo start = new ProcessStartInfo();
     start.FileName = "my/full/path/to/python.exe";
     start.Arguments = string.Format("{0} {1}", cmd, args);
     start.UseShellExecute = false;
     start.RedirectStandardOutput = true;
     using(Process process = Process.Start(start))
     {
         using(StreamReader reader = process.StandardOutput)
         {
             string result = reader.ReadToEnd();
             Console.Write(result);
         }
     }
}

Question 3

If you’re willing to use IronPython, you can execute scripts directly in C#:

using IronPython.Hosting;
using Microsoft.Scripting.Hosting;

private static void doPython()
{
    ScriptEngine engine = Python.CreateEngine();
    engine.ExecuteFile(@"test.py");
}

Get IronPython here.

Question 4

Execute Python script from C

Create a C# project and write the following code.

using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            run_cmd();
        }

        private void run_cmd()
        {

            string fileName = @"C:\sample_script.py";

            Process p = new Process();
            p.StartInfo = new ProcessStartInfo(@"C:\Python27\python.exe", fileName)
            {
                RedirectStandardOutput = true,
                UseShellExecute = false,
                CreateNoWindow = true
            };
            p.Start();

            string output = p.StandardOutput.ReadToEnd();
            p.WaitForExit();

            Console.WriteLine(output);

            Console.ReadLine();

        }
    }
}

Python sample_script

print "Python C# Test"

You will see the ‘Python C# Test’ in the console of C#.

Question 5

I ran into the same problem and Master Morality’s answer didn’t do it for me. The following, which is based on the previous answer, worked:

private void run_cmd(string cmd, string args)
{
 ProcessStartInfo start = new ProcessStartInfo();
 start.FileName = cmd;//cmd is full path to python.exe
 start.Arguments = args;//args is path to .py file and any cmd line args
 start.UseShellExecute = false;
 start.RedirectStandardOutput = true;
 using(Process process = Process.Start(start))
 {
     using(StreamReader reader = process.StandardOutput)
     {
         string result = reader.ReadToEnd();
         Console.Write(result);
     }
 }
}

As an example, cmd would be @C:/Python26/python.exe and args would be C://Python26//test.py 100 if you wanted to execute test.py with cmd line argument 100. Note that the path the .py file does not have the @ symbol.

Question 6

Just also to draw your attention to this:

https://code.msdn.microsoft.com/windowsdesktop/C-and-Python-interprocess-171378ee

It works great.

Question 7

Set WorkingDirectory or specify the full path of the python script in the Argument

ProcessStartInfo start = new ProcessStartInfo();
start.FileName = "C:\\Python27\\python.exe";
//start.WorkingDirectory = @"D:\script";
start.Arguments = string.Format("D:\\script\\test.py -a {0} -b {1} ", "some param", "some other param");
start.UseShellExecute = false;
start.RedirectStandardOutput = true;
using (Process process = Process.Start(start))
{
    using (StreamReader reader = process.StandardOutput)
    {
        string result = reader.ReadToEnd();
        Console.Write(result);
    }
}

Question 8

Actually its pretty easy to make integration between Csharp (VS) and Python with IronPython. It’s not that much complex… As Chris Dunaway already said in answer section I started to build this inegration for my own project. N its pretty simple. Just follow these steps N you will get your results.

step 1 : Open VS and create new empty ConsoleApp project.

step 2 : Go to tools –> NuGet Package Manager –> Package Manager Console.

step 3 : After this open this link in your browser and copy the NuGet Command. Link: https://www.nuget.org/packages/IronPython/2.7.9

step 4 : After opening the above link copy the PM>Install-Package IronPython -Version 2.7.9 command and paste it in NuGet Console in VS. It will install the supportive packages.

step 5 : This is my code that I have used to run a .py file stored in my Python.exe directory.

using IronPython.Hosting;//for DLHE
using Microsoft.Scripting.Hosting;//provides scripting abilities comparable to batch files
using System;
using System.Diagnostics;
using System.IO;
using System.Net;
using System.Net.Sockets;
class Hi
{
private static void Main(string []args)
{
Process process = new Process(); //to make a process call
ScriptEngine engine = Python.CreateEngine(); //For Engine to initiate the script
engine.ExecuteFile(@"C:\Users\daulmalik\AppData\Local\Programs\Python\Python37\p1.py");//Path of my .py file that I would like to see running in console after running my .cs file from VS.//process.StandardInput.Flush();
process.StandardInput.Close();//to close
process.WaitForExit();//to hold the process i.e. cmd screen as output
}
}

step 6 : save and execute the code

Question 9

I am having problems with stdin/stout – when payload size exceeds several kilobytes it hangs. I need to call Python functions not only with some short arguments, but with a custom payload that could be big.

A while ago, I wrote a virtual actor library that allows to distribute task on different machines via Redis. To call Python code, I added functionality to listen for messages from Python, process them and return results back to .NET. Here is a brief description of how it works.

It works on a single machine as well, but requires a Redis instance. Redis adds some reliability guarantees – payload is stored until a worked acknowledges completion. If a worked dies, the payload is returned to a job queue and then is reprocessed by another worker.

Question 10

The question I’m about to ask seems to be a duplicate of Python’s use of __new__ and __init__?, but regardless, it’s still unclear to me exactly what the practical difference between __new__ and __init__ is.

Before you rush to tell me that __new__ is for creating objects and __init__ is for initializing objects, let me be clear: I get that. In fact, that distinction is quite natural to me, since I have experience in C++ where we have placement new, which similarly separates object allocation from initialization.

The Python C API tutorial explains it like this:

The new member is responsible for creating (as opposed to initializing) objects of the type. It is exposed in Python as the __new__() method. … One reason to implement a new method is to assure the initial values of instance variables.

So, yeah – I get what __new__ does, but despite this, I still don’t understand why it’s useful in Python. The example given says that __new__ might be useful if you want to “assure the initial values of instance variables”. Well, isn’t that exactly what __init__ will do?

In the C API tutorial, an example is shown where a new Type (called a “Noddy”) is created, and the Type’s __new__ function is defined. The Noddy type contains a string member called first, and this string member is initialized to an empty string like so:

static PyObject * Noddy_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    .....

    self->first = PyString_FromString("");
    if (self->first == NULL)
    {
       Py_DECREF(self);
       return NULL;
    }

    .....
}

Note that without the __new__ method defined here, we’d have to use PyType_GenericNew, which simply initializes all of the instance variable members to NULL. So the only benefit of the __new__ method is that the instance variable will start out as an empty string, as opposed to NULL. But why is this ever useful, since if we cared about making sure our instance variables are initialized to some default value, we could have just done that in the __init__ method?

Question 11

The difference mainly arises with mutable vs immutable types.

__new__ accepts a type as the first argument, and (usually) returns a new instance of that type. Thus it is suitable for use with both mutable and immutable types.

__init__ accepts an instance as the first argument and modifies the attributes of that instance. This is inappropriate for an immutable type, as it would allow them to be modified after creation by calling obj.__init__(*args).

Compare the behaviour of tuple and list:

>>> x = (1, 2)
>>> x
(1, 2)
>>> x.__init__([3, 4])
>>> x # tuple.__init__ does nothing
(1, 2)
>>> y = [1, 2]
>>> y
[1, 2]
>>> y.__init__([3, 4])
>>> y # list.__init__ reinitialises the object
[3, 4]

As to why they’re separate (aside from simple historical reasons): __new__ methods require a bunch of boilerplate to get right (the initial object creation, and then remembering to return the object at the end). __init__ methods, by contrast, are dead simple, since you just set whatever attributes you need to set.

Aside from __init__ methods being easier to write, and the mutable vs immutable distinction noted above, the separation can also be exploited to make calling the parent class __init__ in subclasses optional by setting up any absolutely required instance invariants in __new__. This is generally a dubious practice though – it’s usually clearer to just call the parent class __init__ methods as necessary.

Question 12

There are probably other uses for __new__ but there’s one really obvious one: You can’t subclass an immutable type without using __new__. So for example, say you wanted to create a subclass of tuple that can contain only integral values between 0 and size.

class ModularTuple(tuple):
    def __new__(cls, tup, size=100):
        tup = (int(x) % size for x in tup)
        return super(ModularTuple, cls).__new__(cls, tup)

You simply can’t do this with __init__ — if you tried to modify self in __init__, the interpreter would complain that you’re trying to modify an immutable object.

Question 13

__new__() can return objects of types other than the class it’s bound to. __init__() only initializes an existing instance of the class.

>>> class C(object):
...   def __new__(cls):
...     return 5
...
>>> c = C()
>>> print type(c)
<type 'int'>
>>> print c
5

Question 14

Not a complete answer but perhaps something that illustrates the difference.

__new__ will always get called when an object has to be created. There are some situations where __init__ will not get called. One example is when you unpickle objects from a pickle file, they will get allocated (__new__) but not initialised (__init__).

Question 15

Just want to add a word about the intent (as opposed to the behavior) of defining __new__ versus __init__.

I came across this question (among others) when I was trying to understand the best way to define a class factory. I realized that one of the ways in which __new__ is conceptually different from __init__ is the fact that the benefit of __new__ is exactly what was stated in the question:

So the only benefit of the __new__ method is that the instance variable will start out as an empty string, as opposed to NULL. But why is this ever useful, since if we cared about making sure our instance variables are initialized to some default value, we could have just done that in the __init__ method?

Considering the stated scenario, we care about the initial values of the instance variables when the instance is in reality a class itself. So, if we are dynamically creating a class object at runtime and we need to define/control something special about the subsequent instances of this class being created, we would define these conditions/properties in a __new__ method of a metaclass.

I was confused about this until I actually thought about the application of the concept rather than just the meaning of it. Here’s an example that would hopefully make the difference clear:

a = Shape(sides=3, base=2, height=12)
b = Shape(sides=4, length=2)
print(a.area())
print(b.area())

# I want `a` and `b` to be an instances of either of 'Square' or 'Triangle'
# depending on number of sides and also the `.area()` method to do the right
# thing. How do I do that without creating a Shape class with all the
# methods having a bunch of `if`s ? Here is one possibility

class Shape:
    def __new__(cls, sides, *args, **kwargs):
        if sides == 3:
            return Triangle(*args, **kwargs)
        else:
            return Square(*args, **kwargs)

class Triangle:
    def __init__(self, base, height):
        self.base = base
        self.height = height

    def area(self):
        return (self.base * self.height) / 2

class Square:
    def __init__(self, length):
        self.length = length

    def area(self):
        return self.length*self.length

Note this is just an demonstartive example. There are multiple ways to get a solution without resorting to a class factory approach like above and even if we do choose to implelent the solution in this manner, there are a little caveats left out for sake of brevity (for instance, declaring the metaclass explicitly)

If you are creating a regular class (a.k.a a non-metaclass), then __new__ doesn’t really make sense unless it is special case like the mutable versus immutable scenario in ncoghlan’s answer answer (which is essentially a more specific example of the concept of defining the initial values/properties of the class/type being created via __new__ to be then initialized via __init__).

Question 16

I’ve got some example Python code that I need to mimic in C++. I do not require any specific solution (such as co-routine based yield solutions, although they would be acceptable answers as well), I simply need to reproduce the semantics in some manner.

Python

This is a basic sequence generator, clearly too large to store a materialized version.

def pair_sequence():
    for i in range(2**32):
        for j in range(2**32):
            yield (i, j)

The goal is to maintain two instances of the sequence above, and iterate over them in semi-lockstep, but in chunks. In the example below the first_pass uses the sequence of pairs to initialize the buffer, and the second_pass regenerates the same exact sequence and processes the buffer again.

def run():
    seq1 = pair_sequence()
    seq2 = pair_sequence()

    buffer = [0] * 1000
    first_pass(seq1, buffer)
    second_pass(seq2, buffer)
    ... repeat ...

C++

The only thing I can find for a solution in C++ is to mimic yield with C++ coroutines, but I haven’t found any good reference on how to do this. I’m also interested in alternative (non general) solutions for this problem. I do not have enough memory budget to keep a copy of the sequence between passes.

Question 17

Generators exist in C++, just under another name: Input Iterators. For example, reading from std::cin is similar to having a generator of char.

You simply need to understand what a generator does:

there is a blob of data: the local variables define a state
there is an init method
there is a “next” method
there is a way to signal termination

In your trivial example, it’s easy enough. Conceptually:

struct State { unsigned i, j; };

State make();

void next(State&);

bool isDone(State const&);

Of course, we wrap this as a proper class:

class PairSequence:
    // (implicit aliases)
    public std::iterator<
        std::input_iterator_tag,
        std::pair<unsigned, unsigned>
    >
{
  // C++03
  typedef void (PairSequence::*BoolLike)();
  void non_comparable();
public:
  // C++11 (explicit aliases)
  using iterator_category = std::input_iterator_tag;
  using value_type = std::pair<unsigned, unsigned>;
  using reference = value_type const&;
  using pointer = value_type const*;
  using difference_type = ptrdiff_t;

  // C++03 (explicit aliases)
  typedef std::input_iterator_tag iterator_category;
  typedef std::pair<unsigned, unsigned> value_type;
  typedef value_type const& reference;
  typedef value_type const* pointer;
  typedef ptrdiff_t difference_type;

  PairSequence(): done(false) {}

  // C++11
  explicit operator bool() const { return !done; }

  // C++03
  // Safe Bool idiom
  operator BoolLike() const {
    return done ? 0 : &PairSequence::non_comparable;
  }

  reference operator*() const { return ij; }
  pointer operator->() const { return &ij; }

  PairSequence& operator++() {
    static unsigned const Max = std::numeric_limts<unsigned>::max();

    assert(!done);

    if (ij.second != Max) { ++ij.second; return *this; }
    if (ij.first != Max) { ij.second = 0; ++ij.first; return *this; }

    done = true;
    return *this;
  }

  PairSequence operator++(int) {
    PairSequence const tmp(*this);
    ++*this;
    return tmp;
  }

private:
  bool done;
  value_type ij;
};

So hum yeah… might be that C++ is a tad more verbose :)

Question 18

In C++ there are iterators, but implementing an iterator isn’t straightforward: one has to consult the iterator concepts and carefully design the new iterator class to implement them. Thankfully, Boost has an iterator_facade template which should help implementing the iterators and iterator-compatible generators.

Sometimes a stackless coroutine can be used to implement an iterator.

P.S. See also this article which mentions both a switch hack by Christopher M. Kohlhoff and Boost.Coroutine by Oliver Kowalke. Oliver Kowalke’s work is a followup on Boost.Coroutine by Giovanni P. Deretta.

P.S. I think you can also write a kind of generator with lambdas:

std::function<int()> generator = []{
  int i = 0;
  return [=]() mutable {
    return i < 10 ? i++ : -1;
  };
}();
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

Or with a functor:

struct generator_t {
  int i = 0;
  int operator() () {
    return i < 10 ? i++ : -1;
  }
} generator;
int ret = 0; while ((ret = generator()) != -1) std::cout << "generator: " << ret << std::endl;

P.S. Here’s a generator implemented with the Mordor coroutines:

#include <iostream>
using std::cout; using std::endl;
#include <mordor/coroutine.h>
using Mordor::Coroutine; using Mordor::Fiber;

void testMordor() {
  Coroutine<int> coro ([](Coroutine<int>& self) {
    int i = 0; while (i < 9) self.yield (i++);
  });
  for (int i = coro.call(); coro.state() != Fiber::TERM; i = coro.call()) cout << i << endl;
}

Question 19

Since Boost.Coroutine2 now supports it very well (I found it because I wanted to solve exactly the same yield problem), I am posting the C++ code that matches your original intention:

#include <stdint.h>
#include <iostream>
#include <memory>
#include <boost/coroutine2/all.hpp>

typedef boost::coroutines2::coroutine<std::pair<uint16_t, uint16_t>> coro_t;

void pair_sequence(coro_t::push_type& yield)
{
    uint16_t i = 0;
    uint16_t j = 0;
    for (;;) {
        for (;;) {
            yield(std::make_pair(i, j));
            if (++j == 0)
                break;
        }
        if (++i == 0)
            break;
    }
}

int main()
{
    coro_t::pull_type seq(boost::coroutines2::fixedsize_stack(),
                          pair_sequence);
    for (auto pair : seq) {
        print_pair(pair);
    }
    //while (seq) {
    //    print_pair(seq.get());
    //    seq();
    //}
}

In this example, pair_sequence does not take additional arguments. If it needs to, std::bind or a lambda should be used to generate a function object that takes only one argument (of push_type), when it is passed to the coro_t::pull_type constructor.

Question 20

All answers that involve writing your own iterator are completely wrong. Such answers entirely miss the point of Python generators (one of the language’s greatest and unique features). The most important thing about generators is that execution picks up where it left off. This does not happen to iterators. Instead, you must manually store state information such that when operator++ or operator* is called anew, the right information is in place at the very beginning of the next function call. This is why writing your own C++ iterator is a gigantic pain; whereas, generators are elegant, and easy to read+write.

I don’t think there is a good analog for Python generators in native C++, at least not yet (there is a rummor that yield will land in C++17). You can get something similarish by resorting to third-party (e.g. Yongwei’s Boost suggestion), or rolling your own.

I would say the closest thing in native C++ is threads. A thread can maintain a suspended set of local variables, and can continue execution where it left off, very much like generators, but you need to roll a little bit of additional infrastructure to support communication between the generator object and its caller. E.g.

// Infrastructure

template <typename Element>
class Channel { ... };

// Application

using IntPair = std::pair<int, int>;

void yield_pairs(int end_i, int end_j, Channel<IntPair>* out) {
  for (int i = 0; i < end_i; ++i) {
    for (int j = 0; j < end_j; ++j) {
      out->send(IntPair{i, j});  // "yield"
    }
  }
  out->close();
}

void MyApp() {
  Channel<IntPair> pairs;
  std::thread generator(yield_pairs, 32, 32, &pairs);
  for (IntPair pair : pairs) {
    UsePair(pair);
  }
  generator.join();
}

This solution has several downsides though:

Threads are “expensive”. Most people would consider this to be an “extravagant” use of threads, especially when your generator is so simple.
There are a couple of clean up actions that you need to remember. These could be automated, but you’d need even more infrastructure, which again, is likely to be seen as “too extravagant”. Anyway, the clean ups that you need are:
1. out->close()
2. generator.join()
This does not allow you to stop generator. You could make some modifications to add that ability, but it adds clutter to the code. It would never be as clean as Python’s yield statement.
In addition to 2, there are other bits of boilerplate that are needed each time you want to “instantiate” a generator object:
1. Channel* out parameter
2. Additional variables in main: pairs, generator

Question 21

You should probably check generators in std::experimental in Visual Studio 2015 e.g: https://blogs.msdn.microsoft.com/vcblog/2014/11/12/resumable-functions-in-c/

I think it’s exactly what you are looking for. Overall generators should be available in C++17 as this is only experimental Microsoft VC feature.

Question 22

If you only need to do this for a relatively small number of specific generators, you can implement each as a class, where the member data is equivalent to the local variables of the Python generator function. Then you have a next function that returns the next thing the generator would yield, updating the internal state as it does so.

This is basically similar to how Python generators are implemented, I believe. The major difference being they can remember an offset into the bytecode for the generator function as part of the “internal state”, which means the generators can be written as loops containing yields. You would have to instead calculate the next value from the previous. In the case of your pair_sequence, that’s pretty trivial. It may not be for complex generators.

You also need some way of indicating termination. If what you’re returning is “pointer-like”, and NULL should not be a valid yieldable value you could use a NULL pointer as a termination indicator. Otherwise you need an out-of-band signal.

Question 23

Something like this is very similar:

struct pair_sequence
{
    typedef pair<unsigned int, unsigned int> result_type;
    static const unsigned int limit = numeric_limits<unsigned int>::max()

    pair_sequence() : i(0), j(0) {}

    result_type operator()()
    {
        result_type r(i, j);
        if(j < limit) j++;
        else if(i < limit)
        {
          j = 0;
          i++;
        }
        else throw out_of_range("end of iteration");
    }

    private:
        unsigned int i;
        unsigned int j;
}

Using the operator() is only a question of what you want to do with this generator, you could also build it as a stream and make sure it adapts to an istream_iterator, for example.

Question 24

Using range-v3:

#include <iostream>
#include <tuple>
#include <range/v3/all.hpp>

using namespace std;
using namespace ranges;

auto generator = [x = view::iota(0) | view::take(3)] {
    return view::cartesian_product(x, x);
};

int main () {
    for (auto x : generator()) {
        cout << get<0>(x) << ", " << get<1>(x) << endl;
    }

    return 0;
}

Question 25

Something like this:

Example use:

using ull = unsigned long long;

auto main() -> int {
    for (ull val : range_t<ull>(100)) {
        std::cout << val << std::endl;
    }

    return 0;
}

Will print the numbers from 0 to 99

Question 26

Well, today I also was looking for easy collection implementation under C++11. Actually I was disappointed, because everything I found is too far from things like python generators, or C# yield operator… or too complicated.

The purpose is to make collection which will emit its items only when it is required.

I wanted it to be like this:

auto emitter = on_range<int>(a, b).yield(
    [](int i) {
         /* do something with i */
         return i * 2;
    });

I found this post, IMHO best answer was about boost.coroutine2, by Yongwei Wu. Since it is the nearest to what author wanted.

It is worth learning boost couroutines.. And I’ll perhaps do on weekends. But so far I’m using my very small implementation. Hope it helps to someone else.

Below is example of use, and then implementation.

Example.cpp

#include <iostream>
#include "Generator.h"
int main() {
    typedef std::pair<int, int> res_t;

    auto emitter = Generator<res_t, int>::on_range(0, 3)
        .yield([](int i) {
            return std::make_pair(i, i * i);
        });

    for (auto kv : emitter) {
        std::cout << kv.first << "^2 = " << kv.second << std::endl;
    }

    return 0;
}

Generator.h

template<typename ResTy, typename IndexTy>
struct yield_function{
    typedef std::function<ResTy(IndexTy)> type;
};

template<typename ResTy, typename IndexTy>
class YieldConstIterator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef YieldConstIterator<ResTy, IndexTy> mytype_t;
    typedef ResTy value_type;

    YieldConstIterator(index_t index, yield_function_t yieldFunction) :
            mIndex(index),
            mYieldFunction(yieldFunction) {}

    mytype_t &operator++() {
        ++mIndex;
        return *this;
    }

    const value_type operator*() const {
        return mYieldFunction(mIndex);
    }

    bool operator!=(const mytype_t &r) const {
        return mIndex != r.mIndex;
    }

protected:

    index_t mIndex;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class YieldIterator : public YieldConstIterator<ResTy, IndexTy> {
public:

    typedef YieldConstIterator<ResTy, IndexTy> parent_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef ResTy value_type;

    YieldIterator(index_t index, yield_function_t yieldFunction) :
            parent_t(index, yieldFunction) {}

    value_type operator*() {
        return parent_t::mYieldFunction(parent_t::mIndex);
    }
};

template<typename IndexTy>
struct Range {
public:
    typedef IndexTy index_t;
    typedef Range<IndexTy> mytype_t;

    index_t begin;
    index_t end;
};

template<typename ResTy, typename IndexTy>
class GeneratorCollection {
public:

    typedef Range<IndexTy> range_t;

    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;
    typedef YieldIterator<ResTy, IndexTy> iterator;
    typedef YieldConstIterator<ResTy, IndexTy> const_iterator;

    GeneratorCollection(range_t range, const yield_function_t &yieldF) :
            mRange(range),
            mYieldFunction(yieldF) {}

    iterator begin() {
        return iterator(mRange.begin, mYieldFunction);
    }

    iterator end() {
        return iterator(mRange.end, mYieldFunction);
    }

    const_iterator begin() const {
        return const_iterator(mRange.begin, mYieldFunction);
    }

    const_iterator end() const {
        return const_iterator(mRange.end, mYieldFunction);
    }

private:
    range_t mRange;
    yield_function_t mYieldFunction;
};

template<typename ResTy, typename IndexTy>
class Generator {
public:
    typedef IndexTy index_t;
    typedef ResTy res_t;
    typedef typename yield_function<res_t, index_t>::type yield_function_t;

    typedef Generator<ResTy, IndexTy> mytype_t;
    typedef Range<IndexTy> parent_t;
    typedef GeneratorCollection<ResTy, IndexTy> finalized_emitter_t;
    typedef  Range<IndexTy> range_t;

protected:
    Generator(range_t range) : mRange(range) {}
public:
    static mytype_t on_range(index_t begin, index_t end) {
        return mytype_t({ begin, end });
    }

    finalized_emitter_t yield(yield_function_t f) {
        return finalized_emitter_t(mRange, f);
    }
protected:

    range_t mRange;
};

Question 27

This answer works in C (and hence i think works in c++ too)

#include <stdio.h>

const uint64_t MAX = 1ll<<32;

typedef struct {
    uint64_t i, j;
} Pair;

Pair* generate_pairs()
{
    static uint64_t i = 0;
    static uint64_t j = 0;
    
    Pair p = {i,j};
    if(j++ < MAX)
    {
        return &p;
    }
        else if(++i < MAX)
    {
        p.i++;
        p.j = 0;
        j = 0;
        return &p;
    }
    else
    {
        return NULL;
    }
}

int main()
{
    while(1)
    {
        Pair *p = generate_pairs();
        if(p != NULL)
        {
            //printf("%d,%d\n",p->i,p->j);
        }
        else
        {
            //printf("end");
            break;
        }
    }
    return 0;
}

This is simple, non object-oriented way to mimic a generator. This worked as expected for me.

Question 28

Just as a function simulates the concept of a stack, generators simulate the concept of a queue. The rest is semantics.

As a side note, you can always simulate a queue with a stack by using a stack of operations instead of data. What that practically means is that you can implement a queue-like behavior by returning a pair, the second value of which either has the next function to be called or indicates that we are out of values. But this is more general than what yield vs return does. It allows to simulate a queue of any values rather than homogeneous values that you expect from a generator, but without keeping a full internal queue.

More specifically, since C++ does not have a natural abstraction for a queue, you need to use constructs which implement a queue internally. So the answer which gave the example with iterators is a decent implementation of the concept.

What this practically means is that you can implement something with bare-bones queue functionality if you just want something quick and then consume queue’s values just as you would consume values yielded from a generator.

Question 29

I’m scripting the checkout, build, distribution, test, and commit cycle for a large C++ solution that is using Monotone, CMake, Visual Studio Express 2008, and custom tests.

All of the other parts seem pretty straight-forward, but I don’t see how to compile the Visual Studio solution without getting the GUI.

The script is written in Python, but an answer that would allow me to just make a call to: os.system would do.

Question 30

I know of two ways to do it.

Method 1
The first method (which I prefer) is to use msbuild:

msbuild project.sln /Flags...

Method 2
You can also run:

vcexpress project.sln /build /Flags...

The vcexpress option returns immediately and does not print any output. I suppose that might be what you want for a script.

Note that DevEnv is not distributed with Visual Studio Express 2008 (I spent a lot of time trying to figure that out when I first had a similar issue).

So, the end result might be:

os.system("msbuild project.sln /p:Configuration=Debug")

You’ll also want to make sure your environment variables are correct, as msbuild and vcexpress are not by default on the system path. Either start the Visual Studio build environment and run your script from there, or modify the paths in Python (with os.putenv).

Question 31

MSBuild usually works, but I’ve run into difficulties before. You may have better luck with

devenv YourSolution.sln /Build

Question 32

To be honest I have to add my 2 cents.

You can do it with msbuild.exe. There are many version of the msbuild.exe.

C:\Windows\Microsoft.NET\Framework64\v2.0.50727\msbuild.exe C:\Windows\Microsoft.NET\Framework64\v3.5\msbuild.exe C:\Windows\Microsoft.NET\Framework64\v4.0.30319\msbuild.exe
C:\Windows\Microsoft.NET\Framework\v2.0.50727\msbuild.exe C:\Windows\Microsoft.NET\Framework\v3.5\msbuild.exe C:\Windows\Microsoft.NET\Framework\v4.0.30319\msbuild.exe

Use version you need. Basically you have to use the last one.

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\msbuild.exe

So how to do it.

Run the COMMAND window
Input the path to msbuild.exe

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\msbuild.exe

Input the path to the project solution like

“C:\Users\Clark.Kent\Documents\visual studio 2012\Projects\WpfApplication1\WpfApplication1.sln”

Add any flags you need after the solution path.
Press ENTER

Note you can get help about all possible flags like

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\msbuild.exe /help

Question 33

Using msbuild as pointed out by others worked for me but I needed to do a bit more than just that. First of all, msbuild needs to have access to the compiler. This can be done by running:

"C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"

Then msbuild was not in my $PATH so I had to run it via its explicit path:

"C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild.exe" myproj.sln

Lastly, my project was making use of some variables like $(VisualStudioDir). It seems those do not get set by msbuild so I had to set them manually via the /property option:

"C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild.exe" /property:VisualStudioDir="C:\Users\Administrator\Documents\Visual Studio 2013" myproj.sln

That line then finally allowed me to compile my project.

Bonus: it seems that the command line tools do not require a registration after 30 days of using them like the “free” GUI-based Visual Studio Community edition does. With the Microsoft registration requirement in place, that version is hardly free. Free-as-in-facebook if anything…

Question 34

MSBuild is your friend.

msbuild "C:\path to solution\project.sln"

Question 35

DEVENV works well in many cases, but on a WIXPROJ to build my WIX installer, all I got is “CATASTROPHIC” error in the Out log.

This works: MSBUILD /Path/PROJECT.WIXPROJ /t:Build /p:Configuration=Release

Question 36

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

Numpy, making use only of the functionality of dot.
Python, calling the BLAS functionalities through a shared object.
C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

Questions

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

Question 37

I’ve run your benchmark. There is no difference between C++ and numpy on my machine:

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

Question 38

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC. Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

OS: Suse
Intel Compiler: ictce-5.3.0
Numpy: 1.8.0
OpenBLAS: 0.2.6
ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL –> Nehalem MKL, Netlib Blas –> Nehalem Netlib BLAS, etc)

Single threaded performance:

Multi threaded performance (8 threads):

Threads vs Matrix size (Ivy Bridge MKL):

Benchmark Suite

Single threaded performance:

Multi threaded (8 threads) performance:

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test. The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode. In multi-threaded mode the performance is worse.

The “Matrix size vs threads chart” also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won’t improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.

Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

Numpy with ATLAS
Numpy with GotoBlas2 (1.13)
Numpy with MKL (11.1/073)
Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

simple dot product of matrices with different sizes
Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

OS: Ubuntu Lucid 10.4 64 Bit.
CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
RAM: 24 GB
Intel Compiler: 11.1/073
Scipy: 0.8
Numpy: 1.5

Mac Book Pro (Accelerate Framework):

OS: Mac OS X Snow Leopard (10.6)
CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
RAM: 4 GB
Scipy: 0.7
Numpy: 1.3

Mac Server (Accelerate Framework):

OS: Mac OS X Snow Leopard Server (10.6)
CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
RAM: 4 GB
Scipy: 0.8
Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn’t work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu’s repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple’s Accelerate Framework. You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple’s Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don’t use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD’s ACML and performance was similar to MKL and GotoBlas2. I don’t have any numbers tough.

I personally would recommend to use GotoBlas2 because it’s easier to install and it’s free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.

Question 39

Here’s another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ — and if this matters, your benchmark is probably broken.)

~~Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)).~~ Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function — it’s OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.

To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can’t import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing! The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL — check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).

Question 40

Given the rigor you’ve shown with your analysis, I’m surprised by the results thus far. I put this as an ‘answer’ but only because it’s too long for a comment and does provide a possibility (though I expect you’ve considered it).

I would’ve thought the numpy/python approach wouldn’t add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I’m more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you’re using the best algorithms that numpy can leverage. From the compilation guide for linux:

“Build FFTW (3.1.2): SciPy Versions >= 0.7 and Numpy >= 1.2: Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack. There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis. Downgrade to a Numpy/Scipy version that includes support. Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example.”

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you’re running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core

If you’re on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you’ve considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I’ll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.

Question 41

is it possible to convert a Python program to C/C++?

I need to implement a couple of algorithms, and I’m not sure if the performance gap is big enough to justify all the pain I’d go through when doing it in C/C++ (which I’m not good at). I thought about writing one simple algorithm and benchmark it against such a converted solution. If that alone is significantly faster than the Python version, then I’ll have no other choice than doing it in C/C++.

Question 42

Yes. Look at Cython. It does just that: Converts Python to C for speedups.

Question 43

If the C variant needs x hours less, then I’d invest that time in letting the algorithms run longer/again

“invest” isn’t the right word here.

Build a working implementation in Python. You’ll finish this long before you’d finish a C version.
Measure performance with the Python profiler. Fix any problems you find. Change data structures and algorithms as necessary to really do this properly. You’ll finish this long before you finish the first version in C.
If it’s still too slow, manually translate the well-designed and carefully constructed Python into C.

Because of the way hindsight works, doing the second version from existing Python (with existing unit tests, and with existing profiling data) will still be faster than trying to do the C code from scratch.

This quote is important.

Thompson’s Rule for First-Time Telescope Makers
It is faster to make a four-inch mirror and then a six-inch mirror than to make a six-inch mirror.

Bill McKeenan
Wang Institute

Question 44

Shed Skin is “a (restricted) Python-to-C++ compiler”.

Question 45

Just came across this new tool in hacker news.

From their page – “Nuitka is a good replacement for the Python interpreter and compiles every construct that CPython 2.6, 2.7, 3.2 and 3.3 offer. It translates the Python into a C++ program that then uses “libpython” to execute in the same way as CPython does, in a very compatible way.”

Question 46

Another option – to convert to C++ besides Shed Skin – is Pythran.

To quote High Performance Python by Micha Gorelick and Ian Ozsvald:

Pythran is a Python-to-C++ compiler for a subset of Python that includes partial numpy support. It acts a little like Numba and Cython—you annotate a function’s arguments, and then it takes over with further type annotation and code specialization. It takes advantage of vectorization possibilities and of OpenMP-based parallelization possibilities. It runs using Python 2.7 only.

One very interesting feature of Pythran is that it will attempt to automatically spot parallelization opportunities (e.g., if you’re using a map), and turn this into parallel code without requiring extra effort from you. You can also specify parallel sections using pragma omp > directives; in this respect, it feels very similar to Cython’s OpenMP support.

Behind the scenes, Pythran will take both normal Python and numpy code and attempt to aggressively compile them into very fast C++—even faster than the results of Cython.

You should note that this project is young, and you may encounter bugs; you should also note that the development team are very friendly and tend to fix bugs in a matter of hours.

Question 47

I know this is an older thread but I wanted to give what I think to be helpful information.

I personally use PyPy which is really easy to install using pip. I interchangeably use Python/PyPy interpreter, you don’t need to change your code at all and I’ve found it to be roughly 40x faster than the standard python interpreter (Either Python 2x or 3x). I use pyCharm Community Edition to manage my code and I love it.

I like writing code in python as I think it lets you focus more on the task than the language, which is a huge plus for me. And if you need it to be even faster, you can always compile to a binary for Windows, Linux, or Mac (not straight forward but possible with other tools). From my experience, I get about 3.5x speedup over PyPy when compiling, meaning 140x faster than python. PyPy is available for Python 3x and 2x code and again if you use an IDE like PyCharm you can interchange between say PyPy, Cython, and Python very easily (takes a little of initial learning and setup though).

Some people may argue with me on this one, but I find PyPy to be faster than Cython. But they’re both great choices though.

Edit: I’d like to make another quick note about compiling: when you compile, the resulting binary is much bigger than your python script as it builds all dependencies into it, etc. But then you get a few distinct benefits: speed!, now the app will work on any machine (depending on which OS you compiled for, if not all. lol) without Python or libraries, it also obfuscates your code and is technically ‘production’ ready (to a degree). Some compilers also generate C code, which I haven’t really looked at or seen if it’s useful or just gibberish. Good luck.

Hope that helps.

Question 48

I realize that an answer on a quite new solution is missing. If Numpy is used in the code, I would advice to try Pythran:

http://pythran.readthedocs.io/

For the functions I tried, Pythran gives extremely good results. The resulting functions are as fast as well written Fortran code (or only slightly slower) and a little bit faster than the (quite optimized) Cython solution.

The advantage compared to Cython is that you just have to use Pythran on the Python function optimized for Numpy, meaning that you do not have to expand the loops and add types for all variables in the loop. Pythran takes its time to analyse the code so it understands the operations on numpy.ndarray.

It is also a huge advantage compared to Numba or other projects based on just-in-time compilation for which (to my knowledge), you have to expand the loops to be really efficient. And then the code with the loops becomes very very inefficient using only CPython and Numpy…

A drawback of Pythran: no classes! But since only the functions that really need to be optimized have to be compiled, it is not very annoying.

Another point: Pythran supports well (and very easily) OpenMP parallelism. But I don’t think mpi4py is supported…

Question 49

http://code.google.com/p/py2c/ looks like a possibility – they also mention on their site: Cython, Shedskin and RPython and confirm that they are converting Python code to pure C/C++ which is much faster than C/C++ riddled with Python API calls. Note: I haven’t tried it but I am going to..

Question 50

I recently started studying deep learning and other ML techniques, and I started searching for frameworks that simplify the process of build a net and training it, then I found TensorFlow, having little experience in the field, for me, it seems that speed is a big factor for making a big ML system even more if working with deep learning, so why python was chosen by Google to make TensorFlow? Wouldn’t it be better to make it over an language that can be compiled and not interpreted?

What are the advantages of using Python over a language like C++ for machine learning?

Question 51

The most important thing to realize about TensorFlow is that, for the most part, the core is not written in Python: It’s written in a combination of highly-optimized C++ and CUDA (Nvidia’s language for programming GPUs). Much of that happens, in turn, by using Eigen (a high-performance C++ and CUDA numerical library) and NVidia’s cuDNN (a very optimized DNN library for NVidia GPUs, for functions such as convolutions).

The model for TensorFlow is that the programmer uses “some language” (most likely Python!) to express the model. This model, written in the TensorFlow constructs such as:

h1 = tf.nn.relu(tf.matmul(l1, W1) + b1)
h2 = ...

is not actually executed when the Python is run. Instead, what’s actually created is a dataflow graph that says to take particular inputs, apply particular operations, supply the results as the inputs to other operations, and so on. This model is executed by fast C++ code, and for the most part, the data going between operations is never copied back to the Python code.

Then the programmer “drives” the execution of this model by pulling on nodes — for training, usually in Python, and for serving, sometimes in Python and sometimes in raw C++:

sess.run(eval_results)

This one Python (or C++ function call) uses either an in-process call to C++ or an RPC for the distributed version to call into the C++ TensorFlow server to tell it to execute, and then copies back the results.

So, with that said, let’s re-phrase the question: Why did TensorFlow choose Python as the first well-supported language for expressing and controlling the training of models?

The answer to that is simple: Python is probably the most comfortable language for a large range of data scientists and machine learning experts that’s also that easy to integrate and have control a C++ backend, while also being general, widely-used both inside and outside of Google, and open source. Given that with the basic model of TensorFlow, the performance of Python isn’t that important, it was a natural fit. It’s also a huge plus that NumPy makes it easy to do pre-processing in Python — also with high performance — before feeding it in to TensorFlow for the truly CPU-heavy things.

There’s also a bunch of complexity in expressing the model that isn’t used when executing it — shape inference (e.g., if you do matmul(A, B), what is the shape of the resulting data?) and automatic gradient computation. It turns out to have been nice to be able to express those in Python, though I think in the long term they’ll probably move to the C++ backend to make adding other languages easier.

(The hope, of course, is to support other languages in the future for creating and expressing models. It’s already quite straightforward to run inference using several other languages — C++ works now, someone from Facebook contributed Go bindings that we’re reviewing now, etc.)

Question 52

TF is not written in python. It is written in C++ (and uses high-performant numerical libraries and CUDA code) and you can check this by looking at their github. So the core is written not in python but TF provide an interface to many other languages (python, C++, Java, Go)

If you come from a data analysis world, you can think about it like numpy (not written in python, but provides an interface to Python) or if you are a web-developer – think about it as a database (PostgreSQL, MySQL, which can be invoked from Java, Python, PHP)

Python frontend (the language in which people write models in TF) is the most popular due to many reasons. In my opinion the main reason is historical: majority of ML users already use it (another popular choice is R) so if you will not provide an interface to python, your library is most probably doomed to obscurity.

But being written in python does not mean that your model is executed in python. On the contrary, if you written your model in the right way Python is never executed during the evaluation of the TF graph (except of tf.py_func(), which exists for debugging and should be avoided in real model exactly because it is executed on Python’s side).

This is different from for example numpy. For example if you do np.linalg.eig(np.matmul(A, np.transpose(A)) (which is eig(AA')), the operation will compute transpose in some fast language (C++ or fortran), return it to python, take it from python together with A, and compute a multiplication in some fast language and return it to python, then compute eigenvalues and return it to python. So nonetheless expensive operations like matmul and eig are calculated efficiently, you still lose time by moving the results to python back and force. TF does not do it, once you defined the graph your tensors flow not in python but in C++/CUDA/something else.

Question 53

Python allows you to create extension modules using C and C++, interfacing with native code, and still getting the advantages that Python gives you.

TensorFlow uses Python, yes, but it also contains large amounts of C++.

This allows a simpler interface for experimentation with less human-thought overhead with Python, and add performance by programming the most important parts in C++.

Question 54

The latest ratio you can check from here shows inside TensorFlow C++ takes ~50% of code, and Python takes ~40% of code.

Both C++ and Python are the official languages at Google so there is no wonder why this is so. If I would have to provide fast regression where C++ and Python are present…

C++ is inside the computational algebra, and Python is used for everything else including for the testing. Knowing how ubiquitous the testing is today it is no wonder why Python code contributes that much to TF.

Question 55

Suppose I have an if statement with a return. From the efficiency perspective, should I use

if(A > B):
    return A+1
return A-1

or

if(A > B):
    return A+1
else:
    return A-1

Should I prefer one or another when using a compiled language (C) or a scripted one (Python)?

Question 56

Since the return statement terminates the execution of the current function, the two forms are equivalent (although the second one is arguably more readable than the first).

The efficiency of both forms is comparable, the underlying machine code has to perform a jump if the if condition is false anyway.

Note that Python supports a syntax that allows you to use only one return statement in your case:

return A+1 if A > B else A-1

Question 57

From Chromium’s style guide:

Don’t use else after return:

# Bad
if (foo)
  return 1
else
  return 2

# Good
if (foo)
  return 1
return 2

return 1 if foo else 2

Question 58

Regarding coding style:

Most coding standards no matter language ban multiple return statements from a single function as bad practice.

(Although personally I would say there are several cases where multiple return statements do make sense: text/data protocol parsers, functions with extensive error handling etc)

The consensus from all those industry coding standards is that the expression should be written as:

int result;

if(A > B)
{
  result = A+1;
}
else
{
  result = A-1;
}
return result;

Regarding efficiency:

The above example and the two examples in the question are all completely equivalent in terms of efficiency. The machine code in all these cases have to compare A > B, then branch to either the A+1 or the A-1 calculation, then store the result of that in a CPU register or on the stack.

EDIT :

Sources:

MISRA-C:2004 rule 14.7, which in turn cites…:
IEC 61508-3. Part 3, table B.9.
IEC 61508-7. C.2.9.

Question 59

With any sensible compiler, you should observe no difference; they should be compiled to identical machine code as they’re equivalent.

Question 60

This is a question of style (or preference) since the interpreter does not care. Personally I would try not to make the final statement of a function which returns a value at an indent level other than the function base. The else in example 1 obscures, if only slightly, where the end of the function is.

By preference I use:

return A+1 if (A > B) else A-1

As it obeys both the good convention of having a single return statement as the last statement in the function (as already mentioned) and the good functional programming paradigm of avoiding imperative style intermediate results.

For more complex functions I prefer to break the function into multiple sub-functions to avoid premature returns if possible. Otherwise I revert to using an imperative style variable called rval. I try not to use multiple return statements unless the function is trivial or the return statement before the end is as a result of an error. Returning prematurely highlights the fact that you cannot go on. For complex functions that are designed to branch off into multiple subfunctions I try to code them as case statements (driven by a dict for instance).

Some posters have mentioned speed of operation. Speed of Run-time is secondary for me since if you need speed of execution Python is not the best language to use. I use Python as its the efficiency of coding (i.e. writing error free code) that matters to me.

Question 61

I personally avoid else blocks when possible. See the Anti-if Campaign

Also, they don’t charge ‘extra’ for the line, you know :p

“Simple is better than complex” & “Readability is king”

delta = 1 if (A > B) else -1
return A + delta

Question 62

Version A is simpler and that’s why I would use it.

And if you turn on all compiler warnings in Java you will get a warning on the second Version because it is unnecesarry and turns up code complexity.

Question 63

I know the question is tagged python, but it mentions dynamic languages so thought I should mention that in ruby the if statement actually has a return type so you can do something like

def foo
  rv = if (A > B)
         A+1
       else
         A-1
       end
  return rv 
end

Or because it also has implicit return simply

def foo 
  if (A>B)
    A+1
  else 
    A-1
  end
end

which gets around the style issue of not having multiple returns quite nicely.

Question 64

I think I understand strong typing, but every time I look for examples for what is weak typing I end up finding examples of programming languages that simply coerce/convert types automatically.

For instance, in this article named Typing: Strong vs. Weak, Static vs. Dynamic says that Python is strongly typed because you get an exception if you try to:

Python

1 + "1"
Traceback (most recent call last):
File "", line 1, in ? 
TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, such thing is possible in Java and in C#, and we do not consider them weakly typed just for that.

Java

  int a = 10;
  String b = "b";
  String result = a + b;
  System.out.println(result);

C#

int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);

In this another article named Weakly Type Languages the author says that Perl is weakly typed simply because I can concatenate a string to a number and viceversa without any explicit conversion.

Perl

$a=10;
$b="a";
$c=$a.$b;
print $c; #10a

So the same example makes Perl weakly typed, but not Java and C#?.

Gee, this is confusing

The authors seem to imply that a language that prevents the application of certain operations on values of different types is strongly typed and the contrary means weakly typed.

Therefore, at some point I have felt prompted to believe that if a language provides a lot of automatic conversions or coercion between types (as perl) may end up being considered weakly typed, whereas other languages that provide only a few conversions may end up being considered strongly typed.

I am inclined to believe, though, that I must be wrong in this interepretation, I just do not know why or how to explain it.

So, my questions are:

What does it really mean for a language to be truly weakly typed?
Could you mention any good examples of weakly typing that are not related to automatic conversion/automatic coercion done by the language?
Can a language be weakly typed and strongly typed at the same time?

问题：如何从C＃运行Python脚本？

回答 0

回答 1

回答 2

从C执行Python脚本

Python sample_script

Execute Python script from C

Python sample_script

回答 3

回答 4

回答 5

回答 6

回答 7

问题：Python（和Python C API）：__new__与__init__

回答 0

回答 1

回答 2

回答 3

回答 4

问题：与Python生成器模式等效的C ++

Python

C ++

Python

C++

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：如何从命令行编译Visual Studio项目？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：基准测试（使用BLAS的python与c ++）和（numpy）

情境

1.脾气暴躁

2. Python，通过共享库调用BLAS

3. c ++，通过共享库调用BLAS

结果

问题

下载

Scenario

1. Numpy

2. Python, calling BLAS through a shared object

3. c++, calling BLAS through a shared object

Results

Questions

Download

回答 0

回答 1

更新（30.07.2014）：

硬件

软件

点产品基准

基准套件

结论

原始答案（04.10.2011）：

机器

点产品基准

基准套件

安装

结论

UPDATE (30.07.2014):

Hardware

Software

Dot-Product Benchmark

Benchmark Suite

Conclusion

Original Answer (04.10.2011):

Machines

问题：Python（和Python C API）：new与init