Python 实用宝典

Question 1

Given a NumPy array of int32, how do I convert it to float32 in place? So basically, I would like to do

a = a.astype(numpy.float32)

without copying the array. It is big.

The reason for doing this is that I have two algorithms for the computation of a. One of them returns an array of int32, the other returns an array of float32 (and this is inherent to the two different algorithms). All further computations assume that a is an array of float32.

Currently I do the conversion in a C function called via ctypes. Is there a way to do this in Python?

Question 2

You can make a view with a different dtype, and then copy in-place into the view:

import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x

print(y)

yields

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)

To show the conversion was in-place, note that copying from x to y altered x:

print(x)

prints

array([         0, 1065353216, 1073741824, 1077936128, 1082130432,
       1084227584, 1086324736, 1088421888, 1090519040, 1091567616])

Question 3

Update: This function only avoids copy if it can, hence this is not the correct answer for this question. unutbu’s answer is the right one.

a = a.astype(numpy.float32, copy=False)

numpy astype has a copy flag. Why shouldn’t we use it ?

Question 4

You can change the array type without converting like this:

a.dtype = numpy.float32

but first you have to change all the integers to something that will be interpreted as the corresponding float. A very slow way to do this would be to use python’s struct module like this:

def toi(i):
    return struct.unpack('i',struct.pack('f',float(i)))[0]

…applied to each member of your array.

But perhaps a faster way would be to utilize numpy’s ctypeslib tools (which I am unfamiliar with)

– edit –

Since ctypeslib doesnt seem to work, then I would proceed with the conversion with the typical numpy.astype method, but proceed in block sizes that are within your memory limits:

a[0:10000] = a[0:10000].astype('float32').view('int32')

…then change the dtype when done.

Here is a function that accomplishes the task for any compatible dtypes (only works for dtypes with same-sized items) and handles arbitrarily-shaped arrays with user-control over block size:

import numpy

def astype_inplace(a, dtype, blocksize=10000):
    oldtype = a.dtype
    newtype = numpy.dtype(dtype)
    assert oldtype.itemsize is newtype.itemsize
    for idx in xrange(0, a.size, blocksize):
        a.flat[idx:idx + blocksize] = \
            a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
    a.dtype = newtype

a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a

Question 5

import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)

use view() and parameter ‘dtype’ to change the array in place.

Question 6

Use this:

In [105]: a
Out[105]: 
array([[15, 30, 88, 31, 33],
       [53, 38, 54, 47, 56],
       [67,  2, 74, 10, 16],
       [86, 33, 15, 51, 32],
       [32, 47, 76, 15, 81]], dtype=int32)

In [106]: float32(a)
Out[106]: 
array([[ 15.,  30.,  88.,  31.,  33.],
       [ 53.,  38.,  54.,  47.,  56.],
       [ 67.,   2.,  74.,  10.,  16.],
       [ 86.,  33.,  15.,  51.,  32.],
       [ 32.,  47.,  76.,  15.,  81.]], dtype=float32)

Question 7

a = np.subtract(a, 0., dtype=np.float32)

Question 8

Node.js is a perfect match for our web project, but there are few computational tasks for which we would prefer Python. We also already have a Python code for them. We are highly concerned about speed, what is the most elegant way how to call a Python “worker” from node.js in an asynchronous non-blocking way?

Question 9

For communication between node.js and Python server, I would use Unix sockets if both processes run on the same server and TCP/IP sockets otherwise. For marshaling protocol I would take JSON or protocol buffer. If threaded Python shows up to be a bottleneck, consider using Twisted Python, which provides the same event driven concurrency as do node.js.

If you feel adventurous, learn clojure (clojurescript, clojure-py) and you’ll get the same language that runs and interoperates with existing code on Java, JavaScript (node.js included), CLR and Python. And you get superb marshalling protocol by simply using clojure data structures.

Question 10

This sounds like a scenario where zeroMQ would be a good fit. It’s a messaging framework that’s similar to using TCP or Unix sockets, but it’s much more robust (http://zguide.zeromq.org/py:all)

There’s a library that uses zeroMQ to provide a RPC framework that works pretty well. It’s called zeroRPC (http://www.zerorpc.io/). Here’s the hello world.

Python “Hello x” server:

import zerorpc

class HelloRPC(object):
    '''pass the method a name, it replies "Hello name!"'''
    def hello(self, name):
        return "Hello, {0}!".format(name)

def main():
    s = zerorpc.Server(HelloRPC())
    s.bind("tcp://*:4242")
    s.run()

if __name__ == "__main__" : main()

And the node.js client:

var zerorpc = require("zerorpc");

var client = new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");
//calls the method on the python object
client.invoke("hello", "World", function(error, reply, streaming) {
    if(error){
        console.log("ERROR: ", error);
    }
    console.log(reply);
});

Or vice-versa, node.js server:

var zerorpc = require("zerorpc");

var server = new zerorpc.Server({
    hello: function(name, reply) {
        reply(null, "Hello, " + name, false);
    }
});

server.bind("tcp://0.0.0.0:4242");

And the python client

import zerorpc, sys

c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1] if len(sys.argv) > 1 else "dude"
print c.hello(name)

Question 11

If you arrange to have your Python worker in a separate process (either long-running server-type process or a spawned child on demand), your communication with it will be asynchronous on the node.js side. UNIX/TCP sockets and stdin/out/err communication are inherently async in node.

Question 12

I’d consider also Apache Thrift http://thrift.apache.org/

It can bridge between several programming languages, is highly efficient and has support for async or sync calls. See full features here http://thrift.apache.org/docs/features/

The multi language can be useful for future plans, for example if you later want to do part of the computational task in C++ it’s very easy to do add it to the mix using Thrift.

Question 13

I’ve had a lot of success using thoonk.js along with thoonk.py. Thoonk leverages Redis (in-memory key-value store) to give you feed (think publish/subscribe), queue and job patterns for communication.

Why is this better than unix sockets or direct tcp sockets? Overall performance may be decreased a little, however Thoonk provides a really simple API that simplifies having to manually deal with a socket. Thoonk also helps make it really trivial to implement a distributed computing model that allows you to scale your python workers to increase performance, since you just spin up new instances of your python workers and connect them to the same redis server.

Question 14

I’d recommend using some work queue using, for example, the excellent Gearman, which will provide you with a great way to dispatch background jobs, and asynchronously get their result once they’re processed.

The advantage of this, used heavily at Digg (among many others) is that it provides a strong, scalable and robust way to make workers in any language to speak with clients in any language.

Question 15

Update 2019

There are several ways to achieve this and here is the list in increasing order of complexity

Python Shell, you will write streams to the python console and it will write back to you
Redis Pub Sub, you can have a channel listening in Python while your node js publisher pushes data
Websocket connection where Node acts as the client and Python acts as the server or vice-versa
API connection with Express/Flask/Tornado etc working separately with an API endpoint exposed for the other to query

Approach 1 Python Shell Simplest approach

source.js file

const ps = require('python-shell')
// very important to add -u option since our python script runs infinitely
var options = {
    pythonPath: '/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
    pythonOptions: ['-u'], // get print results in real-time
    // make sure you use an absolute path for scriptPath
    scriptPath: "./subscriber/",
    // args: ['value1', 'value2', 'value3'],
    mode: 'json'
};

const shell = new ps.PythonShell("destination.py", options);

function generateArray() {
    const list = []
    for (let i = 0; i < 1000; i++) {
        list.push(Math.random() * 1000)
    }
    return list
}

setInterval(() => {
    shell.send(generateArray())
}, 1000);

shell.on("message", message => {
    console.log(message);
})

destination.py file

import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

size = 1000
p = 100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)

def get_indicators(values):
    # Return the RSI of the values sent from node.js
    numpy_values = numpy.array(values, dtype=numpy.double) 
    return talib.func.RSI(numpy_values, 14)

for line in sys.stdin:
    l = json.loads(line)
    print(get_indicators(l))
    # Without this step the output may not be immediately available in node
    sys.stdout.flush()

Notes: Make a folder called subscriber which is at the same level as source.js file and put destination.py inside it. Dont forget to change your virtualenv environment

Question 16

I’ve been reading recently about Stackless Python and it seems to have many advantages compared with vanilla cPython. It has all those cool features like infinite recursion, microthreads, continuations, etc. and at the same time is faster than cPython (around 10%, if the Python wiki is to be believed) and compatible with it (at least versions 2.5, 2.6 and 3.0).

All these looks almost too good to be true. However, TANSTAAFL, I don’t see much enthusiasm for Stackless among the Python community, and PEP 219 has never come into realization. Why is that? What are the drawbacks of Stackless? What skeletons are hidden in Stackless’ closet?

(I know Stackless doesn’t offer real concurrency, just an easier way of programming in the concurrent way. It doesn’t really bother me.)

Question 17

I don’t know where that “Stackless is 10% faster” on the Wiki came from, but then again I’ve never tried to measure those performance numbers. I can’t think of what Stackless does to make a difference that big.

Stackless is an amazing tool with several organizational/political problems.

The first comes from history. Christian Tismer started talking about what eventually became Stackless about 10 years ago. He had an idea of what he wanted, but had a hard time explaining what he was doing and why people should use it. This is partially because his background didn’t have the CS training regarding ideas like coroutines and because his presentations and discussion are very implementation oriented, which is hard for anyone not already hip-deep in continuations to understand how to use it as a solution to their problems.

For that reason, the initial documentation was poor. There were some descriptions of how to use it, with the best from third-party contributors. At PyCon 2007 I gave a talk on “Using Stackless” which went over quite well, according to the PyCon survey numbers. Richard Tew has done a great job collecting these, updating stackless.com, and maintaining the distribution when new Python releases comes up. He’s an employee of CCP Games, developers of EVE Online, which uses Stackless as an essential part of their gaming system.

CCP games is also the biggest real-world example people use when they talk about Stackless. The main tutorial for Stackless is Grant Olson’s “Introduction to Concurrent Programming with Stackless Python“, which is also game oriented. I think this gives people a skewed idea that Stackless is games-oriented, when it’s more that games are more easily continuation oriented.

Another difficulty has been the source code. In its original form it required changes to many parts of Python, which made Guido van Rossum, the Python lead, wary. Part of the reason, I think, was support for call/cc that was later removed as being “too much like supporting a goto when there are better higher-level forms.” I’m not certain about this history, so just read this paragraph as “Stackless used to require too many changes.”

Later releases didn’t require the changes, and Tismer continued to push for its inclusion in Python. While there was some consideration, the official stance (as far as I know) is that CPython is not only a Python implementation but it’s meant as a reference implementation, and it won’t include Stackless functionality because it can’t be implemented by Jython or Iron Python.

There are absolutely no plans for “significant changes to the code base“. That quote and reference hyperlink from Arafangion’s (see the comment) are from roughly 2000/2001. The structural changes have long been done, and it’s what I mentioned above. Stackless as it is now is stable and mature, with only minor tweaks to the code base over the last several years.

One final limitation with Stackless – there is no strong advocate for Stackless. Tismer is now deeply involved with PyPy, which is an implementation of Python for Python. He has implemented the Stackless functionality in PyPy and considers it much superior to Stackless itself, and feels that PyPy is the way of the future. Tew maintains Stackless but he isn’t interested in advocacy. I considered being in that role, but couldn’t see how I could make an income from it.

Though if you want training in Stackless, feel free to contact me! :)

Question 18

it took quite long to find this discussion. At that time I was not on PyPy but had a 2-years affair with psyco, until health stopped this all quite abruptly. I’m now active again and designing an alternative approach – will present it on EuroPython 2012.

Most of Andrews statements are correct. Some minor additions:

Stackless was significantly faster than CPython, 10 years ago, because I optimized the interpreter loop. At that time, Guido was not ready for that. A few years later, people did similar optimizations and even more and better ones, which makes Stackless a little bit slower, as expected.

On inclusion: well, in the beginning I was very pushy and convinced that Stackless is the way to go. Later, when it was almost possible to get included, I lost interest in that and preferred to let it stay this way, partially out of frustration, partially to keep control of Stackless.

The arguments like “other implementations cannot do it” felt always lame to me, as there are other examples where this argument could also be used. I thought I better forget about that and stay in good friendship with Guido, having my own distro.

Meanwhile things are changing again. I’m working on PyPy and Stackless as an extension Will talk about that sometimes later

Cheers — Chris

Question 19

If I recall correctly, Stackless was slated for inclusion into the official CPython, but the author of stackless told the CPython folks not to do so, because he planned to do some significant changes to the code base – presumeably he wanted the integration done later when the project was more mature.

Question 20

I’m also interested in the answers here. I’ve played a bit with Stackless and it looks like it would be a good solid addition to standard Python.

PEP 219 does mention potential difficulties with calling Python code from C code, if Python wants to change to a different stack. There would need to be ways to detect and prevent this (to avoid trashing the C stack). I think this is tractable though, so I’m also wondering why Stackless must stand on its own.

Question 21

I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.

I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.

Here’s a very trivial example…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

…and here’s what I see in htop while this script is running:

I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?

Question 22

After some more googling I found the answer here.

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

A workaround is to reset the task affinity using

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .

Update:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line

NO_AFFINITY=1

Question 23

Python 3 now exposes the methods to directly set the affinity

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

Question 24

This appears to be a common problem with Python on Ubuntu, and is not specific to joblib:

I would suggest experimenting with CPU affinity (taskset).

Question 25

I’m using Python 3.5.2

I have two lists

a list of about 750,000 “sentences” (long strings)
a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.

Question 26

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.

Question 27

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

the search is case-insensitive (thanks to lower())
replacing a word with "" might leave two spaces (as in your code)
With python3, \w+ also matches accented characters (e.g. "ångström").
Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

one is clearly not a word (it begins with #)
one is the first word in the list
one is the last word in the list
one looks like a word but isn’t

import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

O(1) best case
O(n/2) average case, which is still O(n)
O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.

Question 28

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

foobar|baz would match foobar or baz, but not foobaz
foo(bar|baz) would save unneeded information to a capturing group.

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:

Question 29

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.

Question 30

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Question 31

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

Question 32

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.

Question 33

Practical approach

A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

With join/split tricks you can avoid loops at all which should speed up the algorithm.

Concatenate a sentences with a special delimeter which is not contained by the sentences:

merged_sentences = ' * '.join(sentences)

Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:

regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

Performance

"".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

for (i = 0; i < seqlen; i++) {
    [...]
    sz += PyUnicode_GET_LENGTH(item);

Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N²) with the initial approach.

b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.

Question 34

Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.

Question 35

I have got a python script which is creating an ODBC connection. The ODBC connection is generated with a connection string. In this connection string I have to include the username and password for this connection.

Is there an easy way to obscure this password in the file (just that nobody can read the password when I’m editing the file) ?

Question 36

Base64 encoding is in the standard library and will do to stop shoulder surfers:

>>> import base64
>>>  print(base64.b64encode("password".encode("utf-8")))
cGFzc3dvcmQ=
>>> print(base64.b64decode("cGFzc3dvcmQ=").decode("utf-8"))
password

Question 37

Douglas F Shearer’s is the generally approved solution in Unix when you need to specify a password for a remote login.
You add a –password-from-file option to specify the path and read plaintext from a file.
The file can then be in the user’s own area protected by the operating system. It also allows different users to automatically pick up their own own file.

For passwords that the user of the script isn’t allowed to know – you can run the script with elavated permission and have the password file owned by that root/admin user.

Question 38

Here is a simple method:

Create a python module – let’s call it peekaboo.py.
In peekaboo.py, include both the password and any code needing that password
Create a compiled version – peekaboo.pyc – by importing this module (via python commandline, etc…).
Now, delete peekaboo.py.
You can now happily import peekaboo relying only on peekaboo.pyc. Since peekaboo.pyc is byte compiled it is not readable to the casual user.

This should be a bit more secure than base64 decoding – although it is vulnerable to a py_to_pyc decompiler.

Question 39

If you are working on a Unix system, take advantage of the netrc module in the standard Python library. It reads passwords from a separate text file (.netrc), which has the format decribed here.

Here is a small usage example:

import netrc

# Define which host in the .netrc file to use
HOST = 'mailcluster.loopia.se'

# Read from the .netrc file in your home directory
secrets = netrc.netrc()
username, account, password = secrets.authenticators( HOST )

print username, password

Question 40

The best solution, assuming the username and password can’t be given at runtime by the user, is probably a separate source file containing only variable initialization for the username and password that is imported into your main code. This file would only need editing when the credentials change. Otherwise, if you’re only worried about shoulder surfers with average memories, base 64 encoding is probably the easiest solution. ROT13 is just too easy to decode manually, isn’t case sensitive and retains too much meaning in it’s encrypted state. Encode your password and user id outside the python script. Have he script decode at runtime for use.

Giving scripts credentials for automated tasks is always a risky proposal. Your script should have its own credentials and the account it uses should have no access other than exactly what is necessary. At least the password should be long and rather random.

Question 41

How about importing the username and password from a file external to the script? That way even if someone got hold of the script, they wouldn’t automatically get the password.

Question 42

base64 is the way to go for your simple needs. There is no need to import anything:

>>> 'your string'.encode('base64')
'eW91ciBzdHJpbmc=\n'
>>> _.decode('base64')
'your string'

Question 43

for python3 obfuscation using base64 is done differently:

import base64
base64.b64encode(b'PasswordStringAsStreamOfBytes')

which results in

b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM='

note the informal string representation, the actual string is in quotes

and decoding back to the original string

base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
b'PasswordStringAsStreamOfBytes'

to use this result where string objects are required the bytes object can be translated

repr = base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
secret = repr.decode('utf-8')
print(secret)

for more information on how python3 handles bytes (and strings accordingly) please see the official documentation.

Question 44

This is a pretty common problem. Typically the best you can do is to either

A) create some kind of ceasar cipher function to encode/decode (just not rot13) or

B) the preferred method is to use an encryption key, within reach of your program, encode/decode the password. In which you can use file protection to protect access the key.

Along those lines if your app runs as a service/daemon (like a webserver) you can put your key into a password protected keystore with the password input as part of the service startup. It’ll take an admin to restart your app, but you will have really good pretection for your configuration passwords.

Question 45

Your operating system probably provides facilities for encrypting data securely. For instance, on Windows there is DPAPI (data protection API). Why not ask the user for their credentials the first time you run then squirrel them away encrypted for subsequent runs?

Question 46

More homegrown appraoch rather than converting authentication / passwords / username to encrytpted details. FTPLIB is just the example. “pass.csv” is the csv file name

Save password in CSV like below :

user_name

user_password

(With no column heading)

Reading the CSV and saving it to a list.

Using List elelments as authetntication details.

Full code.

import os
import ftplib
import csv 
cred_detail = []
os.chdir("Folder where the csv file is stored")
for row in csv.reader(open("pass.csv","rb")):       
        cred_detail.append(row)
ftp = ftplib.FTP('server_name',cred_detail[0][0],cred_detail[1][0])

Question 47

Here is my snippet for such thing. You basically import or copy the function to your code. getCredentials will create the encrypted file if it does not exist and return a dictionaty, and updateCredential will update.

import os

def getCredentials():
    import base64

    splitter='<PC+,DFS/-SHQ.R'
    directory='C:\\PCT'

    if not os.path.exists(directory):
        os.makedirs(directory)

    try:
        with open(directory+'\\Credentials.txt', 'r') as file:
            cred = file.read()
            file.close()
    except:
        print('I could not file the credentials file. \nSo I dont keep asking you for your email and password everytime you run me, I will be saving an encrypted file at {}.\n'.format(directory))

        lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
        email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
        password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
        cred = lanid+splitter+email+splitter+password
        with open(directory+'\\Credentials.txt','w+') as file:
            file.write(cred)
            file.close()

    return {'lanid':base64.b64decode(bytes(cred.split(splitter)[0], encoding='utf-8')).decode('utf-8'),
            'email':base64.b64decode(bytes(cred.split(splitter)[1], encoding='utf-8')).decode('utf-8'),
            'password':base64.b64decode(bytes(cred.split(splitter)[2], encoding='utf-8')).decode('utf-8')}

def updateCredentials():
    import base64

    splitter='<PC+,DFS/-SHQ.R'
    directory='C:\\PCT'

    if not os.path.exists(directory):
        os.makedirs(directory)

    print('I will be saving an encrypted file at {}.\n'.format(directory))

    lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
    email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
    password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
    cred = lanid+splitter+email+splitter+password
    with open(directory+'\\Credentials.txt','w+') as file:
        file.write(cred)
        file.close()

cred = getCredentials()

updateCredentials()

Question 48

Place the configuration information in a encrypted config file. Query this info in your code using an key. Place this key in a separate file per environment, and don’t store it with your code.

Question 49

Do you know pit?

https://pypi.python.org/pypi/pit (py2 only (version 0.3))

https://github.com/yoshiori/pit (it will work on py3 (current version 0.4))

test.py

from pit import Pit

config = Pit.get('section-name', {'require': {
    'username': 'DEFAULT STRING',
    'password': 'DEFAULT STRING',
    }})
print(config)

Run:

$ python test.py
{'password': 'my-password', 'username': 'my-name'}

~/.pit/default.yml:

section-name:
  password: my-password
  username: my-name

Question 50

If running on Windows, you could consider using win32crypt library. It allows storage and retrieval of protected data (keys, passwords) by the user that is running the script, thus passwords are never stored in clear text or obfuscated format in your code. I am not sure if there is an equivalent implementation for other platforms, so with the strict use of win32crypt your code is not portable.

I believe the module can be obtained here: http://timgolden.me.uk/pywin32-docs/win32crypt.html

Question 51

A way that I have done this is as follows:

At the python shell:

>>> from cryptography.fernet import Fernet
>>> key = Fernet.generate_key()
>>> print(key)
b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
>>> cipher = Fernet(key)
>>> password = "thepassword".encode('utf-8')
>>> token = cipher.encrypt(password)
>>> print(token)
b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='

Then, create a module with the following code:

from cryptography.fernet import Fernet

# you store the key and the token
key = b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
token = b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='

# create a cipher and decrypt when you need your password
cipher = Fernet(key)

mypassword = cipher.decrypt(token).decode('utf-8')

Once you’ve done this, you can either import mypassword directly or you can import the token and cipher to decrypt as needed.

Obviously, there are some shortcomings to this approach. If someone has both the token and the key (as they would if they have the script), they can decrypt easily. However it does obfuscate, and if you compile the code (with something like Nuitka) at least your password won’t appear as plain text in a hex editor.

Question 52

This doesn’t precisely answer your question, but it’s related. I was going to add as a comment but wasn’t allowed. I’ve been dealing with this same issue, and we have decided to expose the script to the users using Jenkins. This allows us to store the db credentials in a separate file that is encrypted and secured on a server and not accessible to non-admins. It also allows us a bit of a shortcut to creating a UI, and throttling execution.

Question 53

You could also consider the possibility of storing the password outside the script, and supplying it at runtime

e.g. fred.py

import os
username = 'fred'
password = os.environ.get('PASSWORD', '')
print(username, password)

which can be run like

$ PASSWORD=password123 python fred.py
fred password123

Extra layers of “security through obscurity” can be achieved by using base64 (as suggested above), using less obvious names in the code and further distancing the actual password from the code.

If the code is in a repository, it is often useful to store secrets outside it, so one could add this to ~/.bashrc (or to a vault, or a launch script, …)

export SURNAME=cGFzc3dvcmQxMjM=

and change fred.py to

import os
import base64
name = 'fred'
surname = base64.b64decode(os.environ.get('SURNAME', '')).decode('utf-8')
print(name, surname)

then re-login and

$ python fred.py
fred password123

Question 54

Why not have a simple xor?

Advantages:

looks like binary data
noone can read it without knowing the key (even if it’s a single char)

I get to the point where I recognize simple b64 strings for common words and rot13 as well. Xor would make it much harder.

Question 55

import base64
print(base64.b64encode("password".encode("utf-8")))
print(base64.b64decode(b'cGFzc3dvcmQ='.decode("utf-8")))

Question 56

There are several ROT13 utilities written in Python on the ‘Net — just google for them. ROT13 encode the string offline, copy it into the source, decode at point of transmission.

But this is really weak protection…

Question 57

When I running the following inside IPython Notebook I don’t see any output:

import logging
logging.basicConfig(level=logging.DEBUG)
logging.debug("test")

Anyone know how to make it so I can see the “test” message inside the notebook?

Question 58

Try following:

import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.debug("test")

According to logging.basicConfig:

Does basic configuration for the logging system by creating a StreamHandler with a default Formatter and adding it to the root logger. The functions debug(), info(), warning(), error() and critical() will call basicConfig() automatically if no handlers are defined for the root logger.

This function does nothing if the root logger already has handlers configured for it.

It seems like ipython notebook call basicConfig (or set handler) somewhere.

Question 59

If you still want to use basicConfig, reload the logging module like this

from importlib import reload  # Not needed in Python 2
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')

Question 60

My understanding is that the IPython session starts up logging so basicConfig doesn’t work. Here is the setup that works for me (I wish this was not so gross looking since I want to use it for almost all my notebooks):

import logging
logger = logging.getLogger()
fhandler = logging.FileHandler(filename='mylog.log', mode='a')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.DEBUG)

Now when I run:

logging.error('hello!')
logging.debug('This is a debug message')
logging.info('this is an info message')
logging.warning('tbllalfhldfhd, warning.')

I get a “mylog.log” file in the same directory as my notebook that contains:

2015-01-28 09:49:25,026 - root - ERROR - hello!
2015-01-28 09:49:25,028 - root - DEBUG - This is a debug message
2015-01-28 09:49:25,029 - root - INFO - this is an info message
2015-01-28 09:49:25,032 - root - WARNING - tbllalfhldfhd, warning.

Note that if you rerun this without restarting the IPython session it will write duplicate entries to the file since there would now be two file handlers defined

Question 61

Bear in mind that stderr is the default stream for the logging module, so in IPython and Jupyter notebooks you might not see anything unless you configure the stream to stdout:

import logging
import sys

logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s',
                     level=logging.INFO, stream=sys.stdout)

logging.info('Hello world!')

Question 62

What worked for me now (Jupyter, notebook server is: 5.4.1, IPython 7.0.1)

import logging
logging.basicConfig()
logger = logging.getLogger('Something')
logger.setLevel(logging.DEBUG)

Now I can use logger to print info, otherwise I would see only message from the default level (logging.WARNING) or above.

Question 63

You can configure logging by running %config Application.log_level="INFO"

For more information, see IPython kernel options

Question 64

I setup a logger for both file and I wanted it to show up on the notebook. Turns out adding a filehandler clears out the default stream handlder.

logger = logging.getLogger()

formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Setup file handler
fhandler  = logging.FileHandler('my.log')
fhandler.setLevel(logging.DEBUG)
fhandler.setFormatter(formatter)

# Configure stream handler for the cells
chandler = logging.StreamHandler()
chandler.setLevel(logging.DEBUG)
chandler.setFormatter(formatter)

# Add both handlers
logger.addHandler(fhandler)
logger.addHandler(chandler)
logger.setLevel(logging.DEBUG)

# Show the handlers
logger.handlers

# Log Something
logger.info("Test info")
logger.debug("Test debug")
logger.error("Test error")

问题：NumPy数组的就地类型转换

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：结合使用node.js和Python

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：Stackless Python有什么缺点？[关闭]

回答 0

回答 1

回答 2

回答 3

问题：为什么在导入numpy之后多处理仅使用单个内核？

回答 0

更新：

Update:

回答 1

回答 2

问题：在Python 3中加速数百万个正则表达式的替换

回答 0

回答 1

TLDR

理论

码

性能

正则表达式联合测试

TLDR

Theory

Code

Performance

Regex union test

回答 2

TLDR

使用Trie优化正则表达式

例

码

测试

TLDR

Optimized Regex with Trie

Example

Code

Test

回答 3

回答 4

回答 5

回答 6

回答 7

实用方法

性能

Practical approach

Performance

回答 8

问题：在python脚本中隐藏密码（仅用于不安全的混淆）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

问题：getattr在模块上