Given a NumPy array of int32, how do I convert it to float32in place? So basically, I would like to do
a = a.astype(numpy.float32)
without copying the array. It is big.
The reason for doing this is that I have two algorithms for the computation of a. One of them returns an array of int32, the other returns an array of float32 (and this is inherent to the two different algorithms). All further computations assume that a is an array of float32.
Currently I do the conversion in a C function called via ctypes. Is there a way to do this in Python?
回答 0
您可以使用不同的dtype创建视图,然后就地复制到视图中:
import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:]= x
print(y)
You can change the array type without converting like this:
a.dtype = numpy.float32
but first you have to change all the integers to something that will be interpreted as the corresponding float. A very slow way to do this would be to use python’s struct module like this:
But perhaps a faster way would be to utilize numpy’s ctypeslib tools (which I am unfamiliar with)
– edit –
Since ctypeslib doesnt seem to work, then I would proceed with the conversion with the typical numpy.astype method, but proceed in block sizes that are within your memory limits:
Here is a function that accomplishes the task for any compatible dtypes (only works for dtypes with same-sized items) and handles arbitrarily-shaped arrays with user-control over block size:
import numpy
def astype_inplace(a, dtype, blocksize=10000):
oldtype = a.dtype
newtype = numpy.dtype(dtype)
assert oldtype.itemsize is newtype.itemsize
for idx in xrange(0, a.size, blocksize):
a.flat[idx:idx + blocksize] = \
a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
a.dtype = newtype
a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a
Node.js is a perfect match for our web project, but there are few computational tasks for which we would prefer Python. We also already have a Python code for them.
We are highly concerned about speed, what is the most elegant way how to call a Python “worker” from node.js in an asynchronous non-blocking way?
For communication between node.js and Python server, I would use Unix sockets if both processes run on the same server and TCP/IP sockets otherwise. For marshaling protocol I would take JSON or protocol buffer. If threaded Python shows up to be a bottleneck, consider using Twisted Python, which
provides the same event driven concurrency as do node.js.
If you feel adventurous, learn clojure (clojurescript, clojure-py) and you’ll get the same language that runs and interoperates with existing code on Java, JavaScript (node.js included), CLR and Python. And you get superb marshalling protocol by simply using clojure data structures.
import zerorpcclassHelloRPC(object):'''pass the method a name, it replies "Hello name!"'''def hello(self, name):return"Hello, {0}!".format(name)def main():
s = zerorpc.Server(HelloRPC())
s.bind("tcp://*:4242")
s.run()if __name__ =="__main__": main()
和node.js客户端:
var zerorpc =require("zerorpc");var client =new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");//calls the method on the python object
client.invoke("hello","World",function(error, reply, streaming){if(error){
console.log("ERROR: ", error);}
console.log(reply);});
反之亦然,node.js服务器:
var zerorpc =require("zerorpc");var server =new zerorpc.Server({
hello:function(name, reply){
reply(null,"Hello, "+ name,false);}});
server.bind("tcp://0.0.0.0:4242");
和python客户端
import zerorpc, sys
c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1]if len(sys.argv)>1else"dude"print c.hello(name)
This sounds like a scenario where zeroMQ would be a good fit. It’s a messaging framework that’s similar to using TCP or Unix sockets, but it’s much more robust (http://zguide.zeromq.org/py:all)
There’s a library that uses zeroMQ to provide a RPC framework that works pretty well. It’s called zeroRPC (http://www.zerorpc.io/). Here’s the hello world.
Python “Hello x” server:
import zerorpc
class HelloRPC(object):
'''pass the method a name, it replies "Hello name!"'''
def hello(self, name):
return "Hello, {0}!".format(name)
def main():
s = zerorpc.Server(HelloRPC())
s.bind("tcp://*:4242")
s.run()
if __name__ == "__main__" : main()
And the node.js client:
var zerorpc = require("zerorpc");
var client = new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");
//calls the method on the python object
client.invoke("hello", "World", function(error, reply, streaming) {
if(error){
console.log("ERROR: ", error);
}
console.log(reply);
});
Or vice-versa, node.js server:
var zerorpc = require("zerorpc");
var server = new zerorpc.Server({
hello: function(name, reply) {
reply(null, "Hello, " + name, false);
}
});
server.bind("tcp://0.0.0.0:4242");
And the python client
import zerorpc, sys
c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1] if len(sys.argv) > 1 else "dude"
print c.hello(name)
回答 2
如果您安排将Python工作进程放在单独的进程中(长时间运行的服务器类型进程或按需生成的子进程),则与之进行的通信在node.js端将是异步的。UNIX / TCP套接字和stdin / out / err通信本身在节点中是异步的。
If you arrange to have your Python worker in a separate process (either long-running server-type process or a spawned child on demand), your communication with it will be asynchronous on the node.js side. UNIX/TCP sockets and stdin/out/err communication are inherently async in node.
It can bridge between several programming languages, is highly efficient and has support for async or sync calls. See full features here http://thrift.apache.org/docs/features/
The multi language can be useful for future plans, for example if you later want to do part of the computational task in C++ it’s very easy to do add it to the mix using Thrift.
I’ve had a lot of success using thoonk.js along with thoonk.py. Thoonk leverages Redis (in-memory key-value store) to give you feed (think publish/subscribe), queue and job patterns for communication.
Why is this better than unix sockets or direct tcp sockets? Overall performance may be decreased a little, however Thoonk provides a really simple API that simplifies having to manually deal with a socket. Thoonk also helps make it really trivial to implement a distributed computing model that allows you to scale your python workers to increase performance, since you just spin up new instances of your python workers and connect them to the same redis server.
I’d recommend using some work queue using, for example, the excellent Gearman, which will provide you with a great way to dispatch background jobs, and asynchronously get their result once they’re processed.
The advantage of this, used heavily at Digg (among many others) is that it provides a strong, scalable and robust way to make workers in any language to speak with clients in any language.
const ps =require('python-shell')// very important to add -u option since our python script runs infinitelyvar options ={
pythonPath:'/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
pythonOptions:['-u'],// get print results in real-time// make sure you use an absolute path for scriptPath
scriptPath:"./subscriber/",// args: ['value1', 'value2', 'value3'],
mode:'json'};const shell =new ps.PythonShell("destination.py", options);function generateArray(){const list =[]for(let i =0; i <1000; i++){
list.push(Math.random()*1000)}return list
}
setInterval(()=>{
shell.send(generateArray())},1000);
shell.on("message", message =>{
console.log(message);})
destination.py文件
import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
size =1000
p =100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)def get_indicators(values):# Return the RSI of the values sent from node.js
numpy_values = numpy.array(values, dtype=numpy.double)return talib.func.RSI(numpy_values,14)for line in sys.stdin:
l = json.loads(line)print(get_indicators(l))# Without this step the output may not be immediately available in node
sys.stdout.flush()
There are several ways to achieve this and here is the list in increasing order of complexity
Python Shell, you will write streams to the python console and it
will write back to you
Redis Pub Sub, you can have a channel
listening in Python while your node js publisher pushes data
Websocket connection where Node acts as the client and Python acts
as the server or vice-versa
API connection with Express/Flask/Tornado etc working separately with an API endpoint exposed for the other to query
Approach 1 Python Shell Simplest approach
source.js file
const ps = require('python-shell')
// very important to add -u option since our python script runs infinitely
var options = {
pythonPath: '/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
pythonOptions: ['-u'], // get print results in real-time
// make sure you use an absolute path for scriptPath
scriptPath: "./subscriber/",
// args: ['value1', 'value2', 'value3'],
mode: 'json'
};
const shell = new ps.PythonShell("destination.py", options);
function generateArray() {
const list = []
for (let i = 0; i < 1000; i++) {
list.push(Math.random() * 1000)
}
return list
}
setInterval(() => {
shell.send(generateArray())
}, 1000);
shell.on("message", message => {
console.log(message);
})
destination.py file
import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
size = 1000
p = 100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)
def get_indicators(values):
# Return the RSI of the values sent from node.js
numpy_values = numpy.array(values, dtype=numpy.double)
return talib.func.RSI(numpy_values, 14)
for line in sys.stdin:
l = json.loads(line)
print(get_indicators(l))
# Without this step the output may not be immediately available in node
sys.stdout.flush()
Notes: Make a folder called subscriber which is at the same level as source.js file and put destination.py inside it. Dont forget to change your virtualenv environment
I’ve been reading recently about Stackless Python and it seems to have many advantages compared with vanilla cPython. It has all those cool features like infinite recursion, microthreads, continuations, etc. and at the same time is faster than cPython (around 10%, if the Python wiki is to be believed) and compatible with it (at least versions 2.5, 2.6 and 3.0).
All these looks almost too good to be true. However, TANSTAAFL, I don’t see much enthusiasm for Stackless among the Python community, and PEP 219 has never come into realization. Why is that? What are the drawbacks of Stackless? What skeletons are hidden in Stackless’ closet?
(I know Stackless doesn’t offer real concurrency, just an easier way of programming in the concurrent way. It doesn’t really bother me.)
I don’t know where that “Stackless is 10% faster” on the Wiki came from, but then again I’ve never tried to measure those performance numbers. I can’t think of what Stackless does to make a difference that big.
Stackless is an amazing tool with several organizational/political problems.
The first comes from history. Christian Tismer started talking about what eventually became Stackless about 10 years ago. He had an idea of what he wanted, but had a hard time explaining what he was doing and why people should use it. This is partially because his background didn’t have the CS training regarding ideas like coroutines and because his presentations and discussion are very implementation oriented, which is hard for anyone not already hip-deep in continuations to understand how to use it as a solution to their problems.
For that reason, the initial documentation was poor. There were some descriptions of how to use it, with the best from third-party contributors. At PyCon 2007 I gave a talk on “Using Stackless” which went over quite well, according to the PyCon survey numbers. Richard Tew has done a great job collecting these, updating stackless.com, and maintaining the distribution when new Python releases comes up. He’s an employee of CCP Games, developers of EVE Online, which uses Stackless as an essential part of their gaming system.
CCP games is also the biggest real-world example people use when they talk about Stackless. The main tutorial for Stackless is Grant Olson’s “Introduction to Concurrent Programming with Stackless Python“, which is also game oriented. I think this gives people a skewed idea that Stackless is games-oriented, when it’s more that games are more easily continuation oriented.
Another difficulty has been the source code. In its original form it required changes to many parts of Python, which made Guido van Rossum, the Python lead, wary. Part of the reason, I think, was support for call/cc that was later removed as being “too much like supporting a goto when there are better higher-level forms.” I’m not certain about this history, so just read this paragraph as “Stackless used to require too many changes.”
Later releases didn’t require the changes, and Tismer continued to push for its inclusion in Python. While there was some consideration, the official stance (as far as I know) is that CPython is not only a Python implementation but it’s meant as a reference implementation, and it won’t include Stackless functionality because it can’t be implemented by Jython or Iron Python.
There are absolutely no plans for “significant changes to the code base“. That quote and reference hyperlink from Arafangion’s (see the comment) are from roughly 2000/2001. The structural changes have long been done, and it’s what I mentioned above. Stackless as it is now is stable and mature, with only minor tweaks to the code base over the last several years.
One final limitation with Stackless – there is no strong advocate for Stackless. Tismer is now deeply involved with PyPy, which is an implementation of Python for Python. He has implemented the Stackless functionality in PyPy and considers it much superior to Stackless itself, and feels that PyPy is the way of the future. Tew maintains Stackless but he isn’t interested in advocacy. I considered being in that role, but couldn’t see how I could make an income from it.
Though if you want training in Stackless, feel free to contact me! :)
it took quite long to find this discussion. At that
time I was not on PyPy but had a 2-years affair with psyco, until health stopped this all quite abruptly. I’m now active again and designing an
alternative approach – will present it on EuroPython 2012.
Most of Andrews statements are correct. Some
minor additions:
Stackless was significantly faster than CPython, 10 years ago, because I optimized the interpreter loop. At that time, Guido was not ready for that. A few years later, people did similar optimizations and even more and better ones, which makes Stackless a little bit slower, as expected.
On inclusion: well, in the beginning I was very pushy and convinced that Stackless is the way to go. Later, when it was almost possible to get included, I lost interest in that and preferred to let
it stay this way, partially out of frustration, partially to
keep control of Stackless.
The arguments like “other implementations cannot do it” felt always lame to me, as there are other examples where this argument could also be used. I thought I better forget about that and stay in good friendship with Guido, having my own distro.
Meanwhile things are changing again. I’m working on PyPy and Stackless as an extension Will talk about that sometimes later
If I recall correctly, Stackless was slated for inclusion into the official CPython, but the author of stackless told the CPython folks not to do so, because he planned to do some significant changes to the code base – presumeably he wanted the integration done later when the project was more mature.
I’m also interested in the answers here. I’ve played a bit with Stackless and it looks like it would be a good solid addition to standard Python.
PEP 219 does mention potential difficulties with calling Python code from C code, if Python wants to change to a different stack. There would need to be ways to detect and prevent this (to avoid trashing the C stack). I think this is tractable though, so I’m also wondering why Stackless must stand on its own.
from joblib importParallel,delayedimport numpy as npdef testfunc(data):# some very boneheaded CPU workfor nn in xrange(1000):for ii in data[0,:]:for jj in data[1,:]:
ii*jjdef run(niter=10):
data =(np.random.randn(2,100)for ii in xrange(niter))
pool =Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
results = pool(delayed(testfunc)(dd)for dd in data)if __name__ =='__main__':
run()
I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.
I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.
Here’s a very trivial example…
from joblib import Parallel,delayed
import numpy as np
def testfunc(data):
# some very boneheaded CPU work
for nn in xrange(1000):
for ii in data[0,:]:
for jj in data[1,:]:
ii*jj
def run(niter=10):
data = (np.random.randn(2,100) for ii in xrange(niter))
pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
results = pool(delayed(testfunc)(dd) for dd in data)
if __name__ == '__main__':
run()
…and here’s what I see in htop while this script is running:
I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?
It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.
A workaround is to reset the task affinity using
os.system("taskset -p 0xff %d" % os.getpid())
With this line pasted in after the module imports, my example now runs on all cores:
My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .
Update:
There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example
OPENBLAS_MAIN_FREE=1 python myscript.py
Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line
>>>import os>>> os.sched_getaffinity(0){0,1,2,3}>>> os.sched_setaffinity(0,{1,3})>>> os.sched_getaffinity(0){1,3}>>> x ={i for i in range(10)}>>> x{0,1,2,3,4,5,6,7,8,9}>>> os.sched_setaffinity(0, x)>>> os.sched_getaffinity(0){0,1,2,3}
a list of about 750,000 “sentences” (long strings)
a list of about 20,000 “words” that I would like to delete from my 750,000 sentences
So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.
I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter
compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]
Then I loop through my “sentences”
import re
for sentence in sentences:
for word in compiled_words:
sentence = re.sub(word, "", sentence)
# put sentence into a growing list
This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.
Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?
Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.
import re
import timeit
import random
with open('/usr/share/dict/american-english')as wordbook:
english_words =[word.strip().lower()for word in wordbook]
random.shuffle(english_words)print("First 10 words :")print(english_words[:10])
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", english_words[0]),("Last word", english_words[-1]),("Almost a word","couldbeaword")]def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nUnion of %d words"%10**exp)
union = re.compile(r"\b(%s)\b"%'|'.join(english_words[:10**exp]))for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %-17s : %.1fms"%(description, time))
它输出:
First10 words :["geritol's","sunstroke's",'fib','fergus','charms','canning','supervisor','fallaciously',"heritage's",'pastime']Union of 10 words
Surelynot a word :0.7msFirst word :0.8msLast word :0.7msAlmost a word :0.7msUnion of 100 words
Surelynot a word :0.7msFirst word :1.1msLast word :1.2msAlmost a word :1.2msUnion of 1000 words
Surelynot a word :0.7msFirst word :0.8msLast word :9.6msAlmost a word :10.1msUnion of 10000 words
Surelynot a word :1.4msFirst word :1.8msLast word :96.3msAlmost a word :116.6msUnion of 100000 words
Surelynot a word :0.7msFirst word :0.8msLast word :1227.1msAlmost a word :1404.1ms
Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.
If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.
Theory
If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.
If you save all the banned words into a set, it will be very fast to check if another word is included in that set.
Pack the logic into a function, give this function as argument to re.sub and you’re done!
Code
import re
with open('/usr/share/dict/american-english') as wordbook:
banned_words = set(word.strip().lower() for word in wordbook)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
"GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000
word_pattern = re.compile('\w+')
for sentence in sentences:
sentence = word_pattern.sub(delete_banned_words, sentence)
the search is case-insensitive (thanks to lower())
replacing a word with "" might leave two spaces (as in your code)
With python3, \w+ also matches accented characters (e.g. "ångström").
Any non-word character (tab, space, newline, marks, …) will stay untouched.
Performance
There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.
In comparison, Liteye’s answer needed 160s for 10 thousand sentences.
With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).
In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).
Regex union test
What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?
It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.
This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :
one is clearly not a word (it begins with #)
one is the first word in the list
one is the last word in the list
one looks like a word but isn’t
import re
import timeit
import random
with open('/usr/share/dict/american-english') as wordbook:
english_words = [word.strip().lower() for word in wordbook]
random.shuffle(english_words)
print("First 10 words :")
print(english_words[:10])
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", english_words[0]),
("Last word", english_words[-1]),
("Almost a word", "couldbeaword")
]
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nUnion of %d words" % 10**exp)
union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %-17s : %.1fms" % (description, time))
It outputs:
First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']
Union of 10 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 0.7ms
Almost a word : 0.7ms
Union of 100 words
Surely not a word : 0.7ms
First word : 1.1ms
Last word : 1.2ms
Almost a word : 1.2ms
Union of 1000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 9.6ms
Almost a word : 10.1ms
Union of 10000 words
Surely not a word : 1.4ms
First word : 1.8ms
Last word : 96.3ms
Almost a word : 116.6ms
Union of 100000 words
Surely not a word : 0.7ms
First word : 0.8ms
Last word : 1227.1ms
Almost a word : 1404.1ms
So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:
O(1) best case
O(n/2) average case, which is still O(n)
O(n) worst case
These results are consistent with a simple loop search.
import re
classTrie():"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""def __init__(self):
self.data ={}def add(self, word):
ref = self.data
for char in word:
ref[char]= char in ref and ref[char]or{}
ref = ref[char]
ref['']=1def dump(self):return self.data
def quote(self, char):return re.escape(char)def _pattern(self, pData):
data = pData
if""in data and len(data.keys())==1:returnNone
alt =[]
cc =[]
q =0for char in sorted(data.keys()):if isinstance(data[char], dict):try:
recurse = self._pattern(data[char])
alt.append(self.quote(char)+ recurse)except:
cc.append(self.quote(char))else:
q =1
cconly =not len(alt)>0if len(cc)>0:if len(cc)==1:
alt.append(cc[0])else:
alt.append('['+''.join(cc)+']')if len(alt)==1:
result = alt[0]else:
result ="(?:"+"|".join(alt)+")"if q:if cconly:
result +="?"else:
result ="(?:%s)?"% result
return result
def pattern(self):return self._pattern(self.dump())
# Encoding: utf-8import re
import timeit
import random
from trie importTriewith open('/usr/share/dict/american-english')as wordbook:
banned_words =[word.strip().lower()for word in wordbook]
random.shuffle(banned_words)
test_words =[("Surely not a word","#surely_NöTäWORD_so_regex_engine_can_return_fast"),("First word", banned_words[0]),("Last word", banned_words[-1]),("Almost a word","couldbeaword")]def trie_regex_from_words(words):
trie =Trie()for word in words:
trie.add(word)return re.compile(r"\b"+ trie.pattern()+ r"\b", re.IGNORECASE)def find(word):def fun():return union.match(word)return fun
for exp in range(1,6):print("\nTrieRegex of %d words"%10**exp)
union = trie_regex_from_words(banned_words[:10**exp])for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000)*1000print(" %s : %.1fms"%(description, time))
它输出:
TrieRegex of 10 words
Surelynot a word :0.3msFirst word :0.4msLast word :0.5msAlmost a word :0.5msTrieRegex of 100 words
Surelynot a word :0.3msFirst word :0.5msLast word :0.9msAlmost a word :0.6msTrieRegex of 1000 words
Surelynot a word :0.3msFirst word :0.7msLast word :0.9msAlmost a word :1.1msTrieRegex of 10000 words
Surelynot a word :0.1msFirst word :1.0msLast word :1.2msAlmost a word :1.2msTrieRegex of 100000 words
Surelynot a word :0.3msFirst word :1.2msLast word :0.9msAlmost a word :1.6ms
Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.
If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.
It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.
The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.
foo(bar|baz) would save unneeded information to a capturing group.
Code
Here’s a slightly modified gist, which we can use as a trie.py library:
import re
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Encoding: utf-8
import re
import timeit
import random
from trie import Trie
with open('/usr/share/dict/american-english') as wordbook:
banned_words = [word.strip().lower() for word in wordbook]
random.shuffle(banned_words)
test_words = [
("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
("First word", banned_words[0]),
("Last word", banned_words[-1]),
("Almost a word", "couldbeaword")
]
def trie_regex_from_words(words):
trie = Trie()
for word in words:
trie.add(word)
return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
def find(word):
def fun():
return union.match(word)
return fun
for exp in range(1, 6):
print("\nTrieRegex of %d words" % 10**exp)
union = trie_regex_from_words(banned_words[:10**exp])
for description, test_word in test_words:
time = timeit.timeit(find(test_word), number=1000) * 1000
print(" %s : %.1fms" % (description, time))
It outputs:
TrieRegex of 10 words
Surely not a word : 0.3ms
First word : 0.4ms
Last word : 0.5ms
Almost a word : 0.5ms
TrieRegex of 100 words
Surely not a word : 0.3ms
First word : 0.5ms
Last word : 0.9ms
Almost a word : 0.6ms
TrieRegex of 1000 words
Surely not a word : 0.3ms
First word : 0.7ms
Last word : 0.9ms
Almost a word : 1.1ms
TrieRegex of 10000 words
Surely not a word : 0.1ms
First word : 1.0ms
Last word : 1.2ms
Almost a word : 1.2ms
TrieRegex of 100000 words
Surely not a word : 0.3ms
First word : 1.2ms
Last word : 0.9ms
Almost a word : 1.6ms
One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.
This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.
Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.
#! /bin/env python3# -*- coding: utf-8import time, random, re
def replace1( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns:
sentence = re.sub("\\b"+search+"\\b", repl, sentence )def replace2( sentences ):for n, sentence in enumerate( sentences ):for search, repl in patterns_comp:
sentence = re.sub( search, repl, sentence )def replace3( sentences ):
pd = patterns_dict.get
for n, sentence in enumerate( sentences ):#~ print( n, sentence )# Split the sentence on non-word characters.# Note: () in split patterns ensure the non-word characters ARE kept# and returned in the result list, so we don't mangle the sentence.# If ALL separators are spaces, use string.split instead or something.# Example:#~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")#~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
words = re.split(r"([^\w]+)", sentence)# and... done.
sentence ="".join( pd(w,w)for w in words )#~ print( n, sentence )def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
w = m.group()return pd(w,w)for n, sentence in enumerate( sentences ):
sentence = re.sub(r"\w+", repl, sentence)# Build test set
test_words =[("word%d"% _)for _ in range(50000)]
test_sentences =[" ".join( random.sample( test_words,10))for _ in range(1000)]# Create search and replace patterns
patterns =[(("word%d"% _),("repl%d"% _))for _ in range(20000)]
patterns_dict = dict( patterns )
patterns_comp =[(re.compile("\\b"+search+"\\b"), repl)for search, repl in patterns ]def test( func, num ):
t = time.time()
func( test_sentences[:num])print("%30s: %.02f sentences/s"%(func.__name__, num/(time.time()-t)))print("Sentences", len(test_sentences))print("Words ", len(test_words))
test( replace1,1)
test( replace2,10)
test( replace3,1000)
test( replace4,1000)
Well, here’s a quick and easy solution, with test set.
Winning strategy:
re.sub(“\w+”,repl,sentence) searches for words.
“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.
This is the simplest and fastest solution (see function replace4 in example code below).
Second best
The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.
Perhaps Python is not the right tool here. Here is one with the Unix toolchain
sed G file |
tr ' ' '\n' |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.
This should run at least an order of magnitude faster.
For preprocessing the blacklist file from words (one word per line)
sed 's/.*/\\b&\\b/' words > blacklist
回答 6
这个怎么样:
#!/usr/bin/env python3from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):# faster on CPython, but does not use \b as the word separator# so result is slightly different than replace_sentences_2()def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)for word in words_iter:
norm_word = word.lower()if norm_word notin banned_words:yield word
yield next(words_iter)# yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))def replace_sentences_2(sentences, banned_words):# slower on CPython, uses \b as separatordef filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary =0whileTrue:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()yield sentence[last_word_boundary:current_boundary]# yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()if norm_word notin banned_words:yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)for sentence in sentences:yield''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words =[l.lower()for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt','wb')print('number of sentences:', len(sentences))
start = time.time()for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')print('time:', time.time()- start)
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences:862462
time:24.46173644065857
$ pypy filter_words.py
number of sentences:862462
time:15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences:862462
time:40.2742919921875
$ pypy filter_words.py
number of sentences:862462
time:13.1190629005
#!/usr/bin/env python3
from __future__ import unicode_literals, print_function
import re
import time
import io
def replace_sentences_1(sentences, banned_words):
# faster on CPython, but does not use \b as the word separator
# so result is slightly different than replace_sentences_2()
def filter_sentence(sentence):
words = WORD_SPLITTER.split(sentence)
words_iter = iter(words)
for word in words_iter:
norm_word = word.lower()
if norm_word not in banned_words:
yield word
yield next(words_iter) # yield the word separator
WORD_SPLITTER = re.compile(r'(\W+)')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
def replace_sentences_2(sentences, banned_words):
# slower on CPython, uses \b as separator
def filter_sentence(sentence):
boundaries = WORD_BOUNDARY.finditer(sentence)
current_boundary = 0
while True:
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
yield sentence[last_word_boundary:current_boundary] # yield the separators
last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
word = sentence[last_word_boundary:current_boundary]
norm_word = word.lower()
if norm_word not in banned_words:
yield word
WORD_BOUNDARY = re.compile(r'\b')
banned_words = set(banned_words)
for sentence in sentences:
yield ''.join(filter_sentence(sentence))
corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
output.write(sentence.encode('utf-8'))
output.write(b' .')
print('time:', time.time() - start)
These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.
I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.
$ # replace_sentences_1()
$ python3 filter_words.py
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py
number of sentences: 862462
time: 15.9370770454
$ # replace_sentences_2()
$ python3 filter_words.py
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py
number of sentences: 862462
time: 13.1190629005
PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.
A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.
With join/split tricks you can avoid loops at all which should speed up the algorithm.
Concatenate a sentences with a special delimeter which is not contained by the sentences:
merged_sentences = ' * '.join(sentences)
Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
"".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:
for (i = 0; i < seqlen; i++) {
[...]
sz += PyUnicode_GET_LENGTH(item);
Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.
b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.
Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.
I have got a python script which is creating an ODBC connection. The ODBC connection is generated with a connection string. In this connection string I have to include the username and password for this connection.
Is there an easy way to obscure this password in the file (just that nobody can read the password when I’m editing the file) ?
Douglas F Shearer’s is the generally approved solution in Unix when you need to specify a password for a remote login.
You add a –password-from-file option to specify the path and read plaintext from a file.
The file can then be in the user’s own area protected by the operating system.
It also allows different users to automatically pick up their own own file.
For passwords that the user of the script isn’t allowed to know – you can run the script with elavated permission and have the password file owned by that root/admin user.
import netrc# Define which host in the .netrc file to use
HOST ='mailcluster.loopia.se'# Read from the .netrc file in your home directory
secrets = netrc.netrc()
username, account, password = secrets.authenticators( HOST )print username, password
If you are working on a Unix system, take advantage of the netrc module in the standard Python library. It reads passwords from a separate text file (.netrc), which has the format decribed here.
Here is a small usage example:
import netrc
# Define which host in the .netrc file to use
HOST = 'mailcluster.loopia.se'
# Read from the .netrc file in your home directory
secrets = netrc.netrc()
username, account, password = secrets.authenticators( HOST )
print username, password
The best solution, assuming the username and password can’t be given at runtime by the user, is probably a separate source file containing only variable initialization for the username and password that is imported into your main code. This file would only need editing when the credentials change. Otherwise, if you’re only worried about shoulder surfers with average memories, base 64 encoding is probably the easiest solution. ROT13 is just too easy to decode manually, isn’t case sensitive and retains too much meaning in it’s encrypted state. Encode your password and user id outside the python script. Have he script decode at runtime for use.
Giving scripts credentials for automated tasks is always a risky proposal. Your script should have its own credentials and the account it uses should have no access other than exactly what is necessary. At least the password should be long and rather random.
How about importing the username and password from a file external to the script? That way even if someone got hold of the script, they wouldn’t automatically get the password.
This is a pretty common problem. Typically the best you can do is to either
A) create some kind of ceasar cipher function to encode/decode (just not rot13)
or
B) the preferred method is to use an encryption key, within reach of your program, encode/decode the password. In which you can use file protection to protect access the key.
Along those lines if your app runs as a service/daemon (like a webserver) you can put your key into a password protected keystore with the password input as part of the service startup. It’ll take an admin to restart your app, but you will have really good pretection for your configuration passwords.
Your operating system probably provides facilities for encrypting data securely. For instance, on Windows there is DPAPI (data protection API). Why not ask the user for their credentials the first time you run then squirrel them away encrypted for subsequent runs?
import os
import ftplib
import csv
cred_detail =[]
os.chdir("Folder where the csv file is stored")for row in csv.reader(open("pass.csv","rb")):
cred_detail.append(row)
ftp = ftplib.FTP('server_name',cred_detail[0][0],cred_detail[1][0])
More homegrown appraoch rather than converting authentication / passwords / username to encrytpted details. FTPLIB is just the example.
“pass.csv” is the csv file name
Save password in CSV like below :
user_name
user_password
(With no column heading)
Reading the CSV and saving it to a list.
Using List elelments as authetntication details.
Full code.
import os
import ftplib
import csv
cred_detail = []
os.chdir("Folder where the csv file is stored")
for row in csv.reader(open("pass.csv","rb")):
cred_detail.append(row)
ftp = ftplib.FTP('server_name',cred_detail[0][0],cred_detail[1][0])
Here is my snippet for such thing. You basically import or copy the function to your code. getCredentials will create the encrypted file if it does not exist and return a dictionaty, and updateCredential will update.
import os
def getCredentials():
import base64
splitter='<PC+,DFS/-SHQ.R'
directory='C:\\PCT'
if not os.path.exists(directory):
os.makedirs(directory)
try:
with open(directory+'\\Credentials.txt', 'r') as file:
cred = file.read()
file.close()
except:
print('I could not file the credentials file. \nSo I dont keep asking you for your email and password everytime you run me, I will be saving an encrypted file at {}.\n'.format(directory))
lanid = base64.b64encode(bytes(input(' LanID: '), encoding='utf-8')).decode('utf-8')
email = base64.b64encode(bytes(input(' eMail: '), encoding='utf-8')).decode('utf-8')
password = base64.b64encode(bytes(input(' PassW: '), encoding='utf-8')).decode('utf-8')
cred = lanid+splitter+email+splitter+password
with open(directory+'\\Credentials.txt','w+') as file:
file.write(cred)
file.close()
return {'lanid':base64.b64decode(bytes(cred.split(splitter)[0], encoding='utf-8')).decode('utf-8'),
'email':base64.b64decode(bytes(cred.split(splitter)[1], encoding='utf-8')).decode('utf-8'),
'password':base64.b64decode(bytes(cred.split(splitter)[2], encoding='utf-8')).decode('utf-8')}
def updateCredentials():
import base64
splitter='<PC+,DFS/-SHQ.R'
directory='C:\\PCT'
if not os.path.exists(directory):
os.makedirs(directory)
print('I will be saving an encrypted file at {}.\n'.format(directory))
lanid = base64.b64encode(bytes(input(' LanID: '), encoding='utf-8')).decode('utf-8')
email = base64.b64encode(bytes(input(' eMail: '), encoding='utf-8')).decode('utf-8')
password = base64.b64encode(bytes(input(' PassW: '), encoding='utf-8')).decode('utf-8')
cred = lanid+splitter+email+splitter+password
with open(directory+'\\Credentials.txt','w+') as file:
file.write(cred)
file.close()
cred = getCredentials()
updateCredentials()
Place the configuration information in a encrypted config file. Query this info in your code using an key. Place this key in a separate file per environment, and don’t store it with your code.
If running on Windows, you could consider using win32crypt library. It allows storage and retrieval of protected data (keys, passwords) by the user that is running the script, thus passwords are never stored in clear text or obfuscated format in your code. I am not sure if there is an equivalent implementation for other platforms, so with the strict use of win32crypt your code is not portable.
from cryptography.fernet importFernet# you store the key and the token
key = b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
token = b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='# create a cipher and decrypt when you need your password
cipher =Fernet(key)
mypassword = cipher.decrypt(token).decode('utf-8')
from cryptography.fernet import Fernet
# you store the key and the token
key = b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
token = b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='
# create a cipher and decrypt when you need your password
cipher = Fernet(key)
mypassword = cipher.decrypt(token).decode('utf-8')
Once you’ve done this, you can either import mypassword directly or you can import the token and cipher to decrypt as needed.
Obviously, there are some shortcomings to this approach. If someone has both the token and the key (as they would if they have the script), they can decrypt easily. However it does obfuscate, and if you compile the code (with something like Nuitka) at least your password won’t appear as plain text in a hex editor.
This doesn’t precisely answer your question, but it’s related. I was going to add as a comment but wasn’t allowed.
I’ve been dealing with this same issue, and we have decided to expose the script to the users using Jenkins. This allows us to store the db credentials in a separate file that is encrypted and secured on a server and not accessible to non-admins.
It also allows us a bit of a shortcut to creating a UI, and throttling execution.
回答 17
您还可以考虑将密码存储在脚本外部并在运行时提供密码的可能性
例如fred.py
import os
username ='fred'
password = os.environ.get('PASSWORD','')print(username, password)
可以像
$ PASSWORD=password123 python fred.py
fred password123
$ PASSWORD=password123 python fred.py
fred password123
Extra layers of “security through obscurity” can be achieved by using base64 (as suggested above), using less obvious names in the code and further distancing the actual password from the code.
If the code is in a repository, it is often useful to store secrets outside it, so one could add this to ~/.bashrc (or to a vault, or a launch script, …)
export SURNAME=cGFzc3dvcmQxMjM=
and change fred.py to
import os
import base64
name = 'fred'
surname = base64.b64decode(os.environ.get('SURNAME', '')).decode('utf-8')
print(name, surname)
There are several ROT13 utilities written in Python on the ‘Net — just google for them. ROT13 encode the string offline, copy it into the source, decode at point of transmission.
Does basic configuration for the logging system by creating a
StreamHandler with a default Formatter and adding it to the root
logger. The functions debug(), info(), warning(), error() and
critical() will call basicConfig() automatically if no handlers are
defined for the root logger.
This function does nothing if the root logger already has handlers
configured for it.
It seems like ipython notebook call basicConfig (or set handler) somewhere.
回答 1
如果仍要使用basicConfig,请像这样重新加载日志记录模块
from importlib import reload # Not needed in Python 2import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
logging.error('hello!')
logging.debug('This is a debug message')
logging.info('this is an info message')
logging.warning('tbllalfhldfhd, warning.')
我在与笔记本相同的目录中得到一个“ mylog.log”文件,其中包含:
2015-01-2809:49:25,026- root - ERROR - hello!2015-01-2809:49:25,028- root - DEBUG -Thisis a debug message2015-01-2809:49:25,029- root - INFO - this is an info message2015-01-2809:49:25,032- root - WARNING - tbllalfhldfhd, warning.
My understanding is that the IPython session starts up logging so basicConfig doesn’t work. Here is the setup that works for me (I wish this was not so gross looking since I want to use it for almost all my notebooks):
logging.error('hello!')
logging.debug('This is a debug message')
logging.info('this is an info message')
logging.warning('tbllalfhldfhd, warning.')
I get a “mylog.log” file in the same directory as my notebook that contains:
2015-01-28 09:49:25,026 - root - ERROR - hello!
2015-01-28 09:49:25,028 - root - DEBUG - This is a debug message
2015-01-28 09:49:25,029 - root - INFO - this is an info message
2015-01-28 09:49:25,032 - root - WARNING - tbllalfhldfhd, warning.
Note that if you rerun this without restarting the IPython session it will write duplicate entries to the file since there would now be two file handlers defined
Bear in mind that stderr is the default stream for the logging module, so in IPython and Jupyter notebooks you might not see anything unless you configure the stream to stdout:
>>> list_a =[1,2,4,6]>>> fil =[True,False,True,False]>>>%timeit list(compress(list_a, fil))100000 loops, best of 3:2.58 us per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]#winner100000 loops, best of 3:1.98 us per loop>>> list_a =[1,2,4,6]*100>>> fil =[True,False,True,False]*100>>>%timeit list(compress(list_a, fil))#winner10000 loops, best of 3:24.3 us per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]10000 loops, best of 3:82 us per loop>>> list_a =[1,2,4,6]*10000>>> fil =[True,False,True,False]*10000>>>%timeit list(compress(list_a, fil))#winner1000 loops, best of 3:1.66 ms per loop>>>%timeit [i for(i, v)in zip(list_a, fil)if v]100 loops, best of 3:7.65 ms per loop
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> %timeit list(compress(list_a, fil))
100000 loops, best of 3: 2.58 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v] #winner
100000 loops, best of 3: 1.98 us per loop
>>> list_a = [1, 2, 4, 6]*100
>>> fil = [True, False, True, False]*100
>>> %timeit list(compress(list_a, fil)) #winner
10000 loops, best of 3: 24.3 us per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
10000 loops, best of 3: 82 us per loop
>>> list_a = [1, 2, 4, 6]*10000
>>> fil = [True, False, True, False]*10000
>>> %timeit list(compress(list_a, fil)) #winner
1000 loops, best of 3: 1.66 ms per loop
>>> %timeit [i for (i, v) in zip(list_a, fil) if v]
100 loops, best of 3: 7.65 ms per loop
Don’t use filter as a variable name, it is a built-in function.
filtered_list = [i for (i, v) in zip(list_a, filter) if v]
Using zip is the pythonic way to iterate over multiple sequences in parallel, without needing any indexing. This assumes both sequences have the same length (zip stops after the shortest runs out). Using itertools for such a simple case is a bit overkill …
One thing you do in your example you should really stop doing is comparing things to True, this is usually not necessary. Instead of if filter[idx]==True: ..., you can simply write if filter[idx]: ....
In[133]: list_a =[1,2,4,6]*10000In[134]: fil =[True,False,True,False]*10000In[135]: list_a_np = np.array(list_a)In[136]: fil_np = np.array(fil)In[139]:%timeit list(itertools.compress(list_a, fil))1000 loops, best of 3:625 us per loopIn[140]:%timeit list_a_np[fil_np]10000 loops, best of 3:173 us per loop
In [128]: list_a = np.array([1, 2, 4, 6])
In [129]: filter = np.array([True, False, True, False])
In [130]: list_a[filter]
Out[130]: array([1, 4])
or see Alex Szatmary’s answer if list_a can be a numpy array but not filter
Numpy usually gives you a big speed boost as well
In [133]: list_a = [1, 2, 4, 6]*10000
In [134]: fil = [True, False, True, False]*10000
In [135]: list_a_np = np.array(list_a)
In [136]: fil_np = np.array(fil)
In [139]: %timeit list(itertools.compress(list_a, fil))
1000 loops, best of 3: 625 us per loop
In [140]: %timeit list_a_np[fil_np]
10000 loops, best of 3: 173 us per loop
回答 3
为此,请使用numpy,即,如果您有一个数组a,而不是list_a:
a = np.array([1,2,4,6])
my_filter = np.array([True,False,True,False], dtype=bool)
a[my_filter]> array([1,4])
I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.
I want to do fuzzy string comparison, but I’m not sure which library to use.
In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:
import codecs, difflib, Levenshtein, distance
with codecs.open("titles.tsv","r","utf-8") as f:
title_list = f.read().split("\n")[:-1]
for row in title_list:
sr = row.lower().split("\t")
diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
lev = Levenshtein.ratio(sr[3], sr[4])
sor = 1 - distance.sorensen(sr[3], sr[4])
jac = 1 - distance.jaccard(sr[3], sr[4])
print diffl, lev, sor, jac
I then plotted the results with R:
Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:
The Difflib / Levenshtein similarity really is quite interesting.
2018 edit: If you’re working on identifying similar strings, you could also check out minhashing–there’s a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext
difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.
Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other
Complexity
SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)
Levenshtein is O(m*n), where n and m are the length of the two input strings.
Performance
According to the source code of the Levenshtein module :
Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it’s much faster.
class A(object):def salutation(self, accusative):print"hello", accusative
# note this function is intentionally on the module, and not the class abovedef __getattr__(mod, name):return getattr(A(), name)if __name__ =="__main__":# i hope here to have my __getattr__ function above invoked, since# salutation does not exist in the current namespace
salutation("world")
这使:
matt@stanley:~/Desktop$ python getattrmod.py
Traceback(most recent call last):File"getattrmod.py", line 9,in<module>
salutation("world")NameError: name 'salutation'isnot defined
How can implement the equivalent of a __getattr__ on a class, on a module?
Example
When calling a function that does not exist in a module’s statically defined attributes, I wish to create an instance of a class in that module, and invoke the method on it with the same name as failed in the attribute lookup on the module.
class A(object):
def salutation(self, accusative):
print "hello", accusative
# note this function is intentionally on the module, and not the class above
def __getattr__(mod, name):
return getattr(A(), name)
if __name__ == "__main__":
# i hope here to have my __getattr__ function above invoked, since
# salutation does not exist in the current namespace
salutation("world")
Which gives:
matt@stanley:~/Desktop$ python getattrmod.py
Traceback (most recent call last):
File "getattrmod.py", line 9, in <module>
salutation("world")
NameError: name 'salutation' is not defined
Recently some historical features have made a comeback, the module __getattr__ among them, and so the existing hack (a module replacing itself with a class in sys.modules at import time) should be no longer necessary.
In Python 3.7+, you just use the one obvious way. To customize attribute access on a module, define a __getattr__ function at the module level which should accept one argument (name of attribute), and return the computed value or raise an AttributeError:
This will also allow hooks into “from” imports, i.e. you can return dynamically generated objects for statements such as from my_module import whatever.
On a related note, along with the module getattr you may also define a __dir__ function at module level to respond to dir(my_module). See PEP 562 for details.
回答 1
您在这里遇到两个基本问题:
__xxx__ 方法只在类上查找
TypeError: can't set attributes of built-in/extension type 'module'
There are two basic problems you are running into here:
__xxx__ methods are only looked up on the class
TypeError: can't set attributes of built-in/extension type 'module'
(1) means any solution would have to also keep track of which module was being examined, otherwise every module would then have the instance-substitution behavior; and (2) means that (1) isn’t even possible… at least not directly.
Fortunately, sys.modules is not picky about what goes there so a wrapper will work, but only for module access (i.e. import somemodule; somemodule.salutation('world'); for same-module access you pretty much have to yank the methods from the substitution class and add them to globals() eiher with a custom method on the class (I like using .export()) or with a generic function (such as those already listed as answers). One thing to keep in mind: if the wrapper is creating a new instance each time, and the globals solution is not, you end up with subtly different behavior. Oh, and you don’t get to use both at the same time — it’s one or the other.
There is actually a hack that is occasionally used and recommended: a
module can define a class with the desired functionality, and then at
the end, replace itself in sys.modules with an instance of that class
(or with the class, if you insist, but that’s generally less useful).
E.g.:
This works because the import machinery is actively enabling this
hack, and as its final step pulls the actual module out of
sys.modules, after loading it. (This is no accident. The hack was
proposed long ago and we decided we liked enough to support it in the
import machinery.)
So the established way to accomplish what you want is to create a single class in your module, and as the last act of the module replace sys.modules[__name__] with an instance of your class — and now you can play with __getattr__/__setattr__/__getattribute__ as needed.
Note 1: If you use this functionality then anything else in the module, such as globals, other functions, etc., will be lost when the sys.modules assignment is made — so make sure everything needed is inside the replacement class.
Note 2: To support from module import * you must have __all__ defined in the class; for example:
class A(object):
....
# The implicit global instance
a= A()
def salutation( *arg, **kw ):
a.salutation( *arg, **kw )
Why? So that the implicit global instance is visible.
For examples, look at the random module, which creates an implicit global instance to slightly simplify the use cases where you want a “simple” random number generator.
Similar to what @Håvard S proposed, in a case where I needed to implement some magic on a module (like __getattr__), I would define a new class that inherits from types.ModuleType and put that in sys.modules (probably replacing the module where my custom ModuleType was defined).
See the main __init__.py file of Werkzeug for a fairly robust implementation of this.
回答 5
这有点黑,但是…
import types
class A(object):def salutation(self, accusative):print"hello", accusative
def farewell(self, greeting, accusative):print greeting, accusative
defAddGlobalAttribute(classname, methodname):print"Adding "+ classname +"."+ methodname +"()"def genericFunction(*args):return globals()[classname]().__getattribute__(methodname)(*args)
globals()[methodname]= genericFunction
# set up the global namespace
x =0# X and Y are here to add them implicitly to globals, so
y =0# globals does not change as we iterate over it.
toAdd =[]def isCallableMethod(classname, methodname):
someclass = globals()[classname]()
something = someclass.__getattribute__(methodname)return callable(something)for x in globals():print"Looking at", x
if isinstance(globals()[x],(types.ClassType, type)):print"Found Class:", x
for y in dir(globals()[x]):if y.find("__")==-1:# hack to ignore default methodsif isCallableMethod(x,y):if y notin globals():# don't override existing global names
toAdd.append((x,y))for x in toAdd:AddGlobalAttribute(*x)if __name__ =="__main__":
salutation("world")
farewell("goodbye","world")
import types
class A(object):
def salutation(self, accusative):
print "hello", accusative
def farewell(self, greeting, accusative):
print greeting, accusative
def AddGlobalAttribute(classname, methodname):
print "Adding " + classname + "." + methodname + "()"
def genericFunction(*args):
return globals()[classname]().__getattribute__(methodname)(*args)
globals()[methodname] = genericFunction
# set up the global namespace
x = 0 # X and Y are here to add them implicitly to globals, so
y = 0 # globals does not change as we iterate over it.
toAdd = []
def isCallableMethod(classname, methodname):
someclass = globals()[classname]()
something = someclass.__getattribute__(methodname)
return callable(something)
for x in globals():
print "Looking at", x
if isinstance(globals()[x], (types.ClassType, type)):
print "Found Class:", x
for y in dir(globals()[x]):
if y.find("__") == -1: # hack to ignore default methods
if isCallableMethod(x,y):
if y not in globals(): # don't override existing global names
toAdd.append((x,y))
for x in toAdd:
AddGlobalAttribute(*x)
if __name__ == "__main__":
salutation("world")
farewell("goodbye", "world")
This works by iterating over the all the objects in the global namespace. If the item is a class, it iterates over the class attributes. If the attribute is callable it adds it to the global namespace as a function.
It ignore all attributes which contain “__”.
I wouldn’t use this in production code, but it should get you started.
Here’s my own humble contribution — a slight embellishment of @Håvard S’s highly rated answer, but a bit more explicit (so it might be acceptable to @S.Lott, even though probably not good enough for the OP):
Create your module file that has your classes. Import the module. Run getattr on the module you just imported. You can do a dynamic import using __import__ and pull the module from sys.modules.