NumPy数组的就地类型转换

问题:NumPy数组的就地类型转换

给定一个NumPy数组int32,如何将其转换为float32 原位?所以基本上,我想做

a = a.astype(numpy.float32)

而不复制阵列。好大

这样做的原因是我有两种算法来计算a。其中一个返回一个数组int32,另一个返回一个数组float32(这是两种不同算法固有的)。所有进一步的计算都假定a是的数组float32

目前,我在C函数中通过via进行转换ctypes。有没有办法在Python中做到这一点?

Given a NumPy array of int32, how do I convert it to float32 in place? So basically, I would like to do

a = a.astype(numpy.float32)

without copying the array. It is big.

The reason for doing this is that I have two algorithms for the computation of a. One of them returns an array of int32, the other returns an array of float32 (and this is inherent to the two different algorithms). All further computations assume that a is an array of float32.

Currently I do the conversion in a C function called via ctypes. Is there a way to do this in Python?


回答 0

您可以使用不同的dtype创建视图,然后就地复制到视图中:

import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x

print(y)

Yield

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)

要显示转换是否就位,请注意 复制x到已y更改x

print(x)

版画

array([         0, 1065353216, 1073741824, 1077936128, 1082130432,
       1084227584, 1086324736, 1088421888, 1090519040, 1091567616])

You can make a view with a different dtype, and then copy in-place into the view:

import numpy as np
x = np.arange(10, dtype='int32')
y = x.view('float32')
y[:] = x

print(y)

yields

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.], dtype=float32)

To show the conversion was in-place, note that copying from x to y altered x:

print(x)

prints

array([         0, 1065353216, 1073741824, 1077936128, 1082130432,
       1084227584, 1086324736, 1088421888, 1090519040, 1091567616])

回答 1

更新:此功能仅在可能的情况下避免复制,因此这不是此问题的正确答案。unutbu的答案是正确的。


a = a.astype(numpy.float32, copy=False)

numpy astype具有复制标志。我们为什么不应该使用它?

Update: This function only avoids copy if it can, hence this is not the correct answer for this question. unutbu’s answer is the right one.


a = a.astype(numpy.float32, copy=False)

numpy astype has a copy flag. Why shouldn’t we use it ?


回答 2

您可以更改数组类型而无需进行如下转换:

a.dtype = numpy.float32

但首先,您必须将所有整数更改为将被解释为相应浮点数的值。一种很慢的方法是使用python的struct模块,如下所示:

def toi(i):
    return struct.unpack('i',struct.pack('f',float(i)))[0]

…应用于数组的每个成员。

但是,也许更快的方法是利用numpy的ctypeslib工具(我不熟悉)

-编辑-

由于ctypeslib似乎不起作用,所以我将使用典型numpy.astype方法进行转换,但以内存限制内的块大小进行处理:

a[0:10000] = a[0:10000].astype('float32').view('int32')

…然后在完成后更改dtype。

这是一个功能,可以完成所有兼容dtypes的任务(仅适用于具有相同大小项目的dtypes),并通过用户控制块大小来处理任意形状的数组:

import numpy

def astype_inplace(a, dtype, blocksize=10000):
    oldtype = a.dtype
    newtype = numpy.dtype(dtype)
    assert oldtype.itemsize is newtype.itemsize
    for idx in xrange(0, a.size, blocksize):
        a.flat[idx:idx + blocksize] = \
            a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
    a.dtype = newtype

a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a

You can change the array type without converting like this:

a.dtype = numpy.float32

but first you have to change all the integers to something that will be interpreted as the corresponding float. A very slow way to do this would be to use python’s struct module like this:

def toi(i):
    return struct.unpack('i',struct.pack('f',float(i)))[0]

…applied to each member of your array.

But perhaps a faster way would be to utilize numpy’s ctypeslib tools (which I am unfamiliar with)

– edit –

Since ctypeslib doesnt seem to work, then I would proceed with the conversion with the typical numpy.astype method, but proceed in block sizes that are within your memory limits:

a[0:10000] = a[0:10000].astype('float32').view('int32')

…then change the dtype when done.

Here is a function that accomplishes the task for any compatible dtypes (only works for dtypes with same-sized items) and handles arbitrarily-shaped arrays with user-control over block size:

import numpy

def astype_inplace(a, dtype, blocksize=10000):
    oldtype = a.dtype
    newtype = numpy.dtype(dtype)
    assert oldtype.itemsize is newtype.itemsize
    for idx in xrange(0, a.size, blocksize):
        a.flat[idx:idx + blocksize] = \
            a.flat[idx:idx + blocksize].astype(newtype).view(oldtype)
    a.dtype = newtype

a = numpy.random.randint(100,size=100).reshape((10,10))
print a
astype_inplace(a, 'float32')
print a

回答 3

import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)

使用view()和参数’dtype’更改数组。

import numpy as np
arr_float = np.arange(10, dtype=np.float32)
arr_int = arr_float.view(np.float32)

use view() and parameter ‘dtype’ to change the array in place.


回答 4

用这个:

In [105]: a
Out[105]: 
array([[15, 30, 88, 31, 33],
       [53, 38, 54, 47, 56],
       [67,  2, 74, 10, 16],
       [86, 33, 15, 51, 32],
       [32, 47, 76, 15, 81]], dtype=int32)

In [106]: float32(a)
Out[106]: 
array([[ 15.,  30.,  88.,  31.,  33.],
       [ 53.,  38.,  54.,  47.,  56.],
       [ 67.,   2.,  74.,  10.,  16.],
       [ 86.,  33.,  15.,  51.,  32.],
       [ 32.,  47.,  76.,  15.,  81.]], dtype=float32)

Use this:

In [105]: a
Out[105]: 
array([[15, 30, 88, 31, 33],
       [53, 38, 54, 47, 56],
       [67,  2, 74, 10, 16],
       [86, 33, 15, 51, 32],
       [32, 47, 76, 15, 81]], dtype=int32)

In [106]: float32(a)
Out[106]: 
array([[ 15.,  30.,  88.,  31.,  33.],
       [ 53.,  38.,  54.,  47.,  56.],
       [ 67.,   2.,  74.,  10.,  16.],
       [ 86.,  33.,  15.,  51.,  32.],
       [ 32.,  47.,  76.,  15.,  81.]], dtype=float32)

回答 5

a = np.subtract(a, 0., dtype=np.float32)

a = np.subtract(a, 0., dtype=np.float32)


结合使用node.js和Python

问题:结合使用node.js和Python

Node.js非常适合我们的Web项目,但是很少有需要Python的计算任务。我们已经为他们准备了Python代码。我们非常关心速度,如何以异步非阻塞方式从node.js调用Python“工作者”的最优雅方法是什么?

Node.js is a perfect match for our web project, but there are few computational tasks for which we would prefer Python. We also already have a Python code for them. We are highly concerned about speed, what is the most elegant way how to call a Python “worker” from node.js in an asynchronous non-blocking way?


回答 0

对于node.js和Python服务器之间的通信,如果两个进程都在同一服务器上运行,则我将使用Unix套接字,否则将使用TCP / IP套接字。对于封送处理协议,我将使用JSON或协议缓冲区。如果线程化Python成为瓶颈,请考虑使用Twisted Python,它提供与node.js相同的事件驱动的并发性。

如果您喜欢冒险,请学习clojureclojurescriptclojure-py),您将获得与Java,JavaScript(包括node.js),CLR和Python上的现有代码可运行并互操作的相同语言。通过使用clojure数据结构,您将获得出色的编组协议。

For communication between node.js and Python server, I would use Unix sockets if both processes run on the same server and TCP/IP sockets otherwise. For marshaling protocol I would take JSON or protocol buffer. If threaded Python shows up to be a bottleneck, consider using Twisted Python, which provides the same event driven concurrency as do node.js.

If you feel adventurous, learn clojure (clojurescript, clojure-py) and you’ll get the same language that runs and interoperates with existing code on Java, JavaScript (node.js included), CLR and Python. And you get superb marshalling protocol by simply using clojure data structures.


回答 1

这听起来像一个零MQ非常合适的场景。这是一个消息传递框架,与使用TCP或Unix套接字类似,但功能更强大(http://zguide.zeromq.org/py:all

有一个库使用zeroMQ提供了一个运行良好的RPC框架。它称为zeroRPC(http://www.zerorpc.io/)。这是你好世界。

Python“ Hello x”服务器:

import zerorpc

class HelloRPC(object):
    '''pass the method a name, it replies "Hello name!"'''
    def hello(self, name):
        return "Hello, {0}!".format(name)

def main():
    s = zerorpc.Server(HelloRPC())
    s.bind("tcp://*:4242")
    s.run()

if __name__ == "__main__" : main()

和node.js客户端:

var zerorpc = require("zerorpc");

var client = new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");
//calls the method on the python object
client.invoke("hello", "World", function(error, reply, streaming) {
    if(error){
        console.log("ERROR: ", error);
    }
    console.log(reply);
});

反之亦然,node.js服务器:

var zerorpc = require("zerorpc");

var server = new zerorpc.Server({
    hello: function(name, reply) {
        reply(null, "Hello, " + name, false);
    }
});

server.bind("tcp://0.0.0.0:4242");

和python客户端

import zerorpc, sys

c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1] if len(sys.argv) > 1 else "dude"
print c.hello(name)

This sounds like a scenario where zeroMQ would be a good fit. It’s a messaging framework that’s similar to using TCP or Unix sockets, but it’s much more robust (http://zguide.zeromq.org/py:all)

There’s a library that uses zeroMQ to provide a RPC framework that works pretty well. It’s called zeroRPC (http://www.zerorpc.io/). Here’s the hello world.

Python “Hello x” server:

import zerorpc

class HelloRPC(object):
    '''pass the method a name, it replies "Hello name!"'''
    def hello(self, name):
        return "Hello, {0}!".format(name)

def main():
    s = zerorpc.Server(HelloRPC())
    s.bind("tcp://*:4242")
    s.run()

if __name__ == "__main__" : main()

And the node.js client:

var zerorpc = require("zerorpc");

var client = new zerorpc.Client();
client.connect("tcp://127.0.0.1:4242");
//calls the method on the python object
client.invoke("hello", "World", function(error, reply, streaming) {
    if(error){
        console.log("ERROR: ", error);
    }
    console.log(reply);
});

Or vice-versa, node.js server:

var zerorpc = require("zerorpc");

var server = new zerorpc.Server({
    hello: function(name, reply) {
        reply(null, "Hello, " + name, false);
    }
});

server.bind("tcp://0.0.0.0:4242");

And the python client

import zerorpc, sys

c = zerorpc.Client()
c.connect("tcp://127.0.0.1:4242")
name = sys.argv[1] if len(sys.argv) > 1 else "dude"
print c.hello(name)

回答 2

如果您安排将Python工作进程放在单独的进程中(长时间运行的服务器类型进程或按需生成的子进程),则与之进行的通信在node.js端将是异步的。UNIX / TCP套接字和stdin / out / err通信本身在节点中是异步的。

If you arrange to have your Python worker in a separate process (either long-running server-type process or a spawned child on demand), your communication with it will be asynchronous on the node.js side. UNIX/TCP sockets and stdin/out/err communication are inherently async in node.


回答 3

我也会考虑Apache Thrift http://thrift.apache.org/

它可以在几种编程语言之间架起桥梁,非常高效,并支持异步或同步调用。在此处查看完整功能http://thrift.apache.org/docs/features/

多语言可能对将来的计划很有用,例如,如果您以后想要在C ++中完成部分计算任务,则可以很容易地使用Thrift将其添加到混合中。

I’d consider also Apache Thrift http://thrift.apache.org/

It can bridge between several programming languages, is highly efficient and has support for async or sync calls. See full features here http://thrift.apache.org/docs/features/

The multi language can be useful for future plans, for example if you later want to do part of the computational task in C++ it’s very easy to do add it to the mix using Thrift.


回答 4

使用thoonk.jsthoonk.py取得了很多成功。Thoonk利用Redis(内存中的键值存储)为您提供供稿(如发布/订阅),队列和作业模式进行通信。

为什么这比unix套接字或直接tcp套接字更好?总体性能可能会有所下降,但是Thoonk提供了一个非常简单的API,该API简化了手动处理套接字的过程。Thoonk还可帮助您轻松实现一个分布式计算模型,该模型可让您扩展python worker来提高性能,因为您只是启动了python worker的新实例并将它们连接到同一Redis服务器。

I’ve had a lot of success using thoonk.js along with thoonk.py. Thoonk leverages Redis (in-memory key-value store) to give you feed (think publish/subscribe), queue and job patterns for communication.

Why is this better than unix sockets or direct tcp sockets? Overall performance may be decreased a little, however Thoonk provides a really simple API that simplifies having to manually deal with a socket. Thoonk also helps make it really trivial to implement a distributed computing model that allows you to scale your python workers to increase performance, since you just spin up new instances of your python workers and connect them to the same redis server.


回答 5

我建议使用一些工作队列,例如使用出色的Gearman,它将为您提供一种很好的方式来调度后台作业,并在处理后异步获取其结果。

Digg(在许多其他公司中)经常使用的优点是,它提供了一种强大,可扩展和强大的方法,使任何语言的工作人员都能与任何语言的客户进行交谈。

I’d recommend using some work queue using, for example, the excellent Gearman, which will provide you with a great way to dispatch background jobs, and asynchronously get their result once they’re processed.

The advantage of this, used heavily at Digg (among many others) is that it provides a strong, scalable and robust way to make workers in any language to speak with clients in any language.


回答 6

更新2019

有几种方法可以做到这一点,以下是按复杂度从高到低排列的清单

  1. Python Shell,您将流写入python控制台,它将回写给您
  2. Redis Pub Sub,当节点js发布者推送数据时,您可以使用Python监听频道
  3. Websocket连接,其中Node充当客户端,Python充当服务器,反之亦然
  4. 与Express / Flask / Tornado等的API连接分别与暴露给其他用户查询的API端点一起工作

方法1 Python Shell最简单的方法

source.js文件

const ps = require('python-shell')
// very important to add -u option since our python script runs infinitely
var options = {
    pythonPath: '/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
    pythonOptions: ['-u'], // get print results in real-time
    // make sure you use an absolute path for scriptPath
    scriptPath: "./subscriber/",
    // args: ['value1', 'value2', 'value3'],
    mode: 'json'
};

const shell = new ps.PythonShell("destination.py", options);

function generateArray() {
    const list = []
    for (let i = 0; i < 1000; i++) {
        list.push(Math.random() * 1000)
    }
    return list
}

setInterval(() => {
    shell.send(generateArray())
}, 1000);

shell.on("message", message => {
    console.log(message);
})

destination.py文件

import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

size = 1000
p = 100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)

def get_indicators(values):
    # Return the RSI of the values sent from node.js
    numpy_values = numpy.array(values, dtype=numpy.double) 
    return talib.func.RSI(numpy_values, 14)

for line in sys.stdin:
    l = json.loads(line)
    print(get_indicators(l))
    # Without this step the output may not be immediately available in node
    sys.stdout.flush()

注意:创建一个名为“订户”的文件夹,该文件夹与source.js文件位于同一级别,并将destination.py放入其中。不要忘记更改您的virtualenv环境

Update 2019

There are several ways to achieve this and here is the list in increasing order of complexity

  1. Python Shell, you will write streams to the python console and it will write back to you
  2. Redis Pub Sub, you can have a channel listening in Python while your node js publisher pushes data
  3. Websocket connection where Node acts as the client and Python acts as the server or vice-versa
  4. API connection with Express/Flask/Tornado etc working separately with an API endpoint exposed for the other to query

Approach 1 Python Shell Simplest approach

source.js file

const ps = require('python-shell')
// very important to add -u option since our python script runs infinitely
var options = {
    pythonPath: '/Users/zup/.local/share/virtualenvs/python_shell_test-TJN5lQez/bin/python',
    pythonOptions: ['-u'], // get print results in real-time
    // make sure you use an absolute path for scriptPath
    scriptPath: "./subscriber/",
    // args: ['value1', 'value2', 'value3'],
    mode: 'json'
};

const shell = new ps.PythonShell("destination.py", options);

function generateArray() {
    const list = []
    for (let i = 0; i < 1000; i++) {
        list.push(Math.random() * 1000)
    }
    return list
}

setInterval(() => {
    shell.send(generateArray())
}, 1000);

shell.on("message", message => {
    console.log(message);
})

destination.py file

import datetime
import sys
import time
import numpy
import talib
import timeit
import json
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

size = 1000
p = 100
o = numpy.random.random(size)
h = numpy.random.random(size)
l = numpy.random.random(size)
c = numpy.random.random(size)
v = numpy.random.random(size)

def get_indicators(values):
    # Return the RSI of the values sent from node.js
    numpy_values = numpy.array(values, dtype=numpy.double) 
    return talib.func.RSI(numpy_values, 14)

for line in sys.stdin:
    l = json.loads(line)
    print(get_indicators(l))
    # Without this step the output may not be immediately available in node
    sys.stdout.flush()

Notes: Make a folder called subscriber which is at the same level as source.js file and put destination.py inside it. Dont forget to change your virtualenv environment


Stackless Python有什么缺点?[关闭]

问题:Stackless Python有什么缺点?[关闭]

我最近在阅读有关Stackless Python的文章,与普通cPython相比,它似乎具有许多优势。它拥有所有那些很酷的功能,比如无限递归地,微,延续等,并在同一时间比CPython的速度(10%左右,如果Python的维基是可信的),并用它(兼容至少2.5版,2.6和3.0)。

所有这些看起来都太好了,难以置信。但是,TANSTAAFL,我在Python社区中没有看到对Stackless的热情,PEP 219从未实现。这是为什么?Stackless的缺点是什么?Stackless壁橱中隐藏着哪些骨架?

(我知道Stackless不提供真正的并发性,只是一种更简单的并发编程方式。它并没有真正打扰我。)

I’ve been reading recently about Stackless Python and it seems to have many advantages compared with vanilla cPython. It has all those cool features like infinite recursion, microthreads, continuations, etc. and at the same time is faster than cPython (around 10%, if the Python wiki is to be believed) and compatible with it (at least versions 2.5, 2.6 and 3.0).

All these looks almost too good to be true. However, TANSTAAFL, I don’t see much enthusiasm for Stackless among the Python community, and PEP 219 has never come into realization. Why is that? What are the drawbacks of Stackless? What skeletons are hidden in Stackless’ closet?

(I know Stackless doesn’t offer real concurrency, just an easier way of programming in the concurrent way. It doesn’t really bother me.)


回答 0

我不知道Wiki上的“ Stackless速度提高10%”来自何处,但是我再也没有尝试过衡量那些性能数字。我想不出Stackless所做的那么大的改变。

Stackless是一个出色的工具,它具有多个组织/政治问题。

首先来自历史。克里斯蒂安·蒂斯默(Christian Tismer)开始谈论大约10年前最终变成Stackless的事情。他对自己想要的东西有所了解,但是很难解释自己在做什么以及人们为什么要使用它。部分原因是他的背景没有接受有关协程等概念的CS培训,并且他的演示和讨论都非常注重实现,这对那些尚未深入研究延续性知识的人来说很难理解如何将其用作解决方案。他们的问题。

因此,最初的文档很差。其中有一些关于如何使用它的描述,其中有来自第三方贡献者的最好的描述。根据PyCon的调查数字,在PyCon 2007上,我发表了有关“ 使用无堆栈 ” 的演讲,该演讲进行得很顺利。理查德·图(Richard Tew)在收集这些内容,更新stackless.com以及在发布新的Python版本时保持发行版方面做得非常出色。他是EVE Online开发商CCP Games的雇员,该公司使用Stackless作为其游戏系统的重要组成部分。

CCP游戏也是人们在谈论Stackless时使用的最大的现实示例。Stackless的主要教程是Grant Olson的“ Stackless Python并发编程简介 ”,它也是面向游戏的。我认为这给人们带来了一个错误的想法,即Stackless是面向游戏的,而更多的是游戏更容易面向延续。

另一个困难是源代码。它的原始形式需要对Python的许多部分进行更改,这使得Python的领导者Guido van Rossum变得警惕。我认为部分原因是因为对call / cc的支持后来被删除,因为它“太像在拥有更好的高级表单时支持goto了”。我不确定这段历史,因此只需将本段读为“ Stackless曾经需要太多更改”。

以后的版本不需要更改,Tismer继续推动将其包含在Python中。尽管有一些考虑,但官方的立场(据我所知)是CPython不仅是Python实现,而且是参考实现,并且不包含Stackless功能,因为Jython无法实现或Iron Python。

绝对没有“ 对代码库进行重大更改 ”的计划。Arafangion的引用和参考超链接(请参阅评论)大约来自2000/2001。结构更改已经完成很长时间了,这就是我上面提到的。现在,无栈是稳定且成熟的,在过去几年中仅对代码库进行了少量调整。

Stackless的最后一个限制-没有强烈的主张Stackless。Tismer现在与PyPy紧密合作,PyPy是用于Python的Python实现。他已经在PyPy中实现了Stackless功能,并认为它比Stackless本身要优越得多,并且认为PyPy是未来的方式。Tew保持Stackless,但他对倡导不感兴趣。我考虑过担任那个角色,但看不到如何从中赚钱。

尽管如果您想在Stackless中进行培训,请随时与我联系!:)

I don’t know where that “Stackless is 10% faster” on the Wiki came from, but then again I’ve never tried to measure those performance numbers. I can’t think of what Stackless does to make a difference that big.

Stackless is an amazing tool with several organizational/political problems.

The first comes from history. Christian Tismer started talking about what eventually became Stackless about 10 years ago. He had an idea of what he wanted, but had a hard time explaining what he was doing and why people should use it. This is partially because his background didn’t have the CS training regarding ideas like coroutines and because his presentations and discussion are very implementation oriented, which is hard for anyone not already hip-deep in continuations to understand how to use it as a solution to their problems.

For that reason, the initial documentation was poor. There were some descriptions of how to use it, with the best from third-party contributors. At PyCon 2007 I gave a talk on “Using Stackless” which went over quite well, according to the PyCon survey numbers. Richard Tew has done a great job collecting these, updating stackless.com, and maintaining the distribution when new Python releases comes up. He’s an employee of CCP Games, developers of EVE Online, which uses Stackless as an essential part of their gaming system.

CCP games is also the biggest real-world example people use when they talk about Stackless. The main tutorial for Stackless is Grant Olson’s “Introduction to Concurrent Programming with Stackless Python“, which is also game oriented. I think this gives people a skewed idea that Stackless is games-oriented, when it’s more that games are more easily continuation oriented.

Another difficulty has been the source code. In its original form it required changes to many parts of Python, which made Guido van Rossum, the Python lead, wary. Part of the reason, I think, was support for call/cc that was later removed as being “too much like supporting a goto when there are better higher-level forms.” I’m not certain about this history, so just read this paragraph as “Stackless used to require too many changes.”

Later releases didn’t require the changes, and Tismer continued to push for its inclusion in Python. While there was some consideration, the official stance (as far as I know) is that CPython is not only a Python implementation but it’s meant as a reference implementation, and it won’t include Stackless functionality because it can’t be implemented by Jython or Iron Python.

There are absolutely no plans for “significant changes to the code base“. That quote and reference hyperlink from Arafangion’s (see the comment) are from roughly 2000/2001. The structural changes have long been done, and it’s what I mentioned above. Stackless as it is now is stable and mature, with only minor tweaks to the code base over the last several years.

One final limitation with Stackless – there is no strong advocate for Stackless. Tismer is now deeply involved with PyPy, which is an implementation of Python for Python. He has implemented the Stackless functionality in PyPy and considers it much superior to Stackless itself, and feels that PyPy is the way of the future. Tew maintains Stackless but he isn’t interested in advocacy. I considered being in that role, but couldn’t see how I could make an income from it.

Though if you want training in Stackless, feel free to contact me! :)


回答 1

找到这个讨论花了很长时间。那时我不在PyPy上,但是与psyco有两年的恋情,直到健康突然停止。我现在又很活跃,正在设计一种替代方法-将在EuroPython 2012上展示。

大多数安德鲁斯声明都是正确的。一些次要的补充:

Stackless比10年前的CPython快得多,因为我优化了解释器循环。那时,圭多还没有为此做好准备。几年后,人们做了类似的优化,甚至做了更多更好的优化,这使得Stackless的速度比预期的要慢一些。

关于包容性:嗯,一开始我很进取,并坚信Stackless是必经之路。后来,当几乎有可能将其包括在内时,我对此失去了兴趣,宁愿让它保持这种状态,部分是出于沮丧,部分是为了控制Stackless。

诸如“其他实现无法做到”之类的论点总是让我感到la脚,因为在其他示例中也可以使用此论点。我以为我最好忘记了这一点,并拥有自己的发行版,与Guido保持良好的友谊。

同时,情况再次发生变化。我正在研究PyPy和Stackless作为扩展,稍后会再讨论

干杯-克里斯

it took quite long to find this discussion. At that time I was not on PyPy but had a 2-years affair with psyco, until health stopped this all quite abruptly. I’m now active again and designing an alternative approach – will present it on EuroPython 2012.

Most of Andrews statements are correct. Some minor additions:

Stackless was significantly faster than CPython, 10 years ago, because I optimized the interpreter loop. At that time, Guido was not ready for that. A few years later, people did similar optimizations and even more and better ones, which makes Stackless a little bit slower, as expected.

On inclusion: well, in the beginning I was very pushy and convinced that Stackless is the way to go. Later, when it was almost possible to get included, I lost interest in that and preferred to let it stay this way, partially out of frustration, partially to keep control of Stackless.

The arguments like “other implementations cannot do it” felt always lame to me, as there are other examples where this argument could also be used. I thought I better forget about that and stay in good friendship with Guido, having my own distro.

Meanwhile things are changing again. I’m working on PyPy and Stackless as an extension Will talk about that sometimes later

Cheers — Chris


回答 2

如果我没记错的话,Stackless计划列入官方的CPython中,但是stackless的作者告诉CPython人士不要这样做,因为他计划对代码库进行一些重大更改-大概他想在以后的时候完成集成该项目更加成熟。

If I recall correctly, Stackless was slated for inclusion into the official CPython, but the author of stackless told the CPython folks not to do so, because he planned to do some significant changes to the code base – presumeably he wanted the integration done later when the project was more mature.


回答 3

我对这里的答案也很感兴趣。我在Stackless上玩了一点,它看起来像是对标准Python的很好的补充。

如果Python要更改为其他堆栈,PEP 219确实提到了从C代码调用Python代码的潜在困难。需要有检测和防止这种情况的方法(以避免浪费C堆栈)。我认为这很容易处理,所以我也想知道为什么Stackless必须独立。

I’m also interested in the answers here. I’ve played a bit with Stackless and it looks like it would be a good solid addition to standard Python.

PEP 219 does mention potential difficulties with calling Python code from C code, if Python wants to change to a different stack. There would need to be ways to detect and prevent this (to avoid trashing the C stack). I think this is tractable though, so I’m also wondering why Stackless must stand on its own.


为什么在导入numpy之后多处理仅使用单个内核?

问题:为什么在导入numpy之后多处理仅使用单个内核?

我不确定这是否更多的是操作系统问题,但是我想在这里问一下,以防有人对Python有所了解。

我一直在尝试使用并行化CPU繁重的for循环joblib,但是我发现不是将每个工作进程分配给不同的内核,而是最终将所有工作进程分配给相同的内核,并且没有性能提升。

这是一个非常简单的例子…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

htop这是该脚本运行时看到的内容:

我在具有4核的笔记本电脑上运行Ubuntu 12.10(3.5.0-26)。显然joblib.Parallel是为不同的工作人员生成了单独的进程,但是有什么方法可以使这些进程在不同的内核上执行?

I am not sure whether this counts more as an OS issue, but I thought I would ask here in case anyone has some insight from the Python end of things.

I’ve been trying to parallelise a CPU-heavy for loop using joblib, but I find that instead of each worker process being assigned to a different core, I end up with all of them being assigned to the same core and no performance gain.

Here’s a very trivial example…

from joblib import Parallel,delayed
import numpy as np

def testfunc(data):
    # some very boneheaded CPU work
    for nn in xrange(1000):
        for ii in data[0,:]:
            for jj in data[1,:]:
                ii*jj

def run(niter=10):
    data = (np.random.randn(2,100) for ii in xrange(niter))
    pool = Parallel(n_jobs=-1,verbose=1,pre_dispatch='all')
    results = pool(delayed(testfunc)(dd) for dd in data)

if __name__ == '__main__':
    run()

…and here’s what I see in htop while this script is running:

I’m running Ubuntu 12.10 (3.5.0-26) on a laptop with 4 cores. Clearly joblib.Parallel is spawning separate processes for the different workers, but is there any way that I can make these processes execute on different cores?


回答 0

经过更多的谷歌搜索后,我在这里找到了答案。

事实证明,某些Python模块(numpyscipytablespandasskimage对进口核心相关性……)的混乱。据我所知,这个问题似乎是由它们链接到多线程OpenBLAS库引起的。

解决方法是使用

os.system("taskset -p 0xff %d" % os.getpid())

在导入模块之后粘贴了这一行,我的示例现在可以在所有内核上运行:

到目前为止,我的经验是,这似乎对numpy机器的性能没有任何负面影响,尽管这可能是特定于机器和任务的。

更新:

还有两种方法可以禁用OpenBLAS本身的CPU关联性重置行为。在运行时,您可以使用环境变量OPENBLAS_MAIN_FREE(或GOTOBLAS_MAIN_FREE),例如

OPENBLAS_MAIN_FREE=1 python myscript.py

或者,如果您要从源代码编译OpenBLAS,则可以在构建时通过编辑Makefile.rule使其包含该行来永久禁用它

NO_AFFINITY=1

After some more googling I found the answer here.

It turns out that certain Python modules (numpy, scipy, tables, pandas, skimage…) mess with core affinity on import. As far as I can tell, this problem seems to be specifically caused by them linking against multithreaded OpenBLAS libraries.

A workaround is to reset the task affinity using

os.system("taskset -p 0xff %d" % os.getpid())

With this line pasted in after the module imports, my example now runs on all cores:

My experience so far has been that this doesn’t seem to have any negative effect on numpy‘s performance, although this is probably machine- and task-specific .

Update:

There are also two ways to disable the CPU affinity-resetting behaviour of OpenBLAS itself. At run-time you can use the environment variable OPENBLAS_MAIN_FREE (or GOTOBLAS_MAIN_FREE), for example

OPENBLAS_MAIN_FREE=1 python myscript.py

Or alternatively, if you’re compiling OpenBLAS from source you can permanently disable it at build-time by editing the Makefile.rule to contain the line

NO_AFFINITY=1

回答 1

Python 3现在公开了直接设置亲和力的方法

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

Python 3 now exposes the methods to directly set the affinity

>>> import os
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}
>>> os.sched_setaffinity(0, {1, 3})
>>> os.sched_getaffinity(0)
{1, 3}
>>> x = {i for i in range(10)}
>>> x
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> os.sched_setaffinity(0, x)
>>> os.sched_getaffinity(0)
{0, 1, 2, 3}

回答 2

这似乎是Ubuntu上Python的常见问题,并不特定于joblib

我建议尝试使用CPU相似性(taskset)。


在Python 3中加速数百万个正则表达式的替换

问题:在Python 3中加速数百万个正则表达式的替换

我正在使用Python 3.5.2

我有两个清单

  • 大约750,000个“句子”(长字符串)的列表
  • 我想从我的750,000个句子中删除的大约20,000个“单词”的列表

因此,我必须遍历750,000个句子并执行大约20,000个替换,但前提是我的单词实际上是“单词”,并且不属于较大的字符串。

我这样做是通过预编译我的单词,使它们位于\b元字符的侧面

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我遍历我的“句子”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

这个嵌套循环每秒处理大约50个句子,这很好,但是处理我所有的句子仍需要几个小时。

  • 有没有一种方法可以使用该str.replace方法(我认为该方法更快),但仍然要求仅在单词边界处进行替换?

  • 或者,有没有办法加快该re.sub方法?re.sub如果单词的长度大于句子的长度,我已经略微提高了速度,但这并没有太大的改进。

感谢您的任何建议。

I’m using Python 3.5.2

I have two lists

  • a list of about 750,000 “sentences” (long strings)
  • a list of about 20,000 “words” that I would like to delete from my 750,000 sentences

So, I have to loop through 750,000 sentences and perform about 20,000 replacements, but ONLY if my words are actually “words” and are not part of a larger string of characters.

I am doing this by pre-compiling my words so that they are flanked by the \b metacharacter

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

Then I loop through my “sentences”

import re

for sentence in sentences:
  for word in compiled_words:
    sentence = re.sub(word, "", sentence)
  # put sentence into a growing list

This nested loop is processing about 50 sentences per second, which is nice, but it still takes several hours to process all of my sentences.

  • Is there a way to using the str.replace method (which I believe is faster), but still requiring that replacements only happen at word boundaries?

  • Alternatively, is there a way to speed up the re.sub method? I have already improved the speed marginally by skipping over re.sub if the length of my word is > than the length of my sentence, but it’s not much of an improvement.

Thank you for any suggestions.


回答 0

您可以尝试做的一件事是编译一个单一模式,例如"\b(word1|word2|word3)\b"

由于re依靠C代码进行实际匹配,因此节省的费用可观。

正如@pvg在评论中指出的,它也受益于单遍匹配。

如果您的单词不是正则表达式,那么Eric的答案会更快。

One thing you can try is to compile one single pattern like "\b(word1|word2|word3)\b".

Because re relies on C code to do the actual matching, the savings can be dramatic.

As @pvg pointed out in the comments, it also benefits from single pass matching.

If your words are not regex, Eric’s answer is faster.


回答 1

TLDR

如果您想要最快的解决方案,请使用此方法(带有设置的查找)。对于类似于OP的数据集,它比接受的答案快大约2000倍。

如果您坚持使用正则表达式进行查找,请使用此基于Trie的版本,该版本仍比正则表达式联合快1000倍。

理论

如果您的句子不是笨拙的字符串,每秒处理50个以上的句子可能是可行的。

如果将所有禁止的单词保存到集合中,则可以非常快速地检查该集合中是否包含另一个单词。

将逻辑打包到一个函数中,将此函数作为参数提供给re.sub您,您就完成了!

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

转换后的句子为:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

注意:

  • 搜索不区分大小写(感谢lower()
  • 用替换一个单词""可能会留下两个空格(如您的代码中所示)
  • 使用python3,\w+还可以匹配带重音符号的字符(例如"ångström")。
  • 任何非单词字符(制表符,空格,换行符,标记等)都将保持不变。

性能

一百万个句子,banned_words近十万个单词,脚本运行时间不到7秒。

相比之下,Liteye的答案需要1万个句子需要160秒。

由于n是单词的总数和m被禁止的单词的数量,OP和Liteye的代码为O(n*m)

相比之下,我的代码应在中运行O(n+m)。考虑到句子比禁止词多得多,该算法变为O(n)

正则表达式联合测试

使用'\b(word1|word2|...|wordN)\b'模式进行正则表达式搜索的复杂性是什么?是O(N)还是O(1)

很难了解正则表达式引擎的工作方式,因此让我们编写一个简单的测试。

此代码将10**i随机的英语单词提取到列表中。它创建相应的正则表达式联合,并用不同的词对其进行测试:

  • 一个人显然不是一个词(以开头#
  • 一个是列表中的第一个单词
  • 一个是列表中的最后一个单词
  • 一个看起来像一个单词,但不是


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

它输出:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

因此,看起来像一个带有'\b(word1|word2|...|wordN)\b'模式的单词的搜索具有:

  • O(1) 最好的情况
  • O(n/2) 一般情况,仍然 O(n)
  • O(n) 最糟糕的情况

这些结果与简单的循环搜索一致。

regex联合的一种更快的替代方法是从trie创建regex模式

TLDR

Use this method (with set lookup) if you want the fastest solution. For a dataset similar to the OP’s, it’s approximately 2000 times faster than the accepted answer.

If you insist on using a regex for lookup, use this trie-based version, which is still 1000 times faster than a regex union.

Theory

If your sentences aren’t humongous strings, it’s probably feasible to process many more than 50 per second.

If you save all the banned words into a set, it will be very fast to check if another word is included in that set.

Pack the logic into a function, give this function as argument to re.sub and you’re done!

Code

import re
with open('/usr/share/dict/american-english') as wordbook:
    banned_words = set(word.strip().lower() for word in wordbook)


def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = ["I'm eric. Welcome here!", "Another boring sentence.",
             "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000

word_pattern = re.compile('\w+')

for sentence in sentences:
    sentence = word_pattern.sub(delete_banned_words, sentence)

Converted sentences are:

' .  !
  .
GiraffeElephantBoat
sfgsdg sdwerha aswertwe

Note that:

  • the search is case-insensitive (thanks to lower())
  • replacing a word with "" might leave two spaces (as in your code)
  • With python3, \w+ also matches accented characters (e.g. "ångström").
  • Any non-word character (tab, space, newline, marks, …) will stay untouched.

Performance

There are a million sentences, banned_words has almost 100000 words and the script runs in less than 7s.

In comparison, Liteye’s answer needed 160s for 10 thousand sentences.

With n being the total amound of words and m the amount of banned words, OP’s and Liteye’s code are O(n*m).

In comparison, my code should run in O(n+m). Considering that there are many more sentences than banned words, the algorithm becomes O(n).

Regex union test

What’s the complexity of a regex search with a '\b(word1|word2|...|wordN)\b' pattern? Is it O(N) or O(1)?

It’s pretty hard to grasp the way the regex engine works, so let’s write a simple test.

This code extracts 10**i random english words into a list. It creates the corresponding regex union, and tests it with different words :

  • one is clearly not a word (it begins with #)
  • one is the first word in the list
  • one is the last word in the list
  • one looks like a word but isn’t


import re
import timeit
import random

with open('/usr/share/dict/american-english') as wordbook:
    english_words = [word.strip().lower() for word in wordbook]
    random.shuffle(english_words)

print("First 10 words :")
print(english_words[:10])

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", english_words[0]),
    ("Last word", english_words[-1]),
    ("Almost a word", "couldbeaword")
]


def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nUnion of %d words" % 10**exp)
    union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp]))
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %-17s : %.1fms" % (description, time))

It outputs:

First 10 words :
["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime']

Union of 10 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 0.7ms
  Almost a word     : 0.7ms

Union of 100 words
  Surely not a word : 0.7ms
  First word        : 1.1ms
  Last word         : 1.2ms
  Almost a word     : 1.2ms

Union of 1000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 9.6ms
  Almost a word     : 10.1ms

Union of 10000 words
  Surely not a word : 1.4ms
  First word        : 1.8ms
  Last word         : 96.3ms
  Almost a word     : 116.6ms

Union of 100000 words
  Surely not a word : 0.7ms
  First word        : 0.8ms
  Last word         : 1227.1ms
  Almost a word     : 1404.1ms

So it looks like the search for a single word with a '\b(word1|word2|...|wordN)\b' pattern has:

  • O(1) best case
  • O(n/2) average case, which is still O(n)
  • O(n) worst case

These results are consistent with a simple loop search.

A much faster alternative to a regex union is to create the regex pattern from a trie.


回答 2

TLDR

如果您想要最快的基于正则表达式的解决方案,请使用此方法。对于类似于OP的数据集,它比接受的答案快大约1000倍。

如果您不关心正则表达式,请使用此基于集合的版本,它比正则表达式联合快2000倍。

使用Trie优化正则表达式

一个简单的正则表达式工会的做法与许多禁用词语变得缓慢,这是因为正则表达式引擎不会做了很好的工作优化格局。

可以使用所有禁止的单词创建Trie并编写相应的正则表达式。生成的trie或regex并不是真正的人类可读的,但是它们确实允许非常快速的查找和匹配。

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

该列表将转换为特里:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

然后到此正则表达式模式:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

巨大的优势在于,要测试是否zoo匹配,正则表达式引擎只需比较第一个字符(不匹配),而无需尝试5个单词。这是5个单词的预处理过大杀伤力,但它显示了成千上万个单词的有希望的结果。

请注意,使用(?:)非捕获组是因为:

  • foobar|baz将匹配foobarbaz但不匹配foobaz
  • foo(bar|baz)将不需要的信息保存到捕获组

这是一个经过稍微修改的gist,我们可以将其用作trie.py库:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

测试

这是一个小测试(与测试相同):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

它输出:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

对于信息,正则表达式开始如下:

(?:a(?:(?:\’s | a(?:\’s | chen | liyah(?:\’s)?| r(?:dvark(?:(?:\’s | s ))?|| on))| b(?:\’s | a(?:c(?:us(?:(?:\’s | es))?| [ik])| ft | lone(? :(?:\’s | s))?| ndon(?:( ?: ed | ing | ment(?:\’s)?| s))?| s(?:e(?:( ?: ment(?:\’s)?| [ds]))?| h(?:( ?: e [ds] | ing))?| ing)| t(?:e(?:( ?: ment( ?:\’s)?| [ds]))?| ing | toir(?:(?:\’s | s))?))| b(?:as(?:id)?| e(? :ss(?:(?:\’s | es))?| y(?:(?:\’s | s))?)| ot(?:(?:\’s | t(?:\ ‘s)?| s))?| reviat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))| y(?:\’ s)?| \é(?:(?:\’s | s))?)| d(?:icat(?:e [ds]?| i(?:ng | on(?:(?:\ ‘s | s))?)))| om(?:en(?:(?:\’s | s))?| inal)| u(?:ct(?:( ?: ed | i(?: ng | on(?:(?:\’s | s))?)|或(?:(?:\’s | s))?| s))?| l(?:\’s)?) )| e(?:(?:\’s | am | l(?:(?:\’s | ard | son(?:\’s)?)))?| r(?:deen(?:\ ‘s)?| nathy(?:\’s)?| ra(?:nt | tion(?:(?:\’s | s))?))| t(?:( ?: t(?: e(?:r(?:(?:\’s | s))?| d)| ing | or(?:(?:\’s | s))?)| s))?| yance(?:\’s)?| d))?| hor(?:( ?: r(?:e(?:n(?:ce(? :\’s)?| t)| d)| ing)| s)))| i(?:d(?:e [ds]?| ing | jan(?:\’s)?)|盖尔| l(?:ene | it(?:ies | y(?:\’s)?)))| j(?:ect(?:ly)?| ur(?:ation(?:(?:\’ s | s))?| e [ds]?| ing))| l(?:a(?:tive(?:(?:\’s | s))?| ze)| e(?:(? :st | r))?| oom | ution(?:(?:\’s | s))?| y)| m \’s | n(?:e(?:gat(?:e [ds] || i(?:ng | on(?:\’s)?))| r(?:\’s)?)| ormal(?:( ?: it(?:ies | y(?:\’ s)?)| ly))?)| o(?:ard | de(?:(?:\’s | s))?| li(?:sh(?:( ?: e [ds] | ing ))|| tion(?:(?:\’s | ist(?:(?:\’s | s))?))?)| mina(?:bl [ey] | t(?:e [ ds]?| i(?:ng | on(?:(?:\’s | s))?))))| r(?:igin(?:al(?:(?:\’s | s) )?| e(?:(?:\’s | s))?)| t(?:( ?: ed | i(?:ng | on(?:(?:\’s | ist(?: (?:\’s | s))?| s))?| ve)| s))))| u(?:nd(?:(?:( ?: ed | ing | s |))?| t)| ve (?:(?:\’s | board))?)| r(?:a(?:cadabra(?:\’s)?| d(?:e [ds]?| ing)| ham(? :\’s)?| m(?:(?:\’s | s))?| si(?:on(?:(?:\’s | s))?| ve(?:( ?:\’s | ly | ness(?:\’s)?| s))?))| east | idg(?:e(?:( ?: ment(?:((?:\’s | s))) ?| [ds]))?| ing | ment(?:(?:\’s | s))?)| o(?:ad | gat(?:e [ds]?| i(?:ng | on(?:(?:\’s | s))?)))))| upt(?:( ?: e(?:st | r)| ly | ness(?:\’s)?))?)) | s(?:alom | c(?:ess(?:(?:\’s | e [ds] | ing)))?| issa(?:(?:\’s | [es])))?| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?:( ?: e(?:e( ?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))))| inth(?:(?:\’s | e( ?:o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))?| i(?:on(?: \’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?: e(?:n(? :cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti …s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。s | [es]))|| ond(?:( ?: ed | ing | s))?)| en(?:ce(?:(?:\’s | s))?| t(?: (?:e(?:e(?:(?:\’s | ism(?:\’s)?| s))?| d)| ing | ly | s))?)| inth(?: (?:\’s | e(?:\’s)?)))| o(?:l(?:ut(?:e(?:(?:\’s | ly | st?)))? | i(?:on(?:\’s)?| sm(?:\’s)?))| v(?:e [ds]?| ing))| r(?:b(?:( ?:e(?:n(?:cy(?:\’s)?| t(?:(?:\’s | s))?)| d)| ing | s))?| pti。 。

这确实让人难以理解,但是对于100000个禁用词的列表而言,此Trie regex比简单的regex联合快1000倍!

这是完整的trie的图,并通过trie-python-graphviz和graphviz 导出twopi

TLDR

Use this method if you want the fastest regex-based solution. For a dataset similar to the OP’s, it’s approximately 1000 times faster than the accepted answer.

If you don’t care about regex, use this set-based version, which is 2000 times faster than a regex union.

Optimized Regex with Trie

A simple Regex union approach becomes slow with many banned words, because the regex engine doesn’t do a very good job of optimizing the pattern.

It’s possible to create a Trie with all the banned words and write the corresponding regex. The resulting trie or regex aren’t really human-readable, but they do allow for very fast lookup and match.

Example

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

The list is converted to a trie:

{
    'f': {
        'o': {
            'o': {
                'x': {
                    'a': {
                        'r': {
                            '': 1
                        }
                    }
                },
                'b': {
                    'a': {
                        'r': {
                            '': 1
                        },
                        'h': {
                            '': 1
                        }
                    }
                },
                'z': {
                    'a': {
                        '': 1,
                        'p': {
                            '': 1
                        }
                    }
                }
            }
        }
    }
}

And then to this regex pattern:

r"\bfoo(?:ba[hr]|xar|zap?)\b"

The huge advantage is that to test if zoo matches, the regex engine only needs to compare the first character (it doesn’t match), instead of trying the 5 words. It’s a preprocess overkill for 5 words, but it shows promising results for many thousand words.

Note that (?:) non-capturing groups are used because:

Code

Here’s a slightly modified gist, which we can use as a trie.py library:

import re


class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return re.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

Test

Here’s a small test (the same as this one):

# Encoding: utf-8
import re
import timeit
import random
from trie import Trie

with open('/usr/share/dict/american-english') as wordbook:
    banned_words = [word.strip().lower() for word in wordbook]
    random.shuffle(banned_words)

test_words = [
    ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"),
    ("First word", banned_words[0]),
    ("Last word", banned_words[-1]),
    ("Almost a word", "couldbeaword")
]

def trie_regex_from_words(words):
    trie = Trie()
    for word in words:
        trie.add(word)
    return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)

def find(word):
    def fun():
        return union.match(word)
    return fun

for exp in range(1, 6):
    print("\nTrieRegex of %d words" % 10**exp)
    union = trie_regex_from_words(banned_words[:10**exp])
    for description, test_word in test_words:
        time = timeit.timeit(find(test_word), number=1000) * 1000
        print("  %s : %.1fms" % (description, time))

It outputs:

TrieRegex of 10 words
  Surely not a word : 0.3ms
  First word : 0.4ms
  Last word : 0.5ms
  Almost a word : 0.5ms

TrieRegex of 100 words
  Surely not a word : 0.3ms
  First word : 0.5ms
  Last word : 0.9ms
  Almost a word : 0.6ms

TrieRegex of 1000 words
  Surely not a word : 0.3ms
  First word : 0.7ms
  Last word : 0.9ms
  Almost a word : 1.1ms

TrieRegex of 10000 words
  Surely not a word : 0.1ms
  First word : 1.0ms
  Last word : 1.2ms
  Almost a word : 1.2ms

TrieRegex of 100000 words
  Surely not a word : 0.3ms
  First word : 1.2ms
  Last word : 0.9ms
  Almost a word : 1.6ms

For info, the regex begins like this:

(?:a(?:(?:\’s|a(?:\’s|chen|liyah(?:\’s)?|r(?:dvark(?:(?:\’s|s))?|on))|b(?:\’s|a(?:c(?:us(?:(?:\’s|es))?|[ik])|ft|lone(?:(?:\’s|s))?|ndon(?:(?:ed|ing|ment(?:\’s)?|s))?|s(?:e(?:(?:ment(?:\’s)?|[ds]))?|h(?:(?:e[ds]|ing))?|ing)|t(?:e(?:(?:ment(?:\’s)?|[ds]))?|ing|toir(?:(?:\’s|s))?))|b(?:as(?:id)?|e(?:ss(?:(?:\’s|es))?|y(?:(?:\’s|s))?)|ot(?:(?:\’s|t(?:\’s)?|s))?|reviat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|y(?:\’s)?|\é(?:(?:\’s|s))?)|d(?:icat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?))|om(?:en(?:(?:\’s|s))?|inal)|u(?:ct(?:(?:ed|i(?:ng|on(?:(?:\’s|s))?)|or(?:(?:\’s|s))?|s))?|l(?:\’s)?))|e(?:(?:\’s|am|l(?:(?:\’s|ard|son(?:\’s)?))?|r(?:deen(?:\’s)?|nathy(?:\’s)?|ra(?:nt|tion(?:(?:\’s|s))?))|t(?:(?:t(?:e(?:r(?:(?:\’s|s))?|d)|ing|or(?:(?:\’s|s))?)|s))?|yance(?:\’s)?|d))?|hor(?:(?:r(?:e(?:n(?:ce(?:\’s)?|t)|d)|ing)|s))?|i(?:d(?:e[ds]?|ing|jan(?:\’s)?)|gail|l(?:ene|it(?:ies|y(?:\’s)?)))|j(?:ect(?:ly)?|ur(?:ation(?:(?:\’s|s))?|e[ds]?|ing))|l(?:a(?:tive(?:(?:\’s|s))?|ze)|e(?:(?:st|r))?|oom|ution(?:(?:\’s|s))?|y)|m\’s|n(?:e(?:gat(?:e[ds]?|i(?:ng|on(?:\’s)?))|r(?:\’s)?)|ormal(?:(?:it(?:ies|y(?:\’s)?)|ly))?)|o(?:ard|de(?:(?:\’s|s))?|li(?:sh(?:(?:e[ds]|ing))?|tion(?:(?:\’s|ist(?:(?:\’s|s))?))?)|mina(?:bl[ey]|t(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|r(?:igin(?:al(?:(?:\’s|s))?|e(?:(?:\’s|s))?)|t(?:(?:ed|i(?:ng|on(?:(?:\’s|ist(?:(?:\’s|s))?|s))?|ve)|s))?)|u(?:nd(?:(?:ed|ing|s))?|t)|ve(?:(?:\’s|board))?)|r(?:a(?:cadabra(?:\’s)?|d(?:e[ds]?|ing)|ham(?:\’s)?|m(?:(?:\’s|s))?|si(?:on(?:(?:\’s|s))?|ve(?:(?:\’s|ly|ness(?:\’s)?|s))?))|east|idg(?:e(?:(?:ment(?:(?:\’s|s))?|[ds]))?|ing|ment(?:(?:\’s|s))?)|o(?:ad|gat(?:e[ds]?|i(?:ng|on(?:(?:\’s|s))?)))|upt(?:(?:e(?:st|r)|ly|ness(?:\’s)?))?)|s(?:alom|c(?:ess(?:(?:\’s|e[ds]|ing))?|issa(?:(?:\’s|[es]))?|ond(?:(?:ed|ing|s))?)|en(?:ce(?:(?:\’s|s))?|t(?:(?:e(?:e(?:(?:\’s|ism(?:\’s)?|s))?|d)|ing|ly|s))?)|inth(?:(?:\’s|e(?:\’s)?))?|o(?:l(?:ut(?:e(?:(?:\’s|ly|st?))?|i(?:on(?:\’s)?|sm(?:\’s)?))|v(?:e[ds]?|ing))|r(?:b(?:(?:e(?:n(?:cy(?:\’s)?|t(?:(?:\’s|s))?)|d)|ing|s))?|pti…

It’s really unreadable, but for a list of 100000 banned words, this Trie regex is 1000 times faster than a simple regex union!

Here’s a diagram of the complete trie, exported with trie-python-graphviz and graphviz twopi:


回答 3

您可能想尝试的一件事是对句子进行预处理以对单词边界进行编码。基本上,通过划分单词边界将每个句子变成单词列表。

这应该更快,因为要处理一个句子,您只需要逐步检查每个单词并检查它是否匹配即可。

当前,正则表达式搜索每次必须再次遍历整个字符串,以查找单词边界,然后在下一次遍历之前“舍弃”这项工作的结果。

One thing you might want to try is pre-processing the sentences to encode the word boundaries. Basically turn each sentence into a list of words by splitting on word boundaries.

This should be faster, because to process a sentence, you just have to step through each of the words and check if it’s a match.

Currently the regex search is having to go through the entire string again each time, looking for word boundaries, and then “discarding” the result of this work before the next pass.


回答 4

好吧,这是一个快速简单的解决方案,带有测试仪。

取胜策略:

re.sub(“ \ w +”,repl,sentence)搜索单词。

“ repl”可以是可调用的。我使用了一个执行字典查找的函数,该字典包含要搜索和替换的单词。

这是最简单,最快的解决方案(请参见下面的示例代码中的函数replace4)。

次好的

想法是使用re.split将句子拆分为单词,同时保留分隔符以稍后重建句子。然后,通过简单的字典查找完成替换。

(请参见下面的示例代码中的函数replace3)。

功能示例的时间:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

编辑:检查是否传递小写的句子列表并编辑repl时,您也可以忽略小写

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

Well, here’s a quick and easy solution, with test set.

Winning strategy:

re.sub(“\w+”,repl,sentence) searches for words.

“repl” can be a callable. I used a function that performs a dict lookup, and the dict contains the words to search and replace.

This is the simplest and fastest solution (see function replace4 in example code below).

Second best

The idea is to split the sentences into words, using re.split, while conserving the separators to reconstruct the sentences later. Then, replacements are done with a simple dict lookup.

(see function replace3 in example code below).

Timings for example functions:

replace1: 0.62 sentences/s
replace2: 7.43 sentences/s
replace3: 48498.03 sentences/s
replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…and code:

#! /bin/env python3
# -*- coding: utf-8

import time, random, re

def replace1( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns:
            sentence = re.sub( "\\b"+search+"\\b", repl, sentence )

def replace2( sentences ):
    for n, sentence in enumerate( sentences ):
        for search, repl in patterns_comp:
            sentence = re.sub( search, repl, sentence )

def replace3( sentences ):
    pd = patterns_dict.get
    for n, sentence in enumerate( sentences ):
        #~ print( n, sentence )
        # Split the sentence on non-word characters.
        # Note: () in split patterns ensure the non-word characters ARE kept
        # and returned in the result list, so we don't mangle the sentence.
        # If ALL separators are spaces, use string.split instead or something.
        # Example:
        #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf")
        #~ ['ab', ' ', 'céé', '? . ', 'd2eéf']
        words = re.split(r"([^\w]+)", sentence)

        # and... done.
        sentence = "".join( pd(w,w) for w in words )

        #~ print( n, sentence )

def replace4( sentences ):
    pd = patterns_dict.get
    def repl(m):
        w = m.group()
        return pd(w,w)

    for n, sentence in enumerate( sentences ):
        sentence = re.sub(r"\w+", repl, sentence)



# Build test set
test_words = [ ("word%d" % _) for _ in range(50000) ]
test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ]

# Create search and replace patterns
patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ]
patterns_dict = dict( patterns )
patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ]


def test( func, num ):
    t = time.time()
    func( test_sentences[:num] )
    print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t)))

print( "Sentences", len(test_sentences) )
print( "Words    ", len(test_words) )

test( replace1, 1 )
test( replace2, 10 )
test( replace3, 1000 )
test( replace4, 1000 )

Edit: You can also ignore lowercase when checking if you pass a lowercase list of Sentences and edit repl

def replace4( sentences ):
pd = patterns_dict.get
def repl(m):
    w = m.group()
    return pd(w.lower(),w)

回答 5

也许Python不是这里的正确工具。这是Unix工具链中的一个

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件已经过预处理,并添加了字边界。步骤是:将文件转换为双倍行距,将每个句子拆分为每行一个单词,从文件中批量删除黑名单单词,然后合并回行。

这应该至少快一个数量级。

用于从单词中预处理黑名单文件(每行一个单词)

sed 's/.*/\\b&\\b/' words > blacklist

Perhaps Python is not the right tool here. Here is one with the Unix toolchain

sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'

assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)

sed 's/.*/\\b&\\b/' words > blacklist

回答 6

这个怎么样:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

这些解决方案在单词边界上划分并查找集合中的每个单词。它们应该比re.sub单词替代(Liteyes的解决方案)更快,因为这些解决方案是O(n),其中n是由于amortized O(1)设置查找而导致的,而使用正则表达式替代项将导致regex引擎必须检查单词是否匹配在每个字符上,而不仅仅是在单词边界上。我的解决方案a格外小心,以保留原始文本中使用的空格(即,它不压缩空格,并保留制表符,换行符和其他空格字符),但是如果您决定不关心它,则可以从输出中删除它们应该非常简单。

我在corpus.txt上进行了测试,corpus.txt是从Gutenberg Project下载的多本电子书的串联,并且banned_words.txt是从Ubuntu的单词表(/ usr / share / dict / american-english)中随机选择的20000个单词。处理862462个句子(约占PyPy的一半)大约需要30秒。我已将句子定义为以“。”分隔的任何内容。

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy特别受益于第二种方法,而CPython在第一种方法上表现更好。上面的代码在Python 2和Python 3上都可以使用。

How about this:

#!/usr/bin/env python3

from __future__ import unicode_literals, print_function
import re
import time
import io

def replace_sentences_1(sentences, banned_words):
    # faster on CPython, but does not use \b as the word separator
    # so result is slightly different than replace_sentences_2()
    def filter_sentence(sentence):
        words = WORD_SPLITTER.split(sentence)
        words_iter = iter(words)
        for word in words_iter:
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word
            yield next(words_iter) # yield the word separator

    WORD_SPLITTER = re.compile(r'(\W+)')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


def replace_sentences_2(sentences, banned_words):
    # slower on CPython, uses \b as separator
    def filter_sentence(sentence):
        boundaries = WORD_BOUNDARY.finditer(sentence)
        current_boundary = 0
        while True:
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            yield sentence[last_word_boundary:current_boundary] # yield the separators
            last_word_boundary, current_boundary = current_boundary, next(boundaries).start()
            word = sentence[last_word_boundary:current_boundary]
            norm_word = word.lower()
            if norm_word not in banned_words:
                yield word

    WORD_BOUNDARY = re.compile(r'\b')
    banned_words = set(banned_words)
    for sentence in sentences:
        yield ''.join(filter_sentence(sentence))


corpus = io.open('corpus2.txt').read()
banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()]
sentences = corpus.split('. ')
output = io.open('output.txt', 'wb')
print('number of sentences:', len(sentences))
start = time.time()
for sentence in replace_sentences_1(sentences, banned_words):
    output.write(sentence.encode('utf-8'))
    output.write(b' .')
print('time:', time.time() - start)

These solutions splits on word boundaries and looks up each word in a set. They should be faster than re.sub of word alternates (Liteyes’ solution) as these solutions are O(n) where n is the size of the input due to the amortized O(1) set lookup, while using regex alternates would cause the regex engine to have to check for word matches on every characters rather than just on word boundaries. My solutiona take extra care to preserve the whitespaces that was used in the original text (i.e. it doesn’t compress whitespaces and preserves tabs, newlines, and other whitespace characters), but if you decide that you don’t care about it, it should be fairly straightforward to remove them from the output.

I tested on corpus.txt, which is a concatenation of multiple eBooks downloaded from the Gutenberg Project, and banned_words.txt is 20000 words randomly picked from Ubuntu’s wordlist (/usr/share/dict/american-english). It takes around 30 seconds to process 862462 sentences (and half of that on PyPy). I’ve defined sentences as anything separated by “. “.

$ # replace_sentences_1()
$ python3 filter_words.py 
number of sentences: 862462
time: 24.46173644065857
$ pypy filter_words.py 
number of sentences: 862462
time: 15.9370770454

$ # replace_sentences_2()
$ python3 filter_words.py 
number of sentences: 862462
time: 40.2742919921875
$ pypy filter_words.py 
number of sentences: 862462
time: 13.1190629005

PyPy particularly benefit more from the second approach, while CPython fared better on the first approach. The above code should work on both Python 2 and 3.


回答 7

实用方法

下述解决方案使用大量内存将所有文本存储在同一字符串中,并降低了复杂度。如果RAM是一个问题,请在使用前三思。

使用join/ split技巧,您可以完全避免循环,从而可以加快算法的速度。

  • 用特殊分隔符连接句子,这些特殊分隔符不包含在句子中:
  • merged_sentences = ' * '.join(sentences)

  • 使用|“或”正则表达式语句为需要从句子中摆脱的所有单词编译一个正则表达式:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

  • 用已编译的正则表达式对单词下标,并用特殊的分隔符将其拆分回单独的句子:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

    性能

    "".join复杂度为O(n)。这是非常直观的,但是无论如何都会有一个简短的报价来源:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);

    因此,join/split有了O(words)+ 2 * O(sentences)仍然是线性复杂度,而初始方法为2 * O(N 2)。


    顺便说一句,不要使用多线程。GIL将阻止每个操作,因为您的任务严格地受CPU限制,因此GIL没有机会被释放,但是每个线程将同时发送滴答声,这会导致额外的工作量,甚至导致操作达到无穷大。

    Practical approach

    A solution described below uses a lot of memory to store all the text at the same string and to reduce complexity level. If RAM is an issue think twice before use it.

    With join/split tricks you can avoid loops at all which should speed up the algorithm.

  • Concatenate a sentences with a special delimeter which is not contained by the sentences:
  • merged_sentences = ' * '.join(sentences)
    

  • Compile a single regex for all the words you need to rid from the sentences using | “or” regex statement:
  • regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag
    

  • Subscript the words with the compiled regex and split it by the special delimiter character back to separated sentences:
  • clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
    

    Performance

    "".join complexity is O(n). This is pretty intuitive but anyway there is a shortened quotation from a source:

    for (i = 0; i < seqlen; i++) {
        [...]
        sz += PyUnicode_GET_LENGTH(item);
    

    Therefore with join/split you have O(words) + 2*O(sentences) which is still linear complexity vs 2*O(N2) with the initial approach.


    b.t.w. don’t use multithreading. GIL will block each operation because your task is strictly CPU bound so GIL have no chance to be released but each thread will send ticks concurrently which cause extra effort and even lead operation to infinity.


    回答 8

    将所有句子连接到一个文档中。使用Aho-Corasick算法的任何实现(这里是)来查找所有“不好”的单词。遍历文件,替换每个坏词,更新后跟的发现词的偏移量等。

    Concatenate all your sentences into one document. Use any implementation of the Aho-Corasick algorithm (here’s one) to locate all your “bad” words. Traverse the file, replacing each bad word, updating the offsets of found words that follow etc.


    在python脚本中隐藏密码(仅用于不安全的混淆)

    问题:在python脚本中隐藏密码(仅用于不安全的混淆)

    我有一个python脚本正在创建ODBC连接。ODBC连接是使用连接字符串生成的。在此连接字符串中,我必须包含此连接的用户名和密码。

    有没有一种简便的方法来隐藏文件中的此密码(只是在我编辑文件时没人能读取该密码)?

    I have got a python script which is creating an ODBC connection. The ODBC connection is generated with a connection string. In this connection string I have to include the username and password for this connection.

    Is there an easy way to obscure this password in the file (just that nobody can read the password when I’m editing the file) ?


    回答 0

    Base64编码在标准库中,并且可以阻止肩膀冲浪者:

    >>> import base64
    >>>  print(base64.b64encode("password".encode("utf-8")))
    cGFzc3dvcmQ=
    >>> print(base64.b64decode("cGFzc3dvcmQ=").decode("utf-8"))
    password
    

    Base64 encoding is in the standard library and will do to stop shoulder surfers:

    >>> import base64
    >>>  print(base64.b64encode("password".encode("utf-8")))
    cGFzc3dvcmQ=
    >>> print(base64.b64decode("cGFzc3dvcmQ=").decode("utf-8"))
    password
    

    回答 1

    当您需要为远程登录指定密码时,Douglas F Shearer’s是Unix中公认的解决方案。
    您添加–password-from-file选项来指定路径并从文件中读取纯文本。
    然后,该文件可以位于受操作系统保护的用户自己的区域中。它还允许不同的用户自动选择自己的文件。

    对于不允许脚本用户知道的密码-您可以使用高级权限运行脚本,并让该root / admin用户拥有密码文件。

    Douglas F Shearer’s is the generally approved solution in Unix when you need to specify a password for a remote login.
    You add a –password-from-file option to specify the path and read plaintext from a file.
    The file can then be in the user’s own area protected by the operating system. It also allows different users to automatically pick up their own own file.

    For passwords that the user of the script isn’t allowed to know – you can run the script with elavated permission and have the password file owned by that root/admin user.


    回答 2

    这是一个简单的方法:

    1. 创建一个python模块-我们称之为peekaboo.py。
    2. 在peekaboo.py中,同时包含密码和需要该密码的任何代码
    3. 通过导入此模块(通过python命令行等)创建一个编译版本-peekaboo.pyc。
    4. 现在,删除peekaboo.py。
    5. 现在,您可以仅依靠peekaboo.pyc来愉快地导入peekaboo。由于peekaboo.pyc是字节编译的,因此临时用户无法读取。

    尽管它容易受到py_to_pyc反编译器的攻击,但它应该比base64解码更加安全。

    Here is a simple method:

    1. Create a python module – let’s call it peekaboo.py.
    2. In peekaboo.py, include both the password and any code needing that password
    3. Create a compiled version – peekaboo.pyc – by importing this module (via python commandline, etc…).
    4. Now, delete peekaboo.py.
    5. You can now happily import peekaboo relying only on peekaboo.pyc. Since peekaboo.pyc is byte compiled it is not readable to the casual user.

    This should be a bit more secure than base64 decoding – although it is vulnerable to a py_to_pyc decompiler.


    回答 3

    如果您在Unix系统上工作,请利用标准Python库中的netrc模块。它从单独的文本文件(.netrc)中读取密码,该文件的格式在此处描述

    这是一个小用法示例:

    import netrc
    
    # Define which host in the .netrc file to use
    HOST = 'mailcluster.loopia.se'
    
    # Read from the .netrc file in your home directory
    secrets = netrc.netrc()
    username, account, password = secrets.authenticators( HOST )
    
    print username, password
    

    If you are working on a Unix system, take advantage of the netrc module in the standard Python library. It reads passwords from a separate text file (.netrc), which has the format decribed here.

    Here is a small usage example:

    import netrc
    
    # Define which host in the .netrc file to use
    HOST = 'mailcluster.loopia.se'
    
    # Read from the .netrc file in your home directory
    secrets = netrc.netrc()
    username, account, password = secrets.authenticators( HOST )
    
    print username, password
    

    回答 4

    假设用户无法在运行时提供用户名和密码,最好的解决方案可能是单独的源文件,其中仅包含导入到您的主代码中的用户名和密码的变量初始化。仅在凭据更改时才需要编辑此文件。否则,如果您只担心具有平均记忆的冲浪者,那么base 64编码可能是最简单的解决方案。ROT13太容易手动解码,不区分大小写,并且在加密状态下保留了太多含义。在python脚本之外对您的密码和用户ID进行编码。让他在运行时对脚本进行解码以供使用。

    为自动化任务提供脚本凭证始终是一个冒险的建议。您的脚本应具有其自己的凭据,并且所使用的帐户应完全不需要访问权限。至少密码应该是长且相当随机的。

    The best solution, assuming the username and password can’t be given at runtime by the user, is probably a separate source file containing only variable initialization for the username and password that is imported into your main code. This file would only need editing when the credentials change. Otherwise, if you’re only worried about shoulder surfers with average memories, base 64 encoding is probably the easiest solution. ROT13 is just too easy to decode manually, isn’t case sensitive and retains too much meaning in it’s encrypted state. Encode your password and user id outside the python script. Have he script decode at runtime for use.

    Giving scripts credentials for automated tasks is always a risky proposal. Your script should have its own credentials and the account it uses should have no access other than exactly what is necessary. At least the password should be long and rather random.


    回答 5

    如何从脚本外部的文件中导入用户名和密码?这样,即使有人掌握了该脚本,他们也不会自动获得密码。

    How about importing the username and password from a file external to the script? That way even if someone got hold of the script, they wouldn’t automatically get the password.


    回答 6

    base64是满足您简单需求的方法。无需导入任何内容:

    >>> 'your string'.encode('base64')
    'eW91ciBzdHJpbmc=\n'
    >>> _.decode('base64')
    'your string'
    

    base64 is the way to go for your simple needs. There is no need to import anything:

    >>> 'your string'.encode('base64')
    'eW91ciBzdHJpbmc=\n'
    >>> _.decode('base64')
    'your string'
    

    回答 7

    对于python3混淆,使用base64方式有所不同:

    import base64
    base64.b64encode(b'PasswordStringAsStreamOfBytes')

    导致

    b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM='

    注意非正式的字符串表示形式,实际的字符串用引号引起来

    并解码回原始字符串

    base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
    b'PasswordStringAsStreamOfBytes'

    在需要字符串对象的地方使用此结果,可以翻译字节对象

    repr = base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
    secret = repr.decode('utf-8')
    print(secret)

    有关python3如何处理字节(以及相应的字符串)的更多信息,请参见官方文档

    for python3 obfuscation using base64 is done differently:

    import base64
    base64.b64encode(b'PasswordStringAsStreamOfBytes')
    

    which results in

    b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM='
    

    note the informal string representation, the actual string is in quotes

    and decoding back to the original string

    base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
    b'PasswordStringAsStreamOfBytes'
    

    to use this result where string objects are required the bytes object can be translated

    repr = base64.b64decode(b'UGFzc3dvcmRTdHJpbmdBc1N0cmVhbU9mQnl0ZXM=')
    secret = repr.decode('utf-8')
    print(secret)
    

    for more information on how python3 handles bytes (and strings accordingly) please see the official documentation.


    回答 8

    这是一个很常见的问题。通常,您能做的最好的就是

    A)创建某种ceasar密码函数来进行编码/解码(但不是rot13)或

    B)首选方法是在程序可及的范围内使用加密密钥对密码进行编码/解码。您可以在其中使用文件保护来保护访问密钥。

    如果您的应用程序作为服务/守护程序(例如Web服务器)运行,则可以将密钥放入密码保护的密钥库中,并在服务启动过程中输入密码。管理员需要重新启动您的应用程序,但是您对配置密码的保护非常好。

    This is a pretty common problem. Typically the best you can do is to either

    A) create some kind of ceasar cipher function to encode/decode (just not rot13) or

    B) the preferred method is to use an encryption key, within reach of your program, encode/decode the password. In which you can use file protection to protect access the key.

    Along those lines if your app runs as a service/daemon (like a webserver) you can put your key into a password protected keystore with the password input as part of the service startup. It’ll take an admin to restart your app, but you will have really good pretection for your configuration passwords.


    回答 9

    您的操作系统可能提供了用于安全加密数据的工具。例如,在Windows上有DPAPI(数据保护API)。为什么不在第一次运行时要求用户提供其凭据,然后将其松散加密以进行后续运行?

    Your operating system probably provides facilities for encrypting data securely. For instance, on Windows there is DPAPI (data protection API). Why not ask the user for their credentials the first time you run then squirrel them away encrypted for subsequent runs?


    回答 10

    更多本地化的方式,而不是将身份验证/密码/用户名转换为加密的详细信息。FTPLIB只是示例。“ pass.csv ”是csv文件名

    将密码保存为CSV格式,如下所示:

    用户名

    用户密码

    (无列标题)

    读取CSV并将其保存到列表中。

    使用列表元素作为认证详细信息。

    完整代码。

    import os
    import ftplib
    import csv 
    cred_detail = []
    os.chdir("Folder where the csv file is stored")
    for row in csv.reader(open("pass.csv","rb")):       
            cred_detail.append(row)
    ftp = ftplib.FTP('server_name',cred_detail[0][0],cred_detail[1][0])

    More homegrown appraoch rather than converting authentication / passwords / username to encrytpted details. FTPLIB is just the example. “pass.csv” is the csv file name

    Save password in CSV like below :

    user_name

    user_password

    (With no column heading)

    Reading the CSV and saving it to a list.

    Using List elelments as authetntication details.

    Full code.

    import os
    import ftplib
    import csv 
    cred_detail = []
    os.chdir("Folder where the csv file is stored")
    for row in csv.reader(open("pass.csv","rb")):       
            cred_detail.append(row)
    ftp = ftplib.FTP('server_name',cred_detail[0][0],cred_detail[1][0])
    

    回答 11

    这是我的摘录。您基本上是将函数导入或复制到代码中。如果加密文件不存在,getCredentials将创建该加密文件并返回命令,updateCredential将更新。

    import os
    
    def getCredentials():
        import base64
    
        splitter='<PC+,DFS/-SHQ.R'
        directory='C:\\PCT'
    
        if not os.path.exists(directory):
            os.makedirs(directory)
    
        try:
            with open(directory+'\\Credentials.txt', 'r') as file:
                cred = file.read()
                file.close()
        except:
            print('I could not file the credentials file. \nSo I dont keep asking you for your email and password everytime you run me, I will be saving an encrypted file at {}.\n'.format(directory))
    
            lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
            email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
            password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
            cred = lanid+splitter+email+splitter+password
            with open(directory+'\\Credentials.txt','w+') as file:
                file.write(cred)
                file.close()
    
        return {'lanid':base64.b64decode(bytes(cred.split(splitter)[0], encoding='utf-8')).decode('utf-8'),
                'email':base64.b64decode(bytes(cred.split(splitter)[1], encoding='utf-8')).decode('utf-8'),
                'password':base64.b64decode(bytes(cred.split(splitter)[2], encoding='utf-8')).decode('utf-8')}
    
    def updateCredentials():
        import base64
    
        splitter='<PC+,DFS/-SHQ.R'
        directory='C:\\PCT'
    
        if not os.path.exists(directory):
            os.makedirs(directory)
    
        print('I will be saving an encrypted file at {}.\n'.format(directory))
    
        lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
        email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
        password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
        cred = lanid+splitter+email+splitter+password
        with open(directory+'\\Credentials.txt','w+') as file:
            file.write(cred)
            file.close()
    
    cred = getCredentials()
    
    updateCredentials()

    Here is my snippet for such thing. You basically import or copy the function to your code. getCredentials will create the encrypted file if it does not exist and return a dictionaty, and updateCredential will update.

    import os
    
    def getCredentials():
        import base64
    
        splitter='<PC+,DFS/-SHQ.R'
        directory='C:\\PCT'
    
        if not os.path.exists(directory):
            os.makedirs(directory)
    
        try:
            with open(directory+'\\Credentials.txt', 'r') as file:
                cred = file.read()
                file.close()
        except:
            print('I could not file the credentials file. \nSo I dont keep asking you for your email and password everytime you run me, I will be saving an encrypted file at {}.\n'.format(directory))
    
            lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
            email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
            password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
            cred = lanid+splitter+email+splitter+password
            with open(directory+'\\Credentials.txt','w+') as file:
                file.write(cred)
                file.close()
    
        return {'lanid':base64.b64decode(bytes(cred.split(splitter)[0], encoding='utf-8')).decode('utf-8'),
                'email':base64.b64decode(bytes(cred.split(splitter)[1], encoding='utf-8')).decode('utf-8'),
                'password':base64.b64decode(bytes(cred.split(splitter)[2], encoding='utf-8')).decode('utf-8')}
    
    def updateCredentials():
        import base64
    
        splitter='<PC+,DFS/-SHQ.R'
        directory='C:\\PCT'
    
        if not os.path.exists(directory):
            os.makedirs(directory)
    
        print('I will be saving an encrypted file at {}.\n'.format(directory))
    
        lanid = base64.b64encode(bytes(input('   LanID: '), encoding='utf-8')).decode('utf-8')  
        email = base64.b64encode(bytes(input('   eMail: '), encoding='utf-8')).decode('utf-8')
        password = base64.b64encode(bytes(input('   PassW: '), encoding='utf-8')).decode('utf-8')
        cred = lanid+splitter+email+splitter+password
        with open(directory+'\\Credentials.txt','w+') as file:
            file.write(cred)
            file.close()
    
    cred = getCredentials()
    
    updateCredentials()
    

    回答 12

    将配置信息放置在加密的配置文件中。使用键在代码中查询此信息。将该密钥放在每个环境的单独文件中,不要将其与代码一起存储。

    Place the configuration information in a encrypted config file. Query this info in your code using an key. Place this key in a separate file per environment, and don’t store it with your code.


    回答 13

    你知道坑吗?

    https://pypi.python.org/pypi/pit(仅适用于py2(0.3版))

    https://github.com/yoshiori/pit(它将在py3上运行(当前版本0.4))

    test.py

    from pit import Pit
    
    config = Pit.get('section-name', {'require': {
        'username': 'DEFAULT STRING',
        'password': 'DEFAULT STRING',
        }})
    print(config)

    跑:

    $ python test.py
    {'password': 'my-password', 'username': 'my-name'}

    〜/ .pit / default.yml:

    section-name:
      password: my-password
      username: my-name

    Do you know pit?

    https://pypi.python.org/pypi/pit (py2 only (version 0.3))

    https://github.com/yoshiori/pit (it will work on py3 (current version 0.4))

    test.py

    from pit import Pit
    
    config = Pit.get('section-name', {'require': {
        'username': 'DEFAULT STRING',
        'password': 'DEFAULT STRING',
        }})
    print(config)
    

    Run:

    $ python test.py
    {'password': 'my-password', 'username': 'my-name'}
    

    ~/.pit/default.yml:

    section-name:
      password: my-password
      username: my-name
    

    回答 14

    如果在Windows上运行,则可以考虑使用win32crypt库。它允许运行脚本的用户存储和检索受保护的数据(键,密码),因此,密码永远不会以明文或混淆格式存储在代码中。我不确定其他平台是否有等效的实现,因此如果严格使用win32crypt,您的代码将无法移植。

    我相信可以在这里获得该模块:http : //timgolden.me.uk/pywin32-docs/win32crypt.html

    If running on Windows, you could consider using win32crypt library. It allows storage and retrieval of protected data (keys, passwords) by the user that is running the script, thus passwords are never stored in clear text or obfuscated format in your code. I am not sure if there is an equivalent implementation for other platforms, so with the strict use of win32crypt your code is not portable.

    I believe the module can be obtained here: http://timgolden.me.uk/pywin32-docs/win32crypt.html


    回答 15

    我执行此操作的方法如下:

    在python shell上:

    >>> from cryptography.fernet import Fernet
    >>> key = Fernet.generate_key()
    >>> print(key)
    b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
    >>> cipher = Fernet(key)
    >>> password = "thepassword".encode('utf-8')
    >>> token = cipher.encrypt(password)
    >>> print(token)
    b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='

    然后,使用以下代码创建一个模块:

    from cryptography.fernet import Fernet
    
    # you store the key and the token
    key = b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
    token = b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='
    
    # create a cipher and decrypt when you need your password
    cipher = Fernet(key)
    
    mypassword = cipher.decrypt(token).decode('utf-8')

    完成此操作后,您可以直接导入mypassword,也可以导入令牌和密码以根据需要进行解密。

    显然,这种方法有一些缺点。如果某人同时拥有令牌和密钥(就像他们拥有脚本一样),则他们可以轻松解密。但是,它的确模糊不清,如果您编译代码(使用Nuitka之类的代码),则至少您的密码不会在十六进制编辑器中显示为纯文本。

    A way that I have done this is as follows:

    At the python shell:

    >>> from cryptography.fernet import Fernet
    >>> key = Fernet.generate_key()
    >>> print(key)
    b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
    >>> cipher = Fernet(key)
    >>> password = "thepassword".encode('utf-8')
    >>> token = cipher.encrypt(password)
    >>> print(token)
    b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='
    

    Then, create a module with the following code:

    from cryptography.fernet import Fernet
    
    # you store the key and the token
    key = b'B8XBLJDiroM3N2nCBuUlzPL06AmfV4XkPJ5OKsPZbC4='
    token = b'gAAAAABe_TUP82q1zMR9SZw1LpawRLHjgNLdUOmW31RApwASzeo4qWSZ52ZBYpSrb1kUeXNFoX0tyhe7kWuudNs2Iy7vUwaY7Q=='
    
    # create a cipher and decrypt when you need your password
    cipher = Fernet(key)
    
    mypassword = cipher.decrypt(token).decode('utf-8')
    

    Once you’ve done this, you can either import mypassword directly or you can import the token and cipher to decrypt as needed.

    Obviously, there are some shortcomings to this approach. If someone has both the token and the key (as they would if they have the script), they can decrypt easily. However it does obfuscate, and if you compile the code (with something like Nuitka) at least your password won’t appear as plain text in a hex editor.


    回答 16

    这并不能完全回答您的问题,但却是相关的。我本来想添加评论,但不允许。我一直在处理同一问题,因此我们决定使用Jenkins将脚本公开给用户。这使我们可以将数据库凭据存储在单独的文件中,该文件在服务器上已加密并受保护,并且非管理员无法访问。它还为我们提供了一些创建UI和限制执行的捷径。

    This doesn’t precisely answer your question, but it’s related. I was going to add as a comment but wasn’t allowed. I’ve been dealing with this same issue, and we have decided to expose the script to the users using Jenkins. This allows us to store the db credentials in a separate file that is encrypted and secured on a server and not accessible to non-admins. It also allows us a bit of a shortcut to creating a UI, and throttling execution.


    回答 17

    您还可以考虑将密码存储在脚本外部并在运行时提供密码的可能性

    例如fred.py

    import os
    username = 'fred'
    password = os.environ.get('PASSWORD', '')
    print(username, password)

    可以像

    $ PASSWORD=password123 python fred.py
    fred password123

    可以通过使用base64(如上所述),在代码中使用不太明显的名称以及使实际密码与代码之间的距离进一步达到“通过模糊性实现安全性” 的目的。

    如果代码位于存储库中,通常将机密存储在存储库之外很有用,因此可以将其添加到~/.bashrc(或添加到Vault或启动脚本中,…)

    export SURNAME=cGFzc3dvcmQxMjM=

    并更改fred.py

    import os
    import base64
    name = 'fred'
    surname = base64.b64decode(os.environ.get('SURNAME', '')).decode('utf-8')
    print(name, surname)

    然后重新登录并

    $ python fred.py
    fred password123

    You could also consider the possibility of storing the password outside the script, and supplying it at runtime

    e.g. fred.py

    import os
    username = 'fred'
    password = os.environ.get('PASSWORD', '')
    print(username, password)
    

    which can be run like

    $ PASSWORD=password123 python fred.py
    fred password123
    

    Extra layers of “security through obscurity” can be achieved by using base64 (as suggested above), using less obvious names in the code and further distancing the actual password from the code.

    If the code is in a repository, it is often useful to store secrets outside it, so one could add this to ~/.bashrc (or to a vault, or a launch script, …)

    export SURNAME=cGFzc3dvcmQxMjM=
    

    and change fred.py to

    import os
    import base64
    name = 'fred'
    surname = base64.b64decode(os.environ.get('SURNAME', '')).decode('utf-8')
    print(name, surname)
    

    then re-login and

    $ python fred.py
    fred password123
    

    回答 18

    为什么不拥有简单的异或?

    优点:

    • 看起来像二进制数据
    • 任何人都无法在不知道键的情况下读取它(即使它是一个字符)

    我到了可以识别普通单词和rot13的简单b64字符串的地步。Xor会让它变得更加困难。

    Why not have a simple xor?

    Advantages:

    • looks like binary data
    • noone can read it without knowing the key (even if it’s a single char)

    I get to the point where I recognize simple b64 strings for common words and rot13 as well. Xor would make it much harder.


    回答 19

    import base64
    print(base64.b64encode("password".encode("utf-8")))
    print(base64.b64decode(b'cGFzc3dvcmQ='.decode("utf-8")))
    import base64
    print(base64.b64encode("password".encode("utf-8")))
    print(base64.b64decode(b'cGFzc3dvcmQ='.decode("utf-8")))
    

    回答 20

    在网上有几种用Python编写的ROT13实用程序-只是谷歌搜索它们。ROT13离线编码字符串,将其复制到源中,然后在传输点解码。

    但这确实是薄弱的保护…

    There are several ROT13 utilities written in Python on the ‘Net — just google for them. ROT13 encode the string offline, copy it into the source, decode at point of transmission.

    But this is really weak protection…


    从IPython Notebook中的日志记录模块获取输出

    问题:从IPython Notebook中的日志记录模块获取输出

    当我在IPython Notebook中运行以下命令时,看不到任何输出:

    import logging
    logging.basicConfig(level=logging.DEBUG)
    logging.debug("test")
    

    有人知道怎么做,这样我才能在笔记本中看到“测试”消息吗?

    When I running the following inside IPython Notebook I don’t see any output:

    import logging
    logging.basicConfig(level=logging.DEBUG)
    logging.debug("test")
    

    Anyone know how to make it so I can see the “test” message inside the notebook?


    回答 0

    请尝试以下操作:

    import logging
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    logging.debug("test")
    

    根据logging.basicConfig

    通过创建带有默认Formatter的StreamHandler并将其添加到根记录器,对记录系统进行基本配置。如果没有为根记录器定义处理程序,则debug(),info(),warning(),error()和critical()函数将自动调用basicConfig()。

    如果根记录器已经为其配置了处理程序,则此功能不执行任何操作。

    似乎ipython笔记本在某处调用basicConfig(或设置处理程序)。

    Try following:

    import logging
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    logging.debug("test")
    

    According to logging.basicConfig:

    Does basic configuration for the logging system by creating a StreamHandler with a default Formatter and adding it to the root logger. The functions debug(), info(), warning(), error() and critical() will call basicConfig() automatically if no handlers are defined for the root logger.

    This function does nothing if the root logger already has handlers configured for it.

    It seems like ipython notebook call basicConfig (or set handler) somewhere.


    回答 1

    如果仍要使用basicConfig,请像这样重新加载日志记录模块

    from importlib import reload  # Not needed in Python 2
    import logging
    reload(logging)
    logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
    

    If you still want to use basicConfig, reload the logging module like this

    from importlib import reload  # Not needed in Python 2
    import logging
    reload(logging)
    logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.DEBUG, datefmt='%I:%M:%S')
    

    回答 2

    我的理解是IPython会话开始记录日志,因此basicConfig不起作用。这是对我有用的设置(我希望这看起来不太好,因为我想将其用于几乎所有笔记本电脑):

    import logging
    logger = logging.getLogger()
    fhandler = logging.FileHandler(filename='mylog.log', mode='a')
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    fhandler.setFormatter(formatter)
    logger.addHandler(fhandler)
    logger.setLevel(logging.DEBUG)
    

    现在,当我运行时:

    logging.error('hello!')
    logging.debug('This is a debug message')
    logging.info('this is an info message')
    logging.warning('tbllalfhldfhd, warning.')
    

    我在与笔记本相同的目录中得到一个“ mylog.log”文件,其中包含:

    2015-01-28 09:49:25,026 - root - ERROR - hello!
    2015-01-28 09:49:25,028 - root - DEBUG - This is a debug message
    2015-01-28 09:49:25,029 - root - INFO - this is an info message
    2015-01-28 09:49:25,032 - root - WARNING - tbllalfhldfhd, warning.
    

    请注意,如果您在不重新启动IPython会话的情况下重新运行它,则会将重复的条目写入文件,因为现在将定义两个文件处理程序

    My understanding is that the IPython session starts up logging so basicConfig doesn’t work. Here is the setup that works for me (I wish this was not so gross looking since I want to use it for almost all my notebooks):

    import logging
    logger = logging.getLogger()
    fhandler = logging.FileHandler(filename='mylog.log', mode='a')
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    fhandler.setFormatter(formatter)
    logger.addHandler(fhandler)
    logger.setLevel(logging.DEBUG)
    

    Now when I run:

    logging.error('hello!')
    logging.debug('This is a debug message')
    logging.info('this is an info message')
    logging.warning('tbllalfhldfhd, warning.')
    

    I get a “mylog.log” file in the same directory as my notebook that contains:

    2015-01-28 09:49:25,026 - root - ERROR - hello!
    2015-01-28 09:49:25,028 - root - DEBUG - This is a debug message
    2015-01-28 09:49:25,029 - root - INFO - this is an info message
    2015-01-28 09:49:25,032 - root - WARNING - tbllalfhldfhd, warning.
    

    Note that if you rerun this without restarting the IPython session it will write duplicate entries to the file since there would now be two file handlers defined


    回答 3

    请记住,stderr是logging模块的默认流,因此在IPython和Jupyter笔记本中,除非将流配置为stdout,否则可能看不到任何内容:

    import logging
    import sys
    
    logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s',
                         level=logging.INFO, stream=sys.stdout)
    
    logging.info('Hello world!')
    

    Bear in mind that stderr is the default stream for the logging module, so in IPython and Jupyter notebooks you might not see anything unless you configure the stream to stdout:

    import logging
    import sys
    
    logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s',
                         level=logging.INFO, stream=sys.stdout)
    
    logging.info('Hello world!')
    

    回答 4

    现在对我有用的(Jupyter,笔记本服务器是:5.4.1,IPython 7.0.1)

    import logging
    logging.basicConfig()
    logger = logging.getLogger('Something')
    logger.setLevel(logging.DEBUG)
    

    现在,我可以使用记录器来打印信息,否则,我只会看到默认级别(logging.WARNING)或更高级别的消息。

    What worked for me now (Jupyter, notebook server is: 5.4.1, IPython 7.0.1)

    import logging
    logging.basicConfig()
    logger = logging.getLogger('Something')
    logger.setLevel(logging.DEBUG)
    

    Now I can use logger to print info, otherwise I would see only message from the default level (logging.WARNING) or above.


    回答 5

    您可以通过运行配置日志记录 %config Application.log_level="INFO"

    有关更多信息,请参见IPython内核选项。

    You can configure logging by running %config Application.log_level="INFO"

    For more information, see IPython kernel options


    回答 6

    我为这两个文件都设置了一个记录器,我希望它能显示在笔记本上。事实证明,添加文件处理程序会清除默认的流处理程序。

    logger = logging.getLogger()
    
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    
    # Setup file handler
    fhandler  = logging.FileHandler('my.log')
    fhandler.setLevel(logging.DEBUG)
    fhandler.setFormatter(formatter)
    
    # Configure stream handler for the cells
    chandler = logging.StreamHandler()
    chandler.setLevel(logging.DEBUG)
    chandler.setFormatter(formatter)
    
    # Add both handlers
    logger.addHandler(fhandler)
    logger.addHandler(chandler)
    logger.setLevel(logging.DEBUG)
    
    # Show the handlers
    logger.handlers
    
    # Log Something
    logger.info("Test info")
    logger.debug("Test debug")
    logger.error("Test error")

    I setup a logger for both file and I wanted it to show up on the notebook. Turns out adding a filehandler clears out the default stream handlder.

    logger = logging.getLogger()
    
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    
    # Setup file handler
    fhandler  = logging.FileHandler('my.log')
    fhandler.setLevel(logging.DEBUG)
    fhandler.setFormatter(formatter)
    
    # Configure stream handler for the cells
    chandler = logging.StreamHandler()
    chandler.setLevel(logging.DEBUG)
    chandler.setFormatter(formatter)
    
    # Add both handlers
    logger.addHandler(fhandler)
    logger.addHandler(chandler)
    logger.setLevel(logging.DEBUG)
    
    # Show the handlers
    logger.handlers
    
    # Log Something
    logger.info("Test info")
    logger.debug("Test debug")
    logger.error("Test error")
    

    回答 7

    似乎适用于ipython / jupyter早期版本的解决方案不再起作用。

    这是适用于ipython 7.9.0的有效解决方案(也已通过jupyter服务器6.0.2测试):

    import logging
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    logging.debug("test message")
    
    DEBUG:root:test message

    It seems that solutions that worked for older versions of ipython/jupyter no longer work.

    Here is a working solution for ipython 7.9.0 (also tested with jupyter server 6.0.2):

    import logging
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    logging.debug("test message")
    
    DEBUG:root:test message
    

    根据布尔值列表过滤列表

    问题:根据布尔值列表过滤列表

    我有一个值列表,需要根据布尔值列表中的值进行过滤:

    list_a = [1, 2, 4, 6]
    filter = [True, False, True, False]
    

    我使用以下行生成一个新的过滤列表:

    filtered_list = [i for indx,i in enumerate(list_a) if filter[indx] == True]

    结果是:

    print filtered_list
    [1,4]
    

    这条线工作正常,但是(对我而言)看起来有些过分了,我想知道是否有更简单的方法来实现这一目标。


    忠告

    以下答案提供了两个好的建议:

    1-不要filter像我一样命名列表,因为它是内置函数。

    2-不要比较True像我做的事情,if filter[idx]==True..因为这是不必要的。只需使用if filter[idx]就足够了。

    I have a list of values which I need to filter given the values in a list of booleans:

    list_a = [1, 2, 4, 6]
    filter = [True, False, True, False]
    

    I generate a new filtered list with the following line:

    filtered_list = [i for indx,i in enumerate(list_a) if filter[indx] == True]
    

    which results in:

    print filtered_list
    [1,4]
    

    The line works but looks (to me) a bit overkill and I was wondering if there was a simpler way to achieve the same.


    Advices

    Summary of two good advices given in the answers below:

    1- Don’t name a list filter like I did because it is a built-in function.

    2- Don’t compare things to True like I did with if filter[idx]==True.. since it’s unnecessary. Just using if filter[idx] is enough.


    回答 0

    您正在寻找itertools.compress

    >>> from itertools import compress
    >>> list_a = [1, 2, 4, 6]
    >>> fil = [True, False, True, False]
    >>> list(compress(list_a, fil))
    [1, 4]
    

    时序比较(py3.x):

    >>> list_a = [1, 2, 4, 6]
    >>> fil = [True, False, True, False]
    >>> %timeit list(compress(list_a, fil))
    100000 loops, best of 3: 2.58 us per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v]  #winner
    100000 loops, best of 3: 1.98 us per loop
    
    >>> list_a = [1, 2, 4, 6]*100
    >>> fil = [True, False, True, False]*100
    >>> %timeit list(compress(list_a, fil))              #winner
    10000 loops, best of 3: 24.3 us per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v]
    10000 loops, best of 3: 82 us per loop
    
    >>> list_a = [1, 2, 4, 6]*10000
    >>> fil = [True, False, True, False]*10000
    >>> %timeit list(compress(list_a, fil))              #winner
    1000 loops, best of 3: 1.66 ms per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v] 
    100 loops, best of 3: 7.65 ms per loop
    

    不要filter用作变量名,它是一个内置函数。

    You’re looking for itertools.compress:

    >>> from itertools import compress
    >>> list_a = [1, 2, 4, 6]
    >>> fil = [True, False, True, False]
    >>> list(compress(list_a, fil))
    [1, 4]
    

    Timing comparisons(py3.x):

    >>> list_a = [1, 2, 4, 6]
    >>> fil = [True, False, True, False]
    >>> %timeit list(compress(list_a, fil))
    100000 loops, best of 3: 2.58 us per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v]  #winner
    100000 loops, best of 3: 1.98 us per loop
    
    >>> list_a = [1, 2, 4, 6]*100
    >>> fil = [True, False, True, False]*100
    >>> %timeit list(compress(list_a, fil))              #winner
    10000 loops, best of 3: 24.3 us per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v]
    10000 loops, best of 3: 82 us per loop
    
    >>> list_a = [1, 2, 4, 6]*10000
    >>> fil = [True, False, True, False]*10000
    >>> %timeit list(compress(list_a, fil))              #winner
    1000 loops, best of 3: 1.66 ms per loop
    >>> %timeit [i for (i, v) in zip(list_a, fil) if v] 
    100 loops, best of 3: 7.65 ms per loop
    

    Don’t use filter as a variable name, it is a built-in function.


    回答 1

    像这样:

    filtered_list = [i for (i, v) in zip(list_a, filter) if v]

    使用zip是在多个索引上并行迭代的pythonic方式,无需任何索引。假设两个序列的长度相同(最短用完后拉链停止)。使用itertools这种简单的情况有点过分…

    在示例中您应该真正停止做的一件事是将事物与True进行比较,这通常不是必需的。相反if filter[idx]==True: ...,您可以简单地编写if filter[idx]: ...

    Like so:

    filtered_list = [i for (i, v) in zip(list_a, filter) if v]
    

    Using zip is the pythonic way to iterate over multiple sequences in parallel, without needing any indexing. This assumes both sequences have the same length (zip stops after the shortest runs out). Using itertools for such a simple case is a bit overkill …

    One thing you do in your example you should really stop doing is comparing things to True, this is usually not necessary. Instead of if filter[idx]==True: ..., you can simply write if filter[idx]: ....


    回答 2

    使用numpy:

    In [128]: list_a = np.array([1, 2, 4, 6])
    In [129]: filter = np.array([True, False, True, False])
    In [130]: list_a[filter]
    
    Out[130]: array([1, 4])
    

    或者,如果list_a可以是一个numpy数组但不能过滤,请查看Alex Szatmary的答案

    Numpy通常也可以大大提高速度

    In [133]: list_a = [1, 2, 4, 6]*10000
    In [134]: fil = [True, False, True, False]*10000
    In [135]: list_a_np = np.array(list_a)
    In [136]: fil_np = np.array(fil)
    
    In [139]: %timeit list(itertools.compress(list_a, fil))
    1000 loops, best of 3: 625 us per loop
    
    In [140]: %timeit list_a_np[fil_np]
    10000 loops, best of 3: 173 us per loop
    

    With numpy:

    In [128]: list_a = np.array([1, 2, 4, 6])
    In [129]: filter = np.array([True, False, True, False])
    In [130]: list_a[filter]
    
    Out[130]: array([1, 4])
    

    or see Alex Szatmary’s answer if list_a can be a numpy array but not filter

    Numpy usually gives you a big speed boost as well

    In [133]: list_a = [1, 2, 4, 6]*10000
    In [134]: fil = [True, False, True, False]*10000
    In [135]: list_a_np = np.array(list_a)
    In [136]: fil_np = np.array(fil)
    
    In [139]: %timeit list(itertools.compress(list_a, fil))
    1000 loops, best of 3: 625 us per loop
    
    In [140]: %timeit list_a_np[fil_np]
    10000 loops, best of 3: 173 us per loop
    

    回答 3

    为此,请使用numpy,即,如果您有一个数组a,而不是list_a

    a = np.array([1, 2, 4, 6])
    my_filter = np.array([True, False, True, False], dtype=bool)
    a[my_filter]
    > array([1, 4])
    

    To do this using numpy, ie, if you have an array, a, instead of list_a:

    a = np.array([1, 2, 4, 6])
    my_filter = np.array([True, False, True, False], dtype=bool)
    a[my_filter]
    > array([1, 4])
    

    回答 4

    filtered_list = [list_a[i] for i in range(len(list_a)) if filter[i]]
    filtered_list = [list_a[i] for i in range(len(list_a)) if filter[i]]
    

    回答 5

    使用python 3,您可以list_a[filter]用来获取True值。要获得False价值,请使用list_a[~filter]

    With python 3 you can use list_a[filter] to get True values. To get False values use list_a[~filter]


    Python中的高性能模糊字符串比较,使用Levenshtein或difflib

    问题:Python中的高性能模糊字符串比较,使用Levenshtein或difflib

    我正在进行临床消息标准化(拼写检查),其中我对照900,000个单词的医学词典检查每个给定的单词。我更关心时间的复杂性/性能。

    我想进行模糊字符串比较,但是不确定使用哪个库。

    选项1:

    import Levenshtein
    Levenshtein.ratio('hello world', 'hello')
    
    Result: 0.625

    选项2:

    import difflib
    difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()
    
    Result: 0.625

    在此示例中,两者给出相同的答案。您是否认为在这种情况下两者表现都一样?

    I am doing clinical message normalization (spell check) in which I check each given word against 900,000 word medical dictionary. I am more concern about the time complexity/performance.

    I want to do fuzzy string comparison, but I’m not sure which library to use.

    Option 1:

    import Levenshtein
    Levenshtein.ratio('hello world', 'hello')
    
    Result: 0.625
    

    Option 2:

    import difflib
    difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()
    
    Result: 0.625
    

    In this example both give the same answer. Do you think both perform alike in this case?


    回答 0

    如果您想对Levenshtein和Difflib的相似性进行快速的视觉比较,我计算了约230万本书的书名:

    import codecs, difflib, Levenshtein, distance
    
    with codecs.open("titles.tsv","r","utf-8") as f:
        title_list = f.read().split("\n")[:-1]
    
        for row in title_list:
    
            sr      = row.lower().split("\t")
    
            diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
            lev     = Levenshtein.ratio(sr[3], sr[4]) 
            sor     = 1 - distance.sorensen(sr[3], sr[4])
            jac     = 1 - distance.jaccard(sr[3], sr[4])
    
            print diffl, lev, sor, jac

    然后,我用R绘制结果:

    出于好奇,我还比较了Difflib,Levenshtein,Sørensen和Jaccard相似度值:

    library(ggplot2)
    require(GGally)
    
    difflib <- read.table("similarity_measures.txt", sep = " ")
    colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")
    
    ggpairs(difflib)

    结果:

    Difflib / Levenshtein的相似性确实很有趣。

    2018编辑:如果您要识别相似的字符串,还可以查看minhashing-这里有一个很棒的概述。Minhashing在线性时间内在大型文本集合中发现相似之处非常了不起。我的实验室在这里组装了一个应用程序,该应用程序使用minhashing检测和可视化文本重用:https//github.com/YaleDHLab/intertext

    In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

    import codecs, difflib, Levenshtein, distance
    
    with codecs.open("titles.tsv","r","utf-8") as f:
        title_list = f.read().split("\n")[:-1]
    
        for row in title_list:
    
            sr      = row.lower().split("\t")
    
            diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
            lev     = Levenshtein.ratio(sr[3], sr[4]) 
            sor     = 1 - distance.sorensen(sr[3], sr[4])
            jac     = 1 - distance.jaccard(sr[3], sr[4])
    
            print diffl, lev, sor, jac
    

    I then plotted the results with R:

    Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

    library(ggplot2)
    require(GGally)
    
    difflib <- read.table("similarity_measures.txt", sep = " ")
    colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")
    
    ggpairs(difflib)
    

    Result:

    The Difflib / Levenshtein similarity really is quite interesting.

    2018 edit: If you’re working on identifying similar strings, you could also check out minhashing–there’s a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext


    回答 1

    • difflib.SequenceMatcher使用Ratcliff / Obershelp算法,计算匹配字符的加倍数除以两个字符串中的字符总数。

    • Levenshtein使用Levenshtein算法,它计算将一个字符串转换为另一个字符串所需的最少编辑次数

    复杂

    SequenceMatcher是最坏情况下的二次时间,其预期情况下的行为以复杂的方式取决于序列共有多少个元素。(从这里

    Levenshtein为O(m * n),其中n和m是两个输入字符串的长度。

    性能

    根据Levenshtein模块的源代码:Levenshtein与difflib(SequenceMatcher)有一些重叠。它仅支持字符串,不支持任意序列类型,但另一方面,它要快得多。

    • difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.

    • Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

    Complexity

    SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

    Levenshtein is O(m*n), where n and m are the length of the two input strings.

    Performance

    According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it’s much faster.


    __getattr__在模块上

    问题:__getattr__在模块上

    如何实现等效的 __getattr__在类,模块上于a?

    当调用模块的静态定义的属性中不存在的函数时,我希望在该模块中创建一个类的实例,并使用与该模块上的属性查找失败相同的名称调用该方法。

    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
    # note this function is intentionally on the module, and not the class above
    def __getattr__(mod, name):
        return getattr(A(), name)
    
    if __name__ == "__main__":
        # i hope here to have my __getattr__ function above invoked, since
        # salutation does not exist in the current namespace
        salutation("world")

    这使:

    matt@stanley:~/Desktop$ python getattrmod.py 
    Traceback (most recent call last):
      File "getattrmod.py", line 9, in <module>
        salutation("world")
    NameError: name 'salutation' is not defined

    How can implement the equivalent of a __getattr__ on a class, on a module?

    Example

    When calling a function that does not exist in a module’s statically defined attributes, I wish to create an instance of a class in that module, and invoke the method on it with the same name as failed in the attribute lookup on the module.

    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
    # note this function is intentionally on the module, and not the class above
    def __getattr__(mod, name):
        return getattr(A(), name)
    
    if __name__ == "__main__":
        # i hope here to have my __getattr__ function above invoked, since
        # salutation does not exist in the current namespace
        salutation("world")
    

    Which gives:

    matt@stanley:~/Desktop$ python getattrmod.py 
    Traceback (most recent call last):
      File "getattrmod.py", line 9, in <module>
        salutation("world")
    NameError: name 'salutation' is not defined
    

    回答 0

    不久前,Guido宣布对新型类的所有特殊方法查找都绕过__getattr__and__getattribute__。Dunder方法曾经工作的模块-你可以,例如,使用一个模块作为一个上下文管理器简单地通过定义__enter____exit__,这些技巧之前爆发

    最近,一些历史功能已经卷土重来,其中的一个模块已被卷土重来,__getattr__因此,sys.modules不再需要现有的hack(在导入时将一个模块替换为一个类)。

    在Python 3.7+中,您仅使用一种显而易见的方法。要自定义模块上的属性访问,请__getattr__在模块级别定义一个函数,该函数应接受一个参数(属性名称),然后返回计算值或引发一个AttributeError

    # my_module.py
    
    def __getattr__(name: str) -> Any:
        ...

    这也将允许钩子插入“ from”导入,即,您可以为语句(例如)返回动态生成的对象 from my_module import whatever

    与此相关的是,您还可以与模块getattr一起__dir__在模块级别定义一个函数以响应dir(my_module)。有关详细信息,请参见PEP 562

    A while ago, Guido declared that all special method lookups on new-style classes bypass __getattr__ and __getattribute__. Dunder methods had previously worked on modules – you could, for example, use a module as a context manager simply by defining __enter__ and __exit__, before those tricks broke.

    Recently some historical features have made a comeback, the module __getattr__ among them, and so the existing hack (a module replacing itself with a class in sys.modules at import time) should be no longer necessary.

    In Python 3.7+, you just use the one obvious way. To customize attribute access on a module, define a __getattr__ function at the module level which should accept one argument (name of attribute), and return the computed value or raise an AttributeError:

    # my_module.py
    
    def __getattr__(name: str) -> Any:
        ...
    

    This will also allow hooks into “from” imports, i.e. you can return dynamically generated objects for statements such as from my_module import whatever.

    On a related note, along with the module getattr you may also define a __dir__ function at module level to respond to dir(my_module). See PEP 562 for details.


    回答 1

    您在这里遇到两个基本问题:

    1. __xxx__ 方法只在类上查找
    2. TypeError: can't set attributes of built-in/extension type 'module'

    (1)表示任何解决方案还必须跟踪正在检查的模块,否则每个模块将具有实例替换行为;(2)表示(1)甚至是不可能的……至少不是直接的。

    幸运的是,sys.modules对那里发生的事情并不挑剔,因此可以使用包装器,但是只能用于模块访问(即import somemodule; somemodule.salutation('world'),对于相同模块的访问,您几乎必须从替换类中提取方法并将其添加到globals()eiher中。类上的自定义方法(我喜欢使用.export())或具有泛型函数(例如已经列出的答案)要记住的一件事:如果包装器每次都创建一个新实例,而全局解决方案不是,最终,您的行为会有所不同。哦,您不能同时使用两者-一种是另一种。


    更新资料

    Guido van Rossum出发:

    实际上,偶尔会使用并推荐一种hack:一个模块可以用所需的功能定义一个类,然后最后,用该类的实例(如果需要,可以用该类)替换sys.modules中的自身。 ,但通常用处不大)。例如:

    # module foo.py
    
    import sys
    
    class Foo:
        def funct1(self, <args>): <code>
        def funct2(self, <args>): <code>
    
    sys.modules[__name__] = Foo()

    之所以可行,是因为导入机制正在积极地启用此hack,并且在加载的最后一步是将实际模块从sys.modules中拉出。(这绝非偶然。黑客是在很久以前就提出的,我们认为我们很喜欢在进口机器中提供支持。)

    因此,完成所需操作的既定方法是在模块中创建一个类,并且作为模块的最后一步,sys.modules[__name__]用您的类的实例替换-现在您可以根据需要使用__getattr__/ __setattr__/ __getattribute__进行操作。


    注意1:如果您使用此功能,则在进行sys.modules分配时,模块中的所有其他内容(例如全局变量,其他函数等)都会丢失-因此请确保所需的所有内容都在替换类之内。

    注意2:要支持from module import *您必须__all__在类中进行定义;例如:

    class Foo:
        def funct1(self, <args>): <code>
        def funct2(self, <args>): <code>
        __all__ = list(set(vars().keys()) - {'__module__', '__qualname__'})

    根据您的Python版本,可能会省略其他名称__all__set()如果不需要Python 2兼容性,可以省略。

    There are two basic problems you are running into here:

    1. __xxx__ methods are only looked up on the class
    2. TypeError: can't set attributes of built-in/extension type 'module'

    (1) means any solution would have to also keep track of which module was being examined, otherwise every module would then have the instance-substitution behavior; and (2) means that (1) isn’t even possible… at least not directly.

    Fortunately, sys.modules is not picky about what goes there so a wrapper will work, but only for module access (i.e. import somemodule; somemodule.salutation('world'); for same-module access you pretty much have to yank the methods from the substitution class and add them to globals() eiher with a custom method on the class (I like using .export()) or with a generic function (such as those already listed as answers). One thing to keep in mind: if the wrapper is creating a new instance each time, and the globals solution is not, you end up with subtly different behavior. Oh, and you don’t get to use both at the same time — it’s one or the other.


    Update

    From Guido van Rossum:

    There is actually a hack that is occasionally used and recommended: a module can define a class with the desired functionality, and then at the end, replace itself in sys.modules with an instance of that class (or with the class, if you insist, but that’s generally less useful). E.g.:

    # module foo.py
    
    import sys
    
    class Foo:
        def funct1(self, <args>): <code>
        def funct2(self, <args>): <code>
    
    sys.modules[__name__] = Foo()
    

    This works because the import machinery is actively enabling this hack, and as its final step pulls the actual module out of sys.modules, after loading it. (This is no accident. The hack was proposed long ago and we decided we liked enough to support it in the import machinery.)

    So the established way to accomplish what you want is to create a single class in your module, and as the last act of the module replace sys.modules[__name__] with an instance of your class — and now you can play with __getattr__/__setattr__/__getattribute__ as needed.


    Note 1: If you use this functionality then anything else in the module, such as globals, other functions, etc., will be lost when the sys.modules assignment is made — so make sure everything needed is inside the replacement class.

    Note 2: To support from module import * you must have __all__ defined in the class; for example:

    class Foo:
        def funct1(self, <args>): <code>
        def funct2(self, <args>): <code>
        __all__ = list(set(vars().keys()) - {'__module__', '__qualname__'})
    

    Depending on your Python version, there may be other names to omit from __all__. The set() can be omitted if Python 2 compatibility is not needed.


    回答 2

    这是一个技巧,但是您可以使用一个类包装模块:

    class Wrapper(object):
      def __init__(self, wrapped):
        self.wrapped = wrapped
      def __getattr__(self, name):
        # Perform custom logic here
        try:
          return getattr(self.wrapped, name)
        except AttributeError:
          return 'default' # Some sensible default
    
    sys.modules[__name__] = Wrapper(sys.modules[__name__])

    This is a hack, but you can wrap the module with a class:

    class Wrapper(object):
      def __init__(self, wrapped):
        self.wrapped = wrapped
      def __getattr__(self, name):
        # Perform custom logic here
        try:
          return getattr(self.wrapped, name)
        except AttributeError:
          return 'default' # Some sensible default
    
    sys.modules[__name__] = Wrapper(sys.modules[__name__])
    

    回答 3

    我们通常不那样做。

    我们要做的就是这个。

    class A(object):
    ....
    
    # The implicit global instance
    a= A()
    
    def salutation( *arg, **kw ):
        a.salutation( *arg, **kw )

    为什么?使隐式全局实例可见。

    例如,查看random模块,该模块创建一个隐式全局实例,以稍微简化您需要“简单”随机数生成器的用例。

    We don’t usually do it that way.

    What we do is this.

    class A(object):
    ....
    
    # The implicit global instance
    a= A()
    
    def salutation( *arg, **kw ):
        a.salutation( *arg, **kw )
    

    Why? So that the implicit global instance is visible.

    For examples, look at the random module, which creates an implicit global instance to slightly simplify the use cases where you want a “simple” random number generator.


    回答 4

    与@HåvardS提出的类似,在我需要在模块上实现一些魔术的情况下(例如__getattr__),我将定义一个继承types.ModuleType并放入其中的新类sys.modules(可能替换自定义模块ModuleType定义了定义)。

    请参阅Werkzeug的主__init__.py文件,以实现此功能的强大功能。

    Similar to what @Håvard S proposed, in a case where I needed to implement some magic on a module (like __getattr__), I would define a new class that inherits from types.ModuleType and put that in sys.modules (probably replacing the module where my custom ModuleType was defined).

    See the main __init__.py file of Werkzeug for a fairly robust implementation of this.


    回答 5

    这有点黑,但是…

    import types
    
    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
        def farewell(self, greeting, accusative):
             print greeting, accusative
    
    def AddGlobalAttribute(classname, methodname):
        print "Adding " + classname + "." + methodname + "()"
        def genericFunction(*args):
            return globals()[classname]().__getattribute__(methodname)(*args)
        globals()[methodname] = genericFunction
    
    # set up the global namespace
    
    x = 0   # X and Y are here to add them implicitly to globals, so
    y = 0   # globals does not change as we iterate over it.
    
    toAdd = []
    
    def isCallableMethod(classname, methodname):
        someclass = globals()[classname]()
        something = someclass.__getattribute__(methodname)
        return callable(something)
    
    
    for x in globals():
        print "Looking at", x
        if isinstance(globals()[x], (types.ClassType, type)):
            print "Found Class:", x
            for y in dir(globals()[x]):
                if y.find("__") == -1: # hack to ignore default methods
                    if isCallableMethod(x,y):
                        if y not in globals(): # don't override existing global names
                            toAdd.append((x,y))
    
    
    for x in toAdd:
        AddGlobalAttribute(*x)
    
    
    if __name__ == "__main__":
        salutation("world")
        farewell("goodbye", "world")

    通过遍历全局命名空间中的所有对象来工作。如果该项目是一个类,则在类属性上进行迭代。如果该属性是可调用的,则将其作为函数添加到全局命名空间中。

    它忽略所有包含“ __”的属性。

    我不会在生产代码中使用它,但是它应该可以帮助您入门。

    This is hackish, but…

    import types
    
    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
        def farewell(self, greeting, accusative):
             print greeting, accusative
    
    def AddGlobalAttribute(classname, methodname):
        print "Adding " + classname + "." + methodname + "()"
        def genericFunction(*args):
            return globals()[classname]().__getattribute__(methodname)(*args)
        globals()[methodname] = genericFunction
    
    # set up the global namespace
    
    x = 0   # X and Y are here to add them implicitly to globals, so
    y = 0   # globals does not change as we iterate over it.
    
    toAdd = []
    
    def isCallableMethod(classname, methodname):
        someclass = globals()[classname]()
        something = someclass.__getattribute__(methodname)
        return callable(something)
    
    
    for x in globals():
        print "Looking at", x
        if isinstance(globals()[x], (types.ClassType, type)):
            print "Found Class:", x
            for y in dir(globals()[x]):
                if y.find("__") == -1: # hack to ignore default methods
                    if isCallableMethod(x,y):
                        if y not in globals(): # don't override existing global names
                            toAdd.append((x,y))
    
    
    for x in toAdd:
        AddGlobalAttribute(*x)
    
    
    if __name__ == "__main__":
        salutation("world")
        farewell("goodbye", "world")
    

    This works by iterating over the all the objects in the global namespace. If the item is a class, it iterates over the class attributes. If the attribute is callable it adds it to the global namespace as a function.

    It ignore all attributes which contain “__”.

    I wouldn’t use this in production code, but it should get you started.


    回答 6

    这是我自己的不起眼的贡献-@HåvardS的高度评价的答案略有修饰,但略显一点(因此@ S.Lott可以接受,尽管可能对OP不够好):

    import sys
    
    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
    class Wrapper(object):
        def __init__(self, wrapped):
            self.wrapped = wrapped
    
        def __getattr__(self, name):
            try:
                return getattr(self.wrapped, name)
            except AttributeError:
                return getattr(A(), name)
    
    _globals = sys.modules[__name__] = Wrapper(sys.modules[__name__])
    
    if __name__ == "__main__":
        _globals.salutation("world")

    Here’s my own humble contribution — a slight embellishment of @Håvard S’s highly rated answer, but a bit more explicit (so it might be acceptable to @S.Lott, even though probably not good enough for the OP):

    import sys
    
    class A(object):
        def salutation(self, accusative):
            print "hello", accusative
    
    class Wrapper(object):
        def __init__(self, wrapped):
            self.wrapped = wrapped
    
        def __getattr__(self, name):
            try:
                return getattr(self.wrapped, name)
            except AttributeError:
                return getattr(A(), name)
    
    _globals = sys.modules[__name__] = Wrapper(sys.modules[__name__])
    
    if __name__ == "__main__":
        _globals.salutation("world")
    

    回答 7

    创建包含您的类的模块文件。导入模块。getattr在刚导入的模块上运行。您可以使用以下方式进行动态导入__import__ sys.modules中的模块。

    这是您的模块some_module.py

    class Foo(object):
        pass
    
    class Bar(object):
        pass

    在另一个模块中:

    import some_module
    
    Foo = getattr(some_module, 'Foo')

    动态地执行此操作:

    import sys
    
    __import__('some_module')
    mod = sys.modules['some_module']
    Foo = getattr(mod, 'Foo')

    Create your module file that has your classes. Import the module. Run getattr on the module you just imported. You can do a dynamic import using __import__ and pull the module from sys.modules.

    Here’s your module some_module.py:

    class Foo(object):
        pass
    
    class Bar(object):
        pass
    

    And in another module:

    import some_module
    
    Foo = getattr(some_module, 'Foo')
    

    Doing this dynamically:

    import sys
    
    __import__('some_module')
    mod = sys.modules['some_module']
    Foo = getattr(mod, 'Foo')
    

    有趣好用的Python教程

    退出移动版
    微信支付
    请使用 微信 扫码支付