标签归档:Python

如何在模板中将数据从Flask传递到JavaScript?

问题:如何在模板中将数据从Flask传递到JavaScript?

我的应用程序调用返回字典的API。我想将信息从此字典传递到视图中的JavaScript。具体来说,我在JS中使用Google Maps API,因此我希望向其传递一个包含长/短信息的元组列表。我知道render_template会将这些变量传递给视图,以便可以在HTML中使用它们,但是如何将它们传递给模板中的JavaScript?

from flask import Flask
from flask import render_template

app = Flask(__name__)

import foo_api

api = foo_api.API('API KEY')

@app.route('/')
def get_data():
    events = api.call(get_event, arg0, arg1)
    geocode = event['latitude'], event['longitude']
    return render_template('get_data.html', geocode=geocode)

My app makes a call to an API that returns a dictionary. I want to pass information from this dict to JavaScript in the view. I am using the Google Maps API in the JS, specifically, so I’d like to pass it a list of tuples with the long/lat information. I know that render_template will pass these variables to the view so they can be used in HTML, but how could I pass them to JavaScript in the template?

from flask import Flask
from flask import render_template

app = Flask(__name__)

import foo_api

api = foo_api.API('API KEY')

@app.route('/')
def get_data():
    events = api.call(get_event, arg0, arg1)
    geocode = event['latitude'], event['longitude']
    return render_template('get_data.html', geocode=geocode)

回答 0

您可以{{ variable }}在模板的任何地方使用,而不仅限于HTML部分。所以这应该工作:

<html>
<head>
  <script>
    var someJavaScriptVar = '{{ geocode[1] }}';
  </script>
</head>
<body>
  <p>Hello World</p>
  <button onclick="alert('Geocode: {{ geocode[0] }} ' + someJavaScriptVar)" />
</body>
</html>

可以将其视为两个阶段的过程:首先,Jinja(Flask使用的模板引擎)生成文本输出。这将发送给执行他所看到的JavaScript的用户。如果希望Flask变量作为数组在JavaScript中可用,则必须在输出中生成数组定义:

<html>
  <head>
    <script>
      var myGeocode = ['{{ geocode[0] }}', '{{ geocode[1] }}'];
    </script>
  </head>
  <body>
    <p>Hello World</p>
    <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
  </body>
</html>

Jinja还提供了来自Python的更高级的构造,因此您可以将其缩短为:

<html>
<head>
  <script>
    var myGeocode = [{{ ', '.join(geocode) }}];
  </script>
</head>
<body>
  <p>Hello World</p>
  <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
</body>
</html>

您还可以使用for循环,if语句等,请参阅Jinja2文档以获取更多信息。

另外,看看福特的答案,谁指出了tojson过滤器,它是对Jinja2标准过滤器集的补充。

编辑2018年11月:tojson现在已包含在Jinja2的标准过滤器集中。

You can use {{ variable }} anywhere in your template, not just in the HTML part. So this should work:

<html>
<head>
  <script>
    var someJavaScriptVar = '{{ geocode[1] }}';
  </script>
</head>
<body>
  <p>Hello World</p>
  <button onclick="alert('Geocode: {{ geocode[0] }} ' + someJavaScriptVar)" />
</body>
</html>

Think of it as a two-stage process: First, Jinja (the template engine Flask uses) generates your text output. This gets sent to the user who executes the JavaScript he sees. If you want your Flask variable to be available in JavaScript as an array, you have to generate an array definition in your output:

<html>
  <head>
    <script>
      var myGeocode = ['{{ geocode[0] }}', '{{ geocode[1] }}'];
    </script>
  </head>
  <body>
    <p>Hello World</p>
    <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
  </body>
</html>

Jinja also offers more advanced constructs from Python, so you can shorten it to:

<html>
<head>
  <script>
    var myGeocode = [{{ ', '.join(geocode) }}];
  </script>
</head>
<body>
  <p>Hello World</p>
  <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
</body>
</html>

You can also use for loops, if statements and many more, see the Jinja2 documentation for more.

Also, have a look at Ford’s answer who points out the tojson filter which is an addition to Jinja2’s standard set of filters.

Edit Nov 2018: tojson is now included in Jinja2’s standard set of filters.


回答 1

将几乎所有Python对象都转换为JavaScript对象的理想方法是使用JSON。JSON非常适合作为系统之间传输的格式,但有时我们会忘记它代表JavaScript Object Notation。这意味着将JSON注入模板与注入描述对象的JavaScript代码相同。

Flask为此提供了一个Jinja过滤器:tojson将结构转储到JSON字符串并标记为安全,以便Jinja不会自动转义它。

<html>
  <head>
    <script>
      var myGeocode = {{ geocode|tojson }};
    </script>
  </head>
  <body>
    <p>Hello World</p>
    <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
  </body>
</html>

这适用于JSON可序列化的任何Python结构:

python_data = {
    'some_list': [4, 5, 6],
    'nested_dict': {'foo': 7, 'bar': 'a string'}
}
var data = {{ python_data|tojson }};
alert('Data: ' + data.some_list[1] + ' ' + data.nested_dict.foo + 
      ' ' + data.nested_dict.bar);

The ideal way to go about getting pretty much any Python object into a JavaScript object is to use JSON. JSON is great as a format for transfer between systems, but sometimes we forget that it stands for JavaScript Object Notation. This means that injecting JSON into the template is the same as injecting JavaScript code that describes the object.

Flask provides a Jinja filter for this: tojson dumps the structure to a JSON string and marks it safe so that Jinja does not autoescape it.

<html>
  <head>
    <script>
      var myGeocode = {{ geocode|tojson }};
    </script>
  </head>
  <body>
    <p>Hello World</p>
    <button onclick="alert('Geocode: ' + myGeocode[0] + ' ' + myGeocode[1])" />
  </body>
</html>

This works for any Python structure that is JSON serializable:

python_data = {
    'some_list': [4, 5, 6],
    'nested_dict': {'foo': 7, 'bar': 'a string'}
}
var data = {{ python_data|tojson }};
alert('Data: ' + data.some_list[1] + ' ' + data.nested_dict.foo + 
      ' ' + data.nested_dict.bar);

回答 2

在HTML元素上使用data属性可以避免使用内联脚本,这反过来意味着您可以使用更严格的CSP规则来提高安全性。

指定数据属性,如下所示:

<div id="mydiv" data-geocode='{{ geocode|tojson }}'>...</div>

然后像下面这样在静态JavaScript文件中进行访问:

// Raw JavaScript
var geocode = JSON.parse(document.getElementById("mydiv").dataset.geocode);

// jQuery
var geocode = JSON.parse($("#mydiv").data("geocode"));

Using a data attribute on an HTML element avoids having to use inline scripting, which in turn means you can use stricter CSP rules for increased security.

Specify a data attribute like so:

<div id="mydiv" data-geocode='{{ geocode|tojson }}'>...</div>

Then access it in a static JavaScript file like so:

// Raw JavaScript
var geocode = JSON.parse(document.getElementById("mydiv").dataset.geocode);

// jQuery
var geocode = JSON.parse($("#mydiv").data("geocode"));

回答 3

另外,您可以添加一个端点以返回变量:

@app.route("/api/geocode")
def geo_code():
    return jsonify(geocode)

然后执行XHR检索它:

fetch('/api/geocode')
  .then((res)=>{ console.log(res) })

Alternatively you could add an endpoint to return your variable:

@app.route("/api/geocode")
def geo_code():
    return jsonify(geocode)

Then do an XHR to retrieve it:

fetch('/api/geocode')
  .then((res)=>{ console.log(res) })

回答 4

对于那些想要将变量传递给使用flask来源的脚本的人来说,这是另一种替代解决方案,我只能通过在外部定义变量,然后按如下所示调用脚本来设法使它起作用:

    <script>
    var myfileuri = "/static/my_csv.csv"
    var mytableid = 'mytable';
    </script>
    <script type="text/javascript" src="/static/test123.js"></script>

如果我在test123.js其中输入jinja变量不起作用,则会收到错误消息。

Just another alternative solution for those who want to pass variables to a script which is sourced using flask, I only managed to get this working by defining the variables outside and then calling the script as follows:

    <script>
    var myfileuri = "/static/my_csv.csv"
    var mytableid = 'mytable';
    </script>
    <script type="text/javascript" src="/static/test123.js"></script>

If I input jinja variables in test123.js it doesn’t work and you will get an error.


回答 5

已经给出了有效的答案,但是我想添加一个检查,以在烧瓶变量不可用的情况下充当故障保险。使用时:

var myVariable = {{ flaskvar | tojson }};

如果存在导致变量不存在的错误,则导致的错误可能会产生意外的结果。为避免这种情况:

{% if flaskvar is defined and flaskvar %}
var myVariable = {{ flaskvar | tojson }};
{% endif %}

Working answers are already given but I want to add a check that acts as a fail-safe in case the flask variable is not available. When you use:

var myVariable = {{ flaskvar | tojson }};

if there is an error that causes the variable to be non existent, resulting errors may produce unexpected results. To avoid this:

{% if flaskvar is defined and flaskvar %}
var myVariable = {{ flaskvar | tojson }};
{% endif %}

回答 6

<script>
    const geocodeArr = JSON.parse('{{ geocode | tojson }}');
    console.log(geocodeArr);
</script>

这使用jinja2将地理编码元组转换为json字符串,然后javascript JSON.parse将其转换为javascript数组。

<script>
    const geocodeArr = JSON.parse('{{ geocode | tojson }}');
    console.log(geocodeArr);
</script>

This uses jinja2 to turn the geocode tuple into a json string, and then the javascript JSON.parse turns that into a javascript array.


回答 7

好吧,我有一个棘手的方法来完成这项工作。这个想法如下

<label>, <p>, <input>在HTML主体中制作一些不可见的HTML标记(如etc),并在标记ID中创建模式,例如,在标记ID中使用列表索引,在标记类名称中使用列表值。

在这里,我有两个长度相同的清单maintenance_next []和maintenance_block_time []。我想使用烧瓶将这两个列表的数据传递给javascript。因此,我采取了一些不可见的标签标记并将其标记名称设置为列表索引的模式,并将其类名称设置为index的值。

{% for i in range(maintenance_next|length): %}
<label id="maintenance_next_{{i}}" name="{{maintenance_next[i]}}" style="display: none;"></label>
<label id="maintenance_block_time_{{i}}" name="{{maintenance_block_time[i]}}" style="display: none;"></label>
{% endfor%}

之后,我使用一些简单的javascript操作检索javascript中的数据。

<script>
var total_len = {{ total_len }};

for (var i = 0; i < total_len; i++) {
    var tm1 = document.getElementById("maintenance_next_" + i).getAttribute("name");
    var tm2 = document.getElementById("maintenance_block_time_" + i).getAttribute("name");
    
    //Do what you need to do with tm1 and tm2.
    
    console.log(tm1);
    console.log(tm2);
}
</script>

Well, I have a tricky method for this job. The idea is as follow-

Make some invisible HTML tags like <label>, <p>, <input> etc. in HTML body and make a pattern in tag id, for example, use list index in tag id and list value at tag class name.

Here I have two lists maintenance_next[] and maintenance_block_time[] of the same length. I want to pass these two list’s data to javascript using the flask. So I take some invisible label tag and set its tag name is a pattern of list index and set its class name as value at index.

{% for i in range(maintenance_next|length): %}
<label id="maintenance_next_{{i}}" name="{{maintenance_next[i]}}" style="display: none;"></label>
<label id="maintenance_block_time_{{i}}" name="{{maintenance_block_time[i]}}" style="display: none;"></label>
{% endfor%}

After this, I retrieve the data in javascript using some simple javascript operation.

<script>
var total_len = {{ total_len }};

for (var i = 0; i < total_len; i++) {
    var tm1 = document.getElementById("maintenance_next_" + i).getAttribute("name");
    var tm2 = document.getElementById("maintenance_block_time_" + i).getAttribute("name");
    
    //Do what you need to do with tm1 and tm2.
    
    console.log(tm1);
    console.log(tm2);
}
</script>

回答 8

一些js文件来自网络或库,它们不是您自己编写的。他们获得的代码如下所示:

var queryString = document.location.search.substring(1);
var params = PDFViewerApplication.parseQueryString(queryString);
var file = 'file' in params ? params.file : DEFAULT_URL;

此方法使js文件保持不变(保持独立性),并正确传递变量!

Some js files come from the web or library, they are not written by yourself. The code they get variable like this:

var queryString = document.location.search.substring(1);
var params = PDFViewerApplication.parseQueryString(queryString);
var file = 'file' in params ? params.file : DEFAULT_URL;

This method makes js files unchanged(keep independence), and pass variable correctly!


如何在Python中创建特里

问题:如何在Python中创建特里

我对尝试和DAWG(直接非循环字图)感兴趣,并且已经阅读了很多有关它们的信息,但我不知道输出trie或DAWG文件应该是什么样。

  • 特里应该是嵌套词典的对象吗?每个字母分为几个字母,依此类推?
  • 如果存在100k或500k条目,对这样的词典执行的查找会很快吗?
  • 如何实现由多个单词组成的单词块,用-空格或空格分隔?
  • 如何将单词的前缀或后缀链接到结构的另一部分?(对于DAWG)

我想了解最佳的输出结构,以便弄清楚如何创建和使用一个结构。

我也希望DAWGtrie一起输出

我不想看到带有彼此链接的气泡的图形表示,我想知道将一组单词转换为尝试或DAWG后的输出对象。

I’m interested in tries and DAWGs (direct acyclic word graph) and I’ve been reading a lot about them but I don’t understand what should the output trie or DAWG file look like.

  • Should a trie be an object of nested dictionaries? Where each letter is divided in to letters and so on?
  • Would a lookup performed on such a dictionary be fast if there are 100k or 500k entries?
  • How to implement word-blocks consisting of more than one word separated with - or space?
  • How to link prefix or suffix of a word to another part in the structure? (for DAWG)

I want to understand the best output structure in order to figure out how to create and use one.

I would also appreciate what should be the output of a DAWG along with trie.

I do not want to see graphical representations with bubbles linked to each other, I want to know the output object once a set of words are turned into tries or DAWGs.


回答 0

展开实际上是正确的,因为有许多不同的方法可以实现Trie。对于大型,可伸缩的特里,嵌套字典可能会变得很麻烦-或至少在空​​间上效率低下。但是,由于您才刚刚入门,因此我认为这是最简单的方法;您trie只需几行就可以编写一个简单的代码。首先,一个构造特里的函数:

>>> _end = '_end_'
>>> 
>>> def make_trie(*words):
...     root = dict()
...     for word in words:
...         current_dict = root
...         for letter in word:
...             current_dict = current_dict.setdefault(letter, {})
...         current_dict[_end] = _end
...     return root
... 
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 
             'z': {'_end_': '_end_'}}}, 
 'f': {'o': {'o': {'_end_': '_end_'}}}}

如果您不熟悉setdefault,它只会在字典中查找一个键(此处为letter_end)。如果存在键,则返回关联的值;否则,返回0。如果不是,它将为该键分配一个默认值并返回值({}_end)。(就像它的版本get也会更新字典。)

接下来,一个测试单词是否在特里的函数:

>>> def in_trie(trie, word):
...     current_dict = trie
...     for letter in word:
...         if letter not in current_dict:
...             return False
...         current_dict = current_dict[letter]
...     return _end in current_dict
... 
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False

我将把插入和拔出留给您作为练习。

当然,Unwind的建议不会很难。在速度方面可能存在一点缺点,即找到正确的子节点需要进行线性搜索。但是搜索将限于可能的字符数-如果包括,则为27 _end。而且,按照他的建议,创建大量的节点列表并按索引访问它们也无济于事。您最好只嵌套列表。

最后,我要补充一点,创建有向无环词图(DAWG)会有些复杂,因为您必须检测当前词与结构中另一个词共享后缀的情况。实际上,这可能会变得相当复杂,具体取决于您要如何构造DAWG!您可能需要学习一些有关Levenshtein 距离的知识才能正确使用它。

Unwind is essentially correct that there are many different ways to implement a trie; and for a large, scalable trie, nested dictionaries might become cumbersome — or at least space inefficient. But since you’re just getting started, I think that’s the easiest approach; you could code up a simple trie in just a few lines. First, a function to construct the trie:

>>> _end = '_end_'
>>> 
>>> def make_trie(*words):
...     root = dict()
...     for word in words:
...         current_dict = root
...         for letter in word:
...             current_dict = current_dict.setdefault(letter, {})
...         current_dict[_end] = _end
...     return root
... 
>>> make_trie('foo', 'bar', 'baz', 'barz')
{'b': {'a': {'r': {'_end_': '_end_', 'z': {'_end_': '_end_'}}, 
             'z': {'_end_': '_end_'}}}, 
 'f': {'o': {'o': {'_end_': '_end_'}}}}

If you’re not familiar with setdefault, it simply looks up a key in the dictionary (here, letter or _end). If the key is present, it returns the associated value; if not, it assigns a default value to that key and returns the value ({} or _end). (It’s like a version of get that also updates the dictionary.)

Next, a function to test whether the word is in the trie:

>>> def in_trie(trie, word):
...     current_dict = trie
...     for letter in word:
...         if letter not in current_dict:
...             return False
...         current_dict = current_dict[letter]
...     return _end in current_dict
... 
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'baz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barz')
True
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'barzz')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'bart')
False
>>> in_trie(make_trie('foo', 'bar', 'baz', 'barz'), 'ba')
False

I’ll leave insertion and removal to you as an exercise.

Of course, Unwind’s suggestion wouldn’t be much harder. There might be a slight speed disadvantage in that finding the correct sub-node would require a linear search. But the search would be limited to the number of possible characters — 27 if we include _end. Also, there’s nothing to be gained by creating a massive list of nodes and accessing them by index as he suggests; you might as well just nest the lists.

Finally, I’ll add that creating a directed acyclic word graph (DAWG) would be a bit more complex, because you have to detect situations in which your current word shares a suffix with another word in the structure. In fact, this can get rather complex, depending on how you want to structure the DAWG! You may have to learn some stuff about Levenshtein distance to get it right.


回答 1

看看这个:

https://github.com/kmike/marisa-trie

适用于Python的静态内存高效Trie结构(2.x和3.x)。

与标准Python字典相比,MARISA-trie中的字符串数据最多可占用50x-100x的内存;原始查找速度是可比的;trie还提供了诸如前缀搜索之类的快速高级方法。

基于marisa-trie C ++库。

这是某公司成功使用marisa trie的博客文章:https ://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/

在Repustate,我们在文本分析中使用的许多数据模型都可以表示为简单的键值对或Python语言中的字典。在我们的特殊情况下,我们的词典非常庞大,每个词典都有几百MB,因此需要不断对其进行访问。实际上,对于给定的HTTP请求,可以访问4或5个模型,每个模型进行20-30次查找。因此,我们面临的问题是我们如何保持客户端的运行速度以及服务器的运行速度。

我找到了这个包,玛丽莎(marisa)尝试了一下,这是一个围绕玛丽莎(Marisa Trie)C ++实现的Python包装器。“ Marisa”是递归实现StorAge的匹配算法的首字母缩写。玛丽莎(Marisa)尝试的最大好处是,存储机制确实缩小了所需的内存量。Python插件的作者声称尺寸减少了50-100倍-我们的经验是相似的。

marisa trie软件包的优点在于,可以将基本的trie结构写入磁盘,然后通过内存映射对象读取。使用内存映射的marisa trie,现在可以满足我们的所有要求。我们服务器的内存使用量急剧下降了约40%,与使用Python的字典实现相比,我们的性能没有变化。

还有一些纯Python实现,尽管除非您在受限平台上,否则您希望使用上面的C ++支持的实现以获得最佳性能:

Have a look at this:

https://github.com/kmike/marisa-trie

Static memory-efficient Trie structures for Python (2.x and 3.x).

String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.

Based on marisa-trie C++ library.

Here’s a blog post from a company using marisa trie successfully:
https://www.repustate.com/blog/sharing-large-data-structure-across-processes-python/

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

There are also a couple of pure-python implementations, though unless you’re on a restricted platform you’d want to use the C++ backed implementation above for best performance:


回答 2

这是实现Trie的python软件包的列表:

Here is a list of python packages that implement Trie:

  • marisa-trie – a C++ based implementation.
  • python-trie – a simple pure python implementation.
  • PyTrie – a more advanced pure python implementation.
  • pygtrie – a pure python implementation by Google.
  • datrie – a double array trie implementation based on libdatrie.

回答 3

senderle的方法修改(上面)。我发现Python defaultdict是创建特里树或前缀树的理想选择。

from collections import defaultdict

class Trie:
    """
    Implement a trie with insert, search, and startsWith methods.
    """
    def __init__(self):
        self.root = defaultdict()

    # @param {string} word
    # @return {void}
    # Inserts a word into the trie.
    def insert(self, word):
        current = self.root
        for letter in word:
            current = current.setdefault(letter, {})
        current.setdefault("_end")

    # @param {string} word
    # @return {boolean}
    # Returns if the word is in the trie.
    def search(self, word):
        current = self.root
        for letter in word:
            if letter not in current:
                return False
            current = current[letter]
        if "_end" in current:
            return True
        return False

    # @param {string} prefix
    # @return {boolean}
    # Returns if there is any word in the trie
    # that starts with the given prefix.
    def startsWith(self, prefix):
        current = self.root
        for letter in prefix:
            if letter not in current:
                return False
            current = current[letter]
        return True

# Now test the class

test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')

print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')

Modified from senderle‘s method (above). I found that Python’s defaultdict is ideal for creating a trie or a prefix tree.

from collections import defaultdict

class Trie:
    """
    Implement a trie with insert, search, and startsWith methods.
    """
    def __init__(self):
        self.root = defaultdict()

    # @param {string} word
    # @return {void}
    # Inserts a word into the trie.
    def insert(self, word):
        current = self.root
        for letter in word:
            current = current.setdefault(letter, {})
        current.setdefault("_end")

    # @param {string} word
    # @return {boolean}
    # Returns if the word is in the trie.
    def search(self, word):
        current = self.root
        for letter in word:
            if letter not in current:
                return False
            current = current[letter]
        if "_end" in current:
            return True
        return False

    # @param {string} prefix
    # @return {boolean}
    # Returns if there is any word in the trie
    # that starts with the given prefix.
    def startsWith(self, prefix):
        current = self.root
        for letter in prefix:
            if letter not in current:
                return False
            current = current[letter]
        return True

# Now test the class

test = Trie()
test.insert('helloworld')
test.insert('ilikeapple')
test.insert('helloz')

print test.search('hello')
print test.startsWith('hello')
print test.search('ilikeapple')

回答 4

没有“应该”;由你决定。各种实现将具有不同的性能特征,需要花费大量时间来实现,理解和正确使用。我认为,这对于整个软件开发来说都是典型的。

我可能首先会尝试创建一个到目前为止所有trie节点的全局列表,并将每个节点中的子指针表示为全局列表中的索引列表。对我来说,拥有一本字典来代表孩子的联系感觉太沉重了。

There’s no “should”; it’s up to you. Various implementations will have different performance characteristics, take various amounts of time to implement, understand, and get right. This is typical for software development as a whole, in my opinion.

I would probably first try having a global list of all trie nodes so far created, and representing the child-pointers in each node as a list of indices into the global list. Having a dictionary just to represent the child linking feels too heavy-weight, to me.


回答 5

如果您想将TRIE实现为Python类,请阅读以下内容,以撰写以下内容:

class Trie:

    def __init__(self):
        self.__final = False
        self.__nodes = {}

    def __repr__(self):
        return 'Trie<len={}, final={}>'.format(len(self), self.__final)

    def __getstate__(self):
        return self.__final, self.__nodes

    def __setstate__(self, state):
        self.__final, self.__nodes = state

    def __len__(self):
        return len(self.__nodes)

    def __bool__(self):
        return self.__final

    def __contains__(self, array):
        try:
            return self[array]
        except KeyError:
            return False

    def __iter__(self):
        yield self
        for node in self.__nodes.values():
            yield from node

    def __getitem__(self, array):
        return self.__get(array, False)

    def create(self, array):
        self.__get(array, True).__final = True

    def read(self):
        yield from self.__read([])

    def update(self, array):
        self[array].__final = True

    def delete(self, array):
        self[array].__final = False

    def prune(self):
        for key, value in tuple(self.__nodes.items()):
            if not value.prune():
                del self.__nodes[key]
        if not len(self):
            self.delete([])
        return self

    def __get(self, array, create):
        if array:
            head, *tail = array
            if create and head not in self.__nodes:
                self.__nodes[head] = Trie()
            return self.__nodes[head].__get(tail, create)
        return self

    def __read(self, name):
        if self.__final:
            yield name
        for key, value in self.__nodes.items():
            yield from value.__read(name + [key])

If you want a TRIE implemented as a Python class, here is something I wrote after reading about them:

class Trie:

    def __init__(self):
        self.__final = False
        self.__nodes = {}

    def __repr__(self):
        return 'Trie<len={}, final={}>'.format(len(self), self.__final)

    def __getstate__(self):
        return self.__final, self.__nodes

    def __setstate__(self, state):
        self.__final, self.__nodes = state

    def __len__(self):
        return len(self.__nodes)

    def __bool__(self):
        return self.__final

    def __contains__(self, array):
        try:
            return self[array]
        except KeyError:
            return False

    def __iter__(self):
        yield self
        for node in self.__nodes.values():
            yield from node

    def __getitem__(self, array):
        return self.__get(array, False)

    def create(self, array):
        self.__get(array, True).__final = True

    def read(self):
        yield from self.__read([])

    def update(self, array):
        self[array].__final = True

    def delete(self, array):
        self[array].__final = False

    def prune(self):
        for key, value in tuple(self.__nodes.items()):
            if not value.prune():
                del self.__nodes[key]
        if not len(self):
            self.delete([])
        return self

    def __get(self, array, create):
        if array:
            head, *tail = array
            if create and head not in self.__nodes:
                self.__nodes[head] = Trie()
            return self.__nodes[head].__get(tail, create)
        return self

    def __read(self, name):
        if self.__final:
            yield name
        for key, value in self.__nodes.items():
            yield from value.__read(name + [key])

回答 6

此版本正在使用递归

import pprint
from collections import deque

pp = pprint.PrettyPrinter(indent=4)

inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}


def trie_recursion(trie_ds, word):
    try:
        letter = word.popleft()
        out = trie_recursion(trie_ds.get(letter, {}), word)
    except IndexError:
        # End of the word
        return {}

    # Dont update if letter already present
    if not trie_ds.has_key(letter):
        trie_ds[letter] = out

    return trie_ds

for word in words:
    # Go through each word
    trie = trie_recursion(trie, deque(word))

pprint.pprint(trie)

输出:

Coool👾 <algos>🚸  python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
  'b': {
    'a': {
      'r': {},
      'z': {}
    }
  },
  'f': {
    'o': {
      'o': {}
    },
    'u': {
      'n': {}
    }
  }
}

This version is using recursion

import pprint
from collections import deque

pp = pprint.PrettyPrinter(indent=4)

inp = raw_input("Enter a sentence to show as trie\n")
words = inp.split(" ")
trie = {}


def trie_recursion(trie_ds, word):
    try:
        letter = word.popleft()
        out = trie_recursion(trie_ds.get(letter, {}), word)
    except IndexError:
        # End of the word
        return {}

    # Dont update if letter already present
    if not trie_ds.has_key(letter):
        trie_ds[letter] = out

    return trie_ds

for word in words:
    # Go through each word
    trie = trie_recursion(trie, deque(word))

pprint.pprint(trie)

Output:

Coool👾 <algos>🚸  python trie.py
Enter a sentence to show as trie
foo bar baz fun
{
  'b': {
    'a': {
      'r': {},
      'z': {}
    }
  },
  'f': {
    'o': {
      'o': {}
    },
    'u': {
      'n': {}
    }
  }
}

回答 7

from collections import defaultdict

定义特里:

_trie = lambda: defaultdict(_trie)

创建特里:

trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
    curr = trie
    for c in s:
        curr = curr[c]
    curr.setdefault("_end")

抬头:

def word_exist(trie, word):
    curr = trie
    for w in word:
        if w not in curr:
            return False
        curr = curr[w]
    return '_end' in curr

测试:

print(word_exist(trie, 'cam'))
from collections import defaultdict

Define Trie:

_trie = lambda: defaultdict(_trie)

Create Trie:

trie = _trie()
for s in ["cat", "bat", "rat", "cam"]:
    curr = trie
    for c in s:
        curr = curr[c]
    curr.setdefault("_end")

Lookup:

def word_exist(trie, word):
    curr = trie
    for w in word:
        if w not in curr:
            return False
        curr = curr[w]
    return '_end' in curr

Test:

print(word_exist(trie, 'cam'))

回答 8

class Trie:
    head = {}

    def add(self,word):

        cur = self.head
        for ch in word:
            if ch not in cur:
                cur[ch] = {}
            cur = cur[ch]
        cur['*'] = True

    def search(self,word):
        cur = self.head
        for ch in word:
            if ch not in cur:
                return False
            cur = cur[ch]

        if '*' in cur:
            return True
        else:
            return False
    def printf(self):
        print (self.head)

dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")


print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()

True
False
False
False
{'h': {'i': {'*': True}}}
class Trie:
    head = {}

    def add(self,word):

        cur = self.head
        for ch in word:
            if ch not in cur:
                cur[ch] = {}
            cur = cur[ch]
        cur['*'] = True

    def search(self,word):
        cur = self.head
        for ch in word:
            if ch not in cur:
                return False
            cur = cur[ch]

        if '*' in cur:
            return True
        else:
            return False
    def printf(self):
        print (self.head)

dictionary = Trie()
dictionary.add("hi")
#dictionary.add("hello")
#dictionary.add("eye")
#dictionary.add("hey")


print(dictionary.search("hi"))
print(dictionary.search("hello"))
print(dictionary.search("hel"))
print(dictionary.search("he"))
dictionary.printf()

Out

True
False
False
False
{'h': {'i': {'*': True}}}

回答 9

Python类Trie


Trie数据结构可用于存储数据,O(L)其中L是字符串的长度,因此要插入N个字符串,时间复杂度是只能在删除O(NL)字符串的同时进行搜索O(L)

可以从https://github.com/Parikshit22/pytrie.git克隆

class Node:
    def __init__(self):
        self.children = [None]*26
        self.isend = False
        
class trie:
    def __init__(self,):
        self.__root = Node()
        
    def __len__(self,):
        return len(self.search_byprefix(''))
    
    def __str__(self):
        ll =  self.search_byprefix('')
        string = ''
        for i in ll:
            string+=i
            string+='\n'
        return string
        
    def chartoint(self,character):
        return ord(character)-ord('a')
    
    def remove(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                raise ValueError("Keyword doesn't exist in trie")
        if ptr.isend is not True:
            raise ValueError("Keyword doesn't exist in trie")
        ptr.isend = False
        return
    
    def insert(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                ptr.children[i] = Node()
                ptr = ptr.children[i]
        ptr.isend = True
        
    def search(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return False
        if ptr.isend is not True:
            return False
        return True
    
    def __getall(self,ptr,key,key_list):
        if ptr is None:
            key_list.append(key)
            return
        if ptr.isend==True:
            key_list.append(key)
        for i in range(26):
            if ptr.children[i]  is not None:
                self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
        
    def search_byprefix(self,key):
        ptr = self.__root
        key_list = []
        length = len(key)
        for idx in range(length):
            i = self.chartoint(key[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return None
        
        self.__getall(ptr,key,key_list)
        return key_list
        

t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)

代码输入

正确
错误
[‘minakshi’,’minhaj’]
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi

Python Class for Trie


Trie Data Structure can be used to store data in O(L) where L is the length of the string so for inserting N strings time complexity would be O(NL) the string can be searched in O(L) only same goes for deletion.

Can be clone from https://github.com/Parikshit22/pytrie.git

class Node:
    def __init__(self):
        self.children = [None]*26
        self.isend = False
        
class trie:
    def __init__(self,):
        self.__root = Node()
        
    def __len__(self,):
        return len(self.search_byprefix(''))
    
    def __str__(self):
        ll =  self.search_byprefix('')
        string = ''
        for i in ll:
            string+=i
            string+='\n'
        return string
        
    def chartoint(self,character):
        return ord(character)-ord('a')
    
    def remove(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                raise ValueError("Keyword doesn't exist in trie")
        if ptr.isend is not True:
            raise ValueError("Keyword doesn't exist in trie")
        ptr.isend = False
        return
    
    def insert(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                ptr.children[i] = Node()
                ptr = ptr.children[i]
        ptr.isend = True
        
    def search(self,string):
        ptr = self.__root
        length = len(string)
        for idx in range(length):
            i = self.chartoint(string[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return False
        if ptr.isend is not True:
            return False
        return True
    
    def __getall(self,ptr,key,key_list):
        if ptr is None:
            key_list.append(key)
            return
        if ptr.isend==True:
            key_list.append(key)
        for i in range(26):
            if ptr.children[i]  is not None:
                self.__getall(ptr.children[i],key+chr(ord('a')+i),key_list)
        
    def search_byprefix(self,key):
        ptr = self.__root
        key_list = []
        length = len(key)
        for idx in range(length):
            i = self.chartoint(key[idx])
            if ptr.children[i] is not None:
                ptr = ptr.children[i]
            else:
                return None
        
        self.__getall(ptr,key,key_list)
        return key_list
        

t = trie()
t.insert("shubham")
t.insert("shubhi")
t.insert("minhaj")
t.insert("parikshit")
t.insert("pari")
t.insert("shubh")
t.insert("minakshi")
print(t.search("minhaj"))
print(t.search("shubhk"))
print(t.search_byprefix('m'))
print(len(t))
print(t.remove("minhaj"))
print(t)

Code Oputpt

True
False
[‘minakshi’, ‘minhaj’]
7
minakshi
minhajsir
pari
parikshit
shubh
shubham
shubhi


从Python调用Java

问题:从Python调用Java

从python调用Java的最佳方法是什么?(jython和RPC对我来说不是一个选择)。

我听说过JCC:http : //pypi.python.org/pypi/JCC/1.9 一个用于从C ++ / Python调用Java的C ++代码生成器,但这需要编译所有可能的调用。我希望有另一个解决方案。

我听说过JPype:http : //jpype.sourceforge.net/ 教程:http : //www.slideshare.net/onyame/mixing-python-and-java

import jpype 
jpype.startJVM(path to jvm.dll, "-ea") 
javaPackage = jpype.JPackage("JavaPackageName") 
javaClass = javaPackage.JavaClassName 
javaObject = javaClass() 
javaObject.JavaMethodName() 
jpype.shutdownJVM() 

这看起来像我需要的。但是,最新版本是2009年1月,我看到人们无法编译JPype。

JPype是一个死项目吗?

还有其他选择吗?

问候,大卫

What is the best way to call java from python? (jython and RPC are not an option for me).

I’ve heard of JCC: http://pypi.python.org/pypi/JCC/1.9 a C++ code generator for calling Java from C++/Python But this requires compiling every possible call; I would prefer another solution.

I’ve hear about JPype: http://jpype.sourceforge.net/ tutorial: http://www.slideshare.net/onyame/mixing-python-and-java

import jpype 
jpype.startJVM(path to jvm.dll, "-ea") 
javaPackage = jpype.JPackage("JavaPackageName") 
javaClass = javaPackage.JavaClassName 
javaObject = javaClass() 
javaObject.JavaMethodName() 
jpype.shutdownJVM() 

This looks like what I need. However, the last release is from Jan 2009 and I see people failing to compile JPype.

Is JPype a dead project?

Are there any other alternatives?

Regards, David


回答 0

这是我对这个问题的总结:从Python调用Java的5种方法

http://baojie.org/blog/2014/06/16/call-java-from-python/(已缓存

简短的答案:Jpype效果很好,并且在许多项目中都得到了证明(例如python-boilerpipe),但是Pyjnius比JPype更快,更简单。

我已经尝试过Pyjnius / Jnius,JCC,javabridge,Jpype和Py4j。

Py4j有点难以使用,因为您需要启动网关,从而增加了另一层脆弱性。

Here is my summary of this problem: 5 Ways of Calling Java from Python

http://baojie.org/blog/2014/06/16/call-java-from-python/ (cached)

Short answer: Jpype works pretty well and is proven in many projects (such as python-boilerpipe), but Pyjnius is faster and simpler than JPype

I have tried Pyjnius/Jnius, JCC, javabridge, Jpype and Py4j.

Py4j is a bit hard to use, as you need to start a gateway, adding another layer of fragility.


回答 1

您也可以使用Py4J。头版上有一个示例和大量文档,但是从本质上讲,您只是从python代码中调用Java方法,就像它们是python方法一样:

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()                        # connect to the JVM
java_object = gateway.jvm.mypackage.MyClass()  # invoke constructor
other_object = java_object.doThat()
other_object.doThis(1,'abc')
gateway.jvm.java.lang.System.out.println('Hello World!') # call a static method

与Jython相反,Py4J的一部分在Python VM中运行,因此它始终与最新版本的Python“保持最新”,并且您可以使用在Jython上运行不佳的库(例如lxml)。另一部分在您要调用的Java VM中运行。

通信是通过套接字而不是通过JNI进行的,并且Py4J具有自己的协议(用于优化某些情况,管理内存等)。

免责声明:我是Py4J的作者

You could also use Py4J. There is an example on the frontpage and lots of documentation, but essentially, you just call Java methods from your python code as if they were python methods:

from py4j.java_gateway import JavaGateway
gateway = JavaGateway()                        # connect to the JVM
java_object = gateway.jvm.mypackage.MyClass()  # invoke constructor
other_object = java_object.doThat()
other_object.doThis(1,'abc')
gateway.jvm.java.lang.System.out.println('Hello World!') # call a static method

As opposed to Jython, one part of Py4J runs in the Python VM so it is always “up to date” with the latest version of Python and you can use libraries that do not run well on Jython (e.g., lxml). The other part runs in the Java VM you want to call.

The communication is done through sockets instead of JNI and Py4J has its own protocol (to optimize certain cases, to manage memory, etc.)

Disclaimer: I am the author of Py4J


回答 2

皮尤尼斯

文件:http : //pyjnius.readthedocs.org/en/latest/

GitHub:https : //github.com/kivy/pyjnius

从github页面:

使用JNI将Java类作为Python类访问的Python模块。

PyJNIus是“进行中的工作”。

快速概述

>>> from jnius import autoclass
>>> autoclass('java.lang.System').out.println('Hello world') Hello world

>>> Stack = autoclass('java.util.Stack')
>>> stack = Stack()
>>> stack.push('hello')
>>> stack.push('world')
>>> print stack.pop() world
>>> print stack.pop() hello

Pyjnius.

Docs: http://pyjnius.readthedocs.org/en/latest/

Github: https://github.com/kivy/pyjnius

From the github page:

A Python module to access Java classes as Python classes using JNI.

PyJNIus is a “Work In Progress”.

Quick overview

>>> from jnius import autoclass
>>> autoclass('java.lang.System').out.println('Hello world') Hello world

>>> Stack = autoclass('java.util.Stack')
>>> stack = Stack()
>>> stack.push('hello')
>>> stack.push('world')
>>> print stack.pop() world
>>> print stack.pop() hello

回答 3

我使用OSX 10.10.2,并成功使用JPype。

遇到Jnius的安装问题(其他人也有),安装了Javabridge,但是在我尝试使用它时出现了神秘的错误,PyJ4的不便之处在于必须首先在Java中启动Gateway服务器,而JCC无法安装。最终,JPype结束了工作。在Github上有一个JPype维护分支。它的主要优点是(a)正确安装,并且(b)可以非常有效地将java数组转换为numpy array(np_arr = java_arr[:]

安装过程为:

git clone https://github.com/originell/jpype.git
cd jpype
python setup.py install

而且您应该能够 import jpype

以下演示有效:

import jpype as jp
jp.startJVM(jp.getDefaultJVMPath(), "-ea")
jp.java.lang.System.out.println("hello world")
jp.shutdownJVM() 

当我尝试调用自己的Java代码时,必须先进行编译(javac ./blah/HelloWorldJPype.java),并且必须将JVM路径从默认值更改(否则,将出现无法解释的“找不到类”错误)。对我来说,这意味着将startJVM命令更改为:

jp.startJVM('/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/MacOS/libjli.dylib', "-ea")
c = jp.JClass('blah.HelloWorldJPype')  
# Where my java class file is in ./blah/HelloWorldJPype.class
...

I’m on OSX 10.10.2, and succeeded in using JPype.

Ran into installation problems with Jnius (others have too), Javabridge installed but gave mysterious errors when I tried to use it, PyJ4 has this inconvenience of having to start a Gateway server in Java first, JCC wouldn’t install. Finally, JPype ended up working. There’s a maintained fork of JPype on Github. It has the major advantages that (a) it installs properly and (b) it can very efficiently convert java arrays to numpy array (np_arr = java_arr[:])

The installation process was:

git clone https://github.com/originell/jpype.git
cd jpype
python setup.py install

And you should be able to import jpype

The following demo worked:

import jpype as jp
jp.startJVM(jp.getDefaultJVMPath(), "-ea")
jp.java.lang.System.out.println("hello world")
jp.shutdownJVM() 

When I tried calling my own java code, I had to first compile (javac ./blah/HelloWorldJPype.java), and I had to change the JVM path from the default (otherwise you’ll get inexplicable “class not found” errors). For me, this meant changing the startJVM command to:

jp.startJVM('/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/MacOS/libjli.dylib', "-ea")
c = jp.JClass('blah.HelloWorldJPype')  
# Where my java class file is in ./blah/HelloWorldJPype.class
...

回答 4

如果您使用的是Python 3,则有一个JPype的分支,称为JPype1-py3

pip install JPype1-py3

这对我适用于OSX / Python 3.4.3。(您可能需要export JAVA_HOME=/Library/Java/JavaVirtualMachines/your-java-version

from jpype import *
startJVM(getDefaultJVMPath(), "-ea")
java.lang.System.out.println("hello world")
shutdownJVM()

If you’re in Python 3, there’s a fork of JPype called JPype1-py3

pip install JPype1-py3

This works for me on OSX / Python 3.4.3. (You may need to export JAVA_HOME=/Library/Java/JavaVirtualMachines/your-java-version)

from jpype import *
startJVM(getDefaultJVMPath(), "-ea")
java.lang.System.out.println("hello world")
shutdownJVM()

回答 5

最近,我一直在将很多东西集成到Python中,包括Java。我发现的最可靠的方法是使用IKVM和C#包装器。

IKVM有一个简洁的小应用程序,它允许您使用任何Java JAR,并将其直接转换为.Net DLL。它只是将JVM字节码转换为CLR字节码。有关详细信息,请参见http://sourceforge.net/p/ikvm/wiki/Ikvmc/

转换后的库的行为就像本机C#库一样,您可以使用它而无需JVM。然后,您可以创建一个C#DLL包装器项目,并添加对转换后的DLL的引用。

现在,您可以创建一些包装程序存根,以调用要公开的方法,并将这些方法标记为DllEport。有关详细信息,请参见https://stackoverflow.com/a/29854281/1977538

包装DLL的行为就像本机C库一样,导出的方法看起来像导出的C方法。您可以照常使用ctype连接到它们。

我已经在Python 2.7上进行过尝试,但是它也应该在3.0上也可以使用。在Windows和Linuxes上均可使用

如果您碰巧使用C#,那么这可能是将几乎所有内容都集成到python中的最佳方法。

I’ve been integrating a lot of stuff into Python lately, including Java. The most robust method I’ve found is to use IKVM and a C# wrapper.

IKVM has a neat little application that allows you to take any Java JAR, and convert it directly to .Net DLL. It simply translates the JVM bytecode to CLR bytecode. See http://sourceforge.net/p/ikvm/wiki/Ikvmc/ for details.

The converted library behaves just like a native C# library, and you can use it without needing the JVM. You can then create a C# DLL wrapper project, and add a reference to the converted DLL.

You can now create some wrapper stubs that call the methods that you want to expose, and mark those methods as DllEport. See https://stackoverflow.com/a/29854281/1977538 for details.

The wrapper DLL acts just like a native C library, with the exported methods looking just like exported C methods. You can connect to them using ctype as usual.

I’ve tried it with Python 2.7, but it should work with 3.0 as well. Works on Windows and the Linuxes

If you happen to use C#, then this is probably the best approach to try when integrating almost anything into python.


回答 6

我刚刚开始使用JPype 0.5.4.2(2011年7月),并且看起来工作得很好…
我使用的是Xubuntu 10.04

I’m just beginning to use JPype 0.5.4.2 (july 2011) and it looks like it’s working nicely…
I’m on Xubuntu 10.04


回答 7

我假设,如果您可以从C ++到Java,那么您已经准备就绪。我看过您提到的那种产品效果很好。碰巧我们使用的是CodeMesh。我没有特别认可该供应商,也没有对他们的产品相对质量发表任何声明,但是我看到它在相当大的情况下有效。

我通常会说,如果可能的话,我建议您尽量避免通过JNI直接集成。一些简单的REST服务方法或基于队列的体系结构将更易于开发和诊断。如果仔细使用这样的去耦技术,您将获得相当不错的性能。

I’m assuming that if you can get from C++ to Java then you are all set. I’ve seen a product of the kind you mention work well. As it happens the one we used was CodeMesh. I’m not specifically endorsing this vendor, or making any statement about their product’s relative quality, but I have seen it work in quite a high volume scenario.

I would say generally that if at all possible I would recommend keeping away from direct integration via JNI if you can. Some simple REST service approach, or queue-based architecture will tend to be simpler to develop and diagnose. You can get quite decent perfomance if you use such decoupled technologies carefully.


回答 8

根据我自己的经验,尝试从python ia中运行某些Java代码的方式类似于在python中的Java代码中运行python代码的方式,我无法找到一种简单的方法。

我对问题的解决方案是通过在具有适当包和变量的临时文件中编辑Java代码后,通过从python代码中将beanshell解释程序作为shell commnad调用,将此Java代码作为beanshell脚本运行。

如果我在说什么对您有任何帮助,很高兴能帮助您分享我的解决方案的更多详细信息。

Through my own experience trying to run some java code from within python i a manner similar to how python code runs within java code in python, I was unable to a find a straight forward methodology.

My solution to my problem was by running this java code as beanshell scripts by calling the beanshell interpreter as a shell commnad from within my python code after editing the java code in a temporary file with the appropriate packages and variables.

If what I am talking about is helpful in any manner, I am glad to help you sharing more details of my solutions.


有什么理由不使用’+’连接两个字符串吗?

问题:有什么理由不使用’+’连接两个字符串吗?

Python中常见的反模式是+在循环中使用串联字符串序列。这很不好,因为Python解释器必须为每次迭代创建一个新的字符串对象,并且最终要花费二次时间。(在某些情况下,最新版本的CPython显然可以优化此功能,但其他实现则不能,因此建议程序员不要依赖此功能。)''.join是执行此操作的正确方法。

但是,我听说它说过(包括Stack Overflow上的内容),您永远都不要将它+用于字符串连接,而应该始终使用''.join或格式字符串。我不明白为什么只连接两个字符串会出现这种情况。如果我的理解是正确的,则不应该花费二次时间,而且我认为a + b''.join((a, b))或更加简洁易读'%s%s' % (a, b)

+串联两个字符串是否是一种好习惯?还是有我不知道的问题?

A common antipattern in Python is to concatenate a sequence of strings using + in a loop. This is bad because the Python interpreter has to create a new string object for each iteration, and it ends up taking quadratic time. (Recent versions of CPython can apparently optimize this in some cases, but other implementations can’t, so programmers are discouraged from relying on this.) ''.join is the right way to do this.

However, I’ve heard it said (including here on Stack Overflow) that you should never, ever use + for string concatenation, but instead always use ''.join or a format string. I don’t understand why this is the case if you’re only concatenating two strings. If my understanding is correct, it shouldn’t take quadratic time, and I think a + b is cleaner and more readable than either ''.join((a, b)) or '%s%s' % (a, b).

Is it good practice to use + to concatenate two strings? Or is there a problem I’m not aware of?


回答 0

两个字符串与连接在一起没有错+。确实,它比容易阅读''.join([a, b])

您是对的,尽管用2个以上的字符串进行连接+是O(n ^ 2)操作(与相比,O(n)join)因此效率低下。但是,这与使用循环无关。偶数a + b + c + ...为O(n ^ 2),原因是每个串联产生一个新的字符串。

CPython2.4及更高版本试图缓解这种情况,但是join在连接两个以上的字符串时仍然建议使用。

There is nothing wrong in concatenating two strings with +. Indeed it’s easier to read than ''.join([a, b]).

You are right though that concatenating more than 2 strings with + is an O(n^2) operation (compared to O(n) for join) and thus becomes inefficient. However this has not to do with using a loop. Even a + b + c + ... is O(n^2), the reason being that each concatenation produces a new string.

CPython2.4 and above try to mitigate that, but it’s still advisable to use join when concatenating more than 2 strings.


回答 1

加号运算符是连接两个 Python字符串的完美解决方案。但是,如果您继续添加两个以上的字符串(n> 25),则可能需要考虑其他问题。

''.join([a, b, c]) 技巧是性能优化。

Plus operator is perfectly fine solution to concatenate two Python strings. But if you keep adding more than two strings (n > 25) , you might want to think something else.

''.join([a, b, c]) trick is a performance optimization.


回答 2

假设永远不要使用+进行字符串连接,而始终使用”.join可能是一个神话。的确,使用+会创建不必要的不​​可变字符串对象的临时副本,但另一个经常引用的事实是,join在循环中调用通常会增加的开销function call。让我们举个例子。

创建两个列表,一个来自链接的SO问题,另一个列表更大

>>> myl1 = ['A','B','C','D','E','F']
>>> myl2=[chr(random.randint(65,90)) for i in range(0,10000)]

让我们创建两个函数,UseJoinUsePlus分别使用join+功能。

>>> def UsePlus():
    return [myl[i] + myl[i + 1] for i in range(0,len(myl), 2)]

>>> def UseJoin():
    [''.join((myl[i],myl[i + 1])) for i in range(0,len(myl), 2)]

让timeit与第一个列表一起运行

>>> myl=myl1
>>> t1=timeit.Timer("UsePlus()","from __main__ import UsePlus")
>>> t2=timeit.Timer("UseJoin()","from __main__ import UseJoin")
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
2.48 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
2.61 usec/pass
>>> 

它们具有几乎相同的运行时。

让我们使用cProfile

>>> myl=myl2
>>> cProfile.run("UsePlus()")
         5 function calls in 0.001 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <pyshell#1376>:1(UsePlus)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}


>>> cProfile.run("UseJoin()")
         5005 function calls in 0.029 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.029    0.029 <pyshell#1388>:1(UseJoin)
        1    0.000    0.000    0.029    0.029 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     5000    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {range}

而且看起来使用Join会导致不必要的函数调用,这可能会增加开销。

现在回到问题。在所有情况下都应该不鼓励使用+over join吗?

我相信不,应该考虑

  1. 所讨论字符串的长度
  2. 串联操作数。

在开发中过早地进行优化是不明智的。

The assumption that one should never, ever use + for string concatenation, but instead always use ”.join may be a myth. It is true that using + creates unnecessary temporary copies of immutable string object but the other not oft quoted fact is that calling join in a loop would generally add the overhead of function call. Lets take your example.

Create two lists, one from the linked SO question and another a bigger fabricated

>>> myl1 = ['A','B','C','D','E','F']
>>> myl2=[chr(random.randint(65,90)) for i in range(0,10000)]

Lets create two functions, UseJoin and UsePlus to use the respective join and + functionality.

>>> def UsePlus():
    return [myl[i] + myl[i + 1] for i in range(0,len(myl), 2)]

>>> def UseJoin():
    [''.join((myl[i],myl[i + 1])) for i in range(0,len(myl), 2)]

Lets run timeit with the first list

>>> myl=myl1
>>> t1=timeit.Timer("UsePlus()","from __main__ import UsePlus")
>>> t2=timeit.Timer("UseJoin()","from __main__ import UseJoin")
>>> print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
2.48 usec/pass
>>> print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
2.61 usec/pass
>>> 

They have almost the same runtime.

Lets use cProfile

>>> myl=myl2
>>> cProfile.run("UsePlus()")
         5 function calls in 0.001 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 <pyshell#1376>:1(UsePlus)
        1    0.000    0.000    0.001    0.001 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {range}


>>> cProfile.run("UseJoin()")
         5005 function calls in 0.029 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015    0.029    0.029 <pyshell#1388>:1(UseJoin)
        1    0.000    0.000    0.029    0.029 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     5000    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {range}

And it looks that using Join, results in unnecessary function calls which could add to the overhead.

Now coming back to the question. Should one discourage the use of + over join in all cases?

I believe no, things should be taken into consideration

  1. Length of the String in Question
  2. No of Concatenation Operation.

And off-course in a development pre-mature optimization is evil.


回答 3

与多个人一起工作时,有时很难确切知道正在发生什么。使用格式字符串而不是连接可以避免对我们造成无数次特定烦恼:

说,一个函数需要一个参数,然后编写它以获取字符串:

In [1]: def foo(zeta):
   ...:     print 'bar: ' + zeta

In [2]: foo('bang')
bar: bang

因此,在整个代码中可能经常使用此功能。您的同事可能确切知道它的功能,但不一定完全了解内部功能,并且可能不知道该函数需要一个字符串。因此,他们最终可能会这样:

In [3]: foo(23)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

/home/izkata/<ipython console> in foo(zeta)

TypeError: cannot concatenate 'str' and 'int' objects

如果您只使用格式字符串,将没有问题:

In [1]: def foo(zeta):
   ...:     print 'bar: %s' % zeta
   ...:     
   ...:     

In [2]: foo('bang')
bar: bang

In [3]: foo(23)
bar: 23

对于所有定义了的对象,__str__也可以传入:

In [1]: from datetime import date

In [2]: zeta = date(2012, 4, 15)

In [3]: print 'bar: ' + zeta
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

TypeError: cannot concatenate 'str' and 'datetime.date' objects

In [4]: print 'bar: %s' % zeta
bar: 2012-04-15

所以可以:如果您可以使用格式字符串,充分利用Python所提供的功能。

When working with multiple people, it’s sometimes difficult to know exactly what’s happening. Using a format string instead of concatenation can avoid one particular annoyance that’s happened a whole ton of times to us:

Say, a function requires an argument, and you write it expecting to get a string:

In [1]: def foo(zeta):
   ...:     print 'bar: ' + zeta

In [2]: foo('bang')
bar: bang

So, this function may be used pretty often throughout the code. Your coworkers may know exactly what it does, but not necessarily be fully up-to-speed on the internals, and may not know that the function expects a string. And so they may end up with this:

In [3]: foo(23)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

/home/izkata/<ipython console> in foo(zeta)

TypeError: cannot concatenate 'str' and 'int' objects

There would be no problem if you just used a format string:

In [1]: def foo(zeta):
   ...:     print 'bar: %s' % zeta
   ...:     
   ...:     

In [2]: foo('bang')
bar: bang

In [3]: foo(23)
bar: 23

The same is true for all types of objects that define __str__, which may be passed in as well:

In [1]: from datetime import date

In [2]: zeta = date(2012, 4, 15)

In [3]: print 'bar: ' + zeta
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/izkata/<ipython console> in <module>()

TypeError: cannot concatenate 'str' and 'datetime.date' objects

In [4]: print 'bar: %s' % zeta
bar: 2012-04-15

So yes: If you can use a format string do it and take advantage of what Python has to offer.


回答 4

我做了一个快速测试:

import sys

str = e = "a xxxxxxxxxx very xxxxxxxxxx long xxxxxxxxxx string xxxxxxxxxx\n"

for i in range(int(sys.argv[1])):
    str = str + e

并定时:

mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  8000000
8000000 times

real    0m2.165s
user    0m1.620s
sys     0m0.540s
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  16000000
16000000 times

real    0m4.360s
user    0m3.480s
sys     0m0.870s

显然有针对此a = a + b情况的优化。它没有表现出人们可能会怀疑的O(n ^ 2)时间。

因此,至少在性能方面,使用+还不错。

I have done a quick test:

import sys

str = e = "a xxxxxxxxxx very xxxxxxxxxx long xxxxxxxxxx string xxxxxxxxxx\n"

for i in range(int(sys.argv[1])):
    str = str + e

and timed it:

mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  8000000
8000000 times

real    0m2.165s
user    0m1.620s
sys     0m0.540s
mslade@mickpc:/binks/micks/ruby/tests$ time python /binks/micks/junk/strings.py  16000000
16000000 times

real    0m4.360s
user    0m3.480s
sys     0m0.870s

There is apparently an optimisation for the a = a + b case. It does not exhibit O(n^2) time as one might suspect.

So at least in terms of performance, using + is fine.


回答 5

根据Python文档,使用str.join()将为您提供各种Python实现的性能一致性。尽管CPython优化了s = s + t的二次行为,但其他Python实现可能没有。

CPython实现细节:如果s和t都是字符串,则某些Python实现(例如CPython)通常可以对s = s + t或s + = t形式的赋值执行就地优化。如果适用,此优化将使二次运行的可能性大大降低。此优化取决于版本和实现。对于性能敏感的代码,最好使用str.join()方法,以确保各个版本和实现之间一致的线性串联性能。

Python文档中的序列类型(请参见脚注[6])

According to Python docs, using str.join() will give you performance consistence across various implementations of Python. Although CPython optimizes away the quadratic behavior of s = s + t, other Python implementations may not.

CPython implementation detail: If s and t are both strings, some Python implementations such as CPython can usually perform an in-place optimization for assignments of the form s = s + t or s += t. When applicable, this optimization makes quadratic run-time much less likely. This optimization is both version and implementation dependent. For performance sensitive code, it is preferable to use the str.join() method which assures consistent linear concatenation performance across versions and implementations.

Sequence Types in Python docs (see the foot note [6])


回答 6

我在python 3.8中使用以下内容

string4 = f'{string1}{string2}{string3}'

I use the following with python 3.8

string4 = f'{string1}{string2}{string3}'

回答 7

”.join([a,b])+更好。

因为应该以不损害Python其他实现(PyPy,Jython,IronPython,Cython,Psyco等)的方式编写代码

形式a + = b或a = a + b即使在CPython中也很脆弱,并且在不使用 引用计数的 实现中根本不存在(引用计数是一种存储引用,指针或对a的句柄的技术资源,例如对象,内存块,磁盘空间或其他资源

https://www.python.org/dev/peps/pep-0008/#programming-recommendations

”.join([a, b]) is better solution than +.

Because Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such)

form a += b or a = a + b is fragile even in CPython and isn’t present at all in implementations that don’t use refcounting (reference counting is a technique of storing the number of references, pointers, or handles to a resource such as an object, block of memory, disk space or other resource)

https://www.python.org/dev/peps/pep-0008/#programming-recommendations


多处理中的共享内存对象

问题:多处理中的共享内存对象

假设我有一个很大的内存numpy数组,我有一个函数func将这个巨型数组作为输入(以及其他一些参数)。func具有不同参数的参数可以并行运行。例如:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

如果我使用多处理库,那么该巨型数组将多次复制到不同的进程中。

有没有办法让不同的进程共享同一数组?该数组对象是只读的,永远不会被修改。

更复杂的是,如果arr不是数组,而是任意python对象,是否可以共享它?

[编辑]

我读了答案,但仍然有些困惑。由于fork()是写时复制的,因此在python多处理库中生成新进程时,我们不应调用任何额外的开销。但是下面的代码表明存在巨大的开销:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

输出(顺便说一句,成本随着数组大小的增加而增加,所以我怀疑仍然存在与内存复制相关的开销):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

如果我们不复制阵列,为什么会有这么大的开销?共享内存可以为我节省哪一部分?

Suppose I have a large in memory numpy array, I have a function func that takes in this giant array as input (together with some other parameters). func with different parameters can be run in parallel. For example:

def func(arr, param):
    # do stuff to arr, param

# build array arr

pool = Pool(processes = 6)
results = [pool.apply_async(func, [arr, param]) for param in all_params]
output = [res.get() for res in results]

If I use multiprocessing library, then that giant array will be copied for multiple times into different processes.

Is there a way to let different processes share the same array? This array object is read-only and will never be modified.

What’s more complicated, if arr is not an array, but an arbitrary python object, is there a way to share it?

[EDITED]

I read the answer but I am still a bit confused. Since fork() is copy-on-write, we should not invoke any additional cost when spawning new processes in python multiprocessing library. But the following code suggests there is a huge overhead:

from multiprocessing import Pool, Manager
import numpy as np; 
import time

def f(arr):
    return len(arr)

t = time.time()
arr = np.arange(10000000)
print "construct array = ", time.time() - t;


pool = Pool(processes = 6)

t = time.time()
res = pool.apply_async(f, [arr,])
res.get()
print "multiprocessing overhead = ", time.time() - t;

output (and by the way, the cost increases as the size of the array increases, so I suspect there is still overhead related to memory copying):

construct array =  0.0178790092468
multiprocessing overhead =  0.252444982529

Why is there such huge overhead, if we didn’t copy the array? And what part does the shared memory save me?


回答 0

如果使用的操作系统使用写时复制fork()语义(如任何常见的unix),则只要不更改数据结构,所有子进程都可以使用它,而不会占用额外的内存。您将不必执行任何特殊操作(除非绝对确保您不更改该对象)。

可以针对问题执行的最有效的操作是将数组打包为有效的数组结构(使用numpyarray),将其放置在共享内存中,将其包装为multiprocessing.Array,然后将其传递给函数。这个答案说明了如何做到这一点

如果需要可写的共享库,则需要使用某种同步或锁定包装它。multiprocessing提供了两种方法来执行此操作:一种使用共享内存(适用于简单值,数组或ctypes)或Manager代理,其中一个进程持有该内存,而管理器则从其他进程(甚至是通过网络)仲裁对它的访问。

Manager方法可用于任意Python对象,但会比使用共享内存的等效方法慢,因为需要对对象进行序列化/反序列化并在进程之间发送。

Python提供许多并行处理库和方法multiprocessing是一个出色且全面的库,但是如果您有特殊需要,也许其他方法中的一种可能更好。

If you use an operating system that uses copy-on-write fork() semantics (like any common unix), then as long as you never alter your data structure it will be available to all child processes without taking up additional memory. You will not have to do anything special (except make absolutely sure you don’t alter the object).

The most efficient thing you can do for your problem would be to pack your array into an efficient array structure (using numpy or array), place that in shared memory, wrap it with multiprocessing.Array, and pass that to your functions. This answer shows how to do that.

If you want a writeable shared object, then you will need to wrap it with some kind of synchronization or locking. multiprocessing provides two methods of doing this: one using shared memory (suitable for simple values, arrays, or ctypes) or a Manager proxy, where one process holds the memory and a manager arbitrates access to it from other processes (even over a network).

The Manager approach can be used with arbitrary Python objects, but will be slower than the equivalent using shared memory because the objects need to be serialized/deserialized and sent between processes.

There are a wealth of parallel processing libraries and approaches available in Python. multiprocessing is an excellent and well rounded library, but if you have special needs perhaps one of the other approaches may be better.


回答 1

我遇到了同样的问题,并编写了一个共享内存实用程序类来解决该问题。

我正在使用multiprocessing.RawArray(无锁),并且对数组的访问完全不同步(无锁),请注意不要自己动手。

通过该解决方案,我在四核i7上获得了大约3倍的加速。

代码如下:随时使用和改进它,请报告所有错误。

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

I run into the same problem and wrote a little shared-memory utility class to work around it.

I’m using multiprocessing.RawArray (lockfree), and also the access to the arrays is not synchronized at all (lockfree), be careful not to shoot your own feet.

With the solution I get speedups by a factor of approx 3 on a quad-core i7.

Here’s the code: Feel free to use and improve it, and please report back any bugs.

'''
Created on 14.05.2013

@author: martin
'''

import multiprocessing
import ctypes
import numpy as np

class SharedNumpyMemManagerError(Exception):
    pass

'''
Singleton Pattern
'''
class SharedNumpyMemManager:    

    _initSize = 1024

    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super(SharedNumpyMemManager, cls).__new__(
                                cls, *args, **kwargs)
        return cls._instance        

    def __init__(self):
        self.lock = multiprocessing.Lock()
        self.cur = 0
        self.cnt = 0
        self.shared_arrays = [None] * SharedNumpyMemManager._initSize

    def __createArray(self, dimensions, ctype=ctypes.c_double):

        self.lock.acquire()

        # double size if necessary
        if (self.cnt >= len(self.shared_arrays)):
            self.shared_arrays = self.shared_arrays + [None] * len(self.shared_arrays)

        # next handle
        self.__getNextFreeHdl()        

        # create array in shared memory segment
        shared_array_base = multiprocessing.RawArray(ctype, np.prod(dimensions))

        # convert to numpy array vie ctypeslib
        self.shared_arrays[self.cur] = np.ctypeslib.as_array(shared_array_base)

        # do a reshape for correct dimensions            
        # Returns a masked array containing the same data, but with a new shape.
        # The result is a view on the original array
        self.shared_arrays[self.cur] = self.shared_arrays[self.cnt].reshape(dimensions)

        # update cnt
        self.cnt += 1

        self.lock.release()

        # return handle to the shared memory numpy array
        return self.cur

    def __getNextFreeHdl(self):
        orgCur = self.cur
        while self.shared_arrays[self.cur] is not None:
            self.cur = (self.cur + 1) % len(self.shared_arrays)
            if orgCur == self.cur:
                raise SharedNumpyMemManagerError('Max Number of Shared Numpy Arrays Exceeded!')

    def __freeArray(self, hdl):
        self.lock.acquire()
        # set reference to None
        if self.shared_arrays[hdl] is not None: # consider multiple calls to free
            self.shared_arrays[hdl] = None
            self.cnt -= 1
        self.lock.release()

    def __getArray(self, i):
        return self.shared_arrays[i]

    @staticmethod
    def getInstance():
        if not SharedNumpyMemManager._instance:
            SharedNumpyMemManager._instance = SharedNumpyMemManager()
        return SharedNumpyMemManager._instance

    @staticmethod
    def createArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__createArray(*args, **kwargs)

    @staticmethod
    def getArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__getArray(*args, **kwargs)

    @staticmethod    
    def freeArray(*args, **kwargs):
        return SharedNumpyMemManager.getInstance().__freeArray(*args, **kwargs)

# Init Singleton on module load
SharedNumpyMemManager.getInstance()

if __name__ == '__main__':

    import timeit

    N_PROC = 8
    INNER_LOOP = 10000
    N = 1000

    def propagate(t):
        i, shm_hdl, evidence = t
        a = SharedNumpyMemManager.getArray(shm_hdl)
        for j in range(INNER_LOOP):
            a[i] = i

    class Parallel_Dummy_PF:

        def __init__(self, N):
            self.N = N
            self.arrayHdl = SharedNumpyMemManager.createArray(self.N, ctype=ctypes.c_double)            
            self.pool = multiprocessing.Pool(processes=N_PROC)

        def update_par(self, evidence):
            self.pool.map(propagate, zip(range(self.N), [self.arrayHdl] * self.N, [evidence] * self.N))

        def update_seq(self, evidence):
            for i in range(self.N):
                propagate((i, self.arrayHdl, evidence))

        def getArray(self):
            return SharedNumpyMemManager.getArray(self.arrayHdl)

    def parallelExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_par(5)
        print(pf.getArray())

    def sequentialExec():
        pf = Parallel_Dummy_PF(N)
        print(pf.getArray())
        pf.update_seq(5)
        print(pf.getArray())

    t1 = timeit.Timer("sequentialExec()", "from __main__ import sequentialExec")
    t2 = timeit.Timer("parallelExec()", "from __main__ import parallelExec")

    print("Sequential: ", t1.timeit(number=1))    
    print("Parallel: ", t2.timeit(number=1))

回答 2

这是Ray的预期用例,这是一个用于并行和分布式Python的库。在后台,它使用Apache Arrow数据布局(零副本格式)序列化对象,并将其存储在共享内存对象存储中,这样多个进程可以访问它们而无需创建副本。

该代码如下所示。

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

如果您不调用ray.put该数组,则该数组仍将存储在共享内存中,但是每次调用都会完成一次func,这不是您想要的。

请注意,这不仅适用于数组,而且还适用于包含数组的对象,例如,将int映射到数组的字典,如下所示。

您可以通过在IPython中运行以下代码来比较Ray和pickle中的序列化性能。

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

使用Ray进行序列化仅比pickle快一点,但是由于使用了共享内存,反序列化的速度要快1000倍(此数字当然取决于对象)。

请参阅Ray文档。您可以阅读更多有关使用Ray和Arrow进行快速序列化的信息。注意我是Ray开发人员之一。

This is the intended use case for Ray, which is a library for parallel and distributed Python. Under the hood, it serializes objects using the Apache Arrow data layout (which is a zero-copy format) and stores them in a shared-memory object store so they can be accessed by multiple processes without creating copies.

The code would look like the following.

import numpy as np
import ray

ray.init()

@ray.remote
def func(array, param):
    # Do stuff.
    return 1

array = np.ones(10**6)
# Store the array in the shared memory object store once
# so it is not copied multiple times.
array_id = ray.put(array)

result_ids = [func.remote(array_id, i) for i in range(4)]
output = ray.get(result_ids)

If you don’t call ray.put then the array will still be stored in shared memory, but that will be done once per invocation of func, which is not what you want.

Note that this will work not only for arrays but also for objects that contain arrays, e.g., dictionaries mapping ints to arrays as below.

You can compare the performance of serialization in Ray versus pickle by running the following in IPython.

import numpy as np
import pickle
import ray

ray.init()

x = {i: np.ones(10**7) for i in range(20)}

# Time Ray.
%time x_id = ray.put(x)  # 2.4s
%time new_x = ray.get(x_id)  # 0.00073s

# Time pickle.
%time serialized = pickle.dumps(x)  # 2.6s
%time deserialized = pickle.loads(serialized)  # 1.9s

Serialization with Ray is only slightly faster than pickle, but deserialization is 1000x faster because of the use of shared memory (this number will of course depend on the object).

See the Ray documentation. You can read more about fast serialization using Ray and Arrow. Note I’m one of the Ray developers.


回答 3

就像Robert Nishihara提到的那样,Apache Arrow使得这一点变得容易,特别是使用Plasma内存对象存储库,这正是Ray所基于的。

为此,我专门制作了脑等离子体 -在Flask应用中快速加载和重新加载大对象。它是Apache Arrow可序列化对象的共享内存对象命名空间,包括pickle.d生成的’d字节字符串pickle.dumps(...)

Apache Ray和Plasma的主要区别在于,它可以为您跟踪对象ID。在本地运行的任何进程,线程或程序都可以通过从任何Brain对象中调用名称来共享变量的值。

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/)

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

Like Robert Nishihara mentioned, Apache Arrow makes this easy, specifically with the Plasma in-memory object store, which is what Ray is built on.

I made brain-plasma specifically for this reason – fast loading and reloading of big objects in a Flask app. It’s a shared-memory object namespace for Apache Arrow-serializable objects, including pickle‘d bytestrings generated by pickle.dumps(...).

The key difference with Apache Ray and Plasma is that it keeps track of object IDs for you. Any processes or threads or programs that are running on locally can share the variables’ values by calling the name from any Brain object.

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma

from brain_plasma import Brain
brain = Brain(path='/tmp/plasma/)

brain['a'] = [1]*10000

brain['a']
# >>> [1,1,1,1,...]

如何逃避os.system()调用?

问题:如何逃避os.system()调用?

使用os.system()时,通常必须转义文件名和其他作为参数传递给命令的参数。我怎样才能做到这一点?最好是可以在多个操作系统/ shell上运行的东西,尤其是bash。

我目前正在执行以下操作,但是请确保为此必须有一个库函数,或者至少是一个更优雅/更强大/更有效的选项:

def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 
             sh_escape(out_filename)))

编辑:我已经接受了使用引号的简单答案,不知道为什么我没有想到它;我猜是因为我来自Windows,“和”的行为略有不同。

关于安全性,我理解这个问题,但是在这种情况下,我对os.system()提供的一种快速简便的解决方案感兴趣,并且字符串的来源不是用户生成的,或者至少是由受信任的用户(我)。

When using os.system() it’s often necessary to escape filenames and other arguments passed as parameters to commands. How can I do this? Preferably something that would work on multiple operating systems/shells but in particular for bash.

I’m currently doing the following, but am sure there must be a library function for this, or at least a more elegant/robust/efficient option:

def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 
             sh_escape(out_filename)))

Edit: I’ve accepted the simple answer of using quotes, don’t know why I didn’t think of that; I guess because I came from Windows where ‘ and ” behave a little differently.

Regarding security, I understand the concern, but, in this case, I’m interested in a quick and easy solution which os.system() provides, and the source of the strings is either not user-generated or at least entered by a trusted user (me).


回答 0

这是我用的:

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"

外壳程序将始终接受带引号的文件名,并在将其传递给相关程序之前删除引号。值得注意的是,这避免了文件名包含空格或其他任何讨厌的shell元字符的问题。

更新:如果您使用的是Python 3.3或更高版本,请使用shlex.quote而不是自己滚动。

This is what I use:

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"

The shell will always accept a quoted filename and remove the surrounding quotes before passing it to the program in question. Notably, this avoids problems with filenames that contain spaces or any other kind of nasty shell metacharacter.

Update: If you are using Python 3.3 or later, use shlex.quote instead of rolling your own.


回答 1

shlex.quote() 从python 3开始做你想要的事情。

(用于pipes.quote同时支持python 2和python 3)

shlex.quote() does what you want since python 3.

(Use pipes.quote to support both python 2 and python 3)


回答 2

也许您有使用的特定原因os.system()。但是,如果不是这样,您可能应该使用该subprocess模块。您可以直接指定管道,并避免使用外壳。

以下是来自PEP324的内容

Replacing shell pipe line
-------------------------

output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

Perhaps you have a specific reason for using os.system(). But if not you should probably be using the subprocess module. You can specify the pipes directly and avoid using the shell.

The following is from PEP324:

Replacing shell pipe line
-------------------------

output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

回答 3

也许subprocess.list2cmdline是更好的选择?

Maybe subprocess.list2cmdline is a better shot?


回答 4

请注意,pipes.quote实际上在Python 2.5和Python 3.1中已损坏,并且不安全使用-它不处理零长度参数。

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

参见Python问题7476 ; 它已在Python 2.6和3.2及更高版本中修复。

Note that pipes.quote is actually broken in Python 2.5 and Python 3.1 and not safe to use–It doesn’t handle zero-length arguments.

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

See Python issue 7476; it has been fixed in Python 2.6 and 3.2 and newer.


回答 5

注意:这是Python 2.7.x的答案。

根据消息来源,这pipes.quote()是“ 可靠地将字符串作为/ bin / sh的单个参数引用 ”的一种方法。(尽管从2.7版开始不推荐使用,但最终在Python 3.3中公开公开为shlex.quote()函数。)

另一方面subprocess.list2cmdline()是一种方法,“ 翻译的参数的序列到命令行串,使用同样的规则作为MS C运行时 ”。

在这里,我们为平台提供了引用命令行字符串的方式。

import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
else:
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)

用法:

# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

Notice: This is an answer for Python 2.7.x.

According to the source, pipes.quote() is a way to “Reliably quote a string as a single argument for /bin/sh“. (Although it is deprecated since version 2.7 and finally exposed publicly in Python 3.3 as the shlex.quote() function.)

On the other hand, subprocess.list2cmdline() is a way to “Translate a sequence of arguments into a command line string, using the same rules as the MS C runtime“.

Here we are, the platform independent way of quoting strings for command lines.

import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
else:
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)

Usage:

# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

回答 6

我相信os.system只会调用为用户配置的任何命令外壳,因此我认为您不能以与平台无关的方式进行操作。我的命令外壳可以是bash,emacs,ruby甚至quake3中的任何东西。这些程序中的某些程序并不期望您传递给它们的参数的种类,即使它们这样做了,也无法保证它们以相同的方式进行转义。

I believe that os.system just invokes whatever command shell is configured for the user, so I don’t think you can do it in a platform independent way. My command shell could be anything from bash, emacs, ruby, or even quake3. Some of these programs aren’t expecting the kind of arguments you are passing to them and even if they did there is no guarantee they do their escaping the same way.


回答 7

我使用的功能是:

def quote_argument(argument):
    return '"%s"' % (
        argument
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')
    )

即:我总是将参数用双引号引起来,然后用反斜杠将双引号内的特殊字符引起来。

The function I use is:

def quote_argument(argument):
    return '"%s"' % (
        argument
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')
    )

that is: I always enclose the argument in double quotes, and then backslash-quote the only characters special inside double quotes.


回答 8

如果您确实使用了system命令,我将尝试将os.system()调用中的内容列入白名单。

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

子进程模块是一个更好的选择,我建议尽量避免使用os.system / subprocess之类的东西。

If you do use the system command, I would try and whitelist what goes into the os.system() call.. For example..

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

The subprocess module is a better option, and I would recommend trying to avoid using anything like os.system/subprocess wherever possible.


回答 9

真正的答案是:首先不要使用os.system()。请subprocess.call改用并提供未转义的参数。

The real answer is: Don’t use os.system() in the first place. Use subprocess.call instead and supply the unescaped arguments.


sql查询中的python列表作为参数

问题:sql查询中的python列表作为参数

我有一个python列表,说我

l = [1,5,8]

我想编写一个SQL查询来获取列表中所有元素的数据,例如

select name from students where id = |IN THE LIST l|

我该如何完成?

I have a python list, say l

l = [1,5,8]

I want to write a sql query to get the data for all the elements of the list, say

select name from students where id = |IN THE LIST l|

How do I accomplish this?


回答 0

到目前为止,答案一直是将这些值模板化为纯SQL字符串。这对于整数绝对没问题,但是如果我们想对字符串进行处理,则会遇到转义问题。

这是一个使用参数化查询的变体,它对两个都适用:

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for unused in l)
query= 'SELECT name FROM students WHERE id IN (%s)' % placeholders
cursor.execute(query, l)

Answers so far have been templating the values into a plain SQL string. That’s absolutely fine for integers, but if we wanted to do it for strings we get the escaping issue.

Here’s a variant using a parameterised query that would work for both:

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for unused in l)
query= 'SELECT name FROM students WHERE id IN (%s)' % placeholders
cursor.execute(query, l)

回答 1

最简单的方法是将列表转到tuple第一

t = tuple(l)
query = "select name from studens where id IN {}".format(t)

Easiest way is to turn the list to tuple first

t = tuple(l)
query = "select name from studens where id IN {}".format(t)

回答 2

不要使其复杂化,解决方案很简单。

l = [1,5,8]

l = tuple(l)

params = {'l': l}

cursor.execute('SELECT * FROM table where id in %(l)s',params)

在此处输入图片说明

我希望这可以帮助!!!

Dont complicate it, Solution for this is simple.

l = [1,5,8]

l = tuple(l)

params = {'l': l}

cursor.execute('SELECT * FROM table where id in %(l)s',params)

enter image description here

I hope this helped !!!


回答 3

您想要的SQL是

select name from studens where id in (1, 5, 8)

如果您想从python构造它,可以使用

l = [1, 5, 8]
sql_query = 'select name from studens where id in (' + ','.join(map(str, l)) + ')'

地图功能将改变列表转换成可以通过使用逗号胶合在一起的字符串列表str.join方法。

或者:

l = [1, 5, 8]
sql_query = 'select name from studens where id in (' + ','.join((str(n) for n in l)) + ')'

如果您更喜欢生成器表达式而不是map函数。

更新:S. Lott在评论中提到Python SQLite绑定不支持序列。在这种情况下,您可能想要

select name from studens where id = 1 or id = 5 or id = 8

产生者

sql_query = 'select name from studens where ' + ' or '.join(('id = ' + str(n) for n in l))

The SQL you want is

select name from studens where id in (1, 5, 8)

If you want to construct this from the python you could use

l = [1, 5, 8]
sql_query = 'select name from studens where id in (' + ','.join(map(str, l)) + ')'

The map function will transform the list into a list of strings that can be glued together by commas using the str.join method.

Alternatively:

l = [1, 5, 8]
sql_query = 'select name from studens where id in (' + ','.join((str(n) for n in l)) + ')'

if you prefer generator expressions to the map function.

UPDATE: S. Lott mentions in the comments that the Python SQLite bindings don’t support sequences. In that case, you might want

select name from studens where id = 1 or id = 5 or id = 8

Generated by

sql_query = 'select name from studens where ' + ' or '.join(('id = ' + str(n) for n in l))

回答 4

string.join用逗号分隔的列表值,并使用format运算符形成查询字符串。

myquery = "select name from studens where id in (%s)" % ",".join(map(str,mylist))

(谢谢,布莱尔康拉德

string.join the list values separated by commas, and use the format operator to form a query string.

myquery = "select name from studens where id in (%s)" % ",".join(map(str,mylist))

(Thanks, blair-conrad)


回答 5

我喜欢bobince的回答:

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for unused in l)
query= 'SELECT name FROM students WHERE id IN (%s)' % placeholders
cursor.execute(query, l)

但是我注意到了这一点:

placeholders= ', '.join(placeholder for unused in l)

可以替换为:

placeholders= ', '.join(placeholder*len(l))

如果不太聪明和不太笼统,我会觉得这更直接。这里l需要有一个长度(即,引用一个定义__len__方法的对象),这应该不是问题。但是占位符也必须是单个字符。要支持多字符占位符使用:

placeholders= ', '.join([placeholder]*len(l))

I like bobince’s answer:

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for unused in l)
query= 'SELECT name FROM students WHERE id IN (%s)' % placeholders
cursor.execute(query, l)

But I noticed this:

placeholders= ', '.join(placeholder for unused in l)

Can be replaced with:

placeholders= ', '.join(placeholder*len(l))

I find this more direct if less clever and less general. Here l is required to have a length (i.e. refer to an object that defines a __len__ method), which shouldn’t be a problem. But placeholder must also be a single character. To support a multi-character placeholder use:

placeholders= ', '.join([placeholder]*len(l))

回答 6

@umount答案的解决方案,因为它用一个元素的元组中断,因为(1,)不是有效的SQL。

>>> random_ids = [1234,123,54,56,57,58,78,91]
>>> cursor.execute("create table test (id)")
>>> for item in random_ids:
    cursor.execute("insert into test values (%d)" % item)
>>> sublist = [56,57,58]
>>> cursor.execute("select id from test where id in %s" % str(tuple(sublist)).replace(',)',')'))
>>> a = cursor.fetchall()
>>> a
[(56,), (57,), (58,)]

sql字符串的其他解决方案:

cursor.execute("select id from test where id in (%s)" % ('"'+'", "'.join(l)+'"'))

Solution for @umounted answer, because that broke with a one-element tuple, since (1,) is not valid SQL.:

>>> random_ids = [1234,123,54,56,57,58,78,91]
>>> cursor.execute("create table test (id)")
>>> for item in random_ids:
    cursor.execute("insert into test values (%d)" % item)
>>> sublist = [56,57,58]
>>> cursor.execute("select id from test where id in %s" % str(tuple(sublist)).replace(',)',')'))
>>> a = cursor.fetchall()
>>> a
[(56,), (57,), (58,)]

Other solution for sql string:

cursor.execute("select id from test where id in (%s)" % ('"'+'", "'.join(l)+'"'))

回答 7

placeholders= ', '.join("'{"+str(i)+"}'" for i in range(len(l)))
query="select name from students where id (%s)"%placeholders
query=query.format(*l)
cursor.execute(query)

这应该可以解决您的问题。

placeholders= ', '.join("'{"+str(i)+"}'" for i in range(len(l)))
query="select name from students where id (%s)"%placeholders
query=query.format(*l)
cursor.execute(query)

This should solve your problem.


回答 8

如果您将PostgreSQL与Psycopg2库一起使用,则可以让其元组适应为您完成所有转义和字符串插值,例如:

ids = [1,2,3]
cur.execute(
  "SELECT * FROM foo WHERE id IN %s",
  [tuple(ids)])

即只需确保您将IN参数作为传递tuple。如果是,则list可以使用= ANY数组语法

cur.execute(
  "SELECT * FROM foo WHERE id = ANY (%s)",
  [list(ids)])

请注意,这两个都将变成相同的查询计划,因此您应该只使用较容易的那个。例如,如果您的列表位于一个元组中,则使用前者;如果它们存储在列表中,则使用后者。

If you’re using PostgreSQL with the Psycopg2 library you can let its tuple adaption do all the escaping and string interpolation for you, e.g:

ids = [1,2,3]
cur.execute(
  "SELECT * FROM foo WHERE id IN %s",
  [tuple(ids)])

i.e. just make sure that you’re passing the IN parameter as a tuple. if it’s a list you can use the = ANY array syntax:

cur.execute(
  "SELECT * FROM foo WHERE id = ANY (%s)",
  [list(ids)])

note that these both will get turned into the same query plan so you should just use whichever is easier. e.g. if your list comes in a tuple use the former, if they’re stored in a list use the latter.


回答 9

例如,如果要使用sql查询:

select name from studens where id in (1, 5, 8)

关于什么:

my_list = [1, 5, 8]
cur.execute("select name from studens where id in %s" % repr(my_list).replace('[','(').replace(']',')') )

For example, if you want the sql query:

select name from studens where id in (1, 5, 8)

What about:

my_list = [1, 5, 8]
cur.execute("select name from studens where id in %s" % repr(my_list).replace('[','(').replace(']',')') )

回答 10

一个更简单的解决方案:

lst = [1,2,3,a,b,c]

query = f"""SELECT * FROM table WHERE IN {str(lst)[1:-1}"""

a simpler solution:

lst = [1,2,3,a,b,c]

query = f"""SELECT * FROM table WHERE IN {str(lst)[1:-1}"""

回答 11

l = [1] # or [1,2,3]

query = "SELECT * FROM table WHERE id IN :l"
params = {'l' : tuple(l)}
cursor.execute(query, params)

:var符号似乎简单。(Python 3.7)

l = [1] # or [1,2,3]

query = "SELECT * FROM table WHERE id IN :l"
params = {'l' : tuple(l)}
cursor.execute(query, params)

The :var notation seems simpler. (Python 3.7)


回答 12

这使用参数替换并处理单个值列表的情况:

l = [1,5,8]

get_operator = lambda x: '=' if len(x) == 1 else 'IN'
get_value = lambda x: int(x[0]) if len(x) == 1 else x

query = 'SELECT * FROM table where id ' + get_operator(l) + ' %s'

cursor.execute(query, (get_value(l),))

This uses parameter substitution and takes care of the single value list case:

l = [1,5,8]

get_operator = lambda x: '=' if len(x) == 1 else 'IN'
get_value = lambda x: int(x[0]) if len(x) == 1 else x

query = 'SELECT * FROM table where id ' + get_operator(l) + ' %s'

cursor.execute(query, (get_value(l),))

Conda和Anaconda有什么区别?

问题:Conda和Anaconda有什么区别?

问题后更新:

有关更多详细信息,请参见《 Conda简介》


问题:

当我尝试更新anaconda时,我首先在ubuntu上安装了Anaconda~/anaconda,根据Continuum Analytics 的文档,我应该使用以下命令:

conda update conda
conda update anaconda

然后我意识到我没有安装conda,因此我使用此处的文档进行了安装。

安装conda后,当我运行时conda update anaconda,出现以下错误:

错误:/ home / xiang / miniconda中未安装软件包“ anaconda”

似乎conda假定我的anaconda已安装,/home/xiang/miniconda但事实并非如此。

问题:

  1. condaanaconda有什么区别?
  2. 如何告诉conda我的Anaconda安装在哪里?

Post-question update:

See Introduction to Conda for more details.


The problem:

I first installed Anaconda on my ubuntu at ~/anaconda, when I was trying to update my anaconda, according to the documentation from Continuum Analytics, I should use the following commands:

conda update conda
conda update anaconda

Then I realized that I did not have conda installed, so I installed it using the documentation from here.

After conda is installed, when I run conda update anaconda, I got the following error:

Error: package ‘anaconda’ is not installed in /home/xiang/miniconda

It appears conda is assuming my anaconda is installed under /home/xiang/miniconda which is NOT true.

The questions:

  1. What are the differences between conda and anaconda?
  2. How can I tell conda where my anaconda is installed?

回答 0

conda是程序包管理器。Anaconda是一组大约一百个程序包,包括conda,numpy,scipy,ipython notebook等。

您安装了Miniconda,这是Anaconda的一个较小替代方案,它只是conda及其依赖项,而不是上面列出的依赖项。

拥有Miniconda之后,您可以使用轻松地将Anaconda安装到其中conda install anaconda

conda is the package manager. Anaconda is a set of about a hundred packages including conda, numpy, scipy, ipython notebook, and so on.

You installed Miniconda, which is a smaller alternative to Anaconda that is just conda and its dependencies, not those listed above.

Once you have Miniconda, you can easily install Anaconda into it with conda install anaconda.


回答 1

简要

conda 既是命令行工具,又是python包。

Miniconda安装程序= Python + conda

Anaconda安装程序= Python conda++ meta包anaconda

meta Python pkg anaconda=约160个其他Python日常使用的软件包

Anaconda安装程序= Miniconda安装程序+ conda install anaconda

详情

conda是环境经理和程序包经理。这意味着工具本身。conda使有可能

  • 安装软件包 conda install flake8
  • 使用任何版本的Python创建环境 conda create -n myenv python=3.6

conda不是二进制命令,而是Python包。要进行conda工作,您必须创建一个Python环境并将软件包安装conda到其中。这是Anaconda安装程序和Miniconda安装程序进入的地方。

安装程序Minoconda将安装Python和软件包conda。安装程序Anaconda不仅会执行Miniconda的操作,还会安装一个为您命名的meta Python软件包anaconda

元软件包是不包含实际软件的软件包,仅依赖于要安装的其他软件包。

pkg anaconda中包含的实际160多个python软件包info/recipe/meta.yaml在其源文件中列出。

package:
    name: anaconda
    version: '2019.07'
build:
    ignore_run_exports:
        - '*'
    number: '0'
    pin_depends: strict
    string: py36_0
requirements:
    build:
        - python 3.6.8 haf84260_0
    is_meta_pkg:
        - true
    run:
        - alabaster 0.7.12 py36_0
        - anaconda-client 1.7.2 py36_0
        - anaconda-project 0.8.3 py_0
        # ...
        - beautifulsoup4 4.7.1 py36_1
        # ...
        - curl 7.65.2 ha441bb4_0
        # ...
        - hdf5 1.10.4 hfa1e0ec_0
        # ...
        - ipykernel 5.1.1 py36h39e3cac_0
        - ipython 7.6.1 py36h39e3cac_0
        - ipython_genutils 0.2.0 py36h241746c_0
        - ipywidgets 7.5.0 py_0
        # ...
        - jupyter 1.0.0 py36_7
        - jupyter_client 5.3.1 py_0
        - jupyter_console 6.0.0 py36_0
        - jupyter_core 4.5.0 py_0
        - jupyterlab 1.0.2 py36hf63ae98_0
        - jupyterlab_server 1.0.0 py_0
        # ...
        - matplotlib 3.1.0 py36h54f8f79_0
        # ...
        - mkl 2019.4 233
        - mkl-service 2.0.2 py36h1de35cc_0
        - mkl_fft 1.0.12 py36h5e564d8_0
        - mkl_random 1.0.2 py36h27c97d8_0
        # ...
        - nltk 3.4.4 py36_0
        # ...
        - numpy 1.16.4 py36hacdab7b_0
        - numpy-base 1.16.4 py36h6575580_0
        - numpydoc 0.9.1 py_0
        # ...
        - pandas 0.24.2 py36h0a44026_0
        - pandoc 2.2.3.2 0
        # ...
        - pillow 6.1.0 py36hb68e598_0
        # ...
        - pyqt 5.9.2 py36h655552a_2
        # ...
        - qt 5.9.7 h468cd18_1
        - qtawesome 0.5.7 py36_1
        - qtconsole 4.5.1 py_0
        - qtpy 1.8.0 py_0
        # ...
        - requests 2.22.0 py36_0
        # ...
        - sphinx 2.1.2 py_0
        - sphinxcontrib 1.0 py36_1
        - sphinxcontrib-applehelp 1.0.1 py_0
        - sphinxcontrib-devhelp 1.0.1 py_0
        - sphinxcontrib-htmlhelp 1.0.2 py_0
        - sphinxcontrib-jsmath 1.0.1 py_0
        - sphinxcontrib-qthelp 1.0.2 py_0
        - sphinxcontrib-serializinghtml 1.1.3 py_0
        - sphinxcontrib-websupport 1.1.2 py_0
        - spyder 3.3.6 py36_0
        - spyder-kernels 0.5.1 py36_0
        # ...

来自meta pkg的预安装软件包anaconda主要用于Web抓取和数据科学。像requestsbeautifulsoupnumpynltk,等。

Brief

conda is both a command line tool, and a python package.

Miniconda installer = Python + conda

Anaconda installer = Python + conda + meta package anaconda

meta Python pkg anaconda = about 160 other Python packages for daily use in data science

Anaconda installer = Miniconda installer + conda install anaconda

Detail

conda is an environment manager and a package manager. It means the tool itself. conda makes it possible to

  • install package with conda install flake8
  • create an environment with any version of Python with conda create -n myenv python=3.6

conda is not a binary command, is a Python package. To make conda work, you have to create a Python environment and install package conda into it. This is where Anaconda installer and Miniconda installer comes in.

Installer Minoconda installs a Python and the package conda. Installer Anaconda not only does what Miniconda does, it also install a meta Python package named anaconda for you.

Meta packages, are packages that do NOT contain actual softwares and simply depend on other packages to be installed.

The actual 160+ python packages included in pkg anaconda are listed in info/recipe/meta.yaml in its source file.

package:
    name: anaconda
    version: '2019.07'
build:
    ignore_run_exports:
        - '*'
    number: '0'
    pin_depends: strict
    string: py36_0
requirements:
    build:
        - python 3.6.8 haf84260_0
    is_meta_pkg:
        - true
    run:
        - alabaster 0.7.12 py36_0
        - anaconda-client 1.7.2 py36_0
        - anaconda-project 0.8.3 py_0
        # ...
        - beautifulsoup4 4.7.1 py36_1
        # ...
        - curl 7.65.2 ha441bb4_0
        # ...
        - hdf5 1.10.4 hfa1e0ec_0
        # ...
        - ipykernel 5.1.1 py36h39e3cac_0
        - ipython 7.6.1 py36h39e3cac_0
        - ipython_genutils 0.2.0 py36h241746c_0
        - ipywidgets 7.5.0 py_0
        # ...
        - jupyter 1.0.0 py36_7
        - jupyter_client 5.3.1 py_0
        - jupyter_console 6.0.0 py36_0
        - jupyter_core 4.5.0 py_0
        - jupyterlab 1.0.2 py36hf63ae98_0
        - jupyterlab_server 1.0.0 py_0
        # ...
        - matplotlib 3.1.0 py36h54f8f79_0
        # ...
        - mkl 2019.4 233
        - mkl-service 2.0.2 py36h1de35cc_0
        - mkl_fft 1.0.12 py36h5e564d8_0
        - mkl_random 1.0.2 py36h27c97d8_0
        # ...
        - nltk 3.4.4 py36_0
        # ...
        - numpy 1.16.4 py36hacdab7b_0
        - numpy-base 1.16.4 py36h6575580_0
        - numpydoc 0.9.1 py_0
        # ...
        - pandas 0.24.2 py36h0a44026_0
        - pandoc 2.2.3.2 0
        # ...
        - pillow 6.1.0 py36hb68e598_0
        # ...
        - pyqt 5.9.2 py36h655552a_2
        # ...
        - qt 5.9.7 h468cd18_1
        - qtawesome 0.5.7 py36_1
        - qtconsole 4.5.1 py_0
        - qtpy 1.8.0 py_0
        # ...
        - requests 2.22.0 py36_0
        # ...
        - sphinx 2.1.2 py_0
        - sphinxcontrib 1.0 py36_1
        - sphinxcontrib-applehelp 1.0.1 py_0
        - sphinxcontrib-devhelp 1.0.1 py_0
        - sphinxcontrib-htmlhelp 1.0.2 py_0
        - sphinxcontrib-jsmath 1.0.1 py_0
        - sphinxcontrib-qthelp 1.0.2 py_0
        - sphinxcontrib-serializinghtml 1.1.3 py_0
        - sphinxcontrib-websupport 1.1.2 py_0
        - spyder 3.3.6 py36_0
        - spyder-kernels 0.5.1 py36_0
        # ...

The pre-installed packages from meta pkg anaconda are mainly for web scraping and data science. Like requests, beautifulsoup, numpy, nltk, etc.


在numpy向量中找到最频繁的数字

问题:在numpy向量中找到最频繁的数字

假设我在python中有以下列表:

a = [1,2,3,1,2,1,1,1,3,2,2,1]

如何以一种简洁的方式在此列表中找到最频繁的号码?

Suppose I have the following list in python:

a = [1,2,3,1,2,1,1,1,3,2,2,1]

How to find the most frequent number in this list in a neat way?


回答 0

如果您的列表包含所有非负整数,则应查看numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

然后可能使用np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print np.argmax(counts)

对于更复杂的列表(可能包含负数或非整数值),可以np.histogram类似的方式使用。另外,如果您只想在python中工作而不使用numpy,collections.Counter则是处理此类数据的一种好方法。

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print b.most_common(1)

If your list contains all non-negative ints, you should take a look at numpy.bincounts:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

and then probably use np.argmax:

a = np.array([1,2,3,1,2,1,1,1,3,2,2,1])
counts = np.bincount(a)
print(np.argmax(counts))

For a more complicated list (that perhaps contains negative numbers or non-integer values), you can use np.histogram in a similar way. Alternatively, if you just want to work in python without using numpy, collections.Counter is a good way of handling this sort of data.

from collections import Counter
a = [1,2,3,1,2,1,1,1,3,2,2,1]
b = Counter(a)
print(b.most_common(1))

回答 1

您可以使用

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

如果某个元素与另一个元素一样频繁,则此代码将仅返回第一个元素。

You may use

(values,counts) = np.unique(a,return_counts=True)
ind=np.argmax(counts)
print values[ind]  # prints the most frequent element

If some element is as frequent as another one, this code will return only the first element.


回答 2

如果您愿意使用SciPy

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

If you’re willing to use SciPy:

>>> from scipy.stats import mode
>>> mode([1,2,3,1,2,1,1,1,3,2,2,1])
(array([ 1.]), array([ 6.]))
>>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0]
>>> most_frequent
1.0

回答 3

在此处找到一些解决方案的性能(使用iPython):

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>> 

对于像这样的小型阵列,最好是“最大”和“设置” 。

根据@David Sanders的说法,如果将数组大小增加到100,000个元素,则“最大w / set”算法最终将是最差的,而“ numpy bincount”方法是最佳的。

Performances (using iPython) for some solutions found here:

>>> # small array
>>> a = [12,3,65,33,12,3,123,888000]
>>> 
>>> import collections
>>> collections.Counter(a).most_common()[0][0]
3
>>> %timeit collections.Counter(a).most_common()[0][0]
100000 loops, best of 3: 11.3 µs per loop
>>> 
>>> import numpy
>>> numpy.bincount(a).argmax()
3
>>> %timeit numpy.bincount(a).argmax()
100 loops, best of 3: 2.84 ms per loop
>>> 
>>> import scipy.stats
>>> scipy.stats.mode(a)[0][0]
3.0
>>> %timeit scipy.stats.mode(a)[0][0]
10000 loops, best of 3: 172 µs per loop
>>> 
>>> from collections import defaultdict
>>> def jjc(l):
...     d = defaultdict(int)
...     for i in a:
...         d[i] += 1
...     return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]
... 
>>> jjc(a)[0]
3
>>> %timeit jjc(a)[0]
100000 loops, best of 3: 5.58 µs per loop
>>> 
>>> max(map(lambda val: (a.count(val), val), set(a)))[1]
12
>>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1]
100000 loops, best of 3: 4.11 µs per loop
>>> 

Best is ‘max’ with ‘set’ for small arrays like the problem.

According to @David Sanders, if you increase the array size to something like 100,000 elements, the “max w/set” algorithm ends up being the worst by far whereas the “numpy bincount” method is the best.


回答 4

另外,如果您想获得最频繁的值(正数或负数)而不加载任何模块,则可以使用以下代码:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

Also if you want to get most frequent value(positive or negative) without loading any modules you can use the following code:

lVals = [1,2,3,1,2,1,1,1,3,2,2,1]
print max(map(lambda val: (lVals.count(val), val), set(lVals)))

回答 5

虽然上面的大多数答案很有用,但在以下情况下您可能会:1)需要它来支持非正整数值(例如浮点数或负整数;-)),以及2)不在Python 2.7上(哪个collections.Counter) 3)不想在代码中添加scipy(甚至numpy)的依赖项,那么纯Python 2.6解决方案就是O(nlogn)(即有效),它就是这样:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

While most of the answers above are useful, in case you: 1) need it to support non-positive-integer values (e.g. floats or negative integers ;-)), and 2) aren’t on Python 2.7 (which collections.Counter requires), and 3) prefer not to add the dependency of scipy (or even numpy) to your code, then a purely python 2.6 solution that is O(nlogn) (i.e., efficient) is just this:

from collections import defaultdict

a = [1,2,3,1,2,1,1,1,3,2,2,1]

d = defaultdict(int)
for i in a:
  d[i] += 1
most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

回答 6

我喜欢JoshAdel的解决方案。

但是只有一个收获。

np.bincount()解决方案仅适用于数字。

如果你有琴弦 collections.Counter解决方案将为您服务。

I like the solution by JoshAdel.

But there is just one catch.

The np.bincount() solution only works on numbers.

If you have strings, collections.Counter solution will work for you.


回答 7

扩展此方法,适用于查找数据模式,在该模式下可能需要实际数组的索引才能查看该值与分布中心的距离。

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

记住当len(np.argmax(counts))> 1时放弃该模式

Expanding on this method, applied to finding the mode of the data where you may need the index of the actual array to see how far away the value is from the center of the distribution.

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True)
index = idx[np.argmax(counts)]
mode = a[index]

Remember to discard the mode when len(np.argmax(counts)) > 1


回答 8

在Python 3中,以下应该起作用:

max(set(a), key=lambda x: a.count(x))

In Python 3 the following should work:

max(set(a), key=lambda x: a.count(x))

回答 9

从开始Python 3.4,标准库包含statistics.mode返回单个最常见数据点的功能。

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

如果存在多个具有相同频率的模式,则statistics.mode返回遇到的第一个模式。


从开始于Python 3.8,该statistics.multimode函数将按最先出现的顺序返回最频繁出现的值的列表:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

Starting in Python 3.4, the standard library includes the statistics.mode function to return the single most common data point.

from statistics import mode

mode([1, 2, 3, 1, 2, 1, 1, 1, 3, 2, 2, 1])
# 1

If there are multiple modes with the same frequency, statistics.mode returns the first one encountered.


Starting in Python 3.8, the statistics.multimode function returns a list of the most frequently occurring values in the order they were first encountered:

from statistics import multimode

multimode([1, 2, 3, 1, 2])
# [1, 2]

回答 10

这是一个纯解决方案,可以使用纯粹的numpy沿轴应用而不管其值如何。我还发现,如果有很多唯一值,这比scipy.stats.mode快得多。

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

Here is a general solution that may be applied along an axis, regardless of values, using purely numpy. I’ve also found that this is much faster than scipy.stats.mode if there are a lot of unique values.

import numpy

def mode(ndarray, axis=0):
    # Check inputs
    ndarray = numpy.asarray(ndarray)
    ndim = ndarray.ndim
    if ndarray.size == 1:
        return (ndarray[0], 1)
    elif ndarray.size == 0:
        raise Exception('Cannot compute mode on empty array')
    try:
        axis = range(ndarray.ndim)[axis]
    except:
        raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim))

    # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice
    if all([ndim == 1,
            int(numpy.__version__.split('.')[0]) >= 1,
            int(numpy.__version__.split('.')[1]) >= 9]):
        modals, counts = numpy.unique(ndarray, return_counts=True)
        index = numpy.argmax(counts)
        return modals[index], counts[index]

    # Sort array
    sort = numpy.sort(ndarray, axis=axis)
    # Create array to transpose along the axis and get padding shape
    transpose = numpy.roll(numpy.arange(ndim)[::-1], axis)
    shape = list(sort.shape)
    shape[axis] = 1
    # Create a boolean array along strides of unique values
    strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'),
                                 numpy.diff(sort, axis=axis) == 0,
                                 numpy.zeros(shape=shape, dtype='bool')],
                                axis=axis).transpose(transpose).ravel()
    # Count the stride lengths
    counts = numpy.cumsum(strides)
    counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])])
    counts[strides] = 0
    # Get shape of padded counts and slice to return to the original shape
    shape = numpy.array(sort.shape)
    shape[axis] += 1
    shape = shape[transpose]
    slices = [slice(None)] * ndim
    slices[axis] = slice(1, None)
    # Reshape and compute final counts
    counts = counts.reshape(shape).transpose(transpose)[slices] + 1

    # Find maximum counts and return modals/counts
    slices = [slice(None, i) for i in sort.shape]
    del slices[axis]
    index = numpy.ogrid[slices]
    index.insert(axis, numpy.argmax(counts, axis=axis))
    return sort[index], counts[index]

回答 11

我最近正在做一个项目,并使用collections.Counter。(这折磨了我)。

我认为收藏中的Counter的表现非常非常差。这只是包装dict()的类。

更糟糕的是,如果使用cProfile来分析其方法,则应该看到很多“ __missing__”和“ __instancecheck__”东西一直在浪费。

使用它的most_common()时要小心,因为每次调用它都会使它变得极其缓慢。如果使用most_common(x),它将调用堆排序,这也很慢。

顺便说一句,numpy的bincount也有一个问题:如果使用np.bincount([1,2,4000000]),您将得到一个包含4000000个元素的数组。

I’m recently doing a project and using collections.Counter.(Which tortured me).

The Counter in collections have a very very bad performance in my opinion. It’s just a class wrapping dict().

What’s worse, If you use cProfile to profile its method, you should see a lot of ‘__missing__’ and ‘__instancecheck__’ stuff wasting the whole time.

Be careful using its most_common(), because everytime it would invoke a sort which makes it extremely slow. and if you use most_common(x), it will invoke a heap sort, which is also slow.

Btw, numpy’s bincount also have a problem: if you use np.bincount([1,2,4000000]), you will get an array with 4000000 elements.


有Python缓存库吗?

问题:有Python缓存库吗?

我正在寻找Python缓存库,但到目前为止找不到任何东西。我需要一个简单dict的类似接口,可以在其中设置密钥及其有效期,并将其重新缓存。有点像:

cache.get(myfunction, duration=300)

它将从缓存中为我提供该项目(如果存在),或者调用该函数并将其存储(如果它不存在或已过期)。有人知道这样吗?

I’m looking for a Python caching library but can’t find anything so far. I need a simple dict-like interface where I can set keys and their expiration and get them back cached. Sort of something like:

cache.get(myfunction, duration=300)

which will give me the item from the cache if it exists or call the function and store it if it doesn’t or has expired. Does anyone know something like this?


回答 0


回答 1

在Python 3.2中,您可以使用functools库中的装饰器@lru_cache。这是最近使用过的高速缓存,因此其中的项目没有过期时间,但是作为快速破解,它非常有用。

from functools import lru_cache

@lru_cache(maxsize=256)
def f(x):
  return x*x

for x in range(20):
  print f(x)
for x in range(20):
  print f(x)

From Python 3.2 you can use the decorator @lru_cache from the functools library. It’s a Last Recently Used cache, so there is no expiration time for the items in it, but as a fast hack it’s very useful.

from functools import lru_cache

@lru_cache(maxsize=256)
def f(x):
  return x*x

for x in range(20):
  print f(x)
for x in range(20):
  print f(x)

回答 2

您还可以查看Memoize装饰器。您可能无需做太多修改就可以使它完成您想做的事情。

You might also take a look at the Memoize decorator. You could probably get it to do what you want without too much modification.


回答 3

Joblib https://joblib.readthedocs.io支持Memoize模式中的缓存功能。通常,这种想法是缓存计算上昂贵的功能。

>>> from joblib import Memory
>>> mem = Memory(cachedir='/tmp/joblib')
>>> import numpy as np
>>> square = mem.cache(np.square)
>>> 
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> b = square(a)                                   
________________________________________________________________________________
[Memory] Calling square...
square(array([[ 0.,  0.,  1.],
       [ 1.,  1.,  1.],
       [ 4.,  2.,  1.]]))
___________________________________________________________square - 0...s, 0.0min

>>> c = square(a)

您也可以做一些花哨的事情,例如在函数上使用@ memory.cache装饰器。该文档位于此处:https : //joblib.readthedocs.io/en/latest/generation/joblib.Memory.html

Joblib https://joblib.readthedocs.io supports caching functions in the Memoize pattern. Mostly, the idea is to cache computationally expensive functions.

>>> from joblib import Memory
>>> mem = Memory(cachedir='/tmp/joblib')
>>> import numpy as np
>>> square = mem.cache(np.square)
>>> 
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> b = square(a)                                   
________________________________________________________________________________
[Memory] Calling square...
square(array([[ 0.,  0.,  1.],
       [ 1.,  1.,  1.],
       [ 4.,  2.,  1.]]))
___________________________________________________________square - 0...s, 0.0min

>>> c = square(a)

You can also do fancy things like using the @memory.cache decorator on functions. The documentation is here: https://joblib.readthedocs.io/en/latest/generated/joblib.Memory.html


回答 4

还没有人提到搁置。https://docs.python.org/2/library/shelve.html

它不是memcached的,但是看起来更简单,并且可能满足您的需求。

No one has mentioned shelve yet. https://docs.python.org/2/library/shelve.html

It isn’t memcached, but looks much simpler and might fit your need.


回答 5

我认为python memcached API是流行的工具,但我自己并未使用过它,也不确定它是否支持您所需的功能。

I think the python memcached API is the prevalent tool, but I haven’t used it myself and am not sure whether it supports the features you need.


回答 6

import time

class CachedItem(object):
    def __init__(self, key, value, duration=60):
        self.key = key
        self.value = value
        self.duration = duration
        self.timeStamp = time.time()

    def __repr__(self):
        return '<CachedItem {%s:%s} expires at: %s>' % (self.key, self.value, time.time() + self.duration)

class CachedDict(dict):

    def get(self, key, fn, duration):
        if key not in self \
            or self[key].timeStamp + self[key].duration < time.time():
                print 'adding new value'
                o = fn(key)
                self[key] = CachedItem(key, o, duration)
        else:
            print 'loading from cache'

        return self[key].value



if __name__ == '__main__':

    fn = lambda key: 'value of %s  is None' % key

    ci = CachedItem('a', 12)
    print ci 
    cd = CachedDict()
    print cd.get('a', fn, 5)
    time.sleep(2)
    print cd.get('a', fn, 6)
    print cd.get('b', fn, 6)
    time.sleep(2)
    print cd.get('a', fn, 7)
    print cd.get('b', fn, 7)
import time

class CachedItem(object):
    def __init__(self, key, value, duration=60):
        self.key = key
        self.value = value
        self.duration = duration
        self.timeStamp = time.time()

    def __repr__(self):
        return '<CachedItem {%s:%s} expires at: %s>' % (self.key, self.value, time.time() + self.duration)

class CachedDict(dict):

    def get(self, key, fn, duration):
        if key not in self \
            or self[key].timeStamp + self[key].duration < time.time():
                print 'adding new value'
                o = fn(key)
                self[key] = CachedItem(key, o, duration)
        else:
            print 'loading from cache'

        return self[key].value



if __name__ == '__main__':

    fn = lambda key: 'value of %s  is None' % key

    ci = CachedItem('a', 12)
    print ci 
    cd = CachedDict()
    print cd.get('a', fn, 5)
    time.sleep(2)
    print cd.get('a', fn, 6)
    print cd.get('b', fn, 6)
    time.sleep(2)
    print cd.get('a', fn, 7)
    print cd.get('b', fn, 7)

回答 7

您可以使用我的简单解决方案来解决该问题。这真的很简单,没有花哨:

class MemCache(dict):
    def __init__(self, fn):
        dict.__init__(self)
        self.__fn = fn

    def __getitem__(self, item):
        if item not in self:
            dict.__setitem__(self, item, self.__fn(item))
        return dict.__getitem__(self, item)

mc = MemCache(lambda x: x*x)

for x in xrange(10):
    print mc[x]

for x in xrange(10):
    print mc[x]

它确实缺乏到期功能,但是您可以通过在MemCache c-tor中指定特定规则来轻松扩展它。

希望代码是不言而喻的,但是,如果不是,就更不用说了,高速缓存正在作为其c-tor参数之一传递给翻译函数。依次用于生成有关输入的缓存输出。

希望能帮助到你

You can use my simple solution to the problem. It is really straightforward, nothing fancy:

class MemCache(dict):
    def __init__(self, fn):
        dict.__init__(self)
        self.__fn = fn

    def __getitem__(self, item):
        if item not in self:
            dict.__setitem__(self, item, self.__fn(item))
        return dict.__getitem__(self, item)

mc = MemCache(lambda x: x*x)

for x in xrange(10):
    print mc[x]

for x in xrange(10):
    print mc[x]

It indeed lacks expiration funcionality, but you can easily extend it with specifying a particular rule in MemCache c-tor.

Hope code is enough self-explanatory, but if not, just to mention, that cache is being passed a translation function as one of its c-tor params. It’s used in turn to generate cached output regarding the input.

Hope it helps


回答 8

尝试使用redis,它是应用程序以原子方式共享数据或如果您具有某种Web服务器平台的最干净,最简单的解决方案之一。它非常容易设置,您将需要一个python redis客户端http://pypi.python.org/pypi/redis

Try redis, it is one of the cleanest and easiest solutions for applications to share data in a atomic way or if you have got some web server platform. Its very easy to setup, you will need a python redis client http://pypi.python.org/pypi/redis


回答 9

查看pypi 上的gocept.cache,管理超时。

Look at gocept.cache on pypi, manage timeout.


回答 10

项目旨在提供“为人类提供缓存”(尽管似乎相当未知)

来自项目页面的一些信息:

安装

点安装缓存

用法:

import pylibmc
from cache import Cache

backend = pylibmc.Client(["127.0.0.1"])

cache = Cache(backend)

@cache("mykey")
def some_expensive_method():
    sleep(10)
    return 42

# writes 42 to the cache
some_expensive_method()

# reads 42 from the cache
some_expensive_method()

# re-calculates and writes 42 to the cache
some_expensive_method.refresh()

# get the cached value or throw an error
# (unless default= was passed to @cache(...))
some_expensive_method.cached()

This project aims to provide “Caching for humans” (seems like it’s fairly unknown though)

Some info from the project page:

Installation

pip install cache

Usage:

import pylibmc
from cache import Cache

backend = pylibmc.Client(["127.0.0.1"])

cache = Cache(backend)

@cache("mykey")
def some_expensive_method():
    sleep(10)
    return 42

# writes 42 to the cache
some_expensive_method()

# reads 42 from the cache
some_expensive_method()

# re-calculates and writes 42 to the cache
some_expensive_method.refresh()

# get the cached value or throw an error
# (unless default= was passed to @cache(...))
some_expensive_method.cached()

回答 11

查看bda.cache http://pypi.python.org/pypi/bda.cache-使用ZCA并经过zope和bfg的测试。

Look at bda.cache http://pypi.python.org/pypi/bda.cache – uses ZCA and is tested with zope and bfg.


回答 12

keyring是最好的python缓存库。您可以使用

keyring.set_password("service","jsonkey",json_res)

json_res= keyring.get_password("service","jsonkey")

json_res= keyring.core.delete_password("service","jsonkey")

keyring is the best python caching library. You can use

keyring.set_password("service","jsonkey",json_res)

json_res= keyring.get_password("service","jsonkey")

json_res= keyring.core.delete_password("service","jsonkey")