Python ElementTree模块:使用方法“ find”,“ findall”时,如何忽略XML文件的命名空间以找到匹配的元素

问题:Python ElementTree模块:使用方法“ find”,“ findall”时,如何忽略XML文件的命名空间以找到匹配的元素

我想使用“ findall”方法在ElementTree模块中找到源xml文件的某些元素。

但是,源xml文件(test.xml)具有命名空间。我截断一部分xml文件作为示例:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

示例python代码如下:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

尽管它可以工作,但是因为有一个命名空间“ {http://www.test.com}”,但是在每个标签前面添加一个命名空间非常不方便。

使用“ find”,“ findall”等方法时,如何忽略命名空间?

I want to use the method of “findall” to locate some elements of the source xml file in the ElementTree module.

However, the source xml file (test.xml) has namespace. I truncate part of xml file as sample:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

The sample python code is below:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

Although it can works, because there is a namespace “{http://www.test.com}”, it’s very inconvenient to add a namespace in front of each tag.

How can I ignore the namespace when using the method of “find”, “findall” and so on?


回答 0

最好不要解析XML文档本身,而是先解析它,然后修改结果中的标记。这样,您可以处理多个命名空间和命名空间别名:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    prefix, has_namespace, postfix = el.tag.partition('}')
    if has_namespace:
        el.tag = postfix  # strip all namespaces
root = it.root

这是基于此处的讨论:http : //bugs.python.org/issue18304

更新: rpartition而不是partition确保你得到的标签名postfix,即使没有命名空间。因此,您可以将其压缩:

for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns

Instead of modifying the XML document itself, it’s best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    prefix, has_namespace, postfix = el.tag.partition('}')
    if has_namespace:
        el.tag = postfix  # strip all namespaces
root = it.root

This is based on the discussion here: http://bugs.python.org/issue18304

Update: rpartition instead of partition makes sure you get the tag name in postfix even if there is no namespace. Thus you could condense it:

for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns

回答 1

如果您在解析前从xml中删除xmlns属性,则树中的每个标记都将没有命名空间。

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

If you remove the xmlns attribute from the xml before parsing it then there won’t be a namespace prepended to each tag in the tree.

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

回答 2

到目前为止,答案明确地将命名空间值放在脚本中。对于更通用的解决方案,我宁愿从xml中提取命名空间:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

并在查找方法中使用它:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

And use it in find method:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

回答 3

这是对nonagon答案的扩展,它也剥离了命名空间的属性:

from StringIO import StringIO
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

UPDATE:已添加,list()以便迭代器可以工作(Python 3所需)

Here’s an extension to nonagon’s answer, which also strips namespaces off attributes:

from StringIO import StringIO
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

UPDATE: added list() so the iterator works (needed for Python 3)


回答 4

改善ericspod的答案:

无需全局更改解析模式,我们可以将其包装在支持with构造的对象中。

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
            self.oldcreate = expat.ParserCreate
            expat.ParserCreate = lambda encoding, sep: self.oldcreate(encoding, None)
    def __exit__(self, type, value, traceback):
            expat.ParserCreate = self.oldcreate

然后可以按如下方式使用

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

这种方式的优点在于,它不会更改with块之外无关代码的任何行为。我使用了ericspod的版本(在此同时也使用了expat)在不相关的库中出现错误之后,最终创建了该代码。

Improving on the answer by ericspod:

Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
            self.oldcreate = expat.ParserCreate
            expat.ParserCreate = lambda encoding, sep: self.oldcreate(encoding, None)
    def __exit__(self, type, value, traceback):
            expat.ParserCreate = self.oldcreate

This can then be used as follows

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.


回答 5

您也可以使用优雅的字符串格式构造:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

或者,如果您确定PAID_OFF仅出现在树的一级中:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)

You can use the elegant string formatting construct as well:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

or, if you’re sure that PAID_OFF only appears in one level in tree:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)

回答 6

如果不使用ElementTree,则cElementTree可以通过替换来强制Expat忽略命名空间处理ParserCreate()

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree尝试通过调用来使用Expat,ParserCreate()但没有提供不提供命名空间分隔符字符串的选项,以上代码将导致其被忽略,但被警告可能会破坏其他情况。

If you’re using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.


回答 7

我为此可能会迟到,但我认为这re.sub不是一个好的解决方案。

但是,该重写xml.parsers.expat不适用于Python 3.x版本,

罪魁祸首是xml/etree/ElementTree.py源代码的底部

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

真是可悲。

解决的办法是先摆脱它。

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

在Python 3.6上测试。

try如果在代码的某处重新加载或导入模块两次而遇到一些奇怪的错误,例如try 语句,则很有用

  • 超过最大递归深度
  • AttributeError:XMLParser

顺便说一句,etree源代码看起来真的很乱。

I might be late for this but I dont think re.sub is a good solution.

However the rewrite xml.parsers.expat does not work for Python 3.x versions,

The main culprit is the xml/etree/ElementTree.py see bottom of the source code

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

Which is kinda sad.

The solution is to get rid of it first.

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

Tested on Python 3.6.

Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like

  • maximum recursion depth exceeded
  • AttributeError: XMLParser

btw damn the etree source code looks really messy.


回答 8

让我们结合nonagon的答案mzjn对一个相关问题的答案

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

使用此功能,我们:

  1. 创建一个迭代器以获取命名空间和已解析的树对象

  2. 遍历创建的迭代器以获取命名空间命令,我们以后可以传入每个命名空间find()findall()调用iMom0的命名空间。

  3. 返回解析树的根元素对象和命名空间。

我认为这是最好的方法,因为无论源XML还是解析后的xml.etree.ElementTree输出都不会受到任何操纵。

我还要感谢Barny的回答,因为它提供了这个难题的重要组成部分(您可以从迭代器获得解析的根)。在此之前,我实际上在应用程序中遍历了两次XML树(一次获取命名空间,第二次获取根)。

Let’s combine nonagon’s answer with mzjn’s answer to a related question:

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

Using this function we:

  1. Create an iterator to get both namespaces and a parsed tree object.

  2. Iterate over the created iterator to get the namespaces dict that we can later pass in each find() or findall() call as sugested by iMom0.

  3. Return the parsed tree’s root element object and namespaces.

I think this is the best approach all around as there’s no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.

I’d like also to credit barny’s answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).


Python速度测试-时差-毫秒

问题:Python速度测试-时差-毫秒

为了快速测试一段代码,在Python中进行两次比较的正确方法是什么?我尝试阅读API文档。我不确定我是否了解timedelta。

到目前为止,我有以下代码:

from datetime import datetime

tstart = datetime.now()
print t1

# code to speed test

tend = datetime.now()
print t2
# what am I missing?
# I'd like to print the time diff here

What is the proper way to compare 2 times in Python in order to speed test a section of code? I tried reading the API docs. I’m not sure I understand the timedelta thing.

So far I have this code:

from datetime import datetime

tstart = datetime.now()
print t1

# code to speed test

tend = datetime.now()
print t2
# what am I missing?
# I'd like to print the time diff here

回答 0

datetime.timedelta 只是两个日期时间之间的差…所以就像一段时间,以天/秒/微秒为单位

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> c = b - a

>>> c
datetime.timedelta(0, 4, 316543)
>>> c.days
0
>>> c.seconds
4
>>> c.microseconds
316543

请注意,它c.microseconds仅返回timedelta的微秒部分!出于计时目的,请始终使用c.total_seconds()

您可以使用datetime.timedelta进行各种数学运算,例如:

>>> c / 10
datetime.timedelta(0, 0, 431654)

不过,查看CPU时间而不是墙上时钟时间可能更有用……虽然这取决于操作系统,但在类Unix系统下,请检查“ time”命令。

datetime.timedelta is just the difference between two datetimes … so it’s like a period of time, in days / seconds / microseconds

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> c = b - a

>>> c
datetime.timedelta(0, 4, 316543)
>>> c.days
0
>>> c.seconds
4
>>> c.microseconds
316543

Be aware that c.microseconds only returns the microseconds portion of the timedelta! For timing purposes always use c.total_seconds().

You can do all sorts of maths with datetime.timedelta, eg:

>>> c / 10
datetime.timedelta(0, 0, 431654)

It might be more useful to look at CPU time instead of wallclock time though … that’s operating system dependant though … under Unix-like systems, check out the ‘time’ command.


回答 1

从Python 2.7开始,有了timedelta.total_seconds()方法。因此,要获得经过的毫秒数:

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> delta = b - a
>>> print delta
0:00:05.077263
>>> int(delta.total_seconds() * 1000) # milliseconds
5077

Since Python 2.7 there’s the timedelta.total_seconds() method. So, to get the elapsed milliseconds:

>>> import datetime
>>> a = datetime.datetime.now()
>>> b = datetime.datetime.now()
>>> delta = b - a
>>> print delta
0:00:05.077263
>>> int(delta.total_seconds() * 1000) # milliseconds
5077

回答 2

您可能要改用timeit模块

You might want to use the timeit module instead.


回答 3

您还可以使用:

import time

start = time.clock()
do_something()
end = time.clock()
print "%.2gs" % (end-start)

或者您可以使用python分析器

You could also use:

import time

start = time.clock()
do_something()
end = time.clock()
print "%.2gs" % (end-start)

Or you could use the python profilers.


回答 4

我知道这很晚了,但实际上我真的很喜欢使用:

import time
start = time.time()

##### your timed code here ... #####

print "Process time: " + (time.time() - start)

time.time()从纪元开始,您可以得到秒数。因为这是标准时间(以秒为单位),所以您可以简单地从结束时间中减去开始时间来获得处理时间(以秒为单位)。time.clock()对基准测试非常有用,但是如果您想知道过程花费了多长时间,我发现它毫无用处。例如,说“我的过程需要10秒”比说“我的过程需要10个处理器时钟单位”要直观得多。

>>> start = time.time(); sum([each**8.3 for each in range(1,100000)]) ; print (time.time() - start)
3.4001404476250935e+45
0.0637760162354
>>> start = time.clock(); sum([each**8.3 for each in range(1,100000)]) ; print (time.clock() - start)
3.4001404476250935e+45
0.05

在上面的第一个示例中,显示的时间time.clock()为0.05,而time.time()为0.06377

>>> start = time.clock(); time.sleep(1) ; print "process time: " + (time.clock() - start)
process time: 0.0
>>> start = time.time(); time.sleep(1) ; print "process time: " + (time.time() - start)
process time: 1.00111794472

在第二个示例中,即使进程睡眠了一秒钟,处理器时间也以某种方式显示为“ 0”。time.time()正确显示多于1秒。

I know this is late, but I actually really like using:

import time
start = time.time()

##### your timed code here ... #####

print "Process time: " + (time.time() - start)

time.time() gives you seconds since the epoch. Because this is a standardized time in seconds, you can simply subtract the start time from the end time to get the process time (in seconds). time.clock() is good for benchmarking, but I have found it kind of useless if you want to know how long your process took. For example, it’s much more intuitive to say “my process takes 10 seconds” than it is to say “my process takes 10 processor clock units”

>>> start = time.time(); sum([each**8.3 for each in range(1,100000)]) ; print (time.time() - start)
3.4001404476250935e+45
0.0637760162354
>>> start = time.clock(); sum([each**8.3 for each in range(1,100000)]) ; print (time.clock() - start)
3.4001404476250935e+45
0.05

In the first example above, you are shown a time of 0.05 for time.clock() vs 0.06377 for time.time()

>>> start = time.clock(); time.sleep(1) ; print "process time: " + (time.clock() - start)
process time: 0.0
>>> start = time.time(); time.sleep(1) ; print "process time: " + (time.time() - start)
process time: 1.00111794472

In the second example, somehow the processor time shows “0” even though the process slept for a second. time.time() correctly shows a little more than 1 second.


回答 5

以下代码应显示时间说明…

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

The following code should display the time detla…

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

回答 6

您可以简单地打印出差异:

print tend - tstart

You could simply print the difference:

print tend - tstart

回答 7

我不是Python程序员,但我确实知道如何使用Google,这就是我发现的内容:您使用“-”运算符。要完成您的代码:

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

此外,看起来您可以使用strftime()函数格式化时间跨度计算以呈现时间,但是这会让您感到高兴。

I am not a Python programmer, but I do know how to use Google and here’s what I found: you use the “-” operator. To complete your code:

from datetime import datetime

tstart = datetime.now()

# code to speed test

tend = datetime.now()
print tend - tstart

Additionally, it looks like you can use the strftime() function to format the timespan calculation in order to render the time however makes you happy.


回答 8

time.time()/ datetime可以快速使用,但并不总是100%精确。出于这个原因,我喜欢使用其中一个std lib 分析器(尤其是hotshot)来找出问题所在。

time.time() / datetime is good for quick use, but is not always 100% precise. For that reason, I like to use one of the std lib profilers (especially hotshot) to find out what’s what.


回答 9

您可能需要研究配置文件模块。您会更好地了解减速的位置,并且大部分工作将完全自动化。

You may want to look into the profile modules. You’ll get a better read out of where your slowdowns are, and much of your work will be full-on automated.


回答 10

这是一个模仿Matlab / Octave tic toc函数的自定义函数。

使用示例:

time_var = time_me(); # get a variable with the current timestamp

... run operation ...

time_me(time_var); # print the time difference (e.g. '5 seconds 821.12314 ms')

功能:

def time_me(*arg):
    if len(arg) != 0: 
        elapsedTime = time.time() - arg[0];
        #print(elapsedTime);
        hours = math.floor(elapsedTime / (60*60))
        elapsedTime = elapsedTime - hours * (60*60);
        minutes = math.floor(elapsedTime / 60)
        elapsedTime = elapsedTime - minutes * (60);
        seconds = math.floor(elapsedTime);
        elapsedTime = elapsedTime - seconds;
        ms = elapsedTime * 1000;
        if(hours != 0):
            print ("%d hours %d minutes %d seconds" % (hours, minutes, seconds)) 
        elif(minutes != 0):
            print ("%d minutes %d seconds" % (minutes, seconds))
        else :
            print ("%d seconds %f ms" % (seconds, ms))
    else:
        #print ('does not exist. here you go.');
        return time.time()

Here is a custom function that mimic’s Matlab’s/Octave’s tic toc functions.

Example of use:

time_var = time_me(); # get a variable with the current timestamp

... run operation ...

time_me(time_var); # print the time difference (e.g. '5 seconds 821.12314 ms')

Function :

def time_me(*arg):
    if len(arg) != 0: 
        elapsedTime = time.time() - arg[0];
        #print(elapsedTime);
        hours = math.floor(elapsedTime / (60*60))
        elapsedTime = elapsedTime - hours * (60*60);
        minutes = math.floor(elapsedTime / 60)
        elapsedTime = elapsedTime - minutes * (60);
        seconds = math.floor(elapsedTime);
        elapsedTime = elapsedTime - seconds;
        ms = elapsedTime * 1000;
        if(hours != 0):
            print ("%d hours %d minutes %d seconds" % (hours, minutes, seconds)) 
        elif(minutes != 0):
            print ("%d minutes %d seconds" % (minutes, seconds))
        else :
            print ("%d seconds %f ms" % (seconds, ms))
    else:
        #print ('does not exist. here you go.');
        return time.time()

回答 11

您可以像这样使用timeit测试名为module.py的脚本。

$ python -mtimeit -s 'import module'

You could use timeit like this to test a script named module.py

$ python -mtimeit -s 'import module'

回答 12

《箭头》:Python的更好日期和时间

import arrow
start_time = arrow.utcnow()
end_time = arrow.utcnow()
(end_time - start_time).total_seconds()  # senconds
(end_time - start_time).total_seconds() * 1000  # milliseconds

Arrow: Better dates & times for Python

import arrow
start_time = arrow.utcnow()
end_time = arrow.utcnow()
(end_time - start_time).total_seconds()  # senconds
(end_time - start_time).total_seconds() * 1000  # milliseconds

如何将NumPy数组标准化到一定范围内?

问题:如何将NumPy数组标准化到一定范围内?

在对音频或图像阵列进行一些处理之后,需要先在一定范围内对其进行标准化,然后才能将其写回到文件中。可以这样完成:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

有没有那么繁琐,方便的函数方式来做到这一点?matplotlib.colors.Normalize()似乎无关。

After doing some processing on an audio or image array, it needs to be normalized within a range before it can be written back to a file. This can be done like so:

# Normalize audio channels to between -1.0 and +1.0
audio[:,0] = audio[:,0]/abs(audio[:,0]).max()
audio[:,1] = audio[:,1]/abs(audio[:,1]).max()

# Normalize image to between 0 and 255
image = image/(image.max()/255.0)

Is there a less verbose, convenience function way to do this? matplotlib.colors.Normalize() doesn’t seem to be related.


回答 0

audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

使用/=*=可以消除中间的临时阵列,从而节省了一些内存。乘法比除法便宜,所以

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

比…快一点

image /= image.max()/255.0    # Uses 1+image.size divisions

由于我们在这里使用基本的numpy方法,因此我认为这是尽可能有效的numpy解决方案。


就地操作不会更改容器数组的dtype。由于所需的标准化值是浮点型,因此在执行就地操作之前,audioand image数组需要具有浮点dtype。如果它们还不是浮点dtype,则需要使用进行转换astype。例如,

image = image.astype('float64')
audio /= np.max(np.abs(audio),axis=0)
image *= (255.0/image.max())

Using /= and *= allows you to eliminate an intermediate temporary array, thus saving some memory. Multiplication is less expensive than division, so

image *= 255.0/image.max()    # Uses 1 division and image.size multiplications

is marginally faster than

image /= image.max()/255.0    # Uses 1+image.size divisions

Since we are using basic numpy methods here, I think this is about as efficient a solution in numpy as can be.


In-place operations do not change the dtype of the container array. Since the desired normalized values are floats, the audio and image arrays need to have floating-point point dtype before the in-place operations are performed. If they are not already of floating-point dtype, you’ll need to convert them using astype. For example,

image = image.astype('float64')

回答 1

如果数组同时包含正数和负数,我将使用:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

如果数组包含nan,则一种解决方案是将其删除为:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

但是,根据上下文,您可能需要nan不同的对待。例如,插值,用例如0代替,或引发错误。

最后,值得一提的是,即使不是OP的问题,也要标准化

e = (a - np.mean(a)) / np.std(a)

If the array contains both positive and negative data, I’d go with:

import numpy as np

a = np.random.rand(3,2)

# Normalised [0,1]
b = (a - np.min(a))/np.ptp(a)

# Normalised [0,255] as integer: don't forget the parenthesis before astype(int)
c = (255*(a - np.min(a))/np.ptp(a)).astype(int)        

# Normalised [-1,1]
d = 2.*(a - np.min(a))/np.ptp(a)-1

If the array contains nan, one solution could be to just remove them as:

def nan_ptp(a):
    return np.ptp(a[np.isfinite(a)])

b = (a - np.nanmin(a))/nan_ptp(a)

However, depending on the context you might want to treat nan differently. E.g. interpolate the value, replacing in with e.g. 0, or raise an error.

Finally, worth mentioning even if it’s not OP’s question, standardization:

e = (a - np.mean(a)) / np.std(a)

回答 2

您也可以使用重新缩放sklearn。优势在于,除了对数据进行均值居中之外,还可以调整标准差的归一化,并且可以在任一轴上,通过要素或按记录进行校准。

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

关键词参数axiswith_meanwith_std是自我解释,并且在默认状态显示。如果该参数copy设置为,则执行就地操作False这里的文件

You can also rescale using sklearn. The advantages are that you can adjust normalize the standard deviation, in addition to mean-centering the data, and that you can do this on either axis, by features, or by records.

from sklearn.preprocessing import scale
X = scale( X, axis=0, with_mean=True, with_std=True, copy=True )

The keyword arguments axis, with_mean, with_std are self explanatory, and are shown in their default state. The argument copy performs the operation in-place if it is set to False. Documentation here.


回答 3

您可以使用“ i”版本(如idiv中的imul ..),它看起来还不错:

image /= (image.max()/255.0)

在另一种情况下,您可以编写一个函数来通过colums标准化n维数组:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

You can use the “i” (as in idiv, imul..) version, and it doesn’t look half bad:

image /= (image.max()/255.0)

For the other case you can write a function to normalize an n-dimensional array by colums:

def normalize_columns(arr):
    rows, cols = arr.shape
    for col in xrange(cols):
        arr[:,col] /= abs(arr[:,col]).max()

回答 4

您正在尝试最小-最大比例缩放audio介于-1和+1 image之间以及0和255之间的值。

使用sklearn.preprocessing.minmax_scale,应该可以轻松解决您的问题。

例如:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

注意:不要与将向量的范数(长度)缩放到某个值(通常为1)的操作相混淆,该操作通常也称为归一化。

You are trying to min-max scale the values of audio between -1 and +1 and image between 0 and 255.

Using sklearn.preprocessing.minmax_scale, should easily solve your problem.

e.g.:

audio_scaled = minmax_scale(audio, feature_range=(-1,1))

and

shape = image.shape
image_scaled = minmax_scale(image.ravel(), feature_range=(0,255)).reshape(shape)

note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.


回答 5

一个简单的解决方案是使用sklearn.preprocessing库提供的缩放器。

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

错误X_rec-X将为零。您可以根据需要调整feature_range,甚至可以使用标准缩放器sk.StandardScaler()

A simple solution is using the scalers offered by the sklearn.preprocessing library.

scaler = sk.MinMaxScaler(feature_range=(0, 250))
scaler = scaler.fit(X)
X_scaled = scaler.transform(X)
# Checking reconstruction
X_rec = scaler.inverse_transform(X_scaled)

The error X_rec-X will be zero. You can adjust the feature_range for your needs, or even use a standart scaler sk.StandardScaler()


回答 6

我尝试按照此操作,但出现了错误

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

numpy我试图正常化阵列是一个integer数组。似乎他们不赞成在版本>中进行类型转换1.10,而您必须使用它numpy.true_divide()来解决该问题。

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img是一个PIL.Image对象。

I tried following this, and got the error

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

The numpy array I was trying to normalize was an integer array. It seems they deprecated type casting in versions > 1.10, and you have to use numpy.true_divide() to resolve that.

arr = np.array(img)
arr = np.true_divide(arr,[255.0],out=None)

img was an PIL.Image object.


在Python中显示带有两位小数的浮点数

问题:在Python中显示带有两位小数的浮点数

我有一个带浮点参数的函数(通常是整数或具有一位有效数字的十进制数),我需要在字符串中输出具有两位小数位的值(5-> 5.00、5.5-> 5.50等)。如何在Python中做到这一点?

I have a function taking float arguments (generally integers or decimals with one significant digit), and I need to output the values in a string with two decimal places (5 -> 5.00, 5.5 -> 5.50, etc). How can I do this in Python?


回答 0

您可以为此使用字符串格式运算符:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

运算符的结果是一个字符串,因此您可以将其存储在变量中,进行打印等。

You could use the string formatting operator for that:

>>> '%.2f' % 1.234
'1.23'
>>> '%.2f' % 5.0
'5.00'

The result of the operator is a string, so you can store it in a variable, print etc.


回答 1

由于这篇文章可能会在这里出现一段时间,因此我们还要指出python 3语法:

"{:.2f}".format(5)

Since this post might be here for a while, lets also point out python 3 syntax:

"{:.2f}".format(5)

回答 2

f字符串格式:

这是Python 3.6中的新功能-照常将字符串放在引号中,并f'...以与r'...原始字符串相同的方式加上前缀。然后,将任何要放入字符串,变量,数字,大括号内的内容放入其中-Python会f'some string text with a {variable} or {number} within that text'像以前的字符串格式化方法那样进行求值,只是该方法更具可读性。

>>> a = 3.141592
>>> print(f'My number is {a:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

您可以在此示例中看到,我们以与以前的字符串格式化方法相似的方式用小数位格式化。

NB a可以是数字,变量甚至是表达式,例如f'{3*my_func(3.14):02f}'

展望未来,使用新代码,我更喜欢f字符串而不是常见的%s或str.format()方法,因为f字符串可以更容易阅读,并且通常更快

f-string formatting:

This was new in Python 3.6 – the string is placed in quotation marks as usual, prepended with f'... in the same way you would r'... for a raw string. Then you place whatever you want to put within your string, variables, numbers, inside braces f'some string text with a {variable} or {number} within that text' – and Python evaluates as with previous string formatting methods, except that this method is much more readable.

>>> foobar = 3.141592
>>> print(f'My number is {foobar:.2f} - look at the nice rounding!')

My number is 3.14 - look at the nice rounding!

You can see in this example we format with decimal places in similar fashion to previous string formatting methods.

NB foobar can be an number, variable, or even an expression eg f'{3*my_func(3.14):02f}'.

Going forward, with new code I prefer f-strings over common %s or str.format() methods as f-strings can be far more readable, and are often much faster.


回答 3

字符串格式:

print "%.2f" % 5

String formatting:

print "%.2f" % 5

回答 4

使用python字符串格式。

>>> "%0.2f" % 3
'3.00'

Using python string formatting.

>>> "%0.2f" % 3
'3.00'

回答 5

字符串格式:

a = 6.789809823
print('%.2f' %a)

要么

print ("{0:.2f}".format(a)) 

舍入函数可以使用:

print(round(a, 2))

round()的好处是,我们可以将结果存储到另一个变量中,然后将其用于其他目的。

b = round(a, 2)
print(b)

String Formatting:

a = 6.789809823
print('%.2f' %a)

OR

print ("{0:.2f}".format(a)) 

Round Function can be used:

print(round(a, 2))

Good thing about round() is that, we can store this result to another variable, and then use it for other purposes.

b = round(a, 2)
print(b)

回答 6

最短的Python 3语法:

n = 5
print(f'{n:.2f}')

Shortest Python 3 syntax:

n = 5
print(f'{n:.2f}')

回答 7

如果您实际上想更改数字本身,而不是只显示不同的数字,请使用format()

将其格式化为2位小数:

format(value, '.2f')

例:

>>> format(5.00000, '.2f')
'5.00'

If you actually want to change the number itself instead of only displaying it differently use format()

Format it to 2 decimal places:

format(value, '.2f')

example:

>>> format(5.00000, '.2f')
'5.00'

回答 8

我知道这是一个古老的问题,但我一直在努力寻找答案。这是我想出的:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

I know it is an old question, but I was struggling finding the answer myself. Here is what I have come up with:

Python 3:

>>> num_dict = {'num': 0.123, 'num2': 0.127}
>>> "{0[num]:.2f}_{0[num2]:.2f}".format(num_dict) 
0.12_0.13

回答 9

使用Python 3语法:

print('%.2f' % number)

Using Python 3 syntax:

print('%.2f' % number)

回答 10

如果要在调用输入时获得一个小数点后两位数限制的浮点值,

看看这个〜

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

If you want to get a floating point value with two decimal places limited at the time of calling input,

Check this out ~

a = eval(format(float(input()), '.2f'))   # if u feed 3.1415 for 'a'.
print(a)                                  # output 3.14 will be printed.

Python日期时间与时间模块之间的差异

问题:Python日期时间与时间模块之间的差异

我试图弄清楚datetimetime模块之间的区别,以及每个模块的用途。

我知道datetime提供日期和时间。该time模块的用途是什么?

将理解示例,并且将特别关注与时区有关的差异。

I am trying to figure out the differences between the datetime and time modules, and what each should be used for.

I know that datetime provides both dates and time. What is the use of the time module?

Examples would be appreciated and differences concerning timezones would especially be of interest.


回答 0

time模块主要用于处理unix时间戳;表示为一个浮点数,以距unix纪元的秒数​​为单位。该datetime模块可以支持许多相同的操作,但是提供了更多的面向对象的类型集,并且对时区的支持有限。

the time module is principally for working with unix time stamps; expressed as a floating point number taken to be seconds since the unix epoch. the datetime module can support many of the same operations, but provides a more object oriented set of types, and also has some limited support for time zones.


回答 1

坚决time防止DST歧义。

专门使用系统time模块而不是datetime模块,以防止夏令时(DST)引起歧义

转换为任何时间格式(包括本地时间)都非常容易:

import time
t = time.time()

time.strftime('%Y-%m-%d %H:%M %Z', time.localtime(t))
'2019-05-27 12:03 CEST'

time.strftime('%Y-%m-%d %H:%M %Z', time.gmtime(t))
'2019-05-27 10:03 GMT'

time.time()是一个浮点数,表示自系统纪元以来的时间(以秒为单位)。time.time()非常适合明确的时间戳记。

如果系统另外运行了网络时间协议(NTP)守护程序,那么最终将获得相当可靠的时基。

这是该模块的文档time

Stick to time to prevent DST ambiguity.

Use exclusively the system time module instead of the datetime module to prevent ambiguity issues with daylight savings time (DST).

Conversion to any time format, including local time, is pretty easy:

import time
t = time.time()

time.strftime('%Y-%m-%d %H:%M %Z', time.localtime(t))
'2019-05-27 12:03 CEST'

time.strftime('%Y-%m-%d %H:%M %Z', time.gmtime(t))
'2019-05-27 10:03 GMT'

time.time() is a floating point number representing the time in seconds since the system epoch. time.time() is ideal for unambiguous time stamping.

If the system additionally runs the network time protocol (NTP) dæmon, one ends up with a pretty solid time base.

Here is the documentation of the time module.


回答 2

当您只需要特定记录的时间时,可以使用时间模块-假设您每天有一个单独的表/文件用于交易,那么您只需要时间。但是,时间数据类型通常用于存储2个时间点之间的时间

这也可以使用datetime完成,但是如果我们只处理特定日期的时间,则可以使用time模块。

日期时间用于存储特定数据和记录时间。就像在出租公司里一样。截止日期将是datetime数据类型。

The time module can be used when you just need the time of a particular record – like lets say you have a seperate table/file for the transactions for each day, then you would just need the time. However the time datatype is usually used to store the time difference between 2 points of time.

This can also be done using datetime, but if we are only dealing with time for a particular day, then time module can be used.

Datetime is used to store a particular data and time for a record. Like in a rental agency. The due date would be a datetime datatype.


回答 3

如果您对时区感兴趣,则应考虑使用pytz。

If you are interested in timezones, you should consider the use of pytz.


NameError:全局名称“ unicode”未定义-在Python 3中

问题:NameError:全局名称“ unicode”未定义-在Python 3中

我正在尝试使用一个名为bidi的Python包。在此程序包(algorithm.py)的模块中,尽管它是程序包的一部分,但仍有一些行会给我带来错误。

以下是这些行:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

这是错误消息:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

我应该如何重新编写代码的这一部分,使其可以在Python3中使用?另外,如果有人在Python 3中使用了bidi软件包,请让我知道他们是否发现了类似的问题。我感谢您的帮助。

I am trying to use a Python package called bidi. In a module in this package (algorithm.py) there are some lines that give me error, although it is part of the package.

Here are the lines:

# utf-8 ? we need unicode
if isinstance(unicode_or_str, unicode):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

and here is the error message:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    bidi_text = get_display(reshaped_text)
  File "C:\Python33\lib\site-packages\python_bidi-0.3.4-py3.3.egg\bidi\algorithm.py",   line 602, in get_display
    if isinstance(unicode_or_str, unicode):
NameError: global name 'unicode' is not defined

How should I re-write this part of the code so it works in Python3? Also if anyone have used bidi package with Python 3 please let me know if they have found similar problems or not. I appreciate your help.


回答 0

Python 3将unicode类型重命名为str,旧str类型已替换为bytes

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

您可能需要阅读Python 3 porting HOWTO以获得更多此类详细信息。还有Lennart Regebro的“ 移植到Python 3:深入指南”,可免费在线获得。

最后但并非最不重要的一点是,您可以尝试使用该2to3工具查看如何为您翻译代码。

Python 3 renamed the unicode type to str, the old str type has been replaced by bytes.

if isinstance(unicode_or_str, str):
    text = unicode_or_str
    decoded = False
else:
    text = unicode_or_str.decode(encoding)
    decoded = True

You may want to read the Python 3 porting HOWTO for more such details. There is also Lennart Regebro’s Porting to Python 3: An in-depth guide, free online.

Last but not least, you could just try to use the 2to3 tool to see how that translates the code for you.


回答 1

如果您需要让脚本像我一样继续在python2和3上工作,这可能会对某人有所帮助

import sys
if sys.version_info[0] >= 3:
    unicode = str

然后可以例如

foo = unicode.lower(foo)

If you need to have the script keep working on python2 and 3 as I did, this might help someone

import sys
if sys.version_info[0] >= 3:
    unicode = str

and can then just do for example

foo = unicode.lower(foo)

回答 2

您可以使用6个库同时支持Python 2和3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

You can use the six library to support both Python 2 and 3:

import six
if isinstance(value, six.string_types):
    handle_string(value)

回答 3

希望您使用的是Python 3,默认情况下Str是unicode,所以请Unicode用String Str函数替换函数。

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

Hope you are using Python 3 , Str are unicode by default, so please Replace Unicode function with String Str function.

if isinstance(unicode_or_str, str):    ##Replaces with str
    text = unicode_or_str
    decoded = False

如何更新SQLAlchemy行条目?

问题:如何更新SQLAlchemy行条目?

假设表有三列:usernamepasswordno_of_logins

当用户尝试登录时,将使用查询之类的内容检查条目

user = User.query.filter_by(username=form.username.data).first()

如果密码匹配,他将继续。我想做的是计算用户登录的次数。因此,无论他何时成功登录,我都希望增加该no_of_logins字段并将其存储回用户表。我不确定如何使用SqlAlchemy运行更新查询。

Assume table has three columns: username, password and no_of_logins.

When user tries to login, it’s checked for an entry with a query like

user = User.query.filter_by(username=form.username.data).first()

If password matches, he proceeds further. What I would like to do is count how many times the user logged in. Thus whenever he successfully logs in, I would like to increment the no_of_logins field and store it back to the user table. I’m not sure how to run update query with SqlAlchemy.


回答 0

user.no_of_logins += 1
session.commit()
user.no_of_logins += 1
session.commit()

回答 1

有几种UPDATE使用方法sqlalchemy

1) user.no_of_logins += 1
   session.commit()

2) session.query().\
       filter(User.username == form.username.data).\
       update({"no_of_logins": (User.no_of_logins +1)})
   session.commit()

3) conn = engine.connect()
   stmt = User.update().\
       values(no_of_logins=(User.no_of_logins + 1)).\
       where(User.username == form.username.data)
   conn.execute(stmt)

4) setattr(user, 'no_of_logins', user.no_of_logins+1)
   session.commit()

There are several ways to UPDATE using sqlalchemy

1) user.no_of_logins += 1
   session.commit()

2) session.query().\
       filter(User.username == form.username.data).\
       update({"no_of_logins": (User.no_of_logins +1)})
   session.commit()

3) conn = engine.connect()
   stmt = User.update().\
       values(no_of_logins=(User.no_of_logins + 1)).\
       where(User.username == form.username.data)
   conn.execute(stmt)

4) setattr(user, 'no_of_logins', user.no_of_logins+1)
   session.commit()

回答 2

举例说明澄清已接受答案的重要问题

直到我自己玩弄它之前,我都不了解它,所以我认为还会有其他人感到困惑。假设您正在id == 6no_of_logins == 30启动时的用户和用户一起工作。

# 1 (bad)
user.no_of_logins += 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 2 (bad)
user.no_of_logins = user.no_of_logins + 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 3 (bad)
setattr(user, 'no_of_logins', user.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 4 (ok)
user.no_of_logins = User.no_of_logins + 1
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 5 (ok)
setattr(user, 'no_of_logins', User.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

重点

通过引用类而不是实例,您可以使SQLAlchemy更加智能地进行增量,使其在数据库端而不是Python端发生。在数据库中执行此操作会更好,因为它不易受到数据损坏的影响(例如,两个客户端尝试同时进行增量操作,而最终结果仅是一次增量而不是两次增量)。我认为,如果您设置了锁或提高了隔离级别,则可以在Python中进行增量操作,但是如果不必这样做,为什么还要打扰呢?

注意事项

如果您要通过产生SQL之类的代码来增加两次SET no_of_logins = no_of_logins + 1,那么您将需要提交或至少刷新两次增加之间的增量,否则总共将只获得一个增量:

# 6 (bad)
user.no_of_logins = User.no_of_logins + 1
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 7 (ok)
user.no_of_logins = User.no_of_logins + 1
session.flush()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

Examples to clarify the important issue in accepted answer’s comments

I didn’t understand it until I played around with it myself, so I figured there would be others who were confused as well. Say you are working on the user whose id == 6 and whose no_of_logins == 30 when you start.

# 1 (bad)
user.no_of_logins += 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 2 (bad)
user.no_of_logins = user.no_of_logins + 1
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 3 (bad)
setattr(user, 'no_of_logins', user.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = 31 WHERE user.id = 6

# 4 (ok)
user.no_of_logins = User.no_of_logins + 1
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 5 (ok)
setattr(user, 'no_of_logins', User.no_of_logins + 1)
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

The point

By referencing the class instead of the instance, you can get SQLAlchemy to be smarter about incrementing, getting it to happen on the database side instead of the Python side. Doing it within the database is better since it’s less vulnerable to data corruption (e.g. two clients attempt to increment at the same time with a net result of only one increment instead of two). I assume it’s possible to do the incrementing in Python if you set locks or bump up the isolation level, but why bother if you don’t have to?

A caveat

If you are going to increment twice via code that produces SQL like SET no_of_logins = no_of_logins + 1, then you will need to commit or at least flush in between increments, or else you will only get one increment in total:

# 6 (bad)
user.no_of_logins = User.no_of_logins + 1
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

# 7 (ok)
user.no_of_logins = User.no_of_logins + 1
session.flush()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6
user.no_of_logins = User.no_of_logins + 1
session.commit()
# result: UPDATE user SET no_of_logins = no_of_logins + 1 WHERE user.id = 6

回答 3

借助user=User.query.filter_by(username=form.username.data).first()声明,您将获得指定用户的user变量。

现在,您可以像一样user.no_of_logins += 1更改新对象变量的值,并使用session的commit方法保存更改。

With the help of user=User.query.filter_by(username=form.username.data).first() statement you will get the specified user in user variable.

Now you can change the value of the new object variable like user.no_of_logins += 1 and save the changes with the session‘s commit method.


回答 4

我写了电报bot,并且在更新行时遇到了一些问题。如果您有模型,请使用此示例

def update_state(chat_id, state):
    try:
        value = Users.query.filter(Users.chat_id == str(chat_id)).first()
        value.state = str(state)
        db.session.flush()
        db.session.commit()
        #db.session.close()
    except:
        print('Error in def update_state')

为什么要使用db.session.flush()?这就是为什么>>> SQLAlchemy:flush()和commit()有什么区别?

I wrote telegram bot, and have some problem with update rows. Use this example, if you have Model

def update_state(chat_id, state):
    try:
        value = Users.query.filter(Users.chat_id == str(chat_id)).first()
        value.state = str(state)
        db.session.flush()
        db.session.commit()
        #db.session.close()
    except:
        print('Error in def update_state')

Why use db.session.flush()? That’s why >>> SQLAlchemy: What’s the difference between flush() and commit()?


创建两个熊猫数据框列的字典的最有效方法是什么?

问题:创建两个熊猫数据框列的字典的最有效方法是什么?

组织以下熊猫数据框的最有效方法是什么:

数据=

Position    Letter
1           a
2           b
3           c
4           d
5           e

变成字典一样alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']

What is the most efficient way to organise the following pandas Dataframe:

data =

Position    Letter
1           a
2           b
3           c
4           d
5           e

into a dictionary like alphabet[1 : 'a', 2 : 'b', 3 : 'c', 4 : 'd', 5 : 'e']?


回答 0

In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

速度比较(使用Wouter方法)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop
In [9]: pd.Series(df.Letter.values,index=df.Position).to_dict()
Out[9]: {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}

Speed comparion (using Wouter’s method)

In [6]: df = pd.DataFrame(randint(0,10,10000).reshape(5000,2),columns=list('AB'))

In [7]: %timeit dict(zip(df.A,df.B))
1000 loops, best of 3: 1.27 ms per loop

In [8]: %timeit pd.Series(df.A.values,index=df.B).to_dict()
1000 loops, best of 3: 987 us per loop

回答 1

我找到了解决问题的更快方法,至少在使用以下方法的大型数据集上: df.set_index(KEY).to_dict()[VALUE]

50,000行的证明:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

输出:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

I found a faster way to solve the problem, at least on realistically large datasets using: df.set_index(KEY).to_dict()[VALUE]

Proof on 50,000 rows:

df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)

%timeit dict(zip(df.A,df.B))
%timeit pd.Series(df.A.values,index=df.B).to_dict()
%timeit df.set_index('A').to_dict()['B']

Output:

100 loops, best of 3: 7.04 ms per loop  # WouterOvermeire
100 loops, best of 3: 9.83 ms per loop  # Jeff
100 loops, best of 3: 4.28 ms per loop  # Kikohs (me)

回答 2

在Python 3.6中,最快的方法仍然是WouterOvermeire。Kikohs的提议比其他两个方案要慢。

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

结果:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

In Python 3.6 the fastest way is still the WouterOvermeire one. Kikohs’ proposal is slower than the other two options.

import timeit

setup = '''
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(32, 120, 100000).reshape(50000,2),columns=list('AB'))
df['A'] = df['A'].apply(chr)
'''

timeit.Timer('dict(zip(df.A,df.B))', setup=setup).repeat(7,500)
timeit.Timer('pd.Series(df.A.values,index=df.B).to_dict()', setup=setup).repeat(7,500)
timeit.Timer('df.set_index("A").to_dict()["B"]', setup=setup).repeat(7,500)

Results:

1.1214002349999777 s  # WouterOvermeire
1.1922008498571748 s  # Jeff
1.7034366211428602 s  # Kikohs

回答 3

TL; DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

在长

解决方案说明: dict(sorted(df.values.tolist()))

鉴于:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[出]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

尝试:

# Get the values out to a 2-D numpy array, 
df.values

[出]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

然后(可选):

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

要么:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[出]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

最后,将2个元素的列表转换成字典。

dict(sorted(df.values.tolist())) 

[出]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

有关

回答@sbradbio评论:

如果一个特定的键有多个值,而您想保留所有值,那么这不是最有效,但最直观的方法是:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[出]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

TL;DR

>>> import pandas as pd
>>> df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})
>>> dict(sorted(df.values.tolist())) # Sort of sorted... 
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
>>> from collections import OrderedDict
>>> OrderedDict(df.values.tolist())
OrderedDict([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])

In Long

Explaining solution: dict(sorted(df.values.tolist()))

Given:

df = pd.DataFrame({'Position':[1,2,3,4,5], 'Letter':['a', 'b', 'c', 'd', 'e']})

[out]:

 Letter Position
0   a   1
1   b   2
2   c   3
3   d   4
4   e   5

Try:

# Get the values out to a 2-D numpy array, 
df.values

[out]:

array([['a', 1],
       ['b', 2],
       ['c', 3],
       ['d', 4],
       ['e', 5]], dtype=object)

Then optionally:

# Dump it into a list so that you can sort it using `sorted()`
sorted(df.values.tolist()) # Sort by key

Or:

# Sort by value:
from operator import itemgetter
sorted(df.values.tolist(), key=itemgetter(1))

[out]:

[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5]]

Lastly, cast the list of list of 2 elements into a dict.

dict(sorted(df.values.tolist())) 

[out]:

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}

Related

Answering @sbradbio comment:

If there are multiple values for a specific key and you would like to keep all of them, it’s the not the most efficient but the most intuitive way is:

from collections import defaultdict
import pandas as pd

multivalue_dict = defaultdict(list)

df = pd.DataFrame({'Position':[1,2,4,4,4], 'Letter':['a', 'b', 'd', 'e', 'f']})

for idx,row in df.iterrows():
    multivalue_dict[row['Position']].append(row['Letter'])

[out]:

>>> print(multivalue_dict)
defaultdict(list, {1: ['a'], 2: ['b'], 4: ['d', 'e', 'f']})

熊猫加入问题:列重叠但未指定后缀

问题:熊猫加入问题:列重叠但未指定后缀

我有以下2个数据帧:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

当我尝试加入这两个数据框时:

join_df = df_a.join(df_b,on='mukey',how='left')

我得到错误:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

为什么会这样呢?数据帧确实具有通用的“ mukey”值。

I have following 2 data frames:

df_a =

     mukey  DI  PI
0   100000  35  14
1  1000005  44  14
2  1000006  44  14
3  1000007  43  13
4  1000008  43  13

df_b = 
    mukey  niccdcd
0  190236        4
1  190237        6
2  190238        7
3  190239        4
4  190240        7

When I try to join these 2 dataframes:

join_df = df_a.join(df_b,on='mukey',how='left')

I get the error:

*** ValueError: columns overlap but no suffix specified: Index([u'mukey'], dtype='object')

Why is this so? The dataframes do have common ‘mukey’ values.


回答 0

您发布的数据片段中的错误有点神秘,因为没有通用值,所以联接操作失败,因为这些值不重叠,这需要您在左侧和右侧提供后缀:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge 之所以有效,是因为它没有此限制:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

Your error on the snippet of data you posted is a little cryptic, in that because there are no common values, the join operation fails because the values don’t overlap it requires you to supply a suffix for the left and right hand side:

In [173]:

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')
Out[173]:
       mukey_left  DI  PI  mukey_right  niccdcd
index                                          
0          100000  35  14          NaN      NaN
1         1000005  44  14          NaN      NaN
2         1000006  44  14          NaN      NaN
3         1000007  43  13          NaN      NaN
4         1000008  43  13          NaN      NaN

merge works because it doesn’t have this restriction:

In [176]:

df_a.merge(df_b, on='mukey', how='left')
Out[176]:
     mukey  DI  PI  niccdcd
0   100000  35  14      NaN
1  1000005  44  14      NaN
2  1000006  44  14      NaN
3  1000007  43  13      NaN
4  1000008  43  13      NaN

回答 1

.join()函数正在使用index传递的参数数据集的,因此您应该改用set_index或使用.mergefunction。

请找到适合您的情况的两个示例:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

要么

join_df = df_a.merge(df_b, on='mukey', how='left')

The .join() function is using the index of the passed as argument dataset, so you should use set_index or use .merge function instead.

Please find the two examples that should work in your case:

join_df = LS_sgo.join(MSU_pi.set_index('mukey'), on='mukey', how='left')

or

join_df = df_a.merge(df_b, on='mukey', how='left')

回答 2

此错误表明两个表具有1个或多个具有相同列名的列名。错误消息翻译为:“我可以在两个表中看到同一列,但是您没有告诉我在将其中一个引入之前重命名了任何一个”

您要么要删除一列,然后再使用del df [‘column name’]从另一列中引入,要么使用lsuffix来重写原始列,或者使用rsuffix重命名要引入的列。

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

This error indicates that the two tables have the 1 or more column names that have the same column name. The error message translates to: “I can see the same column in both tables but you haven’t told me to rename either before bringing one of them in”

You either want to delete one of the columns before bringing it in from the other on using del df[‘column name’], or use lsuffix to re-write the original column, or rsuffix to rename the one that is being brought it.

df_a.join(df_b, on='mukey', how='left', lsuffix='_left', rsuffix='_right')

回答 3

主要是join是专门用于基于索引的联接,而不是基于属性名称的联接,因此在两个不同的数据框中更改属性名称,然后尝试联接,它们将被联接,否则会引发此错误

Mainly join is used exclusively to join based on the index,not on the attribute names,so change the attributes names in two different dataframes,then try to join,they will be joined,else this error is raised


如何将tsv文件加载到Pandas DataFrame中?

问题:如何将tsv文件加载到Pandas DataFrame中?

我是python和pandas的新手。我正在尝试将tsv文件加载到熊猫中DataFrame

这是我正在尝试的错误:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

I’m new to python and pandas. I’m trying to get a tsv file loaded into a pandas DataFrame.

This is what I’m trying and the error I’m getting:

>>> df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))

Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    df1 = DataFrame(csv.reader(open('c:/~/trainSetRel3.txt'), delimiter='\t'))
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 318, in __init__
    raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!

回答 0

:由于17.0 from_csv气馁:使用pd.read_csv替代

该文档列出了一个.from_csv函数,该函数似乎可以执行您想要的操作:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

如果您有标题,则可以传递header=0

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

Note: As of 17.0 from_csv is discouraged: use pd.read_csv instead

The documentation lists a .from_csv function that appears to do what you want:

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t')

If you have a header, you can pass header=0.

DataFrame.from_csv('c:/~/trainSetRel3.txt', sep='\t', header=0)

回答 1

从17.0开始from_csv不建议使用。

使用pd.read_csv(fpath, sep='\t')pd.read_table(fpath)

As of 17.0 from_csv is discouraged.

Use pd.read_csv(fpath, sep='\t') or pd.read_table(fpath).


回答 2

使用read_table(filepath)。默认分隔符是制表符

Use read_table(filepath). The default separator is tab


回答 3

试试这个

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

在此处输入图片说明

您实际上需要修复sep参数。

Try this

df = pd.read_csv("rating-data.tsv",sep='\t')
df.head()

enter image description here

You actually need to fix the sep parameter.


回答 4

打开文件,另存为.csv,然后应用

df = pd.read_csv('apps.csv', sep='\t')

对于任何其他格式,只需更改sep标记

open file, save as .csv and then apply

df = pd.read_csv('apps.csv', sep='\t')

for any other format also, just change the sep tag


回答 5

df = pd.read_csv('filename.csv', sep='\t', header=0)

您可以通过指定分隔符和标头将tsv文件直接加载到pandas数据框中。

df = pd.read_csv('filename.csv', sep='\t', header=0)

You can load the tsv file directly into pandas data frame by specifying delimitor and header.


有趣好用的Python教程