标签归档:elementtree

Python ElementTree模块:使用方法“ find”,“ findall”时,如何忽略XML文件的命名空间以找到匹配的元素

问题:Python ElementTree模块:使用方法“ find”,“ findall”时,如何忽略XML文件的命名空间以找到匹配的元素

我想使用“ findall”方法在ElementTree模块中找到源xml文件的某些元素。

但是,源xml文件(test.xml)具有命名空间。我截断一部分xml文件作为示例:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

示例python代码如下:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

尽管它可以工作,但是因为有一个命名空间“ {http://www.test.com}”,但是在每个标签前面添加一个命名空间非常不方便。

使用“ find”,“ findall”等方法时,如何忽略命名空间?

I want to use the method of “findall” to locate some elements of the source xml file in the ElementTree module.

However, the source xml file (test.xml) has namespace. I truncate part of xml file as sample:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

The sample python code is below:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

Although it can works, because there is a namespace “{http://www.test.com}”, it’s very inconvenient to add a namespace in front of each tag.

How can I ignore the namespace when using the method of “find”, “findall” and so on?


回答 0

最好不要解析XML文档本身,而是先解析它,然后修改结果中的标记。这样,您可以处理多个命名空间和命名空间别名:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    prefix, has_namespace, postfix = el.tag.partition('}')
    if has_namespace:
        el.tag = postfix  # strip all namespaces
root = it.root

这是基于此处的讨论:http : //bugs.python.org/issue18304

更新: rpartition而不是partition确保你得到的标签名postfix,即使没有命名空间。因此,您可以将其压缩:

for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns

Instead of modifying the XML document itself, it’s best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    prefix, has_namespace, postfix = el.tag.partition('}')
    if has_namespace:
        el.tag = postfix  # strip all namespaces
root = it.root

This is based on the discussion here: http://bugs.python.org/issue18304

Update: rpartition instead of partition makes sure you get the tag name in postfix even if there is no namespace. Thus you could condense it:

for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns

回答 1

如果您在解析前从xml中删除xmlns属性,则树中的每个标记都将没有命名空间。

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

If you remove the xmlns attribute from the xml before parsing it then there won’t be a namespace prepended to each tag in the tree.

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

回答 2

到目前为止,答案明确地将命名空间值放在脚本中。对于更通用的解决方案,我宁愿从xml中提取命名空间:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

并在查找方法中使用它:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

And use it in find method:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

回答 3

这是对nonagon答案的扩展,它也剥离了命名空间的属性:

from StringIO import StringIO
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

UPDATE:已添加,list()以便迭代器可以工作(Python 3所需)

Here’s an extension to nonagon’s answer, which also strips namespaces off attributes:

from StringIO import StringIO
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

UPDATE: added list() so the iterator works (needed for Python 3)


回答 4

改善ericspod的答案:

无需全局更改解析模式,我们可以将其包装在支持with构造的对象中。

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
            self.oldcreate = expat.ParserCreate
            expat.ParserCreate = lambda encoding, sep: self.oldcreate(encoding, None)
    def __exit__(self, type, value, traceback):
            expat.ParserCreate = self.oldcreate

然后可以按如下方式使用

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

这种方式的优点在于,它不会更改with块之外无关代码的任何行为。我使用了ericspod的版本(在此同时也使用了expat)在不相关的库中出现错误之后,最终创建了该代码。

Improving on the answer by ericspod:

Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
            self.oldcreate = expat.ParserCreate
            expat.ParserCreate = lambda encoding, sep: self.oldcreate(encoding, None)
    def __exit__(self, type, value, traceback):
            expat.ParserCreate = self.oldcreate

This can then be used as follows

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.


回答 5

您也可以使用优雅的字符串格式构造:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

或者,如果您确定PAID_OFF仅出现在树的一级中:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)

You can use the elegant string formatting construct as well:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

or, if you’re sure that PAID_OFF only appears in one level in tree:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)

回答 6

如果不使用ElementTree,则cElementTree可以通过替换来强制Expat忽略命名空间处理ParserCreate()

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree尝试通过调用来使用Expat,ParserCreate()但没有提供不提供命名空间分隔符字符串的选项,以上代码将导致其被忽略,但被警告可能会破坏其他情况。

If you’re using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.


回答 7

我为此可能会迟到,但我认为这re.sub不是一个好的解决方案。

但是,该重写xml.parsers.expat不适用于Python 3.x版本,

罪魁祸首是xml/etree/ElementTree.py源代码的底部

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

真是可悲。

解决的办法是先摆脱它。

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

在Python 3.6上测试。

try如果在代码的某处重新加载或导入模块两次而遇到一些奇怪的错误,例如try 语句,则很有用

  • 超过最大递归深度
  • AttributeError:XMLParser

顺便说一句,etree源代码看起来真的很乱。

I might be late for this but I dont think re.sub is a good solution.

However the rewrite xml.parsers.expat does not work for Python 3.x versions,

The main culprit is the xml/etree/ElementTree.py see bottom of the source code

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

Which is kinda sad.

The solution is to get rid of it first.

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

Tested on Python 3.6.

Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like

  • maximum recursion depth exceeded
  • AttributeError: XMLParser

btw damn the etree source code looks really messy.


回答 8

让我们结合nonagon的答案mzjn对一个相关问题的答案

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

使用此功能,我们:

  1. 创建一个迭代器以获取命名空间和已解析的树对象

  2. 遍历创建的迭代器以获取命名空间命令,我们以后可以传入每个命名空间find()findall()调用iMom0的命名空间。

  3. 返回解析树的根元素对象和命名空间。

我认为这是最好的方法,因为无论源XML还是解析后的xml.etree.ElementTree输出都不会受到任何操纵。

我还要感谢Barny的回答,因为它提供了这个难题的重要组成部分(您可以从迭代器获得解析的根)。在此之前,我实际上在应用程序中遍历了两次XML树(一次获取命名空间,第二次获取根)。

Let’s combine nonagon’s answer with mzjn’s answer to a related question:

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

Using this function we:

  1. Create an iterator to get both namespaces and a parsed tree object.

  2. Iterate over the created iterator to get the namespaces dict that we can later pass in each find() or findall() call as sugested by iMom0.

  3. Return the parsed tree’s root element object and namespaces.

I think this is the best approach all around as there’s no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.

I’d like also to credit barny’s answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).


通过’ElementTree’在Python中使用命名空间解析XML

问题:通过’ElementTree’在Python中使用命名空间解析XML

我有以下要使用Python解析的XML ElementTree

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

我想找到所有owl:Class标签,然后提取其中所有rdfs:label实例的值。我正在使用以下代码:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

由于命名空间的原因,出现以下错误。

SyntaxError: prefix 'owl' not found in prefix map

我尝试阅读http://effbot.org/zone/element-namespaces.htm上的文档,但由于上述XML具有多个嵌套的命名空间,因此仍然无法正常工作。

请让我知道如何更改代码以查找所有owl:Class标签。

I have the following XML which I want to parse using Python’s ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.


回答 0

ElementTree对命名空间不太聪明。你需要给的.find()findall()iterfind()方法的明确的命名空间字典。这没有很好的记录:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

namespaces您传入的参数中查找前缀。这意味着您可以使用任何喜欢的命名空间前缀;API会分开owl:一部分,在namespaces字典中查找相应的命名空间URL ,然后更改搜索以查找XPath表达式{http://www.w3.org/2002/07/owl}Class。当然,您也可以自己使用相同的语法:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

如果可以切换到lxml,那就更好了;该库支持相同的ElementTree API,但会在.nsmap元素的属性中为您收集命名空间。

ElementTree is not too smart about namespaces. You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.


回答 1

这是使用lxml来执行此操作的方法,而不必对命名空间进行硬编码或对其进行扫描(如Martijn Pieters所述):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

更新

5年后,我仍然遇到这个问题的变体。如上所述,lxml可以提供帮助,但并非在每种情况下都可以。评论者在合并文档时可能会对此技术有个正确的认识,但我认为大多数人都很难仅搜索文档。

这是另一种情况以及我的处理方式:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

不带前缀的xmlns意味着未加前缀的标签将获得此默认命名空间。这意味着当您搜索Tag2时,需要包括命名空间才能找到它。但是,lxml创建了一个以None为键的nsmap条目,我找不到搜索它的方法。所以,我像这样创建了一个新的命名空间字典

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

Here’s how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I’m still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here’s another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn’t find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

回答 2

注意:这是对Python的ElementTree标准库有用的答案,而无需使用硬编码的命名空间。

要从XML数据提取命名空间的前缀和URI,可以使用ElementTree.iterparse函数,仅解析命名空间启动事件(start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

然后可以将字典作为参数传递给搜索功能:

root.findall('owl:Class', my_namespaces)

Note: This is an answer useful for Python’s ElementTree standard library without using hardcoded namespaces.

To extract namespace’s prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

回答 3

我一直在使用与此类似的代码,并发现它总是值得阅读文档…像往常一样!

findall()将只查找当前标签的直接子元素。所以,不是全部。

尝试使代码与以下代码一起使用可能会值得您投入,尤其是在处理大型而复杂的xml文件时,还包括子子元素(等)。如果您自己了解xml中元素的位置,那么我想就可以了!只是认为这值得记住。

root.iter()

参考:https : //docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements “ Element.findall()仅查找带有标签的元素,这些标签是当前元素的直接子元素。 Element.find()查找带有特定标记的第一个子元素,然后Element.text访问元素的文本内容。Element.get()访问元素的属性:“

I’ve been using similar code to this and have found it’s always worth reading the documentation… as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you’re dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it’ll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements “Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:”


回答 4

以其命名空间格式获取命名空间,例如 {myNameSpace},可以执行以下操作:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

这样,您可以稍后在代码中使用它来查找节点,例如使用字符串插值(Python 3)。

link = root.find(f"{ns}link")

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

回答 5

我的解决方案基于@Martijn Pieters的评论:

register_namespace 仅影响序列化,不影响搜索。

因此,这里的技巧是使用不同的字典进行序列化和搜索。

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

现在,注册所有命名空间以进行解析和写入:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

对于搜索(find()findall()iterfind())我们需要一个非空前缀。向这些函数传递一个修改后的字典(这里我修改了原始字典,但这必须在注册了命名空间之后才能进行)。

self.namespaces['default'] = self.namespaces['']

现在,该find()系列的功能可以与default前缀一起使用:

print root.find('default:myelem', namespaces)

tree.write(destination)

默认命名空间中的元素不使用任何前缀。

My solution is based on @Martijn Pieters’ comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.