问题:通过’ElementTree’在Python中使用命名空间解析XML
我有以下要使用Python解析的XML ElementTree
:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
我想找到所有owl:Class
标签,然后提取其中所有rdfs:label
实例的值。我正在使用以下代码:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
由于命名空间的原因,出现以下错误。
SyntaxError: prefix 'owl' not found in prefix map
我尝试阅读http://effbot.org/zone/element-namespaces.htm上的文档,但由于上述XML具有多个嵌套的命名空间,因此仍然无法正常工作。
请让我知道如何更改代码以查找所有owl:Class
标签。
I have the following XML which I want to parse using Python’s ElementTree
:
<rdf:RDF xml:base="http://dbpedia.org/ontology/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://dbpedia.org/ontology/">
<owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
<rdfs:label xml:lang="en">basketball league</rdfs:label>
<rdfs:comment xml:lang="en">
a group of sports teams that compete against each other
in Basketball
</rdfs:comment>
</owl:Class>
</rdf:RDF>
I want to find all owl:Class
tags and then extract the value of all rdfs:label
instances inside them. I am using the following code:
tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')
Because of the namespace, I am getting the following error.
SyntaxError: prefix 'owl' not found in prefix map
I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.
Kindly let me know how to change the code to find all the owl:Class
tags.
回答 0
ElementTree对命名空间不太聪明。你需要给的.find()
,findall()
和iterfind()
方法的明确的命名空间字典。这没有很好的记录:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
仅在namespaces
您传入的参数中查找前缀。这意味着您可以使用任何喜欢的命名空间前缀;API会分开owl:
一部分,在namespaces
字典中查找相应的命名空间URL ,然后更改搜索以查找XPath表达式{http://www.w3.org/2002/07/owl}Class
。当然,您也可以自己使用相同的语法:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
如果可以切换到lxml
库,那就更好了;该库支持相同的ElementTree API,但会在.nsmap
元素的属性中为您收集命名空间。
ElementTree is not too smart about namespaces. You need to give the .find()
, findall()
and iterfind()
methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed
root.findall('owl:Class', namespaces)
Prefixes are only looked up in the namespaces
parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl:
part, looks up the corresponding namespace URL in the namespaces
dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class
instead. You can use the same syntax yourself too of course:
root.findall('{http://www.w3.org/2002/07/owl#}Class')
If you can switch to the lxml
library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap
attribute on elements.
回答 1
这是使用lxml来执行此操作的方法,而不必对命名空间进行硬编码或对其进行扫描(如Martijn Pieters所述):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
更新:
5年后,我仍然遇到这个问题的变体。如上所述,lxml可以提供帮助,但并非在每种情况下都可以。评论者在合并文档时可能会对此技术有个正确的认识,但我认为大多数人都很难仅搜索文档。
这是另一种情况以及我的处理方式:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
不带前缀的xmlns意味着未加前缀的标签将获得此默认命名空间。这意味着当您搜索Tag2时,需要包括命名空间才能找到它。但是,lxml创建了一个以None为键的nsmap条目,我找不到搜索它的方法。所以,我像这样创建了一个新的命名空间字典
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
Here’s how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):
from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)
UPDATE:
5 years later I’m still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.
Here’s another case and how I handled it:
<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>
xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn’t find a way to search for it. So, I created a new namespace dictionary like this
namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
if not k:
namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)
回答 2
注意:这是对Python的ElementTree标准库有用的答案,而无需使用硬编码的命名空间。
要从XML数据提取命名空间的前缀和URI,可以使用ElementTree.iterparse
函数,仅解析命名空间启动事件(start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
然后可以将字典作为参数传递给搜索功能:
root.findall('owl:Class', my_namespaces)
Note: This is an answer useful for Python’s ElementTree standard library without using hardcoded namespaces.
To extract namespace’s prefixes and URI from XML data you can use ElementTree.iterparse
function, parsing only namespace start events (start-ns):
>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
... xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
... xmlns:owl="http://www.w3.org/2002/07/owl#"
... xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
... xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
... xmlns="http://dbpedia.org/ontology/">
...
... <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
... <rdfs:label xml:lang="en">basketball league</rdfs:label>
... <rdfs:comment xml:lang="en">
... a group of sports teams that compete against each other
... in Basketball
... </rdfs:comment>
... </owl:Class>
...
... </rdf:RDF>'''
>>> my_namespaces = dict([
... node for _, node in ElementTree.iterparse(
... StringIO(my_schema), events=['start-ns']
... )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
'owl': 'http://www.w3.org/2002/07/owl#',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'xsd': 'http://www.w3.org/2001/XMLSchema#'}
Then the dictionary can be passed as argument to the search functions:
root.findall('owl:Class', my_namespaces)
回答 3
I’ve been using similar code to this and have found it’s always worth reading the documentation… as usual!
findall() will only find elements which are direct children of the current tag. So, not really ALL.
It might be worth your while trying to get your code working with the following, especially if you’re dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included.
If you know yourself where elements are in your xml, then I suppose it’ll be fine! Just thought this was worth remembering.
root.iter()
ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements
“Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:”
回答 4
以其命名空间格式获取命名空间,例如 {myNameSpace}
,可以执行以下操作:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
这样,您可以稍后在代码中使用它来查找节点,例如使用字符串插值(Python 3)。
link = root.find(f"{ns}link")
To get the namespace in its namespace format, e.g. {myNameSpace}
, you can do the following:
root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)
This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).
link = root.find(f"{ns}link")
回答 5
我的解决方案基于@Martijn Pieters的评论:
register_namespace
仅影响序列化,不影响搜索。
因此,这里的技巧是使用不同的字典进行序列化和搜索。
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
现在,注册所有命名空间以进行解析和写入:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
对于搜索(find()
,findall()
,iterfind()
)我们需要一个非空前缀。向这些函数传递一个修改后的字典(这里我修改了原始字典,但这必须在注册了命名空间之后才能进行)。
self.namespaces['default'] = self.namespaces['']
现在,该find()
系列的功能可以与default
前缀一起使用:
print root.find('default:myelem', namespaces)
但
tree.write(destination)
默认命名空间中的元素不使用任何前缀。
My solution is based on @Martijn Pieters’ comment:
register_namespace
only influences serialisation, not search.
So the trick here is to use different dictionaries for serialization and for searching.
namespaces = {
'': 'http://www.example.com/default-schema',
'spec': 'http://www.example.com/specialized-schema',
}
Now, register all namespaces for parsing and writing:
for name, value in namespaces.iteritems():
ET.register_namespace(name, value)
For searching (find()
, findall()
, iterfind()
) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).
self.namespaces['default'] = self.namespaces['']
Now, the functions from the find()
family can be used with the default
prefix:
print root.find('default:myelem', namespaces)
but
tree.write(destination)
does not use any prefixes for elements in the default namespace.