标签归档:xml-parsing

通过’ElementTree’在Python中使用命名空间解析XML

问题:通过’ElementTree’在Python中使用命名空间解析XML

我有以下要使用Python解析的XML ElementTree

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

我想找到所有owl:Class标签,然后提取其中所有rdfs:label实例的值。我正在使用以下代码:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

由于命名空间的原因,出现以下错误。

SyntaxError: prefix 'owl' not found in prefix map

我尝试阅读http://effbot.org/zone/element-namespaces.htm上的文档,但由于上述XML具有多个嵌套的命名空间,因此仍然无法正常工作。

请让我知道如何更改代码以查找所有owl:Class标签。

I have the following XML which I want to parse using Python’s ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.


回答 0

ElementTree对命名空间不太聪明。你需要给的.find()findall()iterfind()方法的明确的命名空间字典。这没有很好的记录:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

namespaces您传入的参数中查找前缀。这意味着您可以使用任何喜欢的命名空间前缀;API会分开owl:一部分,在namespaces字典中查找相应的命名空间URL ,然后更改搜索以查找XPath表达式{http://www.w3.org/2002/07/owl}Class。当然,您也可以自己使用相同的语法:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

如果可以切换到lxml,那就更好了;该库支持相同的ElementTree API,但会在.nsmap元素的属性中为您收集命名空间。

ElementTree is not too smart about namespaces. You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.


回答 1

这是使用lxml来执行此操作的方法,而不必对命名空间进行硬编码或对其进行扫描(如Martijn Pieters所述):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

更新

5年后,我仍然遇到这个问题的变体。如上所述,lxml可以提供帮助,但并非在每种情况下都可以。评论者在合并文档时可能会对此技术有个正确的认识,但我认为大多数人都很难仅搜索文档。

这是另一种情况以及我的处理方式:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

不带前缀的xmlns意味着未加前缀的标签将获得此默认命名空间。这意味着当您搜索Tag2时,需要包括命名空间才能找到它。但是,lxml创建了一个以None为键的nsmap条目,我找不到搜索它的方法。所以,我像这样创建了一个新的命名空间字典

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

Here’s how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I’m still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here’s another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn’t find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

回答 2

注意:这是对Python的ElementTree标准库有用的答案,而无需使用硬编码的命名空间。

要从XML数据提取命名空间的前缀和URI,可以使用ElementTree.iterparse函数,仅解析命名空间启动事件(start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

然后可以将字典作为参数传递给搜索功能:

root.findall('owl:Class', my_namespaces)

Note: This is an answer useful for Python’s ElementTree standard library without using hardcoded namespaces.

To extract namespace’s prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

回答 3

我一直在使用与此类似的代码,并发现它总是值得阅读文档…像往常一样!

findall()将只查找当前标签的直接子元素。所以,不是全部。

尝试使代码与以下代码一起使用可能会值得您投入,尤其是在处理大型而复杂的xml文件时,还包括子子元素(等)。如果您自己了解xml中元素的位置,那么我想就可以了!只是认为这值得记住。

root.iter()

参考:https : //docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements “ Element.findall()仅查找带有标签的元素,这些标签是当前元素的直接子元素。 Element.find()查找带有特定标记的第一个子元素,然后Element.text访问元素的文本内容。Element.get()访问元素的属性:“

I’ve been using similar code to this and have found it’s always worth reading the documentation… as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you’re dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it’ll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements “Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:”


回答 4

以其命名空间格式获取命名空间,例如 {myNameSpace},可以执行以下操作:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

这样,您可以稍后在代码中使用它来查找节点,例如使用字符串插值(Python 3)。

link = root.find(f"{ns}link")

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

回答 5

我的解决方案基于@Martijn Pieters的评论:

register_namespace 仅影响序列化,不影响搜索。

因此,这里的技巧是使用不同的字典进行序列化和搜索。

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

现在,注册所有命名空间以进行解析和写入:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

对于搜索(find()findall()iterfind())我们需要一个非空前缀。向这些函数传递一个修改后的字典(这里我修改了原始字典,但这必须在注册了命名空间之后才能进行)。

self.namespaces['default'] = self.namespaces['']

现在,该find()系列的功能可以与default前缀一起使用:

print root.find('default:myelem', namespaces)

tree.write(destination)

默认命名空间中的元素不使用任何前缀。

My solution is based on @Martijn Pieters’ comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.


使用Python解析HTML

问题:使用Python解析HTML

我正在寻找适用于Python的HTML Parser模块,该模块可以帮助我以Python列表/字典/对象的形式获取标签。

如果我有以下格式的文件:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

那么它应该给我一种通过HTML标签的名称或ID访问嵌套标签的方法,这样我基本上可以要求它为我div提供class='container'包含在body标签中或类似标签的标签中的内容/文本。

如果您使用了Firefox的“检查元素”功能(查看HTML),您就会知道它以一种很好的嵌套方式(如树)为您提供了所有标签。

我更喜欢一个内置模块,但是可能要求太多。


我在Stack Overflow上遇到了很多问题,在互联网上也有一些博客,其中大多数都建议使用BeautifulSoup或lxml或HTMLParser,但是其中很少有详细介绍功能,最后只是参数哪个更快/更有效。

I’m looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.

If I have a document of the form:

<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>

then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.

If you’ve used Firefox’s “Inspect element” feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.

I’d prefer a built-in module but that might be asking a little too much.


I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.


回答 0

这样我就可以要求它为我获取div标签中的内容/文本,其中body =包含class =’container’或类似内容。

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

我猜您不需要性能描述-只需阅读BeautifulSoup的工作原理即可。查看其官方文档

So that I can ask it to get me the content/text in the div tag with class=’container’ contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don’t need performance descriptions I guess – just read how BeautifulSoup works. Look at its official documentation.


回答 1

我猜你在找什么pyquery

pyquery:类似jQuery的python库。

您想要的示例可能像:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

并且它使用与Firefox或Chrome的inspect元素相同的选择器。例如:

被检查的元素选择器是“ div#mw-head.noprint”。因此,在pyquery中,您只需要传递此选择器:

pq('div#mw-head.noprint')

I guess what you’re looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox’s or Chrome’s inspect element. For example:

The inspected element selector is ‘div#mw-head.noprint’. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

回答 2

在这里,您可以了解有关Python中不同HTML解析器及其性能的更多信息。即使文章有些陈旧,它仍然可以为您提供很好的概述。

Python HTML解析器性能

即使它不是内置的,我也建议使用BeautifulSoup。只是因为它很容易处理这些任务。例如:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.

Python HTML parser performance

I’d recommend BeautifulSoup even though it isn’t built in. Just because it’s so easy to work with for those kinds of tasks. Eg:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)

x = soup.body.find('div', attrs={'class' : 'container'}).text

回答 3

与其他解析器库相比,lxml速度非常快:

而且,cssselect它也非常容易用于抓取HTML页面:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html文档

Compared to the other parser libraries lxml is extremely fast:

And with cssselect it’s quite easy to use for scraping HTML pages too:

from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
    print '%s: %s' % (div.text_content(), div.get('href'))

lxml.html Documentation


回答 4

我建议使用lxml解析HTML。请参阅“解析HTML”(在lxml网站上)。

以我的经验,Beautiful Soup将一些复杂的HTML弄乱了。我相信这是因为Beautiful Soup不是解析器,而是非常好的字符串分析器。

I recommend lxml for parsing HTML. See “Parsing HTML” (on the lxml site).

In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.


回答 5

我建议使用justext库:

https://github.com/miso-belica/jusText

用法: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

I recommend using justext library:

https://github.com/miso-belica/jusText

Usage: Python2:

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print paragraph.text

Python3:

import requests
import justext

response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
    print (paragraph.text)

回答 6

我会使用EHP

https://github.com/iogf/ehp

这里是:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

输出:

Something here
Something else

I would use EHP

https://github.com/iogf/ehp

Here it is:

from ehp import *

doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
    <div class='container'>
        <div id='class'>Something here</div>
        <div>Something else</div>
    </div>
</body>
</html>
'''

html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
    print ind.text()

Output:

Something here
Something else