如何在Python中使用Xpath?

问题:如何在Python中使用Xpath?

有哪些支持Xpath的库?是否有完整的实现?图书馆如何使用?它的网站在哪里?

What are the libraries that support XPath? Is there a full implementation? How is the library used? Where is its website?


回答 0

libxml2具有许多优点:

  1. 符合规范
  2. 积极发展和社区参与
  3. 速度。这实际上是围绕C实现的python包装器。
  4. 无处不在。libxml2库无处不在,因此经过了充分的测试。

缺点包括:

  1. 符合规范。严格 在其他库中,诸如默认命名空间处理之类的事情会更容易。
  2. 使用本机代码。这可能会很麻烦,具体取决于您的应用程序的分发/部署方式。可使用RPM来减轻这种痛苦。
  3. 手动资源处理。请注意下面的示例中对freeDoc()和xpathFreeContext()的调用。这不是很Pythonic。

如果您要进行简单的路径选择,请坚持使用ElementTree(Python 2.5附带)。如果您需要完全符合规范或原始速度并且可以应付本机代码的分发,请使用libxml2。

libxml2 XPath使用示例


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

ElementTree XPath使用示例


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text

libxml2 has a number of advantages:

  1. Compliance to the spec
  2. Active development and a community participation
  3. Speed. This is really a python wrapper around a C implementation.
  4. Ubiquity. The libxml2 library is pervasive and thus well tested.

Downsides include:

  1. Compliance to the spec. It’s strict. Things like default namespace handling are easier in other libraries.
  2. Use of native code. This can be a pain depending on your how your application is distributed / deployed. RPMs are available that ease some of this pain.
  3. Manual resource handling. Note in the sample below the calls to freeDoc() and xpathFreeContext(). This is not very Pythonic.

If you are doing simple path selection, stick with ElementTree ( which is included in Python 2.5 ). If you need full spec compliance or raw speed and can cope with the distribution of native code, go with libxml2.

Sample of libxml2 XPath Use


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

Sample of ElementTree XPath Use


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text


回答 1

LXML包支持XPath。尽管我在self ::轴上遇到了一些麻烦,但它似乎工作得很好。还有Amara,但是我还没有亲自使用过。

The lxml package supports xpath. It seems to work pretty well, although I’ve had some trouble with the self:: axis. There’s also Amara, but I haven’t used it personally.


回答 2

在这里听起来像lxml广告。;)ElementTree包含在std库中。在2.6及以下版本中,它的xpath相当弱,但在2.7+中则大大改善了

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

Sounds like an lxml advertisement in here. ;) ElementTree is included in the std library. Under 2.6 and below its xpath is pretty weak, but in 2.7+ much improved:

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

回答 3

使用LXML。LXML充分利用了libxml2和libxslt的功能,但是将它们包装在比这些库中固有的Python绑定更多的“ Pythonic”绑定中。这样,它将获得完整的XPath 1.0实现。本机ElemenTree支持XPath的有限子集,尽管它可能足以满足您的需求。

Use LXML. LXML uses the full power of libxml2 and libxslt, but wraps them in more “Pythonic” bindings than the Python bindings that are native to those libraries. As such, it gets the full XPath 1.0 implementation. Native ElemenTree supports a limited subset of XPath, although it may be good enough for your needs.


回答 4

另一个选项是py-dom-xpath,它可以与minidom无缝协作,并且是纯Python,因此可以在appengine上运行。

import xpath
xpath.find('//item', doc)

Another option is py-dom-xpath, it works seamlessly with minidom and is pure Python so works on appengine.

import xpath
xpath.find('//item', doc)

回答 5

您可以使用:

PyXML

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

You can use:

PyXML:

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2:

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

回答 6

最新版本的elementtree很好地支持XPath。我不是XPath专家,我不能肯定地说实现是否完整,但是在使用Python时它可以满足我的大多数需求。我也使用了lxml和PyXML,我发现etree很不错,因为它是一个标准模块。

注意:从那以后我就找到了lxml,对我来说,它绝对是Python最好的XML库。它也很好地完成了XPath(尽管可能不是完整的实现)。

The latest version of elementtree supports XPath pretty well. Not being an XPath expert I can’t say for sure if the implementation is full but it has satisfied most of my needs when working in Python. I’ve also use lxml and PyXML and I find etree nice because it’s a standard module.

NOTE: I’ve since found lxml and for me it’s definitely the best XML lib out there for Python. It does XPath nicely as well (though again perhaps not a full implementation).


回答 7

您可以使用简单soupparserlxml

例:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

You can use the simple soupparser from lxml

Example:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

回答 8

如果您希望同时拥有XPATH的功能和使用CSS的能力,则可以使用parsel

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

If you want to have the power of XPATH combined with the ability to also use CSS at any point you can use parsel:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

回答 9

另一个库是4Suite:http//sourceforge.net/projects/foursuite/

我不知道它是如何符合规范的。但这对我来说非常有效。它看起来被遗弃了。

Another library is 4Suite: http://sourceforge.net/projects/foursuite/

I do not know how spec-compliant it is. But it has worked very well for my use. It looks abandoned.


回答 10

PyXML运作良好。

您没有说要使用什么平台,但是如果您使用的是Ubuntu,则可以使用sudo apt-get install python-xml。我敢肯定其他Linux发行版也有。

如果您使用的是Mac,则xpath已安装但无法立即访问。可以PY_USE_XMLPLUS在导入xml.xpath之前在您的环境中进行设置或以Python方式进行设置:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

在最坏的情况下,您可能必须自己构建它。该软件包不再维护,但仍然可以正常运行,并且可以与现代2.x Python一起使用。基本文档在这里

PyXML works well.

You didn’t say what platform you’re using, however if you’re on Ubuntu you can get it with sudo apt-get install python-xml. I’m sure other Linux distros have it as well.

If you’re on a Mac, xpath is already installed but not immediately accessible. You can set PY_USE_XMLPLUS in your environment or do it the Python way before you import xml.xpath:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

In the worst case you may have to build it yourself. This package is no longer maintained but still builds fine and works with modern 2.x Pythons. Basic docs are here.


回答 11

如果您需要html

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')

If you are going to need it for html:

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')