标签归档:dom

使用Python最小化获取元素值

问题:使用Python最小化获取元素值

我正在使用Python创建Eve Online API的GUI前端。

我已经成功地从他们的服务器中提取了XML数据。

我正在尝试从名为“名称”的节点获取值:

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')
print name

这似乎找到了节点,但是输出如下:

[<DOM Element: name at 0x11e6d28>]

我如何获得它来打印节点的值?

I am creating a GUI frontend for the Eve Online API in Python.

I have successfully pulled the XML data from their server.

I am trying to grab the value from a node called “name”:

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')
print name

This seems to find the node, but the output is below:

[<DOM Element: name at 0x11e6d28>]

How could I get it to print the value of the node?


回答 0

应该只是

name[0].firstChild.nodeValue

It should just be

name[0].firstChild.nodeValue

回答 1

如果这是您想要的文字部分,可能是这样的。

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')

print " ".join(t.nodeValue for t in name[0].childNodes if t.nodeType == t.TEXT_NODE)

节点的文本部分本身被视为一个节点,它作为您要的节点的子节点。因此,您将需要遍历其所有子节点,并找到所有作为文本节点的子节点。一个节点可以有多个文本节点。例如。

<name>
  blabla
  <somestuff>asdf</somestuff>
  znylpx
</name>

您同时需要“ blabla”和“ znylpx”;因此是“” .join()。您可能要用换行符代替空格,或者什么也不要。

Probably something like this if it’s the text part you want…

from xml.dom.minidom import parse
dom = parse("C:\\eve.xml")
name = dom.getElementsByTagName('name')

print " ".join(t.nodeValue for t in name[0].childNodes if t.nodeType == t.TEXT_NODE)

The text part of a node is considered a node in itself placed as a child-node of the one you asked for. Thus you will want to go through all its children and find all child nodes that are text nodes. A node can have several text nodes; eg.

<name>
  blabla
  <somestuff>asdf</somestuff>
  znylpx
</name>

You want both ‘blabla’ and ‘znylpx’; hence the ” “.join(). You might want to replace the space with a newline or so, or perhaps by nothing.


回答 2

你可以用这样的东西。它对我有用

doc = parse('C:\\eve.xml')
my_node_list = doc.getElementsByTagName("name")
my_n_node = my_node_list[0]
my_child = my_n_node.firstChild
my_text = my_child.data 
print my_text

you can use something like this.It worked out for me

doc = parse('C:\\eve.xml')
my_node_list = doc.getElementsByTagName("name")
my_n_node = my_node_list[0]
my_child = my_n_node.firstChild
my_text = my_child.data 
print my_text

回答 3

我知道这个问题现在已经很老了,但我认为您使用ElementTree可能会更轻松

from xml.etree import ElementTree as ET
import datetime

f = ET.XML(data)

for element in f:
    if element.tag == "currentTime":
        # Handle time data was pulled
        currentTime = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
    if element.tag == "cachedUntil":
        # Handle time until next allowed update
        cachedUntil = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
    if element.tag == "result":
        # Process list of skills
        pass

我知道这不是超级特定的,但是我只是发现了,到目前为止,让我的头脑比最小化要容易得多(因为很多节点本质上都是空白)。

例如,您可以将标签名称和实际文本放在一起,就像您可能期望的那样:

>>> element[0]
<Element currentTime at 40984d0>
>>> element[0].tag
'currentTime'
>>> element[0].text
'2010-04-12 02:45:45'e

I know this question is pretty old now, but I thought you might have an easier time with ElementTree

from xml.etree import ElementTree as ET
import datetime

f = ET.XML(data)

for element in f:
    if element.tag == "currentTime":
        # Handle time data was pulled
        currentTime = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
    if element.tag == "cachedUntil":
        # Handle time until next allowed update
        cachedUntil = datetime.datetime.strptime(element.text, "%Y-%m-%d %H:%M:%S")
    if element.tag == "result":
        # Process list of skills
        pass

I know that’s not super specific, but I just discovered it, and so far it’s a lot easier to get my head around than the minidom (since so many nodes are essentially white space).

For instance, you have the tag name and the actual text together, just as you’d probably expect:

>>> element[0]
<Element currentTime at 40984d0>
>>> element[0].tag
'currentTime'
>>> element[0].text
'2010-04-12 02:45:45'e

回答 4

上面的答案是正确的,即:

name[0].firstChild.nodeValue

但是,对我来说,和其他人一样,我的价值更进一步。

name[0].firstChild.firstChild.nodeValue

为了找到这个,我使用了以下内容:

def scandown( elements, indent ):
    for el in elements:
        print("   " * indent + "nodeName: " + str(el.nodeName) )
        print("   " * indent + "nodeValue: " + str(el.nodeValue) )
        print("   " * indent + "childNodes: " + str(el.childNodes) )
        scandown(el.childNodes, indent + 1)

scandown( doc.getElementsByTagName('text'), 0 )

对使用Inkscape创建的简单SVG文件运行此命令,这给了我:

nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c6d0>]
   nodeName: tspan
   nodeValue: None
   childNodes: [<DOM Text node "'MY STRING'">]
      nodeName: #text
      nodeValue: MY STRING
      childNodes: ()
nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c800>]
   nodeName: tspan
   nodeValue: None
   childNodes: [<DOM Text node "'MY WORDS'">]
      nodeName: #text
      nodeValue: MY WORDS
      childNodes: ()

我使用xml.dom.minidom,此页面MiniDom Python解释了各个字段

The above answer is correct, namely:

name[0].firstChild.nodeValue

However for me, like others, my value was further down the tree:

name[0].firstChild.firstChild.nodeValue

To find this I used the following:

def scandown( elements, indent ):
    for el in elements:
        print("   " * indent + "nodeName: " + str(el.nodeName) )
        print("   " * indent + "nodeValue: " + str(el.nodeValue) )
        print("   " * indent + "childNodes: " + str(el.childNodes) )
        scandown(el.childNodes, indent + 1)

scandown( doc.getElementsByTagName('text'), 0 )

Running this for my simple SVG file created with Inkscape this gave me:

nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c6d0>]
   nodeName: tspan
   nodeValue: None
   childNodes: [<DOM Text node "'MY STRING'">]
      nodeName: #text
      nodeValue: MY STRING
      childNodes: ()
nodeName: text
nodeValue: None
childNodes: [<DOM Element: tspan at 0x10392c800>]
   nodeName: tspan
   nodeValue: None
   childNodes: [<DOM Text node "'MY WORDS'">]
      nodeName: #text
      nodeValue: MY WORDS
      childNodes: ()

I used xml.dom.minidom, the various fields are explained on this page, MiniDom Python.


回答 5

我有一个类似的案例,对我有用的是:

name.firstChild.childNodes [0] .data

XML应该很简单,实际上确实如此,我不知道为什么python的小巧性使它如此复杂…但是它是如何制作的

I had a similar case, what worked for me was:

name.firstChild.childNodes[0].data

XML is supposed to be simple and it really is and I don’t know why python’s minidom did it so complicated… but it’s how it’s made


回答 6

这是Henrik对于多个节点的稍作修改的答案(即,当getElementsByTagName返回多个实例时)

images = xml.getElementsByTagName("imageUrl")
for i in images:
    print " ".join(t.nodeValue for t in i.childNodes if t.nodeType == t.TEXT_NODE)

Here is a slightly modified answer of Henrik’s for multiple nodes (ie. when getElementsByTagName returns more than one instance)

images = xml.getElementsByTagName("imageUrl")
for i in images:
    print " ".join(t.nodeValue for t in i.childNodes if t.nodeType == t.TEXT_NODE)

回答 7

问题已经回答,我的贡献在于澄清了一件事,可能会使初学者感到困惑:

使用了一些建议的正确答案firstChild.datafirstChild.nodeValue而使用了其他答案。如果您想知道它们之间的区别是什么,您应该记住它们做相同的事情,因为nodeValue它只是的别名data

我的陈述的引用可以作为对minidom源代码的注释

nodeValue是的别名data

The question has been answered, my contribution consists in clarifying one thing that may confuse beginners:

Some of the suggested and correct answers used firstChild.data and others used firstChild.nodeValue instead. In case you are wondering what is the different between them, you should remember they do the same thing because nodeValue is just an alias for data.

The reference to my statement can be found as a comment on the source code of minidom:

#nodeValue is an alias for data


回答 8

它是一棵树,可能有嵌套的元素。尝试:

def innerText(self, sep=''):
    t = ""
    for curNode in self.childNodes:
        if (curNode.nodeType == Node.TEXT_NODE):
            t += sep + curNode.nodeValue
        elif (curNode.nodeType == Node.ELEMENT_NODE):
            t += sep + curNode.innerText(sep=sep)
    return t

It’s a tree, and there may be nested elements. Try:

def innerText(self, sep=''):
    t = ""
    for curNode in self.childNodes:
        if (curNode.nodeType == Node.TEXT_NODE):
            t += sep + curNode.nodeValue
        elif (curNode.nodeType == Node.ELEMENT_NODE):
            t += sep + curNode.innerText(sep=sep)
    return t

如何在Python中使用Xpath?

问题:如何在Python中使用Xpath?

有哪些支持Xpath的库?是否有完整的实现?图书馆如何使用?它的网站在哪里?

What are the libraries that support XPath? Is there a full implementation? How is the library used? Where is its website?


回答 0

libxml2具有许多优点:

  1. 符合规范
  2. 积极发展和社区参与
  3. 速度。这实际上是围绕C实现的python包装器。
  4. 无处不在。libxml2库无处不在,因此经过了充分的测试。

缺点包括:

  1. 符合规范。严格 在其他库中,诸如默认命名空间处理之类的事情会更容易。
  2. 使用本机代码。这可能会很麻烦,具体取决于您的应用程序的分发/部署方式。可使用RPM来减轻这种痛苦。
  3. 手动资源处理。请注意下面的示例中对freeDoc()和xpathFreeContext()的调用。这不是很Pythonic。

如果您要进行简单的路径选择,请坚持使用ElementTree(Python 2.5附带)。如果您需要完全符合规范或原始速度并且可以应付本机代码的分发,请使用libxml2。

libxml2 XPath使用示例


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

ElementTree XPath使用示例


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text

libxml2 has a number of advantages:

  1. Compliance to the spec
  2. Active development and a community participation
  3. Speed. This is really a python wrapper around a C implementation.
  4. Ubiquity. The libxml2 library is pervasive and thus well tested.

Downsides include:

  1. Compliance to the spec. It’s strict. Things like default namespace handling are easier in other libraries.
  2. Use of native code. This can be a pain depending on your how your application is distributed / deployed. RPMs are available that ease some of this pain.
  3. Manual resource handling. Note in the sample below the calls to freeDoc() and xpathFreeContext(). This is not very Pythonic.

If you are doing simple path selection, stick with ElementTree ( which is included in Python 2.5 ). If you need full spec compliance or raw speed and can cope with the distribution of native code, go with libxml2.

Sample of libxml2 XPath Use


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

Sample of ElementTree XPath Use


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text


回答 1

LXML包支持XPath。尽管我在self ::轴上遇到了一些麻烦,但它似乎工作得很好。还有Amara,但是我还没有亲自使用过。

The lxml package supports xpath. It seems to work pretty well, although I’ve had some trouble with the self:: axis. There’s also Amara, but I haven’t used it personally.


回答 2

在这里听起来像lxml广告。;)ElementTree包含在std库中。在2.6及以下版本中,它的xpath相当弱,但在2.7+中则大大改善了

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

Sounds like an lxml advertisement in here. ;) ElementTree is included in the std library. Under 2.6 and below its xpath is pretty weak, but in 2.7+ much improved:

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

回答 3

使用LXML。LXML充分利用了libxml2和libxslt的功能,但是将它们包装在比这些库中固有的Python绑定更多的“ Pythonic”绑定中。这样,它将获得完整的XPath 1.0实现。本机ElemenTree支持XPath的有限子集,尽管它可能足以满足您的需求。

Use LXML. LXML uses the full power of libxml2 and libxslt, but wraps them in more “Pythonic” bindings than the Python bindings that are native to those libraries. As such, it gets the full XPath 1.0 implementation. Native ElemenTree supports a limited subset of XPath, although it may be good enough for your needs.


回答 4

另一个选项是py-dom-xpath,它可以与minidom无缝协作,并且是纯Python,因此可以在appengine上运行。

import xpath
xpath.find('//item', doc)

Another option is py-dom-xpath, it works seamlessly with minidom and is pure Python so works on appengine.

import xpath
xpath.find('//item', doc)

回答 5

您可以使用:

PyXML

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

You can use:

PyXML:

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2:

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

回答 6

最新版本的elementtree很好地支持XPath。我不是XPath专家,我不能肯定地说实现是否完整,但是在使用Python时它可以满足我的大多数需求。我也使用了lxml和PyXML,我发现etree很不错,因为它是一个标准模块。

注意:从那以后我就找到了lxml,对我来说,它绝对是Python最好的XML库。它也很好地完成了XPath(尽管可能不是完整的实现)。

The latest version of elementtree supports XPath pretty well. Not being an XPath expert I can’t say for sure if the implementation is full but it has satisfied most of my needs when working in Python. I’ve also use lxml and PyXML and I find etree nice because it’s a standard module.

NOTE: I’ve since found lxml and for me it’s definitely the best XML lib out there for Python. It does XPath nicely as well (though again perhaps not a full implementation).


回答 7

您可以使用简单soupparserlxml

例:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

You can use the simple soupparser from lxml

Example:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

回答 8

如果您希望同时拥有XPATH的功能和使用CSS的能力,则可以使用parsel

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

If you want to have the power of XPATH combined with the ability to also use CSS at any point you can use parsel:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

回答 9

另一个库是4Suite:http//sourceforge.net/projects/foursuite/

我不知道它是如何符合规范的。但这对我来说非常有效。它看起来被遗弃了。

Another library is 4Suite: http://sourceforge.net/projects/foursuite/

I do not know how spec-compliant it is. But it has worked very well for my use. It looks abandoned.


回答 10

PyXML运作良好。

您没有说要使用什么平台,但是如果您使用的是Ubuntu,则可以使用sudo apt-get install python-xml。我敢肯定其他Linux发行版也有。

如果您使用的是Mac,则xpath已安装但无法立即访问。可以PY_USE_XMLPLUS在导入xml.xpath之前在您的环境中进行设置或以Python方式进行设置:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

在最坏的情况下,您可能必须自己构建它。该软件包不再维护,但仍然可以正常运行,并且可以与现代2.x Python一起使用。基本文档在这里

PyXML works well.

You didn’t say what platform you’re using, however if you’re on Ubuntu you can get it with sudo apt-get install python-xml. I’m sure other Linux distros have it as well.

If you’re on a Mac, xpath is already installed but not immediately accessible. You can set PY_USE_XMLPLUS in your environment or do it the Python way before you import xml.xpath:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

In the worst case you may have to build it yourself. This package is no longer maintained but still builds fine and works with modern 2.x Pythons. Basic docs are here.


回答 11

如果您需要html

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')

If you are going to need it for html:

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')