标签归档:xpath

熊猫read_xml()方法测试策略

问题:熊猫read_xml()方法测试策略

当前,pandas I / O工具没有维护read_xml()方法,而相应的工具to_xml()。但是,read_json证明可以为数据帧导入和read_html标记格式实现树状结构。

如果大熊猫团队会考虑这样一个read_xml为未来大熊猫版本的方法,他们会追求什么实现:使用内置的解析xml.etree.ElementTreeiterfind()iterparse()功能或第三方模块,lxml其XPath 1.0和XSLT 1.0的方法呢?

以下是我在简单,扁平,以元素为中心的XML输入上针对四种方法类型的测试运行。所有这些都针对root的任何第二级子级进行了通用解析,并且每种方法都应产生完全相同的pandas数据帧。除最后一次调用外pd.Dataframe(),所有其他功能都在词典列表中。XSLT方法将XML转换为CSV,以便StringIO()在中进行转换pd.read_csv()

问题 (多部分)

  • 性能:您如何解释由于iterparse迭代解析文件而通常建议对较大文件使用的速度较慢的速度?部分原因是由于if逻辑检查吗?

  • 内存:CPU内存是否与I / O调用中的时间相关?XSLT和XPath 1.0在较大的XML文档中往往无法很好地扩展,因为必须在内存中读取整个文件才能进行解析。

  • 策略:词典列表是Dataframe()呼叫的最佳策略吗?请参阅以下有趣的答案:生成器版本和iterwalk用户定义版本。两个上载列表到数据帧。

输入数据(Stack Overflow当前的年度最大用户,其中包括我们的熊猫朋友)

<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
  <topusers>
    <user>Gordon Linoff</user>
    <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
    <location>New York, United States</location>
    <year_rep>5,985</year_rep>
    <total_rep>499,408</total_rep>
    <tag1>sql</tag1>
    <tag2>sql-server</tag2>
    <tag3>mysql</tag3>
  </topusers>
  <topusers>
    <user>Günter Zöchbauer</user>
    <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
    <location>Linz, Austria</location>
    <year_rep>5,835</year_rep>
    <total_rep>154,439</total_rep>
    <tag1>angular2</tag1>
    <tag2>typescript</tag2>
    <tag3>javascript</tag3>
  </topusers>
  <topusers>
    <user>jezrael</user>
    <link>http://www.stackoverflow.com//users/2901002/jezrael</link>
    <location>Bratislava, Slovakia</location>
    <year_rep>5,740</year_rep>
    <total_rep>83,237</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>VonC</user>
    <link>http://www.stackoverflow.com//users/6309/vonc</link>
    <location>France</location>
    <year_rep>5,577</year_rep>
    <total_rep>651,397</total_rep>
    <tag1>git</tag1>
    <tag2>github</tag2>
    <tag3>docker</tag3>
  </topusers>
  <topusers>
    <user>Martijn Pieters</user>
    <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
    <location>Cambridge, United Kingdom</location>
    <year_rep>5,337</year_rep>
    <total_rep>525,176</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>T.J. Crowder</user>
    <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
    <location>United Kingdom</location>
    <year_rep>5,258</year_rep>
    <total_rep>508,310</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>akrun</user>
    <link>http://www.stackoverflow.com//users/3732271/akrun</link>
    <location></location>
    <year_rep>5,188</year_rep>
    <total_rep>229,553</total_rep>
    <tag1>r</tag1>
    <tag2>dplyr</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>Wiktor Stribi?ew</user>
    <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
    <location>Warsaw, Poland</location>
    <year_rep>4,948</year_rep>
    <total_rep>158,134</total_rep>
    <tag1>regex</tag1>
    <tag2>javascript</tag2>
    <tag3>c#</tag3>
  </topusers>
  <topusers>
    <user>Darin Dimitrov</user>
    <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
    <location>Sofia, Bulgaria</location>
    <year_rep>4,936</year_rep>
    <total_rep>709,683</total_rep>
    <tag1>c#</tag1>
    <tag2>asp.net-mvc</tag2>
    <tag3>asp.net-mvc-3</tag3>
  </topusers>
  <topusers>
    <user>Eric Duminil</user>
    <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
    <location></location>
    <year_rep>4,854</year_rep>
    <total_rep>12,557</total_rep>
    <tag1>ruby</tag1>
    <tag2>ruby-on-rails</tag2>
    <tag3>arrays</tag3>
  </topusers>
  <topusers>
    <user>alecxe</user>
    <link>http://www.stackoverflow.com//users/771848/alecxe</link>
    <location>New York, United States</location>
    <year_rep>4,723</year_rep>
    <total_rep>233,368</total_rep>
    <tag1>python</tag1>
    <tag2>selenium</tag2>
    <tag3>protractor</tag3>
  </topusers>
  <topusers>
    <user>Jean-François Fabre</user>
    <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
    <location>Toulouse, France</location>
    <year_rep>4,526</year_rep>
    <total_rep>30,027</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>piRSquared</user>
    <link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
    <location>Bellevue, WA, United States</location>
    <year_rep>4,482</year_rep>
    <total_rep>41,183</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>CommonsWare</user>
    <link>http://www.stackoverflow.com//users/115145/commonsware</link>
    <location>Who Wants to Know?</location>
    <year_rep>4,475</year_rep>
    <total_rep>616,135</total_rep>
    <tag1>android</tag1>
    <tag2>java</tag2>
    <tag3>android-intent</tag3>
  </topusers>
  <topusers>
    <user>Quentin</user>
    <link>http://www.stackoverflow.com//users/19068/quentin</link>
    <location>United Kingdom</location>
    <year_rep>4,464</year_rep>
    <total_rep>509,365</total_rep>
    <tag1>javascript</tag1>
    <tag2>html</tag2>
    <tag3>css</tag3>
  </topusers>
  <topusers>
    <user>Jon Skeet</user>
    <link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
    <location>Reading, United Kingdom</location>
    <year_rep>4,348</year_rep>
    <total_rep>921,690</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>.net</tag3>
  </topusers>
  <topusers>
    <user>Felix Kling</user>
    <link>http://www.stackoverflow.com//users/218196/felix-kling</link>
    <location>Sunnyvale, CA</location>
    <year_rep>4,324</year_rep>
    <total_rep>411,535</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>asynchronous</tag3>
  </topusers>
  <topusers>
    <user>matt</user>
    <link>http://www.stackoverflow.com//users/341994/matt</link>
    <location></location>
    <year_rep>4,313</year_rep>
    <total_rep>220,515</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>xcode</tag3>
  </topusers>
  <topusers>
    <user>Psidom</user>
    <link>http://www.stackoverflow.com//users/4983450/psidom</link>
    <location>Atlanta, GA, United States</location>
    <year_rep>4,236</year_rep>
    <total_rep>36,950</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>r</tag3>
  </topusers>
  <topusers>
    <user>Martin R</user>
    <link>http://www.stackoverflow.com//users/1187415/martin-r</link>
    <location>Germany</location>
    <year_rep>4,195</year_rep>
    <total_rep>269,380</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>swift3</tag3>
  </topusers>
  <topusers>
    <user>Barmar</user>
    <link>http://www.stackoverflow.com//users/1491895/barmar</link>
    <location>Arlington, MA</location>
    <year_rep>4,179</year_rep>
    <total_rep>289,989</total_rep>
    <tag1>javascript</tag1>
    <tag2>php</tag2>
    <tag3>jquery</tag3>
  </topusers>
  <topusers>
    <user>Alexey Mezenin</user>
    <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
    <location>??????</location>
    <year_rep>4,142</year_rep>
    <total_rep>31,602</total_rep>
    <tag1>laravel</tag1>
    <tag2>php</tag2>
    <tag3>laravel-5.3</tag3>
  </topusers>
  <topusers>
    <user>BalusC</user>
    <link>http://www.stackoverflow.com//users/157882/balusc</link>
    <location>Amsterdam, Netherlands</location>
    <year_rep>4,046</year_rep>
    <total_rep>703,046</total_rep>
    <tag1>java</tag1>
    <tag2>jsf</tag2>
    <tag3>servlets</tag3>
  </topusers>
  <topusers>
    <user>GurV</user>
    <link>http://www.stackoverflow.com//users/6348498/gurv</link>
    <location></location>
    <year_rep>4,016</year_rep>
    <total_rep>7,932</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>sql-server</tag3>
  </topusers>
  <topusers>
    <user>Nina Scholz</user>
    <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
    <location>Berlin, Deutschland</location>
    <year_rep>3,950</year_rep>
    <total_rep>61,135</total_rep>
    <tag1>javascript</tag1>
    <tag2>arrays</tag2>
    <tag3>object</tag3>
  </topusers>
  <topusers>
    <user>JB Nizet</user>
    <link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
    <location>Saint-Etienne, France</location>
    <year_rep>3,923</year_rep>
    <total_rep>418,780</total_rep>
    <tag1>java</tag1>
    <tag2>hibernate</tag2>
    <tag3>java-8</tag3>
  </topusers>
  <topusers>
    <user>Frank van Puffelen</user>
    <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
    <location>San Francisco, CA</location>
    <year_rep>3,920</year_rep>
    <total_rep>86,520</total_rep>
    <tag1>firebase</tag1>
    <tag2>firebase-database</tag2>
    <tag3>android</tag3>
  </topusers>
  <topusers>
    <user>dasblinkenlight</user>
    <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
    <location>United States</location>
    <year_rep>3,886</year_rep>
    <total_rep>475,813</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>c++</tag3>
  </topusers>
  <topusers>
    <user>Tim Biegeleisen</user>
    <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
    <location>Singapore</location>
    <year_rep>3,814</year_rep>
    <total_rep>77,211</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>Greg Hewgill</user>
    <link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
    <location>Christchurch, New Zealand</location>
    <year_rep>3,796</year_rep>
    <total_rep>529,137</total_rep>
    <tag1>git</tag1>
    <tag2>python</tag2>
    <tag3>git-pull</tag3>
  </topusers>
  <topusers>
    <user>unutbu</user>
    <link>http://www.stackoverflow.com//users/190597/unutbu</link>
    <location></location>
    <year_rep>3,735</year_rep>
    <total_rep>401,595</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>numpy</tag3>
  </topusers>
  <topusers>
    <user>Hans Passant</user>
    <link>http://www.stackoverflow.com//users/17034/hans-passant</link>
    <location>Madison, WI</location>
    <year_rep>3,688</year_rep>
    <total_rep>672,118</total_rep>
    <tag1>c#</tag1>
    <tag2>.net</tag2>
    <tag3>winforms</tag3>
  </topusers>
  <topusers>
    <user>Jonathan Leffler</user>
    <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
    <location>California, USA</location>
    <year_rep>3,649</year_rep>
    <total_rep>455,157</total_rep>
    <tag1>c</tag1>
    <tag2>bash</tag2>
    <tag3>unix</tag3>
  </topusers>
  <topusers>
    <user>paxdiablo</user>
    <link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
    <location></location>
    <year_rep>3,636</year_rep>
    <total_rep>507,043</total_rep>
    <tag1>c</tag1>
    <tag2>c++</tag2>
    <tag3>bash</tag3>
  </topusers>
  <topusers>
    <user>Pranav C Balan</user>
    <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
    <location>Ramanthali, Kannur, Kerala, India</location>
    <year_rep>3,604</year_rep>
    <total_rep>64,476</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>html</tag3>
  </topusers>
  <topusers>
    <user>Suragch</user>
    <link>http://www.stackoverflow.com//users/3681880/suragch</link>
    <location>Hohhot, China</location>
    <year_rep>3,580</year_rep>
    <total_rep>71,032</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>android</tag3>
  </topusers>
</stackoverflow>

Python方法

import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet

def read_xml_iterfind():
    tree = et.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.iterfind('./*'):
        for i in el.iterfind('*'):
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_iterparse():
    data = []
    inner = {}
    i = 1
    for (ev, el) in et.iterparse(path):
        if i <= 2:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text
    i += 1

    df = pd.DataFrame(data)    

def read_xml_lxml_xpath():     
    tree = lxet.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.xpath('/*/*'):
        for i in el:
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_lxml_xsl():     
    xml = lxet.parse('Input.xml')

    xslstr = '''
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes"  method="text"/>
        <xsl:strip-space elements="*"/>

        <!-- HEADERS -->
        <xsl:template match = "/*">
            <xsl:for-each select="*[1]/*">
              <xsl:value-of select="local-name()" />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>,</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>   
            </xsl:for-each>
            <xsl:apply-templates/>
        </xsl:template>

        <!-- DATA ROWS (COMMA-SEPARATED) -->
        <xsl:template match="/*/*" priority="2">    
            <xsl:for-each select="*">
              <xsl:if test="position() = 1">
                   <xsl:text>&quot;</xsl:text>
              </xsl:if>
              <xsl:value-of select="." />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>&quot;,&quot;</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&quot;&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>
            </xsl:for-each>
        </xsl:template>

    </xsl:transform>
    '''
    xsl = lxet.fromstring(xslstr)

    transform = lxet.XSLT(xsl)
    newdom = transform(xml)

    df = pd.read_csv(StringIO(str(newdom)))

时序 (当前的XML和XML的子级是25倍(即900条StackOverflow用户记录)

# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop

# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop

Currently, pandas I/O tools does not maintain a read_xml() method and the counterpart to_xml(). However, read_json proves tree-like structures can be implemented for dataframe import and read_html for markup formats.

If the pandas team does consider such a read_xml method for a future pandas version, what implementation would they pursue: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or the third-party module, lxml with its XPath 1.0 and XSLT 1.0 methods?

Below are my test runs for four method types on a simple, flat, element-centric XML input. All are set up for generalized parsing for any second level children of root and each method should yield exact same pandas dataframe. All but the last calls pd.Dataframe() on list of dictionaries. The XSLT method transforms XML to CSV for casted StringIO() in pd.read_csv().

Question (multi-part)

  • PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

  • MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

  • STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

Input Data (Stack Overflow’s current top users by year of which our pandas friends are included)

<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
  <topusers>
    <user>Gordon Linoff</user>
    <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
    <location>New York, United States</location>
    <year_rep>5,985</year_rep>
    <total_rep>499,408</total_rep>
    <tag1>sql</tag1>
    <tag2>sql-server</tag2>
    <tag3>mysql</tag3>
  </topusers>
  <topusers>
    <user>Günter Zöchbauer</user>
    <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
    <location>Linz, Austria</location>
    <year_rep>5,835</year_rep>
    <total_rep>154,439</total_rep>
    <tag1>angular2</tag1>
    <tag2>typescript</tag2>
    <tag3>javascript</tag3>
  </topusers>
  <topusers>
    <user>jezrael</user>
    <link>http://www.stackoverflow.com//users/2901002/jezrael</link>
    <location>Bratislava, Slovakia</location>
    <year_rep>5,740</year_rep>
    <total_rep>83,237</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>VonC</user>
    <link>http://www.stackoverflow.com//users/6309/vonc</link>
    <location>France</location>
    <year_rep>5,577</year_rep>
    <total_rep>651,397</total_rep>
    <tag1>git</tag1>
    <tag2>github</tag2>
    <tag3>docker</tag3>
  </topusers>
  <topusers>
    <user>Martijn Pieters</user>
    <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
    <location>Cambridge, United Kingdom</location>
    <year_rep>5,337</year_rep>
    <total_rep>525,176</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>T.J. Crowder</user>
    <link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
    <location>United Kingdom</location>
    <year_rep>5,258</year_rep>
    <total_rep>508,310</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>akrun</user>
    <link>http://www.stackoverflow.com//users/3732271/akrun</link>
    <location></location>
    <year_rep>5,188</year_rep>
    <total_rep>229,553</total_rep>
    <tag1>r</tag1>
    <tag2>dplyr</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>Wiktor Stribi?ew</user>
    <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
    <location>Warsaw, Poland</location>
    <year_rep>4,948</year_rep>
    <total_rep>158,134</total_rep>
    <tag1>regex</tag1>
    <tag2>javascript</tag2>
    <tag3>c#</tag3>
  </topusers>
  <topusers>
    <user>Darin Dimitrov</user>
    <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
    <location>Sofia, Bulgaria</location>
    <year_rep>4,936</year_rep>
    <total_rep>709,683</total_rep>
    <tag1>c#</tag1>
    <tag2>asp.net-mvc</tag2>
    <tag3>asp.net-mvc-3</tag3>
  </topusers>
  <topusers>
    <user>Eric Duminil</user>
    <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
    <location></location>
    <year_rep>4,854</year_rep>
    <total_rep>12,557</total_rep>
    <tag1>ruby</tag1>
    <tag2>ruby-on-rails</tag2>
    <tag3>arrays</tag3>
  </topusers>
  <topusers>
    <user>alecxe</user>
    <link>http://www.stackoverflow.com//users/771848/alecxe</link>
    <location>New York, United States</location>
    <year_rep>4,723</year_rep>
    <total_rep>233,368</total_rep>
    <tag1>python</tag1>
    <tag2>selenium</tag2>
    <tag3>protractor</tag3>
  </topusers>
  <topusers>
    <user>Jean-François Fabre</user>
    <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
    <location>Toulouse, France</location>
    <year_rep>4,526</year_rep>
    <total_rep>30,027</total_rep>
    <tag1>python</tag1>
    <tag2>python-3.x</tag2>
    <tag3>python-2.7</tag3>
  </topusers>
  <topusers>
    <user>piRSquared</user>
    <link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
    <location>Bellevue, WA, United States</location>
    <year_rep>4,482</year_rep>
    <total_rep>41,183</total_rep>
    <tag1>pandas</tag1>
    <tag2>python</tag2>
    <tag3>dataframe</tag3>
  </topusers>
  <topusers>
    <user>CommonsWare</user>
    <link>http://www.stackoverflow.com//users/115145/commonsware</link>
    <location>Who Wants to Know?</location>
    <year_rep>4,475</year_rep>
    <total_rep>616,135</total_rep>
    <tag1>android</tag1>
    <tag2>java</tag2>
    <tag3>android-intent</tag3>
  </topusers>
  <topusers>
    <user>Quentin</user>
    <link>http://www.stackoverflow.com//users/19068/quentin</link>
    <location>United Kingdom</location>
    <year_rep>4,464</year_rep>
    <total_rep>509,365</total_rep>
    <tag1>javascript</tag1>
    <tag2>html</tag2>
    <tag3>css</tag3>
  </topusers>
  <topusers>
    <user>Jon Skeet</user>
    <link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
    <location>Reading, United Kingdom</location>
    <year_rep>4,348</year_rep>
    <total_rep>921,690</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>.net</tag3>
  </topusers>
  <topusers>
    <user>Felix Kling</user>
    <link>http://www.stackoverflow.com//users/218196/felix-kling</link>
    <location>Sunnyvale, CA</location>
    <year_rep>4,324</year_rep>
    <total_rep>411,535</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>asynchronous</tag3>
  </topusers>
  <topusers>
    <user>matt</user>
    <link>http://www.stackoverflow.com//users/341994/matt</link>
    <location></location>
    <year_rep>4,313</year_rep>
    <total_rep>220,515</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>xcode</tag3>
  </topusers>
  <topusers>
    <user>Psidom</user>
    <link>http://www.stackoverflow.com//users/4983450/psidom</link>
    <location>Atlanta, GA, United States</location>
    <year_rep>4,236</year_rep>
    <total_rep>36,950</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>r</tag3>
  </topusers>
  <topusers>
    <user>Martin R</user>
    <link>http://www.stackoverflow.com//users/1187415/martin-r</link>
    <location>Germany</location>
    <year_rep>4,195</year_rep>
    <total_rep>269,380</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>swift3</tag3>
  </topusers>
  <topusers>
    <user>Barmar</user>
    <link>http://www.stackoverflow.com//users/1491895/barmar</link>
    <location>Arlington, MA</location>
    <year_rep>4,179</year_rep>
    <total_rep>289,989</total_rep>
    <tag1>javascript</tag1>
    <tag2>php</tag2>
    <tag3>jquery</tag3>
  </topusers>
  <topusers>
    <user>Alexey Mezenin</user>
    <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
    <location>??????</location>
    <year_rep>4,142</year_rep>
    <total_rep>31,602</total_rep>
    <tag1>laravel</tag1>
    <tag2>php</tag2>
    <tag3>laravel-5.3</tag3>
  </topusers>
  <topusers>
    <user>BalusC</user>
    <link>http://www.stackoverflow.com//users/157882/balusc</link>
    <location>Amsterdam, Netherlands</location>
    <year_rep>4,046</year_rep>
    <total_rep>703,046</total_rep>
    <tag1>java</tag1>
    <tag2>jsf</tag2>
    <tag3>servlets</tag3>
  </topusers>
  <topusers>
    <user>GurV</user>
    <link>http://www.stackoverflow.com//users/6348498/gurv</link>
    <location></location>
    <year_rep>4,016</year_rep>
    <total_rep>7,932</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>sql-server</tag3>
  </topusers>
  <topusers>
    <user>Nina Scholz</user>
    <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
    <location>Berlin, Deutschland</location>
    <year_rep>3,950</year_rep>
    <total_rep>61,135</total_rep>
    <tag1>javascript</tag1>
    <tag2>arrays</tag2>
    <tag3>object</tag3>
  </topusers>
  <topusers>
    <user>JB Nizet</user>
    <link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
    <location>Saint-Etienne, France</location>
    <year_rep>3,923</year_rep>
    <total_rep>418,780</total_rep>
    <tag1>java</tag1>
    <tag2>hibernate</tag2>
    <tag3>java-8</tag3>
  </topusers>
  <topusers>
    <user>Frank van Puffelen</user>
    <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
    <location>San Francisco, CA</location>
    <year_rep>3,920</year_rep>
    <total_rep>86,520</total_rep>
    <tag1>firebase</tag1>
    <tag2>firebase-database</tag2>
    <tag3>android</tag3>
  </topusers>
  <topusers>
    <user>dasblinkenlight</user>
    <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
    <location>United States</location>
    <year_rep>3,886</year_rep>
    <total_rep>475,813</total_rep>
    <tag1>c#</tag1>
    <tag2>java</tag2>
    <tag3>c++</tag3>
  </topusers>
  <topusers>
    <user>Tim Biegeleisen</user>
    <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
    <location>Singapore</location>
    <year_rep>3,814</year_rep>
    <total_rep>77,211</total_rep>
    <tag1>sql</tag1>
    <tag2>mysql</tag2>
    <tag3>java</tag3>
  </topusers>
  <topusers>
    <user>Greg Hewgill</user>
    <link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
    <location>Christchurch, New Zealand</location>
    <year_rep>3,796</year_rep>
    <total_rep>529,137</total_rep>
    <tag1>git</tag1>
    <tag2>python</tag2>
    <tag3>git-pull</tag3>
  </topusers>
  <topusers>
    <user>unutbu</user>
    <link>http://www.stackoverflow.com//users/190597/unutbu</link>
    <location></location>
    <year_rep>3,735</year_rep>
    <total_rep>401,595</total_rep>
    <tag1>python</tag1>
    <tag2>pandas</tag2>
    <tag3>numpy</tag3>
  </topusers>
  <topusers>
    <user>Hans Passant</user>
    <link>http://www.stackoverflow.com//users/17034/hans-passant</link>
    <location>Madison, WI</location>
    <year_rep>3,688</year_rep>
    <total_rep>672,118</total_rep>
    <tag1>c#</tag1>
    <tag2>.net</tag2>
    <tag3>winforms</tag3>
  </topusers>
  <topusers>
    <user>Jonathan Leffler</user>
    <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
    <location>California, USA</location>
    <year_rep>3,649</year_rep>
    <total_rep>455,157</total_rep>
    <tag1>c</tag1>
    <tag2>bash</tag2>
    <tag3>unix</tag3>
  </topusers>
  <topusers>
    <user>paxdiablo</user>
    <link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
    <location></location>
    <year_rep>3,636</year_rep>
    <total_rep>507,043</total_rep>
    <tag1>c</tag1>
    <tag2>c++</tag2>
    <tag3>bash</tag3>
  </topusers>
  <topusers>
    <user>Pranav C Balan</user>
    <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
    <location>Ramanthali, Kannur, Kerala, India</location>
    <year_rep>3,604</year_rep>
    <total_rep>64,476</total_rep>
    <tag1>javascript</tag1>
    <tag2>jquery</tag2>
    <tag3>html</tag3>
  </topusers>
  <topusers>
    <user>Suragch</user>
    <link>http://www.stackoverflow.com//users/3681880/suragch</link>
    <location>Hohhot, China</location>
    <year_rep>3,580</year_rep>
    <total_rep>71,032</total_rep>
    <tag1>swift</tag1>
    <tag2>ios</tag2>
    <tag3>android</tag3>
  </topusers>
</stackoverflow>

Python Methods

import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet

def read_xml_iterfind():
    tree = et.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.iterfind('./*'):
        for i in el.iterfind('*'):
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_iterparse():
    data = []
    inner = {}
    i = 1
    for (ev, el) in et.iterparse(path):
        if i <= 2:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text
    i += 1

    df = pd.DataFrame(data)    

def read_xml_lxml_xpath():     
    tree = lxet.parse('Input.xml')

    data = []
    inner = {}
    for el in tree.xpath('/*/*'):
        for i in el:
            inner[i.tag] = i.text
        data.append(inner)
        inner = {}

    df = pd.DataFrame(data)

def read_xml_lxml_xsl():     
    xml = lxet.parse('Input.xml')

    xslstr = '''
    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:output version="1.0" encoding="UTF-8" indent="yes"  method="text"/>
        <xsl:strip-space elements="*"/>

        <!-- HEADERS -->
        <xsl:template match = "/*">
            <xsl:for-each select="*[1]/*">
              <xsl:value-of select="local-name()" />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>,</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>   
            </xsl:for-each>
            <xsl:apply-templates/>
        </xsl:template>

        <!-- DATA ROWS (COMMA-SEPARATED) -->
        <xsl:template match="/*/*" priority="2">    
            <xsl:for-each select="*">
              <xsl:if test="position() = 1">
                   <xsl:text>&quot;</xsl:text>
              </xsl:if>
              <xsl:value-of select="." />
                <xsl:choose>
                   <xsl:when test="position() != last()">
                      <xsl:text>&quot;,&quot;</xsl:text>
                   </xsl:when>
                   <xsl:otherwise>
                      <xsl:text>&quot;&#xa;</xsl:text>
                   </xsl:otherwise>                              
                </xsl:choose>
            </xsl:for-each>
        </xsl:template>

    </xsl:transform>
    '''
    xsl = lxet.fromstring(xslstr)

    transform = lxet.XSLT(xsl)
    newdom = transform(xml)

    df = pd.read_csv(StringIO(str(newdom)))

Timings (with current XML and XML with 25 times the children (i.e., 900 StackOverflow user records)

# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop

python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop

# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop

python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop

回答 0

性能:如何解释由于迭代解析文件而通常建议对较大文件使用的较慢iterparse?部分原因是由于if逻辑检查?

我认为更多的python代码会使它变慢,因为每次都会评估python代码。您是否尝试过像pypy这样的JIT编译器?

如果仅删除i并使用first_tag,它似乎会快很多,所以是的,部分原因在于if逻辑检查:

def read_xml_iterparse2(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop

我不确定我是否了解上次if检查的目的,但也不确定为什么您会丢失仅空白元素。持续删除最后一个可以if节省一点时间:

def read_xml_iterparse3(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop

现在,无论是否进行了这些性能改进,您的iterparse版本似乎都会产生一个更大的数据框。这似乎是一个有效的快速版本:

def read_xml_iterparse5(path):
    data = []
    inner = {}
    for (ev, el) in et.iterparse(path):
        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
        if el.text and el.text[0] == '\n':
            # ignore /stackoverflow
            if inner:
                data.append(inner)
                inner = {}
        else:
            inner[el.tag] = el.text

    return pd.DataFrame(data)    

print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop

内存:CPU内存是否与I / O调用中的时间相关?XSLT和XPath 1.0在较大的XML文档中往往无法很好地扩展,因为必须在内存中读取整个文件才能进行解析。

我不能完全确定“ I / O调用”是什么意思,但是如果您的文档足够小以适合缓存,那么一切都会更快,因为它不会从缓存中逐出其他项目。

策略:词典列表是否是Dataframe()调用的最佳策略?请参阅以下有趣的答案:生成器版本和iterwalk用户定义的版本。两个上载列表到数据帧。

列表使用的内存较少,因此根据您拥有的列数,它可能会产生明显的不同。当然,这然后要求您的XML标记具有一致的顺序,看起来确实如此。该DataFrame()调用也将需要做的工作更少,因为它不必在每一行的dict中查找键,以弄清楚哪一列是什么值。

PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?

If I remove i and use first_tag only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:

def read_xml_iterparse2(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        if el.text is not None and len(el.text.strip()) > 0:
            inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 33 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 23 ms per loop

I wasn’t sure I understood the purpose of the last if check, but I’m also not sure why you would want to lose whitespace-only elements. Removing the last if consistently shaves off a little bit of time:

def read_xml_iterparse3(path):
    data = []
    inner = {}
    first_tag = None
    for (ev, el) in et.iterparse(path):
        if not first_tag:
           first_tag = el.tag

        if el.tag == first_tag and len(inner) != 0:
            data.append(inner)            
            inner = {}

        inner[el.tag] = el.text

    df = pd.DataFrame(data)    

%timeit read_xml_iterparse(path)
# 10 loops, best of 5: 34.4 ms per loop
%timeit read_xml_iterparse2(path)
# 10 loops, best of 5: 24.5 ms per loop
%timeit read_xml_iterparse3(path)
# 10 loops, best of 5: 20.9 ms per loop

Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:

def read_xml_iterparse5(path):
    data = []
    inner = {}
    for (ev, el) in et.iterparse(path):
        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check
        if el.text and el.text[0] == '\n':
            # ignore /stackoverflow
            if inner:
                data.append(inner)
                inner = {}
        else:
            inner[el.tag] = el.text

    return pd.DataFrame(data)    

print(read_xml_iterfind(path).shape)
# (900, 8)
print(read_xml_iterparse(path).shape)
# (7050, 8)
print(read_xml_lxml_xpath(path).shape)
# (900, 8)
print(read_xml_lxml_xsl(path).shape)
# (900, 8)
print(read_xml_iterparse5(path).shape)
# (900, 8)
%timeit read_xml_iterparse5(path)
# 10 loops, best of 5: 20.6 ms per loop

MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

I’m not totally sure what you mean by “I/O calls” but if your document is small enough to fit in cache, then everything will be much faster as it won’t evict many other items from the cache.

STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame() call would also need to do less work, as it doesn’t have to lookup keys in the dict on every row, to figure out what column if for what value.


我们可以将xpath与BeautifulSoup一起使用吗?

问题:我们可以将xpath与BeautifulSoup一起使用吗?

我正在使用BeautifulSoup抓取网址,并且我有以下代码

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

现在在上面的代码中,我们可以findAll用来获取标签和与其相关的信息,但是我想使用xpath。是否可以将xpath与BeautifulSoup一起使用?如果可能的话,任何人都可以给我提供示例代码,以便提供更多帮助吗?

I am using BeautifulSoup to scrape a url and I had the following code

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})

Now in the above code we can use findAll to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?


回答 0

不,BeautifulSoup本身不支持XPath表达式。

另一种库,LXML支持的XPath 1.0。它具有BeautifulSoup兼容模式,它将尝试以Soup的方式解析损坏的HTML。但是,默认的lxml HTML解析器可以很好地完成解析损坏的HTML的工作,而且我相信它的速度更快。

将文档解析为lxml树后,就可以使用该.xpath()方法搜索元素。

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

还有一个带有附加功能的专用lxml.html()模块

请注意,在上面的示例中,我将response对象直接传递给lxml,因为直接从流中读取解析器比将响应首先读取到大字符串中更为有效。要对requests库执行相同的操作,您需要在启用透明传输解压缩后设置stream=True并传递response.raw对象:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

您可能会感兴趣的是CSS选择器支持;在CSSSelector类转换CSS语句转换为XPath表达式,使您的搜索td.empformbody更加容易:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

即将来临:BeautifulSoup本身确实具有非常完整的CSS选择器支持

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you’ve parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

回答 1

我可以确认Beautiful Soup中没有XPath支持。

I can confirm that there is no XPath support within Beautiful Soup.


回答 2

正如其他人所说,BeautifulSoup没有xpath支持。可能有很多方法可以从xpath中获取某些东西,包括使用Selenium。但是,以下是可在Python 2或3中使用的解决方案:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

以此为参考。

As others have said, BeautifulSoup doesn’t have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here’s a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.


回答 3

BeautifulSoup 从当前指向子元素的元素中有一个名为findNext的函数,因此:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

上面的代码可以模仿以下xpath:

div[class=class_value]/div[id=id_value]

BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]

回答 4

我搜索了他们的文档,似乎没有xpath选项。此外,你可以看到在这里对SO类似的问题时,OP是要求从XPath来BeautifulSoup一个翻译,所以我的结论是-没有,没有的XPath解析可用。

I’ve searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be – no, there is no xpath parsing available.


回答 5

当您使用lxml时,一切都很简单:

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

但是使用BeautifulSoup BS4时也很简单:

  • 首先删除“ //”和“ @”
  • 第二个-在“ =“之前添加星号

试试这个魔术:

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

如您所见,这不支持子标签,因此我删除了“ / @ href”部分

when you use lxml all simple:

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

but when use BeautifulSoup BS4 all simple too:

  • first remove “//” and “@”
  • second – add star before “=”

try this magic:

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

as you see, this does not support sub-tag, so i remove “/@href” part


回答 6

也许您可以在没有XPath的情况下尝试以下操作

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

Maybe you can try the following without XPath

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

回答 7

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

上面使用了Soup对象和lxml的组合,并且可以使用xpath提取值

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

Above used the combination of Soup object with lxml and one can extract the value using xpath


回答 8

这是一个很旧的线程,但是现在有一个解决方法,当时在BeautifulSoup中可能还没有。

这是我所做的一个例子。我使用“请求”模块读取RSS提要,并在名为“ rss_text”的变量中获取其文本内容。这样,我就可以通过BeautifulSoup运行它,搜索xpath / rss / channel / title,并检索其内容。它并不是XPath的全部功能(通配符,多个路径等),但是,如果您只有要定位的基本路径,则可以使用。

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

Here is an example of what I did. I use the “requests” module to read an RSS feed and get its text content in a variable called “rss_text”. With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It’s not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

如何在Python中使用Xpath?

问题:如何在Python中使用Xpath?

有哪些支持Xpath的库?是否有完整的实现?图书馆如何使用?它的网站在哪里?

What are the libraries that support XPath? Is there a full implementation? How is the library used? Where is its website?


回答 0

libxml2具有许多优点:

  1. 符合规范
  2. 积极发展和社区参与
  3. 速度。这实际上是围绕C实现的python包装器。
  4. 无处不在。libxml2库无处不在,因此经过了充分的测试。

缺点包括:

  1. 符合规范。严格 在其他库中,诸如默认命名空间处理之类的事情会更容易。
  2. 使用本机代码。这可能会很麻烦,具体取决于您的应用程序的分发/部署方式。可使用RPM来减轻这种痛苦。
  3. 手动资源处理。请注意下面的示例中对freeDoc()和xpathFreeContext()的调用。这不是很Pythonic。

如果您要进行简单的路径选择,请坚持使用ElementTree(Python 2.5附带)。如果您需要完全符合规范或原始速度并且可以应付本机代码的分发,请使用libxml2。

libxml2 XPath使用示例


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

ElementTree XPath使用示例


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text

libxml2 has a number of advantages:

  1. Compliance to the spec
  2. Active development and a community participation
  3. Speed. This is really a python wrapper around a C implementation.
  4. Ubiquity. The libxml2 library is pervasive and thus well tested.

Downsides include:

  1. Compliance to the spec. It’s strict. Things like default namespace handling are easier in other libraries.
  2. Use of native code. This can be a pain depending on your how your application is distributed / deployed. RPMs are available that ease some of this pain.
  3. Manual resource handling. Note in the sample below the calls to freeDoc() and xpathFreeContext(). This is not very Pythonic.

If you are doing simple path selection, stick with ElementTree ( which is included in Python 2.5 ). If you need full spec compliance or raw speed and can cope with the distribution of native code, go with libxml2.

Sample of libxml2 XPath Use


import libxml2

doc = libxml2.parseFile("tst.xml")
ctxt = doc.xpathNewContext()
res = ctxt.xpathEval("//*")
if len(res) != 2:
    print "xpath query: wrong node set size"
    sys.exit(1)
if res[0].name != "doc" or res[1].name != "foo":
    print "xpath query: wrong node set value"
    sys.exit(1)
doc.freeDoc()
ctxt.xpathFreeContext()

Sample of ElementTree XPath Use


from elementtree.ElementTree import ElementTree
mydoc = ElementTree(file='tst.xml')
for e in mydoc.findall('/foo/bar'):
    print e.get('title').text


回答 1

LXML包支持XPath。尽管我在self ::轴上遇到了一些麻烦,但它似乎工作得很好。还有Amara,但是我还没有亲自使用过。

The lxml package supports xpath. It seems to work pretty well, although I’ve had some trouble with the self:: axis. There’s also Amara, but I haven’t used it personally.


回答 2

在这里听起来像lxml广告。;)ElementTree包含在std库中。在2.6及以下版本中,它的xpath相当弱,但在2.7+中则大大改善了

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

Sounds like an lxml advertisement in here. ;) ElementTree is included in the std library. Under 2.6 and below its xpath is pretty weak, but in 2.7+ much improved:

import xml.etree.ElementTree as ET
root = ET.parse(filename)
result = ''

for elem in root.findall('.//child/grandchild'):
    # How to make decisions based on attributes even in 2.6:
    if elem.attrib.get('name') == 'foo':
        result = elem.text
        break

回答 3

使用LXML。LXML充分利用了libxml2和libxslt的功能,但是将它们包装在比这些库中固有的Python绑定更多的“ Pythonic”绑定中。这样,它将获得完整的XPath 1.0实现。本机ElemenTree支持XPath的有限子集,尽管它可能足以满足您的需求。

Use LXML. LXML uses the full power of libxml2 and libxslt, but wraps them in more “Pythonic” bindings than the Python bindings that are native to those libraries. As such, it gets the full XPath 1.0 implementation. Native ElemenTree supports a limited subset of XPath, although it may be good enough for your needs.


回答 4

另一个选项是py-dom-xpath,它可以与minidom无缝协作,并且是纯Python,因此可以在appengine上运行。

import xpath
xpath.find('//item', doc)

Another option is py-dom-xpath, it works seamlessly with minidom and is pure Python so works on appengine.

import xpath
xpath.find('//item', doc)

回答 5

您可以使用:

PyXML

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

You can use:

PyXML:

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.xml').documentElement
for url in xpath.Evaluate('//@Url', doc):
  print url.value

libxml2:

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
  print url.content

回答 6

最新版本的elementtree很好地支持XPath。我不是XPath专家,我不能肯定地说实现是否完整,但是在使用Python时它可以满足我的大多数需求。我也使用了lxml和PyXML,我发现etree很不错,因为它是一个标准模块。

注意:从那以后我就找到了lxml,对我来说,它绝对是Python最好的XML库。它也很好地完成了XPath(尽管可能不是完整的实现)。

The latest version of elementtree supports XPath pretty well. Not being an XPath expert I can’t say for sure if the implementation is full but it has satisfied most of my needs when working in Python. I’ve also use lxml and PyXML and I find etree nice because it’s a standard module.

NOTE: I’ve since found lxml and for me it’s definitely the best XML lib out there for Python. It does XPath nicely as well (though again perhaps not a full implementation).


回答 7

您可以使用简单soupparserlxml

例:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

You can use the simple soupparser from lxml

Example:

from lxml.html.soupparser import fromstring

tree = fromstring("<a>Find me!</a>")
print tree.xpath("//a/text()")

回答 8

如果您希望同时拥有XPATH的功能和使用CSS的能力,则可以使用parsel

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

If you want to have the power of XPATH combined with the ability to also use CSS at any point you can use parsel:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul
        </body>
        </html>""")
>>>
>>> sel.css('h1::text').extract_first()
'Hello, Parsel!'
>>> sel.xpath('//h1/text()').extract_first()
'Hello, Parsel!'

回答 9

另一个库是4Suite:http//sourceforge.net/projects/foursuite/

我不知道它是如何符合规范的。但这对我来说非常有效。它看起来被遗弃了。

Another library is 4Suite: http://sourceforge.net/projects/foursuite/

I do not know how spec-compliant it is. But it has worked very well for my use. It looks abandoned.


回答 10

PyXML运作良好。

您没有说要使用什么平台,但是如果您使用的是Ubuntu,则可以使用sudo apt-get install python-xml。我敢肯定其他Linux发行版也有。

如果您使用的是Mac,则xpath已安装但无法立即访问。可以PY_USE_XMLPLUS在导入xml.xpath之前在您的环境中进行设置或以Python方式进行设置:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

在最坏的情况下,您可能必须自己构建它。该软件包不再维护,但仍然可以正常运行,并且可以与现代2.x Python一起使用。基本文档在这里

PyXML works well.

You didn’t say what platform you’re using, however if you’re on Ubuntu you can get it with sudo apt-get install python-xml. I’m sure other Linux distros have it as well.

If you’re on a Mac, xpath is already installed but not immediately accessible. You can set PY_USE_XMLPLUS in your environment or do it the Python way before you import xml.xpath:

if sys.platform.startswith('darwin'):
    os.environ['PY_USE_XMLPLUS'] = '1'

In the worst case you may have to build it yourself. This package is no longer maintained but still builds fine and works with modern 2.x Pythons. Basic docs are here.


回答 11

如果您需要html

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')

If you are going to need it for html:

import lxml.html as html
root  = html.fromstring(string)
root.xpath('//meta')