问题:美丽的汤并通过ID提取div及其内容
soup.find("tagName", { "id" : "articlebody" })
为什么这不返回<div id="articlebody"> ... </div>
标签和中间的东西?它什么也不返回。我知道一个事实,因为我正盯着它
soup.prettify()
soup.find("div", { "id" : "articlebody" })
也行不通。
(编辑:我发现BeautifulSoup无法正确解析我的页面,这可能意味着我尝试解析的页面在SGML或其他格式中未正确格式化)
soup.find("tagName", { "id" : "articlebody" })
Why does this NOT return the <div id="articlebody"> ... </div>
tags and stuff in between? It returns nothing. And I know for a fact it exists because I’m staring right at it from
soup.prettify()
soup.find("div", { "id" : "articlebody" })
also does not work.
(EDIT: I found that BeautifulSoup wasn’t correctly parsing my page, which probably meant the page I was trying to parse isn’t properly formatted in SGML or whatever)
回答 0
您应该发布示例文档,因为代码可以正常工作:
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
<div>
在<div>
s中查找s 也可以:
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
You should post your example document, because the code works fine:
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
Finding <div>
s inside <div>
s works as well:
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
回答 1
通过其元素查找id
:
div = soup.find(id="articlebody")
To find an element by its id
:
div = soup.find(id="articlebody")
回答 2
Beautiful Soup 4支持该方法的大多数CSS选择器,因此您可以使用以下选择器:.select()
id
soup.select('#articlebody')
如果需要指定元素的类型,则可以在选择器之前添加类型选择id
器:
soup.select('div#articlebody')
该.select()
方法将返回元素的集合,这意味着它将返回与以下.find_all()
方法示例相同的结果:
soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")
如果只想选择一个元素,则可以使用.find()
方法:
soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
Beautiful Soup 4 supports most CSS selectors with the .select()
method, therefore you can use an id
selector such as:
soup.select('#articlebody')
If you need to specify the element’s type, you can add a type selector before the id
selector:
soup.select('div#articlebody')
The .select()
method will return a collection of elements, which means that it would return the same results as the following .find_all()
method example:
soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")
If you only want to select a single element, then you could just use the .find()
method:
soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
回答 3
我认为’div’标签嵌套过多时会出现问题。我正在尝试从Facebook html文件中解析一些联系人,Beautifulsoup无法找到带有“ fcontent”类的标签“ div”。
其他类也会发生这种情况。一般而言,当我搜索div时,它只会变成那些嵌套不多的div。
html源代码可以是您的朋友(而不是您的朋友之一)的朋友列表中来自facebook的任何页面。如果有人可以测试它并提供一些建议,我将非常感激。
这是我的代码,在这里我只尝试打印带有“ fcontent”类的标签“ div”的数量:
from BeautifulSoup import BeautifulSoup
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
I think there is a problem when the ‘div’ tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags “div” with class “fcontent”.
This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.
The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.
This is my code, where I just try to print the number of tags “div” with class “fcontent”:
from BeautifulSoup import BeautifulSoup
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
回答 4
很可能是因为默认的beautifulsoup解析器有问题。更改其他解析器,例如“ lxml”,然后重试。
Most probably because of the default beautifulsoup parser has problem. Change a different parser, like ‘lxml’ and try again.
回答 5
在beautifulsoup源代码中,此行允许div嵌套在div中;因此您对lukas评论的关注不会成立。
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
我认为您需要做的是指定所需的attrs,例如
source.find('div', attrs={'id':'articlebody'})
In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas’ comment wouldn’t be valid.
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
What I think you need to do is to specify the attrs you want such as
source.find('div', attrs={'id':'articlebody'})
回答 6
你有尝试过soup.findAll("div", {"id": "articlebody"})
吗?
听起来很疯狂,但是如果您从野外抓东西,就不能排除多个div …
have you tried soup.findAll("div", {"id": "articlebody"})
?
sounds crazy, but if you’re scraping stuff from the wild, you can’t rule out multiple divs…
回答 7
我用了:
soup.findAll('tag', attrs={'attrname':"attrvalue"})
作为我的find / findall语法;也就是说,除非标签和属性列表之间还有其他可选参数,否则这应该没有什么不同。
I used:
soup.findAll('tag', attrs={'attrname':"attrvalue"})
As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn’t be different.
回答 8
在尝试抓取Google时也遇到了我。
我最终使用pyquery。
安装:
pip install pyquery
用:
from pyquery import PyQuery
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')
Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:
pip install pyquery
Use:
from pyquery import PyQuery
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')
回答 9
这是一个代码片段
soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})
如您所见,我找到了所有标签,然后找到了所有带有class =“ article”的标签
Here is a code fragment
soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})
As you can see I find all tags and then I find all tags with class=”article” inside
回答 10
该Id
属性始终是唯一标识的。这意味着您无需指定元素就可以直接使用它。因此,如果您的元素可以在内容中进行解析,则是一个加分点。
divEle = soup.find(id = "articlebody")
The Id
property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.
divEle = soup.find(id = "articlebody")