I’m working in Python and using Flask. When I run my main Python file on my computer, it works perfectly, but when I activate venv and run the Flask Python file in the terminal, it says that my main Python file has “No Module Named bs4.” Any comments or advice is greatly appreciated.
Activate the virtualenv, and then install BeautifulSoup4:
$ pip install BeautifulSoup4
When you installed bs4 with easy_install, you installed it system-wide. So your system python can import it, but not your virtualenv python.
If you do not need bs4 to be installed in your system python path, uninstall it and keep it in your virtualenv.
A lot of tutorials/references were written for Python 2 and tell you to use pip install somename. If you’re using Python 3 you want to change that to pip3 install somename.
I want to make a website that shows the comparison between amazon and e-bay product price.
Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.
While
BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.
In simple words, with Beautiful Soup you can build something similar to Scrapy.
Beautiful Soup is a library while Scrapy is a complete framework.
I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page.
After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.
If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.
Take a look at the scrapy documentation, it’s very simple to use.
The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.
The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.
Scrapy
It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.
Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.
etc.
TLDR: scrapy is a framework that provides everything that one might
need to build large scale crawls. It provides various features that
hide complexity of crawling the webs. one can simply start writing web
crawlers without worrying about the setup burden.
Beautiful soup
Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.
To sum up.
Scrapy is a rich framework that you can use to start writing crawlers
without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It
cannot be used alone to scrape web.
You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.
BeautifulSoup is a library that lets you extract information from a web page.
Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.
Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method.
Big project takes both advantages.
Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the body text (article) and maybe even a few tab names here and there. I have tried the suggestion in this SO question that returns lots of <script> tags and html comments which I don’t want. I can’t figure out the arguments I need for the function findAll() in order to just get the visible texts on a webpage.
So, how should I find all visible text excluding scripts, comments, css etc.?
回答 0
试试这个:
from bs4 importBeautifulSoupfrom bs4.element importCommentimport urllib.request
def tag_visible(element):if element.parent.name in['style','script','head','title','meta','[document]']:returnFalseif isinstance(element,Comment):returnFalsereturnTruedef text_from_html(body):
soup =BeautifulSoup(body,'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)return u" ".join(t.strip()for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()print(text_from_html(html))
html = open('21storm.html').read()
soup =BeautifulSoup(html)[s.extract()for s in soup(['style','script','[document]','head','title'])]
visible_text = soup.getText()
The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.
html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
回答 2
import urllib
from bs4 importBeautifulSoup
url ="https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup =BeautifulSoup(html)# kill all script and style elementsfor script in soup(["script","style"]):
script.extract()# rip it out# get text
text = soup.get_text()# break into lines and remove leading and trailing space on each
lines =(line.strip()for line in text.splitlines())# break multi-headlines into a line each
chunks =(phrase.strip()for line in lines for phrase in line.split(" "))# drop blank lines
text ='\n'.join(chunk for chunk in chunks if chunk)print(text.encode('utf-8'))
import urllib
from bs4 import BeautifulSoup
url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))
<html><title>Title here</title><body>
lots of text here <p><br><h1> even headings </h1><style type="text/css"><div > this will not be visible </div></style></body></html>
上面发布的一种解决方案是:
html =Utilities.ReadFile('simple.html')
soup =BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)print(visible_texts)[u'\n', u'\n', u'\n\n lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']
I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.
I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.
<html>
<title> Title here</title>
<body>
lots of text here <p> <br>
<h1> even headings </h1>
<style type="text/css">
<div > this will not be visible </div>
</style>
</body>
</html>
One solution posted above is:
html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)
[u'\n', u'\n', u'\n\n lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']
I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data…
One answer here from @Helge was about using nltk of all things.
import nltk
%timeit nltk.clean_html(html)
was returning 153 us per loop
It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.
betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop
回答 4
如果您关心性能,这是另一种更有效的方法:
import re
INVISIBLE_ELEMS =('style','script','head','title')
RE_SPACES = re.compile(r'\s{3,}')def visible_texts(soup):""" get visible text from a document """
text =' '.join([
s for s in soup.strings
if s.parent.name notin INVISIBLE_ELEMS
])# collapse multiple spaces to two spaces.return RE_SPACES.sub(' ', text)
If you care about performance, here’s another more efficient way:
import re
INVISIBLE_ELEMS = ('style', 'script', 'head', 'title')
RE_SPACES = re.compile(r'\s{3,}')
def visible_texts(soup):
""" get visible text from a document """
text = ' '.join([
s for s in soup.strings
if s.parent.name not in INVISIBLE_ELEMS
])
# collapse multiple spaces to two spaces.
return RE_SPACES.sub(' ', text)
soup.strings is an iterator, and it returns NavigableString so that you can check the parent’s tag name directly, without going through multiple loops.
The title is inside an <nyt_headline> tag, which is nested inside an <h1> tag and a <div> tag with id “article”.
soup.findAll('nyt_headline', limit=1)
Should work.
The article body is inside an <nyt_text> tag, which is nested inside a <div> tag with id “articleBody”. Inside the <nyt_text> element, the text itself is contained within <p> tags. Images are not within those <p> tags. It’s difficult for me to experiment with the syntax, but I expect a working scrape to look something like this.
text = soup.findAll('nyt_text', limit=1)[0]
text.findAll('p')
While, i would completely suggest using beautiful-soup in general, if anyone is looking to display the visible parts of a malformed html (e.g. where you have just a segment or line of a web-page) for whatever-reason, the the following will remove content between < and > tags:
import re ## only use with malformed html - this is not efficient
def display_visible_html_using_re(text):
return(re.sub("(\<.*?\>)", "",text))
回答 7
使用BeautifulSoup是最简单的方法,只需较少的代码即可获取字符串,而不会出现空行和废话。
tag =<Parent_Tag_that_contains_the_data>
soup =BeautifulSoup(tag,'html.parser')for i in soup.stripped_strings:print repr(i)
This will find the text element,"3.7", within the tag object <span class="ratingsContent">3.7</span> when it exists, however, default to NoneType when it does not.
Return the value of the named attribute of object. name must be a string. If the string is the name of one of the object’s attributes, the result is the value of that attribute. For example, getattr(x, ‘foobar’) is equivalent to x.foobar. If the named attribute does not exist, default is returned if provided, otherwise, AttributeError is raised.
回答 9
from bs4 importBeautifulSoupfrom bs4.element importCommentimport urllib.request
import re
import ssl
def tag_visible(element):if element.parent.name in['style','script','head','title','meta','[document]']:returnFalseif isinstance(element,Comment):returnFalseif re.match(r"[\n]+",str(element)):returnFalsereturnTruedef text_from_html(url):
body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
soup =BeautifulSoup(body ,"lxml")
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text = u",".join(t.strip()for t in visible_texts)
text = text.lstrip().rstrip()
text = text.split(',')
clean_text =''for sen in text:if sen:
sen = sen.rstrip().lstrip()
clean_text += sen+','return clean_text
url ='http://www.nytimes.com/2009/12/21/us/21storm.html'print(text_from_html(url))
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import re
import ssl
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
if re.match(r"[\n]+",str(element)): return False
return True
def text_from_html(url):
body = urllib.request.urlopen(url,context=ssl._create_unverified_context()).read()
soup = BeautifulSoup(body ,"lxml")
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text = u",".join(t.strip() for t in visible_texts)
text = text.lstrip().rstrip()
text = text.split(',')
clean_text = ''
for sen in text:
if sen:
sen = sen.rstrip().lstrip()
clean_text += sen+','
return clean_text
url = 'http://www.nytimes.com/2009/12/21/us/21storm.html'
print(text_from_html(url))
I am trying to extract the content of a single “value” attribute in a specific “input” tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get a TypeError: list indices must be integers, not str
even though from the Beautifulsoup documentation i understand that strings should not be a problem here… but i a no specialist and i may have misunderstood.
Any suggestion is greatly appreciated!
Thanks in advance.
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
回答 2
如果要从上面的源中检索属性的多个值,可以使用findAll和列表推导来获取所需的一切:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()fromBeautifulSoupimportBeautifulStoneSoup
soup =BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name":"stainfo"})### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output =[x["stainfo"]for x in inputTags]print output
### This will print a list of the values.
If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
回答 3
我实际上建议您使用一种节省时间的方法,假设您知道哪种标签具有这些属性。
假设标签xyz的attritube名为“ staininfo”。
full_tag = soup.findAll("xyz")
而且我不希望您知道full_tag是一个列表
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]print staininfo_attrb_value
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
回答 5
我在Beautifulsoup 4.8.1中使用它来获取某些元素的所有类属性的值:
from bs4 importBeautifulSoup
html ="<td class='val1'/><td col='1'/><td class='val2' />"
bsoup =BeautifulSoup(html,'html.parser')for td in bsoup.find_all('td'):if td.has_attr('class'):print(td['class'][0])
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
Now in the above code we can use findAll to get tags and information related to them, but I want to use xpath. Is it possible to use xpath with BeautifulSoup? If possible, can anyone please provide me an example code so that it will be more helpful?
from lxml.cssselect importCSSSelector
td_empformbody =CSSSelector('td.empformbody')for elem in td_empformbody(tree):# Do something with these table cells.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you’ve parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')print('Buyers: ', buyers)print('Prices: ', prices)
As others have said, BeautifulSoup doesn’t have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here’s a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I’ve searched through their docs and it seems there is not xpath option. Also, as you can see here on a similar question on SO, the OP is asking for a translation from xpath to BeautifulSoup, so my conclusion would be – no, there is no xpath parsing available.
回答 5
当您使用lxml时,一切都很简单:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')
as you see, this does not support sub-tag, so i remove “/@href” part
回答 6
也许您可以在没有XPath的情况下尝试以下操作
from simplified_scrapy.simplified_doc importSimplifiedDoc
html ='''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''# What XPath can do, so can it
doc =SimplifiedDoc(html)# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').textprint(doc.body.div.h1.text)print(doc.div.h1.text)print(doc.h1.text)# Shorter paths will be fasterprint(doc.div.getChildren())print(doc.div.getChildren('p'))
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))
回答 7
from lxml import etree
from bs4 importBeautifulSoup
soup =BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the “requests” module to read an RSS feed and get its text content in a variable called “rss_text”. With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It’s not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I’m staring right at it from
soup.prettify()
soup.find("div", { "id" : "articlebody" }) also does not work.
(EDIT: I found that BeautifulSoup wasn’t correctly parsing my page, which probably meant the page I was trying to parse isn’t properly formatted in SGML or whatever)
If you need to specify the element’s type, you can add a type selector before the id selector:
soup.select('div#articlebody')
The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:
soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")
If you only want to select a single element, then you could just use the .find() method:
soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
I think there is a problem when the ‘div’ tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags “div” with class “fcontent”.
This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.
The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.
This is my code, where I just try to print the number of tags “div” with class “fcontent”:
from BeautifulSoup import BeautifulSoup
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.
Here’s a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
Edit: Note that I used the SoupStrainer class because it’s a bit more efficient (memory and speed wise), if you know what you’re parsing in advance.
回答 1
为了完整起见,BeautifulSoup 4版本还使用了服务器提供的编码:
from bs4 importBeautifulSoupimport urllib.request
parser ='html.parser'# or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup =BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))for link in soup.find_all('a', href=True):print(link['href'])
或Python 2版本:
from bs4 importBeautifulSoupimport urllib2
parser ='html.parser'# or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup =BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))for link in soup.find_all('a', href=True):print link['href']
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.
import urllibimport lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())for link in dom.xpath('//a/@href'):# select the url in href for all a tags(links)print link
Others have recommended BeautifulSoup, but it’s much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It’s much, much faster than BeautifulSoup, and it even handles “broken” HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don’t want to learn the lxml API.
There’s no reason to use BeautifulSoup anymore, unless you’re on Google App Engine or something where anything not purely Python isn’t allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
print link
回答 3
import urllib2importBeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup =BeautifulSoup.BeautifulSoup(response)for a in soup.findAll('a'):if'national-park'in a['href']:print'found a url with national-park in the link'
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'
回答 4
以下代码使用urllib2和检索网页中所有可用的链接BeautifulSoup4:
import urllib2from bs4 importBeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup =BeautifulSoup(url)for line in soup.find_all('a'):print(line.get('href'))
The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))
回答 5
现在,BeautifulSoup现在使用lxml。请求,lxml和列表理解能力是一个杀手comb。
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)[x for x in dom.xpath('//a/@href')if'//'in x and'nytimes.com'notin x]
Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the “if ‘//’ and ‘url.com’ not in x” is a simple method to scrub the url list of the sites ‘internal’ navigation urls, etc.
回答 6
仅用于获取链接,而无需B.soup和regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"for item in data:if"<a href"in item:try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)except:passelse:print item[:end]
just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.
回答 7
该脚本可以满足您的需求,而且可以将相对链接解析为绝对链接。
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)return lxml.html.fromstring(connection.read())def get_links(url):return resolve_links((link for link in get_dom(url).xpath('//a/@href')))def guess_root(links):for link in links:if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme +'://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)for link in links:ifnot link.startswith('http'):
link = urlparse.urljoin(root, link)yield link
for link in get_links('http://www.google.com'):print link
This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/@href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)#read html code
html = website.read()#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)print links
To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is “re.findall()”.
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
回答 9
为什么不使用正则表达式:
import urllib2
import re
url ="http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)for link in links:print('href: %s, HTML text: %s'%(link[0], link[1]))
Links can be within a variety of attributes so you could pass a list of those attributes to select
for example, with src and href attribute (here I am using the starts with ^ operator to specify that either of these attributes values starts with http. You can tailor this as required
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Here’s an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
回答 12
经过以下更正(发现无法正常运行的情况),我发现了@ Blairg23工作的答案:
for link inBeautifulSoup(response.content,'html.parser', parse_only=SoupStrainer('a')):if link.has_attr('href'):if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href'])#module urlparse need to be imported
wget.download(full_path)
I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.
BeatifulSoup’s own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[@href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won’t handle single and double dots in the relative paths though, but so far I didn’t have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn’t handle loading from https and doesn’t do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[@href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
import urllib2
from bs4 importBeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")#To get href part aloneprint links[0].attrs['href']
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']
回答 15
外部链接和内部链接都可以有很多重复的链接。要区分两者并仅使用集合获得唯一链接:
# Python 3.import urllib
from bs4 importBeautifulSoup
url ="http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)# Get server encoding per recommendation of Martijn Pieters.
soup =BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()for line in soup.find_all('a'):
link = line.get('href')ifnot link:continueif link.startswith('http'):
external_links.add(link)else:
internal_links.add(link)# Depending on usage, full internal links may be preferred.
full_internal_links ={
urllib.parse.urljoin(url, internal_link)for internal_link in internal_links
}# Print all unique external and full internal links.for link in external_links.union(full_internal_links):print(link)
There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)
...
soup =BeautifulSoup(html,"lxml")File"/Library/Python/2.7/site-packages/bs4/__init__.py", line 152,in __init__%",".join(features))
bs4.FeatureNotFound:Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
以上输出在我的终端上。我在Mac OS 10.7.x上。我有Python 2.7.1,并按照本教程操作获得了Beautiful Soup和lxml,它们都已成功安装并与位于此处的单独测试文件一起使用。在导致此错误的Python脚本中,我包含以下行:
from pageCrawler import comparePages
在pageCrawler文件中,我包含以下两行:
from bs4 import BeautifulSoupfrom urllib2 import urlopen
...
soup = BeautifulSoup(html, "lxml")
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
The above outputs on my Terminal. I am on Mac OS 10.7.x. I have Python 2.7.1, and followed this tutorial to get Beautiful Soup and lxml, which both installed successfully and work with a separate test file located here. In the Python script that causes this error, I have included this line:
from pageCrawler import comparePages
And in the pageCrawler file I have included the following two lines:
from bs4 import BeautifulSoupfrom urllib2 import urlopen
Any help in figuring out what the problem is and how it can be solved would much be appreciated.
I have a suspicion that this is related to the parser that BS will use to read the HTML. They document is here, but if you’re like me (on OSX) you might be stuck with something that requires a bit of work:
You’ll notice that in the BS4 documentation page above, they point out that by default BS4 will use the Python built-in HTML parser. Assuming you are in OSX, the Apple-bundled version of Python is 2.7.2 which is not lenient for character formatting. I hit this same problem, so I upgraded my version of Python to work around it. Doing this in a virtualenv will minimize disruption to other projects.
If doing that sounds like a pain, you can switch over to the LXML parser:
pip install lxml
And then try:
soup = BeautifulSoup(html, "lxml")
Depending on your scenario, that might be good enough. I found this annoying enough to warrant upgrading my version of Python. Using virtualenv, you can migrate your packages fairly easily.
Although BeautifulSoup supports the HTML parser by default
If you want to use any other third-party Python parsers you need to install that external parser like(lxml).
soup_object= BeautifulSoup(markup,"html.parser") #Python HTML parser
But if you don’t specified any parser as parameter you will get an warning that no parser specified.
soup_object= BeautifulSoup(markup) #Warnning
To use any other external parser you need to install it and then need to specify it. like
pip install lxml
soup_object= BeautifulSoup(markup,'lxml') # C dependent parser
External parser have c and python dependency which may have some advantage and disadvantage.
回答 7
我遇到了同样的问题。我发现原因是我有一个过时的python 6软件包。
>>>import html5lib
Traceback(most recent call last):File"<stdin>", line 1,in<module>File"/usr/local/lib/python2.7/site-packages/html5lib/__init__.py", line 16,in<module>from.html5parser importHTMLParser, parse, parseFragment
File"/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 2,in<module>from six import with_metaclass, viewkeys, PY3
ImportError: cannot import name viewkeys
I encountered the same issue. I found the reason is that I had a slightly-outdated python six package.
>>> import html5lib
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/html5lib/__init__.py", line 16, in <module>
from .html5parser import HTMLParser, parse, parseFragment
File "/usr/local/lib/python2.7/site-packages/html5lib/html5parser.py", line 2, in <module>
from six import with_metaclass, viewkeys, PY3
ImportError: cannot import name viewkeys
The error is coming because of the parser you are using. In general, if you have HTML file/code then you need to use html5lib(documentation can be found here) & in-case you have XML file/data then you need to use lxml(documentation can be found here). You can use lxml for HTML file/code also but sometimes it gives an error as above. So, better to choose the package wisely based on the type of data/file. You can also use html_parser which is built-in module. But, this also sometimes do not work.
For more details regarding when to use which package you can see the details here
Blank parameter will result in a warning for best available.
soup = BeautifulSoup(html)
—————/UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html5lib”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.———————-/
import urllib.request
from bs4 importBeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup =BeautifulSoup(html)print(soup)
我收到以下错误:
File"C:\Python34\lib\encodings\cp1252.py", line 19,in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]UnicodeEncodeError:'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
I’m trying to scrape a website, but it gives me an error.
I’m using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I’m getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding=”utf-8″ with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding=”utf-8″ will help if you are trying to print it.
soup.encode(“utf-8”)
If you are trying to open scraped data and maybe write it into a file, then open the file with (……,encoding=”utf-8″)
with open(filename_csv , ‘w’, newline=”,encoding=”utf-8″) as csv_file: