I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:
How do I fake a browser visit by using python requests or command wget?
from fake_useragent importUserAgentimport requests
ua =UserAgent()print(ua.chrome)
header ={'User-Agent':str(ua.chrome)}print(header)
url ="https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)print(htmlContent)
输出:
Mozilla/5.0(Macintosh;IntelMac OS X 10_8_2)AppleWebKit/537.17(KHTML, like Gecko)Chrome/24.0.1309.0Safari/537.17{'User-Agent':'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}<Response[200]>
The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.
So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.
from bs4 importBeautifulSoupfrom urllib2 import urlopen
import urllib
# use this image scraper from the location that #you want to save scraped images todef make_soup(url):
html = urlopen(url).read()returnBeautifulSoup(html)def get_images(url):
soup = make_soup(url)#this makes a list of bs4 element tags
images =[img for img in soup.findAll('img')]print(str(len(images))+"images found.")print'Downloading images to current working directory.'#compile our unicode list of image links
image_links =[each.get('src')for each in images]for each in image_links:
filename=each.split('/')[-1]
urllib.urlretrieve(each, filename)return image_links
#a standard call looks like this#get_images('http://www.wookmark.com')
I utilized BeautifulSoup to allow me to parse any website for images. If you will be doing much web scraping (or intend to use my tool) I suggest you sudo pip install BeautifulSoup. Information on BeautifulSoup is available here.
For convenience here is my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.findAll('img')]
print (str(len(images)) + "images found.")
print 'Downloading images to current working directory.'
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urllib.urlretrieve(each, filename)
return image_links
#a standard call looks like this
#get_images('http://www.wookmark.com')
from urllib.error importHTTPErrorfrom urllib.request import urlretrieve
try:
urlretrieve(image_url, image_local_path)exceptFileNotFoundErroras err:print(err)# something wrong with local pathexceptHTTPErroras err:print(err)# something wrong with url
from urllib.error import HTTPError
from urllib.request import urlretrieve
try:
urlretrieve(image_url, image_local_path)
except FileNotFoundError as err:
print(err) # something wrong with local path
except HTTPError as err:
print(err) # something wrong with url
def load_requests(source_url, sink_path):"""
Load a file from an URL (e.g. http).
Parameters
----------
source_url : str
Where to load the file from.
sink_path : str
Where the loaded file is stored.
"""import requests
r = requests.get(source_url, stream=True)if r.status_code ==200:with open(sink_path,'wb')as f:for chunk in r:
f.write(chunk)
or, if the additional requirement of requests is acceptable and if it is a http(s) URL:
def load_requests(source_url, sink_path):
"""
Load a file from an URL (e.g. http).
Parameters
----------
source_url : str
Where to load the file from.
sink_path : str
Where the loaded file is stored.
"""
import requests
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(sink_path, 'wb') as f:
for chunk in r:
f.write(chunk)
I made a script expanding on Yup.’s script. I fixed some things. It will now bypass 403:Forbidden problems. It wont crash when an image fails to be retrieved. It tries to avoid corrupted previews. It gets the right absolute urls. It gives out more information. It can be run with an argument from the command line.
# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are
from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time
def make_soup(url):
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print 'Downloading images to current working directory.'
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print 'Getting: ' + filename
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print ' An error occured. Continuing.'
print 'Done.'
if __name__ == '__main__':
url = sys.argv[1]
get_images(url)
回答 7
使用请求库
import requests
import shutil,os
headers ={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder
defImageDl(url):
attempts =0while attempts <5:#retry 5 times
try:
filename = url.split('/')[-1]
r = requests.get(url,headers=headers,stream=True,timeout=5)if r.status_code ==200:with open(os.path.join(path,filename),'wb')as f:
r.raw.decode_content =True
shutil.copyfileobj(r.raw,f)print(filename)breakexceptExceptionas e:
attempts+=1print(e)ImageDl(url)
# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are
from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time
def make_soup(url):
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
html = urllib.request.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print('Downloading images to current working directory.')
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print('Getting: ' + filename)
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print(' An error occured. Continuing.')
print('Done.')
if __name__ == '__main__':
get_images('http://www.wookmark.com')
回答 10
使用Requests对于Python 3来说有些新鲜:
代码中的注释。准备使用功能。
import requests
from os import path
def get_image(image_url):"""
Get image based on url.
:return: Image name if everything OK, False otherwise
"""
image_name = path.split(image_url)[1]try:
image = requests.get(image_url)exceptOSError:# Little too wide, but work OK, no additional imports needed. Catch all conection problemsreturnFalseif image.status_code ==200:# we could have retrieved error page
base_dir = path.join(path.dirname(path.realpath(__file__)),"images")# Use your own path or "" to use current working directory. Folder must exist.with open(path.join(base_dir, image_name),"wb")as f:
f.write(image.content)return image_name
get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")
import requests
from os import path
def get_image(image_url):
"""
Get image based on url.
:return: Image name if everything OK, False otherwise
"""
image_name = path.split(image_url)[1]
try:
image = requests.get(image_url)
except OSError: # Little too wide, but work OK, no additional imports needed. Catch all conection problems
return False
if image.status_code == 200: # we could have retrieved error page
base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist.
with open(path.join(base_dir, image_name), "wb") as f:
f.write(image.content)
return image_name
get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")
img_data=requests.get('https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg')
with open(str('file_name.jpg', 'wb') as handler:
handler.write(img_data)
Here’s a short snippet using the SoupStrainer class in BeautifulSoup:
import httplib2
from bs4 import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:
Edit: Note that I used the SoupStrainer class because it’s a bit more efficient (memory and speed wise), if you know what you’re parsing in advance.
回答 1
为了完整起见,BeautifulSoup 4版本还使用了服务器提供的编码:
from bs4 importBeautifulSoupimport urllib.request
parser ='html.parser'# or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup =BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))for link in soup.find_all('a', href=True):print(link['href'])
或Python 2版本:
from bs4 importBeautifulSoupimport urllib2
parser ='html.parser'# or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup =BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))for link in soup.find_all('a', href=True):print link['href']
For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:
from bs4 import BeautifulSoup
import urllib.request
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
or the Python 2 version:
from bs4 import BeautifulSoup
import urllib2
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))
for link in soup.find_all('a', href=True):
print link['href']
and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.
BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.
Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.
With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.
import urllibimport lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())for link in dom.xpath('//a/@href'):# select the url in href for all a tags(links)print link
Others have recommended BeautifulSoup, but it’s much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It’s much, much faster than BeautifulSoup, and it even handles “broken” HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don’t want to learn the lxml API.
There’s no reason to use BeautifulSoup anymore, unless you’re on Google App Engine or something where anything not purely Python isn’t allowed.
lxml.html also supports CSS3 selectors so this sort of thing is trivial.
An example with lxml and xpath would look like this:
import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
print link
回答 3
import urllib2importBeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup =BeautifulSoup.BeautifulSoup(response)for a in soup.findAll('a'):if'national-park'in a['href']:print'found a url with national-park in the link'
import urllib2
import BeautifulSoup
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
if 'national-park' in a['href']:
print 'found a url with national-park in the link'
回答 4
以下代码使用urllib2和检索网页中所有可用的链接BeautifulSoup4:
import urllib2from bs4 importBeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup =BeautifulSoup(url)for line in soup.find_all('a'):print(line.get('href'))
The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:
import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
print(line.get('href'))
回答 5
现在,BeautifulSoup现在使用lxml。请求,lxml和列表理解能力是一个杀手comb。
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)[x for x in dom.xpath('//a/@href')if'//'in x and'nytimes.com'notin x]
Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)
[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]
In the list comp, the “if ‘//’ and ‘url.com’ not in x” is a simple method to scrub the url list of the sites ‘internal’ navigation urls, etc.
回答 6
仅用于获取链接,而无需B.soup和regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"for item in data:if"<a href"in item:try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)except:passelse:print item[:end]
just for getting the links, without B.soup and regex:
import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item=item[ind+len(tag):]
end=item.index(endtag)
except: pass
else:
print item[:end]
for more complex operations, of course BSoup is still preferred.
回答 7
该脚本可以满足您的需求,而且可以将相对链接解析为绝对链接。
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)return lxml.html.fromstring(connection.read())def get_links(url):return resolve_links((link for link in get_dom(url).xpath('//a/@href')))def guess_root(links):for link in links:if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme +'://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)for link in links:ifnot link.startswith('http'):
link = urlparse.urljoin(root, link)yield link
for link in get_links('http://www.google.com'):print link
This script does what your looking for, But also resolves the relative links to absolute links.
import urllib
import lxml.html
import urlparse
def get_dom(url):
connection = urllib.urlopen(url)
return lxml.html.fromstring(connection.read())
def get_links(url):
return resolve_links((link for link in get_dom(url).xpath('//a/@href')))
def guess_root(links):
for link in links:
if link.startswith('http'):
parsed_link = urlparse.urlparse(link)
scheme = parsed_link.scheme + '://'
netloc = parsed_link.netloc
return scheme + netloc
def resolve_links(links):
root = guess_root(links)
for link in links:
if not link.startswith('http'):
link = urlparse.urljoin(root, link)
yield link
for link in get_links('http://www.google.com'):
print link
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)#read html code
html = website.read()#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)print links
To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is “re.findall()”.
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*
import urllib2
import re
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
回答 9
为什么不使用正则表达式:
import urllib2
import re
url ="http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)for link in links:print('href: %s, HTML text: %s'%(link[0], link[1]))
Links can be within a variety of attributes so you could pass a list of those attributes to select
for example, with src and href attribute (here I am using the starts with ^ operator to specify that either of these attributes values starts with http. You can tailor this as required
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)
Here’s an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
回答 12
经过以下更正(发现无法正常运行的情况),我发现了@ Blairg23工作的答案:
for link inBeautifulSoup(response.content,'html.parser', parse_only=SoupStrainer('a')):if link.has_attr('href'):if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href'])#module urlparse need to be imported
wget.download(full_path)
I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
wget.download(full_path)
For Python 3:
urllib.parse.urljoin has to be used in order to obtain the full URL instead.
BeatifulSoup’s own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).
import lxml.html
doc = lxml.html.parse(url)
links = doc.xpath('//a[@href]')
for link in links:
print link.attrib['href']
The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won’t handle single and double dots in the relative paths though, but so far I didn’t have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.
NOTE: Direct lxml url parsing doesn’t handle loading from https and doesn’t do redirects, so for this reason the version below is using urllib2 + lxml.
#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch
try:
import urltools as urltools
except ImportError:
sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
urltools = None
def get_host(url):
p = urlparse.urlparse(url)
return "{}://{}".format(p.scheme, p.netloc)
if __name__ == '__main__':
url = sys.argv[1]
host = get_host(url)
glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'
doc = lxml.html.parse(urllib2.urlopen(url))
links = doc.xpath('//a[@href]')
for link in links:
href = link.attrib['href']
if fnmatch.fnmatch(href, glob_patt):
if not href.startswith(('http://', 'https://' 'ftp://')):
if href.startswith('/'):
href = host + href
else:
parent_url = url.rsplit('/', 1)[0]
href = urlparse.urljoin(parent_url, href)
if urltools:
href = urltools.normalize(href)
print href
import urllib2
from bs4 importBeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")#To get href part aloneprint links[0].attrs['href']
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']
回答 15
外部链接和内部链接都可以有很多重复的链接。要区分两者并仅使用集合获得唯一链接:
# Python 3.import urllib
from bs4 importBeautifulSoup
url ="http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)# Get server encoding per recommendation of Martijn Pieters.
soup =BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()for line in soup.find_all('a'):
link = line.get('href')ifnot link:continueif link.startswith('http'):
external_links.add(link)else:
internal_links.add(link)# Depending on usage, full internal links may be preferred.
full_internal_links ={
urllib.parse.urljoin(url, internal_link)for internal_link in internal_links
}# Print all unique external and full internal links.for link in external_links.union(full_internal_links):print(link)
There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:
# Python 3.
import urllib
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
link = line.get('href')
if not link:
continue
if link.startswith('http'):
external_links.add(link)
else:
internal_links.add(link)
# Depending on usage, full internal links may be preferred.
full_internal_links = {
urllib.parse.urljoin(url, internal_link)
for internal_link in internal_links
}
# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
print(link)
<selectid="fruits01"class="select"name="fruits"><optionvalue="0">Choose your fruits:</option><optionvalue="1">Banana</option><optionvalue="2">Mango</option></select>
from selenium import webdriver
from selenium.webdriver.support.ui importSelect
driver = webdriver.Firefox()
driver.get('url')
select =Select(driver.find_element_by_id('fruits01'))# select by visible text
select.select_by_visible_text('Banana')# select by value
select.select_by_value('1')
Selenium provides a convenient Select class to work with select -> option constructs:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Firefox()
driver.get('url')
select = Select(driver.find_element_by_id('fruits01'))
# select by visible text
select.select_by_visible_text('Banana')
# select by value
select.select_by_value('1')
firstly you need to import the Select class and then you need to create the instance of Select class.
After creating the instance of Select class, you can perform select methods on that instance to select the options from dropdown list.
Here is the code
from selenium.webdriver.support.select import Select
select_fr = Select(driver.find_element_by_id("fruits01"))
select_fr.select_by_index(0)
#identify the drop down element
elem = browser.find_element_by_name(objectVal)for option in elem.find_elements_by_tag_name('option'):if option.text == value:breakelse:
ARROW_DOWN = u'\ue015'
elem.send_keys(ARROW_DOWN)
I tried a lot many things, but my drop down was inside a table and I was not able to perform a simple select operation. Only the below solution worked. Here I am highlighting drop down elem and pressing down arrow until getting the desired value –
#identify the drop down element
elem = browser.find_element_by_name(objectVal)
for option in elem.find_elements_by_tag_name('option'):
if option.text == value:
break
else:
ARROW_DOWN = u'\ue015'
elem.send_keys(ARROW_DOWN)
回答 5
您无需单击任何内容。使用xpath或任何您选择的方式查找,然后使用发送键
例如:HTML:
<select id="fruits01"class="select" name="fruits"><option value="0">Choose your fruits:</option><option value="1">Banana</option><option value="2">Mango</option></select>
In this way you can select all the options in any dropdowns.
driver.get("https://www.spectrapremium.com/en/aftermarket/north-america")
print( "The title is : " + driver.title)
inputs = Select(driver.find_element_by_css_selector('#year'))
input1 = len(inputs.options)
for items in range(input1):
inputs.select_by_index(items)
time.sleep(1)
option_visible_text ="Banana"
select = driver.find_element_by_id("fruits01")#now use this to select option from dropdown by visible text
driver.execute_script("var select = arguments[0]; for(var i = 0; i < select.options.length; i++){ if(select.options[i].text == arguments[1]){ select.options[i].selected = true; } }", select, option_visible_text);
The best way to use selenium.webdriver.support.ui.Select class to work to with dropdown selection but some time it does not work as expected due to designing issue or other issues of the HTML.
In this type of situation you can also prefer as alternate solution using execute_script() as below :-
option_visible_text = "Banana"
select = driver.find_element_by_id("fruits01")
#now use this to select option from dropdown by visible text
driver.execute_script("var select = arguments[0]; for(var i = 0; i < select.options.length; i++){ if(select.options[i].text == arguments[1]){ select.options[i].selected = true; } }", select, option_visible_text);
回答 11
按照提供的HTML:
<select id="fruits01"class="select" name="fruits"><option value="0">Choose your fruits:</option><option value="1">Banana</option><option value="2">Mango</option></select>
from selenium import webdriver
from selenium.webdriver.support.ui importWebDriverWaitfrom selenium.webdriver.common.by importByfrom selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui importSelect
select =Select(WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.ID,"fruits01"))))
select.select_by_visible_text("Mango")
select =Select(WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.CSS_SELECTOR,"//select[@class='select' and @name='fruits']"))))
select.select_by_index(2)
To select an <option> element from a html-select menu you have to use the SelectClass. Moreover, as you have to interact with the drop-down-menu you have to induce WebDriverWait for the element_to_be_clickable().
To select the <option> with text as Mango from the dropdown you can use you can use either of the following Locator Strategies:
Using ID attribute and select_by_visible_text() method:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "fruits01"))))
select.select_by_visible_text("Mango")
select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "//select[@class='select' and @name='fruits']"))))
select.select_by_index(2)
回答 12
项目清单
公共类ListBoxMultiple {
public static void main(String[] args) throws InterruptedException{// TODO Auto-generated method stub
System.setProperty("webdriver.chrome.driver","./drivers/chromedriver.exe");WebDriver driver=new ChromeDriver();
driver.get("file:///C:/Users/Amitabh/Desktop/hotel2.html");//open the website
driver.manage().window().maximize();WebElement hotel = driver.findElement(By.id("maarya"));//get the element
Select sel=new Select(hotel);//for handling list box
//isMultiple
if(sel.isMultiple()){System.out.println("it is multi select list");}else{System.out.println("it is single select list");}//select option
sel.selectByIndex(1);// you can select by index values
sel.selectByValue("p");//you can select by value
sel.selectByVisibleText("Fish");// you can also select by visible text of the options
//deselect option but this is possible only in case of multiple lists
Thread.sleep(1000);
sel.deselectByIndex(1);
sel.deselectAll();//getOptions
List<WebElement> options = sel.getOptions();
int count=options.size();System.out.println("Total options: "+count);for(WebElement opt:options){// getting text of every elements
String text=opt.getText();System.out.println(text);}//select all options
for(int i=0;i<count;i++){
sel.selectByIndex(i);Thread.sleep(1000);}
driver.quit();}
public static void main(String[] args) throws InterruptedException {
// TODO Auto-generated method stub
System.setProperty("webdriver.chrome.driver", "./drivers/chromedriver.exe");
WebDriver driver=new ChromeDriver();
driver.get("file:///C:/Users/Amitabh/Desktop/hotel2.html");//open the website
driver.manage().window().maximize();
WebElement hotel = driver.findElement(By.id("maarya"));//get the element
Select sel=new Select(hotel);//for handling list box
//isMultiple
if(sel.isMultiple()){
System.out.println("it is multi select list");
}
else{
System.out.println("it is single select list");
}
//select option
sel.selectByIndex(1);// you can select by index values
sel.selectByValue("p");//you can select by value
sel.selectByVisibleText("Fish");// you can also select by visible text of the options
//deselect option but this is possible only in case of multiple lists
Thread.sleep(1000);
sel.deselectByIndex(1);
sel.deselectAll();
//getOptions
List<WebElement> options = sel.getOptions();
int count=options.size();
System.out.println("Total options: "+count);
for(WebElement opt:options){ // getting text of every elements
String text=opt.getText();
System.out.println(text);
}
//select all options
for(int i=0;i<count;i++){
sel.selectByIndex(i);
Thread.sleep(1000);
}
driver.quit();
}
soup =BeautifulSoup(sdata)
mydivs = soup.findAll('div')for div in mydivs:if(div["class"]=="stylelistrow"):print div
脚本完成后的同一行出现错误。
File"./beautifulcoding.py", line 130,in getlanguage
if(div["class"]=="stylelistrow"):File"/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599,in __getitem__
return self._getAttrMap()[key]KeyError:'class'
I’m having trouble parsing HTML elements with “class” attribute using Beautifulsoup. The code looks like this
soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
if (div["class"] == "stylelistrow"):
print div
I get an error on the same line “after” the script finishes.
File "./beautifulcoding.py", line 130, in getlanguage
if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
return self._getAttrMap()[key]
KeyError: 'class'
soup =BeautifulSoup(sdata)
class_list =["stylelistrow"]# can add any other classes to this list.# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list)
To be clear, this selects only the p tags that are both strikeout and body class.
To find for the intersection of any in a set of classes (not the intersection, but the union), you can give a list to the class_ keyword argument (as of 4.1.2):
soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list)
Also note that findAll has been renamed from the camelCase to the more Pythonic find_all.
This works for me to access the class attribute (on beautifulsoup 4, contrary to what the documentation says). The KeyError comes a list being returned not a dictionary.
for hit in soup.findAll(name='span'):
print hit.contents[1]['class']
for div in mydivs:
try:
clazz = div["class"]
except KeyError:
clazz = ""
if (clazz == "stylelistrow"):
print div
回答 12
或者,我们可以使用lxml,它支持xpath并且非常快!
from lxml import html, etree
attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific classfor each in handles:print(etree.tostring(each))#printing the html as string
Alternatively we can use lxml, it support xpath and very fast!
from lxml import html, etree
attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class
for each in handles:
print(etree.tostring(each))#printing the html as string
回答 13
这应该工作:
soup =BeautifulSoup(sdata)
mydivs = soup.findAll('div')for div in mydivs:if(div.find(class_ =="stylelistrow"):print div
# parse html
page_soup = soup(web_page.read(),"html.parser")# filter out items matching class name
all_songs = page_soup.findAll("li","song_item")# traverse through all_songsfor song in all_songs:# get text out of span element matching class 'song_name'# doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
song.find("span","song_name").text
In other answers the findAll is being used on the soup object itself, but I needed a way to do a find by class name on objects inside a specific element extracted from the object I obtained after doing findAll.
If you are trying to do a search inside nested HTML elements to get objects by class name, try below –
# parse html
page_soup = soup(web_page.read(), "html.parser")
# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")
# traverse through all_songs
for song in all_songs:
# get text out of span element matching class 'song_name'
# doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
song.find("span", "song_name").text
Points to note:
I’m not explicitly defining the search to be on ‘class’ attribute findAll("li", {"class": "song_item"}), since it’s the only attribute I’m searching on and it will by default search for class attribute if you don’t exclusively tell which attribute you want to find on.
When you do a findAll or find, the resulting object is of class bs4.element.ResultSet which is a subclass of list. You can utilize all methods of ResultSet, inside any number of nested elements (as long as they are of type ResultSet) to do a find or find all.
replace ‘totalcount’ with your class name and ‘span’ with tag you are looking for. Also, if your class contains multiple names with space, just choose one and use.
P.S. This finds the first element with given criteria. If you want to find all elements then replace ‘find’ with ‘find_all’.