标签归档:web-scraping

如何使用Python请求伪造浏览器访问?

问题:如何使用Python请求伪造浏览器访问?

我想从下面的网站获取内容。如果使用Firefox或Chrome这样的浏览器,则可以获取所需的真实网站页面,但是如果使用Python request软件包(或wget命令)进行获取,则它将返回完全不同的HTML页面。我以为网站的开发人员为此做了一些阻碍,所以问题是:

如何使用python请求或命令wget伪造浏览器访问?

http://www.ichangtou.com/#company:data_000008.html

I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:

How do I fake a browser visit by using python requests or command wget?

http://www.ichangtou.com/#company:data_000008.html


回答 0

提供User-Agent标题

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

仅供参考,这是不同浏览器的用户代理字符串的列表:


附带说明一下,有一个非常有用的第三方程序包,称为fake-useragent,它在用户代理上提供了一个不错的抽象层:

假用户代理

最新的简单useragent伪造者与实际数据库

演示:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

回答 1

如果这个问题仍然有效

我使用了伪造的UserAgent

如何使用:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

输出:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

if this question is still valid

I used fake UserAgent

How to use:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

outPut:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

回答 2

尝试使用Firefox作为伪造的用户代理来执行此操作(此外,这是使用Cookie进行网络抓取的良好启动脚本):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

用法:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

回答 3

答案的根源是,提出问题的人需要有一个JavaScript解释器才能获得所要查找的内容。我发现我可以在JSON网站上获取想要的所有信息,然后再用JavaScript对其进行解释。这为我节省了很多时间来解析html,希望每个网页都采用相同的格式。

因此,当您从网站收到使用请求的响应时,请真正查看html / text,因为您可能会在页脚中找到可解析的javascripts JSON。

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.


如何使用我已经知道URL地址的Python在本地保存图像?

问题:如何使用我已经知道URL地址的Python在本地保存图像?

我知道Internet上图像的URL。

例如http://www.digimouth.com/news/media/2011/09/google-logo.jpg,其中包含Google的徽标。

现在,如何使用Python下载此图像,而无需在浏览器中实际打开URL并手动保存文件。

I know the URL of an image on Internet.

e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google.

Now, how can I download this image using Python without actually opening the URL in a browser and saving the file manually.


回答 0

Python 2

如果您要做的只是将其保存为文件,这是一种更简单的方法:

import urllib

urllib.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

第二个参数是应在其中保存文件的本地路径。

Python 3

正如SergO所建议的,以下代码应与Python 3配合使用。

import urllib.request

urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

Python 2

Here is a more straightforward way if all you want to do is save it as a file:

import urllib

urllib.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

The second argument is the local path where the file should be saved.

Python 3

As SergO suggested the code below should work with Python 3.

import urllib.request

urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")

回答 1

import urllib
resource = urllib.urlopen("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
output = open("file01.jpg","wb")
output.write(resource.read())
output.close()

file01.jpg 将包含您的图像。

import urllib
resource = urllib.urlopen("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
output = open("file01.jpg","wb")
output.write(resource.read())
output.close()

file01.jpg will contain your image.


回答 2

我编写了一个脚本来执行此操作,并且可以在我的github上找到该脚本供您使用。

我利用BeautifulSoup允许我解析任何网站的图像。如果您要进行大量的网络抓取(或打算使用我的工具),建议您使用sudo pip install BeautifulSoup。可在此处获得有关BeautifulSoup的信息。

为了方便起见,这是我的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib

# use this image scraper from the location that 
#you want to save scraped images to

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + "images found.")
    print 'Downloading images to current working directory.'
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urllib.urlretrieve(each, filename)
    return image_links

#a standard call looks like this
#get_images('http://www.wookmark.com')

I wrote a script that does just this, and it is available on my github for your use.

I utilized BeautifulSoup to allow me to parse any website for images. If you will be doing much web scraping (or intend to use my tool) I suggest you sudo pip install BeautifulSoup. Information on BeautifulSoup is available here.

For convenience here is my code:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib

# use this image scraper from the location that 
#you want to save scraped images to

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + "images found.")
    print 'Downloading images to current working directory.'
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urllib.urlretrieve(each, filename)
    return image_links

#a standard call looks like this
#get_images('http://www.wookmark.com')

回答 3

这可以通过请求来完成。加载页面并将二进制内容转储到文件中。

import os
import requests

url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg'
page = requests.get(url)

f_ext = os.path.splitext(url)[-1]
f_name = 'img{}'.format(f_ext)
with open(f_name, 'wb') as f:
    f.write(page.content)

This can be done with requests. Load the page and dump the binary content to a file.

import os
import requests

url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg'
page = requests.get(url)

f_ext = os.path.splitext(url)[-1]
f_name = 'img{}'.format(f_ext)
with open(f_name, 'wb') as f:
    f.write(page.content)

回答 4

Python 3

urllib.request —用于打开URL的可扩展库

from urllib.error import HTTPError
from urllib.request import urlretrieve

try:
    urlretrieve(image_url, image_local_path)
except FileNotFoundError as err:
    print(err)   # something wrong with local path
except HTTPError as err:
    print(err)  # something wrong with url

Python 3

urllib.request — Extensible library for opening URLs

from urllib.error import HTTPError
from urllib.request import urlretrieve

try:
    urlretrieve(image_url, image_local_path)
except FileNotFoundError as err:
    print(err)   # something wrong with local path
except HTTPError as err:
    print(err)  # something wrong with url

回答 5

适用于Python 2和Python 3的解决方案

try:
    from urllib.request import urlretrieve  # Python 3
except ImportError:
    from urllib import urlretrieve  # Python 2

url = "http://www.digimouth.com/news/media/2011/09/google-logo.jpg"
urlretrieve(url, "local-filename.jpg")

或者,如果的附加要求requests是可以接受的并且是http(s)URL:

def load_requests(source_url, sink_path):
    """
    Load a file from an URL (e.g. http).

    Parameters
    ----------
    source_url : str
        Where to load the file from.
    sink_path : str
        Where the loaded file is stored.
    """
    import requests
    r = requests.get(source_url, stream=True)
    if r.status_code == 200:
        with open(sink_path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

A solution which works with Python 2 and Python 3:

try:
    from urllib.request import urlretrieve  # Python 3
except ImportError:
    from urllib import urlretrieve  # Python 2

url = "http://www.digimouth.com/news/media/2011/09/google-logo.jpg"
urlretrieve(url, "local-filename.jpg")

or, if the additional requirement of requests is acceptable and if it is a http(s) URL:

def load_requests(source_url, sink_path):
    """
    Load a file from an URL (e.g. http).

    Parameters
    ----------
    source_url : str
        Where to load the file from.
    sink_path : str
        Where the loaded file is stored.
    """
    import requests
    r = requests.get(source_url, stream=True)
    if r.status_code == 200:
        with open(sink_path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

回答 6

我在Yup。的脚本上扩展了脚本。我修好了一些东西。现在它将绕过403:禁止的问题。当无法检索图像时,它不会崩溃。它试图避免损坏预览。它获取正确的绝对URL。它给出了更多信息。可以使用命令行中的参数来运行它。

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time

def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occured. Continuing.'
    print 'Done.'

if __name__ == '__main__':
    url = sys.argv[1]
    get_images(url)

I made a script expanding on Yup.’s script. I fixed some things. It will now bypass 403:Forbidden problems. It wont crash when an image fails to be retrieved. It tries to avoid corrupted previews. It gets the right absolute urls. It gives out more information. It can be run with an argument from the command line.

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time

def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occured. Continuing.'
    print 'Done.'

if __name__ == '__main__':
    url = sys.argv[1]
    get_images(url)

回答 7

使用请求库

import requests
import shutil,os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder

def ImageDl(url):
    attempts = 0
    while attempts < 5:#retry 5 times
        try:
            filename = url.split('/')[-1]
            r = requests.get(url,headers=headers,stream=True,timeout=5)
            if r.status_code == 200:
                with open(os.path.join(path,filename),'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw,f)
            print(filename)
            break
        except Exception as e:
            attempts+=1
            print(e)


ImageDl(url)

Using requests library

import requests
import shutil,os

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder

def ImageDl(url):
    attempts = 0
    while attempts < 5:#retry 5 times
        try:
            filename = url.split('/')[-1]
            r = requests.get(url,headers=headers,stream=True,timeout=5)
            if r.status_code == 200:
                with open(os.path.join(path,filename),'wb') as f:
                    r.raw.decode_content = True
                    shutil.copyfileobj(r.raw,f)
            print(filename)
            break
        except Exception as e:
            attempts+=1
            print(e)


ImageDl(url)

回答 8

这是很简短的答案。

import urllib
urllib.urlretrieve("http://photogallery.sandesh.com/Picture.aspx?AlubumId=422040", "Abc.jpg")

This is very short answer.

import urllib
urllib.urlretrieve("http://photogallery.sandesh.com/Picture.aspx?AlubumId=422040", "Abc.jpg")

回答 9

Python 3版本

我为Python 3调整了@madprops的代码

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time

def make_soup(url):
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib.request.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print('Downloading images to current working directory.')
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print('Getting: ' + filename)
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print('  An error occured. Continuing.')
    print('Done.')

if __name__ == '__main__':
    get_images('http://www.wookmark.com')

Version for Python 3

I adjusted the code of @madprops for Python 3

# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are

from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time

def make_soup(url):
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    html = urllib.request.urlopen(req)
    return BeautifulSoup(html, 'html.parser')

def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print('Downloading images to current working directory.')
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print('Getting: ' + filename)
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print('  An error occured. Continuing.')
    print('Done.')

if __name__ == '__main__':
    get_images('http://www.wookmark.com')

回答 10

使用Requests对于Python 3来说有些新鲜:

代码中的注释。准备使用功能。


import requests
from os import path

def get_image(image_url):
    """
    Get image based on url.
    :return: Image name if everything OK, False otherwise
    """
    image_name = path.split(image_url)[1]
    try:
        image = requests.get(image_url)
    except OSError:  # Little too wide, but work OK, no additional imports needed. Catch all conection problems
        return False
    if image.status_code == 200:  # we could have retrieved error page
        base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist.
        with open(path.join(base_dir, image_name), "wb") as f:
            f.write(image.content)
        return image_name

get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")

Something fresh for Python 3 using Requests:

Comments in the code. Ready to use function.


import requests
from os import path

def get_image(image_url):
    """
    Get image based on url.
    :return: Image name if everything OK, False otherwise
    """
    image_name = path.split(image_url)[1]
    try:
        image = requests.get(image_url)
    except OSError:  # Little too wide, but work OK, no additional imports needed. Catch all conection problems
        return False
    if image.status_code == 200:  # we could have retrieved error page
        base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist.
        with open(path.join(base_dir, image_name), "wb") as f:
            f.write(image.content)
        return image_name

get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")


回答 11

较晚的答案,但是python>=3.6您可以使用dload,即:

import dload
dload.save("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

如果您需要使用图像bytes,请使用:

img_bytes = dload.bytes("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

使用安装 pip3 install dload

Late answer, but for python>=3.6 you can use dload, i.e.:

import dload
dload.save("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

if you need the image as bytes, use:

img_bytes = dload.bytes("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")

install using pip3 install dload


回答 12

img_data=requests.get('https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg')

with open(str('file_name.jpg', 'wb') as handler:
    handler.write(img_data)
img_data=requests.get('https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg')

with open(str('file_name.jpg', 'wb') as handler:
    handler.write(img_data)

使用python和BeautifulSoup从网页检索链接

问题:使用python和BeautifulSoup从网页检索链接

如何使用Python检索网页的链接并复制链接的URL地址?

How can I retrieve the links of a webpage and copy the url address of the links using Python?


回答 0

这是在BeautifulSoup中使用SoupStrainer类的一小段代码:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup文档实际上非常好,涵盖了许多典型方案:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

编辑:请注意,我使用了SoupStrainer类,因为如果您事先知道要解析什么,它会更有效率(在内存和速度方面)。

Here’s a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Edit: Note that I used the SoupStrainer class because it’s a bit more efficient (memory and speed wise), if you know what you’re parsing in advance.


回答 1

为了完整起见,BeautifulSoup 4版本还使用了服务器提供的编码:

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

或Python 2版本:

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

以及使用requestslibrary的版本,该版本在Python 2和3中均可使用:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

soup.find_all('a', href=True)调用将查找<a>具有href属性的所有元素。没有属性的元素将被跳过。

BeautifulSoup 3于2012年3月停止开发;新项目确实应该始终使用BeautifulSoup 4。

请注意,您应该将HTML从字节解码为BeautifulSoup。您可以将HTTP响应标头中找到的字符集告知BeautifulSoup以帮助解码,但这可能是错误的,并且与<meta>HTML本身中的标头信息相冲突,因此,以上内容使用BeautifulSoup内部类方法EncodingDetector.find_declared_encoding()来确保这样的嵌入式编码提示可以胜过配置错误的服务器。

使用requests,即使response.encoding响应text/*未返回任何字符集,如果响应具有mimetype ,该属性也会默认为Latin-1 。这与HTTP RFC一致,但是在与HTML解析一起使用时会很麻烦,因此当charsetContent-Type标头中设置为no时,应忽略该属性。

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

or the Python 2 version:

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

and a version using the requests library, which as written will work in both Python 2 and 3:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.

BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.

Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.

With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.


回答 2

其他人推荐了BeautifulSoup,但是使用lxml更好。尽管它的名字,它也用于解析和抓取HTML。它比BeautifulSoup快得多,而且甚至比BeautifulSoup(他们声名fa起)更好地处理“断”的HTML。如果您不想学习lxml API,它也具有BeautifulSoup的兼容性API。

伊恩·布里金(Ian Blicking)同意

除非您使用的是Google App Engine或不允许使用任何非Python的工具,否则没有理由再使用BeautifulSoup。

lxml.html还支持CSS3选择器,因此这种事情很简单。

具有lxml和xpath的示例如下所示:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

Others have recommended BeautifulSoup, but it’s much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It’s much, much faster than BeautifulSoup, and it even handles “broken” HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don’t want to learn the lxml API.

Ian Blicking agrees.

There’s no reason to use BeautifulSoup anymore, unless you’re on Google App Engine or something where anything not purely Python isn’t allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

An example with lxml and xpath would look like this:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

回答 3

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'
import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

回答 4

以下代码使用urllib2和检索网页中所有可用的链接BeautifulSoup4

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

回答 5

现在,BeautifulSoup现在使用lxml。请求,lxml和列表理解能力是一个杀手comb。

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在list comp中,“ if’//’和’url.com’不是x”是一种简单的方法,可以清理网站“内部”导航网址等的网址列表。

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

In the list comp, the “if ‘//’ and ‘url.com’ not in x” is a simple method to scrub the url list of the sites ‘internal’ navigation urls, etc.


回答 6

仅用于获取链接,而无需B.soup和regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

对于更复杂的操作,当然BSoup仍然是首选。

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.


回答 7

该脚本可以满足您的需求,而且可以将相对链接解析为绝对链接。

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

This script does what your looking for, But also resolves the relative links to absolute links.

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

回答 8

要查找所有链接,在本示例中,我们将urllib2模块与re.module一起使用 * re模块中最强大的功能之一是“ re.findall()”。使用re.search()查找模式的第一个匹配项时,re.findall()查找所有 匹配项并将它们作为字符串列表返回,每个字符串代表一个匹配项*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

To find all the links, we will in this example use the urllib2 module together with the re.module *One of the most powerful function in the re module is “re.findall()”. While re.search() is used to find the first match for a pattern, re.findall() finds all the matches and returns them as a list of strings, with each string representing one match*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

回答 9

为什么不使用正则表达式:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

Why not use regular expressions:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

回答 10

链接可以位于各种属性中,因此您可以传递这些属性的列表以进行选择

例如,使用src和href属性(在这里,我使用^开头的运算符来指定这些属性值之一以http开头。您可以根据需要定制此属性

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

属性=值选择器

[attr ^ = value]

表示属性名称为attr的元素,其值以值为前缀(前缀)。

Links can be within a variety of attributes so you could pass a list of those attributes to select

for example, with src and href attribute (here I am using the starts with ^ operator to specify that either of these attributes values starts with http. You can tailor this as required

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

Attribute = value selectors

[attr^=value]

Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.


回答 11

下面是使用@ars接受的答案和一个例子BeautifulSoup4requestswget模块来处理下载。

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

Here’s an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

回答 12

经过以下更正(发现无法正常运行的情况),我发现了@ Blairg23工作的答案:

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

对于Python 3:

urllib.parse.urljoin 必须使用以获得完整的URL。

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

For Python 3:

urllib.parse.urljoin has to be used in order to obtain the full URL instead.


回答 13

BeatifulSoup自己的解析器可能很慢。使用能够直接从URL进行解析的lxml(可能会在下面提到一些限制)可能更为可行。

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

上面的代码将按原样返回链接,并且在大多数情况下,它们将是相对链接或从站点根目录开始的绝对链接。由于我的用例仅提取某种类型的链接,因此下面是一个将链接转换为完整URL的版本,该版本可以选择接受glob模式(如)*.mp3。它不会处理相对路径中的单点和双点,但是到目前为止,我并不需要它。如果您需要解析包含../或的URL片段,./urlparse.urljoin可能会派上用场。

注意:直接的lxml url解析不处理来自加载,https也不进行重定向,因此,出于这个原因,以下版本使用urllib2+ lxml

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

用法如下:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

BeatifulSoup’s own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won’t handle single and double dots in the relative paths though, but so far I didn’t have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.

NOTE: Direct lxml url parsing doesn’t handle loading from https and doesn’t do redirects, so for this reason the version below is using urllib2 + lxml.

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

The usage is as follows:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

回答 14

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']
import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

回答 15

外部链接和内部链接都可以有很多重复的链接。要区分两者并仅使用集合获得唯一链接:

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

如何使用Python使用Selenium选择下拉菜单值?

问题:如何使用Python使用Selenium选择下拉菜单值?

我需要从中选择一个元素 下拉菜单中。

例如:

<select id="fruits01" class="select" name="fruits">
  <option value="0">Choose your fruits:</option>
  <option value="1">Banana</option>
  <option value="2">Mango</option>
</select>

1)首先,我必须单击它。我这样做:

inputElementFruits = driver.find_element_by_xpath("//select[id='fruits']").click()

2)之后,我必须选择一个好的元素,让我们说Mango

我尝试这样做,inputElementFruits.send_keys(...)但是没有用。

I need to select an element from a drop-down menu.

For example:

<select id="fruits01" class="select" name="fruits">
  <option value="0">Choose your fruits:</option>
  <option value="1">Banana</option>
  <option value="2">Mango</option>
</select>

1) First I have to click on it. I do this:

inputElementFruits = driver.find_element_by_xpath("//select[id='fruits']").click()

2) After that I have to select the good element, lets say Mango.

I tried to do it with inputElementFruits.send_keys(...) but it did not work.


回答 0

除非您的点击触发了某种Ajax调用来填充列表,否则您实际上不需要执行该点击。

只需找到元素,然后枚举选项,然后选择所需的选项即可。

这是一个例子:

from selenium import webdriver
b = webdriver.Firefox()
b.find_element_by_xpath("//select[@name='element_name']/option[text()='option_text']").click()

您可以在以下网址中阅读更多内容:https :
//sqa.stackexchange.com/questions/1355/unable-to-select-an-option-using-seleniums-python-webdriver

Unless your click is firing some kind of ajax call to populate your list, you don’t actually need to execute the click.

Just find the element and then enumerate the options, selecting the option(s) you want.

Here is an example:

from selenium import webdriver
b = webdriver.Firefox()
b.find_element_by_xpath("//select[@name='element_name']/option[text()='option_text']").click()

You can read more in:
https://sqa.stackexchange.com/questions/1355/unable-to-select-an-option-using-seleniums-python-webdriver


回答 1

Selenium提供了一个方便的Select来使用select -> option构造:

from selenium import webdriver
from selenium.webdriver.support.ui import Select

driver = webdriver.Firefox()
driver.get('url')

select = Select(driver.find_element_by_id('fruits01'))

# select by visible text
select.select_by_visible_text('Banana')

# select by value 
select.select_by_value('1')

也可以看看:

Selenium provides a convenient Select class to work with select -> option constructs:

from selenium import webdriver
from selenium.webdriver.support.ui import Select

driver = webdriver.Firefox()
driver.get('url')

select = Select(driver.find_element_by_id('fruits01'))

# select by visible text
select.select_by_visible_text('Banana')

# select by value 
select.select_by_value('1')

See also:


回答 2

首先,您需要导入Select类,然后创建Select类的实例。创建Select类的实例后,您可以对该实例执行select方法以从下拉列表中选择选项。这是代码

from selenium.webdriver.support.select import Select

select_fr = Select(driver.find_element_by_id("fruits01"))
select_fr.select_by_index(0)

firstly you need to import the Select class and then you need to create the instance of Select class. After creating the instance of Select class, you can perform select methods on that instance to select the options from dropdown list. Here is the code

from selenium.webdriver.support.select import Select

select_fr = Select(driver.find_element_by_id("fruits01"))
select_fr.select_by_index(0)

回答 3

希望这段代码对您有所帮助。

from selenium.webdriver.support.ui import Select

ID为下拉列表的元素

ddelement= Select(driver.find_element_by_id('id_of_element'))

带xpath的下拉元素

ddelement= Select(driver.find_element_by_xpath('xpath_of_element'))

带CSS选择器的下拉元素

ddelement= Select(driver.find_element_by_css_selector('css_selector_of_element'))

从下拉列表中选择“香蕉”

  1. 使用下拉索引

ddelement.select_by_index(1)

  1. 使用下拉菜单的值

ddelement.select_by_value('1')

  1. 您可以使用匹配显示在下拉菜单中的文本。

ddelement.select_by_visible_text('Banana')

I hope this code will help you.

from selenium.webdriver.support.ui import Select

dropdown element with id

ddelement= Select(driver.find_element_by_id('id_of_element'))

dropdown element with xpath

ddelement= Select(driver.find_element_by_xpath('xpath_of_element'))

dropdown element with css selector

ddelement= Select(driver.find_element_by_css_selector('css_selector_of_element'))

Selecting ‘Banana’ from a dropdown

  1. Using the index of dropdown

ddelement.select_by_index(1)

  1. Using the value of dropdown

ddelement.select_by_value('1')

  1. You can use match the text which is displayed in the drop down.

ddelement.select_by_visible_text('Banana')


回答 4

我尝试了很多事情,但下拉菜单位于表内,因此我无法执行简单的选择操作。仅以下解决方案有效。在这里,我突出显示下拉元素并按下箭头,直到获得所需的值-

        #identify the drop down element
        elem = browser.find_element_by_name(objectVal)
        for option in elem.find_elements_by_tag_name('option'):
            if option.text == value:
                break

            else:
                ARROW_DOWN = u'\ue015'
                elem.send_keys(ARROW_DOWN)

I tried a lot many things, but my drop down was inside a table and I was not able to perform a simple select operation. Only the below solution worked. Here I am highlighting drop down elem and pressing down arrow until getting the desired value –

        #identify the drop down element
        elem = browser.find_element_by_name(objectVal)
        for option in elem.find_elements_by_tag_name('option'):
            if option.text == value:
                break

            else:
                ARROW_DOWN = u'\ue015'
                elem.send_keys(ARROW_DOWN)

回答 5

您无需单击任何内容。使用xpath或任何您选择的方式查找,然后使用发送键

例如:HTML:

<select id="fruits01" class="select" name="fruits">
    <option value="0">Choose your fruits:</option>
    <option value="1">Banana</option>
    <option value="2">Mango</option>
</select>

Python:

fruit_field = browser.find_element_by_xpath("//input[@name='fruits']")
fruit_field.send_keys("Mango")

而已。

You don’t have to click anything. Use find by xpath or whatever you choose and then use send keys

For your example: HTML:

<select id="fruits01" class="select" name="fruits">
    <option value="0">Choose your fruits:</option>
    <option value="1">Banana</option>
    <option value="2">Mango</option>
</select>

Python:

fruit_field = browser.find_element_by_xpath("//input[@name='fruits']")
fruit_field.send_keys("Mango")

That’s it.


回答 6

您可以很好地使用CSS选择器组合

driver.find_element_by_css_selector("#fruits01 [value='1']").click()

将attribute = value css选择器中的1更改为与所需水果对应的值。

You can use a css selector combination a well

driver.find_element_by_css_selector("#fruits01 [value='1']").click()

Change the 1 in the attribute = value css selector to the value corresponding with the desired fruit.


回答 7

from selenium.webdriver.support.ui import Select
driver = webdriver.Ie(".\\IEDriverServer.exe")
driver.get("https://test.com")
select = Select(driver.find_element_by_xpath("""//input[@name='n_name']"""))
select.select_by_index(2)

会很好的工作

from selenium.webdriver.support.ui import Select
driver = webdriver.Ie(".\\IEDriverServer.exe")
driver.get("https://test.com")
select = Select(driver.find_element_by_xpath("""//input[@name='n_name']"""))
select.select_by_index(2)

It will work fine


回答 8

它与选项值一起使用:

from selenium import webdriver
b = webdriver.Firefox()
b.find_element_by_xpath("//select[@class='class_name']/option[@value='option_value']").click()

It works with option value:

from selenium import webdriver
b = webdriver.Firefox()
b.find_element_by_xpath("//select[@class='class_name']/option[@value='option_value']").click()

回答 9

这样,您可以在任何下拉菜单中选择所有选项。

driver.get("https://www.spectrapremium.com/en/aftermarket/north-america")

print( "The title is  : " + driver.title)

inputs = Select(driver.find_element_by_css_selector('#year'))

input1 = len(inputs.options)

for items in range(input1):

    inputs.select_by_index(items)
    time.sleep(1)

In this way you can select all the options in any dropdowns.

driver.get("https://www.spectrapremium.com/en/aftermarket/north-america")

print( "The title is  : " + driver.title)

inputs = Select(driver.find_element_by_css_selector('#year'))

input1 = len(inputs.options)

for items in range(input1):

    inputs.select_by_index(items)
    time.sleep(1)

回答 10

使用selenium.webdriver.support.ui.Select类与下拉选择配合使用的最佳方法,但由于HTML的设计问题或其他问题,有时它无法按预期工作。

在这种情况下,您也可以使用execute_script()以下替代方法:

option_visible_text = "Banana"
select = driver.find_element_by_id("fruits01")

#now use this to select option from dropdown by visible text 
driver.execute_script("var select = arguments[0]; for(var i = 0; i < select.options.length; i++){ if(select.options[i].text == arguments[1]){ select.options[i].selected = true; } }", select, option_visible_text);

The best way to use selenium.webdriver.support.ui.Select class to work to with dropdown selection but some time it does not work as expected due to designing issue or other issues of the HTML.

In this type of situation you can also prefer as alternate solution using execute_script() as below :-

option_visible_text = "Banana"
select = driver.find_element_by_id("fruits01")

#now use this to select option from dropdown by visible text 
driver.execute_script("var select = arguments[0]; for(var i = 0; i < select.options.length; i++){ if(select.options[i].text == arguments[1]){ select.options[i].selected = true; } }", select, option_visible_text);

回答 11

按照提供的HTML:

<select id="fruits01" class="select" name="fruits">
  <option value="0">Choose your fruits:</option>
  <option value="1">Banana</option>
  <option value="2">Mango</option>
</select>

选择一个 <option>元素中元素菜单,您必须使用Select Class。此外,由于您必须与你有诱导WebDriverWaitelement_to_be_clickable()

选择<option>文本作为芒果您可以使用以下两种定位策略之一

  • 使用ID属性和select_by_visible_text()方法:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import Select
    
    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "fruits01"))))
    select.select_by_visible_text("Mango")
  • 使用CSS-SELECTORselect_by_value()方法:

    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "select.select[name='fruits']"))))
    select.select_by_value("2")
  • 使用XPATHselect_by_index()方法:

    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "//select[@class='select' and @name='fruits']"))))
    select.select_by_index(2)

As per the HTML provided:

<select id="fruits01" class="select" name="fruits">
  <option value="0">Choose your fruits:</option>
  <option value="1">Banana</option>
  <option value="2">Mango</option>
</select>

To select an <option> element from a menu you have to use the Select Class. Moreover, as you have to interact with the you have to induce WebDriverWait for the element_to_be_clickable().

To select the <option> with text as Mango from the you can use you can use either of the following Locator Strategies:

  • Using ID attribute and select_by_visible_text() method:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.support.ui import Select
    
    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "fruits01"))))
    select.select_by_visible_text("Mango")
    
  • Using CSS-SELECTOR and select_by_value() method:

    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "select.select[name='fruits']"))))
    select.select_by_value("2")
    
  • Using XPATH and select_by_index() method:

    select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "//select[@class='select' and @name='fruits']"))))
    select.select_by_index(2)
    

回答 12

  1. 项目清单

公共类ListBoxMultiple {

public static void main(String[] args) throws InterruptedException {
    // TODO Auto-generated method stub
    System.setProperty("webdriver.chrome.driver", "./drivers/chromedriver.exe");
    WebDriver driver=new ChromeDriver();
    driver.get("file:///C:/Users/Amitabh/Desktop/hotel2.html");//open the website
    driver.manage().window().maximize();


    WebElement hotel = driver.findElement(By.id("maarya"));//get the element

    Select sel=new Select(hotel);//for handling list box
    //isMultiple
    if(sel.isMultiple()){
        System.out.println("it is multi select list");
    }
    else{
        System.out.println("it is single select list");
    }
    //select option
    sel.selectByIndex(1);// you can select by index values
    sel.selectByValue("p");//you can select by value
    sel.selectByVisibleText("Fish");// you can also select by visible text of the options
    //deselect option but this is possible only in case of multiple lists
    Thread.sleep(1000);
    sel.deselectByIndex(1);
    sel.deselectAll();

    //getOptions
    List<WebElement> options = sel.getOptions();

    int count=options.size();
    System.out.println("Total options: "+count);

    for(WebElement opt:options){ // getting text of every elements
        String text=opt.getText();
        System.out.println(text);
        }

    //select all options
    for(int i=0;i<count;i++){
        sel.selectByIndex(i);
        Thread.sleep(1000);
    }

    driver.quit();

}

}

  1. List item

public class ListBoxMultiple {

public static void main(String[] args) throws InterruptedException {
    // TODO Auto-generated method stub
    System.setProperty("webdriver.chrome.driver", "./drivers/chromedriver.exe");
    WebDriver driver=new ChromeDriver();
    driver.get("file:///C:/Users/Amitabh/Desktop/hotel2.html");//open the website
    driver.manage().window().maximize();


    WebElement hotel = driver.findElement(By.id("maarya"));//get the element

    Select sel=new Select(hotel);//for handling list box
    //isMultiple
    if(sel.isMultiple()){
        System.out.println("it is multi select list");
    }
    else{
        System.out.println("it is single select list");
    }
    //select option
    sel.selectByIndex(1);// you can select by index values
    sel.selectByValue("p");//you can select by value
    sel.selectByVisibleText("Fish");// you can also select by visible text of the options
    //deselect option but this is possible only in case of multiple lists
    Thread.sleep(1000);
    sel.deselectByIndex(1);
    sel.deselectAll();

    //getOptions
    List<WebElement> options = sel.getOptions();

    int count=options.size();
    System.out.println("Total options: "+count);

    for(WebElement opt:options){ // getting text of every elements
        String text=opt.getText();
        System.out.println(text);
        }

    //select all options
    for(int i=0;i<count;i++){
        sel.selectByIndex(i);
        Thread.sleep(1000);
    }

    driver.quit();

}

}


如何按类别查找元素

问题:如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的HTML元素时遇到了麻烦。代码看起来像这样

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

脚本完成后的同一行出现错误。

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

我如何摆脱这个错误?

I’m having trouble parsing HTML elements with “class” attribute using Beautifulsoup. The code looks like this

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div["class"] == "stylelistrow"):
        print div

I get an error on the same line “after” the script finishes.

File "./beautifulcoding.py", line 130, in getlanguage
  if (div["class"] == "stylelistrow"):
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 599, in __getitem__
   return self._getAttrMap()[key]
KeyError: 'class'

How do I get rid of this error?


回答 0

您可以使用BS3优化搜索以仅找到具有给定类的那些div:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

You can refine your search to only find those divs with a given class using BS3:

mydivs = soup.findAll("div", {"class": "stylelistrow"})

回答 1

从文档中:

从Beautiful Soup 4.1.2开始,您可以使用关键字arguments通过CSS类进行搜索 class_

soup.find_all("a", class_="sister")

在这种情况下将是:

soup.find_all("div", class_="stylelistrow")

它也适用于:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

From the documentation:

As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

soup.find_all("a", class_="sister")

Which in this case would be:

soup.find_all("div", class_="stylelistrow")

It would also work for:

soup.find_all("div", class_="stylelistrowone stylelistrowtwo")

回答 2

更新:2016在最新版本的beautifulsoup中,方法“ findAll”已重命名为“ find_all”。链接到官方文档

因此答案将是

soup.find_all("html_element", class_="your_class_name")

Update: 2016 In the latest version of beautifulsoup, the method ‘findAll’ has been renamed to ‘find_all’. Link to official documentation

Hence the answer will be

soup.find_all("html_element", class_="your_class_name")

回答 3

特定于BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

将找到所有这些:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

Specific to BeautifulSoup 3:

soup.findAll('div',
             {'class': lambda x: x 
                       and 'stylelistrow' in x.split()
             }
            )

Will find all of these:

<div class="stylelistrow">
<div class="stylelistrow button">
<div class="button stylelistrow">

回答 4

直接的方法是:

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

确保您使用findAll的大小写,而不是findall的大小写

A straight forward way would be :

soup = BeautifulSoup(sdata)
for each_div in soup.findAll('div',{'class':'stylelist'}):
    print each_div

Make sure you take of the casing of findAll, its not findall


回答 5

如何按类别查找元素

我在使用Beautifulsoup解析具有“ class”属性的html元素时遇到了麻烦。

您可以轻松地按一个类别查找,但是如果要按两个类别的相交查找,则要困难一些,

文档(添加重点):

如果要搜索与两个或多个 CSS类匹配的标签,则应使用CSS选择器:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

为了清楚起见,此操作仅选择既是删除线又是正文类的p标签。

要在一组类中查找任何交集(不是交集,而是联合),可以给class_关键字参数提供一个列表(从4.1.2开始):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

还要注意,findAll已从camelCase重命名为更多Pythonic find_all

How to find elements by class

I’m having trouble parsing html elements with “class” attribute using Beautifulsoup.

You can easily find by one class, but if you want to find by the intersection of two classes, it’s a little more difficult,

From the documentation (emphasis added):

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

To be clear, this selects only the p tags that are both strikeout and body class.

To find for the intersection of any in a set of classes (not the intersection, but the union), you can give a list to the class_ keyword argument (as of 4.1.2):

soup = BeautifulSoup(sdata)
class_list = ["stylelistrow"] # can add any other classes to this list.
# will find any divs with any names in class_list:
mydivs = soup.find_all('div', class_=class_list) 

Also note that findAll has been renamed from the camelCase to the more Pythonic find_all.


回答 6

CSS选择器

单班第一场比赛

soup.select_one('.stylelistrow')

比赛清单

soup.select('.stylelistrow')

复合类(即AND另一类)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

复合类名称中的空格例如class = stylelistrow otherclassname用“。”代替。您可以继续添加类。

类列表(或-匹配存在的任何一个

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

innerText包含字符串的特定类

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

具有特定子元素(例如a标签)的特定类

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

CSS selectors

single class first match

soup.select_one('.stylelistrow')

list of matches

soup.select('.stylelistrow')

compound class (i.e. AND another class)

soup.select_one('.stylelistrow.otherclassname')
soup.select('.stylelistrow.otherclassname')

Spaces in compound class names e.g. class = stylelistrow otherclassname are replaced with “.”. You can continue to add classes.

list of classes (OR – match whichever present

soup.select_one('.stylelistrow, .otherclassname')
soup.select('.stylelistrow, .otherclassname')

bs4 4.7.1 +

Specific class whose innerText contains a string

soup.select_one('.stylelistrow:contains("some string")')
soup.select('.stylelistrow:contains("some string")')

Specific class which has a certain child element e.g. a tag

soup.select_one('.stylelistrow:has(a)')
soup.select('.stylelistrow:has(a)')

回答 7

从BeautifulSoup 4+开始,

如果您只有一个类名,则只需将类名作为参数传递即可:

mydivs = soup.find_all('div', 'class_name')

或者,如果您有多个类名,只需将类名列表作为参数传递即可:

mydivs = soup.find_all('div', ['class1', 'class2'])

As of BeautifulSoup 4+ ,

If you have a single class name , you can just pass the class name as parameter like :

mydivs = soup.find_all('div', 'class_name')

Or if you have more than one class names , just pass the list of class names as parameter like :

mydivs = soup.find_all('div', ['class1', 'class2'])

回答 8

尝试首先检查div是否具有class属性,如下所示:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

Try to check if the div has a class attribute first, like this:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs:
    if "class" in div:
        if (div["class"]=="stylelistrow"):
            print div

回答 9

这对我来说可以访问class属性(在beautifulsoup 4上,与文档中所说的相反)。KeyError会返回一个列表,而不是字典。

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

This works for me to access the class attribute (on beautifulsoup 4, contrary to what the documentation says). The KeyError comes a list being returned not a dictionary.

for hit in soup.findAll(name='span'):
    print hit.contents[1]['class']

回答 10

以下对我有用

a_tag = soup.find_all("div",class_='full tabpublist')

the following worked for me

a_tag = soup.find_all("div",class_='full tabpublist')

回答 11

这为我工作:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

This worked for me:

for div in mydivs:
    try:
        clazz = div["class"]
    except KeyError:
        clazz = ""
    if (clazz == "stylelistrow"):
        print div

回答 12

或者,我们可以使用lxml,它支持xpath并且非常快!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

Alternatively we can use lxml, it support xpath and very fast!

from lxml import html, etree 

attr = html.fromstring(html_text)#passing the raw html
handles = attr.xpath('//div[@class="stylelistrow"]')#xpath exresssion to find that specific class

for each in handles:
    print(etree.tostring(each))#printing the html as string

回答 13

这应该工作:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

This should work:

soup = BeautifulSoup(sdata)
mydivs = soup.findAll('div')
for div in mydivs: 
    if (div.find(class_ == "stylelistrow"):
        print div

回答 14

其他答案对我不起作用。

在其他答案中,findAll它被用于汤对象本身,但是我需要一种方法,可以通过对从我做完之后获得的对象中提取的特定元素内的对象进行类名查找findAll

如果您要在嵌套的HTML元素中进行搜索以按类名获取对象,请尝试以下操作-

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

注意事项:

  1. 我没有明确定义要在’class’属性上进行搜索findAll("li", {"class": "song_item"}),因为它是我要搜索的唯一属性,并且如果您不专门指出要在哪个属性上查找,默认情况下它将搜索class属性。

  2. 当您执行findAllor或时find,生成的对象属于的bs4.element.ResultSet子类list。您可以ResultSet在任意数量的嵌套元素(只要类型为ResultSet)内利用的所有方法进行查找或全部查找。

  3. 我的BS4版本-4.9.1,Python版本-3.8.1

Other answers did not work for me.

In other answers the findAll is being used on the soup object itself, but I needed a way to do a find by class name on objects inside a specific element extracted from the object I obtained after doing findAll.

If you are trying to do a search inside nested HTML elements to get objects by class name, try below –

# parse html
page_soup = soup(web_page.read(), "html.parser")

# filter out items matching class name
all_songs = page_soup.findAll("li", "song_item")

# traverse through all_songs
for song in all_songs:

    # get text out of span element matching class 'song_name'
    # doing a 'find' by class name within a specific song element taken out of 'all_songs' collection
    song.find("span", "song_name").text

Points to note:

  1. I’m not explicitly defining the search to be on ‘class’ attribute findAll("li", {"class": "song_item"}), since it’s the only attribute I’m searching on and it will by default search for class attribute if you don’t exclusively tell which attribute you want to find on.

  2. When you do a findAll or find, the resulting object is of class bs4.element.ResultSet which is a subclass of list. You can utilize all methods of ResultSet, inside any number of nested elements (as long as they are of type ResultSet) to do a find or find all.

  3. My BS4 version – 4.9.1, Python version – 3.8.1


回答 15

以下应该工作

soup.find('span', attrs={'class':'totalcount'})

将“ totalcount”替换为您的类的名称,并将“ span”替换为您要查找的标签。另外,如果您的Class包含多个带空格的名称,只需选择一个并使用即可。

PS这将找到具有给定条件的第一个元素。如果要查找所有元素,则将“ find”替换为“ find_all”。

The following should work

soup.find('span', attrs={'class':'totalcount'})

replace ‘totalcount’ with your class name and ‘span’ with tag you are looking for. Also, if your class contains multiple names with space, just choose one and use.

P.S. This finds the first element with given criteria. If you want to find all elements then replace ‘find’ with ‘find_all’.