I’d like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?
import urllib2
fromBeautifulSoupimportBeautifulSoup# or if you're using BeautifulSoup4:# from bs4 import BeautifulSoup
soup =BeautifulSoup(urllib2.urlopen('http://example.com').read())for row in soup('table',{'class':'spad'})[0].tbody('tr'):
tds = row('td')print tds[0].string, tds[1].string
# will print date and sunrise
Use urllib2 in combination with the brilliant BeautifulSoup library:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
# will print date and sunrise
Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).
Scrapy has better and fastest support for parsing (x)html on top of libxml2.
Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.
Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.
I would strongly suggest checking out pyquery. It uses jquery-like (aka css-like) syntax which makes things really easy for those coming from that background.
For your case, it would be something like:
from pyquery import *
html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')
for tr in trs:
tds = tr.getchildren()
print tds[1].text, tds[2].text
Output:
5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM
I use a combination of Scrapemark (finding urls – py2) and httlib2 (downloading images – py2+3). The scrapemark.py has 500 lines of code, but uses regular expressions, so it may be not so fast, did not test.
import urllib2
from bs4 importBeautifulSoup
main_url ="http://www.example.com"
main_page_html = tryAgain(main_url)
main_page_soup =BeautifulSoup(main_page_html)# Scrape all TDs from TRs inside Tablefor tr in main_page_soup.select("table.class_of_table"):for td in tr.select("td#id"):print(td.text)# For acnhors inside TDprint(td.select("a")[0].text)# Value of Href attributeprint(td.select("a")[0]["href"])# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)def tryAgain(passed_url):try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
exceptException:while1:print("Trying again the URL:")print(passed_url)try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")print("---- URL was successfully scraped ---")print("-------------------------------------")return page
exceptException:
time.sleep(20)continue
I know I have come late to party but I have a nice suggestion for you.
Using BeautifulSoup is already been suggested I would rather prefer using CSS Selectors to scrape data inside HTML
import urllib2
from bs4 import BeautifulSoup
main_url = "http://www.example.com"
main_page_html = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)
# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
for td in tr.select("td#id"):
print(td.text)
# For acnhors inside TD
print(td.select("a")[0].text)
# Value of Href attribute
print(td.select("a")[0]["href"])
# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
return page
except Exception:
while 1:
print("Trying again the URL:")
print(passed_url)
try:
page = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
print("-------------------------------------")
print("---- URL was successfully scraped ---")
print("-------------------------------------")
return page
except Exception:
time.sleep(20)
continue
回答 7
如果我们想从任何特定类别中获取商品名称,则可以通过使用CSS选择器指定该类别的类别名称来实现:
import requests ;from bs4 importBeautifulSoup
soup =BeautifulSoup(requests.get('https://www.flipkart.com/').text,"lxml")for link in soup.select('div._2kSfQ4'):print(link.text)
这是部分搜索结果:
Puma, USPA,Adidas& moreUp to 70%OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals,SlippersPhilips& moreStarting ₹99LEDBulbs&EmergencyLights
If we think of getting name of items from any specific category then we can do that by specifying the class name of that category using css selector:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
print(link.text)
Here is a simple web crawler, i used BeautifulSoup and we will search for all the links(anchors) who’s class name is _3NFO0d. I used Flipkar.com, it is an online retailing store.
import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
url = 'https://www.flipkart.com/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': '_3NFO0d'}):
href = link.get('href')
print(href)
crawl_flipkart()
Python has good options to scrape the web. The best one with a framework is scrapy. It can be a little tricky for beginners, so here is a little help.
1. Install python above 3.5 (lower ones till 2.7 will work).
2. Create a environment in conda ( I did this).
3. Install scrapy at a location and run in from there.
4. Scrapy shell will give you an interactive interface to test you code.
5. Scrapy startproject projectname will create a framework.
6. Scrapy genspider spidername will create a spider. You can create as many spiders as you want. While doing this make sure you are inside the project directory.
The easier one is to use requests and beautiful soup. Before starting give one hour of time to go through the documentation, it will solve most of your doubts. BS4 offer wide range of parsers that you can opt for. Use user-agent and sleep to make scraping easier. BS4 returns a bs.tag so use variable[0]. If there is js running, you wont be able to scrape using requests and bs4 directly. You could get the api link then parse the JSON to get the information you need or try selenium.