>>> r = requests.get('http://httpbin.org/status/404')>>> r.raise_for_status()Traceback(most recent call last):File"<stdin>", line 1,in<module>File"requests/models.py", line 664,in raise_for_status
raise http_error
requests.exceptions.HTTPError:404ClientError: NOT FOUND
>>> r = requests.get('http://httpbin.org/status/200')>>> r.raise_for_status()>>># no exception raised.
>>> import requests
>>> r = requests.get('http://httpbin.org/status/404')
>>> r.status_code
404
If you want requests to raise an exception for error codes (4xx or 5xx), call r.raise_for_status():
>>> r = requests.get('http://httpbin.org/status/404')
>>> r.raise_for_status()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "requests/models.py", line 664, in raise_for_status
raise http_error
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND
>>> r = requests.get('http://httpbin.org/status/200')
>>> r.raise_for_status()
>>> # no exception raised.
You can also test the response object in a boolean context; if the status code is not an error code (4xx or 5xx), it is considered ‘true’:
I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:
How do I fake a browser visit by using python requests or command wget?
from fake_useragent importUserAgentimport requests
ua =UserAgent()print(ua.chrome)
header ={'User-Agent':str(ua.chrome)}print(header)
url ="https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)print(htmlContent)
输出:
Mozilla/5.0(Macintosh;IntelMac OS X 10_8_2)AppleWebKit/537.17(KHTML, like Gecko)Chrome/24.0.1309.0Safari/537.17{'User-Agent':'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}<Response[200]>
The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.
So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.
I want to do parallel http request tasks in asyncio, but I find that python-requests would block the event loop of asyncio. I’ve found aiohttp but it couldn’t provide the service of http request using a http proxy.
So I want to know if there’s a way to do asynchronous http requests with the help of asyncio.
To use requests (or any other blocking libraries) with asyncio, you can use BaseEventLoop.run_in_executor to run a function in another thread and yield from it to get the result. For example:
Requests does not currently support asyncio and there are no plans to provide such support. It’s likely that you could implement a custom “Transport Adapter” (as discussed here) that knows how to use asyncio.
If I find myself with some time it’s something I might actually look into, but I can’t promise anything.
To minimize the total completion time, we could increase the size of the thread pool to match the number of requests we have to make. Luckily, this is easy to do as we will see next. The code listing below is an example of how to make twenty asynchronous HTTP requests with a thread pool of twenty worker threads:
# Example 3: asynchronous requests with larger thread pool
import asyncio
import concurrent.futures
import requests
async def main():
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(
executor,
requests.get,
'http://example.org/'
)
for i in range(20)
]
for response in await asyncio.gather(*futures):
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I’m performing a simple task of uploading a file using Python requests library. I searched Stack Overflow and no one seemed to have the same problem, namely, that the file is not received by the server:
I’m filling the value of ‘upload_file’ keyword with my filename, because if I leave it blank, it says
Error - You must select a file to upload!
And now I get
File file.txt of size bytes is uploaded successfully!
Query service results: There were 0 lines.
Which comes up only if the file is empty. So I’m stuck as to how to send my file successfully. I know that the file works because if I go to this website and manually fill in the form it returns a nice list of matched objects, which is what I’m after. I’d really appreciate all hints.
Some other threads related (but not answering my problem):
You can use a tuple for the files mapping value, with between 2 and 4 elements, if you need more control. The first element is the filename, followed by the contents, and an optional content-type header value and an optional mapping of additional headers:
This sets an alternative filename and content type, leaving out the optional headers.
If you are meaning the whole POST body to be taken from a file (with no other fields specified), then don’t use the files parameter, just post the file directly as data. You then may want to set a Content-Type header too, as none will be set otherwise. See Python requests – POST data from a file.
(2018) the new python requests library has simplified this process, we can use the ‘files’ variable to signal that we want to upload a multipart-encoded file
If you want to upload a single file with Python requests library, then requests lib supports streaming uploads, which allow you to send large files or streams without reading into memory.
with open('massive-body', 'rb') as f:
requests.post('http://some.url/streamed', data=f)
Server Side
Then store the file on the server.py side such that save the stream into file without loading into the memory. Following is an example with using Flask file uploads.
Or use werkzeug Form Data Parsing as mentioned in a fix for the issue of “large file uploads eating up memory” in order to avoid using memory inefficiently on large files upload (s.t. 22 GiB file in ~60 seconds. Memory usage is constant at about 13 MiB.).
I am using the terrific Python Requests library. I notice that the fine documentation has many examples of how to do something without explaining the why. For instance, both r.text and r.content are shown as examples of how to get the server response. But where is it explained what these properties do? For instance, when would I choose one over the other? I see thar r.text returns a unicode object sometimes, and I suppose that there would be a difference for a non-text response. But where is all this documented? Note that the linked document does state:
You can also access the response body as bytes, for non-text requests:
But then it goes on to show an example of a text response! I can only suppose that the quote above means to say non-text responses instead of non-text requests, as a non-text request does not make sense in HTTP.
In short, where is the proper documentation of the library, as opposed to the (excellent) tutorial on the Python Requests site?
I am using the requests module (version 0.10.0 with Python 2.5).
I have figured out how to submit data to a login form on a website and retrieve the session key, but I can’t see an obvious way to use this session key in subsequent requests.
Can someone fill in the ellipsis in the code below or suggest another approach?
s.post('https://localhost/login.py', login_data)#logged in! cookies saved for future requests.
r2 = s.get('https://localhost/profile_data.json',...)#cookies sent automatically!#do whatever, s will keep your cookies intact :)
After that, continue with your requests as you would:
s.post('https://localhost/login.py', login_data)
#logged in! cookies saved for future requests.
r2 = s.get('https://localhost/profile_data.json', ...)
#cookies sent automatically!
#do whatever, s will keep your cookies intact :)
import pickle
import datetime
import os
from urllib.parse import urlparse
import requests
classMyLoginSession:"""
a class which handles and saves login sessions. It also keeps track of proxy settings.
It does also maintine a cache-file for restoring session data from earlier
script executions.
"""def __init__(self,
loginUrl,
loginData,
loginTestUrl,
loginTestString,
sessionFileAppendix ='_session.dat',
maxSessionTimeSeconds =30*60,
proxies =None,
userAgent ='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
debug =True,
forceLogin =False,**kwargs):"""
save some information needed to login the session
you'll have to provide 'loginTestString' which will be looked for in the
responses html to make sure, you've properly been logged in
'proxies' is of format { 'https' : 'https://user:pass@server:port', 'http' : ...
'loginData' will be sent as post data (dictionary of id : value).
'maxSessionTimeSeconds' will be used to determine when to re-login.
"""
urlData = urlparse(loginUrl)
self.proxies = proxies
self.loginData = loginData
self.loginUrl = loginUrl
self.loginTestUrl = loginTestUrl
self.maxSessionTime = maxSessionTimeSeconds
self.sessionFile = urlData.netloc + sessionFileAppendix
self.userAgent = userAgent
self.loginTestString = loginTestString
self.debug = debug
self.login(forceLogin,**kwargs)def modification_date(self, filename):"""
return last file modification date as datetime object
"""
t = os.path.getmtime(filename)return datetime.datetime.fromtimestamp(t)def login(self, forceLogin =False,**kwargs):"""
login to a session. Try to read last saved session from cache file. If this fails
do proper login. If the last cache access was too old, also perform a proper login.
Always updates session cache file.
"""
wasReadFromCache =Falseif self.debug:print('loading or generating session...')if os.path.exists(self.sessionFile)andnot forceLogin:
time = self.modification_date(self.sessionFile)# only load if file less than 30 minutes old
lastModification =(datetime.datetime.now()- time).seconds
if lastModification < self.maxSessionTime:with open(self.sessionFile,"rb")as f:
self.session = pickle.load(f)
wasReadFromCache =Trueif self.debug:print("loaded session from cache (last access %ds ago) "% lastModification)ifnot wasReadFromCache:
self.session = requests.Session()
self.session.headers.update({'user-agent': self.userAgent})
res = self.session.post(self.loginUrl, data = self.loginData,
proxies = self.proxies,**kwargs)if self.debug:print('created new session with login')
self.saveSessionToCache()# test login
res = self.session.get(self.loginTestUrl)if res.text.lower().find(self.loginTestString.lower())<0:raiseException("could not log into provided site '%s'"" (did not find successful login string)"% self.loginUrl)def saveSessionToCache(self):"""
save session to a cache file
"""# always save (to update timeout)with open(self.sessionFile,"wb")as f:
pickle.dump(self.session, f)if self.debug:print('updated session cache-file %s'% self.sessionFile)def retrieveContent(self, url, method ="get", postData =None,**kwargs):"""
return the content of the url with respect to the session.
If 'method' is not 'get', the url will be called with 'postData'
as a post request.
"""if method =='get':
res = self.session.get(url , proxies = self.proxies,**kwargs)else:
res = self.session.post(url , data = postData, proxies = self.proxies,**kwargs)# the session has been updated on the server, so also update in cache
self.saveSessionToCache()return res
使用上述类的代码片段可能如下所示:
if __name__ =="__main__":# proxies = {'https' : 'https://user:pass@server:port',# 'http' : 'http://user:pass@server:port'}
loginData ={'user':'usr','password':'pwd'}
loginUrl ='https://...'
loginTestUrl ='https://...'
successStr ='Hello Tom'
s =MyLoginSession(loginUrl, loginData, loginTestUrl, successStr,#proxies = proxies)
res = s.retrieveContent('https://....')print(res.text)# if, for instance, login via JSON values required try this:
s =MyLoginSession(loginUrl,None, loginTestUrl, successStr,#proxies = proxies,
json = loginData)
the other answers help to understand how to maintain such a session. Additionally, I want to provide a class which keeps the session maintained over different runs of a script (with a cache file). This means a proper “login” is only performed when required (timout or no session exists in cache). Also it supports proxy settings over subsequent calls to ‘get’ or ‘post’.
It is tested with Python3.
Use it as a basis for your own code. The following snippets are release with GPL v3
import pickle
import datetime
import os
from urllib.parse import urlparse
import requests
class MyLoginSession:
"""
a class which handles and saves login sessions. It also keeps track of proxy settings.
It does also maintine a cache-file for restoring session data from earlier
script executions.
"""
def __init__(self,
loginUrl,
loginData,
loginTestUrl,
loginTestString,
sessionFileAppendix = '_session.dat',
maxSessionTimeSeconds = 30 * 60,
proxies = None,
userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
debug = True,
forceLogin = False,
**kwargs):
"""
save some information needed to login the session
you'll have to provide 'loginTestString' which will be looked for in the
responses html to make sure, you've properly been logged in
'proxies' is of format { 'https' : 'https://user:pass@server:port', 'http' : ...
'loginData' will be sent as post data (dictionary of id : value).
'maxSessionTimeSeconds' will be used to determine when to re-login.
"""
urlData = urlparse(loginUrl)
self.proxies = proxies
self.loginData = loginData
self.loginUrl = loginUrl
self.loginTestUrl = loginTestUrl
self.maxSessionTime = maxSessionTimeSeconds
self.sessionFile = urlData.netloc + sessionFileAppendix
self.userAgent = userAgent
self.loginTestString = loginTestString
self.debug = debug
self.login(forceLogin, **kwargs)
def modification_date(self, filename):
"""
return last file modification date as datetime object
"""
t = os.path.getmtime(filename)
return datetime.datetime.fromtimestamp(t)
def login(self, forceLogin = False, **kwargs):
"""
login to a session. Try to read last saved session from cache file. If this fails
do proper login. If the last cache access was too old, also perform a proper login.
Always updates session cache file.
"""
wasReadFromCache = False
if self.debug:
print('loading or generating session...')
if os.path.exists(self.sessionFile) and not forceLogin:
time = self.modification_date(self.sessionFile)
# only load if file less than 30 minutes old
lastModification = (datetime.datetime.now() - time).seconds
if lastModification < self.maxSessionTime:
with open(self.sessionFile, "rb") as f:
self.session = pickle.load(f)
wasReadFromCache = True
if self.debug:
print("loaded session from cache (last access %ds ago) "
% lastModification)
if not wasReadFromCache:
self.session = requests.Session()
self.session.headers.update({'user-agent' : self.userAgent})
res = self.session.post(self.loginUrl, data = self.loginData,
proxies = self.proxies, **kwargs)
if self.debug:
print('created new session with login' )
self.saveSessionToCache()
# test login
res = self.session.get(self.loginTestUrl)
if res.text.lower().find(self.loginTestString.lower()) < 0:
raise Exception("could not log into provided site '%s'"
" (did not find successful login string)"
% self.loginUrl)
def saveSessionToCache(self):
"""
save session to a cache file
"""
# always save (to update timeout)
with open(self.sessionFile, "wb") as f:
pickle.dump(self.session, f)
if self.debug:
print('updated session cache-file %s' % self.sessionFile)
def retrieveContent(self, url, method = "get", postData = None, **kwargs):
"""
return the content of the url with respect to the session.
If 'method' is not 'get', the url will be called with 'postData'
as a post request.
"""
if method == 'get':
res = self.session.get(url , proxies = self.proxies, **kwargs)
else:
res = self.session.post(url , data = postData, proxies = self.proxies, **kwargs)
# the session has been updated on the server, so also update in cache
self.saveSessionToCache()
return res
A code snippet for using the above class may look like this:
import urllib2
import urllib
from cookielib import CookieJar
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# input-type values from the html form
formdata = { "username" : username, "password": password, "form-id" : "1234" }
data_encoded = urllib.urlencode(formdata)
response = opener.open("https://page.com/login.php", data_encoded)
content = response.read()
EDIT:
I see I’ve gotten a few downvotes for my answer, but no explaining comments. I’m guessing it’s because I’m referring to the urllib libraries instead of requests. I do that because the OP asks for help with requests or for someone to suggest another approach.
回答 3
文档说这get是可选的cookies参数,允许您指定要使用的cookie:
从文档:
>>> url ='http://httpbin.org/cookies'>>> cookies = dict(cookies_are='working')>>> r = requests.get(url, cookies=cookies)>>> r.text
'{"cookies": {"cookies_are": "working"}}'
import requests
import json
# The Login URL
authUrl ='https://whatever.com/login'# The subsequent URL
testUrl ='https://whatever.com/someEndpoint'# Logout URL
testlogoutUrl ='https://whatever.com/logout'# Whatever you are posting
login_data ={'formPosted':'1','login_email':'me@example.com','password':'pw'}# The Authentication token or any other data that we will receive from the Authentication Request.
token =''# Post the login Request
loginRequest = requests.post(authUrl, login_data)print("{}".format(loginRequest.text))# Save the request content to your variable. In this case I needed a field called token.
token = str(json.loads(loginRequest.content)['token'])# or ['access_token']print("{}".format(token))# Verify Successful loginprint("{}".format(loginRequest.status_code))# Create your Requests Cookie Jar for your subsequent requests and add the cookie
jar = requests.cookies.RequestsCookieJar()
jar.set('LWSSO_COOKIE_KEY', token)# Execute your next request(s) with the Request Cookie Jar set
r = requests.get(testUrl, cookies=jar)print("R.TEXT: {}".format(r.text))print("R.STCD: {}".format(r.status_code))# Execute your logout request(s) with the Request Cookie Jar set
r = requests.delete(testlogoutUrl, cookies=jar)print("R.TEXT: {}".format(r.text))# should show "Request Not Authorized"print("R.STCD: {}".format(r.status_code))# should show 401
Upon trying all the answers above, I found that using “RequestsCookieJar” instead of the regular CookieJar for subsequent requests fixed my problem.
import requests
import json
# The Login URL
authUrl = 'https://whatever.com/login'
# The subsequent URL
testUrl = 'https://whatever.com/someEndpoint'
# Logout URL
testlogoutUrl = 'https://whatever.com/logout'
# Whatever you are posting
login_data = {'formPosted':'1',
'login_email':'me@example.com',
'password':'pw'
}
# The Authentication token or any other data that we will receive from the Authentication Request.
token = ''
# Post the login Request
loginRequest = requests.post(authUrl, login_data)
print("{}".format(loginRequest.text))
# Save the request content to your variable. In this case I needed a field called token.
token = str(json.loads(loginRequest.content)['token']) # or ['access_token']
print("{}".format(token))
# Verify Successful login
print("{}".format(loginRequest.status_code))
# Create your Requests Cookie Jar for your subsequent requests and add the cookie
jar = requests.cookies.RequestsCookieJar()
jar.set('LWSSO_COOKIE_KEY', token)
# Execute your next request(s) with the Request Cookie Jar set
r = requests.get(testUrl, cookies=jar)
print("R.TEXT: {}".format(r.text))
print("R.STCD: {}".format(r.status_code))
# Execute your logout request(s) with the Request Cookie Jar set
r = requests.delete(testlogoutUrl, cookies=jar)
print("R.TEXT: {}".format(r.text)) # should show "Request Not Authorized"
print("R.STCD: {}".format(r.status_code)) # should show 401
import os
import pickle
from urllib.parse import urljoin, urlparse
login ='my@email.com'
password ='secret'# Assuming two cookies are used for persistent login.# (Find it by tracing the login process)
persistentCookieNames =['sessionId','profileId']
URL ='http://example.com'
urlData = urlparse(URL)
cookieFile = urlData.netloc +'.cookie'
signinUrl = urljoin(URL,"/signin")with requests.Session()as session:try:with open(cookieFile,'rb')as f:print("Loading cookies...")
session.cookies.update(pickle.load(f))exceptException:# If could not load cookies from file, get the new ones by login inprint("Login in...")
post = session.post(
signinUrl,
data={'email': login,'password': password,})try:with open(cookieFile,'wb')as f:
jar = requests.cookies.RequestsCookieJar()for cookie in session.cookies:if cookie.name in persistentCookieNames:
jar.set_cookie(cookie)
pickle.dump(jar, f)exceptExceptionas e:
os.remove(cookieFile)raise(e)MyPage= urljoin(URL,"/mypage")
page = session.get(MyPage)
import os
import pickle
from urllib.parse import urljoin, urlparse
login = 'my@email.com'
password = 'secret'
# Assuming two cookies are used for persistent login.
# (Find it by tracing the login process)
persistentCookieNames = ['sessionId', 'profileId']
URL = 'http://example.com'
urlData = urlparse(URL)
cookieFile = urlData.netloc + '.cookie'
signinUrl = urljoin(URL, "/signin")
with requests.Session() as session:
try:
with open(cookieFile, 'rb') as f:
print("Loading cookies...")
session.cookies.update(pickle.load(f))
except Exception:
# If could not load cookies from file, get the new ones by login in
print("Login in...")
post = session.post(
signinUrl,
data={
'email': login,
'password': password,
}
)
try:
with open(cookieFile, 'wb') as f:
jar = requests.cookies.RequestsCookieJar()
for cookie in session.cookies:
if cookie.name in persistentCookieNames:
jar.set_cookie(cookie)
pickle.dump(jar, f)
except Exception as e:
os.remove(cookieFile)
raise(e)
MyPage = urljoin(URL, "/mypage")
page = session.get(MyPage)
I like very much the requests package and its comfortable way to handle JSON responses.
Unfortunately, I did not understand if I can also process XML responses. Has anybody experience how to handle XML responses with the requests package? Is it necessary to include another package for the XML decoding?
import requestsfrom xml.etree importElementTree
response = requests.get(url)
tree =ElementTree.fromstring(response.content)
或者,如果响应特别大,请使用增量方法:
response = requests.get(url, stream=True)# if the server sent a Gzip or Deflate compressed response, decompress# as we read the raw stream:
response.raw.decode_content =True
events =ElementTree.iterparse(response.raw)forevent, elem in events:# do something with `elem`
requests does not handle parsing XML responses, no. XML responses are much more complex in nature than JSON responses, how you’d serialize XML data into Python structures is not nearly as straightforward.
Python comes with built-in XML parsers. I recommend you use the ElementTree API:
import requests
from xml.etree import ElementTree
response = requests.get(url)
tree = ElementTree.fromstring(response.content)
or, if the response is particularly large, use an incremental approach:
response = requests.get(url, stream=True)
# if the server sent a Gzip or Deflate compressed response, decompress
# as we read the raw stream:
response.raw.decode_content = True
events = ElementTree.iterparse(response.raw)
for event, elem in events:
# do something with `elem`
The external lxml project builds on the same API to give you more features and power still.
from pyquery importPyQueryimport requests
url ='http://www.locationary.com/home/index2.jsp'
所以现在,我认为我应该使用“发布”和cookie。
ck ={'inUserName':'USERNAME/EMAIL','inUserPass':'PASSWORD'}
r = requests.post(url, cookies=ck)
content = r.text
q =PyQuery(content)
title = q("title").text()print title
I am trying to post a request to log in to a website using the Requests module in Python but its not really working. I’m new to this…so I can’t figure out if I should make my Username and Password cookies or some type of HTTP authorization thing I found (??).
from pyquery import PyQuery
import requests
url = 'http://www.locationary.com/home/index2.jsp'
So now, I think I’m supposed to use “post” and cookies….
ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
r = requests.post(url, cookies=ck)
content = r.text
q = PyQuery(content)
title = q("title").text()
print title
I have a feeling that I’m doing the cookies thing wrong…I don’t know.
If it doesn’t log in correctly, the title of the home page should come out to “Locationary.com” and if it does, it should be “Home Page.”
If you could maybe explain a few things about requests and cookies to me and help me out with this, I would greatly appreciate it. :D
Thanks.
…It still didn’t really work yet. Okay…so this is what the home page HTML says before you log in:
So I think I’m doing it right, but the output is still “Locationary.com”
2nd EDIT:
I want to be able to stay logged in for a long time and whenever I request a page under that domain, I want the content to show up as if I were logged in.
import requests
# Fill in your details here to be posted to the login form.
payload ={'inUserName':'username','inUserPass':'password'}# Use 'with' to ensure the session context is closed after use.with requests.Session()as s:
p = s.post('LOGIN_URL', data=payload)# print the html returned or something more intelligent to see if it's a successful login page.print p.text
# An authorised request.
r = s.get('A protected web page url')print r.text
# etc...
I know you’ve found another solution, but for those like me who find this question, looking for the same thing, it can be achieved with requests as follows:
Firstly, as Marcus did, check the source of the login form to get three pieces of information – the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.
Once you’ve got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.
Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.
Example
import requests
# Fill in your details here to be posted to the login form.
payload = {
'inUserName': 'username',
'inUserPass': 'password'
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('LOGIN_URL', data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text
# An authorised request.
r = s.get('A protected web page url')
print r.text
# etc...
Let me try to make it simple, suppose URL of the site is http://example.com/ and let’s suppose you need to sign up by filling username and password, so we go to the login page say http://example.com/login.php now and view it’s source code and search for the action URL it will be in form tag something like
Find out the name of the inputs used on the websites form for usernames <...name=username.../> and passwords <...name=password../> and replace them in the script below. Also replace the URL to point at the desired site to log into.
The use of disable_warnings(InsecureRequestWarning) will silence any output from the script when trying to log into sites with unverified SSL certificates.
Extra:
To run this script from the command line on a UNIX based system place it in a directory, i.e. home/scripts and add this directory to your path in ~/.bash_profile or a similar file used by the terminal.
The requests.Session() solution assisted with logging into a form with CSRF Protection (as used in Flask-WTF forms). Check if a csrf_token is required as a hidden field and add it to the payload with the username and password:
import requests
from bs4 import BeautifulSoup
payload = {
'email': 'email@example.com',
'password': 'passw0rd'
}
with requests.Session() as sess:
res = sess.get(server_name + '/signin')
signin = BeautifulSoup(res._content, 'html.parser')
payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
res = sess.post(server_name + '/auth/login', data=payload)
I am using python Requests. I need to debug some OAuth activity, and for that I would like it to log all requests being performed. I could get this information with ngrep, but unfortunately it is not possible to grep https connections (which are needed for OAuth)
How can I activate logging of all URLs (+ parameters) that Requests is accessing?
Depending on the exact version of urllib3, the following messages are logged:
INFO: Redirects
WARN: Connection pool full (if this happens often increase the connection pool size)
WARN: Failed to parse headers (response headers with invalid format)
WARN: Retrying the connection
WARN: Certificate did not match expected hostname
WARN: Received response with both Content-Length and Transfer-Encoding, when processing a chunked response
DEBUG: New connections (HTTP or HTTPS)
DEBUG: Dropped connections
DEBUG: Connection details: method, path, HTTP version, status code and response length
DEBUG: Retry count increments
This doesn’t include headers or bodies. urllib3 uses the http.client.HTTPConnection class to do the grunt-work, but that class doesn’t support logging, it can normally only be configured to print to stdout. However, you can rig it to send all debug information to logging instead by introducing an alternative print name into that module:
import logging
import http.client
httpclient_logger = logging.getLogger("http.client")
def httpclient_logging_patch(level=logging.DEBUG):
"""Enable HTTPConnection debug logging to the logging framework"""
def httpclient_log(*args):
httpclient_logger.log(level, " ".join(args))
# mask the print() built-in in the http.client module to use
# logging instead
http.client.print = httpclient_log
# enable debugging
http.client.HTTPConnection.debuglevel = 1
Calling httpclient_logging_patch() causes http.client connections to output all debug information to a standard logger, and so are picked up by logging.basicConfig():
You will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA. The only thing missing will be the response.body which is not logged.
that only urllib3 actually uses the Python logging system:
requestsno
http.client.HTTPConnectionno
urllib3yes
Sure, you can extract debug messages from HTTPConnection by setting:
HTTPConnection.debuglevel = 1
but these outputs are merely emitted via the print statement. To prove this, simply grep the Python 3.7 client.py source code and view the print statements yourself (thanks @Yohann):
Presumably redirecting stdout in some way might work to shoe-horn stdout into the logging system and potentially capture to e.g. a log file.
Choose the ‘urllib3‘ logger not ‘requests.packages.urllib3‘
To capture urllib3 debug information through the Python 3 logging system, contrary to much advice on the internet, and as @MikeSmith points out, you won’t have much luck intercepting:
Remember this output uses print and not the Python logging system, and thus cannot be captured using a traditional logging stream or file handler (though it may be possible to capture output to a file by redirecting stdout).
Combine the two above – maximise all possible logging to console
To maximise all possible logging, you must settle for console/stdout output with this:
import requests
import logging
from http.client import HTTPConnection # py3
log = logging.getLogger('urllib3')
log.setLevel(logging.DEBUG)
# logging from urllib3 to console
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
log.addHandler(ch)
# print statements from `http.client.HTTPConnection` to console/stdout
HTTPConnection.debuglevel = 1
requests.get('http://httpbin.org/')
‘urllib3’ is the logger to get now (no longer ‘requests.packages.urllib3’). Basic logging will still happen without setting http.client.HTTPConnection.debuglevel
Having a script or even a subsystem of an application for a network protocol debugging, it’s desired to see what request-response pairs are exactly, including effective URLs, headers, payloads and the status. And it’s typically impractical to instrument individual requests all over the place. At the same time there are performance considerations that suggest using single (or few specialised) requests.Session, so the following assumes that the suggestion is followed.
requests supports so called event hooks (as of 2.23 there’s actually only response hook). It’s basically an event listener, and the event is emitted before returning control from requests.request. At this moment both request and response are fully defined, hence can be logged.
That’s basically how to log all HTTP round-trips of a session.
Formatting HTTP round-trip log records
For the logging above to be useful there can be specialised logging formatter that understands req and res extras on logging records. It can look like this:
import textwrap
class HttpFormatter(logging.Formatter):
def _formatHeaders(self, d):
return '\n'.join(f'{k}: {v}' for k, v in d.items())
def formatMessage(self, record):
result = super().formatMessage(record)
if record.name == 'httplogger':
result += textwrap.dedent('''
---------------- request ----------------
{req.method} {req.url}
{reqhdrs}
{req.body}
---------------- response ----------------
{res.status_code} {res.reason} {res.url}
{reshdrs}
{res.text}
''').format(
req=record.req,
res=record.res,
reqhdrs=self._formatHeaders(record.req.headers),
reshdrs=self._formatHeaders(record.res.headers),
)
return result
formatter = HttpFormatter('{asctime} {levelname} {name} {message}', style='{')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logging.basicConfig(level=logging.DEBUG, handlers=[handler])
Now if you do some requests using the session, like:
When you have a lot of queries, having a simple UI and a way to filter records comes at handy. I’ll show to use Chronologer for that (which I’m the author of).
First, the hook has be rewritten to produce records that logging can serialise when sending over the wire. It can look like this:
Now if you open http://localhost:8080/ (use “logger” for username and empty password for the basic auth popup) and click “Open” button, you should see something like:
response = requests.get(someurl)if response.history:print("Request was redirected")for resp in response.history:print(resp.status_code, resp.url)print("Final destination:")print(response.status_code, response.url)else:print("Request was not redirected")
The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.
response = requests.get(someurl)
if response.history:
print("Request was redirected")
for resp in response.history:
print(resp.status_code, resp.url)
print("Final destination:")
print(response.status_code, response.url)
else:
print("Request was not redirected")
This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.
If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won’t work. Instead, it’s the “Location” header:
r = requests.get('http://github.com/', allow_redirects=False)
r.status_code # 302
r.url # http://github.com, not https.
r.headers['Location'] # https://github.com/ -- the redirect destination