标签归档:python-requests

请求-如何判断您是否收到404

问题:请求-如何判断您是否收到404

我正在使用请求库并通过以下代码访问网站以从中收集数据:

r = requests.get(url)

我想为输入不正确的URL并返回404错误时添加错误测试。如果我有意输入无效的URL,请执行以下操作:

print r

我得到这个:

<Response [404]>

编辑:

我想知道如何测试。对象类型仍然相同。当我执行r.content或时r.text,我仅获得自定义404页面的HTML。

I’m using the Requests library and accessing a website to gather data from it with the following code:

r = requests.get(url)

I want to add error testing for when an improper URL is entered and a 404 error is returned. If I intentionally enter an invalid URL, when I do this:

print r

I get this:

<Response [404]>

EDIT:

I want to know how to test for that. The object type is still the same. When I do r.content or r.text, I simply get the HTML of a custom 404 page.


回答 0

看一下r.status_code属性

if r.status_code == 404:
    # A 404 was issued.

演示:

>>> import requests
>>> r = requests.get('http://httpbin.org/status/404')
>>> r.status_code
404

如果要requests引发错误代码(4xx或5xx)的异常,请调用r.raise_for_status()

>>> r = requests.get('http://httpbin.org/status/404')
>>> r.raise_for_status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "requests/models.py", line 664, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND
>>> r = requests.get('http://httpbin.org/status/200')
>>> r.raise_for_status()
>>> # no exception raised.

您还可以在布尔上下文中测试响应对象。如果状态代码不是错误代码(4xx或5xx),则将其视为“ true”:

if r:
    # successful response

如果要更明确,请使用if r.ok:

Look at the r.status_code attribute:

if r.status_code == 404:
    # A 404 was issued.

Demo:

>>> import requests
>>> r = requests.get('http://httpbin.org/status/404')
>>> r.status_code
404

If you want requests to raise an exception for error codes (4xx or 5xx), call r.raise_for_status():

>>> r = requests.get('http://httpbin.org/status/404')
>>> r.raise_for_status()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "requests/models.py", line 664, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error: NOT FOUND
>>> r = requests.get('http://httpbin.org/status/200')
>>> r.raise_for_status()
>>> # no exception raised.

You can also test the response object in a boolean context; if the status code is not an error code (4xx or 5xx), it is considered ‘true’:

if r:
    # successful response

If you want to be more explicit, use if r.ok:.


如何使用Python请求伪造浏览器访问?

问题:如何使用Python请求伪造浏览器访问?

我想从下面的网站获取内容。如果使用Firefox或Chrome这样的浏览器,则可以获取所需的真实网站页面,但是如果使用Python request软件包(或wget命令)进行获取,则它将返回完全不同的HTML页面。我以为网站的开发人员为此做了一些阻碍,所以问题是:

如何使用python请求或命令wget伪造浏览器访问?

http://www.ichangtou.com/#company:data_000008.html

I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:

How do I fake a browser visit by using python requests or command wget?

http://www.ichangtou.com/#company:data_000008.html


回答 0

提供User-Agent标题

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

仅供参考,这是不同浏览器的用户代理字符串的列表:


附带说明一下,有一个非常有用的第三方程序包,称为fake-useragent,它在用户代理上提供了一个不错的抽象层:

假用户代理

最新的简单useragent伪造者与实际数据库

演示:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'

回答 1

如果这个问题仍然有效

我使用了伪造的UserAgent

如何使用:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

输出:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

if this question is still valid

I used fake UserAgent

How to use:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

outPut:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

回答 2

尝试使用Firefox作为伪造的用户代理来执行此操作(此外,这是使用Cookie进行网络抓取的良好启动脚本):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

用法:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

Try doing this, using firefox as fake user agent (moreover, it’s a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"

回答 3

答案的根源是,提出问题的人需要有一个JavaScript解释器才能获得所要查找的内容。我发现我可以在JSON网站上获取想要的所有信息,然后再用JavaScript对其进行解释。这为我节省了很多时间来解析html,希望每个网页都采用相同的格式。

因此,当您从网站收到使用请求的响应时,请真正查看html / text,因为您可能会在页脚中找到可解析的javascripts JSON。

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.


如何在asyncio中使用请求?

问题:如何在asyncio中使用请求?

我想在其中执行并行http请求任务asyncio,但是我发现这python-requests会阻止的事件循环asyncio。我找到了aiohttp,但它无法使用http代理提供http请求的服务。

所以我想知道是否有一种方法可以借助进行异步http请求asyncio

I want to do parallel http request tasks in asyncio, but I find that python-requests would block the event loop of asyncio. I’ve found aiohttp but it couldn’t provide the service of http request using a http proxy.

So I want to know if there’s a way to do asynchronous http requests with the help of asyncio.


回答 0

要将请求(或任何其他阻塞库)与asyncio一起使用,可以使用BaseEventLoop.run_in_executor在另一个线程中运行一个函数,并从该线程中屈服以获得结果。例如:

import asyncio
import requests

@asyncio.coroutine
def main():
    loop = asyncio.get_event_loop()
    future1 = loop.run_in_executor(None, requests.get, 'http://www.google.com')
    future2 = loop.run_in_executor(None, requests.get, 'http://www.google.co.uk')
    response1 = yield from future1
    response2 = yield from future2
    print(response1.text)
    print(response2.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

这将同时获得两个响应。

使用python 3.5可以使用new await/ async语法:

import asyncio
import requests

async def main():
    loop = asyncio.get_event_loop()
    future1 = loop.run_in_executor(None, requests.get, 'http://www.google.com')
    future2 = loop.run_in_executor(None, requests.get, 'http://www.google.co.uk')
    response1 = await future1
    response2 = await future2
    print(response1.text)
    print(response2.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

有关更多信息,请参见PEP0492

To use requests (or any other blocking libraries) with asyncio, you can use BaseEventLoop.run_in_executor to run a function in another thread and yield from it to get the result. For example:

import asyncio
import requests

@asyncio.coroutine
def main():
    loop = asyncio.get_event_loop()
    future1 = loop.run_in_executor(None, requests.get, 'http://www.google.com')
    future2 = loop.run_in_executor(None, requests.get, 'http://www.google.co.uk')
    response1 = yield from future1
    response2 = yield from future2
    print(response1.text)
    print(response2.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

This will get both responses in parallel.

With python 3.5 you can use the new await/async syntax:

import asyncio
import requests

async def main():
    loop = asyncio.get_event_loop()
    future1 = loop.run_in_executor(None, requests.get, 'http://www.google.com')
    future2 = loop.run_in_executor(None, requests.get, 'http://www.google.co.uk')
    response1 = await future1
    response2 = await future2
    print(response1.text)
    print(response2.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

See PEP0492 for more.


回答 1

aiohttp已经可以与HTTP代理一起使用:

import asyncio
import aiohttp


@asyncio.coroutine
def do_request():
    proxy_url = 'http://localhost:8118'  # your proxy address
    response = yield from aiohttp.request(
        'GET', 'http://google.com',
        proxy=proxy_url,
    )
    return response

loop = asyncio.get_event_loop()
loop.run_until_complete(do_request())

aiohttp can be used with HTTP proxy already:

import asyncio
import aiohttp


@asyncio.coroutine
def do_request():
    proxy_url = 'http://localhost:8118'  # your proxy address
    response = yield from aiohttp.request(
        'GET', 'http://google.com',
        proxy=proxy_url,
    )
    return response

loop = asyncio.get_event_loop()
loop.run_until_complete(do_request())

回答 2

上面的答案仍在使用旧的Python 3.4样式协程。如果您使用的是Python 3.5以上版本,则应编写以下内容。

aiohttp 现在支持 http代理

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [
            'http://python.org',
            'https://google.com',
            'http://yifei.me'
        ]
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            tasks.append(fetch(session, url))
        htmls = await asyncio.gather(*tasks)
        for html in htmls:
            print(html[:100])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

The answers above are still using the old Python 3.4 style coroutines. Here is what you would write if you got Python 3.5+.

aiohttp supports http proxy now

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [
            'http://python.org',
            'https://google.com',
            'http://yifei.me'
        ]
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            tasks.append(fetch(session, url))
        htmls = await asyncio.gather(*tasks)
        for html in htmls:
            print(html[:100])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

回答 3

请求目前不支持asyncio,也没有计划提供此类支持。这可能是因为你可以实现一个自定义的“传输适配器”(如讨论这里),它知道如何使用asyncio

如果我有一段时间的话,我可能会真正去研究,但是我什么也不能保证。

Requests does not currently support asyncio and there are no plans to provide such support. It’s likely that you could implement a custom “Transport Adapter” (as discussed here) that knows how to use asyncio.

If I find myself with some time it’s something I might actually look into, but I can’t promise anything.


回答 4

Pimin Konstantin Kefaloukos 在Python和asyncio上进行的简单并行HTTP请求中有一篇很好的案例,介绍了异步/等待循环和线程 :

为了最大程度地减少总完成时间,我们可以增加线程池的大小以匹配我们必须发出的请求数量。幸运的是,这很容易做到,接下来我们将看到。下面的代码示例是如何使用二十个工作线程的线程池发出二十个异步HTTP请求的示例:

# Example 3: asynchronous requests with larger thread pool
import asyncio
import concurrent.futures
import requests

async def main():

    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:

        loop = asyncio.get_event_loop()
        futures = [
            loop.run_in_executor(
                executor, 
                requests.get, 
                'http://example.org/'
            )
            for i in range(20)
        ]
        for response in await asyncio.gather(*futures):
            pass


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

There is a good case of async/await loops and threading in an article by Pimin Konstantin Kefaloukos Easy parallel HTTP requests with Python and asyncio:

To minimize the total completion time, we could increase the size of the thread pool to match the number of requests we have to make. Luckily, this is easy to do as we will see next. The code listing below is an example of how to make twenty asynchronous HTTP requests with a thread pool of twenty worker threads:

# Example 3: asynchronous requests with larger thread pool
import asyncio
import concurrent.futures
import requests

async def main():

    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:

        loop = asyncio.get_event_loop()
        futures = [
            loop.run_in_executor(
                executor, 
                requests.get, 
                'http://example.org/'
            )
            for i in range(20)
        ]
        for response in await asyncio.gather(*futures):
            pass


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

python请求文件上传

问题:python请求文件上传

我正在执行一个使用Python请求库上传文件的简单任务。我搜索了Stack Overflow,似乎没有人遇到相同的问题,即服务器未收到该文件:

import requests
url='http://nesssi.cacr.caltech.edu/cgi-bin/getmulticonedb_release2.cgi/post'
files={'files': open('file.txt','rb')}
values={'upload_file' : 'file.txt' , 'DB':'photcat' , 'OUT':'csv' , 'SHORT':'short'}
r=requests.post(url,files=files,data=values)

我用文件名填充了’upload_file’关键字的值,因为如果我将其保留为空白,则表示

Error - You must select a file to upload!

现在我明白了

File  file.txt  of size    bytes is  uploaded successfully!
Query service results:  There were 0 lines.

仅当文件为空时才会出现。因此,我对如何成功发送文件感到困惑。我知道该文件有效,因为如果我访问此网站并手动填写表格,它将返回一个很好的匹配对象列表,这就是我想要的。我非常感谢所有提示。

其他一些相关的线程(但不能回答我的问题):

I’m performing a simple task of uploading a file using Python requests library. I searched Stack Overflow and no one seemed to have the same problem, namely, that the file is not received by the server:

import requests
url='http://nesssi.cacr.caltech.edu/cgi-bin/getmulticonedb_release2.cgi/post'
files={'files': open('file.txt','rb')}
values={'upload_file' : 'file.txt' , 'DB':'photcat' , 'OUT':'csv' , 'SHORT':'short'}
r=requests.post(url,files=files,data=values)

I’m filling the value of ‘upload_file’ keyword with my filename, because if I leave it blank, it says

Error - You must select a file to upload!

And now I get

File  file.txt  of size    bytes is  uploaded successfully!
Query service results:  There were 0 lines.

Which comes up only if the file is empty. So I’m stuck as to how to send my file successfully. I know that the file works because if I go to this website and manually fill in the form it returns a nice list of matched objects, which is what I’m after. I’d really appreciate all hints.

Some other threads related (but not answering my problem):


回答 0

如果upload_file要作为文件,请使用:

files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}

r = requests.post(url, files=files, data=values)

并且requests将派遣一个多部分表单POST体与upload_file字段设置为内容file.txt的文件。

文件名将包含在特定字段的mime标头中:

>>> import requests
>>> open('file.txt', 'wb')  # create an empty demo file
<_io.BufferedWriter name='file.txt'>
>>> files = {'upload_file': open('file.txt', 'rb')}
>>> print(requests.Request('POST', 'http://example.com', files=files).prepare().body.decode('ascii'))
--c226ce13d09842658ffbd31e0563c6bd
Content-Disposition: form-data; name="upload_file"; filename="file.txt"


--c226ce13d09842658ffbd31e0563c6bd--

注意filename="file.txt"参数。

files如果需要更多控制,则可以使用元组作为映射值,其中包含2到4个元素。第一个元素是文件名,其后是内容,以及可选的content-type标头值和可选的附加标头映射:

files = {'upload_file': ('foobar.txt', open('file.txt','rb'), 'text/x-spam')}

这将设置备用文件名和内容类型,而忽略可选的标题。

如果您要从文件中提取整个POST正文(未指定其他字段),则不要使用files参数,只需将文件直接发布为即可data。然后,您可能还需要设置Content-Type标头,否则将不会设置任何标头。请参阅Python请求-文件中的POST数据

If upload_file is meant to be the file, use:

files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}

r = requests.post(url, files=files, data=values)

and requests will send a multi-part form POST body with the upload_file field set to the contents of the file.txt file.

The filename will be included in the mime header for the specific field:

>>> import requests
>>> open('file.txt', 'wb')  # create an empty demo file
<_io.BufferedWriter name='file.txt'>
>>> files = {'upload_file': open('file.txt', 'rb')}
>>> print(requests.Request('POST', 'http://example.com', files=files).prepare().body.decode('ascii'))
--c226ce13d09842658ffbd31e0563c6bd
Content-Disposition: form-data; name="upload_file"; filename="file.txt"


--c226ce13d09842658ffbd31e0563c6bd--

Note the filename="file.txt" parameter.

You can use a tuple for the files mapping value, with between 2 and 4 elements, if you need more control. The first element is the filename, followed by the contents, and an optional content-type header value and an optional mapping of additional headers:

files = {'upload_file': ('foobar.txt', open('file.txt','rb'), 'text/x-spam')}

This sets an alternative filename and content type, leaving out the optional headers.

If you are meaning the whole POST body to be taken from a file (with no other fields specified), then don’t use the files parameter, just post the file directly as data. You then may want to set a Content-Type header too, as none will be set otherwise. See Python requests – POST data from a file.


回答 1

(2018)新的python请求库简化了此过程,我们可以使用’files’变量表示我们要上传经过多部分编码的文件

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}

r = requests.post(url, files=files)
r.text

(2018) the new python requests library has simplified this process, we can use the ‘files’ variable to signal that we want to upload a multipart-encoded file

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}

r = requests.post(url, files=files)
r.text

回答 2

客户上传

如果要使用Python requests库上传单个文件,则请求lib 支持流上传,这使您无需读取内存即可发送大文件或流。

with open('massive-body', 'rb') as f:
    requests.post('http://some.url/streamed', data=f)

服务器端

然后将文件存储在server.py侧面,这样就可以将流保存到文件中而不加载到内存中。以下是使用Flask文件上传的示例。

@app.route("/upload", methods=['POST'])
def upload_file():
    from werkzeug.datastructures import FileStorage
    FileStorage(request.stream).save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
    return 'OK', 200

或使用修复程序中提到的werkzeug表单数据解析来解决“ 大文件上传占用内存 ”的问题,以避免在大文件上传时(约60秒内无效使用 st 22 GiB文件。) 13 MiB。)。

@app.route("/upload", methods=['POST'])
def upload_file():
    def custom_stream_factory(total_content_length, filename, content_type, content_length=None):
        import tempfile
        tmpfile = tempfile.NamedTemporaryFile('wb+', prefix='flaskapp', suffix='.nc')
        app.logger.info("start receiving file ... filename => " + str(tmpfile.name))
        return tmpfile

    import werkzeug, flask
    stream, form, files = werkzeug.formparser.parse_form_data(flask.request.environ, stream_factory=custom_stream_factory)
    for fil in files.values():
        app.logger.info(" ".join(["saved form name", fil.name, "submitted as", fil.filename, "to temporary file", fil.stream.name]))
        # Do whatever with stored file at `fil.stream.name`
    return 'OK', 200

Client Upload

If you want to upload a single file with Python requests library, then requests lib supports streaming uploads, which allow you to send large files or streams without reading into memory.

with open('massive-body', 'rb') as f:
    requests.post('http://some.url/streamed', data=f)

Server Side

Then store the file on the server.py side such that save the stream into file without loading into the memory. Following is an example with using Flask file uploads.

@app.route("/upload", methods=['POST'])
def upload_file():
    from werkzeug.datastructures import FileStorage
    FileStorage(request.stream).save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
    return 'OK', 200

Or use werkzeug Form Data Parsing as mentioned in a fix for the issue of “large file uploads eating up memory” in order to avoid using memory inefficiently on large files upload (s.t. 22 GiB file in ~60 seconds. Memory usage is constant at about 13 MiB.).

@app.route("/upload", methods=['POST'])
def upload_file():
    def custom_stream_factory(total_content_length, filename, content_type, content_length=None):
        import tempfile
        tmpfile = tempfile.NamedTemporaryFile('wb+', prefix='flaskapp', suffix='.nc')
        app.logger.info("start receiving file ... filename => " + str(tmpfile.name))
        return tmpfile

    import werkzeug, flask
    stream, form, files = werkzeug.formparser.parse_form_data(flask.request.environ, stream_factory=custom_stream_factory)
    for fil in files.values():
        app.logger.info(" ".join(["saved form name", fil.name, "submitted as", fil.filename, "to temporary file", fil.stream.name]))
        # Do whatever with stored file at `fil.stream.name`
    return 'OK', 200

回答 3

在Ubuntu中,您可以采用这种方式,

将文件保存在某个位置(临时),然后打开并发送给API

      path = default_storage.save('static/tmp/' + f1.name, ContentFile(f1.read()))
      path12 = os.path.join(os.getcwd(), "static/tmp/" + f1.name)
      data={} #can be anything u want to pass along with File
      file1 = open(path12, 'rb')
      header = {"Content-Disposition": "attachment; filename=" + f1.name, "Authorization": "JWT " + token}
       res= requests.post(url,data,header)

In Ubuntu you can apply this way,

to save file at some location (temporary) and then open and send it to API

      path = default_storage.save('static/tmp/' + f1.name, ContentFile(f1.read()))
      path12 = os.path.join(os.getcwd(), "static/tmp/" + f1.name)
      data={} #can be anything u want to pass along with File
      file1 = open(path12, 'rb')
      header = {"Content-Disposition": "attachment; filename=" + f1.name, "Authorization": "JWT " + token}
       res= requests.post(url,data,header)

“内容”和“文本”有什么区别

问题:“内容”和“文本”有什么区别

我正在使用很棒的Python Requests库。我注意到,精美的文档中有许多示例,说明了如何做某事而不解释其原因。例如,r.textr.content都作为如何获得服务器响应的示例显示。但是,这些属性在哪里解释呢?例如,我什么时候会选择一个?我看到thar 有时会r.text返回一个unicode对象,并且我想非文本响应会有所不同。但是,所有这些都记录在哪里?请注意,链接文档确实声明:

对于非文本请求,您还可以字节形式访问响应主体:

但随后继续显示文本响应的示例!我只能假设上面的引号是说non-text responses而不是non-text requests,因为非文本请求在HTTP中没有意义。

简而言之,相对于Python Requests网站上的(优秀)教程,该库的正确文档在哪里?

I am using the terrific Python Requests library. I notice that the fine documentation has many examples of how to do something without explaining the why. For instance, both r.text and r.content are shown as examples of how to get the server response. But where is it explained what these properties do? For instance, when would I choose one over the other? I see thar r.text returns a unicode object sometimes, and I suppose that there would be a difference for a non-text response. But where is all this documented? Note that the linked document does state:

You can also access the response body as bytes, for non-text requests:

But then it goes on to show an example of a text response! I can only suppose that the quote above means to say non-text responses instead of non-text requests, as a non-text request does not make sense in HTTP.

In short, where is the proper documentation of the library, as opposed to the (excellent) tutorial on the Python Requests site?


回答 0

开发接口进行了详细介绍:

r.text是Unicode中响应的内容,并且r.content是字节中响应的内容。

The requests.Response class documentation has more details:

r.text is the content of the response in Unicode, and r.content is the content of the response in bytes.


回答 1

从文档中可以明显看出,r.content

You can also access the response body as bytes, for non-text requests:

 >>> r.content

如果您进一步阅读该页面,它会处理例如图像文件

It seems clear from the documentation is that r.content

You can also access the response body as bytes, for non-text requests:

 >>> r.content

If you read further down the page it addresses for example an image file


Python请求和持久会话

问题:Python请求和持久会话

我正在使用请求模块(Python 2.5的版本0.10.0)。我已经找到了如何将数据提交到网站上的登录表单并检索会话密钥的方法,但是我看不到在后续请求中使用此会话密钥的明显方法。有人可以在下面的代码中填写省略号还是建议其他方法?

>>> import requests
>>> login_data =  {'formPosted':'1', 'login_email':'me@example.com', 'password':'pw'}
>>> r = requests.post('https://localhost/login.py', login_data)
>>> 
>>> r.text
u'You are being redirected <a href="profilePage?_ck=1349394964">here</a>'
>>> r.cookies
{'session_id_myapp': '127-0-0-1-825ff22a-6ed1-453b-aebc-5d3cf2987065'}
>>> 
>>> r2 = requests.get('https://localhost/profile_data.json', ...)

I am using the requests module (version 0.10.0 with Python 2.5). I have figured out how to submit data to a login form on a website and retrieve the session key, but I can’t see an obvious way to use this session key in subsequent requests. Can someone fill in the ellipsis in the code below or suggest another approach?

>>> import requests
>>> login_data =  {'formPosted':'1', 'login_email':'me@example.com', 'password':'pw'}
>>> r = requests.post('https://localhost/login.py', login_data)
>>> 
>>> r.text
u'You are being redirected <a href="profilePage?_ck=1349394964">here</a>'
>>> r.cookies
{'session_id_myapp': '127-0-0-1-825ff22a-6ed1-453b-aebc-5d3cf2987065'}
>>> 
>>> r2 = requests.get('https://localhost/profile_data.json', ...)

回答 0

您可以使用以下方法轻松创建持久会话:

s = requests.Session()

之后,继续执行您的请求,如下所示:

s.post('https://localhost/login.py', login_data)
#logged in! cookies saved for future requests.
r2 = s.get('https://localhost/profile_data.json', ...)
#cookies sent automatically!
#do whatever, s will keep your cookies intact :)

有关会话的更多信息:https : //requests.kennethreitz.org/en/master/user/advanced/#session-objects

You can easily create a persistent session using:

s = requests.Session()

After that, continue with your requests as you would:

s.post('https://localhost/login.py', login_data)
#logged in! cookies saved for future requests.
r2 = s.get('https://localhost/profile_data.json', ...)
#cookies sent automatically!
#do whatever, s will keep your cookies intact :)

For more about sessions: https://requests.kennethreitz.org/en/master/user/advanced/#session-objects


回答 1

其他答案有助于了解如何维护此类会话。另外,我想提供一个类,该类可以使会话在脚本的不同运行(带有缓存文件)上维护。这意味着仅在需要时才执行正确的“登录”(超时或缓存中不存在会话)。它还支持在随后的“ get”或“ post”调用中的代理设置。

已通过Python3测试。

使用它作为您自己的代码的基础。以下代码段随GPL v3一起发布

import pickle
import datetime
import os
from urllib.parse import urlparse
import requests    

class MyLoginSession:
    """
    a class which handles and saves login sessions. It also keeps track of proxy settings.
    It does also maintine a cache-file for restoring session data from earlier
    script executions.
    """
    def __init__(self,
                 loginUrl,
                 loginData,
                 loginTestUrl,
                 loginTestString,
                 sessionFileAppendix = '_session.dat',
                 maxSessionTimeSeconds = 30 * 60,
                 proxies = None,
                 userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
                 debug = True,
                 forceLogin = False,
                 **kwargs):
        """
        save some information needed to login the session

        you'll have to provide 'loginTestString' which will be looked for in the
        responses html to make sure, you've properly been logged in

        'proxies' is of format { 'https' : 'https://user:pass@server:port', 'http' : ...
        'loginData' will be sent as post data (dictionary of id : value).
        'maxSessionTimeSeconds' will be used to determine when to re-login.
        """
        urlData = urlparse(loginUrl)

        self.proxies = proxies
        self.loginData = loginData
        self.loginUrl = loginUrl
        self.loginTestUrl = loginTestUrl
        self.maxSessionTime = maxSessionTimeSeconds
        self.sessionFile = urlData.netloc + sessionFileAppendix
        self.userAgent = userAgent
        self.loginTestString = loginTestString
        self.debug = debug

        self.login(forceLogin, **kwargs)

    def modification_date(self, filename):
        """
        return last file modification date as datetime object
        """
        t = os.path.getmtime(filename)
        return datetime.datetime.fromtimestamp(t)

    def login(self, forceLogin = False, **kwargs):
        """
        login to a session. Try to read last saved session from cache file. If this fails
        do proper login. If the last cache access was too old, also perform a proper login.
        Always updates session cache file.
        """
        wasReadFromCache = False
        if self.debug:
            print('loading or generating session...')
        if os.path.exists(self.sessionFile) and not forceLogin:
            time = self.modification_date(self.sessionFile)         

            # only load if file less than 30 minutes old
            lastModification = (datetime.datetime.now() - time).seconds
            if lastModification < self.maxSessionTime:
                with open(self.sessionFile, "rb") as f:
                    self.session = pickle.load(f)
                    wasReadFromCache = True
                    if self.debug:
                        print("loaded session from cache (last access %ds ago) "
                              % lastModification)
        if not wasReadFromCache:
            self.session = requests.Session()
            self.session.headers.update({'user-agent' : self.userAgent})
            res = self.session.post(self.loginUrl, data = self.loginData, 
                                    proxies = self.proxies, **kwargs)

            if self.debug:
                print('created new session with login' )
            self.saveSessionToCache()

        # test login
        res = self.session.get(self.loginTestUrl)
        if res.text.lower().find(self.loginTestString.lower()) < 0:
            raise Exception("could not log into provided site '%s'"
                            " (did not find successful login string)"
                            % self.loginUrl)

    def saveSessionToCache(self):
        """
        save session to a cache file
        """
        # always save (to update timeout)
        with open(self.sessionFile, "wb") as f:
            pickle.dump(self.session, f)
            if self.debug:
                print('updated session cache-file %s' % self.sessionFile)

    def retrieveContent(self, url, method = "get", postData = None, **kwargs):
        """
        return the content of the url with respect to the session.

        If 'method' is not 'get', the url will be called with 'postData'
        as a post request.
        """
        if method == 'get':
            res = self.session.get(url , proxies = self.proxies, **kwargs)
        else:
            res = self.session.post(url , data = postData, proxies = self.proxies, **kwargs)

        # the session has been updated on the server, so also update in cache
        self.saveSessionToCache()            

        return res

使用上述类的代码片段可能如下所示:

if __name__ == "__main__":
    # proxies = {'https' : 'https://user:pass@server:port',
    #           'http' : 'http://user:pass@server:port'}

    loginData = {'user' : 'usr',
                 'password' :  'pwd'}

    loginUrl = 'https://...'
    loginTestUrl = 'https://...'
    successStr = 'Hello Tom'
    s = MyLoginSession(loginUrl, loginData, loginTestUrl, successStr, 
                       #proxies = proxies
                       )

    res = s.retrieveContent('https://....')
    print(res.text)

    # if, for instance, login via JSON values required try this:
    s = MyLoginSession(loginUrl, None, loginTestUrl, successStr, 
                       #proxies = proxies,
                       json = loginData)

the other answers help to understand how to maintain such a session. Additionally, I want to provide a class which keeps the session maintained over different runs of a script (with a cache file). This means a proper “login” is only performed when required (timout or no session exists in cache). Also it supports proxy settings over subsequent calls to ‘get’ or ‘post’.

It is tested with Python3.

Use it as a basis for your own code. The following snippets are release with GPL v3

import pickle
import datetime
import os
from urllib.parse import urlparse
import requests    

class MyLoginSession:
    """
    a class which handles and saves login sessions. It also keeps track of proxy settings.
    It does also maintine a cache-file for restoring session data from earlier
    script executions.
    """
    def __init__(self,
                 loginUrl,
                 loginData,
                 loginTestUrl,
                 loginTestString,
                 sessionFileAppendix = '_session.dat',
                 maxSessionTimeSeconds = 30 * 60,
                 proxies = None,
                 userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
                 debug = True,
                 forceLogin = False,
                 **kwargs):
        """
        save some information needed to login the session

        you'll have to provide 'loginTestString' which will be looked for in the
        responses html to make sure, you've properly been logged in

        'proxies' is of format { 'https' : 'https://user:pass@server:port', 'http' : ...
        'loginData' will be sent as post data (dictionary of id : value).
        'maxSessionTimeSeconds' will be used to determine when to re-login.
        """
        urlData = urlparse(loginUrl)

        self.proxies = proxies
        self.loginData = loginData
        self.loginUrl = loginUrl
        self.loginTestUrl = loginTestUrl
        self.maxSessionTime = maxSessionTimeSeconds
        self.sessionFile = urlData.netloc + sessionFileAppendix
        self.userAgent = userAgent
        self.loginTestString = loginTestString
        self.debug = debug

        self.login(forceLogin, **kwargs)

    def modification_date(self, filename):
        """
        return last file modification date as datetime object
        """
        t = os.path.getmtime(filename)
        return datetime.datetime.fromtimestamp(t)

    def login(self, forceLogin = False, **kwargs):
        """
        login to a session. Try to read last saved session from cache file. If this fails
        do proper login. If the last cache access was too old, also perform a proper login.
        Always updates session cache file.
        """
        wasReadFromCache = False
        if self.debug:
            print('loading or generating session...')
        if os.path.exists(self.sessionFile) and not forceLogin:
            time = self.modification_date(self.sessionFile)         

            # only load if file less than 30 minutes old
            lastModification = (datetime.datetime.now() - time).seconds
            if lastModification < self.maxSessionTime:
                with open(self.sessionFile, "rb") as f:
                    self.session = pickle.load(f)
                    wasReadFromCache = True
                    if self.debug:
                        print("loaded session from cache (last access %ds ago) "
                              % lastModification)
        if not wasReadFromCache:
            self.session = requests.Session()
            self.session.headers.update({'user-agent' : self.userAgent})
            res = self.session.post(self.loginUrl, data = self.loginData, 
                                    proxies = self.proxies, **kwargs)

            if self.debug:
                print('created new session with login' )
            self.saveSessionToCache()

        # test login
        res = self.session.get(self.loginTestUrl)
        if res.text.lower().find(self.loginTestString.lower()) < 0:
            raise Exception("could not log into provided site '%s'"
                            " (did not find successful login string)"
                            % self.loginUrl)

    def saveSessionToCache(self):
        """
        save session to a cache file
        """
        # always save (to update timeout)
        with open(self.sessionFile, "wb") as f:
            pickle.dump(self.session, f)
            if self.debug:
                print('updated session cache-file %s' % self.sessionFile)

    def retrieveContent(self, url, method = "get", postData = None, **kwargs):
        """
        return the content of the url with respect to the session.

        If 'method' is not 'get', the url will be called with 'postData'
        as a post request.
        """
        if method == 'get':
            res = self.session.get(url , proxies = self.proxies, **kwargs)
        else:
            res = self.session.post(url , data = postData, proxies = self.proxies, **kwargs)

        # the session has been updated on the server, so also update in cache
        self.saveSessionToCache()            

        return res

A code snippet for using the above class may look like this:

if __name__ == "__main__":
    # proxies = {'https' : 'https://user:pass@server:port',
    #           'http' : 'http://user:pass@server:port'}

    loginData = {'user' : 'usr',
                 'password' :  'pwd'}

    loginUrl = 'https://...'
    loginTestUrl = 'https://...'
    successStr = 'Hello Tom'
    s = MyLoginSession(loginUrl, loginData, loginTestUrl, successStr, 
                       #proxies = proxies
                       )

    res = s.retrieveContent('https://....')
    print(res.text)

    # if, for instance, login via JSON values required try this:
    s = MyLoginSession(loginUrl, None, loginTestUrl, successStr, 
                       #proxies = proxies,
                       json = loginData)

回答 2

在这个类似的问题中查看我的答案:

python:urllib2如何使用urlopen请求发送cookie

import urllib2
import urllib
from cookielib import CookieJar

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# input-type values from the html form
formdata = { "username" : username, "password": password, "form-id" : "1234" }
data_encoded = urllib.urlencode(formdata)
response = opener.open("https://page.com/login.php", data_encoded)
content = response.read()

编辑:

我看到我的回答有些不满意,但没有解释性的评论。我猜是因为我指的是urllib库而不是requests。我这样做是因为OP寻求帮助requests或有人建议其他方法。

Check out my answer in this similar question:

python: urllib2 how to send cookie with urlopen request

import urllib2
import urllib
from cookielib import CookieJar

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
# input-type values from the html form
formdata = { "username" : username, "password": password, "form-id" : "1234" }
data_encoded = urllib.urlencode(formdata)
response = opener.open("https://page.com/login.php", data_encoded)
content = response.read()

EDIT:

I see I’ve gotten a few downvotes for my answer, but no explaining comments. I’m guessing it’s because I’m referring to the urllib libraries instead of requests. I do that because the OP asks for help with requests or for someone to suggest another approach.


回答 3

文档说这get是可选的cookies参数,允许您指定要使用的cookie:

从文档:

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

http://docs.python-requests.org/zh-CN/latest/user/quickstart/#cookies

The documentation says that get takes in an optional cookies argument allowing you to specify cookies to use:

from the docs:

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

http://docs.python-requests.org/en/latest/user/quickstart/#cookies


回答 4

尝试了以上所有答案后,我发现使用“ RequestsCookieJar”代替常规的CookieJar处理后续请求可以解决我的问题。

import requests
import json

# The Login URL
authUrl = 'https://whatever.com/login'

# The subsequent URL
testUrl = 'https://whatever.com/someEndpoint'

# Logout URL
testlogoutUrl = 'https://whatever.com/logout'

# Whatever you are posting
login_data =  {'formPosted':'1', 
               'login_email':'me@example.com', 
               'password':'pw'
               }

# The Authentication token or any other data that we will receive from the Authentication Request. 
token = ''

# Post the login Request
loginRequest = requests.post(authUrl, login_data)
print("{}".format(loginRequest.text))

# Save the request content to your variable. In this case I needed a field called token. 
token = str(json.loads(loginRequest.content)['token'])  # or ['access_token']
print("{}".format(token))

# Verify Successful login
print("{}".format(loginRequest.status_code))

# Create your Requests Cookie Jar for your subsequent requests and add the cookie
jar = requests.cookies.RequestsCookieJar()
jar.set('LWSSO_COOKIE_KEY', token)

# Execute your next request(s) with the Request Cookie Jar set
r = requests.get(testUrl, cookies=jar)
print("R.TEXT: {}".format(r.text))
print("R.STCD: {}".format(r.status_code))

# Execute your logout request(s) with the Request Cookie Jar set
r = requests.delete(testlogoutUrl, cookies=jar)
print("R.TEXT: {}".format(r.text))  # should show "Request Not Authorized"
print("R.STCD: {}".format(r.status_code))  # should show 401

Upon trying all the answers above, I found that using “RequestsCookieJar” instead of the regular CookieJar for subsequent requests fixed my problem.

import requests
import json

# The Login URL
authUrl = 'https://whatever.com/login'

# The subsequent URL
testUrl = 'https://whatever.com/someEndpoint'

# Logout URL
testlogoutUrl = 'https://whatever.com/logout'

# Whatever you are posting
login_data =  {'formPosted':'1', 
               'login_email':'me@example.com', 
               'password':'pw'
               }

# The Authentication token or any other data that we will receive from the Authentication Request. 
token = ''

# Post the login Request
loginRequest = requests.post(authUrl, login_data)
print("{}".format(loginRequest.text))

# Save the request content to your variable. In this case I needed a field called token. 
token = str(json.loads(loginRequest.content)['token'])  # or ['access_token']
print("{}".format(token))

# Verify Successful login
print("{}".format(loginRequest.status_code))

# Create your Requests Cookie Jar for your subsequent requests and add the cookie
jar = requests.cookies.RequestsCookieJar()
jar.set('LWSSO_COOKIE_KEY', token)

# Execute your next request(s) with the Request Cookie Jar set
r = requests.get(testUrl, cookies=jar)
print("R.TEXT: {}".format(r.text))
print("R.STCD: {}".format(r.status_code))

# Execute your logout request(s) with the Request Cookie Jar set
r = requests.delete(testlogoutUrl, cookies=jar)
print("R.TEXT: {}".format(r.text))  # should show "Request Not Authorized"
print("R.STCD: {}".format(r.status_code))  # should show 401

回答 5

片段以获取受密码保护的json数据

import requests

username = "my_user_name"
password = "my_super_secret"
url = "https://www.my_base_url.com"
the_page_i_want = "/my_json_data_page"

session = requests.Session()
# retrieve cookie value
resp = session.get(url+'/login')
csrf_token = resp.cookies['csrftoken']
# login, add referer
resp = session.post(url+"/login",
                  data={
                      'username': username,
                      'password': password,
                      'csrfmiddlewaretoken': csrf_token,
                      'next': the_page_i_want,
                  },
                  headers=dict(Referer=url+"/login"))
print(resp.json())

snippet to retrieve json data, password protected

import requests

username = "my_user_name"
password = "my_super_secret"
url = "https://www.my_base_url.com"
the_page_i_want = "/my_json_data_page"

session = requests.Session()
# retrieve cookie value
resp = session.get(url+'/login')
csrf_token = resp.cookies['csrftoken']
# login, add referer
resp = session.post(url+"/login",
                  data={
                      'username': username,
                      'password': password,
                      'csrfmiddlewaretoken': csrf_token,
                      'next': the_page_i_want,
                  },
                  headers=dict(Referer=url+"/login"))
print(resp.json())

回答 6

仅保存所需的cookie并重复使用。

import os
import pickle
from urllib.parse import urljoin, urlparse

login = 'my@email.com'
password = 'secret'
# Assuming two cookies are used for persistent login.
# (Find it by tracing the login process)
persistentCookieNames = ['sessionId', 'profileId']
URL = 'http://example.com'
urlData = urlparse(URL)
cookieFile = urlData.netloc + '.cookie'
signinUrl = urljoin(URL, "/signin")
with requests.Session() as session:
    try:
        with open(cookieFile, 'rb') as f:
            print("Loading cookies...")
            session.cookies.update(pickle.load(f))
    except Exception:
        # If could not load cookies from file, get the new ones by login in
        print("Login in...")
        post = session.post(
            signinUrl,
            data={
                'email': login,
                'password': password,
            }
        )
        try:
            with open(cookieFile, 'wb') as f:
                jar = requests.cookies.RequestsCookieJar()
                for cookie in session.cookies:
                    if cookie.name in persistentCookieNames:
                        jar.set_cookie(cookie)
                pickle.dump(jar, f)
        except Exception as e:
            os.remove(cookieFile)
            raise(e)
    MyPage = urljoin(URL, "/mypage")
    page = session.get(MyPage)

Save only required cookies and reuse them.

import os
import pickle
from urllib.parse import urljoin, urlparse

login = 'my@email.com'
password = 'secret'
# Assuming two cookies are used for persistent login.
# (Find it by tracing the login process)
persistentCookieNames = ['sessionId', 'profileId']
URL = 'http://example.com'
urlData = urlparse(URL)
cookieFile = urlData.netloc + '.cookie'
signinUrl = urljoin(URL, "/signin")
with requests.Session() as session:
    try:
        with open(cookieFile, 'rb') as f:
            print("Loading cookies...")
            session.cookies.update(pickle.load(f))
    except Exception:
        # If could not load cookies from file, get the new ones by login in
        print("Login in...")
        post = session.post(
            signinUrl,
            data={
                'email': login,
                'password': password,
            }
        )
        try:
            with open(cookieFile, 'wb') as f:
                jar = requests.cookies.RequestsCookieJar()
                for cookie in session.cookies:
                    if cookie.name in persistentCookieNames:
                        jar.set_cookie(cookie)
                pickle.dump(jar, f)
        except Exception as e:
            os.remove(cookieFile)
            raise(e)
    MyPage = urljoin(URL, "/mypage")
    page = session.get(MyPage)

回答 7

这将在Python中为您工作;

# Call JIRA API with HTTPBasicAuth
import json
import requests
from requests.auth import HTTPBasicAuth

JIRA_EMAIL = "****"
JIRA_TOKEN = "****"
BASE_URL = "https://****.atlassian.net"
API_URL = "/rest/api/3/serverInfo"

API_URL = BASE_URL+API_URL

BASIC_AUTH = HTTPBasicAuth(JIRA_EMAIL, JIRA_TOKEN)
HEADERS = {'Content-Type' : 'application/json;charset=iso-8859-1'}

response = requests.get(
    API_URL,
    headers=HEADERS,
    auth=BASIC_AUTH
)

print(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))

This will work for you in Python;

# Call JIRA API with HTTPBasicAuth
import json
import requests
from requests.auth import HTTPBasicAuth

JIRA_EMAIL = "****"
JIRA_TOKEN = "****"
BASE_URL = "https://****.atlassian.net"
API_URL = "/rest/api/3/serverInfo"

API_URL = BASE_URL+API_URL

BASIC_AUTH = HTTPBasicAuth(JIRA_EMAIL, JIRA_TOKEN)
HEADERS = {'Content-Type' : 'application/json;charset=iso-8859-1'}

response = requests.get(
    API_URL,
    headers=HEADERS,
    auth=BASIC_AUTH
)

print(json.dumps(json.loads(response.text), sort_keys=True, indent=4, separators=(",", ": ")))

Python请求包:处理xml响应

问题:Python请求包:处理xml响应

我非常喜欢该requests程序包及其舒适的方式来处理JSON响应。

不幸的是,我不知道是否还可以处理XML响应。有没有人体验过如何使用该requests包处理XML响应?是否需要包括另一个用于XML解码的包?

I like very much the requests package and its comfortable way to handle JSON responses.

Unfortunately, I did not understand if I can also process XML responses. Has anybody experience how to handle XML responses with the requests package? Is it necessary to include another package for the XML decoding?


回答 0

requests不处理解析XML响应,否。XML响应本质上比JSON响应复杂得多,如何将XML数据序列化为Python结构几乎不那么简单。

Python带有内置的XML解析器。我建议您使用ElementTree API

import requests
from xml.etree import ElementTree

response = requests.get(url)

tree = ElementTree.fromstring(response.content)

或者,如果响应特别大,请使用增量方法:

response = requests.get(url, stream=True)
# if the server sent a Gzip or Deflate compressed response, decompress
# as we read the raw stream:
response.raw.decode_content = True

events = ElementTree.iterparse(response.raw)
for event, elem in events:
    # do something with `elem`

外部lxml项目建立在相同的API上,仍然为您提供更多功能和强大功能。

requests does not handle parsing XML responses, no. XML responses are much more complex in nature than JSON responses, how you’d serialize XML data into Python structures is not nearly as straightforward.

Python comes with built-in XML parsers. I recommend you use the ElementTree API:

import requests
from xml.etree import ElementTree

response = requests.get(url)

tree = ElementTree.fromstring(response.content)

or, if the response is particularly large, use an incremental approach:

    response = requests.get(url, stream=True)
    # if the server sent a Gzip or Deflate compressed response, decompress
    # as we read the raw stream:
    response.raw.decode_content = True

    events = ElementTree.iterparse(response.raw)
    for event, elem in events:
        # do something with `elem`

The external lxml project builds on the same API to give you more features and power still.


如何使用Python的“请求”模块“登录”网站?

问题:如何使用Python的“请求”模块“登录”网站?

我正在尝试使用Python中的“请求”模块发布一个登录到网站的请求,但它实际上无法正常工作。我是新来的…所以我不知道是否应该使用我的用户名和密码cookie或某种我发现的HTTP授权类型(??)。

from pyquery import PyQuery
import requests

url = 'http://www.locationary.com/home/index2.jsp'

所以现在,我认为我应该使用“发布”和cookie。

ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}

r = requests.post(url, cookies=ck)

content = r.text

q = PyQuery(content)

title = q("title").text()

print title

我有一种感觉,我做错了Cookie的事情……我不知道。

如果登录不正确,则主页标题应显示在“ Locationary.com”上;如果登录不正确,则应显示为“主页”。

如果您可以向我解释一些有关请求和cookie的事情,并帮助我解决这个问题,我将不胜感激。:D

谢谢。

…它仍然没有真正起作用。好的…所以这是登录之前主页HTML的内容:

</td><td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_email.gif">    </td>
<td><input class="Data_Entry_Field_Login" type="text" name="inUserName" id="inUserName"  size="25"></td>
<td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_password.gif"> </td>
<td><input  class="Data_Entry_Field_Login"  type="password" name="inUserPass"     id="inUserPass"></td>

所以我认为我做对了,但输出仍然是“ Locationary.com”

第二次编辑:

我希望能够长时间保持登录状态,并且每当我请求该域下的页面时,我都希望内容显示出来就像我已登录一样。

I am trying to post a request to log in to a website using the Requests module in Python but its not really working. I’m new to this…so I can’t figure out if I should make my Username and Password cookies or some type of HTTP authorization thing I found (??).

from pyquery import PyQuery
import requests

url = 'http://www.locationary.com/home/index2.jsp'

So now, I think I’m supposed to use “post” and cookies….

ck = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}

r = requests.post(url, cookies=ck)

content = r.text

q = PyQuery(content)

title = q("title").text()

print title

I have a feeling that I’m doing the cookies thing wrong…I don’t know.

If it doesn’t log in correctly, the title of the home page should come out to “Locationary.com” and if it does, it should be “Home Page.”

If you could maybe explain a few things about requests and cookies to me and help me out with this, I would greatly appreciate it. :D

Thanks.

…It still didn’t really work yet. Okay…so this is what the home page HTML says before you log in:

</td><td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_email.gif">    </td>
<td><input class="Data_Entry_Field_Login" type="text" name="inUserName" id="inUserName"  size="25"></td>
<td><img src="http://www.locationary.com/img/LocationaryImgs/icons/txt_password.gif"> </td>
<td><input  class="Data_Entry_Field_Login"  type="password" name="inUserPass"     id="inUserPass"></td>

So I think I’m doing it right, but the output is still “Locationary.com”

2nd EDIT:

I want to be able to stay logged in for a long time and whenever I request a page under that domain, I want the content to show up as if I were logged in.


回答 0

如果您想要的信息在页面上,登录后将立即定向到该页面。

让我们改为调用您的ck变量payload,例如在python-requests文档中:

payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)

除此以外…

请参阅下面的https://stackoverflow.com/a/17633072/111362

If the information you want is on the page you are directed to immediately after login…

Lets call your ck variable payload instead, like in the python-requests docs:

payload = {'inUserName': 'USERNAME/EMAIL', 'inUserPass': 'PASSWORD'}
url = 'http://www.locationary.com/home/index2.jsp'
requests.post(url, data=payload)

Otherwise…

See https://stackoverflow.com/a/17633072/111362 below.


回答 1

我知道您已经找到了另一种解决方案,但是对于像我这样的人,如果发现同样的问题,可以通过以下请求来实现:

首先,就像Marcus一样,检查登录表单的源以获取三项信息-表单发布到的URL以及用户名和密码字段的名称属性。在他的示例中,它们是inUserName和inUserPass。

一旦知道了这一点,就可以使用requests.Session()实例向登录URL发出发布请求,并将您的登录详细信息作为有效内容。从会话实例发出请求本质上与正常使用请求相同,它只是增加了持久性,允许您存储和使用cookie等。

假设您的登录尝试成功,则可以简单地使用会话实例向站点发出进一步的请求。标识您的cookie将用于授权请求。

import requests

# Fill in your details here to be posted to the login form.
payload = {
    'inUserName': 'username',
    'inUserPass': 'password'
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post('LOGIN_URL', data=payload)
    # print the html returned or something more intelligent to see if it's a successful login page.
    print p.text

    # An authorised request.
    r = s.get('A protected web page url')
    print r.text
        # etc...

I know you’ve found another solution, but for those like me who find this question, looking for the same thing, it can be achieved with requests as follows:

Firstly, as Marcus did, check the source of the login form to get three pieces of information – the url that the form posts to, and the name attributes of the username and password fields. In his example, they are inUserName and inUserPass.

Once you’ve got that, you can use a requests.Session() instance to make a post request to the login url with your login details as a payload. Making requests from a session instance is essentially the same as using requests normally, it simply adds persistence, allowing you to store and use cookies etc.

Assuming your login attempt was successful, you can simply use the session instance to make further requests to the site. The cookie that identifies you will be used to authorise the requests.

Example

import requests

# Fill in your details here to be posted to the login form.
payload = {
    'inUserName': 'username',
    'inUserPass': 'password'
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post('LOGIN_URL', data=payload)
    # print the html returned or something more intelligent to see if it's a successful login page.
    print p.text

    # An authorised request.
    r = s.get('A protected web page url')
    print r.text
        # etc...

回答 2

让我尝试简化一下,假设该站点的URL是http://example.com/,并且假设您需要通过填充用户名和密码进行注册,所以我们在登录页面上输入http:// example。 com / login.php,然后查看其源代码并搜索操作网址,该网址将在表单标签中,例如

 <form name="loginform" method="post" action="userinfo.php">

现在使用userinfo.php来创建绝对URL,该URL将是“ http://example.com/userinfo.php ”,现在运行一个简单的python脚本

import requests
url = 'http://example.com/userinfo.php'
values = {'username': 'user',
          'password': 'pass'}

r = requests.post(url, data=values)
print r.content

我希望这有一天能对某人有所帮助。

Let me try to make it simple, suppose URL of the site is http://example.com/ and let’s suppose you need to sign up by filling username and password, so we go to the login page say http://example.com/login.php now and view it’s source code and search for the action URL it will be in form tag something like

 <form name="loginform" method="post" action="userinfo.php">

now take userinfo.php to make absolute URL which will be ‘http://example.com/userinfo.php‘, now run a simple python script

import requests
url = 'http://example.com/userinfo.php'
values = {'username': 'user',
          'password': 'pass'}

r = requests.post(url, data=values)
print r.content

I Hope that this helps someone somewhere someday.


回答 3

找出用于用户名<...name=username.../>和密码的网站表单上输入的名称,<...name=password../>并在下面的脚本中替换它们。另外,替换URL以指向要登录的所需站点。

login.py

#!/usr/bin/env python

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
payload = { 'username': 'user@email.com', 'password': 'blahblahsecretpassw0rd' }
url = 'https://website.com/login.html'
requests.post(url, data=payload, verify=False)

指某东西的用途 disable_warnings(InsecureRequestWarning)当尝试使用未经验证的SSL证书登录站点时会使脚本的任何输出静音。

额外:

要在基于UNIX的系统上从命令行运行此脚本,请将其放置在目录中,即home/scripts,将该目录添加到~/.bash_profile终端所使用的路径或类似文件中。

# Custom scripts
export CUSTOM_SCRIPTS=home/scripts
export PATH=$CUSTOM_SCRIPTS:$PATH

然后在其中创建指向此python脚本的链接 home/scripts/login.py

ln -s ~/home/scripts/login.py ~/home/scripts/login

关闭您的终端,启动一个新终端,运行 login

Find out the name of the inputs used on the websites form for usernames <...name=username.../> and passwords <...name=password../> and replace them in the script below. Also replace the URL to point at the desired site to log into.

login.py

#!/usr/bin/env python

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
payload = { 'username': 'user@email.com', 'password': 'blahblahsecretpassw0rd' }
url = 'https://website.com/login.html'
requests.post(url, data=payload, verify=False)

The use of disable_warnings(InsecureRequestWarning) will silence any output from the script when trying to log into sites with unverified SSL certificates.

Extra:

To run this script from the command line on a UNIX based system place it in a directory, i.e. home/scripts and add this directory to your path in ~/.bash_profile or a similar file used by the terminal.

# Custom scripts
export CUSTOM_SCRIPTS=home/scripts
export PATH=$CUSTOM_SCRIPTS:$PATH

Then create a link to this python script inside home/scripts/login.py

ln -s ~/home/scripts/login.py ~/home/scripts/login

Close your terminal, start a new one, run login


回答 4

requests.Session()解决方案有助于登录到具有CSRF保护的表单(与Flask-WTF表单中使用的一样)。检查是否csrf_token需要a作为隐藏字段,然后使用用户名和密码将其添加到有效负载中:

import requests
from bs4 import BeautifulSoup

payload = {
    'email': 'email@example.com',
    'password': 'passw0rd'
}     

with requests.Session() as sess:
    res = sess.get(server_name + '/signin')
    signin = BeautifulSoup(res._content, 'html.parser')
    payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
    res = sess.post(server_name + '/auth/login', data=payload)

The requests.Session() solution assisted with logging into a form with CSRF Protection (as used in Flask-WTF forms). Check if a csrf_token is required as a hidden field and add it to the payload with the username and password:

import requests
from bs4 import BeautifulSoup

payload = {
    'email': 'email@example.com',
    'password': 'passw0rd'
}     

with requests.Session() as sess:
    res = sess.get(server_name + '/signin')
    signin = BeautifulSoup(res._content, 'html.parser')
    payload['csrf_token'] = signin.find('input', id='csrf_token')['value']
    res = sess.post(server_name + '/auth/login', data=payload)

记录来自python-requests模块的所有请求

问题:记录来自python-requests模块的所有请求

我正在使用python Requests。我需要调试一些OAuth活动,为此,我希望它记录正在执行的所有请求。我可以通过获得此信息ngrep,但是很遗憾,无法grep https连接(所需OAuth

如何激活Requests正在访问的所有URL(+参数)的日志记录?

I am using python Requests. I need to debug some OAuth activity, and for that I would like it to log all requests being performed. I could get this information with ngrep, but unfortunately it is not possible to grep https connections (which are needed for OAuth)

How can I activate logging of all URLs (+ parameters) that Requests is accessing?


回答 0

基础urllib3库记录与logging模块(而不是POST正文)的所有新连接和URL 。对于GET请求,这应该足够了:

import logging

logging.basicConfig(level=logging.DEBUG)

这为您提供了最详细的日志记录选项;有关如何配置日志记录级别和目标的更多详细信息,请参见日志记录操作指南。

简短的演示:

>>> import requests
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> r = requests.get('http://httpbin.org/get?foo=bar&baz=python')
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org:80
DEBUG:urllib3.connectionpool:http://httpbin.org:80 "GET /get?foo=bar&baz=python HTTP/1.1" 200 366

根据urllib3的确切版本,将记录以下消息:

  • INFO:重定向
  • WARN:连接池已满(如果发生这种情况,通常会增加连接池的大小)
  • WARN:无法解析标头(格式无效的响应标头)
  • WARN:重试连接
  • WARN:证书与预期的主机名不匹配
  • WARN:处理分块响应时,接收到的内容长度和传输编码都包含响应
  • DEBUG:新连接(HTTP或HTTPS)
  • DEBUG:断开的连接
  • DEBUG:连接详细信息:方法,路径,HTTP版本,状态代码和响应长度
  • DEBUG:重试计数增量

这不包括标题或正文。urllib3使用http.client.HTTPConnection该类完成任务,但是该类不支持日志记录,通常只能将其配置为打印到stdout。但是,您可以绑定该模块以将所有调试信息发送到日志记录,而无需print在该模块中引入一个替代名称:

import logging
import http.client

httpclient_logger = logging.getLogger("http.client")

def httpclient_logging_patch(level=logging.DEBUG):
    """Enable HTTPConnection debug logging to the logging framework"""

    def httpclient_log(*args):
        httpclient_logger.log(level, " ".join(args))

    # mask the print() built-in in the http.client module to use
    # logging instead
    http.client.print = httpclient_log
    # enable debugging
    http.client.HTTPConnection.debuglevel = 1

调用httpclient_logging_patch()导致http.client连接将所有调试信息输出到标准记录器,因此被以下对象拾取logging.basicConfig()

>>> httpclient_logging_patch()
>>> r = requests.get('http://httpbin.org/get?foo=bar&baz=python')
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org:80
DEBUG:http.client:send: b'GET /get?foo=bar&baz=python HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
DEBUG:http.client:reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:http.client:header: Date: Tue, 04 Feb 2020 13:36:53 GMT
DEBUG:http.client:header: Content-Type: application/json
DEBUG:http.client:header: Content-Length: 366
DEBUG:http.client:header: Connection: keep-alive
DEBUG:http.client:header: Server: gunicorn/19.9.0
DEBUG:http.client:header: Access-Control-Allow-Origin: *
DEBUG:http.client:header: Access-Control-Allow-Credentials: true
DEBUG:urllib3.connectionpool:http://httpbin.org:80 "GET /get?foo=bar&baz=python HTTP/1.1" 200 366

The underlying urllib3 library logs all new connections and URLs with the logging module, but not POST bodies. For GET requests this should be enough:

import logging

logging.basicConfig(level=logging.DEBUG)

which gives you the most verbose logging option; see the logging HOWTO for more details on how to configure logging levels and destinations.

Short demo:

>>> import requests
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> r = requests.get('http://httpbin.org/get?foo=bar&baz=python')
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org:80
DEBUG:urllib3.connectionpool:http://httpbin.org:80 "GET /get?foo=bar&baz=python HTTP/1.1" 200 366

Depending on the exact version of urllib3, the following messages are logged:

  • INFO: Redirects
  • WARN: Connection pool full (if this happens often increase the connection pool size)
  • WARN: Failed to parse headers (response headers with invalid format)
  • WARN: Retrying the connection
  • WARN: Certificate did not match expected hostname
  • WARN: Received response with both Content-Length and Transfer-Encoding, when processing a chunked response
  • DEBUG: New connections (HTTP or HTTPS)
  • DEBUG: Dropped connections
  • DEBUG: Connection details: method, path, HTTP version, status code and response length
  • DEBUG: Retry count increments

This doesn’t include headers or bodies. urllib3 uses the http.client.HTTPConnection class to do the grunt-work, but that class doesn’t support logging, it can normally only be configured to print to stdout. However, you can rig it to send all debug information to logging instead by introducing an alternative print name into that module:

import logging
import http.client

httpclient_logger = logging.getLogger("http.client")

def httpclient_logging_patch(level=logging.DEBUG):
    """Enable HTTPConnection debug logging to the logging framework"""

    def httpclient_log(*args):
        httpclient_logger.log(level, " ".join(args))

    # mask the print() built-in in the http.client module to use
    # logging instead
    http.client.print = httpclient_log
    # enable debugging
    http.client.HTTPConnection.debuglevel = 1

Calling httpclient_logging_patch() causes http.client connections to output all debug information to a standard logger, and so are picked up by logging.basicConfig():

>>> httpclient_logging_patch()
>>> r = requests.get('http://httpbin.org/get?foo=bar&baz=python')
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org:80
DEBUG:http.client:send: b'GET /get?foo=bar&baz=python HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
DEBUG:http.client:reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:http.client:header: Date: Tue, 04 Feb 2020 13:36:53 GMT
DEBUG:http.client:header: Content-Type: application/json
DEBUG:http.client:header: Content-Length: 366
DEBUG:http.client:header: Connection: keep-alive
DEBUG:http.client:header: Server: gunicorn/19.9.0
DEBUG:http.client:header: Access-Control-Allow-Origin: *
DEBUG:http.client:header: Access-Control-Allow-Credentials: true
DEBUG:urllib3.connectionpool:http://httpbin.org:80 "GET /get?foo=bar&baz=python HTTP/1.1" 200 366

回答 1

您需要启用调试时httplib的水平(requests→交通urllib3→交通httplib)。

这是一些同时切换(..._on()..._off())或暂时启用它的功能:

import logging
import contextlib
try:
    from http.client import HTTPConnection # py3
except ImportError:
    from httplib import HTTPConnection # py2

def debug_requests_on():
    '''Switches on logging of the requests module.'''
    HTTPConnection.debuglevel = 1

    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True

def debug_requests_off():
    '''Switches off logging of the requests module, might be some side-effects'''
    HTTPConnection.debuglevel = 0

    root_logger = logging.getLogger()
    root_logger.setLevel(logging.WARNING)
    root_logger.handlers = []
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.WARNING)
    requests_log.propagate = False

@contextlib.contextmanager
def debug_requests():
    '''Use with 'with'!'''
    debug_requests_on()
    yield
    debug_requests_off()

演示用途:

>>> requests.get('http://httpbin.org/')
<Response [200]>

>>> debug_requests_on()
>>> requests.get('http://httpbin.org/')
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 12150
send: 'GET / HTTP/1.1\r\nHost: httpbin.org\r\nConnection: keep-alive\r\nAccept-
Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.11.1\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
...
<Response [200]>

>>> debug_requests_off()
>>> requests.get('http://httpbin.org/')
<Response [200]>

>>> with debug_requests():
...     requests.get('http://httpbin.org/')
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org
...
<Response [200]>

您将看到包括HEADERS和DATA在内的REQUEST,以及带有HEADERS但没有DATA的RESPONSE。唯一缺少的是没有记录的response.body。

资源

You need to enable debugging at httplib level (requestsurllib3httplib).

Here’s some functions to both toggle (..._on() and ..._off()) or temporarily have it on:

import logging
import contextlib
try:
    from http.client import HTTPConnection # py3
except ImportError:
    from httplib import HTTPConnection # py2

def debug_requests_on():
    '''Switches on logging of the requests module.'''
    HTTPConnection.debuglevel = 1

    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True

def debug_requests_off():
    '''Switches off logging of the requests module, might be some side-effects'''
    HTTPConnection.debuglevel = 0

    root_logger = logging.getLogger()
    root_logger.setLevel(logging.WARNING)
    root_logger.handlers = []
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.WARNING)
    requests_log.propagate = False

@contextlib.contextmanager
def debug_requests():
    '''Use with 'with'!'''
    debug_requests_on()
    yield
    debug_requests_off()

Demo use:

>>> requests.get('http://httpbin.org/')
<Response [200]>

>>> debug_requests_on()
>>> requests.get('http://httpbin.org/')
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 12150
send: 'GET / HTTP/1.1\r\nHost: httpbin.org\r\nConnection: keep-alive\r\nAccept-
Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.11.1\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: nginx
...
<Response [200]>

>>> debug_requests_off()
>>> requests.get('http://httpbin.org/')
<Response [200]>

>>> with debug_requests():
...     requests.get('http://httpbin.org/')
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): httpbin.org
...
<Response [200]>

You will see the REQUEST, including HEADERS and DATA, and RESPONSE with HEADERS but without DATA. The only thing missing will be the response.body which is not logged.

Source


回答 2

对于使用python 3+的用户

import requests
import logging
import http.client

http.client.HTTPConnection.debuglevel = 1

logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

For those using python 3+

import requests
import logging
import http.client

http.client.HTTPConnection.debuglevel = 1

logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

回答 3

当尝试使Python日志记录系统(import logging)发出低级调试日志消息时,我很惊讶地发现给定的内容:

requests --> urllib3 --> http.client.HTTPConnection

urllib3实际上仅使用Python logging系统:

  • requests 没有
  • http.client.HTTPConnection 没有
  • urllib3

当然,您可以HTTPConnection通过设置以下内容来提取调试消息:

HTTPConnection.debuglevel = 1

但是这些输出仅通过print语句发出。为了证明这一点,只需grep Python 3.7 client.py源代码并自己查看打印语句(感谢@Yohann):

curl https://raw.githubusercontent.com/python/cpython/3.7/Lib/http/client.py |grep -A1 debuglevel` 

大概以某种方式重定向stdout可能有助于将stdout拖入日志系统,并有可能捕获到例如日志文件中。

选择’ urllib3‘logger not’ requests.packages.urllib3

与互联网上的许多建议相反,urllib3通过Python 3 logging系统捕获调试信息,正如@MikeSmith指出的那样,您不会遇到很多运气拦截:

log = logging.getLogger('requests.packages.urllib3')

相反,您需要:

log = logging.getLogger('urllib3')

调试urllib3到日志文件

这是一些urllib3使用Python logging系统将工作记录到日志文件中的代码:

import requests
import logging
from http.client import HTTPConnection  # py3

# log = logging.getLogger('requests.packages.urllib3')  # useless
log = logging.getLogger('urllib3')  # works

log.setLevel(logging.DEBUG)  # needed
fh = logging.FileHandler("requests.log")
log.addHandler(fh)

requests.get('http://httpbin.org/')

结果:

Starting new HTTP connection (1): httpbin.org:80
http://httpbin.org:80 "GET / HTTP/1.1" 200 3168

启用HTTPConnection.debuglevelprint()语句

如果您设定 HTTPConnection.debuglevel = 1

from http.client import HTTPConnection  # py3
HTTPConnection.debuglevel = 1
requests.get('http://httpbin.org/')

您将获得其他多汁的低级别信息的print语句输出:

send: b'GET / HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python- 
requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Access-Control-Allow-Credentials header: Access-Control-Allow-Origin 
header: Content-Encoding header: Content-Type header: Date header: ...

请记住,此输出使用的print不是Python logging系统,因此无法使用传统的方法捕获logging流或文件处理程序捕获(尽管可以通过重定向stdout将输出捕获到文件中)

结合以上两者-将所有可能的日志记录最大化到控制台

为了最大限度地利用所有可能的日志记录,您必须使用以下命令来满足控制台/ stdout输出的要求:

import requests
import logging
from http.client import HTTPConnection  # py3

log = logging.getLogger('urllib3')
log.setLevel(logging.DEBUG)

# logging from urllib3 to console
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
log.addHandler(ch)

# print statements from `http.client.HTTPConnection` to console/stdout
HTTPConnection.debuglevel = 1

requests.get('http://httpbin.org/')

提供完整的输出范围:

Starting new HTTP connection (1): httpbin.org:80
send: b'GET / HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
http://httpbin.org:80 "GET / HTTP/1.1" 200 3168
header: Access-Control-Allow-Credentials header: Access-Control-Allow-Origin 
header: Content-Encoding header: ...

When trying to get the Python logging system (import logging) to emit low level debug log messages, it suprised me to discover that given:

requests --> urllib3 --> http.client.HTTPConnection

that only urllib3 actually uses the Python logging system:

  • requests no
  • http.client.HTTPConnection no
  • urllib3 yes

Sure, you can extract debug messages from HTTPConnection by setting:

HTTPConnection.debuglevel = 1

but these outputs are merely emitted via the print statement. To prove this, simply grep the Python 3.7 client.py source code and view the print statements yourself (thanks @Yohann):

curl https://raw.githubusercontent.com/python/cpython/3.7/Lib/http/client.py |grep -A1 debuglevel` 

Presumably redirecting stdout in some way might work to shoe-horn stdout into the logging system and potentially capture to e.g. a log file.

Choose the ‘urllib3‘ logger not ‘requests.packages.urllib3

To capture urllib3 debug information through the Python 3 logging system, contrary to much advice on the internet, and as @MikeSmith points out, you won’t have much luck intercepting:

log = logging.getLogger('requests.packages.urllib3')

instead you need to:

log = logging.getLogger('urllib3')

Debugging urllib3 to a log file

Here is some code which logs urllib3 workings to a log file using the Python logging system:

import requests
import logging
from http.client import HTTPConnection  # py3

# log = logging.getLogger('requests.packages.urllib3')  # useless
log = logging.getLogger('urllib3')  # works

log.setLevel(logging.DEBUG)  # needed
fh = logging.FileHandler("requests.log")
log.addHandler(fh)

requests.get('http://httpbin.org/')

the result:

Starting new HTTP connection (1): httpbin.org:80
http://httpbin.org:80 "GET / HTTP/1.1" 200 3168

Enabling the HTTPConnection.debuglevel print() statements

If you set HTTPConnection.debuglevel = 1

from http.client import HTTPConnection  # py3
HTTPConnection.debuglevel = 1
requests.get('http://httpbin.org/')

you’ll get the print statement output of additional juicy low level info:

send: b'GET / HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python- 
requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Access-Control-Allow-Credentials header: Access-Control-Allow-Origin 
header: Content-Encoding header: Content-Type header: Date header: ...

Remember this output uses print and not the Python logging system, and thus cannot be captured using a traditional logging stream or file handler (though it may be possible to capture output to a file by redirecting stdout).

Combine the two above – maximise all possible logging to console

To maximise all possible logging, you must settle for console/stdout output with this:

import requests
import logging
from http.client import HTTPConnection  # py3

log = logging.getLogger('urllib3')
log.setLevel(logging.DEBUG)

# logging from urllib3 to console
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
log.addHandler(ch)

# print statements from `http.client.HTTPConnection` to console/stdout
HTTPConnection.debuglevel = 1

requests.get('http://httpbin.org/')

giving the full range of output:

Starting new HTTP connection (1): httpbin.org:80
send: b'GET / HTTP/1.1\r\nHost: httpbin.org\r\nUser-Agent: python-requests/2.22.0\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
http://httpbin.org:80 "GET / HTTP/1.1" 200 3168
header: Access-Control-Allow-Credentials header: Access-Control-Allow-Origin 
header: Content-Encoding header: ...

回答 4

我正在使用python 3.4,请求2.19.1:

“ urllib3”是现在获取的记录器(不再是“ requests.packages.urllib3”)。在不设置http.client.HTTPConnection.debuglevel的情况下,仍然会发生基本日志记录

I’m using python 3.4, requests 2.19.1:

‘urllib3’ is the logger to get now (no longer ‘requests.packages.urllib3’). Basic logging will still happen without setting http.client.HTTPConnection.debuglevel


回答 5

具有用于网络协议调试的脚本或什至是应用程序的子系统,希望查看确切的请求-响应对,包括有效的URL,标头,有效负载和状态。通常,在各地检测单个请求是不切实际的。同时,出于性能方面的考虑,建议使用单个(或几个专门化)requests.Session,因此以下假定遵循了该建议

requests支持所谓的事件挂钩(从2.23开始,实际上只有response挂钩)。从本质上讲,它是一个事件侦听器,并且该事件是在从返回控件之前发出的requests.request。此时,请求和响应都已完全定义,因此可以记录下来。

import logging

import requests


logger = logging.getLogger('httplogger')

def logRoundtrip(response, *args, **kwargs):
    extra = {'req': response.request, 'res': response}
    logger.debug('HTTP roundtrip', extra=extra)

session = requests.Session()
session.hooks['response'].append(logRoundtrip)

基本上,这就是记录会话的所有HTTP往返的方式。

格式化HTTP往返日志记录

为了使上面的日志记录有用,可以使用专门的日志记录格式化程序,该格式化程序可以理解日志记录req并提供res额外的信息。它看起来可能像这样:

import textwrap

class HttpFormatter(logging.Formatter):   

    def _formatHeaders(self, d):
        return '\n'.join(f'{k}: {v}' for k, v in d.items())

    def formatMessage(self, record):
        result = super().formatMessage(record)
        if record.name == 'httplogger':
            result += textwrap.dedent('''
                ---------------- request ----------------
                {req.method} {req.url}
                {reqhdrs}

                {req.body}
                ---------------- response ----------------
                {res.status_code} {res.reason} {res.url}
                {reshdrs}

                {res.text}
            ''').format(
                req=record.req,
                res=record.res,
                reqhdrs=self._formatHeaders(record.req.headers),
                reshdrs=self._formatHeaders(record.res.headers),
            )

        return result

formatter = HttpFormatter('{asctime} {levelname} {name} {message}', style='{')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logging.basicConfig(level=logging.DEBUG, handlers=[handler])

现在,如果您使用进行一些请求session,例如:

session.get('https://httpbin.org/user-agent')
session.get('https://httpbin.org/status/200')

的输出stderr将如下所示。

2020-05-14 22:10:13,224 DEBUG urllib3.connectionpool Starting new HTTPS connection (1): httpbin.org:443
2020-05-14 22:10:13,695 DEBUG urllib3.connectionpool https://httpbin.org:443 "GET /user-agent HTTP/1.1" 200 45
2020-05-14 22:10:13,698 DEBUG httplogger HTTP roundtrip
---------------- request ----------------
GET https://httpbin.org/user-agent
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

None
---------------- response ----------------
200 OK https://httpbin.org/user-agent
Date: Thu, 14 May 2020 20:10:13 GMT
Content-Type: application/json
Content-Length: 45
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

{
  "user-agent": "python-requests/2.23.0"
}


2020-05-14 22:10:13,814 DEBUG urllib3.connectionpool https://httpbin.org:443 "GET /status/200 HTTP/1.1" 200 0
2020-05-14 22:10:13,818 DEBUG httplogger HTTP roundtrip
---------------- request ----------------
GET https://httpbin.org/status/200
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

None
---------------- response ----------------
200 OK https://httpbin.org/status/200
Date: Thu, 14 May 2020 20:10:13 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

GUI方式

当您有很多查询时,使用简单的UI和筛选记录的方法就很方便。我将展示使用Chronologer(我是作者)。

首先,该钩子已被重写以产生记录,这些记录logging在通过电线发送时可以序列化。它看起来可能像这样:

def logRoundtrip(response, *args, **kwargs): 
    extra = {
        'req': {
            'method': response.request.method,
            'url': response.request.url,
            'headers': response.request.headers,
            'body': response.request.body,
        }, 
        'res': {
            'code': response.status_code,
            'reason': response.reason,
            'url': response.url,
            'headers': response.headers,
            'body': response.text
        },
    }
    logger.debug('HTTP roundtrip', extra=extra)

session = requests.Session()
session.hooks['response'].append(logRoundtrip)

其次,必须修改日志配置以使用logging.handlers.HTTPHandler(Chronologer理解)。

import logging.handlers

chrono = logging.handlers.HTTPHandler(
  'localhost:8080', '/api/v1/record', 'POST', credentials=('logger', ''))
handlers = [logging.StreamHandler(), chrono]
logging.basicConfig(level=logging.DEBUG, handlers=handlers)

最后,运行Chronologer实例。例如使用Docker:

docker run --rm -it -p 8080:8080 -v /tmp/db \
    -e CHRONOLOGER_STORAGE_DSN=sqlite:////tmp/db/chrono.sqlite \
    -e CHRONOLOGER_SECRET=example \
    -e CHRONOLOGER_ROLES="basic-reader query-reader writer" \
    saaj/chronologer \
    python -m chronologer -e production serve -u www-data -g www-data -m

并再次运行请求:

session.get('https://httpbin.org/user-agent')
session.get('https://httpbin.org/status/200')

流处理程序将生成:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): httpbin.org:443
DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /user-agent HTTP/1.1" 200 45
DEBUG:httplogger:HTTP roundtrip
DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /status/200 HTTP/1.1" 200 0
DEBUG:httplogger:HTTP roundtrip

现在,如果您打开http:// localhost:8080 /(使用“ logger”输入用户名,并使用空密码输入基本auth弹出窗口),然后单击“ Open”按钮,您应该会看到类似以下内容:

Having a script or even a subsystem of an application for a network protocol debugging, it’s desired to see what request-response pairs are exactly, including effective URLs, headers, payloads and the status. And it’s typically impractical to instrument individual requests all over the place. At the same time there are performance considerations that suggest using single (or few specialised) requests.Session, so the following assumes that the suggestion is followed.

requests supports so called event hooks (as of 2.23 there’s actually only response hook). It’s basically an event listener, and the event is emitted before returning control from requests.request. At this moment both request and response are fully defined, hence can be logged.

import logging

import requests


logger = logging.getLogger('httplogger')

def logRoundtrip(response, *args, **kwargs):
    extra = {'req': response.request, 'res': response}
    logger.debug('HTTP roundtrip', extra=extra)

session = requests.Session()
session.hooks['response'].append(logRoundtrip)

That’s basically how to log all HTTP round-trips of a session.

Formatting HTTP round-trip log records

For the logging above to be useful there can be specialised logging formatter that understands req and res extras on logging records. It can look like this:

import textwrap

class HttpFormatter(logging.Formatter):   

    def _formatHeaders(self, d):
        return '\n'.join(f'{k}: {v}' for k, v in d.items())

    def formatMessage(self, record):
        result = super().formatMessage(record)
        if record.name == 'httplogger':
            result += textwrap.dedent('''
                ---------------- request ----------------
                {req.method} {req.url}
                {reqhdrs}

                {req.body}
                ---------------- response ----------------
                {res.status_code} {res.reason} {res.url}
                {reshdrs}

                {res.text}
            ''').format(
                req=record.req,
                res=record.res,
                reqhdrs=self._formatHeaders(record.req.headers),
                reshdrs=self._formatHeaders(record.res.headers),
            )

        return result

formatter = HttpFormatter('{asctime} {levelname} {name} {message}', style='{')
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logging.basicConfig(level=logging.DEBUG, handlers=[handler])

Now if you do some requests using the session, like:

session.get('https://httpbin.org/user-agent')
session.get('https://httpbin.org/status/200')

The output to stderr will look as follows.

2020-05-14 22:10:13,224 DEBUG urllib3.connectionpool Starting new HTTPS connection (1): httpbin.org:443
2020-05-14 22:10:13,695 DEBUG urllib3.connectionpool https://httpbin.org:443 "GET /user-agent HTTP/1.1" 200 45
2020-05-14 22:10:13,698 DEBUG httplogger HTTP roundtrip
---------------- request ----------------
GET https://httpbin.org/user-agent
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

None
---------------- response ----------------
200 OK https://httpbin.org/user-agent
Date: Thu, 14 May 2020 20:10:13 GMT
Content-Type: application/json
Content-Length: 45
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

{
  "user-agent": "python-requests/2.23.0"
}


2020-05-14 22:10:13,814 DEBUG urllib3.connectionpool https://httpbin.org:443 "GET /status/200 HTTP/1.1" 200 0
2020-05-14 22:10:13,818 DEBUG httplogger HTTP roundtrip
---------------- request ----------------
GET https://httpbin.org/status/200
User-Agent: python-requests/2.23.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

None
---------------- response ----------------
200 OK https://httpbin.org/status/200
Date: Thu, 14 May 2020 20:10:13 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

A GUI way

When you have a lot of queries, having a simple UI and a way to filter records comes at handy. I’ll show to use Chronologer for that (which I’m the author of).

First, the hook has be rewritten to produce records that logging can serialise when sending over the wire. It can look like this:

def logRoundtrip(response, *args, **kwargs): 
    extra = {
        'req': {
            'method': response.request.method,
            'url': response.request.url,
            'headers': response.request.headers,
            'body': response.request.body,
        }, 
        'res': {
            'code': response.status_code,
            'reason': response.reason,
            'url': response.url,
            'headers': response.headers,
            'body': response.text
        },
    }
    logger.debug('HTTP roundtrip', extra=extra)

session = requests.Session()
session.hooks['response'].append(logRoundtrip)

Second, logging configuration has to be adapted to use logging.handlers.HTTPHandler (which Chronologer understands).

import logging.handlers

chrono = logging.handlers.HTTPHandler(
  'localhost:8080', '/api/v1/record', 'POST', credentials=('logger', ''))
handlers = [logging.StreamHandler(), chrono]
logging.basicConfig(level=logging.DEBUG, handlers=handlers)

Finally, run Chronologer instance. e.g. using Docker:

docker run --rm -it -p 8080:8080 -v /tmp/db \
    -e CHRONOLOGER_STORAGE_DSN=sqlite:////tmp/db/chrono.sqlite \
    -e CHRONOLOGER_SECRET=example \
    -e CHRONOLOGER_ROLES="basic-reader query-reader writer" \
    saaj/chronologer \
    python -m chronologer -e production serve -u www-data -g www-data -m

And run the requests again:

session.get('https://httpbin.org/user-agent')
session.get('https://httpbin.org/status/200')

The stream handler will produce:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): httpbin.org:443
DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /user-agent HTTP/1.1" 200 45
DEBUG:httplogger:HTTP roundtrip
DEBUG:urllib3.connectionpool:https://httpbin.org:443 "GET /status/200 HTTP/1.1" 200 0
DEBUG:httplogger:HTTP roundtrip

Now if you open http://localhost:8080/ (use “logger” for username and empty password for the basic auth popup) and click “Open” button, you should see something like:


Python请求库重定向新网址

问题:Python请求库重定向新网址

我一直在浏览Python Requests文档,但是看不到我要实现的功能。

在我的脚本中,我正在设置allow_redirects=True

我想知道页面是否已重定向到其他内容,新的URL是什么。

例如,如果起始URL为: www.google.com/redirect

最终的URL是 www.google.co.uk/redirected

我如何获得该URL?

I’ve been looking through the Python Requests documentation but I cannot see any functionality for what I am trying to achieve.

In my script I am setting allow_redirects=True.

I would like to know if the page has been redirected to something else, what is the new URL.

For example, if the start URL was: www.google.com/redirect

And the final URL is www.google.co.uk/redirected

How do I get that URL?


回答 0

您正在寻找请求历史记录

response.history属性是导致最终到达网址的响应列表,可以在中找到response.url

response = requests.get(someurl)
if response.history:
    print("Request was redirected")
    for resp in response.history:
        print(resp.status_code, resp.url)
    print("Final destination:")
    print(response.status_code, response.url)
else:
    print("Request was not redirected")

演示:

>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
...     print(resp.status_code, resp.url)
... 
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get

You are looking for the request history.

The response.history attribute is a list of responses that led to the final URL, which can be found in response.url.

response = requests.get(someurl)
if response.history:
    print("Request was redirected")
    for resp in response.history:
        print(resp.status_code, resp.url)
    print("Final destination:")
    print(response.status_code, response.url)
else:
    print("Request was not redirected")

Demo:

>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
(<Response [302]>, <Response [302]>, <Response [302]>)
>>> for resp in response.history:
...     print(resp.status_code, resp.url)
... 
302 http://httpbin.org/redirect/3
302 http://httpbin.org/redirect/2
302 http://httpbin.org/redirect/1
>>> print(response.status_code, response.url)
200 http://httpbin.org/get

回答 1

这回答了一个稍有不同的问题,但是由于我自己一直坚持这个问题,所以我希望它对其他人可能有用。

如果要使用allow_redirects=False并直接到达第一个重定向对象,而不是遵循它们的链,而只想直接从302响应对象中获取重定向位置,则r.url则将无法使用。相反,它是“ Location”标头:

r = requests.get('http://github.com/', allow_redirects=False)
r.status_code  # 302
r.url  # http://github.com, not https.
r.headers['Location']  # https://github.com/ -- the redirect destination

This is answering a slightly different question, but since I got stuck on this myself, I hope it might be useful for someone else.

If you want to use allow_redirects=False and get directly to the first redirect object, rather than following a chain of them, and you just want to get the redirect location directly out of the 302 response object, then r.url won’t work. Instead, it’s the “Location” header:

r = requests.get('http://github.com/', allow_redirects=False)
r.status_code  # 302
r.url  # http://github.com, not https.
r.headers['Location']  # https://github.com/ -- the redirect destination

回答 2

该文档具有以下内容:https: //requests.readthedocs.io/zh/master/user/quickstart/#redirection-and-history

import requests

r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for 

the documentation has this blurb https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

import requests

r = requests.get('http://www.github.com')
r.url
#returns https://www.github.com instead of the http page you asked for 

回答 3

我觉得requests.head代替requests.get会更安全的处理URL重定向时调用,检查GitHub的问题在这里

r = requests.head(url, allow_redirects=True)
print(r.url)

I think requests.head instead of requests.get will be more safe to call when handling url redirect,check the github issue here:

r = requests.head(url, allow_redirects=True)
print(r.url)

回答 4

对于python3.5,您可以使用以下代码:

import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)

For python3.5, you can use the following code:

import urllib.request
res = urllib.request.urlopen(starturl)
finalurl = res.geturl()
print(finalurl)