标签归档:urllib

如何对Python中的URL参数进行百分比编码?

问题:如何对Python中的URL参数进行百分比编码?

如果我做

url = "http://example.com?p=" + urllib.quote(query)
  1. 它不编码/%2F(破坏OAuth规范化)
  2. 它不处理Unicode(引发异常)

有没有更好的图书馆?

If I do

url = "http://example.com?p=" + urllib.quote(query)
  1. It doesn’t encode / to %2F (breaks OAuth normalization)
  2. It doesn’t handle Unicode (it throws an exception)

Is there a better library?


回答 0

Python 2

文档

urllib.quote(string[, safe])

使用%xx转义符替换字符串中的特殊字符。字母,数字和字符“ _.-”都不会被引用。默认情况下,此函数用于引用URL的路径部分。可选的safe参数指定不应引用的其他字符- 其默认值为’/’

这意味着通过“安全”将解决您的第一个问题:

>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'

关于第二个问题,有关于它的bug报告在这里。显然,它已在python 3中修复。您可以通过编码为utf8来解决此问题,如下所示:

>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller

顺便看看urlencode

Python 3

相同的,除了更换urllib.quoteurllib.parse.quote

Python 2

From the docs:

urllib.quote(string[, safe])

Replace special characters in string using the %xx escape. Letters, digits, and the characters ‘_.-‘ are never quoted. By default, this function is intended for quoting the path section of the URL.The optional safe parameter specifies additional characters that should not be quoted — its default value is ‘/’

That means passing ” for safe will solve your first issue:

>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'

About the second issue, there is a bug report about it here. Apparently it was fixed in python 3. You can workaround it by encoding as utf8 like this:

>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller

By the way have a look at urlencode

Python 3

The same, except replace urllib.quote with urllib.parse.quote.


回答 1

在Python 3中,urllib.quote已移至,urllib.parse.quote并且默认情况下确实处理unicode。

>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'

In Python 3, urllib.quote has been moved to urllib.parse.quote and it does handle unicode by default.

>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'

回答 2

我的答案类似于保罗的答案。

我认为模块requests要好得多。它基于urllib3。您可以尝试以下方法:

>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'

My answer is similar to Paolo’s answer.

I think module requests is much better. It’s based on urllib3. You can try this:

>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'

回答 3

如果您使用的是django,则可以使用urlquote:

>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'

请注意,自发布此答案以来对Python的更改意味着它现在是旧版包装器。从django.utils.http的Django 2.1源代码中:

A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)

If you’re using django, you can use urlquote:

>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'

Note that changes to Python since this answer was published mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:

A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)

回答 4

最好在urlencode这里使用。单个参数没有太大区别,但是恕我直言使代码更清晰。(看一个函数看起来很混乱quote_plus!尤其是那些来自其他语言的函数)

In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'

In [22]: val=34

In [23]: from urllib.parse import urlencode

In [24]: encoded = urlencode(dict(p=query,val=val))

In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34

文件

urlencode:https//docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode

quote_plus:https ://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote_plus

It is better to use urlencode here. Not much difference for single parameter but IMHO makes the code clearer. (It looks confusing to see a function quote_plus! especially those coming from other languates)

In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'

In [22]: val=34

In [23]: from urllib.parse import urlencode

In [24]: encoded = urlencode(dict(p=query,val=val))

In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34

Docs

urlencode: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode

quote_plus: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote_plus


如何在Python中使用Urlencode查询字符串?

问题:如何在Python中使用Urlencode查询字符串?

我尝试在提交之前对该字符串进行urlencode。

queryString = 'eventName=' + evt.fields["eventName"] + '&' + 'eventDescription=' + evt.fields["eventDescription"]; 

I am trying to urlencode this string before I submit.

queryString = 'eventName=' + evt.fields["eventName"] + '&' + 'eventDescription=' + evt.fields["eventDescription"]; 

回答 0

您需要将参数传递urlencode()为映射(dict)或2元组序列,例如:

>>> import urllib
>>> f = { 'eventName' : 'myEvent', 'eventDescription' : 'cool event'}
>>> urllib.urlencode(f)
'eventName=myEvent&eventDescription=cool+event'

Python 3或以上

采用:

>>> urllib.parse.urlencode(f)
eventName=myEvent&eventDescription=cool+event

请注意,这在通常意义上不会进行url编码(请看输出)。为此使用urllib.parse.quote_plus

You need to pass your parameters into urlencode() as either a mapping (dict), or a sequence of 2-tuples, like:

>>> import urllib
>>> f = { 'eventName' : 'myEvent', 'eventDescription' : 'cool event'}
>>> urllib.urlencode(f)
'eventName=myEvent&eventDescription=cool+event'

Python 3 or above

Use:

>>> urllib.parse.urlencode(f)
eventName=myEvent&eventDescription=cool+event

Note that this does not do url encoding in the commonly used sense (look at the output). For that use urllib.parse.quote_plus.


回答 1

Python 2

您正在寻找的是urllib.quote_plus

>>> urllib.quote_plus('string_of_characters_like_these:$#@=?%^Q^$')
'string_of_characters_like_these%3A%24%23%40%3D%3F%25%5EQ%5E%24'

Python 3

在Python 3中,该urllib软件包已分解为较小的组件。您将使用urllib.parse.quote_plus(注意parse子模块)

import urllib.parse
urllib.parse.quote_plus(...)

Python 2

What you’re looking for is urllib.quote_plus:

>>> urllib.quote_plus('string_of_characters_like_these:$#@=?%^Q^$')
'string_of_characters_like_these%3A%24%23%40%3D%3F%25%5EQ%5E%24'

Python 3

In Python 3, the urllib package has been broken into smaller components. You’ll use urllib.parse.quote_plus (note the parse child module)

import urllib.parse
urllib.parse.quote_plus(...)

回答 2

尝试使用请求而不是urllib,您无需费心urlencode!

import requests
requests.get('http://youraddress.com', params=evt.fields)

编辑:

如果您需要有序的名称/值对或一个名称的多个值,请按如下所示设置参数:

params=[('name1','value11'), ('name1','value12'), ('name2','value21'), ...]

而不是使用字典。

Try requests instead of urllib and you don’t need to bother with urlencode!

import requests
requests.get('http://youraddress.com', params=evt.fields)

EDIT:

If you need ordered name-value pairs or multiple values for a name then set params like so:

params=[('name1','value11'), ('name1','value12'), ('name2','value21'), ...]

instead of using a dictionary.


回答 3

语境

  • Python(版本2.7.2)

问题

  • 您要生成一个urlencoded查询字符串。
  • 您有一个包含名称-值对的字典或对象。
  • 您希望能够控制名称-值对的输出顺序。

  • urllib.urlencode
  • urllib.quote_plus

陷阱

以下是一个完整的解决方案,包括如何处理一些陷阱。

### ********************
## init python (version 2.7.2 )
import urllib

### ********************
## first setup a dictionary of name-value pairs
dict_name_value_pairs = {
  "bravo"   : "True != False",
  "alpha"   : "http://www.example.com",
  "charlie" : "hello world",
  "delta"   : "1234567 !@#$%^&*",
  "echo"    : "user@example.com",
  }

### ********************
## setup an exact ordering for the name-value pairs
ary_ordered_names = []
ary_ordered_names.append('alpha')
ary_ordered_names.append('bravo')
ary_ordered_names.append('charlie')
ary_ordered_names.append('delta')
ary_ordered_names.append('echo')

### ********************
## show the output results
if('NO we DO NOT care about the ordering of name-value pairs'):
  queryString  = urllib.urlencode(dict_name_value_pairs)
  print queryString 
  """
  echo=user%40example.com&bravo=True+%21%3D+False&delta=1234567+%21%40%23%24%25%5E%26%2A&charlie=hello+world&alpha=http%3A%2F%2Fwww.example.com
  """

if('YES we DO care about the ordering of name-value pairs'):
  queryString  = "&".join( [ item+'='+urllib.quote_plus(dict_name_value_pairs[item]) for item in ary_ordered_names ] )
  print queryString
  """
  alpha=http%3A%2F%2Fwww.example.com&bravo=True+%21%3D+False&charlie=hello+world&delta=1234567+%21%40%23%24%25%5E%26%2A&echo=user%40example.com
  """ 

Context

  • Python (version 2.7.2 )

Problem

  • You want to generate a urlencoded query string.
  • You have a dictionary or object containing the name-value pairs.
  • You want to be able to control the output ordering of the name-value pairs.

Solution

  • urllib.urlencode
  • urllib.quote_plus

Pitfalls

Example

The following is a complete solution, including how to deal with some pitfalls.

### ********************
## init python (version 2.7.2 )
import urllib

### ********************
## first setup a dictionary of name-value pairs
dict_name_value_pairs = {
  "bravo"   : "True != False",
  "alpha"   : "http://www.example.com",
  "charlie" : "hello world",
  "delta"   : "1234567 !@#$%^&*",
  "echo"    : "user@example.com",
  }

### ********************
## setup an exact ordering for the name-value pairs
ary_ordered_names = []
ary_ordered_names.append('alpha')
ary_ordered_names.append('bravo')
ary_ordered_names.append('charlie')
ary_ordered_names.append('delta')
ary_ordered_names.append('echo')

### ********************
## show the output results
if('NO we DO NOT care about the ordering of name-value pairs'):
  queryString  = urllib.urlencode(dict_name_value_pairs)
  print queryString 
  """
  echo=user%40example.com&bravo=True+%21%3D+False&delta=1234567+%21%40%23%24%25%5E%26%2A&charlie=hello+world&alpha=http%3A%2F%2Fwww.example.com
  """

if('YES we DO care about the ordering of name-value pairs'):
  queryString  = "&".join( [ item+'='+urllib.quote_plus(dict_name_value_pairs[item]) for item in ary_ordered_names ] )
  print queryString
  """
  alpha=http%3A%2F%2Fwww.example.com&bravo=True+%21%3D+False&charlie=hello+world&delta=1234567+%21%40%23%24%25%5E%26%2A&echo=user%40example.com
  """ 

回答 4


回答 5

尝试这个:

urllib.pathname2url(stringToURLEncode)

urlencode将不起作用,因为它仅适用于词典。quote_plus没有产生正确的输出。

Try this:

urllib.pathname2url(stringToURLEncode)

urlencode won’t work because it only works on dictionaries. quote_plus didn’t produce the correct output.


回答 6

请注意,urllib.urlencode并非总能解决问题。问题在于某些服务关心参数的顺序,当您创建字典时,这些顺序会丢失。对于这种情况,如Ricky所建议的那样,urllib.quote_plus更好。

Note that the urllib.urlencode does not always do the trick. The problem is that some services care about the order of arguments, which gets lost when you create the dictionary. For such cases, urllib.quote_plus is better, as Ricky suggested.


回答 7

在Python 3中,这与我合作

import urllib

urllib.parse.quote(query)

In Python 3, this worked with me

import urllib

urllib.parse.quote(query)

回答 8

供将来参考(例如:适用于python3)

>>> import urllib.request as req
>>> query = 'eventName=theEvent&eventDescription=testDesc'
>>> req.pathname2url(query)
>>> 'eventName%3DtheEvent%26eventDescription%3DtestDesc'

for future references (ex: for python3)

>>> import urllib.request as req
>>> query = 'eventName=theEvent&eventDescription=testDesc'
>>> req.pathname2url(query)
>>> 'eventName%3DtheEvent%26eventDescription%3DtestDesc'

回答 9

为了在需要同时支持python 2和python 3的脚本/程序中使用,这六个模块提供了quote和urlencode函数:

>>> from six.moves.urllib.parse import urlencode, quote
>>> data = {'some': 'query', 'for': 'encoding'}
>>> urlencode(data)
'some=query&for=encoding'
>>> url = '/some/url/with spaces and %;!<>&'
>>> quote(url)
'/some/url/with%20spaces%20and%20%25%3B%21%3C%3E%26'

For use in scripts/programs which need to support both python 2 and 3, the six module provides quote and urlencode functions:

>>> from six.moves.urllib.parse import urlencode, quote
>>> data = {'some': 'query', 'for': 'encoding'}
>>> urlencode(data)
'some=query&for=encoding'
>>> url = '/some/url/with spaces and %;!<>&'
>>> quote(url)
'/some/url/with%20spaces%20and%20%25%3B%21%3C%3E%26'

回答 10

如果urllib.parse.urlencode()给您错误,请尝试urllib3模块。

语法如下:

import urllib3
urllib3.request.urlencode({"user" : "john" }) 

If the urllib.parse.urlencode( ) is giving you errors , then Try the urllib3 module .

The syntax is as follows :

import urllib3
urllib3.request.urlencode({"user" : "john" }) 

回答 11

可能尚未提到的另一件事是,urllib.urlencode()它将字典中的空值编码为字符串,None而不是缺少该参数。我不知道通常是否需要这样做,但是不适合我的用例,因此我必须使用quote_plus

Another thing that might not have been mentioned already is that urllib.urlencode() will encode empty values in the dictionary as the string None instead of having that parameter as absent. I don’t know if this is typically desired or not, but does not fit my use case, hence I have to use quote_plus.


回答 12

为使Python 3 urllib3正常工作,您可以根据其官方文档使用以下命令:

import urllib3

http = urllib3.PoolManager()
response = http.request(
     'GET',
     'https://api.prylabs.net/eth/v1alpha1/beacon/attestations',
     fields={  # here fields are the query params
          'epoch': 1234,
          'pageSize': pageSize 
      } 
 )
response = attestations.data.decode('UTF-8')

For Python 3 urllib3 works properly, you can use as follow as per its official docs :

import urllib3

http = urllib3.PoolManager()
response = http.request(
     'GET',
     'https://api.prylabs.net/eth/v1alpha1/beacon/attestations',
     fields={  # here fields are the query params
          'epoch': 1234,
          'pageSize': pageSize 
      } 
 )
response = attestations.data.decode('UTF-8')

如何使用Python通过HTTP下载文件?

问题:如何使用Python通过HTTP下载文件?

我有一个小的实用程序,可以用来按计划从网站上下载MP3文件,然后构建/更新已添加到iTunes的播客XML文件。

创建/更新XML文件的文本处理是用Python编写的。但是,我在Windows内使用wget.bat文件中下载实际的MP3文件。我希望使用Python编写整个实用程序。

我努力寻找一种方法来实际使用Python下载文件,因此为什么我诉诸于使用 wget

那么,如何使用Python下载文件?

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I’ve added to iTunes.

The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.

I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.

So, how do I download the file using Python?


回答 0

在Python 2中,请使用标准库随附的urllib2。

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

减去任何错误处理,这是使用该库的最基本方法。您还可以执行更复杂的操作,例如更改标题。该文档可在此处找到

In Python 2, use urllib2 which comes with the standard library.

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers. The documentation can be found here.


回答 1

使用urlretrieve

import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(对于Python 3+使用import urllib.requesturllib.request.urlretrieve

还有一个,带有“进度条”

import urllib2

url = "http://download.thinkbroadband.com/10MB.zip"

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

One more, using urlretrieve:

import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(for Python 3+ use import urllib.request and urllib.request.urlretrieve)

Yet another one, with a “progressbar”

import urllib2

url = "http://download.thinkbroadband.com/10MB.zip"

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

回答 2

在2012年,使用python请求库

>>> import requests
>>> 
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

你可以跑 pip install requests以获取它。

与API相比,请求具有许多优势,因为API更加简单。如果必须执行身份验证,则尤其如此。在这种情况下,urllib和urllib2非常不直观且令人痛苦。


2015-12-30

人们对进度条表示钦佩。肯定很酷。现在有几种现成的解决方案,包括tqdm

from tqdm import tqdm
import requests

url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)

with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

这本质上是30个月前描述的实现@kvance。

In 2012, use the python requests library

>>> import requests
>>> 
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760

You can run pip install requests to get it.

Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.


2015-12-30

People have expressed admiration for the progress bar. It’s cool, sure. There are several off-the-shelf solutions now, including tqdm:

from tqdm import tqdm
import requests

url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)

with open("10MB", "wb") as handle:
    for data in tqdm(response.iter_content()):
        handle.write(data)

This is essentially the implementation @kvance described 30 months ago.


回答 3

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

wbopen('test.mp3','wb')打开一个文件(并清除所有现有文件)以二进制模式,所以你可以用它来代替刚才保存的文本数据。

import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
  output.write(mp3file.read())

The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.


回答 4

Python 3

  • urllib.request.urlopen

    import urllib.request
    response = urllib.request.urlopen('http://www.example.com/')
    html = response.read()
  • urllib.request.urlretrieve

    import urllib.request
    urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

    注意:根据文档,urllib.request.urlretrieve是“旧版界面”和“以后可能会弃用”(感谢gerrit

Python 2

  • urllib2.urlopen(感谢科里

    import urllib2
    response = urllib2.urlopen('http://www.example.com/')
    html = response.read()
  • urllib.urlretrieve(感谢PabloG

    import urllib
    urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

Python 3

  • urllib.request.urlopen

    import urllib.request
    response = urllib.request.urlopen('http://www.example.com/')
    html = response.read()
    
  • urllib.request.urlretrieve

    import urllib.request
    urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
    

    Note: According to the documentation, urllib.request.urlretrieve is a “legacy interface” and “might become deprecated in the future” (thanks gerrit)

Python 2

  • urllib2.urlopen (thanks Corey)

    import urllib2
    response = urllib2.urlopen('http://www.example.com/')
    html = response.read()
    
  • urllib.urlretrieve (thanks PabloG)

    import urllib
    urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
    

回答 5

使用wget模块:

import wget
wget.download('url')

use wget module:

import wget
wget.download('url')

回答 6

用于Python 2/3的PabloG代码的改进版本:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
    import urllib.request as urllib2
    import urllib.parse as urlparse
else:
    import urllib2
    import urlparse

def download_file(url, dest=None):
    """ 
    Download and save a file specified by url to dest directory,
    """
    u = urllib2.urlopen(url)

    scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
    filename = os.path.basename(path)
    if not filename:
        filename = 'downloaded.file'
    if dest:
        filename = os.path.join(dest, filename)

    with open(filename, 'wb') as f:
        meta = u.info()
        meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
        meta_length = meta_func("Content-Length")
        file_size = None
        if meta_length:
            file_size = int(meta_length[0])
        print("Downloading: {0} Bytes: {1}".format(url, file_size))

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)

            status = "{0:16}".format(file_size_dl)
            if file_size:
                status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
            status += chr(13)
            print(status, end="")
        print()

    return filename

if __name__ == "__main__":  # Only run if this file is called directly
    print("Testing with 10MB download")
    url = "http://download.thinkbroadband.com/10MB.zip"
    filename = download_file(url)
    print(filename)

An improved version of the PabloG code for Python 2/3:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )

import sys, os, tempfile, logging

if sys.version_info >= (3,):
    import urllib.request as urllib2
    import urllib.parse as urlparse
else:
    import urllib2
    import urlparse

def download_file(url, dest=None):
    """ 
    Download and save a file specified by url to dest directory,
    """
    u = urllib2.urlopen(url)

    scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
    filename = os.path.basename(path)
    if not filename:
        filename = 'downloaded.file'
    if dest:
        filename = os.path.join(dest, filename)

    with open(filename, 'wb') as f:
        meta = u.info()
        meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
        meta_length = meta_func("Content-Length")
        file_size = None
        if meta_length:
            file_size = int(meta_length[0])
        print("Downloading: {0} Bytes: {1}".format(url, file_size))

        file_size_dl = 0
        block_sz = 8192
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break

            file_size_dl += len(buffer)
            f.write(buffer)

            status = "{0:16}".format(file_size_dl)
            if file_size:
                status += "   [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
            status += chr(13)
            print(status, end="")
        print()

    return filename

if __name__ == "__main__":  # Only run if this file is called directly
    print("Testing with 10MB download")
    url = "http://download.thinkbroadband.com/10MB.zip"
    filename = download_file(url)
    print(filename)

回答 7

Python 2 & Python 3附带了简单但兼容的方法six

from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

Simple yet Python 2 & Python 3 compatible way comes with six library:

from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

回答 8

import os,requests
def download(url):
    get_response = requests.get(url,stream=True)
    file_name  = url.split("/")[-1]
    with open(file_name, 'wb') as f:
        for chunk in get_response.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)


download("https://example.com/example.jpg")
import os,requests
def download(url):
    get_response = requests.get(url,stream=True)
    file_name  = url.split("/")[-1]
    with open(file_name, 'wb') as f:
        for chunk in get_response.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)


download("https://example.com/example.jpg")

回答 9

为此,在纯Python中编写了wget库。从2.0版开始urlretrieve,它就具备了这些功能

Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.


回答 10

以下是在python中下载文件的最常用调用:

  1. urllib.urlretrieve ('url_to_file', file_name)

  2. urllib2.urlopen('url_to_file')

  3. requests.get(url)

  4. wget.download('url', file_name)

注意:urlopenurlretrieve发现在下载大文件(大小> 500 MB)时表现相对较差。requests.get将文件存储在内存中,直到下载完成。

Following are the most commonly used calls for downloading files in python:

  1. urllib.urlretrieve ('url_to_file', file_name)

  2. urllib2.urlopen('url_to_file')

  3. requests.get(url)

  4. wget.download('url', file_name)

Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.


回答 11

我同意Corey的观点,urllib2比urllib更完整,如果您想做更复杂的事情,它应该是使用的模块,但是要使答案更完整,如果只需要基础知识,则urllib是一个更简单的模块:

import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()

将正常工作。或者,如果您不想处理“响应”对象,则可以直接调用read()

import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:

import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()

Will work fine. Or, if you don’t want to deal with the “response” object you can call read() directly:

import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

回答 12

在python3中,您可以使用urllib3和shutil libraires。使用pip或pip3下载它们(取决于python3是否为默认值)

pip3 install urllib3 shutil

然后运行这段代码

import urllib.request
import shutil

url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

请注意,您已下载urllib3urllib在代码中使用

In python3 you can use urllib3 and shutil libraires. Download them by using pip or pip3 (Depending whether python3 is default or not)

pip3 install urllib3 shutil

Then run this code

import urllib.request
import shutil

url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

Note that you download urllib3 but use urllib in code


回答 13

您也可以使用urlretrieve获得进度反馈:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

You can get the progress feedback with urlretrieve as well:

def report(blocknr, blocksize, size):
    current = blocknr*blocksize
    sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))

def downloadFile(url):
    print "\n",url
    fname = url.split('/')[-1]
    print fname
    urllib.urlretrieve(url, fname, report)

回答 14

如果已安装wget,则可以使用parallel_sync。

pip安装parallel_sync

from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc:https : //pythonhosted.org/parallel_sync/pages/examples.html

这是非常强大的。它可以并行下载文件,重试失败,甚至可以将文件下载到远程计算机上。

If you have wget installed, you can use parallel_sync.

pip install parallel_sync

from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc: https://pythonhosted.org/parallel_sync/pages/examples.html

This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.


回答 15

如果速度对您很重要,则我对模块urllib和进行了小型性能测试wget,并考虑了wget一次尝试使用状态栏,一次尝试不使用状态栏。我使用了三个不同的500MB文件进行测试(不同的文件-消除了在后台进行某些缓存的机会)。在python2的debian机器上进行了测试。

首先,这些是结果(在不同的运行中它们是相似的):

$ python wget_test.py 
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============

我执行测试的方式是使用“配置文件”装饰器。这是完整的代码:

import wget
import urllib
import time
from functools import wraps

def profile(func):
    @wraps(func)
    def inner(*args):
        print func.__name__, ": starting"
        start = time.time()
        ret = func(*args)
        end = time.time()
        print func.__name__, ": {:.2f}".format(end - start)
        return ret
    return inner

url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'

def do_nothing(*args):
    pass

@profile
def urlretrive_test(url):
    return urllib.urlretrieve(url)

@profile
def wget_no_bar_test(url):
    return wget.download(url, out='/tmp/', bar=do_nothing)

@profile
def wget_with_bar_test(url):
    return wget.download(url, out='/tmp/')

urlretrive_test(url1)
print '=============='
time.sleep(1)

wget_no_bar_test(url2)
print '=============='
time.sleep(1)

wget_with_bar_test(url3)
print '=============='
time.sleep(1)

urllib 似乎是最快的

If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.

First, these are the results (they are similar in different runs):

$ python wget_test.py 
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============

The way I performed the test is using “profile” decorator. This is the full code:

import wget
import urllib
import time
from functools import wraps

def profile(func):
    @wraps(func)
    def inner(*args):
        print func.__name__, ": starting"
        start = time.time()
        ret = func(*args)
        end = time.time()
        print func.__name__, ": {:.2f}".format(end - start)
        return ret
    return inner

url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'

def do_nothing(*args):
    pass

@profile
def urlretrive_test(url):
    return urllib.urlretrieve(url)

@profile
def wget_no_bar_test(url):
    return wget.download(url, out='/tmp/', bar=do_nothing)

@profile
def wget_with_bar_test(url):
    return wget.download(url, out='/tmp/')

urlretrive_test(url1)
print '=============='
time.sleep(1)

wget_no_bar_test(url2)
print '=============='
time.sleep(1)

wget_with_bar_test(url3)
print '=============='
time.sleep(1)

urllib seems to be the fastest


回答 16

仅出于完整性考虑,也可以使用该subprocess软件包调用任何程序来检索文件。专用于检索文件的程序要比Python函数(如)强大urlretrieve。例如,wget可以递归下载目录(-R),可以处理FTP,重定向,HTTP代理,可以避免重新下载现有文件(-nc),并且aria2可以进行多连接下载,从而有可能加快下载速度。

import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

在Jupyter Notebook中,还可以使用以下!语法直接调用程序:

!wget -O example_output_file.html https://example.com

Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.

import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])

In Jupyter Notebook, one can also call programs directly with the ! syntax:

!wget -O example_output_file.html https://example.com

回答 17

源代码可以是:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource  

Source code can be:

import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()                            
sock.close()                                        
print htmlSource  

回答 18

您可以在Python 2和3上使用PycURL

import pycurl

FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'

with open(FILE_DEST, 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, FILE_SRC)
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

You can use PycURL on Python 2 and 3.

import pycurl

FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'

with open(FILE_DEST, 'wb') as f:
    c = pycurl.Curl()
    c.setopt(c.URL, FILE_SRC)
    c.setopt(c.WRITEDATA, f)
    c.perform()
    c.close()

回答 19

我编写了以下内容,这些内容适用于香草Python 2或Python 3。


import sys
try:
    import urllib.request
    python3 = True
except ImportError:
    import urllib2
    python3 = False


def progress_callback_simple(downloaded,total):
    sys.stdout.write(
        "\r" +
        (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
        " [%3.2f%%]"%(100.0*float(downloaded)/float(total))
    )
    sys.stdout.flush()

def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
    def _download_helper(response, out_file, file_size):
        if progress_callback!=None: progress_callback(0,file_size)
        if block_size == None:
            buffer = response.read()
            out_file.write(buffer)

            if progress_callback!=None: progress_callback(file_size,file_size)
        else:
            file_size_dl = 0
            while True:
                buffer = response.read(block_size)
                if not buffer: break

                file_size_dl += len(buffer)
                out_file.write(buffer)

                if progress_callback!=None: progress_callback(file_size_dl,file_size)
    with open(dstfilepath,"wb") as out_file:
        if python3:
            with urllib.request.urlopen(srcurl) as response:
                file_size = int(response.getheader("Content-Length"))
                _download_helper(response,out_file,file_size)
        else:
            response = urllib2.urlopen(srcurl)
            meta = response.info()
            file_size = int(meta.getheaders("Content-Length")[0])
            _download_helper(response,out_file,file_size)

import traceback
try:
    download(
        "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
        "output.zip",
        progress_callback_simple
    )
except:
    traceback.print_exc()
    input()

笔记:

  • 支持“进度条”回调。
  • 从我的网站下载了一个4 MB的测试.zip文件。

I wrote the following, which works in vanilla Python 2 or Python 3.


import sys
try:
    import urllib.request
    python3 = True
except ImportError:
    import urllib2
    python3 = False


def progress_callback_simple(downloaded,total):
    sys.stdout.write(
        "\r" +
        (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
        " [%3.2f%%]"%(100.0*float(downloaded)/float(total))
    )
    sys.stdout.flush()

def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
    def _download_helper(response, out_file, file_size):
        if progress_callback!=None: progress_callback(0,file_size)
        if block_size == None:
            buffer = response.read()
            out_file.write(buffer)

            if progress_callback!=None: progress_callback(file_size,file_size)
        else:
            file_size_dl = 0
            while True:
                buffer = response.read(block_size)
                if not buffer: break

                file_size_dl += len(buffer)
                out_file.write(buffer)

                if progress_callback!=None: progress_callback(file_size_dl,file_size)
    with open(dstfilepath,"wb") as out_file:
        if python3:
            with urllib.request.urlopen(srcurl) as response:
                file_size = int(response.getheader("Content-Length"))
                _download_helper(response,out_file,file_size)
        else:
            response = urllib2.urlopen(srcurl)
            meta = response.info()
            file_size = int(meta.getheaders("Content-Length")[0])
            _download_helper(response,out_file,file_size)

import traceback
try:
    download(
        "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
        "output.zip",
        progress_callback_simple
    )
except:
    traceback.print_exc()
    input()

Notes:

  • Supports a “progress bar” callback.
  • Download is a 4 MB test .zip from my website.

回答 20

这可能有点晚了,但是我看到了pabloG的代码,不禁添加了一个os.system(’cls’)使其看上去真棒!一探究竟 :

    import urllib2,os

    url = "http://download.thinkbroadband.com/10MB.zip"

    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (file_name, file_size)
    os.system('cls')
    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

如果在Windows以外的环境中运行,则必须使用“ cls”以外的其他名称。在MAC OS X和Linux中,应该“清楚”。

This may be a little late, But I saw pabloG’s code and couldn’t help adding a os.system(‘cls’) to make it look AWESOME! Check it out :

    import urllib2,os

    url = "http://download.thinkbroadband.com/10MB.zip"

    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (file_name, file_size)
    os.system('cls')
    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

If running in an environment other than Windows, you will have to use something other then ‘cls’. In MAC OS X and Linux it should be ‘clear’.


回答 21

urlretrieve和request.get很简单,但事实并非如此。我已经获取了几个站点的数据,包括文本和图像,以上两个可能解决了大多数任务。但是对于更通用的解决方案,我建议使用urlopen。由于它包含在Python 3标准库中,因此您的代码可以在任何运行Python 3的计算机上运行,​​而无需预先安装站点软件包

import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)

#remember to open file in bytes mode
with open(filename, 'wb') as f:
    while True:
        buffer = url_connect.read(buffer_size)
        if not buffer: break

        #an integer value of size of written data
        data_wrote = f.write(buffer)

#you could probably use with-open-as manner
url_connect.close()

当使用Python通过http下载文件时,此答案提供了针对HTTP 403 Forbidden的解决方案。我只尝试了请求和urllib模块,其他模块可能提供了更好的功能,但这是我用来解决大多数问题的模块。

urlretrieve and requests.get are simple, however the reality not. I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package

import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)

#remember to open file in bytes mode
with open(filename, 'wb') as f:
    while True:
        buffer = url_connect.read(buffer_size)
        if not buffer: break

        #an integer value of size of written data
        data_wrote = f.write(buffer)

#you could probably use with-open-as manner
url_connect.close()

This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.


回答 22

答案较晚,但python>=3.6您可以使用:

import dload
dload.save(url)

安装dload方式:

pip3 install dload

Late answer, but for python>=3.6 you can use:

import dload
dload.save(url)

Install dload with:

pip3 install dload

回答 23

我想从网页上下载所有文件。我尝试了一下,wget但是失败了,所以我决定使用Python路由,然后找到了这个线程。

阅读后,我做了一个命令行应用程序soupget,扩展了PabloGStan的出色答案,并添加了一些有用的选项。

它使用BeatifulSoup收集页面的所有URL,然后下载具有所需扩展名的URL。最后,它可以并行下载多个文件。

这里是:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup

# --- insert Stan's script here ---
# if sys.version_info >= (3,): 
#...
#...
# def download_file(url, dest=None): 
#...
#...

# --- new stuff ---
def collect_all_url(page_url, extensions):
    """
    Recovers all links in page_url checking for all the desired extensions
    """
    conn = urllib2.urlopen(page_url)
    html = conn.read()
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('a')

    results = []    
    for tag in links:
        link = tag.get('href', None)
        if link is not None: 
            for e in extensions:
                if e in link:
                    # Fallback for badly defined links
                    # checks for missing scheme or netloc
                    if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                        results.append(link)
                    else:
                        new_url=urlparse.urljoin(page_url,link)                        
                        results.append(new_url)
    return results

if __name__ == "__main__":  # Only run if this file is called directly
    # Command line arguments
    parser = argparse.ArgumentParser(
        description='Download all files from a webpage.')
    parser.add_argument(
        '-u', '--url', 
        help='Page url to request')
    parser.add_argument(
        '-e', '--ext', 
        nargs='+',
        help='Extension(s) to find')    
    parser.add_argument(
        '-d', '--dest', 
        default=None,
        help='Destination where to save the files')
    parser.add_argument(
        '-p', '--par', 
        action='store_true', default=False, 
        help="Turns on parallel download")
    args = parser.parse_args()

    # Recover files to download
    all_links = collect_all_url(args.url, args.ext)

    # Download
    if not args.par:
        for l in all_links:
            try:
                filename = download_file(l, args.dest)
                print(l)
            except Exception as e:
                print("Error while downloading: {}".format(e))
    else:
        from multiprocessing.pool import ThreadPool
        results = ThreadPool(10).imap_unordered(
            lambda x: download_file(x, args.dest), all_links)
        for p in results:
            print(p)

其用法的一个示例是:

python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>

还有一个实际示例,如果您想看到它的实际效果:

python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.

After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.

It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.

Here it is:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup

# --- insert Stan's script here ---
# if sys.version_info >= (3,): 
#...
#...
# def download_file(url, dest=None): 
#...
#...

# --- new stuff ---
def collect_all_url(page_url, extensions):
    """
    Recovers all links in page_url checking for all the desired extensions
    """
    conn = urllib2.urlopen(page_url)
    html = conn.read()
    soup = BeautifulSoup(html, 'lxml')
    links = soup.find_all('a')

    results = []    
    for tag in links:
        link = tag.get('href', None)
        if link is not None: 
            for e in extensions:
                if e in link:
                    # Fallback for badly defined links
                    # checks for missing scheme or netloc
                    if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
                        results.append(link)
                    else:
                        new_url=urlparse.urljoin(page_url,link)                        
                        results.append(new_url)
    return results

if __name__ == "__main__":  # Only run if this file is called directly
    # Command line arguments
    parser = argparse.ArgumentParser(
        description='Download all files from a webpage.')
    parser.add_argument(
        '-u', '--url', 
        help='Page url to request')
    parser.add_argument(
        '-e', '--ext', 
        nargs='+',
        help='Extension(s) to find')    
    parser.add_argument(
        '-d', '--dest', 
        default=None,
        help='Destination where to save the files')
    parser.add_argument(
        '-p', '--par', 
        action='store_true', default=False, 
        help="Turns on parallel download")
    args = parser.parse_args()

    # Recover files to download
    all_links = collect_all_url(args.url, args.ext)

    # Download
    if not args.par:
        for l in all_links:
            try:
                filename = download_file(l, args.dest)
                print(l)
            except Exception as e:
                print("Error while downloading: {}".format(e))
    else:
        from multiprocessing.pool import ThreadPool
        results = ThreadPool(10).imap_unordered(
            lambda x: download_file(x, args.dest), all_links)
        for p in results:
            print(p)

An example of its usage is:

python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>

And an actual example if you want to see it in action:

python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics

urllib,urllib2,urllib3和请求模块之间有什么区别?

问题:urllib,urllib2,urllib3和请求模块之间有什么区别?

在Python,有什么之间的差异urlliburllib2urllib3requests模块?为什么有三个?他们似乎在做同样的事情…

In Python, what are the differences between the urllib, urllib2, urllib3 and requests modules? Why are there three? They seem to do the same thing…


回答 0

我知道已经有人说过了,但我强烈建议您使用requestsPython软件包。

如果您使用的是python以外的语言,则可能是在考虑urllib并且urllib2易于使用,代码不多且功能强大,这就是我以前的想法。但是该requests程序包是如此有用且太短,以至于每个人都应该使用它。

首先,它支持完全宁静的API,并且非常简单:

import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')

无论是GET / POST,您都无需再次对参数进行编码,只需将字典作为参数即可。

userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
resp = requests.post('http://www.mywebsite.com/user', data=userdata)

加上它甚至还具有内置的JSON解码器(再次,我知道json.loads()编写的内容并不多,但这肯定很方便):

resp.json()

或者,如果您的响应数据只是文本,请使用:

resp.text

这只是冰山一角。这是请求站点中的功能列表:

  • 国际域名和URL
  • 保持活动和连接池
  • Cookie持久性会话
  • 浏览器式SSL验证
  • 基本/摘要身份验证
  • 优雅的键/值Cookie
  • 自动减压
  • Unicode响应机构
  • 分段文件上传
  • 连接超时
  • .netrc支持
  • 项目清单
  • python 2.6—3.4
  • 线程安全的。

I know it’s been said already, but I’d highly recommend the requests Python package.

If you’ve used languages other than python, you’re probably thinking urllib and urllib2 are easy to use, not much code, and highly capable, that’s how I used to think. But the requests package is so unbelievably useful and short that everyone should be using it.

First, it supports a fully restful API, and is as easy as:

import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')

Regardless of whether GET / POST, you never have to encode parameters again, it simply takes a dictionary as an argument and is good to go:

userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
resp = requests.post('http://www.mywebsite.com/user', data=userdata)

Plus it even has a built in JSON decoder (again, I know json.loads() isn’t a lot more to write, but this sure is convenient):

resp.json()

Or if your response data is just text, use:

resp.text

This is just the tip of the iceberg. This is the list of features from the requests site:

  • International Domains and URLs
  • Keep-Alive & Connection Pooling
  • Sessions with Cookie Persistence
  • Browser-style SSL Verification
  • Basic/Digest Authentication
  • Elegant Key/Value Cookies
  • Automatic Decompression
  • Unicode Response Bodies
  • Multipart File Uploads
  • Connection Timeouts
  • .netrc support
  • List item
  • Python 2.6—3.4
  • Thread-safe.

回答 1

urllib2提供了一些额外的功能,即该urlopen()函数可以允许您指定标头(通常您以前必须使用httplib,这要冗长得多。)不过,更重要的是,urllib2提供了Request该类,该类可以提供更多功能。声明式处理请求:

r = Request(url='http://www.mysite.com')
r.add_header('User-Agent', 'awesome fetcher')
r.add_data(urllib.urlencode({'foo': 'bar'})
response = urlopen(r)

请注意,urlencode()仅在urllib中,而不在urllib2中。

还有一些处理程序,用于在urllib2中实现更高级的URL支持。简短的答案是,除非使用旧代码,否则可能要使用urllib2中的URL打开程序,但是对于某些实用程序功能,仍然需要导入urllib。

奖励答案 使用Google App Engine,您可以使用httplib,urllib或urllib2中的任何一个,但它们都只是Google URL Fetch API的包装。也就是说,您仍然受到端口,协议和允许的响应时间之类的相同限制。不过,您可以像期望的那样使用库的核心来获取HTTP URL。

urllib2 provides some extra functionality, namely the urlopen() function can allow you to specify headers (normally you’d have had to use httplib in the past, which is far more verbose.) More importantly though, urllib2 provides the Request class, which allows for a more declarative approach to doing a request:

r = Request(url='http://www.mysite.com')
r.add_header('User-Agent', 'awesome fetcher')
r.add_data(urllib.urlencode({'foo': 'bar'})
response = urlopen(r)

Note that urlencode() is only in urllib, not urllib2.

There are also handlers for implementing more advanced URL support in urllib2. The short answer is, unless you’re working with legacy code, you probably want to use the URL opener from urllib2, but you still need to import into urllib for some of the utility functions.

Bonus answer With Google App Engine, you can use any of httplib, urllib or urllib2, but all of them are just wrappers for Google’s URL Fetch API. That is, you are still subject to the same limitations such as ports, protocols, and the length of the response allowed. You can use the core of the libraries as you would expect for retrieving HTTP URLs, though.


回答 2

urlliburllib2都是Python模块,它们执行URL请求相关的内容,但提供不同的功能。

1)urllib2可以接受Request对象来设置URL请求的标头,而urllib仅接受URL。

2)urllib提供了urlencode方法,该方法用于生成GET查询字符串,而urllib2没有此功能。这是urllib与urllib2经常一起使用的原因之一。

Requests -Requests是一个使用Python编写的简单易用的HTTP库。

1)Python请求自动对参数进行编码,因此您只需将它们作为简单的参数传递,就与urllib不同,在urllib中,需要在传递参数之前使用urllib.encode()方法对参数进行编码。

2)它自动将响应解码为Unicode。

3)Requests还具有更方便的错误处理方式。如果您的身份验证失败,则urllib2将引发urllib2.URLError,而Requests将返回正常的响应对象。您需要通过boolean response.ok查看所有请求是否成功

urllib and urllib2 are both Python modules that do URL request related stuff but offer different functionalities.

1) urllib2 can accept a Request object to set the headers for a URL request, urllib accepts only a URL.

2) urllib provides the urlencode method which is used for the generation of GET query strings, urllib2 doesn’t have such a function. This is one of the reasons why urllib is often used along with urllib2.

Requests – Requests’ is a simple, easy-to-use HTTP library written in Python.

1) Python Requests encodes the parameters automatically so you just pass them as simple arguments, unlike in the case of urllib, where you need to use the method urllib.encode() to encode the parameters before passing them.

2) It automatically decoded the response into Unicode.

3) Requests also has far more convenient error handling.If your authentication failed, urllib2 would raise a urllib2.URLError, while Requests would return a normal response object, as expected. All you have to see if the request was successful by boolean response.ok


回答 3

将Python2移植到Python3是一个相当大的区别。urllib2对于python3不存在,其方法已移植到urllib。因此,您正在大量使用它,并希望将来迁移到Python3,请考虑使用urllib。但是2to3工具将自动为您完成大部分工作。

One considerable difference is about porting Python2 to Python3. urllib2 does not exist for python3 and its methods ported to urllib. So you are using that heavily and want to migrate to Python3 in future, consider using urllib. However 2to3 tool will automatically do most of the work for you.


回答 4

仅添加到现有答案中,我看不到有人提到python请求不是本机库。如果可以添加依赖项,那么请求就可以了。但是,如果您试图避免添加依赖项,则urllib是一个本机python库,已经可供您使用。

Just to add to the existing answers, I don’t see anyone mentioning that python requests is not a native library. If you are ok with adding dependencies, then requests is fine. However, if you are trying to avoid adding dependencies, urllib is a native python library that is already available to you.


回答 5

我喜欢此urllib.urlencode功能,并且似乎不存在urllib2

>>> urllib.urlencode({'abc':'d f', 'def': '-!2'})
'abc=d+f&def=-%212'

I like the urllib.urlencode function, and it doesn’t appear to exist in urllib2.

>>> urllib.urlencode({'abc':'d f', 'def': '-!2'})
'abc=d+f&def=-%212'

回答 6

要获取网址的内容:

try: # Try importing requests first.
    import requests
except ImportError: 
    try: # Try importing Python3 urllib
        import urllib.request
    except AttributeError: # Now importing Python2 urllib
        import urllib


def get_content(url):
    try:  # Using requests.
        return requests.get(url).content # Returns requests.models.Response.
    except NameError:  
        try: # Using Python3 urllib.
            with urllib.request.urlopen(index_url) as response:
                return response.read() # Returns http.client.HTTPResponse.
        except AttributeError: # Using Python3 urllib.
            return urllib.urlopen(url).read() # Returns an instance.

很难request为响应编写Python2和Python3以及依赖项代码,因为它们的urlopen()功能和requests.get()函数返回不同的类型:

  • Python2 urllib.request.urlopen()返回一个http.client.HTTPResponse
  • Python3 urllib.urlopen(url)返回一个instance
  • 请求request.get(url)返回一个requests.models.Response

To get the content of a url:

try: # Try importing requests first.
    import requests
except ImportError: 
    try: # Try importing Python3 urllib
        import urllib.request
    except AttributeError: # Now importing Python2 urllib
        import urllib


def get_content(url):
    try:  # Using requests.
        return requests.get(url).content # Returns requests.models.Response.
    except NameError:  
        try: # Using Python3 urllib.
            with urllib.request.urlopen(index_url) as response:
                return response.read() # Returns http.client.HTTPResponse.
        except AttributeError: # Using Python3 urllib.
            return urllib.urlopen(url).read() # Returns an instance.

It’s hard to write Python2 and Python3 and request dependencies code for the responses because they urlopen() functions and requests.get() function return different types:

  • Python2 urllib.request.urlopen() returns a http.client.HTTPResponse
  • Python3 urllib.urlopen(url) returns an instance
  • Request request.get(url) returns a requests.models.Response

回答 7

通常应该使用urllib2,因为通过接受Request对象有时会使事情变得容易一些,并且还会在协议错误时引发URLException。但是,借助Google App Engine,您将无法使用任何一种。您必须使用Google在其沙盒Python环境中提供的URL Fetch API

You should generally use urllib2, since this makes things a bit easier at times by accepting Request objects and will also raise a URLException on protocol errors. With Google App Engine though, you can’t use either. You have to use the URL Fetch API that Google provides in its sandboxed Python environment.


回答 8

我发现上述答案中缺少的一个关键点是urllib返回类型为object的对象,<class http.client.HTTPResponse>requests返回return <class 'requests.models.Response'>

因此,read()方法可以与一起使用,urllib但不能与一起使用requests

PS:requests已经有很多方法,几乎​​不需要read();>

A key point that I find missing in the above answers is that urllib returns an object of type <class http.client.HTTPResponse> whereas requests returns <class 'requests.models.Response'>.

Due to this, read() method can be used with urllib but not with requests.

P.S. : requests is already rich with so many methods that it hardly needs one more as read() ;>