Python:base64解码时忽略“错误填充”错误

问题:Python:base64解码时忽略“错误填充”错误

我有一些base64编码的数据,即使其中存在填充错误,我也想将其转换回二进制。如果我用

base64.decodestring(b64_string)

会引发“填充错误”错误。还有另一种方法吗?

更新:感谢您的所有反馈。老实说,提到的所有方法听起来都有些失败,所以我决定尝试使用openssl。以下命令可以使您满意:

openssl enc -d -base64 -in b64string -out binary_data

I have some data that is base64 encoded that I want to convert back to binary even if there is a padding error in it. If I use

base64.decodestring(b64_string)

it raises an ‘Incorrect padding’ error. Is there another way?

UPDATE: Thanks for all the feedback. To be honest, all the methods mentioned sounded a bit hit and miss so I decided to try openssl. The following command worked a treat:

openssl enc -d -base64 -in b64string -out binary_data

回答 0

如其他答复中所述,base64数据有多种损坏方式。

但是,正如Wikipedia所说,删除填充(base64编码数据末尾的’=’字符)是“无损的”:

从理论上讲,不需要填充字符,因为可以从Base64位的位数计算丢失的字节数。

因此,如果这真的是您的base64数据唯一的“错误”,则可以将填充添加回去。我想出了这一点,以便能够在WeasyPrint中解析“数据” URL,其中一些是base64而不填充:

import base64
import re

def decode_base64(data, altchars=b'+/'):
    """Decode base64, padding being optional.

    :param data: Base64 data as an ASCII byte string
    :returns: The decoded byte string.

    """
    data = re.sub(rb'[^a-zA-Z0-9%s]+' % altchars, b'', data)  # normalize
    missing_padding = len(data) % 4
    if missing_padding:
        data += b'='* (4 - missing_padding)
    return base64.b64decode(data, altchars)

测试此功能:weasyprint / tests / test_css.py#L68

As said in other responses, there are various ways in which base64 data could be corrupted.

However, as Wikipedia says, removing the padding (the ‘=’ characters at the end of base64 encoded data) is “lossless”:

From a theoretical point of view, the padding character is not needed, since the number of missing bytes can be calculated from the number of Base64 digits.

So if this is really the only thing “wrong” with your base64 data, the padding can just be added back. I came up with this to be able to parse “data” URLs in WeasyPrint, some of which were base64 without padding:

import base64
import re

def decode_base64(data, altchars=b'+/'):
    """Decode base64, padding being optional.

    :param data: Base64 data as an ASCII byte string
    :returns: The decoded byte string.

    """
    data = re.sub(rb'[^a-zA-Z0-9%s]+' % altchars, b'', data)  # normalize
    missing_padding = len(data) % 4
    if missing_padding:
        data += b'='* (4 - missing_padding)
    return base64.b64decode(data, altchars)

Tests for this function: weasyprint/tests/test_css.py#L68


回答 1

只需添加所需的填充。但是,请注意迈克尔的警告。

b64_string += "=" * ((4 - len(b64_string) % 4) % 4) #ugh

Just add padding as required. Heed Michael’s warning, however.

b64_string += "=" * ((4 - len(b64_string) % 4) % 4) #ugh

回答 2

看来您只需要在解码之前在字节中添加填充即可。关于这个问题,还有许多其他答案,但我想指出(至少在Python 3.x中),base64.b64decode它将截断所有多余的填充,前提是首先要有足够的填充。

所以,这样的:b'abc='工作一样好b'abc=='(一样b'abc=====')。

这意味着您只需添加所需的最大填充字符数(三个(b'===')),base64就会截断所有不必要的填充字符。

这使您可以编写:

base64.b64decode(s + b'===')

比以下方法简单:

base64.b64decode(s + b'=' * (-len(s) % 4))

It seems you just need to add padding to your bytes before decoding. There are many other answers on this question, but I want to point out that (at least in Python 3.x) base64.b64decode will truncate any extra padding, provided there is enough in the first place.

So, something like: b'abc=' works just as well as b'abc==' (as does b'abc=====').

What this means is that you can just add the maximum number of padding characters that you would ever need—which is three (b'===')—and base64 will truncate any unnecessary ones.

This lets you write:

base64.b64decode(s + b'===')

which is simpler than:

base64.b64decode(s + b'=' * (-len(s) % 4))

回答 3

“不正确的填充”不仅可以表示“缺少填充”,还可以表示(不信不信)“不正确的填充”。

如果建议的“添加填充”方法不起作用,请尝试删除一些尾随字节:

lens = len(strg)
lenx = lens - (lens % 4 if lens % 4 else 4)
try:
    result = base64.decodestring(strg[:lenx])
except etc

更新:摆弄填充或从结尾删除可能坏字节的任何摆弄都应该在删除任何空白之后进行,否则长度计算会很麻烦。

如果您向我们展示了您需要恢复的数据的(简短)样本,那将是一个好主意。编辑您的问题,然后复制/粘贴的结果 print repr(sample)

更新2:可能以url安全的方式完成了编码。在这种情况下,您将能够看到数据中的负号和下划线字符,并且应该能够通过使用以下命令对其进行解码base64.b64decode(strg, '-_')

如果您在数据中看不到减号和下划线字符,但可以看到加号和斜杠字符,则说明您还有其他问题,可能需要使用添加或删除技巧。

如果您在数据中看不到减号,下划线,加号​​和斜线,则需要确定两个替代字符;否则,请参见表。他们将是[A-Za-z0-9]中没有的人。然后,您需要进行实验,以查看需要在第2个参数中使用它们的顺序base64.b64decode()

更新3:如果您的数据是“公司机密”:
(a)您应该这样说
(b)我们可以探索理解问题的其他途径,这很可能与使用什么字符代替+/使用编码字母,或其他格式或无关字符。

一种方法是检查数据中包含哪些非“标准”字符,例如

from collections import defaultdict
d = defaultdict(int)
import string
s = set(string.ascii_letters + string.digits)
for c in your_data:
   if c not in s:
      d[c] += 1
print d

“Incorrect padding” can mean not only “missing padding” but also (believe it or not) “incorrect padding”.

If suggested “adding padding” methods don’t work, try removing some trailing bytes:

lens = len(strg)
lenx = lens - (lens % 4 if lens % 4 else 4)
try:
    result = base64.decodestring(strg[:lenx])
except etc

Update: Any fiddling around adding padding or removing possibly bad bytes from the end should be done AFTER removing any whitespace, otherwise length calculations will be upset.

It would be a good idea if you showed us a (short) sample of the data that you need to recover. Edit your question and copy/paste the result of print repr(sample).

Update 2: It is possible that the encoding has been done in an url-safe manner. If this is the case, you will be able to see minus and underscore characters in your data, and you should be able to decode it by using base64.b64decode(strg, '-_')

If you can’t see minus and underscore characters in your data, but can see plus and slash characters, then you have some other problem, and may need the add-padding or remove-cruft tricks.

If you can see none of minus, underscore, plus and slash in your data, then you need to determine the two alternate characters; they’ll be the ones that aren’t in [A-Za-z0-9]. Then you’ll need to experiment to see which order they need to be used in the 2nd arg of base64.b64decode()

Update 3: If your data is “company confidential”:
(a) you should say so up front
(b) we can explore other avenues in understanding the problem, which is highly likely to be related to what characters are used instead of + and / in the encoding alphabet, or by other formatting or extraneous characters.

One such avenue would be to examine what non-“standard” characters are in your data, e.g.

from collections import defaultdict
d = defaultdict(int)
import string
s = set(string.ascii_letters + string.digits)
for c in your_data:
   if c not in s:
      d[c] += 1
print d

回答 4

string += '=' * (-len(string) % 4)  # restore stripped '='s

值得一提的是这里的某处评论。

>>> import base64

>>> enc = base64.b64encode('1')

>>> enc
>>> 'MQ=='

>>> base64.b64decode(enc)
>>> '1'

>>> enc = enc.rstrip('=')

>>> enc
>>> 'MQ'

>>> base64.b64decode(enc)
...
TypeError: Incorrect padding

>>> base64.b64decode(enc + '=' * (-len(enc) % 4))
>>> '1'

>>> 

Use

string += '=' * (-len(string) % 4)  # restore stripped '='s

Credit goes to a comment somewhere here.

>>> import base64

>>> enc = base64.b64encode('1')

>>> enc
>>> 'MQ=='

>>> base64.b64decode(enc)
>>> '1'

>>> enc = enc.rstrip('=')

>>> enc
>>> 'MQ'

>>> base64.b64decode(enc)
...
TypeError: Incorrect padding

>>> base64.b64decode(enc + '=' * (-len(enc) % 4))
>>> '1'

>>> 

回答 5

如果存在填充错误,则可能意味着您的字符串已损坏;base64编码的字符串应具有四个长度的倍数。您可以尝试=自己添加填充字符(),以使字符串为四的倍数,但除非有错误,否则应该已经有该字符了

If there’s a padding error it probably means your string is corrupted; base64-encoded strings should have a multiple of four length. You can try adding the padding character (=) yourself to make the string a multiple of four, but it should already have that unless something is wrong


回答 6

检查您要解码的数据源的文档。您是否有可能要使用base64.urlsafe_b64decode(s)而不是base64.b64decode(s)?这是您可能已经看到此错误消息的原因之一。

使用URL安全字母对字符串s进行解码,该字母在标准Base64字母中用-代替+,用_代替/。

例如,各种Google API(例如Google的身份工具包和Gmail负载)就是这种情况。

Check the documentation of the data source you’re trying to decode. Is it possible that you meant to use base64.urlsafe_b64decode(s) instead of base64.b64decode(s)? That’s one reason you might have seen this error message.

Decode string s using a URL-safe alphabet, which substitutes – instead of + and _ instead of / in the standard Base64 alphabet.

This is for example the case for various Google APIs, like Google’s Identity Toolkit and Gmail payloads.


回答 7

很容易地添加填充。这是我借助该线程中的注释以及base64的Wiki页面(非常有用)https://en.wikipedia.org/wiki/Base64#Padding编写的函数。

import logging
import base64
def base64_decode(s):
    """Add missing padding to string and return the decoded base64 string."""
    log = logging.getLogger()
    s = str(s).strip()
    try:
        return base64.b64decode(s)
    except TypeError:
        padding = len(s) % 4
        if padding == 1:
            log.error("Invalid base64 string: {}".format(s))
            return ''
        elif padding == 2:
            s += b'=='
        elif padding == 3:
            s += b'='
        return base64.b64decode(s)

Adding the padding is rather… fiddly. Here’s the function I wrote with the help of the comments in this thread as well as the wiki page for base64 (it’s surprisingly helpful) https://en.wikipedia.org/wiki/Base64#Padding.

import logging
import base64
def base64_decode(s):
    """Add missing padding to string and return the decoded base64 string."""
    log = logging.getLogger()
    s = str(s).strip()
    try:
        return base64.b64decode(s)
    except TypeError:
        padding = len(s) % 4
        if padding == 1:
            log.error("Invalid base64 string: {}".format(s))
            return ''
        elif padding == 2:
            s += b'=='
        elif padding == 3:
            s += b'='
        return base64.b64decode(s)

回答 8

base64.urlsafe_b64decode(data)如果您要解码网络图像,则可以简单地使用。它将自动处理填充。

You can simply use base64.urlsafe_b64decode(data) if you are trying to decode a web image. It will automatically take care of the padding.


回答 9

有两种方法可以更正此处描述的输入数据,或更确切地说,与OP保持一致,以使Python模块base64的b64decode方法能够将输入数据处理为某种内容而不会引发未捕获的异常:

  1. 将==附加到输入数据的末尾并调用base64.b64decode(…)
  2. 如果那引发了异常,那么

    一世。通过try / except捕获它,

    ii。(R?)从输入数据中去除=字符(注意,可能没有必要),

    iii。将A ==附加到输入数据(A ==至P ==将起作用),

    iv。使用这些A ==附加的输入数据调用base64.b64decode(…)

上面第1项或第2项的结果将产生所需的结果。

注意事项

这不能保证解码后的结果将是原始编码的结果,但是(有时?)它会给OP提供足够的处理能力:

即使发生损坏,我也想回到二进制文件,因为我仍然可以从ASN.1流中获取一些有用的信息”)。

请参阅下面的“我们知道的信息”和“ 假设”

TL; DR

来自base64.b64decode(…)的一些快速测试

  1. 似乎它忽略了非[A-Za-z0-9 + /]字符;包括忽略= s,除非它们是已解析的四个字符组中的最后符,在这种情况下,= s终止解码(a = b = c = d =给出与abc =相同的结果,而a = = b == c ==得出与ab ==相同的结果)。

  2. 看来在base64.b64decode(…)终止解码之后(例如,从= =作为组中的第四个字符),所有附加字符都将被忽略

如上面的几条评论所述,当[解析到该点的字符数为4的值]的值为0或3时,在输入数据的末尾需要填充为零或一或两个。或2。因此,从上述第3项和第4项开始,在输入数据中附加两个或多个=可以纠正这些情况下的任何[Invalid padding]问题。

但是, [解析的字符的模数总数为4]为1时解码无法处理,因为它需要至少两个编码字符来表示三个解码字节组中的第一个解码字节。在损坏的编码输入数据中,永远不会发生这种[N模4] = 1情况,但是由于OP指出字符可能会丢失,因此可能会在这里发生。这就是为什么仅附加= s并不总是有效的原因,以及为什么附加A在附加==时不能 ==的。注意使用[A]几乎是任意的:它仅将已清除的(零)位添加到解码后的位,这可能是正确的,也可能是不正确的,但是此时的对象不是正确的,而是由base64.b64decode(…)完成的,但没有exceptions。 。

我们从OP中了解到的信息尤其是后续评论是

  • 怀疑在Base64编码的输入数据中缺少数据(字符)
  • Base64编码使用标准的64位值加上填充:AZ;az; 0-9; +; /; =是填充。事实证明或至少建议这样openssl enc ...做。

假设条件

  • 输入数据仅包含7位ASCII数据
  • 唯一的损坏是缺少编码的输入数据
  • 在对应于任何丢失的编码输入数据的那一点之后,OP不在乎解码输出数据

Github

这是实现此解决方案的包装器:

https://github.com/drbitboy/missing_b64

There are two ways to correct the input data described here, or, more specifically and in line with the OP, to make Python module base64’s b64decode method able to process the input data to something without raising an un-caught exception:

  1. Append == to the end of the input data and call base64.b64decode(…)
  2. If that raises an exception, then

    i. Catch it via try/except,

    ii. (R?)Strip any = characters from the input data (N.B. this may not be necessary),

    iii. Append A== to the input data (A== through P== will work),

    iv. Call base64.b64decode(…) with those A==-appended input data

The result from Item 1. or Item 2. above will yield the desired result.

Caveats

This does not guarantee the decoded result will be what was originally encoded, but it will (sometimes?) give the OP enough to work with:

Even with corruption I want to get back to the binary because I can still get some useful info from the ASN.1 stream”).

See What we know and Assumptions below.

TL;DR

From some quick tests of base64.b64decode(…)

  1. it appears that it ignores non-[A-Za-z0-9+/] characters; that includes ignoring =s unless they are the last character(s) in a parsed group of four, in which case the =s terminate the decoding (a=b=c=d= gives the same result as abc=, and a==b==c== gives the same result as ab==).

  2. It also appears that all characters appended are ignored after the point where base64.b64decode(…) terminates decoding e.g. from an = as the fourth in a group.

As noted in several comments above, there are either zero, or one, or two, =s of padding required at the end of input data for when the [number of parsed characters to that point modulo 4] value is 0, or 3, or 2, respectively. So, from items 3. and 4. above, appending two or more =s to the input data will correct any [Incorrect padding] problems in those cases.

HOWEVER, decoding cannot handle the case where the [total number of parsed characters modulo 4] is 1, because it takes a least two encoded characters to represent the first decoded byte in a group of three decoded bytes. In uncorrupted encoded input data, this [N modulo 4]=1 case never happens, but as the OP stated that characters may be missing, it could happen here. That is why simply appending =s will not always work, and why appending A== will work when appending == does not. N.B. Using [A] is all but arbitrary: it adds only cleared (zero) bits to the decoded, which may or not be correct, but then the object here is not correctness but completion by base64.b64decode(…) sans exceptions.

What we know from the OP and especially subsequent comments is

  • It is suspected that there are missing data (characters) in the Base64-encoded input data
  • The Base64 encoding uses the standard 64 place-values plus padding: A-Z; a-z; 0-9; +; /; = is padding. This is confirmed, or at least suggested, by the fact that openssl enc ... works.

Assumptions

  • The input data contain only 7-bit ASCII data
  • The only kind of corruption is missing encoded input data
  • The OP does not care about decoded output data at any point after that corresponding to any missing encoded input data

Github

Here is a wrapper to implement this solution:

https://github.com/drbitboy/missing_b64


回答 10

造成错误的填充错误是因为有时编码的字符串中也存在元数据。如果您的字符串看起来像:“ data:image / png; base64,… base 64 stuff ….”,那么您需要删除第一个部分,然后再解码。

如果您有图像base64编码的字符串,请尝试下面的代码段。

from PIL import Image
from io import BytesIO
from base64 import b64decode
imagestr = 'data:image/png;base64,...base 64 stuff....'
im = Image.open(BytesIO(b64decode(imagestr.split(',')[1])))
im.save("image.png")

Incorrect padding error is caused because sometimes, metadata is also present in the encoded string If your string looks something like: ‘data:image/png;base64,…base 64 stuff….’ then you need to remove the first part before decoding it.

Say if you have image base64 encoded string, then try below snippet..

from PIL import Image
from io import BytesIO
from base64 import b64decode
imagestr = 'data:image/png;base64,...base 64 stuff....'
im = Image.open(BytesIO(b64decode(imagestr.split(',')[1])))
im.save("image.png")

回答 11

在尝试解码目标字符串值之前,只需添加其他字符(例如“ =”或任何其他字符)并将其设为4的倍数即可。就像是;

if len(value) % 4 != 0: #check if multiple of 4
    while len(value) % 4 != 0:
        value = value + "="
    req_str = base64.b64decode(value)
else:
    req_str = base64.b64decode(value)

Simply add additional characters like “=” or any other and make it a multiple of 4 before you try decoding the target string value. Something like;

if len(value) % 4 != 0: #check if multiple of 4
    while len(value) % 4 != 0:
        value = value + "="
    req_str = base64.b64decode(value)
else:
    req_str = base64.b64decode(value)

回答 12

如果此错误来自Web服务器:请尝试对您的帖子值进行url编码。我是通过“ curl”发布的,发现我没有对base64值进行url编码,因此像“ +”这样的字符没有被转义,因此Web服务器的url解码逻辑会自动运行url解码并将+转换为空格。

“ +”是有效的base64字符,也许是唯一被意外的URL解码破坏的字符。

In case this error came from a web server: Try url encoding your post value. I was POSTing via “curl” and discovered I wasn’t url-encoding my base64 value so characters like “+” were not escaped so the web server url-decode logic automatically ran url-decode and converted + to spaces.

“+” is a valid base64 character and perhaps the only character which gets mangled by an unexpected url-decode.


回答 13

就我而言,我在解析电子邮件时遇到了该错误。我将附件作为base64字符串获取,并通过re.search将其提取。最终在末尾有一个奇怪的附加子字符串。

dHJhaWxlcgo8PCAvU2l6ZSAxNSAvUm9vdCAxIDAgUiAvSW5mbyAyIDAgUgovSUQgWyhcMDAyXDMz
MHtPcFwyNTZbezU/VzheXDM0MXFcMzExKShcMDAyXDMzMHtPcFwyNTZbezU/VzheXDM0MXFcMzEx
KV0KPj4Kc3RhcnR4cmVmCjY3MDEKJSVFT0YK

--_=ic0008m4wtZ4TqBFd+sXC8--

当我删除 --_=ic0008m4wtZ4TqBFd+sXC8--并字符串后,解析就被修复了。

因此,我的建议是确保您正在解码正确的base64字符串。

In my case I faced that error while parsing an email. I got the attachment as base64 string and extract it via re.search. Eventually there was a strange additional substring at the end.

dHJhaWxlcgo8PCAvU2l6ZSAxNSAvUm9vdCAxIDAgUiAvSW5mbyAyIDAgUgovSUQgWyhcMDAyXDMz
MHtPcFwyNTZbezU/VzheXDM0MXFcMzExKShcMDAyXDMzMHtPcFwyNTZbezU/VzheXDM0MXFcMzEx
KV0KPj4Kc3RhcnR4cmVmCjY3MDEKJSVFT0YK

--_=ic0008m4wtZ4TqBFd+sXC8--

When I deleted --_=ic0008m4wtZ4TqBFd+sXC8-- and strip the string then parsing was fixed up.

So my advise is make sure that you are decoding a correct base64 string.


回答 14

你应该用

base64.b64decode(b64_string, ' /')

默认情况下,altchars是'+/'

You should use

base64.b64decode(b64_string, ' /')

By default, the altchars are '+/'.


回答 15

我也遇到了这个问题,没有任何效果。我终于设法找到了适合我的解决方案。我在base64中压缩了内容,而这恰好是一百万个记录中的一个…

这是Simon Sapin建议的解决方案的一个版本。

如果填充缺少3,则我删除最后3个字符。

代替“ 0gA1RD5L / 9AUGtH9MzAwAAA ==”

我们得到“ 0gA1RD5L / 9AUGtH9MzAwAA”

        missing_padding = len(data) % 4
        if missing_padding == 3:
            data = data[0:-3]
        elif missing_padding != 0:
            print ("Missing padding : " + str(missing_padding))
            data += '=' * (4 - missing_padding)
        data_decoded = base64.b64decode(data)   

根据此答案,base64中结尾为As,原因为空。但是我仍然不知道为什么编码器会搞砸这个…

I ran into this problem as well and nothing worked. I finally managed to find the solution which works for me. I had zipped content in base64 and this happened to 1 out of a million records…

This is a version of the solution suggested by Simon Sapin.

In case the padding is missing 3 then I remove the last 3 characters.

Instead of “0gA1RD5L/9AUGtH9MzAwAAA==”

We get “0gA1RD5L/9AUGtH9MzAwAA”

        missing_padding = len(data) % 4
        if missing_padding == 3:
            data = data[0:-3]
        elif missing_padding != 0:
            print ("Missing padding : " + str(missing_padding))
            data += '=' * (4 - missing_padding)
        data_decoded = base64.b64decode(data)   

According to this answer Trailing As in base64 the reason is nulls. But I still have no idea why the encoder messes this up…