问题:在Python 3中将字符串转换为字节的最佳方法?

TypeError的答案中可以看出,有两种不同的方法可以将字符串转换为字节:’str’不支持缓冲区接口

以下哪种方法更好或更Pythonic?还是仅仅是个人喜好问题?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

There appear to be two different ways to convert a string to bytes, as seen in the answers to TypeError: ‘str’ does not support the buffer interface

Which of these methods would be better or more Pythonic? Or is it just a matter of personal preference?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

回答 0

如果您查看的文档bytes,它将指向bytearray

bytearray([源[,编码[,错误]]])

返回一个新的字节数组。字节数组类型是一个可变的整数序列,范围为0 <= x <256。它具有可变序列类型中介绍的大多数可变序列的常用方法,以及字节类型具有的大多数方法,请参见字节和。字节数组方法。

可选的source参数可以通过几种不同的方式用于初始化数组:

如果是字符串,则还必须提供编码(以及可选的错误)参数;然后,bytearray()使用str.encode()将字符串转换为字节。

如果它是整数,则数组将具有该大小,并将使用空字节初始化。

如果它是符合缓冲区接口的对象,则该对象的只读缓冲区将用于初始化bytes数组。

如果是可迭代的,则它必须是0 <= x <256范围内的整数的可迭代对象,这些整数用作数组的初始内容。

没有参数,将创建大小为0的数组。

所以 bytes除了编码字符串以外,还可以做更多的事情。这是Pythonic的用法,它允许您使用有意义的任何类型的源参数来调用构造函数。

对于编码字符串,我认为它some_string.encode(encoding)比使用构造函数更具Pythonic性,因为它是最易于记录的文档-“使用此字符串并使用此编码对其进行编码”比bytes(some_string, encoding) – -当您使用构造函数。

编辑:我检查了Python源。如果您将unicode字符串传递给bytes使用CPython,它将调用PyUnicode_AsEncodedString,它是encode; 的实现。因此,如果您自称,则只是跳过了间接级别encode

另外,请参见Serdalis的评论- unicode_string.encode(encoding)也是Python 风格的,因为它的反函数为byte_string.decode(encoding),对称性很好。

If you look at the docs for bytes, it points you to bytearray:

bytearray([source[, encoding[, errors]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

The optional source parameter can be used to initialize the array in a few different ways:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

If it is an integer, the array will have that size and will be initialized with null bytes.

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

Without an argument, an array of size 0 is created.

So bytes can do much more than just encode a string. It’s Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting — “take this string and encode it with this encoding” is clearer than bytes(some_string, encoding) — there is no explicit verb when you use the constructor.

Edit: I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you’re just skipping a level of indirection if you call encode yourself.

Also, see Serdalis’ comment — unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.


回答 1

比想像的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

It’s easier than it is thought:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

回答 2

绝对最好的办法既不是2,但第3位。自Python 3.0以来,第一个参数默认默认值。因此,最好的方法是 'utf-8'

b = mystring.encode()

这也将更快,因为默认参数的结果不是"utf-8"C代码中的字符串,而是NULL,它的检查快得多!

以下是一些时间安排:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

尽管发出警告,但重复运行后时间仍然非常稳定-偏差仅为〜2%。


encode()不带参数使用不兼容Python 2,因为在Python 2中,默认字符编码为ASCII

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The absolutely best way is neither of the 2, but the 3rd. The first parameter to defaults to 'utf-8' ever since Python 3.0. Thus the best way is

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

Here be some timings:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs – the deviation was just ~2 per cent.


Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。