问题:字符串和字节字符串有什么区别?
我正在使用一个返回字节字符串的库,我需要将其转换为字符串。
尽管我不确定有什么区别-如果有的话。
I am working with a library which returns a byte string and I need to convert this to a string.
Although I’m not sure what the difference is – if any.
回答 0
假设使用Python 3(在Python 2中,这种区别的定义不太明确)-字符串是字符序列,即unicode码点;这些是一个抽象概念,不能直接存储在磁盘上。毫无疑问,字节字符串是字节序列,可以存储在磁盘上。它们之间的映射是一种编码 -其中有很多(并且无限可能)-并且您需要知道在特定情况下哪种适用才能进行转换,因为不同的编码可能会映射相同的字节到另一个字符串:
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'
一旦知道要使用哪个.decode()
字符串,就可以使用字节字符串的方法从中获取正确的字符串,如上所述。为了完整起见,.encode()
字符串的方法是相反的:
>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
Assuming Python 3 (in Python 2, this difference is a little less well-defined) – a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can’t be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes – things that can be stored on disk. The mapping between them is an encoding – there are quite a lot of these (and infinitely many are possible) – and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'
Once you know which one to use, you can use the .decode()
method of the byte string to get the right character string from it as above. For completeness, the .encode()
method of a character string goes the opposite way:
>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
回答 1
计算机唯一可以存储的是字节。
要将任何内容存储在计算机中,必须先对其进行编码,即将其转换为字节。例如:
- 如果你想存储的音乐,你必须先进行编码使用它
MP3
,WAV
等等。
- 如果你想存储图片,必须先进行编码使用它
PNG
,JPEG
等等。
- 如果你想存储文本,必须先进行编码使用它
ASCII
,UTF-8
等等。
MP3
,WAV
,PNG
,JPEG
,ASCII
和UTF-8
是的示例编码。编码是一种格式,以字节为单位表示音频,图像,文本等。
在Python中,字节字符串就是这样:字节序列。这不是人类可读的。在引擎盖下,必须先将所有内容转换为字节字符串,然后才能将其存储在计算机中。
另一方面,通常被称为“字符串”的字符串是字符序列。它是人类可读的。字符串不能直接存储在计算机中,必须先进行编码(转换为字节字符串)。可以通过多种编码将字符串转换为字节字符串,例如ASCII
和UTF-8
。
'I am a string'.encode('ASCII')
上面的Python代码将'I am a string'
使用encoding 对字符串进行编码ASCII
。上面代码的结果将是一个字节字符串。如果您打印它,Python会将其表示为b'I am a string'
。但是请记住,字节字符串不是人类可读的,只是Python从ASCII
打印时就对其进行解码。在Python中,字节串由表示b
,后跟字节串的ASCII
表示。
如果您知道用于编码的字节,则可以将字节字符串解码回字符串。
b'I am a string'.decode('ASCII')
上面的代码将返回原始字符串'I am a string'
。
编码和解码是相反的操作。在将所有内容写入磁盘之前,必须对其进行编码,并且必须对其进行解码,然后才能被人类读取。
The only thing that a computer can store is bytes.
To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:
- If you want to store music, you must first encode it using
MP3
, WAV
, etc.
- If you want to store a picture, you must first encode it using
PNG
, JPEG
, etc.
- If you want to store text, you must first encode it using
ASCII
, UTF-8
, etc.
MP3
, WAV
, PNG
, JPEG
, ASCII
and UTF-8
are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.
In Python, a byte string is just that: a sequence of bytes. It isn’t human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.
On the other hand, a character string, often just called a “string”, is a sequence of characters. It is human-readable. A character string can’t be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII
and UTF-8
.
'I am a string'.encode('ASCII')
The above Python code will encode the string 'I am a string'
using the encoding ASCII
. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'
. Remember, however, that byte strings aren’t human-readable, it’s just that Python decodes them from ASCII
when you print them. In Python, a byte string is represented by a b
, followed by the byte string’s ASCII
representation.
A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.
b'I am a string'.decode('ASCII')
The above code will return the original string 'I am a string'
.
Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.
回答 2
注意:由于Python 2的生命周期即将结束,因此我将详细说明Python 3的答案。
在Python 3中
bytes
由8位无符号值str
的序列组成,而由表示人类语言文字字符的Unicode代码点序列组成。
>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve
尽管bytes
和str
似乎相同的方式工作,他们的情况下,不与对方,即兼容,bytes
并且str
实例无法与像运营商一起使用>
和+
。此外,请记住,比较bytes
和str
实例是否相等,即使用==
,将始终计算为False
即使完全相同。
>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False
处理bytes
和str
使用使用返回的文件时存在的另一个问题open
内置函数。一方面,如果要从文件读取二进制数据或从文件读取二进制数据,请始终使用“ rb”或“ wb”之类的二进制模式打开文件。另一方面,如果要从文件读取Unicode数据或从文件读取Unicode数据,请注意计算机的默认编码,因此如有必要,请传递encoding
参数以避免意外情况。
在Python 2中
str
由8位值unicode
的序列组成,而由Unicode字符序列组成。有一点要记住的是,str
和unicode
如果str
仅由7位ASCI字符组成可以与运算符一起使用。
这可能是使用辅助功能之间进行转换有用的str
和unicode
在Python 2之间,以及bytes
和str
在Python 3。
Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.
In Python 3
bytes
consists of sequences of 8-bit unsigned values, while str
consists of sequences of Unicode code points that represent textual characters from human languages.
>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve
Even though bytes
and str
seem to work the same way, their instances are not compatible with each other, i.e, bytes
and str
instances can’t be used together with operators like >
and +
. In addition, keep in mind that comparing bytes
and str
instances for equality, i.e. using ==
, will always evaluate to False
even when they contain exactly the same characters.
>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False
Another issue when dealing with bytes
and str
is present when working with files that are returned using the open
built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like ‘rb’ or ‘wb’. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding
parameter to avoid surprises.
In Python 2
str
consists of sequences of 8-bit values, while unicode
consists of sequences of Unicode characters. One thing to keep in mind is that str
and unicode
can be used together with operators if str
only consists of 7-bit ASCI characters.
It might be useful to use helper functions to convert between str
and unicode
in Python 2, and between bytes
and str
in Python 3.
回答 3
从什么是Unicode:
从根本上讲,计算机只处理数字。他们通过为每个字母分配一个数字来存储字母和其他字符。
……
无论平台是什么,程序是什么,语言是什么,Unicode都会为每个字符提供唯一的数字。
因此,当计算机表示字符串时,它会通过其唯一的Unicode数字找到存储在字符串计算机中的字符,并将这些数字存储在内存中。但是您不能直接将字符串写到磁盘或通过其唯一的Unicode数字在网络上传输字符串,因为这些数字只是简单的十进制数字。您应该将字符串编码为字节字符串,例如UTF-8
。UTF-8
是一种字符编码,能够对所有可能的字符进行编码,并且将字符存储为字节(看起来像这样)。因此,已编码的字符串可以在任何地方使用,因为UTF-8
几乎在任何地方都支持。当您打开一个以UTF-8
在其他系统上,您的计算机将对其进行解码,并通过其唯一的Unicode数字在其中显示字符。当浏览器接收UTF-8
到从网络编码的字符串数据时,它将解码数据为字符串(假设浏览器已UTF-8
编码)并显示该字符串。
在python3中,您可以将字符串和字节字符串彼此转换:
>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文
简而言之,字符串用于显示给人类在计算机上阅读,字节字符串用于存储到磁盘和数据传输。
From What is Unicode:
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.
……
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can’t directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8
. UTF-8
is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8
is nearly supported everywhere. When you open a text file encoded in UTF-8
from other systems, your computer will decode it and display characters in it through their unique Unicode number. When a browser receive string data encoded UTF-8
from network, it will decode the data to string (assume the browser in UTF-8
encoding) and display the string.
In python3, you can transform string and byte string to each other:
>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文
In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.
回答 4
Unicode是一种公认的格式,用于字符的二进制表示和各种格式(例如,小写/大写,换行,回车)和其他“事物”(例如,表情符号)。无论是在内存中还是在文件中,计算机都能够存储unicode表示(一系列位),而不是存储ascii表示(一系列不同的位)或任何其他表示形式(一系列的位) )。
为了进行通讯,通讯双方必须就将使用哪种表示形式达成一致。
因为unicode试图代表所有人与人之间和计算机间通信中使用的可能的字符(和其他“事物”),所以与许多其他表示系统相比,表示许多字符(或事物)所需要的位数更多。试图代表一组更有限的字符/事物。为了“简化”,并可能适应历史用法,unicode表示几乎专门转换为某种其他表示系统(例如ascii),目的是将字符存储在文件中。
这不是的情况下的unicode 不能被用于在文件中存储的字符,或通过发送它们的任何通信信道,只要它是不。
术语“字符串”没有精确定义。通常,“字符串”是指一组字符/事物。在计算机中,这些字符可以以多种不同的逐位表示形式中的任何一种形式存储。“字节字符串”是一组字符,它们使用八位(八位称为字节)的表示形式存储。由于如今,计算机使用unicode系统(由可变字节数表示的字符)将字符存储在内存中,并使用字节字符串(由单字节表示的字符)将字符存储到文件中,因此在表示字符之前必须先进行转换内存中的内容将被移动到文件存储中。
Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (e.g. lower case/upper case, new line, carriage return), and other “things” (e.g. emojis). A computer is no less capable of storing a unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ascii representation (a different series of bits), or any other representation (series of bits).
For communication to take place, the parties to the communication must agree on what representation will be used.
Because unicode seeks to represent all the possible characters (and other “things”) used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things. To “simplify,” and perhaps to accommodate historical usage, unicode representation is almost exclusively converted to some other system of representation (e.g. ascii) for the purpose of storing characters in files.
It is not the case that unicode cannot be used for storing characters in files, or transmitting them through any communications channel, simply that it is not.
The term “string,” is not precisely defined. “String,” in its common usage, refers to a set of characters/things. In a computer, those characters may be stored in any one of many different bit-by-bit representations. A “byte string” is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). Since, these days, computers use the unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.
回答 5
让我们有一个简单的单字符字符串,'š'
并将其编码为字节序列:
>>> 'š'.encode('utf-8')
b'\xc5\xa1'
出于本示例的目的,让我们以二进制形式显示字节序列:
>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'
现在,在不知道信息是如何编码的情况下,通常无法将信息解码回去。仅当您知道使用了utf-8
文本编码时,您才可以按照用于解码utf-8的算法并获取原始字符串:
11000101 10100001
^^^^^ ^^^^^^
00101 100001
您可以将二进制数显示101100001
为字符串:
>>> chr(int('101100001', 2))
'š'
Let’s have a simple one-character string 'š'
and encode it into a sequence of bytes:
>>> 'š'.encode('utf-8')
b'\xc5\xa1'
For the purpose of this example let’s display the sequence of bytes in its binary form:
>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'
Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the utf-8
text encoding was used, you can follow the algorithm for decoding utf-8 and acquire the original string:
11000101 10100001
^^^^^ ^^^^^^
00101 100001
You can display the binary number 101100001
back as a string:
>>> chr(int('101100001', 2))
'š'
回答 6
Python语言包括str
和bytes
作为标准的“内置类型”。换句话说,它们都是类。我认为尝试合理化以这种方式实现Python的理由并不值得。
话虽如此,str
而且bytes
彼此非常相似。两者共享大多数相同的方法。以下方法是str
该类唯一的:
casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable
以下方法是bytes
该类唯一的:
decode
fromhex
hex
The Python languages includes str
and bytes
as standard “Built-in Types”. In other words, they are both classes. I don’t think it’s worthwhile trying to rationalize why Python has been implemented this way.
Having said that, str
and bytes
are very similar to one another. Both share most of the same methods. The following methods are unique to the str
class:
casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable
The following methods are unique to the bytes
class:
decode
fromhex
hex