Python str与unicode类型

问题:Python str与unicode类型

使用Python 2.7,我想知道使用type unicode代替真正的优势是什么str,因为它们似乎都能够容纳Unicode字符串。除了能够unicode使用转义字符在字符串中设置Unicode代码之外,还有什么特殊的原因\吗?:

使用以下命令执行模块:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

结果:á,á

编辑:

使用Python Shell进行更多测试:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

因此,该unicode字符串似乎是使用latin1而不是编码的utf-8,而原始字符串是使用utf-8?编码的 我现在更加困惑!:S

Working with Python 2.7, I’m wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:

Executing a module with:

# -*- coding: utf-8 -*-

a = 'á'
ua = u'á'
print a, ua

Results in: á, á

EDIT:

More testing using Python shell:

>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'

So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I’m even more confused now! :S


回答 0

unicode用于处理文本。文本是一个代码点序列,可能大于一个字节。文本可以被编码在一个特定的编码来表示文本作为原始字节(例如utf-8latin-1…)。

注意,这unicode 是没有编码的!python使用的内部表示形式是实现细节,只要它能够表示所需的代码点,您就不必在意它。

相反,str在Python 2中是字节的简单序列。它不代表文字!

您可以将其unicode视为某些文本的一般表示形式,可以用许多不同的方式将其编码为通过表示的二进制数据序列str

注意:在Python 3中,unicode已重命名为,str并且bytes为普通字节序列提供了一种新类型。

您可以看到一些差异:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte

请注意,使用时,str您可以控制特定编码表示形式的单个字节,而使用时unicode,则只能在代码点级别进行控制。例如,您可以执行以下操作:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

以前是有效的UTF-8,现在已经不复存在了。使用unicode字符串,您不能以结果字符串不是有效的unicode文本的方式进行操作。您可以删除代码点,将代码点替换为其他代码点等,但不能与内部表示混淆。

unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1…).

Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn’t care about it as long as it is able to represent the code points you want.

On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.

Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.

Some differences that you can see:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:

>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù

What before was valid UTF-8, isn’t anymore. Using a unicode string you cannot operate in such a way that the resulting string isn’t valid unicode text. You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.


回答 1

Unicode和编码是完全不同的,无关的东西。

统一码

为每个字符分配一个数字ID:

  • 0x41→A
  • 0xE1→á
  • 0x414→Д

因此,Unicode将数字0x41分配给A,将0xE1分配给á,将0x414分配给Д。

即使是我使用的小箭头也有其Unicode数字,即0x2192。甚至表情符号都有其Unicode数字,😂是0x1F602。

您可以在此表中查找所有字符的Unicode数字。特别是,你可以找到上面的前三个字符这里,箭头在这里,和表情符号在这里

这些由Unicode分配给所有字符的数字称为代码点

所有这些的目的是提供一种明确地引用每个字符的方法。例如,如果我在谈论😂,而不是说“你知道,这笑着哭的表情含泪”,我可以说Unicode代码点0x1F602。比较容易,对吧?

请注意,Unicode代码点通常使用前导格式U+,然后将十六进制数字值填充为至少4位数字。因此,以上示例为U + 0041,U + 00E1,U + 0414,U + 2192,U + 1F602。

Unicode代码点的范围从U + 0000到U + 10FFFF。那是1,114,112数字。这些数字中的2048个用于代理,因此,剩下1,112,064。这意味着,Unicode可以为1,112,064个不同的字符分配唯一的ID(代码点)。尚未将所有这些代码点都分配给一个字符,并且Unicode不断扩展(例如,当引入新的表情符号时)。

要记住的重要一点是,所有Unicode所做的就是为每个字符分配一个称为代码点的数字ID,以便于进行明确的引用。

编码方式

将字符映射到位模式。

这些位模式用于表示计算机内存或磁盘上的字符。

有许多不同的编码覆盖了字符的不同子集。在说英语的世界中,最常见的编码如下:

ASCII码

128个字符(代码点U + 0000到U + 007F)映射到长度为7的位模式。

例:

  • a→1100001(0x61)

您可以在此表中看到所有映射。

ISO 8859-1(又名Latin-1)

191个字符(代码点U + 0020到U + 007E和U + 00A0到U + 00FF)映射到长度为8的位模式。

例:

  • a→01100001(0x61)
  • á→11100001(0xE1)

您可以在此表中看到所有映射。

UTF-8

1,112,064个字符(所有现有的Unicode代码点)映射到长度为8、16、24或32位(即1、2、3或4个字节)的位模式。

例:

  • a→01100001(0x61)
  • á→11000011 10100001(0xC3 0xA1)
  • ≠→11100010 10001001 10100000(0xE2 0x89 0xA0)
  • 😂→11110000 10011111 10011000 10000010(0xF0 0x9F 0x98 0x82)

UTF-8将字符编码为位字符串的方式在此处得到了很好的描述。

Unicode和编码

通过上面的示例,可以清楚地了解Unicode是如何有用的。

例如,如果我是Latin-1,并且想解释一下á的编码,则无需说:

“我用aigu(或您将其称为上升条)编码为11100001”

但是我只能说:

“我将U + 00E1编码为11100001”

如果我是UTF-8,我可以说:

“我又将U + 00E1编码为11000011 10100001”

每个人都清楚知道我们指的是哪个角色。

现在到经常出现的混乱

的确,如果将编码的位模式解释为二进制数,则有时与此字符的Unicode代码点相同。

例如:

  • ASCII编码一个为1100001,您可以解释为十六进制数0x61,和的Unicode代码点一个U + 0061
  • Latin-1将á编码为11100001,可以将其解释为十六进制数字0xE1,而á的Unicode代码点是U + 00E1

当然,为了方便起见,已经对此进行了安排。但是您应该将其视为纯粹的巧合。用于表示内存中字符的位模式与该字符的Unicode代码点没有任何关联。

甚至没人说您必须将11100001之类的字符串解释为二进制数。只需将其视为Latin-1用来编码字符á的位序列即可。

回到您的问题

您的Python解释器使用的编码为UTF-8

这是您的示例中发生的事情:

例子1

以下代码以UTF-8编码字符á。这将产生位字符串11000011 10100001,该位字符串将保存在变量中a

>>> a = 'á'

当您查看的值时a,其内容11000011 10100001的格式设置为十六进制数字0xC3 0xA1,并输出为'\xc3\xa1'

>>> a
'\xc3\xa1'

例子2

下面的代码将á的Unicode代码点U + 00E1保存在变量中ua(我们不知道Python内部使用哪种数据格式在内存中表示代码点U + 00E1,这对我们来说并不重要):

>>> ua = u'á'

当您查看的值时ua,Python会告诉您它包含代码点U + 00E1:

>>> ua
u'\xe1'

例子3

以下代码使用UTF-8对Unicode代码点U + 00E1(表示字符á)进行编码,结果得到位模式1100001110100001。同样,对于输出,该位模式也表示为十六进制数字0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

例子4

下面的代码使用Latin-1对Unicode代码点U + 00E1(表示字符á)进行编码,从而得到位模式11100001。对于输出,该位模式表示为十六进制数字0xE1,巧合的是,其与初始字符相同。码点U + 00E1:

>>> ua.encode('latin1')
'\xe1'

Unicode对象ua和Latin-1编码之间没有关系。á的代码点为U + 00E1,而á的Latin-1编码为0xE1(如果将编码的位模式解释为二进制数)纯属巧合。

Unicode and encodings are completely different, unrelated things.

Unicode

Assigns a numeric ID to each character:

  • 0x41 → A
  • 0xE1 → á
  • 0x414 → Д

So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.

Even the little arrow → I used has its Unicode number, it’s 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.

You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.

These numbers assigned to all characters by Unicode are called code points.

The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I’m talking about 😂, instead of saying “you know, this laughing emoji with tears”, I can just say, Unicode code point 0x1F602. Easier, right?

Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.

Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).

The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.

Encodings

Map characters to bit patterns.

These bit patterns are used to represent the characters in computer memory or on disk.

There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:

ASCII

Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.

Example:

  • a → 1100001 (0x61)

You can see all the mappings in this table.

ISO 8859-1 (aka Latin-1)

Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.

Example:

  • a → 01100001 (0x61)
  • á → 11100001 (0xE1)

You can see all the mappings in this table.

UTF-8

Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).

Example:

  • a → 01100001 (0x61)
  • á → 11000011 10100001 (0xC3 0xA1)
  • ≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
  • 😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)

The way UTF-8 encodes characters to bit strings is very well described here.

Unicode and Encodings

Looking at the above examples, it becomes clear how Unicode is useful.

For example, if I’m Latin-1 and I want to explain my encoding of á, I don’t need to say:

“I encode that a with an aigu (or however you call that rising bar) as 11100001”

But I can just say:

“I encode U+00E1 as 11100001”

And if I’m UTF-8, I can say:

“Me, in turn, I encode U+00E1 as 11000011 10100001”

And it’s unambiguously clear to everybody which character we mean.

Now to the often arising confusion

It’s true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.

For example:

  • ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
  • Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.

Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.

Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.

Back to your question

The encoding used by your Python interpreter is UTF-8.

Here’s what’s going on in your examples:

Example 1

The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.

>>> a = 'á'

When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':

>>> a
'\xc3\xa1'

Example 2

The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don’t know which data format Python uses internally to represent the code point U+00E1 in memory, and it’s unimportant to us):

>>> ua = u'á'

When you look at the value of ua, Python tells you that it contains the code point U+00E1:

>>> ua
u'\xe1'

Example 3

The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:

>>> ua.encode('utf-8')
'\xc3\xa1'

Example 4

The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:

>>> ua.encode('latin1')
'\xe1'

There’s no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.


回答 2

您的终端碰巧已配置为UTF-8。

印刷的事实a是一个巧合;您正在将原始UTF-8字节写入终端。a是长度的值2,含有两个字节,十六进制值C3和A1,而ua是长度的Unicode值一个,包含一个编码点U + 00E1。

这种长度上的差异是使用Unicode值的主要原因之一。您无法轻松地测量字节字符串中文字符的数量;该len()字节串的告诉你有多少字节被使用,许多字符没有编码方式。

你可以看到差异,当你编码的Unicode值到不同的输出编码:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

请注意,Unicode标准的前256个代码点与Latin 1标准匹配,因此U + 00E1代码点被编码为Latin 1作为具有十六进制值E1的字节。

此外,Python同样在unicode和字节字符串的表示形式中使用转义码,并且使用\x..转义值表示不可打印ASCII的低代码点。这就是为什么一个Unicode字符串,128米255的外观之间的代码点只是喜欢拉丁1个编码。如果您的unicode字符串的代码点超过U + 00FF,\u....则使用不同的转义序列,并使用四位十六进制值。

看来您尚未完全了解Unicode和编码之间的区别。在继续之前,请务必阅读以下文章:

Your terminal happens to be configured to UTF-8.

The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.

This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.

You can see the difference when you encode the unicode value to different output encodings:

>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'

Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.

Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.

It looks like you don’t yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:


回答 3

当您将a定义为unicode时,字符a和á相等。否则,á被视为两个字符。尝试len(a)和len(au)。除此之外,在其他环境中工作时可能需要编码。例如,如果您使用md5,则a和ua的值将不同

When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua