字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

问题:字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

当问这个问题时,我意识到我对原始字符串不了解很多。对于自称是Django培训师的人来说,这很糟糕。

我知道什么是编码,而且我知道u''自从得到Unicode以来,它独自做什么。

  • 但是究竟是r''什么呢?它产生什么样的字符串?

  • 最重要的是,该怎么ur''办?

  • 最后,有什么可靠的方法可以从Unicode字符串返回到简单的原始字符串?

  • 嗯,顺便说一句,如果您的系统和文本编辑器字符集设置为UTF-8,u''实际上有什么作用吗?

While asking this question, I realized I didn’t know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

  • But what does r'' do exactly? What kind of string does it result in?

  • And above all, what the heck does ur'' do?

  • Finally, is there any reliable way to go back from a Unicode string to a simple raw string?

  • Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?


回答 0

实际上并没有任何“原始字符串 ”。这里有原始的字符串文字,它们恰好是'r'在引号前用a标记的字符串文字。

“原始字符串文字”与字符串文字的语法略有不同,其中\反斜杠“”代表“只是反斜杠”(除非在引号之前否则会终止文字)- “转义序列”代表换行符,制表符,退格键,换页等。在普通的字符串文字中,每个反斜杠必须加倍,以避免被当作转义序列的开始。

之所以存在此语法变体,主要是因为正则表达式模式的语法带有反斜杠(但不会在末尾加重),所以语法比较繁琐(因此,上面的“ except”子句无关紧要),并且在避免将每个模式加倍时看起来会更好一些- – 就这样。表达本机Windows文件路径(用反斜杠代替其他平台上的常规斜杠)也引起了人们的欢迎,但这很少需要(因为正常斜杠在Windows上也可以正常工作)并且不完美(由于“ except”子句)以上)。

r'...'是一个字节串(在Python 2 *),ur'...'是Unicode字符串(再次,在Python 2 *),以及任何其他3种引用的也产生完全相同的类型字符串(因此,例如r'...'r'''...'''r"..."r"""..."""都是字节字符串,依此类推)。

不确定“ 返回 ”是什么意思-本质上没有前后方向,因为没有原始字符串类型,它只是一种表示完全正常的字符串对象,字节或Unicode的替代语法。

是的,在Python 2 *,u'...' 当然总是从刚不同'...'-前者是一个unicode字符串,后者是一个字节的字符串。文字表达的编码方式可能是完全正交的问题。

例如,考虑一下(Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

Unicode对象当然会占用更多的存储空间(很短的字符串,很明显,;-差别很小)。

There’s not really any “raw string“; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A “raw string literal” is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning “just a backslash” (except when it comes right before a quote that would otherwise terminate the literal) — no “escape sequences” to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the “except” clause above doesn’t matter) and it looks a bit better when you avoid doubling up each of them — that’s all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that’s very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the “except” clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by “going back” – there is no intrinsically back and forward directions, because there’s no raw string type, it’s just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' — the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).


回答 1

python中有两种类型的字符串:传统str类型和较新unicode类型。如果您键入的字符串文字u前面不带,则将得到str存储8位字符的旧类型,而带u前面的将得到unicode可存储任何Unicode字符的较新类型。

r完全不改变类型,它只是改变了字符串是如何解释。如果没有,则将r反斜杠视为转义字符。使用时r,反斜杠被视为文字。无论哪种方式,类型都是相同的。

ur 当然是Unicode字符串,其中反斜杠是文字反斜杠,而不是转义码的一部分。

您可以尝试使用str()函数将Unicode字符串转换为旧字符串,但是如果旧字符串中无法表示任何Unicode字符,则会出现异常。如果愿意,可以先用问号替换它们,但是当然这会导致这些字符不可读。str如果要正确处理Unicode字符,建议不要使用该类型。

There are two types of string in python: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.

The r doesn’t change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.


回答 2

“原始字符串”表示将其存储为原样。例如,'\'只是一个反斜杠,而不是逃避

‘raw string’ means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.


回答 3

“ u”前缀表示该值具有类型unicode而不是str

带有“ r”前缀的原始字符串文字将转义其中的所有转义序列,因此len(r"\n")也是如此。2。由于它们转义了转义序列,因此您不能以单个反斜杠结束字符串文字:这不是有效的转义序列(例如r"\")。

“原始”不是该类型的一部分,它只是表示值的一种方式。例如,"\\n"r"\n"是相同的值,就像320x200b100000是相同的。

您可以使用unicode原始字符串文字:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

源文件编码仅决定如何解释源文件,否则不会影响表达式或类型。但是,建议避免使用非ASCII编码会改变含义的代码:

使用ASCII的文件(对于Python 3.0,则为UTF-8)应该没有编码cookie。只有在注释或文档字符串需要提及需要使用Latin-1的作者姓名时,才应使用Latin-1(或UTF-8)。否则,使用\ x,\ u或\ U转义是在字符串文字中包含非ASCII数据的首选方法。

A “u” prefix denotes the value has type unicode rather than str.

Raw string literals, with an “r” prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that’s not a valid escape sequence (e.g. r"\").

“Raw” is not part of the type, it’s merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.

You can have unicode raw string literals:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn’t affect expressions or types otherwise. However, it’s recommended to avoid code where an encoding other than ASCII would change the meaning:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.


回答 4

让我简单地解释一下:在python 2中,您可以将字符串存储为2种不同的类型。

第一个是ASCII,它是python中的str类型,它使用1个字节的内存。(256个字符,将主要存储英文字母和简单符号)

第二种类型是UNICODE,它是python中的unicode类型。Unicode存储所有类型的语言。

默认情况下,python会更喜欢str类型,但是如果您想将字符串存储为unicode类型,则可以将u放在像u’text’这样的文本前面,也可以通过调用unicode(’text’)来实现

所以ü只是打电话投的功能一小段路海峡Unicode的。而已!

现在r部分,您将其放在文本前面以告诉计算机该文本是原始文本,反斜杠不应是转义字符。r’\ n’不会创建换行符。只是包含2个字符的纯文本。

如果要将str转换为unicode并将原始文本也放入其中,请使用ur,因为ru会引发错误。

现在,重要的部分:

您不能使用r来存储一个反斜杠,这是唯一的exceptions。因此,此代码将产生错误:r’\’

要存储反斜杠(仅一个),您需要使用“ \\”

如果要存储1个以上的字符,则仍可以使用r,r’\\’一样,将产生2个反斜杠,如您所愿。

我不知道r无法与一个反斜杠存储一起使用的原因,但至今尚未有人描述。我希望这是一个错误。

Let me explain it simply: In python 2, you can store string in 2 different types.

The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)

The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u’text’ or you can do this by calling unicode(‘text’)

So u is just a short way to call a function to cast str to unicode. That’s it!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r’\n’ will not create a new line character. It’s just plain text containing 2 characters.

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.

NOW, the important part:

You cannot store one backslash by using r, it’s the only exception. So this code will produce error: r’\’

To store a backslash (only one) you need to use ‘\\’

If you want to store more than 1 characters you can still use r like r’\\’ will produce 2 backslashes as you expected.

I don’t know the reason why r doesn’t work with one backslash storage but the reason isn’t described by anyone yet. I hope that it is a bug.


回答 5

也许这很明显,也许不是,但是您可以通过调用x = chr(92)来使字符串“ \”

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

Maybe this is obvious, maybe not, but you can make the string ‘\’ by calling x=chr(92)

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

回答 6

Unicode字符串文字

Unicode字符串文字(以前缀的字符串文字u)在Python 3中不再使用。它们仍然有效,但仅出于与Python 2 兼容的目的

原始字符串文字

如果要创建仅由易于键入的字符(例如英文字母或数字)组成的字符串文字,只需键入以下内容即可:'hello world'。但是,如果您还想包含一些其他奇特的字符,则必须使用一些解决方法。解决方法之一是转义序列。这样,例如,您只需\n在字符串文字中添加两个易于键入的字符,即可在字符串中表示新行。因此,当您打印'hello\nworld'字符串时,单词将被打印在单独的行上。非常方便!

另一方面,在某些情况下,您想创建一个包含转义序列的字符串文字,但又不希望它们由Python解释。您希望它们变得生硬。看下面的例子:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

在这种情况下,您可以在字符串文字前加上如下r字符:r'hello\nworld'并且Python不会解释任何转义序列。字符串将完全按照您创建的样子打印。

原始字符串文字不是完全“原始”吗?

许多人期望原始字符串文字是原始的,因为“ Python会忽略引号之间的任何内容”。那是不对的。Python仍然可以识别所有转义序列,只是不解释它们-而是使它们保持不变。这意味着原始字符串文字仍然必须是有效的字符串文字

根据字符串文字的词汇定义

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

显然,包含裸引号:'hello'world'或以反斜杠:结尾的字符串文字(无论是否原始)'hello world\'都是无效的。

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you’ll have to use some workaround. One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That’s very handy!

On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don’t want them to be interpreted by Python. You want them to be raw. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely “raw”?

Many people expect the raw string literals to be raw in a sense that “anything placed between the quotes is ignored by Python”. That is not true. Python still recognizes all the escape sequences, it just does not interpret them – it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.