标签归档:rawstring

为什么Python的原始字符串文字不能以单个反斜杠结尾?

问题:为什么Python的原始字符串文字不能以单个反斜杠结尾?

从技术上讲,文档中描述了任意数量的反斜杠。

>>> r'\'
  File "<stdin>", line 1
    r'\'
       ^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
  File "<stdin>", line 1
    r'\\\'
         ^
SyntaxError: EOL while scanning string literal

似乎解析器可以将原始字符串中的反斜杠视为常规字符(这不是原始字符串的全部含义吗?),但是我可能缺少明显的东西。

Technically, any odd number of backslashes, as described in the documentation.

>>> r'\'
  File "<stdin>", line 1
    r'\'
       ^
SyntaxError: EOL while scanning string literal
>>> r'\\'
'\\\\'
>>> r'\\\'
  File "<stdin>", line 1
    r'\\\'
         ^
SyntaxError: EOL while scanning string literal

It seems like the parser could just treat backslashes in raw strings as regular characters (isn’t that what raw strings are all about?), but I’m probably missing something obvious.


回答 0

我在该部分中以粗体突出显示了原因:

字符串引号可以使用反斜杠转义,但反斜杠仍保留在字符串中;例如,r"\""是由两个字符组成的有效字符串文字:反斜杠和双引号;r"\"不是有效的字符串文字(即使是原始字符串也不能以奇数个反斜杠结尾)。特别是,原始字符串不能以单个反斜杠结尾(因为反斜杠会转义以下引号字符)。还请注意,单个反斜杠后跟换行符将被解释为这两个字符是字符串的一部分,而不是换行符。

因此,原始字符串不是100%原始的,仍然存在一些基本的反斜杠处理。

The reason is explained in the part of that section which I highlighted in bold:

String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

So raw strings are not 100% raw, there is still some rudimentary backslash-processing.


回答 1

关于python原始字符串的整个误解是,大多数人都认为反斜杠(在原始字符串内)与其他所有字符一样都是常规字符。它不是。要了解的关键是此python的教程序列:

当存在’ r ‘或’ R ‘前缀时,字符串中包含反斜杠后的字符而无需更改,并且所有反斜杠都保留在字符串中

因此,反斜杠后面的任何字符都是原始字符串的一部分。解析器输入原始字符串(非Unicode字符串)并遇到反斜杠后,便知道存在2个字符(紧随其后的是反斜杠和char)。

这条路:

r’abc \ d’包含a,b,c,\,d

r’abc \’d’包含a,b,c,\,’,d

r’abc \”包括a,b,c,\,’

和:

r’abc \’包含a,b,c,\,’,但现在没有终止引号。

最后一种情况表明,根据文档,解析器现在找不到结尾的引号,因为您在上面看到的最后一个引号是字符串的一部分,即反斜杠不能在此结尾,因为它将“吞噬”字符串的结尾字符。

The whole misconception about python’s raw strings is that most of people think that backslash (within a raw string) is just a regular character as all others. It is NOT. The key to understand is this python’s tutorial sequence:

When an ‘r‘ or ‘R‘ prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string

So any character following a backslash is part of raw string. Once parser enters a raw string (non Unicode one) and encounters a backslash it knows there are 2 characters (a backslash and a char following it).

This way:

r’abc\d’ comprises a, b, c, \, d

r’abc\’d’ comprises a, b, c, \, ‘, d

r’abc\” comprises a, b, c, \, ‘

and:

r’abc\’ comprises a, b, c, \, ‘ but there is no terminating quote now.

Last case shows that according to documentation now a parser cannot find closing quote as the last quote you see above is part of the string i.e. backslash cannot be last here as it will ‘devour’ string closing char.


回答 2

它就是这样儿的!我将其视为python中的那些小缺陷之一!

我认为没有充分的理由,但绝对不是要解析。使用\作为最后符来解析原始字符串真的很容易。

问题是,如果您允许\成为原始字符串中的最后符,那么您将无法在原始字符串中放入“。”似乎python允许使用“而不是将\作为最后符。

但是,这不会造成任何麻烦。

如果您担心无法轻松地编写Windows文件夹路径(例如,c:\mypath\然后不用担心),则可以将它们表示为r"C:\mypath",并且,如果需要附加子目录名称,请不要使用字符串串联来实现,无论如何,这不是正确的方法!用os.path.join

>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'

That’s the way it is! I see it as one of those small defects in python!

I don’t think there’s a good reason for it, but it’s definitely not parsing; it’s really easy to parse raw strings with \ as a last character.

The catch is, if you allow \ to be the last character in a raw string then you won’t be able to put ” inside a raw string. It seems python went with allowing ” instead of allowing \ as the last character.

However, this shouldn’t cause any trouble.

If you’re worried about not being able to easily write windows folder pathes such as c:\mypath\ then worry not, for, you can represent them as r"C:\mypath", and, if you need to append a subdirectory name, don’t do it with string concatenation, for it’s not the right way to do it anyway! use os.path.join

>>> import os
>>> os.path.join(r"C:\mypath", "subfolder")
'C:\\mypath\\subfolder'

回答 3

为了使您的原始字符串以斜杠结尾,我建议您可以使用以下技巧:

>>> print r"c:\test"'\\'
test\

In order for you to end a raw string with a slash I suggest you can use this trick:

>>> print r"c:\test"'\\'
test\

回答 4

另一个技巧是在计算结果为“ \”时使用chr(92)。

最近,我不得不清理一串反斜线,而以下方法可以解决问题:

CleanString = DirtyString.replace(chr(92),'')

我意识到这并不能解决“为什么”的问题,但是线程吸引了许多人寻找解决当前问题的方法。

Another trick is to use chr(92) as it evaluates to “\”.

I recently had to clean a string of backslashes and the following did the trick:

CleanString = DirtyString.replace(chr(92),'')

I realize that this does not take care of the “why” but the thread attracts many people looking for a solution to an immediate problem.


回答 5

由于原始字符串中允许使用\“。因此不能用于标识字符串文字的结尾。

为什么在遇到第一个“”时不停止解析字符串文字?

如果真是这样,那么在字符串文字中将不允许使用“”。

Since \” is allowed inside the raw string. Then it can’t be used to identify the end of the string literal.

Why not stop parsing the string literal when you encounter the first “?

If that was the case, then \” wouldn’t be allowed inside the string literal. But it is.


回答 6

r'\'语法错误的原因是,尽管字符串表达式是原始的,但使用的引号(单引号或双引号)始终必须转义,否则它们会标记引号的结尾。因此,如果您想在单引号引起来的字符串中表达单引号,则没有其他方法可以使用\'。同样适用于双引号。

但是您可以使用:

'\\'

The reason for why r'\' is syntactical incorrect is that although the string expression is raw the used quotes (single or double) always have to be escape since they would mark the end of the quote otherwise. So if you want to express a single quote inside single quoted string, there is no other way than using \'. Same applies for double quotes.

But you could use:

'\\'

回答 7

此后删除了答案的另一位用户(不确定是否要记入他们的答案)建议,Python语言设计人员可以通过使用相同的解析规则并将转义的字符扩展为原始格式来简化解析器设计。 (如果文字被标记为原始)。

我认为这是一个有趣的想法,并将其作为后代社区Wiki包含在内。

Another user who has since deleted their answer (not sure if they’d like to be credited) suggested that the Python language designers may be able to simplify the parser design by using the same parsing rules and expanding escaped characters to raw form as an afterthought (if the literal was marked as raw).

I thought it was an interesting idea and am including it as community wiki for posterity.


回答 8

尽管其作用很大,但即使是原始字符串也不能以单个反斜杠结尾,因为反斜杠会转义以下引号字符—您仍必须先转义周围的引号字符才能将其嵌入到字符串中。也就是说,r“ … \”不是有效的字符串文字-原始字符串不能以奇数个反斜杠结尾。
如果需要用单个反斜杠结束原始字符串,则可以使用两个反斜杠。

Despite its role, even a raw string cannot end in a single backslash, because the backslash escapes the following quote character—you still must escape the surrounding quote character to embed it in the string. That is, r”…\” is not a valid string literal—a raw string cannot end in an odd number of backslashes.
If you need to end a raw string with a single backslash, you can use two and slice off the second.


回答 9

从C来看,对我来说很清楚,单个\用作转义符,允许您将特殊字符(例如换行符,制表符和引号)放入字符串中。

确实确实不允许\作为最后符,因为它将逃脱“并使解析器阻塞。但是如前所述,\是合法的。

Comming from C it pretty clear to me that a single \ works as escape character allowing you to put special characters such as newlines, tabs and quotes into strings.

That does indeed disallow \ as last character since it will escape the ” and make the parser choke. But as pointed out earlier \ is legal.


回答 10

一些技巧 :

1)如果您需要为路径操纵反斜杠,则标准python模块os.path是您的朋友。例如 :

os.path.normpath(’c:/ folder1 /’)

2)如果您要构建的字符串中带有反斜杠,但字符串末尾没有反斜杠,那么原始字符串就是您的朋友(在文字字符串前使用’r’前缀)。例如 :

r'\one \two \three'

3)如果您需要为变量X中的字符串加上反斜杠作为前缀,则可以执行以下操作:

X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X  # X2 now contains \dummy

4)如果您需要创建一个结尾处带有反斜杠的字符串,则结合技巧2和3:

voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name

现在lilypond_statement包含 "\DisplayLilyMusic \upper"

Python万岁!:)

n3on

some tips :

1) if you need to manipulate backslash for path then standard python module os.path is your friend. for example :

os.path.normpath(‘c:/folder1/’)

2) if you want to build strings with backslash in it BUT without backslash at the END of your string then raw string is your friend (use ‘r’ prefix before your literal string). for example :

r'\one \two \three'

3) if you need to prefix a string in a variable X with a backslash then you can do this :

X='dummy'
bs=r'\ ' # don't forget the space after backslash or you will get EOL error
X2=bs[0]+X  # X2 now contains \dummy

4) if you need to create a string with a backslash at the end then combine tip 2 and 3 :

voice_name='upper'
lilypond_display=r'\DisplayLilyMusic \ ' # don't forget the space at the end
lilypond_statement=lilypond_display[:-1]+voice_name

now lilypond_statement contains "\DisplayLilyMusic \upper"

long live python ! :)

n3on


回答 11

我遇到了这个问题,并找到了部分解决方案,在某些情况下是好的。尽管python无法以单个反斜杠结束字符串,但是可以将其序列化并保存在文本文件中,并以单个反斜杠结尾。因此,如果您需要在计算机上保存带有单个反斜杠的文本,则可以:

x = 'a string\\' 
x
'a string\\' 

# Now save it in a text file and it will appear with a single backslash:

with open("my_file.txt", 'w') as h:
    h.write(x)

顺便说一句,如果您使用python的json库转储它,它就不能与json一起使用。

最后,我使用了Spyder,我注意到,如果我在Spider的文本编辑器中通过在变量资源管理器中双击其名称来打开该变量,则该变量将带有一个反斜杠,并且可以通过这种方式复制到剪贴板(不是对大多数需求都非常有帮助,但也许对某些人很有帮助。)

I encountered this problem and found a partial solution which is good for some cases. Despite python not being able to end a string with a single backslash, it can be serialized and saved in a text file with a single backslash at the end. Therefore if what you need is saving a text with a single backslash on you computer, it is possible:

x = 'a string\\' 
x
'a string\\' 

# Now save it in a text file and it will appear with a single backslash:

with open("my_file.txt", 'w') as h:
    h.write(x)

BTW it is not working with json if you dump it using python’s json library.

Finally, I work with Spyder, and I noticed that if I open the variable in spider’s text editor by double clicking on its name in the variable explorer, it is presented with a single backslash and can be copied to the clipboard that way (it’s not very helpful for most needs but maybe for some..).


字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

问题:字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

当问这个问题时,我意识到我对原始字符串不了解很多。对于自称是Django培训师的人来说,这很糟糕。

我知道什么是编码,而且我知道u''自从得到Unicode以来,它独自做什么。

  • 但是究竟是r''什么呢?它产生什么样的字符串?

  • 最重要的是,该怎么ur''办?

  • 最后,有什么可靠的方法可以从Unicode字符串返回到简单的原始字符串?

  • 嗯,顺便说一句,如果您的系统和文本编辑器字符集设置为UTF-8,u''实际上有什么作用吗?

While asking this question, I realized I didn’t know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

  • But what does r'' do exactly? What kind of string does it result in?

  • And above all, what the heck does ur'' do?

  • Finally, is there any reliable way to go back from a Unicode string to a simple raw string?

  • Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?


回答 0

实际上并没有任何“原始字符串 ”。这里有原始的字符串文字,它们恰好是'r'在引号前用a标记的字符串文字。

“原始字符串文字”与字符串文字的语法略有不同,其中\反斜杠“”代表“只是反斜杠”(除非在引号之前否则会终止文字)- “转义序列”代表换行符,制表符,退格键,换页等。在普通的字符串文字中,每个反斜杠必须加倍,以避免被当作转义序列的开始。

之所以存在此语法变体,主要是因为正则表达式模式的语法带有反斜杠(但不会在末尾加重),所以语法比较繁琐(因此,上面的“ except”子句无关紧要),并且在避免将每个模式加倍时看起来会更好一些- – 就这样。表达本机Windows文件路径(用反斜杠代替其他平台上的常规斜杠)也引起了人们的欢迎,但这很少需要(因为正常斜杠在Windows上也可以正常工作)并且不完美(由于“ except”子句)以上)。

r'...'是一个字节串(在Python 2 *),ur'...'是Unicode字符串(再次,在Python 2 *),以及任何其他3种引用的也产生完全相同的类型字符串(因此,例如r'...'r'''...'''r"..."r"""..."""都是字节字符串,依此类推)。

不确定“ 返回 ”是什么意思-本质上没有前后方向,因为没有原始字符串类型,它只是一种表示完全正常的字符串对象,字节或Unicode的替代语法。

是的,在Python 2 *,u'...' 当然总是从刚不同'...'-前者是一个unicode字符串,后者是一个字节的字符串。文字表达的编码方式可能是完全正交的问题。

例如,考虑一下(Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

Unicode对象当然会占用更多的存储空间(很短的字符串,很明显,;-差别很小)。

There’s not really any “raw string“; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A “raw string literal” is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning “just a backslash” (except when it comes right before a quote that would otherwise terminate the literal) — no “escape sequences” to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the “except” clause above doesn’t matter) and it looks a bit better when you avoid doubling up each of them — that’s all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that’s very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the “except” clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by “going back” – there is no intrinsically back and forward directions, because there’s no raw string type, it’s just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' — the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).


回答 1

python中有两种类型的字符串:传统str类型和较新unicode类型。如果您键入的字符串文字u前面不带,则将得到str存储8位字符的旧类型,而带u前面的将得到unicode可存储任何Unicode字符的较新类型。

r完全不改变类型,它只是改变了字符串是如何解释。如果没有,则将r反斜杠视为转义字符。使用时r,反斜杠被视为文字。无论哪种方式,类型都是相同的。

ur 当然是Unicode字符串,其中反斜杠是文字反斜杠,而不是转义码的一部分。

您可以尝试使用str()函数将Unicode字符串转换为旧字符串,但是如果旧字符串中无法表示任何Unicode字符,则会出现异常。如果愿意,可以先用问号替换它们,但是当然这会导致这些字符不可读。str如果要正确处理Unicode字符,建议不要使用该类型。

There are two types of string in python: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.

The r doesn’t change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.


回答 2

“原始字符串”表示将其存储为原样。例如,'\'只是一个反斜杠,而不是逃避

‘raw string’ means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.


回答 3

“ u”前缀表示该值具有类型unicode而不是str

带有“ r”前缀的原始字符串文字将转义其中的所有转义序列,因此len(r"\n")也是如此。2。由于它们转义了转义序列,因此您不能以单个反斜杠结束字符串文字:这不是有效的转义序列(例如r"\")。

“原始”不是该类型的一部分,它只是表示值的一种方式。例如,"\\n"r"\n"是相同的值,就像320x200b100000是相同的。

您可以使用unicode原始字符串文字:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

源文件编码仅决定如何解释源文件,否则不会影响表达式或类型。但是,建议避免使用非ASCII编码会改变含义的代码:

使用ASCII的文件(对于Python 3.0,则为UTF-8)应该没有编码cookie。只有在注释或文档字符串需要提及需要使用Latin-1的作者姓名时,才应使用Latin-1(或UTF-8)。否则,使用\ x,\ u或\ U转义是在字符串文字中包含非ASCII数据的首选方法。

A “u” prefix denotes the value has type unicode rather than str.

Raw string literals, with an “r” prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that’s not a valid escape sequence (e.g. r"\").

“Raw” is not part of the type, it’s merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.

You can have unicode raw string literals:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn’t affect expressions or types otherwise. However, it’s recommended to avoid code where an encoding other than ASCII would change the meaning:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.


回答 4

让我简单地解释一下:在python 2中,您可以将字符串存储为2种不同的类型。

第一个是ASCII,它是python中的str类型,它使用1个字节的内存。(256个字符,将主要存储英文字母和简单符号)

第二种类型是UNICODE,它是python中的unicode类型。Unicode存储所有类型的语言。

默认情况下,python会更喜欢str类型,但是如果您想将字符串存储为unicode类型,则可以将u放在像u’text’这样的文本前面,也可以通过调用unicode(’text’)来实现

所以ü只是打电话投的功能一小段路海峡Unicode的。而已!

现在r部分,您将其放在文本前面以告诉计算机该文本是原始文本,反斜杠不应是转义字符。r’\ n’不会创建换行符。只是包含2个字符的纯文本。

如果要将str转换为unicode并将原始文本也放入其中,请使用ur,因为ru会引发错误。

现在,重要的部分:

您不能使用r来存储一个反斜杠,这是唯一的exceptions。因此,此代码将产生错误:r’\’

要存储反斜杠(仅一个),您需要使用“ \\”

如果要存储1个以上的字符,则仍可以使用r,r’\\’一样,将产生2个反斜杠,如您所愿。

我不知道r无法与一个反斜杠存储一起使用的原因,但至今尚未有人描述。我希望这是一个错误。

Let me explain it simply: In python 2, you can store string in 2 different types.

The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)

The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u’text’ or you can do this by calling unicode(‘text’)

So u is just a short way to call a function to cast str to unicode. That’s it!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r’\n’ will not create a new line character. It’s just plain text containing 2 characters.

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.

NOW, the important part:

You cannot store one backslash by using r, it’s the only exception. So this code will produce error: r’\’

To store a backslash (only one) you need to use ‘\\’

If you want to store more than 1 characters you can still use r like r’\\’ will produce 2 backslashes as you expected.

I don’t know the reason why r doesn’t work with one backslash storage but the reason isn’t described by anyone yet. I hope that it is a bug.


回答 5

也许这很明显,也许不是,但是您可以通过调用x = chr(92)来使字符串“ \”

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

Maybe this is obvious, maybe not, but you can make the string ‘\’ by calling x=chr(92)

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

回答 6

Unicode字符串文字

Unicode字符串文字(以前缀的字符串文字u)在Python 3中不再使用。它们仍然有效,但仅出于与Python 2 兼容的目的

原始字符串文字

如果要创建仅由易于键入的字符(例如英文字母或数字)组成的字符串文字,只需键入以下内容即可:'hello world'。但是,如果您还想包含一些其他奇特的字符,则必须使用一些解决方法。解决方法之一是转义序列。这样,例如,您只需\n在字符串文字中添加两个易于键入的字符,即可在字符串中表示新行。因此,当您打印'hello\nworld'字符串时,单词将被打印在单独的行上。非常方便!

另一方面,在某些情况下,您想创建一个包含转义序列的字符串文字,但又不希望它们由Python解释。您希望它们变得生硬。看下面的例子:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

在这种情况下,您可以在字符串文字前加上如下r字符:r'hello\nworld'并且Python不会解释任何转义序列。字符串将完全按照您创建的样子打印。

原始字符串文字不是完全“原始”吗?

许多人期望原始字符串文字是原始的,因为“ Python会忽略引号之间的任何内容”。那是不对的。Python仍然可以识别所有转义序列,只是不解释它们-而是使它们保持不变。这意味着原始字符串文字仍然必须是有效的字符串文字

根据字符串文字的词汇定义

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

显然,包含裸引号:'hello'world'或以反斜杠:结尾的字符串文字(无论是否原始)'hello world\'都是无效的。

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you’ll have to use some workaround. One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That’s very handy!

On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don’t want them to be interpreted by Python. You want them to be raw. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely “raw”?

Many people expect the raw string literals to be raw in a sense that “anything placed between the quotes is ignored by Python”. That is not true. Python still recognizes all the escape sequences, it just does not interpret them – it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.