标签归档:escaping

在Python字符串中转义正则表达式特殊字符

问题:在Python字符串中转义正则表达式特殊字符

Python是否具有可用来在正则表达式中转义特殊字符的函数?

例如,I'm "stuck" :\应成为I\'m \"stuck\" :\\

Does Python have a function that I can use to escape special characters in a regular expression?

For example, I'm "stuck" :\ should become I\'m \"stuck\" :\\.


回答 0

re.escape

>>> import re
>>> re.escape(r'\ a.*$')
'\\\\\\ a\\.\\*\\$'
>>> print(re.escape(r'\ a.*$'))
\\\ a\.\*\$
>>> re.escape('www.stackoverflow.com')
'www\\.stackoverflow\\.com'
>>> print(re.escape('www.stackoverflow.com'))
www\.stackoverflow\.com

在这里重复:

re.escape(字符串)

返回所有非字母数字加反斜杠的字符串;如果您想匹配其中可能包含正则表达式元字符的任意文字字符串,这将很有用。

从Python 3.7 re.escape()开始,更改为仅转义对正则表达式操作有意义的字符。

Use re.escape

>>> import re
>>> re.escape(r'\ a.*$')
'\\\\\\ a\\.\\*\\$'
>>> print(re.escape(r'\ a.*$'))
\\\ a\.\*\$
>>> re.escape('www.stackoverflow.com')
'www\\.stackoverflow\\.com'
>>> print(re.escape('www.stackoverflow.com'))
www\.stackoverflow\.com

Repeating it here:

re.escape(string)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.


回答 1

我很惊讶没有人提到通过re.sub()以下方式使用正则表达式:

import re
print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"

重要注意事项:

  • 搜索模式中,包括\您要查找的字符。你会使用\逃脱你的角色,所以你需要逃避 为好。
  • 搜索模式周围加上括号,例如([\"]),以便替换 模式在找到的字符添加\到其前面时可以使用该字符。(这就是 \1作用:使用第一个带括号的组的值。)
  • r前面r'([\"])'意味着它是一个原始字符串。原始字符串使用不同的规则来转义反斜杠。要([\"])以纯字符串形式编写,您需要将所有反斜杠加倍,并写入'([\\"])'。在编写正则表达式时,原始字符串更友好。
  • 替代模式,你需要转义\从先于一个取代基的反斜杠,例如区分\1,因此r'\\\1'。写 的是作为一个普通的字符串,你需要'\\\\\\1'-大家都不希望发生。

I’m surprised no one has mentioned using regular expressions via re.sub():

import re
print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"

Important things to note:

  • In the search pattern, include \ as well as the character(s) you’re looking for. You’re going to be using \ to escape your characters, so you need to escape that as well.
  • Put parentheses around the search pattern, e.g. ([\"]), so that the substitution pattern can use the found character when it adds \ in front of it. (That’s what \1 does: uses the value of the first parenthesized group.)
  • The r in front of r'([\"])' means it’s a raw string. Raw strings use different rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to double all the backslashes and write '([\\"])'. Raw strings are friendlier when you’re writing regular expressions.
  • In the substitution pattern, you need to escape \ to distinguish it from a backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write that as a plain string, you’d need '\\\\\\1' — and nobody wants that.

回答 2

使用repr()[1:-1]。在这种情况下,双引号不需要转义。[-1:1]切片是从开头和结尾删除单引号。

>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\

或者,也许您只是想转义一个短语以粘贴到您的程序中?如果是这样,请执行以下操作:

>>> raw_input()
I'm "stuck" :\
'I\'m "stuck" :\\'

Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.

>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\

Or maybe you just want to escape a phrase to paste into your program? If so, do this:

>>> raw_input()
I'm "stuck" :\
'I\'m "stuck" :\\'

回答 3

如上所述,答案取决于您的情况。如果要转义正则表达式的字符串,则应使用re.escape()。但是,如果要转义一组特定的字符,请使用此lambda函数:

>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\

As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:

>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\

回答 4

这并不难:

def escapeSpecialCharacters ( text, characters ):
    for character in characters:
        text = text.replace( character, '\\' + character )
    return text

>>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
'I\\\'m \\"stuck\\" :\\'
>>> print( _ )
I\'m \"stuck\" :\

It’s not that hard:

def escapeSpecialCharacters ( text, characters ):
    for character in characters:
        text = text.replace( character, '\\' + character )
    return text

>>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
'I\\\'m \\"stuck\\" :\\'
>>> print( _ )
I\'m \"stuck\" :\

回答 5

如果只想替换某些字符,则可以使用以下命令:

import re

print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")

If you only want to replace some characters you could use this:

import re

print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")

如何逃避os.system()调用?

问题:如何逃避os.system()调用?

使用os.system()时,通常必须转义文件名和其他作为参数传递给命令的参数。我怎样才能做到这一点?最好是可以在多个操作系统/ shell上运行的东西,尤其是bash。

我目前正在执行以下操作,但是请确保为此必须有一个库函数,或者至少是一个更优雅/更强大/更有效的选项:

def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 
             sh_escape(out_filename)))

编辑:我已经接受了使用引号的简单答案,不知道为什么我没有想到它;我猜是因为我来自Windows,“和”的行为略有不同。

关于安全性,我理解这个问题,但是在这种情况下,我对os.system()提供的一种快速简便的解决方案感兴趣,并且字符串的来源不是用户生成的,或者至少是由受信任的用户(我)。

When using os.system() it’s often necessary to escape filenames and other arguments passed as parameters to commands. How can I do this? Preferably something that would work on multiple operating systems/shells but in particular for bash.

I’m currently doing the following, but am sure there must be a library function for this, or at least a more elegant/robust/efficient option:

def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 
             sh_escape(out_filename)))

Edit: I’ve accepted the simple answer of using quotes, don’t know why I didn’t think of that; I guess because I came from Windows where ‘ and ” behave a little differently.

Regarding security, I understand the concern, but, in this case, I’m interested in a quick and easy solution which os.system() provides, and the source of the strings is either not user-generated or at least entered by a trusted user (me).


回答 0

这是我用的:

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"

外壳程序将始终接受带引号的文件名,并在将其传递给相关程序之前删除引号。值得注意的是,这避免了文件名包含空格或其他任何讨厌的shell元字符的问题。

更新:如果您使用的是Python 3.3或更高版本,请使用shlex.quote而不是自己滚动。

This is what I use:

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"

The shell will always accept a quoted filename and remove the surrounding quotes before passing it to the program in question. Notably, this avoids problems with filenames that contain spaces or any other kind of nasty shell metacharacter.

Update: If you are using Python 3.3 or later, use shlex.quote instead of rolling your own.


回答 1

shlex.quote() 从python 3开始做你想要的事情。

(用于pipes.quote同时支持python 2和python 3)

shlex.quote() does what you want since python 3.

(Use pipes.quote to support both python 2 and python 3)


回答 2

也许您有使用的特定原因os.system()。但是,如果不是这样,您可能应该使用该subprocess模块。您可以直接指定管道,并避免使用外壳。

以下是来自PEP324的内容

Replacing shell pipe line
-------------------------

output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

Perhaps you have a specific reason for using os.system(). But if not you should probably be using the subprocess module. You can specify the pipes directly and avoid using the shell.

The following is from PEP324:

Replacing shell pipe line
-------------------------

output=`dmesg | grep hda`
==>
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

回答 3

也许subprocess.list2cmdline是更好的选择?

Maybe subprocess.list2cmdline is a better shot?


回答 4

请注意,pipes.quote实际上在Python 2.5和Python 3.1中已损坏,并且不安全使用-它不处理零长度参数。

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

参见Python问题7476 ; 它已在Python 2.6和3.2及更高版本中修复。

Note that pipes.quote is actually broken in Python 2.5 and Python 3.1 and not safe to use–It doesn’t handle zero-length arguments.

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

See Python issue 7476; it has been fixed in Python 2.6 and 3.2 and newer.


回答 5

注意:这是Python 2.7.x的答案。

根据消息来源,这pipes.quote()是“ 可靠地将字符串作为/ bin / sh的单个参数引用 ”的一种方法。(尽管从2.7版开始不推荐使用,但最终在Python 3.3中公开公开为shlex.quote()函数。)

另一方面subprocess.list2cmdline()是一种方法,“ 翻译的参数的序列到命令行串,使用同样的规则作为MS C运行时 ”。

在这里,我们为平台提供了引用命令行字符串的方式。

import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
else:
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)

用法:

# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

Notice: This is an answer for Python 2.7.x.

According to the source, pipes.quote() is a way to “Reliably quote a string as a single argument for /bin/sh“. (Although it is deprecated since version 2.7 and finally exposed publicly in Python 3.3 as the shlex.quote() function.)

On the other hand, subprocess.list2cmdline() is a way to “Translate a sequence of arguments into a command line string, using the same rules as the MS C runtime“.

Here we are, the platform independent way of quoting strings for command lines.

import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
else:
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)

Usage:

# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

回答 6

我相信os.system只会调用为用户配置的任何命令外壳,因此我认为您不能以与平台无关的方式进行操作。我的命令外壳可以是bash,emacs,ruby甚至quake3中的任何东西。这些程序中的某些程序并不期望您传递给它们的参数的种类,即使它们这样做了,也无法保证它们以相同的方式进行转义。

I believe that os.system just invokes whatever command shell is configured for the user, so I don’t think you can do it in a platform independent way. My command shell could be anything from bash, emacs, ruby, or even quake3. Some of these programs aren’t expecting the kind of arguments you are passing to them and even if they did there is no guarantee they do their escaping the same way.


回答 7

我使用的功能是:

def quote_argument(argument):
    return '"%s"' % (
        argument
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')
    )

即:我总是将参数用双引号引起来,然后用反斜杠将双引号内的特殊字符引起来。

The function I use is:

def quote_argument(argument):
    return '"%s"' % (
        argument
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')
    )

that is: I always enclose the argument in double quotes, and then backslash-quote the only characters special inside double quotes.


回答 8

如果您确实使用了system命令,我将尝试将os.system()调用中的内容列入白名单。

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

子进程模块是一个更好的选择,我建议尽量避免使用os.system / subprocess之类的东西。

If you do use the system command, I would try and whitelist what goes into the os.system() call.. For example..

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

The subprocess module is a better option, and I would recommend trying to avoid using anything like os.system/subprocess wherever possible.


回答 9

真正的答案是:首先不要使用os.system()。请subprocess.call改用并提供未转义的参数。

The real answer is: Don’t use os.system() in the first place. Use subprocess.call instead and supply the unescaped arguments.


在Python中处理字符串中的转义序列

问题:在Python中处理字符串中的转义序列

有时,当我从文件或用户那里得到输入时,我会得到一个带有转义序列的字符串。我想以与Python处理字符串文字中的转义序列相同的方式来处理转义序列

例如,假设myString定义为:

>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs

我想要一个process执行此操作的函数(我称之为):

>>> print(process(myString))
spam
eggs

该函数可以处理Python中的所有转义序列(在上面的链接的表格中列出),这一点很重要。

Python是否具有执行此操作的功能?

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

For example, let’s say myString is defined as:

>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs

I want a function (I’ll call it process) that does this:

>>> print(process(myString))
spam
eggs

It’s important that the function can process all of the escape sequences in Python (listed in a table in the link above).

Does Python have a function to do this?


回答 0

正确的做法是使用“字符串转义”代码对字符串进行解码。

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

不要使用AST或eval。使用字符串编解码器更加安全。

The correct thing to do is use the ‘string-escape’ code to decode the string.

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don’t use the AST or eval. Using the string codecs is much safer.


回答 1

unicode_escape 总的来说不起作用

事实证明,string_escapeor unicode_escape解决方案通常无法正常工作-尤其是在存在实际Unicode的情况下,它不能正常工作。

如果您可以确定每个非ASCII字符都会被转义(并且请记住,前128个字符以外的任何字符都是非ASCII),unicode_escape将为您做正确的事。但是,如果您的字符串中已经有任何文字上的非ASCII字符,则会出错。

unicode_escape从根本上来说是设计用来将字节转换为Unicode文本。但是在许多地方(例如Python源代码),源数据已经是Unicode文本。

唯一可以正常工作的方法是首先将文本编码为字节。UTF-8是所有文本的明智编码,因此应该可以使用,对吧?

以下示例是Python 3中的示例,因此字符串文字更清晰,但在Python 2和3上,存在相同的问题,但表现形式略有不同。

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

好吧,那是错误的。

建议使用编解码器将文本解码为文本的新方法是codecs.decode直接调用。有帮助吗?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

一点也不。(此外,以上是Python 2上的UnicodeError。)

unicode_escape编解码器,尽管它的名字,原来假设所有非ASCII字节拉丁-1(ISO-8859-1)编码。因此,您必须这样做:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

但这太可怕了。这将您限制为256个Latin-1字符,就好像根本没有发明Unicode一样!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

添加正则表达式以解决问题

(令人惊讶的是,我们现在没有两个问题。)

我们需要做的只是将unicode_escape解码器应用于我们确定为ASCII文本的内容。特别是,我们可以确保仅将其应用于有效的Python转义序列,这些序列必须保证为ASCII文本。

计划是,我们将使用正则表达式查找转义序列,并使用函数作为参数以re.sub将其替换为未转义的值。

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

然后:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

unicode_escape doesn’t work in general

It turns out that the string_escape or unicode_escape solution does not work in general — particularly, it doesn’t work in the presence of actual Unicode.

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places — for example, Python source code — the source data is already Unicode text.

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

Well, that’s wrong.

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that’s terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

The plan is, we’ll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

回答 2

python 3的实际正确答案:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

有关的详细信息codecs.escape_decode

  • codecs.escape_decode 是一个逐字节解码器
  • codecs.escape_decode解码ascii转义序列,例如:b"\\n"-> b"\n"b"\\xce"-> b"\xce"
  • codecs.escape_decode 不需要或不需要了解字节对象的编码,但是转义字节的编码应与对象其余部分的编码匹配。

背景:

The actually correct and convenient answer for python 3:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

  • codecs.escape_decode is a bytes-to-bytes decoder
  • codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
  • codecs.escape_decode does not care or need to know about the byte object’s encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.

Background:

  • @rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @Jerub is correct: avoid the AST or eval.
  • I first discovered codecs.escape_decode from this answer to “how do I .decode(‘string-escape’) in Python3?”. As that answer states, that function is currently not documented for python 3.

回答 3

ast.literal_eval函数将关闭,但是它将期望该字符串先被正确引用。

当然反斜杠Python的解释依赖于字符串的方式引用(""VS r""VS u"",三引号等),所以你可能想包装在合适的报价的用户输入和传递给literal_eval。将其包装在引号中还可以防止literal_eval返回数字,元组,字典等。

如果用户键入您打算在字符串周围使用的引号引起来,事情可能仍然会变得棘手。

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

Of course Python’s interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.


回答 4

这是一个不好的方法,但是当我尝试解释在字符串参数中传递的转义八进制时,它对我有用。

input_string = eval('b"' + sys.argv[1] + '"')

值得一提的是,eval和ast.literal_eval之间存在区别(eval更加不安全)。请参阅使用python的eval()与ast.literal_eval()吗?

This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.

input_string = eval('b"' + sys.argv[1] + '"')

It’s worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python’s eval() vs. ast.literal_eval()?


回答 5

下面的代码应该适用于\ n,要求将其显示在字符串上。

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

Below code should work for \n is required to be displayed on the string.

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

如何在python中写字符串文字而不必转义它们?

问题:如何在python中写字符串文字而不必转义它们?

有没有一种方法可以在python中声明字符串变量,以使其中的所有内容都自动转义,或者具有其原义字符值?

不是问如何用斜杠将引号转义,这是显而易见的。我要的是一种通用的方式,使所有内容都以字符串文字形式显示,这样我就不必手动检查并转义非常大的字符串的所有内容。有人知道解决方案吗?谢谢!

Is there a way to declare a string variable in python such that everything inside of it is automatically escaped, or has its literal character value?

I’m not asking how to escape the quotes with slashes, that’s obvious. What I’m asking for is a general purpose way for making everything in a string literal so that I don’t have to manually go through and escape everything for very large strings. Anyone know of a solution? Thanks!


回答 0

原始字符串文字:

>>> r'abc\dev\t'
'abc\\dev\\t'

Raw string literals:

>>> r'abc\dev\t'
'abc\\dev\\t'

回答 1

如果要处理非常大的字符串,特别是多行字符串,请注意三引号语法:

a = r"""This is a multiline string
with more than one line
in the source code."""

If you’re dealing with very large strings, specifically multiline strings, be aware of the triple-quote syntax:

a = r"""This is a multiline string
with more than one line
in the source code."""

回答 2

哪有这回事。看起来您想要在Perl和shell中提供“此处文档”之类的东西,但是Python没有。

仅使用原始字符串或多行字符串意味着不必担心太多事情。如果您使用原始字符串,则仍然必须在终端“ \”周围解决,并且使用任何字符串解决方案,都必须担心如果数据中包含了结束符“,’,’,’或”“” 。

也就是说,无法获取字符串

 '   ''' """  " \

正确存储在任何Python字符串文字中,而无需进行某种内部转义。

There is no such thing. It looks like you want something like “here documents” in Perl and the shells, but Python doesn’t have that.

Using raw strings or multiline strings only means that there are fewer things to worry about. If you use a raw string then you still have to work around a terminal “\” and with any string solution you’ll have to worry about the closing “, ‘, ”’ or “”” if it is included in your data.

That is, there’s no way to have the string

 '   ''' """  " \

properly stored in any Python string literal without internal escaping of some sort.


回答 3

您可以在这里找到Python的字符串文字文档:

http://docs.python.org/tutorial/introduction.html#strings

和这里:

http://docs.python.org/reference/lexical_analysis.html#literals

最简单的示例将使用’r’前缀:

ss = r'Hello\nWorld'
print(ss)
Hello\nWorld

You will find Python’s string literal documentation here:

http://docs.python.org/tutorial/introduction.html#strings

and here:

http://docs.python.org/reference/lexical_analysis.html#literals

The simplest example would be using the ‘r’ prefix:

ss = r'Hello\nWorld'
print(ss)
Hello\nWorld

回答 4

(假设您不需要直接在Python代码中输入字符串)

为了解决安德鲁·达尔克(Andrew Dalke)指出的问题,只需在文本文件中键入文字字符串,然后使用它即可;

input_ = '/directory_of_text_file/your_text_file.txt' 
input_open   = open(input_,'r+')
input_string = input_open.read()

print input_string

这将打印文本文件中任何内容的文字文本,即使它是;

 '   ''' """   \

虽然不是很有趣,也不是最佳选择,但它很有用,尤其是当您有3页需要转义字符的代码时。

(Assuming you are not required to input the string from directly within Python code)

to get around the Issue Andrew Dalke pointed out, simply type the literal string into a text file and then use this;

input_ = '/directory_of_text_file/your_text_file.txt' 
input_open   = open(input_,'r+')
input_string = input_open.read()

print input_string

This will print the literal text of whatever is in the text file, even if it is;

 '   ''' """  “ \

Not fun or optimal, but can be useful, especially if you have 3 pages of code that would’ve needed character escaping.


回答 5

如果string是变量,请使用。repr方法就可以了:

>>> s = '\tgherkin\n'

>>> s
'\tgherkin\n'

>>> print(s)
    gherkin

>>> print(s.__repr__())
'\tgherkin\n'

if string is a variable, use the .repr method on it:

>>> s = '\tgherkin\n'

>>> s
'\tgherkin\n'

>>> print(s)
    gherkin

>>> print(s.__repr__())
'\tgherkin\n'

如何取消转义的反斜杠字符串?

问题:如何取消转义的反斜杠字符串?

假设我有一个字符串,它是另一个字符串的反斜杠转义版本。在Python中,有没有一种简便的方法可以使字符串不转义?例如,我可以这样做:

>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>> 

但是,这涉及将(可能不受信任的)字符串传递给eval(),这是安全隐患。标准库中是否有一个函数可以接收一个字符串并生成一个不涉及安全性的字符串?

Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:

>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>> 

However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?


回答 0

>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"
>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"

回答 1

您可以使用ast.literal_eval哪个是安全的:

安全地评估表达式节点或包含Python表达式的字符串。提供的字符串或节点只能由以下Python文字结构组成:字符串,数字,元组,列表,字典,布尔值和无。(结束)

像这样:

>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!

You can use ast.literal_eval which is safe:

Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None. (END)

Like this:

>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!

回答 2

所有给出的答案将在通用Unicode字符串上中断。据我所知,以下代码在所有情况下都适用于Python3:

from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)

如注释中所述,您还可以像下面这样使用模块中的literal_eval方法ast

import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))

当您的字符串确实包含字符串文字(包括引号)时,也可以这样:

import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))

但是,如果不确定输入字符串是使用双引号还是单引号作为定界符,或者不确定根本不能正确转义输入字符串,则literal_eval可能会花点时间SyntaxError编码/解码方法仍然有效。

All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:

from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)

As outlined in the comments, you can also use the literal_eval method from the ast module like so:

import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))

Or like this when your string really contains a string literal (including the quotes):

import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))

However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.


回答 3

在python 3中,str对象没有decode方法,您必须使用bytes对象。ChristopheD的答案涵盖了python 2。

# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")

# or directly
my_bytes = b"Hello,\\nworld"

print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"

In python 3, str objects don’t have a decode method and you have to use a bytes object. ChristopheD’s answer covers python 2.

# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")

# or directly
my_bytes = b"Hello,\\nworld"

print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"

如何在正则表达式中使用变量?

问题:如何在正则表达式中使用变量?

我想在a variable内部使用regex,该怎么办Python

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

I’d like to use a variable inside a regex, how can I do this in Python?

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

回答 0

从python 3.6开始,您还可以使用文字字符串插值(“ f-strings”)。在您的特定情况下,解决方案是:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something

编辑:

既然评论中存在一些有关如何处理特殊字符的问题,我想扩展一下我的答案:

原始字符串(’r’):

在正则表达式中处理特殊字符时,您必须了解的主要概念之一是区分字符串文字和正则表达式本身。这是很好的解释在这里

简而言之:

假设您要匹配字符串\b之后,而不是查找单词边界。你必须写:TEXTO\boundary

TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
    print("match")

这仅起作用,因为我们使用的是原始字符串(正则表达式以’r’开头),否则我们必须在正则表达式中写入“ \\\\ boundary”(四个反斜杠)。另外,如果没有’\ r’,\ b’将不再转换为单词边界,而是转换为退格键!

重新转义

基本上在任何特殊字符的前面放置一个空格。因此,如果您希望TEXTO中有特殊字符,则需要编写:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

注:对于任何版本> = 3.7蟒:!"%',/:;<=>@,和`都没有逃脱。仅对正则表达式中具有含义的特殊字符进行转义。_因为Python 3.3没有逃脱。(送。这里

大括号:

如果要在使用f字符串的正则表达式中使用量词,则必须使用双花括号。假设您要匹配TEXTO,然后再精确匹配2位数字:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

From python 3.6 on you can also use Literal String Interpolation, “f-strings”. In your particular case the solution would be:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something

EDIT:

Since there have been some questions in the comment on how to deal with special characters I’d like to extend my answer:

raw strings (‘r’):

One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:

In short:

Let’s say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:

TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
    print("match")

This only works because we are using a raw-string (the regex is preceded by ‘r’), otherwise we must write “\\\\boundary” in the regex (four backslashes). Additionally, without ‘\r’, \b’ would not converted to a word boundary anymore but to a backspace!

re.escape:

Basically puts a backspace in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, @, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)

Curly braces:

If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let’s say you want to match TEXTO followed by exactly 2 digits:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

回答 1

您必须将正则表达式构建为字符串:

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):
    etc.

请注意使用,re.escape这样如果您的文本中包含特殊字符,则不会这样解释它们。

You have to build the regex as a string:

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):
    etc.

Note the use of re.escape so that if your text has special characters, they won’t be interpreted as such.


回答 2

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):

这会将TEXTO中的内容作为字符串插入到正则表达式中。

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):

This will insert what is in TEXTO into the regex as a string.


回答 3

rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)

回答 4

我发现通过将多个较小的模式串在一起来构建正则表达式模式非常方便。

import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)

输出:

[('begin', 'id1'), ('middl', 'id2')]

I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.

import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)

Output:

[('begin', 'id1'), ('middl', 'id2')]

回答 5

我同意以上所有条件,除非:

sys.argv[1] 就像 Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"

您不想使用re.escape,因为在这种情况下,您希望它的行为类似于正则表达式

TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

I agree with all the above unless:

sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"

you would not want to use re.escape, because in that case you would like it to behave like a regex

TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

回答 6

我需要搜索彼此相似的用户名,Ned Batchelder所说的话非常有用。但是,当我使用re.compile创建我的搜索项时,发现输出更清晰:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)

可以使用以下命令打印输出:

print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)

Output can be printed using the following:

print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

回答 7

您可以使用formatgrammer suger 尝试另一种用法:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

you can try another usage using format grammer suger:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

回答 8

您也可以为此使用format关键字。Format方法将{}占位符替换为您作为参数传递给format方法的变量。

if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
else:
    # Match attempt failed

You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.

if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
else:
    # Match attempt failed

回答 9

更多例子

我有带有流文件的configus.yml

"pattern":
  - _(\d{14})_
"datetime_string":
  - "%m%d%Y%H%M%f"

在我使用的python代码中

data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

more example

I have configus.yml with flows files

"pattern":
  - _(\d{14})_
"datetime_string":
  - "%m%d%Y%H%M%f"

in python code I use

data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

如何在Python字符串中有选择地转义百分比(%)?

问题:如何在Python字符串中有选择地转义百分比(%)?

我有以下代码

test = "have it break."
selectiveEscape = "Print percent % in sentence and not %s" % test

print(selectiveEscape)

我想获得输出:

Print percent % in sentence and not have it break.

实际发生的情况:

    selectiveEscape = "Use percent % in sentence and not %s" % test
TypeError: %d format: a number is required, not str

I have the following code

test = "have it break."
selectiveEscape = "Print percent % in sentence and not %s" % test

print(selectiveEscape)

I would like to get the output:

Print percent % in sentence and not have it break.

What actually happens:

    selectiveEscape = "Use percent % in sentence and not %s" % test
TypeError: %d format: a number is required, not str

回答 0

>>> test = "have it break."
>>> selectiveEscape = "Print percent %% in sentence and not %s" % test
>>> print selectiveEscape
Print percent % in sentence and not have it break.
>>> test = "have it break."
>>> selectiveEscape = "Print percent %% in sentence and not %s" % test
>>> print selectiveEscape
Print percent % in sentence and not have it break.

回答 1

另外,从Python 2.6开始,您可以使用新的字符串格式(如PEP 3101中所述):

'Print percent % in sentence and not {0}'.format(test)

当您的弦变得越来越复杂时,这尤其方便。

Alternatively, as of Python 2.6, you can use new string formatting (described in PEP 3101):

'Print percent % in sentence and not {0}'.format(test)

which is especially handy as your strings get more complicated.


回答 2

尝试使用%%打印%符号。

try using %% to print % sign .


回答 3

您不能选择性地转义%,因为%根据以下字符,它总是具有特殊的含义。

在Python 文档中,该部分第二个表格的底部指出:

'%'        No argument is converted, results in a '%' character in the result.

因此,您应该使用:

selectiveEscape = "Print percent %% in sentence and not %s" % (test, )

(请注意,将元组的显式更改作为的参数%

在不了解上述情况的情况下,我会这样做:

selectiveEscape = "Print percent %s in sentence and not %s" % ('%', test)

显然你已经有了知识。

You can’t selectively escape %, as % always has a special meaning depending on the following character.

In the documentation of Python, at the bottem of the second table in that section, it states:

'%'        No argument is converted, results in a '%' character in the result.

Therefore you should use:

selectiveEscape = "Print percent %% in sentence and not %s" % (test, )

(please note the expicit change to tuple as argument to %)

Without knowing about the above, I would have done:

selectiveEscape = "Print percent %s in sentence and not %s" % ('%', test)

with the knowledge you obviously already had.


回答 4

如果从文件中读取了格式模板,并且不能确保内容将百分号加倍,则可能必须检测百分号并以编程方式确定它是否是占位符的开始。然后,解析器还应该识别类似%d(以及可以使用的其他字母)之类的序列,也应如此%(xxx)s

使用新格式可以观察到类似的问题-文本可以包含花括号。

If the formatting template was read from a file, and you cannot ensure the content doubles the percent sign, then you probably have to detect the percent character and decide programmatically whether it is the start of a placeholder or not. Then the parser should also recognize sequences like %d (and other letters that can be used), but also %(xxx)s etc.

Similar problem can be observed with the new formats — the text can contain curly braces.


回答 5

如果您使用的是Python 3.6或更高版本,则可以使用f-string

>>> test = "have it break."
>>> selectiveEscape = f"Print percent % in sentence and not {test}"
>>> print(selectiveEscape)
... Print percent % in sentence and not have it break.

If you are using Python 3.6 or newer, you can use f-string:

>>> test = "have it break."
>>> selectiveEscape = f"Print percent % in sentence and not {test}"
>>> print(selectiveEscape)
... Print percent % in sentence and not have it break.

回答 6

我尝试了不同的方法来打印子图标题,看看它们是如何工作的。当我使用乳胶时,情况有所不同。

在典型情况下,它适用于’%%’和’string’+’%’。

如果您使用Latex,则可以使用’string’+’\%’

因此,在典型情况下:

import matplotlib.pyplot as plt
fig,ax = plt.subplots(4,1)
float_number = 4.17
ax[0].set_title('Total: (%1.2f' %float_number + '\%)')
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '%)')

带有%的标题示例

如果我们使用乳胶:

import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)
matplotlib.rcParams['text.usetex'] = True
matplotlib.rcParams['text.latex.unicode'] = True
fig,ax = plt.subplots(4,1)
float_number = 4.17
#ax[0].set_title('Total: (%1.2f\%)' %float_number) This makes python crash
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '\%)')

我们得到这样的结果: 具有%和乳胶的标题示例

I have tried different methods to print a subplot title, look how they work. It’s different when i use Latex.

It works with ‘%%’ and ‘string’+’%’ in a typical case.

If you use Latex it worked using ‘string’+’\%’

So in a typical case:

import matplotlib.pyplot as plt
fig,ax = plt.subplots(4,1)
float_number = 4.17
ax[0].set_title('Total: (%1.2f' %float_number + '\%)')
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '%)')

Title examples with %

If we use latex:

import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)
matplotlib.rcParams['text.usetex'] = True
matplotlib.rcParams['text.latex.unicode'] = True
fig,ax = plt.subplots(4,1)
float_number = 4.17
#ax[0].set_title('Total: (%1.2f\%)' %float_number) This makes python crash
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '\%)')

We get this: Title example with % and latex


将utf-8文本保存在json.dumps中为UTF8,而不是\ u转义序列

问题:将utf-8文本保存在json.dumps中为UTF8,而不是\ u转义序列

样例代码:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

问题:这不是人类可读的。我的(智能)用户想要使用JSON转储来验证甚至编辑文本文件(我宁愿不使用XML)。

有没有一种方法可以将对象序列化为UTF-8 JSON字符串(而不是 \uXXXX)?

sample code:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

The problem: it’s not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?


回答 0

使用ensure_ascii=False切换至json.dumps(),然后手动将值编码为UTF-8:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

如果要写入文件,只需使用json.dump()并将其留给文件对象进行编码:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Python 2警告

对于Python 2,还有更多注意事项需要考虑。如果要将其写入文件,则可以使用io.open()代替open()来生成一个文件对象,该对象在编写时为您编码Unicode值,然后使用json.dump()代替来写入该文件:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

做笔记,有一对在错误json模块,其中ensure_ascii=False标志可以产生一个混合unicodestr对象。那么,Python 2的解决方法是:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

在Python 2中,当使用str编码为UTF-8的字节字符串(类型)时,请确保还设置encoding关键字:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

回答 1

写入文件

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

打印到标准输出

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

To write to a file

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

To print to stdout

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

回答 2

更新:这是错误的答案,但是了解为什么它仍然是有用的。看评论。

怎么unicode-escape

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

UPDATE: This is wrong answer, but it’s still useful to understand why it’s wrong. See comments.

How about unicode-escape?

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

回答 3

Peters的python 2解决方法在边缘情况下失败:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

它在第3行的.decode(’utf8’)部分崩溃。我通过避免该步骤以及ascii的特殊大小写,使程序更加简单,从而解决了该问题:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))

cat filename
{"keyword": "bad credit  çredit cards"}

Peters’ python 2 workaround fails on an edge case:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

It was crashing on the .decode(‘utf8’) part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))

cat filename
{"keyword": "bad credit  çredit cards"}

回答 4

从Python 3.7开始,以下代码可以正常运行:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

输出:

{"symbol": "ƒ"}

As of Python 3.7 the following code works fine:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

Output:

{"symbol": "ƒ"}

回答 5

以下是我和google的var阅读答案。

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'

u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'

#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

b'\xe4\xb8\xad\xe6\x96\x87'
'中文'

'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

The following is my understanding var reading answer above and google.

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'

u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'

#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

b'\xe4\xb8\xad\xe6\x96\x87'
'中文'

'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

回答 6

这是我使用json.dump()的解决方案:

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

其中SYSTEM_ENCODING设置为:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

Here’s my solution using json.dump():

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

where SYSTEM_ENCODING is set to:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

回答 7

尽可能使用编解码器,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

Use codecs if possible,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

回答 8

感谢您在这里的原始答案。使用python 3的以下代码行:

print(json.dumps(result_dict,ensure_ascii=False))

还可以 如果不是必须的,请考虑尝试在代码中不要写太多文本。

这对于python控制台可能已经足够了。但是,要使服务器满意,您可能需要按照此处的说明设置语言环境(如果它在apache2上) http://blog.dscpl.com.au/2014/09/setting-lang-and-lcall-when-using .html

基本上在ubuntu上安装he_IL或​​任何语言环境,检查是否未安装

locale -a 

将其安装在XX是您的语言的位置

sudo apt-get install language-pack-XX

例如:

sudo apt-get install language-pack-he

将以下文本添加到/ etc / apache2 / envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

比您希望不要从apache上得到python错误:

打印(js)UnicodeEncodeError:’ascii’编解码器无法在位置41-45处编码字符:序数不在范围内(128)

另外,在apache中,尝试将utf设置为默认编码,如下所示:
如何将Apache的默认编码更改为UTF-8?

尽早执行,因为apache错误可能很难调试,并且您可能会错误地认为它来自python,在这种情况下可能并非如此

Thanks for the original answer here. With python 3 the following line of code:

print(json.dumps(result_dict,ensure_ascii=False))

was ok. Consider trying not writing too much text in the code if it’s not imperative.

This might be good enough for the python console. However, to satisfy a server you might need to set the locale as explained here (if it is on apache2) http://blog.dscpl.com.au/2014/09/setting-lang-and-lcall-when-using.html

basically install he_IL or whatever language locale on ubuntu check it is not installed

locale -a 

install it where XX is your language

sudo apt-get install language-pack-XX

For example:

sudo apt-get install language-pack-he

add the following text to /etc/apache2/envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

Than you would hopefully not get python errors on from apache like:

print (js) UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 41-45: ordinal not in range(128)

Also in apache try to make utf the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache?

Do it early because apache errors can be pain to debug and you can mistakenly think it’s from python which possibly isn’t the case in that situation


回答 9

如果要从文件和文件内容阿拉伯文本加载JSON字符串。然后这将起作用。

假设文件如下:arabic.json

{ 
"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"
}

从arabic.json文件获取阿拉伯语内容

with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)
   f.close()


# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

要在Django模板中使用JSON数据,请执行以下步骤:

# If have to get the JSON index in Django Template file, then simply decode the encoded string.

json.JSONDecoder().decode(json_data2)

完成了!现在我们可以将结果作为带有阿拉伯值的JSON索引来获取。

If you are loading JSON string from a file & file contents arabic texts. Then this will work.

Assume File like: arabic.json

{ 
"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"
}

Get the arabic contents from the arabic.json file

with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)
   f.close()


# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

To use JSON Data in Django Template follow below steps:

# If have to get the JSON index in Django Template file, then simply decode the encoded string.

json.JSONDecoder().decode(json_data2)

done! Now we can get the results as JSON index with arabic value.


回答 10

使用unicode-escape解决问题

>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'

说明

>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original:   χαν  хан

原始资源:https : //blog.csdn.net/chuatony/article/details/72628868

use unicode-escape to solve problem

>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'

explain

>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original: 漢  χαν  хан

original resource:https://blog.csdn.net/chuatony/article/details/72628868


回答 11

正如Martijn指出的那样,在json.dumps中使用suresure_ascii = False是解决此问题的正确方向。但是,这可能会引发异常:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

您需要在site.py或sitecustomize.py中进行其他设置,才能正确设置sys.getdefaultencoding()。site.py在lib / python2.7 /下,sitecustomize.py在lib / python2.7 / site-packages下。

如果要使用site.py,请在def setencoding()下:将第一个if 0:更改为if 1 :,以便python使用操作系统的语言环境。

如果您喜欢使用sitecustomize.py,那么如果尚未创建,则可能不存在。只需将这些行:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

然后,您可以以utf-8格式进行中文json输出,例如:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

您将获得一个utf-8编码的字符串,而不是\ u转义的json字符串。

验证默认编码:

print sys.getdefaultencoding()

您应该获得“ utf-8”或“ UTF-8”来验证site.py或sitecustomize.py设置。

请注意,您无法在交互式python控制台上执行sys.setdefaultencoding(“ utf-8”)。

Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.

If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that python will use your operation system’s locale.

If you prefer to use sitecustomize.py, which may not exist if you haven’t created it. simply put these lines:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Then you can do some Chinese json output in utf-8 format, such as:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

You will get an utf-8 encoded string, rather than \u escaped json string.

To verify your default encoding:

print sys.getdefaultencoding()

You should get “utf-8” or “UTF-8” to verify your site.py or sitecustomize.py settings.

Please note that you could not do sys.setdefaultencoding(“utf-8”) at interactive python console.