



例如,I'm "stuck" :\应成为I\'m \"stuck\" :\\

Does Python have a function that I can use to escape special characters in a regular expression?

For example, I'm "stuck" :\ should become I\'m \"stuck\" :\\.

>>> import re
>>> re.escape(r'\ a.*$')
'\\\\\\ a\\.\\*\\$'
>>> print(re.escape(r'\ a.*$'))
\\\ a\.\*\$
>>> re.escape('www.stackoverflow.com')
>>> print(re.escape('www.stackoverflow.com'))




从Python 3.7 re.escape()开始,更改为仅转义对正则表达式操作有意义的字符。

Use re.escape

>>> import re
>>> re.escape(r'\ a.*$')
'\\\\\\ a\\.\\*\\$'
>>> print(re.escape(r'\ a.*$'))
\\\ a\.\*\$
>>> re.escape('www.stackoverflow.com')
>>> print(re.escape('www.stackoverflow.com'))

Repeating it here:


Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

As of Python 3.7 re.escape() was changed to escape only characters which are meaningful to regex operations.

import re
print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"


  • 搜索模式中,包括\您要查找的字符。你会使用\逃脱你的角色,所以你需要逃避 为好。
  • 搜索模式周围加上括号,例如([\"]),以便替换 模式在找到的字符添加\到其前面时可以使用该字符。(这就是 \1作用:使用第一个带括号的组的值。)
  • r前面r'([\"])'意味着它是一个原始字符串。原始字符串使用不同的规则来转义反斜杠。要([\"])以纯字符串形式编写,您需要将所有反斜杠加倍,并写入'([\\"])'。在编写正则表达式时,原始字符串更友好。
  • 替代模式,你需要转义\从先于一个取代基的反斜杠,例如区分\1,因此r'\\\1'。写 的是作为一个普通的字符串,你需要'\\\\\\1'-大家都不希望发生。

I’m surprised no one has mentioned using regular expressions via re.sub():

import re
print re.sub(r'([\"])',    r'\\\1', 'it\'s "this"')  # it's \"this\"
print re.sub(r"([\'])",    r'\\\1', 'it\'s "this"')  # it\'s "this"
print re.sub(r'([\" \'])', r'\\\1', 'it\'s "this"')  # it\'s\ \"this\"

Important things to note:

  • In the search pattern, include \ as well as the character(s) you’re looking for. You’re going to be using \ to escape your characters, so you need to escape that as well.
  • Put parentheses around the search pattern, e.g. ([\"]), so that the substitution pattern can use the found character when it adds \ in front of it. (That’s what \1 does: uses the value of the first parenthesized group.)
  • The r in front of r'([\"])' means it’s a raw string. Raw strings use different rules for escaping backslashes. To write ([\"]) as a plain string, you’d need to double all the backslashes and write '([\\"])'. Raw strings are friendlier when you’re writing regular expressions.
  • In the substitution pattern, you need to escape \ to distinguish it from a backslash that precedes a substitution group, e.g. \1, hence r'\\\1'. To write that as a plain string, you’d need '\\\\\\1' — and nobody wants that.

>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\


>>> raw_input()
I'm "stuck" :\
'I\'m "stuck" :\\'

Use repr()[1:-1]. In this case, the double quotes don’t need to be escaped. The [-1:1] slice is to remove the single quote from the beginning and the end.

>>> x = raw_input()
I'm "stuck" :\
>>> print x
I'm "stuck" :\
>>> print repr(x)[1:-1]
I\'m "stuck" :\\

Or maybe you just want to escape a phrase to paste into your program? If so, do this:

>>> raw_input()
I'm "stuck" :\
'I\'m "stuck" :\\'

>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\

As it was mentioned above, the answer depends on your case. If you want to escape a string for a regular expression then you should use re.escape(). But if you want to escape a specific set of characters then use this lambda function:

>>> escape = lambda s, escapechar, specialchars: "".join(escapechar + c if c in specialchars or c == escapechar else c for c in s)
>>> s = raw_input()
I'm "stuck" :\
>>> print s
I'm "stuck" :\
>>> print escape(s, "\\", ['"'])
I'm \"stuck\" :\\

def escapeSpecialCharacters ( text, characters ):
    for character in characters:
        text = text.replace( character, '\\' + character )
    return text

>>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
'I\\\'m \\"stuck\\" :\\'
>>> print( _ )
I\'m \"stuck\" :\

It’s not that hard:

def escapeSpecialCharacters ( text, characters ):
    for character in characters:
        text = text.replace( character, '\\' + character )
    return text

>>> escapeSpecialCharacters( 'I\'m "stuck" :\\', '\'"' )
'I\\\'m \\"stuck\\" :\\'
>>> print( _ )
I\'m \"stuck\" :\

import re

print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")

If you only want to replace some characters you could use this:

import re

print re.sub(r'([\.\\\+\*\?\[\^\]\$\(\)\{\}\!\<\>\|\:\-])', r'\\\1', "example string.")



使用os.system()时,通常必须转义文件名和其他作为参数传递给命令的参数。我怎样才能做到这一点?最好是可以在多个操作系统/ shell上运行的东西,尤其是bash。


def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 



When using os.system() it’s often necessary to escape filenames and other arguments passed as parameters to commands. How can I do this? Preferably something that would work on multiple operating systems/shells but in particular for bash.

I’m currently doing the following, but am sure there must be a library function for this, or at least a more elegant/robust/efficient option:

def sh_escape(s):
   return s.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

os.system("cat %s | grep something | sort > %s" 
          % (sh_escape(in_filename), 

Edit: I’ve accepted the simple answer of using quotes, don’t know why I didn’t think of that; I guess because I came from Windows where ‘ and ” behave a little differently.

Regarding security, I understand the concern, but, in this case, I’m interested in a quick and easy solution which os.system() provides, and the source of the strings is either not user-generated or at least entered by a trusted user (me).

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"


更新:如果您使用的是Python 3.3或更高版本,请使用shlex.quote而不是自己滚动。

This is what I use:

def shellquote(s):
    return "'" + s.replace("'", "'\\''") + "'"

The shell will always accept a quoted filename and remove the surrounding quotes before passing it to the program in question. Notably, this avoids problems with filenames that contain spaces or any other kind of nasty shell metacharacter.

Update: If you are using Python 3.3 or later, use shlex.quote instead of rolling your own.

shlex.quote() 从python 3开始做你想要的事情。

(用于pipes.quote同时支持python 2和python 3)

shlex.quote() does what you want since python 3.

(Use pipes.quote to support both python 2 and python 3)

Replacing shell pipe line

output=`dmesg | grep hda`
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

Perhaps you have a specific reason for using os.system(). But if not you should probably be using the subprocess module. You can specify the pipes directly and avoid using the shell.

The following is from PEP324:

Replacing shell pipe line

output=`dmesg | grep hda`
p1 = Popen(["dmesg"], stdout=PIPE)
p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]

Maybe subprocess.list2cmdline is a better shot?

请注意,pipes.quote实际上在Python 2.5和Python 3.1中已损坏,并且不安全使用-它不处理零长度参数。

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

参见Python问题7476 ; 它已在Python 2.6和3.2及更高版本中修复。

Note that pipes.quote is actually broken in Python 2.5 and Python 3.1 and not safe to use–It doesn’t handle zero-length arguments.

>>> from pipes import quote
>>> args = ['arg1', '', 'arg3']
>>> print 'mycommand %s' % (' '.join(quote(arg) for arg in args))
mycommand arg1  arg3

See Python issue 7476; it has been fixed in Python 2.6 and 3.2 and newer.

注意:这是Python 2.7.x的答案。

根据消息来源,这pipes.quote()是" 可靠地将字符串作为/ bin / sh的单个参数引用 "的一种方法。(尽管从2.7版开始不推荐使用,但最终在Python 3.3中公开公开为shlex.quote()函数。)

另一方面subprocess.list2cmdline()是一种方法," 翻译的参数的序列到命令行串,使用同样的规则作为MS C运行时 "。


import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)


# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

Notice: This is an answer for Python 2.7.x.

According to the source, pipes.quote() is a way to “Reliably quote a string as a single argument for /bin/sh“. (Although it is deprecated since version 2.7 and finally exposed publicly in Python 3.3 as the shlex.quote() function.)

On the other hand, subprocess.list2cmdline() is a way to “Translate a sequence of arguments into a command line string, using the same rules as the MS C runtime“.

Here we are, the platform independent way of quoting strings for command lines.

import sys
mswindows = (sys.platform == "win32")

if mswindows:
    from subprocess import list2cmdline
    quote_args = list2cmdline
    # POSIX
    from pipes import quote

    def quote_args(seq):
        return ' '.join(quote(arg) for arg in seq)


# Quote a single argument
print quote_args(['my argument'])

# Quote multiple arguments
my_args = ['This', 'is', 'my arguments']
print quote_args(my_args)

I believe that os.system just invokes whatever command shell is configured for the user, so I don’t think you can do it in a platform independent way. My command shell could be anything from bash, emacs, ruby, or even quake3. Some of these programs aren’t expecting the kind of arguments you are passing to them and even if they did there is no guarantee they do their escaping the same way.

def quote_argument(argument):
    return '"%s"' % (
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')


The function I use is:

def quote_argument(argument):
    return '"%s"' % (
        .replace('\\', '\\\\')
        .replace('"', '\\"')
        .replace('$', '\\$')
        .replace('`', '\\`')

that is: I always enclose the argument in double quotes, and then backslash-quote the only characters special inside double quotes.

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

子进程模块是一个更好的选择,我建议尽量避免使用os.system / subprocess之类的东西。

If you do use the system command, I would try and whitelist what goes into the os.system() call.. For example..

clean_user_input re.sub("[^a-zA-Z]", "", user_input)
os.system("ls %s" % (clean_user_input))

The subprocess module is a better option, and I would recommend trying to avoid using anything like os.system/subprocess wherever possible.

The real answer is: Don’t use os.system() in the first place. Use subprocess.call instead and supply the unescaped arguments.





>>> myString = "spam\\neggs"
>>> print(myString)


>>> print(process(myString))



Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

For example, let’s say myString is defined as:

>>> myString = "spam\\neggs"
>>> print(myString)

I want a function (I’ll call it process) that does this:

>>> print(process(myString))

It’s important that the function can process all of the escape sequences in Python (listed in a table in the link above).

Does Python have a function to do this?

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)


The correct thing to do is use the ‘string-escape’ code to decode the string.

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)

Don’t use the AST or eval. Using the string codecs is much safer.

回答 1

unicode_escape 总的来说不起作用

事实证明,string_escapeor unicode_escape解决方案通常无法正常工作-尤其是在存在实际Unicode的情况下,它不能正常工作。




以下示例是Python 3中的示例,因此字符串文字更清晰,但在Python 2和3上,存在相同的问题,但表现形式略有不同。

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test



>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

一点也不。(此外,以上是Python 2上的UnicodeError。)


>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test


>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)





import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)


>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

unicode_escape doesn’t work in general

It turns out that the string_escape or unicode_escape solution does not work in general — particularly, it doesn’t work in the presence of actual Unicode.

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places — for example, Python source code — the source data is already Unicode text.

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve   test

Well, that’s wrong.

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that’s terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do not now have two problems.)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

The plan is, we’ll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

python 3的实际正确答案:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test


  • codecs.escape_decode 是一个逐字节解码器
  • codecs.escape_decode解码ascii转义序列,例如:b"\\n"-> b"\n"b"\\xce"-> b"\xce"
  • codecs.escape_decode 不需要或不需要了解字节对象的编码,但是转义字节的编码应与对象其余部分的编码匹配。


The actually correct and convenient answer for python 3:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

  • codecs.escape_decode is a bytes-to-bytes decoder
  • codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
  • codecs.escape_decode does not care or need to know about the byte object’s encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.


  • @rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
  • @Jerub is correct: avoid the AST or eval.
  • I first discovered codecs.escape_decode from this answer to “how do I .decode(‘string-escape’) in Python3?”. As that answer states, that function is currently not documented for python 3.

当然反斜杠Python的解释依赖于字符串的方式引用(""VS r""VS u"",三引号等),所以你可能想包装在合适的报价的用户输入和传递给literal_eval。将其包装在引号中还可以防止literal_eval返回数字,元组,字典等。


The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

Of course Python’s interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

input_string = eval('b"' + sys.argv[1] + '"')


This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.

input_string = eval('b"' + sys.argv[1] + '"')

It’s worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python’s eval() vs. ast.literal_eval()?

回答 5

下面的代码应该适用于\ n,要求将其显示在字符串上。

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)

Below code should work for \n is required to be displayed on the string.

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)





Is there a way to declare a string variable in python such that everything inside of it is automatically escaped, or has its literal character value?

I’m not asking how to escape the quotes with slashes, that’s obvious. What I’m asking for is a general purpose way for making everything in a string literal so that I don’t have to manually go through and escape everything for very large strings. Anyone know of a solution? Thanks!

>>> r'abc\dev\t'

Raw string literals:

>>> r'abc\dev\t'

a = r"""This is a multiline string
with more than one line
in the source code."""

If you’re dealing with very large strings, specifically multiline strings, be aware of the triple-quote syntax:

a = r"""This is a multiline string
with more than one line
in the source code."""

仅使用原始字符串或多行字符串意味着不必担心太多事情。如果您使用原始字符串,则仍然必须在终端" \"周围解决,并且使用任何字符串解决方案,都必须担心如果数据中包含了结束符",',','或""" 。


 '   ''' """  " \


There is no such thing. It looks like you want something like “here documents” in Perl and the shells, but Python doesn’t have that.

Using raw strings or multiline strings only means that there are fewer things to worry about. If you use a raw string then you still have to work around a terminal “\” and with any string solution you’ll have to worry about the closing “, ‘, ”’ or “”” if it is included in your data.

That is, there’s no way to have the string

 '   ''' """  " \

properly stored in any Python string literal without internal escaping of some sort.

回答 3






ss = r'Hello\nWorld'

You will find Python’s string literal documentation here:


and here:


The simplest example would be using the ‘r’ prefix:

ss = r'Hello\nWorld'

为了解决安德鲁·达尔克(Andrew Dalke)指出的问题,只需在文本文件中键入文字字符串,然后使用它即可;

input_ = '/directory_of_text_file/your_text_file.txt' 
input_open   = open(input_,'r+')
input_string = input_open.read()

print input_string


 '   ''' """   \


(Assuming you are not required to input the string from directly within Python code)

to get around the Issue Andrew Dalke pointed out, simply type the literal string into a text file and then use this;

input_ = '/directory_of_text_file/your_text_file.txt' 
input_open   = open(input_,'r+')
input_string = input_open.read()

print input_string

This will print the literal text of whatever is in the text file, even if it is;

 '   ''' """  “ \

Not fun or optimal, but can be useful, especially if you have 3 pages of code that would’ve needed character escaping.

>>> s = '\tgherkin\n'

>>> s

>>> print(s)

>>> print(s.__repr__())

>>> s = '\tgherkin\n'

>>> s

>>> print(s)

>>> print(s.__repr__())




>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str


Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:

>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str

However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?

>>> print '"Hello,\\nworld!"'.decode('string_escape')
>>> print '"Hello,\\nworld!"'.decode('string_escape')

>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)

You can use ast.literal_eval which is safe:

Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None. (END)

Like this:

>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)

from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')


import ast
import ast
sample = u'"mon€y\\nröcks"'


All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:

from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')

As outlined in the comments, you can also use the literal_eval method from the ast module like so:

import ast
sample = u'mon€y\\nröcks'

Or like this when your string really contains a string literal (including the quotes):

import ast
sample = u'"mon€y\\nröcks"'

However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.

在python 3中,str对象没有decode方法,您必须使用bytes对象。ChristopheD的答案涵盖了python 2。

# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")

# or directly
my_bytes = b"Hello,\\nworld"

# "Hello,
# world"

In python 3, str objects don’t have a decode method and you have to use a bytes object. ChristopheD’s answer covers python 2.

# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")

# or directly
my_bytes = b"Hello,\\nworld"

# "Hello,
# world"



我想在a variable内部使用regex,该怎么办Python

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
    # Match attempt failed

I’d like to use a variable inside a regex, how can I do this in Python?

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
    # Match attempt failed

回答 0

从python 3.6开始,您还可以使用文字字符串插值(" f-strings")。在您的特定情况下,解决方案是:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something







TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):

这仅起作用,因为我们使用的是原始字符串(正则表达式以'r'开头),否则我们必须在正则表达式中写入" \\\\ boundary"(四个反斜杠)。另外,如果没有'\ r',\ b'将不再转换为单词边界,而是转换为退格键!



if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):

注:对于任何版本> = 3.7蟒:!"%',/:;<=>@,和`都没有逃脱。仅对正则表达式中具有含义的特殊字符进行转义。_因为Python 3.3没有逃脱。(送。这里



if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):

From python 3.6 on you can also use Literal String Interpolation, “f-strings”. In your particular case the solution would be:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something


Since there have been some questions in the comment on how to deal with special characters I’d like to extend my answer:

raw strings (‘r’):

One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:

In short:

Let’s say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:

TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):

This only works because we are using a raw-string (the regex is preceded by ‘r’), otherwise we must write “\\\\boundary” in the regex (four backslashes). Additionally, without ‘\r’, \b’ would not converted to a word boundary anymore but to a backspace!


Basically puts a backspace in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):

NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, @, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)

Curly braces:

If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let’s say you want to match TEXTO followed by exactly 2 digits:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):


You have to build the regex as a string:

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):

回答 2

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):


if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):

This will insert what is in TEXTO into the regex as a string.

rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
回答 4


import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)


[('begin', 'id1'), ('middl', 'id2')]

I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.

import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)


[('begin', 'id1'), ('middl', 'id2')]

sys.argv[1] 就像 Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"


TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
    # Match attempt failed

I agree with all the above unless:

sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"

you would not want to use re.escape, because in that case you would like it to behave like a regex

TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
    # Match attempt failed

我需要搜索彼此相似的用户名,Ned Batchelder所说的话非常有用。但是,当我使用re.compile创建我的搜索项时,发现输出更清晰:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)


print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)

Output can be printed using the following:

print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

您可以使用formatgrammer suger 尝试另一种用法:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

you can try another usage using format grammer suger:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

回答 8


if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
    # Match attempt failed

You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.

if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
    # Match attempt failed

  - _(\d{14})_
  - "%m%d%Y%H%M%f"


data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

more example

I have configus.yml with flows files

  - _(\d{14})_
  - "%m%d%Y%H%M%f"

in python code I use

data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)




selectiveEscape = "Print percent % in sentence and not %s" % test



Print percent % in sentence and not have it break.


    selectiveEscape = "Use percent % in sentence and not %s" % test
TypeError: %d format: a number is required, not str

I have the following code

test = "have it break."
selectiveEscape = "Print percent % in sentence and not %s" % test


I would like to get the output:

Print percent % in sentence and not have it break.

What actually happens:

    selectiveEscape = "Use percent % in sentence and not %s" % test
TypeError: %d format: a number is required, not str

>>> test = "have it break."
>>> selectiveEscape = "Print percent %% in sentence and not %s" % test
>>> print selectiveEscape
Print percent % in sentence and not have it break.
>>> test = "have it break."
>>> selectiveEscape = "Print percent %% in sentence and not %s" % test
>>> print selectiveEscape
Print percent % in sentence and not have it break.

另外,从Python 2.6开始,您可以使用新的字符串格式(如PEP 3101中所述):

'Print percent % in sentence and not {0}'.format(test)


Alternatively, as of Python 2.6, you can use new string formatting (described in PEP 3101):

'Print percent % in sentence and not {0}'.format(test)

which is especially handy as your strings get more complicated.

回答 3


在Python 文档中,该部分第二个表格的底部指出:

'%'        No argument is converted, results in a '%' character in the result.


selectiveEscape = "Print percent %% in sentence and not %s" % (test, )



selectiveEscape = "Print percent %s in sentence and not %s" % ('%', test)


You can’t selectively escape %, as % always has a special meaning depending on the following character.

In the documentation of Python, at the bottem of the second table in that section, it states:

'%'        No argument is converted, results in a '%' character in the result.

Therefore you should use:

selectiveEscape = "Print percent %% in sentence and not %s" % (test, )

(please note the expicit change to tuple as argument to %)

Without knowing about the above, I would have done:

selectiveEscape = "Print percent %s in sentence and not %s" % ('%', test)

with the knowledge you obviously already had.

If the formatting template was read from a file, and you cannot ensure the content doubles the percent sign, then you probably have to detect the percent character and decide programmatically whether it is the start of a placeholder or not. Then the parser should also recognize sequences like %d (and other letters that can be used), but also %(xxx)s etc.

Similar problem can be observed with the new formats — the text can contain curly braces.

回答 5

如果您使用的是Python 3.6或更高版本,则可以使用f-string

>>> test = "have it break."
>>> selectiveEscape = f"Print percent % in sentence and not {test}"
>>> print(selectiveEscape)
... Print percent % in sentence and not have it break.

If you are using Python 3.6 or newer, you can use f-string:

>>> test = "have it break."
>>> selectiveEscape = f"Print percent % in sentence and not {test}"
>>> print(selectiveEscape)
... Print percent % in sentence and not have it break.

import matplotlib.pyplot as plt
fig,ax = plt.subplots(4,1)
float_number = 4.17
ax[0].set_title('Total: (%1.2f' %float_number + '\%)')
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '%)')



import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)
matplotlib.rcParams['text.usetex'] = True
matplotlib.rcParams['text.latex.unicode'] = True
fig,ax = plt.subplots(4,1)
float_number = 4.17
#ax[0].set_title('Total: (%1.2f\%)' %float_number) This makes python crash
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '\%)')

我们得到这样的结果: 具有%和乳胶的标题示例

I have tried different methods to print a subplot title, look how they work. It’s different when i use Latex.

It works with ‘%%’ and ‘string’+’%’ in a typical case.

If you use Latex it worked using ‘string’+’\%’

So in a typical case:

import matplotlib.pyplot as plt
fig,ax = plt.subplots(4,1)
float_number = 4.17
ax[0].set_title('Total: (%1.2f' %float_number + '\%)')
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '%)')

Title examples with %

If we use latex:

import matplotlib.pyplot as plt
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 12}
matplotlib.rc('font', **font)
matplotlib.rcParams['text.usetex'] = True
matplotlib.rcParams['text.latex.unicode'] = True
fig,ax = plt.subplots(4,1)
float_number = 4.17
#ax[0].set_title('Total: (%1.2f\%)' %float_number) This makes python crash
ax[1].set_title('Total: (%1.2f%%)' %float_number)
ax[2].set_title('Total: (%1.2f' %float_number + '%%)')
ax[3].set_title('Total: (%1.2f' %float_number + '\%)')

We get this: Title example with % and latex

问题:将utf-8文本保存在json.dumps中为UTF8,而不是\ u转义序列


>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"


有没有一种方法可以将对象序列化为UTF-8 JSON字符串(而不是 \uXXXX)?

sample code:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

The problem: it’s not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"


with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Python 2警告

对于Python 2,还有更多注意事项需要考虑。如果要将其写入文件,则可以使用io.open()代替open()来生成一个文件对象,该对象在编写时为您编码Unicode值,然后使用json.dump()代替来写入该文件:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

做笔记,有一对在错误json模块,其中ensure_ascii=False标志可以产生一个混合unicodestr对象。那么,Python 2的解决方法是:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str

在Python 2中,当使用str编码为UTF-8的字节字符串(类型)时,请确保还设置encoding关键字:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)


import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

To write to a file

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

To print to stdout

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

How about unicode-escape?

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

回答 3

Peters的python 2解决方法在边缘情况下失败:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    except TypeError:
        # Decode data to Unicode first

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)


with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')

cat filename
Peters’ python 2 workaround fails on an edge case:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    except TypeError:
        # Decode data to Unicode first

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

It was crashing on the .decode(‘utf8’) part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')

cat filename
{"keyword": "bad credit  çredit cards"}

从Python 3.7开始,以下代码可以正常运行:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)


{"symbol": "ƒ"}

As of Python 3.7 the following code works fine:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)


{"symbol": "ƒ"}

# coding:utf-8
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'



{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
The following is my understanding var reading answer above and google.

# coding:utf-8
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
回答 6


def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)


locale.setlocale(locale.LC_ALL, '')
Here’s my solution using json.dump():

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

where SYSTEM_ENCODING is set to:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

Use codecs if possible,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

回答 8

感谢您在这里的原始答案。使用python 3的以下代码行:


还可以 如果不是必须的,请考虑尝试在代码中不要写太多文本。

locale -a 


sudo apt-get install language-pack-XX


sudo apt-get install language-pack-he

将以下文本添加到/ etc / apache2 / envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'





Thanks for the original answer here. With python 3 the following line of code:


was ok. Consider trying not writing too much text in the code if it’s not imperative.

This might be good enough for the python console. However, to satisfy a server you might need to set the locale as explained here (if it is on apache2) http://blog.dscpl.com.au/2014/09/setting-lang-and-lcall-when-using.html

basically install he_IL or whatever language locale on ubuntu check it is not installed

locale -a 

install it where XX is your language

sudo apt-get install language-pack-XX

For example:

sudo apt-get install language-pack-he

add the following text to /etc/apache2/envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

Than you would hopefully not get python errors on from apache like:

print (js) UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 41-45: ordinal not in range(128)

Also in apache try to make utf the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache?

Do it early because apache errors can be pain to debug and you can mistakenly think it’s from python which possibly isn’t the case in that situation

回答 9



"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"


with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)

# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)


# If have to get the JSON index in Django Template file, then simply decode the encoded string.



If you are loading JSON string from a file & file contents arabic texts. Then this will work.

Assume File like: arabic.json

"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"

Get the arabic contents from the arabic.json file

with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)

# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

To use JSON Data in Django Template follow below steps:

# If have to get the JSON index in Django Template file, then simply decode the encoded string.


done! Now we can get the results as JSON index with arabic value.

>>>import json
>>>json_string = json.dumps("ברי צקלה")
'"ברי צקלה"'


>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original:   χαν  хан

use unicode-escape to solve problem

>>>import json
>>>json_string = json.dumps("ברי צקלה")
'"ברי צקלה"'


>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original: 漢  χαν  хан

original resource:https://blog.csdn.net/chuatony/article/details/72628868

正如Martijn指出的那样,在json.dumps中使用suresure_ascii = False是解决此问题的正确方向。但是,这可能会引发异常:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

您需要在site.py或sitecustomize.py中进行其他设置,才能正确设置sys.getdefaultencoding()。site.py在lib / python2.7 /下,sitecustomize.py在lib / python2.7 / site-packages下。

如果要使用site.py,请在def setencoding()下:将第一个if 0:更改为if 1 :,以便python使用操作系统的语言环境。


import sys


name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

您将获得一个utf-8编码的字符串,而不是\ u转义的json字符串。


print sys.getdefaultencoding()

您应该获得" utf-8"或" UTF-8"来验证site.py或sitecustomize.py设置。

请注意,您无法在交互式python控制台上执行sys.setdefaultencoding(" utf-8")。

Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.

If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that python will use your operation system’s locale.

If you prefer to use sitecustomize.py, which may not exist if you haven’t created it. simply put these lines:

import sys

Then you can do some Chinese json output in utf-8 format, such as:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

You will get an utf-8 encoded string, rather than \u escaped json string.

To verify your default encoding:

print sys.getdefaultencoding()

You should get “utf-8” or “UTF-8” to verify your site.py or sitecustomize.py settings.

Please note that you could not do sys.setdefaultencoding(“utf-8”) at interactive python console.