标签归档:unicode

如何更正TypeError:散列之前必须对Unicode对象进行编码?

问题:如何更正TypeError:散列之前必须对Unicode对象进行编码?

我有这个错误:

Traceback (most recent call last):
  File "python_md5_cracker.py", line 27, in <module>
  m.update(line)
TypeError: Unicode-objects must be encoded before hashing

当我尝试在Python 3.2.2中执行以下代码时:

import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides?  ")
wordlist = input("What is your wordlist?  (Enter the file name)  ")
try:
  hashdocument = open(hash_file, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  hash = hashdocument.readline()
  hash = hash.replace("\n", "")

try:
  wordlistfile = open(wordlist, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  pass
for line in wordlistfile:
  # Flush the buffer (this caused a massive problem when placed 
  # at the beginning of the script, because the buffer kept getting
  # overwritten, thus comparing incorrect hashes)
  m = hashlib.md5()
  line = line.replace("\n", "")
  m.update(line)
  word_hash = m.hexdigest()
  if word_hash == hash:
    print("Collision! The word corresponding to the given hash is", line)
    input()
    sys.exit()

print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()

I have this error:

Traceback (most recent call last):
  File "python_md5_cracker.py", line 27, in <module>
  m.update(line)
TypeError: Unicode-objects must be encoded before hashing

when I try to execute this code in Python 3.2.2:

import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides?  ")
wordlist = input("What is your wordlist?  (Enter the file name)  ")
try:
  hashdocument = open(hash_file, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  hash = hashdocument.readline()
  hash = hash.replace("\n", "")

try:
  wordlistfile = open(wordlist, "r")
except IOError:
  print("Invalid file.")
  raw_input()
  sys.exit()
else:
  pass
for line in wordlistfile:
  # Flush the buffer (this caused a massive problem when placed 
  # at the beginning of the script, because the buffer kept getting
  # overwritten, thus comparing incorrect hashes)
  m = hashlib.md5()
  line = line.replace("\n", "")
  m.update(line)
  word_hash = m.hexdigest()
  if word_hash == hash:
    print("Collision! The word corresponding to the given hash is", line)
    input()
    sys.exit()

print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()

回答 0

它可能正在寻找来自的字符编码wordlistfile

wordlistfile = open(wordlist,"r",encoding='utf-8')

或者,如果您逐行工作:

line.encode('utf-8')

It is probably looking for a character encoding from wordlistfile.

wordlistfile = open(wordlist,"r",encoding='utf-8')

Or, if you’re working on a line-by-line basis:

line.encode('utf-8')

回答 1

您必须定义encoding formatlike utf-8,尝试这种简单的方法,

本示例使用SHA256算法生成一个随机数:

>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'

You must have to define encoding format like utf-8, Try this easy way,

This example generates a random number using the SHA256 algorithm:

>>> import hashlib
>>> hashlib.sha256(str(random.getrandbits(256)).encode('utf-8')).hexdigest()
'cd183a211ed2434eac4f31b317c573c50e6c24e3a28b82ddcb0bf8bedf387a9f'

回答 2

要存储密码(PY3):

import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'

hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()

To store the password (PY3):

import hashlib, os
password_salt = os.urandom(32).hex()
password = '12345'

hash = hashlib.sha512()
hash.update(('%s%s' % (password_salt, password)).encode('utf-8'))
password_hash = hash.hexdigest()

回答 3

该错误已说明您必须执行的操作。MD5对字节进行操作,因此您必须将Unicode字符串编码为bytes,例如使用line.encode('utf-8')

The error already says what you have to do. MD5 operates on bytes, so you have to encode Unicode string into bytes, e.g. with line.encode('utf-8').


回答 4

请首先查看答案。

现在,该错误信息是明确的:你只能使用字节,而不是Python字符串(曾经被认为是unicode在Python <3),所以你必须与你的首选编码编码字符串:utf-32utf-16utf-8或受限制的甚至是一个8-位编码(有些可能称为代码页)。

从文件中读取时,Python 3将自动将单词列表文件中的字节解码为Unicode。我建议你这样做:

m.update(line.encode(wordlistfile.encoding))

因此,推送到md5算法的编码数据的编码方式与基础文件完全相同。

Please take a look first at that answer.

Now, the error message is clear: you can only use bytes, not Python strings (what used to be unicode in Python < 3), so you have to encode the strings with your preferred encoding: utf-32, utf-16, utf-8 or even one of the restricted 8-bit encodings (what some might call codepages).

The bytes in your wordlist file are being automatically decoded to Unicode by Python 3 as you read from the file. I suggest you do:

m.update(line.encode(wordlistfile.encoding))

so that the encoded data pushed to the md5 algorithm are encoded exactly like the underlying file.


回答 5

import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())
import hashlib
string_to_hash = '123'
hash_object = hashlib.sha256(str(string_to_hash).encode('utf-8'))
print('Hash', hash_object.hexdigest())

回答 6

您可以以二进制模式打开文件:

import hashlib

with open(hash_file) as file:
    control_hash = file.readline().rstrip("\n")

wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
    if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
       # collision

You could open the file in binary mode:

import hashlib

with open(hash_file) as file:
    control_hash = file.readline().rstrip("\n")

wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
    if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
       # collision

回答 7

编码这行为我固定。

m.update(line.encode('utf-8'))

encoding this line fixed it for me.

m.update(line.encode('utf-8'))

回答 8

如果是单行字符串。用b或B包裹它。例如:

variable = b"This is a variable"

要么

variable2 = B"This is also a variable"

If it’s a single line string. wrapt it with b or B. e.g:

variable = b"This is a variable"

or

variable2 = B"This is also a variable"

回答 9

该程序是上述MD5破解程序的无错误和增强版本,该MD5破解程序读取包含哈希密码列表的文件,并根据英语词典单词列表中的哈希单词检查该文件。希望对您有所帮助。

我从以下链接https://github.com/dwyl/english-words下载了英语词典

# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words 

import hashlib, sys

hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'

try:
    hashdocument = open(hash_file,'r')
except IOError:
    print('Invalid file.')
    sys.exit()
else:
    count = 0
    for hash in hashdocument:
        hash = hash.rstrip('\n')
        print(hash)
        i = 0
        with open(wordlist,'r') as wordlistfile:
            for word in wordlistfile:
                m = hashlib.md5()
                word = word.rstrip('\n')            
                m.update(word.encode('utf-8'))
                word_hash = m.hexdigest()
                if word_hash==hash:
                    print('The word, hash combination is ' + word + ',' + hash)
                    count += 1
                    break
                i += 1
        print('Itiration is ' + str(i))
    if count == 0:
        print('The hash given does not correspond to any supplied word in the wordlist.')
    else:
        print('Total passwords identified is: ' + str(count))
sys.exit()

This program is the bug free and enhanced version of the above MD5 cracker that reads the file containing list of hashed passwords and checks it against hashed word from the English dictionary word list. Hope it is helpful.

I downloaded the English dictionary from the following link https://github.com/dwyl/english-words

# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words 

import hashlib, sys

hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'

try:
    hashdocument = open(hash_file,'r')
except IOError:
    print('Invalid file.')
    sys.exit()
else:
    count = 0
    for hash in hashdocument:
        hash = hash.rstrip('\n')
        print(hash)
        i = 0
        with open(wordlist,'r') as wordlistfile:
            for word in wordlistfile:
                m = hashlib.md5()
                word = word.rstrip('\n')            
                m.update(word.encode('utf-8'))
                word_hash = m.hexdigest()
                if word_hash==hash:
                    print('The word, hash combination is ' + word + ',' + hash)
                    count += 1
                    break
                i += 1
        print('Itiration is ' + str(i))
    if count == 0:
        print('The hash given does not correspond to any supplied word in the wordlist.')
    else:
        print('Total passwords identified is: ' + str(count))
sys.exit()

Unicode(UTF-8)用Python读写文件

问题:Unicode(UTF-8)用Python读写文件

我在理解将文本写入文件和将文件写入文件时遇到了一些大脑故障(Python 2.4)。

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

(“ u’Capit \ xe1n’”,“’Capit \ xc3 \ xa1n’”)

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

因此,我Capit\xc3\xa1n在文件f2 中输入我最喜欢的编辑器。

然后:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

我在这里不明白什么?显然,我缺少一些至关重要的魔术(或理智)。一种类型的文本文件可以正确转换?

我真正无法理解的是UTF-8表示法的意义所在,如果您实际上无法让Python识别它的话(如果它来自外部)。也许我应该只将JSON转储字符串,然后使用它,因为它具有可表示性!更重要的是,当来自文件时,Python是否会识别并解码该Unicode对象的ASCII表示形式?如果是这样,我如何得到它?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

I’m having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

(“u’Capit\xe1n'”, “‘Capit\xc3\xa1n'”)

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I’m missing. What does one type into text files to get proper conversions?

What I’m truly failing to grok here, is what the point of the UTF-8 representation is, if you can’t actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

回答 0

在符号中

u'Capit\xe1n\n'

“ \ xe1”仅代表一个字节。“ \ x”告诉您“ e1”为十六进制。当你写

Capit\xc3\xa1n

到您的文件中,您有“ \ xc3”。这些是4个字节,在您的代码中,您全部读取了它们。显示它们时可以看到以下内容:

>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

您可以看到反斜杠被反斜杠转义了。因此,您的字符串中有四个字节:“ \”,“ x”,“ c”和“ 3”。

编辑:

正如其他人在他们的答案中指出的那样,您只需要在编辑器中输入字符,然后您的编辑器就应处理到UTF-8的转换并保存。

如果您实际上有这种格式的字符串,则可以使用string_escape编解码器将其解码为普通字符串:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

结果是一个以UTF-8编码的字符串,其中重音字符由\\xc3\\xa1原始字符串中写入的两个字节表示。如果要使用unicode字符串,则必须使用UTF-8再次解码。

编辑:您的文件中没有UTF-8。实际查看它的外观:

s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

将文件utf-8.out内容与使用编辑器保存的文件内容进行比较。

In the notation

u'Capit\xe1n\n'

the “\xe1” represents just one byte. “\x” tells you that “e1” is in hexadecimal. When you write

Capit\xc3\xa1n

into your file you have “\xc3” in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:

>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

You can see that the backslash is escaped by a backslash. So you have four bytes in your string: “\”, “x”, “c” and “3”.

Edit:

As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.

If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.

To your edit: you don’t have UTF-8 in your file. To actually see how it would look like:

s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

Compare the content of the file utf-8.out to the content of the file you saved with your editor.


回答 1

我发现打开文件时更容易指定编码,而不是搞乱编码和解码方法。该io模块(Python 2.6中添加)提供了一个io.open函数,该函数具有一个编码参数。

使用io模块中的open方法。

>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")

然后,在调用f的read()函数之后,将返回一个编码的Unicode对象。

>>>f.read()
u'Capit\xe1l\n\n'

请注意,在Python 3中,该io.open函数是内置函数的别名open。内置的open函数仅在Python 3中支持encoding参数,而在Python 2中不支持。

编辑:以前此答案推荐编解码器模块。该混合编解码器时,模块可能会造成问题read()readline(),所以这个答案现在建议的IO模块来代替。

使用编解码器模块中的open方法。

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

然后,在调用f的read()函数之后,将返回一个编码的Unicode对象。

>>>f.read()
u'Capit\xe1l\n\n'

如果您知道文件的编码,那么使用编解码器软件包将减少混乱。

请参阅http://docs.python.org/library/codecs.html#codecs.open

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.

Use the open method from the io module.

>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")

Then after calling f’s read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.

Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read() and readline(), so this answer now recommends the io module instead.

Use the open method from the codecs module.

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

Then after calling f’s read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open


回答 2

现在,您在Python3中所需的就是 open(Filename, 'r', encoding='utf-8')

[在2016-02-10上进行编辑以要求澄清]

Python3在其open函数中添加了encoding参数。从此处收集了有关open函数的以下信息:https : //docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

编码是用于解码或编码文件的编码名称。仅应在文本模式下使用。默认编码取决于平台(无论locale.getpreferredencoding() 返回什么),但是可以使用Python支持的任何文本编码。有关支持的编码列表,请参见编解码器模块。

因此,通过向encoding='utf-8'open函数添加参数,所有文件的读取和写入操作都将以utf8的形式完成(现在,这也是使用Python完成的所有操作的默认编码。)

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)


回答 3

因此,我找到了所需的解决方案,即:

print open('f2').read().decode('string-escape').decode("utf-8")

这里有一些不常用的编解码器。这种特殊的阅读方式允许人们从Python内部获取UTF-8表示形式,将其复制到ASCII文件中,然后将其读入Unicode。在“字符串转义”解码下,斜杠不会加倍。

这允许我想象中的那种往返。

So, I’ve found a solution for what I’m looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the “string-escape” decode, the slashes won’t be doubled.

This allows for the sort of round trip that I was imagining.


回答 4

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()
# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

回答 5

实际上,这对于在Python 3.2中读取UTF-8编码的文件非常有用:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

Actually, this worked for me for reading a file with UTF-8 encoding in Python 3.2:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

回答 6

要读取Unicode字符串然后发送到HTML,我这样做:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

对于由python驱动的http服务器有用。

To read in an Unicode string and then send to HTML, I did this:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

Useful for python powered http servers.


回答 7

您已经迷惑了编码的一般问题:如何确定文件是哪种编码?

答:除非文件格式为此提供,否则您不能这样做。例如,XML以:

<?xml encoding="utf-8"?>

仔细选择了此标头,以便无论编码方式都可以读取它。在您的情况下,没有这样的提示,因此您的编辑器和Python都不知道发生了什么。因此,您必须使用codecs模块并使用codecs.open(path,mode,encoding)它提供Python中缺少的位。

对于您的编辑器,您必须检查它是否提供某种方式来设置文件的编码。

UTF-8的重点是能够将21位字符(Unicode)编码为8位数据流(因为这是世界上所有计算机只能处理的事情)。但是,由于大多数操作系统早于Unicode时代,因此它们没有合适的工具将编码信息附加到硬盘上的文件中。

下一个问题是Python中的表示形式。heikogerlach评论中对此做了完美解释。您必须了解控制台只能显示ASCII。为了显示Unicode或> = charcode 128的任何内容,它必须使用某种转义方法。在编辑器中,您不得键入转义的显示字符串,而应输入字符串的含义(在这种情况下,必须输入变音符号并保存文件)。

也就是说,您可以使用Python函数eval()将转义的字符串转换为字符串:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

如您所见,字符串“ \ xc3”已变成单个字符。现在,这是一个8位字符串,采用UTF-8编码。要获取Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind问:我认为这里缺少一些内容:文件f2包含:十六进制:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'),例如,将它们全部读取到一个单独的字符中(期望),是否有任何方法可以用ASCII写入文件?

答:这取决于您的意思。ASCII不能表示大于127的字符。因此,您需要某种方式来表示“接下来的几个字符表示特殊的含义”,这就是序列“ \ x”的作用。它说:接下来的两个字符是单个字符的代码。“ \ u”使用四个字符对最多0xFFFF(65535)的Unicode进行编码。

因此,您不能直接将Unicode写为ASCII(因为ASCII根本不包含相同的字符)。您可以将其写为字符串转义符(如f2所示);在这种情况下,文件可以表示为ASCII。或者您可以将其编写为UTF-8,在这种情况下,您需要8位安全流。

您的解决方案使用decode('string-escape')确实可以,但是您必须知道使用了多少内存:使用量的三倍codecs.open()

请记住,文件只是一个具有8位的字节序列。位和字节都没有意义。是您说“ 65代表’A’”。由于\xc3\xa1应该变成“à”,但是计算机无法识别,因此必须通过指定在写入文件时使用的编码来告诉它。

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can’t unless the file format provides for this. XML, for example, begins with:

<?xml encoding="utf-8"?>

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that’s the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don’t have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string “\xc3” has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

Answer: That depends on what you mean. ASCII can’t represent characters > 127. So you need some way to say “the next few characters mean something special” which is what the sequence “\x” does. It says: The next two characters are the code of a single character. “\u” does the same using four characters to encode Unicode up to 0xFFFF (65535).

So you can’t directly write Unicode to ASCII (because ASCII simply doesn’t contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().

Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It’s you who says “65 means ‘A'”. Since \xc3\xa1 should become “à” but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.


回答 8

除了之外codecs.open(),可以使用io.open()Python2或Python3来读取/写入unicode文件

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

except for codecs.open(), one can uses io.open() to work with Python2 or Python3 to read / write unicode file

example

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

回答 9

好吧,您最喜欢的文本编辑器没有意识到这\xc3\xa1应该是字符文字,而是将它们解释为文本。这就是为什么在最后一行得到双反斜杠的原因-它现在是xc3文件中的真实反斜杠+ 等。

如果要用Python读写编码文件,最好使用编解码器模块。

在终端和应用程序之间粘贴文本很困难,因为您不知道哪个程序将使用哪种编码来解释您的文本。您可以尝试以下方法:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán

然后将此字符串粘贴到编辑器中,并确保使用Latin-1将其存储。在剪贴板不乱码的假设下,往返应该起作用。

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That’s why you get the double backslashes in the last line — it’s now a real backslash + xc3, etc. in your file.

If you want to read and write encoded files in Python, best use the codecs module.

Pasting text between the terminal and applications is difficult, because you don’t know which program will interpret your text using which encoding. You could try the following:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán

Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.


回答 10

\ x ..序列特定于Python。这不是通用字节转义序列。

实际输入UTF-8编码的非ASCII的方式取决于您的操作系统和/或编辑器。这是您在Windows中的操作方法。对于OS X进入一个带有尖音,你可以点击option+ E,然后A在OS X的支持UTF-8,而几乎所有的文本编辑器。

The \x.. sequence is something that’s specific to Python. It’s not a universal byte escape sequence.

How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here’s how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.


回答 11

您还可以open()通过使用该partial函数替换原来的函数,从而改进原始函数以使用Unicode文件。该解决方案的优点在于您无需更改任何旧代码。它是透明的。

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don’t need to change any old code. It’s transparent.

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

回答 12

我试图使用Python 2.7.9 解析iCal

从icalendar导入日历

但是我得到了:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

它被固定为:

print "{}".format(e[attr].encode("utf-8"))

(现在,它可以打印likéáböss了。)

I was trying to parse iCal using Python 2.7.9:

from icalendar import Calendar

But I was getting:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

and it was fixed with just:

print "{}".format(e[attr].encode("utf-8"))

(Now it can print liké á böss.)


回答 13

通过将整个脚本的默认编码更改为’UTF-8’,我找到了最简单的方法:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

任何openprint或其他语句将只使用utf8

至少适用于Python 2.7.9

Thx转到https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/(看看结尾)。

I found the most simple approach by changing the default encoding of the whole script to be ‘UTF-8’:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

any open, print or other statement will just use utf8.

Works at least for Python 2.7.9.

Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).


将Unicode字符串转换为Python中的字符串(包含多余的符号)

问题:将Unicode字符串转换为Python中的字符串(包含多余的符号)

如何将Unicode字符串(包含额外的字符,如£$等)转换为Python字符串?

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?


回答 0

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

回答 1

如果不需要翻译非ASCII字符,则可以使用编码为ASCII:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>

You can use encode to ASCII if you don’t need to translate the non-ASCII characters:

>>> a=u"aaaàçççñññ"
>>> type(a)
<type 'unicode'>
>>> a.encode('ascii','ignore')
'aaa'
>>> a.encode('ascii','replace')
'aaa???????'
>>>

回答 2

>>> text=u'abcd'
>>> str(text)
'abcd'

如果字符串仅包含ascii字符。

>>> text=u'abcd'
>>> str(text)
'abcd'

If the string only contains ascii characters.


回答 3

如果您有Unicode字符串,并且想要将其写入文件或其他序列化形式,则必须首先将其编码为可以存储的特定表示形式。有几种常见的Unicode编码,例如UTF-16(大多数Unicode字符使用两个字节)或UTF-8(1-4个字节/代码点,取决于字符)等。要将该字符串转换为特定的编码,您可以可以使用:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

可以将此原始字节字符串写入文件。但是,请注意,当读回它时,您必须知道它所使用的编码并使用相同的编码对其进行解码。

写入文件时,您可以使用编解码器模块摆脱此手动编码/解码过程。因此,要打开将所有Unicode字符串编码为UTF-8的文件,请使用:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

请注意,正在使用这些文件的其他任何文件,如果要读取它们,都必须了解文件的编码格式。如果您是唯一一个进行读/写的人,那么这不是问题,否则请确保以一种其他任何使用文件都可以理解的形式书写。

在Python 3中,这种形式的文件访问是默认的,并且内置open函数将采用编码参数,并始终与以文本模式打开的文件在Unicode字符串(Python 3中的默认字符串对象)之间进行转换。

If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:

>>> s= u'£10'
>>> s.encode('utf8')
'\xc2\x9c10'
>>> s.encode('utf16')
'\xff\xfe\x9c\x001\x000\x00'

This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.

When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:

import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string)  # Stored on disk as UTF-8

Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn’t a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.

In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.


回答 4

这是一个例子:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

Here is an example:

>>> u = u'€€€'
>>> s = u.encode('utf8')
>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'

回答 5

好吧,如果您愿意/准备切换到Python 3(可能不是由于与某些Python 2代码的向后不兼容),则不必进行任何转换。Python 3中的所有文本均以Unicode字符串表示,这也意味着不再使用该u'<text>'语法。实际上,您还有字节字符串,用于表示数据(可以是编码字符串)。

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8位

(当然,如果您当前使用的是Python 3,则问题可能与您尝试将文本保存到文件中有关。)

Well, if you’re willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don’t have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there’s no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).

http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

(Of course, if you’re currently using Python 3, then the problem is likely something to do with how you’re attempting to save the text to a file.)


回答 6

这是一个示例代码

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')

Here is an example code

import unicodedata    
raw_text = u"here $%6757 dfgdfg"
convert_text = unicodedata.normalize('NFKD', raw_text).encode('ascii','ignore')

回答 7

文件包含Unicode字符串

\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\",

为了我

 f = open("56ad62-json.log", encoding="utf-8")
 qq=f.readline() 

 print(qq)                          
 {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'

file contain unicode-esaped string

\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0437\\u0430\\u0446\\u0438\\u044f .....\",

for me

 f = open("56ad62-json.log", encoding="utf-8")
 qq=f.readline() 

 print(qq)                          
 {"log":\"message\": \"\\u0410\\u0432\\u0442\\u043e\\u0440\\u0438\\u0437\\u0430\\u0446\\u0438\\u044f \\u043f\\u043e\\u043b\\u044c\\u0437\\u043e\\u0432\\u0430\\u0442\\u0435\\u043b\\u044f\"}

(qq.encode().decode("unicode-escape").encode().decode("unicode-escape")) 
# '{"log":"message": "Авторизация пользователя"}\n'

回答 8

对于我的情况,没有答案可用。在这里,我有一个包含unichar字符的字符串变量,在此没有解释的encoding-decode起作用。

如果我在航站楼里

echo "no me llama mucho la atenci\u00f3n"

要么

python3
>>> print("no me llama mucho la atenci\u00f3n")

输出正确:

output: no me llama mucho la atención

但是使用脚本加载此字符串变量无法正常工作。

这是对我的案例起作用的,以防万一:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención

No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.

If I do in a Terminal

echo "no me llama mucho la atenci\u00f3n"

or

python3
>>> print("no me llama mucho la atenci\u00f3n")

The output is correct:

output: no me llama mucho la atención

But working with scripts loading this string variable didn’t work.

This is what worked on my case, in case helps anybody:

string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención

移除Python unicode字符串中的重音符号的最佳方法是什么?

问题:移除Python unicode字符串中的重音符号的最佳方法是什么?

我在Python中有一个Unicode字符串,我想删除所有的重音符号(变音符号)。

我在网上发现了一种用Java实现此目的的优雅方法:

  1. 将Unicode字符串转换为长规范化格式(带有单独的字母和变音符号)
  2. 删除Unicode类型为“变音符号”的所有字符。

我是否需要安装pyICU之类的库,还是仅使用python标准库就可以?那python 3呢?

重要说明:我想避免使用带有重音符号到非重音符号的显式映射的代码。

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the Web an elegant way to do this in Java:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is “diacritic”.

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.


回答 0

Unidecode是正确的答案。它将所有unicode字符串音译为ASCII文本中最接近的可能表示形式。

例:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

回答 1

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

字符类别 “锰”表示Nonspacing_Mark,这是类似于MiniQuark的答案unicodedata.combining(我没想到unicodedata.combining的,但它可能是更好的解决方案,因为它更明确)。

请记住,这些操作可能会大大改变文本的含义。口音,Umlauts等不是“装饰”。

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category “Mn” stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark’s answer (I didn’t think of unicodedata.combining, but it is probably the better solution, because it’s more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not “decoration”.


回答 2

我刚刚在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

它可以正常工作(例如,对于法语),但是我认为第二步(删除重音符号)比丢弃非ASCII字符要好,因为这对于某些语言(例如希腊文)会失败。最好的解决方案可能是显式删除标记为变音符号的unicode字符。

编辑:这可以解决问题:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c)如果该字符c可以与前面的字符组合,则返回true ,这主要是如果它是一个变音符。

编辑2remove_accents需要一个unicode字符串,而不是字节字符串。如果您有字节字符串,则必须将其解码为一个unicode字符串,如下所示:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

I just found this answer on the Web:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it’s a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

回答 3

实际上,我正在开发与项目兼容的python 2.6、2.7和3.4,并且必须从免费用户条目中创建ID。

多亏了您,我创建了一个可以实现奇迹的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

Thanks to you, I have created this function that works wonders.

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

回答 4

这不仅处理重音,而且还处理“笔画”(如ø等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

这是我能想到的最优雅的方式(alexis在此页的评论中已经提到),尽管我认为这确实不是很优雅。实际上,正如注释中所指出的那样,这更像是一种黑客,因为Unicode名称是–实际上只是名称,它们不能保证其一致性或任何形式。

由于它们的Unicode名称中不包含“ WITH”,因此仍有一些特殊的字母无法对此进行处理,例如转弯和倒转字母。无论如何,这取决于您想做什么。有时我需要重音符号来实现字典的排序顺序。

编辑说明:

合并了注释中的建议(处理查找错误,Python-3代码)。

This handles not only accents, but also “strokes” (as in ø etc.):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don’t think it is very elegant indeed. In fact, it’s more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

EDIT NOTE:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).


回答 5

回应@MiniQuark的回答:

我试图读取一个半法语的csv文件(包含重音符号)以及一些最终会变成整数和浮点数的字符串。作为测试,我创建了一个如下所示的test.txt文件:

蒙特利尔,于伯,12.89,梅尔,弗朗索瓦,诺尔,889

我必须包括行23使其起作用(在python票证中找到),并包含@Jabba的注释:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(注意:我在Mac OS X 10.8.4上并使用Python 2.7.3)

In response to @MiniQuark’s answer:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

Montréal, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba’s comment:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)


回答 6

gensim.utils.deaccent(文本)Gensim -人类主题建模

'Sef chomutovskych komunistu dostal postou bily prasek'

另一个解决方案是unidecode

需要注意的是,用建议的解决方案unicodedata通常只在某些字符去掉口音(例如,它变成'ł''',而不是进入'l')。

gensim.utils.deaccent(text) from Gensim – topic modelling for humans:

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode.

Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').


回答 7

一些语言结合了变音符号作为语言字母和重音符号来指定重音。

我认为更明确地指定要去除的折光度数更安全:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

I think it is more safe to specify explicitly what diactrics you want to strip:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

将utf-8文本保存在json.dumps中为UTF8,而不是\ u转义序列

问题:将utf-8文本保存在json.dumps中为UTF8,而不是\ u转义序列

样例代码:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

问题:这不是人类可读的。我的(智能)用户想要使用JSON转储来验证甚至编辑文本文件(我宁愿不使用XML)。

有没有一种方法可以将对象序列化为UTF-8 JSON字符串(而不是 \uXXXX)?

sample code:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"

The problem: it’s not human readable. My (smart) users want to verify or even edit text files with JSON dumps (and I’d rather not use XML).

Is there a way to serialize objects into UTF-8 JSON strings (instead of \uXXXX)?


回答 0

使用ensure_ascii=False切换至json.dumps(),然后手动将值编码为UTF-8:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

如果要写入文件,只需使用json.dump()并将其留给文件对象进行编码:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Python 2警告

对于Python 2,还有更多注意事项需要考虑。如果要将其写入文件,则可以使用io.open()代替open()来生成一个文件对象,该对象在编写时为您编码Unicode值,然后使用json.dump()代替来写入该文件:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

做笔记,有一对在错误json模块,其中ensure_ascii=False标志可以产生一个混合unicodestr对象。那么,Python 2的解决方法是:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

在Python 2中,当使用str编码为UTF-8的字节字符串(类型)时,请确保还设置encoding关键字:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

Use the ensure_ascii=False switch to json.dumps(), then encode the value to UTF-8 manually:

>>> json_string = json.dumps("ברי צקלה", ensure_ascii=False).encode('utf8')
>>> json_string
b'"\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94"'
>>> print(json_string.decode())
"ברי צקלה"

If you are writing to a file, just use json.dump() and leave it to the file object to encode:

with open('filename', 'w', encoding='utf8') as json_file:
    json.dump("ברי צקלה", json_file, ensure_ascii=False)

Caveats for Python 2

For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

with io.open('filename', 'w', encoding='utf8') as json_file:
    json.dump(u"ברי צקלה", json_file, ensure_ascii=False)

Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(u"ברי צקלה", ensure_ascii=False)
    # unicode(data) auto-decodes data to unicode if str
    json_file.write(unicode(data))

In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:

>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> d
{1: '\xd7\x91\xd7\xa8\xd7\x99 \xd7\xa6\xd7\xa7\xd7\x9c\xd7\x94', 2: u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'}

>>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
>>> s
u'{"1": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4", "2": "\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4"}'
>>> json.loads(s)['1']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> json.loads(s)['2']
u'\u05d1\u05e8\u05d9 \u05e6\u05e7\u05dc\u05d4'
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
ברי צקלה

回答 1

写入文件

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

打印到标准输出

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

To write to a file

import codecs
import json

with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
    json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)

To print to stdout

import json
print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))

回答 2

更新:这是错误的答案,但是了解为什么它仍然是有用的。看评论。

怎么unicode-escape

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

UPDATE: This is wrong answer, but it’s still useful to understand why it’s wrong. See comments.

How about unicode-escape?

>>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
>>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
>>> print json_str
{"1": "ברי צקלה", "2": "ברי צקלה"}

回答 3

Peters的python 2解决方法在边缘情况下失败:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

它在第3行的.decode(’utf8’)部分崩溃。我通过避免该步骤以及ascii的特殊大小写,使程序更加简单,从而解决了该问题:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))

cat filename
{"keyword": "bad credit  çredit cards"}

Peters’ python 2 workaround fails on an edge case:

d = {u'keyword': u'bad credit  \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
    data = json.dumps(d, ensure_ascii=False).decode('utf8')
    try:
        json_file.write(data)
    except TypeError:
        # Decode data to Unicode first
        json_file.write(data.decode('utf8'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)

It was crashing on the .decode(‘utf8’) part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:

with io.open('filename', 'w', encoding='utf8') as json_file:
  data = json.dumps(d, ensure_ascii=False, encoding='utf8')
  json_file.write(unicode(data))

cat filename
{"keyword": "bad credit  çredit cards"}

回答 4

从Python 3.7开始,以下代码可以正常运行:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

输出:

{"symbol": "ƒ"}

As of Python 3.7 the following code works fine:

from json import dumps
result = {"symbol": "ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)
print(json_string)

Output:

{"symbol": "ƒ"}

回答 5

以下是我和google的var阅读答案。

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'

u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'

#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

b'\xe4\xb8\xad\xe6\x96\x87'
'中文'

'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

The following is my understanding var reading answer above and google.

# coding:utf-8
r"""
@update: 2017-01-09 14:44:39
@explain: str, unicode, bytes in python2to3
    #python2 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
    #1.reload
    #importlib,sys
    #importlib.reload(sys)
    #sys.setdefaultencoding('utf-8') #python3 don't have this attribute.
    #not suggest even in python2 #see:http://stackoverflow.com/questions/3828723/why-should-we-not-use-sys-setdefaultencodingutf-8-in-a-py-script
    #2.overwrite /usr/lib/python2.7/sitecustomize.py or (sitecustomize.py and PYTHONPATH=".:$PYTHONPATH" python)
    #too complex
    #3.control by your own (best)
    #==> all string must be unicode like python3 (u'xx'|b'xx'.encode('utf-8')) (unicode 's disappeared in python3)
    #see: http://blog.ernest.me/post/python-setdefaultencoding-unicode-bytes

    #how to Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
    #http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence
"""

from __future__ import print_function
import json

a = {"b": u"中文"}  # add u for python2 compatibility
print('%r' % a)
print('%r' % json.dumps(a))
print('%r' % (json.dumps(a).encode('utf8')))
a = {"b": u"中文"}
print('%r' % json.dumps(a, ensure_ascii=False))
print('%r' % (json.dumps(a, ensure_ascii=False).encode('utf8')))
# print(a.encode('utf8')) #AttributeError: 'dict' object has no attribute 'encode'
print('')

# python2:bytes=str; python3:bytes
b = a['b'].encode('utf-8')
print('%r' % b)
print('%r' % b.decode("utf-8"))
print('')

# python2:unicode; python3:str=unicode
c = b.decode('utf-8')
print('%r' % c)
print('%r' % c.encode('utf-8'))
"""
#python2
{'b': u'\u4e2d\u6587'}
'{"b": "\\u4e2d\\u6587"}'
'{"b": "\\u4e2d\\u6587"}'
u'{"b": "\u4e2d\u6587"}'
'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

'\xe4\xb8\xad\xe6\x96\x87'
u'\u4e2d\u6587'

u'\u4e2d\u6587'
'\xe4\xb8\xad\xe6\x96\x87'

#python3
{'b': '中文'}
'{"b": "\\u4e2d\\u6587"}'
b'{"b": "\\u4e2d\\u6587"}'
'{"b": "中文"}'
b'{"b": "\xe4\xb8\xad\xe6\x96\x87"}'

b'\xe4\xb8\xad\xe6\x96\x87'
'中文'

'中文'
b'\xe4\xb8\xad\xe6\x96\x87'
"""

回答 6

这是我使用json.dump()的解决方案:

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

其中SYSTEM_ENCODING设置为:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

Here’s my solution using json.dump():

def jsonWrite(p, pyobj, ensure_ascii=False, encoding=SYSTEM_ENCODING, **kwargs):
    with codecs.open(p, 'wb', 'utf_8') as fileobj:
        json.dump(pyobj, fileobj, ensure_ascii=ensure_ascii,encoding=encoding, **kwargs)

where SYSTEM_ENCODING is set to:

locale.setlocale(locale.LC_ALL, '')
SYSTEM_ENCODING = locale.getlocale()[1]

回答 7

尽可能使用编解码器,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

Use codecs if possible,

with codecs.open('file_path', 'a+', 'utf-8') as fp:
    fp.write(json.dumps(res, ensure_ascii=False))

回答 8

感谢您在这里的原始答案。使用python 3的以下代码行:

print(json.dumps(result_dict,ensure_ascii=False))

还可以 如果不是必须的,请考虑尝试在代码中不要写太多文本。

这对于python控制台可能已经足够了。但是,要使服务器满意,您可能需要按照此处的说明设置语言环境(如果它在apache2上) http://blog.dscpl.com.au/2014/09/setting-lang-and-lcall-when-using .html

基本上在ubuntu上安装he_IL或​​任何语言环境,检查是否未安装

locale -a 

将其安装在XX是您的语言的位置

sudo apt-get install language-pack-XX

例如:

sudo apt-get install language-pack-he

将以下文本添加到/ etc / apache2 / envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

比您希望不要从apache上得到python错误:

打印(js)UnicodeEncodeError:’ascii’编解码器无法在位置41-45处编码字符:序数不在范围内(128)

另外,在apache中,尝试将utf设置为默认编码,如下所示:
如何将Apache的默认编码更改为UTF-8?

尽早执行,因为apache错误可能很难调试,并且您可能会错误地认为它来自python,在这种情况下可能并非如此

Thanks for the original answer here. With python 3 the following line of code:

print(json.dumps(result_dict,ensure_ascii=False))

was ok. Consider trying not writing too much text in the code if it’s not imperative.

This might be good enough for the python console. However, to satisfy a server you might need to set the locale as explained here (if it is on apache2) http://blog.dscpl.com.au/2014/09/setting-lang-and-lcall-when-using.html

basically install he_IL or whatever language locale on ubuntu check it is not installed

locale -a 

install it where XX is your language

sudo apt-get install language-pack-XX

For example:

sudo apt-get install language-pack-he

add the following text to /etc/apache2/envvrs

export LANG='he_IL.UTF-8'
export LC_ALL='he_IL.UTF-8'

Than you would hopefully not get python errors on from apache like:

print (js) UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 41-45: ordinal not in range(128)

Also in apache try to make utf the default encoding as explained here:
How to change the default encoding to UTF-8 for Apache?

Do it early because apache errors can be pain to debug and you can mistakenly think it’s from python which possibly isn’t the case in that situation


回答 9

如果要从文件和文件内容阿拉伯文本加载JSON字符串。然后这将起作用。

假设文件如下:arabic.json

{ 
"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"
}

从arabic.json文件获取阿拉伯语内容

with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)
   f.close()


# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

要在Django模板中使用JSON数据,请执行以下步骤:

# If have to get the JSON index in Django Template file, then simply decode the encoded string.

json.JSONDecoder().decode(json_data2)

完成了!现在我们可以将结果作为带有阿拉伯值的JSON索引来获取。

If you are loading JSON string from a file & file contents arabic texts. Then this will work.

Assume File like: arabic.json

{ 
"key1" : "لمستخدمين",
"key2" : "إضافة مستخدم"
}

Get the arabic contents from the arabic.json file

with open(arabic.json, encoding='utf-8') as f:
   # deserialises it
   json_data = json.load(f)
   f.close()


# json formatted string
json_data2 = json.dumps(json_data, ensure_ascii = False)

To use JSON Data in Django Template follow below steps:

# If have to get the JSON index in Django Template file, then simply decode the encoded string.

json.JSONDecoder().decode(json_data2)

done! Now we can get the results as JSON index with arabic value.


回答 10

使用unicode-escape解决问题

>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'

说明

>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original:   χαν  хан

原始资源:https : //blog.csdn.net/chuatony/article/details/72628868

use unicode-escape to solve problem

>>>import json
>>>json_string = json.dumps("ברי צקלה")
>>>json_string.encode('ascii').decode('unicode-escape')
'"ברי צקלה"'

explain

>>>s = '漢  χαν  хан'
>>>print('unicode: ' + s.encode('unicode-escape').decode('utf-8'))
unicode: \u6f22  \u03c7\u03b1\u03bd  \u0445\u0430\u043d

>>>u = s.encode('unicode-escape').decode('utf-8')
>>>print('original: ' + u.encode("utf-8").decode('unicode-escape'))
original: 漢  χαν  хан

original resource:https://blog.csdn.net/chuatony/article/details/72628868


回答 11

正如Martijn指出的那样,在json.dumps中使用suresure_ascii = False是解决此问题的正确方向。但是,这可能会引发异常:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

您需要在site.py或sitecustomize.py中进行其他设置,才能正确设置sys.getdefaultencoding()。site.py在lib / python2.7 /下,sitecustomize.py在lib / python2.7 / site-packages下。

如果要使用site.py,请在def setencoding()下:将第一个if 0:更改为if 1 :,以便python使用操作系统的语言环境。

如果您喜欢使用sitecustomize.py,那么如果尚未创建,则可能不存在。只需将这些行:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

然后,您可以以utf-8格式进行中文json输出,例如:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

您将获得一个utf-8编码的字符串,而不是\ u转义的json字符串。

验证默认编码:

print sys.getdefaultencoding()

您应该获得“ utf-8”或“ UTF-8”来验证site.py或sitecustomize.py设置。

请注意,您无法在交互式python控制台上执行sys.setdefaultencoding(“ utf-8”)。

Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)

You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.

If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that python will use your operation system’s locale.

If you prefer to use sitecustomize.py, which may not exist if you haven’t created it. simply put these lines:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Then you can do some Chinese json output in utf-8 format, such as:

name = {"last_name": u"王"}
json.dumps(name, ensure_ascii=False)

You will get an utf-8 encoded string, rather than \u escaped json string.

To verify your default encoding:

print sys.getdefaultencoding()

You should get “utf-8” or “UTF-8” to verify your site.py or sitecustomize.py settings.

Please note that you could not do sys.setdefaultencoding(“utf-8”) at interactive python console.


使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

问题:使用Python在Pandas中读取CSV文件时出现UnicodeDecodeError

我正在运行一个程序,正在处理30,000个类似文件。他们中有随机数正在停止并产生此错误…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

这些文件的源/创建都来自同一位置。纠正此错误以继续导入的最佳方法是什么?

I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…

   File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
     data = pd.read_csv(filepath, names=fields)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
     return _read(filepath_or_buffer, kwds)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
     return parser.read()
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
     ret = self._engine.read(nrows)
   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
     data = self._reader.read(nrows)
   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
   File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
   File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
   File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
   File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
   File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
   File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
   File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte

The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?


回答 0

read_csv可以encoding选择处理不同格式的文件。我主要使用read_csv('file', encoding = "ISO-8859-1"),或者替代地encoding = "utf-8"阅读,并且通常utf-8用于to_csv

您还可以使用而不是的多个alias选项'latin'之一'ISO-8859-1'(请参阅python docs,还可能会遇到许多其他编码)。

请参阅相关的Pandas文档有关csv文件的python文档示例以及有关SO的大量相关问题。一个好的背景资源是每个开发人员应了解的unicode和字符集

要检测编码(假设文件包含非ASCII字符),可以使用enca(请参见手册页)或file -i(linux)或file -I(osx)(请参见手册页)。

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).


回答 1

所有解决方案中最简单的:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

替代解决方案:

  • Sublime文本编辑器中打开csv文件。
  • 以utf-8格式保存文件。

崇高地,单击文件->使用编码保存-> UTF-8

然后,您可以照常读取文件:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

其他不同的编码类型是:

encoding = "cp1252"
encoding = "ISO-8859-1"

Simplest of all Solutions:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Alternate Solution:

  • Open the csv file in Sublime text editor.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')

and the other different encoding types are:

encoding = "cp1252"
encoding = "ISO-8859-1"

回答 2

熊猫允许指定编码,但不允许忽略错误以免自动替换有问题的字节。因此,没有一种适合所有方法的大小,而是取决于实际用例的不同方法。

  1. 您知道编码,并且文件中没有编码错误。太好了:您只需要指定编码即可:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
  2. 您不希望被编码问题困扰,无论某些文本字段是否包含垃圾内容,都只希望加载该死的文件。好的,您只需要使用Latin1编码,因为它接受任何可能的字节作为输入(并将其转换为相同代码的unicode字符):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
  3. 您知道大多数文件都是用特定的编码编写的,但是它也包含编码错误。一个真实的示例是一个UTF8文件,该文件已使用非utf8编辑器进行了编辑,并且其中包含一些使用不同编码的行。Pandas没有提供特殊的错误处理的准备,但是Python open函数具有(假设Python3),并且read_csv接受像object这样的文件。在这里使用的典型错误参数是'ignore'仅抑制有问题的字节,或者(IMHO更好)'backslashreplace'用其Python的反斜杠转义序列替换有问题的字节:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)

Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.

  1. You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:

    file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
    
  2. You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

    pd.read_csv(input_file_and_path, ..., encoding='latin1')
    
  3. You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

    file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
    input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
    pd.read_csv(input_fd, ...)
    

回答 3

with open('filename.csv') as f:
   print(f)

执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

你去

with open('filename.csv') as f:
   print(f)

after executing this code you will find encoding of ‘filename.csv’ then execute code as following

data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"

there you go


回答 4

就我而言,USC-2 LE BOM根据Notepad ++ ,文件具有编码。它encoding="utf_16_le"用于python。

希望这有助于更快找到某人的答案。

In my case, a file has USC-2 LE BOM encoding, according to Notepad++. It is encoding="utf_16_le" for python.

Hope, it helps to find an answer a bit faster for someone.


回答 5

就我而言,这适用于python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

而对于python 3,仅:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

In my case this worked for python 2.7:

data = read_csv(filename, encoding = "ISO-8859-1", dtype={'name_of_colum': unicode}, low_memory=False) 

And for python 3, only:

data = read_csv(filename, encoding = "ISO-8859-1", low_memory=False) 

回答 6

尝试指定engine =’python’。它对我有用,但我仍在尝试找出原因。

df = pd.read_csv(input_file_path,...engine='python')

Try specifying the engine=’python’. It worked for me but I’m still trying to figure out why.

df = pd.read_csv(input_file_path,...engine='python')

回答 7

我正在发布答案,以提供有关为什么会出现此问题的更新解决方案和解释。假设您正在从数据库或Excel工作簿中获取此数据。如果您有特殊字符,例如La Cañada Flintridge city,除非您使用UTF-8编码导出数据,否则将引入错误。La Cañada Flintridge city将成为La Ca\xf1ada Flintridge city。如果您pandas.read_csv对默认参数没有任何调整,则会遇到以下错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

幸运的是,有一些解决方案。

选项1,修复出口。确保使用UTF-8编码。

选项2,如果您无法解决出口问题,而需要使用pandas.read_csv,请确保包括以下参数engine='python'。缺省情况下,pandas使用engine='C'此选项非常适合读取大型干净文件,但如果出现意外情况,它将崩溃。根据我的经验,设置encoding='utf-8'从未解决过这个问题UnicodeDecodeError。另外,您不需要使用errors_bad_lines,但是,如果您确实需要它,那仍然是一个选择。

pd.read_csv(<your file>, engine='python')

选项3:解决方案是我个人首选的解决方案。使用香草Python读取文件。

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

希望这可以帮助人们第一次遇到这个问题。

I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you’re going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you’ll hit the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte

Fortunately, there are a few solutions.

Option 1, fix the exporting. Be sure to use UTF-8 encoding.

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(<your file>, engine='python')

Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)

Hope this helps people encountering this issue for the first time.


回答 8

挣扎了一段时间,以为我会在这个问题上发布,因为它是第一个搜索结果。将encoding="iso-8859-1"标签添加到熊猫read_csv没有用,也没有任何其他编码,但始终给出UnicodeDecodeError。

如果您要传递文件句柄,则pd.read_csv(),需要将encoding属性放在文件上,而不是中read_csv。事后看来很明显,但是要跟踪却有一个微妙的错误。

Struggled with this a while and thought I’d post on this question as it’s the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn’t work, nor did any other encoding, kept giving a UnicodeDecodeError.

If you’re passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.


回答 9

这个答案似乎可以解决CSV编码问题。如果标题出现奇怪的编码问题,如下所示:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

然后,您在CSV文件的开头就有一个字节顺序标记(BOM)字符。这个答案解决了这个问题:

Python读取csv-BOM嵌入第一个密钥

解决方案是使用加载CSV encoding="utf-8-sig"

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

希望这对某人有帮助。

This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])

Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

Python read csv – BOM embedded into the first key

The solution is to load the CSV with encoding="utf-8-sig":

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])

Hopefully this helps someone.


回答 10

我正在发布此旧线程的更新。我找到了一个可行的解决方案,但需要打开每个文件。我在LibreOffice中打开了csv文件,选择另存为>编辑过滤器设置。在下拉菜单中,我选择了UTF8编码。然后我添加encoding="utf-8-sig"data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig")

希望这对某人有帮助。

I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").

Hope this helps someone.


回答 11

我无法打开从网上银行下载的简体中文CSV文件,我尝试过latin1,尝试过iso-8859-1cp1252,但都无济于事。

但是pd.read_csv("",encoding ='gbk')工作就完成了。

I have trouble opening a CSV file in simplified Chinese downloaded from an online bank, I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.

But pd.read_csv("",encoding ='gbk') simply does the work.


回答 12

请尝试添加

encoding='unicode_escape'

这会有所帮助。为我工作。另外,请确保使用正确的定界符和列名。

您可以从仅加载1000行开始,以快速加载文件。

Please try to add

encoding='unicode_escape'

This will help. Worked for me. Also, make sure you’re using the correct delimiter and column names.

You can start with loading just 1000 rows to load the file quickly.


回答 13

我正在使用Jupyter笔记本。以我为例,它以错误的格式显示文件。“编码”选项无效。因此,我将CSV保存为utf-8格式,并且可以正常工作。

I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The ‘encoding’ option was not working. So I save the csv in utf-8 format, and it works.


回答 14

尝试这个:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

看起来它会处理编码,而无需通过参数明确表示

Try this:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)

Looks like it will take care of the encoding without explicitly expressing it through argument


回答 15

在传递给熊猫之前,请检查编码。它会使您减速,但是…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

在python 3.7中

Check the encoding before you pass to pandas. It will slow you down, but…

with open(path, 'r') as f:
    encoding = f.encoding 

df = pd.read_csv(path,sep=sep, encoding=encoding)

In python 3.7


回答 16

我遇到的另一个导致相同错误的重要问题是:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^此行导致相同的错误,因为我正在使用read_csv()方法读取Excel文件。使用read_excel()阅读.xlxs

Another important issue that I faced which resulted in the same error was:

_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")

^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs


字符串文字前的’b’字符做什么?

问题:字符串文字前的’b’字符做什么?

显然,以下是有效的语法:

my_string = b'The string'

我想知道:

  1. 这是什么b字在前面的字符串是什么意思?
  2. 使用它有什么作用?
  3. 在什么情况下可以使用它?

我在SO上找到了一个相关的问题,但是这个问题是关于PHP的,它指出b用来表示字符串是二进制的,与Unicode相反,Unicode是使PHP <6版本兼容的代码所必需的,当迁移到PHP 6时。我认为这不适用于Python。

我确实在Python站点上找到了有关使用相同语法的字符将字符串指定为Unicode的文档u。不幸的是,它在该文档的任何地方都没有提到b字符。

而且,只是出于好奇,有没有比多符号bu是做其他事情?

Apparently, the following is the valid syntax:

my_string = b'The string'

I would like to know:

  1. What does this b character in front of the string mean?
  2. What are the effects of using it?
  3. What are appropriate situations to use it?

I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don’t think this applies to Python.

I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn’t mention the b character anywhere in that document.

Also, just out of curiosity, are there more symbols than the b and u that do other things?


回答 0

引用Python 2.x文档

在Python 2中,前缀’b’或’B’被忽略;它表示文字应在Python 3中变成字节文字(例如,当代码自动由2to3转换时)。前缀“ u”或“ b”后可以带有前缀“ r”。

Python 3中的文件状态:

字节字面量始终以“ b”或“ B”为前缀;它们产生字节类型的实例而不是str类型。它们只能包含ASCII字符;数值等于或大于128的字节必须用转义符表示。

To quote the Python 2.x documentation:

A prefix of ‘b’ or ‘B’ is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A ‘u’ or ‘b’ prefix may be followed by an ‘r’ prefix.

The Python 3 documentation states:

Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.


回答 1

Python 3.x明确区分了以下类型:

  • str= '...'文字= Unicode字符序列(UTF-16或UTF-32,取决于Python的编译方式)
  • bytes= b'...'文字=八位字节序列(0到255之间的整数)

如果你熟悉Java或C#,想到strStringbytes作为byte[]。如果您熟悉SQL,请认为stras NVARCHARbytesas BINARYBLOB。如果你熟悉Windows注册表,想到strREG_SZbytes作为REG_BINARY。如果您熟悉C(++),请忘记学习的所有知识char和字符串,因为CHARACTER不是BYTE。这个想法早已过时。

您可以使用str,当你想要表达的文字。

print('שלום עולם')

您可以使用bytes,当你想表示相同结构的低级别的二进制数据。

NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]

您可以编码一个str到一个bytes对象。

>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'

您可以将a解码bytesstr

>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'

但是您不能随意混合使用这两种类型。

>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

这种b'...'表示法有些令人困惑,因为它允许使用ASCII字符而不是十六进制数字指定字节0x01-0x7F。

>>> b'A' == b'\x41'
True

但是我必须强调,字符不是字节

>>> 'A' == b'A'
False

在Python 2.x中

Python 3.0之前的版本在文本和二进制数据之间缺乏这种区别。相反,有:

  • unicode= u'...'文字= Unicode字符序列= 3.xstr
  • str= '...'文字=混杂字节/字符的序列
    • 通常为文本,以某种未指定的编码进行编码。
    • 而且还用来表示二进制数据,如struct.pack输出。

为了简化从2.x到3.x的过渡,b'...'将原义语法反向移植到Python 2.6,以便区分二进制字符串(应bytes在3.x中)和文本字符串(应str在3中) 。X)。该b前缀在2.x中不执行任何操作,但告诉2to3脚本不要在3.x中将其转换为Unicode字符串。

因此,是的,b'...'Python中的文字具有与PHP中相同的目的。

另外,出于好奇,还有比b和u更多的符号可以执行其他操作吗?

r前缀创建原始字符串(例如,r'\t'是反斜杠+ t,而不是一个选项卡),和三引号'''...'''"""..."""允许多行字符串文字。

Python 3.x makes a clear distinction between the types:

  • str = '...' literals = a sequence of Unicode characters (UTF-16 or UTF-32, depending on how Python was compiled)
  • bytes = b'...' literals = a sequence of octets (integers between 0 and 255)

If you’re familiar with Java or C#, think of str as String and bytes as byte[]. If you’re familiar with SQL, think of str as NVARCHAR and bytes as BINARY or BLOB. If you’re familiar with the Windows registry, think of str as REG_SZ and bytes as REG_BINARY. If you’re familiar with C(++), then forget everything you’ve learned about char and strings, because A CHARACTER IS NOT A BYTE. That idea is long obsolete.

You use str when you want to represent text.

print('שלום עולם')

You use bytes when you want to represent low-level binary data like structs.

NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]

You can encode a str to a bytes object.

>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'

And you can decode a bytes into a str.

>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'

But you can’t freely mix the two types.

>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.

>>> b'A' == b'\x41'
True

But I must emphasize, a character is not a byte.

>>> 'A' == b'A'
False

In Python 2.x

Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:

  • unicode = u'...' literals = sequence of Unicode characters = 3.x str
  • str = '...' literals = sequences of confounded bytes/characters
    • Usually text, encoded in some unspecified encoding.
    • But also used to represent binary data like struct.pack output.

In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.

So yes, b'...' literals in Python have the same purpose that they do in PHP.

Also, just out of curiosity, are there more symbols than the b and u that do other things?

The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.


回答 2

b表示字节字符串。

字节是实际数据。字符串是一种抽象。

如果您有多个字符的字符串对象并且使用了一个字符,则该字符串将是一个字符串,并且根据编码的不同,大小可能会超过1个字节。

如果使用1个字节和一个字节字符串,则您将获得0-255之间的单个8位值,并且如果由于编码而导致的那些字符大于1个字节,则它可能不表示完整的字符。

TBH我将使用字符串,除非我有一些特定的低级原因要使用字节。

The b denotes a byte string.

Bytes are the actual data. Strings are an abstraction.

If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.

If took 1 byte with a byte string, you’d get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.

TBH I’d use strings unless I had some specific low level reason to use bytes.


回答 3

从服务器端,如果我们发送任何响应,它将以字节类型的形式发送,因此它将在客户端中显示为 b'Response from server'

为了摆脱,b'....'只需使用以下代码:

服务器文件:

stri="Response from server"    
c.send(stri.encode())

客户端文件:

print(s.recv(1024).decode())

然后它将打印 Response from server

From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'

In order get rid of b'....' simply use below code:

Server file:

stri="Response from server"    
c.send(stri.encode())

Client file:

print(s.recv(1024).decode())

then it will print Response from server


回答 4

这是一个示例,其中缺少bTypeError在Python 3.x中引发异常

>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

添加b前缀将解决此问题。

Here’s an example where the absence of b would throw a TypeError exception in Python 3.x

>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

Adding a b prefix would fix the problem.


回答 5

它将其转换为bytes文字(或str在2.x中),并且对于2.6+有效。

r前缀导致反斜杠需要“不解释”(不被忽略,差异确实物质)。

It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.

The r prefix causes backslashes to be “uninterpreted” (not ignored, and the difference does matter).


回答 6

除了其他人所说的以外,请注意unicode中的单个字符可以由多个字节组成

unicode的工作方式是采用旧的ASCII格式(7位代码,看起来像0xxx xxxx),并添加了多字节序列,其中所有字节均以1(1xxx xxxx)开头,以表示ASCII以外的字符,以便Unicode 向后-与ASCII 兼容

>>> len('Öl')  # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8')  # convert str to bytes 
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8'))  # 3 bytes encode 2 characters !
3

In addition to what others have said, note that a single character in unicode can consist of multiple bytes.

The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.

>>> len('Öl')  # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8')  # convert str to bytes 
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8'))  # 3 bytes encode 2 characters !
3

回答 7

您可以使用JSON将其转换为字典

import json
data = b'{"key":"value"}'
print(json.loads(data))

{“核心价值”}


烧瓶:

这是烧瓶的一个例子。在终端行上运行此命令:

import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})

在flask / routes.py中

@app.route('/', methods=['POST'])
def api_script_add():
    print(request.data) # --> b'{"hi":"Hello"}'
    print(json.loads(request.data))
return json.loads(request.data)

{‘核心价值’}

You can use JSON to convert it to dictionary

import json
data = b'{"key":"value"}'
print(json.loads(data))

{“key”:”value”}


FLASK:

This is an example from flask. Run this on terminal line:

import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})

In flask/routes.py

@app.route('/', methods=['POST'])
def api_script_add():
    print(request.data) # --> b'{"hi":"Hello"}'
    print(json.loads(request.data))
return json.loads(request.data)

{‘key’:’value’}


字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

问题:字符串标志“ u”和“ r”到底是做什么的,什么是原始字符串文字?

当问这个问题时,我意识到我对原始字符串不了解很多。对于自称是Django培训师的人来说,这很糟糕。

我知道什么是编码,而且我知道u''自从得到Unicode以来,它独自做什么。

  • 但是究竟是r''什么呢?它产生什么样的字符串?

  • 最重要的是,该怎么ur''办?

  • 最后,有什么可靠的方法可以从Unicode字符串返回到简单的原始字符串?

  • 嗯,顺便说一句,如果您的系统和文本编辑器字符集设置为UTF-8,u''实际上有什么作用吗?

While asking this question, I realized I didn’t know much about raw strings. For somebody claiming to be a Django trainer, this sucks.

I know what an encoding is, and I know what u'' alone does since I get what is Unicode.

  • But what does r'' do exactly? What kind of string does it result in?

  • And above all, what the heck does ur'' do?

  • Finally, is there any reliable way to go back from a Unicode string to a simple raw string?

  • Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?


回答 0

实际上并没有任何“原始字符串 ”。这里有原始的字符串文字,它们恰好是'r'在引号前用a标记的字符串文字。

“原始字符串文字”与字符串文字的语法略有不同,其中\反斜杠“”代表“只是反斜杠”(除非在引号之前否则会终止文字)- “转义序列”代表换行符,制表符,退格键,换页等。在普通的字符串文字中,每个反斜杠必须加倍,以避免被当作转义序列的开始。

之所以存在此语法变体,主要是因为正则表达式模式的语法带有反斜杠(但不会在末尾加重),所以语法比较繁琐(因此,上面的“ except”子句无关紧要),并且在避免将每个模式加倍时看起来会更好一些- – 就这样。表达本机Windows文件路径(用反斜杠代替其他平台上的常规斜杠)也引起了人们的欢迎,但这很少需要(因为正常斜杠在Windows上也可以正常工作)并且不完美(由于“ except”子句)以上)。

r'...'是一个字节串(在Python 2 *),ur'...'是Unicode字符串(再次,在Python 2 *),以及任何其他3种引用的也产生完全相同的类型字符串(因此,例如r'...'r'''...'''r"..."r"""..."""都是字节字符串,依此类推)。

不确定“ 返回 ”是什么意思-本质上没有前后方向,因为没有原始字符串类型,它只是一种表示完全正常的字符串对象,字节或Unicode的替代语法。

是的,在Python 2 *,u'...' 当然总是从刚不同'...'-前者是一个unicode字符串,后者是一个字节的字符串。文字表达的编码方式可能是完全正交的问题。

例如,考虑一下(Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

Unicode对象当然会占用更多的存储空间(很短的字符串,很明显,;-差别很小)。

There’s not really any “raw string“; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.

A “raw string literal” is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning “just a backslash” (except when it comes right before a quote that would otherwise terminate the literal) — no “escape sequences” to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.

This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the “except” clause above doesn’t matter) and it looks a bit better when you avoid doubling up each of them — that’s all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that’s very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the “except” clause above).

r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).

Not sure what you mean by “going back” – there is no intrinsically back and forward directions, because there’s no raw string type, it’s just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.

And yes, in Python 2.*, u'...' is of course always distinct from just '...' — the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.

E.g., consider (Python 2.6):

>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34

The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).


回答 1

python中有两种类型的字符串:传统str类型和较新unicode类型。如果您键入的字符串文字u前面不带,则将得到str存储8位字符的旧类型,而带u前面的将得到unicode可存储任何Unicode字符的较新类型。

r完全不改变类型,它只是改变了字符串是如何解释。如果没有,则将r反斜杠视为转义字符。使用时r,反斜杠被视为文字。无论哪种方式,类型都是相同的。

ur 当然是Unicode字符串,其中反斜杠是文字反斜杠,而不是转义码的一部分。

您可以尝试使用str()函数将Unicode字符串转换为旧字符串,但是如果旧字符串中无法表示任何Unicode字符,则会出现异常。如果愿意,可以先用问号替换它们,但是当然这会导致这些字符不可读。str如果要正确处理Unicode字符,建议不要使用该类型。

There are two types of string in python: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.

The r doesn’t change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.

ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.

You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.


回答 2

“原始字符串”表示将其存储为原样。例如,'\'只是一个反斜杠,而不是逃避

‘raw string’ means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.


回答 3

“ u”前缀表示该值具有类型unicode而不是str

带有“ r”前缀的原始字符串文字将转义其中的所有转义序列,因此len(r"\n")也是如此。2。由于它们转义了转义序列,因此您不能以单个反斜杠结束字符串文字:这不是有效的转义序列(例如r"\")。

“原始”不是该类型的一部分,它只是表示值的一种方式。例如,"\\n"r"\n"是相同的值,就像320x200b100000是相同的。

您可以使用unicode原始字符串文字:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

源文件编码仅决定如何解释源文件,否则不会影响表达式或类型。但是,建议避免使用非ASCII编码会改变含义的代码:

使用ASCII的文件(对于Python 3.0,则为UTF-8)应该没有编码cookie。只有在注释或文档字符串需要提及需要使用Latin-1的作者姓名时,才应使用Latin-1(或UTF-8)。否则,使用\ x,\ u或\ U转义是在字符串文字中包含非ASCII数据的首选方法。

A “u” prefix denotes the value has type unicode rather than str.

Raw string literals, with an “r” prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that’s not a valid escape sequence (e.g. r"\").

“Raw” is not part of the type, it’s merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.

You can have unicode raw string literals:

>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2

The source file encoding just determines how to interpret the source file, it doesn’t affect expressions or types otherwise. However, it’s recommended to avoid code where an encoding other than ASCII would change the meaning:

Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.


回答 4

让我简单地解释一下:在python 2中,您可以将字符串存储为2种不同的类型。

第一个是ASCII,它是python中的str类型,它使用1个字节的内存。(256个字符,将主要存储英文字母和简单符号)

第二种类型是UNICODE,它是python中的unicode类型。Unicode存储所有类型的语言。

默认情况下,python会更喜欢str类型,但是如果您想将字符串存储为unicode类型,则可以将u放在像u’text’这样的文本前面,也可以通过调用unicode(’text’)来实现

所以ü只是打电话投的功能一小段路海峡Unicode的。而已!

现在r部分,您将其放在文本前面以告诉计算机该文本是原始文本,反斜杠不应是转义字符。r’\ n’不会创建换行符。只是包含2个字符的纯文本。

如果要将str转换为unicode并将原始文本也放入其中,请使用ur,因为ru会引发错误。

现在,重要的部分:

您不能使用r来存储一个反斜杠,这是唯一的exceptions。因此,此代码将产生错误:r’\’

要存储反斜杠(仅一个),您需要使用“ \\”

如果要存储1个以上的字符,则仍可以使用r,r’\\’一样,将产生2个反斜杠,如您所愿。

我不知道r无法与一个反斜杠存储一起使用的原因,但至今尚未有人描述。我希望这是一个错误。

Let me explain it simply: In python 2, you can store string in 2 different types.

The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)

The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.

By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u’text’ or you can do this by calling unicode(‘text’)

So u is just a short way to call a function to cast str to unicode. That’s it!

Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r’\n’ will not create a new line character. It’s just plain text containing 2 characters.

If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.

NOW, the important part:

You cannot store one backslash by using r, it’s the only exception. So this code will produce error: r’\’

To store a backslash (only one) you need to use ‘\\’

If you want to store more than 1 characters you can still use r like r’\\’ will produce 2 backslashes as you expected.

I don’t know the reason why r doesn’t work with one backslash storage but the reason isn’t described by anyone yet. I hope that it is a bug.


回答 5

也许这很明显,也许不是,但是您可以通过调用x = chr(92)来使字符串“ \”

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

Maybe this is obvious, maybe not, but you can make the string ‘\’ by calling x=chr(92)

x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y   # True
x is y # False

回答 6

Unicode字符串文字

Unicode字符串文字(以前缀的字符串文字u)在Python 3中不再使用。它们仍然有效,但仅出于与Python 2 兼容的目的

原始字符串文字

如果要创建仅由易于键入的字符(例如英文字母或数字)组成的字符串文字,只需键入以下内容即可:'hello world'。但是,如果您还想包含一些其他奇特的字符,则必须使用一些解决方法。解决方法之一是转义序列。这样,例如,您只需\n在字符串文字中添加两个易于键入的字符,即可在字符串中表示新行。因此,当您打印'hello\nworld'字符串时,单词将被打印在单独的行上。非常方便!

另一方面,在某些情况下,您想创建一个包含转义序列的字符串文字,但又不希望它们由Python解释。您希望它们变得生硬。看下面的例子:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

在这种情况下,您可以在字符串文字前加上如下r字符:r'hello\nworld'并且Python不会解释任何转义序列。字符串将完全按照您创建的样子打印。

原始字符串文字不是完全“原始”吗?

许多人期望原始字符串文字是原始的,因为“ Python会忽略引号之间的任何内容”。那是不对的。Python仍然可以识别所有转义序列,只是不解释它们-而是使它们保持不变。这意味着原始字符串文字仍然必须是有效的字符串文字

根据字符串文字的词汇定义

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

显然,包含裸引号:'hello'world'或以反斜杠:结尾的字符串文字(无论是否原始)'hello world\'都是无效的。

Unicode string literals

Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.

Raw string literals

If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you’ll have to use some workaround. One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That’s very handy!

On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don’t want them to be interpreted by Python. You want them to be raw. Look at these examples:

'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'

In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.

Raw string literals are not completely “raw”?

Many people expect the raw string literals to be raw in a sense that “anything placed between the quotes is ignored by Python”. That is not true. Python still recognizes all the escape sequences, it just does not interpret them – it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.

From the lexical definition of a string literal:

string     ::=  "'" stringitem* "'"
stringitem ::=  stringchar | escapeseq
stringchar ::=  <any source character except "\" or newline or the quote>
escapeseq  ::=  "\" <any source character>

It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.


如何在Python的终端中打印彩色文本?

问题:如何在Python的终端中打印彩色文本?

如何在Python中将彩色文本输出到终端?代表实体块的最佳Unicode符号是什么?

How can I output colored text to the terminal, in Python? What is the best Unicode symbol to represent a solid block?


回答 0

这在某种程度上取决于您所使用的平台。最常见的方法是打印ANSI转义序列。对于一个简单的示例,这是Blender构建脚本中的一些python代码:

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

要使用这样的代码,您可以执行以下操作

print(bcolors.WARNING + "Warning: No active frommets remain. Continue?" + bcolors.ENDC)

或者,使用Python3.6 +:

print(f"{bcolors.WARNING}Warning: No active frommets remain. Continue?{bcolors.ENDC}")

这将在包括OS X,Linux和Windows的Unix上运行(前提是您使用ANSICON,或者在Windows 10中,前提是您启用了VT100仿真)。有用于设置颜色,移动光标等的ansi代码。

如果您要对此进行复杂化处理(听起来就像是在编写游戏),则应查看“ curses”模块,该模块为您处理了许多复杂的部分。在Python的诅咒HOWTO是一个很好的介绍。

如果您没有使用扩展的ASCII码(即不在PC上),那么您将只能使用127以下的ASCII字符,而“#”或“ @”可能是您最好的选择。如果可以确保您的终端使用的是IBM 扩展的ascii字符集,那么您还有更多选择。字符176、177、178和219是“块字符”。

一些现代的基于文本的程序,例如“矮人要塞”,以图形模式模拟文本模式,并使用经典PC字体的图像。您可以在Dwarf Fortress Wiki see(用户制作的tileset)上找到一些可以使用的位图。

文本模式设计大赛已在文本模式下做图形更多的资源。

嗯..我认为这个答案有点过头了。不过,我正在计划一个史诗般的基于文本的冒险游戏。祝您彩色文字好运!

This somewhat depends on what platform you are on. The most common way to do this is by printing ANSI escape sequences. For a simple example, here’s some python code from the blender build scripts:

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

To use code like this, you can do something like

print(bcolors.WARNING + "Warning: No active frommets remain. Continue?" + bcolors.ENDC)

or, with Python3.6+:

print(f"{bcolors.WARNING}Warning: No active frommets remain. Continue?{bcolors.ENDC}")

This will work on unixes including OS X, linux and windows (provided you use ANSICON, or in Windows 10 provided you enable VT100 emulation). There are ansi codes for setting the color, moving the cursor, and more.

If you are going to get complicated with this (and it sounds like you are if you are writing a game), you should look into the “curses” module, which handles a lot of the complicated parts of this for you. The Python Curses HowTO is a good introduction.

If you are not using extended ASCII (i.e. not on a PC), you are stuck with the ascii characters below 127, and ‘#’ or ‘@’ is probably your best bet for a block. If you can ensure your terminal is using a IBM extended ascii character set, you have many more options. Characters 176, 177, 178 and 219 are the “block characters”.

Some modern text-based programs, such as “Dwarf Fortress”, emulate text mode in a graphical mode, and use images of the classic PC font. You can find some of these bitmaps that you can use on the Dwarf Fortress Wiki see (user-made tilesets).

The Text Mode Demo Contest has more resources for doing graphics in text mode.

Hmm.. I think got a little carried away on this answer. I am in the midst of planning an epic text-based adventure game, though. Good luck with your colored text!


回答 1

我很惊讶没有人提到Python termcolor模块。用法很简单:

from termcolor import colored

print colored('hello', 'red'), colored('world', 'green')

或在Python 3中:

print(colored('hello', 'red'), colored('world', 'green'))

但是,对于游戏编程和您要执行的“彩色块”来说,它可能不够复杂…

I’m surprised no one has mentioned the Python termcolor module. Usage is pretty simple:

from termcolor import colored

print colored('hello', 'red'), colored('world', 'green')

Or in Python 3:

print(colored('hello', 'red'), colored('world', 'green'))

It may not be sophisticated enough, however, for game programming and the “colored blocks” that you want to do…


回答 2

答案是Colorama,用于Python中的所有跨平台着色。

Python 3.6示例屏幕截图:

The answer is Colorama for all cross-platform coloring in Python.

A Python 3.6 example screenshot:


回答 3

打印一个以颜色/样式开头的字符串,然后打印该字符串,然后通过以下命令结束颜色/样式更改'\x1b[0m'

print('\x1b[6;30;42m' + 'Success!' + '\x1b[0m')

使用以下代码获取外壳程序文本的格式选项表:

def print_format_table():
    """
    prints table of formatted text format options
    """
    for style in range(8):
        for fg in range(30,38):
            s1 = ''
            for bg in range(40,48):
                format = ';'.join([str(style), str(fg), str(bg)])
                s1 += '\x1b[%sm %s \x1b[0m' % (format, format)
            print(s1)
        print('\n')

print_format_table()

浅色示例(完整)

暗光示例(部分)

Print a string that starts a color/style, then the string, then end the color/style change with '\x1b[0m':

print('\x1b[6;30;42m' + 'Success!' + '\x1b[0m')

Get a table of format options for shell text with following code:

def print_format_table():
    """
    prints table of formatted text format options
    """
    for style in range(8):
        for fg in range(30,38):
            s1 = ''
            for bg in range(40,48):
                format = ';'.join([str(style), str(fg), str(bg)])
                s1 += '\x1b[%sm %s \x1b[0m' % (format, format)
            print(s1)
        print('\n')

print_format_table()

Light-on-dark example (complete)

Dark-on-light example (partial)


回答 4

定义一个以颜色开头的字符串和一个以颜色结尾的字符串,然后打印您的文本,其中起始字符串在前面,结尾字符串在结尾。

CRED = '\033[91m'
CEND = '\033[0m'
print(CRED + "Error, does not compute!" + CEND)

这会产生bash,并urxvt带有Zenburn风格的配色方案:

通过实验,我们可以获得更多的颜色:

注意:\33[5m\33[6m闪烁。

这样,我们可以创建完整的颜色集合:

CEND      = '\33[0m'
CBOLD     = '\33[1m'
CITALIC   = '\33[3m'
CURL      = '\33[4m'
CBLINK    = '\33[5m'
CBLINK2   = '\33[6m'
CSELECTED = '\33[7m'

CBLACK  = '\33[30m'
CRED    = '\33[31m'
CGREEN  = '\33[32m'
CYELLOW = '\33[33m'
CBLUE   = '\33[34m'
CVIOLET = '\33[35m'
CBEIGE  = '\33[36m'
CWHITE  = '\33[37m'

CBLACKBG  = '\33[40m'
CREDBG    = '\33[41m'
CGREENBG  = '\33[42m'
CYELLOWBG = '\33[43m'
CBLUEBG   = '\33[44m'
CVIOLETBG = '\33[45m'
CBEIGEBG  = '\33[46m'
CWHITEBG  = '\33[47m'

CGREY    = '\33[90m'
CRED2    = '\33[91m'
CGREEN2  = '\33[92m'
CYELLOW2 = '\33[93m'
CBLUE2   = '\33[94m'
CVIOLET2 = '\33[95m'
CBEIGE2  = '\33[96m'
CWHITE2  = '\33[97m'

CGREYBG    = '\33[100m'
CREDBG2    = '\33[101m'
CGREENBG2  = '\33[102m'
CYELLOWBG2 = '\33[103m'
CBLUEBG2   = '\33[104m'
CVIOLETBG2 = '\33[105m'
CBEIGEBG2  = '\33[106m'
CWHITEBG2  = '\33[107m'

这是生成测试的代码:

x = 0
for i in range(24):
  colors = ""
  for j in range(5):
    code = str(x+j)
    colors = colors + "\33[" + code + "m\\33[" + code + "m\033[0m "
  print(colors)
  x=x+5

Define a string that starts a color and a string that ends the color, then print your text with the start string at the front and the end string at the end.

CRED = '\033[91m'
CEND = '\033[0m'
print(CRED + "Error, does not compute!" + CEND)

This produces the following in bash, in urxvt with a Zenburn-style color scheme:

Through experimentation, we can get more colors:

Note: \33[5m and \33[6m are blinking.

This way we can create a full color collection:

CEND      = '\33[0m'
CBOLD     = '\33[1m'
CITALIC   = '\33[3m'
CURL      = '\33[4m'
CBLINK    = '\33[5m'
CBLINK2   = '\33[6m'
CSELECTED = '\33[7m'

CBLACK  = '\33[30m'
CRED    = '\33[31m'
CGREEN  = '\33[32m'
CYELLOW = '\33[33m'
CBLUE   = '\33[34m'
CVIOLET = '\33[35m'
CBEIGE  = '\33[36m'
CWHITE  = '\33[37m'

CBLACKBG  = '\33[40m'
CREDBG    = '\33[41m'
CGREENBG  = '\33[42m'
CYELLOWBG = '\33[43m'
CBLUEBG   = '\33[44m'
CVIOLETBG = '\33[45m'
CBEIGEBG  = '\33[46m'
CWHITEBG  = '\33[47m'

CGREY    = '\33[90m'
CRED2    = '\33[91m'
CGREEN2  = '\33[92m'
CYELLOW2 = '\33[93m'
CBLUE2   = '\33[94m'
CVIOLET2 = '\33[95m'
CBEIGE2  = '\33[96m'
CWHITE2  = '\33[97m'

CGREYBG    = '\33[100m'
CREDBG2    = '\33[101m'
CGREENBG2  = '\33[102m'
CYELLOWBG2 = '\33[103m'
CBLUEBG2   = '\33[104m'
CVIOLETBG2 = '\33[105m'
CBEIGEBG2  = '\33[106m'
CWHITEBG2  = '\33[107m'

Here is the code to generate the test:

x = 0
for i in range(24):
  colors = ""
  for j in range(5):
    code = str(x+j)
    colors = colors + "\33[" + code + "m\\33[" + code + "m\033[0m "
  print(colors)
  x=x+5

回答 5

您想了解ANSI转义序列。这是一个简单的示例:

CSI="\x1B["
print(CSI+"31;40m" + "Colored Text" + CSI + "0m")

有关更多信息,请参见http://en.wikipedia.org/wiki/ANSI_escape_code

对于块字符,请尝试使用\ u2588这样的Unicode字符:

print(u"\u2588")

放在一起:

print(CSI+"31;40m" + u"\u2588" + CSI + "0m")

You want to learn about ANSI escape sequences. Here’s a brief example:

CSI="\x1B["
print(CSI+"31;40m" + "Colored Text" + CSI + "0m")

For more info see http://en.wikipedia.org/wiki/ANSI_escape_code

For a block character, try a unicode character like \u2588:

print(u"\u2588")

Putting it all together:

print(CSI+"31;40m" + u"\u2588" + CSI + "0m")

回答 6

我之所以做出回应,是因为我找到了一种在Windows 10上使用ANSI代码的方法,这样您就可以更改文本的颜色而无需任何内置模块:

进行此工作的行是os.system("")或任何其他系统调用,它使您可以在终端中打印ANSI代码:

import os

os.system("")

# Group of Different functions for different styles
class style():
    BLACK = '\033[30m'
    RED = '\033[31m'
    GREEN = '\033[32m'
    YELLOW = '\033[33m'
    BLUE = '\033[34m'
    MAGENTA = '\033[35m'
    CYAN = '\033[36m'
    WHITE = '\033[37m'
    UNDERLINE = '\033[4m'
    RESET = '\033[0m'

print(style.YELLOW + "Hello, World!")

注意:尽管此选项与其他Windows选项具有相同的选项,但是Windows即使使用此技巧也无法完全支持ANSI代码。并非所有的文本装饰颜色都起作用,并且所有“明亮”颜色(代码90-97和100-107)显示的颜色与常规颜色相同(代码30-37和40-47)

编辑:感谢@jl查找更短的方法。

tl; dros.system("")在文件顶部附近添加。

Python版本: 3.6.7

I’m responding because I have found out a way to use ANSI codes on Windows 10, so that you can change the colour of text without any modules that aren’t built in:

The line that makes this work is os.system(""), or any other system call, which allows you to print ANSI codes in the Terminal:

import os

os.system("")

# Group of Different functions for different styles
class style():
    BLACK = '\033[30m'
    RED = '\033[31m'
    GREEN = '\033[32m'
    YELLOW = '\033[33m'
    BLUE = '\033[34m'
    MAGENTA = '\033[35m'
    CYAN = '\033[36m'
    WHITE = '\033[37m'
    UNDERLINE = '\033[4m'
    RESET = '\033[0m'

print(style.YELLOW + "Hello, World!")

Note: Although this gives the same options as other Windows options, Windows does not full support ANSI codes, even with this trick. Not all the text decoration colours work and all the ‘bright’ colours (Codes 90-97 and 100-107) display the same as the regular colours (Codes 30-37 and 40-47)

Edit: Thanks to @j-l for finding an even shorter method.

tl;dr: Add os.system("") near the top of your file.

Python Version: 3.6.7


回答 7

我最喜欢的方法是使用Blessings库(完整披露:我写了它)。例如:

from blessings import Terminal

t = Terminal()
print t.red('This is red.')
print t.bold_bright_red_on_black('Bright red on black')

要打印彩色砖,最可靠的方法是使用背景色打印空间。我用这种技术绘制了鼻子渐进式的进度条

print t.on_green(' ')

您也可以在特定位置打印:

with t.location(0, 5):
    print t.on_yellow(' ')

如果您在游戏过程中不得不考虑其他终端功能,也可以这样做。您可以使用Python的标准字符串格式来保持可读性:

print '{t.clear_eol}You just cleared a {t.bold}whole{t.normal} line!'.format(t=t)

Blessings的好处在于,它尽其所能在各种终端上工作,而不仅仅是(绝大多数)ANSI颜色终端。它还在使代码简洁明了的同时,将无法读取的转义序列保留在代码之外。玩得开心!

My favorite way is with the Blessings library (full disclosure: I wrote it). For example:

from blessings import Terminal

t = Terminal()
print t.red('This is red.')
print t.bold_bright_red_on_black('Bright red on black')

To print colored bricks, the most reliable way is to print spaces with background colors. I use this technique to draw the progress bar in nose-progressive:

print t.on_green(' ')

You can print in specific locations as well:

with t.location(0, 5):
    print t.on_yellow(' ')

If you have to muck with other terminal capabilities in the course of your game, you can do that as well. You can use Python’s standard string formatting to keep it readable:

print '{t.clear_eol}You just cleared a {t.bold}whole{t.normal} line!'.format(t=t)

The nice thing about Blessings is that it does its best to work on all sorts of terminals, not just the (overwhelmingly common) ANSI-color ones. It also keeps unreadable escape sequences out of your code while remaining concise to use. Have fun!


回答 8

sty与colorama相似,但较为冗长,支持8位24位(rgb)颜色,允许您注册自己的样式,支持静音,非常灵活,文档丰富,并且更多。

例子:

from sty import fg, bg, ef, rs

foo = fg.red + 'This is red text!' + fg.rs
bar = bg.blue + 'This has a blue background!' + bg.rs
baz = ef.italic + 'This is italic text' + rs.italic
qux = fg(201) + 'This is pink text using 8bit colors' + fg.rs
qui = fg(255, 10, 10) + 'This is red text using 24bit colors.' + fg.rs

# Add custom colors:

from sty import Style, RgbFg

fg.orange = Style(RgbFg(255, 150, 50))

buf = fg.orange + 'Yay, Im orange.' + fg.rs

print(foo, bar, baz, qux, qui, buf, sep='\n')

印刷品:

演示:

sty is similar to colorama, but it’s less verbose, supports 8bit and 24bit (rgb) colors, allows you to register your own styles, supports muting, is really flexible, well documented and more.

Examples:

from sty import fg, bg, ef, rs

foo = fg.red + 'This is red text!' + fg.rs
bar = bg.blue + 'This has a blue background!' + bg.rs
baz = ef.italic + 'This is italic text' + rs.italic
qux = fg(201) + 'This is pink text using 8bit colors' + fg.rs
qui = fg(255, 10, 10) + 'This is red text using 24bit colors.' + fg.rs

# Add custom colors:

from sty import Style, RgbFg

fg.orange = Style(RgbFg(255, 150, 50))

buf = fg.orange + 'Yay, Im orange.' + fg.rs

print(foo, bar, baz, qux, qui, buf, sep='\n')

prints:

Demo:


回答 9

使用for循环生成一个具有所有颜色的类,以将每种颜色的组合最多迭代到100,然后编写一个带有python颜色的类。请按我的意愿复制并粘贴GPLv2:

class colors:
    '''Colors class:
    reset all colors with colors.reset
    two subclasses fg for foreground and bg for background.
    use as colors.subclass.colorname.
    i.e. colors.fg.red or colors.bg.green
    also, the generic bold, disable, underline, reverse, strikethrough,
    and invisible work with the main class
    i.e. colors.bold
    '''
    reset='\033[0m'
    bold='\033[01m'
    disable='\033[02m'
    underline='\033[04m'
    reverse='\033[07m'
    strikethrough='\033[09m'
    invisible='\033[08m'
    class fg:
        black='\033[30m'
        red='\033[31m'
        green='\033[32m'
        orange='\033[33m'
        blue='\033[34m'
        purple='\033[35m'
        cyan='\033[36m'
        lightgrey='\033[37m'
        darkgrey='\033[90m'
        lightred='\033[91m'
        lightgreen='\033[92m'
        yellow='\033[93m'
        lightblue='\033[94m'
        pink='\033[95m'
        lightcyan='\033[96m'
    class bg:
        black='\033[40m'
        red='\033[41m'
        green='\033[42m'
        orange='\033[43m'
        blue='\033[44m'
        purple='\033[45m'
        cyan='\033[46m'
        lightgrey='\033[47m'

generated a class with all the colors using a for loop to iterate every combination of color up to 100, then wrote a class with python colors. Copy and paste as you will, GPLv2 by me:

class colors:
    '''Colors class:
    reset all colors with colors.reset
    two subclasses fg for foreground and bg for background.
    use as colors.subclass.colorname.
    i.e. colors.fg.red or colors.bg.green
    also, the generic bold, disable, underline, reverse, strikethrough,
    and invisible work with the main class
    i.e. colors.bold
    '''
    reset='\033[0m'
    bold='\033[01m'
    disable='\033[02m'
    underline='\033[04m'
    reverse='\033[07m'
    strikethrough='\033[09m'
    invisible='\033[08m'
    class fg:
        black='\033[30m'
        red='\033[31m'
        green='\033[32m'
        orange='\033[33m'
        blue='\033[34m'
        purple='\033[35m'
        cyan='\033[36m'
        lightgrey='\033[37m'
        darkgrey='\033[90m'
        lightred='\033[91m'
        lightgreen='\033[92m'
        yellow='\033[93m'
        lightblue='\033[94m'
        pink='\033[95m'
        lightcyan='\033[96m'
    class bg:
        black='\033[40m'
        red='\033[41m'
        green='\033[42m'
        orange='\033[43m'
        blue='\033[44m'
        purple='\033[45m'
        cyan='\033[46m'
        lightgrey='\033[47m'

回答 10

试试这个简单的代码

def prRed(prt): print("\033[91m {}\033[00m" .format(prt))
def prGreen(prt): print("\033[92m {}\033[00m" .format(prt))
def prYellow(prt): print("\033[93m {}\033[00m" .format(prt))
def prLightPurple(prt): print("\033[94m {}\033[00m" .format(prt))
def prPurple(prt): print("\033[95m {}\033[00m" .format(prt))
def prCyan(prt): print("\033[96m {}\033[00m" .format(prt))
def prLightGray(prt): print("\033[97m {}\033[00m" .format(prt))
def prBlack(prt): print("\033[98m {}\033[00m" .format(prt))

prGreen("Hello world")

Try this simple code

def prRed(prt): print("\033[91m {}\033[00m" .format(prt))
def prGreen(prt): print("\033[92m {}\033[00m" .format(prt))
def prYellow(prt): print("\033[93m {}\033[00m" .format(prt))
def prLightPurple(prt): print("\033[94m {}\033[00m" .format(prt))
def prPurple(prt): print("\033[95m {}\033[00m" .format(prt))
def prCyan(prt): print("\033[96m {}\033[00m" .format(prt))
def prLightGray(prt): print("\033[97m {}\033[00m" .format(prt))
def prBlack(prt): print("\033[98m {}\033[00m" .format(prt))

prGreen("Hello world")

回答 11

在Windows上,您可以使用模块“ win32console”(在某些Python发行版中可用)或模块“ ctypes”(Python 2.5及更高版本)来访问Win32 API。

要查看完整的代码,同时支持方式,见色控制台报告代码Testoob

ctypes示例:

import ctypes

# Constants from the Windows API
STD_OUTPUT_HANDLE = -11
FOREGROUND_RED    = 0x0004 # text color contains red.

def get_csbi_attributes(handle):
    # Based on IPython's winconsole.py, written by Alexander Belchenko
    import struct
    csbi = ctypes.create_string_buffer(22)
    res = ctypes.windll.kernel32.GetConsoleScreenBufferInfo(handle, csbi)
    assert res

    (bufx, bufy, curx, cury, wattr,
    left, top, right, bottom, maxx, maxy) = struct.unpack("hhhhHhhhhhh", csbi.raw)
    return wattr


handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
reset = get_csbi_attributes(handle)

ctypes.windll.kernel32.SetConsoleTextAttribute(handle, FOREGROUND_RED)
print "Cherry on top"
ctypes.windll.kernel32.SetConsoleTextAttribute(handle, reset)

On Windows you can use module ‘win32console’ (available in some Python distributions) or module ‘ctypes’ (Python 2.5 and up) to access the Win32 API.

To see complete code that supports both ways, see the color console reporting code from Testoob.

ctypes example:

import ctypes

# Constants from the Windows API
STD_OUTPUT_HANDLE = -11
FOREGROUND_RED    = 0x0004 # text color contains red.

def get_csbi_attributes(handle):
    # Based on IPython's winconsole.py, written by Alexander Belchenko
    import struct
    csbi = ctypes.create_string_buffer(22)
    res = ctypes.windll.kernel32.GetConsoleScreenBufferInfo(handle, csbi)
    assert res

    (bufx, bufy, curx, cury, wattr,
    left, top, right, bottom, maxx, maxy) = struct.unpack("hhhhHhhhhhh", csbi.raw)
    return wattr


handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
reset = get_csbi_attributes(handle)

ctypes.windll.kernel32.SetConsoleTextAttribute(handle, FOREGROUND_RED)
print "Cherry on top"
ctypes.windll.kernel32.SetConsoleTextAttribute(handle, reset)

回答 12

基于@joeld的答案非常简单

class PrintInColor:
    RED = '\033[91m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    LIGHT_PURPLE = '\033[94m'
    PURPLE = '\033[95m'
    END = '\033[0m'

    @classmethod
    def red(cls, s, **kwargs):
        print(cls.RED + s + cls.END, **kwargs)

    @classmethod
    def green(cls, s, **kwargs):
        print(cls.GREEN + s + cls.END, **kwargs)

    @classmethod
    def yellow(cls, s, **kwargs):
        print(cls.YELLOW + s + cls.END, **kwargs)

    @classmethod
    def lightPurple(cls, s, **kwargs):
        print(cls.LIGHT_PURPLE + s + cls.END, **kwargs)

    @classmethod
    def purple(cls, s, **kwargs):
        print(cls.PURPLE + s + cls.END, **kwargs)

然后就

PrintInColor.red('hello', end=' ')
PrintInColor.green('world')

Stupidly simple based on @joeld’s answer

class PrintInColor:
    RED = '\033[91m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    LIGHT_PURPLE = '\033[94m'
    PURPLE = '\033[95m'
    END = '\033[0m'

    @classmethod
    def red(cls, s, **kwargs):
        print(cls.RED + s + cls.END, **kwargs)

    @classmethod
    def green(cls, s, **kwargs):
        print(cls.GREEN + s + cls.END, **kwargs)

    @classmethod
    def yellow(cls, s, **kwargs):
        print(cls.YELLOW + s + cls.END, **kwargs)

    @classmethod
    def lightPurple(cls, s, **kwargs):
        print(cls.LIGHT_PURPLE + s + cls.END, **kwargs)

    @classmethod
    def purple(cls, s, **kwargs):
        print(cls.PURPLE + s + cls.END, **kwargs)

Then just

PrintInColor.red('hello', end=' ')
PrintInColor.green('world')

回答 13

我已经将@joeld答案包装到具有全局函数的模块中,可以在代码的任何地方使用它。

文件:log.py

HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = "\033[1m"

def disable():
    HEADER = ''
    OKBLUE = ''
    OKGREEN = ''
    WARNING = ''
    FAIL = ''
    ENDC = ''

def infog( msg):
    print OKGREEN + msg + ENDC

def info( msg):
    print OKBLUE + msg + ENDC

def warn( msg):
    print WARNING + msg + ENDC

def err( msg):
    print FAIL + msg + ENDC

用途如下:

 import log
    log.info("Hello World")
    log.err("System Error")

I have wrapped @joeld answer into a module with global functions that I can use anywhere in my code.

file: log.py

HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = "\033[1m"

def disable():
    HEADER = ''
    OKBLUE = ''
    OKGREEN = ''
    WARNING = ''
    FAIL = ''
    ENDC = ''

def infog( msg):
    print OKGREEN + msg + ENDC

def info( msg):
    print OKBLUE + msg + ENDC

def warn( msg):
    print WARNING + msg + ENDC

def err( msg):
    print FAIL + msg + ENDC

use as follows:

 import log
    log.info("Hello World")
    log.err("System Error")

回答 14

对于Windows,除非使用win32api,否则无法使用颜色打印到控制台。

对于Linux,这就像使用print一样简单,这里概述了转义序列:

色彩

对于要像盒子一样打印的字符,实际上取决于您在控制台窗口中使用的字体。井字符号效果很好,但是取决于字体:

#

For Windows you cannot print to console with colors unless you’re using the win32api.

For Linux it’s as simple as using print, with the escape sequences outlined here:

Colors

For the character to print like a box, it really depends on what font you are using for the console window. The pound symbol works well, but it depends on the font:

#

回答 15

# Pure Python 3.x demo, 256 colors
# Works with bash under Linux and MacOS

fg = lambda text, color: "\33[38;5;" + str(color) + "m" + text + "\33[0m"
bg = lambda text, color: "\33[48;5;" + str(color) + "m" + text + "\33[0m"

def print_six(row, format, end="\n"):
    for col in range(6):
        color = row*6 + col - 2
        if color>=0:
            text = "{:3d}".format(color)
            print (format(text,color), end=" ")
        else:
            print(end="    ")   # four spaces
    print(end=end)

for row in range(0, 43):
    print_six(row, fg, " ")
    print_six(row, bg)

# Simple usage: print(fg("text", 160))

# Pure Python 3.x demo, 256 colors
# Works with bash under Linux and MacOS

fg = lambda text, color: "\33[38;5;" + str(color) + "m" + text + "\33[0m"
bg = lambda text, color: "\33[48;5;" + str(color) + "m" + text + "\33[0m"

def print_six(row, format, end="\n"):
    for col in range(6):
        color = row*6 + col - 2
        if color>=0:
            text = "{:3d}".format(color)
            print (format(text,color), end=" ")
        else:
            print(end="    ")   # four spaces
    print(end=end)

for row in range(0, 43):
    print_six(row, fg, " ")
    print_six(row, bg)

# Simple usage: print(fg("text", 160))


回答 16

我最终这样做了,我觉得那是最干净的:

formatters = {             
    'RED': '\033[91m',     
    'GREEN': '\033[92m',   
    'END': '\033[0m',      
}

print 'Master is currently {RED}red{END}!'.format(**formatters)
print 'Help make master {GREEN}green{END} again!'.format(**formatters)

I ended up doing this, I felt it was cleanest:

formatters = {             
    'RED': '\033[91m',     
    'GREEN': '\033[92m',   
    'END': '\033[0m',      
}

print 'Master is currently {RED}red{END}!'.format(**formatters)
print 'Help make master {GREEN}green{END} again!'.format(**formatters)

回答 17

使用https://pypi.python.org/pypi/lazyme 在@joeld答案上构建pip install -U lazyme

from lazyme.string import color_print
>>> color_print('abc')
abc
>>> color_print('abc', color='pink')
abc
>>> color_print('abc', color='red')
abc
>>> color_print('abc', color='yellow')
abc
>>> color_print('abc', color='green')
abc
>>> color_print('abc', color='blue', underline=True)
abc
>>> color_print('abc', color='blue', underline=True, bold=True)
abc
>>> color_print('abc', color='pink', underline=True, bold=True)
abc

屏幕截图:


color_print使用新的格式化程序对进行了一些更新,例如:

>>> from lazyme.string import palette, highlighter, formatter
>>> from lazyme.string import color_print
>>> palette.keys() # Available colors.
['pink', 'yellow', 'cyan', 'magenta', 'blue', 'gray', 'default', 'black', 'green', 'white', 'red']
>>> highlighter.keys() # Available highlights.
['blue', 'pink', 'gray', 'black', 'yellow', 'cyan', 'green', 'magenta', 'white', 'red']
>>> formatter.keys() # Available formatter, 
['hide', 'bold', 'italic', 'default', 'fast_blinking', 'faint', 'strikethrough', 'underline', 'blinking', 'reverse']

注:italicfast blinkingstrikethrough可能无法在所有终端上使用,在Mac / Ubuntu上也无法使用。

例如

>>> color_print('foo bar', color='pink', highlight='white')
foo bar
>>> color_print('foo bar', color='pink', highlight='white', reverse=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', bold=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', faint=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', faint=True, reverse=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', underline=True, reverse=True)
foo bar

屏幕截图:

Building on @joeld answer, using https://pypi.python.org/pypi/lazyme pip install -U lazyme :

from lazyme.string import color_print
>>> color_print('abc')
abc
>>> color_print('abc', color='pink')
abc
>>> color_print('abc', color='red')
abc
>>> color_print('abc', color='yellow')
abc
>>> color_print('abc', color='green')
abc
>>> color_print('abc', color='blue', underline=True)
abc
>>> color_print('abc', color='blue', underline=True, bold=True)
abc
>>> color_print('abc', color='pink', underline=True, bold=True)
abc

Screenshot:


Some updates to the color_print with new formatters, e.g.:

>>> from lazyme.string import palette, highlighter, formatter
>>> from lazyme.string import color_print
>>> palette.keys() # Available colors.
['pink', 'yellow', 'cyan', 'magenta', 'blue', 'gray', 'default', 'black', 'green', 'white', 'red']
>>> highlighter.keys() # Available highlights.
['blue', 'pink', 'gray', 'black', 'yellow', 'cyan', 'green', 'magenta', 'white', 'red']
>>> formatter.keys() # Available formatter, 
['hide', 'bold', 'italic', 'default', 'fast_blinking', 'faint', 'strikethrough', 'underline', 'blinking', 'reverse']

Note: italic, fast blinking and strikethrough may not work on all terminals, doesn’t work on Mac / Ubuntu.

E.g.

>>> color_print('foo bar', color='pink', highlight='white')
foo bar
>>> color_print('foo bar', color='pink', highlight='white', reverse=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', bold=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', faint=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', faint=True, reverse=True)
foo bar
>>> color_print('foo bar', color='pink', highlight='white', underline=True, reverse=True)
foo bar

Screenshot:


回答 18

def black(text):
    print('\033[30m', text, '\033[0m', sep='')

def red(text):
    print('\033[31m', text, '\033[0m', sep='')

def green(text):
    print('\033[32m', text, '\033[0m', sep='')

def yellow(text):
    print('\033[33m', text, '\033[0m', sep='')

def blue(text):
    print('\033[34m', text, '\033[0m', sep='')

def magenta(text):
    print('\033[35m', text, '\033[0m', sep='')

def cyan(text):
    print('\033[36m', text, '\033[0m', sep='')

def gray(text):
    print('\033[90m', text, '\033[0m', sep='')


black("BLACK")
red("RED")
green("GREEN")
yellow("YELLOW")
blue("BLACK")
magenta("MAGENTA")
cyan("CYAN")
gray("GRAY")

在线尝试

def black(text):
    print('\033[30m', text, '\033[0m', sep='')

def red(text):
    print('\033[31m', text, '\033[0m', sep='')

def green(text):
    print('\033[32m', text, '\033[0m', sep='')

def yellow(text):
    print('\033[33m', text, '\033[0m', sep='')

def blue(text):
    print('\033[34m', text, '\033[0m', sep='')

def magenta(text):
    print('\033[35m', text, '\033[0m', sep='')

def cyan(text):
    print('\033[36m', text, '\033[0m', sep='')

def gray(text):
    print('\033[90m', text, '\033[0m', sep='')


black("BLACK")
red("RED")
green("GREEN")
yellow("YELLOW")
blue("BLACK")
magenta("MAGENTA")
cyan("CYAN")
gray("GRAY")

Try online


回答 19

请注意,with关键字与需要重置的修饰符(使用Python 3和Colorama)混合的程度如何:

from colorama import Fore, Style
import sys

class Highlight:
  def __init__(self, clazz, color):
    self.color = color
    self.clazz = clazz
  def __enter__(self):
    print(self.color, end="")
  def __exit__(self, type, value, traceback):
    if self.clazz == Fore:
      print(Fore.RESET, end="")
    else:
      assert self.clazz == Style
      print(Style.RESET_ALL, end="")
    sys.stdout.flush()

with Highlight(Fore, Fore.GREEN):
  print("this is highlighted")
print("this is not")

note how well the with keyword mixes with modifiers like these that need to be reset (using Python 3 and Colorama):

from colorama import Fore, Style
import sys

class Highlight:
  def __init__(self, clazz, color):
    self.color = color
    self.clazz = clazz
  def __enter__(self):
    print(self.color, end="")
  def __exit__(self, type, value, traceback):
    if self.clazz == Fore:
      print(Fore.RESET, end="")
    else:
      assert self.clazz == Style
      print(Style.RESET_ALL, end="")
    sys.stdout.flush()

with Highlight(Fore, Fore.GREEN):
  print("this is highlighted")
print("this is not")

回答 20

您可以使用curses库的Python实现:http : //docs.python.org/library/curses.html

另外,运行此命令,您将找到您的盒子:

for i in range(255):
    print i, chr(i)

You can use the Python implementation of the curses library: http://docs.python.org/library/curses.html

Also, run this and you’ll find your box:

for i in range(255):
    print i, chr(i)

回答 21

您可以使用CLINT:

from clint.textui import colored
print colored.red('some warning message')
print colored.green('nicely done!')

从GitHub获取它

You could use CLINT:

from clint.textui import colored
print colored.red('some warning message')
print colored.green('nicely done!')

Get it from GitHub.


回答 22

如果您正在编写游戏,也许您想更改背景颜色并仅使用空格?例如:

print " "+ "\033[01;41m" + " " +"\033[01;46m"  + "  " + "\033[01;42m"

If you are programming a game perhaps you would like to change the background color and use only spaces? For example:

print " "+ "\033[01;41m" + " " +"\033[01;46m"  + "  " + "\033[01;42m"

回答 23

我知道我迟到了。但是,我有一个名为ColorIt。非常简单。

这里有些例子:

from ColorIt import *

# Use this to ensure that ColorIt will be usable by certain command line interfaces
initColorIt()

# Foreground
print (color ('This text is red', colors.RED))
print (color ('This text is orange', colors.ORANGE))
print (color ('This text is yellow', colors.YELLOW))
print (color ('This text is green', colors.GREEN))
print (color ('This text is blue', colors.BLUE))
print (color ('This text is purple', colors.PURPLE))
print (color ('This text is white', colors.WHITE))

# Background
print (background ('This text has a background that is red', colors.RED))
print (background ('This text has a background that is orange', colors.ORANGE))
print (background ('This text has a background that is yellow', colors.YELLOW))
print (background ('This text has a background that is green', colors.GREEN))
print (background ('This text has a background that is blue', colors.BLUE))
print (background ('This text has a background that is purple', colors.PURPLE))
print (background ('This text has a background that is white', colors.WHITE))

# Custom
print (color ("This color has a custom grey text color", (150, 150, 150))
print (background ("This color has a custom grey background", (150, 150, 150))

# Combination
print (background (color ("This text is blue with a white background", colors.BLUE), colors.WHITE))

这给您:

还值得注意的是,这是跨平台的,并且已经在Mac,Linux和Windows上进行了测试。

您可能想尝试一下:https : //github.com/CodeForeverAndEver/ColorIt

注意:几天后将添加闪烁,斜体,粗体等。

I know that I am late. But, I have a library called ColorIt. It is super simple.

Here are some examples:

from ColorIt import *

# Use this to ensure that ColorIt will be usable by certain command line interfaces
initColorIt()

# Foreground
print (color ('This text is red', colors.RED))
print (color ('This text is orange', colors.ORANGE))
print (color ('This text is yellow', colors.YELLOW))
print (color ('This text is green', colors.GREEN))
print (color ('This text is blue', colors.BLUE))
print (color ('This text is purple', colors.PURPLE))
print (color ('This text is white', colors.WHITE))

# Background
print (background ('This text has a background that is red', colors.RED))
print (background ('This text has a background that is orange', colors.ORANGE))
print (background ('This text has a background that is yellow', colors.YELLOW))
print (background ('This text has a background that is green', colors.GREEN))
print (background ('This text has a background that is blue', colors.BLUE))
print (background ('This text has a background that is purple', colors.PURPLE))
print (background ('This text has a background that is white', colors.WHITE))

# Custom
print (color ("This color has a custom grey text color", (150, 150, 150))
print (background ("This color has a custom grey background", (150, 150, 150))

# Combination
print (background (color ("This text is blue with a white background", colors.BLUE), colors.WHITE))

This gives you:

It’s also worth noting that this is cross platform and has been tested on mac, linux, and windows.

You might want to try it out: https://github.com/CodeForeverAndEver/ColorIt

Note: Blinking, italics, bold, etc. will be added in a few days.


回答 24

如果您使用的是Windows,那么就到这里!

# display text on a Windows console
# Windows XP with Python27 or Python32
from ctypes import windll
# needed for Python2/Python3 diff
try:
    input = raw_input
except:
    pass
STD_OUTPUT_HANDLE = -11
stdout_handle = windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
# look at the output and select the color you want
# for instance hex E is yellow on black
# hex 1E is yellow on blue
# hex 2E is yellow on green and so on
for color in range(0, 75):
     windll.kernel32.SetConsoleTextAttribute(stdout_handle, color)
     print("%X --> %s" % (color, "Have a fine day!"))
     input("Press Enter to go on ... ")

If you are using Windows, then here you go!

# display text on a Windows console
# Windows XP with Python27 or Python32
from ctypes import windll
# needed for Python2/Python3 diff
try:
    input = raw_input
except:
    pass
STD_OUTPUT_HANDLE = -11
stdout_handle = windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
# look at the output and select the color you want
# for instance hex E is yellow on black
# hex 1E is yellow on blue
# hex 2E is yellow on green and so on
for color in range(0, 75):
     windll.kernel32.SetConsoleTextAttribute(stdout_handle, color)
     print("%X --> %s" % (color, "Have a fine day!"))
     input("Press Enter to go on ... ")

回答 25

如果您使用的是Django

>>> from django.utils.termcolors import colorize
>>> print colorize("Hello World!", fg="blue", bg='red',
...                 opts=('bold', 'blink', 'underscore',))
Hello World!
>>> help(colorize)

快照:

(我通常在运行服务器终端上使用彩色输出进行调试,所以我添加了它。)

您可以测试它是否已安装在您的计算机中:
$ python -c "import django; print django.VERSION"
要安装它,请检查:如何安装Django

试试看!!

If you are using Django

>>> from django.utils.termcolors import colorize
>>> print colorize("Hello World!", fg="blue", bg='red',
...                 opts=('bold', 'blink', 'underscore',))
Hello World!
>>> help(colorize)

snapshot:

(I generally use colored output for debugging on runserver terminal so I added it.)

You can test if it is installed in your machine:
$ python -c "import django; print django.VERSION"
To install it check: How to install Django

Give it a Try!!


回答 26

这是一个诅咒的例子:

import curses

def main(stdscr):
    stdscr.clear()
    if curses.has_colors():
        for i in xrange(1, curses.COLORS):
            curses.init_pair(i, i, curses.COLOR_BLACK)
            stdscr.addstr("COLOR %d! " % i, curses.color_pair(i))
            stdscr.addstr("BOLD! ", curses.color_pair(i) | curses.A_BOLD)
            stdscr.addstr("STANDOUT! ", curses.color_pair(i) | curses.A_STANDOUT)
            stdscr.addstr("UNDERLINE! ", curses.color_pair(i) | curses.A_UNDERLINE)
            stdscr.addstr("BLINK! ", curses.color_pair(i) | curses.A_BLINK)
            stdscr.addstr("DIM! ", curses.color_pair(i) | curses.A_DIM)
            stdscr.addstr("REVERSE! ", curses.color_pair(i) | curses.A_REVERSE)
    stdscr.refresh()
    stdscr.getch()

if __name__ == '__main__':
    print "init..."
    curses.wrapper(main)

Here’s a curses example:

import curses

def main(stdscr):
    stdscr.clear()
    if curses.has_colors():
        for i in xrange(1, curses.COLORS):
            curses.init_pair(i, i, curses.COLOR_BLACK)
            stdscr.addstr("COLOR %d! " % i, curses.color_pair(i))
            stdscr.addstr("BOLD! ", curses.color_pair(i) | curses.A_BOLD)
            stdscr.addstr("STANDOUT! ", curses.color_pair(i) | curses.A_STANDOUT)
            stdscr.addstr("UNDERLINE! ", curses.color_pair(i) | curses.A_UNDERLINE)
            stdscr.addstr("BLINK! ", curses.color_pair(i) | curses.A_BLINK)
            stdscr.addstr("DIM! ", curses.color_pair(i) | curses.A_DIM)
            stdscr.addstr("REVERSE! ", curses.color_pair(i) | curses.A_REVERSE)
    stdscr.refresh()
    stdscr.getch()

if __name__ == '__main__':
    print "init..."
    curses.wrapper(main)

回答 27

https://raw.github.com/fabric/fabric/master/fabric/colors.py

"""
.. versionadded:: 0.9.2

Functions for wrapping strings in ANSI color codes.

Each function within this module returns the input string ``text``, wrapped
with ANSI color codes for the appropriate color.

For example, to print some text as green on supporting terminals::

    from fabric.colors import green

    print(green("This text is green!"))

Because these functions simply return modified strings, you can nest them::

    from fabric.colors import red, green

    print(red("This sentence is red, except for " + \
          green("these words, which are green") + "."))

If ``bold`` is set to ``True``, the ANSI flag for bolding will be flipped on
for that particular invocation, which usually shows up as a bold or brighter
version of the original color on most terminals.
"""


def _wrap_with(code):

    def inner(text, bold=False):
        c = code
        if bold:
            c = "1;%s" % c
        return "\033[%sm%s\033[0m" % (c, text)
    return inner

red = _wrap_with('31')
green = _wrap_with('32')
yellow = _wrap_with('33')
blue = _wrap_with('34')
magenta = _wrap_with('35')
cyan = _wrap_with('36')
white = _wrap_with('37')

https://raw.github.com/fabric/fabric/master/fabric/colors.py

"""
.. versionadded:: 0.9.2

Functions for wrapping strings in ANSI color codes.

Each function within this module returns the input string ``text``, wrapped
with ANSI color codes for the appropriate color.

For example, to print some text as green on supporting terminals::

    from fabric.colors import green

    print(green("This text is green!"))

Because these functions simply return modified strings, you can nest them::

    from fabric.colors import red, green

    print(red("This sentence is red, except for " + \
          green("these words, which are green") + "."))

If ``bold`` is set to ``True``, the ANSI flag for bolding will be flipped on
for that particular invocation, which usually shows up as a bold or brighter
version of the original color on most terminals.
"""


def _wrap_with(code):

    def inner(text, bold=False):
        c = code
        if bold:
            c = "1;%s" % c
        return "\033[%sm%s\033[0m" % (c, text)
    return inner

red = _wrap_with('31')
green = _wrap_with('32')
yellow = _wrap_with('33')
blue = _wrap_with('34')
magenta = _wrap_with('35')
cyan = _wrap_with('36')
white = _wrap_with('37')

回答 28

asciimatics为构建文本UI和动画提供了可移植的支持:

#!/usr/bin/env python
from asciimatics.effects import RandomNoise  # $ pip install asciimatics
from asciimatics.renderers import SpeechBubble, Rainbow
from asciimatics.scene import Scene
from asciimatics.screen import Screen
from asciimatics.exceptions import ResizeScreenError


def demo(screen):
    render = Rainbow(screen, SpeechBubble('Rainbow'))
    effects = [RandomNoise(screen, signal=render)]
    screen.play([Scene(effects, -1)], stop_on_resize=True)

while True:
    try:
        Screen.wrapper(demo)
        break
    except ResizeScreenError:
        pass

Asciicast:

asciimatics provides a portable support for building text UI and animations:

#!/usr/bin/env python
from asciimatics.effects import RandomNoise  # $ pip install asciimatics
from asciimatics.renderers import SpeechBubble, Rainbow
from asciimatics.scene import Scene
from asciimatics.screen import Screen
from asciimatics.exceptions import ResizeScreenError


def demo(screen):
    render = Rainbow(screen, SpeechBubble('Rainbow'))
    effects = [RandomNoise(screen, signal=render)]
    screen.play([Scene(effects, -1)], stop_on_resize=True)

while True:
    try:
        Screen.wrapper(demo)
        break
    except ResizeScreenError:
        pass

Asciicast:


回答 29

另一个包装python 3打印功能的pypi模块:

https://pypi.python.org/pypi/colorprint

如果您也可以在python 2.x中使用它from __future__ import print。这是来自模块pypi页面的python 2示例:

from __future__ import print_function
from colorprint import *

print('Hello', 'world', color='blue', end='', sep=', ')
print('!', color='red', format=['bold', 'blink'])

输出“你好,世界!” 用蓝色和感叹号标记为红色和闪烁。

Yet another pypi module that wraps the python 3 print function:

https://pypi.python.org/pypi/colorprint

It’s usable in python 2.x if you also from __future__ import print. Here is a python 2 example from the modules pypi page:

from __future__ import print_function
from colorprint import *

print('Hello', 'world', color='blue', end='', sep=', ')
print('!', color='red', format=['bold', 'blink'])

Outputs “Hello, world!” with the words in blue and the exclamation mark bold red and blinking.


UnicodeEncodeError:’ascii’编解码器无法在位置20编码字符u’\ xa0’:序数不在范围内(128)

问题:UnicodeEncodeError:’ascii’编解码器无法在位置20编码字符u’\ xa0’:序数不在范围内(128)

我在处理从不同网页(在不同站点上)获取的文本中的unicode字符时遇到问题。我正在使用BeautifulSoup。

问题是错误并非总是可重现的。它有时可以在某些页面上使用,有时它会通过抛出来发声UnicodeEncodeError。我已经尝试了几乎所有我能想到的东西,但是没有找到任何能正常工作而不抛出某种与Unicode相关的错误的东西。

导致问题的代码部分之一如下所示:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

这是运行上述代码段时在某些字符串上生成的堆栈跟踪:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

我怀疑这是因为某些页面(或更具体地说,来自某些站点的页面)可能已编码,而其他页面可能未编码。所有站点都位于英国,并提供供英国消费的数据-因此,与英语以外的其他任何形式的内部化或文字处理都没有问题。

是否有人对如何解决此问题有任何想法,以便我可以始终如一地解决此问题?

I’m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

One of the sections of code that is causing problems is shown below:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption – so there are no issues relating to internalization or dealing with text written in anything other than English.

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?


回答 0

您需要阅读Python Unicode HOWTO。这个错误是第一个例子

基本上,停止使用str从Unicode转换为编码的文本/字节。

相反,请正确使用.encode()编码字符串:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

或完全以unicode工作。

You need to read the Python Unicode HOWTO. This error is the very first example.

Basically, stop using str to convert from unicode to encoded text / bytes.

Instead, properly use .encode() to encode the string:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.


回答 1

这是经典的python unicode痛点!考虑以下:

a = u'bats\u00E0'
print a
 => batsà

到目前为止一切都很好,但是如果我们调用str(a),让我们看看会发生什么:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

噢,蘸,那对任何人都不会有好处!要解决该错误,请使用.encode明确编码字节,并告诉python使用哪种编解码器:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil \ u00E0!

问题是,当您调用str()时,python使用默认的字符编码来尝试对给定的字节进行编码,在您的情况下,有时表示为unicode字符。要解决此问题,您必须告诉python如何使用.encode(’whatever_unicode’)处理您给它的字符串。大多数时候,使用utf-8应该会很好。

有关此主题的出色论述,请参见Ned Batchelder在PyCon上的演讲:http : //nedbatchelder.com/text/unipain.html

This is a classic python unicode pain point! Consider the following:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let’s see what happens:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that’s not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode(‘whatever_unicode’). Most of the time, you should be fine using utf-8.

For an excellent exposition on this topic, see Ned Batchelder’s PyCon talk here: http://nedbatchelder.com/text/unipain.html


回答 2

我发现可以通过优雅的方法删除符号并继续按以下方式将字符串保留为字符串:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

重要的是要注意,使用ignore选项是危险的,因为它会悄悄地从使用它的代码中删除所有对unicode(和国际化)的支持,如下所示(转换unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

It’s important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):

>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

回答 3

好吧,我尝试了一切,但并没有帮助,在谷歌搜索之后,我发现了以下内容并有所帮助。使用python 2.7。

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

回答 4

导致甚至打印失败的一个细微问题是环境变量设置错误,例如。此处LC_ALL设置为“ C”。在Debian中,他们不鼓励设置它:Locale上的Debian Wiki

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to “C”. In Debian they discourage setting it: Debian wiki on Locale

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

回答 5

对我来说,有效的是:

BeautifulSoup(html_text,from_encoding="utf-8")

希望这对某人有帮助。

For me, what worked was:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.


回答 6

实际上,我发现在大多数情况下,仅去除那些字符会更加简单:

s = mystring.decode('ascii', 'ignore')

I’ve actually found that in most of my cases, just stripping out those characters is much simpler:

s = mystring.decode('ascii', 'ignore')

回答 7

问题是您正在尝试打印unicode字符,但是您的终端不支持它。

您可以尝试安装language-pack-en软件包来解决此问题:

sudo apt-get install language-pack-en

它为所有支持的软件包(包括Python)提供英语翻译数据更新。如有必要,请安装其他语言包(取决于您尝试打​​印的字符)。

在某些Linux发行版中,需要确保正确设置了默认的英语语言环境(因此unicode字符可以由shell / terminal处理)。有时,与手动配置相比,它更容易安装。

然后,在编写代码时,请确保在代码中使用正确的编码。

例如:

open(foo, encoding='utf-8')

如果仍然有问题,请仔细检查系统配置,例如:

  • 您的语言环境文件(/etc/default/locale),应包含例如

    LANG="en_US.UTF-8"
    LC_ALL="en_US.UTF-8"

    要么:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
  • LANG/ LC_CTYPEin shell的值。

  • 通过以下方法检查您的shell支持的语言环境:

    locale -a | grep "UTF-8"

演示新VM中的问题和解决方案。

  1. 初始化和配置VM(例如使用vagrant):

    vagrant init ubuntu/trusty64; vagrant up; vagrant ssh

    请参阅:可用的Ubuntu盒

  2. 打印unicode字符(例如商标符号):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
  3. 现在安装language-pack-en

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
  4. 现在应该解决问题:

    $ python -c 'print(u"\u2122");'
    
  5. 否则,请尝试以下命令:

    $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
    

The problem is that you’re trying to print a unicode character, but your terminal doesn’t support it.

You can try installing language-pack-en package to fix that:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you’re trying to print).

On some Linux distributions it’s required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it’s easier to install it, than configuring it manually.

Then when writing the code, make sure you use the right encoding in your code.

For example:

open(foo, encoding='utf-8')

If you’ve still a problem, double check your system configuration, such as:

  • Your locale file (/etc/default/locale), which should have e.g.

    LANG="en_US.UTF-8"
    LC_ALL="en_US.UTF-8"
    

    or:

    LC_ALL=C.UTF-8
    LANG=C.UTF-8
    
  • Value of LANG/LC_CTYPE in shell.

  • Check which locale your shell supports by:

    locale -a | grep "UTF-8"
    

Demonstrating the problem and solution in fresh VM.

  1. Initialize and provision the VM (e.g. using vagrant):

    vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
    

    See: available Ubuntu boxes..

  2. Printing unicode characters (such as trade mark sign like ):

    $ python -c 'print(u"\u2122");'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
    
  3. Now installing language-pack-en:

    $ sudo apt-get -y install language-pack-en
    The following extra packages will be installed:
      language-pack-en-base
    Generating locales...
      en_GB.UTF-8... /usr/sbin/locale-gen: done
    Generation complete.
    
  4. Now problem should be solved:

    $ python -c 'print(u"\u2122");'
    ™
    
  5. Otherwise, try the following command:

    $ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
    ™
    

回答 8

在外壳中:

  1. 通过以下命令查找支持的UTF-8语言环境:

    locale -a | grep "UTF-8"
  2. 在运行脚本之前将其导出,例如:

    export LC_ALL=$(locale -a | grep UTF-8)

    或手动像:

    export LC_ALL=C.UTF-8
  3. 通过打印特殊字符进行测试,例如

    python -c 'print(u"\u2122");'

以上在Ubuntu中测试过。

In shell:

  1. Find supported UTF-8 locale by the following command:

    locale -a | grep "UTF-8"
    
  2. Export it, before running the script, e.g.:

    export LC_ALL=$(locale -a | grep UTF-8)
    

    or manually like:

    export LC_ALL=C.UTF-8
    
  3. Test it by printing special character, e.g. :

    python -c 'print(u"\u2122");'
    

Above tested in Ubuntu.


回答 9

在脚本开头的下面添加一行(或作为第二行):

# -*- coding: utf-8 -*-

那就是python源代码编码的定义。PEP 263中的更多信息。

Add line below at the beginning of your script ( or as second line):

# -*- coding: utf-8 -*-

That’s definition of python source code encoding. More info in PEP 263.


回答 10

这是其他一些所谓的“警惕”答案的重新表述。在某些情况下,尽管有人在这里提出抗议,但简单地扔掉麻烦的字符/字符串是一个很好的解决方案。

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

测试它:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

结果:

1
test
98°
98

建议:您可能想将此函数命名为toAscii?这是优先事项。

这是为Python 2编写的。 对于Python 3,我相信您将要使用bytes(obj,"ascii")而不是str(obj)。我尚未对此进行测试,但是我会在某个时候修改答案。

Here’s a rehashing of some other so-called “cop out” answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

1
test
98°
98

Suggestion: you might want to name this function to toAscii instead? That’s a matter of preference.

This was written for Python 2. For Python 3, I believe you’ll want to use bytes(obj,"ascii") rather than str(obj). I didn’t test this yet, but I will at some point and revise the answer.


回答 11

我总是将代码放在python文件的前两行中:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

I always put the code below in the first two lines of the python files:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

回答 12

这里可以找到简单的帮助程序功能。

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

Simple helper functions found here.

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

回答 13

只需添加到变量encode(’utf-8’)

agent_contact.encode('utf-8')

Just add to a variable encode(‘utf-8’)

agent_contact.encode('utf-8')

回答 14

请打开终端并执行以下命令:

export LC_ALL="en_US.UTF-8"

Please open terminal and fire the below command:

export LC_ALL="en_US.UTF-8"

回答 15

我只使用了以下内容:

import unicodedata
message = unicodedata.normalize("NFKD", message)

查看有关它的文档说明:

unicodedata.normalize(form,unistr)返回Unicode字符串unistr的普通形式form。格式的有效值为“ NFC”,“ NFKC”,“ NFD”和“ NFKD”。

Unicode标准基于规范对等和兼容性对等的定义,定义了Unicode字符串的各种规范化形式。在Unicode中,可以用各种方式表示几个字符。例如,字符U + 00C7(带有CEDILLA的拉丁文大写字母C)也可以表示为序列U + 0043(拉丁文的大写字母C)U + 0327(合并CEDILLA)。

对于每个字符,有两种规范形式:规范形式C和规范形式D。规范形式D(NFD)也称为规范分解,将每个字符转换为其分解形式。范式C(NFC)首先应用规范分解,然后再次组成预组合字符。

除了这两种形式,还有基于兼容性对等的两种其他常规形式。在Unicode中,支持某些字符,这些字符通常会与其他字符统一。例如,U + 2160(罗马数字ONE)与U + 0049(拉丁大写字母I)实际上是同一回事。但是,Unicode支持它与现有字符集(例如gb2312)兼容。

普通形式的KD(NFKD)将应用兼容性分解,即用所有等效字符替换它们的等效字符。范式KC(NFKC)首先应用兼容性分解,然后进行规范组合。

即使将两个unicode字符串归一化并在人类读者看来是相同的,但是如果一个字符串包含组合字符而另一个字符串没有组合,则它们可能不相等。

为我解决。简单容易。

I just used the following:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

Solves it for me. Simple and easy.


回答 16

下面的解决方案为我工作,刚刚添加

u“字符串”

(将字符串表示为unicode)在我的字符串之前。

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

Below solution worked for me, Just added

u “String”

(representing the string as unicode) before my string.

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report.  Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)

回答 17

this这至少在Python 3中有效…

Python 3

有时错误在于环境变量中,因此

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

在编码中忽略错误的地方。

Alas this works in Python 3 at least…

Python 3

Sometimes the error is in the enviroment variables and enconding so

import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
... 
print(myText.encode('utf-8', errors='ignore'))

where errors are ignored in encoding.


回答 18

我只是遇到了这个问题,而Google带领我来到这里,因此,为了在这里添加一般的解决方案,这对我有用:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

阅读内德的演讲后,我有了这个主意。

不过,我并没有声称完全理解为什么这样做。因此,如果任何人都可以编辑此答案或发表评论以进行解释,我将不胜感激。

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned’s presentation.

I don’t claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I’ll appreciate it.


回答 19

manage.py migrate在带有本地化夹具的Django中运行时,我们遇到了此错误。

我们的资料包含# -*- coding: utf-8 -*-声明,MySQL已为utf8正确配置,而Ubuntu在中具有适当的语言包和值/etc/default/locale

问题只是因为Django容器(我们使用docker)缺少 LANG env var。

在重新运行迁移之前设置LANGen_US.UTF-8并重新启动容器可以解决此问题。

We struck this error when running manage.py migrate in Django with localized fixtures.

Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.

The issue was simply that the Django container (we use docker) was missing the LANG env var.

Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.


回答 20

这里的许多答案(例如,@ agf和@Andbdrew)已经解决了OP问题的最直接方面。

但是,我认为有一个微妙但重要的方面已被很大程度上忽略,这对于像我这样在尝试理解Python编码时最终落到这里的每个人都非常重要:Python 2 vs Python 3字符表示的管理截然不同。我觉得很多困惑与人们在不了解版本的情况下阅读Python编码有关。

我建议有兴趣了解OP问题根本原因的人首先阅读Spolsky对字符表示法和Unicode 介绍,然后转向Python 2和Python 3中的Unicode Batchelder

Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.

However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.

I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky’s introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.


回答 21

尽量避免将变量转换为str(variable)。有时,这可能会导致问题。

避免的简单提示:

try: 
    data=str(data)
except:
    data = data #Don't convert to String

上面的示例还将解决Encode错误。

Try to avoid conversion of variable to str(variable). Sometimes, It may cause the issue.

Simple tip to avoid :

try: 
    data=str(data)
except:
    data = data #Don't convert to String

The above example will solve Encode error also.


回答 22

如果您有类似的packet_data = "This is data"操作,请在初始化后立即在下一行执行此操作packet_data

unic = u''
packet_data = unic

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

unic = u''
packet_data = unic

回答 23

python 3.0及更高版本的更新。在python编辑器中尝试以下操作:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

这会将系统的默认语言环境编码设置为UTF-8格式。

有关更多信息,请参见PEP 538-将传统C语言环境强制为基于UTF-8的语言环境

Update for python 3.0 and later. Try the following in the python editor:

locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8

This sets the system`s default locale encoding to the UTF-8 format.

More can be read here at PEP 538 — Coercing the legacy C locale to a UTF-8 based locale.


回答 24

我遇到了尝试将Unicode字符输出到stdout,但使用sys.stdout.write而不是print的问题(这样我也可以支持将输出输出到其他文件)。

从BeautifulSoup自己的文档中,我使用编解码器库解决了此问题:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).

From BeautifulSoup’s own documentation, I solved this with the codecs library:

import sys
import codecs

def main(fIn, fOut):
    soup = BeautifulSoup(fIn)
    # Do processing, with data including non-ASCII characters
    fOut.write(unicode(soup))

if __name__ == '__main__':
    with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
        with codecs.getwriter('utf-8')(sys.stdout) as fOut:
            main(fIn, fOut)

回答 25

当使用Apache部署django项目时,经常会发生此问题。因为Apache在/ etc / sysconfig / httpd中设置环境变量LANG = C。只需打开文件并注释(或更改为您的样式)此设置即可。或使用WSGIDaemonProcess命令的lang选项,在这种情况下,您将能够为不同的虚拟主机设置不同的LANG环境变量。

This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.


回答 26

推荐的解决方案对我不起作用,我可以忍受所有非ascii字符的转储,因此

s = s.encode('ascii',errors='ignore')

这给我留下了不会抛出错误的东西。

The recommended solution did not work for me, and I could live with dumping all non ascii characters, so

s = s.encode('ascii',errors='ignore')

which left me with something stripped that doesn’t throw errors.


回答 27

这将起作用:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

输出:

>>>bats

This will work:

 >>>print(unicodedata.normalize('NFD', re.sub("[\(\[].*?[\)\]]", "", "bats\xc3\xa0")).encode('ascii', 'ignore'))

Output:

>>>bats