Traceback(most recent call last):File"python_md5_cracker.py", line 27,in<module>
m.update(line)TypeError:Unicode-objects must be encoded before hashing
当我尝试在Python 3.2.2中执行以下代码时:
import hashlib, sys
m = hashlib.md5()
hash =""
hash_file = input("What is the file name in which the hash resides? ")
wordlist = input("What is your wordlist? (Enter the file name) ")try:
hashdocument = open(hash_file,"r")exceptIOError:print("Invalid file.")
raw_input()
sys.exit()else:
hash = hashdocument.readline()
hash = hash.replace("\n","")try:
wordlistfile = open(wordlist,"r")exceptIOError:print("Invalid file.")
raw_input()
sys.exit()else:passfor line in wordlistfile:# Flush the buffer (this caused a massive problem when placed # at the beginning of the script, because the buffer kept getting# overwritten, thus comparing incorrect hashes)
m = hashlib.md5()
line = line.replace("\n","")
m.update(line)
word_hash = m.hexdigest()if word_hash == hash:print("Collision! The word corresponding to the given hash is", line)
input()
sys.exit()print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()
Traceback (most recent call last):
File "python_md5_cracker.py", line 27, in <module>
m.update(line)
TypeError: Unicode-objects must be encoded before hashing
when I try to execute this code in Python 3.2.2:
import hashlib, sys
m = hashlib.md5()
hash = ""
hash_file = input("What is the file name in which the hash resides? ")
wordlist = input("What is your wordlist? (Enter the file name) ")
try:
hashdocument = open(hash_file, "r")
except IOError:
print("Invalid file.")
raw_input()
sys.exit()
else:
hash = hashdocument.readline()
hash = hash.replace("\n", "")
try:
wordlistfile = open(wordlist, "r")
except IOError:
print("Invalid file.")
raw_input()
sys.exit()
else:
pass
for line in wordlistfile:
# Flush the buffer (this caused a massive problem when placed
# at the beginning of the script, because the buffer kept getting
# overwritten, thus comparing incorrect hashes)
m = hashlib.md5()
line = line.replace("\n", "")
m.update(line)
word_hash = m.hexdigest()
if word_hash == hash:
print("Collision! The word corresponding to the given hash is", line)
input()
sys.exit()
print("The hash given does not correspond to any supplied word in the wordlist.")
input()
sys.exit()
Now, the error message is clear: you can only use bytes, not Python strings (what used to be unicode in Python < 3), so you have to encode the strings with your preferred encoding: utf-32, utf-16, utf-8 or even one of the restricted 8-bit encodings (what some might call codepages).
The bytes in your wordlist file are being automatically decoded to Unicode by Python 3 as you read from the file. I suggest you do:
m.update(line.encode(wordlistfile.encoding))
so that the encoded data pushed to the md5 algorithm are encoded exactly like the underlying file.
import hashlib
with open(hash_file) as file:
control_hash = file.readline().rstrip("\n")
wordlistfile = open(wordlist, "rb")
# ...
for line in wordlistfile:
if hashlib.md5(line.rstrip(b'\n\r')).hexdigest() == control_hash:
# collision
# md5cracker.py# English Dictionary https://github.com/dwyl/english-words import hashlib, sys
hash_file ='exercise\hashed.txt'
wordlist ='data_sets\english_dictionary\words.txt'try:
hashdocument = open(hash_file,'r')exceptIOError:print('Invalid file.')
sys.exit()else:
count =0for hash in hashdocument:
hash = hash.rstrip('\n')print(hash)
i =0with open(wordlist,'r')as wordlistfile:for word in wordlistfile:
m = hashlib.md5()
word = word.rstrip('\n')
m.update(word.encode('utf-8'))
word_hash = m.hexdigest()if word_hash==hash:print('The word, hash combination is '+ word +','+ hash)
count +=1break
i +=1print('Itiration is '+ str(i))if count ==0:print('The hash given does not correspond to any supplied word in the wordlist.')else:print('Total passwords identified is: '+ str(count))
sys.exit()
This program is the bug free and enhanced version of the above MD5 cracker that reads the file containing list of hashed passwords and checks it against hashed word from the English dictionary word list. Hope it is helpful.
# md5cracker.py
# English Dictionary https://github.com/dwyl/english-words
import hashlib, sys
hash_file = 'exercise\hashed.txt'
wordlist = 'data_sets\english_dictionary\words.txt'
try:
hashdocument = open(hash_file,'r')
except IOError:
print('Invalid file.')
sys.exit()
else:
count = 0
for hash in hashdocument:
hash = hash.rstrip('\n')
print(hash)
i = 0
with open(wordlist,'r') as wordlistfile:
for word in wordlistfile:
m = hashlib.md5()
word = word.rstrip('\n')
m.update(word.encode('utf-8'))
word_hash = m.hexdigest()
if word_hash==hash:
print('The word, hash combination is ' + word + ',' + hash)
count += 1
break
i += 1
print('Itiration is ' + str(i))
if count == 0:
print('The hash given does not correspond to any supplied word in the wordlist.')
else:
print('Total passwords identified is: ' + str(count))
sys.exit()
What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I’m missing. What does one type into text files to get proper conversions?
What I’m truly failing to grok here, is what the point of the UTF-8 representation is, if you can’t actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?
the “\xe1” represents just one byte. “\x” tells you that “e1” is in hexadecimal.
When you write
Capit\xc3\xa1n
into your file you have “\xc3” in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
You can see that the backslash is escaped by a backslash. So you have four bytes in your string: “\”, “x”, “c” and “3”.
Edit:
As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.
If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:
In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.
To your edit: you don’t have UTF-8 in your file. To actually see how it would look like:
s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)
Compare the content of the file utf-8.out to the content of the file you saved with your editor.
Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.
Then after calling f’s read() function, an encoded Unicode object is returned.
>>>f.read()
u'Capit\xe1l\n\n'
Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.
Encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used.
See the codecs module for the list of supported encodings.
So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the “string-escape” decode, the slashes won’t be doubled.
This allows for the sort of round trip that I was imagining.
回答 4
# -*- encoding: utf-8 -*-# converting a unknown formatting file in utf-8import codecs
import commands
file_location ="jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s'% file_location)
file_stream = codecs.open(file_location,'r', file_encoding)
file_output = codecs.open(file_location+"b",'w','utf-8')for l in file_stream:
file_output.write(l)
file_stream.close()
file_output.close()
You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?
Answer: You can’t unless the file format provides for this. XML, for example, begins with:
<?xml encoding="utf-8"?>
This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.
As for your editor, you must check if it offers some way to set the encoding of a file.
The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that’s the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don’t have suitable tools to attach the encoding information to files on the hard disk.
The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).
That said, you can use the Python function eval() to turn an escaped string into a string:
>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1
As you can see, the string “\xc3” has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:
>>> x.decode('utf-8')
u'Capit\xe1n\n'
Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:
codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?
Answer: That depends on what you mean. ASCII can’t represent characters > 127. So you need some way to say “the next few characters mean something special” which is what the sequence “\x” does. It says: The next two characters are the code of a single character. “\u” does the same using four characters to encode Unicode up to 0xFFFF (65535).
So you can’t directly write Unicode to ASCII (because ASCII simply doesn’t contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.
Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().
Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It’s you who says “65 means ‘A'”. Since \xc3\xa1 should become “à” but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.
except for codecs.open(), one can uses io.open() to work with Python2 or Python3 to read / write unicode file
example
import io
text = u'á'
encoding = 'utf8'
with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
fout.write(text)
with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
text2 = fin.read()
assert text == text2
Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That’s why you get the double backslashes in the last line — it’s now a real backslash + xc3, etc. in your file.
If you want to read and write encoded files in Python, best use the codecs module.
Pasting text between the terminal and applications is difficult, because you don’t know which program will interpret your text using which encoding. You could try the following:
>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán
Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.
The \x.. sequence is something that’s specific to Python. It’s not a universal byte escape sequence.
How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here’s how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.
You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don’t need to change any old code. It’s transparent.
import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')
Traceback(most recent call last):File"ical.py", line 92,in parse
print"{}".format(e[attr])UnicodeEncodeError:'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
Traceback (most recent call last):
File "ical.py", line 92, in parse
print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)
accented_string = u'Málaga'# accented_string is of type 'unicode'import unidecode
unaccented_string = unidecode.unidecode(accented_string)# unaccented_string contains 'Malaga'and is of type 'str'
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
回答 1
这个怎么样:
import unicodedatadef strip_accents(s):return''.join(c for c in unicodedata.normalize('NFD', s)if unicodedata.category(c)!='Mn')
这也适用于希腊字母:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'>>>
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category “Mn” stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark’s answer (I didn’t think of unicodedata.combining, but it is probably the better solution, because it’s more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not “decoration”.
import unicodedatadef remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)return u"".join([c for c in nfkd_form ifnot unicodedata.combining(c)])
encoding ="utf-8"# or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"# or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it’s a diacritic.
Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
import reimport unicodedatadef strip_accents(text):"""
Strip accents from input String.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""try:
text = unicode(text,'utf-8')except(TypeError,NameError):# unicode is a default on python 3 pass
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii','ignore')
text = text.decode("utf-8")return str(text)def text_to_id(text):"""
Convert input text to id.
:param text: The input string.
:type text: String.
:returns: The processed String.
:rtype: String.
"""
text = strip_accents(text.lower())
text = re.sub('[ ]+','_', text)
text = re.sub('[^0-9a-zA-Z_-]','', text)return text
import unicodedata as uddef rmdiacritics(char):'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')if cutoff !=-1:
desc = desc[:cutoff]try:
char = ud.lookup(desc)exceptKeyError:pass# removing "WITH ..." produced an invalid namereturn char
This handles not only accents, but also “strokes” (as in ø etc.):
import unicodedata as ud
def rmdiacritics(char):
'''
Return the base character of char, by "removing" any
diacritics like accents or curls and strokes and the like.
'''
desc = ud.name(char)
cutoff = desc.find(' WITH ')
if cutoff != -1:
desc = desc[:cutoff]
try:
char = ud.lookup(desc)
except KeyError:
pass # removing "WITH ..." produced an invalid name
return char
This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don’t think it is very elegant indeed.
In fact, it’s more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.
There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain ‘WITH’. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.
EDIT NOTE:
Incorporated suggestions from the comments (handling lookup errors, Python-3 code).
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba’s comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)
Some languages have combining diacritics as language letters and accent diacritics to specify accent.
I think it is more safe to specify explicitly what diactrics you want to strip:
def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
accents = set(map(unicodedata.lookup, accents))
chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
return unicodedata.normalize('NFC', ''.join(chars))
How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?
回答 0
title = u"Klüft skräms inför på fédéral électoral große"import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')'Kluft skrams infor pa federal electoral groe'
title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
If you have a Unicode string, and you want to write this to a file, or other serialised form, you must first encode it into a particular representation that can be stored. There are several common Unicode encodings, such as UTF-16 (uses two bytes for most Unicode characters) or UTF-8 (1-4 bytes / codepoint depending on the character), etc. To convert that string into a particular encoding, you can use:
This raw string of bytes can be written to a file. However, note that when reading it back, you must know what encoding it is in and decode it using that same encoding.
When writing to files, you can get rid of this manual encode/decode process by using the codecs module. So, to open a file that encodes all Unicode strings into UTF-8, use:
import codecs
f = codecs.open('path/to/file.txt','w','utf8')
f.write(my_unicode_string) # Stored on disk as UTF-8
Do note that anything else that is using these files must understand what encoding the file is in if they want to read them. If you are the only one doing the reading/writing this isn’t a problem, otherwise make sure that you write in a form understandable by whatever else uses the files.
In Python 3, this form of file access is the default, and the built-in open function will take an encoding parameter and always translate to/from Unicode strings (the default string object in Python 3) for files opened in text mode.
回答 4
这是一个例子:
>>> u = u'€€€'>>> s = u.encode('utf8')>>> s
'\xe2\x82\xac\xe2\x82\xac\xe2\x82\xac'
Well, if you’re willing/ready to switch to Python 3 (which you may not be due to the backwards incompatibility with some Python 2 code), you don’t have to do any converting; all text in Python 3 is represented with Unicode strings, which also means that there’s no more usage of the u'<text>' syntax. You also have what are, in effect, strings of bytes, which are used to represent data (which may be an encoded string).
python3
>>>print("no me llama mucho la atenci\u00f3n")
输出正确:
output: no me llama mucho la atención
但是使用脚本加载此字符串变量无法正常工作。
这是对我的案例起作用的,以防万一:
string_to_convert ="no me llama mucho la atenci\u00f3n"print(json.dumps(json.loads(r'"%s"'% string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención
No answere worked for my case, where I had a string variable containing unicode chars, and no encode-decode explained here did the work.
If I do in a Terminal
echo "no me llama mucho la atenci\u00f3n"
or
python3
>>> print("no me llama mucho la atenci\u00f3n")
The output is correct:
output: no me llama mucho la atención
But working with scripts loading this string variable didn’t work.
This is what worked on my case, in case helps anybody:
string_to_convert = "no me llama mucho la atenci\u00f3n"
print(json.dumps(json.loads(r'"%s"' % string_to_convert), ensure_ascii=False))
output: no me llama mucho la atención
with io.open('filename','w', encoding='utf8')as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
If you are writing to a file, just use json.dump() and leave it to the file object to encode:
with open('filename', 'w', encoding='utf8') as json_file:
json.dump("ברי צקלה", json_file, ensure_ascii=False)
Caveats for Python 2
For Python 2, there are some more caveats to take into account. If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:
with io.open('filename', 'w', encoding='utf8') as json_file:
json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(u"ברי צקלה", ensure_ascii=False)
# unicode(data) auto-decodes data to unicode if str
json_file.write(unicode(data))
In Python 2, when using byte strings (type str), encoded to UTF-8, make sure to also set the encoding keyword:
d ={u'keyword': u'bad credit \xe7redit cards'}with io.open('filename','w', encoding='utf8')as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')try:
json_file.write(data)exceptTypeError:# Decode data to Unicode first
json_file.write(data.decode('utf8'))UnicodeEncodeError:'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
Peters’ python 2 workaround fails on an edge case:
d = {u'keyword': u'bad credit \xe7redit cards'}
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False).decode('utf8')
try:
json_file.write(data)
except TypeError:
# Decode data to Unicode first
json_file.write(data.decode('utf8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 25: ordinal not in range(128)
It was crashing on the .decode(‘utf8’) part of line 3. I fixed the problem by making the program much simpler by avoiding that step as well as the special casing of ascii:
with io.open('filename', 'w', encoding='utf8') as json_file:
data = json.dumps(d, ensure_ascii=False, encoding='utf8')
json_file.write(unicode(data))
cat filename
{"keyword": "bad credit çredit cards"}
回答 4
从Python 3.7开始,以下代码可以正常运行:
from json import dumps
result ={"symbol":"ƒ"}
json_string = dumps(result, sort_keys=True, indent=2, ensure_ascii=False)print(json_string)
Using ensure_ascii=False in json.dumps is the right direction to solve this problem, as pointed out by Martijn. However, this may raise an exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1: ordinal not in range(128)
You need extra settings in either site.py or sitecustomize.py to set your sys.getdefaultencoding() correct. site.py is under lib/python2.7/ and sitecustomize.py is under lib/python2.7/site-packages.
If you want to use site.py, under def setencoding(): change the first if 0: to if 1: so that python will use your operation system’s locale.
If you prefer to use sitecustomize.py, which may not exist if you haven’t created it. simply put these lines:
File"C:\Importer\src\dfman\importer.py", line 26,in import_chr
data = pd.read_csv(filepath, names=fields)File"C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400,in parser_f
return _read(filepath_or_buffer, kwds)File"C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205,in _read
return parser.read()File"C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608,in read
ret = self._engine.read(nrows)File"C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028,in read
data = self._reader.read(nrows)File"parser.pyx", line 706,in pandas.parser.TextReader.read (pandas\parser.c:6745)File"parser.pyx", line 728,in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)File"parser.pyx", line 804,in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)File"parser.pyx", line 890,in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)File"parser.pyx", line 950,in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)File"parser.pyx", line 1026,in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)File"parser.pyx", line 1046,in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)File"parser.pyx", line 1278,in pandas.parser._string_box_utf8 (pandas\parser.c:15657)UnicodeDecodeError:'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
I’m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error…
File "C:\Importer\src\dfman\importer.py", line 26, in import_chr
data = pd.read_csv(filepath, names=fields)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
return parser.read()
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid continuation byte
The source/creation of these files all come from the same place. What’s the best way to correct this to proceed with the import?
read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.
You can also use one of several alias options like 'latin' instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).
To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).
回答 1
所有解决方案中最简单的:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
替代解决方案:
在Sublime文本编辑器中打开csv文件。
以utf-8格式保存文件。
崇高地,单击文件->使用编码保存-> UTF-8
然后,您可以照常读取文件:
import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.
You know the encoding, and there is no encoding error in the file.
Great: you have just to specify the encoding:
file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.)
pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):
You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:
file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.)
input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
pd.read_csv(input_fd, ...)
回答 3
with open('filename.csv')as f:print(f)
执行此代码后,您将找到“ filename.csv”的编码,然后执行以下代码
data=pd.read_csv('filename.csv', encoding="encoding as you found earlier"
import pandas as pd
data =[]with open(<your file>,"rb")as myfile:# read the header seperately# decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
header = myfile.readline().decode('utf-8').replace('\r\n','').split(',')# read the rest of the datafor line in myfile:
row = line.decode('utf-8', errors='ignore').replace('\r\n','').split(',')
data.append(row)# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you’re going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you’ll hit the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte
Fortunately, there are a few solutions.
Option 1, fix the exporting. Be sure to use UTF-8 encoding.
Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.
pd.read_csv(<your file>, engine='python')
Option 3: solution is my preferred solution personally. Read the file using vanilla Python.
import pandas as pd
data = []
with open(<your file>, "rb") as myfile:
# read the header seperately
# decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
# read the rest of the data
for line in myfile:
row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
data.append(row)
# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)
Hope this helps people encountering this issue for the first time.
Struggled with this a while and thought I’d post on this question as it’s the first search result. Adding the encoding="iso-8859-1" tag to pandas read_csv didn’t work, nor did any other encoding, kept giving a UnicodeDecodeError.
If you’re passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.
回答 9
这个答案似乎可以解决CSV编码问题。如果标题出现奇怪的编码问题,如下所示:
>>> f = open(filename,"r")>>> reader =DictReader(f)>>> next(reader)OrderedDict([('\ufeffid','1'),...])
I am posting an update to this old thread. I found one solution that worked, but requires opening each file. I opened my csv file in LibreOffice, chose Save As > edit filter settings. In the drop-down menu I chose UTF8 encoding. Then I added encoding="utf-8-sig" to the data = pd.read_csv(r'C:\fullpathtofile\filename.csv', sep = ',', encoding="utf-8-sig").
I have trouble opening a CSV file in simplified Chinese downloaded from an online bank,
I have tried latin1, I have tried iso-8859-1, I have tried cp1252, all to no avail.
But pd.read_csv("",encoding ='gbk') simply does the work.
I am using Jupyter-notebook. And in my case, it was showing the file in the wrong format. The ‘encoding’ option was not working.
So I save the csv in utf-8 format, and it works.
回答 14
尝试这个:
import pandas as pd
with open('filename.csv')as f:
data = pd.read_csv(f)
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don’t think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn’t mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
A prefix of ‘b’ or ‘B’ is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
‘u’ or ‘b’ prefix may be followed by
an ‘r’ prefix.
Bytes literals are always prefixed with ‘b’ or ‘B’; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (UTF-16 or UTF-32, depending on how Python was compiled)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you’re familiar with Java or C#, think of str as String and bytes as byte[]. If you’re familiar with SQL, think of str as NVARCHAR and bytes as BINARY or BLOB. If you’re familiar with the Windows registry, think of str as REG_SZ and bytes as REG_BINARY. If you’re familiar with C(++), then forget everything you’ve learned about char and strings, because A CHARACTER IS NOT A BYTE. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you’d get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I’d use strings unless I had some specific low level reason to use bytes.
回答 3
从服务器端,如果我们发送任何响应,它将以字节类型的形式发送,因此它将在客户端中显示为 b'Response from server'
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
回答 4
这是一个示例,其中缺少b会TypeError在Python 3.x中引发异常
>>> f=open("new","wb")>>> f.write("Hello Python!")Traceback(most recent call last):File"<stdin>", line 1,in<module>TypeError:'str' does not support the buffer interface
Here’s an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
>>> len('Öl')# German word for 'oil' with 2 characters2>>>'Öl'.encode('UTF-8')# convert str to bytes
b'\xc3\x96l'>>> len('Öl'.encode('UTF-8'))# 3 bytes encode 2 characters !3
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
回答 7
您可以使用JSON将其转换为字典
import json
data = b'{"key":"value"}'print(json.loads(data))
There’s not really any “raw string“; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A “raw string literal” is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning “just a backslash” (except when it comes right before a quote that would otherwise terminate the literal) — no “escape sequences” to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the “except” clause above doesn’t matter) and it looks a bit better when you avoid doubling up each of them — that’s all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that’s very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the “except” clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by “going back” – there is no intrinsically back and forward directions, because there’s no raw string type, it’s just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...'is of course always distinct from just '...' — the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
There are two types of string in python: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn’t change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.
A “u” prefix denotes the value has type unicode rather than str.
Raw string literals, with an “r” prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that’s not a valid escape sequence (e.g. r"\").
“Raw” is not part of the type, it’s merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
The source file encoding just determines how to interpret the source file, it doesn’t affect expressions or types otherwise. However, it’s recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u’text’ or you can do this by calling unicode(‘text’)
So u is just a short way to call a function to cast str to unicode. That’s it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r’\n’ will not create a new line character. It’s just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it’s the only exception.
So this code will produce error: r’\’
To store a backslash (only one) you need to use ‘\\’
If you want to store more than 1 characters you can still use r like r’\\’ will produce 2 backslashes as you expected.
I don’t know the reason why r doesn’t work with one backslash storage but the reason isn’t described by anyone yet. I hope that it is a bug.
回答 5
也许这很明显,也许不是,但是您可以通过调用x = chr(92)来使字符串“ \”
x=chr(92)print type(x), len(x)# <type 'str'> 1
y='\\'print type(y), len(y)# <type 'str'> 1
x==y # True
x is y # False
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you’ll have to use some workaround. One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That’s very handy!
On the other hand, there are some situations when you want to create a string literal that contains escape sequences but you don’t want them to be interpreted by Python. You want them to be raw. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely “raw”?
Many people expect the raw string literals to be raw in a sense that “anything placed between the quotes is ignored by Python”. That is not true. Python still recognizes all the escape sequences, it just does not interpret them – it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.
This somewhat depends on what platform you are on. The most common way to do this is by printing ANSI escape sequences. For a simple example, here’s some python code from the blender build scripts:
print(bcolors.WARNING + "Warning: No active frommets remain. Continue?" + bcolors.ENDC)
or, with Python3.6+:
print(f"{bcolors.WARNING}Warning: No active frommets remain. Continue?{bcolors.ENDC}")
This will work on unixes including OS X, linux and windows (provided you use ANSICON, or in Windows 10 provided you enable VT100 emulation). There are ansi codes for setting the color, moving the cursor, and more.
If you are going to get complicated with this (and it sounds like you are if you are writing a game), you should look into the “curses” module, which handles a lot of the complicated parts of this for you. The Python Curses HowTO is a good introduction.
If you are not using extended ASCII (i.e. not on a PC), you are stuck with the ascii characters below 127, and ‘#’ or ‘@’ is probably your best bet for a block. If you can ensure your terminal is using a IBM extended ascii character set, you have many more options. Characters 176, 177, 178 and 219 are the “block characters”.
Some modern text-based programs, such as “Dwarf Fortress”, emulate text mode in a graphical mode, and use images of the classic PC font. You can find some of these bitmaps that you can use on the Dwarf Fortress Wiki see (user-made tilesets).
Hmm.. I think got a little carried away on this answer. I am in the midst of planning an epic text-based adventure game, though. Good luck with your colored text!
def print_format_table():"""
prints table of formatted text format options
"""for style in range(8):for fg in range(30,38):
s1 =''for bg in range(40,48):
format =';'.join([str(style), str(fg), str(bg)])
s1 +='\x1b[%sm %s \x1b[0m'%(format, format)print(s1)print('\n')
print_format_table()
Print a string that starts a color/style, then the string, then end the color/style change with '\x1b[0m':
print('\x1b[6;30;42m' + 'Success!' + '\x1b[0m')
Get a table of format options for shell text with following code:
def print_format_table():
"""
prints table of formatted text format options
"""
for style in range(8):
for fg in range(30,38):
s1 = ''
for bg in range(40,48):
format = ';'.join([str(style), str(fg), str(bg)])
s1 += '\x1b[%sm %s \x1b[0m' % (format, format)
print(s1)
print('\n')
print_format_table()
Define a string that starts a color and a string that ends the color, then print your text with the start string at the front and the end string at the end.
CRED = '\033[91m'
CEND = '\033[0m'
print(CRED + "Error, does not compute!" + CEND)
This produces the following in bash, in urxvt with a Zenburn-style color scheme:
import os
os.system("")# Group of Different functions for different stylesclass style():
BLACK ='\033[30m'
RED ='\033[31m'
GREEN ='\033[32m'
YELLOW ='\033[33m'
BLUE ='\033[34m'
MAGENTA ='\033[35m'
CYAN ='\033[36m'
WHITE ='\033[37m'
UNDERLINE ='\033[4m'
RESET ='\033[0m'print(style.YELLOW +"Hello, World!")
I’m responding because I have found out a way to use ANSI codes on Windows 10, so that you can change the colour of text without any modules that aren’t built in:
The line that makes this work is os.system(""), or any other system call, which allows you to print ANSI codes in the Terminal:
import os
os.system("")
# Group of Different functions for different styles
class style():
BLACK = '\033[30m'
RED = '\033[31m'
GREEN = '\033[32m'
YELLOW = '\033[33m'
BLUE = '\033[34m'
MAGENTA = '\033[35m'
CYAN = '\033[36m'
WHITE = '\033[37m'
UNDERLINE = '\033[4m'
RESET = '\033[0m'
print(style.YELLOW + "Hello, World!")
Note: Although this gives the same options as other Windows options, Windows does not full support ANSI codes, even with this trick. Not all the text decoration colours work and all the ‘bright’ colours (Codes 90-97 and 100-107) display the same as the regular colours (Codes 30-37 and 40-47)
Edit: Thanks to @j-l for finding an even shorter method.
tl;dr: Add os.system("") near the top of your file.
My favorite way is with the Blessings library (full disclosure: I wrote it). For example:
from blessings import Terminal
t = Terminal()
print t.red('This is red.')
print t.bold_bright_red_on_black('Bright red on black')
To print colored bricks, the most reliable way is to print spaces with background colors. I use this technique to draw the progress bar in nose-progressive:
print t.on_green(' ')
You can print in specific locations as well:
with t.location(0, 5):
print t.on_yellow(' ')
If you have to muck with other terminal capabilities in the course of your game, you can do that as well. You can use Python’s standard string formatting to keep it readable:
print '{t.clear_eol}You just cleared a {t.bold}whole{t.normal} line!'.format(t=t)
The nice thing about Blessings is that it does its best to work on all sorts of terminals, not just the (overwhelmingly common) ANSI-color ones. It also keeps unreadable escape sequences out of your code while remaining concise to use. Have fun!
from sty import fg, bg, ef, rs
foo = fg.red +'This is red text!'+ fg.rs
bar = bg.blue +'This has a blue background!'+ bg.rs
baz = ef.italic +'This is italic text'+ rs.italic
qux = fg(201)+'This is pink text using 8bit colors'+ fg.rs
qui = fg(255,10,10)+'This is red text using 24bit colors.'+ fg.rs
# Add custom colors:from sty importStyle,RgbFg
fg.orange =Style(RgbFg(255,150,50))
buf = fg.orange +'Yay, Im orange.'+ fg.rs
print(foo, bar, baz, qux, qui, buf, sep='\n')
sty is similar to colorama, but it’s less verbose, supports 8bit and 24bit (rgb) colors, allows you to register your own styles, supports muting, is really flexible, well documented and more.
Examples:
from sty import fg, bg, ef, rs
foo = fg.red + 'This is red text!' + fg.rs
bar = bg.blue + 'This has a blue background!' + bg.rs
baz = ef.italic + 'This is italic text' + rs.italic
qux = fg(201) + 'This is pink text using 8bit colors' + fg.rs
qui = fg(255, 10, 10) + 'This is red text using 24bit colors.' + fg.rs
# Add custom colors:
from sty import Style, RgbFg
fg.orange = Style(RgbFg(255, 150, 50))
buf = fg.orange + 'Yay, Im orange.' + fg.rs
print(foo, bar, baz, qux, qui, buf, sep='\n')
class colors:'''Colors class:
reset all colors with colors.reset
two subclasses fg for foreground and bg for background.
use as colors.subclass.colorname.
i.e. colors.fg.red or colors.bg.green
also, the generic bold, disable, underline, reverse, strikethrough,
and invisible work with the main class
i.e. colors.bold
'''
reset='\033[0m'
bold='\033[01m'
disable='\033[02m'
underline='\033[04m'
reverse='\033[07m'
strikethrough='\033[09m'
invisible='\033[08m'class fg:
black='\033[30m'
red='\033[31m'
green='\033[32m'
orange='\033[33m'
blue='\033[34m'
purple='\033[35m'
cyan='\033[36m'
lightgrey='\033[37m'
darkgrey='\033[90m'
lightred='\033[91m'
lightgreen='\033[92m'
yellow='\033[93m'
lightblue='\033[94m'
pink='\033[95m'
lightcyan='\033[96m'class bg:
black='\033[40m'
red='\033[41m'
green='\033[42m'
orange='\033[43m'
blue='\033[44m'
purple='\033[45m'
cyan='\033[46m'
lightgrey='\033[47m'
generated a class with all the colors using a for loop to iterate every combination of color up to 100, then wrote a class with python colors. Copy and paste as you will, GPLv2 by me:
class colors:
'''Colors class:
reset all colors with colors.reset
two subclasses fg for foreground and bg for background.
use as colors.subclass.colorname.
i.e. colors.fg.red or colors.bg.green
also, the generic bold, disable, underline, reverse, strikethrough,
and invisible work with the main class
i.e. colors.bold
'''
reset='\033[0m'
bold='\033[01m'
disable='\033[02m'
underline='\033[04m'
reverse='\033[07m'
strikethrough='\033[09m'
invisible='\033[08m'
class fg:
black='\033[30m'
red='\033[31m'
green='\033[32m'
orange='\033[33m'
blue='\033[34m'
purple='\033[35m'
cyan='\033[36m'
lightgrey='\033[37m'
darkgrey='\033[90m'
lightred='\033[91m'
lightgreen='\033[92m'
yellow='\033[93m'
lightblue='\033[94m'
pink='\033[95m'
lightcyan='\033[96m'
class bg:
black='\033[40m'
red='\033[41m'
green='\033[42m'
orange='\033[43m'
blue='\033[44m'
purple='\033[45m'
cyan='\033[46m'
lightgrey='\033[47m'
import ctypes
# Constants from the Windows API
STD_OUTPUT_HANDLE =-11
FOREGROUND_RED =0x0004# text color contains red.def get_csbi_attributes(handle):# Based on IPython's winconsole.py, written by Alexander Belchenkoimport struct
csbi = ctypes.create_string_buffer(22)
res = ctypes.windll.kernel32.GetConsoleScreenBufferInfo(handle, csbi)assert res
(bufx, bufy, curx, cury, wattr,
left, top, right, bottom, maxx, maxy)= struct.unpack("hhhhHhhhhhh", csbi.raw)return wattr
handle = ctypes.windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
reset = get_csbi_attributes(handle)
ctypes.windll.kernel32.SetConsoleTextAttribute(handle, FOREGROUND_RED)print"Cherry on top"
ctypes.windll.kernel32.SetConsoleTextAttribute(handle, reset)
For the character to print like a box, it really depends on what font you are using for the console window. The pound symbol works well, but it depends on the font:
#
回答 15
# Pure Python 3.x demo, 256 colors# Works with bash under Linux and MacOS
fg =lambda text, color:"\33[38;5;"+ str(color)+"m"+ text +"\33[0m"
bg =lambda text, color:"\33[48;5;"+ str(color)+"m"+ text +"\33[0m"def print_six(row, format, end="\n"):for col in range(6):
color = row*6+ col -2if color>=0:
text ="{:3d}".format(color)print(format(text,color), end=" ")else:print(end=" ")# four spacesprint(end=end)for row in range(0,43):
print_six(row, fg," ")
print_six(row, bg)# Simple usage: print(fg("text", 160))
# Pure Python 3.x demo, 256 colors
# Works with bash under Linux and MacOS
fg = lambda text, color: "\33[38;5;" + str(color) + "m" + text + "\33[0m"
bg = lambda text, color: "\33[48;5;" + str(color) + "m" + text + "\33[0m"
def print_six(row, format, end="\n"):
for col in range(6):
color = row*6 + col - 2
if color>=0:
text = "{:3d}".format(color)
print (format(text,color), end=" ")
else:
print(end=" ") # four spaces
print(end=end)
for row in range(0, 43):
print_six(row, fg, " ")
print_six(row, bg)
# Simple usage: print(fg("text", 160))
回答 16
我最终这样做了,我觉得那是最干净的:
formatters ={'RED':'\033[91m','GREEN':'\033[92m','END':'\033[0m',}print'Master is currently {RED}red{END}!'.format(**formatters)print'Help make master {GREEN}green{END} again!'.format(**formatters)
fromColorItimport*# Use this to ensure that ColorIt will be usable by certain command line interfaces
initColorIt()# Foregroundprint(color ('This text is red', colors.RED))print(color ('This text is orange', colors.ORANGE))print(color ('This text is yellow', colors.YELLOW))print(color ('This text is green', colors.GREEN))print(color ('This text is blue', colors.BLUE))print(color ('This text is purple', colors.PURPLE))print(color ('This text is white', colors.WHITE))# Backgroundprint(background ('This text has a background that is red', colors.RED))print(background ('This text has a background that is orange', colors.ORANGE))print(background ('This text has a background that is yellow', colors.YELLOW))print(background ('This text has a background that is green', colors.GREEN))print(background ('This text has a background that is blue', colors.BLUE))print(background ('This text has a background that is purple', colors.PURPLE))print(background ('This text has a background that is white', colors.WHITE))# Customprint(color ("This color has a custom grey text color",(150,150,150))print(background ("This color has a custom grey background",(150,150,150))# Combinationprint(background (color ("This text is blue with a white background", colors.BLUE), colors.WHITE))
from ColorIt import *
# Use this to ensure that ColorIt will be usable by certain command line interfaces
initColorIt()
# Foreground
print (color ('This text is red', colors.RED))
print (color ('This text is orange', colors.ORANGE))
print (color ('This text is yellow', colors.YELLOW))
print (color ('This text is green', colors.GREEN))
print (color ('This text is blue', colors.BLUE))
print (color ('This text is purple', colors.PURPLE))
print (color ('This text is white', colors.WHITE))
# Background
print (background ('This text has a background that is red', colors.RED))
print (background ('This text has a background that is orange', colors.ORANGE))
print (background ('This text has a background that is yellow', colors.YELLOW))
print (background ('This text has a background that is green', colors.GREEN))
print (background ('This text has a background that is blue', colors.BLUE))
print (background ('This text has a background that is purple', colors.PURPLE))
print (background ('This text has a background that is white', colors.WHITE))
# Custom
print (color ("This color has a custom grey text color", (150, 150, 150))
print (background ("This color has a custom grey background", (150, 150, 150))
# Combination
print (background (color ("This text is blue with a white background", colors.BLUE), colors.WHITE))
This gives you:
It’s also worth noting that this is cross platform and has been tested on mac, linux, and windows.
Note: Blinking, italics, bold, etc. will be added in a few days.
回答 24
如果您使用的是Windows,那么就到这里!
# display text on a Windows console# Windows XP with Python27 or Python32from ctypes import windll
# needed for Python2/Python3 difftry:
input = raw_input
except:pass
STD_OUTPUT_HANDLE =-11
stdout_handle = windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)# look at the output and select the color you want# for instance hex E is yellow on black# hex 1E is yellow on blue# hex 2E is yellow on green and so onfor color in range(0,75):
windll.kernel32.SetConsoleTextAttribute(stdout_handle, color)print("%X --> %s"%(color,"Have a fine day!"))
input("Press Enter to go on ... ")
# display text on a Windows console
# Windows XP with Python27 or Python32
from ctypes import windll
# needed for Python2/Python3 diff
try:
input = raw_input
except:
pass
STD_OUTPUT_HANDLE = -11
stdout_handle = windll.kernel32.GetStdHandle(STD_OUTPUT_HANDLE)
# look at the output and select the color you want
# for instance hex E is yellow on black
# hex 1E is yellow on blue
# hex 2E is yellow on green and so on
for color in range(0, 75):
windll.kernel32.SetConsoleTextAttribute(stdout_handle, color)
print("%X --> %s" % (color, "Have a fine day!"))
input("Press Enter to go on ... ")
"""
.. versionadded:: 0.9.2
Functions for wrapping strings in ANSI color codes.
Each function within this module returns the input string ``text``, wrapped
with ANSI color codes for the appropriate color.
For example, to print some text as green on supporting terminals::
from fabric.colors import green
print(green("This text is green!"))
Because these functions simply return modified strings, you can nest them::
from fabric.colors import red, green
print(red("This sentence is red, except for " + \
green("these words, which are green") + "."))
If ``bold`` is set to ``True``, the ANSI flag for bolding will be flipped on
for that particular invocation, which usually shows up as a bold or brighter
version of the original color on most terminals.
"""def _wrap_with(code):def inner(text, bold=False):
c = code
if bold:
c ="1;%s"% c
return"\033[%sm%s\033[0m"%(c, text)return inner
red = _wrap_with('31')
green = _wrap_with('32')
yellow = _wrap_with('33')
blue = _wrap_with('34')
magenta = _wrap_with('35')
cyan = _wrap_with('36')
white = _wrap_with('37')
"""
.. versionadded:: 0.9.2
Functions for wrapping strings in ANSI color codes.
Each function within this module returns the input string ``text``, wrapped
with ANSI color codes for the appropriate color.
For example, to print some text as green on supporting terminals::
from fabric.colors import green
print(green("This text is green!"))
Because these functions simply return modified strings, you can nest them::
from fabric.colors import red, green
print(red("This sentence is red, except for " + \
green("these words, which are green") + "."))
If ``bold`` is set to ``True``, the ANSI flag for bolding will be flipped on
for that particular invocation, which usually shows up as a bold or brighter
version of the original color on most terminals.
"""
def _wrap_with(code):
def inner(text, bold=False):
c = code
if bold:
c = "1;%s" % c
return "\033[%sm%s\033[0m" % (c, text)
return inner
red = _wrap_with('31')
green = _wrap_with('32')
yellow = _wrap_with('33')
blue = _wrap_with('34')
magenta = _wrap_with('35')
cyan = _wrap_with('36')
white = _wrap_with('37')
Traceback(most recent call last):File"foobar.py", line 792,in<module>
p.agent_info = str(agent_contact +' '+ agent_telno).strip()UnicodeEncodeError:'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I’m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.
One of the sections of code that is causing problems is shown below:
Here is a stack trace produced on SOME strings when the snippet above is run:
Traceback (most recent call last):
File "foobar.py", line 792, in <module>
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption – so there are no issues relating to internalization or dealing with text written in anything other than English.
Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?
str(a)Traceback(most recent call last):File"<stdin>", line 1,in<module>UnicodeEncodeError:'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
This is a classic python unicode pain point! Consider the following:
a = u'bats\u00E0'
print a
=> batsà
All good so far, but if we call str(a), let’s see what happens:
str(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
Oh dip, that’s not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:
The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode(‘whatever_unicode’). Most of the time, you should be fine using utf-8.
It’s important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to “C”. In Debian they discourage setting it: Debian wiki on Locale
$ echo $LANG
en_US.utf8
$ echo $LC_ALL
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà
$ python -c 'print(u"\u2122");'Traceback(most recent call last):File"<string>", line 1,in<module>UnicodeEncodeError:'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
现在安装language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8.../usr/sbin/locale-gen: done
Generation complete.
The problem is that you’re trying to print a unicode character, but your terminal doesn’t support it.
You can try installing language-pack-en package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you’re trying to print).
On some Linux distributions it’s required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it’s easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you’ve still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale), which should have e.g.
Printing unicode characters (such as trade mark sign like ™):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Here’s a rehashing of some other so-called “cop out” answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.
Suggestion: you might want to name this function to toAscii instead? That’s a matter of preference.
This was written for Python 2. For Python 3, I believe you’ll want to use bytes(obj,"ascii") rather than str(obj). I didn’t test this yet, but I will at some point and revise the answer.
unicodedata.normalize(form, unistr) Return the normal form form for
the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’,
‘NFD’, and ‘NFKD’.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
For each character, there are two normal forms: normal form C and
normal form D. Normal form D (NFD) is also known as canonical
decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence. In Unicode, certain characters are
supported which normally would be unified with other characters. For
example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
compatibility with existing character sets (e.g. gb2312).
The normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents. The
normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Even if two unicode strings are normalized and look the same to a
human reader, if one has combining characters and the other doesn’t,
they may not compare equal.
Solves it for me. Simple and easy.
回答 16
下面的解决方案为我工作,刚刚添加
u“字符串”
(将字符串表示为unicode)在我的字符串之前。
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
(representing the string as unicode) before my string.
result_html = result.to_html(col_space=1, index=False, justify={'right'})
text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly summary report. Let me know if you have any questions. <br>
<br>
Data Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Data Team</p>
</body></html>
""".format(result_html)
回答 17
this这至少在Python 3中有效…
Python 3
有时错误在于环境变量中,因此
import os
import locale
os.environ["PYTHONIOENCODING"]="utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")...print(myText.encode('utf-8', errors='ignore'))
We struck this error when running manage.py migrate in Django with localized fixtures.
Our source contained the # -*- coding: utf-8 -*- declaration, MySQL was correctly configured for utf8 and Ubuntu had the appropriate language pack and values in /etc/default/locale.
The issue was simply that the Django container (we use docker) was missing the LANG env var.
Setting LANG to en_US.UTF-8 and restarting the container before re-running migrations fixed the problem.
回答 20
这里的许多答案(例如,@ agf和@Andbdrew)已经解决了OP问题的最直接方面。
但是,我认为有一个微妙但重要的方面已被很大程度上忽略,这对于像我这样在尝试理解Python编码时最终落到这里的每个人都非常重要:Python 2 vs Python 3字符表示的管理截然不同。我觉得很多困惑与人们在不了解版本的情况下阅读Python编码有关。
Many answers here (@agf and @Andbdrew for example) have already addressed the most immediate aspects of the OP question.
However, I think there is one subtle but important aspect that has been largely ignored and that matters dearly for everyone who like me ended up here while trying to make sense of encodings in Python: Python 2 vs Python 3 management of character representation is wildly different. I feel like a big chunk of confusion out there has to do with people reading about encodings in Python without being version aware.
I suggest anyone interested in understanding the root cause of OP problem to begin by reading Spolsky’s introduction to character representations and Unicode and then move to Batchelder on Unicode in Python 2 and Python 3.
回答 21
尽量避免将变量转换为str(variable)。有时,这可能会导致问题。
避免的简单提示:
try:
data=str(data)except:
data = data #Don't convert to String
import sys
import codecs
def main(fIn, fOut):
soup =BeautifulSoup(fIn)# Do processing, with data including non-ASCII characters
fOut.write(unicode(soup))if __name__ =='__main__':with(sys.stdin)as fIn:# Don't think we need codecs.getreader herewith codecs.getwriter('utf-8')(sys.stdout)as fOut:
main(fIn, fOut)
I had this issue trying to output Unicode characters to stdout, but with sys.stdout.write, rather than print (so that I could support output to a different file as well).
import sys
import codecs
def main(fIn, fOut):
soup = BeautifulSoup(fIn)
# Do processing, with data including non-ASCII characters
fOut.write(unicode(soup))
if __name__ == '__main__':
with (sys.stdin) as fIn: # Don't think we need codecs.getreader here
with codecs.getwriter('utf-8')(sys.stdout) as fOut:
main(fIn, fOut)
This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.