Python 实用宝典

Question 1

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can’t.

So it’s been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we’re know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

ASCII
ISO-8859-1
CP1252
MacRoman
UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(

Question 2

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.

Question 3

Mozilla nsUniversalDetector (Perl bindings: Encode::Detect/Encode::Detect::Detector) is millionfold proven.

Question 4

My attempt at such a heuristic (assuming that you’ve ruled out ASCII and UTF-8):

If 0x7f to 0x9f don’t appear at all, it’s probably ISO-8859-1, because those are very rarely used control codes.
If 0x91 through 0x94 appear at lot, it’s probably Windows-1252, because those are the “smart quotes”, by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
Otherwise, it’s MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that’s where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.

Question 5

“Perl, C, Java, or Python, and in that order”: interesting attitude :-)

“we stand a good change of knowing if something is probably UTF-8”: Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you’ve decided that it’s neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I’m aware of don’t support MacRoman and in any case don’t do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I’d suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them — that would distort the probabilities something shocking.

You’ve mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain “control characters” in the range 0x80 to 0x9F inclusive … this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains “control characters” in the range U+0080 to U+009F inclusive … this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from “ISO-8859-1” to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that’s file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It’s nowhere near prime time yet. I’m interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally “unencumbered” as you expect anyone’s code solution to be?

Question 6

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you’re unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.

Question 7

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can’t be deciphered are in macroman. In other words, just make a list of files that couldn’t be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn’t garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you’ll have thousands of files done in no time.

Finally, wouldn’t it be better to just convert all existing files to a single format, and require that new files be in that format.

Question 8

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes’ theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.

Question 9

I’m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a WordPress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I’m converting everything to Unicode on the way in, joining it all together in a Python string, then doing:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

There is an encoding error on the last line:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

But then if I open the actual text file, I see lots of symbols like:

Qur‚Äôan

Maybe I need to write to something other than a text file?

Question 10

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you’ll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you’ll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

Question 11

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

It might be more convenient if you need to write the text incrementally (you don’t need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.

Question 12

Unicode string handling is already standardized in Python 3.

char’s are already stored in Unicode (32-bit) in memory

You only need to open file in utf-8
(32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Question 13

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn’t unicode; you take unicode and encode it in iso-8859-1 yourself. That’s what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.

Question 14

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don’t even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding – it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you’re limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

Example using many Unicode characters

Here’s an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you’ll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can’t get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don’t print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  ￫  HALFWIDTH RIGHTWARDS ARROW

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3’s would.

Question 15

How to print unicode characters into a file:

Save this to file: foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

Run it and pipe output to file:

python foo.py > tmp.txt

Open tmp.txt and look inside, you see this:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

Thus you have saved unicode e with a obfuscation mark on it to a file.

Question 16

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it’s in plain ASCII. There are two possibilities:

You’re encoding it to a bytestring, but because you’ve used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

Question 17

In case of writing in python3

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

In case of writing in python2:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To avoid this error you would have to encode it to bytes using codecs “utf-8” like this:

>>> f.write(a.encode("utf-8"))
>>> f.close()

and decode the data while reading using the codecs “utf-8”:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

And also if you try to execute print on this string it will automatically decode using the “utf-8” codecs like this

>>> print a
batsà

Question 18

Here is my code,

for line in open('u.item'):
#read each line

whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(), the code looks like;

for line in open('u.item', encoding='utf-8'):
#read each line

But again it gives the same error. what should I do then! Please help.

Question 19

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.

Question 20

Also worked for me, ISO 8859-1 is going to save a lot, hahaha, mainly if using Speech Recognition API’s

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1");

Question 21

Your file doesn’t actually contain utf-8 encoded data, it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding for example the 0xe9 would be the character é.

Question 22

Try this to read using pandas

pd.read_csv('u.item', sep='|', names=m_cols , encoding='latin-1')

Question 23

If you are using Python 2 the following will the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # do something

Because encoding parameter doesn’t work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

Question 24

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

‘rb’ is reading file in binary mode. Read more here. Hope this will help!

Question 25

This works:

open('filename', encoding='latin-1')

or:

open('filename',encoding="ISO-8859-1")

Question 26

If someone looking for these, this is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

Question 27

Sometimes when open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you’re trying to open exists:

import os
assert os.path.isfile(filepath)

hope this will help.

Question 28

you can try this way:

open('u.item', encoding='utf8', errors='ignore')

Question 29

I’ve never been sure that I understand the difference between str/unicode decode and encode.

I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.

I know that unicode().encode() converts unicode chars into a string of bytes according to a given encoding name.

But I don’t understand what str().encode() and unicode().decode() are for. Can anyone explain, and possibly also correct anything else I’ve gotten wrong above?

EDIT:

Several answers give info on what .encode does on a string, but no-one seems to know what .decode does for unicode.

Question 30

The decode method of unicode strings really doesn’t have any applications at all (unless you have some non-text data in a unicode string for some reason — see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it’s the other way around — it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of “encoding” for both these applications is… awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.

Question 31

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or ‘…’.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:

Question 32

anUnicode.encode(‘encoding’) results in a string object and can be called on a unicode object

aString.decode(‘encoding’) results in an unicode object and can be called on a string, encoded in given encoding.

Some more explanations:

You can create some unicode object, which doesn’t have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you’d like to print your unicode object to console or into some text file. So you have to encode it (for example – in UTF-8), you call encode(‘utf-8’) and you get a string with ‘\u<someNumber>’ inside, which is perfectly printable.

Then, again – you’d like to do the opposite – read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note – you can select some pervert encoding, like ‘zip’, ‘base64’, ‘rot’ and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.

Question 33

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

base64
bz2
zlib
hex
quopri
rot13
string_escape
uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system’s default encoding first.

Question 34

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.

Question 35

The simple answer is that they are the exact opposite of each other.

The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.

For example,’\xe4\xb8\xad\xe6\x96\x87′ is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a “utf-8” dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).

In the above case, the process for a computer to look for Chinese word is decode().

And the process of computer writing the Chinese into computer memory is encode().

So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).

Question 36

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

Question 37

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you’ll get an error like or similar to this:

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x8b in position 1: unexpected code byte

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

Note: In Python 2 you’d use StringIO instead of io

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

Original Answer from 2010:

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

Succeeds without error. Do note that “windows-1252” is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it’s kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It’s either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

Question 38

>>> u'aあä'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode

Question 39

As an extension to Ignacio Vazquez-Abrams’ answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

Although there are more efficient ways to accomplish this. See this question for more details Where is Python’s “best ASCII for this Unicode” database?

Question 40

Use unidecode – it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.

$ pip install unidecode

then:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Question 41

I use this helper function throughout all of my projects. If it can’t convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.

Question 42

For broken consoles like cmd.exe and HTML output you can always use:

my_unicode_string.encode('ascii','xmlcharrefreplace')

This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.

WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.

And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').

Question 43

You wrote “””I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere.”””

The HTML is NOT expected to contain any kind of “attempt at unicode”, well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front … look for “charset”.

You appear to be assuming that the charset is UTF-8 … on what grounds? The “\xA0” byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.

If you can’t get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.

Why have you tagged your question with “regex”?

Update after you replaced your whole question with a non-question:

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

Question 44

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.

line = 'my big string'

line.encode('ascii', 'ignore')

For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html

Question 45

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

Let’s take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )

1/10/17, 21:36 – Land : Welcome ï¿½ï¿½

and we want to ignore and preserve only ascii characters.

This code will do:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

and type(rline) will give you

>type(rline) 
<type 'str'>

Question 46

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

Works for me

Question 47

Looks like you are using python 2.x. Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.

Just paste the below line after shebang, it will work

# -*- coding: utf-8 -*-

问题：如何可靠地猜测MacRoman，CP1252，Latin1，UTF-8和ASCII之间的编码

回答 0

ASCII码

UTF-8

ISO-8859-1与Windows-1252

您如何区分MacRoman和cp1252？

未定义的字符

相同字符

统计方法

ASCII

UTF-8

ISO-8859-1 vs. windows-1252

How do you distinguish MacRoman from cp1252?

Undefined characters

Identical characters

Statistical approach

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：将Unicode文本写入文本文件？

回答 0

回答 1

回答 2

回答 3

回答 4

前言：您的查看器会工作吗？

将Unicode文本写入文本文件？

使用许多Unicode字符的示例

Preface: will your viewer work?

Writing Unicode text to a text file?

Example using many Unicode characters

回答 5

回答 6

回答 7

问题：“ for line in…”导致UnicodeDecodeError：’utf-8’编解码器无法解码字节

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：编码/解码有什么区别？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：将Unicode转换为ASCII且在Python中没有错误

回答 0

2018年更新：

2010年的原始答案：

2018 Update:

Original Answer from 2010:

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：在Python 3中将字符串转换为字节的最佳方法？

回答 0

回答 1

回答 2

有趣好用的Python教程