Python 实用宝典

Question 1

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what’s going on behind the curtain in both cases?

Question 2

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of “string”: (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. “Characters” for machines also include “drawing instructions” like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.
On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly “understood” (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression “Unicode encoding” as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python’s universal newline file reading mode).

Now, what I have called “character” above is what Unicode calls a “user-perceived character“. A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called “code points“—these codes points can be combined together to form a “grapheme cluster”. Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them “Unicode strings” (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python’s \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… “Unicode strings” (unicode type, literal form u"…"), while byte arrays are “strings” (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called “strings” (str type, literal form "…"), while byte arrays are “bytes” (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).

With these few key points, you should be able to understand most encoding related questions!

Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal’s encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal’s encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user’s terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I’m not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.

Question 3

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the ‘ascii’ encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you’ve printed 4 uncommon characters that your terminal didn’t support in its font. Here’s some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python correctly determined the encoding of the terminal.

Output (redirected to file)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used ‘ascii’ default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)

cp437

and my output file was correct:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

Example 2

Now I’ll throw in a character in the source that isn’t supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn’t understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

Question 4

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

Question 5

I used this :

u = unicode(text, 'utf-8')

But getting error with Python 3 (or… maybe I just forgot to include something) :

NameError: global name 'unicode' is not defined

Thank you.

Question 6

Literal strings are unicode by default in Python3.

Assuming that text is a bytes object, just use text.decode('utf-8')

unicode of Python2 is equivalent to str in Python3, so you can also write:

str(text, 'utf-8')

if you prefer.

Question 7

What’s new in Python 3.0 says:

All text is Unicode; however encoded Unicode is represented as binary data

If you want to ensure you are outputting utf-8, here’s an example from this page on unicode in 3.0:

b'\x80abc'.decode("utf-8", "strict")

Question 8

As a workaround, I’ve been using this:

# Fix Python 2.x.
try:
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    unicode = lambda s: str(s)

Question 9

This how I solved my problem to convert chars like \uFE0F, \u000A, etc. And also emojis that encoded with 16 bytes.

example = 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\\uD83D\\uDE0D\\uD83D\\uDE0D\\u2764\\uFE0F Present Moment Caf\\u00E8 in St.Augustine\\u2764\\uFE0F\\u2764\\uFE0F '
import codecs
new_str = codecs.unicode_escape_decode(example)[0]
print(new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream\ud83d\ude0d\ud83d\ude0d❤️ Present Moment Cafè in St.Augustine❤️❤️ '
new_new_str = new_str.encode('utf-16', 'surrogatepass').decode('utf-16')
print(new_new_str)
>>> 'raw vegan chocolate cocoa pie w chocolate &amp; vanilla cream😍😍❤️ Present Moment Cafè in St.Augustine❤️❤️ '

Question 10

In a Python 2 program that I used for many years there was this line:

ocd[i].namn=unicode(a[:b], 'utf-8')

This did not work in Python 3.

However, the program turned out to work with:

ocd[i].namn=a[:b]

I don’t remember why I put unicode there in the first place, but I think it was because the name can contains Swedish letters åäöÅÄÖ. But even they work without “unicode”.

Question 11

the easiest way in python 3.x

text = "hi , I'm text"
text.encode('utf-8')

Question 12

I got strange error message when tried to save first_name, last_name to Django’s auth_user model.

Failed examples

user = User.object.create_user(username, email, password)
user.first_name = u'Rytis'
user.last_name = u'Slatkevičius'
user.save()
>>> Incorrect string value: '\xC4\x8Dius' for column 'last_name' at row 104

user.first_name = u'Валерий'
user.last_name = u'Богданов'
user.save()
>>> Incorrect string value: '\xD0\x92\xD0\xB0\xD0\xBB...' for column 'first_name' at row 104

user.first_name = u'Krzysztof'
user.last_name = u'Szukiełojć'
user.save()
>>> Incorrect string value: '\xC5\x82oj\xC4\x87' for column 'last_name' at row 104

Succeed examples

user.first_name = u'Marcin'
user.last_name = u'Król'
user.save()
>>> SUCCEED

MySQL settings

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Table charset and collation

Table auth_user has utf-8 charset with utf8_general_ci collation.

Results of UPDATE command

It didn’t raise any error when updating above values to auth_user table by using UPDATE command.

mysql> update auth_user set last_name='Slatkevičiusa' where id=1;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select last_name from auth_user where id=100;
+---------------+
| last_name     |
+---------------+
| Slatkevi?iusa | 
+---------------+
1 row in set (0.00 sec)

PostgreSQL

The failed values listed above can be updated into PostgreSQL table when I switched the database backend in Django. It’s strange.

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
...
| utf8     | UTF-8 Unicode               | utf8_general_ci     |      3 | 
...

But from http://www.postgresql.org/docs/8.1/interactive/multibyte.html, I found the following:

Name Bytes/Char
UTF8 1-4

Is it means unicode char has maxlen of 4 bytes in PostgreSQL but 3 bytes in MySQL which caused above error?

Question 13

None of these answers solved the problem for me. The root cause being:

You cannot store 4-byte characters in MySQL with the utf-8 character set.

MySQL has a 3 byte limit on utf-8 characters (yes, it’s wack, nicely summed up by a Django developer here)

To solve this you need to:

Change your MySQL database, table and columns to use the utf8mb4 character set (only available from MySQL 5.5 onwards)
Specify the charset in your Django settings file as below:

settings.py

DATABASES = {
    'default': {
        'ENGINE':'django.db.backends.mysql',
        ...
        'OPTIONS': {'charset': 'utf8mb4'},
    }
}

Note: When recreating your database you may run into the ‘Specified key was too long‘ issue.

The most likely cause is a CharField which has a max_length of 255 and some kind of index on it (e.g. unique). Because utf8mb4 uses 33% more space than utf-8 you’ll need to make these fields 33% smaller.

In this case, change the max_length from 255 to 191.

Alternatively you can edit your MySQL configuration to remove this restriction but not without some django hackery

UPDATE: I just ran into this issue again and ended up switching to PostgreSQL because I was unable to reduce my VARCHAR to 191 characters.

Question 14

I had the same problem and resolved it by changing the character set of the column. Even though your database has a default character set of utf-8 I think it’s possible for database columns to have a different character set in MySQL. Here’s the SQL QUERY I used:

    ALTER TABLE database.table MODIFY COLUMN col VARCHAR(255)
    CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

Question 15

If you have this problem here’s a python script to change all the columns of your mysql database automatically.

#! /usr/bin/env python
import MySQLdb

host = "localhost"
passwd = "passwd"
user = "youruser"
dbname = "yourdbname"

db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=dbname)
cursor = db.cursor()

cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
cursor.execute(sql)

results = cursor.fetchall()
for row in results:
  sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
  cursor.execute(sql)
db.close()

Question 16

If it’s a new project, I’d just drop the database, and create a new one with a proper charset:

CREATE DATABASE <dbname> CHARACTER SET utf8;

Question 17

I just figured out one method to avoid above errors.

Save to database

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevičius'.encode('unicode_escape')
user.save()
>>> SUCCEED

print user.last_name
>>> Slatkevi\u010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevičius

Is this the only method to save strings like that into a MySQL table and decode it before rendering to templates for display?

Question 18

You can change the collation of your text field to UTF8_general_ci and the problem will be solved.

Notice, this cannot be done in Django.

Question 19

You aren’t trying to save unicode strings, you’re trying to save bytestrings in the UTF-8 encoding. Make them actual unicode string literals:

user.last_name = u'Slatkevičius'

or (when you don’t have string literals) decode them using the utf-8 encoding:

user.last_name = lastname.decode('utf-8')

Question 20

Simply alter your table, no need to any thing. just run this query on database. ALTER TABLE table_nameCONVERT TO CHARACTER SET utf8

it will definately work.

Question 21

Improvement to @madprops answer – solution as a django management command:

import MySQLdb
from django.conf import settings

from django.core.management.base import BaseCommand


class Command(BaseCommand):

    def handle(self, *args, **options):
        host = settings.DATABASES['default']['HOST']
        password = settings.DATABASES['default']['PASSWORD']
        user = settings.DATABASES['default']['USER']
        dbname = settings.DATABASES['default']['NAME']

        db = MySQLdb.connect(host=host, user=user, passwd=password, db=dbname)
        cursor = db.cursor()

        cursor.execute("ALTER DATABASE `%s` CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci'" % dbname)

        sql = "SELECT DISTINCT(table_name) FROM information_schema.columns WHERE table_schema = '%s'" % dbname
        cursor.execute(sql)

        results = cursor.fetchall()
        for row in results:
            print(f'Changing table "{row[0]}"...')
            sql = "ALTER TABLE `%s` convert to character set DEFAULT COLLATE DEFAULT" % (row[0])
            cursor.execute(sql)
        db.close()

Hope this helps anybody but me :)

Question 22

I am using Python 2.6.5. My code requires the use of the “more than or equal to” sign. Here it goes:

>>> s = u'\u2265'
>>> print s
>>> ≥
>>> print "{0}".format(s)
Traceback (most recent call last):
     File "<input>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265'
  in position 0: ordinal not in range(128)`

Why do I get this error? Is there a right way to do this? I need to use the .format() function.

Question 23

Just make the second string also a unicode string

>>> s = u'\u2265'
>>> print s
≥
>>> print "{0}".format(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)
>>> print u"{0}".format(s)
≥
>>>

Question 24

unicodes need unicode format strings.

>>> print u'{0}'.format(s)
≥

Question 25

A bit more information on why that happens.

>>> s = u'\u2265'
>>> print s

works because print automatically uses the system encoding for your environment, which was likely set to UTF-8. (You can check by doing import sys; print sys.stdout.encoding)

>>> print "{0}".format(s)

fails because format tries to match the encoding of the type that it is called on (I couldn’t find documentation on this, but this is the behavior I’ve noticed). Since string literals are byte strings encoded as ASCII in python 2, format tries to encode s as ASCII, which then results in that exception. Observe:

>>> s = u'\u2265'
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 0: ordinal not in range(128)

So that is basically why these approaches work:

>>> s = u'\u2265'
>>> print u'{}'.format(s)
≥
>>> print '{}'.format(s.encode('utf-8'))
≥

The source character set is defined by the encoding declaration; it is ASCII if no encoding declaration is given in the source file (https://docs.python.org/2/reference/lexical_analysis.html#string-literals)

Question 26

I get an error with the following patter:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I’m web scraping. How can I remedy the situation? The .replace() string method doesn’t work on it.

Question 27

The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

Output:

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

Note that the utf-16 codec requires BOM to be present, or Python won’t know if the data is big- or little-endian.

Question 28

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.

Question 29

That character is the BOM or “Byte Order Mark”. It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to ‘ascii’, you should probably pick another encoding for whatever you were trying to do.

Question 30

The content you’re scraping is encoded in unicode rather than ascii text, and you’re getting a character that doesn’t convert to ascii. The right ‘translation’ depends on what the original web page thought it was. Python’s unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it’s writing the data that’s causing the problem, not reading it. This question is a good place to look for the fixes.

Question 31

Here is based on the answer from Mark Tolonen. The string included different languages of the word ‘test’ that’s separated by ‘|’, so you can see the difference.

u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestβ貝塔위másbêta|test|اختبار|测试|測試|テスト|परीक्षा|പരിശോധന|פּרובירן|kiểm tra|Ölçek|'

It’s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.

Question 32

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read() function you can see at the begin of the returned code ‘\ufeff’ is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding… Hope this will help.

Question 33

When I try to print a Unicode string in a Windows console, I get a UnicodeEncodeError: 'charmap' codec can't encode character .... error. I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this? Is there any way I can make Python automatically print a ? instead of failing in this situation?

Edit: I’m using Python 2.5.

Note: @LasseV.Karlsen answer with the checkmark is sort of outdated (from 2008). Please use the solutions/answers/suggestions below with care!!

@JFSebastian answer is more relevant as of today (6 Jan 2016).

Question 34

Note: This answer is sort of outdated (from 2008). Please use the solution below with care!!

Here is a page that details the problem and a solution (search the page for the text Wrapping sys.stdout into an instance):

PrintFails – Python Wiki

Here’s a code excerpt from that page:

$ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line'
  UTF-8
  <type 'unicode'> 2
  Б
  Б

  $ python -c 'import sys, codecs, locale; print sys.stdout.encoding; \
    sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout); \
    line = u"\u0411\n"; print type(line), len(line); \
    sys.stdout.write(line); print line' | cat
  None
  <type 'unicode'> 2
  Б
  Б

There’s some more information on that page, well worth a read.

Question 35

Update: Python 3.6 implements PEP 528: Change Windows console encoding to UTF-8: the default console on Windows will now accept all Unicode characters. Internally, it uses the same Unicode API as the win-unicode-console package mentioned below. print(unicode_string) should just work now.

I get a UnicodeEncodeError: 'charmap' codec can't encode character... error.

The error means that Unicode characters that you are trying to print can’t be represented using the current (chcp) console character encoding. The codepage is often 8-bit encoding such as cp437 that can represent only ~0x100 characters from ~1M Unicode characters:

>>> u"\N{EURO SIGN}".encode('cp437')
Traceback (most recent call last):
...
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position 0:
character maps to

I assume this is because the Windows console does not accept Unicode-only characters. What’s the best way around this?

Windows console does accept Unicode characters and it can even display them (BMP only) if the corresponding font is configured. WriteConsoleW() API should be used as suggested in @Daira Hopwood’s answer. It can be called transparently i.e., you don’t need to and should not modify your scripts if you use win-unicode-console package:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What’s the deal with Python 3.4, Unicode, different languages and Windows?

Is there any way I can make Python automatically print a ? instead of failing in this situation?

If it is enough to replace all unencodable characters with ? in your case then you could set PYTHONIOENCODING envvar:

T:\> set PYTHONIOENCODING=:replace
T:\> python3 -c "print(u'[\N{EURO SIGN}]')"
[?]

In Python 3.6+, the encoding specified by PYTHONIOENCODING envvar is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSIOENCODING envvar is set to a non-empty string.

Question 36

Despite the other plausible-sounding answers that suggest changing the code page to 65001, that does not work. (Also, changing the default encoding using sys.setdefaultencoding is not a good idea.)

See this question for details and code that does work.

Question 37

If you’re not interested in getting a reliable representation of the bad character(s) you might use something like this (working with python >= 2.6, including 3.x):

from __future__ import print_function
import sys

def safeprint(s):
    try:
        print(s)
    except UnicodeEncodeError:
        if sys.version_info >= (3,):
            print(s.encode('utf8').decode(sys.stdout.encoding))
        else:
            print(s.encode('utf8'))

safeprint(u"\N{EM DASH}")

The bad character(s) in the string will be converted in a representation which is printable by the Windows console.

Question 38

The below code will make Python output to console as UTF-8 even on Windows.

The console will display the characters well on Windows 7 but on Windows XP it will not display them well, but at least it will work and most important you will have a consistent output from your script on all platforms. You’ll be able to redirect the output to a file.

Below code was tested with Python 2.6 on Windows.


#!/usr/bin/python
# -*- coding: UTF-8 -*-

import codecs, sys

reload(sys)
sys.setdefaultencoding('utf-8')

print sys.getdefaultencoding()

if sys.platform == 'win32':
    try:
        import win32console 
    except:
        print "Python Win32 Extensions module is required.\n You can download it from https://sourceforge.net/projects/pywin32/ (x86 and x64 builds are available)\n"
        exit(-1)
    # win32console implementation  of SetConsoleCP does not return a value
    # CP_UTF8 = 65001
    win32console.SetConsoleCP(65001)
    if (win32console.GetConsoleCP() != 65001):
        raise Exception ("Cannot set console codepage to 65001 (UTF-8)")
    win32console.SetConsoleOutputCP(65001)
    if (win32console.GetConsoleOutputCP() != 65001):
        raise Exception ("Cannot set console output codepage to 65001 (UTF-8)")

#import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

print "This is an Е乂αmp١ȅ testing Unicode support using Arabic, Latin, Cyrillic, Greek, Hebrew and CJK code points.\n"

Question 39

Just enter this code in command line before executing python script:

chcp 65001 & set PYTHONIOENCODING=utf-8

Question 40

Like Giampaolo Rodolà’s answer, but even more dirty: I really, really intend to spend a long time (soon) understanding the whole subject of encodings and how they apply to Windoze consoles,

For the moment I just wanted sthg which would mean my program would NOT CRASH, and which I understood … and also which didn’t involve importing too many exotic modules (in particular I’m using Jython, so half the time a Python module turns out not in fact to be available).

def pr(s):
    try:
        print(s)
    except UnicodeEncodeError:
        for c in s:
            try:
                print( c, end='')
            except UnicodeEncodeError:
                print( '?', end='')

NB “pr” is shorter to type than “print” (and quite a bit shorter to type than “safeprint”)…!

Question 41

For Python 2 try:

print unicode(string, 'unicode-escape')

For Python 3 try:

import os
string = "002 Could've Would've Should've"
os.system('echo ' + string)

Or try win-unicode-console:

pip install win-unicode-console
py -mrun your_script.py

Question 42

TL;DR:

print(yourstring.encode('ascii','replace'));

I ran into this myself, working on a Twitch chat (IRC) bot. (Python 2.7 latest)

I wanted to parse chat messages in order to respond…

msg = s.recv(1024).decode("utf-8")

but also print them safely to the console in a human-readable format:

print(msg.encode('ascii','replace'));

This corrected the issue of the bot throwing UnicodeEncodeError: 'charmap' errors and replaced the unicode characters with ?.

Question 43

The cause of your problem is NOT the Win console not willing to accept Unicode (as it does this since I guess Win2k by default). It is the default system encoding. Try this code and see what it gives you:

import sys
sys.getdefaultencoding()

if it says ascii, there’s your cause ;-) You have to create a file called sitecustomize.py and put it under python path (I put it under /usr/lib/python2.5/site-packages, but that is differen on Win – it is c:\python\lib\site-packages or something), with the following contents:

import sys
sys.setdefaultencoding('utf-8')

and perhaps you might want to specify the encoding in your files as well:

# -*- coding: UTF-8 -*-
import sys,time

Edit: more info can be found in excellent the Dive into Python book

Question 44

Kind of related on the answer by J. F. Sebastian, but more direct.

If you are having this problem when printing to the console/terminal, then do this:

>set PYTHONIOENCODING=UTF-8

Question 45

Python 3.6 windows7: There is several way to launch a python you could use the python console (which has a python logo on it) or the windows console (it’s written cmd.exe on it).

I could not print utf8 characters in the windows console. Printing utf-8 characters throw me this error:

OSError: [winError 87] The paraneter is incorrect 
Exception ignored in: (_io-TextIOwrapper name='(stdout)' mode='w' ' encoding='utf8') 
OSError: [WinError 87] The parameter is incorrect

After trying and failing to understand the answer above I discovered it was only a setting problem. Right click on the top of the cmd console windows, on the tab font chose lucida console.

Question 46

James Sulak asked,

Is there any way I can make Python automatically print a ? instead of failing in this situation?

Other solutions recommend we attempt to modify the Windows environment or replace Python’s print() function. The answer below comes closer to fulfilling Sulak’s request.

Under Windows 7, Python 3.5 can be made to print Unicode without throwing a UnicodeEncodeError as follows:

In place of: print(text)
substitute: print(str(text).encode('utf-8'))

Instead of throwing an exception, Python now displays unprintable Unicode characters as \xNN hex codes, e.g.:

Halmalo n\xe2\x80\x99\xc3\xa9tait plus qu\xe2\x80\x99un point noir

Instead of

Halmalo n’était plus qu’un point noir

Granted, the latter is preferable ceteris paribus, but otherwise the former is completely accurate for diagnostic messages. Because it displays Unicode as literal byte values the former may also assist in diagnosing encode/decode problems.

Note: The str() call above is needed because otherwise encode() causes Python to reject a Unicode character as a tuple of numbers.

Question 47

This will surely be an easy one but it is really bugging me.

I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents.

All of the text that I am parsing is ASCII. I know that Python treats strings as unicode, and I am sure this is very handy, just of no use in my wee script.

Every time I go to print out a variable that holds ‘String’ I get [u'String'] printed to the screen. Is there a simple way of getting this back into just ascii or should I write a regex to strip it?

Question 48

[u'ABC'] would be a one-element list of unicode strings. Beautiful Soup always produces Unicode. So you need to convert the list to a single unicode string, and then convert that to ASCII.

I don’t know exaxtly how you got the one-element lists; the contents member would be a list of strings and tags, which is apparently not what you have. Assuming that you really always get a list with a single element, and that your test is really only ASCII you would use this:

 soup[0].encode("ascii")

However, please double-check that your data is really ASCII. This is pretty rare. Much more likely it’s latin-1 or utf-8.

 soup[0].encode("latin-1")


 soup[0].encode("utf-8")

Or you ask Beautiful Soup what the original encoding was and get it back in this encoding:

 soup[0].encode(soup.originalEncoding)

Question 49

You probably have a list containing one unicode string. The repr of this is [u'String'].

You can convert this to a list of byte strings using any variation of the following:

# Functional style.
print map(lambda x: x.encode('ascii'), my_list)

# List comprehension.
print [x.encode('ascii') for x in my_list]

# Interesting if my_list may be a tuple or a string.
print type(my_list)(x.encode('ascii') for x in my_list)

# What do I care about the brackets anyway?
print ', '.join(repr(x.encode('ascii')) for x in my_list)

# That's actually not a good way of doing it.
print ' '.join(repr(x).lstrip('u')[1:-1] for x in my_list)

Question 50

import json, ast
r = {u'name': u'A', u'primary_key': 1}
ast.literal_eval(json.dumps(r))

will print

{'name': 'A', 'primary_key': 1}

Question 51

If accessing/printing single element lists (e.g., sequentially or filtered):

my_list = [u'String'] # sample element
my_list = [str(my_list[0])]

Question 52

pass the output to str() function and it will remove the convert the unicode output. also by printing the output it will remove the u” tags from it.

Question 53

[u'String'] is a text representation of a list that contains a Unicode string on Python 2.

If you run print(some_list) then it is equivalent to
print'[%s]' % ', '.join(map(repr, some_list)) i.e., to create a text representation of a Python object with the type list, repr() function is called for each item.

Don’t confuse a Python object and its text representation—repr('a') != 'a' and even the text representation of the text representation differs: repr(repr('a')) != repr('a').

repr(obj) returns a string that contains a printable representation of an object. Its purpose is to be an unambiguous representation of an object that can be useful for debugging, in a REPL. Often eval(repr(obj)) == obj.

To avoid calling repr(), you could print list items directly (if they are all Unicode strings) e.g.: print ",".join(some_list)—it prints a comma separated list of the strings: String

Do not encode a Unicode string to bytes using a hardcoded character encoding, print Unicode directly instead. Otherwise, the code may fail because the encoding can’t represent all the characters e.g., if you try to use 'ascii' encoding with non-ascii characters. Or the code silently produces mojibake (corrupted data is passed further in a pipeline) if the environment uses an encoding that is incompatible with the hardcoded encoding.

Question 54

Use dir or type on the ‘string’ to find out what it is. I suspect that it’s one of BeautifulSoup’s tag objects, that prints like a string, but really isn’t one. Otherwise, its inside a list and you need to convert each string separately.

In any case, why are you objecting to using Unicode? Any specific reason?

Question 55

Do you really mean u'String'?

In any event, can’t you just do str(string) to get a string rather than a unicode-string? (This should be different for Python 3, for which all strings are unicode.)

Question 56

encode("latin-1") helped me in my case:

facultyname[0].encode("latin-1")

Question 57

Maybe i dont understand , why cant you just get the element.text and then convert it before using it ? for instance (dont know why you would do this but…) find all label elements of the web page and iterate between them until you find one called MyText

        avail = []
        avail = driver.find_elements_by_class_name("label");
        for i in avail:
                if  i.text == "MyText":

Convert the string from i and do whatever you wanted to do … maybe im missing something in the original message ? or was this what you were looking for ?

Question 58

From the Python 2.6 shell:

>>> import sys
>>> print sys.getdefaultencoding()
ascii
>>> print u'\xe9'
é
>>>

I expected to have either some gibberish or an Error after the print statement, since the “é” character isn’t part of ASCII and I haven’t specified an encoding. I guess I don’t understand what ASCII being the default encoding means.

EDIT

I moved the edit to the Answers section and accepted it as suggested.

Question 59

Thanks to bits and pieces from various replies, I think we can stitch up an explanation.

By trying to print an unicode string, u’\xe9′, Python implicitly try to encode that string using the encoding scheme currently stored in sys.stdout.encoding. Python actually picks up this setting from the environment it’s been initiated from. If it can’t find a proper encoding from the environment, only then does it revert to its default, ASCII.

For example, I use a bash shell which encoding defaults to UTF-8. If I start Python from it, it picks up and use that setting:

$ python

>>> import sys
>>> print sys.stdout.encoding
UTF-8

Let’s for a moment exit the Python shell and set bash’s environment with some bogus encoding:

$ export LC_CTYPE=klingon
# we should get some error message here, just ignore it.

Then start the python shell again and verify that it does indeed revert to its default ascii encoding.

$ python

>>> import sys
>>> print sys.stdout.encoding
ANSI_X3.4-1968

Bingo!

If you now try to output some unicode character outside of ascii you should get a nice error message

>>> print u'\xe9'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 
in position 0: ordinal not in range(128)

Lets exit Python and discard the bash shell.

We’ll now observe what happens after Python outputs strings. For this we’ll first start a bash shell within a graphic terminal (I use Gnome Terminal) and we’ll set the terminal to decode output with ISO-8859-1 aka latin-1 (graphic terminals usually have an option to Set Character Encoding in one of their dropdown menus). Note that this doesn’t change the actual shell environment’s encoding, it only changes the way the terminal itself will decode output it’s given, a bit like a web browser does. You can therefore change the terminal’s encoding, independantly from the shell’s environment. Let’s then start Python from the shell and verify that sys.stdout.encoding is set to the shell environment’s encoding (UTF-8 for me):

$ python

>>> import sys

>>> print sys.stdout.encoding
UTF-8

>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
Ã©
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>

(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character “é” and so that’s what the terminal displays.

(2) python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it’s “UTF-8”. After UTF-8 encoding, the resulting binary string is ‘\xc3\xa9’ (see later explanation). Terminal receives the stream as such and tries to decode 0xc3a9 using latin-1, but latin-1 goes from 0 to 255 and so, only decodes streams 1 byte at a time. 0xc3a9 is 2 bytes long, latin-1 decoder therefore interprets it as 0xc3 (195) and 0xa9 (169) and that yields 2 characters: Ã and ©.

(3) python encodes unicode code point u’\xe9′ (233) with the latin-1 scheme. Turns out latin-1 code points range is 0-255 and points to the exact same character as Unicode within that range. Therefore, Unicode code points in that range will yield the same value when encoded in latin-1. So u’\xe9′ (233) encoded in latin-1 will also yields the binary string ‘\xe9’. Terminal receives that value and tries to match it on the latin-1 character map. Just like case (1), it yields “é” and that’s what’s displayed.

Let’s now change the terminal’s encoding settings to UTF-8 from the dropdown menu (like you would change your web browser’s encoding settings). No need to stop Python or restart the shell. The terminal’s encoding now matches Python’s. Let’s try printing again:

>>> print '\xe9' # (4)

>>> print u'\xe9' # (5)
é
>>> print u'\xe9'.encode('latin-1') # (6)

>>>

(4) python outputs a binary string as is. Terminal attempts to decode that stream with UTF-8. But UTF-8 doesn’t understand the value 0xe9 (see later explanation) and is therefore unable to convert it to a unicode code point. No code point found, no character printed.

(5) python attempts to implicitly encode the Unicode string with whatever’s in sys.stdout.encoding. Still “UTF-8”. The resulting binary string is ‘\xc3\xa9’. Terminal receives the stream and attempts to decode 0xc3a9 also using UTF-8. It yields back code value 0xe9 (233), which on the Unicode character map points to the symbol “é”. Terminal displays “é”.

(6) python encodes unicode string with latin-1, it yields a binary string with the same value ‘\xe9’. Again, for the terminal this is pretty much the same as case (4).

Conclusions: – Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. – Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. – Python gets that setting from the shell’s environment. – the terminal displays output according to its own encoding settings. – the terminal’s encoding is independant from the shell’s.

More details on unicode, UTF-8 and latin-1:

Unicode is basically a table of characters where some keys (code points) have been conventionally assigned to point to some symbols. e.g. by convention it’s been decided that key 0xe9 (233) is the value pointing to the symbol ‘é’. ASCII and Unicode use the same code points from 0 to 127, as do latin-1 and Unicode from 0 to 255. That is, 0x41 points to ‘A’ in ASCII, latin-1 and Unicode, 0xc8 points to ‘Ü’ in latin-1 and Unicode, 0xe9 points to ‘é’ in latin-1 and Unicode.

When working with electronic devices, Unicode code points need an efficient way to be represented electronically. That’s what encoding schemes are about. Various Unicode encoding schemes exist (utf7, UTF-8, UTF-16, UTF-32). The most intuitive and straight forward encoding approach would be to simply use a code point’s value in the Unicode map as its value for its electronic form, but Unicode currently has over a million code points, which means that some of them require 3 bytes to be expressed. To work efficiently with text, a 1 to 1 mapping would be rather impractical, since it would require that all code points be stored in exactly the same amount of space, with a minimum of 3 bytes per character, regardless of their actual need.

Most encoding schemes have shortcomings regarding space requirement, the most economic ones don’t cover all unicode code points, for example ascii only covers the first 128, while latin-1 covers the first 256. Others that try to be more comprehensive end up also being wasteful, since they require more bytes than necessary, even for common “cheap” characters. UTF-16 for instance, uses a minimum of 2 bytes per character, including those in the ascii range (‘B’ which is 65, still requires 2 bytes of storage in UTF-16). UTF-32 is even more wasteful as it stores all characters in 4 bytes.

UTF-8 happens to have cleverly resolved the dilemma, with a scheme able to store code points with a variable amount of byte spaces. As part of its encoding strategy, UTF-8 laces code points with flag bits that indicate (presumably to decoders) their space requirements and their boundaries.

UTF-8 encoding of unicode code points in the ascii range (0-127):

0xxx xxxx  (in binary)

the x’s show the actual space reserved to “store” the code point during encoding
The leading 0 is a flag that indicates to the UTF-8 decoder that this code point will only require 1 byte.
upon encoding, UTF-8 doesn’t change the value of code points in that specific range (i.e. 65 encoded in UTF-8 is also 65). Considering that Unicode and ASCII are also compatible in the same range, it incidentally makes UTF-8 and ASCII also compatible in that range.

e.g. Unicode code point for ‘B’ is ‘0x42’ or 0100 0010 in binary (as we said, it’s the same in ASCII). After encoding in UTF-8 it becomes:

0xxx xxxx  <-- UTF-8 encoding for Unicode code points 0 to 127
*100 0010  <-- Unicode code point 0x42
0100 0010  <-- UTF-8 encoded (exactly the same)

UTF-8 encoding of Unicode code points above 127 (non-ascii):

110x xxxx 10xx xxxx            <-- (from 128 to 2047)
1110 xxxx 10xx xxxx 10xx xxxx  <-- (from 2048 to 65535)

the leading bits ‘110’ indicate to the UTF-8 decoder the beginning of a code point encoded in 2 bytes, whereas ‘1110’ indicates 3 bytes, 11110 would indicate 4 bytes and so forth.
the inner ’10’ flag bits are used to signal the beginning of an inner byte.
again, the x’s mark the space where the Unicode code point value is stored after encoding.

e.g. ‘é’ Unicode code point is 0xe9 (233).

1110 1001    <-- 0xe9

When UTF-8 encodes this value, it determines that the value is larger than 127 and less than 2048, therefore should be encoded in 2 bytes:

110x xxxx 10xx xxxx   <-- UTF-8 encoding for Unicode 128-2047
***0 0011 **10 1001   <-- 0xe9
1100 0011 1010 1001   <-- 'é' after UTF-8 encoding
C    3    A    9

The 0xe9 Unicode code points after UTF-8 encoding becomes 0xc3a9. Which is exactly how the terminal receives it. If your terminal is set to decode strings using latin-1 (one of the non-unicode legacy encodings), you’ll see Ã©, because it just so happens that 0xc3 in latin-1 points to Ã and 0xa9 to ©.

Question 60

When Unicode characters are printed to stdout, sys.stdout.encoding is used. A non-Unicode character is assumed to be in sys.stdout.encoding and is just sent to the terminal. On my system (Python 2):

>>> import unicodedata as ud
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> ud.name(u'\xe9') # U+00E9 Unicode codepoint
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.name('\xe9'.decode('cp437')) 
'GREEK CAPITAL LETTER THETA'
>>> '\xe9'.decode('cp437') # byte E9 decoded using code page 437 is U+0398.
u'\u0398'
>>> ud.name(u'\u0398')
'GREEK CAPITAL LETTER THETA'
>>> print u'\xe9' # Unicode is encoded to CP437 correctly
é
>>> print '\xe9'  # Byte is just sent to terminal and assumed to be CP437.
Θ

sys.getdefaultencoding() is only used when Python doesn’t have another option.

Note that Python 3.6 or later ignores encodings on Windows and uses Unicode APIs to write Unicode to the terminal. No UnicodeEncodeError warnings and the correct character is displayed if the font supports it. Even if the font doesn’t support it the characters can still be cut-n-pasted from the terminal to an application with a supporting font and it will be correct. Upgrade!

Question 61

The Python REPL tries to pick up what encoding to use from your environment. If it finds something sane then it all Just Works. It’s when it can’t figure out what’s going on that it bugs out.

>>> print sys.stdout.encoding
UTF-8

Question 62

You have specified an encoding by entering an explicit Unicode string. Compare the results of not using the u prefix.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> '\xe9'
'\xe9'
>>> u'\xe9'
u'\xe9'
>>> print u'\xe9'
é
>>> print '\xe9'

>>>

In the case of \xe9 then Python assumes your default encoding (Ascii), thus printing … something blank.

Question 63

It works for me:

import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('utf-8')

Question 64

As per Python default/implicit string encodings and conversions :

When printing unicode, it’s encoded with <file>.encoding.
- when the encoding is not set, the unicode is implicitly converted to str (since the codec for that is sys.getdefaultencoding(), i.e. ascii, any national characters would cause a UnicodeEncodeError)
- for standard streams, the encoding is inferred from environment. It’s typically set fot tty streams (from the terminal’s locale settings), but is likely to not be set for pipes
  - so a print u'\xe9' is likely to succeed when the output is to a terminal, and fail if it’s redirected. A solution is to encode() the string with the desired encoding before printing.
When printing str, the bytes are sent to the stream as is. What glyphs the terminal shows will depend on its locale settings.

问题：重定向到文件时出现UnicodeDecodeError

回答 0

回答 1

例子1

输出（直接从终端运行）

输出（重定向到文件）

输出（重定向到文件，PYTHONIOENCODING = cp437）

例子2

输出（直接从终端运行）

输出（直接运行，PYTHONIOENCODING = 437：替换）

Example 1

Output (run directly from terminal)

Output (redirected to file)

Output (redirected to file, PYTHONIOENCODING=cp437)

Example 2

Output (run directly from terminal)

Output (run directly, PYTHONIOENCODING=437:replace)

回答 2

问题：如何使用python3制作unicode字符串

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：在Django中保存Unicode字符串时，MySQL“字符串值不正确”错误

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：Python：对Unicode转义的字符串使用.format（）

回答 0

回答 1

回答 2

问题：u’\ ufeff’在Python字符串中

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：Python，Unicode和Windows控制台

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

问题：Python字符串打印为[u’String’]

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：为什么默认编码为ASCII时Python为什么打印unicode字符？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：将Unicode文本写入文本文件？

回答 0

问题：Python str与unicode