Unicode(UTF-8)用Python读写文件

问题:Unicode(UTF-8)用Python读写文件

我在理解将文本写入文件和将文件写入文件时遇到了一些大脑故障(Python 2.4)。

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

(“ u’Capit \ xe1n’”,“’Capit \ xc3 \ xa1n’”)

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

因此,我Capit\xc3\xa1n在文件f2 中输入我最喜欢的编辑器。

然后:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

我在这里不明白什么?显然,我缺少一些至关重要的魔术(或理智)。一种类型的文本文件可以正确转换?

我真正无法理解的是UTF-8表示法的意义所在,如果您实际上无法让Python识别它的话(如果它来自外部)。也许我应该只将JSON转储字符串,然后使用它,因为它具有可表示性!更重要的是,当来自文件时,Python是否会识别并解码该Unicode对象的ASCII表示形式?如果是这样,我如何得到它?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

I’m having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

(“u’Capit\xe1n'”, “‘Capit\xc3\xa1n'”)

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I’m missing. What does one type into text files to get proper conversions?

What I’m truly failing to grok here, is what the point of the UTF-8 representation is, if you can’t actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

回答 0

在符号中

u'Capit\xe1n\n'

“ \ xe1”仅代表一个字节。“ \ x”告诉您“ e1”为十六进制。当你写

Capit\xc3\xa1n

到您的文件中,您有“ \ xc3”。这些是4个字节,在您的代码中,您全部读取了它们。显示它们时可以看到以下内容:

>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

您可以看到反斜杠被反斜杠转义了。因此,您的字符串中有四个字节:“ \”,“ x”,“ c”和“ 3”。

编辑:

正如其他人在他们的答案中指出的那样,您只需要在编辑器中输入字符,然后您的编辑器就应处理到UTF-8的转换并保存。

如果您实际上有这种格式的字符串,则可以使用string_escape编解码器将其解码为普通字符串:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

结果是一个以UTF-8编码的字符串,其中重音字符由\\xc3\\xa1原始字符串中写入的两个字节表示。如果要使用unicode字符串,则必须使用UTF-8再次解码。

编辑:您的文件中没有UTF-8。实际查看它的外观:

s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

将文件utf-8.out内容与使用编辑器保存的文件内容进行比较。

In the notation

u'Capit\xe1n\n'

the “\xe1” represents just one byte. “\x” tells you that “e1” is in hexadecimal. When you write

Capit\xc3\xa1n

into your file you have “\xc3” in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:

>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

You can see that the backslash is escaped by a backslash. So you have four bytes in your string: “\”, “x”, “c” and “3”.

Edit:

As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.

If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:

In [15]: print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.

To your edit: you don’t have UTF-8 in your file. To actually see how it would look like:

s = u'Capit\xe1n\n'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

Compare the content of the file utf-8.out to the content of the file you saved with your editor.


回答 1

我发现打开文件时更容易指定编码,而不是搞乱编码和解码方法。该io模块(Python 2.6中添加)提供了一个io.open函数,该函数具有一个编码参数。

使用io模块中的open方法。

>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")

然后,在调用f的read()函数之后,将返回一个编码的Unicode对象。

>>>f.read()
u'Capit\xe1l\n\n'

请注意,在Python 3中,该io.open函数是内置函数的别名open。内置的open函数仅在Python 3中支持encoding参数,而在Python 2中不支持。

编辑:以前此答案推荐编解码器模块。该混合编解码器时,模块可能会造成问题read()readline(),所以这个答案现在建议的IO模块来代替。

使用编解码器模块中的open方法。

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

然后,在调用f的read()函数之后,将返回一个编码的Unicode对象。

>>>f.read()
u'Capit\xe1l\n\n'

如果您知道文件的编码,那么使用编解码器软件包将减少混乱。

请参阅http://docs.python.org/library/codecs.html#codecs.open

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.

Use the open method from the io module.

>>>import io
>>>f = io.open("test", mode="r", encoding="utf-8")

Then after calling f’s read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.

Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read() and readline(), so this answer now recommends the io module instead.

Use the open method from the codecs module.

>>>import codecs
>>>f = codecs.open("test", "r", "utf-8")

Then after calling f’s read() function, an encoded Unicode object is returned.

>>>f.read()
u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open


回答 2

现在,您在Python3中所需的就是 open(Filename, 'r', encoding='utf-8')

[在2016-02-10上进行编辑以要求澄清]

Python3在其open函数中添加了encoding参数。从此处收集了有关open函数的以下信息:https : //docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

编码是用于解码或编码文件的编码名称。仅应在文本模式下使用。默认编码取决于平台(无论locale.getpreferredencoding() 返回什么),但是可以使用Python支持的任何文本编码。有关支持的编码列表,请参见编解码器模块。

因此,通过向encoding='utf-8'open函数添加参数,所有文件的读取和写入操作都将以utf8的形式完成(现在,这也是使用Python完成的所有操作的默认编码。)

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)


回答 3

因此,我找到了所需的解决方案,即:

print open('f2').read().decode('string-escape').decode("utf-8")

这里有一些不常用的编解码器。这种特殊的阅读方式允许人们从Python内部获取UTF-8表示形式,将其复制到ASCII文件中,然后将其读入Unicode。在“字符串转义”解码下,斜杠不会加倍。

这允许我想象中的那种往返。

So, I’ve found a solution for what I’m looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the “string-escape” decode, the slashes won’t be doubled.

This allows for the sort of round trip that I was imagining.


回答 4

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()
# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

回答 5

实际上,这对于在Python 3.2中读取UTF-8编码的文件非常有用:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

Actually, this worked for me for reading a file with UTF-8 encoding in Python 3.2:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

回答 6

要读取Unicode字符串然后发送到HTML,我这样做:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

对于由python驱动的http服务器有用。

To read in an Unicode string and then send to HTML, I did this:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

Useful for python powered http servers.


回答 7

您已经迷惑了编码的一般问题:如何确定文件是哪种编码?

答:除非文件格式为此提供,否则您不能这样做。例如,XML以:

<?xml encoding="utf-8"?>

仔细选择了此标头,以便无论编码方式都可以读取它。在您的情况下,没有这样的提示,因此您的编辑器和Python都不知道发生了什么。因此,您必须使用codecs模块并使用codecs.open(path,mode,encoding)它提供Python中缺少的位。

对于您的编辑器,您必须检查它是否提供某种方式来设置文件的编码。

UTF-8的重点是能够将21位字符(Unicode)编码为8位数据流(因为这是世界上所有计算机只能处理的事情)。但是,由于大多数操作系统早于Unicode时代,因此它们没有合适的工具将编码信息附加到硬盘上的文件中。

下一个问题是Python中的表示形式。heikogerlach评论中对此做了完美解释。您必须了解控制台只能显示ASCII。为了显示Unicode或> = charcode 128的任何内容,它必须使用某种转义方法。在编辑器中,您不得键入转义的显示字符串,而应输入字符串的含义(在这种情况下,必须输入变音符号并保存文件)。

也就是说,您可以使用Python函数eval()将转义的字符串转换为字符串:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

如您所见,字符串“ \ xc3”已变成单个字符。现在,这是一个8位字符串,采用UTF-8编码。要获取Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind问:我认为这里缺少一些内容:文件f2包含:十六进制:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'),例如,将它们全部读取到一个单独的字符中(期望),是否有任何方法可以用ASCII写入文件?

答:这取决于您的意思。ASCII不能表示大于127的字符。因此,您需要某种方式来表示“接下来的几个字符表示特殊的含义”,这就是序列“ \ x”的作用。它说:接下来的两个字符是单个字符的代码。“ \ u”使用四个字符对最多0xFFFF(65535)的Unicode进行编码。

因此,您不能直接将Unicode写为ASCII(因为ASCII根本不包含相同的字符)。您可以将其写为字符串转义符(如f2所示);在这种情况下,文件可以表示为ASCII。或者您可以将其编写为UTF-8,在这种情况下,您需要8位安全流。

您的解决方案使用decode('string-escape')确实可以,但是您必须知道使用了多少内存:使用量的三倍codecs.open()

请记住,文件只是一个具有8位的字节序列。位和字节都没有意义。是您说“ 65代表’A’”。由于\xc3\xa1应该变成“à”,但是计算机无法识别,因此必须通过指定在写入文件时使用的编码来告诉它。

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can’t unless the file format provides for this. XML, for example, begins with:

<?xml encoding="utf-8"?>

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that’s the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don’t have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string “\xc3” has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

Answer: That depends on what you mean. ASCII can’t represent characters > 127. So you need some way to say “the next few characters mean something special” which is what the sequence “\x” does. It says: The next two characters are the code of a single character. “\u” does the same using four characters to encode Unicode up to 0xFFFF (65535).

So you can’t directly write Unicode to ASCII (because ASCII simply doesn’t contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().

Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It’s you who says “65 means ‘A'”. Since \xc3\xa1 should become “à” but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.


回答 8

除了之外codecs.open(),可以使用io.open()Python2或Python3来读取/写入unicode文件

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

except for codecs.open(), one can uses io.open() to work with Python2 or Python3 to read / write unicode file

example

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

回答 9

好吧,您最喜欢的文本编辑器没有意识到这\xc3\xa1应该是字符文字,而是将它们解释为文本。这就是为什么在最后一行得到双反斜杠的原因-它现在是xc3文件中的真实反斜杠+ 等。

如果要用Python读写编码文件,最好使用编解码器模块。

在终端和应用程序之间粘贴文本很困难,因为您不知道哪个程序将使用哪种编码来解释您的文本。您可以尝试以下方法:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán

然后将此字符串粘贴到编辑器中,并确保使用Latin-1将其存储。在剪贴板不乱码的假设下,往返应该起作用。

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That’s why you get the double backslashes in the last line — it’s now a real backslash + xc3, etc. in your file.

If you want to read and write encoded files in Python, best use the codecs module.

Pasting text between the terminal and applications is difficult, because you don’t know which program will interpret your text using which encoding. You could try the following:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
Capitán

Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.


回答 10

\ x ..序列特定于Python。这不是通用字节转义序列。

实际输入UTF-8编码的非ASCII的方式取决于您的操作系统和/或编辑器。这是您在Windows中的操作方法。对于OS X进入一个带有尖音,你可以点击option+ E,然后A在OS X的支持UTF-8,而几乎所有的文本编辑器。

The \x.. sequence is something that’s specific to Python. It’s not a universal byte escape sequence.

How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here’s how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.


回答 11

您还可以open()通过使用该partial函数替换原来的函数,从而改进原始函数以使用Unicode文件。该解决方案的优点在于您无需更改任何旧代码。它是透明的。

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don’t need to change any old code. It’s transparent.

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

回答 12

我试图使用Python 2.7.9 解析iCal

从icalendar导入日历

但是我得到了:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

它被固定为:

print "{}".format(e[attr].encode("utf-8"))

(现在,它可以打印likéáböss了。)

I was trying to parse iCal using Python 2.7.9:

from icalendar import Calendar

But I was getting:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

and it was fixed with just:

print "{}".format(e[attr].encode("utf-8"))

(Now it can print liké á böss.)


回答 13

通过将整个脚本的默认编码更改为’UTF-8’,我找到了最简单的方法:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

任何openprint或其他语句将只使用utf8

至少适用于Python 2.7.9

Thx转到https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/(看看结尾)。

I found the most simple approach by changing the default encoding of the whole script to be ‘UTF-8’:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

any open, print or other statement will just use utf8.

Works at least for Python 2.7.9.

Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).