将Unicode文本写入文本文件?

问题:将Unicode文本写入文本文件?

我正在从Google文档中提取数据,进行处理,然后将其写入文件(最终我将其粘贴到Wordpress页面中)。

它具有一些非ASCII符号。如何将这些安全地转换为可以在HTML源代码中使用的符号?

目前,我正在将所有内容都转换为Unicode,将它们全部组合成Python字符串,然后执行以下操作:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

最后一行存在编码错误:

UnicodeDecodeError:’ascii’编解码器无法解码位置12286的字节0xa0:序数不在范围内(128)

部分解决方案:

此Python运行无错误:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

但是,如果我打开实际的文本文件,则会看到很多符号,例如:

Qur’an 

也许我需要写文本文件以外的东西?

I’m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a WordPress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I’m converting everything to Unicode on the way in, joining it all together in a Python string, then doing:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

There is an encoding error on the last line:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

But then if I open the actual text file, I see lots of symbols like:

Qur’an 

Maybe I need to write to something other than a text file?


回答 0

通过在首次获取对象时将其解码为unicode对象,并在出路时根据需要对其进行编码,从而尽可能地专门处理unicode对象。

如果您的字符串实际上是unicode对象,则需要先将其转换为unicode编码的字符串对象,然后再将其写入文件:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

再次读取该文件时,您将获得一个unicode编码的字符串,可以将其解码为unicode对象:

f = file('test', 'r')
print f.read().decode('utf8')

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you’ll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you’ll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

回答 1

在Python 2.6+中,您可以在Python 3上使用io.open()默认设置(内置open()):

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

如果您需要增量编写文本(不需要unicode_text.encode(character_encoding)多次调用),可能会更方便。与codecs模块不同,io模块具有适当的通用换行符支持。

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

It might be more convenient if you need to write the text incrementally (you don’t need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.


回答 2

Unicode字符串处理已在Python 3中标准化。

  1. 字符已经以Unicode(32位)存储在内存中
  2. 您只需要以utf-8打开文件
    (从内存到文件自动执行32位Unicode到可变字节长度的utf-8转换)。

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

Unicode string handling is already standardized in Python 3.

  1. char’s are already stored in Unicode (32-bit) in memory
  2. You only need to open file in utf-8
    (32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

回答 3

打开的文件codecs.open是一个接收unicode数据,对其进行编码并将其iso-8859-1写入文件的文件。但是,您尝试写什么不是unicode; 您可以自己unicode进行编码。这就是方法的作用,对unicode字符串进行编码的结果是一个字节字符串(一种类型)。iso-8859-1 unicode.encodestr

您应该使用normal open()并自己对unicode编码,或者(通常是一个更好的主意)使用codecs.open()不是对数据进行编码。

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn’t unicode; you take unicode and encode it in iso-8859-1 yourself. That’s what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.


回答 4

前言:您的查看器会工作吗?

确保查看器/编辑器/终端(无论与utf-8编码的文件进行交互)都可以读取该文件。这在Windows(例如记事本)上经常是一个问题。

将Unicode文本写入文本文件?

在Python 2中,使用open来自io模块(这与openPython 3中的内置功能相同):

import io

通常,最佳实践UTF-8用于写入文件(我们甚至不必担心utf-8的字节顺序)。

encoding = 'utf-8'

utf-8是最现代且通用的编码-适用于所有Web浏览器,大多数文本编辑器(如果有问题,请参阅设置)和大多数终端/外壳。

在Windows上,utf-16le如果您仅限于在记事本(或其他受限制的查看器)中查看输出,则可以尝试。

encoding = 'utf-16le' # sorry, Windows users... :(

只需使用上下文管理器打开它,然后将您的unicode字符写出来:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

使用许多Unicode字符的示例

这是一个示例,尝试将每个可能的字符映射到数字表示形式(整数)中最多三位宽(最大为4,但这会有点远)到编码的可打印输出及其名称,如果可能(将其放入名为的文件中uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

此过程应大约运行一分钟,您可以查看数据文件,如果文件查看器可以显示unicode,则可以看到它。有关类别的信息可在此处找到。根据计数,我们可以通过排除没有关联符号的Cn和Co类别来改善结果。

$ python uni.py

它将显示十六进制映射,category,symbol(除非无法获得名称,因此可能是控制字符)以及该符号的名称。例如

我建议less在Unix或Cygwin上使用(不要将整个文件打印/保存到输出中):

$ less unidata

例如,它将显示类似于以下使用Python 2(unicode 5.2)从中采样的行:

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So    PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd    THAI DIGIT NINE
  2887 So    BRAILLE PATTERN DOTS-1238
  bc13 Lo    HANGUL SYLLABLE MIH
  ffeb Sm    HALFWIDTH RIGHTWARDS ARROW

我来自Anaconda的Python 3.5具有unicode 8.0,我认为大多数都是3。

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don’t even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding – it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you’re limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

Example using many Unicode characters

Here’s an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you’ll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can’t get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don’t print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  →  HALFWIDTH RIGHTWARDS ARROW

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3’s would.


回答 5

如何将unicode字符打印到文件中:

将此保存到文件:foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

运行它,并将输出管道传输到文件:

python foo.py > tmp.txt

打开tmp.txt并查看内部,您会看到以下内容:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此,您已将带有混淆标记的unicode e保存到文件中。

How to print unicode characters into a file:

Save this to file: foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

Run it and pipe output to file:

python foo.py > tmp.txt

Open tmp.txt and look inside, you see this:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

Thus you have saved unicode e with a obfuscation mark on it to a file.


回答 6

当您尝试对非unicode字符串进行编码时,会出现该错误:假定它使用纯ASCII,它将尝试对其进行解码。有两种可能性:

  1. 您正在将其编码为字节串,但是由于使用过codecs.open,因此write方法需要一个unicode对象。因此,您对其进行编码,然后它将尝试再次对其进行解码。试试:f.write(all_html)代替。
  2. 实际上,all_html不是unicode对象。当您这样做时.encode(...),它首先尝试对其进行解码。

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it’s in plain ASCII. There are two possibilities:

  1. You’re encoding it to a bytestring, but because you’ve used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
  2. all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

回答 7

如果用python3编写

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果使用python2编写:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为避免此错误,您将必须使用编解码器“ utf-8”将其编码为字节,如下所示:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在使用编解码器“ utf-8”读取时解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

而且,如果您尝试在此字符串上执行打印,它将使用“ utf-8”编解码器自动解码,如下所示

>>> print a
batsà

In case of writing in python3

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

In case of writing in python2:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To avoid this error you would have to encode it to bytes using codecs “utf-8” like this:

>>> f.write(a.encode("utf-8"))
>>> f.close()

and decode the data while reading using the codecs “utf-8”:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

And also if you try to execute print on this string it will automatically decode using the “utf-8” codecs like this

>>> print a
batsà