用Python从文件中读取字符

问题:用Python从文件中读取字符

在文本文件中,有一个字符串“我不喜欢这样”。

但是,当我将其读取为字符串时,它变成“我不这样\ xe2 \ x80 \ x98t”。我了解\ u2018是“’”的Unicode表示形式。我用

f1 = open (file1, "r")
text = f1.read()

命令来做阅读。

现在,是否可以以这样的方式读取字符串,即当将其读入字符串时,它是“我不喜欢这样”而不是“我不喜欢这样”吗?

第二编辑:我已经看到有人使用映射来解决此问题,但实际上,没有内置的转换可以将这种ANSI转换为unicode(反之亦然)吗?

In a text file, there is a string “I don’t like this”.

However, when I read it into a string, it becomes “I don\xe2\x80\x98t like this”. I understand that \u2018 is the unicode representation of “‘”. I use

f1 = open (file1, "r")
text = f1.read()

command to do the reading.

Now, is it possible to read the string in such a way that when it is read into the string, it is “I don’t like this”, instead of “I don\xe2\x80\x98t like this like this”?

Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?


回答 0

参考:http : //docs.python.org/howto/unicode

因此,从文件读取Unicode很简单:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

也可以在更新模式下打开文件,从而允许读取和写入:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

编辑:我假设您的预期目标是能够将文件正确读取为Python中的字符串。如果您尝试从Unicode转换为ASCII字符串,那么实际上没有直接的方法,因为Unicode字符不一定存在于ASCII中。

如果您尝试转换为ASCII字符串,请尝试以下操作之一:

  1. 如果您只想处理一些特殊情况(例如此特定示例),请使用ASCII等价的方式替换特定的unicode字符。

  2. 使用unicodedata模块normalize()string.encode()方法将最大程度地转换为下一个最接近的ASCII等效词(参阅https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting- unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'

Ref: http://docs.python.org/howto/unicode

Reading Unicode from a file is therefore simple:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

It’s also possible to open files in update mode, allowing both reading and writing:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

EDIT: I’m assuming that your intended goal is just to be able to read the file properly into a string in Python. If you’re trying to convert to an ASCII string from Unicode, then there’s really no direct way to do so, since the Unicode characters won’t necessarily exist in ASCII.

If you’re trying to convert to an ASCII string, try one of the following:

  1. Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example

  2. Use the unicodedata module’s normalize() and the string.encode() method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'
    

回答 1

有几点要考虑。

\ u2018字符只能作为Python中unicode字符串表示形式的一部分出现,例如,如果您编写:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

现在,如果您只是想简单地打印unicode字符串,只需使用unicode的encode方法:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I dont like this

为了确保将任何文件中的每一行都读为unicode,最好使用codecs.open函数而不是just open,它允许您指定文件的编码:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I dont like this

There are a few points to consider.

A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

Now if you simply want to print the unicode string prettily, just use unicode’s encode method:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this

To make sure that every line from any file would be read as unicode, you’d better use the codecs.open function instead of just open, which allows you to specify file’s encoding:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this

回答 2

但这确实是“我不喜欢这样”而不是“我不喜欢这样”。字符u’\ u2018’与“’”是完全不同的字符(并且在视觉上应更对应于“`”)。

如果您尝试将编码的unicode转换为纯ASCII,则可以保留要转换为ASCII的unicode标点的映射。

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

unicode中有很多标点符号,但是,我想您只能指望其中的几个实际被创建您正在阅读的文档的应用程序所实际使用。

But it really is “I don\u2018t like this” and not “I don’t like this”. The character u’\u2018′ is a completely different character than “‘” (and, visually, should correspond more to ‘`’).

If you’re trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you’re reading.


回答 3

也可以使用python 3 read方法读取编码的文本文件:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

使用此变体,无需导入任何其他库

It is also possible to read an encoded text file using the python 3 read method:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

With this variation, there is no need to import any additional libraries


回答 4

撇开您的文本文件已损坏的事实(U + 2018是左引号,而不是撇号):iconv可用于将unicode字符音译为ascii。

您必须在Google上搜索“ iconvcodec”,因为该模块似乎不再受支持,而且我也找不到它的规范主页。

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

或者,您可以使用iconv命令行实用程序来清理文件:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.

You’ll have to google for “iconvcodec”, since the module seems not to be supported anymore and I can’t find a canonical home page for it.

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

Alternatively you can use the iconv command line utility to clean up your file:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.

回答 5

您可能会以某种方式拥有带有Unicode转义字符的非Unicode字符串,例如:

>>> print repr(text)
'I don\\u2018t like this'

这实际上发生在我之前。您可以使用unicode_escape编解码器将字符串解码为unicode,然后将其编码为所需的任何格式:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I dont like this

There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:

>>> print repr(text)
'I don\\u2018t like this'

This actually happened to me once before. You can use a unicode_escape codec to decode the string to unicode and then encode it to any format you want:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this

回答 6

这是Python的方法,向您显示unicode编码的字符串。但我认为您应该能够在屏幕上打印字符串或将其写入新文件而不会出现任何问题。

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I dont like this

This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this

回答 7

实际上,U + 2018是特殊字符’的Unicode表示。如果需要,可以使用以下代码将该字符的实例转换为U + 0027:

text = text.replace (u"\u2018", "'")

另外,您用什么来写文件?f1.read()应该返回一个看起来像这样的字符串:

'I don\xe2\x80\x98t like this'

如果返回字符串,则表示文件编写不正确:

'I don\u2018t like this'

Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:

text = text.replace (u"\u2018", "'")

In addition, what are you using to write the file? f1.read() should return a string that looks like this:

'I don\xe2\x80\x98t like this'

If it’s returning this string, the file is being written incorrectly:

'I don\u2018t like this'