标签归档:character-encoding

如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

问题:如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

在工作中,似乎没有一周没有编码相关的混乱,灾难或灾难。问题通常来自程序员,他们认为他们无需指定编码就可以可靠地处理“文本”文件。但是你不能。

因此,已决定从此以后禁止文件以*.txt或结尾的文件名*.text。这种想法是,这些扩展误导了随意的程序员对编码的沉闷自满,这会导致处理不当。根本没有扩展将是更好的选择,因为至少您知道自己不知道拥有什么。

但是,我们不会走那么远。相反,您将期望使用以编码结尾的文件名。因此,对于文本文件,例如,这些会是这样README.asciiREADME.latin1README.utf8,等。

对于需要特定扩展名的文件,如果可以在文件本身内部指定编码,例如在Perl或Python中,则应这样做。对于Java源之类的文件,其中文件内部没有这样的功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java

对于输出,强烈建议使用 UTF-8 。

但是作为输入,我们需要弄清楚如何处理代码库中名为的数千个文件*.txt。我们想重命名所有这些以适应我们的新标准。但是我们不可能全神贯注。因此,我们需要一个实际起作用的库或程序。

这些格式有ASCII,ISO-8859-1,UTF-8,Microsoft CP1252或Apple MacRoman。尽管我们知道我们可以判断某些东西是否为ASCII,并且知道有某种东西可能是UTF-8还是一个不错的选择,但我们对8位编码感到困惑。因为我们在大多数台式机为Mac的混合Unix环境(Solaris,Linux,Darwin)中运行,所以我们有很多烦人的MacRoman文件。这些尤其是一个问题。

一段时间以来,我一直在寻找一种以编程方式确定

  1. ASCII码
  2. ISO-8859-1
  3. CP1252
  4. 麦克罗曼
  5. UTF-8

文件在其中,我还没有找到可以可靠地区分这三种不同的8位编码的程序或库。我们可能仅拥有一千多个MacRoman文件,因此我们使用的任何字符集检测器都必须能够将它们嗅出。我看过的东西都无法解决这个问题。我对ICU字符集检测器库寄予厚望,但它不能处理MacRoman。我也研究过模块,它们在Perl和Python中都可以做同样的事情,但是一遍又一遍地是同一回事:不支持检测MacRoman。

因此,我要寻找的是一个现有的库或程序,该库或程序可以可靠地确定文件所用的五种编码中的哪一种(最好是更多)。特别是它必须区分我引用的三种3位编码,尤其是MacRoman。文件是超过99%的英语文本;还有其他几种语言,但不是很多。

如果是库代码,则我们的语言偏好是按Perl,C,Java或Python的顺序排列。如果它只是一个程序,那么我们并不在乎它的语言是什么,只要它是完整的源代码,在Unix上运行并且完全不受限制即可。

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?这是我的问题中最重要的方面,但是我也很感兴趣您是否鼓励程序员使用文件中的实际编码来命名(或重命名)他们的文件,这将有助于我们将来避免此问题。有没有人曾经尝试过在制度基础上强制执行,如果成功,那么成功与否,为什么?

是的,我完全理解,考虑到问题的性质,为什么不能保证给出确切的答案。对于小文件,尤其是这种情况,因为您没有足够的数据继续运行。幸运的是,我们的文件很少。除了随机README文件外,大多数文件的大小在50k到250k之间,许多文件更大。大小超过K的任何内容都将保证使用英语。

问题领域是生物医学文本挖掘,因此我们有时会处理大量的超大型语料库,例如PubMedCentral的所有Open Access存储库。一个相当大的文件是BioThesaurus 6.0,容量为5.7 GB。该文件特别令人讨厌,因为它几乎都是UTF-8。但是,我相信有些numbskull会以一些8位编码插入其中的几行,即Microsoft CP1252。您需要花费相当长的时间才能踏上那个旅程。:(

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can’t.

So it’s been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we’re know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(


回答 0

首先,简单的情况:

ASCII码

如果您的数据不包含大于0x7F的字节,则为ASCII。(或者是7位ISO646编码,但是已经过时了。)

UTF-8

如果您的数据验证为UTF-8,则可以放心地假定它 UTF-8。由于UTF-8严格的验证规则,误报极为罕见。

ISO-8859-1与Windows-1252

这两种编码之间的唯一区别是ISO-8859-1具有C1控制字符,而Windows-1252具有可打印字符€,ƒ„…†‡ˆ‰Š‹ŒŽ“”•-〜™š› œžŸ。我见过很多使用大括号或破折号的文件,但是没有使用C1控制字符的文件。因此,甚至不必理会它们或ISO-8859-1,而只需检测Windows-1252。

现在只剩下一个问题了。

您如何区分MacRoman和cp1252?

这要复杂得多。

未定义的字符

Windows-1252中未使用字节0x81、0x8D,0x8F,0x90、0x9D。如果发生这种情况,则假定数据为MacRoman。

相同字符

两种编码中的字节0xA2(¢),0xA3(£),0xA9(©),0xB1(±),0xB5(µ)都相同。如果这些是唯一的非ASCII字节,那么选择MacRoman还是cp1252都没有关系。

统计方法

在您知道为UTF-8的数据中计算字符(非字节!)频率。确定最频繁的字符。然后使用此数据确定cp1252或MacRoman字符是否更常见。

例如,在我仅对100条随机英语Wikipedia文章执行的搜索中,最常见的非ASCII字符为·•–é°®’èö—。基于这个事实,

  • 字节0x92、0x95、0x96、0x97、0xAE,0xB0、0xB7、0xE8、0xE9或0xF6表示Windows-1252。
  • 字节0x8E,0x8F,0x9A,0xA1、0xA5、0xA8、0xD0、0xD1、0xD5或0xE1表示MacRoman。

计算cp1252建议字节和MacRoman建议字节,并选择最大的一个。

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.


回答 1

Mozilla nsUniversalDetector(Perl绑定:Encode :: Detect / Encode :: Detect :: Detector)已被证明了百万倍。


回答 2

我尝试进行这种试探(假设您已经排除了ASCII和UTF-8):

  • 如果根本不显示0x7f到0x9f,则可能是ISO-8859-1,因为它们是很少使用的控制代码。
  • 如果大量出现0x91到0x94,则可能是Windows-1252,因为它们是“智能引号”,是该范围内最有可能在英文文本中使用的字符。可以肯定的是,您可以寻找对。
  • 否则,它是MacRoman,尤其是如果您看到很多0xd2到0xd5(在MacRoman中是印刷引号)。

边注:

对于像Java源这样的文件,其中文件内部没有这种功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java。

不要这样做!!

Java编译器期望文件名与类名匹配,因此重命名文件将使源代码不可编译。正确的做法是猜测编码,然后使用该native2ascii工具将所有非ASCII字符转换为Unicode转义序列

My attempt at such a heuristic (assuming that you’ve ruled out ASCII and UTF-8):

  • If 0x7f to 0x9f don’t appear at all, it’s probably ISO-8859-1, because those are very rarely used control codes.
  • If 0x91 through 0x94 appear at lot, it’s probably Windows-1252, because those are the “smart quotes”, by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
  • Otherwise, it’s MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that’s where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.


回答 3

“ Perl,C,Java或Python,并按此顺序”:有趣的态度:-)

“我们知道一个东西是否可能是UTF-8,这是一个很好的改变”:实际上,当UTF-8很小时,包含以其他字符集编码的,使用高位字节的有意义文本的文件将成功解码的机会。

UTF-8策略(至少使用首选语言):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

一旦确定它既不是ASCII也不是UTF-8:

我知道的Mozilla起源字符集检测器不支持MacRoman,而且无论如何在8位字符集上都做得不好,尤其是对于英语,因为AFAICT依赖于检查给定解码是否有意义语言,忽略标点符号,并基于该语言的大量文档。

正如其他人所说的,您实际上只有高位标点符号可用于区分cp1252和macroman。我建议您在自己的文档上训练Mozilla类型的模型,而不是莎士比亚,《议事录》或《圣经》,并考虑所有256个字节。我认为您的文件中没有标记(HTML,XML等),这会使某些令人震惊的概率失真。

您提到的文件大多为UTF-8,但无法解码。您还应该非常怀疑:

(1)据称是用ISO-8859-1编码的文件,但包含范围在0x80至0x9F(包括0x80至0x9F)内的“控制字符” …这太普遍了,以至于HTML5标准草案表示要解码所有声明为ISO-8859的HTML流-1使用cp1252。

(2)将OK解码为UTF-8的文件,但所得的Unicode包含范围在U + 0080至U + 009F(含)范围内的“控制字符” …这可能是由于对cp1252 / cp850进行代码转换(见它发生了!)/等等文件从“ ISO-8859-1”到UTF-8。

背景:我有一个星期天下午下午的项目,以创建一个基于Python的字符集检测器,该检测器面向文件(而不是面向Web),并且可以与8位字符集(包括legacy ** ncp850和cp437等)一起很好地工作。现在还远没有黄金时间。我对培训文件感兴趣;您的ISO-8859-1 / cp1252 / MacRoman文件是否像您期望任何人的代码解决方案一样“不受阻碍”?

“Perl, C, Java, or Python, and in that order”: interesting attitude :-)

“we stand a good change of knowing if something is probably UTF-8”: Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you’ve decided that it’s neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I’m aware of don’t support MacRoman and in any case don’t do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I’d suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them — that would distort the probabilities something shocking.

You’ve mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain “control characters” in the range 0x80 to 0x9F inclusive … this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains “control characters” in the range U+0080 to U+009F inclusive … this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from “ISO-8859-1” to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that’s file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It’s nowhere near prime time yet. I’m interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally “unencumbered” as you expect anyone’s code solution to be?


回答 4

您已经发现,没有完美的方法来解决此问题,因为如果没有关于文件使用哪种编码的隐式知识,所有8位编码都是完全相同的:字节的集合。所有字节对于所有8位编码均有效。

您可以期望的最好结果是某种算法,可以分析字节,并基于以某种语言以某种编码使用某种字节的概率,可以猜测文件使用的编码方式。但这必须知道文件使用哪种语言,并且当您使用混合编码的文件时,它变得完全无用。

从好的方面来说,如果您知道文件中的文本是用英语编写的,那么您决定使用该文件的任何编码都不会引起任何差异,因为所有提到的编码之间的差异都本地化在编码的一部分,指定了英语中通常不使用的字符。在文本使用特殊格式或特殊版本的标点符号(例如CP1252具有引号字符的多个版本)的情况下,您可能会遇到一些麻烦,但是对于文本的要旨而言,可能没有任何问题。

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you’re unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.


回答 5

如果您可以检测到除宏人以外的所有编码,那么逻辑上是假设无法解密的是宏人。换句话说,只要列出无法处理的文件,然后将其视为宏文件即可。

排序这些文件的另一种方法是制作一个基于服务器的程序,该程序允许用户确定哪种编码不乱码。当然,这将在公司内部,但是如果有100名员工每天做几次工作,那么您将立即拥有成千上万的文件。

最后,将所有现有文件转换为单一格式并要求新文件采用该格式不是更好。

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can’t be deciphered are in macroman. In other words, just make a list of files that couldn’t be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn’t garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you’ll have thousands of files done in no time.

Finally, wouldn’t it be better to just convert all existing files to a single format, and require that new files be in that format.


回答 6

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?

我目前正在编写将文件转换为XML的程序。它必须自动检测每个文件的类型,这是确定文本文件编码问题的超集。为了确定编码,我使用贝叶斯方法。也就是说,我的分类代码针对文本文件能够理解的所有编码,计算出文本文件具有特定编码的概率(可能性)。然后,程序选择最可能的解码器。对于每种编码,贝叶斯方法都像这样工作。

  1. 根据每次编码的频率,设置文件在编码中的初始(优先)概率。
  2. 依次检查文件中的每个字节。查找字节值,以确定该字节值存在与该编码中实际存在的文件之间的相关性。使用该相关性来计算新的(后验文件在编码中)概率。如果要检查的字节更多,请在检查下一个字节时将该字节的后验概率用作先验概率。
  3. 当您到达文件末尾时(我实际上仅查看前1024个字节),则具有的可能性就是文件处于编码状态的可能性。

可以看出,如果您计算信息内容而不是计算概率,而这是几率的对数,那么贝叶斯定理变得非常容易做到:info = log(p / (1.0 - p))

您将必须通过检查手动分类的文件的语料库来计算初始先验概率和相关性。

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

  1. Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
  2. Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
  3. When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes’ theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.


将Unicode文本写入文本文件?

问题:将Unicode文本写入文本文件?

我正在从Google文档中提取数据,进行处理,然后将其写入文件(最终我将其粘贴到Wordpress页面中)。

它具有一些非ASCII符号。如何将这些安全地转换为可以在HTML源代码中使用的符号?

目前,我正在将所有内容都转换为Unicode,将它们全部组合成Python字符串,然后执行以下操作:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

最后一行存在编码错误:

UnicodeDecodeError:’ascii’编解码器无法解码位置12286的字节0xa0:序数不在范围内(128)

部分解决方案:

此Python运行无错误:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

但是,如果我打开实际的文本文件,则会看到很多符号,例如:

Qur’an 

也许我需要写文本文件以外的东西?

I’m pulling data out of a Google doc, processing it, and writing it to a file (that eventually I will paste into a WordPress page).

It has some non-ASCII symbols. How can I convert these safely to symbols that can be used in HTML source?

Currently I’m converting everything to Unicode on the way in, joining it all together in a Python string, then doing:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

There is an encoding error on the last line:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 12286: ordinal not in range(128)

Partial solution:

This Python runs without an error:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

But then if I open the actual text file, I see lots of symbols like:

Qur’an 

Maybe I need to write to something other than a text file?


回答 0

通过在首次获取对象时将其解码为unicode对象,并在出路时根据需要对其进行编码,从而尽可能地专门处理unicode对象。

如果您的字符串实际上是unicode对象,则需要先将其转换为unicode编码的字符串对象,然后再将其写入文件:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

再次读取该文件时,您将获得一个unicode编码的字符串,可以将其解码为unicode对象:

f = file('test', 'r')
print f.read().decode('utf8')

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out.

If your string is actually a unicode object, you’ll need to convert it to a unicode-encoded string object before writing it to a file:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

When you read that file again, you’ll get a unicode-encoded string that you can decode to a unicode object:

f = file('test', 'r')
print f.read().decode('utf8')

回答 1

在Python 2.6+中,您可以在Python 3上使用io.open()默认设置(内置open()):

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

如果您需要增量编写文本(不需要unicode_text.encode(character_encoding)多次调用),可能会更方便。与codecs模块不同,io模块具有适当的通用换行符支持。

In Python 2.6+, you could use io.open() that is default (builtin open()) on Python 3:

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

It might be more convenient if you need to write the text incrementally (you don’t need to call unicode_text.encode(character_encoding) multiple times). Unlike codecs module, io module has a proper universal newlines support.


回答 2

Unicode字符串处理已在Python 3中标准化。

  1. 字符已经以Unicode(32位)存储在内存中
  2. 您只需要以utf-8打开文件
    (从内存到文件自动执行32位Unicode到可变字节长度的utf-8转换)。

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

Unicode string handling is already standardized in Python 3.

  1. char’s are already stored in Unicode (32-bit) in memory
  2. You only need to open file in utf-8
    (32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file.)

    out1 = "(嘉南大圳 ㄐㄧㄚ ㄋㄢˊ ㄉㄚˋ ㄗㄨㄣˋ )"
    fobj = open("t1.txt", "w", encoding="utf-8")
    fobj.write(out1)
    fobj.close()
    

回答 3

打开的文件codecs.open是一个接收unicode数据,对其进行编码并将其iso-8859-1写入文件的文件。但是,您尝试写什么不是unicode; 您可以自己unicode进行编码。这就是方法的作用,对unicode字符串进行编码的结果是一个字节字符串(一种类型)。iso-8859-1 unicode.encodestr

您应该使用normal open()并自己对unicode编码,或者(通常是一个更好的主意)使用codecs.open()不是对数据进行编码。

The file opened by codecs.open is a file that takes unicode data, encodes it in iso-8859-1 and writes it to the file. However, what you try to write isn’t unicode; you take unicode and encode it in iso-8859-1 yourself. That’s what the unicode.encode method does, and the result of encoding a unicode string is a bytestring (a str type.)

You should either use normal open() and encode the unicode yourself, or (usually a better idea) use codecs.open() and not encode the data yourself.


回答 4

前言:您的查看器会工作吗?

确保查看器/编辑器/终端(无论与utf-8编码的文件进行交互)都可以读取该文件。这在Windows(例如记事本)上经常是一个问题。

将Unicode文本写入文本文件?

在Python 2中,使用open来自io模块(这与openPython 3中的内置功能相同):

import io

通常,最佳实践UTF-8用于写入文件(我们甚至不必担心utf-8的字节顺序)。

encoding = 'utf-8'

utf-8是最现代且通用的编码-适用于所有Web浏览器,大多数文本编辑器(如果有问题,请参阅设置)和大多数终端/外壳。

在Windows上,utf-16le如果您仅限于在记事本(或其他受限制的查看器)中查看输出,则可以尝试。

encoding = 'utf-16le' # sorry, Windows users... :(

只需使用上下文管理器打开它,然后将您的unicode字符写出来:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

使用许多Unicode字符的示例

这是一个示例,尝试将每个可能的字符映射到数字表示形式(整数)中最多三位宽(最大为4,但这会有点远)到编码的可打印输出及其名称,如果可能(将其放入名为的文件中uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

此过程应大约运行一分钟,您可以查看数据文件,如果文件查看器可以显示unicode,则可以看到它。有关类别的信息可在此处找到。根据计数,我们可以通过排除没有关联符号的Cn和Co类别来改善结果。

$ python uni.py

它将显示十六进制映射,category,symbol(除非无法获得名称,因此可能是控制字符)以及该符号的名称。例如

我建议less在Unix或Cygwin上使用(不要将整个文件打印/保存到输出中):

$ less unidata

例如,它将显示类似于以下使用Python 2(unicode 5.2)从中采样的行:

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So    PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd    THAI DIGIT NINE
  2887 So    BRAILLE PATTERN DOTS-1238
  bc13 Lo    HANGUL SYLLABLE MIH
  ffeb Sm    HALFWIDTH RIGHTWARDS ARROW

我来自Anaconda的Python 3.5具有unicode 8.0,我认为大多数都是3。

Preface: will your viewer work?

Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.

Writing Unicode text to a text file?

In Python 2, use open from the io module (this is the same as the builtin open in Python 3):

import io

Best practice, in general, use UTF-8 for writing to files (we don’t even have to worry about byte-order with utf-8).

encoding = 'utf-8'

utf-8 is the most modern and universally usable encoding – it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.

On Windows, you might try utf-16le if you’re limited to viewing output in Notepad (or another limited viewer).

encoding = 'utf-16le' # sorry, Windows users... :(

And just open it with the context manager and write your unicode characters out:

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

Example using many Unicode characters

Here’s an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py):

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you’ll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.

$ python uni.py

It will display the hexadecimal mapping, category, symbol (unless can’t get the name, so probably a control character), and the name of the symbol. e.g.

I recommend less on Unix or Cygwin (don’t print/cat the entire file to your output):

$ less unidata

e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  →  HALFWIDTH RIGHTWARDS ARROW

My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3’s would.


回答 5

如何将unicode字符打印到文件中:

将此保存到文件:foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

运行它,并将输出管道传输到文件:

python foo.py > tmp.txt

打开tmp.txt并查看内部,您会看到以下内容:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此,您已将带有混淆标记的unicode e保存到文件中。

How to print unicode characters into a file:

Save this to file: foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

Run it and pipe output to file:

python foo.py > tmp.txt

Open tmp.txt and look inside, you see this:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

Thus you have saved unicode e with a obfuscation mark on it to a file.


回答 6

当您尝试对非unicode字符串进行编码时,会出现该错误:假定它使用纯ASCII,它将尝试对其进行解码。有两种可能性:

  1. 您正在将其编码为字节串,但是由于使用过codecs.open,因此write方法需要一个unicode对象。因此,您对其进行编码,然后它将尝试再次对其进行解码。试试:f.write(all_html)代替。
  2. 实际上,all_html不是unicode对象。当您这样做时.encode(...),它首先尝试对其进行解码。

That error arises when you try to encode a non-unicode string: it tries to decode it, assuming it’s in plain ASCII. There are two possibilities:

  1. You’re encoding it to a bytestring, but because you’ve used codecs.open, the write method expects a unicode object. So you encode it, and it tries to decode it again. Try: f.write(all_html) instead.
  2. all_html is not, in fact, a unicode object. When you do .encode(...), it first tries to decode it.

回答 7

如果用python3编写

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果使用python2编写:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为避免此错误,您将必须使用编解码器“ utf-8”将其编码为字节,如下所示:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在使用编解码器“ utf-8”读取时解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

而且,如果您尝试在此字符串上执行打印,它将使用“ utf-8”编解码器自动解码,如下所示

>>> print a
batsà

In case of writing in python3

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

In case of writing in python2:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To avoid this error you would have to encode it to bytes using codecs “utf-8” like this:

>>> f.write(a.encode("utf-8"))
>>> f.close()

and decode the data while reading using the codecs “utf-8”:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

And also if you try to execute print on this string it will automatically decode using the “utf-8” codecs like this

>>> print a
batsà

“ for line in…”导致UnicodeDecodeError:’utf-8’编解码器无法解码字节

问题:“ for line in…”导致UnicodeDecodeError:’utf-8’编解码器无法解码字节

这是我的代码,

for line in open('u.item'):
#read each line

每当我运行此代码时,都会出现以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

我试图解决这个问题,并在open()中添加了一个额外的参数,代码看起来像;

for line in open('u.item', encoding='utf-8'):
#read each line

但是,它再次给出相同的错误。那我该怎么办!请帮忙。

Here is my code,

for line in open('u.item'):
#read each line

whenever I run this code it gives the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

I tried to solve this and add an extra parameter in open(), the code looks like;

for line in open('u.item', encoding='utf-8'):
#read each line

But again it gives the same error. what should I do then! Please help.


回答 0

正如Mark Ransom所建议的,我找到了解决该问题的正确编码。编码为“ ISO-8859-1”,因此替换open("u.item", encoding="utf-8")open('u.item', encoding = "ISO-8859-1")可以解决该问题。

As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1", so replacing open("u.item", encoding="utf-8") with open('u.item', encoding = "ISO-8859-1") will solve the problem.


回答 1

同样对我有用,ISO 8859-1将节省很多,哈哈哈,主要是如果使用语音识别API的话

例:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1");

Also worked for me, ISO 8859-1 is going to save a lot, hahaha, mainly if using Speech Recognition API’s

Example:

file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1");

回答 2

您的文件实际上并不包含utf-8编码的数据,而是包含其他一些编码。弄清楚编码是什么,并在open调用中使用它。

例如,在Windows-1252编码中,0xe9字符为é

Your file doesn’t actually contain utf-8 encoded data, it contains some other encoding. Figure out what that encoding is and use it in the open call.

In Windows-1252 encoding for example the 0xe9 would be the character é.


回答 3

尝试使用熊猫阅读

pd.read_csv('u.item', sep='|', names=m_cols , encoding='latin-1')

Try this to read using pandas

pd.read_csv('u.item', sep='|', names=m_cols , encoding='latin-1')

回答 4

如果使用Python 2以下将解决方案:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # do something

由于encodingparameter不适用于open(),因此您将收到以下错误:

TypeError:“ encoding”是此函数的无效关键字参数

If you are using Python 2 the following will the solution:

import io
for line in io.open("u.item", encoding="ISO-8859-1"):
    # do something

Because encoding parameter doesn’t work with open(), you will be getting the following error:

TypeError: 'encoding' is an invalid keyword argument for this function

回答 5

您可以使用以下方法解决问题:

for line in open(your_file_path, 'rb'):

‘rb’以二进制模式读取文件。在这里阅读更多。希望这会有所帮助!

You could resolve the problem with:

for line in open(your_file_path, 'rb'):

‘rb’ is reading file in binary mode. Read more here. Hope this will help!


回答 6

这有效:

open('filename', encoding='latin-1')

要么:

open('filename',encoding="IS0-8859-1")

This works:

open('filename', encoding='latin-1')

or:

open('filename',encoding="ISO-8859-1")

回答 7

如果有人在寻找这些,这是在Python 3中转换CSV文件的示例:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

If someone looking for these, this is an example for converting a CSV file in Python 3:

try:
    inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
    pass

回答 8

有时,open(filepath)在其中filepath实际上不是一个文件会得到同样的错误,所以,首先要确保你试图打开的文件存在:

import os
assert os.path.isfile(filepath)

希望这会有所帮助。

Sometimes when open(filepath) in which filepath actually is not a file would get the same error, so firstly make sure the file you’re trying to open exists:

import os
assert os.path.isfile(filepath)

hope this will help.


回答 9

您可以这样尝试:

open('u.item', encoding='utf8', errors='ignore')

you can try this way:

open('u.item', encoding='utf8', errors='ignore')

编码/解码有什么区别?

问题:编码/解码有什么区别?

我从来不确定我了解str / unicode解码和编码之间的区别。

我知道这str().decode()是针对当您有一个字节字符串,并且您知道该字符串具有某种字符编码时,给定该编码名称,它将返回一个unicode字符串。

我知道unicode().encode()根据给定的编码名称将Unicode字符转换为字节字符串。

但我不明白是什么str().encode()以及unicode().decode()是。有人可以解释,也可以更正我在上面遇到的其他错误吗?

编辑:

有几个答案给出了.encode有关字符串处理内容的信息,但似乎没人知道.decodeUnicode的处理内容。

I’ve never been sure that I understand the difference between str/unicode decode and encode.

I know that str().decode() is for when you have a string of bytes that you know has a certain character encoding, given that encoding name it will return a unicode string.

I know that unicode().encode() converts unicode chars into a string of bytes according to a given encoding name.

But I don’t understand what str().encode() and unicode().decode() are for. Can anyone explain, and possibly also correct anything else I’ve gotten wrong above?

EDIT:

Several answers give info on what .encode does on a string, but no-one seems to know what .decode does for unicode.


回答 0

decodeUnicode字符串的方法实际上根本没有任何应用程序(除非出于某种原因在Unicode字符串中包含一些非文本数据,请参见下文)。我认为主要是出于历史原因。在Python 3中,它完全消失了。

unicode().decode()将执行隐式编码s使用默认(ASCII)编解码器。像这样验证:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

错误消息是完全相同的。

对于str().encode()它周围的其他方法-它试图隐式解码s默认编码方式:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

这样使用,str().encode()也是多余的。

但是后一种方法的另一个应用很有用:有些编码与字符集无关,因此可以有意义的方式应用于8位字符串:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

但是,您是对的:这两个应用程序对“编码”的模棱两可用法令人生厌。同样,在Python 3中使用单独bytestring类型,这不再是问题。

The decode method of unicode strings really doesn’t have any applications at all (unless you have some non-text data in a unicode string for some reason — see below). It is mainly there for historical reasons, i think. In Python 3 it is completely gone.

unicode().decode() will perform an implicit encoding of s using the default (ascii) codec. Verify this like so:

>>> s = u'ö'
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 0:
ordinal not in range(128)

The error messages are exactly the same.

For str().encode() it’s the other way around — it attempts an implicit decoding of s with the default encoding:

>>> s = 'ö'
>>> s.decode('utf-8')
u'\xf6'
>>> s.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

Used like this, str().encode() is also superfluous.

But there is another application of the latter method that is useful: there are encodings that have nothing to do with character sets, and thus can be applied to 8-bit strings in a meaningful way:

>>> s.encode('zip')
'x\x9c;\xbc\r\x00\x02>\x01z'

You are right, though: the ambiguous usage of “encoding” for both these applications is… awkard. Again, with separate byte and string types in Python 3, this is no longer an issue.


回答 1

将unicode字符串表示为字节字符串被​​称为encoding。使用u'...'.encode(encoding)

例:

    >>>u'æøå'.encode('utf8')
    '\ xc3 \ x83 \ xc2 \ xa6 \ xc3 \ x83 \ xc2 \ xb8 \ xc3 \ x83 \ xc2 \ xa5'
    >>>u'æøå'.encode('latin1')
    '\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
    >>>u'æøå'.encode('ascii')
    UnicodeEncodeError:'ascii'编解码器无法编码位置0-5处的字符: 
    序数不在范围内(128)

通常,在需要将unicode字符串用于IO(例如,通过网络传输它或将其保存到磁盘文件)时,通常会对其进行编码。

将字节字符串转换为unicode字符串称为解码。使用unicode('...', encoding)或’…’。decode(encoding)。

例:

   >>>u'æøå'
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'#解释程序将这样打印unicode对象
   >>> unicode('\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5','latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'
   >>>'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'.decode('latin1')
   u'\ xc3 \ xa6 \ xc3 \ xb8 \ xc3 \ xa5'

通常,每当您从网络或磁盘文件接收到字符串数据时,就对字节字符串进行解码。

我相信python 3的unicode处理方式有所变化,因此以上内容可能不适用于python 3。

一些好的链接:

To represent a unicode string as a string of bytes is known as encoding. Use u'...'.encode(encoding).

Example:

    >>> u'æøå'.encode('utf8')
    '\xc3\x83\xc2\xa6\xc3\x83\xc2\xb8\xc3\x83\xc2\xa5'
    >>> u'æøå'.encode('latin1')
    '\xc3\xa6\xc3\xb8\xc3\xa5'
    >>> u'æøå'.encode('ascii')
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: 
    ordinal not in range(128)

You typically encode a unicode string whenever you need to use it for IO, for instance transfer it over the network, or save it to a disk file.

To convert a string of bytes to a unicode string is known as decoding. Use unicode('...', encoding) or ‘…’.decode(encoding).

Example:

   >>> u'æøå'
   u'\xc3\xa6\xc3\xb8\xc3\xa5' # the interpreter prints the unicode object like so
   >>> unicode('\xc3\xa6\xc3\xb8\xc3\xa5', 'latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'
   >>> '\xc3\xa6\xc3\xb8\xc3\xa5'.decode('latin1')
   u'\xc3\xa6\xc3\xb8\xc3\xa5'

You typically decode a string of bytes whenever you receive string data from the network or from a disk file.

I believe there are some changes in unicode handling in python 3, so the above is probably not correct for python 3.

Some good links:


回答 2

Unicode。encode(’encoding’)产生一个字符串对象,并且可以在unicode对象上调用

aString。解码(“编码”)产生一个unicode对象,可以在以给定编码方式编码的字符串上调用。


一些更多的解释:

您可以创建一些未设置任何编码的unicode对象。Python将其存储在内存中的方式与您无关。您可以对其进行搜索,拆分并调用您喜欢的任何字符串操作函数。

但是有时候,您想将unicode对象打印为控制台或某些文本文件。因此,您必须对其进行编码(例如-在UTF-8中),调用encode(’utf-8’),然后会得到一个带有’\ u <someNumber>’的字符串,该字符串可完美打印。

然后,再次(您想做相反的事情)读取以UTF-8编码的字符串并将其视为Unicode,因此\ u360将是一个字符,而不是5。然后解码一个字符串(使用选定的编码),然后获取unicode类型的全新对象。

恰如其分-您可以选择一些变态编码,例如’zip’,’base64’,’rot’,其中一些会在字符串之间转换,但是我认为最常见的情况是涉及UTF-8 / UTF-16和字符串。

anUnicode.encode(‘encoding’) results in a string object and can be called on a unicode object

aString.decode(‘encoding’) results in an unicode object and can be called on a string, encoded in given encoding.


Some more explanations:

You can create some unicode object, which doesn’t have any encoding set. The way it is stored by Python in memory is none of your concern. You can search it, split it and call any string manipulating function you like.

But there comes a time, when you’d like to print your unicode object to console or into some text file. So you have to encode it (for example – in UTF-8), you call encode(‘utf-8’) and you get a string with ‘\u<someNumber>’ inside, which is perfectly printable.

Then, again – you’d like to do the opposite – read string encoded in UTF-8 and treat it as an Unicode, so the \u360 would be one character, not 5. Then you decode a string (with selected encoding) and get brand new object of the unicode type.

Just as a side note – you can select some pervert encoding, like ‘zip’, ‘base64’, ‘rot’ and some of them will convert from string to string, but I believe the most common case is one that involves UTF-8/UTF-16 and string.


回答 3

mybytestring.encode(somecodec)对于以下值有意义somecodec

  • base64
  • bz2
  • zlib
  • 十六进制
  • 夸普里
  • 腐烂13
  • string_escape
  • u

我不确定解码已解码的unicode文本适合什么。尝试使用任何编码似乎总是先尝试使用系统的默认编码进行编码。

mybytestring.encode(somecodec) is meaningful for these values of somecodec:

  • base64
  • bz2
  • zlib
  • hex
  • quopri
  • rot13
  • string_escape
  • uu

I am not sure what decoding an already decoded unicode text is good for. Trying that with any encoding seems to always try to encode with the system’s default encoding first.


回答 4

有几种编码可用于从str到str或从unicode到unicode解码/编码。例如base64,hex甚至rot13。它们在编解码器模块中列出。

编辑:

Unicode字符串上的解码消息可以撤消相应的编码操作:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

返回的类型是str而不是unicode,我认为这很不幸。但是,当您没有在str和unicode之间进行适当的编码/解码时,无论如何这看起来都是一团糟。

There are a few encodings that can be used to de-/encode from str to str or from unicode to unicode. For example base64, hex or even rot13. They are listed in the codecs module.

Edit:

The decode message on a unicode string can undo the corresponding encode operation:

In [1]: u'0a'.decode('hex')
Out[1]: '\n'

The returned type is str instead of unicode which is unfortunate in my opinion. But when you are not doing a proper en-/decode between str and unicode this looks like a mess anyway.


回答 5

简单的答案是它们彼此完全相反。

计算机使用字节的最基本单位来存储和处理信息。这对人眼毫无意义。

例如,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\中文单词,在这种情况下,它是“ utf-8”字典,如果您查看其他或错误的字典(使用其他解码方法),它将无法正确显示预期的中文单词。

在上述情况下,计算机查找中文单词的过程为decode()

并且计算机将中文写入计算机存储器的过程是encode()

因此,编码信息是原始字节,解码信息是原始字节和要引用的字典的名称(但不是字典本身)。

The simple answer is that they are the exact opposite of each other.

The computer uses the very basic unit of byte to store and process information; it is meaningless for human eyes.

For example,’\xe4\xb8\xad\xe6\x96\x87′ is the representation of two Chinese characters, but the computer only knows (meaning print or store) it is Chinese Characters when they are given a dictionary to look for that Chinese word, in this case, it is a “utf-8” dictionary, and it would fail to correctly show the intended Chinese word if you look into a different or wrong dictionary (using a different decoding method).

In the above case, the process for a computer to look for Chinese word is decode().

And the process of computer writing the Chinese into computer memory is encode().

So the encoded information is the raw bytes, and the decoded information is the raw bytes and the name of the dictionary to reference (but not the dictionary itself).


将Unicode转换为ASCII且在Python中没有错误

问题:将Unicode转换为ASCII且在Python中没有错误

我的代码只是抓取一个网页,然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到了UnicodeDecodeError


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML在某处包含一些错误的Unicode尝试。我可以删除导致问题的任何代码字节而不出错吗?

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?


回答 0

2018年更新:

截至2018年2月,使用类似的压缩gzip已变得非常流行(大约73%的网站都在使用它,包括Google,YouTube,Yahoo,Wikipedia,Reddit,Stack Overflow和Stack Exchange Network网站等大型网站)。
如果您像原始答案中那样使用gzip压缩后的响应进行简单的解码,则会收到类似以下的错误:

UnicodeDecodeError:’utf8’编解码器无法解码位置1的字节0x8b:意外的代码字节

为了解码gzpipped响应,您需要添加以下模块(在Python 3中):

import gzip
import io

注意: 在Python 2中,您将使用StringIO代替io

然后,您可以像这样解析内容:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应,并将字节放入缓冲区。然后,gzip模块使用GZipFile函数读取缓冲区。之后,可以将压缩后的文件再次读取为字节,最后将其解码为正常可读的文本。

2010年的原始答案:

我们可以获取用于的实际值link吗?

另外,当我们尝试.encode()使用已编码的字节字符串时,通常会在这里遇到此问题。因此,您可以尝试先将其解码为

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子:

html = '\xa0'
encoded_str = html.encode("utf8")

与失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误。请注意,我以“ windows-1252” 为例。我是从chardet那里得到的,它对它的置信度为0.5,这是正确的!(同样,对于长度为1个字符的字符串,您希望得到什么)您应该将其更改为返回的字节字符串的编码,以.urlopen().read()适应所检索内容的内容。

我看到的另一个问题是,.encode()字符串方法返回修改后的字符串,而不是就地修改源。因此拥有self.response.out.write(html)html是没有用的,因为html不是html.encode中的编码字符串(如果这是您最初的目标)。

正如Ignacio所建议的那样,请在源网页上检查从中返回的字符串的实际编码read()。它位于响应的Meta标签之一或ContentType标头中。然后将其用作的参数.decode()

但是请注意,不应假定其他开发人员负责确保标题和/或元字符集声明与实际内容匹配。(这是PITA,是的,我应该知道,我以前是其中的一个)。

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you’ll get an error like or similar to this:

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x8b in position 1: unexpected code byte

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

Note: In Python 2 you’d use StringIO instead of io

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

Original Answer from 2010:

Can we get the actual value used for link?

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xa0'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

Succeeds without error. Do note that “windows-1252” is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it’s kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It’s either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).


回答 1

>>> u'aあä'.encode('ascii', 'ignore')
'a'

使用meta响应中相应标签或Content-Type标头中的字符集对返回的字符串进行解码,然后进行编码。

该方法encode(encoding, errors)接受错误的自定义处理程序。除之外的默认值为ignore

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

参见https://docs.python.org/3/library/stdtypes.html#str.encode

>>> u'aあä'.encode('ascii', 'ignore')
'a'

Decode the string you get back, using either the charset in the the appropriate meta tag in the response or in the Content-Type header, then encode.

The method encode(encoding, errors) accepts custom handlers for errors. The default values, besides ignore, are:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

See https://docs.python.org/3/library/stdtypes.html#str.encode


回答 2

作为对Ignacio Vazquez-Abrams的回答的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本表格。这可以通过完成

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还需要将其他字符(例如标点符号)转换为最接近的等价字符,例如,编码时未将RIGHT SINGLE QUOTATION MARK Unicode字符转换为ASCII APOSTROPHE。

>>> print u'\u2019'

>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

尽管有更有效的方法可以做到这一点。有关更多详细信息,请参见此问题。Python的“此Unicode的最佳ASCII”数据库在哪里?

As an extension to Ignacio Vazquez-Abrams’ answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

It is sometimes desirable to remove accents from characters and print the base form. This can be accomplished with

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

You may also want to translate other characters (such as punctuation) to their nearest equivalents, for instance the RIGHT SINGLE QUOTATION MARK unicode character does not get converted to an ascii APOSTROPHE when encoding.

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

Although there are more efficient ways to accomplish this. See this question for more details Where is Python’s “best ASCII for this Unicode” database?


回答 3

使用unidecode-它甚至可以立即将奇怪的字符转换为ASCII,甚至将中文转换为语音ASCII。

$ pip install unidecode

然后:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Use unidecode – it even converts weird characters to ascii instantly, and even converts Chinese to phonetic ascii.

$ pip install unidecode

then:

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

回答 4

我在所有项目中都使用了此辅助功能。如果它不能转换unicode,它将忽略它。这与django库联系在一起,但是只要进行一些研究,您就可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用此方法后,我不再遇到任何unicode错误。

I use this helper function throughout all of my projects. If it can’t convert the unicode, it ignores it. This ties into a django library, but with a little research you could bypass it.

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

I no longer get any unicode errors after using this.


回答 5

对于损坏的控制台cmd.exe和HTML输出,您可以始终使用:

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符,同时使它们可以纯ASCII HTML格式打印。

警告如果在生产代码中使用此代码来避免错误,则很可能代码中有错误。唯一有效的用例是打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后,如果您在Windows上并使用cmd.exe,则可以键入chcp 65001启用utf-8输出(适用于Lucida Console字体)。您可能需要添加myUnicodeString.encode('utf8')

For broken consoles like cmd.exe and HTML output you can always use:

my_unicode_string.encode('ascii','xmlcharrefreplace')

This will preserve all the non-ascii chars while making them printable in pure ASCII and in HTML.

WARNING: If you use this in production code to avoid errors then most likely there is something wrong in your code. The only valid use case for this is printing to a non-unicode console or easy conversion to HTML entities in an HTML context.

And finally, if you are on windows and use cmd.exe then you can type chcp 65001 to enable utf-8 output (works with Lucida Console font). You might need to add myUnicodeString.encode('utf8').


回答 6

您写了“”“,我认为这意味着HTML包含对某处unicode的某些错误格式的尝试。”“”“

HTML不应包含任何格式正确的“尝试unicode”。它必须包含以某种编码方式编码的Unicode字符,通常是在前面提供的…查找“字符集”。

您似乎基于什么理由假设字符集为UTF-8。错误消息中显示的“ \ xA0”字节表示您可能具有单字节字符集,例如cp1252。

如果在HTML的开头对声明一无所知,请尝试使用chardet找出可能的编码。

为什么用“ regex”标记您的问题?

在用一个非问题替换了整个问题后进行更新

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

You wrote “””I assume that means the HTML contains some wrongly-formed attempt at unicode somewhere.”””

The HTML is NOT expected to contain any kind of “attempt at unicode”, well-formed or not. It must of necessity contain Unicode characters encoded in some encoding, which is usually supplied up front … look for “charset”.

You appear to be assuming that the charset is UTF-8 … on what grounds? The “\xA0” byte that is shown in your error message indicates that you may have a single-byte charset e.g. cp1252.

If you can’t get any sense out of the declaration at the start of the HTML, try using chardet to find out what the likely encoding is.

Why have you tagged your question with “regex”?

Update after you replaced your whole question with a non-question:

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

回答 7

如果您有一个string line,则可以.encode([encoding], [errors='strict'])对字符串使用该方法来转换编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息,这是一个非常有用的网站:https : //docs.python.org/2/howto/unicode.html

If you have a string line, you can use the .encode([encoding], [errors='strict']) method for strings to convert encoding types.

line = 'my big string'

line.encode('ascii', 'ignore')

For more information about handling ASCII and unicode in Python, this is a really useful site: https://docs.python.org/2/howto/unicode.html


回答 8

我认为答案是存在的,但只能是零散的,这使得很难快速解决诸如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子,假设我有一个文件,其中的数据具有以下格式(包含ascii和non-ascii字符)

17年1月1日,21:36-土地:欢迎��

而我们只想忽略和保留ascii字符。

该代码将执行以下操作:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和type(rline)会给你

>type(rline) 
<type 'str'>

I think the answer is there but only in bits and pieces, which makes it difficult to quickly fix the problem such as

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

Let’s take an example, Suppose I have file which has some data in the following form ( containing ascii and non-ascii chars )

1/10/17, 21:36 – Land : Welcome ��

and we want to ignore and preserve only ascii characters.

This code will do:

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

and type(rline) will give you

>type(rline) 
<type 'str'>

回答 9

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

Works for me


回答 10

看起来您正在使用python2.x。Python 2.x默认为ascii,它不了解Unicode。因此,exceptions。

只需在shebang上粘贴以下行,它就会起作用

# -*- coding: utf-8 -*-

Looks like you are using python 2.x. Python 2.x defaults to ascii and it doesn’t know about Unicode. Hence the exception.

Just paste the below line after shebang, it will work

# -*- coding: utf-8 -*-

在Python 3中将字符串转换为字节的最佳方法?

问题:在Python 3中将字符串转换为字节的最佳方法?

TypeError的答案中可以看出,有两种不同的方法可以将字符串转换为字节:’str’不支持缓冲区接口

以下哪种方法更好或更Pythonic?还是仅仅是个人喜好问题?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

There appear to be two different ways to convert a string to bytes, as seen in the answers to TypeError: ‘str’ does not support the buffer interface

Which of these methods would be better or more Pythonic? Or is it just a matter of personal preference?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

回答 0

如果您查看的文档bytes,它将指向bytearray

bytearray([源[,编码[,错误]]])

返回一个新的字节数组。字节数组类型是一个可变的整数序列,范围为0 <= x <256。它具有可变序列类型中介绍的大多数可变序列的常用方法,以及字节类型具有的大多数方法,请参见字节和。字节数组方法。

可选的source参数可以通过几种不同的方式用于初始化数组:

如果是字符串,则还必须提供编码(以及可选的错误)参数;然后,bytearray()使用str.encode()将字符串转换为字节。

如果它是整数,则数组将具有该大小,并将使用空字节初始化。

如果它是符合缓冲区接口的对象,则该对象的只读缓冲区将用于初始化bytes数组。

如果是可迭代的,则它必须是0 <= x <256范围内的整数的可迭代对象,这些整数用作数组的初始内容。

没有参数,将创建大小为0的数组。

所以 bytes除了编码字符串以外,还可以做更多的事情。这是Pythonic的用法,它允许您使用有意义的任何类型的源参数来调用构造函数。

对于编码字符串,我认为它some_string.encode(encoding)比使用构造函数更具Pythonic性,因为它是最易于记录的文档-“使用此字符串并使用此编码对其进行编码”比bytes(some_string, encoding) – -当您使用构造函数。

编辑:我检查了Python源。如果您将unicode字符串传递给bytes使用CPython,它将调用PyUnicode_AsEncodedString,它是encode; 的实现。因此,如果您自称,则只是跳过了间接级别encode

另外,请参见Serdalis的评论- unicode_string.encode(encoding)也是Python 风格的,因为它的反函数为byte_string.decode(encoding),对称性很好。

If you look at the docs for bytes, it points you to bytearray:

bytearray([source[, encoding[, errors]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods.

The optional source parameter can be used to initialize the array in a few different ways:

If it is a string, you must also give the encoding (and optionally, errors) parameters; bytearray() then converts the string to bytes using str.encode().

If it is an integer, the array will have that size and will be initialized with null bytes.

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array.

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.

Without an argument, an array of size 0 is created.

So bytes can do much more than just encode a string. It’s Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense.

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting — “take this string and encode it with this encoding” is clearer than bytes(some_string, encoding) — there is no explicit verb when you use the constructor.

Edit: I checked the Python source. If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString, which is the implementation of encode; so you’re just skipping a level of indirection if you call encode yourself.

Also, see Serdalis’ comment — unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice.


回答 1

比想像的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

It’s easier than it is thought:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

回答 2

绝对最好的办法既不是2,但第3位。自Python 3.0以来,第一个参数默认默认值。因此,最好的方法是encode 'utf-8'

b = mystring.encode()

这也将更快,因为默认参数的结果不是"utf-8"C代码中的字符串,而是NULL,它的检查快得多!

以下是一些时间安排:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

尽管发出警告,但重复运行后时间仍然非常稳定-偏差仅为〜2%。


encode()不带参数使用不兼容Python 2,因为在Python 2中,默认字符编码为ASCII

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The absolutely best way is neither of the 2, but the 3rd. The first parameter to encode defaults to 'utf-8' ever since Python 3.0. Thus the best way is

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL, which is much faster to check!

Here be some timings:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs – the deviation was just ~2 per cent.


Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII.

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)