如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

问题:如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

在工作中,似乎没有一周没有编码相关的混乱,灾难或灾难。问题通常来自程序员,他们认为他们无需指定编码就可以可靠地处理“文本”文件。但是你不能。

因此,已决定从此以后禁止文件以*.txt或结尾的文件名*.text。这种想法是,这些扩展误导了随意的程序员对编码的沉闷自满,这会导致处理不当。根本没有扩展将是更好的选择,因为至少您知道自己不知道拥有什么。

但是,我们不会走那么远。相反,您将期望使用以编码结尾的文件名。因此,对于文本文件,例如,这些会是这样README.asciiREADME.latin1README.utf8,等。

对于需要特定扩展名的文件,如果可以在文件本身内部指定编码,例如在Perl或Python中,则应这样做。对于Java源之类的文件,其中文件内部没有这样的功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java

对于输出,强烈建议使用 UTF-8 。

但是作为输入,我们需要弄清楚如何处理代码库中名为的数千个文件*.txt。我们想重命名所有这些以适应我们的新标准。但是我们不可能全神贯注。因此,我们需要一个实际起作用的库或程序。

这些格式有ASCII,ISO-8859-1,UTF-8,Microsoft CP1252或Apple MacRoman。尽管我们知道我们可以判断某些东西是否为ASCII,并且知道有某种东西可能是UTF-8还是一个不错的选择,但我们对8位编码感到困惑。因为我们在大多数台式机为Mac的混合Unix环境(Solaris,Linux,Darwin)中运行,所以我们有很多烦人的MacRoman文件。这些尤其是一个问题。

一段时间以来,我一直在寻找一种以编程方式确定

  1. ASCII码
  2. ISO-8859-1
  3. CP1252
  4. 麦克罗曼
  5. UTF-8

文件在其中,我还没有找到可以可靠地区分这三种不同的8位编码的程序或库。我们可能仅拥有一千多个MacRoman文件,因此我们使用的任何字符集检测器都必须能够将它们嗅出。我看过的东西都无法解决这个问题。我对ICU字符集检测器库寄予厚望,但它不能处理MacRoman。我也研究过模块,它们在Perl和Python中都可以做同样的事情,但是一遍又一遍地是同一回事:不支持检测MacRoman。

因此,我要寻找的是一个现有的库或程序,该库或程序可以可靠地确定文件所用的五种编码中的哪一种(最好是更多)。特别是它必须区分我引用的三种3位编码,尤其是MacRoman。文件是超过99%的英语文本;还有其他几种语言,但不是很多。

如果是库代码,则我们的语言偏好是按Perl,C,Java或Python的顺序排列。如果它只是一个程序,那么我们并不在乎它的语言是什么,只要它是完整的源代码,在Unix上运行并且完全不受限制即可。

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?这是我的问题中最重要的方面,但是我也很感兴趣您是否鼓励程序员使用文件中的实际编码来命名(或重命名)他们的文件,这将有助于我们将来避免此问题。有没有人曾经尝试过在制度基础上强制执行,如果成功,那么成功与否,为什么?

是的,我完全理解,考虑到问题的性质,为什么不能保证给出确切的答案。对于小文件,尤其是这种情况,因为您没有足够的数据继续运行。幸运的是,我们的文件很少。除了随机README文件外,大多数文件的大小在50k到250k之间,许多文件更大。大小超过K的任何内容都将保证使用英语。

问题领域是生物医学文本挖掘,因此我们有时会处理大量的超大型语料库,例如PubMedCentral的所有Open Access存储库。一个相当大的文件是BioThesaurus 6.0,容量为5.7 GB。该文件特别令人讨厌,因为它几乎都是UTF-8。但是,我相信有些numbskull会以一些8位编码插入其中的几行,即Microsoft CP1252。您需要花费相当长的时间才能踏上那个旅程。:(

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can’t.

So it’s been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we’re know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(


回答 0

首先,简单的情况:

ASCII码

如果您的数据不包含大于0x7F的字节,则为ASCII。(或者是7位ISO646编码,但是已经过时了。)

UTF-8

如果您的数据验证为UTF-8,则可以放心地假定它 UTF-8。由于UTF-8严格的验证规则,误报极为罕见。

ISO-8859-1与Windows-1252

这两种编码之间的唯一区别是ISO-8859-1具有C1控制字符,而Windows-1252具有可打印字符€,ƒ„…†‡ˆ‰Š‹ŒŽ“”•-〜™š› œžŸ。我见过很多使用大括号或破折号的文件,但是没有使用C1控制字符的文件。因此,甚至不必理会它们或ISO-8859-1,而只需检测Windows-1252。

现在只剩下一个问题了。

您如何区分MacRoman和cp1252?

这要复杂得多。

未定义的字符

Windows-1252中未使用字节0x81、0x8D,0x8F,0x90、0x9D。如果发生这种情况,则假定数据为MacRoman。

相同字符

两种编码中的字节0xA2(¢),0xA3(£),0xA9(©),0xB1(±),0xB5(µ)都相同。如果这些是唯一的非ASCII字节,那么选择MacRoman还是cp1252都没有关系。

统计方法

在您知道为UTF-8的数据中计算字符(非字节!)频率。确定最频繁的字符。然后使用此数据确定cp1252或MacRoman字符是否更常见。

例如,在我仅对100条随机英语Wikipedia文章执行的搜索中,最常见的非ASCII字符为·•–é°®’èö—。基于这个事实,

  • 字节0x92、0x95、0x96、0x97、0xAE,0xB0、0xB7、0xE8、0xE9或0xF6表示Windows-1252。
  • 字节0x8E,0x8F,0x9A,0xA1、0xA5、0xA8、0xD0、0xD1、0xD5或0xE1表示MacRoman。

计算cp1252建议字节和MacRoman建议字节,并选择最大的一个。

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.


回答 1

Mozilla nsUniversalDetector(Perl绑定:Encode :: Detect / Encode :: Detect :: Detector)已被证明了百万倍。


回答 2

我尝试进行这种试探(假设您已经排除了ASCII和UTF-8):

  • 如果根本不显示0x7f到0x9f,则可能是ISO-8859-1,因为它们是很少使用的控制代码。
  • 如果大量出现0x91到0x94,则可能是Windows-1252,因为它们是“智能引号”,是该范围内最有可能在英文文本中使用的字符。可以肯定的是,您可以寻找对。
  • 否则,它是MacRoman,尤其是如果您看到很多0xd2到0xd5(在MacRoman中是印刷引号)。

边注:

对于像Java源这样的文件,其中文件内部没有这种功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java。

不要这样做!!

Java编译器期望文件名与类名匹配,因此重命名文件将使源代码不可编译。正确的做法是猜测编码,然后使用该native2ascii工具将所有非ASCII字符转换为Unicode转义序列

My attempt at such a heuristic (assuming that you’ve ruled out ASCII and UTF-8):

  • If 0x7f to 0x9f don’t appear at all, it’s probably ISO-8859-1, because those are very rarely used control codes.
  • If 0x91 through 0x94 appear at lot, it’s probably Windows-1252, because those are the “smart quotes”, by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
  • Otherwise, it’s MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that’s where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.


回答 3

“ Perl,C,Java或Python,并按此顺序”:有趣的态度:-)

“我们知道一个东西是否可能是UTF-8,这是一个很好的改变”:实际上,当UTF-8很小时,包含以其他字符集编码的,使用高位字节的有意义文本的文件将成功解码的机会。

UTF-8策略(至少使用首选语言):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

一旦确定它既不是ASCII也不是UTF-8:

我知道的Mozilla起源字符集检测器不支持MacRoman,而且无论如何在8位字符集上都做得不好,尤其是对于英语,因为AFAICT依赖于检查给定解码是否有意义语言,忽略标点符号,并基于该语言的大量文档。

正如其他人所说的,您实际上只有高位标点符号可用于区分cp1252和macroman。我建议您在自己的文档上训练Mozilla类型的模型,而不是莎士比亚,《议事录》或《圣经》,并考虑所有256个字节。我认为您的文件中没有标记(HTML,XML等),这会使某些令人震惊的概率失真。

您提到的文件大多为UTF-8,但无法解码。您还应该非常怀疑:

(1)据称是用ISO-8859-1编码的文件,但包含范围在0x80至0x9F(包括0x80至0x9F)内的“控制字符” …这太普遍了,以至于HTML5标准草案表示要解码所有声明为ISO-8859的HTML流-1使用cp1252。

(2)将OK解码为UTF-8的文件,但所得的Unicode包含范围在U + 0080至U + 009F(含)范围内的“控制字符” …这可能是由于对cp1252 / cp850进行代码转换(见它发生了!)/等等文件从“ ISO-8859-1”到UTF-8。

背景:我有一个星期天下午下午的项目,以创建一个基于Python的字符集检测器,该检测器面向文件(而不是面向Web),并且可以与8位字符集(包括legacy ** ncp850和cp437等)一起很好地工作。现在还远没有黄金时间。我对培训文件感兴趣;您的ISO-8859-1 / cp1252 / MacRoman文件是否像您期望任何人的代码解决方案一样“不受阻碍”?

“Perl, C, Java, or Python, and in that order”: interesting attitude :-)

“we stand a good change of knowing if something is probably UTF-8”: Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you’ve decided that it’s neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I’m aware of don’t support MacRoman and in any case don’t do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I’d suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them — that would distort the probabilities something shocking.

You’ve mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain “control characters” in the range 0x80 to 0x9F inclusive … this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains “control characters” in the range U+0080 to U+009F inclusive … this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from “ISO-8859-1” to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that’s file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It’s nowhere near prime time yet. I’m interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally “unencumbered” as you expect anyone’s code solution to be?


回答 4

您已经发现,没有完美的方法来解决此问题,因为如果没有关于文件使用哪种编码的隐式知识,所有8位编码都是完全相同的:字节的集合。所有字节对于所有8位编码均有效。

您可以期望的最好结果是某种算法,可以分析字节,并基于以某种语言以某种编码使用某种字节的概率,可以猜测文件使用的编码方式。但这必须知道文件使用哪种语言,并且当您使用混合编码的文件时,它变得完全无用。

从好的方面来说,如果您知道文件中的文本是用英语编写的,那么您决定使用该文件的任何编码都不会引起任何差异,因为所有提到的编码之间的差异都本地化在编码的一部分,指定了英语中通常不使用的字符。在文本使用特殊格式或特殊版本的标点符号(例如CP1252具有引号字符的多个版本)的情况下,您可能会遇到一些麻烦,但是对于文本的要旨而言,可能没有任何问题。

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you’re unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.


回答 5

如果您可以检测到除宏人以外的所有编码,那么逻辑上是假设无法解密的是宏人。换句话说,只要列出无法处理的文件,然后将其视为宏文件即可。

排序这些文件的另一种方法是制作一个基于服务器的程序,该程序允许用户确定哪种编码不乱码。当然,这将在公司内部,但是如果有100名员工每天做几次工作,那么您将立即拥有成千上万的文件。

最后,将所有现有文件转换为单一格式并要求新文件采用该格式不是更好。

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can’t be deciphered are in macroman. In other words, just make a list of files that couldn’t be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn’t garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you’ll have thousands of files done in no time.

Finally, wouldn’t it be better to just convert all existing files to a single format, and require that new files be in that format.


回答 6

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?

我目前正在编写将文件转换为XML的程序。它必须自动检测每个文件的类型,这是确定文本文件编码问题的超集。为了确定编码,我使用贝叶斯方法。也就是说,我的分类代码针对文本文件能够理解的所有编码,计算出文本文件具有特定编码的概率(可能性)。然后,程序选择最可能的解码器。对于每种编码,贝叶斯方法都像这样工作。

  1. 根据每次编码的频率,设置文件在编码中的初始(优先)概率。
  2. 依次检查文件中的每个字节。查找字节值,以确定该字节值存在与该编码中实际存在的文件之间的相关性。使用该相关性来计算新的(后验文件在编码中)概率。如果要检查的字节更多,请在检查下一个字节时将该字节的后验概率用作先验概率。
  3. 当您到达文件末尾时(我实际上仅查看前1024个字节),则具有的可能性就是文件处于编码状态的可能性。

可以看出,如果您计算信息内容而不是计算概率,而这是几率的对数,那么贝叶斯定理变得非常容易做到:info = log(p / (1.0 - p))

您将必须通过检查手动分类的文件的语料库来计算初始先验概率和相关性。

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

  1. Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
  2. Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
  3. When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes’ theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.