标签归档:perl

如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

问题:如何可靠地猜测MacRoman,CP1252,Latin1,UTF-8和ASCII之间的编码

在工作中,似乎没有一周没有编码相关的混乱,灾难或灾难。问题通常来自程序员,他们认为他们无需指定编码就可以可靠地处理“文本”文件。但是你不能。

因此,已决定从此以后禁止文件以*.txt或结尾的文件名*.text。这种想法是,这些扩展误导了随意的程序员对编码的沉闷自满,这会导致处理不当。根本没有扩展将是更好的选择,因为至少您知道自己不知道拥有什么。

但是,我们不会走那么远。相反,您将期望使用以编码结尾的文件名。因此,对于文本文件,例如,这些会是这样README.asciiREADME.latin1README.utf8,等。

对于需要特定扩展名的文件,如果可以在文件本身内部指定编码,例如在Perl或Python中,则应这样做。对于Java源之类的文件,其中文件内部没有这样的功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java

对于输出,强烈建议使用 UTF-8 。

但是作为输入,我们需要弄清楚如何处理代码库中名为的数千个文件*.txt。我们想重命名所有这些以适应我们的新标准。但是我们不可能全神贯注。因此,我们需要一个实际起作用的库或程序。

这些格式有ASCII,ISO-8859-1,UTF-8,Microsoft CP1252或Apple MacRoman。尽管我们知道我们可以判断某些东西是否为ASCII,并且知道有某种东西可能是UTF-8还是一个不错的选择,但我们对8位编码感到困惑。因为我们在大多数台式机为Mac的混合Unix环境(Solaris,Linux,Darwin)中运行,所以我们有很多烦人的MacRoman文件。这些尤其是一个问题。

一段时间以来,我一直在寻找一种以编程方式确定

  1. ASCII码
  2. ISO-8859-1
  3. CP1252
  4. 麦克罗曼
  5. UTF-8

文件在其中,我还没有找到可以可靠地区分这三种不同的8位编码的程序或库。我们可能仅拥有一千多个MacRoman文件,因此我们使用的任何字符集检测器都必须能够将它们嗅出。我看过的东西都无法解决这个问题。我对ICU字符集检测器库寄予厚望,但它不能处理MacRoman。我也研究过模块,它们在Perl和Python中都可以做同样的事情,但是一遍又一遍地是同一回事:不支持检测MacRoman。

因此,我要寻找的是一个现有的库或程序,该库或程序可以可靠地确定文件所用的五种编码中的哪一种(最好是更多)。特别是它必须区分我引用的三种3位编码,尤其是MacRoman。文件是超过99%的英语文本;还有其他几种语言,但不是很多。

如果是库代码,则我们的语言偏好是按Perl,C,Java或Python的顺序排列。如果它只是一个程序,那么我们并不在乎它的语言是什么,只要它是完整的源代码,在Unix上运行并且完全不受限制即可。

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?这是我的问题中最重要的方面,但是我也很感兴趣您是否鼓励程序员使用文件中的实际编码来命名(或重命名)他们的文件,这将有助于我们将来避免此问题。有没有人曾经尝试过在制度基础上强制执行,如果成功,那么成功与否,为什么?

是的,我完全理解,考虑到问题的性质,为什么不能保证给出确切的答案。对于小文件,尤其是这种情况,因为您没有足够的数据继续运行。幸运的是,我们的文件很少。除了随机README文件外,大多数文件的大小在50k到250k之间,许多文件更大。大小超过K的任何内容都将保证使用英语。

问题领域是生物医学文本挖掘,因此我们有时会处理大量的超大型语料库,例如PubMedCentral的所有Open Access存储库。一个相当大的文件是BioThesaurus 6.0,容量为5.7 GB。该文件特别令人讨厌,因为它几乎都是UTF-8。但是,我相信有些numbskull会以一些8位编码插入其中的几行,即Microsoft CP1252。您需要花费相当长的时间才能踏上那个旅程。:(

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a “text” file without specifying the encoding. But you can’t.

So it’s been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we’re know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(


回答 0

首先,简单的情况:

ASCII码

如果您的数据不包含大于0x7F的字节,则为ASCII。(或者是7位ISO646编码,但是已经过时了。)

UTF-8

如果您的数据验证为UTF-8,则可以放心地假定它 UTF-8。由于UTF-8严格的验证规则,误报极为罕见。

ISO-8859-1与Windows-1252

这两种编码之间的唯一区别是ISO-8859-1具有C1控制字符,而Windows-1252具有可打印字符€,ƒ„…†‡ˆ‰Š‹ŒŽ“”•-〜™š› œžŸ。我见过很多使用大括号或破折号的文件,但是没有使用C1控制字符的文件。因此,甚至不必理会它们或ISO-8859-1,而只需检测Windows-1252。

现在只剩下一个问题了。

您如何区分MacRoman和cp1252?

这要复杂得多。

未定义的字符

Windows-1252中未使用字节0x81、0x8D,0x8F,0x90、0x9D。如果发生这种情况,则假定数据为MacRoman。

相同字符

两种编码中的字节0xA2(¢),0xA3(£),0xA9(©),0xB1(±),0xB5(µ)都相同。如果这些是唯一的非ASCII字节,那么选择MacRoman还是cp1252都没有关系。

统计方法

在您知道为UTF-8的数据中计算字符(非字节!)频率。确定最频繁的字符。然后使用此数据确定cp1252或MacRoman字符是否更常见。

例如,在我仅对100条随机英语Wikipedia文章执行的搜索中,最常见的非ASCII字符为·•–é°®’èö—。基于这个事实,

  • 字节0x92、0x95、0x96、0x97、0xAE,0xB0、0xB7、0xE8、0xE9或0xF6表示Windows-1252。
  • 字节0x8E,0x8F,0x9A,0xA1、0xA5、0xA8、0xD0、0xD1、0xD5或0xE1表示MacRoman。

计算cp1252建议字节和MacRoman建议字节,并选择最大的一个。

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it’s ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8’s strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. I’ve seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don’t even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn’t matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.


回答 1

Mozilla nsUniversalDetector(Perl绑定:Encode :: Detect / Encode :: Detect :: Detector)已被证明了百万倍。


回答 2

我尝试进行这种试探(假设您已经排除了ASCII和UTF-8):

  • 如果根本不显示0x7f到0x9f,则可能是ISO-8859-1,因为它们是很少使用的控制代码。
  • 如果大量出现0x91到0x94,则可能是Windows-1252,因为它们是“智能引号”,是该范围内最有可能在英文文本中使用的字符。可以肯定的是,您可以寻找对。
  • 否则,它是MacRoman,尤其是如果您看到很多0xd2到0xd5(在MacRoman中是印刷引号)。

边注:

对于像Java源这样的文件,其中文件内部没有这种功能,您可以将编码放在扩展名之前,例如SomeClass-utf8.java。

不要这样做!!

Java编译器期望文件名与类名匹配,因此重命名文件将使源代码不可编译。正确的做法是猜测编码,然后使用该native2ascii工具将所有非ASCII字符转换为Unicode转义序列

My attempt at such a heuristic (assuming that you’ve ruled out ASCII and UTF-8):

  • If 0x7f to 0x9f don’t appear at all, it’s probably ISO-8859-1, because those are very rarely used control codes.
  • If 0x91 through 0x94 appear at lot, it’s probably Windows-1252, because those are the “smart quotes”, by far the most likely characters in that range to be used in English text. To be more certain, you could look for pairs.
  • Otherwise, it’s MacRoman, especially if you see a lot of 0xd2 through 0xd5 (that’s where the typographic quotes are in MacRoman).

Side note:

For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java

Do not do this!!

The Java compiler expects file names to match class names, so renaming the files will render the source code uncompilable. The correct thing would be to guess the encoding, then use the native2ascii tool to convert all non-ASCII characters to Unicode escape sequences.


回答 3

“ Perl,C,Java或Python,并按此顺序”:有趣的态度:-)

“我们知道一个东西是否可能是UTF-8,这是一个很好的改变”:实际上,当UTF-8很小时,包含以其他字符集编码的,使用高位字节的有意义文本的文件将成功解码的机会。

UTF-8策略(至少使用首选语言):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

一旦确定它既不是ASCII也不是UTF-8:

我知道的Mozilla起源字符集检测器不支持MacRoman,而且无论如何在8位字符集上都做得不好,尤其是对于英语,因为AFAICT依赖于检查给定解码是否有意义语言,忽略标点符号,并基于该语言的大量文档。

正如其他人所说的,您实际上只有高位标点符号可用于区分cp1252和macroman。我建议您在自己的文档上训练Mozilla类型的模型,而不是莎士比亚,《议事录》或《圣经》,并考虑所有256个字节。我认为您的文件中没有标记(HTML,XML等),这会使某些令人震惊的概率失真。

您提到的文件大多为UTF-8,但无法解码。您还应该非常怀疑:

(1)据称是用ISO-8859-1编码的文件,但包含范围在0x80至0x9F(包括0x80至0x9F)内的“控制字符” …这太普遍了,以至于HTML5标准草案表示要解码所有声明为ISO-8859的HTML流-1使用cp1252。

(2)将OK解码为UTF-8的文件,但所得的Unicode包含范围在U + 0080至U + 009F(含)范围内的“控制字符” …这可能是由于对cp1252 / cp850进行代码转换(见它发生了!)/等等文件从“ ISO-8859-1”到UTF-8。

背景:我有一个星期天下午下午的项目,以创建一个基于Python的字符集检测器,该检测器面向文件(而不是面向Web),并且可以与8位字符集(包括legacy ** ncp850和cp437等)一起很好地工作。现在还远没有黄金时间。我对培训文件感兴趣;您的ISO-8859-1 / cp1252 / MacRoman文件是否像您期望任何人的代码解决方案一样“不受阻碍”?

“Perl, C, Java, or Python, and in that order”: interesting attitude :-)

“we stand a good change of knowing if something is probably UTF-8”: Actually the chance that a file containing meaningful text encoded in some other charset that uses high-bit-set bytes will decode successfully as UTF-8 is vanishingly small.

UTF-8 strategies (in least preferred language):

# 100% Unicode-standard-compliant UTF-8
def utf8_strict(text):
    try:
        text.decode('utf8')
        return True
    except UnicodeDecodeError:
        return False

# looking for almost all UTF-8 with some junk
def utf8_replace(text):
    utext = text.decode('utf8', 'replace')
    dodgy_count = utext.count(u'\uFFFD') 
    return dodgy_count, utext
    # further action depends on how large dodgy_count / float(len(utext)) is

# checking for UTF-8 structure but non-compliant
# e.g. encoded surrogates, not minimal length, more than 4 bytes:
# Can be done with a regex, if you need it

Once you’ve decided that it’s neither ASCII nor UTF-8:

The Mozilla-origin charset detectors that I’m aware of don’t support MacRoman and in any case don’t do a good job on 8-bit charsets especially with English because AFAICT they depend on checking whether the decoding makes sense in the given language, ignoring the punctuation characters, and based on a wide selection of documents in that language.

As others have remarked, you really only have the high-bit-set punctuation characters available to distinguish between cp1252 and macroman. I’d suggest training a Mozilla-type model on your own documents, not Shakespeare or Hansard or the KJV Bible, and taking all 256 bytes into account. I presume that your files have no markup (HTML, XML, etc) in them — that would distort the probabilities something shocking.

You’ve mentioned files that are mostly UTF-8 but fail to decode. You should also be very suspicious of:

(1) files that are allegedly encoded in ISO-8859-1 but contain “control characters” in the range 0x80 to 0x9F inclusive … this is so prevalent that the draft HTML5 standard says to decode ALL HTML streams declared as ISO-8859-1 using cp1252.

(2) files that decode OK as UTF-8 but the resultant Unicode contains “control characters” in the range U+0080 to U+009F inclusive … this can result from transcoding cp1252 / cp850 (seen it happen!) / etc files from “ISO-8859-1” to UTF-8.

Background: I have a wet-Sunday-afternoon project to create a Python-based charset detector that’s file-oriented (instead of web-oriented) and works well with 8-bit character sets including legacy ** n ones like cp850 and cp437. It’s nowhere near prime time yet. I’m interested in training files; are your ISO-8859-1 / cp1252 / MacRoman files as equally “unencumbered” as you expect anyone’s code solution to be?


回答 4

您已经发现,没有完美的方法来解决此问题,因为如果没有关于文件使用哪种编码的隐式知识,所有8位编码都是完全相同的:字节的集合。所有字节对于所有8位编码均有效。

您可以期望的最好结果是某种算法,可以分析字节,并基于以某种语言以某种编码使用某种字节的概率,可以猜测文件使用的编码方式。但这必须知道文件使用哪种语言,并且当您使用混合编码的文件时,它变得完全无用。

从好的方面来说,如果您知道文件中的文本是用英语编写的,那么您决定使用该文件的任何编码都不会引起任何差异,因为所有提到的编码之间的差异都本地化在编码的一部分,指定了英语中通常不使用的字符。在文本使用特殊格式或特殊版本的标点符号(例如CP1252具有引号字符的多个版本)的情况下,您可能会遇到一些麻烦,但是对于文本的要旨而言,可能没有任何问题。

As you have discovered, there is no perfect way to solve this problem, because without the implicit knowledge about which encoding a file uses, all 8-bit encodings are exactly the same: A collection of bytes. All bytes are valid for all 8-bit encodings.

The best you can hope for, is some sort of algorithm that analyzes the bytes, and based on probabilities of a certain byte being used in a certain language with a certain encoding will guess at what encoding the files uses. But that has to know which language the file uses, and becomes completely useless when you have files with mixed encodings.

On the upside, if you know that the text in a file is written in English, then the you’re unlikely to notice any difference whichever encoding you decide to use for that file, as the differences between all the mentioned encodings are all localized in the parts of the encodings that specify characters not normally used in the English language. You might have some troubles where the text uses special formatting, or special versions of punctuation (CP1252 has several versions of the quote characters for instance), but for the gist of the text there will probably be no problems.


回答 5

如果您可以检测到除宏人以外的所有编码,那么逻辑上是假设无法解密的是宏人。换句话说,只要列出无法处理的文件,然后将其视为宏文件即可。

排序这些文件的另一种方法是制作一个基于服务器的程序,该程序允许用户确定哪种编码不乱码。当然,这将在公司内部,但是如果有100名员工每天做几次工作,那么您将立即拥有成千上万的文件。

最后,将所有现有文件转换为单一格式并要求新文件采用该格式不是更好。

If you can detect every encoding EXCEPT for macroman, than it would be logical to assume that the ones that can’t be deciphered are in macroman. In other words, just make a list of files that couldn’t be processed and handle those as if they were macroman.

Another way to sort these files would be to make a server based program that allows users to decide which encoding isn’t garbled. Of course, it would be within the company, but with 100 employees doing a few each day, you’ll have thousands of files done in no time.

Finally, wouldn’t it be better to just convert all existing files to a single format, and require that new files be in that format.


回答 6

还有其他人遇到过随机编码成千上万个旧文本文件的问题吗?如果是这样,您是如何尝试解决它的?您的成功程度如何?

我目前正在编写将文件转换为XML的程序。它必须自动检测每个文件的类型,这是确定文本文件编码问题的超集。为了确定编码,我使用贝叶斯方法。也就是说,我的分类代码针对文本文件能够理解的所有编码,计算出文本文件具有特定编码的概率(可能性)。然后,程序选择最可能的解码器。对于每种编码,贝叶斯方法都像这样工作。

  1. 根据每次编码的频率,设置文件在编码中的初始(优先)概率。
  2. 依次检查文件中的每个字节。查找字节值,以确定该字节值存在与该编码中实际存在的文件之间的相关性。使用该相关性来计算新的(后验文件在编码中)概率。如果要检查的字节更多,请在检查下一个字节时将该字节的后验概率用作先验概率。
  3. 当您到达文件末尾时(我实际上仅查看前1024个字节),则具有的可能性就是文件处于编码状态的可能性。

可以看出,如果您计算信息内容而不是计算概率,而这是几率的对数,那么贝叶斯定理变得非常容易做到:info = log(p / (1.0 - p))

您将必须通过检查手动分类的文件的语料库来计算初始先验概率和相关性。

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you?

I am currently writing a program that translates files into XML. It has to autodetect the type of each file, which is a superset of the problem of determining the encoding of a text file. For determining the encoding I am using a Bayesian approach. That is, my classification code computes a probability (likelihood) that a text file has a particular encoding for all the encodings it understands. The program then selects the most probable decoder. The Bayesian approach works like this for each encoding.

  1. Set the initial (prior) probability that the file is in the encoding, based on the frequencies of each encoding.
  2. Examine each byte in turn in the file. Look-up the byte value to determine the correlation between that byte value being present and a file actually being in that encoding. Use that correlation to compute a new (posterior) probability that the file is in the encoding. If you have more bytes to examine, use the posterior probability of that byte as the prior probability when you examine the next byte.
  3. When you get to the end of the file (I actually look at only the first 1024 bytes), the proability you have is the probability that the file is in the encoding.

It transpires that Bayes’ theorem becomes very easy to do if instead of computing probabilities, you compute information content, which is the logarithm of the odds: info = log(p / (1.0 - p)).

You will have to compute the initail priori probability, and the correlations, by examining a corpus of files that you have manually classified.


寻求澄清有关弱类型语言的明显矛盾

问题:寻求澄清有关弱类型语言的明显矛盾

我想我了解强类型,但是每次我寻找弱类型的示例时,我最终都会找到简单地自动强制转换类型的编程语言示例。

例如,在这篇名为“ 打字:强vs.弱”,“静态vs.动态 ”的文章中,Python是强类型的,因为如果尝试执行以下操作,则会得到异常:

Python

1 + "1"
Traceback (most recent call last):
File "", line 1, in ? 
TypeError: unsupported operand type(s) for +: 'int' and 'str'

但是,在Java和C#中这种事情是可能的,因此我们不认为它们只是弱类型的。

爪哇

  int a = 10;
  String b = "b";
  String result = a + b;
  System.out.println(result);

C#

int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);

在另一篇名为弱类型语言的文章中,作者说Perl弱类型仅仅是因为我可以将字符串连接成数字,反之亦然,而无需任何显式转换。

佩尔

$a=10;
$b="a";
$c=$a.$b;
print $c; #10a

因此,同一示例使Perl的类型较弱,但Java和C#?的类型却没有。

真是的

作者似乎暗示一种阻止对不同类型的值执行某些操作的语言是强类型的,而相反的意思是弱类型。

因此,在某些时候,我感到被提示相信,如果一种语言在类型之间提供大量自动转换或强制转换(例如perl),最终可能会被认为是弱类型,而其他仅提供少量转换的语言可能最终会被视为弱类型。被认为是强类型的。

但是,我倾向于相信,在这种相互交流中我一定是错的,我只是不知道为什么或如何解释它。

因此,我的问题是:

  • 语言真正弱键入到底意味着什么?
  • 您能否提及与该语言完成的自动转换/自动强制无关的弱类型的任何好例子?
  • 语言可以同时弱输入和强输入吗?

I think I understand strong typing, but every time I look for examples for what is weak typing I end up finding examples of programming languages that simply coerce/convert types automatically.

For instance, in this article named Typing: Strong vs. Weak, Static vs. Dynamic says that Python is strongly typed because you get an exception if you try to:

Python

1 + "1"
Traceback (most recent call last):
File "", line 1, in ? 
TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, such thing is possible in Java and in C#, and we do not consider them weakly typed just for that.

Java

  int a = 10;
  String b = "b";
  String result = a + b;
  System.out.println(result);

C#

int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);

In this another article named Weakly Type Languages the author says that Perl is weakly typed simply because I can concatenate a string to a number and viceversa without any explicit conversion.

Perl

$a=10;
$b="a";
$c=$a.$b;
print $c; #10a

So the same example makes Perl weakly typed, but not Java and C#?.

Gee, this is confusing

The authors seem to imply that a language that prevents the application of certain operations on values of different types is strongly typed and the contrary means weakly typed.

Therefore, at some point I have felt prompted to believe that if a language provides a lot of automatic conversions or coercion between types (as perl) may end up being considered weakly typed, whereas other languages that provide only a few conversions may end up being considered strongly typed.

I am inclined to believe, though, that I must be wrong in this interepretation, I just do not know why or how to explain it.

So, my questions are:

  • What does it really mean for a language to be truly weakly typed?
  • Could you mention any good examples of weakly typing that are not related to automatic conversion/automatic coercion done by the language?
  • Can a language be weakly typed and strongly typed at the same time?

回答 0

更新:这个问题是我在2012年10月15日发布的博客的主题。感谢您提出的伟大问题!


语言“弱类型化”的真正含义是什么?

它的意思是“这种语言使用的类型系统令人讨厌”。相比之下,“强类型”语言是具有令人愉悦的类型系统的语言。

这些术语本质上是没有意义的,您应该避免使用它们。维基百科列出了“强类型”的十一种不同含义,其中有几种是矛盾的。这表明在涉及术语“强类型”或“弱类型”的任何对话中,造成混乱的可能性很高。

您真正可以肯定地说的是,正在讨论的“强类型”语言在类型系统上有一些其他限制,无论是在运行时还是编译时,都缺乏在讨论中的“弱类型”语言。没有进一步的上下文,就无法确定该限制是什么。

不应使用“强类型”和“弱类型”,而应详细描述您所指的类型安全。例如,C#在大多数情况下静态类型的语言,类型安全的语言和内存安全的语言。C#允许违反所有三种“强”类型的输入形式。强制转换运算符违反静态类型;它对编译器说:“我比您更了解此表达式的运行时类型”。如果开发人员错误,则运行时将抛出异常以保护类型安全。如果开发人员希望破坏类型安全性或存储安全性,则可以通过制作“不安全”块来关闭类型安全性系统来做到这一点。在不安全的块中,您可以使用指针魔术来将int视为浮点型(违反类型安全性)或写入您不拥有的内存。(破坏内存安全。)

C#施加了在编译时和运行时都进行检查的类型限制,因此与进行较少的编译时检查或较少的运行时检查的语言相比,C#使其成为“强类型”语言。C#还允许您在特殊情况下绕这些限制进行最终运行,与不允许您进行此类最终运行的语言相比,它成为“弱类型”语言。

到底是什么 很难说。这取决于说话者的观点及其对各种语言功能的态度。

UPDATE: This question was the subject of my blog on the 15th of October, 2012. Thanks for the great question!


What does it really mean for a language to be “weakly typed”?

It means “this language uses a type system that I find distasteful”. A “strongly typed” language by contrast is a language with a type system that I find pleasant.

The terms are essentially meaningless and you should avoid them. Wikipedia lists eleven different meanings for “strongly typed”, several of which are contradictory. This indicates that the odds of confusion being created are high in any conversation involving the term “strongly typed” or “weakly typed”.

All that you can really say with any certainty is that a “strongly typed” language under discussion has some additional restriction in the type system, either at runtime or compile time, that a “weakly typed” language under discussion lacks. What that restriction might be cannot be determined without further context.

Instead of using “strongly typed” and “weakly typed”, you should describe in detail what kind of type safety you mean. For example, C# is a statically typed language and a type safe language and a memory safe language, for the most part. C# allows all three of those forms of “strong” typing to be violated. The cast operator violates static typing; it says to the compiler “I know more about the runtime type of this expression than you do”. If the developer is wrong, then the runtime will throw an exception in order to protect type safety. If the developer wishes to break type safety or memory safety, they can do so by turning off the type safety system by making an “unsafe” block. In an unsafe block you can use pointer magic to treat an int as a float (violating type safety) or to write to memory you do not own. (Violating memory safety.)

C# imposes type restrictions that are checked at both compile-time and at runtime, thereby making it a “strongly typed” language compared to languages that do less compile-time checking or less runtime checking. C# also allows you to in special circumstances do an end-run around those restrictions, making it a “weakly typed” language compared with languages which do not allow you to do such an end-run.

Which is it really? It is impossible to say; it depends on the point of view of the speaker and their attitude towards the various language features.


回答 1

正如其他人指出的那样,术语“强类型”和“弱类型”具有许多不同的含义,因此您的问题没有一个答案。但是,由于您在问题中特别提到了Perl,因此让我尝试解释Perl弱键入的含义。

关键是,在Perl中,没有“整数变量”,“浮点变量”,“字符串变量”或“布尔变量”之类的东西。实际上,据用户所知(通常),甚至没有整数,浮点数,字符串或布尔:您所拥有的都是“标量”,它们同时是所有这些东西。因此,您可以例如编写:

$foo = "123" + "456";           # $foo = 579
$bar = substr($foo, 2, 1);      # $bar = 9
$bar .= " lives";               # $bar = "9 lives"
$foo -= $bar;                   # $foo = 579 - 9 = 570

当然,正如您正确指出的那样,所有这些都可以看作是强制类型。但是关键是,在Perl中,类型始终是强制的。实际上,用户很难说出变量的内部“类型”是什么:在我上面的示例的第2行,询问变量的值$bar是字符串"9"还是数字9几乎没有意义,因为就Perl而言,它们是同一回事。实际上,Perl标量甚至有可能在内部同时具有字符串和数字值,例如$foo上面第2行之后的情况。

不利的一面是,由于Perl变量是无类型的(或者,不向用户公开其内部类型),因此不能重载运算符以对不同类型的参数执行不同的操作。您不能只说“此运算符将对数字执行X,对字符串执行Y”,因为该运算符无法(不会)告诉其参数是哪种类型的值。

因此,例如,Perl同时具有并且需要数字加法运算符(+)和字符串连接运算符(.):如上所述,添加字符串("1" + "2" == "3")或连接数字(1 . 2 == 12)很好。同样,数字比较操作符==!=<><=>=<=>比较它们的参数的数值,而字符串比较操作符eqneltgtlegecmp字典顺序比较它们为字符串。所以2 < 10,但是2 gt 10(但是"02" lt 10,虽然"02" == 2)。(请注意,某些其他语言(例如JavaScript)会尝试容纳类似Perl的弱类型做运算符重载。这通常会导致丑陋,例如失去与+。)的关联性。

(美中不足的是,由于历史原因,Perl 5确实有一些极端情况,例如按位逻辑运算符,其行为取决于其参数的内部表示。通常认为这是令人讨厌的设计缺陷,因为内部表述可能会由于令人惊讶的原因而发生变化,因此仅预测那些操作员在给定情况下的操作可能很棘手。)

综上所述,可以说Perl 确实具有强类型。它们不是您可能期望的那种类型。具体来说,除了上面讨论的“标量”类型外,Perl还具有两种结构化类型:“数组”和“哈希”。这些是非常从标量不同,到了那里的Perl变量具有不同的点印记,指示它们的类型($用于标量,@数组,%对于散列)1。有这些类型之间的强制规则,这样你就可以写例如%foo = @bar,但其中不少是相当有损耗:例如,$foo = @bar分配长度的数组 @bar$foo,而不是其内容。(此外,还有其他一些奇怪的类型,例如typeglob和I / O句柄,您通常不会看到它们是公开的。)

同样,在这种出色的设计中,有一个小缺点是引用类型的存在,它们是一种特殊的标量(可以使用ref运算符将其与普通标量区分开)。可以将引用用作普通标量,但是它们的字符串/数字值并不是特别有用,并且如果您使用普通标量操作对其进行修改,它们往往会失去其特殊的引用性。同样,任何Perl变量2都可以作为范例。通常的意见是,如果您发现自己在Perl中检查了对象的类,则说明您做错了什么。bless编入一个类,将其变成该类的对象。Perl中的OO类系统在某种程度上与上述原始类型(或无类型性)系统正交,尽管它在遵循鸭子类型的意义上也是“弱”的


1实际上,印记表示被访问的值的类型,以使得例如在阵列中的第一标@foo$foo[0]。有关更多详细信息,请参见perlfaq4

2(通常)通过引用访问Perl中的对象,但实际上得到的bless是引用所指向的(可能是匿名的)变量。但是,祝福实际上是变量的属性,而不是变量的值,因此,例如,将实际的祝福变量分配给另一个变量,只会给您一个浅浅的,没有祝福的副本。有关更多详细信息,请参见perlobj

As others have noted, the terms “strongly typed” and “weakly typed” have so many different meanings that there’s no single answer to your question. However, since you specifically mentioned Perl in your question, let me try to explain in what sense Perl is weakly typed.

The point is that, in Perl, there is no such thing as an “integer variable”, a “float variable”, a “string variable” or a “boolean variable”. In fact, as far as the user can (usually) tell, there aren’t even integer, float, string or boolean values: all you have are “scalars”, which are all of these things at the same time. So you can, for example, write:

$foo = "123" + "456";           # $foo = 579
$bar = substr($foo, 2, 1);      # $bar = 9
$bar .= " lives";               # $bar = "9 lives"
$foo -= $bar;                   # $foo = 579 - 9 = 570

Of course, as you correctly note, all of this can be seen as just type coercion. But the point is that, in Perl, types are always coerced. In fact, it’s quite hard for a user to tell what the internal “type” of a variable might be: at line 2 in my example above, asking whether the value of $bar is the string "9" or the number 9 is pretty much meaningless, since, as far as Perl is concerned, those are the same thing. Indeed, it’s even possible for a Perl scalar to internally have both a string and a numeric value at the same time, as is e.g. the case for $foo after line 2 above.

The flip side of all this is that, since Perl variables are untyped (or, rather, don’t expose their internal type to the user), operators cannot be overloaded to do different things for different types of arguments; you can’t just say “this operator will do X for numbers and Y for strings”, because the operator can’t (won’t) tell which kind of values its arguments are.

Thus, for example, Perl has and needs both a numeric addition operator (+) and a string concatenation operator (.): as you saw above, it’s perfectly fine to add strings ("1" + "2" == "3") or to concatenate numbers (1 . 2 == 12). Similarly, the numeric comparison operators ==, !=, <, >, <=, >= and <=> compare the numeric values of their arguments, while the string comparison operators eq, ne, lt, gt, le, ge and cmp compare them lexicographically as strings. So 2 < 10, but 2 gt 10 (but "02" lt 10, while "02" == 2). (Mind you, certain other languages, like JavaScript, try to accommodate Perl-like weak typing while also doing operator overloading. This often leads to ugliness, like the loss of associativity for +.)

(The fly in the ointment here is that, for historical reasons, Perl 5 does have a few corner cases, like the bitwise logical operators, whose behavior depends on the internal representation of their arguments. Those are generally considered an annoying design flaw, since the internal representation can change for surprising reasons, and so predicting just what those operators do in a given situation can be tricky.)

All that said, one could argue that Perl does have strong types; they’re just not the kind of types you might expect. Specifically, in addition to the “scalar” type discussed above, Perl also has two structured types: “array” and “hash”. Those are very distinct from scalars, to the point where Perl variables have different sigils indicating their type ($ for scalars, @ for arrays, % for hashes)1. There are coercion rules between these types, so you can write e.g. %foo = @bar, but many of them are quite lossy: for example, $foo = @bar assigns the length of the array @bar to $foo, not its contents. (Also, there are a few other strange types, like typeglobs and I/O handles, that you don’t often see exposed.)

Also, a slight chink in this nice design is the existence of reference types, which are a special kind of scalars (and which can be distinguished from normal scalars, using the ref operator). It’s possible to use references as normal scalars, but their string/numeric values are not particularly useful, and they tend to lose their special reference-ness if you modify them using normal scalar operations. Also, any Perl variable2 can be blessed to a class, turning it into an object of that class; the OO class system in Perl is somewhat orthogonal to the primitive type (or typelessness) system described above, although it’s also “weak” in the sense of following the duck typing paradigm. The general opinion is that, if you find yourself checking the class of an object in Perl, you’re doing something wrong.


1 Actually, the sigil denotes the type of the value being accessed, so that e.g. the first scalar in the array @foo is denoted $foo[0]. See perlfaq4 for more details.

2 Objects in Perl are (normally) accessed through references to them, but what actually gets blessed is the (possibly anonymous) variable the reference points to. However, the blessing is indeed a property of the variable, not of its value, so e.g. that assigning the actual blessed variable to another one just gives you a shallow, unblessed copy of it. See perlobj for more details.


回答 2

除了Eric所说的以外,请考虑以下C代码:

void f(void* x);

f(42);
f("hello");

与诸如Python,C#,Java或其他语言之类的语言相比,上面的类型是弱类型的,因为我们会丢失类型信息。Eric正确指出,在C#中,我们可以通过强制转换来绕过编译器,有效地告诉它“我比您更了解此变量的类型”。

但是即使那样,运行时仍会检查类型!如果强制转换无效,则运行时系统将对其进行捕获并引发异常。

使用类型擦除不会发生这种情况–类型信息会被丢弃。强制转换void*为C可以做到这一点。在这方面,以上内容与C#方法声明(例如)从根本上有所不同void f(Object x)

(从技术上讲,C#还允许通过不安全的代码或编组来擦除类型。)

是尽可能弱的类型。其他的一切只是一个静态还是动态类型检查,即时间的事一个类型被选中。

In addition to what Eric has said, consider the following C code:

void f(void* x);

f(42);
f("hello");

In contrast to languages such as Python, C#, Java or whatnot, the above is weakly typed because we lose type information. Eric correctly pointed out that in C# we can circumvent the compiler by casting, effectively telling it “I know more about the type of this variable than you”.

But even then, the runtime will still check the type! If the cast is invalid, the runtime system will catch it and throw an exception.

With type erasure, this doesn’t happen – type information is thrown away. A cast to void* in C does exactly that. In this regard, the above is fundamentally different from a C# method declaration such as void f(Object x).

(Technically, C# also allows type erasure through unsafe code or marshalling.)

This is as weakly typed as it gets. Everything else is just a matter of static vs. dynamic type checking, i.e. of the time when a type is checked.


回答 3

Wikipedia的“强类型”文章就是一个很好的例子:

通常,强类型意味着编程语言对允许发生的混合进行了严格的限制。

弱打字

a = 2
b = "2"

concatenate(a, b) # returns "22"
add(a, b) # returns 4

强类型

a = 2
b = "2"

concatenate(a, b) # Type Error
add(a, b) # Type Error
concatenate(str(a), b) #Returns "22"
add(a, int(b)) # Returns 4

请注意,弱类型的语言可以混合不同类型而不会出错。强类型语言要求输入类型为预期类型。在强类型语言中,可以转换类型(str(a)将整数转换为字符串)或强制转换(int(b))。

这一切都取决于键入的解释。

A perfect example comes from the wikipedia article of Strong Typing:

Generally strong typing implies that the programming language places severe restrictions on the intermixing that is permitted to occur.

Weak Typing

a = 2
b = "2"

concatenate(a, b) # returns "22"
add(a, b) # returns 4

Strong Typing

a = 2
b = "2"

concatenate(a, b) # Type Error
add(a, b) # Type Error
concatenate(str(a), b) #Returns "22"
add(a, int(b)) # Returns 4

Notice that a weak typing language can intermix different types without errors. A strong type language requires the input types to be the expected types. In a strong type language a type can be converted (str(a) converts an integer to a string) or cast (int(b)).

This all depends on the interpretation of typing.


回答 4

我想通过自己对这个问题的研究来为讨论做出贡献,正如其他人评论并做出贡献一样,我一直在阅读他们的答案并遵循他们的参考文献,并且发现了有趣的信息。如建议的那样,在程序员论坛中可能会更好地讨论其中的大多数内容,因为它似乎是理论上的而非实际的。

从理论的角度来看,我认为Luca Cardelli和Peter Wegner撰写的关于理解类型,数据抽象和多态性的文章是我所读过的最好的论据之一。

一种类型可以看作是一套衣服(或盔甲),可以保护基础的无类型表示形式免受任意使用或非预期使用。它提供了一个保护性遮盖物,该遮盖物隐藏了底层表示并限制了对象与其他对象交互的方式。在无类型的系统中,无类型的对象是裸露 的,其基础表示形式公开给所有人看。违反字体系统需要脱下防护服并直接在裸露的衣服上操作。

该说法似乎表明,弱类型输入将使我们能够访问类型的内部结构并像对待其他类型(另一种类型)一样对其进行操作。也许我们可以用不安全的代码(由Eric提及)或由Konrad提及的用c类型擦除的指针来做。

文章继续…

所有表达式类型一致的语言称为强类型语言。如果语言是强类型的,则其编译器可以保证所接受的程序将在没有类型错误的情况下执行。通常,我们应该努力实现强类型化,并在可能的情况下采用静态类型。请注意,每种静态类型的语言都是强类型的,但相反不一定是正确的。

因此,强类型表示没有类型错误,我只能假设弱类型意味着相反:可能存在类型错误。在运行时还是编译时?在这里似乎无关紧要。

有趣的是,按照该定义,具有强大类型强制性的语言(如Perl)将被视为强类型化的,因​​为系统不会失败,但是它通过将类型强制为适当的和定义良好的对等来处理类型。

另一方面,我是否可以说ClassCastExceptionArrayStoreException(在Java中)和InvalidCastExceptionArrayTypeMismatchException(在C#中)的允许表示至少在编译时处于弱类型的水平?埃里克的答案似乎同意这一点。

在此问题的答案之一提供的参考文献之一中提供的第二篇名为类型化编程的文章中,Luca Cardelli深入研究了类型冲突的概念:

大多数系统编程语言都允许任意类型的违反,有些是不加区分的,有些仅在程序的受限部分中。涉及类型冲突的操作称为不健全。类型违规分为几类[我们可以提到]:

基本值强制:包括整数,布尔值,字符,集合等之间的转换。这里不需要类型冲突,因为可以提供内置接口来以类型健全的方式执行强制。

这样,像操作员提供的那样的类型强制可以被认为是类型冲突,但是除非它们破坏了类型系统的一致性,否则我们可以说它们不会导致弱类型系统。

基于此,Python,Perl,Java或C#都不是弱类型。

Cardelli提到了两个类型错误,我很好地考虑了真正弱类型的情况:

地址算术。如有必要,应该有一个内置的(不健全的)接口,提供对地址和类型转换的适当操作。各种情况都涉及到堆的指针(对于重定位收集器非常危险),指向堆栈的指针,指向静态区域的指针以及指向其他地址空间的指针。有时,数组索引可以代替地址算法。 内存映射。这涉及将内存区域视为非结构化数组,尽管它包含结构化数据。这是内存分配器和收集器的典型特征。

诸如C(由Konrad提及)或.Net中不安全代码(由Eric提及)之类的语言中可能发生的这类事情实际上暗示着弱键入。

我相信到目前为止,最好的答案是埃里克(Eric),因为这个概念的定义是非常理论性的,当涉及到特定语言时,对所有这些概念的解释可能会得出不同的结论。

I would like to contribute to the discussion with my own research on the subject, as others comment and contribute I have been reading their answers and following their references and I have found interesting information. As suggested, it is probable that most of this would be better discussed in the Programmers forum, since it appears to be more theoretical than practical.

From a theoretical standpoint, I think the article by Luca Cardelli and Peter Wegner named On Understanding Types, Data Abstraction and Polymorphism has one of the best arguments I have read.

A type may be viewed as a set of clothes (or a suit of armor) that protects an underlying untyped representation from arbitrary or unintended use. It provides a protective covering that hides the underlying representation and constrains the way objects may interact with other objects. In an untyped system untyped objects are naked in that the underlying representation is exposed for all to see. Violating the type system involves removing the protective set of clothing and operating directly on the naked representation.

This statement seems to suggest that weakly typing would let us access the inner structure of a type and manipulate it as if it was something else (another type). Perhaps what we could do with unsafe code (mentioned by Eric) or with c type-erased pointers mentioned by Konrad.

The article continues…

Languages in which all expressions are type-consistent are called strongly typed languages. If a language is strongly typed its compiler can guarantee that the programs it accepts will execute without type errors. In general, we should strive for strong typing, and adopt static typing whenever possible. Note that every statically typed language is strongly typed but the converse is not necessarily true.

As such, strong typing means the absence of type errors, I can only assume that weak typing means the contrary: the likely presence of type errors. At runtime or compile time? Seems irrelevant here.

Funny thing, as per this definition, a language with powerful type coercions like Perl would be considered strongly typed, because the system is not failing, but it is dealing with the types by coercing them into appropriate and well defined equivalences.

On the other hand, could I say than the allowance of ClassCastException and ArrayStoreException (in Java) and InvalidCastException, ArrayTypeMismatchException (in C#) would indicate a level of weakly typing, at least at compile time? Eric’s answer seems to agree with this.

In a second article named Typeful Programming provided in one of the references provided in one of the answers in this question, Luca Cardelli delves into the concept of type violations:

Most system programming languages allow arbitrary type violations, some indiscriminately, some only in restricted parts of a program. Operations that involve type violations are called unsound. Type violations fall in several classes [among which we can mention]:

Basic-value coercions: These include conversions between integers, booleans, characters, sets, etc. There is no need for type violations here, because built-in interfaces can be provided to carry out the coercions in a type-sound way.

As such, type coercions like those provided by operators could be considered type violations, but unless they break the consistency of the type system, we might say that they do not lead to a weakly typed system.

Based on this neither Python, Perl, Java or C# are weakly typed.

Cardelli mentions two type vilations that I very well consider cases of truly weak typing:

Address arithmetic. If necessary, there should be a built-in (unsound) interface, providing the adequate operations on addresses and type conversions. Various situations involve pointers into the heap (very dangerous with relocating collectors), pointers to the stack, pointers to static areas, and pointers into other address spaces. Sometimes array indexing can replace address arithmetic. Memory mapping. This involves looking at an area of memory as an unstructured array, although it contains structured data. This is typical of memory allocators and collectors.

This kind of things possible in languages like C (mentioned by Konrad) or through unsafe code in .Net (mentioned by Eric) would truly imply weakly typing.

I believe the best answer so far is Eric’s, because the definition of this concepts is very theoretical, and when it comes to a particular language, the interpretations of all these concepts may lead to different debatable conclusions.


回答 5

弱类型的确确实意味着可以隐式强制转换很大比例的类型,试图猜测编码器的意图。

强类型意味着没有强制类型,或者至少没有强制类型。

静态类型意味着您的变量类型在编译时确定。

最近,许多人将“明显键入”与“强烈键入”混淆。“清单式输入”是指您明确声明变量的类型。

Python通常是强类型的,尽管您可以在布尔上下文中使用几乎所有内容,布尔值可以在整数上下文中使用,并且您可以在浮点上下文中使用整数。它没有明显的类型,因为您不需要声明您的类型(Cython除外,尽管它很有趣,但它并不完全是python)。它也不是静态类型的。

C和C ++是明显类型化,静态类型化和某种程度强类型化的,因​​为您声明类型,类型是在编译时确定的,并且可以混合使用整数和指针,整数和双精度,甚至将指向一种类型的指针转​​换为指向另一种类型的指针。

Haskell是一个有趣的示例,因为它不是显式键入的,而是静态和强类型的。

Weak typing does indeed mean that a high percentage of types can be implicitly coerced, attempting to guess what the coder intended.

Strong typing means that types are not coerced, or at least coerced less.

Static typing means your variables’ types are determined at compile time.

Many people have recently been confusing “manifestly typed” with “strongly typed”. “Manifestly typed” means that you declare your variables’ types explicitly.

Python is mostly strongly typed, though you can use almost anything in a boolean context, and booleans can be used in an integer context, and you can use an integer in a float context. It is not manifestly typed, because you don’t need to declare your types (except for Cython, which isn’t entirely python, albeit interesting). It is also not statically typed.

C and C++ are manifestly typed, statically typed, and somewhat strongly typed, because you declare your types, types are determined at compile time, and you can mix integers and pointers, or integers and doubles, or even cast a pointer to one type into a pointer to another type.

Haskell is an interesting example, because it is not manifestly typed, but it’s also statically and strongly typed.


回答 6

强<=>弱类型不仅是关于一种数据类型的语言将语言自动将多少值强制转换为另一种数据的连续性,还涉及到对实际的强弱程度。在Python和Java中,大多数情况下在C#中,值的类型设置为固定。在Perl中,不是那么多-实际上只有少数几个不同的值类型可以存储在变量中。

让我们一一打开案例。


Python

在Python示例中1 + "1"+运算符调用__add__for类型int,将字符串"1"作为参数,但是这会导致NotImplemented:

>>> (1).__add__('1')
NotImplemented

接下来,解释器尝试__radd__str的:

>>> '1'.__radd__(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute '__radd__'

由于失败,+操作员将结果与失败TypeError: unsupported operand type(s) for +: 'int' and 'str'。这样,该异常并不能说明强类型,但是该运算符+ 不会自动将其参数强制转换为同一类型,这说明了Python不是连续体中最弱类型的语言。

另一方面,在Python 'a' * 5 实现了:

>>> 'a' * 5
'aaaaa'

那是,

>>> 'a'.__mul__(5)
'aaaaa'

操作不同的事实需要强类型化-但是,*在乘法之前将值强制转换为数字的相反情况并不一定会使值弱类型化。


爪哇

Java示例String result = "1" + 1;之所以起作用,仅是因为为方便起见,运算符+被字符串重载。Java +运算符使用创建一个替换序列StringBuilder(请参阅参考资料):

String result = a + b;
// becomes something like
String result = new StringBuilder().append(a).append(b).toString()

这是一个非常静态的键入的示例,没有实际的强制性- StringBuilder有一种append(Object)专门用于此的方法。该文档说:

追加Object参数的字符串表示形式。

总体效果就好像参数已由方法转换为String.valueOf(Object)字符串,然后将该字符串的字符附加到此字符序列。

String.valueOf

返回Object参数的字符串表示形式。[返回]如果参数为null,则字符串等于"null"; 否则,obj.toString()返回的值。

因此,这种情况绝对不会被语言强制-将所有问题都委派给对象本身。


C#

根据此处Jon Skeet答案,该类+甚至都不会重载运算符string-类似于Java,这归功于静态和强类型化,这是编译器生成的便利。


佩尔

正如perldata解释的那样,

Perl具有三种内置数据类型:标量,标量数组和标量的关联数组,称为“哈希”。标量是单个字符串(任何大小,仅受可用内存限制),数字或对某物的引用(将在perlref中进行讨论)。普通数组是按数字索引的标量的有序列表,从0开始。哈希是通过其关联的字符串键索引的无序标量值的集合。

但是,Perl没有用于数字,布尔值,字符串,空值,undefineds,对其他对象的引用等的单独数据类型-它仅具有一种用于所有这些的类型,即标量类型。0是“ 0”的标量值。设置为字符串的标量变量实际上可以变成数字,并且从此开始,如果在数字上下文中访问则其行为就不同于“只是字符串”。标量可以在Perl中容纳任何内容,它与系统中存在的对象一样多。而在Python中,名称仅指对象,而在Perl中,名称中的标量值是可变对象。此外,基于对象的类型系统还基于此:perl中只有3种数据类型-标量,列表和哈希。Perl中的用户定义对象是对包的引用(指向之前3个中的任何一个的指针)bless-您可以获取任何此类值,并在需要的任何时候将其祝福给任何类。

Perl甚至允许您一时兴起地更改值的类-在Python中这是不可能的,在Python中创建某些类的值时,您需要显式构造具有该类object.__new__或类似值的该类的值。在Python中,创建后实际上不能更改对象的本质,在Perl中,您可以做很多事情:

package Foo;
package Bar;

my $val = 42;
# $val is now a scalar value set from double
bless \$val, Foo;
# all references to $val now belong to class Foo
my $obj = \$val;
# now $obj refers to the SV stored in $val
# thus this prints: Foo=SCALAR(0x1c7d8c8)
print \$val, "\n"; 
# all references to $val now belong to class Bar
bless \$val, Bar;
# thus this prints Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# we change the value stored in $val from number to a string
$val = 'abc';
# yet still the SV is blessed: Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# and on the course, the $obj now refers to a "Bar" even though
# at the time of copying it did refer to a "Foo".
print $obj, "\n";

因此,类型标识弱绑定到变量,并且可以通过任何引用即时更改它。实际上,如果您这样做

my $another = $val;

\$another没有类标识,即使仍然\$val会提供祝福的引用。


TL; DR

对于Perl而言,弱类型不仅仅是自动强制,还有很多更多的是,值的类型本身并没有固定不变,这与Python是动态但非常强类型的语言不同。这Python给人TypeError1 + "1"是一种迹象表明,语言是强类型,即使做一些有用的一个相反,如Java或C#不排除他们是强类型语言。

The strong <=> weak typing is not only about the continuum on how much or how little of the values are coerced automatically by the language for one datatype to another, but how strongly or weakly the actual values are typed. In Python and Java, and mostly in C#, the values have their types set in stone. In Perl, not so much – there are really only a handful of different valuetypes to store in a variable.

Let’s open the cases one by one.


Python

In Python example 1 + "1", + operator calls the __add__ for type int giving it the string "1" as an argument – however, this results in NotImplemented:

>>> (1).__add__('1')
NotImplemented

Next, the interpreter tries the __radd__ of str:

>>> '1'.__radd__(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute '__radd__'

As it fails, the + operator fails with the the result TypeError: unsupported operand type(s) for +: 'int' and 'str'. As such, the exception does not say much about strong typing, but the fact that the operator + does not coerce its arguments automatically to the same type, is a pointer to the fact that Python is not the most weakly typed language in the continuum.

On the other hand, in Python 'a' * 5 is implemented:

>>> 'a' * 5
'aaaaa'

That is,

>>> 'a'.__mul__(5)
'aaaaa'

The fact that the operation is different requires some strong typing – however the opposite of * coercing the values to numbers before multiplying still would not necessarily make the values weakly typed.


Java

The Java example, String result = "1" + 1; works only because as a fact of convenience, the operator + is overloaded for strings. The Java + operator replaces the sequence with creating a StringBuilder (see this):

String result = a + b;
// becomes something like
String result = new StringBuilder().append(a).append(b).toString()

This is rather an example of very static typing, without no actual coercion – StringBuilder has a method append(Object) that is specifically used here. The documentation says the following:

Appends the string representation of the Object argument.

The overall effect is exactly as if the argument were converted to a string by the method String.valueOf(Object), and the characters of that string were then appended to this character sequence.

Where String.valueOf then

Returns the string representation of the Object argument. [Returns] if the argument is null, then a string equal to "null"; otherwise, the value of obj.toString() is returned.

Thus this is a case of absolutely no coercion by the language – delegating every concern to the objects itself.


C#

According to the Jon Skeet answer here, operator + is not even overloaded for the string class – akin to Java, this is just convenience generated by the compiler, thanks to both static and strong typing.


Perl

As the perldata explains,

Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as “hashes”. A scalar is a single string (of any size, limited only by the available memory), number, or a reference to something (which will be discussed in perlref). Normal arrays are ordered lists of scalars indexed by number, starting with 0. Hashes are unordered collections of scalar values indexed by their associated string key.

Perl however does not have a separate data type for numbers, booleans, strings, nulls, undefineds, references to other objects etc – it just has one type for these all, the scalar type; 0 is a scalar value as much as is “0”. A scalar variable that was set as a string can really change into a number, and from there on behave differently from “just a string”, if it is accessed in a numerical context. The scalar can hold anything in Perl, it is as much the object as it exists in the system. whereas in Python the names just refers to the objects, in Perl the scalar values in the names are changeable objects. Furthermore, the Object Oriented Type system is glued on top of this: there are just 3 datatypes in perl – scalars, lists and hashes. A user defined object in Perl is a reference (that is a pointer to any of the 3 previous) blessed to a package – you can take any such value and bless it to any class at any instant you want.

Perl even allows you to change the classes of values at whim – this is not possible in Python where to create a value of some class you need to explicitly construct the value belonging to that class with object.__new__ or similar. In Python you cannot really change the essence of the object after the creation, in Perl you can do much anything:

package Foo;
package Bar;

my $val = 42;
# $val is now a scalar value set from double
bless \$val, Foo;
# all references to $val now belong to class Foo
my $obj = \$val;
# now $obj refers to the SV stored in $val
# thus this prints: Foo=SCALAR(0x1c7d8c8)
print \$val, "\n"; 
# all references to $val now belong to class Bar
bless \$val, Bar;
# thus this prints Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# we change the value stored in $val from number to a string
$val = 'abc';
# yet still the SV is blessed: Bar=SCALAR(0x1c7d8c8)
print \$val, "\n";
# and on the course, the $obj now refers to a "Bar" even though
# at the time of copying it did refer to a "Foo".
print $obj, "\n";

thus the type identity is weakly bound to the variable, and it can be changed through any reference on the fly. In fact, if you do

my $another = $val;

\$another does not have the class identity, even though \$val will still give the blessed reference.


TL;DR

There are much more about weak typing to Perl than just automatic coercions, and it is more about that the types of the values themselves are not set into stone, unlike the Python which is dynamically yet very strongly typed language. That python gives TypeError on 1 + "1" is an indication that the language is strongly typed, even though the contrary one of doing something useful, as in Java or C# does not preclude them being strongly typed languages.


回答 7

正如许多其他人所表示的那样,“强”键入与“弱”键入的整个概念都是有问题的。

作为一个原型,Smalltalk是非常强类型的- 如果两个对象之间的操作不兼容,它将始终引发异常。但是,我怀疑此列表中很少有人将Smalltalk称为强类型语言,因为它是动态类型。

我发现“静态”与“动态”打字的概念比“强”与“弱”的打字更有用。静态类型的语言具有在编译时确定的所有类型,否则程序员必须明确声明。

与动态类型语言相反,后者是在运行时执行键入的。这通常是多态语言的要求,因此程序员不必事先确定关于两个对象之间的操作是否合法的决定。

在多态,动态类型的语言(如Smalltalk和Ruby)中,将“类型”视为“符合协议”更为有用。如果一个对象遵循与另一个对象相同的协议(即使两个对象不共享任何继承,混合或其他伏都教),则在运行系统中它们被视为相同的“类型”。更正确地说,此类系统中的对象是自治的,并且可以决定响应任何引用特定参数的特定消息是否有意义。

是否需要一个对象,该对象可以使用描述蓝色的对象参数对消息“ +”做出有意义的响应?您可以在动态类型的语言中执行此操作,但是在静态类型的语言中则很麻烦。

As many others have expressed, the entire notion of “strong” vs “weak” typing is problematic.

As a archetype, Smalltalk is very strongly typed — it will always raise an exception if an operation between two objects is incompatible. However, I suspect few on this list would call Smalltalk a strongly-typed language, because it is dynamically typed.

I find the notion of “static” versus “dynamic” typing more useful than “strong” versus “weak.” A statically-typed language has all the types figured out at compile-time, and the programmer has to explicitly declare if otherwise.

Contrast with a dynamically-typed language, where typing is performed at run-time. This is typically a requirement for polymorphic languages, so that decisions about whether an operation between two objects is legal does not have to be decided by the programmer in advance.

In polymorphic, dynamically-typed languages (like Smalltalk and Ruby), it’s more useful to think of a “type” as a “conformance to protocol.” If an object obeys a protocol the same way another object does — even if the two objects do not share any inheritance or mixins or other voodoo — they are considered the same “type” by the run-time system. More correctly, an object in such systems is autonomous, and can decide if it makes sense to respond to any particular message referring to any particular argument.

Want an object that can make some meaningful response to the message “+” with an object argument that describes the colour blue? You can do that in dynamically-typed languages, but it is a pain in statically-typed languages.


回答 8

我喜欢@Eric Lippert的答案,但要解决这个问题-强类型语言通常在程序的每个点都具有变量类型的显式知识。弱类型语言不会,因此它们可以尝试执行某种特定类型可能无法执行的操作。它认为最简单的方法是在函数中。C ++:

void func(string a) {...}

a已知该变量的类型为字符串,任何不兼容的操作都将在编译时捕获。

Python:

def func(a)
  ...

该变量a可以是任何东西,我们可以拥有调用无效方法的代码,该方法只会在运行时被捕获。

I like @Eric Lippert’s answer, but to address the question – strongly typed languages typically have explicit knowledge of the types of variables at each point of the program. Weakly typed languages do not, so they can attempt to perform an operation that may not be possible for a particular type. It think the easiest way to see this is in a function. C++:

void func(string a) {...}

The variable a is known to be of type string and any incompatible operation will be caught at compile time.

Python:

def func(a)
  ...

The variable a could be anything and we can have code that calls an invalid method, which will only get caught at runtime.


Perl,Python,AWK和sed有什么区别?[关闭]

问题:Perl,Python,AWK和sed有什么区别?[关闭]

只想知道它们之间的主要区别是什么?以及每种语言的功能(最好使用它)。

编辑:不是“ vs”。就像话题,只是信息。

just want to know what are the main differences among them? and the power of each language (where it’s better to use it).

Edit: it’s not “vs.” like topic, just information.


回答 0

在出现的顺序,语言是sedawkperlpython

sed程序是一个流编辑器,旨在将脚本中的操作应用于输入文件的每一行(或更一般而言,应用于指定的行范围)。它的语言基于edUnix编辑器,尽管它具有条件等,但是很难处理复杂的任务。您可以用它创造一些小奇迹-但要花很多钱。但是,当尝试在其权限范围内执行任务时,它可能是最快的程序。(它具有所讨论程序的功能最弱的正则表达式-足以满足许多目的,但肯定不是PCRE-与Perl兼容的正则表达式)

awk程序(以其作者的名字缩写命名-Aho,Weinberger和Kernighan)最初是用于格式化报告的工具。它可以用作汤sed。在最新版本中,它在计算上是完整的。它使用了一个有趣的想法-该程序基于“模式匹配”和“模式匹配时采取的动作”。这些模式非常强大(扩展正则表达式)。动作的语言类似于C。的主要功能之一awk是它将输入自动分为记录,每个记录又分为字段。

Perl的部分编写是awk杀手和sed杀手。附带的两个程序是a2ps2p用于将awk脚本和sed脚本转换为Perl。Perl是下一代脚本语言中最早的一种(Tcl / Tk可能声称是首要的)。它具有功能强大的集成正则表达式处理功能,并且语言功能强大得多。它提供对几乎所有系统调用的访问,并具有CPAN模块的可扩展性。(既不是可扩展的,awk也不sed是可扩展的。)Perl的座右铭之一是“ TMTOWTDI-做到这一点的方法不止一种”(发音为“ tim-toady”)。Perl有“对象”,但它不仅仅是语言的基本组成部分,而是附加组件。

Python是最后编写的,可能部分是对Perl的反应。它具有一些有趣的句法概念(缩进以指示级别-没有大括号或等效项)。从根本上说,它比Perl更面向对象。它与Perl一样可扩展。

OK-什么时候使用每个?

  • Sed-需要对文件进行简单的文本转换时。
  • Awk-当您只需要简单的格式设置,数据汇总和转换时。
  • Perl-适用于几乎所有任务,尤其是当任务需要复杂的正则表达式时。
  • Python-完成与Perl相同的任务。

我不知道Perl可以做Python无法做到的任何事情,反之亦然。两者之间的选择将取决于其他因素。我在没有Python之前就学习了Perl,所以我倾向于使用它。Python的语法较少,而且通常更易于学习。Perl 6上市后,将是一个引人入胜的发展。

(请注意,尤其是Perl和Python的“概述”还很不完整;整本书都可以写成该主题。)

In order of appearance, the languages are sed, awk, perl, python.

The sed program is a stream editor and is designed to apply the actions from a script to each line (or, more generally, to specified ranges of lines) of the input file or files. Its language is based on ed, the Unix editor, and although it has conditionals and so on, it is hard to work with for complex tasks. You can work minor miracles with it – but at a cost to the hair on your head. However, it is probably the fastest of the programs when attempting tasks within its remit. (It has the least powerful regular expressions of the programs discussed – adequate for many purposes, but certainly not PCRE – Perl-Compatible Regular Expressions)

The awk program (name from the initials of its authors – Aho, Weinberger, and Kernighan) is a tool initially for formatting reports. It can be used as a souped-up sed; in its more recent versions, it is computationally complete. It uses an interesting idea – the program is based on ‘patterns matched’ and ‘actions taken when the pattern matches’. The patterns are fairly powerful (Extended Regular Expressions). The language for the actions is similar to C. One of the key features of awk is that it splits the input automatically into records and each record into fields.

Perl was written in part as an awk-killer and sed-killer. Two of the programs provided with it are a2p and s2p for converting awk scripts and sed scripts into Perl. Perl is one of the earliest of the next generation of scripting languages (Tcl/Tk can probably claim primacy). It has powerful integrated regular expression handling with a vastly more powerful language. It provides access to almost all system calls and has the extensibility of the CPAN modules. (Neither awk nor sed is extensible.) One of Perl’s mottos is “TMTOWTDI – There’s more than one way to do it” (pronounced “tim-toady”). Perl has ‘objects’, but it is more of an add-on than a fundamental part of the language.

Python was written last, and probably in part as a reaction to Perl. It has some interesting syntactic ideas (indenting to indicate levels – no braces or equivalents). It is more fundamentally object-oriented than Perl; it is just as extensible as Perl.

OK – when to use each?

  • Sed – when you need to do simple text transforms on files.
  • Awk – when you only need simple formatting and summarisation or transformation of data.
  • Perl – for almost any task, but especially when the task needs complex regular expressions.
  • Python – for the same tasks that you could use Perl for.

I’m not aware of anything that Perl can do that Python can’t, nor vice versa. The choice between the two would depend on other factors. I learned Perl before there was a Python, so I tend to use it. Python has less accreted syntax and is generally somewhat simpler to learn. Perl 6, when it becomes available, will be a fascinating development.

(Note that the ‘overviews’ of Perl and Python, in particular, are woefully incomplete; whole books could be written on the topic.)


回答 1

掌握了数十种语言后,您会厌倦像S. Lott这样的人(请参阅他对此问题的有争议的答案,在回答六年后,向下投票的比例几乎是向上投票的一半(+ 45 / -22))。

Sed是用于极其简单的命令行管道的最佳工具。在sed管理员的手中,它适合一次性复杂的应用程序,但除非常简单的替换管道外,不应在生产代码中使用。诸如“ s / this / that /”之类的东西。

当只有一个输入源和一个输出(或依次写入的多个输出)时,Gawk(GNU awk)是进行复杂数据重新格式化的最佳选择。由于大量实际工作都符合此描述,并且优秀的程序员可以在两个小时内学习gawk,因此这是最佳选择。在这个星球上,更简单,更快更好!

当您有非常复杂的输入/输出方案时,Perl或Python远远优于任何版本的awk或sed。从维护和可读性的角度来看,问题越复杂,使用python的效果就越好。但是请注意,优秀的程序员可以用任何语言编写可读的代码,而糟糕的程序员可以用任何有用的语言编写无法维护的废话,因此,如果说Perl或python是程序员,则可以放心地选择perl或python熟练而聪明。

After mastering a few dozen languages, you get tired of people like S. Lott (see his controversial answer to this question, nearly half as many down-votes as up (+45/-22) six years after answering).

Sed is the best tool for extremely simple command-line pipelines. In the hands of a sed master, it’s suitable for one-offs of arbitrary complexity, but it should not be used in production code except in very simple substitution pipelines. Stuff like ‘s/this/that/.’

Gawk (the GNU awk) is by far the best choice for complex data reformatting when there is only a single input source and a single output (or, multiple outputs sequentially written). Since a great deal of real-world work conforms to this description, and a good programmer can learn gawk in two hours, it is the best choice. On this planet, simpler and faster is better!

Perl or Python are far better than any version of awk or sed when you have very complex input/output scenarios. The more complex the problem is, the better off you are using python, from a maintenance and readability standpoint. Note, however, that a good programmer can write readable code in any language, and a bad programmer can write unmaintainable crap in any useful language, so the choice of perl or python can safely be left to the preferences of the programmer if said programmer is skilled and clever.


回答 2

我不会将sed称为成熟的编程语言,它是一种流编辑器,具有旨在以编程方式编辑文本文件的语言构造。

Awk只是一种通用语言,但它仍然最适合于文本处理。

Perl和Python是成熟的通用编程语言。Perl扎根于文本处理,并具有许多类似awk的构造(甚至网上都有awk-to-perl脚本浮动)。Perl和Python之间有很多区别,您最好的选择可能是在Wikipedia等上阅读两种语言的摘要,以更好地了解它们的含义。

I wouldn’t call sed a fully-fledged programming language, it is a stream editor with language constructs aimed at editing text files programmatically.

Awk is a little more of a general purpose language but it is still best suited for text processing.

Perl and Python are fully fledged, general purpose programming languages. Perl has its roots in text processing and has a number of awk-like constructs (there is even an awk-to-perl script floating around on the net). There are many differences between Perl and Python, your best bet is probably to read the summaries of both languages on something like Wikipedia to get a good grasp on what they are.


回答 3

首先,列表“ Perl,Python awk和sed”中有两个不相关的内容。

事情1-简化的文本操作工具。

  • sed。它具有固定,相对简单的工作范围,该范围由读取和检查文件的每一行的想法定义。sed并非旨在使其特别可读。它被设计为在非常小的Unix服务器上非常小且非常高效。

  • w 它的固定性稍差,工作范围也较简单。但是,awk程序的主循环是通过隐式读取源文件的行来定义的。

这些不是“完整的”编程语言。尽管您可以通过一些工作用awk编写相当复杂的程序,但它很快变得复杂且难以阅读。

第2件事-通用编程语言。它们具有丰富的语句类型,大量的内置数据结构,并且没有任何固定的假设或捷径可言。

  • Perl。

  • Python。

什么时候使用它们。

  • sed。决不。在内存超过32K的现代计算机中,它确实没有任何价值。Perl或Python更清楚地执行了相同的操作。

  • w 决不。像sed一样,它反映了更早的计算时代。与其维护这种语言(除了一个成功的系统所需的所有其他语言),不如简单地用一种令人愉悦的语言来做所有事情。

  • Perl。任何形式的编程问题。如果您喜欢自由思考的语法,并且有很多很多方法可以执行相同的操作,那么perl很有趣。

  • Python。任何形式的编程问题。如果您喜欢语法相当有限的语法,那么其中的选择更少,细节更少,并且(也许)更加清晰。Python的面向对象特性使其更适合于大型复杂问题。

背景-我并不是因为无知而抨击sed和awk。我20多年前学过awk。它做了很多事情;曾经将其作为Unix的核心技能来教授。我大约15年前学习了Perl。做了很多复杂的事情。我把两者都抛在后面是因为我可以在Python中做同样的事情-而且更简单,更清晰。

sed和awk有两个严重的问题,都不是年龄。

  1. 其实施不完整。sed和awk所做的一切都可以在Python或Perl中完成,通常更简单,有时也更快。Shell管道由于具有多重处理而具有一些性能优势。Python提供了一个subprocess模块,使我可以恢复这些优势。

  2. 需要学习另一种语言。通过使用Python(或Perl)执行操作,您的实现依赖于更少的语言,从而提高了清晰度。

First, there are two unrelated things in the list “Perl, Python awk and sed”.

Thing 1 – simplistic text manipulation tools.

  • sed. It has a fixed, relatively simple scope of work defined by the idea of reading and examining each line of a file. sed is not designed to be particularly readable. It is designed to be very small and very efficient on very tiny unix servers.

  • awk. It has a slightly less fixed, less simple scope of work. However, the main loop of an awk program is defined by the implicit reading of lines of a source file.

These are not “complete” programming languages. While you can — with some work — write fairly sophisticated programs in awk, it rapidly gets complicated and difficult to read.

Thing 2 – general-purposes programming languages. These have a rich variety of statement types, numerous built-in data structures, and no wired-in assumptions or shortcuts to speak of.

  • Perl.

  • Python.

When to use them.

  • sed. Never. It really doesn’t have any value in the modern era of computers with more than 32K of memory. Perl or Python do the same things more clearly.

  • awk. Never. Like sed, it reflects an earlier era of computing. Rather than maintain this language (in addition to all the other required for a successful system), it’s more pleasant to simply do everything in one pleasant language.

  • Perl. Any programming problem of any kind. If you like free-thinking syntax, where there are many, many ways to do the same thing, perl is fun.

  • Python. Any programming problem of any kind. If you like fairly limited syntax, where there are fewer choices, less subtlety, and (perhaps) more clarity. Python’s object-oriented nature makes it more suitable for large, complex problems.

Background — I’m not bashing sed and awk out of ignorance. I learned awk over 20 years ago. Did many things with it; used to teach it as a core unix skill. I learned Perl about 15 years ago. Did many sophisticated things with it. I’ve left both behind because I can do the same things in Python — and it is simpler and more clear.

There are two serious problems with sed and awk, neither of which are their age.

  1. The incompleteness of their implementation. Everything sed and awk do can be done in Python or Perl, often more simply and sometimes faster, too. A shell pipeline has some performance advantages because of its multi-processing. Python offers a subprocess module to allow me to recover those advantages.

  2. The need to learn yet another language. By doing things in Python (or Perl) your implementation depends on fewer languages, with a resulting increase in clarity.


回答 4

何时使用它们:awk-永远不会-S. Lott。

我认为洛特(S. Lott)对此建议略有遗漏。事实是,在Linux和其他UNIX环境上,awk是与bash,sh和ksh一起用于快速文本处理的有用工具。脚本本身的想法是,您可以通过将该工具(即该工具)粘合在一起来解决问题。因此,在管理脚本中,通常具有ls,grep,|,awk,time,ps等。每个脚本编写器都将其组合为一个工具,就像构建器一砖一瓦地完成建筑(解决当前的问题) 。

例如,我是团队管理的团队成员 彩弹装备互联网。该电子商务网站基于LAMP堆栈。为了自动处理和标准化来自各个供应商的数据馈入后端数据库,我们采用并维护了多种脚本组合,包括bash,perl,php甚至期望。每个模块都有其优势,这取决于可用的模块和API。在bash脚本中,我们使用awk进行模式的快速模式匹配和适当的操作,而无需切换到PERL。我还想指出的一件事是,这些脚本中有相当一部分是从开放源代码中购买或获得的,而在线程中并未强调。如果脚本以Perl的形式出现,则我们将其保留为Perl;如果脚本以Php格式出现,我们将其保持为Php;如果它是bash,我们将其保持为bash;

When to use them: awk – never – S. Lott.

I think S. Lott slightly missed the mark with this recommendation. The fact is, on Linux and the other UNIX environments, awk is a useful tool to be used with bash, sh, and ksh for quick text processings. The idea of scripting itself is you solve your problem by gluing together this tool, that tool. Hence in admin scripts, it is common to has ls, grep, |, awk, time, ps, etc. Each is a tool that the scripter combines like a builder brick by brick to finish the building (to solve the problem at hand).

For instance I am a team member of the team managing paintball gear supplies dotcom. This e-commerce site is based on the LAMP stack. For automated processing and normalizing data feeds from various suppliers into the back end database, we employ and maintain a diversified mix of scripts, including bash, perl, php, and even expect. Each has its strengths based on the available modules and API. In the bash scripts we do quick patterns match and appropriate actions on the patterns as needed using awk without the need to switch to PERL. One thing I would also like to point out, which has not been emphasized in the thread, is that a fair number of these scripts were purchased, or gotten from the open source. If the script came as Perl, we maintain it as Perl; if the script came as Php, we maintain it as Php; if it came as bash, we maintain it as bash; we do not re-write it in another language just because we think it is less efficient in the original language.