Python 教程 13 — 选择数据结构

15 min read

目前为止，你已经学完了 Python 的核心数据结构，同时你也接触了利用到这些数据结构的一些算法。如果你希望学习更多算法知识，那么现在是阅读第二十一章的好时机。但是不必急着马上读，什么时候感兴趣了再去读即可。

本章是一个案例研究，同时给出了一些习题，目的是启发你思考如何选择数据结构，并练习数据结构使用。

1.词频分析 #

和之前一样，在查看答案之前，你至少应该试着解答一下这些习题，答案在习题13-5的最后会揭露。

习题13-1 #

编写一个程序，读取一个文件，将每一行转换成单词列表，删掉单词中的空格和标点，然后将它们转换为小写字母。

提示：string 模块提供了名为 whitespace 的字符串，其包括空格、制表符、新行等等，以及名为 punctuation 的字符串，其包括标点字符。试试能否让Python说脏话：

>>> import string
>>> string.punctuation
'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

同时，你可以考虑使用字符串方法 strip 、replace 和 translate 。

习题13-2 #

前往古腾堡项目（gutenberg.org），以纯文本格式下载你喜欢的已无版权保护的图书。

修改前面习题的程序，读取你下载的书，跳过文件开始的头部信息，像之前那样处理其余的单词。

然后修改程序，计算书中单词的总数，以及每个单词使用的次数。

打印该书使用单词的总数。比较不同年代、不同作者写的书。哪个作者使用的词汇量最大？

习题13-3 #

修改上一个习题中的程序，打印书中最常使用的20个单词。

习题13-4 #

修改上一个习题中的程序，读取一个单词列表（见读取单词列表一节），然后打印书中所有没有出现在该单词表中的单词。它们中有多少是拼写错误的？有多少是词表中应该包括的常用词？有多少是生僻词？

2.随机数 #

给定相同的输入，大多数计算机程序每次都会生成相同的输出，它们因此被称作确定性的（deterministic）。确定性通常是个好东西，因为我们期望相同的计算产生相同的结果。然而，对于有些应用，我们希望计算机不可预知。游戏是一个明显的例子，但是不限于此。

让程序具备真正意义上的非确定性并不容易，但是有办法使它至少看起来是不确定的。其中之一是使用生成伪随机（pseudorandom）数的算法。伪随机数不是真正的随机数，因为它们由一个确定性的计算生成，但是仅看其生成的数字，不可能将它们和随机生成的相区分开。

random模块提供了生成伪随机数（下文中简称为“随机数”）的函数。

函数 random 返回一个 0.0 到 1.0 之间的随机浮点数（包括 0.0 ，但是不包括 1.0 ）。每次调用 random ，你获得一个长序列中的下一个数。举个例子，运行此循环：

import random

for i in range(10):
    x = random.random()
    print(x)

函数 randint 接受参数 low 和 high ，返回一个 low 和 high 之间的整数（两个都包括）。

>>> random.randint(5, 10)
5
>>> random.randint(5, 10)
9

你可以使用 choice ，从一个序列中随机选择一个元素：

>>> t = [1, 2, 3]
>>> random.choice(t)
2
>>> random.choice(t)
3

random模块提供的函数，还可以生成符合高斯、指数、伽马等连续分布的随机值。

习题13-5 #

编写一个名为choose_from_hist的函数，其接受一个如字典作为计数器集合一节中定义的 histogram 对象作为参数，并从该对象中返回一个随机值，其选择概率和值出现的频率成正比。例如：

>>> t = ['a', 'a', 'b']
>>> hist = histogram(t)
>>> hist
{'a': 2, 'b': 1}

你的函数返回 'a' 的概率应该是2/32/3，返回 'b' 的概率应该是1/31/3。

"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
Copyright 2015 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/
"""

from __future__ import print_function, division

import random
import string

def process_file(filename, skip_header):
    """Makes a histogram that contains the words from a file.
    filename: string
    skip_header: boolean, whether to skip the Gutenberg header
   
    returns: map from each word to the number of times it appears.
    """
    hist = {}
    fp = open(filename)

    if skip_header:
        skip_gutenberg_header(fp)

    for line in fp:
        if line.startswith('*** END OF THIS'):
            break

        process_line(line, hist)

    return hist


def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.
    fp: open file object
    """
    for line in fp:
        if line.startswith('*** START OF THIS'):
            break


def process_line(line, hist):
    """Adds the words in the line to the histogram.
    Modifies hist.
    line: string
    hist: histogram (map from word to frequency)
    """
    # TODO: rewrite using Counter

    # replace hyphens with spaces before splitting
    line = line.replace('-', ' ')
    strippables = string.punctuation + string.whitespace

    for word in line.split():
        # remove punctuation and convert to lowercase
        word = word.strip(strippables)
        word = word.lower()

        # update the histogram
        hist[word] = hist.get(word, 0) + 1


def most_common(hist):
    """Makes a list of word-freq pairs in descending order of frequency.
    hist: map from word to frequency
    returns: list of (frequency, word) pairs
    """
    t = []
    for key, value in hist.items():
        t.append((value, key))

    t.sort()
    t.reverse()
    return t


def print_most_common(hist, num=10):
    """Prints the most commons words in a histgram and their frequencies.
    
    hist: histogram (map from word to frequency)
    num: number of words to print
    """
    t = most_common(hist)
    print('The most common words are:')
    for freq, word in t[:num]:
        print(word, '\t', freq)


def subtract(d1, d2):
    """Returns a dictionary with all keys that appear in d1 but not d2.
    d1, d2: dictionaries
    """
    # TODO: reimplement using Counter
    res = {}
    for key in d1:
        if key not in d2:
            res[key] = None
    return res


def total_words(hist):
    """Returns the total of the frequencies in a histogram."""
    return sum(hist.values())


def different_words(hist):
    """Returns the number of different words in a histogram."""
    return len(hist)


def random_word(hist):
    """Chooses a random word from a histogram.
    The probability of each word is proportional to its frequency.
    """
    # TODO: rewrite using Counter
    t = []
    for word, freq in hist.items():
        t.extend([word] * freq)

    return random.choice(t)


def main():
    hist = process_file('158-0.txt', skip_header=True)
    print('Total number of words:', total_words(hist))
    print('Number of different words:', different_words(hist))

    t = most_common(hist)
    print('The most common words are:')
    for freq, word in t[0:20]:
        print(word, '\t', freq)

    words = process_file('words.txt', skip_header=False)

    diff = subtract(hist, words)
    print("The words in the book that aren't in the word list are:")
    for word in diff.keys():
        print(word, end=' ')

    print("\n\nHere are some random words from the book")
    for i in range(100):
        print(random_word(hist), end=' ')


if __name__ == '__main__':
    main()

3.单词直方图 #

在继续下面的习题之前，你应该尝试完成前面的练习。你还需要下载emma.txt 。

下载emma.txt

下面这个程序将读取一个文件，并建立文件中单词的直方图：

import string

def process_file(filename):
    hist = dict()
    fp = open(filename)
    for line in fp:
        process_line(line, hist)
    return hist

def process_line(line, hist):
    line = line.replace('-', ' ')

    for word in line.split():
        word = word.strip(string.punctuation + string.whitespace)
        word = word.lower()
        hist[word] = hist.get(word, 0) + 1

hist = process_file('emma.txt')

该程序读取 emma.txt ，其包括Jane Austen写的《Emma》的文本。

process_file循环读取文件的每行，依次把它们传递给process_line。直方图 hist 被用作一个累加器。

在使用 split 将一行文件切分成一个字符串列表之前， process_line使用字符串的 replace 方法将连字符替换成空格。它会遍历单词列表，并使用 strip 和 lower 来删除标点以及将单词转换为小写。（“转换”只是一种简略的说法；记住，字符串是不可变的，所以类似 strip 和 lower 这样的方法其实返回的是新字符串。）

最后，process_line通过生成一个新的项或者递增一个已有的项来更新直方图。

我们可以通过累加直方图中的频率，来统计文件中的单词总数：

def total_words(hist):
    return sum(hist.values())

不同单词的数量恰好是词典中项的数目：

def different_words(hist):
    return len(hist)

这是打印结果的代码：

print('Total number of words:', total_words(hist))
print('Number of different words:', different_words(hist))

结果是：

Total number of words: 161080
Number of different words: 7214

4.最常用单词 #

为了找到最常用的单词，我们可以使用元组列表，其中每个元组包含单词和它的频率，然后排序这个列表。

下面的函数接受一个直方图并且返回一个单词-频率的元组列表：

def most_common(hist):
    t = []
    for key, value in hist.items():
        t.append((value, key))

    t.sort(reverse=True)
    return t

每一个元组中，频率在前，所以这个列表是按照频率排序。下面是输出最常用的十个单词的循环：

t = most_common(hist)
print('The most common words are:')
for freq, word in t[:10]:
    print(word, freq, sep='\t')

这里我通过关键词参数 sep ，让 print 使用一个制表符（Tab）而不是空格键作为分隔符，所以第二行将对齐。下面是对小说*《Emma》*的分析结果：

The most common words are:
to      5242
the     5205
and     4897
of      4295
i       3191
a       3130
it      2529
her     2483
was     2400
she     2364

当然，这段代码也可以通过 sort 函数的参数 key 进行简化。如果你感兴趣，可以阅读 https://wiki.python.org/moin/HowTo/Sorting 。

5.可选形参 #

我们已经见过接受可变数量实参的函数和方法了。程序员也可以自己定义具有可选实参的函数。例如，下面就是一个打印直方图中最常见单词的函数。

def print_most_common(hist, num=10):
    t = most_common(hist)
    print('The most common words are:')
    for freq, word in t[:num]:
        print(word, freq, sep='\t')

第一个形参是必须的；第二个是可选的。 num 的默认值（default value）是10。

如果你只提供了一个实参:

print_most_common(hist)

num将使用默认值。如果你你提供两个实参：

print_most_common(hist, 20)

num获得实参的值。换句话说，可选实参覆盖（overrides）了默认值。

如果一个函数同时有必选和可选两类形参，则所有的必选形参必须首先出现，可选形参紧随其后。

6.字典差集 #

从书中找到所有没出现在词表 words.txt 中的单词，可以认为是一个差集问题；也就是，我们应该从一个集合中（书中的单词）找到所有没出现在另一个集合中（列表中的单词）的单词。

subtract接受词典 d1 和 d2 ，并返回一个新的词典，其包括 d1 中的所有没出现在 d2 中的键。由于并不真正关心值是什么，我们将它们都设为 None 。

def subtract(d1, d2):
    res = dict()
    for key in d1:
        if key not in d2:
            res[key] = None
    return res

为了找到书中没有出现在 words.txt 中的单词，我们可以使用process_file来为 words.txt 构建一个直方图，然后使用 subtract ：

words = process_file('words.txt')
diff = subtract(hist, words)

print("Words in the book that aren't in the word list:")
for word in diff.keys():
    print(word, end=' ')

这是针对小说《Emma》的部分运行结果：

Words in the book that aren't in the word list:
rencontre jane's blanche woodhouses disingenuousness
friend's venice apartment ...

这些单词中，一些是名字和名词所有格。如“rencontre”这样的其他单词已经不常使用了。但是有一些真的应该包括在列表中！

习题13-6 #

Python　提供了一个叫做集合（set）的数据结构，支持许多常见的集合操作。你可以前往第十九章阅读相关内容，或者在官网上阅读文档 http://docs.python.org/3/library/stdtypes.html#types-set 。

编写一个程序，使用集合的差集操作来找出一本书中不在 work list 中的单词。

"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
Copyright 2015 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/
"""

from __future__ import print_function, division

from analyze_book1 import process_file


def subtract(d1, d2):
    """Returns a set of all keys that appear in d1 but not d2.
    d1, d2: dictionaries
    """
    return set(d1) - set(d2)


def main():
    hist = process_file('158-0.txt', skip_header=True)
    words = process_file('words.txt', skip_header=False)

    diff = subtract(hist, words)
    print("The words in the book that aren't in the word list are:")
    for word in diff:
        print(word, end=' ')


if __name__ == '__main__':
    main()

7.随机单词 #

如果想从直方图中随机选择一个单词，最简单的算法是创建一个列表，其中根据其出现的频率，每个单词都有相应个数的拷贝，然后从该列表中选择单词：

def random_word(h):
    t = []
    for word, freq in h.items():
        t.extend([word] * freq)

    return random.choice(t)

表达式 [word] * freq 创建一个具有 freq 个 word 字符串拷贝的列表。除了它的实参要求是一个序列外，extend方法和 append方法很像。

该算法能够满足要求，但是效率不够高；每次你选择一个随机单词，它都重建列表，这个列表和原书一样大。一个明显的改进是，创建列表一次，然后进行多次选择，但是该列表仍然很大。

一个替代方案是：

使用 keys 来获得该书中单词的列表。
创建一个包含单词频率累积和的列表（见习题10-2）。此列表的最后一项是书中单词的数目nn。
选择一个从 1 到nn的随机数。使用二分搜索（见习题10-10）找到该随机数应该被在累积和中插入的索引。
使用该索引从单词列表中找到相应的单词。

习题13-7 #

编写一个使用该算法从书中选择一个随机单词的程序。

答案：http://thinkpython2.com/code/analyze_book3.py 。

"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
Copyright 2015 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/
"""

from __future__ import print_function, division

import random

from bisect import bisect

from analyze_book1 import process_file


def random_word(hist):
    """Chooses a random word from a histogram.
    The probability of each word is proportional to its frequency.
    hist: map from word to frequency
    """
    # TODO: This could be made faster by computing the cumulative
    # frequencies once and reusing them.

    words = []
    freqs = []
    total_freq = 0

    # make a list of words and a list of cumulative frequencies
    for word, freq in hist.items():
        total_freq += freq
        words.append(word)
        freqs.append(total_freq)

    # choose a random value and find its location in the cumulative list
    x = random.randint(0, total_freq-1)
    index = bisect(freqs, x)
    return words[index]


def main():
    hist = process_file('158-0.txt', skip_header=True)

    print("\n\nHere are some random words from the book")
    for i in range(100):
        print(random_word(hist), end=' ')


if __name__ == '__main__':
    main()

8.马尔科夫分析 #

如果你从书中随机选择单词，那么你会大致了解其使用的词汇，但可能不会得到一个完整的句子：

this the small regard harriet which knightley's it most things

一系列随机单词很少有意义，因为相邻的单词之间没有关系。例如，在一个真实的句子中，你可能期望“the”这样的冠词后面跟着的是一个形容词或者名词，而大不可能会是一个动词或者副词。

衡量相邻单词关系的方法之一是马尔科夫分析法，对于一个给定的单词序列，马尔科夫分析法将给出接下来单词的概率。例如，歌曲Eric, the Half a Bee的开头是：Half a bee, philosophically,Must, ipso facto, half not be.But half the bee has got to beVis a vis, its entity. D’you see?But can a bee be said to beOr not to be an entire beeWhen half the bee is not a beeDue to some ancient injury?

在此文本中，短语“half the”后面总是跟着单词“bee”，但是短语“the bee”则可能跟着“has”或者“is”。

马尔科夫分析的结果是从每个前缀（如“half the”和“the bee”）到所有可能的后缀（如“has”和“is”）的映射。

给定此映射，你能够通过以任意前缀开始并从可能的后缀中随机选择一个的方法，来生成一个随机文本。接下来，你可以将前缀的结尾和新的后缀组合成下一个前缀，并重复下去。

例如，如果你以前缀“Half a”开始，然后下一个但是必须是“bee”，因为此前缀在文本中仅出现一次。下一个前缀是“a bee”，所以下一个后缀可能是“philosophically”，“be”或“due”。

此例中，前缀的长度总是2，但是你可以以任意前缀长度进行马尔科夫分析。前缀的长度被称作此分析的“阶”。

习题13-8 #

马尔科夫分析：

编写一个程序，从一个文件中读取文本并执行马尔科夫分析。结果应该是一个字典，即从前缀映射到一个可能的后缀集合。此后缀集合可以是一个列表、元组或字典；你需要做出合适的选择。你可以用长度为2的前缀测试程序，但是在编写程序时，应确保其很容易支持其它长度。
在前面的程序中添加一个函数，基于马尔科夫分析生成随机文本。下面是使用《Emma》执行前缀为2的马尔科夫分析生成的示例：He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?“ ”I cannot make speeches, Emma:” he soon cut it all himself.在此例中，我保留了附在词后面的标点符号。从语法上看，结果几乎是正确的，但不完全。语义上讲，它几乎有意义，但也不完全。如果你增加前缀的长度，会发生什么？随机文本更有意义是么？
一旦程序正常运行，你可以想尝试一下混搭：如果你组合两本或更多书中的文本，你生成的随机文本将以有趣的方式混合这些书中的词汇和短语。

致谢：此案例研究基于 Kernighan 与 Pike 所著的 《The Practice of Programming》 一书中的示例。

在继续阅读之前，你应该尝试解决该习题；

"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
Copyright 2015 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/
"""

from __future__ import print_function, division

import sys
import string
import random

# global variables
suffix_map = {}        # map from prefixes to a list of suffixes
prefix = ()            # current tuple of words


def process_file(filename, order=2):
    """Reads a file and performs Markov analysis.
    filename: string
    order: integer number of words in the prefix
    returns: map from prefix to list of possible suffixes.
    """
    fp = open(filename)
    skip_gutenberg_header(fp)

    for line in fp:
        if line.startswith('*** END OF THIS'): 
            break

        for word in line.rstrip().split():
            process_word(word, order)


def skip_gutenberg_header(fp):
    """Reads from fp until it finds the line that ends the header.
    fp: open file object
    """
    for line in fp:
        if line.startswith('*** START OF THIS'):
            break


def process_word(word, order=2):
    """Processes each word.
    word: string
    order: integer
    During the first few iterations, all we do is store up the words; 
    after that we start adding entries to the dictionary.
    """
    global prefix
    if len(prefix) < order:
        prefix += (word,)
        return

    try:
        suffix_map[prefix].append(word)
    except KeyError:
        # if there is no entry for this prefix, make one
        suffix_map[prefix] = [word]

    prefix = shift(prefix, word)


def random_text(n=100):
    """Generates random wordsfrom the analyzed text.
    Starts with a random prefix from the dictionary.
    n: number of words to generate
    """
    # choose a random prefix (not weighted by frequency)
    start = random.choice(list(suffix_map.keys()))
    
    for i in range(n):
        suffixes = suffix_map.get(start, None)
        if suffixes == None:
            # if the start isn't in map, we got to the end of the
            # original text, so we have to start again.
            random_text(n-i)
            return

        # choose a random suffix
        word = random.choice(suffixes)
        print(word, end=' ')
        start = shift(start, word)


def shift(t, word):
    """Forms a new tuple by removing the head and adding word to the tail.
    t: tuple of strings
    word: string
    Returns: tuple of strings
    """
    return t[1:] + (word,)


def main(script, filename='158-0.txt', n=100, order=2):
    try:
        n = int(n)
        order = int(order)
    except ValueError:
        print('Usage: %d filename [# of words] [prefix length]' % script)
    else: 
        process_file(filename, order)
        random_text(n)
        print()


if __name__ == '__main__':
    main(*sys.argv)

9.数据结构 #

使用马尔科夫分析生成随机文本很有趣，但是上面那道习题的目的是：学习数据结构选择。在解答上述习题时，你不得不选择：

如何表示前缀。
如何表示可能后缀的集合。
如何表示从前缀到可能后缀集合的映射。

最后一个选择很简单：明显应该选择字典作为键至对应值的映射。

对于前缀，最明显的选择是字符串、字符串列表或者字符串元组。

对于后缀，一个选择是列表；另一个是直方图（字典）。

你如何选择呢？第一步是考虑对每个数据结构你需要实现的操作。对于前缀，我们需要能从头部删除单词，并在结尾处加入单词。例如，如果当前的前缀是“Half a”，下一个词是“bee”，你需要能构成下一个前缀“a bee”。

你的第一个选择可能是列表，因为它能很容易的增加和删除元素，但是我们也需要让前缀作为字典的键，这就排除了列表。使用元组，你不能追加或删除元素，但是你能使用加号运算符来形成一个新的元组：

def shift(prefix, word):
    return prefix[1:] + (word,)

shift接受一个单词元组 prefix 和一个字符串 word ，并形成一个新的元组，其具有prefix中除第一个单词外的全部单词，然后在结尾增加 word 。

对于后缀的集合，我们需要执行的运算包括增加一个新的后缀（或者增加一个已有后缀的频率），并选择一个随机后缀。

对于列表或者直方图，增加一个新的后缀一样容易。从列表中选择一个随机元素很容易；在直方图中选择的难度更大（见上方习题13-7）。

目前为止，我们主要讨论实现的难易，但是选择数据结构时还要考虑其它因素。一个是运行时间。有时，一个数据结构比另一个快有理论依据；例如，我提到过 in 运算符对于字典比对列表要快，至少当元素的数目很大的时候。

但是通常你事先不知道哪个实现更快。一个选择是两个都实现，然后再看哪个更快。此方法被称作基准测试（benchmarking）。另一个更实际的选择是选择最容易实现的数据结构，然后看它对于拟定的应用是否足够快。如果是的话，就不需要继续了。如果不是，可以使用一些工具，如 profile 模块，识别程序中哪处最耗时。

另一个要考虑的因素是存储空间。例如，使用直方图表示后缀集合可能用更少的空间，因为无论一个单词在文本中出现多少次，你只需要存储它一次。在一些情况下，节省空间也能让你的程序更快，极端情况下，如果内存溢出，你的程序可能根本不能运行。但是对于许多应用，空间是运行时间之后的第二位考虑。

最后一点：在此讨论中，我暗示了我们应该使用一种数据结构同时进行分析和生成。但是既然这些是独立的步骤，使用一种数据结构进行分析，然后采用另一种结构进行生成也是可能的。如果生成节省的时间超过了转化花费的时间，这也会提高程序的性能。

10.调试 #

在调试一个程序的时候，特别是调试一个很难的错误时，应该做到以下五点：

细读：检查你的代码，仔细地阅读，并且检查是否实现了你的期望。

运行：通过修改和运行不同的版本来不断试验。通常，如果你在程序中正确的地方打印了正确的东西，问题会变得很明显，但是有时你不得不搭建一些脚手架。

思考：花些时间思考！错误的类型是什么：语法、运行时、语义？你从错误信息或者程序的输出中能获得什么信息？什么类型的错误能引起你看到的问题？问题出现前，你最后的修改是什么？

小黄鸭调试法（rubberducking）：如果将你的问题解释给别人听，有时你会发现在解释完问题之前就能找到答案。你通常并不需要真的去问另外一个人；你可以对着一个小黄鸭说。这就是著名的小黄鸭调试法(rubber duck debugging)的由来。这可不是我编造的，你可以看看这个维基页面： https://en.wikipedia.org/wiki/Rubber_duck_debugging 。

回退：有时候，最好的做法是回退，撤销最近的修改，直到你回到一个能运行并且你能理解的程序。然后你可以开始重建。

初级程序员有时陷入这些步骤之一，忘记了还可以做其他的事情。事实上，每种方法都有失败的可能。

例如，如果程序是一个排版错误，读代码可能有帮助，但是如果问题是概念理解错误，则未必是这样。如果你不理解程序要做什么，可能读100遍程序都不会发现错误，因为错误在你的头脑中。

试验可能会有帮助，特别是如果你运行简单短小的测试。但是，如果你不思考或者阅读你的代码，就直接进行实验，你可能陷入一种我称为“随机游走编程”的模式。这指的是随机修改，直到程序通过测试。不用说，随机游走编程会花费很长的时间。

你必须花时间思考。调试就像是一门实验科学。你应该至少有一个关于问题是什么的假设。如果有两个或者更多的可能，试着考虑利用测试消除其中一个可能。

但是，如果有太多的错误，或者你正试图修复的代码太大、太复杂，即使最好的调试技巧也会失败。有时，最好的选择是回退，简化程序，直到你获得一个正常运行并且能理解的程序。

初级程序员经常不愿意回退，因为他们舍不得删除一行代码（即使它是错误的）。如果能让你好受些，在你开始精简之前，可以将你的代码拷贝到另一个文件中。然后你再把修改后的代码一块一块地拷贝回去。

发现一个错误，需要阅读、运行、沉思、和时而的回退。如果其中某个步骤没有进展，试一下其它的。

11.术语表 #

确定性的（deterministic）：指的是给定相同的输入，一个程序每次运行的结果是一样的。

伪随机（pseudorandom）：指的是一串数字看上去是随机的，但是实际是由一个确定性程序生成的。

默认值：没有提供实参时，赋给可选形参的值。

覆盖：用实参替代默认值。

基准测试（benchmarking）：通过可能的输入样本对使用不同数据结构的实现进行测试，从而选择数据结构的过程。

小黄鸭调试法（rubberducking）：通过向小黄鸭这样的非生物体解释你的问题来进行调试。清晰地陈述问题可以帮助你解决问题，即使小黄鸭并不懂 Python。

12.练习题 #

习题 13-9 #

单词的”秩”是指它在按照单词频率排序的列表中的位置：出现频率最高的单词，它的秩是1，频率第二高的单词，它的秩是2，以此类推。

Zipf定律(http://en.wikipedia.org/wiki/Zipf’s_law )描述了自然语言中秩和单词出现频率的关系。特别是，它预测对于秩为 rr 的单词，其出现的频率 ff 是：f=cr−sf=cr−s

其中，ss　和　cc 是依赖于语言和文本的参数。如果在上述等式两边取对数的话，你可以得到：logf=logc−slogrlog⁡f=log⁡c−slog⁡r

因此，如果绘出 log ff　和　log rr　的图像，你可以得到一条以　−s−s　为斜率、以　cc　为截距的直线。

编写一个程序，从文件中读取文本，计算单词频率，倒序输出每个单词，一个单词一行，同时在这行输出对应的 log ff 和 log rr。使用你喜欢的绘图程序，画出结果并检查是不是形成一条直线。你可以估算出 ss 的值吗？

如果希望运行我的答案，你需要安装绘图模块 matplotlib。当然如果你安装了 Anaconda，你就已经有了matplotlib；否则你需要安装它。

"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
Copyright 2015 Allen Downey
License: http://creativecommons.org/licenses/by/4.0/
"""

from __future__ import print_function, division

import sys

import matplotlib.pyplot as plt

from analyze_book1 import process_file


def rank_freq(hist):
    """Returns a list of (rank, freq) tuples.
    hist: map from word to frequency
    returns: list of (rank, freq) tuples
    """
    # sort the list of frequencies in decreasing order
    freqs = list(hist.values())
    freqs.sort(reverse=True)

    # enumerate the ranks and frequencies 
    rf = [(r+1, f) for r, f in enumerate(freqs)]
    return rf


def print_ranks(hist):
    """Prints the rank vs. frequency data.
    hist: map from word to frequency
    """
    for r, f in rank_freq(hist):
        print(r, f)


def plot_ranks(hist, scale='log'):
    """Plots frequency vs. rank.
    hist: map from word to frequency
    scale: string 'linear' or 'log'
    """
    t = rank_freq(hist)
    rs, fs = zip(*t)

    plt.clf()
    plt.xscale(scale)
    plt.yscale(scale)
    plt.title('Zipf plot')
    plt.xlabel('rank')
    plt.ylabel('frequency')
    plt.plot(rs, fs, 'r-', linewidth=3)
    plt.show()


def main(script, filename='emma.txt', flag='plot'):
    hist = process_file(filename, skip_header=True)

    # either print the results or plot them
    if flag == 'print':
        print_ranks(hist)
    elif flag == 'plot':
        plot_ranks(hist)
    else:
        print('Usage: zipf.py filename [print|plot]')


if __name__ == '__main__':
    main(*sys.argv)

贡献者 #

翻译：@iphyer
校对：@bingjin
参考：@carfly

Python 实用宝典

Python 基础教程

Python 机器学习教程

Celery 教程

Python 教程 13 — 选择数据结构

1.词频分析 #

习题13-1 #

习题13-2 #

习题13-3 #

习题13-4 #

2.随机数 #

习题13-5 #

3.单词直方图 #

4.最常用单词 #

5.可选形参 #

6.字典差集 #

习题13-6 #

7.随机单词 #

习题13-7 #

8.马尔科夫分析 #

习题13-8 #

9.数据结构 #

10.调试 #

11.术语表 #

12.练习题 #

习题 13-9 #

贡献者 #

推荐阅读 #

bert-serving 模型及源代码 #

Python 教你3分钟用Bert搭建问答搜索引擎 #

【源代码+模型】基于Paddle训练一个98%准确率的抑郁文本预测 #

Eiten 源代码 #

推荐一个非常优惠的QMT开户渠道 #

What are your Feelings

发表回复

1.词频分析 #

习题13-1 #

习题13-2 #

习题13-3 #

习题13-4 #

2.随机数 #

习题13-5 #

3.单词直方图 #

4.最常用单词 #

5.可选形参 #

6.字典差集 #

习题13-6 #

7.随机单词 #

习题13-7 #

8.马尔科夫分析 #

习题13-8 #

9.数据结构 #

10.调试 #

11.术语表 #

12.练习题 #

习题 13-9 #

贡献者 #

推荐阅读 #

What are your Feelings

我们需要怎么帮助你？

发表回复 取消回复

发表回复