# Python 教程 13 — 选择数据结构

## 1.词频分析 #

### 习题13-1 #

```>>> import string
>>> string.punctuation
'!"#\$%&'()*+,-./:;<=>?@[\]^_`{|}~'```

## 2.随机数 #

`random`模块提供了生成伪随机数（下文中简称为“随机数”）的函数。

```import random

for i in range(10):
x = random.random()
print(x)```

```>>> random.randint(5, 10)
5
>>> random.randint(5, 10)
9```

```>>> t = [1, 2, 3]
>>> random.choice(t)
2
>>> random.choice(t)
3```

`random`模块提供的函数，还可以生成符合高斯、指数、伽马等连续分布的随机值。

### 习题13-5 #

```>>> t = ['a', 'a', 'b']
>>> hist = histogram(t)
>>> hist
{'a': 2, 'b': 1}```

```"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
"""

from __future__ import print_function, division

import random
import string

"""Makes a histogram that contains the words from a file.
filename: string

returns: map from each word to the number of times it appears.
"""
hist = {}
fp = open(filename)

for line in fp:
if line.startswith('*** END OF THIS'):
break

process_line(line, hist)

return hist

"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('*** START OF THIS'):
break

def process_line(line, hist):
"""Adds the words in the line to the histogram.
Modifies hist.
line: string
hist: histogram (map from word to frequency)
"""
# TODO: rewrite using Counter

# replace hyphens with spaces before splitting
line = line.replace('-', ' ')
strippables = string.punctuation + string.whitespace

for word in line.split():
# remove punctuation and convert to lowercase
word = word.strip(strippables)
word = word.lower()

# update the histogram
hist[word] = hist.get(word, 0) + 1

def most_common(hist):
"""Makes a list of word-freq pairs in descending order of frequency.
hist: map from word to frequency
returns: list of (frequency, word) pairs
"""
t = []
for key, value in hist.items():
t.append((value, key))

t.sort()
t.reverse()
return t

def print_most_common(hist, num=10):
"""Prints the most commons words in a histgram and their frequencies.

hist: histogram (map from word to frequency)
num: number of words to print
"""
t = most_common(hist)
print('The most common words are:')
for freq, word in t[:num]:
print(word, '\t', freq)

def subtract(d1, d2):
"""Returns a dictionary with all keys that appear in d1 but not d2.
d1, d2: dictionaries
"""
# TODO: reimplement using Counter
res = {}
for key in d1:
if key not in d2:
res[key] = None
return res

def total_words(hist):
"""Returns the total of the frequencies in a histogram."""
return sum(hist.values())

def different_words(hist):
"""Returns the number of different words in a histogram."""
return len(hist)

def random_word(hist):
"""Chooses a random word from a histogram.
The probability of each word is proportional to its frequency.
"""
# TODO: rewrite using Counter
t = []
for word, freq in hist.items():
t.extend([word] * freq)

return random.choice(t)

def main():
print('Total number of words:', total_words(hist))
print('Number of different words:', different_words(hist))

t = most_common(hist)
print('The most common words are:')
for freq, word in t[0:20]:
print(word, '\t', freq)

diff = subtract(hist, words)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
print(word, end=' ')

print("\n\nHere are some random words from the book")
for i in range(100):
print(random_word(hist), end=' ')

if __name__ == '__main__':
main()```

## 3.单词直方图 #

```import string

def process_file(filename):
hist = dict()
fp = open(filename)
for line in fp:
process_line(line, hist)
return hist

def process_line(line, hist):
line = line.replace('-', ' ')

for word in line.split():
word = word.strip(string.punctuation + string.whitespace)
word = word.lower()
hist[word] = hist.get(word, 0) + 1

hist = process_file('emma.txt')```

`process_file`循环读取文件的每行，依次把它们传递给`process_line`。 直方图 `hist` 被用作一个累加器。

```def total_words(hist):
return sum(hist.values())```

```def different_words(hist):
return len(hist)```

```print('Total number of words:', total_words(hist))
print('Number of different words:', different_words(hist))```

```Total number of words: 161080
Number of different words: 7214```

## 4.最常用单词 #

```def most_common(hist):
t = []
for key, value in hist.items():
t.append((value, key))

t.sort(reverse=True)
return t```

```t = most_common(hist)
print('The most common words are:')
for freq, word in t[:10]:
print(word, freq, sep='\t')```

```The most common words are:
to      5242
the     5205
and     4897
of      4295
i       3191
a       3130
it      2529
her     2483
was     2400
she     2364```

## 5.可选形参 #

```def print_most_common(hist, num=10):
t = most_common(hist)
print('The most common words are:')
for freq, word in t[:num]:
print(word, freq, sep='\t')```

`print_most_common(hist)`

`num`将使用默认值。如果你你提供两个实参：

`print_most_common(hist, 20)`

`num`获得实参的值。换句话说，可选实参覆盖（overrides）了默认值。

## 6.字典差集 #

`subtract`接受词典 `d1` 和 `d2` ，并返回一个新的词典， 其包括 `d1` 中的所有没出现在 `d2` 中的键。 由于并不真正关心值是什么，我们将它们都设为 `None` 。

```def subtract(d1, d2):
res = dict()
for key in d1:
if key not in d2:
res[key] = None
return res```

```words = process_file('words.txt')
diff = subtract(hist, words)

print("Words in the book that aren't in the word list:")
for word in diff.keys():
print(word, end=' ')```

```Words in the book that aren't in the word list:
rencontre jane's blanche woodhouses disingenuousness
friend's venice apartment ...```

### 习题13-6 #

Python　提供了一个叫做集合（set）的数据结构，支持许多常见的集合操作。 你可以前往第十九章阅读相关内容，或者在官网上阅读文档 http://docs.python.org/3/library/stdtypes.html#types-set 。

```"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
"""

from __future__ import print_function, division

from analyze_book1 import process_file

def subtract(d1, d2):
"""Returns a set of all keys that appear in d1 but not d2.
d1, d2: dictionaries
"""
return set(d1) - set(d2)

def main():

diff = subtract(hist, words)
print("The words in the book that aren't in the word list are:")
for word in diff:
print(word, end=' ')

if __name__ == '__main__':
main()```

## 7.随机单词 #

```def random_word(h):
t = []
for word, freq in h.items():
t.extend([word] * freq)

return random.choice(t)```

1. 使用 `keys` 来获得该书中单词的列表。
2. 创建一个包含单词频率累积和的列表（见习题10-2）。 此列表的最后一项是书中单词的数目nn。
3. 选择一个从 1 到nn的随机数。使用二分搜索（见习题10-10） 找到该随机数应该被在累积和中插入的索引。
4. 使用该索引从单词列表中找到相应的单词。

### 习题13-7 #

```"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
"""

from __future__ import print_function, division

import random

from bisect import bisect

from analyze_book1 import process_file

def random_word(hist):
"""Chooses a random word from a histogram.
The probability of each word is proportional to its frequency.
hist: map from word to frequency
"""
# TODO: This could be made faster by computing the cumulative
# frequencies once and reusing them.

words = []
freqs = []
total_freq = 0

# make a list of words and a list of cumulative frequencies
for word, freq in hist.items():
total_freq += freq
words.append(word)
freqs.append(total_freq)

# choose a random value and find its location in the cumulative list
x = random.randint(0, total_freq-1)
index = bisect(freqs, x)
return words[index]

def main():

print("\n\nHere are some random words from the book")
for i in range(100):
print(random_word(hist), end=' ')

if __name__ == '__main__':
main()```

## 8.马尔科夫分析 #

`this the small regard harriet which knightley's it most things`

### 习题13-8 #

1. 编写一个程序，从一个文件中读取文本并执行马尔科夫分析。 结果应该是一个字典，即从前缀映射到一个可能的后缀集合。 此后缀集合可以是一个列表、元组或字典；你需要做出合适的选择。 你可以用长度为2的前缀测试程序，但是在编写程序时，应确保其很容易支持其它长度。
2. 在前面的程序中添加一个函数，基于马尔科夫分析生成随机文本。 下面是使用《Emma》执行前缀为2的马尔科夫分析生成的示例：He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?“ ”I cannot make speeches, Emma:” he soon cut it all himself.在此例中，我保留了附在词后面的标点符号。从语法上看，结果几乎是正确的，但不完全。 语义上讲，它几乎有意义，但也不完全。如果你增加前缀的长度，会发生什么？随机文本更有意义是么？
3. 一旦程序正常运行，你可以想尝试一下混搭：如果你组合两本或更多书中的文本， 你生成的随机文本将以有趣的方式混合这些书中的词汇和短语。

```"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
"""

from __future__ import print_function, division

import sys
import string
import random

# global variables
suffix_map = {}        # map from prefixes to a list of suffixes
prefix = ()            # current tuple of words

def process_file(filename, order=2):
"""Reads a file and performs Markov analysis.
filename: string
order: integer number of words in the prefix
returns: map from prefix to list of possible suffixes.
"""
fp = open(filename)

for line in fp:
if line.startswith('*** END OF THIS'):
break

for word in line.rstrip().split():
process_word(word, order)

"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('*** START OF THIS'):
break

def process_word(word, order=2):
"""Processes each word.
word: string
order: integer
During the first few iterations, all we do is store up the words;
after that we start adding entries to the dictionary.
"""
global prefix
if len(prefix) < order:
prefix += (word,)
return

try:
suffix_map[prefix].append(word)
except KeyError:
# if there is no entry for this prefix, make one
suffix_map[prefix] = [word]

prefix = shift(prefix, word)

def random_text(n=100):
"""Generates random wordsfrom the analyzed text.
Starts with a random prefix from the dictionary.
n: number of words to generate
"""
# choose a random prefix (not weighted by frequency)
start = random.choice(list(suffix_map.keys()))

for i in range(n):
suffixes = suffix_map.get(start, None)
if suffixes == None:
# if the start isn't in map, we got to the end of the
# original text, so we have to start again.
random_text(n-i)
return

# choose a random suffix
word = random.choice(suffixes)
print(word, end=' ')
start = shift(start, word)

def shift(t, word):
"""Forms a new tuple by removing the head and adding word to the tail.
t: tuple of strings
word: string
Returns: tuple of strings
"""
return t[1:] + (word,)

def main(script, filename='158-0.txt', n=100, order=2):
try:
n = int(n)
order = int(order)
except ValueError:
print('Usage: %d filename [# of words] [prefix length]' % script)
else:
process_file(filename, order)
random_text(n)
print()

if __name__ == '__main__':
main(*sys.argv)```

## 9.数据结构 #

• 如何表示前缀。
• 如何表示可能后缀的集合。
• 如何表示从前缀到可能后缀集合的映射。

```def shift(prefix, word):
return prefix[1:] + (word,)```

`shift`接受一个单词元组 `prefix` 和一个字符串 `word` ， 并形成一个新的元组，其具有prefix中除第一个单词外的全部单词， 然后在结尾增加 `word` 。

## 12.练习题 #

### 习题 13-9 #

Zipf定律(http://en.wikipedia.org/wiki/Zipf’s_law )描述了自然语言中秩和单词出现频率的关系。特别是，它预测对于秩为 rr 的单词， 其出现的频率 ff 是：f=cr−sf=cr−s

```"""This module contains a code example related to
Think Python, 2nd Edition
by Allen Downey
http://thinkpython2.com
"""

from __future__ import print_function, division

import sys

import matplotlib.pyplot as plt

from analyze_book1 import process_file

def rank_freq(hist):
"""Returns a list of (rank, freq) tuples.
hist: map from word to frequency
returns: list of (rank, freq) tuples
"""
# sort the list of frequencies in decreasing order
freqs = list(hist.values())
freqs.sort(reverse=True)

# enumerate the ranks and frequencies
rf = [(r+1, f) for r, f in enumerate(freqs)]
return rf

def print_ranks(hist):
"""Prints the rank vs. frequency data.
hist: map from word to frequency
"""
for r, f in rank_freq(hist):
print(r, f)

def plot_ranks(hist, scale='log'):
"""Plots frequency vs. rank.
hist: map from word to frequency
scale: string 'linear' or 'log'
"""
t = rank_freq(hist)
rs, fs = zip(*t)

plt.clf()
plt.xscale(scale)
plt.yscale(scale)
plt.title('Zipf plot')
plt.xlabel('rank')
plt.ylabel('frequency')
plt.plot(rs, fs, 'r-', linewidth=3)
plt.show()

def main(script, filename='emma.txt', flag='plot'):

# either print the results or plot them
if flag == 'print':
print_ranks(hist)
elif flag == 'plot':
plot_ranks(hist)
else:
print('Usage: zipf.py filename [print|plot]')

if __name__ == '__main__':
main(*sys.argv)```

1. 翻译：@iphyer
2. 校对：@bingjin
3. 参考：@carfly

## 推荐阅读#

##### Wolfram: ChatGPT是怎么实现的？为什么它这么有效？#
ChatGPT 能够自动生成类似于人类写作的文本，这一点非常引人注目，也令人意外。但它是如何实现的？为什么它能…
##### Python 比较两个时间序列在图形上是否相似#

​Python实用宝典 (pythondict.com)