如何将没有空格的文本分割成单词列表?

问题:如何将没有空格的文本分割成单词列表?

输入: "tableapplechairtablecupboard..."很多单词

将此类文本拆分为单词列表并获得以下内容的有效算法是什么?

输出: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

首先想到的是遍历所有可能的单词(从第一个字母开始)并找到可能的最长单词,然后从 position=word_position+len(word)

PS:
我们列出了所有可能的单词。
单词“ cupboard”可以是“ cup”和“ board”,选择时间最长。
语言:python,但主要是算法本身。

Input: "tableapplechairtablecupboard..." many words

What would be an efficient algorithm to split such text to the list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

P.S.
We have a list of all possible words.
Word “cupboard” can be “cup” and “board”, select longest.
Language: python, but main thing is the algorithm itself.


回答 0

如果将朴素的算法应用于现实世界的数据,则效果不佳。这是一条20行的算法,该算法利用相对词频来为实词文本提供准确的结果。

(如果您想回答不使用词频的原始问题的答案,则需要完善“最长词”的确切含义:最好是使用20个字母的单词和10个3字母的单词,还是最好有五个10个字母的单词?确定精确的定义后,只需更改定义的行wordcost即可反映出预期的含义。)

这个主意

最好的进行方法是对输出的分布进行建模。一个很好的第一近似是假设所有单词都是独立分布的。然后,您只需要知道所有单词的相对频率即可。可以合理地假设它们遵循Zipf定律,也就是说,单词列表中排名为n的单词的概率约为1 /(n log N),其中N是词典中的单词数。

固定模型后,可以使用动态编程来推断空间的位置。最可能的句子是最大化每个单词的概率乘积的句子,并且可以通过动态编程轻松地计算出该句子。代替直接使用概率,我们使用定义为概率倒数的对数的代价来避免溢出。

代码

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

您可以使用

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

结果

我使用的是从Wikipedia的一小部分中拼凑而成的这个快速且肮脏的125k字字典

之前: thumbgreenappleactiveassignmentweekly隐喻。
之后:拇指绿苹果积极地分配每周的隐喻。

之前:有一种从html解析出的软注释信息,但是却重新获得了有限的字符内阁,例如,thumbgreenappleactiveassignmentlymetapho rapparentlytherearthumbgreenappleetc 字符串中应该用词条查询该词是否合理,这是最快的提取方法。

之后:从html解析出大量的人民评论文本信息,但是其中没有定界字符,例如,拇指绿苹果活跃作业每周隐喻显然在字符串中有拇指绿苹果等,我也有一个大词典查询单词是否合理,那么提取最快的方法是什么。

之前:夜晚是阴暗的暴风雨,除了偶然的间隔外,当被狂风吹过一阵子的时候,暴风雨席卷了整个伦敦的大街小巷,这在我们的幕后掠过,并迅速燃烧了灯火,直到黑暗中挣扎了。

之后:那是一个黑暗而暴风雨的夜晚,雨水倾盆而下,除了偶尔被狂风吹拂而扫过街道的间隔,因为在伦敦,我们的场景在房顶上嘎嘎作响,剧烈地搅动着房屋。挣扎着黑暗的灯火。

如您所见,它本质上是完美的。最重要的部分是确保将单词列表训练成与您实际遇到的相似的语料库,否则结果将很糟糕。


优化

该实现消耗线性时间和内存,因此它是相当有效的。如果需要进一步的提速,则可以从单词列表中构建后缀树,以减少候选集的大小。

如果您需要处理非常大的连续字符串,则合理的做法是拆分字符串以避免过多的内存使用。例如,您可以按10000个字符的块处理文本,并在每侧加1000个字符的边距,以避免边界效应。这将使内存使用降至最低,并且几乎可以肯定不会对质量产生任何影响。

A naive algorithm won’t give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by “longest word”: is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf’s law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it’s easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.


回答 1

基于最佳答案中的出色工作,我创建了一个pip易于使用的软件包。

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

要安装,请运行pip install wordninja

唯一的区别是微小的。这将返回a list而不是a str,它适用于python3,它包括单词列表并正确拆分,即使存在非字母字符(如下划线,破折号等)也是如此。

再次感谢通用人类!

https://github.com/keredson/wordninja

Based on the excellent work in the top answer, I’ve created a pip package for easy use.

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

To install, run pip install wordninja.

The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).

Thanks again to Generic Human!

https://github.com/keredson/wordninja


回答 2

这是使用递归搜索的解决方案:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

Yield

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

Here is solution using recursive search:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

yields

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

回答 3

使用一个trie 数据结构(该结构包含可能的单词的列表),执行以下操作不会太复杂:

  1. 前进指针(在串联字符串中)
  2. 查找并将相应的节点存储在trie中
  3. 如果trie节点有子节点(例如,单词较长),请转到1。
  4. 如果到达的节点没有孩子,则会出现最长的单词匹配;将单词(存储在节点中或在遍历遍历过程中只是串联在一起)添加到结果列表中,将指针重置为trie(或重置引用),然后重新开始

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:

  1. Advance pointer (in the concatenated string)
  2. Lookup and store the corresponding node in the trie
  3. If the trie node has children (e.g. there are longer words), go to 1.
  4. If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over

回答 4

Unutbu的解决方案非常接近,但是我发现代码难以阅读,并且没有产生预期的结果。通用人类解决方案的缺点是需要单词频率。不适用于所有用例。

这是一个使用分而治之算法的简单解决方案。

  1. 它试图使 Eg find_words('cupboard')返回['cupboard']而不是['cup', 'board'](假设cupboardcup并且board在字典中)的单词数量最少。
  2. 最佳解决方案不是唯一的,下面的实现返回一个解决方案。find_words('charactersin')可能会返回,['characters', 'in']或者可能会返回['character', 'sin'](如下所示)。您可以很容易地修改算法以返回所有最优解。
  3. 在此实现中,将记住解决方案,以使其在合理的时间内运行。

代码:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

在我的3GHz机器上,这大约需要5秒:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

从html解析的人民评论文本信息的全部内容,但没有定界字符,例如,拇指绿苹果活动任务每周隐喻显然在字符串中有拇指绿苹果等,我也有一个大型词典来查询是否这个词很合理,所以提取很多最快的方法是什么

Unutbu’s solution was quite close but I find the code difficult to read, and it didn’t yield the expected result. Generic Human’s solution has the drawback that it needs word frequencies. Not appropriate for all use case.

Here’s a simple solution using a Divide and Conquer algorithm.

  1. It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
  2. The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
  3. In this implementation solutions are memoized so that it runs in a reasonable time.

The code:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

This will take about about 5sec on my 3GHz machine:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot


回答 5

https://stackoverflow.com/users/1515832/generic-human的答案很棒。但是我见过的最好的实现是彼得·诺维格本人在他的《美丽的数据》一书中写的。

在粘贴他的代码之前,让我先解释一下为什么Norvig的方法更准确(尽管在代码方面会更慢一些,也更长一些)。

1)数据要好一些-就大小和精度而言(他使用字数而不是简单的排名)2)更重要的是,n-gram背后的逻辑确实使方法如此精确。

他在书中提供的示例是拆分字符串“ sitdown”的问题。现在,一个非二字组合的字符串拆分方法将考虑p(’sit’)* p(’down’),如果小于p(’sitdown’)-这种情况经常发生-它将不会拆分它,但我们希望(大多数时候)。

但是,当您具有bigram模型时,可以将p(’sit down’)视为bigram vs p(’sitdown’),并且前者获胜。基本上,如果您不使用双字母组,它将把您拆分的单词的概率视为独立,而不是这种情况,某些单词更有可能一个接一个地出现。不幸的是,这些词在很多情况下经常被粘在一起,使拆分器感到困惑。

这是数据的链接(它是3个独立问题的数据,而细分仅仅是一个。请阅读本章以获取详细信息):http : //norvig.com/ngrams/

这是代码的链接:http : //norvig.com/ngrams/ngrams.py

这些链接已经建立了一段时间,但是无论如何我都会在此处复制粘贴代码的分段部分

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

The answer by https://stackoverflow.com/users/1515832/generic-human is great. But the best implementation of this I’ve ever seen was written Peter Norvig himself in his book ‘Beautiful Data’.

Before I paste his code, let me expand on why Norvig’s method is more accurate (although a little slower and longer in terms of code).

1) The data is a bit better – both in terms of size and in terms of precision (he uses a word count rather than a simple ranking) 2) More importantly, it’s the logic behind n-grams that really makes the approach so accurate.

The example he provides in his book is the problem of splitting a string ‘sitdown’. Now a non-bigram method of string split would consider p(‘sit’) * p (‘down’), and if this less than the p(‘sitdown’) – which will be the case quite often – it will NOT split it, but we’d want it to (most of the time).

However when you have the bigram model you could value p(‘sit down’) as a bigram vs p(‘sitdown’) and the former wins. Basically, if you don’t use bigrams, it treats the probability of the words you’re splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.

Here’s the link to the data (it’s data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

and here’s the link to the code: http://norvig.com/ngrams/ngrams.py

These links have been up a while, but I’ll copy paste the segmentation part of the code here anyway

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

回答 6

这是转换为JavaScript的可接受答案(需要node.js,以及来自https://github.com/keredson/wordninja的文件“ wordninja_words.txt” ):

var fs = require("fs");

var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};

fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
    if (err) {
        throw err;
    }
    var words = data.split('\n');
    words.forEach(function(word, index) {
        wordCost[word] = Math.log((index + 1) * Math.log(words.length));
    })
    words.forEach(function(word) {
        if (word.length > maxWordLen)
            maxWordLen = word.length;
    });
    console.log(maxWordLen)
    splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
    console.log(split(process.argv[2]));
});


function split(s) {
    var list = [];
    s.split(splitRegex).forEach(function(sub) {
        _split(sub).forEach(function(word) {
            list.push(word);
        })
    })
    return list;
}
module.exports = split;


function _split(s) {
    var cost = [0];

    function best_match(i) {
        var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
        var minPair = [Number.MAX_SAFE_INTEGER, 0];
        candidates.forEach(function(c, k) {
            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
            } else {
                var ccost = Number.MAX_SAFE_INTEGER;
            }
            if (ccost < minPair[0]) {
                minPair = [ccost, k + 1];
            }
        })
        return minPair;
    }

    for (var i = 1; i < s.length + 1; i++) {
        cost.push(best_match(i)[0]);
    }

    var out = [];
    i = s.length;
    while (i > 0) {
        var c = best_match(i)[0];
        var k = best_match(i)[1];
        if (c == cost[i])
            console.log("Alert: " + c);

        var newToken = true;
        if (s.slice(i - k, i) != "'") {
            if (out.length > 0) {
                if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                    out[-1] = s.slice(i - k, i) + out[-1];
                    newToken = false;
                }
            }
        }

        if (newToken) {
            out.push(s.slice(i - k, i))
        }

        i -= k

    }
    return out.reverse();
}

Here is the accepted answer translated to JavaScript (requires node.js, and the file “wordninja_words.txt” from https://github.com/keredson/wordninja):

var fs = require("fs");

var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};

fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
    if (err) {
        throw err;
    }
    var words = data.split('\n');
    words.forEach(function(word, index) {
        wordCost[word] = Math.log((index + 1) * Math.log(words.length));
    })
    words.forEach(function(word) {
        if (word.length > maxWordLen)
            maxWordLen = word.length;
    });
    console.log(maxWordLen)
    splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
    console.log(split(process.argv[2]));
});


function split(s) {
    var list = [];
    s.split(splitRegex).forEach(function(sub) {
        _split(sub).forEach(function(word) {
            list.push(word);
        })
    })
    return list;
}
module.exports = split;


function _split(s) {
    var cost = [0];

    function best_match(i) {
        var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
        var minPair = [Number.MAX_SAFE_INTEGER, 0];
        candidates.forEach(function(c, k) {
            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
            } else {
                var ccost = Number.MAX_SAFE_INTEGER;
            }
            if (ccost < minPair[0]) {
                minPair = [ccost, k + 1];
            }
        })
        return minPair;
    }

    for (var i = 1; i < s.length + 1; i++) {
        cost.push(best_match(i)[0]);
    }

    var out = [];
    i = s.length;
    while (i > 0) {
        var c = best_match(i)[0];
        var k = best_match(i)[1];
        if (c == cost[i])
            console.log("Alert: " + c);

        var newToken = true;
        if (s.slice(i - k, i) != "'") {
            if (out.length > 0) {
                if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                    out[-1] = s.slice(i - k, i) + out[-1];
                    newToken = false;
                }
            }
        }

        if (newToken) {
            out.push(s.slice(i - k, i))
        }

        i -= k

    }
    return out.reverse();
}

回答 7

如果将单词表预编译到DFA中(这将非常慢),则匹配输入所花费的时间将与字符串的长度成比例(实际上,这仅比遍历字符串慢一点)。

实际上,这是前面提到的trie算法的更通用的版本。我只是为了完全完成而提到它-到目前为止,您还不能使用DFA实现。RE2可以使用,但是我不知道Python绑定是否可以让您调整DFA的大小,然后再丢弃编译的DFA数据并进行NFA搜索。

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).

This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless — as of yet, there’s no DFA implementation you can just use. RE2 would work, but I don’t know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.


回答 8

看起来相当平凡的回溯可以做到。从字符串的开头开始。向右扫描,直到您有一个字。然后,在字符串的其余部分调用该函数。如果函数一直扫描到右边而没有识别出一个单词,则该函数返回“ false”。否则,返回找到的单词以及递归调用返回的单词列表。

示例:“ tableapple”。查找“ tab”,然后查找“ leap”,但“ ple”中没有单词。“ leapple”中没有其他词。查找“表”,然后查找“ app”。“ le”不是一个词,所以尝试苹果,认识并返回。

为了尽可能长的时间,请继续前进,只发出(而不是返回)正确的解决方案;然后,根据您选择的任何标准(最大值,最小值,平均值等)选择最佳的一种

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns “false” if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.

Example: “tableapple”. Finds “tab”, then “leap”, but no word in “ple”. No other word in “leapple”. Finds “table”, then “app”. “le” not a word, so tries apple, recognizes, returns.

To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)


回答 9

基于unutbu的解决方案,我实现了Java版本:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

输入: "tableapplechairtablecupboard"

输出: [table, apple, chair, table, cupboard]

输入: "tableprechaun"

输出: [tab, leprechaun]

Based on unutbu’s solution I’ve implemented a Java version:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

Input: "tableapplechairtablecupboard"

Output: [table, apple, chair, table, cupboard]

Input: "tableprechaun"

Output: [tab, leprechaun]


回答 10

对于德语,有CharSplit,它使用机器学习,并且对几个单词的字符串非常有效。

https://github.com/dtuggener/CharSplit

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.

https://github.com/dtuggener/CharSplit


回答 11

扩展@miku的建议使用a Trie,仅追加Trie是相对简单的实现python

class Node:
    def __init__(self, is_word=False):
        self.children = {}
        self.is_word = is_word

class TrieDictionary:
    def __init__(self, words=tuple()):
        self.root = Node()
        for word in words:
            self.add(word)

    def add(self, word):
        node = self.root
        for c in word:
            node = node.children.setdefault(c, Node())
        node.is_word = True

    def lookup(self, word, from_node=None):
        node = self.root if from_node is None else from_node
        for c in word:
            try:
                node = node.children[c]
            except KeyError:
                return None

        return node

然后,我们可以Trie根据一组单词构建一个基于字典的字典:

dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)

它将产生如下所示的树(*指示单词的开头或结尾):

* -> a*
 \\\ 
  \\\-> p -> e -> a*
   \\              \-> n -> u -> t*
    \\
     \\-> b -> u -> t*
      \\             \-> t*
       \\                 \-> e*
        \\                     \-> r*
         \
          \-> n -> u -> t*

通过将其与关于如何选择单词的启发式方法相结合,我们可以将其合并到解决方案中。例如,与较短的单词相比,我们更喜欢较长的单词:

def using_trie_longest_word_heuristic(s):
    node = None
    possible_indexes = []

    # O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
    if s in dictionary:
        return [ s ]

    for i in range(len(s)):
        # traverse the trie, char-wise to determine intermediate words
        node = trie_dictionary.lookup(s[i], from_node=node)

        # no more words start this way
        if node is None:
            # iterate words we have encountered from biggest to smallest
            for possible in possible_indexes[::-1]:
                # recurse to attempt to solve the remaining sub-string
                end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])

                # if we have a solution, return this word + our solution
                if end_of_phrase:
                    return [ s[:possible+1] ] + end_of_phrase

            # unsolvable
            break

        # if this is a leaf, append the index to the possible words list
        elif node.is_word:
            possible_indexes.append(i)

    # empty string OR unsolvable case 
    return []

我们可以这样使用此功能:

>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]

由于我们在Trie搜索越来越长的单词时一直保持在的位置,因此trie每个可能的解决方案最多遍历一次(而不是:,的2次数)。最终的短路使我们免于在最坏的情况下通过字符串随意移动。peanutpeapeanut

最终结果仅需进行少量检查:

'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word

该解决方案的一个好处是,您可以很快地知道是否存在带有给定前缀的较长单词,从而无需针对字典全面测试序列组合。unsolvable与其他实现方式相比,这样做还使答案相对便宜。

该解决方案的缺点是要占用大量内存,trie以及trie前期构建成本。

Expanding on @miku’s suggestion to use a Trie, an append-only Trie is relatively straight-forward to implement in python:

class Node:
    def __init__(self, is_word=False):
        self.children = {}
        self.is_word = is_word

class TrieDictionary:
    def __init__(self, words=tuple()):
        self.root = Node()
        for word in words:
            self.add(word)

    def add(self, word):
        node = self.root
        for c in word:
            node = node.children.setdefault(c, Node())
        node.is_word = True

    def lookup(self, word, from_node=None):
        node = self.root if from_node is None else from_node
        for c in word:
            try:
                node = node.children[c]
            except KeyError:
                return None

        return node

We can then build a Trie-based dictionary from a set of words:

dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)

Which will produce a tree that looks like this (* indicates beginning or end of a word):

* -> a*
 \\\ 
  \\\-> p -> e -> a*
   \\              \-> n -> u -> t*
    \\
     \\-> b -> u -> t*
      \\             \-> t*
       \\                 \-> e*
        \\                     \-> r*
         \
          \-> n -> u -> t*

We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:

def using_trie_longest_word_heuristic(s):
    node = None
    possible_indexes = []

    # O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
    if s in dictionary:
        return [ s ]

    for i in range(len(s)):
        # traverse the trie, char-wise to determine intermediate words
        node = trie_dictionary.lookup(s[i], from_node=node)

        # no more words start this way
        if node is None:
            # iterate words we have encountered from biggest to smallest
            for possible in possible_indexes[::-1]:
                # recurse to attempt to solve the remaining sub-string
                end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])

                # if we have a solution, return this word + our solution
                if end_of_phrase:
                    return [ s[:possible+1] ] + end_of_phrase

            # unsolvable
            break

        # if this is a leaf, append the index to the possible words list
        elif node.is_word:
            possible_indexes.append(i)

    # empty string OR unsolvable case 
    return []

We can use this function like this:

>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]

Because we maintain our position in the Trie as we search for longer and longer words, we traverse the trie at most once per possible solution (rather than 2 times for peanut: pea, peanut). The final short-circuit saves us from walking char-wise through the string in the worst-case.

The final result is only a handful of inspections:

'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word

A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an unsolvable answer comparatively cheap to other implementations.

The downsides of this solution are a large memory footprint for the trie and the cost of building the trie up-front.


回答 12

如果您包含字符串中包含的单词的详尽列表:

word_list = ["table", "apple", "chair", "cupboard"]

使用列表理解来遍历列表以找到单词以及单词出现的次数。

string = "tableapplechairtablecupboard"

def split_string(string, word_list):

    return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()

该函数string按列表顺序返回单词的输出table table apple chair cupboard

If you have an exhaustive list of the words contained within the string:

word_list = ["table", "apple", "chair", "cupboard"]

Using list comprehension to iterate over the list to locate the word and how many times it appears.

string = "tableapplechairtablecupboard"

def split_string(string, word_list):

    return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()


The function returns a string output of words in order of the list table table apple chair cupboard


回答 13

非常感谢https://github.com/keredson/wordninja/中的帮助

在我看来,Java的贡献很小。

可以将public方法splitContiguousWords与其他2种方法一起嵌入在同一目录中具有ninja_words.txt的类中(或根据编码器的选择进行修改)。并且该方法splitContiguousWords可以用于该目的。

public List<String> splitContiguousWords(String sentence) {

    String splitRegex = "[^a-zA-Z0-9']+";
    Map<String, Number> wordCost = new HashMap<>();
    List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
    double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
    long wordIdx = 0;
    for (String word : dictionaryWords) {
        wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
    }
    int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
    List<String> splitWords = new ArrayList<>();
    for (String partSentence : sentence.split(splitRegex)) {
        splitWords.add(split(partSentence, wordCost, maxWordLength));
    }
    log.info("Split word for the sentence: {}", splitWords);
    return splitWords;
}

private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> cost = new ArrayList<>();
    cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
    for (int index = 1; index < partSentence.length() + 1; index++) {
        cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
    }
    int idx = partSentence.length();
    List<String> output = new ArrayList<>();
    while (idx > 0) {
        Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
        Number candidateCost = candidate.getKey();
        Number candidateIndexValue = candidate.getValue();
        if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
            throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
        }
        boolean newToken = true;
        String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
        if (token != "\'" && output.size() > 0) {
            String lastWord = output.get(output.size() - 1);
            if (lastWord.equalsIgnoreCase("\'s") ||
                    (Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
                output.set(output.size() - 1, token + lastWord);
                newToken = false;
            }
        }
        if (newToken) {
            output.add(token);
        }
        idx -= candidateIndexValue.intValue();
    }
    return String.join(" ", Lists.reverse(output));
}


private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
                      Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
    int enumerateIdx = 0;
    Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
    for (Pair<Number, Number> pair : candidates) {
        ++enumerateIdx;
        String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
        Number minCost = Integer.MAX_VALUE;
        if (wordCost.containsKey(subsequence)) {
            minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
        }
        if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
            minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
        }
    }
    return minPair;
}

Many Thanks for the help in https://github.com/keredson/wordninja/

A small contribution of the same in Java from my side.

The public method splitContiguousWords could be embedded with the other 2 methods in the class having ninja_words.txt in same directory(or modified as per the choice of coder). And the method splitContiguousWords could be used for the purpose.

public List<String> splitContiguousWords(String sentence) {

    String splitRegex = "[^a-zA-Z0-9']+";
    Map<String, Number> wordCost = new HashMap<>();
    List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
    double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
    long wordIdx = 0;
    for (String word : dictionaryWords) {
        wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
    }
    int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
    List<String> splitWords = new ArrayList<>();
    for (String partSentence : sentence.split(splitRegex)) {
        splitWords.add(split(partSentence, wordCost, maxWordLength));
    }
    log.info("Split word for the sentence: {}", splitWords);
    return splitWords;
}

private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> cost = new ArrayList<>();
    cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
    for (int index = 1; index < partSentence.length() + 1; index++) {
        cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
    }
    int idx = partSentence.length();
    List<String> output = new ArrayList<>();
    while (idx > 0) {
        Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
        Number candidateCost = candidate.getKey();
        Number candidateIndexValue = candidate.getValue();
        if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
            throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
        }
        boolean newToken = true;
        String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
        if (token != "\'" && output.size() > 0) {
            String lastWord = output.get(output.size() - 1);
            if (lastWord.equalsIgnoreCase("\'s") ||
                    (Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
                output.set(output.size() - 1, token + lastWord);
                newToken = false;
            }
        }
        if (newToken) {
            output.add(token);
        }
        idx -= candidateIndexValue.intValue();
    }
    return String.join(" ", Lists.reverse(output));
}


private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
                      Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
    int enumerateIdx = 0;
    Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
    for (Pair<Number, Number> pair : candidates) {
        ++enumerateIdx;
        String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
        Number minCost = Integer.MAX_VALUE;
        if (wordCost.containsKey(subsequence)) {
            minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
        }
        if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
            minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
        }
    }
    return minPair;
}

回答 14

这会有所帮助

from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

This will help

from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

回答 15

您需要确定您的词汇表-也许任何免费单词列表都可以。

完成后,使用该词汇表来构建后缀树,并将输入流与此匹配:http : //en.wikipedia.org/wiki/Suffix_tree

You need to identify your vocabulary – perhaps any free word list will do.

Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree