标签归档:split

为什么在split()结果中返回空字符串?

问题:为什么在split()结果中返回空字符串?

什么是点'/segment/segment/'.split('/')回来['', 'segment', 'segment', '']

注意空元素。如果您要分割的分隔符恰好位于字符串的第一位置,并且位于字符串的末尾,那么它又能为您带来什么额外的价值呢?

What is the point of '/segment/segment/'.split('/') returning ['', 'segment', 'segment', '']?

Notice the empty elements. If you’re splitting on a delimiter that happens to be at position one and at the very end of a string, what extra value does it give you to have the empty string returned from each end?


回答 0

str.splitstr.join,所以

"/".join(['', 'segment', 'segment', ''])

让您返回原始字符串。

如果没有空字符串,则第一个和最后符串'/'将丢失join()

str.split complements str.join, so

"/".join(['', 'segment', 'segment', ''])

gets you back the original string.

If the empty strings were not there, the first and last '/' would be missing after the join()


回答 1

更一般而言,要删除split()结果中返回的空字符串,您可能需要查看该filter函数。

例:

filter(None, '/segment/segment/'.split('/'))

退货

['segment', 'segment']

More generally, to remove empty strings returned in split() results, you may want to look at the filter function.

Example:

f = filter(None, '/segment/segment/'.split('/'))
s_all = list(f)

returns

['segment', 'segment']

回答 2

这里有两点要考虑:

  • 期望结果'/segment/segment/'.split('/')['segment', 'segment']于是合理的,但这会丢失信息。如果split()按照您想要的方式工作,如果我告诉您a.split('/') == ['segment', 'segment'],您将无法告诉我是什么a
  • 结果应该是什么'a//b'.split()['a', 'b']?或['a', '', 'b']?即,是否应split()合并相邻的定界符?如果需要,那么将很难解析由字符分隔的数据,并且某些字段可以为空。我可以肯定,有很多人确实想要上述情况的结果中的空值!

最后,归结为两点:

一致性:如果我有n定界符,则在中a,我会在n+1返回值split()

应该可以做复杂的事情,并且可以轻松地做简单的事情:如果由于想要忽略空字符串split(),可以始终这样做:

def mysplit(s, delim=None):
    return [x for x in s.split(delim) if x]

但是如果不想忽略空值,则应该可以。

该语言必须选择一种定义split()-有太多不同的用例无法满足所有人的默认要求。我认为Python的选择是不错的选择,也是最合乎逻辑的选择。(顺便说一句,我不喜欢C的原因之一strtok()是因为它合并了相邻的定界符,因此很难对其进行认真的解析/标记化处理。)

有一个exceptions:a.split()没有参数会挤压连续的空格,但是有人可以认为在这种情况下这样做是正确的。如果您不想要这种行为,则可以始终这样做a.split(' ')

There are two main points to consider here:

  • Expecting the result of '/segment/segment/'.split('/') to be equal to ['segment', 'segment'] is reasonable, but then this loses information. If split() worked the way you wanted, if I tell you that a.split('/') == ['segment', 'segment'], you can’t tell me what a was.
  • What should be the result of 'a//b'.split() be? ['a', 'b']?, or ['a', '', 'b']? I.e., should split() merge adjacent delimiters? If it should, then it will be very hard to parse data that’s delimited by a character, and some of the fields can be empty. I am fairly sure there are many people who do want the empty values in the result for the above case!

In the end, it boils down to two things:

Consistency: if I have n delimiters, in a, I get n+1 values back after the split().

It should be possible to do complex things, and easy to do simple things: if you want to ignore empty strings as a result of the split(), you can always do:

def mysplit(s, delim=None):
    return [x for x in s.split(delim) if x]

but if one doesn’t want to ignore the empty values, one should be able to.

The language has to pick one definition of split()—there are too many different use cases to satisfy everyone’s requirement as a default. I think that Python’s choice is a good one, and is the most logical. (As an aside, one of the reasons I don’t like C’s strtok() is because it merges adjacent delimiters, making it extremely hard to do serious parsing/tokenization with it.)

There is one exception: a.split() without an argument squeezes consecutive white-space, but one can argue that this is the right thing to do in that case. If you don’t want the behavior, you can always to a.split(' ').


回答 3

x.split(y)始终返回列表1 + x.count(y)项是一种珍贵的规律性-为@ gnibbler本已指出,这让splitjoin对方的确切逆(因为它们显然应该是),这也正是各种分隔符连记录的语义(映射例如csv文件行[[引用网络的净额]],/etc/groupUnix中的行等等),它允许(如@Roman的回答所述)轻松检查(例如)绝对路径与相对路径(在文件路径和URL中),等等。

另一种看待它的方式是,您不应该肆意地将信息扔出窗外,以免获得任何好处。x.split(y)等于会得到什么x.strip(y).split(y)?没事,当然-它很容易使用第二种形式时,这就是你的意思,但如果第一种形式是任意视为指第二个,你有很多工作要做,当你希望第一个(如上一段所指出的,这绝非罕见。

但是实际上,根据数学规律进行思考是您可以自学的设计可传递API的最简单,最通用的方法。举一个不同的例子,对于任何有效的xy x == x[:y] + x[y:]-这立即表明为什么应该排除切片的一个极端非常重要。您可以表述不变的断言越简单,就越有可能产生的语义就是您在现实生活中需要的语义-这是神秘的事实,即数学在处理宇宙中非常有用。

尝试为split前导和尾随定界符是特殊情况的方言制定不变式…反例:像这样的字符串方法isspace并不是最大程度的简单- x.isspace()等效于x and all(c in string.whitespace for c in x)-愚蠢的前导x and是您经常发现自己编码的原因not x or x.isspace(),返回到应该is...字符串方法中设计的简单性(因此,空字符串“就是”您想要的任何东西)与街上人马的感觉相反,也许[[空集,如零&c,始终使大多数人感到困惑;-)]],但完全符合明显完善的数学常识!-)。

Having x.split(y) always return a list of 1 + x.count(y) items is a precious regularity — as @gnibbler’s already pointed out it makes split and join exact inverses of each other (as they obviously should be), it also precisely maps the semantics of all kinds of delimiter-joined records (such as csv file lines [[net of quoting issues]], lines from /etc/group in Unix, and so on), it allows (as @Roman’s answer mentioned) easy checks for (e.g.) absolute vs relative paths (in file paths and URLs), and so forth.

Another way to look at it is that you shouldn’t wantonly toss information out of the window for no gain. What would be gained in making x.split(y) equivalent to x.strip(y).split(y)? Nothing, of course — it’s easy to use the second form when that’s what you mean, but if the first form was arbitrarily deemed to mean the second one, you’d have lot of work to do when you do want the first one (which is far from rare, as the previous paragraph points out).

But really, thinking in terms of mathematical regularity is the simplest and most general way you can teach yourself to design passable APIs. To take a different example, it’s very important that for any valid x and y x == x[:y] + x[y:] — which immediately indicates why one extreme of a slicing should be excluded. The simpler the invariant assertion you can formulate, the likelier it is that the resulting semantics are what you need in real life uses — part of the mystical fact that maths is very useful in dealing with the universe.

Try formulating the invariant for a split dialect in which leading and trailing delimiters are special-cased… counter-example: string methods such as isspace are not maximally simple — x.isspace() is equivalent to x and all(c in string.whitespace for c in x) — that silly leading x and is why you so often find yourself coding not x or x.isspace(), to get back to the simplicity which should have been designed into the is... string methods (whereby an empty string “is” anything you want — contrary to man-in-the-street horse-sense, maybe [[empty sets, like zero &c, have always confused most people;-)]], but fully conforming to obvious well-refined mathematical common-sense!-).


回答 4

我不确定您要寻找哪种答案?您得到三个匹配,因为您有三个定界符。如果您不想要那个空的,只需使用:

'/segment/segment/'.strip('/').split('/')

I’m not sure what kind of answer you’re looking for? You get three matches because you have three delimiters. If you don’t want that empty one, just use:

'/segment/segment/'.strip('/').split('/')

回答 5

好吧,它让您知道那里有一个定界符。因此,看到4个结果会让您知道您有3个定界符。这使您能够使用此信息执行任何所需的操作,而不是让Python删除空元素,然后让您手动检查是否需要定界符(如果需要)。

一个简单的例子:假设您要检查绝对文件名和相对文件名。这样,您就可以使用拆分完成所有操作,而不必检查文件名的第一个字符是什么。

Well, it lets you know there was a delimiter there. So, seeing 4 results lets you know you had 3 delimiters. This gives you the power to do whatever you want with this information, rather than having Python drop the empty elements, and then making you manually check for starting or ending delimiters if you need to know it.

Simple example: Say you want to check for absolute vs. relative filenames. This way you can do it all with the split, without also having to check what the first character of your filename is.


回答 6

考虑以下最小示例:

>>> '/'.split('/')
['', '']

split必须给您定界符前后的内容'/',但没有其他字符。因此,它必须为您提供一个空字符串,从技术上讲,它在之前和之后'/',因为'' + '/' + '' == '/'

Consider this minimal example:

>>> '/'.split('/')
['', '']

split must give you what’s before and after the delimiter '/', but there are no other characters. So it has to give you the empty string, which technically precedes and follows the '/', because '' + '/' + '' == '/'.


基于正则表达式的Python拆分字符串

问题:基于正则表达式的Python拆分字符串

"HELLO there HOW are YOU"用大写单词分割字符串的最佳方法是什么(在Python中)?

所以我最终得到一个像这样的数组: results = ['HELLO there', 'HOW are', 'YOU']


编辑:

我努力了:

p = re.compile("\b[A-Z]{2,}\b")
print p.split(page_text)

不过,它似乎不起作用。

What is the best way to split a string like "HELLO there HOW are YOU" by upper case words (in Python)?

So I’d end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']


EDIT:

I have tried:

p = re.compile("\b[A-Z]{2,}\b")
print p.split(page_text)

It doesn’t seem to work, though.


回答 0

我建议

l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)

检查这个演示

I suggest

l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)

Check this demo.


回答 1

您可以先行使用:

re.split(r'[ ](?=[A-Z]+\b)', input)

这将在每个空格处分开,后跟一串大写字母,每个大写字母以单词边界结尾。

请注意,方括号仅出于可读性考虑,也可以省略。

如果一个单词的第一个字母大写就足够了(因此,如果您也想在其前面拆分Hello),它将变得更加容易:

re.split(r'[ ](?=[A-Z])', input)

现在,这会在每个空格处分开,后跟任何大写字母。

You could use a lookahead:

re.split(r'[ ](?=[A-Z]+\b)', input)

This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.

Note that the square brackets are only for readability and could as well be omitted.

If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello as well) it gets even easier:

re.split(r'[ ](?=[A-Z])', input)

Now this splits at every space followed by any upper-case letter.


回答 2

您的问题包含字符串文字"\b[A-Z]{2,}\b",但是这\b意味着退格,因为没有r-修饰符。

试试:r"\b[A-Z]{2,}\b"

Your question contains the string literal "\b[A-Z]{2,}\b", but that \b will mean backspace, because there is no r-modifier.

Try: r"\b[A-Z]{2,}\b".


如何将文本分成句子?

问题:如何将文本分成句子?

我有一个文本文件。我需要得到一个句子清单。

如何实现呢?有很多微妙之处,例如缩写中使用了点。

我的旧正则表达式效果很差:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

I have a text file. I need to get a list of sentences.

How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.

My old regular expression works badly:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

回答 0

自然语言工具包(nltk.org)满足您的需求。 该群组发布表明这样做:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(我还没有尝试过!)

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven’t tried it!)


回答 1

此功能可以在约0.1秒内将Huckleberry Finn的整个文本拆分成句子,并处理许多使句子解析变得不平凡的更痛苦的案例,例如“ John Johnson Jr.先生出生于美国,但获得了博士学位。 D.在以色列加入耐克公司的工程师之前,他还曾在craigslist.org业务分析师。

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. “Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst.

# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

回答 2

除了使用正则表达式将文本拆分为句子外,还可以使用nltk库。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考:https : //stackoverflow.com/a/9474645/2877052

Instead of using regex for spliting the text into sentences, you can also use nltk library.

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

ref: https://stackoverflow.com/a/9474645/2877052


回答 3

您可以尝试使用Spacy代替正则表达式。我用它就可以了。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

You can try using Spacy instead of regex. I use it and it does the job.

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

回答 4

这是不依赖任何外部库的中间方法。我使用列表推导来排除缩写词和终止符之间的重叠以及排除终止符变体之间的重叠,例如:“。” 与“。”

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我从以下条目中使用了Karl的find_all函数: 查找Python中所有出现的子字符串

Here is a middle of the road approach that doesn’t rely on any external libraries. I use list comprehension to exclude overlaps between abbreviations and terminators as well as to exclude overlaps between variations on terminations, for example: ‘.’ vs. ‘.”‘

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

I used Karl’s find_all function from this entry: Find all occurrences of a substring in Python


回答 5

对于简单的情况(句子正常终止),这应该起作用:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

regex是*\. +,它匹配一个由左0或更多空格和右1或更多空格包围的句点(以防止re.split的句点被视为句子的变化)。

显然,这不是最可靠的解决方案,但是在大多数情况下,它会很好用。唯一无法解决的情况就是缩写(也许遍历句子列表并检查每个字符串中是否都sentences以大写字母开头?)

For simple cases (where sentences are terminated normally), this should work:

import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The regex is *\. +, which matches a period surrounded by 0 or more spaces to the left and 1 or more to the right (to prevent something like the period in re.split being counted as a change in sentence).

Obviously, not the most robust solution, but it’ll do fine in most cases. The only case this won’t cover is abbreviations (perhaps run through the list of sentences and check that each string in sentences starts with a capital letter?)


回答 6

您还可以在NLTK中使用句子标记化功能:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

You can also use sentence tokenization function in NLTK:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

回答 7

@Artyom,

嗨!您可以使用此功能为俄语(和其他一些语言)创建新的令牌生成器:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

然后以这种方式调用它:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

祝你好运,Marilena。

@Artyom,

Hi! You could make a new tokenizer for Russian (and some other languages) using this function:

def russianTokenizer(text):
    result = text
    result = result.replace('.', ' . ')
    result = result.replace(' .  .  . ', ' ... ')
    result = result.replace(',', ' , ')
    result = result.replace(':', ' : ')
    result = result.replace(';', ' ; ')
    result = result.replace('!', ' ! ')
    result = result.replace('?', ' ? ')
    result = result.replace('\"', ' \" ')
    result = result.replace('\'', ' \' ')
    result = result.replace('(', ' ( ')
    result = result.replace(')', ' ) ') 
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.replace('  ', ' ')
    result = result.strip()
    result = result.split(' ')
    return result

and then call it in this way:

text = 'вы выполняете поиск, используя Google SSL;'
tokens = russianTokenizer(text)

Good luck, Marilena.


回答 8

毫无疑问,NLTK最适合此目的。但是开始使用NLTK会很痛苦(但是一旦安装,您就可以得到回报)

因此,这是位于http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html的简单的基于re的代码

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

No doubt that NLTK is the most suitable for the purpose. But getting started with NLTK is quite painful (But once you install it – you just reap the rewards)

So here is simple re based code available at http://pythonicprose.blogspot.com/2009/09/python-split-paragraph-into-sentences.html

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question 

回答 9

我必须阅读字幕文件,并将其拆分为句子。经过预处理(如删除.srt文件中的时间信息等)后,变量fullFile包含字幕文件的全文。下面的粗略方法将它们整齐地分成句子。可能我很幸运,句子总是(正确)以空格结尾。首先尝试执行此操作,如果有任何exceptions,请添加更多的检查和余额。

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

哦! 好。我现在意识到,由于我的内容是西班牙语,所以我没有遇到与“史密斯先生”等人打交道的问题。但是,如果有人想要快速又脏的解析器…

I had to read subtitles files and split them into sentences. After pre-processing (like removing time information etc in the .srt files), the variable fullFile contained the full text of the subtitle file. The below crude way neatly split them into sentences. Probably I was lucky that the sentences always ended (correctly) with a space. Try this first and if it has any exceptions, add more checks and balances.

# Very approximate way to split the text into sentences - Break after ? . and !
fullFile = re.sub("(\!|\?|\.) ","\\1<BRK>",fullFile)
sentences = fullFile.split("<BRK>");
sentFile = open("./sentences.out", "w+");
for line in sentences:
    sentFile.write (line);
    sentFile.write ("\n");
sentFile.close;

Oh! well. I now realize that since my content was Spanish, I did not have the issues of dealing with “Mr. Smith” etc. Still, if someone wants a quick and dirty parser…


回答 10

我希望这对您使用拉丁文,中文,阿拉伯文有帮助

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

i hope this will help you on latin,chinese,arabic text

import re

punctuation = re.compile(r"([^\d+])(\.|!|\?|;|\n|。|!|?|;|…| |!|؟|؛)+")
lines = []

with open('myData.txt','r',encoding="utf-8") as myFile:
    lines = punctuation.sub(r"\1\2<pad>", myFile.read())
    lines = [line.strip() for line in lines.split("<pad>") if line.strip()]

回答 11

通过执行一些链接并为nltk进行了一些练习,正在处理类似的任务并遇到了此查询,下面的代码对我来说就像魔术一样。

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

输出:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

资料来源:https : //www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/

Was working on similar task and came across this query, by following few links and working on few exercises for nltk the below code worked for me like magic.

from nltk.tokenize import sent_tokenize 
  
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article"
sent_tokenize(text) 

output:

['Hello everyone.',
 'Welcome to GeeksforGeeks.',
 'You are studying NLP article']

Source: https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/


如何在Python中拆分和解析字符串?

问题:如何在Python中拆分和解析字符串?

我正在尝试在python中拆分此字符串: 2.7.0_bf4fda703454

我想在下划线处拆分该字符串,_以便可以使用左侧的值。

I am trying to split this string in python: 2.7.0_bf4fda703454

I want to split that string on the underscore _ so that I can use the value on the left side.


回答 0

"2.7.0_bf4fda703454".split("_") 给出字符串列表:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

这会在每个下划线处拆分字符串。如果要在第一次拆分后停止它,请使用"2.7.0_bf4fda703454".split("_", 1)

如果您知道字符串中包含下划线,那么您甚至可以将LHS和RHS解压缩为单独的变量:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

另一种方法是使用partition()。用法与上一个示例类似,不同之处在于它返回三个组件而不是两个。主要优点是,如果字符串不包含分隔符,则此方法不会失败。

"2.7.0_bf4fda703454".split("_") gives a list of strings:

In [1]: "2.7.0_bf4fda703454".split("_")
Out[1]: ['2.7.0', 'bf4fda703454']

This splits the string at every underscore. If you want it to stop after the first split, use "2.7.0_bf4fda703454".split("_", 1).

If you know for a fact that the string contains an underscore, you can even unpack the LHS and RHS into separate variables:

In [8]: lhs, rhs = "2.7.0_bf4fda703454".split("_", 1)

In [9]: lhs
Out[9]: '2.7.0'

In [10]: rhs
Out[10]: 'bf4fda703454'

An alternative is to use partition(). The usage is similar to the last example, except that it returns three components instead of two. The principal advantage is that this method doesn’t fail if the string doesn’t contain the separator.


回答 1

Python字符串解析演练

在空间上分割字符串,获取列表,显示其类型,然后打印出来:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

如果两个相邻的定界符相邻,则假定为空字符串:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

在下划线处分割字符串,并获取列表中的第五项:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

将多个空格合为一个

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

当您不向Python的split方法传递任何参数时,文档指出:“连续的空白行被视为单个分隔符,如果字符串的开头或结尾是空白,则结果在开头或结尾将不包含空字符串”。

抓住男孩们的帽子,解析一个正则表达式:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

正则表达式“[AM] +”装置中的小写字母a通过m发生一次或多次被作为分隔符相匹配。re是要导入的库。

或者,如果您想一次将一项剁碎:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

Python string parsing walkthrough

Split a string on space, get a list, show its type, print it out:

el@apollo:~/foo$ python
>>> mystring = "What does the fox say?"

>>> mylist = mystring.split(" ")

>>> print type(mylist)
<type 'list'>

>>> print mylist
['What', 'does', 'the', 'fox', 'say?']

If you have two delimiters next to each other, empty string is assumed:

el@apollo:~/foo$ python
>>> mystring = "its  so   fluffy   im gonna    DIE!!!"

>>> print mystring.split(" ")
['its', '', 'so', '', '', 'fluffy', '', '', 'im', 'gonna', '', '', '', 'DIE!!!']

Split a string on underscore and grab the 5th item in the list:

el@apollo:~/foo$ python
>>> mystring = "Time_to_fire_up_Kowalski's_Nuclear_reactor."

>>> mystring.split("_")[4]
"Kowalski's"

Collapse multiple spaces into one

el@apollo:~/foo$ python
>>> mystring = 'collapse    these       spaces'

>>> mycollapsedstring = ' '.join(mystring.split())

>>> print mycollapsedstring.split(' ')
['collapse', 'these', 'spaces']

When you pass no parameter to Python’s split method, the documentation states: “runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace”.

Hold onto your hats boys, parse on a regular expression:

el@apollo:~/foo$ python
>>> mystring = 'zzzzzzabczzzzzzdefzzzzzzzzzghizzzzzzzzzzzz'
>>> import re
>>> mylist = re.split("[a-m]+", mystring)
>>> print mylist
['zzzzzz', 'zzzzzz', 'zzzzzzzzz', 'zzzzzzzzzzzz']

The regular expression “[a-m]+” means the lowercase letters a through m that occur one or more times are matched as a delimiter. re is a library to be imported.

Or if you want to chomp the items one at a time:

el@apollo:~/foo$ python
>>> mystring = "theres coffee in that nebula"

>>> mytuple = mystring.partition(" ")

>>> print type(mytuple)
<type 'tuple'>

>>> print mytuple
('theres', ' ', 'coffee in that nebula')

>>> print mytuple[0]
theres

>>> print mytuple[2]
coffee in that nebula

回答 2

如果总是将LHS / RHS分开,则也可以使用partition字符串中内置的方法。它返回一个三元组,就(LHS, separator, RHS)好像找到了分隔符,(original_string, '', '')如果不存在分隔符:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

If it’s always going to be an even LHS/RHS split, you can also use the partition method that’s built into strings. It returns a 3-tuple as (LHS, separator, RHS) if the separator is found, and (original_string, '', '') if the separator wasn’t present:

>>> "2.7.0_bf4fda703454".partition('_')
('2.7.0', '_', 'bf4fda703454')

>>> "shazam".partition("_")
('shazam', '', '')

如何将没有空格的文本分割成单词列表?

问题:如何将没有空格的文本分割成单词列表?

输入: "tableapplechairtablecupboard..."很多单词

将此类文本拆分为单词列表并获得以下内容的有效算法是什么?

输出: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

首先想到的是遍历所有可能的单词(从第一个字母开始)并找到可能的最长单词,然后从 position=word_position+len(word)

PS:
我们列出了所有可能的单词。
单词“ cupboard”可以是“ cup”和“ board”,选择时间最长。
语言:python,但主要是算法本身。

Input: "tableapplechairtablecupboard..." many words

What would be an efficient algorithm to split such text to the list of words and get:

Output: ["table", "apple", "chair", "table", ["cupboard", ["cup", "board"]], ...]

First thing that cames to mind is to go through all possible words (starting with first letter) and find the longest word possible, continue from position=word_position+len(word)

P.S.
We have a list of all possible words.
Word “cupboard” can be “cup” and “board”, select longest.
Language: python, but main thing is the algorithm itself.


回答 0

如果将朴素的算法应用于现实世界的数据,则效果不佳。这是一条20行的算法,该算法利用相对词频来为实词文本提供准确的结果。

(如果您想回答不使用词频的原始问题的答案,则需要完善“最长词”的确切含义:最好是使用20个字母的单词和10个3字母的单词,还是最好有五个10个字母的单词?确定精确的定义后,只需更改定义的行wordcost即可反映出预期的含义。)

这个主意

最好的进行方法是对输出的分布进行建模。一个很好的第一近似是假设所有单词都是独立分布的。然后,您只需要知道所有单词的相对频率即可。可以合理地假设它们遵循Zipf定律,也就是说,单词列表中排名为n的单词的概率约为1 /(n log N),其中N是词典中的单词数。

固定模型后,可以使用动态编程来推断空间的位置。最可能的句子是最大化每个单词的概率乘积的句子,并且可以通过动态编程轻松地计算出该句子。代替直接使用概率,我们使用定义为概率倒数的对数的代价来避免溢出。

代码

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

您可以使用

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

结果

我使用的是从Wikipedia的一小部分中拼凑而成的这个快速且肮脏的125k字字典

之前: thumbgreenappleactiveassignmentweekly隐喻。
之后:拇指绿苹果积极地分配每周的隐喻。

之前:有一种从html解析出的软注释信息,但是却重新获得了有限的字符内阁,例如,thumbgreenappleactiveassignmentlymetapho rapparentlytherearthumbgreenappleetc 字符串中应该用词条查询该词是否合理,这是最快的提取方法。

之后:从html解析出大量的人民评论文本信息,但是其中没有定界字符,例如,拇指绿苹果活跃作业每周隐喻显然在字符串中有拇指绿苹果等,我也有一个大词典查询单词是否合理,那么提取最快的方法是什么。

之前:夜晚是阴暗的暴风雨,除了偶然的间隔外,当被狂风吹过一阵子的时候,暴风雨席卷了整个伦敦的大街小巷,这在我们的幕后掠过,并迅速燃烧了灯火,直到黑暗中挣扎了。

之后:那是一个黑暗而暴风雨的夜晚,雨水倾盆而下,除了偶尔被狂风吹拂而扫过街道的间隔,因为在伦敦,我们的场景在房顶上嘎嘎作响,剧烈地搅动着房屋。挣扎着黑暗的灯火。

如您所见,它本质上是完美的。最重要的部分是确保将单词列表训练成与您实际遇到的相似的语料库,否则结果将很糟糕。


优化

该实现消耗线性时间和内存,因此它是相当有效的。如果需要进一步的提速,则可以从单词列表中构建后缀树,以减少候选集的大小。

如果您需要处理非常大的连续字符串,则合理的做法是拆分字符串以避免过多的内存使用。例如,您可以按10000个字符的块处理文本,并在每侧加1000个字符的边距,以避免边界效应。这将使内存使用降至最低,并且几乎可以肯定不会对质量产生任何影响。

A naive algorithm won’t give good results when applied to real-world data. Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text.

(If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by “longest word”: is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.)

The idea

The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf’s law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.

Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it’s easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.

The code

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

which you can use with

s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))

The results

I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.

Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.

Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.

After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.

Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.

After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.


Optimization

The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.

If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.


回答 1

基于最佳答案中的出色工作,我创建了一个pip易于使用的软件包。

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

要安装,请运行pip install wordninja

唯一的区别是微小的。这将返回a list而不是a str,它适用于python3,它包括单词列表并正确拆分,即使存在非字母字符(如下划线,破折号等)也是如此。

再次感谢通用人类!

https://github.com/keredson/wordninja

Based on the excellent work in the top answer, I’ve created a pip package for easy use.

>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']

To install, run pip install wordninja.

The only differences are minor. This returns a list rather than a str, it works in python3, it includes the word list and properly splits even if there are non-alpha chars (like underscores, dashes, etc).

Thanks again to Generic Human!

https://github.com/keredson/wordninja


回答 2

这是使用递归搜索的解决方案:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

Yield

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

Here is solution using recursive search:

def find_words(instring, prefix = '', words = None):
    if not instring:
        return []
    if words is None:
        words = set()
        with open('/usr/share/dict/words') as f:
            for line in f:
                words.add(line.strip())
    if (not prefix) and (instring in words):
        return [instring]
    prefix, suffix = prefix + instring[0], instring[1:]
    solutions = []
    # Case 1: prefix in solution
    if prefix in words:
        try:
            solutions.append([prefix] + find_words(suffix, '', words))
        except ValueError:
            pass
    # Case 2: prefix not in solution
    try:
        solutions.append(find_words(suffix, prefix, words))
    except ValueError:
        pass
    if solutions:
        return sorted(solutions,
                      key = lambda solution: [len(word) for word in solution],
                      reverse = True)[0]
    else:
        raise ValueError('no solution')

print(find_words('tableapplechairtablecupboard'))
print(find_words('tableprechaun', words = set(['tab', 'table', 'leprechaun'])))

yields

['table', 'apple', 'chair', 'table', 'cupboard']
['tab', 'leprechaun']

回答 3

使用一个trie 数据结构(该结构包含可能的单词的列表),执行以下操作不会太复杂:

  1. 前进指针(在串联字符串中)
  2. 查找并将相应的节点存储在trie中
  3. 如果trie节点有子节点(例如,单词较长),请转到1。
  4. 如果到达的节点没有孩子,则会出现最长的单词匹配;将单词(存储在节点中或在遍历遍历过程中只是串联在一起)添加到结果列表中,将指针重置为trie(或重置引用),然后重新开始

Using a trie data structure, which holds the list of possible words, it would not be too complicated to do the following:

  1. Advance pointer (in the concatenated string)
  2. Lookup and store the corresponding node in the trie
  3. If the trie node has children (e.g. there are longer words), go to 1.
  4. If the node reached has no children, a longest word match happened; add the word (stored in the node or just concatenated during trie traversal) to the result list, reset the pointer in the trie (or reset the reference), and start over

回答 4

Unutbu的解决方案非常接近,但是我发现代码难以阅读,并且没有产生预期的结果。通用人类解决方案的缺点是需要单词频率。不适用于所有用例。

这是一个使用分而治之算法的简单解决方案。

  1. 它试图使 Eg find_words('cupboard')返回['cupboard']而不是['cup', 'board'](假设cupboardcup并且board在字典中)的单词数量最少。
  2. 最佳解决方案不是唯一的,下面的实现返回一个解决方案。find_words('charactersin')可能会返回,['characters', 'in']或者可能会返回['character', 'sin'](如下所示)。您可以很容易地修改算法以返回所有最优解。
  3. 在此实现中,将记住解决方案,以使其在合理的时间内运行。

代码:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

在我的3GHz机器上,这大约需要5秒:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

从html解析的人民评论文本信息的全部内容,但没有定界字符,例如,拇指绿苹果活动任务每周隐喻显然在字符串中有拇指绿苹果等,我也有一个大型词典来查询是否这个词很合理,所以提取很多最快的方法是什么

Unutbu’s solution was quite close but I find the code difficult to read, and it didn’t yield the expected result. Generic Human’s solution has the drawback that it needs word frequencies. Not appropriate for all use case.

Here’s a simple solution using a Divide and Conquer algorithm.

  1. It tries to minimize the number of words E.g. find_words('cupboard') will return ['cupboard'] rather than ['cup', 'board'] (assuming that cupboard, cup and board are in the dictionnary)
  2. The optimal solution is not unique, the implementation below returns a solution. find_words('charactersin') could return ['characters', 'in'] or maybe it will return ['character', 'sin'] (as seen below). You could quite easily modify the algorithm to return all optimal solutions.
  3. In this implementation solutions are memoized so that it runs in a reasonable time.

The code:

words = set()
with open('/usr/share/dict/words') as f:
    for line in f:
        words.add(line.strip())

solutions = {}
def find_words(instring):
    # First check if instring is in the dictionnary
    if instring in words:
        return [instring]
    # No... But maybe it's a result we already computed
    if instring in solutions:
        return solutions[instring]
    # Nope. Try to split the string at all position to recursively search for results
    best_solution = None
    for i in range(1, len(instring) - 1):
        part1 = find_words(instring[:i])
        part2 = find_words(instring[i:])
        # Both parts MUST have a solution
        if part1 is None or part2 is None:
            continue
        solution = part1 + part2
        # Is the solution found "better" than the previous one?
        if best_solution is None or len(solution) < len(best_solution):
            best_solution = solution
    # Remember (memoize) this solution to avoid having to recompute it
    solutions[instring] = best_solution
    return best_solution

This will take about about 5sec on my 3GHz machine:

result = find_words("thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearenodelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetaphorapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquerywhetherthewordisreasonablesowhatsthefastestwayofextractionthxalot")
assert(result is not None)
print ' '.join(result)

the reis masses of text information of peoples comments which is parsed from h t m l but there are no delimited character sin them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple e t c in the string i also have a large dictionary to query whether the word is reasonable so whats the fastest way of extraction t h x a lot


回答 5

https://stackoverflow.com/users/1515832/generic-human的答案很棒。但是我见过的最好的实现是彼得·诺维格本人在他的《美丽的数据》一书中写的。

在粘贴他的代码之前,让我先解释一下为什么Norvig的方法更准确(尽管在代码方面会更慢一些,也更长一些)。

1)数据要好一些-就大小和精度而言(他使用字数而不是简单的排名)2)更重要的是,n-gram背后的逻辑确实使方法如此精确。

他在书中提供的示例是拆分字符串“ sitdown”的问题。现在,一个非二字组合的字符串拆分方法将考虑p(’sit’)* p(’down’),如果小于p(’sitdown’)-这种情况经常发生-它将不会拆分它,但我们希望(大多数时候)。

但是,当您具有bigram模型时,可以将p(’sit down’)视为bigram vs p(’sitdown’),并且前者获胜。基本上,如果您不使用双字母组,它将把您拆分的单词的概率视为独立,而不是这种情况,某些单词更有可能一个接一个地出现。不幸的是,这些词在很多情况下经常被粘在一起,使拆分器感到困惑。

这是数据的链接(它是3个独立问题的数据,而细分仅仅是一个。请阅读本章以获取详细信息):http : //norvig.com/ngrams/

这是代码的链接:http : //norvig.com/ngrams/ngrams.py

这些链接已经建立了一段时间,但是无论如何我都会在此处复制粘贴代码的分段部分

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

The answer by https://stackoverflow.com/users/1515832/generic-human is great. But the best implementation of this I’ve ever seen was written Peter Norvig himself in his book ‘Beautiful Data’.

Before I paste his code, let me expand on why Norvig’s method is more accurate (although a little slower and longer in terms of code).

1) The data is a bit better – both in terms of size and in terms of precision (he uses a word count rather than a simple ranking) 2) More importantly, it’s the logic behind n-grams that really makes the approach so accurate.

The example he provides in his book is the problem of splitting a string ‘sitdown’. Now a non-bigram method of string split would consider p(‘sit’) * p (‘down’), and if this less than the p(‘sitdown’) – which will be the case quite often – it will NOT split it, but we’d want it to (most of the time).

However when you have the bigram model you could value p(‘sit down’) as a bigram vs p(‘sitdown’) and the former wins. Basically, if you don’t use bigrams, it treats the probability of the words you’re splitting as independent, which is not the case, some words are more likely to appear one after the other. Unfortunately those are also the words that are often stuck together in a lot of instances and confuses the splitter.

Here’s the link to the data (it’s data for 3 separate problems and segmentation is only one. Please read the chapter for details): http://norvig.com/ngrams/

and here’s the link to the code: http://norvig.com/ngrams/ngrams.py

These links have been up a while, but I’ll copy paste the segmentation part of the code here anyway

import re, string, random, glob, operator, heapq
from collections import defaultdict
from math import log10

def memo(f):
    "Memoize function f."
    table = {}
    def fmemo(*args):
        if args not in table:
            table[args] = f(*args)
        return table[args]
    fmemo.memo = table
    return fmemo

def test(verbose=None):
    """Run some tests, taken from the chapter.
    Since the hillclimbing algorithm is randomized, some tests may fail."""
    import doctest
    print 'Running tests...'
    doctest.testfile('ngrams-test.txt', verbose=verbose)

################ Word Segmentation (p. 223)

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:]) 
            for i in range(min(len(text), L))]

def Pwords(words): 
    "The Naive Bayes probability of a sequence of words."
    return product(Pw(w) for w in words)

#### Support functions (p. 224)

def product(nums):
    "Return the product of a sequence of numbers."
    return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key): 
        if key in self: return self[key]/self.N  
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
    "Estimate the probability of an unknown word."
    return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw  = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

#### segment2: second version, with bigram counts, (p. 226-227)

def cPw(word, prev):
    "Conditional probability of word, given previous word."
    try:
        return P2w[prev + ' ' + word]/float(Pw[prev])
    except KeyError:
        return Pw(word)

P2w = Pdist(datafile('count_2w.txt'), N)

@memo 
def segment2(text, prev='<S>'): 
    "Return (log P(words), words), where words is the best segmentation." 
    if not text: return 0.0, [] 
    candidates = [combine(log10(cPw(first, prev)), first, segment2(rem, first)) 
                  for first,rem in splits(text)] 
    return max(candidates) 

def combine(Pfirst, first, (Prem, rem)): 
    "Combine first and rem results into one (probability, words) pair." 
    return Pfirst+Prem, [first]+rem 

回答 6

这是转换为JavaScript的可接受答案(需要node.js,以及来自https://github.com/keredson/wordninja的文件“ wordninja_words.txt” ):

var fs = require("fs");

var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};

fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
    if (err) {
        throw err;
    }
    var words = data.split('\n');
    words.forEach(function(word, index) {
        wordCost[word] = Math.log((index + 1) * Math.log(words.length));
    })
    words.forEach(function(word) {
        if (word.length > maxWordLen)
            maxWordLen = word.length;
    });
    console.log(maxWordLen)
    splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
    console.log(split(process.argv[2]));
});


function split(s) {
    var list = [];
    s.split(splitRegex).forEach(function(sub) {
        _split(sub).forEach(function(word) {
            list.push(word);
        })
    })
    return list;
}
module.exports = split;


function _split(s) {
    var cost = [0];

    function best_match(i) {
        var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
        var minPair = [Number.MAX_SAFE_INTEGER, 0];
        candidates.forEach(function(c, k) {
            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
            } else {
                var ccost = Number.MAX_SAFE_INTEGER;
            }
            if (ccost < minPair[0]) {
                minPair = [ccost, k + 1];
            }
        })
        return minPair;
    }

    for (var i = 1; i < s.length + 1; i++) {
        cost.push(best_match(i)[0]);
    }

    var out = [];
    i = s.length;
    while (i > 0) {
        var c = best_match(i)[0];
        var k = best_match(i)[1];
        if (c == cost[i])
            console.log("Alert: " + c);

        var newToken = true;
        if (s.slice(i - k, i) != "'") {
            if (out.length > 0) {
                if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                    out[-1] = s.slice(i - k, i) + out[-1];
                    newToken = false;
                }
            }
        }

        if (newToken) {
            out.push(s.slice(i - k, i))
        }

        i -= k

    }
    return out.reverse();
}

Here is the accepted answer translated to JavaScript (requires node.js, and the file “wordninja_words.txt” from https://github.com/keredson/wordninja):

var fs = require("fs");

var splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
var maxWordLen = 0;
var wordCost = {};

fs.readFile("./wordninja_words.txt", 'utf8', function(err, data) {
    if (err) {
        throw err;
    }
    var words = data.split('\n');
    words.forEach(function(word, index) {
        wordCost[word] = Math.log((index + 1) * Math.log(words.length));
    })
    words.forEach(function(word) {
        if (word.length > maxWordLen)
            maxWordLen = word.length;
    });
    console.log(maxWordLen)
    splitRegex = new RegExp("[^a-zA-Z0-9']+", "g");
    console.log(split(process.argv[2]));
});


function split(s) {
    var list = [];
    s.split(splitRegex).forEach(function(sub) {
        _split(sub).forEach(function(word) {
            list.push(word);
        })
    })
    return list;
}
module.exports = split;


function _split(s) {
    var cost = [0];

    function best_match(i) {
        var candidates = cost.slice(Math.max(0, i - maxWordLen), i).reverse();
        var minPair = [Number.MAX_SAFE_INTEGER, 0];
        candidates.forEach(function(c, k) {
            if (wordCost[s.substring(i - k - 1, i).toLowerCase()]) {
                var ccost = c + wordCost[s.substring(i - k - 1, i).toLowerCase()];
            } else {
                var ccost = Number.MAX_SAFE_INTEGER;
            }
            if (ccost < minPair[0]) {
                minPair = [ccost, k + 1];
            }
        })
        return minPair;
    }

    for (var i = 1; i < s.length + 1; i++) {
        cost.push(best_match(i)[0]);
    }

    var out = [];
    i = s.length;
    while (i > 0) {
        var c = best_match(i)[0];
        var k = best_match(i)[1];
        if (c == cost[i])
            console.log("Alert: " + c);

        var newToken = true;
        if (s.slice(i - k, i) != "'") {
            if (out.length > 0) {
                if (out[-1] == "'s" || (Number.isInteger(s[i - 1]) && Number.isInteger(out[-1][0]))) {
                    out[-1] = s.slice(i - k, i) + out[-1];
                    newToken = false;
                }
            }
        }

        if (newToken) {
            out.push(s.slice(i - k, i))
        }

        i -= k

    }
    return out.reverse();
}

回答 7

如果将单词表预编译到DFA中(这将非常慢),则匹配输入所花费的时间将与字符串的长度成比例(实际上,这仅比遍历字符串慢一点)。

实际上,这是前面提到的trie算法的更通用的版本。我只是为了完全完成而提到它-到目前为止,您还不能使用DFA实现。RE2可以使用,但是我不知道Python绑定是否可以让您调整DFA的大小,然后再丢弃编译的DFA数据并进行NFA搜索。

If you precompile the wordlist into a DFA (which will be very slow), then the time it takes to match an input will be proportional to the length of the string (in fact, only a little slower than just iterating over the string).

This is effectively a more general version of the trie algorithm which was mentioned earlier. I only mention it for completeless — as of yet, there’s no DFA implementation you can just use. RE2 would work, but I don’t know if the Python bindings let you tune how large you allow a DFA to be before it just throws away the compiled DFA data and does NFA search.


回答 8

看起来相当平凡的回溯可以做到。从字符串的开头开始。向右扫描,直到您有一个字。然后,在字符串的其余部分调用该函数。如果函数一直扫描到右边而没有识别出一个单词,则该函数返回“ false”。否则,返回找到的单词以及递归调用返回的单词列表。

示例:“ tableapple”。查找“ tab”,然后查找“ leap”,但“ ple”中没有单词。“ leapple”中没有其他词。查找“表”,然后查找“ app”。“ le”不是一个词,所以尝试苹果,认识并返回。

为了尽可能长的时间,请继续前进,只发出(而不是返回)正确的解决方案;然后,根据您选择的任何标准(最大值,最小值,平均值等)选择最佳的一种

It seems like fairly mundane backtracking will do. Start at the beggining of the string. Scan right until you have a word. Then, call the function on the rest of the string. Function returns “false” if it scans all the way to the right without recognizing a word. Otherwise, returns the word it found and the list of words returned by the recursive call.

Example: “tableapple”. Finds “tab”, then “leap”, but no word in “ple”. No other word in “leapple”. Finds “table”, then “app”. “le” not a word, so tries apple, recognizes, returns.

To get longest possible, keep going, only emitting (rather than returning) correct solutions; then, choose the optimal one by any criterion you choose (maxmax, minmax, average, etc.)


回答 9

基于unutbu的解决方案,我实现了Java版本:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

输入: "tableapplechairtablecupboard"

输出: [table, apple, chair, table, cupboard]

输入: "tableprechaun"

输出: [tab, leprechaun]

Based on unutbu’s solution I’ve implemented a Java version:

private static List<String> splitWordWithoutSpaces(String instring, String suffix) {
    if(isAWord(instring)) {
        if(suffix.length() > 0) {
            List<String> rest = splitWordWithoutSpaces(suffix, "");
            if(rest.size() > 0) {
                List<String> solutions = new LinkedList<>();
                solutions.add(instring);
                solutions.addAll(rest);
                return solutions;
            }
        } else {
            List<String> solutions = new LinkedList<>();
            solutions.add(instring);
            return solutions;
        }

    }
    if(instring.length() > 1) {
        String newString = instring.substring(0, instring.length()-1);
        suffix = instring.charAt(instring.length()-1) + suffix;
        List<String> rest = splitWordWithoutSpaces(newString, suffix);
        return rest;
    }
    return Collections.EMPTY_LIST;
}

Input: "tableapplechairtablecupboard"

Output: [table, apple, chair, table, cupboard]

Input: "tableprechaun"

Output: [tab, leprechaun]


回答 10

对于德语,有CharSplit,它使用机器学习,并且对几个单词的字符串非常有效。

https://github.com/dtuggener/CharSplit

For German language there is CharSplit which uses machine learning and works pretty good for strings of a few words.

https://github.com/dtuggener/CharSplit


回答 11

扩展@miku的建议使用a Trie,仅追加Trie是相对简单的实现python

class Node:
    def __init__(self, is_word=False):
        self.children = {}
        self.is_word = is_word

class TrieDictionary:
    def __init__(self, words=tuple()):
        self.root = Node()
        for word in words:
            self.add(word)

    def add(self, word):
        node = self.root
        for c in word:
            node = node.children.setdefault(c, Node())
        node.is_word = True

    def lookup(self, word, from_node=None):
        node = self.root if from_node is None else from_node
        for c in word:
            try:
                node = node.children[c]
            except KeyError:
                return None

        return node

然后,我们可以Trie根据一组单词构建一个基于字典的字典:

dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)

它将产生如下所示的树(*指示单词的开头或结尾):

* -> a*
 \\\ 
  \\\-> p -> e -> a*
   \\              \-> n -> u -> t*
    \\
     \\-> b -> u -> t*
      \\             \-> t*
       \\                 \-> e*
        \\                     \-> r*
         \
          \-> n -> u -> t*

通过将其与关于如何选择单词的启发式方法相结合,我们可以将其合并到解决方案中。例如,与较短的单词相比,我们更喜欢较长的单词:

def using_trie_longest_word_heuristic(s):
    node = None
    possible_indexes = []

    # O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
    if s in dictionary:
        return [ s ]

    for i in range(len(s)):
        # traverse the trie, char-wise to determine intermediate words
        node = trie_dictionary.lookup(s[i], from_node=node)

        # no more words start this way
        if node is None:
            # iterate words we have encountered from biggest to smallest
            for possible in possible_indexes[::-1]:
                # recurse to attempt to solve the remaining sub-string
                end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])

                # if we have a solution, return this word + our solution
                if end_of_phrase:
                    return [ s[:possible+1] ] + end_of_phrase

            # unsolvable
            break

        # if this is a leaf, append the index to the possible words list
        elif node.is_word:
            possible_indexes.append(i)

    # empty string OR unsolvable case 
    return []

我们可以这样使用此功能:

>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]

由于我们在Trie搜索越来越长的单词时一直保持在的位置,因此trie每个可能的解决方案最多遍历一次(而不是:,的2次数)。最终的短路使我们免于在最坏的情况下通过字符串随意移动。peanutpeapeanut

最终结果仅需进行少量检查:

'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word

该解决方案的一个好处是,您可以很快地知道是否存在带有给定前缀的较长单词,从而无需针对字典全面测试序列组合。unsolvable与其他实现方式相比,这样做还使答案相对便宜。

该解决方案的缺点是要占用大量内存,trie以及trie前期构建成本。

Expanding on @miku’s suggestion to use a Trie, an append-only Trie is relatively straight-forward to implement in python:

class Node:
    def __init__(self, is_word=False):
        self.children = {}
        self.is_word = is_word

class TrieDictionary:
    def __init__(self, words=tuple()):
        self.root = Node()
        for word in words:
            self.add(word)

    def add(self, word):
        node = self.root
        for c in word:
            node = node.children.setdefault(c, Node())
        node.is_word = True

    def lookup(self, word, from_node=None):
        node = self.root if from_node is None else from_node
        for c in word:
            try:
                node = node.children[c]
            except KeyError:
                return None

        return node

We can then build a Trie-based dictionary from a set of words:

dictionary = {"a", "pea", "nut", "peanut", "but", "butt", "butte", "butter"}
trie_dictionary = TrieDictionary(words=dictionary)

Which will produce a tree that looks like this (* indicates beginning or end of a word):

* -> a*
 \\\ 
  \\\-> p -> e -> a*
   \\              \-> n -> u -> t*
    \\
     \\-> b -> u -> t*
      \\             \-> t*
       \\                 \-> e*
        \\                     \-> r*
         \
          \-> n -> u -> t*

We can incorporate this into a solution by combining it with a heuristic about how to choose words. For example we can prefer longer words over shorter words:

def using_trie_longest_word_heuristic(s):
    node = None
    possible_indexes = []

    # O(1) short-circuit if whole string is a word, doesn't go against longest-word wins
    if s in dictionary:
        return [ s ]

    for i in range(len(s)):
        # traverse the trie, char-wise to determine intermediate words
        node = trie_dictionary.lookup(s[i], from_node=node)

        # no more words start this way
        if node is None:
            # iterate words we have encountered from biggest to smallest
            for possible in possible_indexes[::-1]:
                # recurse to attempt to solve the remaining sub-string
                end_of_phrase = using_trie_longest_word_heuristic(s[possible+1:])

                # if we have a solution, return this word + our solution
                if end_of_phrase:
                    return [ s[:possible+1] ] + end_of_phrase

            # unsolvable
            break

        # if this is a leaf, append the index to the possible words list
        elif node.is_word:
            possible_indexes.append(i)

    # empty string OR unsolvable case 
    return []

We can use this function like this:

>>> using_trie_longest_word_heuristic("peanutbutter")
[ "peanut", "butter" ]

Because we maintain our position in the Trie as we search for longer and longer words, we traverse the trie at most once per possible solution (rather than 2 times for peanut: pea, peanut). The final short-circuit saves us from walking char-wise through the string in the worst-case.

The final result is only a handful of inspections:

'peanutbutter' - not a word, go charwise
'p' - in trie, use this node
'e' - in trie, use this node
'a' - in trie and edge, store potential word and use this node
'n' - in trie, use this node
'u' - in trie, use this node
't' - in trie and edge, store potential word and use this node
'b' - not in trie from `peanut` vector
'butter' - remainder of longest is a word

A benefit of this solution is in the fact that you know very quickly if longer words exist with a given prefix, which spares the need to exhaustively test sequence combinations against a dictionary. It also makes getting to an unsolvable answer comparatively cheap to other implementations.

The downsides of this solution are a large memory footprint for the trie and the cost of building the trie up-front.


回答 12

如果您包含字符串中包含的单词的详尽列表:

word_list = ["table", "apple", "chair", "cupboard"]

使用列表理解来遍历列表以找到单词以及单词出现的次数。

string = "tableapplechairtablecupboard"

def split_string(string, word_list):

    return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()

该函数string按列表顺序返回单词的输出table table apple chair cupboard

If you have an exhaustive list of the words contained within the string:

word_list = ["table", "apple", "chair", "cupboard"]

Using list comprehension to iterate over the list to locate the word and how many times it appears.

string = "tableapplechairtablecupboard"

def split_string(string, word_list):

    return ("".join([(item + " ")*string.count(item.lower()) for item in word_list if item.lower() in string])).strip()


The function returns a string output of words in order of the list table table apple chair cupboard


回答 13

非常感谢https://github.com/keredson/wordninja/中的帮助

在我看来,Java的贡献很小。

可以将public方法splitContiguousWords与其他2种方法一起嵌入在同一目录中具有ninja_words.txt的类中(或根据编码器的选择进行修改)。并且该方法splitContiguousWords可以用于该目的。

public List<String> splitContiguousWords(String sentence) {

    String splitRegex = "[^a-zA-Z0-9']+";
    Map<String, Number> wordCost = new HashMap<>();
    List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
    double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
    long wordIdx = 0;
    for (String word : dictionaryWords) {
        wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
    }
    int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
    List<String> splitWords = new ArrayList<>();
    for (String partSentence : sentence.split(splitRegex)) {
        splitWords.add(split(partSentence, wordCost, maxWordLength));
    }
    log.info("Split word for the sentence: {}", splitWords);
    return splitWords;
}

private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> cost = new ArrayList<>();
    cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
    for (int index = 1; index < partSentence.length() + 1; index++) {
        cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
    }
    int idx = partSentence.length();
    List<String> output = new ArrayList<>();
    while (idx > 0) {
        Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
        Number candidateCost = candidate.getKey();
        Number candidateIndexValue = candidate.getValue();
        if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
            throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
        }
        boolean newToken = true;
        String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
        if (token != "\'" && output.size() > 0) {
            String lastWord = output.get(output.size() - 1);
            if (lastWord.equalsIgnoreCase("\'s") ||
                    (Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
                output.set(output.size() - 1, token + lastWord);
                newToken = false;
            }
        }
        if (newToken) {
            output.add(token);
        }
        idx -= candidateIndexValue.intValue();
    }
    return String.join(" ", Lists.reverse(output));
}


private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
                      Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
    int enumerateIdx = 0;
    Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
    for (Pair<Number, Number> pair : candidates) {
        ++enumerateIdx;
        String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
        Number minCost = Integer.MAX_VALUE;
        if (wordCost.containsKey(subsequence)) {
            minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
        }
        if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
            minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
        }
    }
    return minPair;
}

Many Thanks for the help in https://github.com/keredson/wordninja/

A small contribution of the same in Java from my side.

The public method splitContiguousWords could be embedded with the other 2 methods in the class having ninja_words.txt in same directory(or modified as per the choice of coder). And the method splitContiguousWords could be used for the purpose.

public List<String> splitContiguousWords(String sentence) {

    String splitRegex = "[^a-zA-Z0-9']+";
    Map<String, Number> wordCost = new HashMap<>();
    List<String> dictionaryWords = IOUtils.linesFromFile("ninja_words.txt", StandardCharsets.UTF_8.name());
    double naturalLogDictionaryWordsCount = Math.log(dictionaryWords.size());
    long wordIdx = 0;
    for (String word : dictionaryWords) {
        wordCost.put(word, Math.log(++wordIdx * naturalLogDictionaryWordsCount));
    }
    int maxWordLength = Collections.max(dictionaryWords, Comparator.comparing(String::length)).length();
    List<String> splitWords = new ArrayList<>();
    for (String partSentence : sentence.split(splitRegex)) {
        splitWords.add(split(partSentence, wordCost, maxWordLength));
    }
    log.info("Split word for the sentence: {}", splitWords);
    return splitWords;
}

private String split(String partSentence, Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> cost = new ArrayList<>();
    cost.add(new Pair<>(Integer.valueOf(0), Integer.valueOf(0)));
    for (int index = 1; index < partSentence.length() + 1; index++) {
        cost.add(bestMatch(partSentence, cost, index, wordCost, maxWordLength));
    }
    int idx = partSentence.length();
    List<String> output = new ArrayList<>();
    while (idx > 0) {
        Pair<Number, Number> candidate = bestMatch(partSentence, cost, idx, wordCost, maxWordLength);
        Number candidateCost = candidate.getKey();
        Number candidateIndexValue = candidate.getValue();
        if (candidateCost.doubleValue() != cost.get(idx).getKey().doubleValue()) {
            throw new RuntimeException("Candidate cost unmatched; This should not be the case!");
        }
        boolean newToken = true;
        String token = partSentence.substring(idx - candidateIndexValue.intValue(), idx);
        if (token != "\'" && output.size() > 0) {
            String lastWord = output.get(output.size() - 1);
            if (lastWord.equalsIgnoreCase("\'s") ||
                    (Character.isDigit(partSentence.charAt(idx - 1)) && Character.isDigit(lastWord.charAt(0)))) {
                output.set(output.size() - 1, token + lastWord);
                newToken = false;
            }
        }
        if (newToken) {
            output.add(token);
        }
        idx -= candidateIndexValue.intValue();
    }
    return String.join(" ", Lists.reverse(output));
}


private Pair<Number, Number> bestMatch(String partSentence, List<Pair<Number, Number>> cost, int index,
                      Map<String, Number> wordCost, int maxWordLength) {
    List<Pair<Number, Number>> candidates = Lists.reverse(cost.subList(Math.max(0, index - maxWordLength), index));
    int enumerateIdx = 0;
    Pair<Number, Number> minPair = new Pair<>(Integer.MAX_VALUE, Integer.valueOf(enumerateIdx));
    for (Pair<Number, Number> pair : candidates) {
        ++enumerateIdx;
        String subsequence = partSentence.substring(index - enumerateIdx, index).toLowerCase();
        Number minCost = Integer.MAX_VALUE;
        if (wordCost.containsKey(subsequence)) {
            minCost = pair.getKey().doubleValue() + wordCost.get(subsequence).doubleValue();
        }
        if (minCost.doubleValue() < minPair.getKey().doubleValue()) {
            minPair = new Pair<>(minCost.doubleValue(), enumerateIdx);
        }
    }
    return minPair;
}

回答 14

这会有所帮助

from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

This will help

from wordsegment import load, segment
load()
segment('providesfortheresponsibilitiesofperson')

回答 15

您需要确定您的词汇表-也许任何免费单词列表都可以。

完成后,使用该词汇表来构建后缀树,并将输入流与此匹配:http : //en.wikipedia.org/wiki/Suffix_tree

You need to identify your vocabulary – perhaps any free word list will do.

Once done, use that vocabulary to build a suffix tree, and match your stream of input against that: http://en.wikipedia.org/wiki/Suffix_tree


如何在字符串Python中获取:之前的所有内容

问题:如何在字符串Python中获取:之前的所有内容

我正在寻找一种方法来在:之前获取字符串中的所有字母,但是我不知道从哪里开始。我会使用正则表达式吗?如果可以,怎么办?

string = "Username: How are you today?"

有人可以给我示范我可以做什么吗?

I am looking for a way to get all of the letters in a string before a : but I have no idea on where to start. Would I use regex? If so how?

string = "Username: How are you today?"

Can someone show me a example on what I could do?


回答 0

只需使用该split功能。它返回一个列表,因此您可以保留第一个元素:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

Just use the split function. It returns a list, so you can keep the first element:

>>> s1.split(':')
['Username', ' How are you today?']
>>> s1.split(':')[0]
'Username'

回答 1

使用index

>>> string = "Username: How are you today?"
>>> string[:string.index(":")]
'Username'

该索引将给您以下位置 :在字符串中,然后可以对其进行切片。

如果要使用正则表达式:

>>> import re
>>> re.match("(.*?):",string).group()
'Username'                       

match 从字符串开头开始匹配。

你也可以使用 itertools.takewhile

>>> import itertools
>>> "".join(itertools.takewhile(lambda x: x!=":", string))
'Username'

Using index:

>>> string = "Username: How are you today?"
>>> string[:string.index(":")]
'Username'

The index will give you the position of : in string, then you can slice it.

If you want to use regex:

>>> import re
>>> re.match("(.*?):",string).group()
'Username'                       

match matches from the start of the string.

you can also use itertools.takewhile

>>> import itertools
>>> "".join(itertools.takewhile(lambda x: x!=":", string))
'Username'

回答 2

你不需要regex这个

>>> s = "Username: How are you today?"

您可以使用split方法拆分的字符串':'的字符

>>> s.split(':')
['Username', ' How are you today?']

并切出元素[0]以获得字符串的第一部分

>>> s.split(':')[0]
'Username'

You don’t need regex for this

>>> s = "Username: How are you today?"

You can use the split method to split the string on the ':' character

>>> s.split(':')
['Username', ' How are you today?']

And slice out element [0] to get the first part of the string

>>> s.split(':')[0]
'Username'

回答 3

我已经在Python 3.7.0(IPython)下对这些各种技术进行了基准测试。

TLDR

  • 最快(当分割符号 c已知):预编译的正则表达式。
  • 最快(否则): s.partition(c)[0]
  • 安全(即何时c可能不在s):分区,拆分。
  • 不安全:索引,正则表达式。

import string, random, re

SYMBOLS = string.ascii_uppercase + string.digits
SIZE = 100

def create_test_set(string_length):
    for _ in range(SIZE):
        random_string = ''.join(random.choices(SYMBOLS, k=string_length))
        yield (random.choice(random_string), random_string)

for string_length in (2**4, 2**8, 2**16, 2**32):
    print("\nString length:", string_length)
    print("  regex (compiled):", end=" ")
    test_set_for_regex = ((re.compile("(.*?)" + c).match, s) for (c, s) in test_set)
    %timeit [re_match(s).group() for (re_match, s) in test_set_for_regex]
    test_set = list(create_test_set(16))
    print("  partition:       ", end=" ")
    %timeit [s.partition(c)[0] for (c, s) in test_set]
    print("  index:           ", end=" ")
    %timeit [s[:s.index(c)] for (c, s) in test_set]
    print("  split (limited): ", end=" ")
    %timeit [s.split(c, 1)[0] for (c, s) in test_set]
    print("  split:           ", end=" ")
    %timeit [s.split(c)[0] for (c, s) in test_set]
    print("  regex:           ", end=" ")
    %timeit [re.match("(.*?)" + c, s).group() for (c, s) in test_set]

结果

String length: 16
  regex (compiled): 156 ns ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.3 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            26.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.8 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.3 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 256
  regex (compiled): 167 ns ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  index:            28.6 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.4 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            31.5 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            148 µs ± 7.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

String length: 65536
  regex (compiled): 173 ns ± 3.95 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.2 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.5 µs ± 377 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 4294967296
  regex (compiled): 165 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.9 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 571 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.1 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            28.1 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            137 µs ± 6.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I have benchmarked these various technics under Python 3.7.0 (IPython).

TLDR

  • fastest (when the split symbol c is known): pre-compiled regex.
  • fastest (otherwise): s.partition(c)[0].
  • safe (i.e., when c may not be in s): partition, split.
  • unsafe: index, regex.

Code

import string, random, re

SYMBOLS = string.ascii_uppercase + string.digits
SIZE = 100

def create_test_set(string_length):
    for _ in range(SIZE):
        random_string = ''.join(random.choices(SYMBOLS, k=string_length))
        yield (random.choice(random_string), random_string)

for string_length in (2**4, 2**8, 2**16, 2**32):
    print("\nString length:", string_length)
    print("  regex (compiled):", end=" ")
    test_set_for_regex = ((re.compile("(.*?)" + c).match, s) for (c, s) in test_set)
    %timeit [re_match(s).group() for (re_match, s) in test_set_for_regex]
    test_set = list(create_test_set(16))
    print("  partition:       ", end=" ")
    %timeit [s.partition(c)[0] for (c, s) in test_set]
    print("  index:           ", end=" ")
    %timeit [s[:s.index(c)] for (c, s) in test_set]
    print("  split (limited): ", end=" ")
    %timeit [s.split(c, 1)[0] for (c, s) in test_set]
    print("  split:           ", end=" ")
    %timeit [s.split(c)[0] for (c, s) in test_set]
    print("  regex:           ", end=" ")
    %timeit [re.match("(.*?)" + c, s).group() for (c, s) in test_set]

Results

String length: 16
  regex (compiled): 156 ns ± 4.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.3 µs ± 430 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            26.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.8 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.3 µs ± 835 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 256
  regex (compiled): 167 ns ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  index:            28.6 µs ± 2.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.4 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            31.5 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            148 µs ± 7.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

String length: 65536
  regex (compiled): 173 ns ± 3.95 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        20.9 µs ± 613 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 515 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  27.2 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            26.5 µs ± 377 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            128 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

String length: 4294967296
  regex (compiled): 165 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  partition:        19.9 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  index:            27.7 µs ± 571 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split (limited):  26.1 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  split:            28.1 µs ± 1.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  regex:            137 µs ± 6.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答 4

为此,partition()可能比split()更好,因为在没有定界符或更多定界符的情况下,它具有更好的可预测结果。

partition() may be better then split() for this purpose as it has the better predicable results for situations you have no delimiter or more delimiters.


python中是否有将单词拆分为列表的函数?[重复]

问题:python中是否有将单词拆分为列表的函数?[重复]

python中是否有将单词分解为单个字母列表的函数?例如:

s="Word to Split"

要得到

wordlist=['W','o','r','d','','t','o' ....]

Is there a function in python to split a word into a list of single letters? e.g:

s="Word to Split"

to get

wordlist=['W','o','r','d','','t','o' ....]

回答 0

>>> list("Word to Split")
['W', 'o', 'r', 'd', ' ', 't', 'o', ' ', 'S', 'p', 'l', 'i', 't']
>>> list("Word to Split")
['W', 'o', 'r', 'd', ' ', 't', 'o', ' ', 'S', 'p', 'l', 'i', 't']

回答 1

最简单的方法可能只是使用list(),但也至少还有一个其他选择:

s = "Word to Split"
wordlist = list(s)               # option 1, 
wordlist = [ch for ch in s]      # option 2, list comprehension.

他们应该为您提供您所需要的:

['W','o','r','d',' ','t','o',' ','S','p','l','i','t']

如前所述,第一个可能是最适合您的示例的示例,但是有些用例可能会使后者在处理更复杂的内容时非常方便,例如,如果您想对项目应用某些任意函数,例如:

[doSomethingWith(ch) for ch in s]

The easiest way is probably just to use list(), but there is at least one other option as well:

s = "Word to Split"
wordlist = list(s)               # option 1, 
wordlist = [ch for ch in s]      # option 2, list comprehension.

They should both give you what you need:

['W','o','r','d',' ','t','o',' ','S','p','l','i','t']

As stated, the first is likely the most preferable for your example but there are use cases that may make the latter quite handy for more complex stuff, such as if you want to apply some arbitrary function to the items, such as with:

[doSomethingWith(ch) for ch in s]

回答 2

列表功能将执行此操作

>>> list('foo')
['f', 'o', 'o']

The list function will do this

>>> list('foo')
['f', 'o', 'o']

回答 3

滥用规则,结果相同:(x表示“要拆分的单词”中的x)

实际上是一个迭代器,而不是列表。但是,您可能不太在意。

Abuse of the rules, same result: (x for x in ‘Word to split’)

Actually an iterator, not a list. But it’s likely you won’t really care.


回答 4

text = "just trying out"

word_list = []

for i in range(0, len(text)):
    word_list.append(text[i])
    i+=1

print(word_list)

['j', 'u', 's', 't', ' ', 't', 'r', 'y', 'i', 'n', 'g', ' ', 'o', 'u', 't']
text = "just trying out"

word_list = []

for i in range(0, len(text)):
    word_list.append(text[i])
    i+=1

print(word_list)

['j', 'u', 's', 't', ' ', 't', 'r', 'y', 'i', 'n', 'g', ' ', 'o', 'u', 't']

回答 5

def count():列表=’oixfjhibokxnjfklmhjpxesriktglanwekgfvnk’

word_list = []
# dict = {}
for i in range(len(list)):
    word_list.append(list[i])
# word_list1 = sorted(word_list)
for i in range(len(word_list) - 1, 0, -1):
    for j in range(i):
        if word_list[j] > word_list[j + 1]:
            temp = word_list[j]
            word_list[j] = word_list[j + 1]
            word_list[j + 1] = temp
print("final count of arrival of each letter is : \n", dict(map(lambda x: (x, word_list.count(x)), word_list)))

def count(): list = ‘oixfjhibokxnjfklmhjpxesriktglanwekgfvnk’

word_list = []
# dict = {}
for i in range(len(list)):
    word_list.append(list[i])
# word_list1 = sorted(word_list)
for i in range(len(word_list) - 1, 0, -1):
    for j in range(i):
        if word_list[j] > word_list[j + 1]:
            temp = word_list[j]
            word_list[j] = word_list[j + 1]
            word_list[j + 1] = temp
print("final count of arrival of each letter is : \n", dict(map(lambda x: (x, word_list.count(x)), word_list)))

回答 6

最简单的选择是仅使用spit()命令。但是,如果您不想使用它或由于某种市集原因而无法使用它,则可以始终使用此方法。

word = 'foo'
splitWord = []

for letter in word:
    splitWord.append(letter)

print(splitWord) #prints ['f', 'o', 'o']

The easiest option is to just use the spit() command. However, if you don’t want to use it or it dose not work for some bazaar reason, you can always use this method.

word = 'foo'
splitWord = []

for letter in word:
    splitWord.append(letter)

print(splitWord) #prints ['f', 'o', 'o']

在Python中拆分空字符串时,为什么split()返回空列表,而split(’\ n’)返回[”]?

问题:在Python中拆分空字符串时,为什么split()返回空列表,而split(’\ n’)返回[”]?

split('\n')用来获取一个字符串中的行,并发现''.split()返回一个空列表[],而''.split('\n')return ['']。有什么特殊原因造成这种差异?

还有没有更方便的方法来计算字符串中的行数?

I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns ['']. Is there any specific reason for such a difference?

And is there any more convenient way to count lines in a string?


回答 0

问题:我正在使用split(’\ n’)在一个字符串中获取行,并发现”.split()返回空列表[],而”.split(’\ n’)返回[”] 。

所述str.split()方法有两种算法。如果未提供任何参数,它将在重复运行空白时拆分。但是,如果给出参数,则将其视为单个定界符,且不会重复运行。

在拆分空字符串的情况下,第一种模式(无参数)将返回一个空列表,因为空白被吃掉并且结果列表中没有任何值。

相比之下,第二种模式(带有参数如\n)将产生第一个空字段。考虑一下您是否写过'\n'.split('\n'),您将得到两个字段(一个字段拆分成两半)。

问题:有什么特殊原因造成这种差异?

当数据在具有可变空白量的列中对齐时,第一种模式很有用。例如:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

第二种模式对于定界数据(例如CSV)很有用,其中重复的逗号表示空白字段。例如:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

注意,结果字段的数量比定界符的数量大一。想想剪一条绳子。如果不削减,则只有一件。一切,给出两块。进行两次切割,得到三块。Python的str.split(delimiter)方法也是如此:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

问题:还有什么更方便的方法来计算字符串中的行数?

是的,有两种简单的方法。一个使用str.count(),另一个使用str.splitlines()。除非最后一行缺少,否则两种方法都将给出相同的答案\n。如果最后的换行符丢失,则str.splitlines方法将给出准确的答案。一种更快且更准确的技术是使用count方法,然后将其更正为最终的换行符:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

来自@Kaz的问题:为什么两个非常不同的算法被误用到一个函数中?

str.split的签名大约有20年的历史了,那个时代的许多API都是严格实用的。虽然并不完美,但方法签名也不是“糟糕的”。在大多数情况下,Guido的API设计选择经受了时间的考验。

当前的API并非没有优势。考虑如下字符串:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

当要求将这些字符串分成多个字段时,人们倾向于使用相同的英语单词“ split”来描述这两个字符串。当要求读取诸如fields = line.split() 或的代码时fields = line.split(','),人们倾向于正确地将语句解释为“将行拆分为字段”。

Microsoft Excel的“ 文本到列”工具做出了类似的API选择,并将两种分割算法都合并到了同一工具中。尽管似乎涉及多个算法,但人们似乎在思维上将字段拆分建模为一个单独的概念。

Question: I am using split(‘\n’) to get lines in one string, and found that ”.split() returns empty list [], while ”.split(‘\n’) returns [”].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python’s str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn’t “terrible” either. For the most part, Guido’s API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

When asked to break these strings into fields, people tend to describe both using the same English word, “split”. When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as “splits a line into fields”.

Microsoft Excel’s text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.


回答 1

根据文档,这似乎只是它应该工作的方式:

使用指定的分隔符分割空字符串将返回['']

如果未指定sep或为None,则将应用不同的拆分算法:连续的空白行将被视为单个分隔符,并且如果字符串的开头或结尾处有空白,则结果在开头或结尾将不包含空字符串。因此,使用None分隔符拆分空字符串或仅包含空格的字符串将返回[]。

因此,为了更清楚一点,该split()函数实现了两种不同的拆分算法,并使用参数的存在来决定要运行哪个参数。这可能是因为它允许优化一个不带参数的参数,而不是优化带参数的参数。我不知道。

It seems to simply be the way it’s supposed to work, according to the documentation:

Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

So, to make it clearer, the split() function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don’t know.


回答 2

.split()没有参数的人会变得聪明。它在任何空格,制表符,空格,换行符等处分割,并因此跳过所有空字符串。

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

本质上,.split()不带参数的用于从字符串中提取单词,而.split()带参数的参数只是带一个字符串并将其分割。

这就是差异的原因。

是的,通过分割来计数行不是一种有效的方法。计算换行符的数量,如果字符串不以换行符结尾,则加一个。

.split() without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

Essentially, .split() without parameters are used to extract words from a string, as opposed to .split() with parameters which just takes a string and splits it.

That’s the reason for the difference.

And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn’t end with a line feed.


回答 3

用途count()

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

Use count():

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

回答 4

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

注意最后一句话。

要计算行数,您可以简单地计算行数\n

line_count = some_string.count('\n') + some_string[-1] != '\n'

最后一部分考虑到不结束最后一行\n,即使这意味着,Hello, World!Hello, World!\n具有相同的行数(这对我来说是合理的),否则,你可以简单地添加1到的计数\n

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

Note the last sentence.

To count lines you can simply count how many \n are there:

line_count = some_string.count('\n') + some_string[-1] != '\n'

The last part takes into account the last line that do not end with \n, even though this means that Hello, World! and Hello, World!\n have the same line count(which for me is reasonable), otherwise you can simply add 1 to the count of \n.


回答 5

要计算行数,可以计算换行数:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

编辑

内置的另一个答案count更合适,实际上

To count lines, you can count the number of line breaks:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

Edit:

The other answer with built-in count is more suitable, actually


将列表拆分为较小的列表(一分为二)

问题:将列表拆分为较小的列表(一分为二)

我正在寻找一种轻松地将python列表分成两半的方法。

这样,如果我有一个数组:

A = [0,1,2,3,4,5]

我将能够得到:

B = [0,1,2]

C = [3,4,5]

I am looking for a way to easily split a python list in half.

So that if I have an array:

A = [0,1,2,3,4,5]

I would be able to get:

B = [0,1,2]

C = [3,4,5]

回答 0

A = [1,2,3,4,5,6]
B = A[:len(A)//2]
C = A[len(A)//2:]

如果需要功能:

def split_list(a_list):
    half = len(a_list)//2
    return a_list[:half], a_list[half:]

A = [1,2,3,4,5,6]
B, C = split_list(A)
A = [1,2,3,4,5,6]
B = A[:len(A)//2]
C = A[len(A)//2:]

If you want a function:

def split_list(a_list):
    half = len(a_list)//2
    return a_list[:half], a_list[half:]

A = [1,2,3,4,5,6]
B, C = split_list(A)

回答 1

一些通用的解决方案(您可以指定所需的零件数量,而不仅仅是“分成两半”):

编辑:更新后的帖子以处理奇数列表长度

EDIT2:基于Brians的评论性评论再次更新帖子

def split_list(alist, wanted_parts=1):
    length = len(alist)
    return [ alist[i*length // wanted_parts: (i+1)*length // wanted_parts] 
             for i in range(wanted_parts) ]

A = [0,1,2,3,4,5,6,7,8,9]

print split_list(A, wanted_parts=1)
print split_list(A, wanted_parts=2)
print split_list(A, wanted_parts=8)

A little more generic solution (you can specify the number of parts you want, not just split ‘in half’):

EDIT: updated post to handle odd list lengths

EDIT2: update post again based on Brians informative comments

def split_list(alist, wanted_parts=1):
    length = len(alist)
    return [ alist[i*length // wanted_parts: (i+1)*length // wanted_parts] 
             for i in range(wanted_parts) ]

A = [0,1,2,3,4,5,6,7,8,9]

print split_list(A, wanted_parts=1)
print split_list(A, wanted_parts=2)
print split_list(A, wanted_parts=8)

回答 2

f = lambda A, n=3: [A[i:i+n] for i in range(0, len(A), n)]
f(A)

n -结果数组的预定义长度

f = lambda A, n=3: [A[i:i+n] for i in range(0, len(A), n)]
f(A)

n – the predefined length of result arrays


回答 3

def split(arr, size):
     arrs = []
     while len(arr) > size:
         pice = arr[:size]
         arrs.append(pice)
         arr   = arr[size:]
     arrs.append(arr)
     return arrs

测试:

x=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
print(split(x, 5))

结果:

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13]]
def split(arr, size):
     arrs = []
     while len(arr) > size:
         pice = arr[:size]
         arrs.append(pice)
         arr   = arr[size:]
     arrs.append(arr)
     return arrs

Test:

x=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
print(split(x, 5))

result:

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13]]

回答 4

如果您不在乎订单…

def split(list):  
    return list[::2], list[1::2]

list[::2]从第0个元素开始获取列表中的每个第二个元素。
list[1::2]从第一个元素开始获取列表中的第二个元素。

If you don’t care about the order…

def split(list):  
    return list[::2], list[1::2]

list[::2] gets every second element in the list starting from the 0th element.
list[1::2] gets every second element in the list starting from the 1st element.


回答 5

B,C=A[:len(A)/2],A[len(A)/2:]

B,C=A[:len(A)/2],A[len(A)/2:]


回答 6

这是一个常见的解决方案,将arr分为count部分

def split(arr, count):
     return [arr[i::count] for i in range(count)]

Here is a common solution, split arr into count part

def split(arr, count):
     return [arr[i::count] for i in range(count)]

回答 7

def splitter(A):
    B = A[0:len(A)//2]
    C = A[len(A)//2:]

 return (B,C)

我进行了测试,在python 3中强制进行int分割需要使用双斜杠。我的原始帖子是正确的,尽管所见即所得由于某种原因在Opera中中断了。

def splitter(A):
    B = A[0:len(A)//2]
    C = A[len(A)//2:]

 return (B,C)

I tested, and the double slash is required to force int division in python 3. My original post was correct, although wysiwyg broke in Opera, for some reason.


回答 8

对于将数组拆分为较小的较小数组的更普遍的情况,有一个官方的Python记录n

from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

该代码段来自python itertools doc页面

There is an official Python receipe for the more generalized case of splitting an array into smaller arrays of size n.

from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

This code snippet is from the python itertools doc page.


回答 9

使用列表切片。语法基本上是my_list[start_index:end_index]

>>> i = [0,1,2,3,4,5]
>>> i[:3] # same as i[0:3] - grabs from first to third index (0->2)
[0, 1, 2]
>>> i[3:] # same as i[3:len(i)] - grabs from fourth index to end
[3, 4, 5]

要获得列表的前半部分,请从第一个索引切成len(i)//2(其中//是整数除法-因此3//2 will give the floored result of1 , instead of the invalid list index of1.5`):

>>> i[:len(i)//2]
[0, 1, 2]

..并交换周围的值以获得下半部分:

>>> i[len(i)//2:]
[3, 4, 5]

Using list slicing. The syntax is basically my_list[start_index:end_index]

>>> i = [0,1,2,3,4,5]
>>> i[:3] # same as i[0:3] - grabs from first to third index (0->2)
[0, 1, 2]
>>> i[3:] # same as i[3:len(i)] - grabs from fourth index to end
[3, 4, 5]

To get the first half of the list, you slice from the first index to len(i)//2 (where // is the integer division – so 3//2 will give the floored result of1, instead of the invalid list index of1.5`):

>>> i[:len(i)//2]
[0, 1, 2]

..and the swap the values around to get the second half:

>>> i[len(i)//2:]
[3, 4, 5]

回答 10

如果列表很大,最好使用itertools并编写一个函数以根据需要生成每个部分:

from itertools import islice

def make_chunks(data, SIZE):
    it = iter(data)
    # use `xragne` if you are in python 2.7:
    for i in range(0, len(data), SIZE):
        yield [k for k in islice(it, SIZE)]

您可以这样使用:

A = [0, 1, 2, 3, 4, 5, 6]

size = len(A) // 2

for sample in make_chunks(A, size):
    print(sample)

输出为:

[0, 1, 2]
[3, 4, 5]
[6]

感谢@thefourtheye@Bede Constantinides

If you have a big list, It’s better to use itertools and write a function to yield each part as needed:

from itertools import islice

def make_chunks(data, SIZE):
    it = iter(data)
    # use `xragne` if you are in python 2.7:
    for i in range(0, len(data), SIZE):
        yield [k for k in islice(it, SIZE)]

You can use this like:

A = [0, 1, 2, 3, 4, 5, 6]

size = len(A) // 2

for sample in make_chunks(A, size):
    print(sample)

The output is:

[0, 1, 2]
[3, 4, 5]
[6]

Thanks to @thefourtheye and @Bede Constantinides


回答 11

10年后..我想-为什么不加上另一个:

arr = 'Some random string' * 10; n = 4
print([arr[e:e+n] for e in range(0,len(arr),n)])

10 years later.. I thought – why not add another:

arr = 'Some random string' * 10; n = 4
print([arr[e:e+n] for e in range(0,len(arr),n)])

回答 12

虽然上述答案或多或少是正确的,但是如果您的数组大小不能被2整除,则可能会遇到麻烦,因为,结果为a / 2odd,在python 3.0中为浮点数,在较早的版本中为from __future__ import division在脚本的开头指定。无论如何,您最好进行整数除法,即a // 2为了获得代码的“前向”兼容性。

While the answers above are more or less correct, you may run into trouble if the size of your array isn’t divisible by 2, as the result of a / 2, a being odd, is a float in python 3.0, and in earlier version if you specify from __future__ import division at the beginning of your script. You are in any case better off going for integer division, i.e. a // 2, in order to get “forward” compatibility of your code.


回答 13

这与其他解决方案相似,但速度更快。

# Usage: split_half([1,2,3,4,5]) Result: ([1, 2], [3, 4, 5])

def split_half(a):
    half = len(a) >> 1
    return a[:half], a[half:]

This is similar to other solutions, but a little faster.

# Usage: split_half([1,2,3,4,5]) Result: ([1, 2], [3, 4, 5])

def split_half(a):
    half = len(a) >> 1
    return a[:half], a[half:]

回答 14

来自@ChristopheD的提示

def line_split(N, K=1):
    length = len(N)
    return [N[i*length/K:(i+1)*length/K] for i in range(K)]

A = [0,1,2,3,4,5,6,7,8,9]
print line_split(A,1)
print line_split(A,2)

With hints from @ChristopheD

def line_split(N, K=1):
    length = len(N)
    return [N[i*length/K:(i+1)*length/K] for i in range(K)]

A = [0,1,2,3,4,5,6,7,8,9]
print line_split(A,1)
print line_split(A,2)

回答 15

#for python 3
    A = [0,1,2,3,4,5]
    l = len(A)/2
    B = A[:int(l)]
    C = A[int(l):]       
#for python 3
    A = [0,1,2,3,4,5]
    l = len(A)/2
    B = A[:int(l)]
    C = A[int(l):]       

回答 16

在2020年对该问题的另一种看法…这是问题的概括。我将“将列表分为两半”解释为..(即仅两个列表,并且如果出现奇数等,则不会溢出到第三数组)。例如,如果数组长度为19,并使用//运算符将其除以2,则结果为9,我们将最终得到两个长度为9的数组和一个长度为1的数组(第三个)(因此总共三个数组)。如果我们想要一个通用的解决方案一直提供两个数组,那么我将假定我们对长度不相等的二重奏数组感到满意(一个长于另一个长)。并且假定混合顺序是可以的(在这种情况下替代)。

"""
arrayinput --> is an array of length N that you wish to split 2 times
"""
ctr = 1 # lets initialize a counter

holder_1 = []
holder_2 = []

for i in range(len(arrayinput)): 

    if ctr == 1 :
        holder_1.append(arrayinput[i])
    elif ctr == 2: 
        holder_2.append(arrayinput[i])

    ctr += 1 

    if ctr > 2 : # if it exceeds 2 then we reset 
        ctr = 1 

这个概念适用于任意数量的列表分区(您必须根据想要的列表部分数量来调整代码)。并且很容易解释。为了加快速度,您甚至可以在cython / C / C ++中编写此循环以加快速度。再说一遍,我已经在相对较小的列表(约10,000行)上尝试了此代码,并且在不到一秒的时间内完成了代码。

只是我的两分钱。

谢谢!

Another take on this problem in 2020 … Here’s a generalization of the problem. I interpret the ‘divide a list in half’ to be .. (i.e. two lists only and there shall be no spillover to a third array in case of an odd one out etc). For instance, if the array length is 19 and a division by two using // operator gives 9, and we will end up having two arrays of length 9 and one array (third) of length 1 (so in total three arrays). If we’d want a general solution to give two arrays all the time, I will assume that we are happy with resulting duo arrays that are not equal in length (one will be longer than the other). And that its assumed to be ok to have the order mixed (alternating in this case).

"""
arrayinput --> is an array of length N that you wish to split 2 times
"""
ctr = 1 # lets initialize a counter

holder_1 = []
holder_2 = []

for i in range(len(arrayinput)): 

    if ctr == 1 :
        holder_1.append(arrayinput[i])
    elif ctr == 2: 
        holder_2.append(arrayinput[i])

    ctr += 1 

    if ctr > 2 : # if it exceeds 2 then we reset 
        ctr = 1 

This concept works for any amount of list partition as you’d like (you’d have to tweak the code depending on how many list parts you want). And is rather straightforward to interpret. To speed things up , you can even write this loop in cython / C / C++ to speed things up. Then again, I’ve tried this code on relatively small lists ~ 10,000 rows and it finishes in a fraction of second.

Just my two cents.

Thanks!


在Python字符串的最后一个分隔符上分割?

问题:在Python字符串的最后一个分隔符上分割?

对于在字符串中最后一次出现定界符时拆分字符串的建议Python惯用法是什么?例:

# instead of regular split
>> s = "a,b,c,d"
>> s.split(",")
>> ['a', 'b', 'c', 'd']

# ..split only on last occurrence of ',' in string:
>>> s.mysplit(s, -1)
>>> ['a,b,c', 'd']

mysplit接受第二个参数,即要分割的分隔符的出现。像常规列表索引一样,-1表示末尾的末尾。如何才能做到这一点?

What’s the recommended Python idiom for splitting a string on the last occurrence of the delimiter in the string? example:

# instead of regular split
>> s = "a,b,c,d"
>> s.split(",")
>> ['a', 'b', 'c', 'd']

# ..split only on last occurrence of ',' in string:
>>> s.mysplit(s, -1)
>>> ['a,b,c', 'd']

mysplit takes a second argument that is the occurrence of the delimiter to be split. Like in regular list indexing, -1 means the last from the end. How can this be done?


回答 0

使用.rsplit().rpartition()代替:

s.rsplit(',', 1)
s.rpartition(',')

str.rsplit()可让您指定拆分次数,而str.rpartition()仅拆分一次,但始终返回固定数量的元素(前缀,定界符和后缀),并且对于单个拆分情况而言更快。

演示:

>>> s = "a,b,c,d"
>>> s.rsplit(',', 1)
['a,b,c', 'd']
>>> s.rsplit(',', 2)
['a,b', 'c', 'd']
>>> s.rpartition(',')
('a,b,c', ',', 'd')

两种方法都从字符串的右侧开始拆分;通过str.rsplit()将最大值作为第二个参数,您可以仅分割最右边的出现。

Use .rsplit() or .rpartition() instead:

s.rsplit(',', 1)
s.rpartition(',')

str.rsplit() lets you specify how many times to split, while str.rpartition() only splits once but always returns a fixed number of elements (prefix, delimiter & postfix) and is faster for the single split case.

Demo:

>>> s = "a,b,c,d"
>>> s.rsplit(',', 1)
['a,b,c', 'd']
>>> s.rsplit(',', 2)
['a,b', 'c', 'd']
>>> s.rpartition(',')
('a,b,c', ',', 'd')

Both methods start splitting from the right-hand-side of the string; by giving str.rsplit() a maximum as the second argument, you get to split just the right-hand-most occurrences.


回答 1

您可以使用rsplit

string.rsplit('delimeter',1)[1]

从反向获取字符串。

You can use rsplit

string.rsplit('delimeter',1)[1]

To get the string from reverse.


回答 2

我只是为了好玩而做

    >>> s = 'a,b,c,d'
    >>> [item[::-1] for item in s[::-1].split(',', 1)][::-1]
    ['a,b,c', 'd']

警告:请参阅下面的第一个评论,此答案可能会出错。

I just did this for fun

    >>> s = 'a,b,c,d'
    >>> [item[::-1] for item in s[::-1].split(',', 1)][::-1]
    ['a,b,c', 'd']

Caution: Refer to the first comment in below where this answer can go wrong.