问题:将字符串拆分为具有多个单词边界定界符的单词

我认为我想做的是一项相当普通的任务,但是我在网络上找不到任何参考。我的文字带有标点符号,我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但是Python str.split()只能使用一个参数,因此在用空格分割后,所有单词都带有标点符号。有任何想法吗?

I think what I want to do is a fairly common task but I’ve found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

But Python’s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?


回答 0

正则表达式合理的情况:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

回答 1

re.split()

re.split(pattern,string [,maxsplit = 0])

按模式分割字符串。如果在模式中使用了捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。如果maxsplit不为零,则最多会发生maxsplit分割,并将字符串的其余部分作为列表的最后一个元素返回。(不兼容说明:在原始的Python 1.5发行版中,maxsplit被忽略。此问题已在以后的发行版中修复。)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

回答 2

另一种无需使用正则表达式的快速方法是首先替换字符,如下所示:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

Another quick way to do this without a regexp is to replace the characters first, as below:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

回答 3

如此众多的答案,但我找不到有效解决问题标题真正要求的解决方案(拆分多个可能的分隔符,相反,许多答案拆分成一个单词而不是单词,这是不同的)。因此,这是标题中问题的答案,该问题依赖于Python的标准高效re模块:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

哪里:

  • […]比赛一个隔板内上市,
  • \-在正则表达式是在这里以防止特殊解释-为字符范围指示器(如在A-Z),
  • +跳过一个或多个分隔符(它可以省略感谢filter(),但是这将不必要地产生匹配隔板之间空字符串),并
  • filter(None, …) 删除可能由前导和尾随分隔符创建的空字符串(因为空字符串具有错误的布尔值)。

re.split()正如问题标题所要求的那样,这恰好是“用多个分隔符分隔”。

此外,该解决方案还可以避免在其他一些解决方案中发现的单词中非ASCII字符的问题(请参见ghostdog74的答案的第一条评论)。

re模块比“手动”执行Python循环和测试要高效得多(在速度和简洁性方面)!

So many answers, yet I can’t find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python’s standard and efficient re module:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

where:

  • the […] matches one of the separators listed inside,
  • the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
  • the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched separators), and
  • filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).

This re.split() precisely “splits with multiple separators”, as asked for in the question title.

This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74’s answer).

The re module is much more efficient (in speed and concision) than doing Python loops and tests “by hand”!


回答 4

另一种方式,没有正则表达式

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

Another way, without regex

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

回答 5

专业提示:使用 string.translate用于Python最快的字符串操作。

一些证明…

首先,慢速的方式(对不起pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

接下来,我们使用re.findall()(由建议的答案给出)。快多了:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094

最后,我们使用translate

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

说明:

string.translate是用C实现的,与Python中的许多字符串操作函数不同,string.translate 它不会产生新的字符串。因此,它与字符串替换一样快。

不过,这有点尴尬,因为它需要翻译表才能执行此操作。您可以使用maketrans()便利功能制作翻译表。此处的目的是将所有不需要的字符转换为空格。一对一的替代品。同样,不会产生任何新数据。所以这很快

接下来,我们使用好old split()split()默认情况下,它将对所有空白字符起作用,将它们分组在一起以进行拆分。结果将是您想要的单词列表。而且这种方法的速度几乎快了4倍re.findall()

Pro-Tip: Use string.translate for the fastest string operations Python has.

Some proof…

First, the slow way (sorry pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

Next, we use re.findall() (as given by the suggested answer). MUCH faster:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094

Finally, we use translate:

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

Explanation:

string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it’s about as fast as you can get for string substitution.

It’s a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!


回答 6

我遇到了类似的难题,不想使用’re’模块。

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

I had a similar dilemma and didn’t want to use ‘re’ module.

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

回答 7

首先,我想与其他人同意,正则表达式或str.translate(...)基于基础的解决方案性能最高。对于我的用例,此功能的性能并不重要,因此我想添加我考虑的该标准的想法。

我的主要目标是将其他一些答案中的想法归纳为一个解决方案,该解决方案可用于包含不仅仅是正则表达式单词的字符串(即,将标点字符的显式子集列入黑名单而将单词字符列入白名单)。

请注意,在任何方法中,都可能会考虑使用 string.punctuation代替手动定义的列表。

选项1-重新订阅

我很惊讶地发现到目前为止没有答案使用re.sub(…)。我发现这是解决此问题的一种简单自然的方法。

import re

my_str = "Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

在此解决方案中,我将调用嵌套到re.sub(...)内部re.split(...)-但如果性能至关重要,则在外部编译正则表达式可能会有所益处-对于我的用例而言,差异并不明显,因此我更喜欢简单性和可读性。

选项2-str.replace

这是另外几行,但是它具有可扩展的优点,而不必检查是否需要在正则表达式中转义某个字符。

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
    my_str = my_str.replace(r, ' ')

words = my_str.split()

能够将str.replace映射到字符串本来会很好,但是我不认为可以使用不可变的字符串来完成,并且在映射到字符列表时可以工作,对每个字符运行每个替换听起来太过分了。(编辑:有关功能示例,请参阅下一个选项。)

选项3-functools.reduce

(在Python 2中,reduce它可以在全局命名空间中使用,而无需从functools导入。)

import functools

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()

First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn’t significant, so I wanted to add ideas that I considered with that criteria.

My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).

Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.

Option 1 – re.sub

I was surprised to see no answer so far uses re.sub(…). I find it a simple and natural approach to this problem.

import re

my_str = "Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn’t significant, so I prefer simplicity and readability.

Option 2 – str.replace

This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
    my_str = my_str.replace(r, ' ')

words = my_str.split()

It would have been nice to be able to map the str.replace to the string instead, but I don’t think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)

Option 3 – functools.reduce

(In Python 2, reduce is available in global namespace without importing it from functools.)

import functools

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()

回答 8

join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

然后这变成了三层:

fragments = [text]
for token in tokens:
    fragments = join(f.split(token) for f in fragments)

说明

这就是在Haskell中被称为List monad的东西。monad背后的想法是,一旦“在monad中”,您就“停留在monad中”,直到有东西将您带出。例如在Haskell中,假设您将python range(n) -> [1,2,...,n]函数映射到List上。如果结果是一个列表,它将被原地追加到列表中,因此您将获得类似map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]。这称为map-append(或mappend,或类似的东西)。这里的想法是,您要执行此操作(拆分令牌),并且每当执行此操作时,您都将结果加入列表。

您可以将其抽象为一个函数,并且tokens=string.punctuation默认情况下具有。

这种方法的优点:

  • 这种方法(与基于朴素的基于正则表达式的方法不同)可以与任意长度的令牌一起使用(正则表达式也可以使用更高级的语法)。
  • 您不仅限于代币;您可以使用任意逻辑代替每个标记,例如,“标记”之一可以是根据嵌套括号的拆分方式进行拆分的函数。
join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

Then this becomes a three-liner:

fragments = [text]
for token in tokens:
    fragments = join(f.split(token) for f in fragments)

Explanation

This is what in Haskell is known as the List monad. The idea behind the monad is that once “in the monad” you “stay in the monad” until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you’d get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you’ve got this operation you’re applying (splitting on a token), and whenever you do that, you join the result into the list.

You can abstract this into a function and have tokens=string.punctuation by default.

Advantages of this approach:

  • This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
  • You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the “tokens” could be a function which splits according to how nested parentheses are.

回答 9

尝试这个:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

这将打印 ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

try this:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']


回答 10

两次使用替换:

a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

结果是:

['11223', '33344', '33222', '3344']

Use replace two times:

a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

results in:

['11223', '33344', '33222', '3344']

回答 11

我喜欢re,但是这是我的解决方案:

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep .__ contains__是’in’运算符使用的方法。基本上和

lambda ch: ch in sep

但是这里比较方便。

groupby获取我们的字符串和函数。它使用该函数将字符串分成几组:每当函数值更改时,就会生成一个新的组。因此,sep .__ contains__正是我们需要的。

groupby返回一对对的序列,其中pair [0]是我们函数的结果,而pair [1]是一个组。使用‘if not k’我们用分隔符过滤掉组(因为sep .__ contains__在分隔符上为True 的结果)。好了,就是这样-现在我们有了一系列的组,每个组都是一个单词(组实际上是一个可迭代的,因此我们使用join将其转换为字符串)。

该解决方案非常通用,因为它使用一个函数来分隔字符串(可以按需要的任何条件进行拆分)。另外,它不会创建中间字符串/列表(您可以删除联接,并且表达式将变得很懒,因为每个组都是迭代器)

I like re, but here is my solution without it:

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep.__contains__ is a method used by ‘in’ operator. Basically it is the same as

lambda ch: ch in sep

but is more convenient here.

groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes – a new group is generated. So, sep.__contains__ is exactly what we need.

groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using ‘if not k’ we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that’s all – now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).

This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn’t create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)


回答 12

您可以使用pandas的series.str.split方法来获得相同的结果,而不是使用re模块功能re.split。

首先,使用上面的字符串创建一个系列,然后将该方法应用于该系列。

thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

参数pat接受定界符,并将拆分后的字符串作为数组返回。这里,两个定界符使用|传递。(或运算符)。输出如下:

[Hey, you , what are you doing here!?]

Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.

First, create a series with the above string and then apply the method to the series.

thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator). The output is as follows:

[Hey, you , what are you doing here!?]


回答 13

我正在重新熟悉Python,并需要同样的东西。findall解决方案可能更好,但是我想到了:

tokens = [x.strip() for x in data.split(',')]

I’m re-acquainting myself with Python and needed the same thing. The findall solution may be better, but I came up with this:

tokens = [x.strip() for x in data.split(',')]

回答 14

使用maketrans和翻译,您可以轻松整齐地进行操作

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

using maketrans and translate you can do it easily and neatly

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

回答 15

在Python 3中,您可以使用PY4E-Python for Everybody中的方法

我们可以通过使用字符串的方法解决这两个问题lowerpunctuationtranslate。该translate是最微妙的方法。这是有关以下内容的文档translate

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

将中的字符替换为中fromstr相同位置的tostr字符,并删除中的所有字符deletestr。该fromstrtostr可以为空字符串和deletestr可以省略参数。

您可以看到“标点符号”:

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  

例如:

In [12]: your_str = "Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

有关更多信息,您可以参考:

In Python 3, your can use the method from PY4E – Python for Everybody.

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.

Your can see the “punctuation”:

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  

For your example:

In [12]: your_str = "Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

For more information, you can refer:


回答 16

实现此目的的另一种方法是使用自然语言工具包(nltk)。

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

打印: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

这种方法的最大缺点是您需要安装nltk软件包

好处是,一旦获得令牌,您就可以使用其余的nltk软件包做很多有趣的事情

Another way to achieve this is to use the Natural Language Tool Kit (nltk).

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

The biggest drawback of this method is that you need to install the nltk package.

The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.


回答 17

首先,我不认为您的意图是在拆分函数中实际使用标点符号作为分隔符。您的描述表明您只是想从结果字符串中消除标点符号。

我经常遇到这种情况,而我通常的解决方案不需要重新输入。

具有列表理解功能的单行lambda函数:

(要求import string):

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']


功能(传统)

作为传统函数,这仍然只有两行具有列表理解功能(除了import string):

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它自然也会使收缩和带连字符的单词保持完整。您总是可以text.replace("-", " ")在分割之前使用连字符将其转换为空格。

没有Lambda或列表理解的常规功能

对于更通用的解决方案(您可以在其中指定要消除的字符),并且无需列表理解,您将获得:

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然,您也可以始终将lambda函数概括为任何指定的字符串。

First of all, I don’t think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.

I come across this pretty frequently, and my usual solution doesn’t require re.

One-liner lambda function w/ list comprehension:

(requires import string):

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']


Function (traditional)

As a traditional function, this is still only two lines with a list comprehension (in addition to import string):

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.

General Function w/o Lambda or List Comprehension

For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

Of course, you can always generalize the lambda function to any specified string of characters as well.


回答 18

首先,在循环中执行任何RegEx操作之前,请始终使用re.compile(),因为它比常规操作更快。

因此对于您的问题,请先编译模式,然后对其执行操作。

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.

so for your problem first compile the pattern and then perform action on it.

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

回答 19

这是一些解释的答案。

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

或者一行,我们可以这样:

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新的答案

Here is the answer with some explanation.

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

or in one line, we can do like this:

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

updated answer


回答 20

创建一个函数,将两个字符串(要拆分的源字符串和定界符的splitlist字符串)作为输入,并输出一个拆分词列表:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output

Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output

回答 21

我喜欢pprzemek的解决方案,因为它不假定定界符是单个字符,并且不尝试利用正则表达式(如果分隔符的数目太长了,这将不能很好地工作)。

为了清楚起见,以下是上述解决方案的可读性更高的版本:

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer

I like pprzemek’s solution because it does not assume that the delimiters are single characters and it doesn’t try to leverage a regex (which would not work well if the number of separators got to be crazy long).

Here’s a more readable version of the above solution for clarity:

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer

回答 22

遇到了与@ooboo相同的问题,并找到了这个主题@ ghostdog74启发了我,也许有人觉得我的解决方案有用

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

如果您不想在空格处分割,请在空格处输入内容并使用相同的字符分割。

got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

input something in space place and split using same character if you dont want to split at spaces.


回答 23

这是我与多个决策者共同努力的结果:

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

Here is my go at a split with multiple deliminaters:

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

回答 24

我认为以下是满足您需求的最佳答案:

\W+ 可能适合这种情况,但可能不适合其他情况。

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

I think the following is the best answer to suite your needs :

\W+ maybe suitable for this case, but may not be suitable for other cases.

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

回答 25

这是我的看法。

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

Heres my take on it….

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

回答 26

我喜欢replace()最好的方式。以下过程将字符串中定义的所有分隔符更改splitlist为第一个分隔符splitlist,然后在该分隔符上拆分文本。它还说明是否splitlist碰巧是一个空字符串。它返回单词列表,其中没有空字符串。

def split_string(text, splitlist):
    for sep in splitlist:
        text = text.replace(sep, splitlist[0])
    return filter(None, text.split(splitlist[0])) if splitlist else [text]

I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.

def split_string(text, splitlist):
    for sep in splitlist:
        text = text.replace(sep, splitlist[0])
    return filter(None, text.split(splitlist[0])) if splitlist else [text]

回答 27

def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

这是用法:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

Here is the usage:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

回答 28

如果要进行可逆操作(保留定界符),则可以使用以下功能:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

If you want a reversible operation (preserve the delimiters), you can use this function:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

回答 29

我最近需要执行此操作,但需要一个与标准库str.split函数有些匹配的函数,当使用0或1个参数调用时,该函数的行为与标准库相同。

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

注意:仅当分隔符由单个字符组成时(如我的用例),此功能才有用。

I recently needed to do this but wanted a function that somewhat matched the standard library str.split function, this function behaves the same as standard library when called with 0 or 1 arguments.

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

NOTE: This function is only useful when your separators consist of a single character (as was my usecase).


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。