标签归档:regex

正则表达式中的“ \ d”表示数字吗?

问题:正则表达式中的“ \ d”表示数字吗?

我发现123\d比赛13,但不会2。我想知道\d匹配的数字是否满足哪种要求?我说的是Python样式的正则表达式。

Gedit中的正则表达式插件使用Python样式正则表达式。我创建了一个文本文件,其内容为

123

正则表达式只13匹配\d2不是。

通常,对于一系列数字,中间没有其他字符,只有奇数位是匹配的,偶数位不是。对于例如12345,比赛是135

I found that in 123, \d matches 1 and 3 but not 2. I was wondering if \d matches a digit satisfying what kind of requirement? I am talking about Python style regex.

Regular expression plugin in Gedit is using Python style regex. I created a text file with its content being

123

Only 1 and 3 are matched by the regex \d; 2 is not.

Generally for a sequence of digit numbers without other characters in between, only the odd order digits are matches, and the even order digits are not. For example in 12345, the matches are 1, 3 and 5.


回答 0

[0-9] 并不总是等同\d。在python3中,[0-9]仅匹配0123456789字符,而\d匹配[0-9]其他字符,例如东部阿拉伯数字٠١٢٣٤٥٦٧٨٩

[0-9] is not always equivalent to \d. In python3, [0-9] matches only 0123456789 characters, while \d matches [0-9] and other digit characters, for example Eastern Arabic numerals ٠١٢٣٤٥٦٧٨٩.


回答 1

\d匹配大多数正则表达式语法样式中的任何一位,包括python。 正则表达式参考

\d matches any single digit in most regex grammar styles, including python. Regex Reference


回答 2

在Python样式的正则表达式中,\d匹配任何单个数字。如果您看到的东西似乎无法做到这一点,请提供您正在使用的完整正则表达式,而不是仅仅描述一个特定的符号。

>>> import re
>>> re.match(r'\d', '3')
<_sre.SRE_Match object at 0x02155B80>
>>> re.match(r'\d', '2')
<_sre.SRE_Match object at 0x02155BB8>
>>> re.match(r'\d', '1')
<_sre.SRE_Match object at 0x02155B80>

In Python-style regex, \d matches any individual digit. If you’re seeing something that doesn’t seem to do that, please provide the full regex you’re using, as opposed to just describing that one particular symbol.

>>> import re
>>> re.match(r'\d', '3')
<_sre.SRE_Match object at 0x02155B80>
>>> re.match(r'\d', '2')
<_sre.SRE_Match object at 0x02155BB8>
>>> re.match(r'\d', '1')
<_sre.SRE_Match object at 0x02155B80>

回答 3

\\d{3} 匹配Java中任意三位数字的序列。

\\d{3} matches any sequence of three digits in Java.


回答 4

这只是一个猜测,但我认为您的编辑器实际上会匹配每个数字,1 2 3但是会突出显示奇数匹配,以区别于整个123字符串都匹配的情况。

大多数正则表达式控制台使用不同的颜色突出显示连续的匹配项,但是由于插件设置,终端限制或其他原因,在您的情况下,可能仅突出显示每个其他组。

This is just a guess, but I think your editor actually matches every single digit — 1 2 3 — but only odd matches are highlighted, to distinguish it from the case when the whole 123 string is matched.

Most regex consoles highlight contiguous matches with different colors, but due to the plugin settings, terminal limitations or for some other reason, only every other group might be highlighted in your case.


回答 5

有关.NET / C#的信息:

小数位字符:\ d \ d匹配任何小数位。它等效于\ p {Nd}正则表达式模式,其中包括标准的十进制数字0-9和许多其他字符集的十进制数字。

如果指定了ECMAScript兼容行为,则\ d等效于[0-9]。有关ECMAScript正则表达式的信息,请参阅正则表达式选项中的“ ECMAScript匹配行为”部分。

信息:https : //docs.microsoft.com/zh-cn/dotnet/standard/base-types/character-classes-in-regular-expressions#decimal-digit-character-d

Info regarding .NET / C#:

Decimal digit character: \d \d matches any decimal digit. It is equivalent to the \p{Nd} regular expression pattern, which includes the standard decimal digits 0-9 as well as the decimal digits of a number of other character sets.

If ECMAScript-compliant behavior is specified, \d is equivalent to [0-9]. For information on ECMAScript regular expressions, see the “ECMAScript Matching Behavior” section in Regular Expression Options.

Info: https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#decimal-digit-character-d


根据正则表达式分割字符串

问题:根据正则表达式分割字符串

我有表格形式的命令输出。我正在从结果文件中解析此输出,并将其存储在字符串中。一行中的每个元素都由一个或多个空格字符分隔,因此我正在使用正则表达式来匹配1个或多个空格并将其拆分。但是,每个元素之间都会插入一个空格:

>>> str1="a    b     c      d" # spaces are irregular
>>> str1
'a    b     c      d'
>>> str2=re.split("( )+", str1)
>>> str2
['a', ' ', 'b', ' ', 'c', ' ', 'd'] # 1 space element between!!!

有一个更好的方法吗?

每次拆分后都会str2添加到列表中。

I have the output of a command in tabular form. I’m parsing this output from a result file and storing it in a string. Each element in one row is separated by one or more whitespace characters, thus I’m using regular expressions to match 1 or more spaces and split it. However, a space is being inserted between every element:

>>> str1="a    b     c      d" # spaces are irregular
>>> str1
'a    b     c      d'
>>> str2=re.split("( )+", str1)
>>> str2
['a', ' ', 'b', ' ', 'c', ' ', 'd'] # 1 space element between!!!

Is there a better way to do this?

After each split str2 is appended to a list.


回答 0

通过使用()您将捕获该组,如果仅删除它们,则不会出现此问题。

>>> str1 = "a    b     c      d"
>>> re.split(" +", str1)
['a', 'b', 'c', 'd']

但是,不需要正则表达式,str.split没有指定任何定界符将为您将其分隔为空白。在这种情况下,这将是最好的方法。

>>> str1.split()
['a', 'b', 'c', 'd']

如果您真的想要正则表达式,则可以使用它('\s'代表空格,并且更清晰):

>>> re.split("\s+", str1)
['a', 'b', 'c', 'd']

或者您可以找到所有非空格字符

>>> re.findall(r'\S+',str1)
['a', 'b', 'c', 'd']

By using (,), you are capturing the group, if you simply remove them you will not have this problem.

>>> str1 = "a    b     c      d"
>>> re.split(" +", str1)
['a', 'b', 'c', 'd']

However there is no need for regex, str.split without any delimiter specified will split this by whitespace for you. This would be the best way in this case.

>>> str1.split()
['a', 'b', 'c', 'd']

If you really wanted regex you can use this ('\s' represents whitespace and it’s clearer):

>>> re.split("\s+", str1)
['a', 'b', 'c', 'd']

or you can find all non-whitespace characters

>>> re.findall(r'\S+',str1)
['a', 'b', 'c', 'd']

回答 1

str.split方法将自动删除项目之间的所有空白:

>>> str1 = "a    b     c      d"
>>> str1.split()
['a', 'b', 'c', 'd']

文件在这里:http : //docs.python.org/library/stdtypes.html#str.split

The str.split method will automatically remove all white space between items:

>>> str1 = "a    b     c      d"
>>> str1.split()
['a', 'b', 'c', 'd']

Docs are here: http://docs.python.org/library/stdtypes.html#str.split


回答 2

当您使用re.split并且拆分模式包含捕获组时,这些组将保留在输出中。如果您不想这样做,请改用非捕获组。

When you use re.split and the split pattern contains capturing groups, the groups are retained in the output. If you don’t want this, use a non-capturing group instead.


回答 3

实际上,它非常简单。试试这个:

str1="a    b     c      d"
splitStr1 = str1.split()
print splitStr1

Its very simple actually. Try this:

str1="a    b     c      d"
splitStr1 = str1.split()
print splitStr1

在Python中,如何分割字符串并保留分隔符?

问题:在Python中,如何分割字符串并保留分隔符?

这是解释此问题的最简单方法。这是我正在使用的:

re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']

这就是我想要的:

someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

原因是我想将字符串拆分为标记,对其进行操作,然后将其重新放回原处。

Here’s the simplest way to explain this. Here’s what I’m using:

re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']

Here’s what I want:

someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

The reason is that I want to split a string into tokens, manipulate it, then put it back together again.


回答 0

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

回答 1

如果要在换行符上拆分,请使用splitlines(True)

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(这不是一个通用的解决方案,但是请在此处添加此功能,以防万一有人来这里而意识到此方法不存在。)

If you are splitting on newline, use splitlines(True).

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(Not a general solution, but adding this here in case someone comes here not realizing this method existed.)


回答 2

另一个在Python 3上运行良好的无正则表达式解决方案

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

Another no-regex solution that works well on Python 3

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

回答 3

如果只有1个分隔符,则可以使用列表推导:

text = 'foo,bar,baz,qux'  
sep = ','

追加/前置分隔符:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

分隔符是它自己的元素:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

If you have only 1 separator, you can employ list comprehensions:

text = 'foo,bar,baz,qux'  
sep = ','

Appending/prepending separator:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

Separator as it’s own element:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

回答 4

另一个示例,拆分非字母数字并保留分隔符

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

输出:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

说明

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

another example, split on non alpha-numeric and keep the separators

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

output:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

explanation

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

回答 5

您还可以使用字符串数组而不是正则表达式来拆分字符串,如下所示:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn


print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

You can also split a string with an array of strings instead of a regular expression, like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

回答 6

# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
     if not l : continue
     k=st.find(l)
     llen=len(l)
     if k> start:
       tmp= st[start:k]
       lo.append(tmp)
       lo.append(l)
       start = k + llen
     else:
       lo.append(l)
       start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']
# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
     if not l : continue
     k=st.find(l)
     llen=len(l)
     if k> start:
       tmp= st[start:k]
       lo.append(tmp)
       lo.append(l)
       start = k + llen
     else:
       lo.append(l)
       start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

回答 7

一种懒惰和简单的解决方案

假设您的正则表达式模式是 split_pattern = r'(!|\?)'

首先,您添加与新分隔符相同的字符,例如“ [cut]”

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

然后拆分新的分隔符, new_string.split('[cut]')

One Lazy and Simple Solution

Assume your regex pattern is split_pattern = r'(!|\?)'

First, you add some same character as the new separator, like ‘[cut]’

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

Then you split the new separator, new_string.split('[cut]')


回答 8

如果要拆分字符串同时用正则表达式保留分隔符而不捕获组:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

如果假设正则表达式包含在捕获组中:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

两种方式都将删除在大多数情况下无用且烦人的空组。

If one wants to split string while keeping separators by regex without capturing group:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

If one assumes that regex is wrapped up into capturing group:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

Both ways also will remove empty groups which are useless and annoying in most of the cases.


回答 9

我在尝试拆分文件路径时遇到了类似的问题,并且很难找到一个简单的答案。这对我有用,并且不需要将分隔符替换回拆分文本中:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

返回:

['folder1/', 'folder2/', 'folder3/', 'file1']

I had a similar issue trying to split a file path and struggled to find a simple answer. This worked for me and didn’t involve having to substitute delimiters back into the split text:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

returns:

['folder1/', 'folder2/', 'folder3/', 'file1']


回答 10

我发现这种基于生成器的方法更加令人满意:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

它在理论上应该相当便宜,而无需找出正确的正则表达式。它不会创建新的字符串对象,而是将大部分迭代工作委托给有效的find方法。

…并且在python 3.8中可以很短:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

I found this generator based approach more satisfying:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

It avoids the need to figure out the correct regex, while in theory should be fairly cheap. It doesn’t create new string objects and, delegates most of the iteration work to the efficient find method.

… and in Python 3.8 it can be as short as:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

回答 11

  1. 全部替换seperator: (\W)seperator + new_seperator: (\W;)

  2. new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')
  1. replace all seperator: (\W) with seperator + new_seperator: (\W;)

  2. split by the new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')

回答 12

这是一个.split无需正则表达式的简单解决方案。

这是对Python split()的答案,没有删除定界符,因此与原始帖子所要求的不完全相同,但另一个问题已作为与此问题的重复而关闭。

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

随机测试:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for idx in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

Here is a simple .split solution that works without regex.

This is an answer for Python split() without removing the delimiter, so not exactly what the original post asks but the other question was closed as a duplicate for this one.

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

Random tests:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for _ in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

如何用下划线替换空格,反之亦然?

问题:如何用下划线替换空格,反之亦然?

我想用字符串中的下划线替换空格以创建漂亮的URL。因此,例如:

"This should be connected" becomes "This_should_be_connected" 

我在Django中使用Python。可以使用正则表达式解决吗?

I want to replace whitespace with underscore in a string to create nice URLs. So that for example:

"This should be connected" becomes "This_should_be_connected" 

I am using Python with Django. Can this be solved using regular expressions?


回答 0

您不需要正则表达式。Python有一个内置的字符串方法可以满足您的需要:

mystring.replace(" ", "_")

You don’t need regular expressions. Python has a built-in string method that does what you need:

mystring.replace(" ", "_")

回答 1

替换空格是可以的,但我建议您进一步处理其他对URL不利的字符,例如问号,撇号,感叹号等。

还要注意,SEO专家之间的普遍共识是,在URL中破折号优先于下划线。

import re

def urlify(s):

    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

# Prints: I-cant-get-no-satisfaction"
print(urlify("I can't get no satisfaction!"))

Replacing spaces is fine, but I might suggest going a little further to handle other URL-hostile characters like question marks, apostrophes, exclamation points, etc.

Also note that the general consensus among SEO experts is that dashes are preferred to underscores in URLs.

import re

def urlify(s):

    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)

    # Replace all runs of whitespace with a single dash
    s = re.sub(r"\s+", '-', s)

    return s

# Prints: I-cant-get-no-satisfaction"
print(urlify("I can't get no satisfaction!"))

回答 2

Django具有执行此功能的“ slugify”功能以及其他对URL友好的优化。它隐藏在defaultfilters模块中。

>>> from django.template.defaultfilters import slugify
>>> slugify("This should be connected")

this-should-be-connected

这不完全是您要求的输出,但是IMO最好在URL中使用。

Django has a ‘slugify’ function which does this, as well as other URL-friendly optimisations. It’s hidden away in the defaultfilters module.

>>> from django.template.defaultfilters import slugify
>>> slugify("This should be connected")

this-should-be-connected

This isn’t exactly the output you asked for, but IMO it’s better for use in URLs.


回答 3

这考虑了空格以外的空白字符,我认为它比使用re模块要快:

url = "_".join( title.split() )

This takes into account blank characters other than space and I think it’s faster than using re module:

url = "_".join( title.split() )

回答 4

使用re模块:

import re
re.sub('\s+', '_', "This should be connected") # This_should_be_connected
re.sub('\s+', '_', 'And     so\tshould this')  # And_so_should_this

除非您有多个空格或上述其他空格可能性,否则您可能只想string.replace按照其他人的建议使用即可。

Using the re module:

import re
re.sub('\s+', '_', "This should be connected") # This_should_be_connected
re.sub('\s+', '_', 'And     so\tshould this')  # And_so_should_this

Unless you have multiple spaces or other whitespace possibilities as above, you may just wish to use string.replace as others have suggested.


回答 5

使用字符串的replace方法:

"this should be connected".replace(" ", "_")

"this_should_be_disconnected".replace("_", " ")

use string’s replace method:

"this should be connected".replace(" ", "_")

"this_should_be_disconnected".replace("_", " ")


回答 6

令人惊讶的是,这个图书馆还没有提到

名为python-slugify的python包,可以很好地完成slugizing:

pip install python-slugify

像这样工作:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a") 

Surprisingly this library not mentioned yet

python package named python-slugify, which does a pretty good job of slugifying:

pip install python-slugify

Works like this:

from slugify import slugify

txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")

txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")

txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")

txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")

txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a") 

回答 7

我将以下代码用于我的友好网址:

from unicodedata import normalize
from re import sub

def slugify(title):
    name = normalize('NFKD', title).encode('ascii', 'ignore').replace(' ', '-').lower()
    #remove `other` characters
    name = sub('[^a-zA-Z0-9_-]', '', name)
    #nomalize dashes
    name = sub('-+', '-', name)

    return name

Unicode字符也可以正常工作。

I’m using the following piece of code for my friendly urls:

from unicodedata import normalize
from re import sub

def slugify(title):
    name = normalize('NFKD', title).encode('ascii', 'ignore').replace(' ', '-').lower()
    #remove `other` characters
    name = sub('[^a-zA-Z0-9_-]', '', name)
    #nomalize dashes
    name = sub('-+', '-', name)

    return name

It works fine with unicode characters as well.


回答 8

Python在名为replace的字符串上有一个内置方法,其使用方式如下:

string.replace(old, new)

因此,您将使用:

string.replace(" ", "_")

前一段时间我遇到了这个问题,我编写了代码来替换字符串中的字符。我必须开始记得检查python文档,因为它们已经内置了所有功能。

Python has a built in method on strings called replace which is used as so:

string.replace(old, new)

So you would use:

string.replace(" ", "_")

I had this problem a while ago and I wrote code to replace characters in a string. I have to start remembering to check the python documentation because they’ve got built in functions for everything.


回答 9

OP使用的是python,但使用的是javascript(由于语法相似,因此请务必谨慎。

// only replaces the first instance of ' ' with '_'
"one two three".replace(' ', '_'); 
=> "one_two three"

// replaces all instances of ' ' with '_'
"one two three".replace(/\s/g, '_');
=> "one_two_three"

OP is using python, but in javascript (something to be careful of since the syntaxes are similar.

// only replaces the first instance of ' ' with '_'
"one two three".replace(' ', '_'); 
=> "one_two three"

// replaces all instances of ' ' with '_'
"one two three".replace(/\s/g, '_');
=> "one_two_three"

回答 10

mystring.replace (" ", "_")

如果将此值分配给任何变量,它将起作用

s = mystring.replace (" ", "_")

默认情况下,mystring不会有这个

mystring.replace (" ", "_")

if you assign this value to any variable, it will work

s = mystring.replace (" ", "_")

by default mystring wont have this


回答 11

您可以尝试以下方法:

mystring.replace(r' ','-')

You can try this instead:

mystring.replace(r' ','-')

回答 12

perl -e 'map { $on=$_; s/ /_/; rename($on, $_) or warn $!; } <*>;'

匹配并替换空间>​​当前目录中所有文件的下划线

perl -e 'map { $on=$_; s/ /_/; rename($on, $_) or warn $!; } <*>;'

Match et replace space > underscore of all files in current directory


命名正则表达式组“(?P regexp)”:“ P”代表什么?

问题:命名正则表达式组“(?P regexp)”:“ P”代表什么?

在Python中,该(?P<group_name>…) 语法允许人们通过其名称引用匹配的字符串:

>>> import re
>>> match = re.search('(?P<name>.*) (?P<phone>.*)', 'John 123456')
>>> match.group('name')
'John'

“ P”代表什么?我在官方文档中找不到任何提示。

我很想获得有关如何帮助我的学生记住该语法的想法。知道“ P”代表(或可能代表)什么会很有用。

In Python, the (?P<group_name>…) syntax allows one to refer to the matched string through its name:

>>> import re
>>> match = re.search('(?P<name>.*) (?P<phone>.*)', 'John 123456')
>>> match.group('name')
'John'

What does “P” stand for? I could not find any hint in the official documentation.

I would love to get ideas about how to help my students remember this syntax. Knowing what “P” does stand for (or might stand for) would be useful.


回答 0

既然我们都在猜测,我还是不妨告诉我:我一直认为它代表Python。这听起来可能很愚蠢-什么,P for Python?-但为了辩护,我隐约记得了这个主题[我的重点]:

主题:声明(?P …)正则表达式语法扩展

来自:Guido van Rossum(gui … @ CNRI.Reston.Va.US)

日期:1997年12月10日下午3:36:19

我对Perl开发人员(开发Perl语言的人)有不同寻常的要求。我希望这个(perl5-porters)是正确的列表。我正在抄送Python字符串信号,因为它是我在此讨论的大部分工作的起源。

您可能知道Python。我是Python的创造者;我计划在今年年底之前发布下一个“主要”版本Python 1.5。我希望Python和Perl可以在未来的几年中共存。异花授粉对两种语言都有好处。(我相信Larry在向Perl 5添加对象时对Python有很好的了解; O’Reilly出版了有关这两种语言的书籍。)

如您所知,Python 1.5添加了一个新的正则表达式模块,该模块与Perl的语法更加匹配。我们试图在Python的语法中尽可能地接近Perl语法。但是,正则表达式语法具有一些特定于Python的扩展名,它们都以(?P开头。目前有两个:

(?P<foo>...)与常规分组括号类似,但是在
执行匹配后,可以通过符号组名“ foo”访问该组所匹配的文本。

(?P=foo)匹配与名为“ foo”的组匹配的字符串。等效于\ 1,\ 2等,除了组是
通过名称而不是数字来引用的。

我希望这个特定于Python的扩展名不会与以后的Perl regex语法的任何Perl扩展名冲突。如果你有计划的使用(?P,请让我们尽快知道,以便我们能够解决冲突。 否则,这将是很好,如果(?P语法可以永久的Python特定的语法扩展保留。 (是有某种扩展注册表吗?)

拉里·沃尔(Larry Wall)回答:

[…]到目前为止,还没有注册表-您的请求是来自外部perl5-porter的第一个请求,因此这是一个相当低的带宽活动。(对不起,上周价格甚至更低-我去纽约的互联网世界。)

无论如何,就我而言,我的祝福一定会让你“ P”。(显然,Perl在这一点上不需要’P’。:-) […]

所以我不知道P最初的选择是由-模式引起的吗?占位符?企鹅?-但您可以理解为什么我总是将其与Python关联。考虑到(1)我不喜欢正则表达式并且尽可能避免使用它们,以及(2)这个线程发生在15年前,这有点奇怪。

Since we’re all guessing, I might as well give mine: I’ve always thought it stood for Python. That may sound pretty stupid — what, P for Python?! — but in my defense, I vaguely remembered this thread [emphasis mine]:

Subject: Claiming (?P…) regex syntax extensions

From: Guido van Rossum (gui…@CNRI.Reston.Va.US)

Date: Dec 10, 1997 3:36:19 pm

I have an unusual request for the Perl developers (those that develop the Perl language). I hope this (perl5-porters) is the right list. I am cc’ing the Python string-sig because it is the origin of most of the work I’m discussing here.

You are probably aware of Python. I am Python’s creator; I am planning to release a next “major” version, Python 1.5, by the end of this year. I hope that Python and Perl can co-exist in years to come; cross-pollination can be good for both languages. (I believe Larry had a good look at Python when he added objects to Perl 5; O’Reilly publishes books about both languages.)

As you may know, Python 1.5 adds a new regular expression module that more closely matches Perl’s syntax. We’ve tried to be as close to the Perl syntax as possible within Python’s syntax. However, the regex syntax has some Python-specific extensions, which all begin with (?P . Currently there are two of them:

(?P<foo>...) Similar to regular grouping parentheses, but the text
matched by the group is accessible after the match has been performed, via the symbolic group name “foo”.

(?P=foo) Matches the same string as that matched by the group named “foo”. Equivalent to \1, \2, etc. except that the group is referred
to by name, not number.

I hope that this Python-specific extension won’t conflict with any future Perl extensions to the Perl regex syntax. If you have plans to use (?P, please let us know as soon as possible so we can resolve the conflict. Otherwise, it would be nice if the (?P syntax could be permanently reserved for Python-specific syntax extensions. (Is there some kind of registry of extensions?)

to which Larry Wall replied:

[…] There’s no registry as of now–yours is the first request from outside perl5-porters, so it’s a pretty low-bandwidth activity. (Sorry it was even lower last week–I was off in New York at Internet World.)

Anyway, as far as I’m concerned, you may certainly have ‘P’ with my blessing. (Obviously Perl doesn’t need the ‘P’ at this point. :-) […]

So I don’t know what the original choice of P was motivated by — pattern? placeholder? penguins? — but you can understand why I’ve always associated it with Python. Which considering that (1) I don’t like regular expressions and avoid them wherever possible, and (2) this thread happened fifteen years ago, is kind of odd.


回答 1

模式!该组命名一个(子)模式,供以后在正则表达式中使用。有关如何使用此类组的详细信息,请参见此处的文档

Pattern! The group names a (sub)pattern for later use in the regex. See the documentation here for details about how such groups are used.


回答 2

Python扩展。从Python Docos:

Perl开发人员选择的解决方案是使用(?…)作为扩展语法。?括号后立即是语法错误,因为?无需重复,因此不会带来任何兼容性问题。?之后的字符 指示正在使用什么扩展名,因此(?= foo)是一回事(正向超前断言),而(?:foo)是另外一回事(包含子表达式foo的非捕获组)。

Python支持Perl的几种扩展,并在Perl的扩展语法中添加了扩展语法。如果问号后的第一个字符是P,则说明它是特定于Python的扩展名

https://docs.python.org/3/howto/regex.html

Python Extension. From the Python Docos:

The solution chosen by the Perl developers was to use (?…) as the extension syntax. ? immediately after a parenthesis was a syntax error because the ? would have nothing to repeat, so this didn’t introduce any compatibility problems. The characters immediately after the ? indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).

Python supports several of Perl’s extensions and adds an extension syntax to Perl’s extension syntax.If the first character after the question mark is a P, you know that it’s an extension that’s specific to Python

https://docs.python.org/3/howto/regex.html


python re.sub组:\ number之后的数字

问题:python re.sub组:\ number之后的数字

如何替换foobarfoo123bar

这不起作用:

>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'

这有效:

>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'

我认为,遇到时,这是一个普遍的问题\number。谁能给我一个关于如何处理的提示?

How can I replace foobar with foo123bar?

This doesn’t work:

>>> re.sub(r'(foo)', r'\1123', 'foobar')
'J3bar'

This works:

>>> re.sub(r'(foo)', r'\1hi', 'foobar')
'foohibar'

I think it’s a common issue when having something like \number. Can anyone give me a hint on how to handle this?


回答 0

答案是:

re.sub(r'(foo)', r'\g<1>123', 'foobar')

相关摘录:

除了如上所述的字符转义和反向引用之外,\ g将使用由(?P …)语法定义的名为name的组匹配的子字符串。\ g使用​​相应的组号;因此,\ g <2>等效于\ 2,但在诸如\ g <2> 0之类的替换中并没有歧义。\ 20将被解释为对组20的引用,而不是对组2的引用,后跟文字字符“ 0”。反向引用\ g <0>替换RE匹配的整个子字符串。

The answer is:

re.sub(r'(foo)', r'\g<1>123', 'foobar')

Relevant excerpt from the docs:

In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P…) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character ‘0’. The backreference \g<0> substitutes in the entire substring matched by the RE.


从字符串中删除所有特殊字符,标点和空格

问题:从字符串中删除所有特殊字符,标点和空格

我需要从字符串中删除所有特殊字符,标点符号和空格,以便只有字母和数字。

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.


回答 0

这可以不用正则表达式来完成:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

您可以使用str.isalnum

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

如果您坚持使用正则表达式,则其他解决方案也可以。但是请注意,如果可以在不使用正则表达式的情况下完成此操作,那么这是最好的解决方法。

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

S.isalnum() -> bool

Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that’s the best way to go about it.


回答 1

这是一个正则表达式,用于匹配不是字母或数字的字符串:

[^A-Za-z0-9]+

这是执行正则表达式替换的Python命令:

re.sub('[^A-Za-z0-9]+', '', mystring)

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

回答 2

较短的方法:

import re
cleanString = re.sub('\W+','', string )

如果要在单词和数字之间留空格,请用”代替’

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute ” with ‘ ‘


回答 3

看到这一点之后,我有兴趣通过找出执行时间最短的方法来扩展所提供的答案,因此我仔细检查了一些建议的答案,并timeit对照了两个示例字符串:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

例子1

'.join(e for e in string if e.isalnum())

  • string1 -结果:10.7061979771
  • string2 -结果:7.77832597694

例子2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 -结果:7.10785102844
  • string2 -结果:4.12814903259

例子3

import re re.sub('\W+','', string)

  • string1 -结果:3.11899876595
  • string2 -结果:2.78014397621

以上结果是以下平均值的最低返回结果的乘积: repeat(3, 2000000)

示例3的速度可以比示例13倍。

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ricidulous!!!'

Example 1

'.join(e for e in string if e.isalnum())

  • string1 – Result: 10.7061979771
  • string2 – Result: 7.78372597694

Example 2

import re re.sub('[^A-Za-z0-9]+', '', string)

  • string1 – Result: 7.10785102844
  • string2 – Result: 4.12814903259

Example 3

import re re.sub('\W+','', string)

  • string1 – Result: 3.11899876595
  • string2 – Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.


回答 4

Python 2. *

我认为filter(str.isalnum, string)效果很好

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3. *

在Python3中,filter( )函数将返回一个可迭代的对象(而不是上面的字符串)。必须重新加入以从itertable中获取字符串:

''.join(filter(str.isalnum, string)) 

或通过list加入使用(不确定,但可以很快

''.join([*filter(str.isalnum, string)])

注意:[*args]Python> = 3.5中解压缩有效

Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string)) 

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5


回答 5

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

您可以添加更多特殊字符,然后将其替换为“”,则表示没有任何意义,即它们将被删除。

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by ” means nothing i.e they will be removed.


回答 6

与使用正则表达式的其他所有人不同,我将尝试排除想要的每个字符,而不是明确枚举不需要的字符。

例如,如果我只需要’a到z’字符(大写和小写)和数字,我将排除所有其他内容:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

这意味着“用空字符串替换每个不是数字的字符,或者用’a到z’或’A到Z’范围内的字符代替”。

实际上,如果^在正则表达式的第一位插入特殊字符,则会得到否定。

额外提示:如果您还需要将结果小写,则可以使正则表达式更快,更轻松,只要您现在找不到大写即可。

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don’t want.

For example, if I want only characters from ‘a to z’ (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means “substitute every character that is not a number, or a character in the range ‘a to z’ or ‘A to Z’ with an empty string”.

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won’t find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

回答 7

假设您要使用正则表达式,并且想要/需要支持2to3的Unicode识别2.x代码:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

回答 8

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)
s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

回答 9

最通用的方法是使用unicodedata表的“类别”,该表对每个单个字符进行分类。例如,以下代码根据其类别仅过滤可打印字符:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

查看上面所有相关类别的给定URL。当然,您也可以按标点符号类别进行过滤。

The most generic approach is using the ‘categories’ of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.


回答 10

string。标点符号包含以下字符:

‘!“#$%&\’()* +,-。/ :; <=>?@ [\] ^ _`{|}〜’

您可以使用translate和maketrans函数将标点符号映射到空值(替换)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

输出:

'This is A test'

string.punctuation contains following characters:

‘!”#$%&\'()*+,-./:;<=>?@[\]^_`{|}~’

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

回答 11

使用翻译:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

警告:仅适用于ASCII字符串。

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.


回答 12

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

与双引号相同。“”“

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

same as double quotes.”””

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

回答 13

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

您将看到的结果为

‘askhnlaskdjalsdk

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

‘askhnlaskdjalsdk


回答 14

删除标点,数字和特殊字符

例子:-

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

结果:-

谢谢 :)

Removing Punctuations, Numbers, and Special Characters

Example :-

Code

combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ") 

Result:-

Thanks :)


如何在正则表达式中使用变量?

问题:如何在正则表达式中使用变量?

我想在a variable内部使用regex,该怎么办Python

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

I’d like to use a variable inside a regex, how can I do this in Python?

TEXTO = sys.argv[1]

if re.search(r"\b(?=\w)TEXTO\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

回答 0

从python 3.6开始,您还可以使用文字字符串插值(“ f-strings”)。在您的特定情况下,解决方案是:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something

编辑:

既然评论中存在一些有关如何处理特殊字符的问题,我想扩展一下我的答案:

原始字符串(’r’):

在正则表达式中处理特殊字符时,您必须了解的主要概念之一是区分字符串文字和正则表达式本身。这是很好的解释在这里

简而言之:

假设您要匹配字符串\b之后,而不是查找单词边界。你必须写:TEXTO\boundary

TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
    print("match")

这仅起作用,因为我们使用的是原始字符串(正则表达式以’r’开头),否则我们必须在正则表达式中写入“ \\\\ boundary”(四个反斜杠)。另外,如果没有’\ r’,\ b’将不再转换为单词边界,而是转换为退格键!

重新转义

基本上在任何特殊字符的前面放置一个空格。因此,如果您希望TEXTO中有特殊字符,则需要编写:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

注:对于任何版本> = 3.7蟒:!"%',/:;<=>@,和`都没有逃脱。仅对正则表达式中具有含义的特殊字符进行转义。_因为Python 3.3没有逃脱。(送。这里

大括号:

如果要在使用f字符串的正则表达式中使用量词,则必须使用双花括号。假设您要匹配TEXTO,然后再精确匹配2位数字:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

From python 3.6 on you can also use Literal String Interpolation, “f-strings”. In your particular case the solution would be:

if re.search(rf"\b(?=\w){TEXTO}\b(?!\w)", subject, re.IGNORECASE):
    ...do something

EDIT:

Since there have been some questions in the comment on how to deal with special characters I’d like to extend my answer:

raw strings (‘r’):

One of the main concepts you have to understand when dealing with special characters in regular expressions is to distinguish between string literals and the regular expression itself. It is very well explained here:

In short:

Let’s say instead of finding a word boundary \b after TEXTO you want to match the string \boundary. The you have to write:

TEXTO = "Var"
subject = r"Var\boundary"

if re.search(rf"\b(?=\w){TEXTO}\\boundary(?!\w)", subject, re.IGNORECASE):
    print("match")

This only works because we are using a raw-string (the regex is preceded by ‘r’), otherwise we must write “\\\\boundary” in the regex (four backslashes). Additionally, without ‘\r’, \b’ would not converted to a word boundary anymore but to a backspace!

re.escape:

Basically puts a backspace in front of any special character. Hence, if you expect a special character in TEXTO, you need to write:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

NOTE: For any version >= python 3.7: !, ", %, ', ,, /, :, ;, <, =, >, @, and ` are not escaped. Only special characters with meaning in a regex are still escaped. _ is not escaped since Python 3.3.(s. here)

Curly braces:

If you want to use quantifiers within the regular expression using f-strings, you have to use double curly braces. Let’s say you want to match TEXTO followed by exactly 2 digits:

if re.search(rf"\b(?=\w){re.escape(TEXTO)}\d{{2}}\b(?!\w)", subject, re.IGNORECASE):
    print("match")

回答 1

您必须将正则表达式构建为字符串:

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):
    etc.

请注意使用,re.escape这样如果您的文本中包含特殊字符,则不会这样解释它们。

You have to build the regex as a string:

TEXTO = sys.argv[1]
my_regex = r"\b(?=\w)" + re.escape(TEXTO) + r"\b(?!\w)"

if re.search(my_regex, subject, re.IGNORECASE):
    etc.

Note the use of re.escape so that if your text has special characters, they won’t be interpreted as such.


回答 2

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):

这会将TEXTO中的内容作为字符串插入到正则表达式中。

if re.search(r"\b(?<=\w)%s\b(?!\w)" % TEXTO, subject, re.IGNORECASE):

This will insert what is in TEXTO into the regex as a string.


回答 3

rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)
rx = r'\b(?<=\w){0}\b(?!\w)'.format(TEXTO)

回答 4

我发现通过将多个较小的模式串在一起来构建正则表达式模式非常方便。

import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)

输出:

[('begin', 'id1'), ('middl', 'id2')]

I find it very convenient to build a regular expression pattern by stringing together multiple smaller patterns.

import re

string = "begin:id1:tag:middl:id2:tag:id3:end"
re_str1 = r'(?<=(\S{5})):'
re_str2 = r'(id\d+):(?=tag:)'
re_pattern = re.compile(re_str1 + re_str2)
match = re_pattern.findall(string)
print(match)

Output:

[('begin', 'id1'), ('middl', 'id2')]

回答 5

我同意以上所有条件,除非:

sys.argv[1] 就像 Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"

您不想使用re.escape,因为在这种情况下,您希望它的行为类似于正则表达式

TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

I agree with all the above unless:

sys.argv[1] was something like Chicken\d{2}-\d{2}An\s*important\s*anchor

sys.argv[1] = "Chicken\d{2}-\d{2}An\s*important\s*anchor"

you would not want to use re.escape, because in that case you would like it to behave like a regex

TEXTO = sys.argv[1]

if re.search(r"\b(?<=\w)" + TEXTO + "\b(?!\w)", subject, re.IGNORECASE):
    # Successful match
else:
    # Match attempt failed

回答 6

我需要搜索彼此相似的用户名,Ned Batchelder所说的话非常有用。但是,当我使用re.compile创建我的搜索项时,发现输出更清晰:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)

可以使用以下命令打印输出:

print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

I needed to search for usernames that are similar to each other, and what Ned Batchelder said was incredibly helpful. However, I found I had cleaner output when I used re.compile to create my re search term:

pattern = re.compile(r"("+username+".*):(.*?):(.*?):(.*?):(.*)"
matches = re.findall(pattern, lines)

Output can be printed using the following:

print(matches[1]) # prints one whole matching line (in this case, the first line)
print(matches[1][3]) # prints the fourth character group (established with the parentheses in the regex statement) of the first line.

回答 7

您可以使用formatgrammer suger 尝试另一种用法:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

you can try another usage using format grammer suger:

re_genre = r'{}'.format(your_variable)
regex_pattern = re.compile(re_genre)  

回答 8

您也可以为此使用format关键字。Format方法将{}占位符替换为您作为参数传递给format方法的变量。

if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
else:
    # Match attempt failed

You can use format keyword as well for this.Format method will replace {} placeholder to the variable which you passed to the format method as an argument.

if re.search(r"\b(?=\w)**{}**\b(?!\w)".**format(TEXTO)**, subject, re.IGNORECASE):
    # Successful match**strong text**
else:
    # Match attempt failed

回答 9

更多例子

我有带有流文件的configus.yml

"pattern":
  - _(\d{14})_
"datetime_string":
  - "%m%d%Y%H%M%f"

在我使用的python代码中

data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

more example

I have configus.yml with flows files

"pattern":
  - _(\d{14})_
"datetime_string":
  - "%m%d%Y%H%M%f"

in python code I use

data_time_real_file=re.findall(r""+flows[flow]["pattern"][0]+"", latest_file)

在Python中转义正则表达式字符串

问题:在Python中转义正则表达式字符串

我想使用用户输入作为正则表达式模式来搜索某些文本。它可以工作,但是如何处理用户在正则表达式中放置具有含义的字符的情况?例如,用户要搜索Word (s):正则表达式引擎会将(s)分组。我希望它像对待字符串一样对待它"(s)"。我可以replace在用户输入上运行并将(with \()with 替换,\)但是问题是我将需要对每个可能的正则表达式符号进行替换。你知道更好的方法吗?

I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?

For example, the user wants to search for Word (s): regex engine will take the (s) as a group. I want it to treat it like a string "(s)" . I can run replace on user input and replace the ( with \( and the ) with \) but the problem is I will need to do replace for every possible regex symbol.

Do you know some better way ?


回答 0

re.escape()为此使用函数:

4.2.3 re模块内容

转义(字符串)

返回所有非字母数字加反斜杠的字符串;如果要匹配可能包含正则表达式元字符的任意文字字符串,则此功能很有用。

一个简单的示例,搜索提供的字符串的任何出现情况(可选)后跟“ s”,然后返回匹配对象。

def simplistic_plural(word, text):
    word_or_plural = re.escape(word) + 's?'
    return re.match(word_or_plural, text)

Use the re.escape() function for this:

4.2.3 re Module Contents

escape(string)

Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

A simplistic example, search any occurence of the provided string optionally followed by ‘s’, and return the match object.

def simplistic_plural(word, text):
    word_or_plural = re.escape(word) + 's?'
    return re.match(word_or_plural, text)

回答 1

您可以使用re.escape()

re.escape(string)返回所有非字母数字加反斜杠的字符串;如果要匹配可能包含正则表达式元字符的任意文字字符串,则此功能很有用。

>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'

You can use re.escape():

re.escape(string) Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'

If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.

If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_).


回答 2

不幸的是,re.escape()不适合替换字符串:

>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'

一种解决方案是将替换项放在lambda中:

>>> re.sub('a', lambda _: '_', 'aa')
'__'

因为lambda的返回值被视为re.sub()文字字符串。

Unfortunately, re.escape() is not suited for the replacement string:

>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'

A solution is to put the replacement in a lambda:

>>> re.sub('a', lambda _: '_', 'aa')
'__'

because the return value of the lambda is treated by re.sub() as a literal string.


回答 3

请尝试:

\ Q和\ E作为锚点

放置“或”条件以匹配完整单词或正则表达式。

参考链接:如何匹配包含正则表达式中特殊字符的整个单词

Please give a try:

\Q and \E as anchors

Put an Or condition to match either a full word or regex.

Ref Link : How to match a whole word that includes special characters in regex


在Python中按空格分隔字符串-保留带引号的子字符串

问题:在Python中按空格分隔字符串-保留带引号的子字符串

我有一个像这样的字符串:

this is "a test"

我正在尝试在Python中编写一些内容,以按空格将其拆分,同时忽略引号内的空格。我正在寻找的结果是:

['this','is','a test']

PS。我知道您会问:“如果引号内有引号,将会发生什么情况?在我的应用程序中,那将永远不会发生。

I have a string which is like this:

this is "a test"

I’m trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I’m looking for is:

['this','is','a test']

PS. I know you are going to ask “what happens if there are quotes within the quotes, well, in my application, that will never happen.


回答 0

您需要split从内置shlex模块中。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

这应该正是您想要的。

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.


回答 1

看看shlex模块,特别是shlex.split

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

Have a look at the shlex module, particularly shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

回答 2

我在这里看到正则表达式方法看起来很复杂和/或错误。这让我感到惊讶,因为正则表达式语法可以轻松地描述“空格或引号引起的东西”,并且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。因此,如果您要使用正则表达式,为什么不直接说出您的意思呢?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

说明:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex可能提供更多功能。

I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe “whitespace or thing-surrounded-by-quotes”, and most regex engines (including Python’s) can split on a regex. So if you’re going to use regexes, why not just say exactly what you mean?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Explanation:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probably provides more features, though.


回答 3

根据您的用例,您可能还需要检出csv模块:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

输出:

['this', 'is', 'a string']
['and', 'more', 'stuff']

Depending on your use case, you may also want to check out the csv module:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

Output:

['this', 'is', 'a string']
['and', 'more', 'stuff']

回答 4

我使用shlex.split处理70,000,000行的鱿鱼日志,它是如此缓慢。所以我转去重新。

如果shlex有性能问题,请尝试此操作。

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

I use shlex.split to process 70,000,000 lines of squid log, it’s so slow. So I switched to re.

Please try this, if you have performance problem with shlex.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

回答 5

由于此问题是用正则表达式标记的,因此我决定尝试使用正则表达式方法。我首先用\ x00替换引号部分中的所有空格,然后按空格分割,然后将\ x00替换回每个部分中的空格。

两种版本都做同样的事情,但是splitter2比splitter2更具可读性。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

Both versions do the same thing, but splitter is a bit more readable then splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

回答 6

似乎出于性能原因re,速度更快。这是我使用保留外部引号的最小贪婪运算符的解决方案:

re.findall("(?:\".*?\"|\S)+", s)

结果:

['this', 'is', '"a test"']

aaa"bla blub"bbb由于这些标记没有用空格分隔,因此将类似的结构留在了一起。如果字符串包含转义字符,则可以这样进行匹配:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

请注意,这也""通过\S模式的一部分与空字符串匹配。

It seems that for performance reasons re is faster. Here is my solution using a least greedy operator that preserves the outer quotes:

re.findall("(?:\".*?\"|\S)+", s)

Result:

['this', 'is', '"a test"']

It leaves constructs like aaa"bla blub"bbb together as these tokens are not separated by spaces. If the string contains escaped characters, you can match like that:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

Please note that this also matches the empty string "" by means of the \S part of the pattern.


回答 7

被接受的主要问题 shlex方法是它不会忽略引号子字符串之外的转义字符,并且在某些特殊情况下会产生一些意外的结果。

我有以下用例,在这里我需要一个拆分函数,该函数拆分输入字符串,以便保留单引号或双引号的子字符串,并能够在这样的子字符串中转义引号。无引号的字符串中的引号不应与其他任何字符区别对待。带有预期输出的一些示例测试用例:

输入字符串| 预期Yield
==============================================
 'abc def'| ['abc','def']
 “ abc \\ s def” | ['abc','\\ s','def']
 '“ abc def” ghi'| ['abc def','ghi']
 “'abc def'ghi” | ['abc def','ghi']
 '“ abc \\” def“ ghi'| ['abc” def','ghi']
 “'abc \\'def'ghi” | [“ abc'def”,'ghi']
 “'abc \\ s def'ghi” | ['abc \\ s def','ghi']
 '“ abc \\ s def” ghi'| ['abc \\ s def','ghi']
 '“”测试'| ['','test']
 “”测试” | ['','test']
 “ abc'def” | [“ abc'def”]
 “ abc'def'” | [“ abc'def'”]
 “ abc'def'ghi” | [“ abc'def”“,'ghi']
 “ abc'def'ghi” | [“ abc'def'ghi”]
 'abc“ def'| ['abc” def']
 'abc“ def”'| ['abc“ def”']
 'abc“ def” ghi'| ['abc“ def”','ghi']
 'abc“ def” ghi'| ['abc“ def” ghi']
 “ r'AA'r'。* _ xyz $'” | [“ r'AA'”,“ r'。* _ xyz $'”]

我最终得到了以下函数来拆分字符串,以便所有输入字符串的预期输出结果:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

下面的测试应用程序检查的其他方法的结果(shlexcsv现在)和自定义拆分实现:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

输出:

lex

[OK] abc def-> ['abc','def']
[失败] abc \ s def-> ['abc','s','def']
[OK]“ abc def” ghi-> ['abc def','ghi']
[OK]'abc def'ghi-> ['abc def','ghi']
[OK]“ abc \” def“ ghi-> ['abc” def','ghi']
[FAIL]'abc \'def'ghi->exceptions:无右引号
[OK]'abc \ s def'ghi-> ['abc \\ s def','ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi']
[OK]“” test-> [“,'test']
[确定]''测试-> ['','测试']
[FAIL] abc'def->exceptions:无结束报价
[失败] abc'def'-> ['abcdef']
[FAIL] abc'def'ghi-> ['abcdef','ghi']
[失败] abc'def'ghi-> ['abcdefghi']
[FAIL] abc“ def->异常:无右引号
[失败] abc“ def”-> ['abcdef']
[FAIL] abc“ def” ghi-> ['abcdef','ghi']
[失败] abc“ def” ghi-> ['abcdefghi']
[失败] r'AA'r'。* _ xyz $'-> ['rAA','r。* _ xyz $']

CSV

[OK] abc def-> ['abc','def']
[确定] abc \ s def-> ['abc','\\ s','def']
[OK]“ abc def” ghi-> ['abc def','ghi']
[失败]'abc def'ghi-> [“'abc”,“ def'”,'ghi']
[失败]“ abc \” def“ ghi-> ['abc \\','def”','ghi']
[FAIL]'abc \'def'ghi-> [“'abc”,“ \\'”,“ def'”,'ghi']
[失败]'abc \ s def'ghi-> [“'abc”,'\\ s',“ def'”,'ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi']
[OK]“” test-> [“,'test']
[失败]''测试-> [“''”,'测试']
[OK] abc'def-> [“ abc'def”]
[OK] abc'def'-> [“ abc'def'”]
[OK] abc'def'ghi-> [“ abc'def'”,'ghi']
[OK] abc'def'ghi-> [“ abc'def'ghi”]
[OK] abc“ def-> ['abc” def']
[OK] abc“ def”-> ['abc“ def”']
[OK] abc“ def” ghi-> ['abc“ def”','ghi']
[OK] abc“ def” ghi-> ['abc“ def” ghi']
[OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”,“ r'。* _ xyz $'”]

回覆

[OK] abc def-> ['abc','def']
[确定] abc \ s def-> ['abc','\\ s','def']
[OK]“ abc def” ghi-> ['abc def','ghi']
[OK]'abc def'ghi-> ['abc def','ghi']
[OK]“ abc \” def“ ghi-> ['abc” def','ghi']
[OK]'abc \'def'ghi-> [“ abc'def”,'ghi']
[OK]'abc \ s def'ghi-> ['abc \\ s def','ghi']
[确定]“ abc \ s def” ghi-> ['abc \\ s def','ghi']
[OK]“” test-> [“,'test']
[确定]''测试-> ['','测试']
[OK] abc'def-> [“ abc'def”]
[OK] abc'def'-> [“ abc'def'”]
[OK] abc'def'ghi-> [“ abc'def'”,'ghi']
[OK] abc'def'ghi-> [“ abc'def'ghi”]
[OK] abc“ def-> ['abc” def']
[OK] abc“ def”-> ['abc“ def”']
[OK] abc“ def” ghi-> ['abc“ def”','ghi']
[OK] abc“ def” ghi-> ['abc“ def” ghi']
[OK] r'AA'r'。* _ xyz $'-> [“ r'AA'”,“ r'。* _ xyz $'”]

shlex:每次迭代0.281ms
csv:每次迭代0.030ms
re:每次迭代0.049ms

因此,性能要比更好shlex,并且可以通过预编译正则表达式来进一步提高性能,在这种情况下它将优于该csv方法。

The main problem with the accepted shlex approach is that it does not ignore escape characters outside quoted substrings, and gives slightly unexpected results in some corner cases.

I have the following use case, where I need a split function that splits input strings such that either single-quoted or double-quoted substrings are preserved, with the ability to escape quotes within such a substring. Quotes within an unquoted string should not be treated differently from any other character. Some example test cases with the expected output:

 input string        | expected output
===============================================
 'abc def'           | ['abc', 'def']
 "abc \\s def"       | ['abc', '\\s', 'def']
 '"abc def" ghi'     | ['abc def', 'ghi']
 "'abc def' ghi"     | ['abc def', 'ghi']
 '"abc \\" def" ghi' | ['abc " def', 'ghi']
 "'abc \\' def' ghi" | ["abc ' def", 'ghi']
 "'abc \\s def' ghi" | ['abc \\s def', 'ghi']
 '"abc \\s def" ghi' | ['abc \\s def', 'ghi']
 '"" test'           | ['', 'test']
 "'' test"           | ['', 'test']
 "abc'def"           | ["abc'def"]
 "abc'def'"          | ["abc'def'"]
 "abc'def' ghi"      | ["abc'def'", 'ghi']
 "abc'def'ghi"       | ["abc'def'ghi"]
 'abc"def'           | ['abc"def']
 'abc"def"'          | ['abc"def"']
 'abc"def" ghi'      | ['abc"def"', 'ghi']
 'abc"def"ghi'       | ['abc"def"ghi']
 "r'AA' r'.*_xyz$'"  | ["r'AA'", "r'.*_xyz$'"]

I ended up with the following function to split a string such that the expected output results for all input strings:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

The following test application checks the results of other approaches (shlex and csv for now) and the custom split implementation:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

Output:

shlex

[ OK ] abc def -> ['abc', 'def']
[FAIL] abc \s def -> ['abc', 's', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[FAIL] 'abc \' def' ghi -> exception: No closing quotation
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[FAIL] abc'def -> exception: No closing quotation
[FAIL] abc'def' -> ['abcdef']
[FAIL] abc'def' ghi -> ['abcdef', 'ghi']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc"def -> exception: No closing quotation
[FAIL] abc"def" -> ['abcdef']
[FAIL] abc"def" ghi -> ['abcdef', 'ghi']
[FAIL] abc"def"ghi -> ['abcdefghi']
[FAIL] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$']

csv

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi']
[FAIL] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[FAIL] '' test -> ["''", 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]

re

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi']
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]

shlex: 0.281ms per iteration
csv: 0.030ms per iteration
re: 0.049ms per iteration

So performance is much better than shlex, and can be improved further by precompiling the regular expression, in which case it will outperform the csv approach.


回答 8

要保留引号,请使用以下功能:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

To preserve quotes use this function:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

回答 9

速度测试的不同答案:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

Speed test of different answers:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

回答 10

嗯,似乎无法找到“ Reply”按钮……无论如何,此答案基于Kate的方法,但正确地将字符串与包含转义引号的子字符串分开,并且还删除了子字符串的开始和结束引号:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

这适用于类似'This is " a \\\"test\\\"\\\'s substring"'的字符串(不幸的是,必须使用疯狂的标记来防止Python删除转义符)。

如果不需要返回列表中的字符串中的结果转义符,则可以使用此函数的稍有改动的版本:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

Hmm, can’t seem to find the “Reply” button… anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

回答 11

为了解决某些Python 2版本中的unicode问题,我建议:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

To get around the unicode issues in some Python 2 versions, I suggest:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

回答 12

作为一种选择,尝试tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

As an option try tssplit:

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

回答 13

我建议:

测试字符串:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

同时捕获“”和“”:

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

结果:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

忽略空的“”和“”:

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

结果:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

I suggest:

test string:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

to capture also “” and ”:

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

to ignore empty “” and ”:

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

result:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

回答 14

如果您不关心子字符串而不是简单的

>>> 'a short sized string with spaces '.split()

性能:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

或字符串模块

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

性能:字符串模块似乎比字符串方法的性能更好

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

或者您可以使用RE引擎

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

性能

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

对于非常长的字符串,您不应将整个字符串加载到内存中,而应拆分行或使用迭代循环

If you don’t care about sub strings than a simple

>>> 'a short sized string with spaces '.split()

Performance:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

Or string module

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Performance: String module seems to perform better than string methods

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Or you can use RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Performance

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop


回答 15

试试这个:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

一些测试字符串:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

Try this:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

Some test strings:

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]