Python 实用宝典

Question 1

Here’s the simplest way to explain this. Here’s what I’m using:

re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']

Here’s what I want:

someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

The reason is that I want to split a string into tokens, manipulate it, then put it back together again.

Question 2

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

Question 3

If you are splitting on newline, use splitlines(True).

>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']

(Not a general solution, but adding this here in case someone comes here not realizing this method existed.)

Question 4

Another no-regex solution that works well on Python 3

# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']

def split_and_keep(s, sep):
   if not s: return [''] # consistent with string.split()

   # Find replacement character that is not used in string
   # i.e. just use the highest available character plus one
   # Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
   p=chr(ord(max(s))+1) 

   return s.replace(sep, sep+p).split(p)

for s in test_strings:
   print(split_and_keep(s, '<'))


# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))

Question 5

If you have only 1 separator, you can employ list comprehensions:

text = 'foo,bar,baz,qux'  
sep = ','

Appending/prepending separator:

result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']

result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']

Separator as it’s own element:

result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1]   # to get rid of trailing

Question 6

another example, split on non alpha-numeric and keep the separators

import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)

output:

['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']

explanation

re.split('([^a-zA-Z0-9])',a)

() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.

Question 7

You can also split a string with an array of strings instead of a regular expression, like this:

def tokenizeString(aString, separators):
    #separators is an array of strings that are being used to split the string.
    #sort separators in order of descending length
    separators.sort(key=len)
    listToReturn = []
    i = 0
    while i < len(aString):
        theSeparator = ""
        for current in separators:
            if current == aString[i:i+len(current)]:
                theSeparator = current
        if theSeparator != "":
            listToReturn += [theSeparator]
            i = i + len(theSeparator)
        else:
            if listToReturn == []:
                listToReturn = [""]
            if(listToReturn[-1] in separators):
                listToReturn += [""]
            listToReturn[-1] += aString[i]
            i += 1
    return listToReturn
    

print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))

Question 8

# This keeps all separators  in result 
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')

def splitStringFull(sh, st):
   ls=sh.split(st)
   lo=[]
   start=0
   for l in ls:
     if not l : continue
     k=st.find(l)
     llen=len(l)
     if k> start:
       tmp= st[start:k]
       lo.append(tmp)
       lo.append(l)
       start = k + llen
     else:
       lo.append(l)
       start =llen
   return lo
  #############################

li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']

Question 9

One Lazy and Simple Solution

Assume your regex pattern is split_pattern = r'(!|\?)'

First, you add some same character as the new separator, like ‘[cut]’

new_string = re.sub(split_pattern, '\\1[cut]', your_string)

Then you split the new separator, new_string.split('[cut]')

Question 10

If one wants to split string while keeping separators by regex without capturing group:

def finditer_with_separators(regex, s):
    matches = []
    prev_end = 0
    for match in regex.finditer(s):
        match_start = match.start()
        if (prev_end != 0 or match_start > 0) and match_start != prev_end:
            matches.append(s[prev_end:match.start()])
        matches.append(match.group())
        prev_end = match.end()
    if prev_end < len(s):
        matches.append(s[prev_end:])
    return matches

regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)

If one assumes that regex is wrapped up into capturing group:

def split_with_separators(regex, s):
    matches = list(filter(None, regex.split(s)))
    return matches

regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)

Both ways also will remove empty groups which are useless and annoying in most of the cases.

Question 11

I had a similar issue trying to split a file path and struggled to find a simple answer. This worked for me and didn’t involve having to substitute delimiters back into the split text:

my_path = 'folder1/folder2/folder3/file1'

import re

re.findall('[^/]+/|[^/]+', my_path)

returns:

['folder1/', 'folder2/', 'folder3/', 'file1']

Question 12

I found this generator based approach more satisfying:

def split_keep(string, sep):
    """Usage:
    >>> list(split_keep("a.b.c.d", "."))
    ['a.', 'b.', 'c.', 'd']
    """
    start = 0
    while True:
        end = string.find(sep, start) + 1
        if end == 0:
            break
        yield string[start:end]
        start = end
    yield string[start:]

It avoids the need to figure out the correct regex, while in theory should be fairly cheap. It doesn’t create new string objects and, delegates most of the iteration work to the efficient find method.

… and in Python 3.8 it can be as short as:

def split_keep(string, sep):
    start = 0
    while (end := string.find(sep, start) + 1) > 0:
        yield string[start:end]
        start = end
    yield string[start:]

Question 13

replace all seperator: (\W) with seperator + new_seperator: (\W;)
split by the new_seperator: (;)

def split_and_keep(seperator, s):
  return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))

print('\W', 'foo/bar spam\neggs')

Question 14

Here is a simple .split solution that works without regex.

This is an answer for Python split() without removing the delimiter, so not exactly what the original post asks but the other question was closed as a duplicate for this one.

def splitkeep(s, delimiter):
    split = s.split(delimiter)
    return [substr + delimiter for substr in split[:-1]] + [split[-1]]

Random tests:

import random

CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""]  # 0 length test
for delimiter in ('.', '..'):
    for _ in range(100000):
        length = random.randint(1, 50)
        s = "".join(random.choice(CHARS) for _ in range(length))
        assert "".join(splitkeep(s, delimiter)) == s

Python 实用宝典

在Python中，如何分割字符串并保留分隔符？

问题：在Python中，如何分割字符串并保留分隔符？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

有趣好用的Python教程