标签归档:split

用python中的分隔符分割字符串

问题:用python中的分隔符分割字符串

如何__在定界符哪里分割此字符串

MATCHES__STRING

得到['MATCHES', 'STRING']?的输出

How to split this string where __ is the delimiter

MATCHES__STRING

To get an output of ['MATCHES', 'STRING']?


回答 0

您可以使用以下str.split功能:string.split('__')

>>> "MATCHES__STRING".split("__")
['MATCHES', 'STRING']

You can use the str.split function: string.split('__')

>>> "MATCHES__STRING".split("__")
['MATCHES', 'STRING']

回答 1

您可能对csv模块感兴趣,该模块是为逗号分隔的文件设计的,但可以轻松修改以使用自定义定界符。

import csv
csv.register_dialect( "myDialect", delimiter = "__", <other-options> )
lines = [ "MATCHES__STRING" ]

for row in csv.reader( lines ):
    ...

You may be interested in the csv module, which is designed for comma-separated files but can be easily modified to use a custom delimiter.

import csv
csv.register_dialect( "myDialect", delimiter = "__", <other-options> )
lines = [ "MATCHES__STRING" ]

for row in csv.reader( lines ):
    ...

回答 2

如果字符串中有两个或多个(在下面的示例中有三个)元素,则可以使用逗号分隔以下各项:

date, time, event_name = ev.get_text(separator='@').split("@")

在这行代码之后,三个变量将具有变量ev的三个部分的值

因此,如果变量ev包含此字符串,并且我们应用分隔符’@’:

Sa.,23.März@ 19:00 @ Klavier + Orchester:SPEZIAL

然后,在拆分操作之后,变量

  • 日期的值将为“ Sa.,23.März”
  • 时间的值为“ 19:00”
  • event_name的值将为“ Klavier + Orchester:SPEZIAL”

When you have two or more (in the example below there’re three) elements in the string, then you can use comma to separate these items:

date, time, event_name = ev.get_text(separator='@').split("@")

After this line of code, the three variables will have values from three parts of the variable ev

So, if the variable ev contains this string and we apply separator ‘@’:

Sa., 23. März@19:00@Klavier + Orchester: SPEZIAL

Then, after split operation the variable

  • date will have value “Sa., 23. März”
  • time will have value “19:00”
  • event_name will have value “Klavier + Orchester: SPEZIAL”

第一次出现时分裂

问题:第一次出现时分裂

在首次出现定界符时分割字符串的最佳方法是什么?

例如:

"123mango abcd mango kiwi peach"

首先分裂mango得到:

"abcd mango kiwi peach"

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

"abcd mango kiwi peach"

回答 0

文档

str.split([sep[, maxsplit]])

使用sep作为分隔符字符串,返回字符串中单词的列表。如果给出了maxsplit,则最多完成maxsplit分割(因此,列表中最多maxsplit+1包含元素)。

s.split('mango', 1)[1]

From the docs:

str.split([sep[, maxsplit]])

Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

s.split('mango', 1)[1]

回答 1

>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'
>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'

回答 2

对我来说,更好的方法是:

s.split('mango', 1)[-1]

…因为如果发生这种情况不在字符串中,您将得到“ IndexError: list index out of range"

因此-1不会造成任何伤害的原因发生次数已经设置为1。

For me the better approach is that:

s.split('mango', 1)[-1]

…because if happens that occurrence is not in the string you’ll get “IndexError: list index out of range".

Therefore -1 will not get any harm cause number of occurrences is already set to one.


回答 3

您也可以使用str.partition

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

使用的优点str.partition是它总是会返回以下形式的元组:

(<pre>, <separator>, <post>)

因此,这使得解压缩输出变得非常灵活,因为在结果元组中总会有3个元素。

You can also use str.partition:

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

The advantage of using str.partition is that it’s always gonna return a tuple in the form:

(<pre>, <separator>, <post>)

So this makes unpacking the output really flexible as there’s always going to be 3 elements in the resulting tuple.


回答 4

df.columnname[1].split('.', 1)

这将以第一个出现的“。”分割数据。在字符串或数据框列中的值。

df.columnname[1].split('.', 1)

This will split data with the first occurrence of ‘.’ in the string or data frame column value.


每n个字符分割一个字符串?

问题:每n个字符分割一个字符串?

是否可以每n个字符分割一个字符串?

例如,假设我有一个包含以下内容的字符串:

'1234567890'

我怎样才能使它看起来像这样:

['12','34','56','78','90']

Is it possible to split a string every nth character?

For example, suppose I have a string containing the following:

'1234567890'

How can I get it to look like this:

['12','34','56','78','90']

回答 0

>>> line = '1234567890'
>>> n = 2
>>> [line[i:i+n] for i in range(0, len(line), n)]
['12', '34', '56', '78', '90']
>>> line = '1234567890'
>>> n = 2
>>> [line[i:i+n] for i in range(0, len(line), n)]
['12', '34', '56', '78', '90']

回答 1

为了完整起见,您可以使用正则表达式执行此操作:

>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']

对于字符的奇数,您可以执行以下操作:

>>> import re
>>> re.findall('..?', '123456789')
['12', '34', '56', '78', '9']

您还可以执行以下操作,以简化较长块的正则表达式:

>>> import re
>>> re.findall('.{1,2}', '123456789')
['12', '34', '56', '78', '9']

re.finditer如果字符串很长,则可以使用它逐块生成。

Just to be complete, you can do this with a regex:

>>> import re
>>> re.findall('..','1234567890')
['12', '34', '56', '78', '90']

For odd number of chars you can do this:

>>> import re
>>> re.findall('..?', '123456789')
['12', '34', '56', '78', '9']

You can also do the following, to simplify the regex for longer chunks:

>>> import re
>>> re.findall('.{1,2}', '123456789')
['12', '34', '56', '78', '9']

And you can use re.finditer if the string is long to generate chunk by chunk.


回答 2

在python中已经有一个内置函数。

>>> from textwrap import wrap
>>> s = '1234567890'
>>> wrap(s, 2)
['12', '34', '56', '78', '90']

这是wrap的文档字符串所说的:

>>> help(wrap)
'''
Help on function wrap in module textwrap:

wrap(text, width=70, **kwargs)
    Wrap a single paragraph of text, returning a list of wrapped lines.

    Reformat the single paragraph in 'text' so it fits in lines of no
    more than 'width' columns, and return a list of wrapped lines.  By
    default, tabs in 'text' are expanded with string.expandtabs(), and
    all other whitespace characters (including newline) are converted to
    space.  See TextWrapper class for available keyword args to customize
    wrapping behaviour.
'''

There is already an inbuilt function in python for this.

>>> from textwrap import wrap
>>> s = '1234567890'
>>> wrap(s, 2)
['12', '34', '56', '78', '90']

This is what the docstring for wrap says:

>>> help(wrap)
'''
Help on function wrap in module textwrap:

wrap(text, width=70, **kwargs)
    Wrap a single paragraph of text, returning a list of wrapped lines.

    Reformat the single paragraph in 'text' so it fits in lines of no
    more than 'width' columns, and return a list of wrapped lines.  By
    default, tabs in 'text' are expanded with string.expandtabs(), and
    all other whitespace characters (including newline) are converted to
    space.  See TextWrapper class for available keyword args to customize
    wrapping behaviour.
'''

回答 3

将元素分组为n个长度的组的另一种常见方式:

>>> s = '1234567890'
>>> map(''.join, zip(*[iter(s)]*2))
['12', '34', '56', '78', '90']

此方法直接来自的文档zip()

Another common way of grouping elements into n-length groups:

>>> s = '1234567890'
>>> map(''.join, zip(*[iter(s)]*2))
['12', '34', '56', '78', '90']

This method comes straight from the docs for zip().


回答 4

我认为这比itertools版本更短,更易读:

def split_by_n(seq, n):
    '''A generator to divide a sequence into chunks of n units.'''
    while seq:
        yield seq[:n]
        seq = seq[n:]

print(list(split_by_n('1234567890', 2)))

I think this is shorter and more readable than the itertools version:

def split_by_n(seq, n):
    '''A generator to divide a sequence into chunks of n units.'''
    while seq:
        yield seq[:n]
        seq = seq[n:]

print(list(split_by_n('1234567890', 2)))

回答 5

我喜欢这个解决方案:

s = '1234567890'
o = []
while s:
    o.append(s[:2])
    s = s[2:]

I like this solution:

s = '1234567890'
o = []
while s:
    o.append(s[:2])
    s = s[2:]

回答 6

使用PyPI的more-itertools

>>> from more_itertools import sliced
>>> list(sliced('1234567890', 2))
['12', '34', '56', '78', '90']

Using more-itertools from PyPI:

>>> from more_itertools import sliced
>>> list(sliced('1234567890', 2))
['12', '34', '56', '78', '90']

回答 7

您可以使用以下grouper()配方itertools

Python 2.x:

from itertools import izip_longest    

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Python 3.x:

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

这些函数可节省内存,并且可与任何可迭代对象一起使用。

You could use the grouper() recipe from itertools:

Python 2.x:

from itertools import izip_longest    

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

Python 3.x:

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

These functions are memory-efficient and work with any iterables.


回答 8

尝试以下代码:

from itertools import islice

def split_every(n, iterable):
    i = iter(iterable)
    piece = list(islice(i, n))
    while piece:
        yield piece
        piece = list(islice(i, n))

s = '1234567890'
print list(split_every(2, list(s)))

Try the following code:

from itertools import islice

def split_every(n, iterable):
    i = iter(iterable)
    piece = list(islice(i, n))
    while piece:
        yield piece
        piece = list(islice(i, n))

s = '1234567890'
print list(split_every(2, list(s)))

回答 9

>>> from functools import reduce
>>> from operator import add
>>> from itertools import izip
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x)]
['12', '34', '56', '78', '90']
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x, x)]
['123', '456', '789']
>>> from functools import reduce
>>> from operator import add
>>> from itertools import izip
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x)]
['12', '34', '56', '78', '90']
>>> x = iter('1234567890')
>>> [reduce(add, tup) for tup in izip(x, x, x)]
['123', '456', '789']

回答 10

尝试这个:

s='1234567890'
print([s[idx:idx+2] for idx,val in enumerate(s) if idx%2 == 0])

输出:

['12', '34', '56', '78', '90']

Try this:

s='1234567890'
print([s[idx:idx+2] for idx,val in enumerate(s) if idx%2 == 0])

Output:

['12', '34', '56', '78', '90']

回答 11

一如既往,对于那些喜欢一只班轮的人

n = 2  
line = "this is a line split into n characters"  
line = [line[i * n:i * n+n] for i,blah in enumerate(line[::n])]

As always, for those who love one liners

n = 2  
line = "this is a line split into n characters"  
line = [line[i * n:i * n+n] for i,blah in enumerate(line[::n])]

回答 12

一个短字符串的简单递归解决方案:

def split(s, n):
    if len(s) < n:
        return []
    else:
        return [s[:n]] + split(s[n:], n)

print(split('1234567890', 2))

或以这种形式:

def split(s, n):
    if len(s) < n:
        return []
    elif len(s) == n:
        return [s]
    else:
        return split(s[:n], n) + split(s[n:], n)

,它更明确地说明了递归方法中的典型分而治之模式(尽管实际上没有必要这样做)

A simple recursive solution for short string:

def split(s, n):
    if len(s) < n:
        return []
    else:
        return [s[:n]] + split(s[n:], n)

print(split('1234567890', 2))

Or in such a form:

def split(s, n):
    if len(s) < n:
        return []
    elif len(s) == n:
        return [s]
    else:
        return split(s[:n], n) + split(s[n:], n)

, which illustrates the typical divide and conquer pattern in recursive approach more explicitly (though practically it is not necessary to do it this way)


回答 13

我陷入了同一个场景。

这对我有用

x="1234567890"
n=2
list=[]
for i in range(0,len(x),n):
    list.append(x[i:i+n])
print(list)

输出量

['12', '34', '56', '78', '90']

I was stucked in the same scenrio.

This worked for me

x="1234567890"
n=2
list=[]
for i in range(0,len(x),n):
    list.append(x[i:i+n])
print(list)

Output

['12', '34', '56', '78', '90']

回答 14

more_itertools.sliced之前已经提到过。这是more_itertools库中的另外四个选项:

s = "1234567890"

["".join(c) for c in mit.grouper(2, s)]

["".join(c) for c in mit.chunked(s, 2)]

["".join(c) for c in mit.windowed(s, 2, step=2)]

["".join(c) for c in  mit.split_after(s, lambda x: int(x) % 2 == 0)]

后面的每个选项均产生以下输出:

['12', '34', '56', '78', '90']

所讨论的选项的说明文档:grouperchunkedwindowedsplit_after

more_itertools.sliced has been mentioned before. Here are four more options from the more_itertools library:

s = "1234567890"

["".join(c) for c in mit.grouper(2, s)]

["".join(c) for c in mit.chunked(s, 2)]

["".join(c) for c in mit.windowed(s, 2, step=2)]

["".join(c) for c in  mit.split_after(s, lambda x: int(x) % 2 == 0)]

Each of the latter options produce the following output:

['12', '34', '56', '78', '90']

Documentation for discussed options: grouper, chunked, windowed, split_after


回答 15

这可以通过简单的for循环来实现。

a = '1234567890a'
result = []

for i in range(0, len(a), 2):
    result.append(a[i : i + 2])
print(result)

输出看起来像[’12’,’34’,’56’,’78’,’90’,’a’]

This can be achieved by a simple for loop.

a = '1234567890a'
result = []

for i in range(0, len(a), 2):
    result.append(a[i : i + 2])
print(result)

The output looks like [’12’, ’34’, ’56’, ’78’, ’90’, ‘a’]


如何将字符串拆分为列表?

问题:如何将字符串拆分为列表?

我希望我的Python函数拆分一个句子(输入)并将每个单词存储在列表中。我当前的代码拆分了句子,但没有将单词存储为列表。我怎么做?

def split_line(text):

    # split the text
    words = text.split()

    # for each word in the line:
    for word in words:

        # print the word
        print(words)

I want my Python function to split a sentence (input) and store each word in a list. My current code splits the sentence, but does not store the words as a list. How do I do that?

def split_line(text):

    # split the text
    words = text.split()

    # for each word in the line:
    for word in words:

        # print the word
        print(words)

回答 0

text.split()

这应该足以将每个单词存储在列表中。 words已经是句子中单词的列表,因此不需要循环。

其次,这可能是一个错字,但是您的循环有点混乱。如果您确实确实想使用附加,它将是:

words.append(word)

word.append(words)
text.split()

This should be enough to store each word in a list. words is already a list of the words from the sentence, so there is no need for the loop.

Second, it might be a typo, but you have your loop a little messed up. If you really did want to use append, it would be:

words.append(word)

not

word.append(words)

回答 1

text在任何连续的空格运行中拆分字符串。

words = text.split()      

text在分隔符上分割字符串","

words = text.split(",")   

单词变量将为a,list并包含text分隔符上的split 单词。

Splits the string in text on any consecutive runs of whitespace.

words = text.split()      

Split the string in text on delimiter: ",".

words = text.split(",")   

The words variable will be a list and contain the words from text split on the delimiter.


回答 2

str.split()

返回字符串中的单词列表,使用sep作为定界符…如果未指定sep或为None,则应用不同的拆分算法:连续空格的运行被视为单个分隔符,并且结果将包含如果字符串的开头或结尾有空格,则开头或结尾不得有空字符串。

>>> line="a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
>>> 

str.split()

Return a list of the words in the string, using sep as the delimiter … If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line="a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
>>> 

回答 3

根据您打算如何处理列表中的句子,您可能需要查看Natural Language Took Kit。它主要处理文本处理和评估。您也可以使用它来解决您的问题:

import nltk
words = nltk.word_tokenize(raw_sentence)

这具有将标点符号分开的额外好处。

例:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

这使您可以过滤掉不需要的标点,而仅使用单词。

请注意,string.split()如果您不打算对句子进行任何复杂的处理,则使用其他解决方案会更好。

[编辑]

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.

Example:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don’t want and use only words.

Please note that the other solutions using string.split() are better if you don’t plan on doing any complex manipulation of the sentence.

[Edited]


回答 4

这个算法怎么样?在空白处分割文本,然后修剪标点符号。这会仔细删除单词边缘的标点符号,而不会损害单词内的撇号,例如we're

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

回答 5

我希望我的python函数拆分一个句子(输入)并将每个单词存储在列表中

str().split()方法执行此操作,它需要一个字符串,并将其拆分为一个列表:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

您遇到的问题是由于输入错误,print(words)而不是您写的print(word)

word变量重命名为current_word,这就是您所拥有的:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(words)

..什么时候应该完成:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(current_word)

如果出于某种原因要在for循环中手动构造列表,则可以使用list append()方法,也许是因为您想对所有单词都小写(例如):

my_list = [] # make empty list
for current_word in words:
    my_list.append(current_word.lower())

或者使用list-comprehension更加整洁:

my_list = [current_word.lower() for current_word in words]

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

The problem you’re having is because of a typo, you wrote print(words) instead of print(word):

Renaming the word variable to current_word, this is what you had:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(words)

..when you should have done:

def split_line(text):
    words = text.split()
    for current_word in words:
        print(current_word)

If for some reason you want to manually construct a list in the for loop, you would use the list append() method, perhaps because you want to lower-case all words (for example):

my_list = [] # make empty list
for current_word in words:
    my_list.append(current_word.lower())

Or more a bit neater, using a list-comprehension:

my_list = [current_word.lower() for current_word in words]

回答 6

shlex具有.split()功能。它的不同之处str.split()在于,它不保留引号,并且将带引号的词组视为一个单词:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

回答 7

如果要在列表中包含单词/句子的所有字符,请执行以下操作:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

If you want all the chars of a word/sentence in a list, do this:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

回答 8

我认为您因错字而感到困惑。

更换print(words)print(word)您的循环内已印刷在另一条线路的每一个字

I think you are confused because of a typo.

Replace print(words) with print(word) inside your loop to have every word printed on a different line


如何将字符串拆分为字符数组?

问题:如何将字符串拆分为字符数组?

我试图在网上四处寻找将字符串拆分为字符数组的答案,但似乎找不到一个简单的方法

str.split(//)似乎不像Ruby那样工作。有没有一种简单的方法可以不循环?

I’ve tried to look around the web for answers to splitting a string into an array of characters but I can’t seem to find a simple method

str.split(//) does not seem to work like Ruby does. Is there a simple way of doing this without looping?


回答 0

>>> s = "foobar"
>>> list(s)
['f', 'o', 'o', 'b', 'a', 'r']

你需要清单

>>> s = "foobar"
>>> list(s)
['f', 'o', 'o', 'b', 'a', 'r']

You need list


回答 1

您将字符串传递给list()

s = "mystring"
l = list(s)
print l

You take the string and pass it to list()

s = "mystring"
l = list(s)
print l

回答 2

您也可以不用list()来以非常简单的方式进行操作:

>>> [c for c in "foobar"]
['f', 'o', 'o', 'b', 'a', 'r']

You can also do it in this very simple way without list():

>>> [c for c in "foobar"]
['f', 'o', 'o', 'b', 'a', 'r']

回答 3

如果您想一次处理您的字符串一个字符。您有多种选择。

uhello = u'Hello\u0020World'

使用列表理解:

print([x for x in uhello])

输出:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

使用地图:

print(list(map(lambda c2: c2, uhello)))

输出:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

调用内置列表功能:

print(list(uhello))

输出:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

使用for循环:

for c in uhello:
    print(c)

输出:

H
e
l
l
o

W
o
r
l
d

If you want to process your String one character at a time. you have various options.

uhello = u'Hello\u0020World'

Using List comprehension:

print([x for x in uhello])

Output:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

Using map:

print(list(map(lambda c2: c2, uhello)))

Output:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

Calling Built in list function:

print(list(uhello))

Output:

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

Using for loop:

for c in uhello:
    print(c)

Output:

H
e
l
l
o

W
o
r
l
d

回答 4

我探索了完成此任务的另外两种方法。这可能对某人有帮助。

第一个很简单:

In [25]: a = []
In [26]: s = 'foobar'
In [27]: a += s
In [28]: a
Out[28]: ['f', 'o', 'o', 'b', 'a', 'r']

以及第二个用途maplambda功能。它可能适用于更复杂的任务:

In [36]: s = 'foobar12'
In [37]: a = map(lambda c: c, s)
In [38]: a
Out[38]: ['f', 'o', 'o', 'b', 'a', 'r', '1', '2']

例如

# isdigit, isspace or another facilities such as regexp may be used
In [40]: a = map(lambda c: c if c.isalpha() else '', s)
In [41]: a
Out[41]: ['f', 'o', 'o', 'b', 'a', 'r', '', '']

有关更多方法,请参见python文档

I explored another two ways to accomplish this task. It may be helpful for someone.

The first one is easy:

In [25]: a = []
In [26]: s = 'foobar'
In [27]: a += s
In [28]: a
Out[28]: ['f', 'o', 'o', 'b', 'a', 'r']

And the second one use map and lambda function. It may be appropriate for more complex tasks:

In [36]: s = 'foobar12'
In [37]: a = map(lambda c: c, s)
In [38]: a
Out[38]: ['f', 'o', 'o', 'b', 'a', 'r', '1', '2']

For example

# isdigit, isspace or another facilities such as regexp may be used
In [40]: a = map(lambda c: c if c.isalpha() else '', s)
In [41]: a
Out[41]: ['f', 'o', 'o', 'b', 'a', 'r', '', '']

See python docs for more methods


回答 5

任务归结为遍历字符串中的字符并将它们收集到列表中。最幼稚的解决方案看起来像

result = []
for character in string:
    result.append(character)

当然,它可以缩短为

result = [character for character in string]

但是仍然有更短的解决方案可以做到这一点。

list构造函数可用于将任何可迭代的(迭代器,列表,元组,字符串等)转换为列表。

>>> list('abc')
['a', 'b', 'c']

最大的优点是,它在Python 2和Python 3中均相同。

另外,从Python 3.5开始(由于出色的PEP 448),现在可以通过将任何可迭代项解压缩为空列表文字来构建列表:

>>> [*'abc']
['a', 'b', 'c']

这比较整洁,并且在某些情况下比list直接调用构造函数更有效。

我建议不要使用map基于方法的方法,因为map不会在Python 3中返回列表。请参见如何在Python 3 中使用过滤,映射和精简

The task boils down to iterating over characters of the string and collecting them into a list. The most naïve solution would look like

result = []
for character in string:
    result.append(character)

Of course, it can be shortened to just

result = [character for character in string]

but there still are shorter solutions that do the same thing.

list constructor can be used to convert any iterable (iterators, lists, tuples, string etc.) to list.

>>> list('abc')
['a', 'b', 'c']

The big plus is that it works the same in both Python 2 and Python 3.

Also, starting from Python 3.5 (thanks to the awesome PEP 448) it’s now possible to build a list from any iterable by unpacking it to an empty list literal:

>>> [*'abc']
['a', 'b', 'c']

This is neater, and in some cases more efficient than calling list constructor directly.

I’d advise against using map-based approaches, because map does not return a list in Python 3. See How to use filter, map, and reduce in Python 3.


回答 6

我只需要一个字符数组:

arr = list(str)

如果要用特定的str拆分str:

# str = "temp//temps" will will be ['temp', 'temps']
arr = str.split("//")

I you just need an array of chars:

arr = list(str)

If you want to split the str by a particular str:

# str = "temp//temps" will will be ['temp', 'temps']
arr = str.split("//")

回答 7

split()内置函数将仅根据特定条件分隔值,但在单个单词中,它无法满足条件。因此,可以借助来解决list()。它在内部调用Array,它将基于数组存储值。

假设,

a = "bottle"
a.split() // will only return the word but not split the every single char.

a = "bottle"
list(a) // will separate ['b','o','t','t','l','e']

split() inbuilt function will only separate the value on the basis of certain condition but in the single word, it cannot fulfill the condition. So, it can be solved with the help of list(). It internally calls the Array and it will store the value on the basis of an array.

Suppose,

a = "bottle"
a.split() // will only return the word but not split the every single char.

a = "bottle"
list(a) // will separate ['b','o','t','t','l','e']

回答 8

打开包装:

word = "Paralelepipedo"
print([*word])

Unpack them:

word = "Paralelepipedo"
print([*word])

回答 9

如果您希望只读访问该字符串,则可以直接使用数组符号。

Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> t = 'my string'
>>> t[1]
'y'

在不使用正则表达式的情况下可能对测试很有用。字符串是否包含结尾换行符?

>>> t[-1] == '\n'
False
>>> t = 'my string\n'
>>> t[-1] == '\n'
True

If you wish to read only access to the string you can use array notation directly.

Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> t = 'my string'
>>> t[1]
'y'

Could be useful for testing without using regexp. Does the string contain an ending newline?

>>> t[-1] == '\n'
False
>>> t = 'my string\n'
>>> t[-1] == '\n'
True

回答 10

好吧,就像我喜欢列表版本一样,这是我发现的另一种更为冗长的方式(但它很酷,所以我认为我应该将其添加到列表中):

>>> text = "My hovercraft is full of eels"
>>> [text[i] for i in range(len(text))]
['M', 'y', ' ', 'h', 'o', 'v', 'e', 'r', 'c', 'r', 'a', 'f', 't', ' ', 'i', 's', ' ', 'f', 'u', 'l', 'l', ' ', 'o', 'f', ' ', 'e', 'e', 'l', 's']

Well, much as I like the list(s) version, here’s another more verbose way I found (but it’s cool so I thought I’d add it to the fray):

>>> text = "My hovercraft is full of eels"
>>> [text[i] for i in range(len(text))]
['M', 'y', ' ', 'h', 'o', 'v', 'e', 'r', 'c', 'r', 'a', 'f', 't', ' ', 'i', 's', ' ', 'f', 'u', 'l', 'l', ' ', 'o', 'f', ' ', 'e', 'e', 'l', 's']

回答 11

from itertools import chain

string = 'your string'
chain(string)

list(string)生成器类似,但返回在使用时延迟评估的生成器,因此内存效率高。

from itertools import chain

string = 'your string'
chain(string)

similar to list(string) but returns a generator that is lazily evaluated at point of use, so memory efficient.


回答 12

>>> for i in range(len(a)):
...     print a[i]
... 

其中a是您要分离的字符串。值“ a [i]”是字符串的各个字符,可以将它们附加到列表中。

>>> for i in range(len(a)):
...     print a[i]
... 

where a is the string that you want to separate out. The values “a[i]” are the individual character of the the string these could be appended to a list.


将字符串拆分为具有多个单词边界定界符的单词

问题:将字符串拆分为具有多个单词边界定界符的单词

我认为我想做的是一项相当普通的任务,但是我在网络上找不到任何参考。我的文字带有标点符号,我想要一个单词列表。

"Hey, you - what are you doing here!?"

应该

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但是Python str.split()只能使用一个参数,因此在用空格分割后,所有单词都带有标点符号。有任何想法吗?

I think what I want to do is a fairly common task but I’ve found no reference on the web. I have text with punctuation, and I want a list of the words.

"Hey, you - what are you doing here!?"

should be

['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

But Python’s str.split() only works with one argument, so I have all words with the punctuation after I split with whitespace. Any ideas?


回答 0

正则表达式合理的情况:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

回答 1

re.split()

re.split(pattern,string [,maxsplit = 0])

按模式分割字符串。如果在模式中使用了捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。如果maxsplit不为零,则最多会发生maxsplit分割,并将字符串的其余部分作为列表的最后一个元素返回。(不兼容说明:在原始的Python 1.5发行版中,maxsplit被忽略。此问题已在以后的发行版中修复。)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']

回答 2

另一种无需使用正则表达式的快速方法是首先替换字符,如下所示:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

Another quick way to do this without a regexp is to replace the characters first, as below:

>>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split()
['a', 'bcd', 'ef', 'g']

回答 3

如此众多的答案,但我找不到有效解决问题标题真正要求的解决方案(拆分多个可能的分隔符,相反,许多答案拆分成一个单词而不是单词,这是不同的)。因此,这是标题中问题的答案,该问题依赖于Python的标准高效re模块:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

哪里:

  • […]比赛一个隔板内上市,
  • \-在正则表达式是在这里以防止特殊解释-为字符范围指示器(如在A-Z),
  • +跳过一个或多个分隔符(它可以省略感谢filter(),但是这将不必要地产生匹配隔板之间空字符串),并
  • filter(None, …) 删除可能由前导和尾随分隔符创建的空字符串(因为空字符串具有错误的布尔值)。

re.split()正如问题标题所要求的那样,这恰好是“用多个分隔符分隔”。

此外,该解决方案还可以避免在其他一些解决方案中发现的单词中非ASCII字符的问题(请参见ghostdog74的答案的第一条评论)。

re模块比“手动”执行Python循环和测试要高效得多(在速度和简洁性方面)!

So many answers, yet I can’t find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers split on anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python’s standard and efficient re module:

>>> import re  # Will be splitting on: , <space> - ! ? :
>>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?"))
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

where:

  • the […] matches one of the separators listed inside,
  • the \- in the regular expression is here to prevent the special interpretation of - as a character range indicator (as in A-Z),
  • the + skips one or more delimiters (it could be omitted thanks to the filter(), but this would unnecessarily produce empty strings between matched separators), and
  • filter(None, …) removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).

This re.split() precisely “splits with multiple separators”, as asked for in the question title.

This solution is furthermore immune to the problems with non-ASCII characters in words found in some other solutions (see the first comment to ghostdog74’s answer).

The re module is much more efficient (in speed and concision) than doing Python loops and tests “by hand”!


回答 4

另一种方式,没有正则表达式

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

Another way, without regex

import string
punc = string.punctuation
thestring = "Hey, you - what are you doing here!?"
s = list(thestring)
''.join([o for o in s if not o in punc]).split()

回答 5

专业提示:使用 string.translate用于Python最快的字符串操作。

一些证明…

首先,慢速的方式(对不起pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

接下来,我们使用re.findall()(由建议的答案给出)。快多了:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094

最后,我们使用translate

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

说明:

string.translate是用C实现的,与Python中的许多字符串操作函数不同,string.translate 它不会产生新的字符串。因此,它与字符串替换一样快。

不过,这有点尴尬,因为它需要翻译表才能执行此操作。您可以使用maketrans()便利功能制作翻译表。此处的目的是将所有不需要的字符转换为空格。一对一的替代品。同样,不会产生任何新数据。所以这很快

接下来,我们使用好old split()split()默认情况下,它将对所有空白字符起作用,将它们分组在一起以进行拆分。结果将是您想要的单词列表。而且这种方法的速度几乎快了4倍re.findall()

Pro-Tip: Use string.translate for the fastest string operations Python has.

Some proof…

First, the slow way (sorry pprzemek):

>>> import timeit
>>> S = 'Hey, you - what are you doing here!?'
>>> def my_split(s, seps):
...     res = [s]
...     for sep in seps:
...         s, res = res, []
...         for seq in s:
...             res += seq.split(sep)
...     return res
... 
>>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit()
54.65477919578552

Next, we use re.findall() (as given by the suggested answer). MUCH faster:

>>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit()
4.194725036621094

Finally, we use translate:

>>> from string import translate,maketrans,punctuation 
>>> T = maketrans(punctuation, ' '*len(punctuation))
>>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit()
1.2835021018981934

Explanation:

string.translate is implemented in C and unlike many string manipulation functions in Python, string.translate does not produce a new string. So it’s about as fast as you can get for string substitution.

It’s a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the maketrans() convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!

Next, we use good old split(). split() by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster than re.findall()!


回答 6

我遇到了类似的难题,不想使用’re’模块。

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

I had a similar dilemma and didn’t want to use ‘re’ module.

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']

回答 7

首先,我想与其他人同意,正则表达式或str.translate(...)基于基础的解决方案性能最高。对于我的用例,此功能的性能并不重要,因此我想添加我考虑的该标准的想法。

我的主要目标是将其他一些答案中的想法归纳为一个解决方案,该解决方案可用于包含不仅仅是正则表达式单词的字符串(即,将标点字符的显式子集列入黑名单而将单词字符列入白名单)。

请注意,在任何方法中,都可能会考虑使用 string.punctuation代替手动定义的列表。

选项1-重新订阅

我很惊讶地发现到目前为止没有答案使用re.sub(…)。我发现这是解决此问题的一种简单自然的方法。

import re

my_str = "Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

在此解决方案中,我将调用嵌套到re.sub(...)内部re.split(...)-但如果性能至关重要,则在外部编译正则表达式可能会有所益处-对于我的用例而言,差异并不明显,因此我更喜欢简单性和可读性。

选项2-str.replace

这是另外几行,但是它具有可扩展的优点,而不必检查是否需要在正则表达式中转义某个字符。

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
    my_str = my_str.replace(r, ' ')

words = my_str.split()

能够将str.replace映射到字符串本来会很好,但是我不认为可以使用不可变的字符串来完成,并且在映射到字符列表时可以工作,对每个字符运行每个替换听起来太过分了。(编辑:有关功能示例,请参阅下一个选项。)

选项3-functools.reduce

(在Python 2中,reduce它可以在全局命名空间中使用,而无需从functools导入。)

import functools

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()

First, I want to agree with others that the regex or str.translate(...) based solutions are most performant. For my use case the performance of this function wasn’t significant, so I wanted to add ideas that I considered with that criteria.

My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).

Note that, in any approach, one might also consider using string.punctuation in place of a manually defined list.

Option 1 – re.sub

I was surprised to see no answer so far uses re.sub(…). I find it a simple and natural approach to this problem.

import re

my_str = "Hey, you - what are you doing here!?"

words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

In this solution, I nested the call to re.sub(...) inside re.split(...) — but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn’t significant, so I prefer simplicity and readability.

Option 2 – str.replace

This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
for r in replacements:
    my_str = my_str.replace(r, ' ')

words = my_str.split()

It would have been nice to be able to map the str.replace to the string instead, but I don’t think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)

Option 3 – functools.reduce

(In Python 2, reduce is available in global namespace without importing it from functools.)

import functools

my_str = "Hey, you - what are you doing here!?"

replacements = (',', '-', '!', '?')
my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str)
words = my_str.split()

回答 8

join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

然后这变成了三层:

fragments = [text]
for token in tokens:
    fragments = join(f.split(token) for f in fragments)

说明

这就是在Haskell中被称为List monad的东西。monad背后的想法是,一旦“在monad中”,您就“停留在monad中”,直到有东西将您带出。例如在Haskell中,假设您将python range(n) -> [1,2,...,n]函数映射到List上。如果结果是一个列表,它将被原地追加到列表中,因此您将获得类似map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]。这称为map-append(或mappend,或类似的东西)。这里的想法是,您要执行此操作(拆分令牌),并且每当执行此操作时,您都将结果加入列表。

您可以将其抽象为一个函数,并且tokens=string.punctuation默认情况下具有。

这种方法的优点:

  • 这种方法(与基于朴素的基于正则表达式的方法不同)可以与任意长度的令牌一起使用(正则表达式也可以使用更高级的语法)。
  • 您不仅限于代币;您可以使用任意逻辑代替每个标记,例如,“标记”之一可以是根据嵌套括号的拆分方式进行拆分的函数。
join = lambda x: sum(x,[])  # a.k.a. flatten1([[1],[2,3],[4]]) -> [1,2,3,4]
# ...alternatively...
join = lambda lists: [x for l in lists for x in l]

Then this becomes a three-liner:

fragments = [text]
for token in tokens:
    fragments = join(f.split(token) for f in fragments)

Explanation

This is what in Haskell is known as the List monad. The idea behind the monad is that once “in the monad” you “stay in the monad” until something takes you out. For example in Haskell, say you map the python range(n) -> [1,2,...,n] function over a List. If the result is a List, it will be append to the List in-place, so you’d get something like map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]. This is known as map-append (or mappend, or maybe something like that). The idea here is that you’ve got this operation you’re applying (splitting on a token), and whenever you do that, you join the result into the list.

You can abstract this into a function and have tokens=string.punctuation by default.

Advantages of this approach:

  • This approach (unlike naive regex-based approaches) can work with arbitrary-length tokens (which regex can also do with more advanced syntax).
  • You are not restricted to mere tokens; you could have arbitrary logic in place of each token, for example one of the “tokens” could be a function which splits according to how nested parentheses are.

回答 9

尝试这个:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

这将打印 ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

try this:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']


回答 10

两次使用替换:

a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

结果是:

['11223', '33344', '33222', '3344']

Use replace two times:

a = '11223FROM33344INTO33222FROM3344'
a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

results in:

['11223', '33344', '33222', '3344']

回答 11

我喜欢re,但是这是我的解决方案:

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep .__ contains__是’in’运算符使用的方法。基本上和

lambda ch: ch in sep

但是这里比较方便。

groupby获取我们的字符串和函数。它使用该函数将字符串分成几组:每当函数值更改时,就会生成一个新的组。因此,sep .__ contains__正是我们需要的。

groupby返回一对对的序列,其中pair [0]是我们函数的结果,而pair [1]是一个组。使用‘if not k’我们用分隔符过滤掉组(因为sep .__ contains__在分隔符上为True 的结果)。好了,就是这样-现在我们有了一系列的组,每个组都是一个单词(组实际上是一个可迭代的,因此我们使用join将其转换为字符串)。

该解决方案非常通用,因为它使用一个函数来分隔字符串(可以按需要的任何条件进行拆分)。另外,它不会创建中间字符串/列表(您可以删除联接,并且表达式将变得很懒,因为每个组都是迭代器)

I like re, but here is my solution without it:

from itertools import groupby
sep = ' ,-!?'
s = "Hey, you - what are you doing here!?"
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep.__contains__ is a method used by ‘in’ operator. Basically it is the same as

lambda ch: ch in sep

but is more convenient here.

groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes – a new group is generated. So, sep.__contains__ is exactly what we need.

groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using ‘if not k’ we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that’s all – now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).

This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn’t create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)


回答 12

您可以使用pandas的series.str.split方法来获得相同的结果,而不是使用re模块功能re.split。

首先,使用上面的字符串创建一个系列,然后将该方法应用于该系列。

thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

参数pat接受定界符,并将拆分后的字符串作为数组返回。这里,两个定界符使用|传递。(或运算符)。输出如下:

[Hey, you , what are you doing here!?]

Instead of using a re module function re.split you can achieve the same result using the series.str.split method of pandas.

First, create a series with the above string and then apply the method to the series.

thestring = pd.Series("Hey, you - what are you doing here!?") thestring.str.split(pat = ',|-')

parameter pat takes the delimiters and returns the split string as an array. Here the two delimiters are passed using a | (or operator). The output is as follows:

[Hey, you , what are you doing here!?]


回答 13

我正在重新熟悉Python,并需要同样的东西。findall解决方案可能更好,但是我想到了:

tokens = [x.strip() for x in data.split(',')]

I’m re-acquainting myself with Python and needed the same thing. The findall solution may be better, but I came up with this:

tokens = [x.strip() for x in data.split(',')]

回答 14

使用maketrans和翻译,您可以轻松整齐地进行操作

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

using maketrans and translate you can do it easily and neatly

import string
specials = ',.!?:;"()<>[]#$=-/'
trans = string.maketrans(specials, ' '*len(specials))
body = body.translate(trans)
words = body.strip().split()

回答 15

在Python 3中,您可以使用PY4E-Python for Everybody中的方法

我们可以通过使用字符串的方法解决这两个问题lowerpunctuationtranslate。该translate是最微妙的方法。这是有关以下内容的文档translate

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

将中的字符替换为中fromstr相同位置的tostr字符,并删除中的所有字符deletestr。该fromstrtostr可以为空字符串和deletestr可以省略参数。

您可以看到“标点符号”:

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  

例如:

In [12]: your_str = "Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

有关更多信息,您可以参考:

In Python 3, your can use the method from PY4E – Python for Everybody.

We can solve both these problems by using the string methods lower, punctuation, and translate. The translate is the most subtle of the methods. Here is the documentation for translate:

your_string.translate(your_string.maketrans(fromstr, tostr, deletestr))

Replace the characters in fromstr with the character in the same position in tostr and delete all characters that are in deletestr. The fromstr and tostr can be empty strings and the deletestr parameter can be omitted.

Your can see the “punctuation”:

In [10]: import string

In [11]: string.punctuation
Out[11]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'  

For your example:

In [12]: your_str = "Hey, you - what are you doing here!?"

In [13]: line = your_str.translate(your_str.maketrans('', '', string.punctuation))

In [14]: line = line.lower()

In [15]: words = line.split()

In [16]: print(words)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

For more information, you can refer:


回答 16

实现此目的的另一种方法是使用自然语言工具包(nltk)。

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

打印: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

这种方法的最大缺点是您需要安装nltk软件包

好处是,一旦获得令牌,您就可以使用其余的nltk软件包做很多有趣的事情

Another way to achieve this is to use the Natural Language Tool Kit (nltk).

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

The biggest drawback of this method is that you need to install the nltk package.

The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.


回答 17

首先,我不认为您的意图是在拆分函数中实际使用标点符号作为分隔符。您的描述表明您只是想从结果字符串中消除标点符号。

我经常遇到这种情况,而我通常的解决方案不需要重新输入。

具有列表理解功能的单行lambda函数:

(要求import string):

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']


功能(传统)

作为传统函数,这仍然只有两行具有列表理解功能(除了import string):

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

它自然也会使收缩和带连字符的单词保持完整。您总是可以text.replace("-", " ")在分割之前使用连字符将其转换为空格。

没有Lambda或列表理解的常规功能

对于更通用的解决方案(您可以在其中指定要消除的字符),并且无需列表理解,您将获得:

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然,您也可以始终将lambda函数概括为任何指定的字符串。

First of all, I don’t think that your intention is to actually use punctuation as delimiters in the split functions. Your description suggests that you simply want to eliminate punctuation from the resultant strings.

I come across this pretty frequently, and my usual solution doesn’t require re.

One-liner lambda function w/ list comprehension:

(requires import string):

split_without_punc = lambda text : [word.strip(string.punctuation) for word in 
    text.split() if word.strip(string.punctuation) != '']

# Call function
split_without_punc("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']


Function (traditional)

As a traditional function, this is still only two lines with a list comprehension (in addition to import string):

def split_without_punctuation2(text):

    # Split by whitespace
    words = text.split()

    # Strip punctuation from each word
    return [word.strip(ignore) for word in words if word.strip(ignore) != '']

split_without_punctuation2("Hey, you -- what are you doing?!")
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

It will also naturally leave contractions and hyphenated words intact. You can always use text.replace("-", " ") to turn hyphens into spaces before the split.

General Function w/o Lambda or List Comprehension

For a more general solution (where you can specify the characters to eliminate), and without a list comprehension, you get:

def split_without(text: str, ignore: str) -> list:

    # Split by whitespace
    split_string = text.split()

    # Strip any characters in the ignore string, and ignore empty strings
    words = []
    for word in split_string:
        word = word.strip(ignore)
        if word != '':
            words.append(word)

    return words

# Situation-specific call to general function
import string
final_text = split_without("Hey, you - what are you doing?!", string.punctuation)
# returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

Of course, you can always generalize the lambda function to any specified string of characters as well.


回答 18

首先,在循环中执行任何RegEx操作之前,请始终使用re.compile(),因为它比常规操作更快。

因此对于您的问题,请先编译模式,然后对其执行操作。

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

First of all, always use re.compile() before performing any RegEx operation in a loop because it works faster than normal operation.

so for your problem first compile the pattern and then perform action on it.

import re
DATA = "Hey, you - what are you doing here!?"
reg_tok = re.compile("[\w']+")
print reg_tok.findall(DATA)

回答 19

这是一些解释的答案。

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

或者一行,我们可以这样:

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新的答案

Here is the answer with some explanation.

st = "Hey, you - what are you doing here!?"

# replace all the non alpha-numeric with space and then join.
new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])
# output of new_string
'Hey  you  what are you doing here  '

# str.split() will remove all the empty string if separator is not provided
new_list = new_string.split()

# output of new_list
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

# we can join it to get a complete string without any non alpha-numeric character
' '.join(new_list)
# output
'Hey you what are you doing'

or in one line, we can do like this:

(''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split()

# output
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

updated answer


回答 20

创建一个函数,将两个字符串(要拆分的源字符串和定界符的splitlist字符串)作为输入,并输出一个拆分词列表:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output

Create a function that takes as input two strings (the source string to be split and the splitlist string of delimiters) and outputs a list of split words:

def split_string(source, splitlist):
    output = []  # output list of cleaned words
    atsplit = True
    for char in source:
        if char in splitlist:
            atsplit = True
        else:
            if atsplit:
                output.append(char)  # append new word after split
                atsplit = False
            else: 
                output[-1] = output[-1] + char  # continue copying characters until next split
    return output

回答 21

我喜欢pprzemek的解决方案,因为它不假定定界符是单个字符,并且不尝试利用正则表达式(如果分隔符的数目太长了,这将不能很好地工作)。

为了清楚起见,以下是上述解决方案的可读性更高的版本:

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer

I like pprzemek’s solution because it does not assume that the delimiters are single characters and it doesn’t try to leverage a regex (which would not work well if the number of separators got to be crazy long).

Here’s a more readable version of the above solution for clarity:

def split_string_on_multiple_separators(input_string, separators):
    buffer = [input_string]
    for sep in separators:
        strings = buffer
        buffer = []  # reset the buffer
        for s in strings:
            buffer = buffer + s.split(sep)

    return buffer

回答 22

遇到了与@ooboo相同的问题,并找到了这个主题@ ghostdog74启发了我,也许有人觉得我的解决方案有用

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

如果您不想在空格处分割,请在空格处输入内容并使用相同的字符分割。

got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull

str1='adj:sg:nom:m1.m2.m3:pos'
splitat=':.'
''.join([ s if s not in splitat else ' ' for s in str1]).split()

input something in space place and split using same character if you dont want to split at spaces.


回答 23

这是我与多个决策者共同努力的结果:

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

Here is my go at a split with multiple deliminaters:

def msplit( str, delims ):
  w = ''
  for z in str:
    if z not in delims:
        w += z
    else:
        if len(w) > 0 :
            yield w
        w = ''
  if len(w) > 0 :
    yield w

回答 24

我认为以下是满足您需求的最佳答案:

\W+ 可能适合这种情况,但可能不适合其他情况。

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

I think the following is the best answer to suite your needs :

\W+ maybe suitable for this case, but may not be suitable for other cases.

filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

回答 25

这是我的看法。

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

Heres my take on it….

def split_string(source,splitlist):
    splits = frozenset(splitlist)
    l = []
    s1 = ""
    for c in source:
        if c in splits:
            if s1:
                l.append(s1)
                s1 = ""
        else:
            print s1
            s1 = s1 + c
    if s1:
        l.append(s1)
    return l

>>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
>>>print out
>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

回答 26

我喜欢replace()最好的方式。以下过程将字符串中定义的所有分隔符更改splitlist为第一个分隔符splitlist,然后在该分隔符上拆分文本。它还说明是否splitlist碰巧是一个空字符串。它返回单词列表,其中没有空字符串。

def split_string(text, splitlist):
    for sep in splitlist:
        text = text.replace(sep, splitlist[0])
    return filter(None, text.split(splitlist[0])) if splitlist else [text]

I like the replace() way the best. The following procedure changes all separators defined in a string splitlist to the first separator in splitlist and then splits the text on that one separator. It also accounts for if splitlist happens to be an empty string. It returns a list of words, with no empty strings in it.

def split_string(text, splitlist):
    for sep in splitlist:
        text = text.replace(sep, splitlist[0])
    return filter(None, text.split(splitlist[0])) if splitlist else [text]

回答 27

def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

这是用法:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
def get_words(s):
    l = []
    w = ''
    for c in s.lower():
        if c in '-!?,. ':
            if w != '': 
                l.append(w)
            w = ''
        else:
            w = w + c
    if w != '': 
        l.append(w)
    return l

Here is the usage:

>>> s = "Hey, you - what are you doing here!?"
>>> print get_words(s)
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

回答 28

如果要进行可逆操作(保留定界符),则可以使用以下功能:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

If you want a reversible operation (preserve the delimiters), you can use this function:

def tokenizeSentence_Reversible(sentence):
    setOfDelimiters = ['.', ' ', ',', '*', ';', '!']
    listOfTokens = [sentence]

    for delimiter in setOfDelimiters:
        newListOfTokens = []
        for ind, token in enumerate(listOfTokens):
            ll = [([delimiter, w] if ind > 0 else [w]) for ind, w in enumerate(token.split(delimiter))]
            listOfTokens = [item for sublist in ll for item in sublist] # flattens.
            listOfTokens = filter(None, listOfTokens) # Removes empty tokens: ''
            newListOfTokens.extend(listOfTokens)

        listOfTokens = newListOfTokens

    return listOfTokens

回答 29

我最近需要执行此操作,但需要一个与标准库str.split函数有些匹配的函数,当使用0或1个参数调用时,该函数的行为与标准库相同。

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

注意:仅当分隔符由单个字符组成时(如我的用例),此功能才有用。

I recently needed to do this but wanted a function that somewhat matched the standard library str.split function, this function behaves the same as standard library when called with 0 or 1 arguments.

def split_many(string, *separators):
    if len(separators) == 0:
        return string.split()
    if len(separators) > 1:
        table = {
            ord(separator): ord(separator[0])
            for separator in separators
        }
        string = string.translate(table)
    return string.split(separators[0])

NOTE: This function is only useful when your separators consist of a single character (as was my usecase).


如何将列表分成大小均匀的块?

问题:如何将列表分成大小均匀的块?

我有一个任意长度的列表,我需要将其分成相等大小的块并对其进行操作。有一些明显的方法可以做到这一点,例如保留一个计数器和两个列表,当第二个列表填满时,将其添加到第一个列表中,并为第二轮数据清空第二个列表,但这可能会非常昂贵。

我想知道是否有人对任何长度的列表都有很好的解决方案,例如使用生成器。

我一直在寻找有用的东西,itertools但找不到任何明显有用的东西。可能已经错过了。

相关问题:遍历大块列表的最“ pythonic”方法是什么?

I have a list of arbitrary length, and I need to split it up into equal size chunks and operate on it. There are some obvious ways to do this, like keeping a counter and two lists, and when the second list fills up, add it to the first list and empty the second list for the next round of data, but this is potentially extremely expensive.

I was wondering if anyone had a good solution to this for lists of any length, e.g. using generators.

I was looking for something useful in itertools but I couldn’t find anything obviously useful. Might’ve missed it, though.

Related question: What is the most “pythonic” way to iterate over a list in chunks?


回答 0

这是一个生成所需块的生成器:

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

如果您使用的是Python 2,则应使用xrange()而不是range()

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in xrange(0, len(lst), n):
        yield lst[i:i + n]

同样,您可以简单地使用列表理解而不是编写函数,尽管将这样的操作封装在命名函数中是个好主意,这样您的代码更易于理解。Python 3:

[lst[i:i + n] for i in range(0, len(lst), n)]

Python 2版本:

[lst[i:i + n] for i in xrange(0, len(lst), n)]

Here’s a generator that yields the chunks you want:

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

import pprint
pprint.pprint(list(chunks(range(10, 75), 10)))
[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

If you’re using Python 2, you should use xrange() instead of range():

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in xrange(0, len(lst), n):
        yield lst[i:i + n]

Also you can simply use list comprehension instead of writing a function, though it’s a good idea to encapsulate operations like this in named functions so that your code is easier to understand. Python 3:

[lst[i:i + n] for i in range(0, len(lst), n)]

Python 2 version:

[lst[i:i + n] for i in xrange(0, len(lst), n)]

回答 1

如果您想要超级简单的东西:

def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

在Python 2.x中使用xrange()代替range()

If you want something super simple:

def chunks(l, n):
    n = max(1, n)
    return (l[i:i+n] for i in range(0, len(l), n))

Use xrange() instead of range() in the case of Python 2.x


回答 2

直接来自(旧的)Python文档(itertools的注意事项):

from itertools import izip, chain, repeat

def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

JFSebastian建议的当前版本:

#from itertools import izip_longest as zip_longest # for Python 2.x
from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

我猜想Guido的时间机器可以工作了,可以工作了,可以工作了,可以再次工作。

这些解决方案之所以有效,是因为[iter(iterable)]*n(或早期版本中的等效项)创建了一个迭代器,n并在列表中重复了几次。izip_longest然后有效地执行“每个”迭代器的循环;因为这是相同的迭代器,所以每次此类调用都会对其进行高级处理,从而使每个此类zip-roundrobin生成一个元组n项。

Directly from the (old) Python documentation (recipes for itertools):

from itertools import izip, chain, repeat

def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return izip(*[chain(iterable, repeat(padvalue, n-1))]*n)

The current version, as suggested by J.F.Sebastian:

#from itertools import izip_longest as zip_longest # for Python 2.x
from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(n, iterable, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

I guess Guido’s time machine works—worked—will work—will have worked—was working again.

These solutions work because [iter(iterable)]*n (or the equivalent in the earlier version) creates one iterator, repeated n times in the list. izip_longest then effectively performs a round-robin of “each” iterator; because this is the same iterator, it is advanced by each such call, resulting in each such zip-roundrobin generating one tuple of n items.


回答 3

我知道这有点陈旧,但没有人提到numpy.array_split

import numpy as np

lst = range(50)
np.array_split(lst, 5)
# [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
#  array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
#  array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
#  array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39]),
#  array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])]

I know this is kind of old but nobody yet mentioned numpy.array_split:

import numpy as np

lst = range(50)
np.array_split(lst, 5)
# [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
#  array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
#  array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
#  array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39]),
#  array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])]

回答 4

令我惊讶的是,没有人想到使用iter二元形式

from itertools import islice

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

演示:

>>> list(chunk(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13)]

这可以与任何迭代一起工作,并产生延迟输出。它返回元组而不是迭代器,但是我认为它仍然具有一定的优雅。它也不会填充;如果您想进行填充,则只需对上述内容进行简单的修改即可:

from itertools import islice, chain, repeat

def chunk_pad(it, size, padval=None):
    it = chain(iter(it), repeat(padval))
    return iter(lambda: tuple(islice(it, size)), (padval,) * size)

演示:

>>> list(chunk_pad(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, None)]
>>> list(chunk_pad(range(14), 3, 'a'))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, 'a')]

izip_longest基于解决方案的解决方案一样,以上内容总是可以解决的。据我所知,没有可选的填充函数的单行或两行itertools配方。通过结合以上两种方法,这一方法非常接近:

_no_padding = object()

def chunk(it, size, padval=_no_padding):
    if padval == _no_padding:
        it = iter(it)
        sentinel = ()
    else:
        it = chain(iter(it), repeat(padval))
        sentinel = (padval,) * size
    return iter(lambda: tuple(islice(it, size)), sentinel)

演示:

>>> list(chunk(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13)]
>>> list(chunk(range(14), 3, None))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, None)]
>>> list(chunk(range(14), 3, 'a'))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, 'a')]

我认为这是最短的分块器,建议提供可选的填充。

饰演Tomasz Gandor 观察到的,如果两个填充分块器遇到一长串填充值,它们将意外停止。这是一个可以合理解决该问题的最终变体:

_no_padding = object()
def chunk(it, size, padval=_no_padding):
    it = iter(it)
    chunker = iter(lambda: tuple(islice(it, size)), ())
    if padval == _no_padding:
        yield from chunker
    else:
        for ch in chunker:
            yield ch if len(ch) == size else ch + (padval,) * (size - len(ch))

演示:

>>> list(chunk([1, 2, (), (), 5], 2))
[(1, 2), ((), ()), (5,)]
>>> list(chunk([1, 2, None, None, 5], 2, None))
[(1, 2), (None, None), (5, None)]

I’m surprised nobody has thought of using iter‘s two-argument form:

from itertools import islice

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

Demo:

>>> list(chunk(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13)]

This works with any iterable and produces output lazily. It returns tuples rather than iterators, but I think it has a certain elegance nonetheless. It also doesn’t pad; if you want padding, a simple variation on the above will suffice:

from itertools import islice, chain, repeat

def chunk_pad(it, size, padval=None):
    it = chain(iter(it), repeat(padval))
    return iter(lambda: tuple(islice(it, size)), (padval,) * size)

Demo:

>>> list(chunk_pad(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, None)]
>>> list(chunk_pad(range(14), 3, 'a'))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, 'a')]

Like the izip_longest-based solutions, the above always pads. As far as I know, there’s no one- or two-line itertools recipe for a function that optionally pads. By combining the above two approaches, this one comes pretty close:

_no_padding = object()

def chunk(it, size, padval=_no_padding):
    if padval == _no_padding:
        it = iter(it)
        sentinel = ()
    else:
        it = chain(iter(it), repeat(padval))
        sentinel = (padval,) * size
    return iter(lambda: tuple(islice(it, size)), sentinel)

Demo:

>>> list(chunk(range(14), 3))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13)]
>>> list(chunk(range(14), 3, None))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, None)]
>>> list(chunk(range(14), 3, 'a'))
[(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11), (12, 13, 'a')]

I believe this is the shortest chunker proposed that offers optional padding.

As Tomasz Gandor observed, the two padding chunkers will stop unexpectedly if they encounter a long sequence of pad values. Here’s a final variation that works around that problem in a reasonable way:

_no_padding = object()
def chunk(it, size, padval=_no_padding):
    it = iter(it)
    chunker = iter(lambda: tuple(islice(it, size)), ())
    if padval == _no_padding:
        yield from chunker
    else:
        for ch in chunker:
            yield ch if len(ch) == size else ch + (padval,) * (size - len(ch))

Demo:

>>> list(chunk([1, 2, (), (), 5], 2))
[(1, 2), ((), ()), (5,)]
>>> list(chunk([1, 2, None, None, 5], 2, None))
[(1, 2), (None, None), (5, None)]

回答 5

这是处理任意可迭代对象的生成器:

def split_seq(iterable, size):
    it = iter(iterable)
    item = list(itertools.islice(it, size))
    while item:
        yield item
        item = list(itertools.islice(it, size))

例:

>>> import pprint
>>> pprint.pprint(list(split_seq(xrange(75), 10)))
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

Here is a generator that work on arbitrary iterables:

def split_seq(iterable, size):
    it = iter(iterable)
    item = list(itertools.islice(it, size))
    while item:
        yield item
        item = list(itertools.islice(it, size))

Example:

>>> import pprint
>>> pprint.pprint(list(split_seq(xrange(75), 10)))
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

回答 6

def chunk(input, size):
    return map(None, *([iter(input)] * size))
def chunk(input, size):
    return map(None, *([iter(input)] * size))

回答 7

简单而优雅

l = range(1, 1000)
print [l[x:x+10] for x in xrange(0, len(l), 10)]

或者,如果您喜欢:

def chunks(l, n): return [l[x: x+n] for x in xrange(0, len(l), n)]
chunks(l, 10)

Simple yet elegant

l = range(1, 1000)
print [l[x:x+10] for x in xrange(0, len(l), 10)]

or if you prefer:

def chunks(l, n): return [l[x: x+n] for x in xrange(0, len(l), n)]
chunks(l, 10)

回答 8

批判其他答案在这里:

这些答案都不是均匀大小的块,它们都在末尾留下欠缺的块,因此它们并不完全平衡。如果您使用这些功能来分配工作,那么您就建立了一个前景可能比其他事情早完成的前景,因此在其他人继续努力的同时,它什么也没做。

例如,当前的最佳答案以:

[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74]]

我只是讨厌最后那个矮子!

其他人,例如list(grouper(3, xrange(7)))和,chunk(xrange(7), 3)都返回:[(0, 1, 2), (3, 4, 5), (6, None, None)]。的None只是填充,在我看来相当不雅。他们没有将可迭代对象均匀地分块。

为什么我们不能更好地划分这些?

我的解决方案

这是一个平衡的解决方案,它是根据我在生产环境中使用过的函数改编而成的(Python 3中的Note替换xrangerange):

def baskets_from(items, maxbaskets=25):
    baskets = [[] for _ in xrange(maxbaskets)] # in Python 3 use range
    for i, item in enumerate(items):
        baskets[i % maxbaskets].append(item)
    return filter(None, baskets) 

我创建了一个生成器,如果将其放入列表中,它的功能也相同:

def iter_baskets_from(items, maxbaskets=3):
    '''generates evenly balanced baskets from indexable iterable'''
    item_count = len(items)
    baskets = min(item_count, maxbaskets)
    for x_i in xrange(baskets):
        yield [items[y_i] for y_i in xrange(x_i, item_count, baskets)]

最后,由于我看到上述所有函数均按连续顺序返回元素(如给出的那样):

def iter_baskets_contiguous(items, maxbaskets=3, item_count=None):
    '''
    generates balanced baskets from iterable, contiguous contents
    provide item_count if providing a iterator that doesn't support len()
    '''
    item_count = item_count or len(items)
    baskets = min(item_count, maxbaskets)
    items = iter(items)
    floor = item_count // baskets 
    ceiling = floor + 1
    stepdown = item_count % baskets
    for x_i in xrange(baskets):
        length = ceiling if x_i < stepdown else floor
        yield [items.next() for _ in xrange(length)]

输出量

要测试它们:

print(baskets_from(xrange(6), 8))
print(list(iter_baskets_from(xrange(6), 8)))
print(list(iter_baskets_contiguous(xrange(6), 8)))
print(baskets_from(xrange(22), 8))
print(list(iter_baskets_from(xrange(22), 8)))
print(list(iter_baskets_contiguous(xrange(22), 8)))
print(baskets_from('ABCDEFG', 3))
print(list(iter_baskets_from('ABCDEFG', 3)))
print(list(iter_baskets_contiguous('ABCDEFG', 3)))
print(baskets_from(xrange(26), 5))
print(list(iter_baskets_from(xrange(26), 5)))
print(list(iter_baskets_contiguous(xrange(26), 5)))

打印出:

[[0], [1], [2], [3], [4], [5]]
[[0], [1], [2], [3], [4], [5]]
[[0], [1], [2], [3], [4], [5]]
[[0, 8, 16], [1, 9, 17], [2, 10, 18], [3, 11, 19], [4, 12, 20], [5, 13, 21], [6, 14], [7, 15]]
[[0, 8, 16], [1, 9, 17], [2, 10, 18], [3, 11, 19], [4, 12, 20], [5, 13, 21], [6, 14], [7, 15]]
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19], [20, 21]]
[['A', 'D', 'G'], ['B', 'E'], ['C', 'F']]
[['A', 'D', 'G'], ['B', 'E'], ['C', 'F']]
[['A', 'B', 'C'], ['D', 'E'], ['F', 'G']]
[[0, 5, 10, 15, 20, 25], [1, 6, 11, 16, 21], [2, 7, 12, 17, 22], [3, 8, 13, 18, 23], [4, 9, 14, 19, 24]]
[[0, 5, 10, 15, 20, 25], [1, 6, 11, 16, 21], [2, 7, 12, 17, 22], [3, 8, 13, 18, 23], [4, 9, 14, 19, 24]]
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]]

请注意,连续生成器以与其他两个相同的长度模式提供块,但是所有项都是有序的,并且它们被均匀地划分为一个可以划分一列离散元素的形式。

Critique of other answers here:

None of these answers are evenly sized chunks, they all leave a runt chunk at the end, so they’re not completely balanced. If you were using these functions to distribute work, you’ve built-in the prospect of one likely finishing well before the others, so it would sit around doing nothing while the others continued working hard.

For example, the current top answer ends with:

[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74]]

I just hate that runt at the end!

Others, like list(grouper(3, xrange(7))), and chunk(xrange(7), 3) both return: [(0, 1, 2), (3, 4, 5), (6, None, None)]. The None‘s are just padding, and rather inelegant in my opinion. They are NOT evenly chunking the iterables.

Why can’t we divide these better?

My Solution(s)

Here’s a balanced solution, adapted from a function I’ve used in production (Note in Python 3 to replace xrange with range):

def baskets_from(items, maxbaskets=25):
    baskets = [[] for _ in xrange(maxbaskets)] # in Python 3 use range
    for i, item in enumerate(items):
        baskets[i % maxbaskets].append(item)
    return filter(None, baskets) 

And I created a generator that does the same if you put it into a list:

def iter_baskets_from(items, maxbaskets=3):
    '''generates evenly balanced baskets from indexable iterable'''
    item_count = len(items)
    baskets = min(item_count, maxbaskets)
    for x_i in xrange(baskets):
        yield [items[y_i] for y_i in xrange(x_i, item_count, baskets)]

And finally, since I see that all of the above functions return elements in a contiguous order (as they were given):

def iter_baskets_contiguous(items, maxbaskets=3, item_count=None):
    '''
    generates balanced baskets from iterable, contiguous contents
    provide item_count if providing a iterator that doesn't support len()
    '''
    item_count = item_count or len(items)
    baskets = min(item_count, maxbaskets)
    items = iter(items)
    floor = item_count // baskets 
    ceiling = floor + 1
    stepdown = item_count % baskets
    for x_i in xrange(baskets):
        length = ceiling if x_i < stepdown else floor
        yield [items.next() for _ in xrange(length)]

Output

To test them out:

print(baskets_from(xrange(6), 8))
print(list(iter_baskets_from(xrange(6), 8)))
print(list(iter_baskets_contiguous(xrange(6), 8)))
print(baskets_from(xrange(22), 8))
print(list(iter_baskets_from(xrange(22), 8)))
print(list(iter_baskets_contiguous(xrange(22), 8)))
print(baskets_from('ABCDEFG', 3))
print(list(iter_baskets_from('ABCDEFG', 3)))
print(list(iter_baskets_contiguous('ABCDEFG', 3)))
print(baskets_from(xrange(26), 5))
print(list(iter_baskets_from(xrange(26), 5)))
print(list(iter_baskets_contiguous(xrange(26), 5)))

Which prints out:

[[0], [1], [2], [3], [4], [5]]
[[0], [1], [2], [3], [4], [5]]
[[0], [1], [2], [3], [4], [5]]
[[0, 8, 16], [1, 9, 17], [2, 10, 18], [3, 11, 19], [4, 12, 20], [5, 13, 21], [6, 14], [7, 15]]
[[0, 8, 16], [1, 9, 17], [2, 10, 18], [3, 11, 19], [4, 12, 20], [5, 13, 21], [6, 14], [7, 15]]
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19], [20, 21]]
[['A', 'D', 'G'], ['B', 'E'], ['C', 'F']]
[['A', 'D', 'G'], ['B', 'E'], ['C', 'F']]
[['A', 'B', 'C'], ['D', 'E'], ['F', 'G']]
[[0, 5, 10, 15, 20, 25], [1, 6, 11, 16, 21], [2, 7, 12, 17, 22], [3, 8, 13, 18, 23], [4, 9, 14, 19, 24]]
[[0, 5, 10, 15, 20, 25], [1, 6, 11, 16, 21], [2, 7, 12, 17, 22], [3, 8, 13, 18, 23], [4, 9, 14, 19, 24]]
[[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]]

Notice that the contiguous generator provide chunks in the same length patterns as the other two, but the items are all in order, and they are as evenly divided as one may divide a list of discrete elements.


回答 9

我在其中看到了最棒的Python式答案 这个问题重复部分中,:

from itertools import zip_longest

a = range(1, 16)
i = iter(a)
r = list(zip_longest(i, i, i))
>>> print(r)
[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12), (13, 14, 15)]

您可以为任何n个创建n个元组。如果为a = range(1, 15),则结果将为:

[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12), (13, 14, None)]

如果列表平均分配,则可以替换zip_longestzip,否则三元组(13, 14, None)将丢失。上面使用了Python 3。对于Python 2,请使用izip_longest

I saw the most awesome Python-ish answer in a duplicate of this question:

from itertools import zip_longest

a = range(1, 16)
i = iter(a)
r = list(zip_longest(i, i, i))
>>> print(r)
[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12), (13, 14, 15)]

You can create n-tuple for any n. If a = range(1, 15), then the result will be:

[(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12), (13, 14, None)]

If the list is divided evenly, then you can replace zip_longest with zip, otherwise the triplet (13, 14, None) would be lost. Python 3 is used above. For Python 2, use izip_longest.


回答 10

如果您知道列表大小:

def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

如果不这样做(一个迭代器):

def IterChunks(sequence, chunk_size):
    res = []
    for item in sequence:
        res.append(item)
        if len(res) >= chunk_size:
            yield res
            res = []
    if res:
        yield res  # yield the last, incomplete, portion

在后一种情况下,如果可以确定序列始终包含给定大小的所有块(即不存在不完整的最后一个块),则可以用更漂亮的方式来重新措词。

If you know list size:

def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

If you don’t (an iterator):

def IterChunks(sequence, chunk_size):
    res = []
    for item in sequence:
        res.append(item)
        if len(res) >= chunk_size:
            yield res
            res = []
    if res:
        yield res  # yield the last, incomplete, portion

In the latter case, it can be rephrased in a more beautiful way if you can be sure that the sequence always contains a whole number of chunks of given size (i.e. there is no incomplete last chunk).


回答 11

图尔茨库具有partition此功能:

from toolz.itertoolz.core import partition

list(partition(2, [1, 2, 3, 4]))
[(1, 2), (3, 4)]

The toolz library has the partition function for this:

from toolz.itertoolz.core import partition

list(partition(2, [1, 2, 3, 4]))
[(1, 2), (3, 4)]

回答 12

例如,如果块大小为3,则可以执行以下操作:

zip(*[iterable[i::3] for i in range(3)]) 

来源:http//code.activestate.com/recipes/303060-group-a-list-into-sequential-n-tuples/

当我可以输入的块大小为固定数字(例如“ 3”)并且永远不会更改时,我将使用此选项。

If you had a chunk size of 3 for example, you could do:

zip(*[iterable[i::3] for i in range(3)]) 

source: http://code.activestate.com/recipes/303060-group-a-list-into-sequential-n-tuples/

I would use this when my chunk size is fixed number I can type, e.g. ‘3’, and would never change.


回答 13

我非常喜欢tzot和JFSebastian提出的Python文档版本,但是它有两个缺点:

  • 这不是很明确
  • 我通常不希望在最后一块填充值

我在代码中经常使用此代码:

from itertools import islice

def chunks(n, iterable):
    iterable = iter(iterable)
    while True:
        yield tuple(islice(iterable, n)) or iterable.next()

更新:惰性块版本:

from itertools import chain, islice

def chunks(n, iterable):
   iterable = iter(iterable)
   while True:
       yield chain([next(iterable)], islice(iterable, n-1))

I like the Python doc’s version proposed by tzot and J.F.Sebastian a lot, but it has two shortcomings:

  • it is not very explicit
  • I usually don’t want a fill value in the last chunk

I’m using this one a lot in my code:

from itertools import islice

def chunks(n, iterable):
    iterable = iter(iterable)
    while True:
        yield tuple(islice(iterable, n)) or iterable.next()

UPDATE: A lazy chunks version:

from itertools import chain, islice

def chunks(n, iterable):
   iterable = iter(iterable)
   while True:
       yield chain([next(iterable)], islice(iterable, n-1))

回答 14

[AA[i:i+SS] for i in range(len(AA))[::SS]]

其中AA是数组,SS是块大小。例如:

>>> AA=range(10,21);SS=3
>>> [AA[i:i+SS] for i in range(len(AA))[::SS]]
[[10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20]]
# or [range(10, 13), range(13, 16), range(16, 19), range(19, 21)] in py3
[AA[i:i+SS] for i in range(len(AA))[::SS]]

Where AA is array, SS is chunk size. For example:

>>> AA=range(10,21);SS=3
>>> [AA[i:i+SS] for i in range(len(AA))[::SS]]
[[10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20]]
# or [range(10, 13), range(13, 16), range(16, 19), range(19, 21)] in py3

回答 15

我很好奇不同方法的性能,这里是:

在Python 3.5.1上测试

import time
batch_size = 7
arr_len = 298937

#---------slice-------------

print("\r\nslice")
start = time.time()
arr = [i for i in range(0, arr_len)]
while True:
    if not arr:
        break

    tmp = arr[0:batch_size]
    arr = arr[batch_size:-1]
print(time.time() - start)

#-----------index-----------

print("\r\nindex")
arr = [i for i in range(0, arr_len)]
start = time.time()
for i in range(0, round(len(arr) / batch_size + 1)):
    tmp = arr[batch_size * i : batch_size * (i + 1)]
print(time.time() - start)

#----------batches 1------------

def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

print("\r\nbatches 1")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#----------batches 2------------

from itertools import islice, chain

def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([next(batchiter)], batchiter)


print("\r\nbatches 2")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#---------chunks-------------
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]
print("\r\nchunks")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in chunks(arr, batch_size):
    tmp = x
print(time.time() - start)

#-----------grouper-----------

from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(iterable, n, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

arr = [i for i in range(0, arr_len)]
print("\r\ngrouper")
start = time.time()
for x in grouper(arr, batch_size):
    tmp = x
print(time.time() - start)

结果:

slice
31.18285083770752

index
0.02184295654296875

batches 1
0.03503894805908203

batches 2
0.22681021690368652

chunks
0.019841909408569336

grouper
0.006506919860839844

I was curious about the performance of different approaches and here it is:

Tested on Python 3.5.1

import time
batch_size = 7
arr_len = 298937

#---------slice-------------

print("\r\nslice")
start = time.time()
arr = [i for i in range(0, arr_len)]
while True:
    if not arr:
        break

    tmp = arr[0:batch_size]
    arr = arr[batch_size:-1]
print(time.time() - start)

#-----------index-----------

print("\r\nindex")
arr = [i for i in range(0, arr_len)]
start = time.time()
for i in range(0, round(len(arr) / batch_size + 1)):
    tmp = arr[batch_size * i : batch_size * (i + 1)]
print(time.time() - start)

#----------batches 1------------

def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

print("\r\nbatches 1")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#----------batches 2------------

from itertools import islice, chain

def batch(iterable, size):
    sourceiter = iter(iterable)
    while True:
        batchiter = islice(sourceiter, size)
        yield chain([next(batchiter)], batchiter)


print("\r\nbatches 2")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in batch(arr, batch_size):
    tmp = x
print(time.time() - start)

#---------chunks-------------
def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]
print("\r\nchunks")
arr = [i for i in range(0, arr_len)]
start = time.time()
for x in chunks(arr, batch_size):
    tmp = x
print(time.time() - start)

#-----------grouper-----------

from itertools import zip_longest # for Python 3.x
#from six.moves import zip_longest # for both (uses the six compat library)

def grouper(iterable, n, padvalue=None):
    "grouper(3, 'abcdefg', 'x') --> ('a','b','c'), ('d','e','f'), ('g','x','x')"
    return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

arr = [i for i in range(0, arr_len)]
print("\r\ngrouper")
start = time.time()
for x in grouper(arr, batch_size):
    tmp = x
print(time.time() - start)

Results:

slice
31.18285083770752

index
0.02184295654296875

batches 1
0.03503894805908203

batches 2
0.22681021690368652

chunks
0.019841909408569336

grouper
0.006506919860839844

回答 16

码:

def split_list(the_list, chunk_size):
    result_list = []
    while the_list:
        result_list.append(the_list[:chunk_size])
        the_list = the_list[chunk_size:]
    return result_list

a_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print split_list(a_list, 3)

结果:

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

code:

def split_list(the_list, chunk_size):
    result_list = []
    while the_list:
        result_list.append(the_list[:chunk_size])
        the_list = the_list[chunk_size:]
    return result_list

a_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print split_list(a_list, 3)

result:

[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

回答 17

您还可以get_chunksutilspie库函数用作:

>>> from utilspie import iterutils
>>> a = [1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> list(iterutils.get_chunks(a, 5))
[[1, 2, 3, 4, 5], [6, 7, 8, 9]]

您可以utilspie通过pip 安装:

sudo pip install utilspie

免责声明:我是utilspie library 的创建者

You may also use get_chunks function of utilspie library as:

>>> from utilspie import iterutils
>>> a = [1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> list(iterutils.get_chunks(a, 5))
[[1, 2, 3, 4, 5], [6, 7, 8, 9]]

You can install utilspie via pip:

sudo pip install utilspie

Disclaimer: I am the creator of utilspie library.


回答 18

在这一点上,我认为我们需要一个递归生成器,以防万一…

在python 2:

def chunks(li, n):
    if li == []:
        return
    yield li[:n]
    for e in chunks(li[n:], n):
        yield e

在python 3:

def chunks(li, n):
    if li == []:
        return
    yield li[:n]
    yield from chunks(li[n:], n)

同样,在外星人大规模入侵的情况下,经过修饰的递归生成器可能会派上用场:

def dec(gen):
    def new_gen(li, n):
        for e in gen(li, n):
            if e == []:
                return
            yield e
    return new_gen

@dec
def chunks(li, n):
    yield li[:n]
    for e in chunks(li[n:], n):
        yield e

At this point, I think we need a recursive generator, just in case…

In python 2:

def chunks(li, n):
    if li == []:
        return
    yield li[:n]
    for e in chunks(li[n:], n):
        yield e

In python 3:

def chunks(li, n):
    if li == []:
        return
    yield li[:n]
    yield from chunks(li[n:], n)

Also, in case of massive Alien invasion, a decorated recursive generator might become handy:

def dec(gen):
    def new_gen(li, n):
        for e in gen(li, n):
            if e == []:
                return
            yield e
    return new_gen

@dec
def chunks(li, n):
    yield li[:n]
    for e in chunks(li[n:], n):
        yield e

回答 19

使用Python 3.8中的赋值表达式,它变得非常不错:

import itertools

def batch(iterable, size):
    it = iter(iterable)
    while item := list(itertools.islice(it, size)):
        yield item

这适用于任意迭代,而不仅仅是列表。

>>> import pprint
>>> pprint.pprint(list(batch(range(75), 10)))
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

With Assignment Expressions in Python 3.8 it becomes quite nice:

import itertools

def batch(iterable, size):
    it = iter(iterable)
    while item := list(itertools.islice(it, size)):
        yield item

This works on an arbitrary iterable, not just a list.

>>> import pprint
>>> pprint.pprint(list(batch(range(75), 10)))
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
 [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
 [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
 [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
 [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
 [70, 71, 72, 73, 74]]

回答 20

呵呵,单行版

In [48]: chunk = lambda ulist, step:  map(lambda i: ulist[i:i+step],  xrange(0, len(ulist), step))

In [49]: chunk(range(1,100), 10)
Out[49]: 
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
 [41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
 [61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
 [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 [81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
 [91, 92, 93, 94, 95, 96, 97, 98, 99]]

heh, one line version

In [48]: chunk = lambda ulist, step:  map(lambda i: ulist[i:i+step],  xrange(0, len(ulist), step))

In [49]: chunk(range(1,100), 10)
Out[49]: 
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
 [41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
 [51, 52, 53, 54, 55, 56, 57, 58, 59, 60],
 [61, 62, 63, 64, 65, 66, 67, 68, 69, 70],
 [71, 72, 73, 74, 75, 76, 77, 78, 79, 80],
 [81, 82, 83, 84, 85, 86, 87, 88, 89, 90],
 [91, 92, 93, 94, 95, 96, 97, 98, 99]]

回答 21

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

用法:

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for seq in split_seq(seq, 3):
    print seq
def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

usage:

seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for seq in split_seq(seq, 3):
    print seq

回答 22

另一个更明确的版本。

def chunkList(initialList, chunkSize):
    """
    This function chunks a list into sub lists 
    that have a length equals to chunkSize.

    Example:
    lst = [3, 4, 9, 7, 1, 1, 2, 3]
    print(chunkList(lst, 3)) 
    returns
    [[3, 4, 9], [7, 1, 1], [2, 3]]
    """
    finalList = []
    for i in range(0, len(initialList), chunkSize):
        finalList.append(initialList[i:i+chunkSize])
    return finalList

Another more explicit version.

def chunkList(initialList, chunkSize):
    """
    This function chunks a list into sub lists 
    that have a length equals to chunkSize.

    Example:
    lst = [3, 4, 9, 7, 1, 1, 2, 3]
    print(chunkList(lst, 3)) 
    returns
    [[3, 4, 9], [7, 1, 1], [2, 3]]
    """
    finalList = []
    for i in range(0, len(initialList), chunkSize):
        finalList.append(initialList[i:i+chunkSize])
    return finalList

回答 23

在不调用len()的情况下,该方法非常适合大型列表:

def splitter(l, n):
    i = 0
    chunk = l[:n]
    while chunk:
        yield chunk
        i += n
        chunk = l[i:i+n]

这是针对可迭代对象的:

def isplitter(l, n):
    l = iter(l)
    chunk = list(islice(l, n))
    while chunk:
        yield chunk
        chunk = list(islice(l, n))

以上功能的味道:

def isplitter2(l, n):
    return takewhile(bool,
                     (tuple(islice(start, n))
                            for start in repeat(iter(l))))

要么:

def chunks_gen_sentinel(n, seq):
    continuous_slices = imap(islice, repeat(iter(seq)), repeat(0), repeat(n))
    return iter(imap(tuple, continuous_slices).next,())

要么:

def chunks_gen_filter(n, seq):
    continuous_slices = imap(islice, repeat(iter(seq)), repeat(0), repeat(n))
    return takewhile(bool,imap(tuple, continuous_slices))

Without calling len() which is good for large lists:

def splitter(l, n):
    i = 0
    chunk = l[:n]
    while chunk:
        yield chunk
        i += n
        chunk = l[i:i+n]

And this is for iterables:

def isplitter(l, n):
    l = iter(l)
    chunk = list(islice(l, n))
    while chunk:
        yield chunk
        chunk = list(islice(l, n))

The functional flavour of the above:

def isplitter2(l, n):
    return takewhile(bool,
                     (tuple(islice(start, n))
                            for start in repeat(iter(l))))

OR:

def chunks_gen_sentinel(n, seq):
    continuous_slices = imap(islice, repeat(iter(seq)), repeat(0), repeat(n))
    return iter(imap(tuple, continuous_slices).next,())

OR:

def chunks_gen_filter(n, seq):
    continuous_slices = imap(islice, repeat(iter(seq)), repeat(0), repeat(n))
    return takewhile(bool,imap(tuple, continuous_slices))

回答 24

以下是其他方法的列表:

给定

import itertools as it
import collections as ct

import more_itertools as mit


iterable = range(11)
n = 3

标准图书馆

list(it.zip_longest(*[iter(iterable)] * n))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

d = {}
for i, x in enumerate(iterable):
    d.setdefault(i//n, []).append(x)

list(d.values())
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

dd = ct.defaultdict(list)
for i, x in enumerate(iterable):
    dd[i//n].append(x)

list(dd.values())
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

more_itertools+

list(mit.chunked(iterable, n))
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

list(mit.sliced(iterable, n))
# [range(0, 3), range(3, 6), range(6, 9), range(9, 11)]

list(mit.grouper(n, iterable))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

list(mit.windowed(iterable, len(iterable)//n, step=n))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

参考文献

+一个实现itertools配方等的第三方库。> pip install more_itertools

Here is a list of additional approaches:

Given

import itertools as it
import collections as ct

import more_itertools as mit


iterable = range(11)
n = 3

Code

The Standard Library

list(it.zip_longest(*[iter(iterable)] * n))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

d = {}
for i, x in enumerate(iterable):
    d.setdefault(i//n, []).append(x)

list(d.values())
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

dd = ct.defaultdict(list)
for i, x in enumerate(iterable):
    dd[i//n].append(x)

list(dd.values())
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

more_itertools+

list(mit.chunked(iterable, n))
# [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10]]

list(mit.sliced(iterable, n))
# [range(0, 3), range(3, 6), range(6, 9), range(9, 11)]

list(mit.grouper(n, iterable))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

list(mit.windowed(iterable, len(iterable)//n, step=n))
# [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, None)]

References

+ A third-party library that implements itertools recipes and more. > pip install more_itertools


回答 25

看到这个参考

>>> orange = range(1, 1001)
>>> otuples = list( zip(*[iter(orange)]*10))
>>> print(otuples)
[(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), ... (991, 992, 993, 994, 995, 996, 997, 998, 999, 1000)]
>>> olist = [list(i) for i in otuples]
>>> print(olist)
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ..., [991, 992, 993, 994, 995, 996, 997, 998, 999, 1000]]
>>> 

Python3

See this reference

>>> orange = range(1, 1001)
>>> otuples = list( zip(*[iter(orange)]*10))
>>> print(otuples)
[(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), ... (991, 992, 993, 994, 995, 996, 997, 998, 999, 1000)]
>>> olist = [list(i) for i in otuples]
>>> print(olist)
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], ..., [991, 992, 993, 994, 995, 996, 997, 998, 999, 1000]]
>>> 

Python3


回答 26

由于这里的每个人都在谈论迭代器。boltons为此有一个完美的方法,称为iterutils.chunked_iter

from boltons import iterutils

list(iterutils.chunked_iter(list(range(50)), 11))

输出:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
 [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
 [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43],
 [44, 45, 46, 47, 48, 49]]

但是,如果您不想对内存存留怜悯,可以使用old-way并通过将数据存储list在第一位iterutils.chunked

Since everybody here talking about iterators. boltons has perfect method for that, called iterutils.chunked_iter.

from boltons import iterutils

list(iterutils.chunked_iter(list(range(50)), 11))

Output:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
 [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
 [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43],
 [44, 45, 46, 47, 48, 49]]

But if you don’t want to be mercy on memory, you can use old-way and store the full list in the first place with iterutils.chunked.


回答 27

另一种解决方案

def make_chunks(data, chunk_size): 
    while data:
        chunk, data = data[:chunk_size], data[chunk_size:]
        yield chunk

>>> for chunk in make_chunks([1, 2, 3, 4, 5, 6, 7], 2):
...     print chunk
... 
[1, 2]
[3, 4]
[5, 6]
[7]
>>> 

One more solution

def make_chunks(data, chunk_size): 
    while data:
        chunk, data = data[:chunk_size], data[chunk_size:]
        yield chunk

>>> for chunk in make_chunks([1, 2, 3, 4, 5, 6, 7], 2):
...     print chunk
... 
[1, 2]
[3, 4]
[5, 6]
[7]
>>> 

回答 28

def chunks(iterable,n):
    """assumes n is an integer>0
    """
    iterable=iter(iterable)
    while True:
        result=[]
        for i in range(n):
            try:
                a=next(iterable)
            except StopIteration:
                break
            else:
                result.append(a)
        if result:
            yield result
        else:
            break

g1=(i*i for i in range(10))
g2=chunks(g1,3)
print g2
'<generator object chunks at 0x0337B9B8>'
print list(g2)
'[[0, 1, 4], [9, 16, 25], [36, 49, 64], [81]]'
def chunks(iterable,n):
    """assumes n is an integer>0
    """
    iterable=iter(iterable)
    while True:
        result=[]
        for i in range(n):
            try:
                a=next(iterable)
            except StopIteration:
                break
            else:
                result.append(a)
        if result:
            yield result
        else:
            break

g1=(i*i for i in range(10))
g2=chunks(g1,3)
print g2
'<generator object chunks at 0x0337B9B8>'
print list(g2)
'[[0, 1, 4], [9, 16, 25], [36, 49, 64], [81]]'

回答 29

考虑使用matplotlib.cbook片段

例如:

import matplotlib.cbook as cbook
segments = cbook.pieces(np.arange(20), 3)
for s in segments:
     print s

Consider using matplotlib.cbook pieces

for example:

import matplotlib.cbook as cbook
segments = cbook.pieces(np.arange(20), 3)
for s in segments:
     print s