问题:在Python中拆分空字符串时,为什么split()返回空列表,而split(’\ n’)返回[”]?

split('\n')用来获取一个字符串中的行,并发现''.split()返回一个空列表[],而''.split('\n')return ['']。有什么特殊原因造成这种差异?

还有没有更方便的方法来计算字符串中的行数?

I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns ['']. Is there any specific reason for such a difference?

And is there any more convenient way to count lines in a string?


回答 0

问题:我正在使用split(’\ n’)在一个字符串中获取行,并发现”.split()返回空列表[],而”.split(’\ n’)返回[”] 。

所述str.split()方法有两种算法。如果未提供任何参数,它将在重复运行空白时拆分。但是,如果给出参数,则将其视为单个定界符,且不会重复运行。

在拆分空字符串的情况下,第一种模式(无参数)将返回一个空列表,因为空白被吃掉并且结果列表中没有任何值。

相比之下,第二种模式(带有参数如\n)将产生第一个空字段。考虑一下您是否写过'\n'.split('\n'),您将得到两个字段(一个字段拆分成两半)。

问题:有什么特殊原因造成这种差异?

当数据在具有可变空白量的列中对齐时,第一种模式很有用。例如:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

第二种模式对于定界数据(例如CSV)很有用,其中重复的逗号表示空白字段。例如:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

注意,结果字段的数量比定界符的数量大一。想想剪一条绳子。如果不削减,则只有一件。一切,给出两块。进行两次切割,得到三块。Python的str.split(delimiter)方法也是如此:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

问题:还有什么更方便的方法来计算字符串中的行数?

是的,有两种简单的方法。一个使用str.count(),另一个使用str.splitlines()。除非最后一行缺少,否则两种方法都将给出相同的答案\n。如果最后的换行符丢失,则str.splitlines方法将给出准确的答案。一种更快且更准确的技术是使用count方法,然后将其更正为最终的换行符:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

来自@Kaz的问题:为什么两个非常不同的算法被误用到一个函数中?

str.split的签名大约有20年的历史了,那个时代的许多API都是严格实用的。虽然并不完美,但方法签名也不是“糟糕的”。在大多数情况下,Guido的API设计选择经受了时间的考验。

当前的API并非没有优势。考虑如下字符串:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

当要求将这些字符串分成多个字段时,人们倾向于使用相同的英语单词“ split”来描述这两个字符串。当要求读取诸如fields = line.split() 或的代码时fields = line.split(','),人们倾向于正确地将语句解释为“将行拆分为字段”。

Microsoft Excel的“ 文本到列”工具做出了类似的API选择,并将两种分割算法都合并到了同一工具中。尽管似乎涉及多个算法,但人们似乎在思维上将字段拆分建模为一个单独的概念。

Question: I am using split(‘\n’) to get lines in one string, and found that ”.split() returns empty list [], while ”.split(‘\n’) returns [”].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python’s str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn’t “terrible” either. For the most part, Guido’s API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

When asked to break these strings into fields, people tend to describe both using the same English word, “split”. When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as “splits a line into fields”.

Microsoft Excel’s text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.


回答 1

根据文档,这似乎只是它应该工作的方式:

使用指定的分隔符分割空字符串将返回['']

如果未指定sep或为None,则将应用不同的拆分算法:连续的空白行将被视为单个分隔符,并且如果字符串的开头或结尾处有空白,则结果在开头或结尾将不包含空字符串。因此,使用None分隔符拆分空字符串或仅包含空格的字符串将返回[]。

因此,为了更清楚一点,该split()函数实现了两种不同的拆分算法,并使用参数的存在来决定要运行哪个参数。这可能是因为它允许优化一个不带参数的参数,而不是优化带参数的参数。我不知道。

It seems to simply be the way it’s supposed to work, according to the documentation:

Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

So, to make it clearer, the split() function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don’t know.


回答 2

.split()没有参数的人会变得聪明。它在任何空格,制表符,空格,换行符等处分割,并因此跳过所有空字符串。

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

本质上,.split()不带参数的用于从字符串中提取单词,而.split()带参数的参数只是带一个字符串并将其分割。

这就是差异的原因。

是的,通过分割来计数行不是一种有效的方法。计算换行符的数量,如果字符串不以换行符结尾,则加一个。

.split() without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

Essentially, .split() without parameters are used to extract words from a string, as opposed to .split() with parameters which just takes a string and splits it.

That’s the reason for the difference.

And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn’t end with a line feed.


回答 3

用途count()

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

Use count():

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

回答 4

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

注意最后一句话。

要计算行数,您可以简单地计算行数\n

line_count = some_string.count('\n') + some_string[-1] != '\n'

最后一部分考虑到不结束最后一行\n,即使这意味着,Hello, World!Hello, World!\n具有相同的行数(这对我来说是合理的),否则,你可以简单地添加1到的计数\n

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

Note the last sentence.

To count lines you can simply count how many \n are there:

line_count = some_string.count('\n') + some_string[-1] != '\n'

The last part takes into account the last line that do not end with \n, even though this means that Hello, World! and Hello, World!\n have the same line count(which for me is reasonable), otherwise you can simply add 1 to the count of \n.


回答 5

要计算行数,可以计算换行数:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

编辑

内置的另一个答案count更合适,实际上

To count lines, you can count the number of line breaks:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

Edit:

The other answer with built-in count is more suitable, actually


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。