遍历字符串

问题:遍历字符串

我有这样定义的多行字符串:

foo = """
this is 
a multi-line string.
"""

我们用作我正在编写的解析器的测试输入的字符串。解析器功能接收file-object作为输入并对其进行迭代。它还确实next()直接调用该方法以跳过行,因此我确实需要一个迭代器作为输入,而不是可迭代的。我需要一个迭代器,它可以在字符串的各个行之间进行迭代,就像file-object可以在文本文件的行之间进行迭代一样。我当然可以这样:

lineiterator = iter(foo.splitlines())

是否有更直接的方法?在这种情况下,字符串必须遍历一次才能进行拆分,然后再由解析器再次遍历。在我的测试用例中,这无关紧要,因为那里的字符串很短,我只是出于好奇而问。Python有很多有用且高效的内置程序,但是我找不到适合此需求的东西。

I have a multi-line string defined like this:

foo = """
this is 
a multi-line string.
"""

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

lineiterator = iter(foo.splitlines())

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn’t matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.


回答 0

这是三种可能性:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

将其作为主要脚本运行,确认这三个功能等效。使用timeit(并使用* 100for foo获得大量字符串以进行更精确的测量):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

注意,我们需要list()调用以确保遍历迭代器,而不仅仅是构建迭代器。

IOW,天真的实现要快得多,甚至都不有趣:比我尝试find调用快6倍,而调用比底层方法快4倍。

经验教训:测量永远是一件好事(但必须准确);像这样的字符串方法splitlines以非常快的方式实现;通过在非常低的级别上进行编程(尤其是通过+=非常小的片段的循环)来将字符串组合在一起可能会非常慢。

编辑:添加了@Jacob的提案,对其进行了稍微修改以使其与其他提案具有相同的结果(保留行尾空白),即:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

测量得出:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

不如.find基于方法的方法好-仍然要牢记,因为它可能不大可能出现小的一次性错误(如f3上面所述,任何出现+1和-1的循环都应该自动触发一个个的怀疑-许多循环应该缺少这些调整并且应该进行调整-尽管我相信我的代码也是正确的,因为我能够使用其他函数检查其输出’)。

但是基于拆分的方法仍然占主导地位。

顺便说一句:可能更好的样式f4是:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

至少,它不那么冗长。\n不幸的是,需要去除尾随s禁止使用来更清楚,更快速地替换while循环return iter(stri)iter在现代版本的Python中,多余的部分是多余的,我相信从2.3或2.4开始,但它也是无害的)。也许也值得尝试:

    return itertools.imap(lambda s: s.strip('\n'), stri)

或其变体-但我在这里停止,因为这几乎是strip基础,最简单和最快的一项理论练习。

Here are three possibilities:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

Note we need the list() call to ensure the iterators are traversed, not just built.

IOW, the naive implementation is so much faster it isn’t even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

Edit: added @Jacob’s proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach — still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions — and so should many loops which lack such tweaks and should have them — though I believe my code is also right since I was able to check its output with other functions’).

But the split-based approach still rules.

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it’s a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it’s also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof — but I’m stopping here since it’s pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.


回答 1

我不确定您的意思是“然后再由解析器”。拆分完成后,将不再遍历字符串,而仅遍历拆分字符串列表。只要您的字符串的大小不是绝对很大,这实际上可能是最快的方法。python使用不可变字符串的事实意味着您必须始终创建一个新字符串,因此无论如何都必须这样做。

如果字符串很大,则不利之处在于内存使用情况:您将同时在内存中拥有原始字符串和拆分字符串列表,从而使所需的内存增加了一倍。迭代器方法可以节省您的开销,可以根据需要构建字符串,尽管它仍然要付出“分割”的代价。但是,如果您的字符串太大,则通常甚至要避免将未拆分的字符串存储在内存中。最好只从文件中读取字符串,该文件已经允许您以行形式遍历该字符串。

但是,如果您确实已经在内存中存储了一个巨大的字符串,则一种方法是使用StringIO,它为字符串提供了一个类似于文件的接口,包括允许逐行迭代(内部使用.find查找下一个换行符)。您将得到:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

I’m not sure what you mean by “then again by the parser”. After the splitting has been done, there’s no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn’t absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you’ll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the “splitting” penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

回答 2

如果我没有看错Modules/cStringIO.c,这应该是非常有效的(尽管有些冗长):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

回答 3

基于正则表达式的搜索有时比生成器方法要快:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

Regex-based searching is sometimes faster than generator approach:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

回答 4

我想你可以自己动手:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

我不确定此实现的效率如何,但这只会在您的字符串上迭代一次。

嗯,生成器。

编辑:

当然,您还想添加想要执行的任何类型的解析操作,但这很简单。

I suppose you could roll your own:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I’m not sure how efficient this implementation is, but that will only iterate over your string once.

Mmm, generators.

Edit:

Of course you’ll also want to add in whatever type of parsing actions you want to take, but that’s pretty simple.


回答 5

您可以遍历“文件”,该文件将产生包括尾随换行符在内的行。要使用字符串制作“虚拟文件”,可以使用StringIO

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))

You can iterate over “a file”, which produces lines, including the trailing newline character. To make a “virtual file” out of a string, you can use StringIO:

import io  # for Py2.7 that would be import cStringIO as io

for line in io.StringIO(foo):
    print(repr(line))