查找字符串中子字符串的第n次出现

Question 1

This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.

I want to find the index corresponding to the n’th occurrence of a substring within a string.

There’s got to be something equivalent to what I WANT to do which is

mystring.find("substring", 2nd)

How can you achieve this in Python?

Question 2

Mark’s iterative approach would be the usual way, I think.

Here’s an alternative with string-splitting, which can often be useful for finding-related processes:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

And here’s a quick (and somewhat dirty, in that you have to choose some chaff that can’t match the needle) one-liner:

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

Question 3

Here’s a more Pythonic version of the straightforward iterative solution:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Example:

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

If you want to find the nth overlapping occurrence of needle, you can increment by 1 instead of len(needle), like this:

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

Example:

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

This is easier to read than Mark’s version, and it doesn’t require the extra memory of the splitting version or importing regular expression module. It also adheres to a few of the rules in the Zen of python, unlike the various re approaches:

Simple is better than complex.
Flat is better than nested.
Readability counts.

Question 4

This will find the second occurrence of substring in string.

def find_2nd(string, substring):
   return string.find(substring, string.find(substring) + 1)

Edit: I haven’t thought much about the performance, but a quick recursion can help with finding the nth occurrence:

def find_nth(string, substring, n):
   if (n == 1):
       return string.find(substring)
   else:
       return string.find(substring, find_nth(string, substring, n - 1) + 1)

Question 5

Understanding that regex is not always the best solution, I’d probably use one here:

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11

Question 6

I’m offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince’s findnth() (based on str.split()) vs. @tgamblin’s or @Mark Byers’ find_nth() (based on str.find()). I will also compare with a C extension (_find_nth.so) to see how fast we can go. Here is find_nth.py:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

def find_nth(s, x, n=0, overlap=False):
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline (‘\n’) in a 1.3 GB file called ‘bigfile’. To save memory, we would like to work on an mmap.mmap object representation of the file:

In [1]: import _find_nth, find_nth, mmap

In [2]: f = open('bigfile', 'r')

In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

There is already the first problem with findnth(), since mmap.mmap objects don’t support split(). So we actually have to copy the whole file into memory:

In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s

Ouch! Fortunately s still fits in the 4 GB of memory of my Macbook Air, so let’s benchmark findnth():

In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop

Clearly a terrible performance. Let’s see how the approach based on str.find() does:

In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop

Much better! Clearly, findnth()‘s problem is that it is forced to copy the string during split(), which is already the second time we copied the 1.3 GB of data around after s = mm[:]. Here comes in the second advantage of find_nth(): We can use it on mm directly, such that zero copies of the file are required:

In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop

There appears to be a small performance penalty operating on mm vs. s, but this illustrates that find_nth() can get us an answer in 1.2 s compared to findnth‘s total of 47 s.

I found no cases where the str.find() based approach was significantly worse than the str.split() based approach, so at this point, I would argue that @tgamblin’s or @Mark Byers’ answer should be accepted instead of @bobince’s.

In my testing, the version of find_nth() above was the fastest pure Python solution I could come up with (very similar to @Mark Byers’ version). Let’s see how much better we can do with a C extension module. Here is _find_nthmodule.c:

#include <Python.h>
#include <string.h>

off_t _find_nth(const char *buf, size_t l, char c, int n) {
    off_t i;
    for (i = 0; i < l; ++i) {
        if (buf[i] == c && n-- == 0) {
            return i;
        }
    }
    return -1;
}

off_t _find_nth2(const char *buf, size_t l, char c, int n) {
    const char *b = buf - 1;
    do {
        b = memchr(b + 1, c, l);
        if (!b) return -1;
    } while (n--);
    return b - buf;
}

/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
    PyObject_HEAD
    char *data;
    size_t size;
} mmap_object;

typedef struct {
    const char *s;
    size_t l;
    char c;
    int n;
} params;

int parse_args(PyObject *args, params *P) {
    PyObject *obj;
    const char *x;

    if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
        return 1;
    }
    PyTypeObject *type = Py_TYPE(obj);

    if (type == &PyString_Type) {
        P->s = PyString_AS_STRING(obj);
        P->l = PyString_GET_SIZE(obj);
    } else if (!strcmp(type->tp_name, "mmap.mmap")) {
        mmap_object *m_obj = (mmap_object*) obj;
        P->s = m_obj->data;
        P->l = m_obj->size;
    } else {
        PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
        return 1;
    }
    P->c = x[0];
    return 0;
}

static PyObject* py_find_nth(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyMethodDef methods[] = {
    {"find_nth", py_find_nth, METH_VARARGS, ""},
    {"find_nth2", py_find_nth2, METH_VARARGS, ""},
    {0}
};

PyMODINIT_FUNC init_find_nth(void) {
    Py_InitModule("_find_nth", methods);
}

Here is the setup.py file:

from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])

Install as usual with python setup.py install. The C code plays at an advantage here since it is limited to finding single characters, but let’s see how fast this is:

In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop

In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop

In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop

In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop

Clearly quite a bit faster still. Interestingly, there is no difference on the C level between the in-memory and mmapped cases. It is also interesting to see that _find_nth2(), which is based on string.h‘s memchr() library function, loses out against the straightforward implementation in _find_nth(): The additional “optimizations” in memchr() are apparently backfiring…

In conclusion, the implementation in findnth() (based on str.split()) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b) it doesn’t work on mmap.mmap objects at all. The implementation in find_nth() (based on str.find()) should be preferred in all circumstances (and therefore be the accepted answer to this question).

There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.

Question 7

Simplest way?

text = "This is a test from a test ok" 

firstTest = text.find('test')

print text.find('test', firstTest + 1)

Question 8

I’d probably do something like this, using the find function that takes an index parameter:

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

It’s not particularly Pythonic I guess, but it’s simple. You could do it using recursion instead:

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

It’s a functional way to solve it, but I don’t know if that makes it more Pythonic.

Question 9

This will give you an array of the starting indices for matches to yourstring:

import re
indices = [s.start() for s in re.finditer(':', yourstring)]

Then your nth entry would be:

n = 2
nth_entry = indices[n-1]

Of course you have to be careful with the index bounds. You can get the number of instances of yourstring like this:

num_instances = len(indices)

Question 10

Here is another approach using re.finditer.
The difference is that this only looks into the haystack as far as necessary

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

Question 11

Here’s another re + itertools version that should work when searching for either a str or a RegexpObject. I will freely admit that this is likely over-engineered, but for some reason it entertained me.

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in \
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1

Question 12

Building on modle13‘s answer, but without the re module dependency.

def iter_find(haystack, needle):
    return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

I kinda wish this was a builtin string method.

>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]

Question 13

>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a

Question 14

Providing another “tricky” solution, which use split and join.

In your example, we can use

len("substring".join([s for s in ori.split("substring")[:2]]))

Question 15

# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
    i = 0
    while n >= 0:
        n -= 1
        i = s.find(substr, i + 1)
    return i

Question 16

Solution without using loops and recursion.

Use the required pattern in compile method and enter the desired occurrence in variable ‘n’ and the last statement will print the starting index of the nth occurrence of the pattern in the given string. Here the result of finditer i.e. iterator is being converted to list and directly accessing the nth index.

import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])

Question 17

The replace one liner is great but only works because XX and bar have the same lentgh

A good and general def would be:

def findN(s,sub,N,replaceString="XXX"):
    return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)

Question 18

This is the answer you really want:

def Find(String,ToFind,Occurence = 1):
index = 0 
count = 0
while index <= len(String):
    try:
        if String[index:index + len(ToFind)] == ToFind:
            count += 1
        if count == Occurence:
               return index
               break
        index += 1
    except IndexError:
        return False
        break
return False

Question 19

Here is my solution for finding nth occurrance of b in string a:

from functools import reduce


def findNth(a, b, n):
    return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

It is pure Python and iterative. For 0 or n that is too large, it returns -1. It is one-liner and can be used directly. Here is an example:

>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7

Question 20

For the special case where you search for the n’th occurence of a character (i.e. substring of length 1), the following function works by building a list of all positions of occurences of the given character:

def find_char_nth(string, char, n):
    """Find the n'th occurence of a character within a string."""
    return [i for i, c in enumerate(string) if c == char][n-1]

If there are fewer than n occurences of the given character, it will give IndexError: list index out of range.

This is derived from @Zv_oDD’s answer and simplified for the case of a single character.

Question 21

Def:

def get_first_N_words(mytext, mylen = 3):
    mylist = list(mytext.split())
    if len(mylist)>=mylen: return ' '.join(mylist[:mylen])

To use:

get_first_N_words('  One Two Three Four ' , 3)

Output:

'One Two Three'

Question 22

How about:

c = os.getcwd().split('\\')
print '\\'.join(c[0:-2])

查找字符串中子字符串的第n次出现

问题：查找字符串中子字符串的第n次出现

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

7行代码 Python热力图可视化分析缺失数据处理

Python 流程图 — 一键转化代码为流程图

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

子流程Popen和call有什么区别（我该如何使用它们）？

Python的所有函数如何工作？

如何获得给定日期的星期几？

按位操作和用法

使用pickle.dump-TypeError：必须为str，而不是字节

如何在Spark中关闭INFO日志记录？

查找字符串中子字符串的第n次出现

问题：查找字符串中子字符串的第n次出现

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

回答 16

回答 17

回答 18

回答 19

回答 20

相关文章

排行榜展示

文章展示