问题:检查列表中是否存在值的最快方法

最快的方法是什么才能知道列表中是否存在值(列表中包含数百万个值)及其索引是什么?

我知道列表中的所有值都是唯一的,如本例所示。

我尝试的第一种方法是(在我的实际代码中为3.8秒):

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

我尝试的第二种方法是(速度提高了2倍:实际代码为1.9秒):

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

堆栈溢出用户建议的方法(我的实际代码为2.74秒):

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

在我的真实代码中,第一种方法耗时3.81秒,第二种方法耗时1.88秒。这是一个很好的改进,但是:

我是使用Python /脚本的初学者,有没有更快的方法来完成相同的事情并节省更多的处理时间?

我的应用程序更具体的说明:

在Blender API中,我可以访问粒子列表:

particles = [1, 2, 3, 4, etc.]

从那里,我可以访问粒子的位置:

particles[x].location = [x,y,z]

对于每个粒子,我通过搜索每个粒子位置来测试是否存在邻居:

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

I know that all values in the list are unique as in this example.

The first method I try is (3.8 sec in my real code):

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

The second method I try is (2x faster: 1.9 sec for my real code):

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

Proposed methods from Stack Overflow user (2.74 sec for my real code):

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

In my real code, the first method takes 3.81 sec and the second method takes 1.88 sec. It’s a good improvement, but:

I’m a beginner with Python/scripting, and is there a faster way to do the same things and save more processing time?

More specific explication for my application:

In the Blender API I can access a list of particles:

particles = [1, 2, 3, 4, etc.]

From there, I can access a particle’s location:

particles[x].location = [x,y,z]

And for each particle I test if a neighbour exists by searching each particle location like so:

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])

回答 0

7 in a

最清晰,最快的方法。

您也可以考虑使用set,但是从列表中构造该集合所花费的时间可能比更快的成员资格测试所节省的时间还要长。唯一可以确定的基准就是基准测试。(这还取决于您需要执行哪些操作)

7 in a

Clearest and fastest way to do it.

You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)


回答 1

正如其他人所述,in对于大型列表,它可能非常慢。这里是表演一些比较insetbisect。请注意时间(以秒为单位)是对数刻度。

在此处输入图片说明

测试代码:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time

def method_in(a,b,c):
    start_time = time.time()
    for i,x in enumerate(a):
        if x in b:
            c[i] = 1
    return(time.time()-start_time)   

def method_set_in(a,b,c):
    start_time = time.time()
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = 1
    return(time.time()-start_time)

def method_bisect(a,b,c):
    start_time = time.time()
    b.sort()
    for i,x in enumerate(a):
        index = bisect.bisect_left(b,x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return(time.time()-start_time)

def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    Nls = [x for x in range(1000,20000,1000)]
    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        time_method_in.append(math.log(method_in(a,b,c)))
        time_method_set_in.append(math.log(method_set_in(a,b,c)))
        time_method_bisect.append(math.log(method_bisect(a,b,c)))

    plt.plot(Nls,time_method_in,marker='o',color='r',linestyle='-',label='in')
    plt.plot(Nls,time_method_set_in,marker='o',color='b',linestyle='-',label='set')
    plt.plot(Nls,time_method_bisect,marker='o',color='g',linestyle='-',label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

As stated by others, in can be very slow for large lists. Here are some comparisons of the performances for in, set and bisect. Note the time (in second) is in log scale.

enter image description here

Code for testing:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time

def method_in(a,b,c):
    start_time = time.time()
    for i,x in enumerate(a):
        if x in b:
            c[i] = 1
    return(time.time()-start_time)   

def method_set_in(a,b,c):
    start_time = time.time()
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = 1
    return(time.time()-start_time)

def method_bisect(a,b,c):
    start_time = time.time()
    b.sort()
    for i,x in enumerate(a):
        index = bisect.bisect_left(b,x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return(time.time()-start_time)

def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    Nls = [x for x in range(1000,20000,1000)]
    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        time_method_in.append(math.log(method_in(a,b,c)))
        time_method_set_in.append(math.log(method_set_in(a,b,c)))
        time_method_bisect.append(math.log(method_bisect(a,b,c)))

    plt.plot(Nls,time_method_in,marker='o',color='r',linestyle='-',label='in')
    plt.plot(Nls,time_method_set_in,marker='o',color='b',linestyle='-',label='set')
    plt.plot(Nls,time_method_bisect,marker='o',color='g',linestyle='-',label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

回答 2

您可以将物品放入set。集合查找非常有效。

尝试:

s = set(a)
if 7 in s:
  # do stuff

编辑在注释中,您说您想获取元素的索引。不幸的是,集合没有元素位置的概念。另一种方法是对列表进行预排序,然后在每次需要查找元素时使用二进制搜索

You could put your items into a set. Set lookups are very efficient.

Try:

s = set(a)
if 7 in s:
  # do stuff

edit In a comment you say that you’d like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.


回答 3

def check_availability(element, collection: iter):
    return element in collection

用法

check_availability('a', [1,2,3,4,'a','b','c'])

我相信这是知道所选值是否在数组中的最快方法。

def check_availability(element, collection: iter):
    return element in collection

Usage

check_availability('a', [1,2,3,4,'a','b','c'])

I believe this is the fastest way to know if a chosen value is in an array.


回答 4

a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

如果a不变,这将是一个好主意,因此我们可以做一次dict()部分,然后重复使用它。如果确实发生变化,请提供您正在做的更多详细信息。

a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

This will only be a good idea if a doesn’t change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.


回答 5

最初的问题是:

最快的方法是什么才能知道列表中是否存在值(列表中包含数百万个值)及其索引是什么?

因此,有两件事可以找到:

  1. 是列表中的一项,并且
  2. 什么是索引(如果在列表中)。

为此,我修改了@xslittlegrass代码以在所有情况下计算索引,并添加了其他方法。

结果

在此处输入图片说明

方法是:

  1. in-基本上如果b中的x:返回b.index(x)
  2. try–try / catch on b.index(x)(跳过必须检查b中的x)
  3. set-基本上,如果x在set(b)中:返回b.index(x)
  4. bisect-用索引对其b进行排序,对sorted(b)中的x进行二进制搜索。请注意@xslittlegrass的mod,它返回排序后的b中的索引,而不是原始b)
  5. 反向-为b形成反向查找字典d; 然后d [x]提供x的索引。

结果表明,方法5最快。

有趣的是,tryset方法在时间上是等效的。


测试代码

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

The original question was:

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?

Thus there are two things to find:

  1. is an item in the list, and
  2. what is the index (if in the list).

Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.

Results

enter image description here

Methods are:

  1. in–basically if x in b: return b.index(x)
  2. try–try/catch on b.index(x) (skips having to check if x in b)
  3. set–basically if x in set(b): return b.index(x)
  4. bisect–sort b with its index, binary search for x in sorted(b). Note mod from @xslittlegrass who returns the index in the sorted b, rather than the original b)
  5. reverse–form a reverse lookup dictionary d for b; then d[x] provides the index of x.

Results show that method 5 is the fastest.

Interestingly the try and the set methods are equivalent in time.


Test Code

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

回答 6

听起来您的应用程序可能会受益于使用Bloom Filter数据结构的优势。

简而言之,布隆过滤器查询可以很快告诉您集合中是否绝对没有值。否则,您可以进行较慢的查找,以获取列表中可能存在的值的索引。因此,如果您的应用程序倾向于比“已找到”结果更频繁地获得“未找到”结果,则可以通过添加Bloom Filter来加快速度。

有关详细信息,Wikipedia很好地概述了布隆过滤器的工作方式,并且对“ python布隆过滤器库”的网络搜索将至少提供一些有用的实现。

It sounds like your application might gain advantage from the use of a Bloom Filter data structure.

In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the “not found” result much more often then the “found” result, you might see a speed up by adding a Bloom Filter.

For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for “python bloom filter library” will provide at least a couple useful implementations.


回答 7

请注意,in运算符不仅测试相等性(==),还测试身份(is),s 的in逻辑大致等同于以下内容(它实际上是用C编写的,但不是用Python编写的,至少是用CPython编写的):list

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

在大多数情况下,这个细节是无关紧要的,但是在某些情况下,它可能会使Python新手感到惊讶,例如,numpy.NAN具有不等于自身的异常特性:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

要区分这些异常情况,可以使用any()

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

注意s 的in逻辑为:listany()

any(element is target or element == target for element in lst)

但是,我要强调的是,这是一个in极端的情况,在绝大多数情况下,运算符都是经过高度优化的,而这正是您想要的(当然是使用a list或使用a set)。

Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it’s actually written in C and not Python though, at least in CPython):

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

To distinguish between these unusual cases you could use any() like:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

Note the in logic for lists with any() would be:

any(element is target or element == target for element in lst)

However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).


回答 8

或使用__contains__

sequence.__contains__(value)

演示:

>>> l=[1,2,3]
>>> l.__contains__(3)
True
>>> 

Or use __contains__:

sequence.__contains__(value)

Demo:

>>> l=[1,2,3]
>>> l.__contains__(3)
True
>>> 

回答 9

@Winston Ewert的解决方案极大地提高了非常大的列表的速度,但是这个stackoverflow答案表明,如果经常到达除外分支,则try:/ except:/ else:构造将变慢。一种替代方法是利用该.get()方法使用dict:

a = [4,2,3,1,5,6]

index = dict((y, x) for x, y in enumerate(a))

b = index.get(7, None)
if b is not None:
    "Do something with variable b"

.get(key, default)方法仅适用于无法保证键将包含在dict中的情况。如果关键存在,它返回值(将dict[key]),但是,当它不是,.get()返回默认值(在这里None)。在这种情况下,您需要确保所选的默认值不会在中a

@Winston Ewert’s solution yields a big speed-up for very large lists, but this stackoverflow answer indicates that the the try:/except:/else: construct will be slowed down if the except branch is often reached. An alternative is to take advantage of the .get() method for the dict:

a = [4,2,3,1,5,6]

index = dict((y, x) for x, y in enumerate(a))

b = index.get(7, None)
if b is not None:
    "Do something with variable b"

The .get(key, default) method is just for the case when you can’t guarantee a key will be in the dict. If key is present, it returns the value (as would dict[key]), but when it is not, .get() returns your default value (here None). You need to make sure in this case that the chosen default will not be in a.


回答 10

这不是代码,而是用于快速搜索的算法。

如果您的列表和要查找的值都是数字,那么这很简单。如果是字符串:请看底部:

  • -让“ n”为列表的长度
  • -可选步骤:如果需要元素索引:将第二列添加到元素的当前索引(0到n-1)-稍后再说
  • 订购列表或列表的副本(.sort())
  • 依次通过:
    • 将您的数字与列表的第n / 2个元素进行比较
      • 如果更大,则在索引n / 2-n之间再次循环
      • 如果较小,则在索引0-n / 2之间再次循环
      • 如果相同:您找到了
  • 不断缩小列表的范围,直到找到它或只有2个数字(在您要查找的数字的下方和上方)
  • 这将在最多19个步骤中找到1.000.000列表中的任何元素(准确地说是log(2)n)

如果您还需要号码的原始位置,请在第二个索引列中查找。

如果您的列表不是由数字组成的,则该方法仍然有效并且将是最快的,但是您可能需要定义一个可以比较/排序字符串的函数。

当然,这需要sorted()方法的投资,但是如果您继续重复使用相同的列表进行检查,那可能是值得的。

This is not the code, but the algorithm for very fast searching.

If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:

  • -Let “n” be the length of your list
  • -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) – see later
  • Order your list or a copy of it (.sort())
  • Loop through:
    • Compare your number to the n/2th element of the list
      • If larger, loop again between indexes n/2-n
      • If smaller, loop again between indexes 0-n/2
      • If the same: you found it
  • Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)
  • This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)

If you also need the original position of your number, look for it in the second, index column.

If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.

Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.


回答 11

因为问题不一定总是被理解为最快的技术方法-我总是建议理解/编写最直接的最快方法:列表理解,单线

[i for i in list_from_which_to_search if i in list_to_search_in]

我对list_to_search_in所有项目都拥有一个,并想返回中的项目索引list_from_which_to_search

这将在一个不错的列表中返回索引。

还有其他方法可以解决此问题-但是列表理解速度足够快,并且可以以足够快的速度编写它来解决问题。

Because the question is not always supposed to be understood as the fastest technical way – I always suggest the most straightforward fastest way to understand/write: a list comprehension, one-liner

[i for i in list_from_which_to_search if i in list_to_search_in]

I had a list_to_search_in with all the items, and wanted to return the indexes of the items in the list_from_which_to_search.

This returns the indexes in a nice list.

There are other ways to check this problem – however list comprehensions are quick enough, adding to the fact of writing it quick enough, to solve a problem.


回答 12

对我而言,这是0.030秒(实际),0.026秒(用户)和0.004秒(系统)。

try:
print("Started")
x = ["a", "b", "c", "d", "e", "f"]

i = 0

while i < len(x):
    i += 1
    if x[i] == "e":
        print("Found")
except IndexError:
    pass

For me it was 0.030 sec (real), 0.026 sec (user), and 0.004 sec (sys).

try:
print("Started")
x = ["a", "b", "c", "d", "e", "f"]

i = 0

while i < len(x):
    i += 1
    if x[i] == "e":
        print("Found")
except IndexError:
    pass

回答 13

检查乘积等于k的数组中是否存在两个元素的代码:

n = len(arr1)
for i in arr1:
    if k%i==0:
        print(i)

Code to check whether two elements exist in array whose product equals k:

n = len(arr1)
for i in arr1:
    if k%i==0:
        print(i)

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。