问题:首次出现大于当前值的值

我在numpy中有一个1D数组,我想在值超过numpy数组中的值的位置找到索引的位置。

例如

aa = range(-10,10)

查找超出aa值的位置5

I have a 1D array in numpy and I want to find the position of the index where a value exceeds the value in numpy array.

E.g.

aa = range(-10,10)

Find position in aa where, the value 5 gets exceeded.


回答 0

这有点快(看起来更好)

np.argmax(aa>5)

因为将在第一个位置停止True(“如果出现多次最大值,则返回对应于第一次出现的索引。”),并且不会保存其他列表。

In [2]: N = 10000

In [3]: aa = np.arange(-N,N)

In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop

In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop

In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop

This is a little faster (and looks nicer)

np.argmax(aa>5)

Since will stop at the first True (“In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.”) and doesn’t save another list.

In [2]: N = 10000

In [3]: aa = np.arange(-N,N)

In [4]: timeit np.argmax(aa>N/2)
100000 loops, best of 3: 52.3 us per loop

In [5]: timeit np.where(aa>N/2)[0][0]
10000 loops, best of 3: 141 us per loop

In [6]: timeit np.nonzero(aa>N/2)[0][0]
10000 loops, best of 3: 142 us per loop

回答 1

给定数组的排序内容,还有一个更快的方法:searchsorted

import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]

# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop

given the sorted content of your array, there is an even faster method: searchsorted.

import time
N = 10000
aa = np.arange(-N,N)
%timeit np.searchsorted(aa, N/2)+1
%timeit np.argmax(aa>N/2)
%timeit np.where(aa>N/2)[0][0]
%timeit np.nonzero(aa>N/2)[0][0]

# Output
100000 loops, best of 3: 5.97 µs per loop
10000 loops, best of 3: 46.3 µs per loop
10000 loops, best of 3: 154 µs per loop
10000 loops, best of 3: 154 µs per loop

回答 2

我对此也很感兴趣,并且将所有建议的答案与perfplot进行了比较。(免责声明:我是perfplot的作者。)

如果您知道要查找的数组已经排序,则

numpy.searchsorted(a, alpha)

是给你的。这是一个固定时间操作,即,速度也不能依赖于数组的大小。您无法获得比这更快的速度。

如果您对阵列一无所知,那么您不会错

numpy.argmax(a > alpha)

已排序:

在此处输入图片说明

未分类:

在此处输入图片说明

复制剧情的代码:

import numpy
import perfplot


alpha = 0.5

def argmax(data):
    return numpy.argmax(data > alpha)

def where(data):
    return numpy.where(data > alpha)[0][0]

def nonzero(data):
    return numpy.nonzero(data > alpha)[0][0]

def searchsorted(data):
    return numpy.searchsorted(data, alpha)

out = perfplot.show(
    # setup=numpy.random.rand,
    setup=lambda n: numpy.sort(numpy.random.rand(n)),
    kernels=[
        argmax, where,
        nonzero,
        searchsorted
        ],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    xlabel='len(array)'
    )

I was also interested in this and I’ve compared all the suggested answers with perfplot. (Disclaimer: I’m the author of perfplot.)

If you know that the array you’re looking through is already sorted, then

numpy.searchsorted(a, alpha)

is for you. It’s O(log(n)) operation, i.e., the speed hardly depends on the size of the array. You can’t get faster than that.

If you don’t know anything about your array, you’re not going wrong with

numpy.argmax(a > alpha)

Already sorted:

enter image description here

Unsorted:

enter image description here

Code to reproduce the plot:

import numpy
import perfplot


alpha = 0.5
numpy.random.seed(0)


def argmax(data):
    return numpy.argmax(data > alpha)


def where(data):
    return numpy.where(data > alpha)[0][0]


def nonzero(data):
    return numpy.nonzero(data > alpha)[0][0]


def searchsorted(data):
    return numpy.searchsorted(data, alpha)


perfplot.save(
    "out.png",
    # setup=numpy.random.rand,
    setup=lambda n: numpy.sort(numpy.random.rand(n)),
    kernels=[argmax, where, nonzero, searchsorted],
    n_range=[2 ** k for k in range(2, 23)],
    xlabel="len(array)",
)

回答 3

In [34]: a=np.arange(-10,10)

In [35]: a
Out[35]:
array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
         3,   4,   5,   6,   7,   8,   9])

In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)

In [37]: np.where(a>5)[0][0]
Out[37]: 16
In [34]: a=np.arange(-10,10)

In [35]: a
Out[35]:
array([-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,
         3,   4,   5,   6,   7,   8,   9])

In [36]: np.where(a>5)
Out[36]: (array([16, 17, 18, 19]),)

In [37]: np.where(a>5)[0][0]
Out[37]: 16

回答 4

元素之间步长恒定的数组

如果是range数组或任何其他线性增加的数组,则可以简单地以编程方式计算索引,而根本不需要实际遍历数组:

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('no value greater than {}'.format(val))
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    # For linearly decreasing arrays or constant arrays we only need to check
    # the first element, because if that does not satisfy the condition
    # no other element will.
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

一个人可能可以改善这一点。我已经确保它可以正确地用于一些示例数组和值,但这并不意味着在那里不会有错误,特别是考虑到它使用浮点数…

>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16]  # double check
6

>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15

假设它无需任何迭代就可以计算位置,那么它将是恒定时间(O(1)),并且可能击败所有其他提到的方法。但是,它需要数组中的恒定步长,否则将产生错误的结果。

使用numba的一般解决方案

更通用的方法是使用numba函数:

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

这将适用于任何数组,但必须迭代该数组,因此在一般情况下将是O(n)

>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16

基准测试

尽管NicoSchlömer已经提供了一些基准,但我认为包括我的新解决方案并测试不同的“值”可能很有用。

测试设置:

import numpy as np
import math
import numba as nb

def first_index_using_argmax(val, arr):
    return np.argmax(arr > val)

def first_index_using_where(val, arr):
    return np.where(arr > val)[0][0]

def first_index_using_nonzero(val, arr):
    return np.nonzero(arr > val)[0][0]

def first_index_using_searchsorted(val, arr):
    return np.searchsorted(arr, val) + 1

def first_index_using_min(val, arr):
    return np.min(np.where(arr > val))

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('empty array')
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

funcs = [
    first_index_using_argmax, 
    first_index_using_min, 
    first_index_using_nonzero,
    first_index_calculate_range_like, 
    first_index_numba, 
    first_index_using_searchsorted, 
    first_index_using_where
]

from simple_benchmark import benchmark, MultiArgument

并使用以下方式生成图:

%matplotlib notebook
b.plot()

项目在开始

b = benchmark(
    funcs,
    {2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

numba函数执行效果最佳,其次是calculate-function和searchsorted函数。其他解决方案的性能要差得多。

项目在末尾

b = benchmark(
    funcs,
    {2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

对于较小的数组,numba函数的执行速度非常快,但是对于较大的数组,它的性能要优于calculate-function和searchsorted函数。

项目在sqrt(len)

b = benchmark(
    funcs,
    {2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

在此处输入图片说明

这更有趣。numba和calculate函数再次表现出色,但这实际上触发了最糟糕的搜索排序情况,在这种情况下确实不能很好地工作。

没有值满足条件时的功能比较

另一个有趣的一点是,如果没有值应返回其索引,则这些函数的行为:

arr = np.ones(100)
value = 2

for func in funcs:
    print(func.__name__)
    try:
        print('-->', func(value, arr))
    except Exception as e:
        print('-->', e)

结果如下:

first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0

Searchsorted,argmax和numba只会返回错误的值。但是searchsortednumba返回的索引不是该数组的有效索引。

功能whereminnonzerocalculate抛出一个异常。但是,只有exceptionscalculate实际上可以说明任何帮助。

这意味着实际上必须将这些调用包装在一个适当的包装函数中,该函数可以捕获异常或无效的返回值并适当地进行处理,至少在不确定该值是否可以在数组中的情况下。


注意:计算和searchsorted选项仅在特殊条件下起作用。“计算”功能需要一个恒定的步骤,而searchsorted需要对数组进行排序。因此,这些方法在适当的情况下可能很有用,但不是解决此问题的一般方法。如果要处理排序的 Python列表,则可能需要查看bisect模块,而不是使用Numpys searchsorted。

Arrays that have a constant step between elements

In case of a range or any other linearly increasing array you can simply calculate the index programmatically, no need to actually iterate over the array at all:

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('no value greater than {}'.format(val))
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    # For linearly decreasing arrays or constant arrays we only need to check
    # the first element, because if that does not satisfy the condition
    # no other element will.
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

One could probably improve that a bit. I have made sure it works correctly for a few sample arrays and values but that doesn’t mean there couldn’t be mistakes in there, especially considering that it uses floats…

>>> import numpy as np
>>> first_index_calculate_range_like(5, np.arange(-10, 10))
16
>>> np.arange(-10, 10)[16]  # double check
6

>>> first_index_calculate_range_like(4.8, np.arange(-10, 10))
15

Given that it can calculate the position without any iteration it will be constant time (O(1)) and can probably beat all other mentioned approaches. However it requires a constant step in the array, otherwise it will produce wrong results.

General solution using numba

A more general approach would be using a numba function:

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

That will work for any array but it has to iterate over the array, so in the average case it will be O(n):

>>> first_index_numba(4.8, np.arange(-10, 10))
15
>>> first_index_numba(5, np.arange(-10, 10))
16

Benchmark

Even though Nico Schlömer already provided some benchmarks I thought it might be useful to include my new solutions and to test for different “values”.

The test setup:

import numpy as np
import math
import numba as nb

def first_index_using_argmax(val, arr):
    return np.argmax(arr > val)

def first_index_using_where(val, arr):
    return np.where(arr > val)[0][0]

def first_index_using_nonzero(val, arr):
    return np.nonzero(arr > val)[0][0]

def first_index_using_searchsorted(val, arr):
    return np.searchsorted(arr, val) + 1

def first_index_using_min(val, arr):
    return np.min(np.where(arr > val))

def first_index_calculate_range_like(val, arr):
    if len(arr) == 0:
        raise ValueError('empty array')
    elif len(arr) == 1:
        if arr[0] > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    first_value = arr[0]
    step = arr[1] - first_value
    if step <= 0:
        if first_value > val:
            return 0
        else:
            raise ValueError('no value greater than {}'.format(val))

    calculated_position = (val - first_value) / step

    if calculated_position < 0:
        return 0
    elif calculated_position > len(arr) - 1:
        raise ValueError('no value greater than {}'.format(val))

    return int(calculated_position) + 1

@nb.njit
def first_index_numba(val, arr):
    for idx in range(len(arr)):
        if arr[idx] > val:
            return idx
    return -1

funcs = [
    first_index_using_argmax, 
    first_index_using_min, 
    first_index_using_nonzero,
    first_index_calculate_range_like, 
    first_index_numba, 
    first_index_using_searchsorted, 
    first_index_using_where
]

from simple_benchmark import benchmark, MultiArgument

and the plots were generated using:

%matplotlib notebook
b.plot()

item is at the beginning

b = benchmark(
    funcs,
    {2**i: MultiArgument([0, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

The numba function performs best followed by the calculate-function and the searchsorted function. The other solutions perform much worse.

item is at the end

b = benchmark(
    funcs,
    {2**i: MultiArgument([2**i-2, np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

For small arrays the numba function performs amazingly fast, however for bigger arrays it’s outperformed by the calculate-function and the searchsorted function.

item is at sqrt(len)

b = benchmark(
    funcs,
    {2**i: MultiArgument([np.sqrt(2**i), np.arange(2**i)]) for i in range(2, 20)},
    argument_name="array size")

enter image description here

This is more interesting. Again numba and the calculate function perform great, however this is actually triggering the worst case of searchsorted which really doesn’t work well in this case.

Comparison of the functions when no value satisfies the condition

Another interesting point is how these function behave if there is no value whose index should be returned:

arr = np.ones(100)
value = 2

for func in funcs:
    print(func.__name__)
    try:
        print('-->', func(value, arr))
    except Exception as e:
        print('-->', e)

With this result:

first_index_using_argmax
--> 0
first_index_using_min
--> zero-size array to reduction operation minimum which has no identity
first_index_using_nonzero
--> index 0 is out of bounds for axis 0 with size 0
first_index_calculate_range_like
--> no value greater than 2
first_index_numba
--> -1
first_index_using_searchsorted
--> 101
first_index_using_where
--> index 0 is out of bounds for axis 0 with size 0

Searchsorted, argmax, and numba simply return a wrong value. However searchsorted and numba return an index that is not a valid index for the array.

The functions where, min, nonzero and calculate throw an exception. However only the exception for calculate actually says anything helpful.

That means one actually has to wrap these calls in an appropriate wrapper function that catches exceptions or invalid return values and handle appropriately, at least if you aren’t sure if the value could be in the array.


Note: The calculate and searchsorted options only work in special conditions. The “calculate” function requires a constant step and the searchsorted requires the array to be sorted. So these could be useful in the right circumstances but aren’t general solutions for this problem. In case you’re dealing with sorted Python lists you might want to take a look at the bisect module instead of using Numpys searchsorted.


回答 5

我想提出

np.min(np.append(np.where(aa>5)[0],np.inf))

这将返回满足条件的最小索引,如果从不满足条件,则返回无穷大(并where返回一个空数组)。

I’d like to propose

np.min(np.append(np.where(aa>5)[0],np.inf))

This will return the smallest index where the condition is met, while returning infinity if the condition is never met (and where returns an empty array).


回答 6

我会去

i = np.min(np.where(V >= x))

其中Vvector是向量(一维数组),x是值,i是结果索引。

I would go with

i = np.min(np.where(V >= x))

where V is vector (1d array), x is the value and i is the resulting index.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。