问题:可以按降序使用argsort吗?

考虑以下代码:

avgDists = np.array([1, 8, 6, 9, 4])
ids = avgDists.argsort()[:n]

这给了我n最小元素的索引。是否可以argsort按降序使用它来获得n最高元素的索引?

Consider the following code:

avgDists = np.array([1, 8, 6, 9, 4])
ids = avgDists.argsort()[:n]

This gives me indices of the n smallest elements. Is it possible to use this same argsort in descending order to get the indices of n highest elements?


回答 0

如果对数组求反,则最低的元素变为最高的元素,反之亦然。因此,n最高元素的索引为:

(-avgDists).argsort()[:n]

评论中所述,对此进行推理的另一种方法是观察大元素在argsort 中排在最后。因此,您可以从argsort的末尾读取以找到n最高的元素:

avgDists.argsort()[::-1][:n]

两种方法的时间复杂度均为O(n log n),因为在此argsort调用是主要项。但是第二种方法有一个很好的优点:它将数组的O(n)取反替换为O(1)切片。如果在循环中使用小型数组,则避免这种求反可能会获得一些性能提升;如果使用大型数组,则可以节省内存使用量,因为这种求反会创建整个数组的副本。

请注意,这些方法并不总是给出相等的结果:如果要求稳定的排序实现(argsort例如,通过传递关键字parameter)kind='mergesort',则第一个策略将保留排序稳定性,但是第二个策略将破坏稳定性(即,位置相等)项目将被撤消)。

时间示例:

使用100个浮点和30个尾巴的小阵列,查看方法快大约15%

>>> avgDists = np.random.rand(100)
>>> n = 30
>>> timeit (-avgDists).argsort()[:n]
1.93 µs ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
1.64 µs ± 3.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
1.64 µs ± 3.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

对于较大的阵列,argsort占主导地位,并且没有明显的时序差异

>>> avgDists = np.random.rand(1000)
>>> n = 300
>>> timeit (-avgDists).argsort()[:n]
21.9 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
21.7 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
21.9 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

请注意,以下来自nedim的评论不正确。在反转之前还是之后进行截断在效率上没有区别,因为这两个操作都只是以不同的方式遍历数组的视图,而实际上并未复制数据。

If you negate an array, the lowest elements become the highest elements and vice-versa. Therefore, the indices of the n highest elements are:

(-avgDists).argsort()[:n]

Another way to reason about this, as mentioned in the comments, is to observe that the big elements are coming last in the argsort. So, you can read from the tail of the argsort to find the n highest elements:

avgDists.argsort()[::-1][:n]

Both methods are O(n log n) in time complexity, because the argsort call is the dominant term here. But the second approach has a nice advantage: it replaces an O(n) negation of the array with an O(1) slice. If you’re working with small arrays inside loops then you may get some performance gains from avoiding that negation, and if you’re working with huge arrays then you can save on memory usage because the negation creates a copy of the entire array.

Note that these methods do not always give equivalent results: if a stable sort implementation is requested to argsort, e.g. by passing the keyword argument kind='mergesort', then the first strategy will preserve the sorting stability, but the second strategy will break stability (i.e. the positions of equal items will get reversed).

Example timings:

Using a small array of 100 floats and a length 30 tail, the view method was about 15% faster

>>> avgDists = np.random.rand(100)
>>> n = 30
>>> timeit (-avgDists).argsort()[:n]
1.93 µs ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
1.64 µs ± 3.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
1.64 µs ± 3.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For larger arrays, the argsort is dominant and there is no significant timing difference

>>> avgDists = np.random.rand(1000)
>>> n = 300
>>> timeit (-avgDists).argsort()[:n]
21.9 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
21.7 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
21.9 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Please note that the comment from nedim below is incorrect. Whether to truncate before or after reversing makes no difference in efficiency, since both of these operations are only striding a view of the array differently and not actually copying data.


回答 1

就像Python一样,它[::-1]反转了返回的数组argsort()[:n]给出最后n个元素:

>>> avgDists=np.array([1, 8, 6, 9, 4])
>>> n=3
>>> ids = avgDists.argsort()[::-1][:n]
>>> ids
array([3, 1, 2])

这种方法的优点ids是可以看到 avgDists:

>>> ids.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

(“ OWNDATA”为False表示这是一个视图,而不是副本)

另一种方法是这样的:

(-avgDists).argsort()[:n]

问题在于,这种工作方式是为数组中的每个元素创建负数:

>>> (-avgDists)
array([-1, -8, -6, -9, -4])

ANd为此创建了一个副本:

>>> (-avgDists_n).flags['OWNDATA']
True

因此,如果您每次使用非常小的数据集计时:

>>> import timeit
>>> timeit.timeit('(-avgDists).argsort()[:3]', setup="from __main__ import avgDists")
4.2879798610229045
>>> timeit.timeit('avgDists.argsort()[::-1][:3]', setup="from __main__ import avgDists")
2.8372560259886086

查看方法实质上更快(并使用1/2的内存…)

Just like Python, in that [::-1] reverses the array returned by argsort() and [:n] gives that last n elements:

>>> avgDists=np.array([1, 8, 6, 9, 4])
>>> n=3
>>> ids = avgDists.argsort()[::-1][:n]
>>> ids
array([3, 1, 2])

The advantage of this method is that ids is a view of avgDists:

>>> ids.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

(The ‘OWNDATA’ being False indicates this is a view, not a copy)

Another way to do this is something like:

(-avgDists).argsort()[:n]

The problem is that the way this works is to create negative of each element in the array:

>>> (-avgDists)
array([-1, -8, -6, -9, -4])

ANd creates a copy to do so:

>>> (-avgDists_n).flags['OWNDATA']
True

So if you time each, with this very small data set:

>>> import timeit
>>> timeit.timeit('(-avgDists).argsort()[:3]', setup="from __main__ import avgDists")
4.2879798610229045
>>> timeit.timeit('avgDists.argsort()[::-1][:3]', setup="from __main__ import avgDists")
2.8372560259886086

The view method is substantially faster (and uses 1/2 the memory…)


回答 2

使用命令进行排序后,可以使用flip命令numpy.flipud()numpy.fliplr()以降序获取索引argsort。那就是我通常要做的。

You can use the flip commands numpy.flipud() or numpy.fliplr() to get the indexes in descending order after sorting using the argsort command. Thats what I usually do.


回答 3

如果您只需要最低/最高n个元素的索引,则np.argsort可以使用np.argpartition– 来代替使用。

这不需要对整个数组进行排序,而只需要排序所需的部分,但请注意,“分区内的顺序”是未定义的,因此尽管它提供了正确的索引,但它们可能未正确排序:

>>> avgDists = [1, 8, 6, 9, 4]
>>> np.array(avgDists).argpartition(2)[:2]  # indices of lowest 2 items
array([0, 4], dtype=int64)

>>> np.array(avgDists).argpartition(-2)[-2:]  # indices of highest 2 items
array([1, 3], dtype=int64)

Instead of using np.argsort you could use np.argpartition – if you only need the indices of the lowest/highest n elements.

That doesn’t require to sort the whole array but just the part that you need but note that the “order inside your partition” is undefined, so while it gives the correct indices they might not be correctly ordered:

>>> avgDists = [1, 8, 6, 9, 4]
>>> np.array(avgDists).argpartition(2)[:2]  # indices of lowest 2 items
array([0, 4], dtype=int64)

>>> np.array(avgDists).argpartition(-2)[-2:]  # indices of highest 2 items
array([1, 3], dtype=int64)

回答 4

您可以创建数组的副本,然后将每个元素乘以-1。
结果,之前最大的元素将变成最小的元素。
副本中n个最小元素的索引是原件中的n个最大元素。

You could create a copy of the array and then multiply each element with -1.
As an effect the before largest elements would become the smallest.
The indeces of the n smallest elements in the copy are the n greatest elements in the original.


回答 5

就像@Kanmani暗示的那样,可以使用来简化解释numpy.flip,如下所示:

import numpy as np

avgDists = np.array([1, 8, 6, 9, 4])
ids = np.flip(np.argsort(avgDists))
print(ids)

通过使用访问者模式而不是成员函数,可以更轻松地读取操作顺序。

As @Kanmani hinted, an easier to interpret implementation may use numpy.flip, as in the following:

import numpy as np

avgDists = np.array([1, 8, 6, 9, 4])
ids = np.flip(np.argsort(avgDists))
print(ids)

By using the visitor pattern rather than member functions, it is easier to read the order of operations.


回答 6

以您的示例为例:

avgDists = np.array([1, 8, 6, 9, 4])

获得n个最大值的索引:

ids = np.argpartition(avgDists, -n)[-n:]

按降序对它们进行排序:

ids = ids[np.argsort(avgDists[ids])[::-1]]

获得结果(n = 4):

>>> avgDists[ids]
array([9, 8, 6, 4])

With your example:

avgDists = np.array([1, 8, 6, 9, 4])

Obtain indexes of n maximal values:

ids = np.argpartition(avgDists, -n)[-n:]

Sort them in descending order:

ids = ids[np.argsort(avgDists[ids])[::-1]]

Obtain results (for n=4):

>>> avgDists[ids]
array([9, 8, 6, 4])

回答 7

另一种方法是在argsort的参数中仅使用“-”,例如:“ df [np.argsort(-df [:, 0])]”,前提是df是数据帧,并且您想按第一个对它进行排序列(由列号“ 0”表示)。适当更改列名。当然,该列必须是数字列。

Another way is to use only a ‘-‘ in the argument for argsort as in : “df[np.argsort(-df[:, 0])]”, provided df is the dataframe and you want to sort it by the first column (represented by the column number ‘0’). Change the column-name as appropriate. Of course, the column has to be a numeric one.


声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。