使用Python / NumPy对数组中的项目进行排名，而无需对数组进行两次排序

Question 1

I have an array of numbers and I’d like to create another array that represents the rank of each item in the first array. I’m using Python and NumPy.

For example:

array = [4,2,7,1]
ranks = [2,1,3,0]

Here’s the best method I’ve come up with:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.arange(len(array))[temp.argsort()]

Are there any better/faster methods that avoid sorting the array twice?

Question 2

Use slicing on the left-hand side in the last step:

array = numpy.array([4,2,7,1])
temp = array.argsort()
ranks = numpy.empty_like(temp)
ranks[temp] = numpy.arange(len(array))

This avoids sorting twice by inverting the permutation in the last step.

Question 3

Use argsort twice, first to obtain the order of the array, then to obtain ranking:

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = order.argsort()

When dealing with 2D (or higher dimensional) arrays, be sure to pass an axis argument to argsort to order over the correct axis.

Question 4

This question is a few years old, and the accepted answer is great, but I think the following is still worth mentioning. If you don’t mind the dependency on scipy, you can use scipy.stats.rankdata:

In [22]: from scipy.stats import rankdata

In [23]: a = [4, 2, 7, 1]

In [24]: rankdata(a)
Out[24]: array([ 3.,  2.,  4.,  1.])

In [25]: (rankdata(a) - 1).astype(int)
Out[25]: array([2, 1, 3, 0])

A nice feature of rankdata is that the method argument provides several options for handling ties. For example, there are three occurrences of 20 and two occurrences of 40 in b:

In [26]: b = [40, 20, 70, 10, 20, 50, 30, 40, 20]

The default assigns the average rank to the tied values:

In [27]: rankdata(b)
Out[27]: array([ 6.5,  3. ,  9. ,  1. ,  3. ,  8. ,  5. ,  6.5,  3. ])

method='ordinal' assigns consecutive ranks:

In [28]: rankdata(b, method='ordinal')
Out[28]: array([6, 2, 9, 1, 3, 8, 5, 7, 4])

method='min' assigns the minimum rank of the tied values to all the tied values:

In [29]: rankdata(b, method='min')
Out[29]: array([6, 2, 9, 1, 2, 8, 5, 6, 2])

See the docstring for more options.

Question 5

I tried to extend both solution for arrays A of more than one dimension, supposing you process your array row-by-row (axis=1).

I extended the first code with a loop on rows; probably it can be improved

temp = A.argsort(axis=1)
rank = np.empty_like(temp)
rangeA = np.arange(temp.shape[1])
for iRow in xrange(temp.shape[0]): 
    rank[iRow, temp[iRow,:]] = rangeA

And the second one, following k.rooijers suggestion, becomes:

temp = A.argsort(axis=1)
rank = temp.argsort(axis=1)

I randomly generated 400 arrays with shape (1000,100); the first code took about 7.5, the second one 3.8.

Question 6

For a vectorized version of an averaged rank, see below. I love np.unique, it really widens the scope of what code can and cannot be efficiently vectorized. Aside from avoiding python for-loops, this approach also avoids the implicit double loop over ‘a’.

import numpy as np

a = np.array( [4,1,6,8,4,1,6])

a = np.array([4,2,7,2,1])
rank = a.argsort().argsort()

unique, inverse = np.unique(a, return_inverse = True)

unique_rank_sum = np.zeros_like(unique)
np.add.at(unique_rank_sum, inverse, rank)
unique_count = np.zeros_like(unique)
np.add.at(unique_count, inverse, 1)

unique_rank_mean = unique_rank_sum.astype(np.float) / unique_count

rank_mean = unique_rank_mean[inverse]

print rank_mean

Question 7

Apart from the elegance and shortness of solutions, there is also the question of performance. Here is a little benchmark:

import numpy as np
from scipy.stats import rankdata
l = list(reversed(range(1000)))

%%timeit -n10000 -r5
x = (rankdata(l) - 1).astype(int)
>>> 128 µs ± 2.72 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
r = a.argsort().argsort()
>>> 69.1 µs ± 464 ns per loop (mean ± std. dev. of 5 runs, 10000 loops each)

%%timeit -n10000 -r5
a = np.array(l)
temp = a.argsort()
r = np.empty_like(temp)
r[temp] = np.arange(len(a))
>>> 63.7 µs ± 1.27 µs per loop (mean ± std. dev. of 5 runs, 10000 loops each)

Question 8

Use argsort() twice will do it:

>>> array = [4,2,7,1]
>>> ranks = numpy.array(array).argsort().argsort()
>>> ranks
array([2, 1, 3, 0])

Question 9

I tried the above methods, but failed because I had many zeores. Yes, even with floats duplicate items may be important.

So I wrote a modified 1D solution by adding a tie-checking step:

def ranks (v):
    import numpy as np
    t = np.argsort(v)
    r = np.empty(len(v),int)
    r[t] = np.arange(len(v))
    for i in xrange(1, len(r)):
        if v[t[i]] <= v[t[i-1]]: r[t[i]] = r[t[i-1]]
    return r

# test it
print sorted(zip(ranks(v), v))

I believe it’s as efficient as it can be.

Question 10

I liked the method by k.rooijers, but as rcoup wrote, repeated numbers are ranked according to array position. This was no good for me, so I modified the version to postprocess the ranks and merge any repeated numbers into a combined average rank:

import numpy as np
a = np.array([4,2,7,2,1])
r = np.array(a.argsort().argsort(), dtype=float)
f = a==a
for i in xrange(len(a)):
   if not f[i]: continue
   s = a == a[i]
   ls = np.sum(s)
   if ls > 1:
      tr = np.sum(r[s])
      r[s] = float(tr)/ls
   f[s] = False

print r  # array([ 3. ,  1.5,  4. ,  1.5,  0. ])

I hope this might help others too, I tried to find anothers solution to this, but couldn’t find any…

Question 11

argsort and slice are symmetry operations.

try slice twice instead of argsort twice. since slice is faster than argsort

array = numpy.array([4,2,7,1])
order = array.argsort()
ranks = np.arange(array.shape[0])[order][order]

Question 12

More general version of one of the answers:

In [140]: x = np.random.randn(10, 3)

In [141]: i = np.argsort(x, axis=0)

In [142]: ranks = np.empty_like(i)

In [143]: np.put_along_axis(ranks, i, np.repeat(np.arange(x.shape[0])[:,None], x.shape[1], axis=1), axis=0)

See How to use numpy.argsort() as indices in more than 2 dimensions? to generalize to more dims.

使用Python / NumPy对数组中的项目进行排名，而无需对数组进行两次排序

问题：使用Python / NumPy对数组中的项目进行排名，而无需对数组进行两次排序

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

排行榜展示

Python 情人节超强技能导出微信聊天记录生成词云

你不得不知道的python超级文献批量搜索下载工具

Python 流程图 — 一键转化代码为流程图

7行代码 Python热力图可视化分析缺失数据处理

Python 优化—算出每条语句执行时间

你的10W块放哪里能赚最多钱？

文章展示

检查对象列表是否包含具有特定属性值的对象

Python call特殊方法的实际示例

7行代码实现早上出门前自动收到分时天气预报

如何在同一目录或子目录中导入类？

如何从Python的文件路径中提取文件夹路径？

从Matplotlib的颜色图中获取单个颜色

使用Python / NumPy对数组中的项目进行排名，而无需对数组进行两次排序

问题：使用Python / NumPy对数组中的项目进行排名，而无需对数组进行两次排序

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

相关文章

排行榜展示

文章展示