在熊猫系列中查找元素的索引

问题:在熊猫系列中查找元素的索引

我知道这是一个非常基本的问题,但是由于某种原因我找不到答案。如何获取python pandas中Series某些元素的索引?(第一次出现就足够了)

即,我想要类似的东西:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

当然,可以使用循环定义这样的方法:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

但我认为应该有更好的方法。在那儿?

I know this is a very basic question but for some reason I can’t find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)

I.e., I’d like something like:

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

Certainly, it is possible to define such a method with a loop:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

but I assume there should be a better way. Is there?


回答 0

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

尽管我承认应该有一个更好的方法,但这至少避免了迭代和循环遍历对象并将其移至C级别。

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.


回答 1

转换为索引,您可以使用 get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

重复处理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

如果非连续返回,将返回一个布尔数组

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

内部使用哈希表,速度如此之快

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

正如Viktor所指出的那样,创建索引有一次性的创建开销(实际上是在使用索引执行某些操作时产生的开销,例如is_unique

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

Converting to an Index, you can use get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

Duplicate handling

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

Will return a boolean array if non-contiguous returns

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

Uses a hashtable internally, so fast

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

回答 2

In [92]: (myseries==7).argmax()
Out[92]: 3

如果您提前知道7个,则此方法有效。您可以使用(myseries == 7).any()进行检查

另一种方法(非常类似于第一个答案)也占多个7(或全无)的原因是

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
In [92]: (myseries==7).argmax()
Out[92]: 3

This works if you know 7 is there in advance. You can check this with (myseries==7).any()

Another approach (very similar to the first answer) that also accounts for multiple 7’s (or none) is

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

回答 3

这里的所有答案给我留下了深刻的印象。这不是一个新的答案,只是尝试总结所有这些方法的时间。我考虑了一个由25个元素组成的系列的情况,并假设了一般情况下索引可以包含任何值,并且您希望索引值与该系列末尾的搜索值相对应。

以下是2013年MacBook Pro的Python 3.7和Pandas 0.25.3版的速度测试。

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff的答案似乎是最快的-尽管它不处理重复项。

更正:很抱歉,我错过了一个,@ Alex Spangher使用列表索引方法的解决方案是迄今为止最快的。

更新资料:添加了@EliadL的答案。

希望这可以帮助。

如此简单的操作需要如此复杂的解决方案,而且许多解决方案是如此之慢,真令人惊讶。在某些情况下,超过半毫秒才能找到一系列25的值。

I’m impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.

Here are the speed tests on a 2013 MacBook Pro in Python 3.7 with Pandas version 0.25.3.

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit pd.Series(myseries.index, index=myseries)[150]                 
178 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit myseries.index[list(myseries).index(150)]
12.7 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit myseries.index[myseries.tolist().index(150)]                   
9.46 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff’s answer seems to be the fastest – although it doesn’t handle duplicates.

Correction: Sorry, I missed one, @Alex Spangher’s solution using the list index method is by far the fastest.

Update: Added @EliadL’s answer.

Hope this helps.

Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.


回答 4

尽管同样不令人满意,但另一种方法是:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

返回:3

使用我正在使用的当前数据集进行时间测试(随机考虑):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

Another way to do this, although equally unsatisfying is:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

returns: 3

On time tests using a current dataset I’m working with (consider it random):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

回答 5

如果您使用numpy,则可以获取一个数组,该数组确定了您的值:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

这将返回一个包含元素数组的单元素元组,其中7是myseries中的值:

(array([3], dtype=int64),)

If you use numpy, you can get an array of the indecies that your value is found:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:

(array([3], dtype=int64),)

回答 6

您可以使用Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

you can use Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

回答 7

尚未提及的另一种实现方法是tolist方法:

myseries.tolist().index(7)

假设该系列中存在该值,则应返回正确的索引。

Another way to do it that hasn’t been mentioned yet is the tolist method:

myseries.tolist().index(7)

should return the correct index, assuming the value exists in the Series.


回答 8

通常,您的价值出现在多个指标上:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

Often your value occurs at multiple indices:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

回答 9

这是我能找到的最原生和可扩展的方法:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64

This is the most native and scalable approach I could find:

>>> myindex = pd.Series(myseries.index, index=myseries)

>>> myindex[7]
3

>>> myindex[[7, 5, 7]]
7    3
5    4
7    3
dtype: int64