Python:查找表的列表与字典

问题:Python:查找表的列表与字典

我需要在某种类型的查询表中放入大约1000万个值,所以我想知道列表字典哪个更有效?

我知道您可以为这两种方法执行以下操作:

if something in dict_of_stuff:
    pass

if something in list_of_stuff:
    pass

我的想法是,该字典将更快,更高效。

谢谢你的帮助。

编辑1
我正在尝试做的更多信息。 欧拉问题92。我正在查找表,以查看所计算的值是否已全部准备好。

编辑2
查找效率。

编辑3
没有与值相关的值…那么集合会更好吗?

I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?

I know you can do something like this for both:

if something in dict_of_stuff:
    pass

and

if something in list_of_stuff:
    pass

My thought is the dict will be faster and more efficient.

Thanks for your help.

EDIT 1
Little more info on what I’m trying to do. Euler Problem 92. I’m making a look up table to see if a value calculated has all ready been calculated.

EDIT 2
Efficiency for look up.

EDIT 3
There are no values assosiated with the value…so would a set be better?


回答 0

速度

关于数据结构中的项目数,列表中的查找为O(n),字典中的查找摊销为O(1)。如果不需要关联值,请使用集合。

记忆

字典和集合都使用哈希,并且它们使用的内存比仅用于对象存储的更多。根据Beautiful Code的 AM Kuchling的说法,该实现尝试使哈希2/3保持完整,因此您可能会浪费一些内存。

如果您不立即添加新条目(根据更新的问题进行操作),则可能需要对列表进行排序并使用二进制搜索。这是O(log n),对于字符串来说可能更慢,对于没有自然顺序的对象则不可能。

Speed

Lookups in lists are O(n), lookups in dictionaries are amortized O(1), with regard to the number of items in the data structure. If you don’t need to associate values, use sets.

Memory

Both dictionaries and sets use hashing and they use much more memory than only for object storage. According to A.M. Kuchling in Beautiful Code, the implementation tries to keep the hash 2/3 full, so you might waste quite some memory.

If you do not add new entries on the fly (which you do, based on your updated question), it might be worthwhile to sort the list and use binary search. This is O(log n), and is likely to be slower for strings, impossible for objects which do not have a natural ordering.


回答 1

dict是哈希表,因此查找密钥确实非常快。因此,在字典和列表之间,字典会更快。但是,如果您没有关联的值,则最好使用集合。它是一个散列表,没有“表”部分。


编辑:对于您的新问题,是的,设置一个会更好。只需创建2组,一组用于以1结尾的序列,另一组用于以89结尾的序列。我已经成功地使用组解决了这个问题。

A dict is a hash table, so it is really fast to find the keys. So between dict and list, dict would be faster. But if you don’t have a value to associate, it is even better to use a set. It is a hash table, without the “table” part.


EDIT: for your new question, YES, a set would be better. Just create 2 sets, one for sequences ended in 1 and other for the sequences ended in 89. I have sucessfully solved this problem using sets.


回答 2

set()正是您想要的。O(1)查找,并且小于字典。

set() is exactly what you want. O(1) lookups, and smaller than a dict.


回答 3

我做了一些基准测试,结果表明dict比列出和设置大型数据集都快,它在linux的i7 CPU上运行python 2.7.3:

  • python -mtimeit -s 'd=range(10**7)' '5*10**6 in d'

    10个循环,每个循环最好3:64.2毫秒

  • python -mtimeit -s 'd=dict.fromkeys(range(10**7))' '5*10**6 in d'

    10000000次循环,最好为3:每个循环0.0759微秒

  • python -mtimeit -s 'from sets import Set; d=Set(range(10**7))' '5*10**6 in d'

    1000000次循环,最好为3:每个循环0.262微秒

如您所见,dict比list快得多,比set快约3倍。但是,在某些应用程序中,您可能仍想选择设置以美观。如果数据集非常小(<1000个元素),则列表的效果会很好。

I did some benchmarking and it turns out that dict is faster than both list and set for large data sets, running python 2.7.3 on an i7 CPU on linux:

  • python -mtimeit -s 'd=range(10**7)' '5*10**6 in d'

    10 loops, best of 3: 64.2 msec per loop

  • python -mtimeit -s 'd=dict.fromkeys(range(10**7))' '5*10**6 in d'

    10000000 loops, best of 3: 0.0759 usec per loop

  • python -mtimeit -s 'from sets import Set; d=Set(range(10**7))' '5*10**6 in d'

    1000000 loops, best of 3: 0.262 usec per loop

As you can see, dict is considerably faster than list and about 3 times faster than set. In some applications you might still want to choose set for the beauty of it, though. And if the data sets are really small (< 1000 elements) lists perform pretty well.


回答 4

你想要一个字典。

对于Python中的(未排序)列表,“输入”操作需要O(n)时间-如果您有大量数据,则不好。另一方面,字典是哈希表,因此您可以期望O(1)查找时间。

正如其他人指出的那样,如果您只有键而不是键/值对,则可以选择一个集合(一种特殊类型的dict)。

有关:

  • Python Wiki:有关Python容器操作时间复杂度的信息。
  • SO:Python容器操作时间和内存复杂性

You want a dict.

For (unsorted) lists in Python, the “in” operation requires O(n) time—not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities

回答 5

如果数据是唯一的,set()将是最有效的,但是是两个-dict(这也需要唯一性,哎呀:)

if data are unique set() will be the most efficient, but of two – dict (which also requires uniqueness, oops :)


回答 6

这些年来,作为一组新的测试表明@ EriF89仍然正确:

$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop

在这里,我们还比较了一个tuplelists在某些用例中,它比(并且使用更少的内存)更快。对于查找表,tuple整流罩没有更好的选择。

无论是dictset表现非常出色。这带来了一个与@SilentGhost有关唯一性的答案有关的有趣观点:如果OP在数据集中具有10M值,并且不知道它们中是否存在重复项,则值得将其元素的集合/ dict并行保存使用实际数据集,并测试该数据集中是否存在该数据。10M数据点可能只有10个唯一值,这是一个很小的搜索空间!

SilentGhost关于dict的错误实际上是有启发性的,因为人们可以使用dict将重复的数据(以值形式)关联到一个非重复的集合(键)中,从而保留一个数据对象来保存所有数据,但仍然像查找表一样快。例如,一个dict键可能是要查找的值,并且该值可能是该值出现的虚构列表中的索引列表。

例如,如果要搜索的源数据列表为l=[1,2,3,1,2,1,4],则可以通过将此字典替换为dict来针对搜索和内存进行优化:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(<class 'list'>, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})

有了这个命令,就可以知道:

  1. 如果值在原始数据集中(即2 in d返回True
  2. 该值是原始数据集(即d[2]返回,其中数据是在原始数据列表中找到索引列表:[1, 4]

As a new set of tests to show @EriF89 is still right after all these years:

$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop

Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it’s unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It’s possible the 10M data points only have 10 unique values, which is a much smaller space to search!

SilentGhost’s mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(<class 'list'>, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})

With this dict, one can know:

  1. If a value was in the original dataset (ie 2 in d returns True)
  2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])

回答 7

实际上,您实际上不需要在表中存储1000万个值,因此这两种方法都没什么大不了的。

提示:考虑一下在第一次平方和运算后结果可能有多大。最大可能的结果将是小于1000万。

You don’t actually need to store 10 million values in the table, so it’s not a big deal either way.

Hint: think about how large your result can be after the first sum of squares operation. The largest possible result will be much smaller than 10 million…