标签归档:data-structures

给定一百万个数字的字符串,返回所有重复的3位数字

问题:给定一百万个数字的字符串,返回所有重复的3位数字

几个月前,我在纽约接受了一家对冲基金公司的采访,不幸的是,我没有获得数据/软件工程师的实习机会。(他们还要求解决方案使用Python。)

我几乎搞砸了第一次面试的问题…

问题:给定一百万个数字的字符串(例如,Pi),编写一个函数/程序,该函数/程序返回所有重复的3位数字,并且重复次数大于1

例如:如果字符串为:123412345123456则函数/程序将返回:

123 - 3 times
234 - 3 times
345 - 2 times

在面试失败后,他们没有给我解决方案,但他们确实告诉我,解决方案的时间复杂度恒定为1000,因为所有可能的结果都介于:

000-> 999

现在我正在考虑它,我认为不可能提出一个恒定时间算法。是吗?

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)

I pretty much screwed up on the first interview problem…

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

For example: if the string was: 123412345123456 then the function/program would return:

123 - 3 times
234 - 3 times
345 - 2 times

They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:

000 –> 999

Now that I’m thinking about it, I don’t think it’s possible to come up with a constant time algorithm. Is it?


回答 0

您轻轻松松下手,您可能不想在对量子点不了解基本算法的对冲基金中工作:-)

在这种情况下,如果您需要至少访问一次每个元素,则无法处理任意大小的数据结构O(1)。在这种情况下,您可以期望的最好是字符串的长度。O(n)n

虽然,顺便说一句,标称O(n)算法O(1)对一个固定的输入大小,这样,在技术上,他们可能已经在这里正确的。但是,这通常不是人们使用复杂度分析的方式。

在我看来,您可能会在很多方面给他们留下深刻的印象。

首先,通知他们,它是不是能够做到这一点的O(1),除非你使用上面的“犯罪嫌疑人”说理给出。

其次,通过提供Pythonic代码来展示您的精英技能,例如:

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

输出:

[(123, 3), (234, 3), (345, 2)]

尽管您当然可以将输出格式修改为所需的任何格式。

最后,通过告诉他们解决方案几乎没有问题O(n),因为上面的代码在不到半秒钟的时间内即可提供一百万个数字字符串的结果。它似乎也线性地缩放,因为一个10,000,000个字符的字符串需要3.5秒,而一个100,000,000个字符的字符串需要36秒。

而且,如果他们需要的更好,则可以采用多种方法并行化此类内容,从而可以大大加快这种处理速度。

当然,由于GIL的缘故,不在单个 Python解释器中,但是您可以将字符串拆分成类似的字符(vv为了正确处理边界区域,必须用表示的重叠):

    vv
123412  vv
    123451
        5123456

您可以将它们种出以分开工作,然后再合并结果。

输入的拆分和输出的合并很可能会用小字符串(甚至可能是百万位数字的字符串)淹没任何节省的时间,但是,对于更大的数据集,这很可能会有所作为。当然,我通常的口号是:“不要猜测”


此口头禅也适用于其他可能性,例如完全绕过Python并使用可能更快的其他语言。

例如,以下C代码,在相同的硬件作为较早Python代码运行,处理一个在0.6秒万位,大致为Python代码处理的相同的时间量之一百万。换句话说,速度快:

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '\0';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '\0') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}

You got off lightly, you probably don’t want to be working for a hedge fund where the quants don’t understand basic algorithms :-)

There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.

Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that’s not usually how people use complexity analysis.

It appears to me you could have impressed them in a number of ways.

First, by informing them that it’s not possible to do it in O(1), unless you use the “suspect” reasoning given above.

Second, by showing your elite skills by providing Pythonic code such as:

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

This outputs:

[(123, 3), (234, 3), (345, 2)]

though you could, of course, modify the output format to anything you desire.

And, finally, by telling them there’s almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.

And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.

Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):

    vv
123412  vv
    123451
        5123456

You can farm these out to separate workers and combine the results afterwards.

The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of “measure, don’t guess” applies here, of course.


This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.

For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '\0';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '\0') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}

回答 1

固定时间是不可能的。所有一百万个数字都需要至少被查看一次,因此这是时间复杂度O(n),在这种情况下,n =一百万。

对于简单的O(n)解决方案,创建一个大小为1000的数组,该数组表示每个可能的3位数字的出现次数。一次前进1位数字,第一个索引== 0,最后一个索引== 999997,并递增array [3位数字]以创建直方图(每个可能的3位数字出现的次数)。然后输出计数> 1的数组内容。

Constant time isn’t possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.

For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.


回答 2

一百万对于我在下面给出的答案来说很小。只期望您必须能够不间断地在面试中运行解决方案,然后以下操作将在不到两秒钟的时间内完成并给出所需的结果:

from collections import Counter

def triple_counter(s):
    c = Counter(s[n-3: n] for n in range(3, len(s)))
    for tri, n in c.most_common():
        if n > 1:
            print('%s - %i times.' % (tri, n))
        else:
            break

if __name__ == '__main__':
    import random

    s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
    triple_counter(s)

希望访问者可以使用标准库collections.Counter类。

并行执行版本

我为此写了一篇博客文章,并提供了更多解释。

A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:

from collections import Counter

def triple_counter(s):
    c = Counter(s[n-3: n] for n in range(3, len(s)))
    for tri, n in c.most_common():
        if n > 1:
            print('%s - %i times.' % (tri, n))
        else:
            break

if __name__ == '__main__':
    import random

    s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
    triple_counter(s)

Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.

Parallel execution version

I wrote a blog post on this with more explanation.


回答 3

简单的O(n)解决方案是对每个3位数字进行计数:

for nr in range(1000):
    cnt = text.count('%03d' % nr)
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

这将搜索全部100万个数字1000次。

仅遍历数字一次:

counts = [0] * 1000
for idx in range(len(text)-2):
    counts[int(text[idx:idx+3])] += 1

for nr, cnt in enumerate(counts):
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

时序显示,仅对索引进行一次迭代是使用的两倍count

The simple O(n) solution would be to count each 3-digit number:

for nr in range(1000):
    cnt = text.count('%03d' % nr)
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

This would search through all 1 million digits 1000 times.

Traversing the digits only once:

counts = [0] * 1000
for idx in range(len(text)-2):
    counts[int(text[idx:idx+3])] += 1

for nr, cnt in enumerate(counts):
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

Timing shows that iterating only once over the index is twice as fast as using count.


回答 4

这是“共识” O(n)算法的NumPy实现:遍历所有三元组和bin。通过遇到“ 385”,将bin加到bin [3,8,5](这是一个O(1)操作)中来完成合并。垃圾箱排列成一个10x10x10立方体。由于合并已完全矢量化,因此代码中没有循环。

def setup_data(n):
    import random
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))

def f_np(text):
    # Get the data into NumPy
    import numpy as np
    a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
    # Rolling triplets
    a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)

    bins = np.zeros((10, 10, 10), dtype=int)
    # Next line performs O(n) binning
    np.add.at(bins, tuple(a3), 1)
    # Filtering is left as an exercise
    return bins.ravel()

def f_py(text):
    counts = [0] * 1000
    for idx in range(len(text)-2):
        counts[int(text[idx:idx+3])] += 1
    return counts

import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
    data = setup_data(n)
    ref = f_np(**data)
    print(f'n = {n}')
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert np.all(ref == func(**data))
            print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
        except:
            print("{:16s} apparently crashed".format(name[2:]))

毫不奇怪,在大型数据集上,NumPy比@Daniel的纯Python解决方案要快一点。样本输出:

# n = 10
# np                    0.03481400 ms
# py                    0.00669330 ms
# n = 1000
# np                    0.11215360 ms
# py                    0.34836530 ms
# n = 1000000
# np                   82.46765980 ms
# py                  360.51235450 ms

Here is a NumPy implementation of the “consensus” O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say “385”, adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.

def setup_data(n):
    import random
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))

def f_np(text):
    # Get the data into NumPy
    import numpy as np
    a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
    # Rolling triplets
    a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)

    bins = np.zeros((10, 10, 10), dtype=int)
    # Next line performs O(n) binning
    np.add.at(bins, tuple(a3), 1)
    # Filtering is left as an exercise
    return bins.ravel()

def f_py(text):
    counts = [0] * 1000
    for idx in range(len(text)-2):
        counts[int(text[idx:idx+3])] += 1
    return counts

import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
    data = setup_data(n)
    ref = f_np(**data)
    print(f'n = {n}')
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert np.all(ref == func(**data))
            print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
        except:
            print("{:16s} apparently crashed".format(name[2:]))

Unsurprisingly, NumPy is a bit faster than @Daniel’s pure Python solution on large data sets. Sample output:

# n = 10
# np                    0.03481400 ms
# py                    0.00669330 ms
# n = 1000
# np                    0.11215360 ms
# py                    0.34836530 ms
# n = 1000000
# np                   82.46765980 ms
# py                  360.51235450 ms

回答 5

我将解决以下问题:

def find_numbers(str_num):
    final_dict = {}
    buffer = {}
    for idx in range(len(str_num) - 3):
        num = int(str_num[idx:idx + 3])
        if num not in buffer:
            buffer[num] = 0
        buffer[num] += 1
        if buffer[num] > 1:
            final_dict[num] = buffer[num]
    return final_dict

应用于示例字符串,将生成:

>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}

该解决方案在O(n)中运行,因为n是提供的字符串的长度,并且我认为这是您可以获得的最佳结果。

I would solve the problem as follows:

def find_numbers(str_num):
    final_dict = {}
    buffer = {}
    for idx in range(len(str_num) - 3):
        num = int(str_num[idx:idx + 3])
        if num not in buffer:
            buffer[num] = 0
        buffer[num] += 1
        if buffer[num] > 1:
            final_dict[num] = buffer[num]
    return final_dict

Applied to your example string, this yields:

>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}

This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.


回答 6

根据我的理解,您无法在固定时间内获得解决方案。至少需要通过一百万个数字(假设它是一个字符串)。您可以对百万个长度数字的位数进行三位数的滚动迭代,如果哈希键已经存在,则将其增加1;如果哈希密钥不存在,则创建一个新的哈希键(由值1初始化)。词典。

该代码将如下所示:

def calc_repeating_digits(number):

    hash = {}

    for i in range(len(str(number))-2):

        current_three_digits = number[i:i+3]
        if current_three_digits in hash.keys():
            hash[current_three_digits] += 1

        else:
            hash[current_three_digits] = 1

    return hash

您可以筛选出项值大于1的键。

As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn’t exists already in the dictionary.

The code will look something like this:

def calc_repeating_digits(number):

    hash = {}

    for i in range(len(str(number))-2):

        current_three_digits = number[i:i+3]
        if current_three_digits in hash.keys():
            hash[current_three_digits] += 1

        else:
            hash[current_three_digits] = 1

    return hash

You can filter down to the keys which have item value greater than 1.


回答 7

如另一个答案中所述,您不能在固定时间内执行此算法,因为您必须查看至少n位数字。线性时间是最快的。

但是,该算法可以在O(1)空间中完成。您只需要存储每个3位数字的计数,因此您需要一个包含1000个条目的数组。然后,您可以输入号码。

我的猜测是,当面试官给您解决方案时,他们会误以为是,或者当他们说“恒定空间”时,您会误以为“恒定时间”。

As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.

However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.

My guess is that either the interviewer misspoke when they gave you the solution, or you misheard “constant time” when they said “constant space.”


回答 8

这是我的答案:

from timeit import timeit
from collections import Counter
import types
import random

def setup_data(n):
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))


def f_counter(text):
    c = Counter()
    for i in range(len(text)-2):
        ss = text[i:i+3]
        c.update([ss])
    return (i for i in c.items() if i[1] > 1)

def f_dict(text):
    d = {}
    for i in range(len(text)-2):
        ss = text[i:i+3]
        if ss not in d:
            d[ss] = 0
        d[ss] += 1
    return ((i, d[i]) for i in d if d[i] > 1)

def f_array(text):
    a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
    for n in range(len(text)-2):
        i, j, k = (int(ss) for ss in text[n:n+3])
        a[i][j][k] += 1
    for i, b in enumerate(a):
        for j, c in enumerate(b):
            for k, d in enumerate(c):
                if d > 1: yield (f'{i}{j}{k}', d)


for n in (1E1, 1E3, 1E6):
    n = int(n)
    data = setup_data(n)
    print(f'n = {n}')
    results = {}
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        print("{:16s}{:16.8f} ms".format(name[2:], timeit(
            'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
    for r in results:
        print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))

数组查找方法非常快(甚至比@ paul-panzer的numpy方法还快!)。当然,它作弊是因为它在完成后并未在技术上完成,因为它正在返回生成器。它也不必检查每次迭代是否已经存在该值,这可能会有所帮助。

n = 10
counter               0.10595780 ms
dict                  0.01070654 ms
array                 0.00135370 ms
f_counter : []
f_dict    : []
f_array   : []
n = 1000
counter               2.89462101 ms
dict                  0.40434612 ms
array                 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict    : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array   : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter            2849.00500992 ms
dict                438.44007806 ms
array                 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict    : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array   : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

Here’s my answer:

from timeit import timeit
from collections import Counter
import types
import random

def setup_data(n):
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))


def f_counter(text):
    c = Counter()
    for i in range(len(text)-2):
        ss = text[i:i+3]
        c.update([ss])
    return (i for i in c.items() if i[1] > 1)

def f_dict(text):
    d = {}
    for i in range(len(text)-2):
        ss = text[i:i+3]
        if ss not in d:
            d[ss] = 0
        d[ss] += 1
    return ((i, d[i]) for i in d if d[i] > 1)

def f_array(text):
    a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
    for n in range(len(text)-2):
        i, j, k = (int(ss) for ss in text[n:n+3])
        a[i][j][k] += 1
    for i, b in enumerate(a):
        for j, c in enumerate(b):
            for k, d in enumerate(c):
                if d > 1: yield (f'{i}{j}{k}', d)


for n in (1E1, 1E3, 1E6):
    n = int(n)
    data = setup_data(n)
    print(f'n = {n}')
    results = {}
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        print("{:16s}{:16.8f} ms".format(name[2:], timeit(
            'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
    for r in results:
        print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))

The array lookup method is very fast (even faster than @paul-panzer’s numpy method!). Of course, it cheats since it isn’t technicailly finished after it completes, because it’s returning a generator. It also doesn’t have to check every iteration if the value already exists, which is likely to help a lot.

n = 10
counter               0.10595780 ms
dict                  0.01070654 ms
array                 0.00135370 ms
f_counter : []
f_dict    : []
f_array   : []
n = 1000
counter               2.89462101 ms
dict                  0.40434612 ms
array                 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict    : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array   : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter            2849.00500992 ms
dict                438.44007806 ms
array                 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict    : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array   : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

回答 9

图片作为答案:

图像作为答案

看起来像一个滑动窗口。

Image as answer:

IMAGE AS ANSWER

Looks like a sliding window.


回答 10

这是我的解决方案:

from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
    d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}

在for循环中具有一些创造力(例如,带有True / False / None的附加查找列表),您应该可以摆脱最后一行,因为您只想创建一个字典,直到我们访问该点为止。希望能帮助到你 :)

Here is my solution:

from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
    d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}

With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point. Hope it helps :)


回答 11

-从C角度讲。-您可以得到一个int 3-d数组结果[10] [10] [10]; -从第0个位置转到第n-4个位置,其中n是字符串数组的大小。-在每个位置上,检查当前,下一个和下一个下一个。-将cntr增加为resutls [current] [next] [next的下一个] ++;-打印的值

results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]

-现在是O(n)时间,不涉及比较。-您可以在此处运行一些并行的东西,方法是对数组进行分区并计算分区之间的匹配项。

-Telling from the perspective of C. -You can have an int 3-d array results[10][10][10]; -Go from 0th location to n-4th location, where n being the size of the string array. -On each location, check the current, next and next’s next. -Increment the cntr as resutls[current][next][next’s next]++; -Print the values of

results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]

-It is O(n) time, there is no comparisons involved. -You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.


回答 12

inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'

count = {}
for i in range(len(inputStr) - 2):
    subNum = int(inputStr[i:i+3])
    if subNum not in count:
        count[subNum] = 1
    else:
        count[subNum] += 1

print count
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'

count = {}
for i in range(len(inputStr) - 2):
    subNum = int(inputStr[i:i+3])
    if subNum not in count:
        count[subNum] = 1
    else:
        count[subNum] += 1

print count

如何将SQL查询结果转换为PANDAS数据结构?

问题:如何将SQL查询结果转换为PANDAS数据结构?

在这个问题上的任何帮助将不胜感激。

因此,基本上我想对我的SQL数据库运行查询并将返回的数据存储为Pandas数据结构。

我已附上查询代码。

我正在阅读有关Pandas的文档,但是在识别查询的返回类型时遇到了问题。

我试图打印查询结果,但没有提供任何有用的信息。

谢谢!!!!

from sqlalchemy import create_engine

engine2 = create_engine('mysql://THE DATABASE I AM ACCESSING')
connection2 = engine2.connect()
dataid = 1022
resoverall = connection2.execute("
  SELECT 
      sum(BLABLA) AS BLA,
      sum(BLABLABLA2) AS BLABLABLA2,
      sum(SOME_INT) AS SOME_INT,
      sum(SOME_INT2) AS SOME_INT2,
      100*sum(SOME_INT2)/sum(SOME_INT) AS ctr,
      sum(SOME_INT2)/sum(SOME_INT) AS cpc
   FROM daily_report_cooked
   WHERE campaign_id = '%s'", %dataid)

因此,我有点想了解变量“ resoverall”的格式/数据类型是什么,以及如何将其与PANDAS数据结构一起使用。

Any help on this problem will be greatly appreciated.

So basically I want to run a query to my SQL database and store the returned data as Pandas data structure.

I have attached code for query.

I am reading the documentation on Pandas, but I have problem to identify the return type of my query.

I tried to print the query result, but it doesn’t give any useful information.

Thanks!!!!

from sqlalchemy import create_engine

engine2 = create_engine('mysql://THE DATABASE I AM ACCESSING')
connection2 = engine2.connect()
dataid = 1022
resoverall = connection2.execute("
  SELECT 
      sum(BLABLA) AS BLA,
      sum(BLABLABLA2) AS BLABLABLA2,
      sum(SOME_INT) AS SOME_INT,
      sum(SOME_INT2) AS SOME_INT2,
      100*sum(SOME_INT2)/sum(SOME_INT) AS ctr,
      sum(SOME_INT2)/sum(SOME_INT) AS cpc
   FROM daily_report_cooked
   WHERE campaign_id = '%s'", %dataid)

So I sort of want to understand what’s the format/datatype of my variable “resoverall” and how to put it with PANDAS data structure.


回答 0

这是完成任务的最短代码:

from pandas import DataFrame
df = DataFrame(resoverall.fetchall())
df.columns = resoverall.keys()

您可以像Paul的回答中所说的那样幻想和分析类型。

Here’s the shortest code that will do the job:

from pandas import DataFrame
df = DataFrame(resoverall.fetchall())
df.columns = resoverall.keys()

You can go fancier and parse the types as in Paul’s answer.


回答 1

编辑:2015年3月

如下所述,熊猫现在使用SQLAlchemy读取(read_sql)并将其插入(to_sql)数据库。以下应该工作

import pandas as pd

df = pd.read_sql(sql, cnxn)

以前的答案: 通过类似问题的麦克贝克斯

import pyodbc
import pandas.io.sql as psql

cnxn = pyodbc.connect(connection_info) 
cursor = cnxn.cursor()
sql = "SELECT * FROM TABLE"

df = psql.frame_query(sql, cnxn)
cnxn.close()

Edit: Mar. 2015

As noted below, pandas now uses SQLAlchemy to both read from (read_sql) and insert into (to_sql) a database. The following should work

import pandas as pd

df = pd.read_sql(sql, cnxn)

Previous answer: Via mikebmassey from a similar question

import pyodbc
import pandas.io.sql as psql

cnxn = pyodbc.connect(connection_info) 
cursor = cnxn.cursor()
sql = "SELECT * FROM TABLE"

df = psql.frame_query(sql, cnxn)
cnxn.close()

回答 2

如果您使用的是SQLAlchemy的ORM而不是表达式语言,则可能会发现自己想要将类型的对象转换sqlalchemy.orm.query.Query为Pandas数据框。

最干净的方法是从查询的statement属性获取生成的SQL,然后使用pandas的read_sql()方法执行它。例如,从名为的查询对象开始query

df = pd.read_sql(query.statement, query.session.bind)

If you are using SQLAlchemy’s ORM rather than the expression language, you might find yourself wanting to convert an object of type sqlalchemy.orm.query.Query to a Pandas data frame.

The cleanest approach is to get the generated SQL from the query’s statement attribute, and then execute it with pandas’s read_sql() method. E.g., starting with a Query object called query:

df = pd.read_sql(query.statement, query.session.bind)

回答 3

编辑2014-09-30:

熊猫现在具有read_sql功能。您肯定要使用它。

原始答案:

我无法使用SQLAlchemy帮助您-我总是根据需要使用pyodbc,MySQLdb或psychopg2。但是这样做的时候,像下面这样一个简单的函数往往可以满足我的需求:

import decimal

import pydobc
import numpy as np
import pandas

cnn, cur = myConnectToDBfunction()
cmd = "SELECT * FROM myTable"
cur.execute(cmd)
dataframe = __processCursor(cur, dataframe=True)

def __processCursor(cur, dataframe=False, index=None):
    '''
    Processes a database cursor with data on it into either
    a structured numpy array or a pandas dataframe.

    input:
    cur - a pyodbc cursor that has just received data
    dataframe - bool. if false, a numpy record array is returned
                if true, return a pandas dataframe
    index - list of column(s) to use as index in a pandas dataframe
    '''
    datatypes = []
    colinfo = cur.description
    for col in colinfo:
        if col[1] == unicode:
            datatypes.append((col[0], 'U%d' % col[3]))
        elif col[1] == str:
            datatypes.append((col[0], 'S%d' % col[3]))
        elif col[1] in [float, decimal.Decimal]:
            datatypes.append((col[0], 'f4'))
        elif col[1] == datetime.datetime:
            datatypes.append((col[0], 'O4'))
        elif col[1] == int:
            datatypes.append((col[0], 'i4'))

    data = []
    for row in cur:
        data.append(tuple(row))

    array = np.array(data, dtype=datatypes)
    if dataframe:
        output = pandas.DataFrame.from_records(array)

        if index is not None:
            output = output.set_index(index)

    else:
        output = array

    return output

Edit 2014-09-30:

pandas now has a read_sql function. You definitely want to use that instead.

Original answer:

I can’t help you with SQLAlchemy — I always use pyodbc, MySQLdb, or psychopg2 as needed. But when doing so, a function as simple as the one below tends to suit my needs:

import decimal

import pydobc
import numpy as np
import pandas

cnn, cur = myConnectToDBfunction()
cmd = "SELECT * FROM myTable"
cur.execute(cmd)
dataframe = __processCursor(cur, dataframe=True)

def __processCursor(cur, dataframe=False, index=None):
    '''
    Processes a database cursor with data on it into either
    a structured numpy array or a pandas dataframe.

    input:
    cur - a pyodbc cursor that has just received data
    dataframe - bool. if false, a numpy record array is returned
                if true, return a pandas dataframe
    index - list of column(s) to use as index in a pandas dataframe
    '''
    datatypes = []
    colinfo = cur.description
    for col in colinfo:
        if col[1] == unicode:
            datatypes.append((col[0], 'U%d' % col[3]))
        elif col[1] == str:
            datatypes.append((col[0], 'S%d' % col[3]))
        elif col[1] in [float, decimal.Decimal]:
            datatypes.append((col[0], 'f4'))
        elif col[1] == datetime.datetime:
            datatypes.append((col[0], 'O4'))
        elif col[1] == int:
            datatypes.append((col[0], 'i4'))

    data = []
    for row in cur:
        data.append(tuple(row))

    array = np.array(data, dtype=datatypes)
    if dataframe:
        output = pandas.DataFrame.from_records(array)

        if index is not None:
            output = output.set_index(index)

    else:
        output = array

    return output

回答 4

MySQL连接器

对于使用mysql连接器的用户,可以将此代码作为开始。(感谢@Daniel Velkov)

二手裁判:


import pandas as pd
import mysql.connector

# Setup MySQL connection
db = mysql.connector.connect(
    host="<IP>",              # your host, usually localhost
    user="<USER>",            # your username
    password="<PASS>",        # your password
    database="<DATABASE>"     # name of the data base
)   

# You must create a Cursor object. It will let you execute all the queries you need
cur = db.cursor()

# Use all the SQL you like
cur.execute("SELECT * FROM <TABLE>")

# Put it all to a data frame
sql_data = pd.DataFrame(cur.fetchall())
sql_data.columns = cur.column_names

# Close the session
db.close()

# Show the data
print(sql_data.head())

MySQL Connector

For those that works with the mysql connector you can use this code as a start. (Thanks to @Daniel Velkov)

Used refs:


import pandas as pd
import mysql.connector

# Setup MySQL connection
db = mysql.connector.connect(
    host="<IP>",              # your host, usually localhost
    user="<USER>",            # your username
    password="<PASS>",        # your password
    database="<DATABASE>"     # name of the data base
)   

# You must create a Cursor object. It will let you execute all the queries you need
cur = db.cursor()

# Use all the SQL you like
cur.execute("SELECT * FROM <TABLE>")

# Put it all to a data frame
sql_data = pd.DataFrame(cur.fetchall())
sql_data.columns = cur.column_names

# Close the session
db.close()

# Show the data
print(sql_data.head())

回答 5

这是我使用的代码。希望这可以帮助。

import pandas as pd
from sqlalchemy import create_engine

def getData():
  # Parameters
  ServerName = "my_server"
  Database = "my_db"
  UserPwd = "user:pwd"
  Driver = "driver=SQL Server Native Client 11.0"

  # Create the connection
  engine = create_engine('mssql+pyodbc://' + UserPwd + '@' + ServerName + '/' + Database + "?" + Driver)

  sql = "select * from mytable"
  df = pd.read_sql(sql, engine)
  return df

df2 = getData()
print(df2)

Here’s the code I use. Hope this helps.

import pandas as pd
from sqlalchemy import create_engine

def getData():
  # Parameters
  ServerName = "my_server"
  Database = "my_db"
  UserPwd = "user:pwd"
  Driver = "driver=SQL Server Native Client 11.0"

  # Create the connection
  engine = create_engine('mssql+pyodbc://' + UserPwd + '@' + ServerName + '/' + Database + "?" + Driver)

  sql = "select * from mytable"
  df = pd.read_sql(sql, engine)
  return df

df2 = getData()
print(df2)

回答 6

这是对您的问题的简短回答:

from __future__ import print_function
import MySQLdb
import numpy as np
import pandas as pd
import xlrd

# Connecting to MySQL Database
connection = MySQLdb.connect(
             host="hostname",
             port=0000,
             user="userID",
             passwd="password",
             db="table_documents",
             charset='utf8'
           )
print(connection)
#getting data from database into a dataframe
sql_for_df = 'select * from tabledata'
df_from_database = pd.read_sql(sql_for_df , connection)

This is a short and crisp answer to your problem:

from __future__ import print_function
import MySQLdb
import numpy as np
import pandas as pd
import xlrd

# Connecting to MySQL Database
connection = MySQLdb.connect(
             host="hostname",
             port=0000,
             user="userID",
             passwd="password",
             db="table_documents",
             charset='utf8'
           )
print(connection)
#getting data from database into a dataframe
sql_for_df = 'select * from tabledata'
df_from_database = pd.read_sql(sql_for_df , connection)

回答 7

1.使用MySQL-connector-python

# pip install mysql-connector-python

import mysql.connector
import pandas as pd

mydb = mysql.connector.connect(
    host = 'host',
    user = 'username',
    passwd = 'pass',
    database = 'db_name'
)
query = 'select * from table_name'
df = pd.read_sql(query, con = mydb)
print(df)

2.使用SQLAlchemy

# pip install pymysql
# pip install sqlalchemy

import pandas as pd
import sqlalchemy

engine = sqlalchemy.create_engine('mysql+pymysql://username:password@localhost:3306/db_name')

query = '''
select * from table_name
'''
df = pd.read_sql_query(query, engine)
print(df)

1. Using MySQL-connector-python

# pip install mysql-connector-python

import mysql.connector
import pandas as pd

mydb = mysql.connector.connect(
    host = 'host',
    user = 'username',
    passwd = 'pass',
    database = 'db_name'
)
query = 'select * from table_name'
df = pd.read_sql(query, con = mydb)
print(df)

2. Using SQLAlchemy

# pip install pymysql
# pip install sqlalchemy

import pandas as pd
import sqlalchemy

engine = sqlalchemy.create_engine('mysql+pymysql://username:password@localhost:3306/db_name')

query = '''
select * from table_name
'''
df = pd.read_sql_query(query, engine)
print(df)

回答 8

像Nathan一样,我经常想将sqlalchemy或sqlsoup Query的结果转储到Pandas数据框中。我自己的解决方案是:

query = session.query(tbl.Field1, tbl.Field2)
DataFrame(query.all(), columns=[column['name'] for column in query.column_descriptions])

Like Nathan, I often want to dump the results of a sqlalchemy or sqlsoup Query into a Pandas data frame. My own solution for this is:

query = session.query(tbl.Field1, tbl.Field2)
DataFrame(query.all(), columns=[column['name'] for column in query.column_descriptions])

回答 9

resoverall是sqlalchemy ResultProxy对象。您可以在sqlalchemy文档中阅读有关它的更多信息,后者介绍了使用Engines and Connections的基本用法。这里重要的resoverall是dict之类的。

熊猫喜欢像dict这样的对象来创建其数据结构,请参见 在线文档

祝您好运sqlalchemy和熊猫。

resoverall is a sqlalchemy ResultProxy object. You can read more about it in the sqlalchemy docs, the latter explains basic usage of working with Engines and Connections. Important here is that resoverall is dict like.

Pandas likes dict like objects to create its data structures, see the online docs

Good luck with sqlalchemy and pandas.


回答 10

简单地使用pandaspyodbc在一起。您必须connstr根据数据库规范修改连接字符串()。

import pyodbc
import pandas as pd

# MSSQL Connection String Example
connstr = "Server=myServerAddress;Database=myDB;User Id=myUsername;Password=myPass;"

# Query Database and Create DataFrame Using Results
df = pd.read_sql("select * from myTable", pyodbc.connect(connstr))

我已经使用pyodbc了多个企业数据库(例如SQL Server,MySQL,MariaDB,IBM)。

Simply use pandas and pyodbc together. You’ll have to modify your connection string (connstr) according to your database specifications.

import pyodbc
import pandas as pd

# MSSQL Connection String Example
connstr = "Server=myServerAddress;Database=myDB;User Id=myUsername;Password=myPass;"

# Query Database and Create DataFrame Using Results
df = pd.read_sql("select * from myTable", pyodbc.connect(connstr))

I’ve used pyodbc with several enterprise databases (e.g. SQL Server, MySQL, MariaDB, IBM).


回答 11

这个问题很旧,但是我想加两分钱。我读到的问题是“我想对我的[my] SQL数据库运行查询并将返回的数据存储为Pandas数据结构[DataFrame]。”

从代码中看起来您的意思是mysql数据库,并假设您的意思是pandas DataFrame。

import MySQLdb as mdb
import pandas.io.sql as sql
from pandas import *

conn = mdb.connect('<server>','<user>','<pass>','<db>');
df = sql.read_frame('<query>', conn)

例如,

conn = mdb.connect('localhost','myname','mypass','testdb');
df = sql.read_frame('select * from testTable', conn)

这会将testTable的所有行导入到DataFrame中。

This question is old, but I wanted to add my two-cents. I read the question as ” I want to run a query to my [my]SQL database and store the returned data as Pandas data structure [DataFrame].”

From the code it looks like you mean mysql database and assume you mean pandas DataFrame.

import MySQLdb as mdb
import pandas.io.sql as sql
from pandas import *

conn = mdb.connect('<server>','<user>','<pass>','<db>');
df = sql.read_frame('<query>', conn)

For example,

conn = mdb.connect('localhost','myname','mypass','testdb');
df = sql.read_frame('select * from testTable', conn)

This will import all rows of testTable into a DataFrame.


回答 12

这是我的。以防万一,如果您使用“ pymysql”:

import pymysql
from pandas import DataFrame

host   = 'localhost'
port   = 3306
user   = 'yourUserName'
passwd = 'yourPassword'
db     = 'yourDatabase'

cnx    = pymysql.connect(host=host, port=port, user=user, passwd=passwd, db=db)
cur    = cnx.cursor()

query  = """ SELECT * FROM yourTable LIMIT 10"""
cur.execute(query)

field_names = [i[0] for i in cur.description]
get_data = [xx for xx in cur]

cur.close()
cnx.close()

df = DataFrame(get_data)
df.columns = field_names

Here is mine. Just in case if you are using “pymysql”:

import pymysql
from pandas import DataFrame

host   = 'localhost'
port   = 3306
user   = 'yourUserName'
passwd = 'yourPassword'
db     = 'yourDatabase'

cnx    = pymysql.connect(host=host, port=port, user=user, passwd=passwd, db=db)
cur    = cnx.cursor()

query  = """ SELECT * FROM yourTable LIMIT 10"""
cur.execute(query)

field_names = [i[0] for i in cur.description]
get_data = [xx for xx in cur]

cur.close()
cnx.close()

df = DataFrame(get_data)
df.columns = field_names

回答 13

pandas.io.sql.write_frame已弃用。 https://pandas.pydata.org/pandas-docs/version/0.15.2/generated/pandas.io.sql.write_frame.html

应该更改为使用pandas.DataFrame.to_sql https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

还有另一种解决方案。 PYODBC到Pandas-DataFrame不起作用-传递的值的形状为(x,y),索引表示为(w,z)

从熊猫0.12(我相信)开始,您可以:

import pandas
import pyodbc

sql = 'select * from table'
cnn = pyodbc.connect(...)

data = pandas.read_sql(sql, cnn)

在0.12之前,您可以执行以下操作:

import pandas
from pandas.io.sql import read_frame
import pyodbc

sql = 'select * from table'
cnn = pyodbc.connect(...)

data = read_frame(sql, cnn)

pandas.io.sql.write_frame is DEPRECATED. https://pandas.pydata.org/pandas-docs/version/0.15.2/generated/pandas.io.sql.write_frame.html

Should change to use pandas.DataFrame.to_sql https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

There is another solution. PYODBC to Pandas – DataFrame not working – Shape of passed values is (x,y), indices imply (w,z)

As of Pandas 0.12 (I believe) you can do:

import pandas
import pyodbc

sql = 'select * from table'
cnn = pyodbc.connect(...)

data = pandas.read_sql(sql, cnn)

Prior to 0.12, you could do:

import pandas
from pandas.io.sql import read_frame
import pyodbc

sql = 'select * from table'
cnn = pyodbc.connect(...)

data = read_frame(sql, cnn)

回答 14

离上一篇帖子很久了,但也许可以帮助某人…

比Paul H更短的路:

my_dic = session.query(query.all())
my_df = pandas.DataFrame.from_dict(my_dic)

Long time from last post but maybe it helps someone…

Shorted way than Paul H:

my_dic = session.query(query.all())
my_df = pandas.DataFrame.from_dict(my_dic)

回答 15

我这样做的最好方法

db.execute(query) where db=db_class() #database class
    mydata=[x for x in db.fetchall()]
    df=pd.DataFrame(data=mydata)

best way I do this

db.execute(query) where db=db_class() #database class
    mydata=[x for x in db.fetchall()]
    df=pd.DataFrame(data=mydata)

回答 16

如果结果类型为ResultSet,则应首先将其转换为字典。然后,将自动收集DataFrame列

这适用于我的情况:

df = pd.DataFrame([dict(r) for r in resoverall])

If the result type is ResultSet, you should convert it to dictionary first. Then the DataFrame columns will be collected automatically.

This works on my case:

df = pd.DataFrame([dict(r) for r in resoverall])

在Python中表示图(数据结构)

问题:在Python中表示图(数据结构)

如何用Python巧妙地表示图形?(从头开始,即没有库!)哪种数据结构(例如dicts / tuples / dict(tuples))既快速又具有存储效率?必须能够对它执行各种图形操作。 如前所述,各种图形表示可能会有所帮助。如何在Python中实现它们?至于图书馆,这个问题有很好的答案。





How can one neatly represent a graph in Python? (Starting from scratch i.e. no libraries!)
What data structure (e.g. dicts/tuples/dict(tuples)) will be fast but also memory efficient?
One must be able to do various graph operations on it.

As pointed out, the various graph representations might help. How does one go about implementing them in Python?

As for the libraries, this question has quite good answers.


回答 0

即使这是一个有点老的问题,我还是想为遇到问题的任何人提供一个切实可行的答案。

假设您以元组列表的形式获取连接的输入数据,如下所示:

[('A', 'B'), ('B', 'C'), ('B', 'D'), ('C', 'D'), ('E', 'F'), ('F', 'C')]

我发现对于Python中的图形最有用和最有效的数据结构是集合的决定。这将是我们GraphClass的基础结构。您还必须知道这些连接是弧形(定向,以一种方式连接)还是边缘(无定向,以两种方式连接)。我们将通过directed向该Graph.__init__方法添加参数来处理该问题。我们还将添加一些其他有用的方法。

import pprint
from collections import defaultdict


class Graph(object):
    """ Graph data structure, undirected by default. """

    def __init__(self, connections, directed=False):
        self._graph = defaultdict(set)
        self._directed = directed
        self.add_connections(connections)

    def add_connections(self, connections):
        """ Add connections (list of tuple pairs) to graph """

        for node1, node2 in connections:
            self.add(node1, node2)

    def add(self, node1, node2):
        """ Add connection between node1 and node2 """

        self._graph[node1].add(node2)
        if not self._directed:
            self._graph[node2].add(node1)

    def remove(self, node):
        """ Remove all references to node """

        for n, cxns in self._graph.items():  # python3: items(); python2: iteritems()
            try:
                cxns.remove(node)
            except KeyError:
                pass
        try:
            del self._graph[node]
        except KeyError:
            pass

    def is_connected(self, node1, node2):
        """ Is node1 directly connected to node2 """

        return node1 in self._graph and node2 in self._graph[node1]

    def find_path(self, node1, node2, path=[]):
        """ Find any path between node1 and node2 (may not be shortest) """

        path = path + [node1]
        if node1 == node2:
            return path
        if node1 not in self._graph:
            return None
        for node in self._graph[node1]:
            if node not in path:
                new_path = self.find_path(node, node2, path)
                if new_path:
                    return new_path
        return None

    def __str__(self):
        return '{}({})'.format(self.__class__.__name__, dict(self._graph))

我将其作为创建读者find_shortest_path和其他方法的“读者练习” 。

让我们来看一下这个动作…

>>> connections = [('A', 'B'), ('B', 'C'), ('B', 'D'),
                   ('C', 'D'), ('E', 'F'), ('F', 'C')]
>>> g = Graph(connections, directed=True)
>>> pretty_print = pprint.PrettyPrinter()
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'C'},
 'C': {'D'},
 'E': {'F'},
 'F': {'C'}}

>>> g = Graph(connections)  # undirected
>>> pretty_print = pprint.PrettyPrinter()
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'A', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'B'},
 'E': {'F'},
 'F': {'E', 'C'}}

>>> g.add('E', 'D')
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'A', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'}}

>>> g.remove('A')
>>> pretty_print.pprint(g._graph)
{'B': {'D', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'}}

>>> g.add('G', 'B')
>>> pretty_print.pprint(g._graph)
{'B': {'D', 'G', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'},
 'G': {'B'}}

>>> g.find_path('G', 'E')
['G', 'B', 'D', 'C', 'F', 'E']

Even though this is a somewhat old question, I thought I’d give a practical answer for anyone stumbling across this.

Let’s say you get your input data for your connections as a list of tuples like so:

[('A', 'B'), ('B', 'C'), ('B', 'D'), ('C', 'D'), ('E', 'F'), ('F', 'C')]

The data structure I’ve found to be most useful and efficient for graphs in Python is a dict of sets. This will be the underlying structure for our Graph class. You also have to know if these connections are arcs (directed, connect one way) or edges (undirected, connect both ways). We’ll handle that by adding a directed parameter to the Graph.__init__ method. We’ll also add some other helpful methods.

import pprint
from collections import defaultdict


class Graph(object):
    """ Graph data structure, undirected by default. """

    def __init__(self, connections, directed=False):
        self._graph = defaultdict(set)
        self._directed = directed
        self.add_connections(connections)

    def add_connections(self, connections):
        """ Add connections (list of tuple pairs) to graph """

        for node1, node2 in connections:
            self.add(node1, node2)

    def add(self, node1, node2):
        """ Add connection between node1 and node2 """

        self._graph[node1].add(node2)
        if not self._directed:
            self._graph[node2].add(node1)

    def remove(self, node):
        """ Remove all references to node """

        for n, cxns in self._graph.items():  # python3: items(); python2: iteritems()
            try:
                cxns.remove(node)
            except KeyError:
                pass
        try:
            del self._graph[node]
        except KeyError:
            pass

    def is_connected(self, node1, node2):
        """ Is node1 directly connected to node2 """

        return node1 in self._graph and node2 in self._graph[node1]

    def find_path(self, node1, node2, path=[]):
        """ Find any path between node1 and node2 (may not be shortest) """

        path = path + [node1]
        if node1 == node2:
            return path
        if node1 not in self._graph:
            return None
        for node in self._graph[node1]:
            if node not in path:
                new_path = self.find_path(node, node2, path)
                if new_path:
                    return new_path
        return None

    def __str__(self):
        return '{}({})'.format(self.__class__.__name__, dict(self._graph))

I’ll leave it as an “exercise for the reader” to create a find_shortest_path and other methods.

Let’s see this in action though…

>>> connections = [('A', 'B'), ('B', 'C'), ('B', 'D'),
                   ('C', 'D'), ('E', 'F'), ('F', 'C')]
>>> g = Graph(connections, directed=True)
>>> pretty_print = pprint.PrettyPrinter()
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'C'},
 'C': {'D'},
 'E': {'F'},
 'F': {'C'}}

>>> g = Graph(connections)  # undirected
>>> pretty_print = pprint.PrettyPrinter()
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'A', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'B'},
 'E': {'F'},
 'F': {'E', 'C'}}

>>> g.add('E', 'D')
>>> pretty_print.pprint(g._graph)
{'A': {'B'},
 'B': {'D', 'A', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'}}

>>> g.remove('A')
>>> pretty_print.pprint(g._graph)
{'B': {'D', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'}}

>>> g.add('G', 'B')
>>> pretty_print.pprint(g._graph)
{'B': {'D', 'G', 'C'},
 'C': {'D', 'F', 'B'},
 'D': {'C', 'E', 'B'},
 'E': {'D', 'F'},
 'F': {'E', 'C'},
 'G': {'B'}}

>>> g.find_path('G', 'E')
['G', 'B', 'D', 'C', 'F', 'E']

回答 1

NetworkX是一个很棒的Python图形库。您将很难找到尚未需要的东西。

而且它是开源的,因此您可以了解他们如何实现算法。您还可以添加其他算法。

https://github.com/networkx/networkx/tree/master/networkx/algorithms

NetworkX is an awesome Python graph library. You’ll be hard pressed to find something you need that it doesn’t already do.

And it’s open source so you can see how they implemented their algorithms. You can also add additional algorithms.

https://github.com/networkx/networkx/tree/master/networkx/algorithms


回答 2

首先,经典列表矩阵表示形式的选择取决于目的(取决于您要如何使用表示形式)。众所周知的问题和算法与选择有关。对抽象表示类型的选择决定了应如何实现它。

其次,问题是顶点和边缘是否应该仅根据存在性来表达,或者它们是否携带一些额外的信息。

从Python内置数据类型的角度来看,其他任何地方包含的任何值都表示为对目标对象的(隐藏)引用。如果它是变量(即命名引用),则名称和引用始终存储在(内部)字典中。如果你不需要名字,则引用可以存储在自己的容器-在这里大概Python列表会一直被用于列表的抽象。

Python列表实现为动态引用数组,Python元组实现为具有恒定内容的静态引用数组(不能更改引用的值)。因此,它们可以很容易地被索引。这样,该列表也可以用于矩阵的实现。

表示矩阵的另一种方法是由标准模块实现的数组array-相对于存储的类型(均值)更受限制。元素直接存储值。(该列表将存储对值对象的引用)。这样,它可以提高内存效率,并且对值的访问也更快。

有时,您可能会发现有用的甚至更受限制的表示形式,例如bytearray

First, the choice of classical list vs. matrix representations depends on the purpose (on what do you want to do with the representation). The well-known problems and algorithms are related to the choice. The choice of the abstract representation kind of dictates how it should be implemented.

Second, the question is whether the vertices and edges should be expressed only in terms of existence, or whether they carry some extra information.

From Python built-in data types point-of-view, any value contained elsewhere is expressed as a (hidden) reference to the target object. If it is a variable (i.e. named reference), then the name and the reference is always stored in (an internal) dictionary. If you do not need names, then the reference can be stored in your own container — here probably Python list will always be used for the list as abstraction.

Python list is implemented as a dynamic array of references, Python tuple is implemented as static array of references with constant content (the value of references cannot be changed). Because of that they can be easily indexed. This way, the list can be used also for implementation of matrices.

Another way to represent matrices are the arrays implemented by the standard module array — more constrained with respect to the stored type, homogeneous value. The elements store the value directly. (The list stores the references to the value objects instead). This way, it is more memory efficient and also the access to the value is faster.

Sometimes, you may find useful even more restricted representation like bytearray.


回答 3

有两个出色的图形库 NetworkXigraph。您可以在GitHub上找到这两个库源代码。您始终可以看到函数的编写方式。但是我更喜欢NetworkX,因为它易于理解。
查看其代码以了解其功能。您将获得多个想法,然后可以选择如何使用数据结构制作图形。

There are two excellent graph libraries NetworkX and igraph. You can find both library source codes on GitHub. You can always see how the functions are written. But I prefer NetworkX because its easy to understand.
See their codes to know how they make the functions. You will get multiple ideas and then can choose how you want to make a graph using data structures.


如何实现二叉树?

问题:如何实现二叉树?

哪种最佳数据结构可用于在Python中实现二叉树?

Which is the best data structure that can be used to implement a binary tree in Python?


回答 0

这是二进制搜索树的简单递归实现。

#!/usr/bin/python

class Node:
    def __init__(self, val):
        self.l = None
        self.r = None
        self.v = val

class Tree:
    def __init__(self):
        self.root = None

    def getRoot(self):
        return self.root

    def add(self, val):
        if self.root is None:
            self.root = Node(val)
        else:
            self._add(val, self.root)

    def _add(self, val, node):
        if val < node.v:
            if node.l is not None:
                self._add(val, node.l)
            else:
                node.l = Node(val)
        else:
            if node.r is not None:
                self._add(val, node.r)
            else:
                node.r = Node(val)

    def find(self, val):
        if self.root is not None:
            return self._find(val, self.root)
        else:
            return None

    def _find(self, val, node):
        if val == node.v:
            return node
        elif (val < node.v and node.l is not None):
            self._find(val, node.l)
        elif (val > node.v and node.r is not None):
            self._find(val, node.r)

    def deleteTree(self):
        # garbage collector will do this for us. 
        self.root = None

    def printTree(self):
        if self.root is not None:
            self._printTree(self.root)

    def _printTree(self, node):
        if node is not None:
            self._printTree(node.l)
            print(str(node.v) + ' ')
            self._printTree(node.r)

#     3
# 0     4
#   2      8
tree = Tree()
tree.add(3)
tree.add(4)
tree.add(0)
tree.add(8)
tree.add(2)
tree.printTree()
print(tree.find(3).v)
print(tree.find(10))
tree.deleteTree()
tree.printTree()

Here is my simple recursive implementation of binary search tree.

#!/usr/bin/python

class Node:
    def __init__(self, val):
        self.l = None
        self.r = None
        self.v = val

class Tree:
    def __init__(self):
        self.root = None

    def getRoot(self):
        return self.root

    def add(self, val):
        if self.root is None:
            self.root = Node(val)
        else:
            self._add(val, self.root)

    def _add(self, val, node):
        if val < node.v:
            if node.l is not None:
                self._add(val, node.l)
            else:
                node.l = Node(val)
        else:
            if node.r is not None:
                self._add(val, node.r)
            else:
                node.r = Node(val)

    def find(self, val):
        if self.root is not None:
            return self._find(val, self.root)
        else:
            return None

    def _find(self, val, node):
        if val == node.v:
            return node
        elif (val < node.v and node.l is not None):
            self._find(val, node.l)
        elif (val > node.v and node.r is not None):
            self._find(val, node.r)

    def deleteTree(self):
        # garbage collector will do this for us. 
        self.root = None

    def printTree(self):
        if self.root is not None:
            self._printTree(self.root)

    def _printTree(self, node):
        if node is not None:
            self._printTree(node.l)
            print(str(node.v) + ' ')
            self._printTree(node.r)

#     3
# 0     4
#   2      8
tree = Tree()
tree.add(3)
tree.add(4)
tree.add(0)
tree.add(8)
tree.add(2)
tree.printTree()
print(tree.find(3).v)
print(tree.find(10))
tree.deleteTree()
tree.printTree()

回答 1

# simple binary tree
# in this implementation, a node is inserted between an existing node and the root


class BinaryTree():

    def __init__(self,rootid):
      self.left = None
      self.right = None
      self.rootid = rootid

    def getLeftChild(self):
        return self.left
    def getRightChild(self):
        return self.right
    def setNodeValue(self,value):
        self.rootid = value
    def getNodeValue(self):
        return self.rootid

    def insertRight(self,newNode):
        if self.right == None:
            self.right = BinaryTree(newNode)
        else:
            tree = BinaryTree(newNode)
            tree.right = self.right
            self.right = tree

    def insertLeft(self,newNode):
        if self.left == None:
            self.left = BinaryTree(newNode)
        else:
            tree = BinaryTree(newNode)
            tree.left = self.left
            self.left = tree


def printTree(tree):
        if tree != None:
            printTree(tree.getLeftChild())
            print(tree.getNodeValue())
            printTree(tree.getRightChild())



# test tree

def testTree():
    myTree = BinaryTree("Maud")
    myTree.insertLeft("Bob")
    myTree.insertRight("Tony")
    myTree.insertRight("Steven")
    printTree(myTree)

在此处了解更多信息:-这是一个非常简单的实现二进制树。

是一个很好的教程,中间有问题

# simple binary tree
# in this implementation, a node is inserted between an existing node and the root


class BinaryTree():

    def __init__(self,rootid):
      self.left = None
      self.right = None
      self.rootid = rootid

    def getLeftChild(self):
        return self.left
    def getRightChild(self):
        return self.right
    def setNodeValue(self,value):
        self.rootid = value
    def getNodeValue(self):
        return self.rootid

    def insertRight(self,newNode):
        if self.right == None:
            self.right = BinaryTree(newNode)
        else:
            tree = BinaryTree(newNode)
            tree.right = self.right
            self.right = tree

    def insertLeft(self,newNode):
        if self.left == None:
            self.left = BinaryTree(newNode)
        else:
            tree = BinaryTree(newNode)
            tree.left = self.left
            self.left = tree


def printTree(tree):
        if tree != None:
            printTree(tree.getLeftChild())
            print(tree.getNodeValue())
            printTree(tree.getRightChild())



# test tree

def testTree():
    myTree = BinaryTree("Maud")
    myTree.insertLeft("Bob")
    myTree.insertRight("Tony")
    myTree.insertRight("Steven")
    printTree(myTree)

Read more about it Here:-This is a very simple implementation of a binary tree.

This is a nice tutorial with questions in between


回答 2

[采访所需的内容] Node类是足以表示二叉树的数据结构。

(尽管其他答案大多数都是正确的,但对于二叉树而言,它们不是必需的:无需扩展对象类,无需成为BST,无需导入双端队列)。

class Node:

    def __init__(self, value = None):
        self.left  = None
        self.right = None
        self.value = value

这是一棵树的例子:

n1 = Node(1)
n2 = Node(2)
n3 = Node(3)
n1.left  = n2
n1.right = n3

在此示例中,n1是具有n2,n3作为其子级的树的根。

在此处输入图片说明

[What you need for interviews] A Node class is the sufficient data structure to represent a binary tree.

(While other answers are mostly correct, they are not required for a binary tree: no need to extend object class, no need to be a BST, no need to import deque).

class Node:

    def __init__(self, value = None):
        self.left  = None
        self.right = None
        self.value = value

Here is an example of a tree:

n1 = Node(1)
n2 = Node(2)
n3 = Node(3)
n1.left  = n2
n1.right = n3

In this example n1 is the root of the tree having n2, n3 as its children.

enter image description here


回答 3

BST在Python中的简单实现

class TreeNode:
    def __init__(self, value):
        self.left = None
        self.right = None
        self.data = value

class Tree:
    def __init__(self):
        self.root = None

    def addNode(self, node, value):
        if(node==None):
            self.root = TreeNode(value)
        else:
            if(value<node.data):
                if(node.left==None):
                    node.left = TreeNode(value)
                else:
                    self.addNode(node.left, value)
            else:
                if(node.right==None):
                    node.right = TreeNode(value)
                else:
                    self.addNode(node.right, value)

    def printInorder(self, node):
        if(node!=None):
            self.printInorder(node.left)
            print(node.data)
            self.printInorder(node.right)

def main():
    testTree = Tree()
    testTree.addNode(testTree.root, 200)
    testTree.addNode(testTree.root, 300)
    testTree.addNode(testTree.root, 100)
    testTree.addNode(testTree.root, 30)
    testTree.printInorder(testTree.root)

Simple implementation of BST in Python

class TreeNode:
    def __init__(self, value):
        self.left = None
        self.right = None
        self.data = value

class Tree:
    def __init__(self):
        self.root = None

    def addNode(self, node, value):
        if(node==None):
            self.root = TreeNode(value)
        else:
            if(value<node.data):
                if(node.left==None):
                    node.left = TreeNode(value)
                else:
                    self.addNode(node.left, value)
            else:
                if(node.right==None):
                    node.right = TreeNode(value)
                else:
                    self.addNode(node.right, value)

    def printInorder(self, node):
        if(node!=None):
            self.printInorder(node.left)
            print(node.data)
            self.printInorder(node.right)

def main():
    testTree = Tree()
    testTree.addNode(testTree.root, 200)
    testTree.addNode(testTree.root, 300)
    testTree.addNode(testTree.root, 100)
    testTree.addNode(testTree.root, 30)
    testTree.printInorder(testTree.root)

回答 4

使用列表实现二叉树的一种非常快捷的方法。这不是最有效的方法,也不能很好地处理nil值。但这非常透明(至少对我而言):

def _add(node, v):
    new = [v, [], []]
    if node:
        left, right = node[1:]
        if not left:
            left.extend(new)
        elif not right:
            right.extend(new)
        else:
            _add(left, v)
    else:
        node.extend(new)

def binary_tree(s):
    root = []
    for e in s:
        _add(root, e)
    return root

def traverse(n, order):
    if n:
        v = n[0]
        if order == 'pre':
            yield v
        for left in traverse(n[1], order):
            yield left
        if order == 'in':
            yield v
        for right in traverse(n[2], order):
            yield right
        if order == 'post':
            yield v

从可迭代构造树:

 >>> tree = binary_tree('A B C D E'.split())
 >>> print tree
 ['A', ['B', ['D', [], []], ['E', [], []]], ['C', [], []]]

遍历一棵树:

 >>> list(traverse(tree, 'pre')), list(traverse(tree, 'in')), list(traverse(tree, 'post'))
 (['A', 'B', 'D', 'E', 'C'],
  ['D', 'B', 'E', 'A', 'C'],
  ['D', 'E', 'B', 'C', 'A'])

A very quick ‘n dirty way of implementing a binary tree using lists. Not the most efficient, nor does it handle nil values all too well. But it’s very transparent (at least to me):

def _add(node, v):
    new = [v, [], []]
    if node:
        left, right = node[1:]
        if not left:
            left.extend(new)
        elif not right:
            right.extend(new)
        else:
            _add(left, v)
    else:
        node.extend(new)

def binary_tree(s):
    root = []
    for e in s:
        _add(root, e)
    return root

def traverse(n, order):
    if n:
        v = n[0]
        if order == 'pre':
            yield v
        for left in traverse(n[1], order):
            yield left
        if order == 'in':
            yield v
        for right in traverse(n[2], order):
            yield right
        if order == 'post':
            yield v

Constructing a tree from an iterable:

 >>> tree = binary_tree('A B C D E'.split())
 >>> print tree
 ['A', ['B', ['D', [], []], ['E', [], []]], ['C', [], []]]

Traversing a tree:

 >>> list(traverse(tree, 'pre')), list(traverse(tree, 'in')), list(traverse(tree, 'post'))
 (['A', 'B', 'D', 'E', 'C'],
  ['D', 'B', 'E', 'A', 'C'],
  ['D', 'E', 'B', 'C', 'A'])

回答 5

我不禁注意到这里的大多数答案都在实现二进制搜索树。二进制搜索树!=二进制树。

  • 二叉搜索树具有非常特殊的属性:对于任何节点X,X的密钥都大于其左子节点的任何后代的关键字,并且小于其右子节点的任何后代的关键字。

  • 二叉树不施加这样的限制。二叉树只是具有“键”元素和两个孩子的数据结构,分别是“左”和“右”。

  • 树是二进制树的更一般的情况,其中每个节点可以具有任意数量的子代。通常,每个节点都有一个“孩子”元素,其类型为列表/数组。

现在,为了回答OP的问题,我将在Python中包含Binary Tree的完整实现。给定它提供最佳的O(1)查找,存储每个BinaryTreeNode的基础数据结构是一个字典。我还实现了深度优先遍历和深度优先遍历。这些是在树上执行的非常常见的操作。

from collections import deque

class BinaryTreeNode:
    def __init__(self, key, left=None, right=None):
        self.key = key
        self.left = left
        self.right = right

    def __repr__(self):
        return "%s l: (%s) r: (%s)" % (self.key, self.left, self.right)

    def __eq__(self, other):
        if self.key == other.key and \
            self.right == other.right and \
                self.left == other.left:
            return True
        else:
            return False

class BinaryTree:
    def __init__(self, root_key=None):
        # maps from BinaryTreeNode key to BinaryTreeNode instance.
        # Thus, BinaryTreeNode keys must be unique.
        self.nodes = {}
        if root_key is not None:
            # create a root BinaryTreeNode
            self.root = BinaryTreeNode(root_key)
            self.nodes[root_key] = self.root

    def add(self, key, left_key=None, right_key=None):
        if key not in self.nodes:
            # BinaryTreeNode with given key does not exist, create it
            self.nodes[key] = BinaryTreeNode(key)
        # invariant: self.nodes[key] exists

        # handle left child
        if left_key is None:
            self.nodes[key].left = None
        else:
            if left_key not in self.nodes:
                self.nodes[left_key] = BinaryTreeNode(left_key)
            # invariant: self.nodes[left_key] exists
            self.nodes[key].left = self.nodes[left_key]

        # handle right child
        if right_key == None:
            self.nodes[key].right = None
        else:
            if right_key not in self.nodes:
                self.nodes[right_key] = BinaryTreeNode(right_key)
            # invariant: self.nodes[right_key] exists
            self.nodes[key].right = self.nodes[right_key]

    def remove(self, key):
        if key not in self.nodes:
            raise ValueError('%s not in tree' % key)
        # remove key from the list of nodes
        del self.nodes[key]
        # if node removed is left/right child, update parent node
        for k in self.nodes:
            if self.nodes[k].left and self.nodes[k].left.key == key:
                self.nodes[k].left = None
            if self.nodes[k].right and self.nodes[k].right.key == key:
                self.nodes[k].right = None
        return True

    def _height(self, node):
        if node is None:
            return 0
        else:
            return 1 + max(self._height(node.left), self._height(node.right))

    def height(self):
        return self._height(self.root)

    def size(self):
        return len(self.nodes)

    def __repr__(self):
        return str(self.traverse_inorder(self.root))

    def bfs(self, node):
        if not node or node not in self.nodes:
            return
        reachable = []    
        q = deque()
        # add starting node to queue
        q.append(node)
        while len(q):
            visit = q.popleft()
            # add currently visited BinaryTreeNode to list
            reachable.append(visit)
            # add left/right children as needed
            if visit.left:
                q.append(visit.left)
            if visit.right:
                q.append(visit.right)
        return reachable

    # visit left child, root, then right child
    def traverse_inorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        self.traverse_inorder(node.left, reachable)
        reachable.append(node.key)
        self.traverse_inorder(node.right, reachable)
        return reachable

    # visit left and right children, then root
    # root of tree is always last to be visited
    def traverse_postorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        self.traverse_postorder(node.left, reachable)
        self.traverse_postorder(node.right, reachable)
        reachable.append(node.key)
        return reachable

    # visit root, left, then right children
    # root is always visited first
    def traverse_preorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        reachable.append(node.key)
        self.traverse_preorder(node.left, reachable)
        self.traverse_preorder(node.right, reachable)
        return reachable

I can’t help but notice that most answers here are implementing a Binary Search Tree. Binary Search Tree != Binary Tree.

  • A Binary Search Tree has a very specific property: for any node X, X’s key is larger than the key of any descendent of its left child, and smaller than the key of any descendant of its right child.

  • A Binary Tree imposes no such restriction. A Binary Tree is simply a data structure with a ‘key’ element, and two children, say ‘left’ and ‘right’.

  • A Tree is an even more general case of a Binary Tree where each node can have an arbitrary number of children. Typically, each node has a ‘children’ element which is of type list/array.

Now, to answer the OP’s question, I am including a full implementation of a Binary Tree in Python. The underlying data structure storing each BinaryTreeNode is a dictionary, given it offers optimal O(1) lookups. I’ve also implemented depth-first and breadth-first traversals. These are very common operations performed on trees.

from collections import deque

class BinaryTreeNode:
    def __init__(self, key, left=None, right=None):
        self.key = key
        self.left = left
        self.right = right

    def __repr__(self):
        return "%s l: (%s) r: (%s)" % (self.key, self.left, self.right)

    def __eq__(self, other):
        if self.key == other.key and \
            self.right == other.right and \
                self.left == other.left:
            return True
        else:
            return False

class BinaryTree:
    def __init__(self, root_key=None):
        # maps from BinaryTreeNode key to BinaryTreeNode instance.
        # Thus, BinaryTreeNode keys must be unique.
        self.nodes = {}
        if root_key is not None:
            # create a root BinaryTreeNode
            self.root = BinaryTreeNode(root_key)
            self.nodes[root_key] = self.root

    def add(self, key, left_key=None, right_key=None):
        if key not in self.nodes:
            # BinaryTreeNode with given key does not exist, create it
            self.nodes[key] = BinaryTreeNode(key)
        # invariant: self.nodes[key] exists

        # handle left child
        if left_key is None:
            self.nodes[key].left = None
        else:
            if left_key not in self.nodes:
                self.nodes[left_key] = BinaryTreeNode(left_key)
            # invariant: self.nodes[left_key] exists
            self.nodes[key].left = self.nodes[left_key]

        # handle right child
        if right_key == None:
            self.nodes[key].right = None
        else:
            if right_key not in self.nodes:
                self.nodes[right_key] = BinaryTreeNode(right_key)
            # invariant: self.nodes[right_key] exists
            self.nodes[key].right = self.nodes[right_key]

    def remove(self, key):
        if key not in self.nodes:
            raise ValueError('%s not in tree' % key)
        # remove key from the list of nodes
        del self.nodes[key]
        # if node removed is left/right child, update parent node
        for k in self.nodes:
            if self.nodes[k].left and self.nodes[k].left.key == key:
                self.nodes[k].left = None
            if self.nodes[k].right and self.nodes[k].right.key == key:
                self.nodes[k].right = None
        return True

    def _height(self, node):
        if node is None:
            return 0
        else:
            return 1 + max(self._height(node.left), self._height(node.right))

    def height(self):
        return self._height(self.root)

    def size(self):
        return len(self.nodes)

    def __repr__(self):
        return str(self.traverse_inorder(self.root))

    def bfs(self, node):
        if not node or node not in self.nodes:
            return
        reachable = []    
        q = deque()
        # add starting node to queue
        q.append(node)
        while len(q):
            visit = q.popleft()
            # add currently visited BinaryTreeNode to list
            reachable.append(visit)
            # add left/right children as needed
            if visit.left:
                q.append(visit.left)
            if visit.right:
                q.append(visit.right)
        return reachable

    # visit left child, root, then right child
    def traverse_inorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        self.traverse_inorder(node.left, reachable)
        reachable.append(node.key)
        self.traverse_inorder(node.right, reachable)
        return reachable

    # visit left and right children, then root
    # root of tree is always last to be visited
    def traverse_postorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        self.traverse_postorder(node.left, reachable)
        self.traverse_postorder(node.right, reachable)
        reachable.append(node.key)
        return reachable

    # visit root, left, then right children
    # root is always visited first
    def traverse_preorder(self, node, reachable=None):
        if not node or node.key not in self.nodes:
            return
        if reachable is None:
            reachable = []
        reachable.append(node.key)
        self.traverse_preorder(node.left, reachable)
        self.traverse_preorder(node.right, reachable)
        return reachable

回答 6

你不需要两节课

class Tree:
    val = None
    left = None
    right = None

    def __init__(self, val):
        self.val = val


    def insert(self, val):
        if self.val is not None:
            if val < self.val:
                if self.left is not None:
                    self.left.insert(val)
                else:
                    self.left = Tree(val)
            elif val > self.val:
                if self.right is not None:
                    self.right.insert(val)
                else:
                    self.right = Tree(val)
            else:
                return
        else:
            self.val = val
            print("new node added")

    def showTree(self):
        if self.left is not None:
            self.left.showTree()
        print(self.val, end = ' ')
        if self.right is not None:
            self.right.showTree()

you don’t need to have two classes

class Tree:
    val = None
    left = None
    right = None

    def __init__(self, val):
        self.val = val


    def insert(self, val):
        if self.val is not None:
            if val < self.val:
                if self.left is not None:
                    self.left.insert(val)
                else:
                    self.left = Tree(val)
            elif val > self.val:
                if self.right is not None:
                    self.right.insert(val)
                else:
                    self.right = Tree(val)
            else:
                return
        else:
            self.val = val
            print("new node added")

    def showTree(self):
        if self.left is not None:
            self.left.showTree()
        print(self.val, end = ' ')
        if self.right is not None:
            self.right.showTree()

回答 7

多一点“ Pythonic”?

class Node:
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None

    def __repr__(self):
        return str(self.value)



class BST:
    def __init__(self):
        self.root = None

    def __repr__(self):
        self.sorted = []
        self.get_inorder(self.root)
        return str(self.sorted)

    def get_inorder(self, node):
        if node:
            self.get_inorder(node.left)
            self.sorted.append(str(node.value))
            self.get_inorder(node.right)

    def add(self, value):
        if not self.root:
            self.root = Node(value)
        else:
            self._add(self.root, value)

    def _add(self, node, value):
        if value <= node.value:
            if node.left:
                self._add(node.left, value)
            else:
                node.left = Node(value)
        else:
            if node.right:
                self._add(node.right, value)
            else:
                node.right = Node(value)



from random import randint

bst = BST()

for i in range(100):
    bst.add(randint(1, 1000))
print (bst)

A little more “Pythonic” ?

class Node:
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None

    def __repr__(self):
        return str(self.value)



class BST:
    def __init__(self):
        self.root = None

    def __repr__(self):
        self.sorted = []
        self.get_inorder(self.root)
        return str(self.sorted)

    def get_inorder(self, node):
        if node:
            self.get_inorder(node.left)
            self.sorted.append(str(node.value))
            self.get_inorder(node.right)

    def add(self, value):
        if not self.root:
            self.root = Node(value)
        else:
            self._add(self.root, value)

    def _add(self, node, value):
        if value <= node.value:
            if node.left:
                self._add(node.left, value)
            else:
                node.left = Node(value)
        else:
            if node.right:
                self._add(node.right, value)
            else:
                node.right = Node(value)



from random import randint

bst = BST()

for i in range(100):
    bst.add(randint(1, 1000))
print (bst)

回答 8

#!/usr/bin/python

class BinaryTree:
    def __init__(self, left, right, data):
        self.left = left
        self.right = right
        self.data = data


    def pre_order_traversal(root):
        print(root.data, end=' ')

        if root.left != None:
            pre_order_traversal(root.left)

        if root.right != None:
            pre_order_traversal(root.right)

    def in_order_traversal(root):
        if root.left != None:
            in_order_traversal(root.left)
        print(root.data, end=' ')
        if root.right != None:
            in_order_traversal(root.right)

    def post_order_traversal(root):
        if root.left != None:
            post_order_traversal(root.left)
        if root.right != None:
            post_order_traversal(root.right)
        print(root.data, end=' ')
#!/usr/bin/python

class BinaryTree:
    def __init__(self, left, right, data):
        self.left = left
        self.right = right
        self.data = data


    def pre_order_traversal(root):
        print(root.data, end=' ')

        if root.left != None:
            pre_order_traversal(root.left)

        if root.right != None:
            pre_order_traversal(root.right)

    def in_order_traversal(root):
        if root.left != None:
            in_order_traversal(root.left)
        print(root.data, end=' ')
        if root.right != None:
            in_order_traversal(root.right)

    def post_order_traversal(root):
        if root.left != None:
            post_order_traversal(root.left)
        if root.right != None:
            post_order_traversal(root.right)
        print(root.data, end=' ')

回答 9

一个 Node基类连接的节点的是一个标准的做法。这些可能很难想象。

摘自有关Python模式-实现图形文章,请考虑一个简单的字典:

给定

二叉树

               a
              / \
             b   c
            / \   \
           d   e   f

制作一个唯一节点的字典:

tree = {
   "a": ["b", "c"],
   "b": ["d", "e"],
   "c": [None, "f"],
   "d": [None, None],
   "e": [None, None],
   "f": [None, None],
}

细节

  • 每个键值对都是一个唯一的节点指向其子级。
  • 列表(或元组)包含一对有序的左/右子级。
  • 有命令命令插入字典,假定第一个条目为根。
  • 常用方法可以是使dict变异或遍历的函数(请参阅参考资料find_all_paths())。

基于树的功能通常包括以下常见操作:

  • 遍历:以给定的顺序产生每个节点(通常从左到右)
    • 广度优先搜索(BFS):遍历级别
    • 深度优先搜索(DFS):先进行遍历分支(前/后/后顺序)
  • insert:根据子节点数将节点添加到树中
  • remove:根据子节点数删除节点
  • 更新:将丢失的节点从一棵树合并到另一棵树
  • visit:得出遍历节点的值

尝试实施所有这些操作。在这里,我们演示这些功能之一 -BFS遍历:

import collections as ct


def traverse(tree):
    """Yield nodes from a tree via BFS."""
    q = ct.deque()                                         # 1
    root = next(iter(tree))                                # 2
    q.append(root)

    while q:
        node = q.popleft()
        children = filter(None, tree.get(node))
        for n in children:                                 # 3 
            q.append(n)
        yield node

list(traverse(tree))
# ['a', 'b', 'c', 'd', 'e', 'f']

这是应用于节点和子字典的广度优先搜索(级别顺序)算法

  1. 初始化FIFO队列。我们使用deque,但使用queuelist作品(后者效率低下)。
  2. 获取并排队根节点(假设根是字典中的第一个条目,Python 3.6+)。
  3. 迭代地使一个节点出队,使其子节点入队并产生节点值。

另请参阅此有关树的深入教程


洞察力

一般而言,遍历有很多好处,我们只需将队列替换为堆栈即可轻松地将后者的迭代方法更改为深度优先搜索(DFS)(即LIFO队列))。这仅表示我们从排队的同一侧出队。DFS允许我们搜索每个分支。

怎么样?由于我们使用deque,我们可以通过更改node = q.popleft()node = q.pop()(右)来模拟堆栈。结果是正确的,预购的DFS['a', 'c', 'f', 'b', 'e', 'd']

A Node-based class of connected nodes is a standard approach. These can be hard to visualize.

Motivated from an essay on Python Patterns – Implementing Graphs, consider a simple dictionary:

Given

A binary tree

               a
              / \
             b   c
            / \   \
           d   e   f

Code

Make a dictionary of unique nodes:

tree = {
   "a": ["b", "c"],
   "b": ["d", "e"],
   "c": [None, "f"],
   "d": [None, None],
   "e": [None, None],
   "f": [None, None],
}

Details

  • Each key-value pair is a unique node pointing to its children.
  • A list (or tuple) holds an ordered pair of left/right children.
  • With a dict having ordered insertion, assume the first entry is the root.
  • Common methods can be functions that mutate or traverse the dict (see find_all_paths()).

Tree-based functions often include the following common operations:

  • traverse: yield each node in a given order (usually left-to-right)
    • breadth-first search (BFS): traverse levels
    • depth-first search (DFS): traverse branches first (pre-/in-/post-order)
  • insert: add a node to the tree depending on the number of children
  • remove: remove a node depending on the number of children
  • update: merge missing nodes from one tree to the other
  • visit: yield the value of a traversed node

Try implementing all of these operations. Here we demonstrate one of these functions – a BFS traversal:

Example

import collections as ct


def traverse(tree):
    """Yield nodes from a tree via BFS."""
    q = ct.deque()                                         # 1
    root = next(iter(tree))                                # 2
    q.append(root)

    while q:
        node = q.popleft()
        children = filter(None, tree.get(node))
        for n in children:                                 # 3 
            q.append(n)
        yield node

list(traverse(tree))
# ['a', 'b', 'c', 'd', 'e', 'f']

This is a breadth-first search (level-order) algorithm applied to a dict of nodes and children.

  1. Initialize a FIFO queue. We use a deque, but a queue or a list works (the latter is inefficient).
  2. Get and enqueue the root node (assumes the root is the first entry in the dict, Python 3.6+).
  3. Iteratively dequeue a node, enqueue its children and yield the node value.

See also this in-depth tutorial on trees.


Insight

Something great about traversals in general, we can easily alter the latter iterative approach to depth-first search (DFS) by simply replacing the queue with a stack (a.k.a LIFO Queue). This simply means we dequeue from the same side that we enqueue. DFS allows us to search each branch.

How? Since we are using a deque, we can emulate a stack by changing node = q.popleft() to node = q.pop() (right). The result is a right-favored, pre-ordered DFS: ['a', 'c', 'f', 'b', 'e', 'd'].


回答 10

import random

class TreeNode:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None
        self.p = None

class BinaryTree:
    def __init__(self):
        self.root = None

    def length(self):
        return self.size

    def inorder(self, node):
        if node == None:
            return None
        else:
            self.inorder(node.left)
            print node.key,
            self.inorder(node.right)

    def search(self, k):
        node = self.root
        while node != None:
            if node.key == k:
                return node
            if node.key > k:
                node = node.left
            else:
                node = node.right
        return None

    def minimum(self, node):
        x = None
        while node.left != None:
            x = node.left
            node = node.left
        return x

    def maximum(self, node):
        x = None
        while node.right != None:
            x = node.right
            node = node.right
        return x

    def successor(self, node):
        parent = None
        if node.right != None:
            return self.minimum(node.right)
        parent = node.p
        while parent != None and node == parent.right:
            node = parent
            parent = parent.p
        return parent

    def predecessor(self, node):
        parent = None
        if node.left != None:
            return self.maximum(node.left)
        parent = node.p
        while parent != None and node == parent.left:
            node = parent
            parent = parent.p
        return parent

    def insert(self, k):
        t = TreeNode(k)
        parent = None
        node = self.root
        while node != None:
            parent = node
            if node.key > t.key:
                node = node.left
            else:
                node = node.right
        t.p = parent
        if parent == None:
            self.root = t
        elif t.key < parent.key:
            parent.left = t
        else:
            parent.right = t
        return t


    def delete(self, node):
        if node.left == None:
            self.transplant(node, node.right)
        elif node.right == None:
            self.transplant(node, node.left)
        else:
            succ = self.minimum(node.right)
            if succ.p != node:
                self.transplant(succ, succ.right)
                succ.right = node.right
                succ.right.p = succ
            self.transplant(node, succ)
            succ.left = node.left
            succ.left.p = succ

    def transplant(self, node, newnode):
        if node.p == None:
            self.root = newnode
        elif node == node.p.left:
            node.p.left = newnode
        else:
            node.p.right = newnode
        if newnode != None:
            newnode.p = node.p
import random

class TreeNode:
    def __init__(self, key):
        self.key = key
        self.left = None
        self.right = None
        self.p = None

class BinaryTree:
    def __init__(self):
        self.root = None

    def length(self):
        return self.size

    def inorder(self, node):
        if node == None:
            return None
        else:
            self.inorder(node.left)
            print node.key,
            self.inorder(node.right)

    def search(self, k):
        node = self.root
        while node != None:
            if node.key == k:
                return node
            if node.key > k:
                node = node.left
            else:
                node = node.right
        return None

    def minimum(self, node):
        x = None
        while node.left != None:
            x = node.left
            node = node.left
        return x

    def maximum(self, node):
        x = None
        while node.right != None:
            x = node.right
            node = node.right
        return x

    def successor(self, node):
        parent = None
        if node.right != None:
            return self.minimum(node.right)
        parent = node.p
        while parent != None and node == parent.right:
            node = parent
            parent = parent.p
        return parent

    def predecessor(self, node):
        parent = None
        if node.left != None:
            return self.maximum(node.left)
        parent = node.p
        while parent != None and node == parent.left:
            node = parent
            parent = parent.p
        return parent

    def insert(self, k):
        t = TreeNode(k)
        parent = None
        node = self.root
        while node != None:
            parent = node
            if node.key > t.key:
                node = node.left
            else:
                node = node.right
        t.p = parent
        if parent == None:
            self.root = t
        elif t.key < parent.key:
            parent.left = t
        else:
            parent.right = t
        return t


    def delete(self, node):
        if node.left == None:
            self.transplant(node, node.right)
        elif node.right == None:
            self.transplant(node, node.left)
        else:
            succ = self.minimum(node.right)
            if succ.p != node:
                self.transplant(succ, succ.right)
                succ.right = node.right
                succ.right.p = succ
            self.transplant(node, succ)
            succ.left = node.left
            succ.left.p = succ

    def transplant(self, node, newnode):
        if node.p == None:
            self.root = newnode
        elif node == node.p.left:
            node.p.left = newnode
        else:
            node.p.right = newnode
        if newnode != None:
            newnode.p = node.p

回答 11

此实现支持插入,查找和删除操作,而不会破坏树的结构。这不是平衡树。

# Class for construct the nodes of the tree. (Subtrees)
class Node:
def __init__(self, key, parent_node = None):
    self.left = None
    self.right = None
    self.key = key
    if parent_node == None:
        self.parent = self
    else:
        self.parent = parent_node

# Class with the  structure of the tree. 
# This Tree is not balanced.
class Tree:
def __init__(self):
    self.root = None

# Insert a single element
def insert(self, x):
    if(self.root == None):
        self.root = Node(x)
    else:
        self._insert(x, self.root)

def _insert(self, x, node):
    if(x < node.key):
        if(node.left == None):
            node.left = Node(x, node)
        else:
            self._insert(x, node.left)
    else:
        if(node.right == None):
            node.right = Node(x, node)
        else:
            self._insert(x, node.right)

# Given a element, return a node in the tree with key x. 
def find(self, x):
    if(self.root == None):
        return None
    else:
        return self._find(x, self.root)
def _find(self, x, node):
    if(x == node.key):
        return node
    elif(x < node.key):
        if(node.left == None):
            return None
        else:
            return self._find(x, node.left)
    elif(x > node.key):
        if(node.right == None):
            return None
        else:
            return self._find(x, node.right)

# Given a node, return the node in the tree with the next largest element.
def next(self, node):
    if node.right != None:
        return self._left_descendant(node.right)
    else:
        return self._right_ancestor(node)

def _left_descendant(self, node):
    if node.left == None:
        return node
    else:
        return self._left_descendant(node.left)

def _right_ancestor(self, node):
    if node.key <= node.parent.key:
        return node.parent
    else:
        return self._right_ancestor(node.parent)

# Delete an element of the tree
def delete(self, x):
    node = self.find(x)
    if node == None:
        print(x, "isn't in the tree")
    else:
        if node.right == None:
            if node.left == None:
                if node.key < node.parent.key:
                    node.parent.left = None
                    del node # Clean garbage
                else:
                    node.parent.right = None
                    del Node # Clean garbage
            else:
                node.key = node.left.key
                node.left = None
        else:
            x = self.next(node)
            node.key = x.key
            x = None


# tests
t = Tree()
t.insert(5)
t.insert(8)
t.insert(3)
t.insert(4)
t.insert(6)
t.insert(2)

t.delete(8)
t.delete(5)

t.insert(9)
t.insert(1)

t.delete(2)
t.delete(100)

# Remember: Find method return the node object. 
# To return a number use t.find(nº).key
# But it will cause an error if the number is not in the tree.
print(t.find(5)) 
print(t.find(8))
print(t.find(4))
print(t.find(6))
print(t.find(9))

This implementation supports insert, find and delete operations without destroy the structure of the tree. This is not a banlanced tree.

# Class for construct the nodes of the tree. (Subtrees)
class Node:
def __init__(self, key, parent_node = None):
    self.left = None
    self.right = None
    self.key = key
    if parent_node == None:
        self.parent = self
    else:
        self.parent = parent_node

# Class with the  structure of the tree. 
# This Tree is not balanced.
class Tree:
def __init__(self):
    self.root = None

# Insert a single element
def insert(self, x):
    if(self.root == None):
        self.root = Node(x)
    else:
        self._insert(x, self.root)

def _insert(self, x, node):
    if(x < node.key):
        if(node.left == None):
            node.left = Node(x, node)
        else:
            self._insert(x, node.left)
    else:
        if(node.right == None):
            node.right = Node(x, node)
        else:
            self._insert(x, node.right)

# Given a element, return a node in the tree with key x. 
def find(self, x):
    if(self.root == None):
        return None
    else:
        return self._find(x, self.root)
def _find(self, x, node):
    if(x == node.key):
        return node
    elif(x < node.key):
        if(node.left == None):
            return None
        else:
            return self._find(x, node.left)
    elif(x > node.key):
        if(node.right == None):
            return None
        else:
            return self._find(x, node.right)

# Given a node, return the node in the tree with the next largest element.
def next(self, node):
    if node.right != None:
        return self._left_descendant(node.right)
    else:
        return self._right_ancestor(node)

def _left_descendant(self, node):
    if node.left == None:
        return node
    else:
        return self._left_descendant(node.left)

def _right_ancestor(self, node):
    if node.key <= node.parent.key:
        return node.parent
    else:
        return self._right_ancestor(node.parent)

# Delete an element of the tree
def delete(self, x):
    node = self.find(x)
    if node == None:
        print(x, "isn't in the tree")
    else:
        if node.right == None:
            if node.left == None:
                if node.key < node.parent.key:
                    node.parent.left = None
                    del node # Clean garbage
                else:
                    node.parent.right = None
                    del Node # Clean garbage
            else:
                node.key = node.left.key
                node.left = None
        else:
            x = self.next(node)
            node.key = x.key
            x = None


# tests
t = Tree()
t.insert(5)
t.insert(8)
t.insert(3)
t.insert(4)
t.insert(6)
t.insert(2)

t.delete(8)
t.delete(5)

t.insert(9)
t.insert(1)

t.delete(2)
t.delete(100)

# Remember: Find method return the node object. 
# To return a number use t.find(nº).key
# But it will cause an error if the number is not in the tree.
print(t.find(5)) 
print(t.find(8))
print(t.find(4))
print(t.find(6))
print(t.find(9))

回答 12

我知道已经发布了许多好的解决方案,但是对于二叉树,我通常采用不同的方法:使用某些Node类并直接实现它更具可读性,但是当您有很多节点时,对于内存可能会变得非常贪婪,所以我建议增加一层复杂性并将节点存储在python列表中,然后仅使用该列表来模拟树的行为。

您仍然可以定义Node类,以在需要时最终表示树中的节点,但是将它们以简单的形式[value,left,right]保留在列表中将使用一半的内存或更少的内存!

这是二进制搜索树类的快速示例,该类将节点存储在数组中。它提供了基本功能,例如添加,删除,查找…

"""
Basic Binary Search Tree class without recursion...
"""

__author__ = "@fbparis"

class Node(object):
    __slots__ = "value", "parent", "left", "right"
    def __init__(self, value, parent=None, left=None, right=None):
        self.value = value
        self.parent = parent
        self.left = left
        self.right = right

    def __repr__(self):
        return "<%s object at %s: parent=%s, left=%s, right=%s, value=%s>" % (self.__class__.__name__, hex(id(self)), self.parent, self.left, self.right, self.value)

class BinarySearchTree(object):
    __slots__ = "_tree"
    def __init__(self, *args):
        self._tree = []
        if args:
            for x in args[0]:
                self.add(x)

    def __len__(self):
        return len(self._tree)

    def __repr__(self):
        return "<%s object at %s with %d nodes>" % (self.__class__.__name__, hex(id(self)), len(self))

    def __str__(self, nodes=None, level=0):
        ret = ""
        if nodes is None:
            if len(self):
                nodes = [0]
            else:
                nodes = []
        for node in nodes:
            if node is None:
                continue
            ret += "-" * level + " %s\n" % self._tree[node][0]
            ret += self.__str__(self._tree[node][2:4], level + 1)
        if level == 0:
            ret = ret.strip()
        return ret

    def __contains__(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    return False
            return True
        return False

    def __eq__(self, other):
        return self._tree == other._tree

    def add(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    b = self._tree[node_index][2]
                    k = 2
                else:
                    b = self._tree[node_index][3]
                    k = 3
                if b is None:
                    self._tree[node_index][k] = len(self)
                    self._tree.append([value, node_index, None, None])
                    break
                node_index = b
        else:
            self._tree.append([value, None, None, None])

    def remove(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    raise KeyError
            if self._tree[node_index][2] is not None:
                b, d = 2, 3
            elif self._tree[node_index][3] is not None:
                b, d = 3, 2
            else:
                i = node_index
                b = None
            if b is not None:
                i = self._tree[node_index][b]
                while self._tree[i][d] is not None:
                    i = self._tree[i][d]
                p = self._tree[i][1]
                b = self._tree[i][b]
                if p == node_index:
                    self._tree[p][5-d] = b
                else:
                    self._tree[p][d] = b
                if b is not None:
                    self._tree[b][1] = p
                self._tree[node_index][0] = self._tree[i][0]
            else:
                p = self._tree[i][1]
                if p is not None:
                    if self._tree[p][2] == i:
                        self._tree[p][2] = None
                    else:
                        self._tree[p][3] = None
            last = self._tree.pop()
            n = len(self)
            if i < n:
                self._tree[i] = last[:]
                if last[2] is not None:
                    self._tree[last[2]][1] = i
                if last[3] is not None:
                    self._tree[last[3]][1] = i
                if self._tree[last[1]][2] == n:
                    self._tree[last[1]][2] = i
                else:
                    self._tree[last[1]][3] = i
        else:
            raise KeyError

    def find(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    return None
            return Node(*self._tree[node_index])
        return None

我添加了一个父属性,以便您可以删除任何节点并维护BST结构。

抱歉,为了便于阅读,尤其是对于“删除”功能。基本上,当一个节点被删除时,我们弹出树数组并用最后一个元素替换它(除非我们想删除最后一个节点)。为了维持BST结构,将删除的节点替换为其左侧子节点的最大值或右侧子节点的最小值,并且必须执行一些操作才能使索引有效,但它必须足够快。

我将这种技术用于更高级的东西,用内部基数trie构建了一些大单词字典,并且我能够将内存消耗除以7-8(您可以在此处看到示例:https : //gist.github.com/fbparis / b3ddd5673b603b42c880974b23db7cda

I know many good solutions have already been posted but I usually have a different approach for binary trees: going with some Node class and implementing it directly is more readable but when you have a lot of nodes it can become very greedy regarding memory, so I suggest adding one layer of complexity and storing the nodes in a python list, and then simulating a tree behavior using only the list.

You can still define a Node class to finally represent the nodes in the tree when needed, but keeping them in a simple form [value, left, right] in a list will use half the memory or less!

Here is a quick example of a Binary Search Tree class storing the nodes in an array. It provides basic fonctions such as add, remove, find…

"""
Basic Binary Search Tree class without recursion...
"""

__author__ = "@fbparis"

class Node(object):
    __slots__ = "value", "parent", "left", "right"
    def __init__(self, value, parent=None, left=None, right=None):
        self.value = value
        self.parent = parent
        self.left = left
        self.right = right

    def __repr__(self):
        return "<%s object at %s: parent=%s, left=%s, right=%s, value=%s>" % (self.__class__.__name__, hex(id(self)), self.parent, self.left, self.right, self.value)

class BinarySearchTree(object):
    __slots__ = "_tree"
    def __init__(self, *args):
        self._tree = []
        if args:
            for x in args[0]:
                self.add(x)

    def __len__(self):
        return len(self._tree)

    def __repr__(self):
        return "<%s object at %s with %d nodes>" % (self.__class__.__name__, hex(id(self)), len(self))

    def __str__(self, nodes=None, level=0):
        ret = ""
        if nodes is None:
            if len(self):
                nodes = [0]
            else:
                nodes = []
        for node in nodes:
            if node is None:
                continue
            ret += "-" * level + " %s\n" % self._tree[node][0]
            ret += self.__str__(self._tree[node][2:4], level + 1)
        if level == 0:
            ret = ret.strip()
        return ret

    def __contains__(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    return False
            return True
        return False

    def __eq__(self, other):
        return self._tree == other._tree

    def add(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    b = self._tree[node_index][2]
                    k = 2
                else:
                    b = self._tree[node_index][3]
                    k = 3
                if b is None:
                    self._tree[node_index][k] = len(self)
                    self._tree.append([value, node_index, None, None])
                    break
                node_index = b
        else:
            self._tree.append([value, None, None, None])

    def remove(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    raise KeyError
            if self._tree[node_index][2] is not None:
                b, d = 2, 3
            elif self._tree[node_index][3] is not None:
                b, d = 3, 2
            else:
                i = node_index
                b = None
            if b is not None:
                i = self._tree[node_index][b]
                while self._tree[i][d] is not None:
                    i = self._tree[i][d]
                p = self._tree[i][1]
                b = self._tree[i][b]
                if p == node_index:
                    self._tree[p][5-d] = b
                else:
                    self._tree[p][d] = b
                if b is not None:
                    self._tree[b][1] = p
                self._tree[node_index][0] = self._tree[i][0]
            else:
                p = self._tree[i][1]
                if p is not None:
                    if self._tree[p][2] == i:
                        self._tree[p][2] = None
                    else:
                        self._tree[p][3] = None
            last = self._tree.pop()
            n = len(self)
            if i < n:
                self._tree[i] = last[:]
                if last[2] is not None:
                    self._tree[last[2]][1] = i
                if last[3] is not None:
                    self._tree[last[3]][1] = i
                if self._tree[last[1]][2] == n:
                    self._tree[last[1]][2] = i
                else:
                    self._tree[last[1]][3] = i
        else:
            raise KeyError

    def find(self, value):
        if len(self):
            node_index = 0
            while self._tree[node_index][0] != value:
                if value < self._tree[node_index][0]:
                    node_index = self._tree[node_index][2]
                else:
                    node_index = self._tree[node_index][3]
                if node_index is None:
                    return None
            return Node(*self._tree[node_index])
        return None

I’ve added a parent attribute so that you can remove any node and maintain the BST structure.

Sorry for the readability, especially for the “remove” function. Basically, when a node is removed, we pop the tree array and replace it with the last element (except if we wanted to remove the last node). To maintain the BST structure, the removed node is replaced with the max of its left children or the min of its right children and some operations have to be done in order to keep the indexes valid but it’s fast enough.

I used this technique for more advanced stuff to build some big words dictionaries with an internal radix trie and I was able to divide memory consumption by 7-8 (you can see an example here: https://gist.github.com/fbparis/b3ddd5673b603b42c880974b23db7cda)


回答 13

二进制搜索树的良好实现,取自此处

'''
A binary search Tree
'''
from __future__ import print_function
class Node:

    def __init__(self, label, parent):
        self.label = label
        self.left = None
        self.right = None
        #Added in order to delete a node easier
        self.parent = parent

    def getLabel(self):
        return self.label

    def setLabel(self, label):
        self.label = label

    def getLeft(self):
        return self.left

    def setLeft(self, left):
        self.left = left

    def getRight(self):
        return self.right

    def setRight(self, right):
        self.right = right

    def getParent(self):
        return self.parent

    def setParent(self, parent):
        self.parent = parent

class BinarySearchTree:

    def __init__(self):
        self.root = None

    def insert(self, label):
        # Create a new Node
        new_node = Node(label, None)
        # If Tree is empty
        if self.empty():
            self.root = new_node
        else:
            #If Tree is not empty
            curr_node = self.root
            #While we don't get to a leaf
            while curr_node is not None:
                #We keep reference of the parent node
                parent_node = curr_node
                #If node label is less than current node
                if new_node.getLabel() < curr_node.getLabel():
                #We go left
                    curr_node = curr_node.getLeft()
                else:
                    #Else we go right
                    curr_node = curr_node.getRight()
            #We insert the new node in a leaf
            if new_node.getLabel() < parent_node.getLabel():
                parent_node.setLeft(new_node)
            else:
                parent_node.setRight(new_node)
            #Set parent to the new node
            new_node.setParent(parent_node)      

    def delete(self, label):
        if (not self.empty()):
            #Look for the node with that label
            node = self.getNode(label)
            #If the node exists
            if(node is not None):
                #If it has no children
                if(node.getLeft() is None and node.getRight() is None):
                    self.__reassignNodes(node, None)
                    node = None
                #Has only right children
                elif(node.getLeft() is None and node.getRight() is not None):
                    self.__reassignNodes(node, node.getRight())
                #Has only left children
                elif(node.getLeft() is not None and node.getRight() is None):
                    self.__reassignNodes(node, node.getLeft())
                #Has two children
                else:
                    #Gets the max value of the left branch
                    tmpNode = self.getMax(node.getLeft())
                    #Deletes the tmpNode
                    self.delete(tmpNode.getLabel())
                    #Assigns the value to the node to delete and keesp tree structure
                    node.setLabel(tmpNode.getLabel())

    def getNode(self, label):
        curr_node = None
        #If the tree is not empty
        if(not self.empty()):
            #Get tree root
            curr_node = self.getRoot()
            #While we don't find the node we look for
            #I am using lazy evaluation here to avoid NoneType Attribute error
            while curr_node is not None and curr_node.getLabel() is not label:
                #If node label is less than current node
                if label < curr_node.getLabel():
                    #We go left
                    curr_node = curr_node.getLeft()
                else:
                    #Else we go right
                    curr_node = curr_node.getRight()
        return curr_node

    def getMax(self, root = None):
        if(root is not None):
            curr_node = root
        else:
            #We go deep on the right branch
            curr_node = self.getRoot()
        if(not self.empty()):
            while(curr_node.getRight() is not None):
                curr_node = curr_node.getRight()
        return curr_node

    def getMin(self, root = None):
        if(root is not None):
            curr_node = root
        else:
            #We go deep on the left branch
            curr_node = self.getRoot()
        if(not self.empty()):
            curr_node = self.getRoot()
            while(curr_node.getLeft() is not None):
                curr_node = curr_node.getLeft()
        return curr_node

    def empty(self):
        if self.root is None:
            return True
        return False

    def __InOrderTraversal(self, curr_node):
        nodeList = []
        if curr_node is not None:
            nodeList.insert(0, curr_node)
            nodeList = nodeList + self.__InOrderTraversal(curr_node.getLeft())
            nodeList = nodeList + self.__InOrderTraversal(curr_node.getRight())
        return nodeList

    def getRoot(self):
        return self.root

    def __isRightChildren(self, node):
        if(node == node.getParent().getRight()):
            return True
        return False

    def __reassignNodes(self, node, newChildren):
        if(newChildren is not None):
            newChildren.setParent(node.getParent())
        if(node.getParent() is not None):
            #If it is the Right Children
            if(self.__isRightChildren(node)):
                node.getParent().setRight(newChildren)
            else:
                #Else it is the left children
                node.getParent().setLeft(newChildren)

    #This function traversal the tree. By default it returns an
    #In order traversal list. You can pass a function to traversal
    #The tree as needed by client code
    def traversalTree(self, traversalFunction = None, root = None):
        if(traversalFunction is None):
            #Returns a list of nodes in preOrder by default
            return self.__InOrderTraversal(self.root)
        else:
            #Returns a list of nodes in the order that the users wants to
            return traversalFunction(self.root)

    #Returns an string of all the nodes labels in the list 
    #In Order Traversal
    def __str__(self):
        list = self.__InOrderTraversal(self.root)
        str = ""
        for x in list:
            str = str + " " + x.getLabel().__str__()
        return str

def InPreOrder(curr_node):
    nodeList = []
    if curr_node is not None:
        nodeList = nodeList + InPreOrder(curr_node.getLeft())
        nodeList.insert(0, curr_node.getLabel())
        nodeList = nodeList + InPreOrder(curr_node.getRight())
    return nodeList

def testBinarySearchTree():
    r'''
    Example
                  8
                 / \
                3   10
               / \    \
              1   6    14
                 / \   /
                4   7 13 
    '''

    r'''
    Example After Deletion
                  7
                 / \
                1   4

    '''
    t = BinarySearchTree()
    t.insert(8)
    t.insert(3)
    t.insert(6)
    t.insert(1)
    t.insert(10)
    t.insert(14)
    t.insert(13)
    t.insert(4)
    t.insert(7)

    #Prints all the elements of the list in order traversal
    print(t.__str__())

    if(t.getNode(6) is not None):
        print("The label 6 exists")
    else:
        print("The label 6 doesn't exist")

    if(t.getNode(-1) is not None):
        print("The label -1 exists")
    else:
        print("The label -1 doesn't exist")

    if(not t.empty()):
        print(("Max Value: ", t.getMax().getLabel()))
        print(("Min Value: ", t.getMin().getLabel()))

    t.delete(13)
    t.delete(10)
    t.delete(8)
    t.delete(3)
    t.delete(6)
    t.delete(14)

    #Gets all the elements of the tree In pre order
    #And it prints them
    list = t.traversalTree(InPreOrder, t.root)
    for x in list:
        print(x)

if __name__ == "__main__":
    testBinarySearchTree()

A good implementation of binary search tree, taken from here:

'''
A binary search Tree
'''
from __future__ import print_function
class Node:

    def __init__(self, label, parent):
        self.label = label
        self.left = None
        self.right = None
        #Added in order to delete a node easier
        self.parent = parent

    def getLabel(self):
        return self.label

    def setLabel(self, label):
        self.label = label

    def getLeft(self):
        return self.left

    def setLeft(self, left):
        self.left = left

    def getRight(self):
        return self.right

    def setRight(self, right):
        self.right = right

    def getParent(self):
        return self.parent

    def setParent(self, parent):
        self.parent = parent

class BinarySearchTree:

    def __init__(self):
        self.root = None

    def insert(self, label):
        # Create a new Node
        new_node = Node(label, None)
        # If Tree is empty
        if self.empty():
            self.root = new_node
        else:
            #If Tree is not empty
            curr_node = self.root
            #While we don't get to a leaf
            while curr_node is not None:
                #We keep reference of the parent node
                parent_node = curr_node
                #If node label is less than current node
                if new_node.getLabel() < curr_node.getLabel():
                #We go left
                    curr_node = curr_node.getLeft()
                else:
                    #Else we go right
                    curr_node = curr_node.getRight()
            #We insert the new node in a leaf
            if new_node.getLabel() < parent_node.getLabel():
                parent_node.setLeft(new_node)
            else:
                parent_node.setRight(new_node)
            #Set parent to the new node
            new_node.setParent(parent_node)      

    def delete(self, label):
        if (not self.empty()):
            #Look for the node with that label
            node = self.getNode(label)
            #If the node exists
            if(node is not None):
                #If it has no children
                if(node.getLeft() is None and node.getRight() is None):
                    self.__reassignNodes(node, None)
                    node = None
                #Has only right children
                elif(node.getLeft() is None and node.getRight() is not None):
                    self.__reassignNodes(node, node.getRight())
                #Has only left children
                elif(node.getLeft() is not None and node.getRight() is None):
                    self.__reassignNodes(node, node.getLeft())
                #Has two children
                else:
                    #Gets the max value of the left branch
                    tmpNode = self.getMax(node.getLeft())
                    #Deletes the tmpNode
                    self.delete(tmpNode.getLabel())
                    #Assigns the value to the node to delete and keesp tree structure
                    node.setLabel(tmpNode.getLabel())

    def getNode(self, label):
        curr_node = None
        #If the tree is not empty
        if(not self.empty()):
            #Get tree root
            curr_node = self.getRoot()
            #While we don't find the node we look for
            #I am using lazy evaluation here to avoid NoneType Attribute error
            while curr_node is not None and curr_node.getLabel() is not label:
                #If node label is less than current node
                if label < curr_node.getLabel():
                    #We go left
                    curr_node = curr_node.getLeft()
                else:
                    #Else we go right
                    curr_node = curr_node.getRight()
        return curr_node

    def getMax(self, root = None):
        if(root is not None):
            curr_node = root
        else:
            #We go deep on the right branch
            curr_node = self.getRoot()
        if(not self.empty()):
            while(curr_node.getRight() is not None):
                curr_node = curr_node.getRight()
        return curr_node

    def getMin(self, root = None):
        if(root is not None):
            curr_node = root
        else:
            #We go deep on the left branch
            curr_node = self.getRoot()
        if(not self.empty()):
            curr_node = self.getRoot()
            while(curr_node.getLeft() is not None):
                curr_node = curr_node.getLeft()
        return curr_node

    def empty(self):
        if self.root is None:
            return True
        return False

    def __InOrderTraversal(self, curr_node):
        nodeList = []
        if curr_node is not None:
            nodeList.insert(0, curr_node)
            nodeList = nodeList + self.__InOrderTraversal(curr_node.getLeft())
            nodeList = nodeList + self.__InOrderTraversal(curr_node.getRight())
        return nodeList

    def getRoot(self):
        return self.root

    def __isRightChildren(self, node):
        if(node == node.getParent().getRight()):
            return True
        return False

    def __reassignNodes(self, node, newChildren):
        if(newChildren is not None):
            newChildren.setParent(node.getParent())
        if(node.getParent() is not None):
            #If it is the Right Children
            if(self.__isRightChildren(node)):
                node.getParent().setRight(newChildren)
            else:
                #Else it is the left children
                node.getParent().setLeft(newChildren)

    #This function traversal the tree. By default it returns an
    #In order traversal list. You can pass a function to traversal
    #The tree as needed by client code
    def traversalTree(self, traversalFunction = None, root = None):
        if(traversalFunction is None):
            #Returns a list of nodes in preOrder by default
            return self.__InOrderTraversal(self.root)
        else:
            #Returns a list of nodes in the order that the users wants to
            return traversalFunction(self.root)

    #Returns an string of all the nodes labels in the list 
    #In Order Traversal
    def __str__(self):
        list = self.__InOrderTraversal(self.root)
        str = ""
        for x in list:
            str = str + " " + x.getLabel().__str__()
        return str

def InPreOrder(curr_node):
    nodeList = []
    if curr_node is not None:
        nodeList = nodeList + InPreOrder(curr_node.getLeft())
        nodeList.insert(0, curr_node.getLabel())
        nodeList = nodeList + InPreOrder(curr_node.getRight())
    return nodeList

def testBinarySearchTree():
    r'''
    Example
                  8
                 / \
                3   10
               / \    \
              1   6    14
                 / \   /
                4   7 13 
    '''

    r'''
    Example After Deletion
                  7
                 / \
                1   4

    '''
    t = BinarySearchTree()
    t.insert(8)
    t.insert(3)
    t.insert(6)
    t.insert(1)
    t.insert(10)
    t.insert(14)
    t.insert(13)
    t.insert(4)
    t.insert(7)

    #Prints all the elements of the list in order traversal
    print(t.__str__())

    if(t.getNode(6) is not None):
        print("The label 6 exists")
    else:
        print("The label 6 doesn't exist")

    if(t.getNode(-1) is not None):
        print("The label -1 exists")
    else:
        print("The label -1 doesn't exist")

    if(not t.empty()):
        print(("Max Value: ", t.getMax().getLabel()))
        print(("Min Value: ", t.getMin().getLabel()))

    t.delete(13)
    t.delete(10)
    t.delete(8)
    t.delete(3)
    t.delete(6)
    t.delete(14)

    #Gets all the elements of the tree In pre order
    #And it prints them
    list = t.traversalTree(InPreOrder, t.root)
    for x in list:
        print(x)

if __name__ == "__main__":
    testBinarySearchTree()

回答 14

我想展示@apadana方法的一种变体,当有大量节点时,它会更有用:

'''
Suppose we have the following tree
      10
    /    \
  11      9
 /  \     / \
7   12  15   8
'''
# Step 1 - Create nodes - Use a list instead of defining each node separately
nlist = [10,11,7,9,15,8,12]; n = []
for i in range(len(nlist)): n.append(Node(nlist[i]))

# Step 2 - Set each node position
n[0].left  = n[1]
n[1].left = n[2]
n[0].right = n[3]
n[3].left = n[4]
n[3].right = n[5]
n[1].right = n[6]

I want to show a variation of @apadana’s method, which is more useful when there is a considerable number of nodes:

'''
Suppose we have the following tree
      10
    /    \
  11      9
 /  \     / \
7   12  15   8
'''
# Step 1 - Create nodes - Use a list instead of defining each node separately
nlist = [10,11,7,9,15,8,12]; n = []
for i in range(len(nlist)): n.append(Node(nlist[i]))

# Step 2 - Set each node position
n[0].left  = n[1]
n[1].left = n[2]
n[0].right = n[3]
n[3].left = n[4]
n[3].right = n[5]
n[1].right = n[6]

回答 15

class Node:
    """
    single Node for tree
    """

    def __init__(self, data):
        self.data = data
        self.right = None
        self.left = None


class binaryTree:
    """
    binary tree implementation
    """

    def __init__(self):
        self.root = None

    def push(self, element, node=None):
        if node is None:
            node = self.root

        if self.root is None:
            self.root = Node(element)

        else:
            if element < node.data:
                if node.left is not None:
                    self.push(element, node.left)
                else:
                    node.left = Node(element)
            else:
                if node.right is not None:
                    self.push(element, node.right)
                else:
                    node.right = Node(element)

    def __str__(self):
        self.printInorder(self.root)
        return "\n"

    def printInorder(self, node):
        """
        print tree in inorder
        """
        if node is not None:
            self.printInorder(node.left)
            print(node.data)
            self.printInorder(node.right)


def main():
    """
    Main code and logic comes here
    """
    tree = binaryTree()
    tree.push(5)
    tree.push(3)
    tree.push(1)
    tree.push(3)
    tree.push(0)
    tree.push(2)
    tree.push(9)
    tree.push(10)
    print(tree)


if __name__ == "__main__":
    main()
class Node:
    """
    single Node for tree
    """

    def __init__(self, data):
        self.data = data
        self.right = None
        self.left = None


class binaryTree:
    """
    binary tree implementation
    """

    def __init__(self):
        self.root = None

    def push(self, element, node=None):
        if node is None:
            node = self.root

        if self.root is None:
            self.root = Node(element)

        else:
            if element < node.data:
                if node.left is not None:
                    self.push(element, node.left)
                else:
                    node.left = Node(element)
            else:
                if node.right is not None:
                    self.push(element, node.right)
                else:
                    node.right = Node(element)

    def __str__(self):
        self.printInorder(self.root)
        return "\n"

    def printInorder(self, node):
        """
        print tree in inorder
        """
        if node is not None:
            self.printInorder(node.left)
            print(node.data)
            self.printInorder(node.right)


def main():
    """
    Main code and logic comes here
    """
    tree = binaryTree()
    tree.push(5)
    tree.push(3)
    tree.push(1)
    tree.push(3)
    tree.push(0)
    tree.push(2)
    tree.push(9)
    tree.push(10)
    print(tree)


if __name__ == "__main__":
    main()

回答 16

Python中的二叉树

 class Tree(object):
    def __init__(self):
        self.data=None
        self.left=None
        self.right=None
    def insert(self, x, root):
        if root==None:
            t=node(x)
            t.data=x
            t.right=None
            t.left=None
            root=t
            return root
        elif x<root.data:
            root.left=self.insert(x, root.left)
        else:
            root.right=self.insert(x, root.right)
        return root

    def printTree(self, t):
        if t==None:
            return

        self.printTree(t.left)
        print t.data
        self.printTree(t.right)

class node(object):
    def __init__(self, x):
        self.x=x

bt=Tree()
root=None
n=int(raw_input())
a=[]
for i in range(n):
    a.append(int(raw_input()))
for i in range(n):
    root=bt.insert(a[i], root)
bt.printTree(root)

Binary Tree in Python

 class Tree(object):
    def __init__(self):
        self.data=None
        self.left=None
        self.right=None
    def insert(self, x, root):
        if root==None:
            t=node(x)
            t.data=x
            t.right=None
            t.left=None
            root=t
            return root
        elif x<root.data:
            root.left=self.insert(x, root.left)
        else:
            root.right=self.insert(x, root.right)
        return root

    def printTree(self, t):
        if t==None:
            return

        self.printTree(t.left)
        print t.data
        self.printTree(t.right)

class node(object):
    def __init__(self, x):
        self.x=x

bt=Tree()
root=None
n=int(raw_input())
a=[]
for i in range(n):
    a.append(int(raw_input()))
for i in range(n):
    root=bt.insert(a[i], root)
bt.printTree(root)

回答 17

这是一个简单的解决方案,可以使用递归方法来构建二叉树,以在下面的代码中使用遍历顺序来显示树。

class Node(object):

    def __init__(self):
        self.left = None
        self.right = None
        self.value = None
    @property
    def get_value(self):
        return self.value

    @property
    def get_left(self):
        return self.left

    @property
    def get_right(self):
        return self.right

    @get_left.setter
    def set_left(self, left_node):
        self.left = left_node
    @get_value.setter
    def set_value(self, value):
        self.value = value
    @get_right.setter
    def set_right(self, right_node):
        self.right = right_node



    def create_tree(self):
        _node = Node() #creating new node.
        _x = input("Enter the node data(-1 for null)")
        if(_x == str(-1)): #for defining no child.
            return None
        _node.set_value = _x #setting the value of the node.
        print("Enter the left child of {}".format(_x))
        _node.set_left = self.create_tree() #setting the left subtree
        print("Enter the right child of {}".format(_x))
        _node.set_right = self.create_tree() #setting the right subtree.

        return _node

    def pre_order(self, root):
        if root is not None:
            print(root.get_value)
            self.pre_order(root.get_left)
            self.pre_order(root.get_right)

if __name__ == '__main__':
    node = Node()
    root_node = node.create_tree()
    node.pre_order(root_node)

代码取自:Python中的二叉树

Here is a simple solution which can be used to build a binary tree using a recursive approach to display the tree in order traversal has been used in the below code.

class Node(object):

    def __init__(self):
        self.left = None
        self.right = None
        self.value = None
    @property
    def get_value(self):
        return self.value

    @property
    def get_left(self):
        return self.left

    @property
    def get_right(self):
        return self.right

    @get_left.setter
    def set_left(self, left_node):
        self.left = left_node
    @get_value.setter
    def set_value(self, value):
        self.value = value
    @get_right.setter
    def set_right(self, right_node):
        self.right = right_node



    def create_tree(self):
        _node = Node() #creating new node.
        _x = input("Enter the node data(-1 for null)")
        if(_x == str(-1)): #for defining no child.
            return None
        _node.set_value = _x #setting the value of the node.
        print("Enter the left child of {}".format(_x))
        _node.set_left = self.create_tree() #setting the left subtree
        print("Enter the right child of {}".format(_x))
        _node.set_right = self.create_tree() #setting the right subtree.

        return _node

    def pre_order(self, root):
        if root is not None:
            print(root.get_value)
            self.pre_order(root.get_left)
            self.pre_order(root.get_right)

if __name__ == '__main__':
    node = Node()
    root_node = node.create_tree()
    node.pre_order(root_node)

Code taken from : Binary Tree in Python


什么是“冻结命令”?

问题:什么是“冻结命令”?

  • 冻结集是冻结集。
  • 冻结列表可能是一个元组。
  • 冻结的字典是什么?一个不变的,可哈希的字典。

我猜可能是collections.namedtuple,但是更像是冰冻的字典(半冻结​​的字典)。是不是

A“frozendict”应该是一个冰冻的字典,它应该有keysvaluesget,等,并支持infor等等。

更新:
*它是:https : //www.python.org/dev/peps/pep-0603

  • A frozen set is a frozenset.
  • A frozen list could be a tuple.
  • What would a frozen dict be? An immutable, hashable dict.

I guess it could be something like collections.namedtuple, but that is more like a frozen-keys dict (a half-frozen dict). Isn’t it?

A “frozendict” should be a frozen dictionary, it should have keys, values, get, etc., and support in, for, etc.

update :
* there it is : https://www.python.org/dev/peps/pep-0603


回答 0

Python没有内置的Frozendict类型。事实证明,这并不是太有用了(尽管它可能仍然比以前有用frozenset)。

想要这种类型的最常见原因是在记忆函数调用具有未知参数的函数时。存储dict的可哈希等效项(值是可哈希的)的最常见解决方案是tuple(sorted(kwargs.iteritems()))

这取决于排序是否有点疯狂。Python无法肯定地承诺排序将在这里产生合理的结果。(但是,它不能承诺其他任何事情,因此请不要流汗过多。)


您可以轻松地制作某种类似于dict的包装器。它可能看起来像

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)
        self._hash = None

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        # It would have been simpler and maybe more obvious to 
        # use hash(tuple(sorted(self._d.iteritems()))) from this discussion
        # so far, but this solution is O(n). I don't know what kind of 
        # n we are going to run into, but sometimes it's hard to resist the 
        # urge to optimize when it will gain improved algorithmic performance.
        if self._hash is None:
            hash_ = 0
            for pair in self.items():
                hash_ ^= hash(pair)
            self._hash = hash_
        return self._hash

它应该很棒:

>>> x = FrozenDict(a=1, b=2)
>>> y = FrozenDict(a=1, b=2)
>>> x is y
False
>>> x == y
True
>>> x == {'a': 1, 'b': 2}
True
>>> d = {x: 'foo'}
>>> d[y]
'foo'

Python doesn’t have a builtin frozendict type. It turns out this wouldn’t be useful too often (though it would still probably be useful more often than frozenset is).

The most common reason to want such a type is when memoizing function calls for functions with unknown arguments. The most common solution to store a hashable equivalent of a dict (where the values are hashable) is something like tuple(sorted(kwargs.iteritems())).

This depends on the sorting not being a bit insane. Python cannot positively promise sorting will result in something reasonable here. (But it can’t promise much else, so don’t sweat it too much.)


You could easily enough make some sort of wrapper that works much like a dict. It might look something like

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)
        self._hash = None

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        # It would have been simpler and maybe more obvious to 
        # use hash(tuple(sorted(self._d.iteritems()))) from this discussion
        # so far, but this solution is O(n). I don't know what kind of 
        # n we are going to run into, but sometimes it's hard to resist the 
        # urge to optimize when it will gain improved algorithmic performance.
        if self._hash is None:
            hash_ = 0
            for pair in self.items():
                hash_ ^= hash(pair)
            self._hash = hash_
        return self._hash

It should work great:

>>> x = FrozenDict(a=1, b=2)
>>> y = FrozenDict(a=1, b=2)
>>> x is y
False
>>> x == y
True
>>> x == {'a': 1, 'b': 2}
True
>>> d = {x: 'foo'}
>>> d[y]
'foo'

回答 1

奇怪的是,尽管我们很少frozenset在python中有用,但仍然没有冻结的映射。这个想法在PEP 416中被拒绝-添加一个Frozendict内置类型。可以在Python 3.9中重新考虑这个想法,请参阅PEP 603-向collections添加一个Frozenmap类型

因此,python 2解决方案:

def foo(config={'a': 1}):
    ...

似乎还是有些la脚:

def foo(config=None):
    if config is None:
        config = default_config = {'a': 1}
    ...

在python3您的选择这个

from types import MappingProxyType

default_config = {'a': 1}
DEFAULTS = MappingProxyType(default_config)

def foo(config=DEFAULTS):
    ...

现在,默认配置可以动态更新,但是可以通过传递代理来保持默认配置不变。

因此,中的更改将按预期default_config更新DEFAULTS,但是您无法写入映射代理对象本身。

诚然,这与“不可变,可哈希的字典”不是完全一样的东西,但是考虑到我们可能希望使用“冻结字典”的相同用例,它是一个不错的替代品。

Curiously, although we have the seldom useful frozenset in python, there’s still no frozen mapping. The idea was rejected in PEP 416 — Add a frozendict builtin type. The idea may be revisited in Python 3.9, see PEP 603 — Adding a frozenmap type to collections.

So the python 2 solution to this:

def foo(config={'a': 1}):
    ...

Still seems to be the somewhat lame:

def foo(config=None):
    if config is None:
        config = default_config = {'a': 1}
    ...

In python3 you have the option of this:

from types import MappingProxyType

default_config = {'a': 1}
DEFAULTS = MappingProxyType(default_config)

def foo(config=DEFAULTS):
    ...

Now the default config can be updated dynamically, but remain immutable where you want it to be immutable by passing around the proxy instead.

So changes in the default_config will update DEFAULTS as expected, but you can’t write to the mapping proxy object itself.

Admittedly it’s not quite the same thing as an “immutable, hashable dict” – but it’s a decent substitute given the same kind of use cases for which we might want a frozendict.


回答 2

假设字典的键和值本身是不可变的(例如字符串),则:

>>> d
{'forever': 'atones', 'minks': 'cards', 'overhands': 'warranted', 
 'hardhearted': 'tartly', 'gradations': 'snorkeled'}
>>> t = tuple((k, d[k]) for k in sorted(d.keys()))
>>> hash(t)
1524953596

Assuming the keys and values of the dictionary are themselves immutable (e.g. strings) then:

>>> d
{'forever': 'atones', 'minks': 'cards', 'overhands': 'warranted', 
 'hardhearted': 'tartly', 'gradations': 'snorkeled'}
>>> t = tuple((k, d[k]) for k in sorted(d.keys()))
>>> hash(t)
1524953596

回答 3

没有fronzedict,但是您可以使用MappingProxyTypePython 3.3中添加到标准库中的:

>>> from types import MappingProxyType
>>> foo = MappingProxyType({'a': 1})
>>> foo
mappingproxy({'a': 1})
>>> foo['a'] = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'mappingproxy' object does not support item assignment
>>> foo
mappingproxy({'a': 1})

There is no fronzedict, but you can use MappingProxyType that was added to the standard library with Python 3.3:

>>> from types import MappingProxyType
>>> foo = MappingProxyType({'a': 1})
>>> foo
mappingproxy({'a': 1})
>>> foo['a'] = 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'mappingproxy' object does not support item assignment
>>> foo
mappingproxy({'a': 1})

回答 4

这是我一直在使用的代码。我把Frozenset归为一类。其优点如下。

  1. 这是一个真正的不变的对象。不依赖未来用户和开发人员的良好行为。
  2. 在常规字典和冻结字典之间来回转换很容易。FrozenDict(orig_dict)->冻结的字典。dict(frozen_dict)->常规字典

2015年1月21日更新:我在2014年发布的原始代码使用了for循环来查找匹配的键。那太慢了。现在,我整理了一个利用Frozenset的哈希功能的实现。键值对存储在特殊的容器中,其中__hash__和和__eq__函数仅基于键。与我在2014年8月发布的代码不同,该代码也已经过正式的单元测试。

MIT样式的许可证。

if 3 / 2 == 1:
    version = 2
elif 3 / 2 == 1.5:
    version = 3

def col(i):
    ''' For binding named attributes to spots inside subclasses of tuple.'''
    g = tuple.__getitem__
    @property
    def _col(self):
        return g(self,i)
    return _col

class Item(tuple):
    ''' Designed for storing key-value pairs inside
        a FrozenDict, which itself is a subclass of frozenset.
        The __hash__ is overloaded to return the hash of only the key.
        __eq__ is overloaded so that normally it only checks whether the Item's
        key is equal to the other object, HOWEVER, if the other object itself
        is an instance of Item, it checks BOTH the key and value for equality.

        WARNING: Do not use this class for any purpose other than to contain
        key value pairs inside FrozenDict!!!!

        The __eq__ operator is overloaded in such a way that it violates a
        fundamental property of mathematics. That property, which says that
        a == b and b == c implies a == c, does not hold for this object.
        Here's a demonstration:
            [in]  >>> x = Item(('a',4))
            [in]  >>> y = Item(('a',5))
            [in]  >>> hash('a')
            [out] >>> 194817700
            [in]  >>> hash(x)
            [out] >>> 194817700
            [in]  >>> hash(y)
            [out] >>> 194817700
            [in]  >>> 'a' == x
            [out] >>> True
            [in]  >>> 'a' == y
            [out] >>> True
            [in]  >>> x == y
            [out] >>> False
    '''

    __slots__ = ()
    key, value = col(0), col(1)
    def __hash__(self):
        return hash(self.key)
    def __eq__(self, other):
        if isinstance(other, Item):
            return tuple.__eq__(self, other)
        return self.key == other
    def __ne__(self, other):
        return not self.__eq__(other)
    def __str__(self):
        return '%r: %r' % self
    def __repr__(self):
        return 'Item((%r, %r))' % self

class FrozenDict(frozenset):
    ''' Behaves in most ways like a regular dictionary, except that it's immutable.
        It differs from other implementations because it doesn't subclass "dict".
        Instead it subclasses "frozenset" which guarantees immutability.
        FrozenDict instances are created with the same arguments used to initialize
        regular dictionaries, and has all the same methods.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> f['x']
            [out] >>> 3
            [in]  >>> f['a'] = 0
            [out] >>> TypeError: 'FrozenDict' object does not support item assignment

        FrozenDict can accept un-hashable values, but FrozenDict is only hashable if its values are hashable.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> hash(f)
            [out] >>> 646626455
            [in]  >>> g = FrozenDict(x=3,y=4,z=[])
            [in]  >>> hash(g)
            [out] >>> TypeError: unhashable type: 'list'

        FrozenDict interacts with dictionary objects as though it were a dict itself.
            [in]  >>> original = dict(x=3,y=4,z=5)
            [in]  >>> frozen = FrozenDict(x=3,y=4,z=5)
            [in]  >>> original == frozen
            [out] >>> True

        FrozenDict supports bi-directional conversions with regular dictionaries.
            [in]  >>> original = {'x': 3, 'y': 4, 'z': 5}
            [in]  >>> FrozenDict(original)
            [out] >>> FrozenDict({'x': 3, 'y': 4, 'z': 5})
            [in]  >>> dict(FrozenDict(original))
            [out] >>> {'x': 3, 'y': 4, 'z': 5}   '''

    __slots__ = ()
    def __new__(cls, orig={}, **kw):
        if kw:
            d = dict(orig, **kw)
            items = map(Item, d.items())
        else:
            try:
                items = map(Item, orig.items())
            except AttributeError:
                items = map(Item, orig)
        return frozenset.__new__(cls, items)

    def __repr__(self):
        cls = self.__class__.__name__
        items = frozenset.__iter__(self)
        _repr = ', '.join(map(str,items))
        return '%s({%s})' % (cls, _repr)

    def __getitem__(self, key):
        if key not in self:
            raise KeyError(key)
        diff = self.difference
        item = diff(diff({key}))
        key, value = set(item).pop()
        return value

    def get(self, key, default=None):
        if key not in self:
            return default
        return self[key]

    def __iter__(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def keys(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def values(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.value, items)

    def items(self):
        items = frozenset.__iter__(self)
        return map(tuple, items)

    def copy(self):
        cls = self.__class__
        items = frozenset.copy(self)
        dupl = frozenset.__new__(cls, items)
        return dupl

    @classmethod
    def fromkeys(cls, keys, value):
        d = dict.fromkeys(keys,value)
        return cls(d)

    def __hash__(self):
        kv = tuple.__hash__
        items = frozenset.__iter__(self)
        return hash(frozenset(map(kv, items)))

    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            try:
                other = FrozenDict(other)
            except Exception:
                return False
        return frozenset.__eq__(self, other)

    def __ne__(self, other):
        return not self.__eq__(other)


if version == 2:
    #Here are the Python2 modifications
    class Python2(FrozenDict):
        def __iter__(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def iterkeys(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def itervalues(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.value

        def iteritems(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield (i.key, i.value)

        def has_key(self, key):
            return key in self

        def viewkeys(self):
            return dict(self).viewkeys()

        def viewvalues(self):
            return dict(self).viewvalues()

        def viewitems(self):
            return dict(self).viewitems()

    #If this is Python2, rebuild the class
    #from scratch rather than use a subclass
    py3 = FrozenDict.__dict__
    py3 = {k: py3[k] for k in py3}
    py2 = {}
    py2.update(py3)
    dct = Python2.__dict__
    py2.update({k: dct[k] for k in dct})

    FrozenDict = type('FrozenDict', (frozenset,), py2)

Here is the code I’ve been using. I subclassed frozenset. The advantages of this are the following.

  1. This is a truly immutable object. No relying on the good behavior of future users and developers.
  2. It’s easy to convert back and forth between a regular dictionary and a frozen dictionary. FrozenDict(orig_dict) –> frozen dictionary. dict(frozen_dict) –> regular dict.

Update Jan 21 2015: The original piece of code I posted in 2014 used a for-loop to find a key that matched. That was incredibly slow. Now I’ve put together an implementation which takes advantage of frozenset’s hashing features. Key-value pairs are stored in special containers where the __hash__ and __eq__ functions are based on the key only. This code has also been formally unit-tested, unlike what I posted here in August 2014.

MIT-style license.

if 3 / 2 == 1:
    version = 2
elif 3 / 2 == 1.5:
    version = 3

def col(i):
    ''' For binding named attributes to spots inside subclasses of tuple.'''
    g = tuple.__getitem__
    @property
    def _col(self):
        return g(self,i)
    return _col

class Item(tuple):
    ''' Designed for storing key-value pairs inside
        a FrozenDict, which itself is a subclass of frozenset.
        The __hash__ is overloaded to return the hash of only the key.
        __eq__ is overloaded so that normally it only checks whether the Item's
        key is equal to the other object, HOWEVER, if the other object itself
        is an instance of Item, it checks BOTH the key and value for equality.

        WARNING: Do not use this class for any purpose other than to contain
        key value pairs inside FrozenDict!!!!

        The __eq__ operator is overloaded in such a way that it violates a
        fundamental property of mathematics. That property, which says that
        a == b and b == c implies a == c, does not hold for this object.
        Here's a demonstration:
            [in]  >>> x = Item(('a',4))
            [in]  >>> y = Item(('a',5))
            [in]  >>> hash('a')
            [out] >>> 194817700
            [in]  >>> hash(x)
            [out] >>> 194817700
            [in]  >>> hash(y)
            [out] >>> 194817700
            [in]  >>> 'a' == x
            [out] >>> True
            [in]  >>> 'a' == y
            [out] >>> True
            [in]  >>> x == y
            [out] >>> False
    '''

    __slots__ = ()
    key, value = col(0), col(1)
    def __hash__(self):
        return hash(self.key)
    def __eq__(self, other):
        if isinstance(other, Item):
            return tuple.__eq__(self, other)
        return self.key == other
    def __ne__(self, other):
        return not self.__eq__(other)
    def __str__(self):
        return '%r: %r' % self
    def __repr__(self):
        return 'Item((%r, %r))' % self

class FrozenDict(frozenset):
    ''' Behaves in most ways like a regular dictionary, except that it's immutable.
        It differs from other implementations because it doesn't subclass "dict".
        Instead it subclasses "frozenset" which guarantees immutability.
        FrozenDict instances are created with the same arguments used to initialize
        regular dictionaries, and has all the same methods.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> f['x']
            [out] >>> 3
            [in]  >>> f['a'] = 0
            [out] >>> TypeError: 'FrozenDict' object does not support item assignment

        FrozenDict can accept un-hashable values, but FrozenDict is only hashable if its values are hashable.
            [in]  >>> f = FrozenDict(x=3,y=4,z=5)
            [in]  >>> hash(f)
            [out] >>> 646626455
            [in]  >>> g = FrozenDict(x=3,y=4,z=[])
            [in]  >>> hash(g)
            [out] >>> TypeError: unhashable type: 'list'

        FrozenDict interacts with dictionary objects as though it were a dict itself.
            [in]  >>> original = dict(x=3,y=4,z=5)
            [in]  >>> frozen = FrozenDict(x=3,y=4,z=5)
            [in]  >>> original == frozen
            [out] >>> True

        FrozenDict supports bi-directional conversions with regular dictionaries.
            [in]  >>> original = {'x': 3, 'y': 4, 'z': 5}
            [in]  >>> FrozenDict(original)
            [out] >>> FrozenDict({'x': 3, 'y': 4, 'z': 5})
            [in]  >>> dict(FrozenDict(original))
            [out] >>> {'x': 3, 'y': 4, 'z': 5}   '''

    __slots__ = ()
    def __new__(cls, orig={}, **kw):
        if kw:
            d = dict(orig, **kw)
            items = map(Item, d.items())
        else:
            try:
                items = map(Item, orig.items())
            except AttributeError:
                items = map(Item, orig)
        return frozenset.__new__(cls, items)

    def __repr__(self):
        cls = self.__class__.__name__
        items = frozenset.__iter__(self)
        _repr = ', '.join(map(str,items))
        return '%s({%s})' % (cls, _repr)

    def __getitem__(self, key):
        if key not in self:
            raise KeyError(key)
        diff = self.difference
        item = diff(diff({key}))
        key, value = set(item).pop()
        return value

    def get(self, key, default=None):
        if key not in self:
            return default
        return self[key]

    def __iter__(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def keys(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.key, items)

    def values(self):
        items = frozenset.__iter__(self)
        return map(lambda i: i.value, items)

    def items(self):
        items = frozenset.__iter__(self)
        return map(tuple, items)

    def copy(self):
        cls = self.__class__
        items = frozenset.copy(self)
        dupl = frozenset.__new__(cls, items)
        return dupl

    @classmethod
    def fromkeys(cls, keys, value):
        d = dict.fromkeys(keys,value)
        return cls(d)

    def __hash__(self):
        kv = tuple.__hash__
        items = frozenset.__iter__(self)
        return hash(frozenset(map(kv, items)))

    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            try:
                other = FrozenDict(other)
            except Exception:
                return False
        return frozenset.__eq__(self, other)

    def __ne__(self, other):
        return not self.__eq__(other)


if version == 2:
    #Here are the Python2 modifications
    class Python2(FrozenDict):
        def __iter__(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def iterkeys(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.key

        def itervalues(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield i.value

        def iteritems(self):
            items = frozenset.__iter__(self)
            for i in items:
                yield (i.key, i.value)

        def has_key(self, key):
            return key in self

        def viewkeys(self):
            return dict(self).viewkeys()

        def viewvalues(self):
            return dict(self).viewvalues()

        def viewitems(self):
            return dict(self).viewitems()

    #If this is Python2, rebuild the class
    #from scratch rather than use a subclass
    py3 = FrozenDict.__dict__
    py3 = {k: py3[k] for k in py3}
    py2 = {}
    py2.update(py3)
    dct = Python2.__dict__
    py2.update({k: dct[k] for k in dct})

    FrozenDict = type('FrozenDict', (frozenset,), py2)

回答 5

每当我编写这样的函数时,我都会想起Frozendict:

def do_something(blah, optional_dict_parm=None):
    if optional_dict_parm is None:
        optional_dict_parm = {}

I think of frozendict everytime I write a function like this:

def do_something(blah, optional_dict_parm=None):
    if optional_dict_parm is None:
        optional_dict_parm = {}

回答 6

您可以将frozendictfrom utilspie包用作:

>>> from utilspie.collectionsutils import frozendict

>>> my_dict = frozendict({1: 3, 4: 5})
>>> my_dict  # object of `frozendict` type
frozendict({1: 3, 4: 5})

# Hashable
>>> {my_dict: 4}
{frozendict({1: 3, 4: 5}): 4}

# Immutable
>>> my_dict[1] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mquadri/workspace/utilspie/utilspie/collectionsutils/collections_utils.py", line 44, in __setitem__
    self.__setitem__.__name__, type(self).__name__))
AttributeError: You can not call '__setitem__()' for 'frozendict' object

根据文件

Frozendict(dict_obj):接受dict类型的obj并返回一个可哈希且不可变的 dict

You may use frozendict from utilspie package as:

>>> from utilspie.collectionsutils import frozendict

>>> my_dict = frozendict({1: 3, 4: 5})
>>> my_dict  # object of `frozendict` type
frozendict({1: 3, 4: 5})

# Hashable
>>> {my_dict: 4}
{frozendict({1: 3, 4: 5}): 4}

# Immutable
>>> my_dict[1] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mquadri/workspace/utilspie/utilspie/collectionsutils/collections_utils.py", line 44, in __setitem__
    self.__setitem__.__name__, type(self).__name__))
AttributeError: You can not call '__setitem__()' for 'frozendict' object

As per the document:

frozendict(dict_obj): Accepts obj of dict type and returns a hashable and immutable dict


回答 7

安装freezedict

pip install frozendict

用它!

from frozendict import frozendict

def smth(param = frozendict({})):
    pass

Install frozendict

pip install frozendict

Use it!

from frozendict import frozendict

def smth(param = frozendict({})):
    pass

回答 8

是的,这是我的第二个答案,但这是一种完全不同的方法。第一个实现是在纯python中实现的。这是在Cython中。如果您知道如何使用和编译Cython模块,这与常规词典一样快。大约.04到.06毫秒,以检索单个值。

这是文件“ frozen_dict.pyx”

import cython
from collections import Mapping

cdef class dict_wrapper:
    cdef object d
    cdef int h

    def __init__(self, *args, **kw):
        self.d = dict(*args, **kw)
        self.h = -1

    def __len__(self):
        return len(self.d)

    def __iter__(self):
        return iter(self.d)

    def __getitem__(self, key):
        return self.d[key]

    def __hash__(self):
        if self.h == -1:
            self.h = hash(frozenset(self.d.iteritems()))
        return self.h

class FrozenDict(dict_wrapper, Mapping):
    def __repr__(self):
        c = type(self).__name__
        r = ', '.join('%r: %r' % (k,self[k]) for k in self)
        return '%s({%s})' % (c, r)

__all__ = ['FrozenDict']

这是文件“ setup.py”

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('frozen_dict.pyx')
)

如果您安装了Cython,请将上面的两个文件保存到同一目录中。在命令行中移至该目录。

python setup.py build_ext --inplace
python setup.py install

并且应该完成。

Yes, this is my second answer, but it is a completely different approach. The first implementation was in pure python. This one is in Cython. If you know how to use and compile Cython modules, this is just as fast as a regular dictionary. Roughly .04 to .06 micro-sec to retrieve a single value.

This is the file “frozen_dict.pyx”

import cython
from collections import Mapping

cdef class dict_wrapper:
    cdef object d
    cdef int h

    def __init__(self, *args, **kw):
        self.d = dict(*args, **kw)
        self.h = -1

    def __len__(self):
        return len(self.d)

    def __iter__(self):
        return iter(self.d)

    def __getitem__(self, key):
        return self.d[key]

    def __hash__(self):
        if self.h == -1:
            self.h = hash(frozenset(self.d.iteritems()))
        return self.h

class FrozenDict(dict_wrapper, Mapping):
    def __repr__(self):
        c = type(self).__name__
        r = ', '.join('%r: %r' % (k,self[k]) for k in self)
        return '%s({%s})' % (c, r)

__all__ = ['FrozenDict']

Here’s the file “setup.py”

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize('frozen_dict.pyx')
)

If you have Cython installed, save the two files above into the same directory. Move to that directory in the command line.

python setup.py build_ext --inplace
python setup.py install

And you should be done.


回答 9

其主要缺点namedtuple是在使用前需要先指定它,因此对于单次使用的情况不太方便。

但是,有一种实际的解决方法可用于处理许多此类情况。假设您想拥有以下字典的不变的等同物:

MY_CONSTANT = {
    'something': 123,
    'something_else': 456
}

可以这样模拟:

from collections import namedtuple

MY_CONSTANT = namedtuple('MyConstant', 'something something_else')(123, 456)

甚至有可能编写一个辅助函数来自动执行此操作:

def freeze_dict(data):
    from collections import namedtuple
    keys = sorted(data.keys())
    frozen_type = namedtuple(''.join(keys), keys)
    return frozen_type(**data)

a = {'foo':'bar', 'x':'y'}
fa = freeze_dict(data)
assert a['foo'] == fa.foo

当然,这仅适用于简单的命令,但实现递归版本并不难。

The main disadvantage of namedtuple is that it needs to be specified before it is used, so it’s less convenient for single-use cases.

However, there is a practical workaround that can be used to handle many such cases. Let’s say that you want to have an immutable equivalent of the following dict:

MY_CONSTANT = {
    'something': 123,
    'something_else': 456
}

This can be emulated like this:

from collections import namedtuple

MY_CONSTANT = namedtuple('MyConstant', 'something something_else')(123, 456)

It’s even possible to write an auxiliary function to automate this:

def freeze_dict(data):
    from collections import namedtuple
    keys = sorted(data.keys())
    frozen_type = namedtuple(''.join(keys), keys)
    return frozen_type(**data)

a = {'foo':'bar', 'x':'y'}
fa = freeze_dict(data)
assert a['foo'] == fa.foo

Of course this works only for flat dicts, but it shouldn’t be too difficult to implement a recursive version.


回答 10

子类化 dict

我在野外(github)看到了这种模式,想提一下:

class FrozenDict(dict):
    def __init__(self, *args, **kwargs):
        self._hash = None
        super(FrozenDict, self).__init__(*args, **kwargs)

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(sorted(self.items())))  # iteritems() on py2
        return self._hash

    def _immutable(self, *args, **kws):
        raise TypeError('cannot change object - object is immutable')

    __setitem__ = _immutable
    __delitem__ = _immutable
    pop = _immutable
    popitem = _immutable
    clear = _immutable
    update = _immutable
    setdefault = _immutable

用法示例:

d1 = FrozenDict({'a': 1, 'b': 2})
d2 = FrozenDict({'a': 1, 'b': 2})
d1.keys() 
assert isinstance(d1, dict)
assert len(set([d1, d2])) == 1  # hashable

优点

  • 支持get()keys()items()iteritems()上PY2)和所有从东西dict开箱没有明确执行这些
  • 在内部使用dict这意味着性能(dict用CPython用c编写)
  • 优雅简约,无黑魔法
  • isinstance(my_frozen_dict, dict)返回True-尽管python鼓励使用鸭式键入许多软件包isinstance(),但这可以节省许多调整和自定义

缺点

  • 任何子类都可以覆盖它或在内部访问它(您不能真正100%保护python中的某些内容,您应该信任您的用户并提供良好的文档)。
  • 如果您关心速度,则可能需要__hash__提高速度。

Subclassing dict

i see this pattern in the wild (github) and wanted to mention it:

class FrozenDict(dict):
    def __init__(self, *args, **kwargs):
        self._hash = None
        super(FrozenDict, self).__init__(*args, **kwargs)

    def __hash__(self):
        if self._hash is None:
            self._hash = hash(tuple(sorted(self.items())))  # iteritems() on py2
        return self._hash

    def _immutable(self, *args, **kws):
        raise TypeError('cannot change object - object is immutable')

    __setitem__ = _immutable
    __delitem__ = _immutable
    pop = _immutable
    popitem = _immutable
    clear = _immutable
    update = _immutable
    setdefault = _immutable

example usage:

d1 = FrozenDict({'a': 1, 'b': 2})
d2 = FrozenDict({'a': 1, 'b': 2})
d1.keys() 
assert isinstance(d1, dict)
assert len(set([d1, d2])) == 1  # hashable

Pros

  • support for get(), keys(), items() (iteritems() on py2) and all the goodies from dict out of the box without explicitly implementing them
  • uses internally dict which means performance (dict is written in c in CPython)
  • elegant simple and no black magic
  • isinstance(my_frozen_dict, dict) returns True – although python encourages duck-typing many packages uses isinstance(), this can save many tweaks and customizations

Cons

  • any subclass can override this or access it internally (you cant really 100% protect something in python, you should trust your users and provide good documentation).
  • if you care for speed, you might want to make __hash__ a bit faster.

回答 11

另一个选择是包中的MultiDictProxymultidict

Another option is the MultiDictProxy class from the multidict package.


回答 12

我需要在某一时刻访问某种东西的固定键,这是一种全球稳定的东西,因此我选择了以下方式:

class MyFrozenDict:
    def __getitem__(self, key):
        if key == 'mykey1':
            return 0
        if key == 'mykey2':
            return "another value"
        raise KeyError(key)

像这样使用

a = MyFrozenDict()
print(a['mykey1'])

警告:对于大多数用例,我不建议这样做,因为这会带来一些非常严重的折衷。

I needed to access fixed keys for something at one point for something that was a sort of globally-constanty kind of thing and I settled on something like this:

class MyFrozenDict:
    def __getitem__(self, key):
        if key == 'mykey1':
            return 0
        if key == 'mykey2':
            return "another value"
        raise KeyError(key)

Use it like

a = MyFrozenDict()
print(a['mykey1'])

WARNING: I don’t recommend this for most use cases as it makes some pretty severe tradeoffs.


回答 13

在没有本地语言支持的情况下,您可以自己做,也可以使用现有的解决方案。幸运的是,Python使扩展基本实现变得非常简单。

class frozen_dict(dict):
    def __setitem__(self, key, value):
        raise Exception('Frozen dictionaries cannot be mutated')

frozen_dict = frozen_dict({'foo': 'FOO' })
print(frozen['foo']) # FOO
frozen['foo'] = 'NEWFOO' # Exception: Frozen dictionaries cannot be mutated

# OR

from types import MappingProxyType

frozen_dict = MappingProxyType({'foo': 'FOO'})
print(frozen_dict['foo']) # FOO
frozen_dict['foo'] = 'NEWFOO' # TypeError: 'mappingproxy' object does not support item assignment

In the absence of native language support, you can either do it yourself or use an existing solution. Fortunately Python makes it dead simple to extend off of their base implementations.

class frozen_dict(dict):
    def __setitem__(self, key, value):
        raise Exception('Frozen dictionaries cannot be mutated')

frozen_dict = frozen_dict({'foo': 'FOO' })
print(frozen['foo']) # FOO
frozen['foo'] = 'NEWFOO' # Exception: Frozen dictionaries cannot be mutated

# OR

from types import MappingProxyType

frozen_dict = MappingProxyType({'foo': 'FOO'})
print(frozen_dict['foo']) # FOO
frozen_dict['foo'] = 'NEWFOO' # TypeError: 'mappingproxy' object does not support item assignment

set()如何实现?

问题:set()如何实现?

我见过有人说setpython 中的对象具有O(1)成员资格检查。如何在内部实现它们以允许这样做?它使用哪种数据结构?该实现还有什么其他含义?

这里的每个答案都非常有启发性,但是我只能接受一个答案,因此,我将选择与原始问题最接近的答案。谢谢你的信息!

I’ve seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I’ll go with the closest answer to my original question. Thanks all for the info!


回答 0

根据这个线程

实际上,CPython的集合被实现为类似于带有伪值的字典(键是集合的成员)的字典,并且进行了一些优化,可以利用这种缺乏值的方式

因此,基本上a set使用哈希表作为其基础数据结构。这解释了O(1)成员资格检查,因为在哈希表中查找项目平均而言是O(1)操作。

如果您愿意,甚至可以浏览CPython源代码以获取集合,根据Achim Domma的说法,该代码大部分是实现中的剪切和粘贴dict

According to this thread:

Indeed, CPython’s sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, is mostly a cut-and-paste from the dict implementation.


回答 1

当人们说集合具有O(1)成员资格检查时,他们正在谈论平均情况。在最坏的情况下(当所有哈希值冲突时),成员资格检查为O(n)。有关时间复杂性,请参见Python Wiki

维基百科的文章说,最好的情况下为一个哈希表,不调整大小的时间复杂度O(1 + k/n)。由于Python集使用调整大小的哈希表,因此该结果并不直接适用于Python集。

在Wikipedia文章上再说一点,对于一般情况,并假设一个简单的统一哈希函数,时间复杂度为O(1/(1-k/n)),其中k/n可以由常数限制c<1

Big-O仅将渐近行为表示为n→∞。由于k / n可以由常数c <1限制,与n无关

O(1/(1-k/n))不大于O(1/(1-c))等于O(constant)= O(1)

因此,假设统一的简单哈希,平均而言,Python集的成员资格检查为O(1)

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).


回答 2

我认为这是一个常见的错误,set查找(或该问题的哈希表)不是O(1)。
来自维基百科

在最简单的模型中,哈希函数是完全未指定的,并且该表不会调整大小。为了最好地选择散列函数,大小为n且具有开放寻址的表没有冲突,最多可容纳n个元素,一次比较即可成功查找,并且大小为n的具有链接和k个键的表具有最小的最大(0,kn)冲突和O(1 + k / n)比较以查找。对于最差的哈希函数选择,每个插入都会导致冲突,并且哈希表会退化为线性搜索,每个插入都要进行Ω(k)摊销比较,并且最多可以进行k个比较才能成功查找。

相关:Java哈希图真的是O(1)吗?

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?


回答 3

我们都可以轻松访问source,前面的评论set_lookkey()说:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

We all have easy access to the source, where the comment preceding set_lookkey() says:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

回答 4

为了进一步强调set's和之间的区别dict's,这是setobject.c注释部分的摘录,其中阐明了set与dicts的主要区别。

集合的用例与字典中存在较大差异的字典大相径庭。相反,集合主要是关于成员资格测试,其中事先不知道元素的存在。因此,集合实现需要针对发现和未发现的情况进行优化。

github上的源代码

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify’s the main difference of set’s against dicts.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

source on github


如何在Python中实现最大堆?

问题:如何在Python中实现最大堆?

Python包括用于最小堆的heapq模块,但是我需要一个最大堆。在Python中最大堆实现应使用什么?

Python includes the heapq module for min-heaps, but I need a max heap. What should I use for a max-heap implementation in Python?


回答 0

最简单的方法是反转键的值并使用heapq。例如,将1000.0转换为-1000.0,将5.0转换为-5.0。

The easiest way is to invert the value of the keys and use heapq. For example, turn 1000.0 into -1000.0 and 5.0 into -5.0.


回答 1

您可以使用

import heapq
listForTree = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]    
heapq.heapify(listForTree)             # for a min heap
heapq._heapify_max(listForTree)        # for a maxheap!!

如果然后要弹出元素,请使用:

heapq.heappop(minheap)      # pop from minheap
heapq._heappop_max(maxheap) # pop from maxheap

You can use

import heapq
listForTree = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]    
heapq.heapify(listForTree)             # for a min heap
heapq._heapify_max(listForTree)        # for a maxheap!!

If you then want to pop elements, use:

heapq.heappop(minheap)      # pop from minheap
heapq._heappop_max(maxheap) # pop from maxheap

回答 2

解决方案是在将值存储在堆中时取反值,或像这样反转对象比较:

import heapq

class MaxHeapObj(object):
  def __init__(self, val): self.val = val
  def __lt__(self, other): return self.val > other.val
  def __eq__(self, other): return self.val == other.val
  def __str__(self): return str(self.val)

最大堆的示例:

maxh = []
heapq.heappush(maxh, MaxHeapObj(x))
x = maxh[0].val  # fetch max value
x = heapq.heappop(maxh).val  # pop max value

但是您必须记住包装和解包值,这需要知道您要处理的是最小堆还是最大堆。

MinHeap,MaxHeap类

MinHeap和添加类MaxHeap可以简化代码:

class MinHeap(object):
  def __init__(self): self.h = []
  def heappush(self, x): heapq.heappush(self.h, x)
  def heappop(self): return heapq.heappop(self.h)
  def __getitem__(self, i): return self.h[i]
  def __len__(self): return len(self.h)

class MaxHeap(MinHeap):
  def heappush(self, x): heapq.heappush(self.h, MaxHeapObj(x))
  def heappop(self): return heapq.heappop(self.h).val
  def __getitem__(self, i): return self.h[i].val

用法示例:

minh = MinHeap()
maxh = MaxHeap()
# add some values
minh.heappush(12)
maxh.heappush(12)
minh.heappush(4)
maxh.heappush(4)
# fetch "top" values
print(minh[0], maxh[0])  # "4 12"
# fetch and remove "top" values
print(minh.heappop(), maxh.heappop())  # "4 12"

The solution is to negate your values when you store them in the heap, or invert your object comparison like so:

import heapq

class MaxHeapObj(object):
  def __init__(self, val): self.val = val
  def __lt__(self, other): return self.val > other.val
  def __eq__(self, other): return self.val == other.val
  def __str__(self): return str(self.val)

Example of a max-heap:

maxh = []
heapq.heappush(maxh, MaxHeapObj(x))
x = maxh[0].val  # fetch max value
x = heapq.heappop(maxh).val  # pop max value

But you have to remember to wrap and unwrap your values, which requires knowing if you are dealing with a min- or max-heap.

MinHeap, MaxHeap classes

Adding classes for MinHeap and MaxHeap objects can simplify your code:

class MinHeap(object):
  def __init__(self): self.h = []
  def heappush(self, x): heapq.heappush(self.h, x)
  def heappop(self): return heapq.heappop(self.h)
  def __getitem__(self, i): return self.h[i]
  def __len__(self): return len(self.h)

class MaxHeap(MinHeap):
  def heappush(self, x): heapq.heappush(self.h, MaxHeapObj(x))
  def heappop(self): return heapq.heappop(self.h).val
  def __getitem__(self, i): return self.h[i].val

Example usage:

minh = MinHeap()
maxh = MaxHeap()
# add some values
minh.heappush(12)
maxh.heappush(12)
minh.heappush(4)
maxh.heappush(4)
# fetch "top" values
print(minh[0], maxh[0])  # "4 12"
# fetch and remove "top" values
print(minh.heappop(), maxh.heappop())  # "4 12"

回答 3

最简单理想的解决方案

将值乘以-1

妳去 现在,所有最高的数字都是最低的,反之亦然。

只需记住,当您弹出一个元素以使其与-1相乘时,才能再次获得原始值。

The easiest and ideal solution

Multiply the values by -1

There you go. All the highest numbers are now the lowest and vice versa.

Just remember that when you pop an element to multiply it with -1 in order to get the original value again.


回答 4

我实现了heapq的最大堆版本并将其提交给PyPI。(对heapq模块的CPython代码稍作更改。)

https://pypi.python.org/pypi/heapq_max/

https://github.com/he-zhe/heapq_max

安装

pip install heapq_max

用法

tl; dr:与heapq模块相同,只不过在所有函数中添加了“ _max”。

heap_max = []                           # creates an empty heap
heappush_max(heap_max, item)            # pushes a new item on the heap
item = heappop_max(heap_max)            # pops the largest item from the heap
item = heap_max[0]                      # largest item on the heap without popping it
heapify_max(x)                          # transforms list into a heap, in-place, in linear time
item = heapreplace_max(heap_max, item)  # pops and returns largest item, and
                                    # adds new item; the heap size is unchanged

I implemented a max heap version of heapq and submitted it to PyPI. (Very slight change of heapq module CPython code.)

https://pypi.python.org/pypi/heapq_max/

https://github.com/he-zhe/heapq_max

Installation

pip install heapq_max

Usage

tl;dr: same as heapq module except adding ‘_max’ to all functions.

heap_max = []                           # creates an empty heap
heappush_max(heap_max, item)            # pushes a new item on the heap
item = heappop_max(heap_max)            # pops the largest item from the heap
item = heap_max[0]                      # largest item on the heap without popping it
heapify_max(x)                          # transforms list into a heap, in-place, in linear time
item = heapreplace_max(heap_max, item)  # pops and returns largest item, and
                                    # adds new item; the heap size is unchanged

回答 5

如果您插入的是可比较的但不是int的键,则可能会覆盖它们上的比较运算符(即<=变为>,而>变为<=)。否则,您可以在heapq模块中覆盖heapq._siftup(最后只是Python代码)。

If you are inserting keys that are comparable but not int-like, you could potentially override the comparison operators on them (i.e. <= become > and > becomes <=). Otherwise, you can override heapq._siftup in the heapq module (it’s all just Python code, in the end).


回答 6

允许您选择任意数量的最大或最小项目

import heapq
heap = [23, 7, -4, 18, 23, 42, 37, 2, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
heapq.heapify(heap)
print(heapq.nlargest(3, heap))  # [42, 42, 37]
print(heapq.nsmallest(3, heap)) # [-4, -4, 2]

Allowing you to chose an arbitrary amount of largest or smallest items

import heapq
heap = [23, 7, -4, 18, 23, 42, 37, 2, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2]
heapq.heapify(heap)
print(heapq.nlargest(3, heap))  # [42, 42, 37]
print(heapq.nsmallest(3, heap)) # [-4, -4, 2]

回答 7

扩展int类并覆盖__lt__是方法之一。

import queue
class MyInt(int):
    def __lt__(self, other):
        return self > other

def main():
    q = queue.PriorityQueue()
    q.put(MyInt(10))
    q.put(MyInt(5))
    q.put(MyInt(1))
    while not q.empty():
        print (q.get())


if __name__ == "__main__":
    main()

Extending the int class and overriding __lt__ is one of the ways.

import queue
class MyInt(int):
    def __lt__(self, other):
        return self > other

def main():
    q = queue.PriorityQueue()
    q.put(MyInt(10))
    q.put(MyInt(5))
    q.put(MyInt(1))
    while not q.empty():
        print (q.get())


if __name__ == "__main__":
    main()

回答 8

我创建了一个堆包装器,该堆包装器将这些值取反以创建一个最大堆,以及一个用于最小堆的包装器类,以使库更像OOP。这里是要点。一共有三节课;堆(抽象类),HeapMin和HeapMax。

方法:

isempty() -> bool; obvious
getroot() -> int; returns min/max
push() -> None; equivalent to heapq.heappush
pop() -> int; equivalent to heapq.heappop
view_min()/view_max() -> int; alias for getroot()
pushpop() -> int; equivalent to heapq.pushpop

I have created a heap wrapper that inverts the values to create a max-heap, as well as a wrapper class for a min-heap to make the library more OOP-like. Here is the gist. There are three classes; Heap (abstract class), HeapMin, and HeapMax.

Methods:

isempty() -> bool; obvious
getroot() -> int; returns min/max
push() -> None; equivalent to heapq.heappush
pop() -> int; equivalent to heapq.heappop
view_min()/view_max() -> int; alias for getroot()
pushpop() -> int; equivalent to heapq.pushpop

回答 9

如果您想使用最大堆来获取最大的K元素,可以执行以下技巧:

nums= [3,2,1,5,6,4]
k = 2  #k being the kth largest element you want to get
heapq.heapify(nums) 
temp = heapq.nlargest(k, nums)
return temp[-1]

In case if you would like to get the largest K element using max heap, you can do the following trick:

nums= [3,2,1,5,6,4]
k = 2  #k being the kth largest element you want to get
heapq.heapify(nums) 
temp = heapq.nlargest(k, nums)
return temp[-1]

回答 10

遵循艾萨克·特纳(Isaac Turner)的出色回答,我想举一个基于使用最大堆的K个最接近原点的示例。

from math import sqrt
import heapq


class MaxHeapObj(object):
    def __init__(self, val):
        self.val = val.distance
        self.coordinates = val.coordinates

    def __lt__(self, other):
        return self.val > other.val

    def __eq__(self, other):
        return self.val == other.val

    def __str__(self):
        return str(self.val)


class MinHeap(object):
    def __init__(self):
        self.h = []

    def heappush(self, x):
        heapq.heappush(self.h, x)

    def heappop(self):
        return heapq.heappop(self.h)

    def __getitem__(self, i):
        return self.h[i]

    def __len__(self):
        return len(self.h)


class MaxHeap(MinHeap):
    def heappush(self, x):
        heapq.heappush(self.h, MaxHeapObj(x))

    def heappop(self):
        return heapq.heappop(self.h).val

    def peek(self):
        return heapq.nsmallest(1, self.h)[0].val

    def __getitem__(self, i):
        return self.h[i].val


class Point():
    def __init__(self, x, y):
        self.distance = round(sqrt(x**2 + y**2), 3)
        self.coordinates = (x, y)


def find_k_closest(points, k):
    res = [Point(x, y) for (x, y) in points]
    maxh = MaxHeap()

    for i in range(k):
        maxh.heappush(res[i])

    for p in res[k:]:
        if p.distance < maxh.peek():
            maxh.heappop()
            maxh.heappush(p)

    res = [str(x.coordinates) for x in maxh.h]
    print(f"{k} closest points from origin : {', '.join(res)}")


points = [(10, 8), (-2, 4), (0, -2), (-1, 0), (3, 5), (-2, 3), (3, 2), (0, 1)]
find_k_closest(points, 3)

Following up to Isaac Turner’s excellent answer, I’d like put an example based on K Closest Points to the Origin using max heap.

from math import sqrt
import heapq


class MaxHeapObj(object):
    def __init__(self, val):
        self.val = val.distance
        self.coordinates = val.coordinates

    def __lt__(self, other):
        return self.val > other.val

    def __eq__(self, other):
        return self.val == other.val

    def __str__(self):
        return str(self.val)


class MinHeap(object):
    def __init__(self):
        self.h = []

    def heappush(self, x):
        heapq.heappush(self.h, x)

    def heappop(self):
        return heapq.heappop(self.h)

    def __getitem__(self, i):
        return self.h[i]

    def __len__(self):
        return len(self.h)


class MaxHeap(MinHeap):
    def heappush(self, x):
        heapq.heappush(self.h, MaxHeapObj(x))

    def heappop(self):
        return heapq.heappop(self.h).val

    def peek(self):
        return heapq.nsmallest(1, self.h)[0].val

    def __getitem__(self, i):
        return self.h[i].val


class Point():
    def __init__(self, x, y):
        self.distance = round(sqrt(x**2 + y**2), 3)
        self.coordinates = (x, y)


def find_k_closest(points, k):
    res = [Point(x, y) for (x, y) in points]
    maxh = MaxHeap()

    for i in range(k):
        maxh.heappush(res[i])

    for p in res[k:]:
        if p.distance < maxh.peek():
            maxh.heappop()
            maxh.heappush(p)

    res = [str(x.coordinates) for x in maxh.h]
    print(f"{k} closest points from origin : {', '.join(res)}")


points = [(10, 8), (-2, 4), (0, -2), (-1, 0), (3, 5), (-2, 3), (3, 2), (0, 1)]
find_k_closest(points, 3)

回答 11

为了详细说明https://stackoverflow.com/a/59311063/1328979,这里是针对一般情况的完整记录,带注释和经过测试的Python 3实现。

from __future__ import annotations  # To allow "MinHeap.push -> MinHeap:"
from typing import Generic, List, Optional, TypeVar
from heapq import heapify, heappop, heappush, heapreplace


T = TypeVar('T')


class MinHeap(Generic[T]):
    '''
    MinHeap provides a nicer API around heapq's functionality.
    As it is a minimum heap, the first element of the heap is always the
    smallest.
    >>> h = MinHeap([3, 1, 4, 2])
    >>> h[0]
    1
    >>> h.peek()
    1
    >>> h.push(5)  # N.B.: the array isn't always fully sorted.
    [1, 2, 4, 3, 5]
    >>> h.pop()
    1
    >>> h.pop()
    2
    >>> h.pop()
    3
    >>> h.push(3).push(2)
    [2, 3, 4, 5]
    >>> h.replace(1)
    2
    >>> h
    [1, 3, 4, 5]
    '''
    def __init__(self, array: Optional[List[T]] = None):
        if array is None:
            array = []
        heapify(array)
        self.h = array
    def push(self, x: T) -> MinHeap:
        heappush(self.h, x)
        return self  # To allow chaining operations.
    def peek(self) -> T:
        return self.h[0]
    def pop(self) -> T:
        return heappop(self.h)
    def replace(self, x: T) -> T:
        return heapreplace(self.h, x)
    def __getitem__(self, i) -> T:
        return self.h[i]
    def __len__(self) -> int:
        return len(self.h)
    def __str__(self) -> str:
        return str(self.h)
    def __repr__(self) -> str:
        return str(self.h)


class Reverse(Generic[T]):
    '''
    Wrap around the provided object, reversing the comparison operators.
    >>> 1 < 2
    True
    >>> Reverse(1) < Reverse(2)
    False
    >>> Reverse(2) < Reverse(1)
    True
    >>> Reverse(1) <= Reverse(2)
    False
    >>> Reverse(2) <= Reverse(1)
    True
    >>> Reverse(2) <= Reverse(2)
    True
    >>> Reverse(1) == Reverse(1)
    True
    >>> Reverse(2) > Reverse(1)
    False
    >>> Reverse(1) > Reverse(2)
    True
    >>> Reverse(2) >= Reverse(1)
    False
    >>> Reverse(1) >= Reverse(2)
    True
    >>> Reverse(1)
    1
    '''
    def __init__(self, x: T) -> None:
        self.x = x
    def __lt__(self, other: Reverse) -> bool:
        return other.x.__lt__(self.x)
    def __le__(self, other: Reverse) -> bool:
        return other.x.__le__(self.x)
    def __eq__(self, other) -> bool:
        return self.x == other.x
    def __ne__(self, other: Reverse) -> bool:
        return other.x.__ne__(self.x)
    def __ge__(self, other: Reverse) -> bool:
        return other.x.__ge__(self.x)
    def __gt__(self, other: Reverse) -> bool:
        return other.x.__gt__(self.x)
    def __str__(self):
        return str(self.x)
    def __repr__(self):
        return str(self.x)


class MaxHeap(MinHeap):
    '''
    MaxHeap provides an implement of a maximum-heap, as heapq does not provide
    it. As it is a maximum heap, the first element of the heap is always the
    largest. It achieves this by wrapping around elements with Reverse,
    which reverses the comparison operations used by heapq.
    >>> h = MaxHeap([3, 1, 4, 2])
    >>> h[0]
    4
    >>> h.peek()
    4
    >>> h.push(5)  # N.B.: the array isn't always fully sorted.
    [5, 4, 3, 1, 2]
    >>> h.pop()
    5
    >>> h.pop()
    4
    >>> h.pop()
    3
    >>> h.pop()
    2
    >>> h.push(3).push(2).push(4)
    [4, 3, 2, 1]
    >>> h.replace(1)
    4
    >>> h
    [3, 1, 2, 1]
    '''
    def __init__(self, array: Optional[List[T]] = None):
        if array is not None:
            array = [Reverse(x) for x in array]  # Wrap with Reverse.
        super().__init__(array)
    def push(self, x: T) -> MaxHeap:
        super().push(Reverse(x))
        return self
    def peek(self) -> T:
        return super().peek().x
    def pop(self) -> T:
        return super().pop().x
    def replace(self, x: T) -> T:
        return super().replace(Reverse(x)).x


if __name__ == '__main__':
    import doctest
    doctest.testmod()

https://gist.github.com/marccarre/577a55850998da02af3d4b7b98152cf4

To elaborate on https://stackoverflow.com/a/59311063/1328979, here is a fully documented, annotated and tested Python 3 implementation for the general case.

from __future__ import annotations  # To allow "MinHeap.push -> MinHeap:"
from typing import Generic, List, Optional, TypeVar
from heapq import heapify, heappop, heappush, heapreplace


T = TypeVar('T')


class MinHeap(Generic[T]):
    '''
    MinHeap provides a nicer API around heapq's functionality.
    As it is a minimum heap, the first element of the heap is always the
    smallest.
    >>> h = MinHeap([3, 1, 4, 2])
    >>> h[0]
    1
    >>> h.peek()
    1
    >>> h.push(5)  # N.B.: the array isn't always fully sorted.
    [1, 2, 4, 3, 5]
    >>> h.pop()
    1
    >>> h.pop()
    2
    >>> h.pop()
    3
    >>> h.push(3).push(2)
    [2, 3, 4, 5]
    >>> h.replace(1)
    2
    >>> h
    [1, 3, 4, 5]
    '''
    def __init__(self, array: Optional[List[T]] = None):
        if array is None:
            array = []
        heapify(array)
        self.h = array
    def push(self, x: T) -> MinHeap:
        heappush(self.h, x)
        return self  # To allow chaining operations.
    def peek(self) -> T:
        return self.h[0]
    def pop(self) -> T:
        return heappop(self.h)
    def replace(self, x: T) -> T:
        return heapreplace(self.h, x)
    def __getitem__(self, i) -> T:
        return self.h[i]
    def __len__(self) -> int:
        return len(self.h)
    def __str__(self) -> str:
        return str(self.h)
    def __repr__(self) -> str:
        return str(self.h)


class Reverse(Generic[T]):
    '''
    Wrap around the provided object, reversing the comparison operators.
    >>> 1 < 2
    True
    >>> Reverse(1) < Reverse(2)
    False
    >>> Reverse(2) < Reverse(1)
    True
    >>> Reverse(1) <= Reverse(2)
    False
    >>> Reverse(2) <= Reverse(1)
    True
    >>> Reverse(2) <= Reverse(2)
    True
    >>> Reverse(1) == Reverse(1)
    True
    >>> Reverse(2) > Reverse(1)
    False
    >>> Reverse(1) > Reverse(2)
    True
    >>> Reverse(2) >= Reverse(1)
    False
    >>> Reverse(1) >= Reverse(2)
    True
    >>> Reverse(1)
    1
    '''
    def __init__(self, x: T) -> None:
        self.x = x
    def __lt__(self, other: Reverse) -> bool:
        return other.x.__lt__(self.x)
    def __le__(self, other: Reverse) -> bool:
        return other.x.__le__(self.x)
    def __eq__(self, other) -> bool:
        return self.x == other.x
    def __ne__(self, other: Reverse) -> bool:
        return other.x.__ne__(self.x)
    def __ge__(self, other: Reverse) -> bool:
        return other.x.__ge__(self.x)
    def __gt__(self, other: Reverse) -> bool:
        return other.x.__gt__(self.x)
    def __str__(self):
        return str(self.x)
    def __repr__(self):
        return str(self.x)


class MaxHeap(MinHeap):
    '''
    MaxHeap provides an implement of a maximum-heap, as heapq does not provide
    it. As it is a maximum heap, the first element of the heap is always the
    largest. It achieves this by wrapping around elements with Reverse,
    which reverses the comparison operations used by heapq.
    >>> h = MaxHeap([3, 1, 4, 2])
    >>> h[0]
    4
    >>> h.peek()
    4
    >>> h.push(5)  # N.B.: the array isn't always fully sorted.
    [5, 4, 3, 1, 2]
    >>> h.pop()
    5
    >>> h.pop()
    4
    >>> h.pop()
    3
    >>> h.pop()
    2
    >>> h.push(3).push(2).push(4)
    [4, 3, 2, 1]
    >>> h.replace(1)
    4
    >>> h
    [3, 1, 2, 1]
    '''
    def __init__(self, array: Optional[List[T]] = None):
        if array is not None:
            array = [Reverse(x) for x in array]  # Wrap with Reverse.
        super().__init__(array)
    def push(self, x: T) -> MaxHeap:
        super().push(Reverse(x))
        return self
    def peek(self) -> T:
        return super().peek().x
    def pop(self) -> T:
        return super().pop().x
    def replace(self, x: T) -> T:
        return super().replace(Reverse(x)).x


if __name__ == '__main__':
    import doctest
    doctest.testmod()

https://gist.github.com/marccarre/577a55850998da02af3d4b7b98152cf4


回答 12

这是MaxHeap基于的简单实现heapq。虽然它仅适用于数值。

import heapq
from typing import List


class MaxHeap:
    def __init__(self):
        self.data = []

    def top(self):
        return -self.data[0]

    def push(self, val):
        heapq.heappush(self.data, -val)

    def pop(self):
        return -heapq.heappop(self.data)

用法:

max_heap = MaxHeap()
max_heap.push(3)
max_heap.push(5)
max_heap.push(1)
print(max_heap.top())  # 5

This is a simple MaxHeap implementation based on heapq. Though it only works with numeric values.

import heapq
from typing import List


class MaxHeap:
    def __init__(self):
        self.data = []

    def top(self):
        return -self.data[0]

    def push(self, val):
        heapq.heappush(self.data, -val)

    def pop(self):
        return -heapq.heappop(self.data)

Usage:

max_heap = MaxHeap()
max_heap.push(3)
max_heap.push(5)
max_heap.push(1)
print(max_heap.top())  # 5

实现嵌套字典的最佳方法是什么?

问题:实现嵌套字典的最佳方法是什么?

我有一个实质上相当于嵌套字典的数据结构。假设它看起来像这样:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

现在,维护和创建它非常痛苦。每当我有一个新的州/县/专业时,我都必须通过讨厌的try / catch块创建较低层的字典。此外,如果要遍历所有值,则必须创建烦人的嵌套迭代器。

我也可以使用元组作为键,例如:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

这使得对值的迭代非常简单自然,但是在语法上进行诸如汇总和查看字典子集之类的操作在语法上更加痛苦(例如,如果我只是想逐个查看状态的话)。

基本上,有时我想将嵌套字典视为平面字典,而有时又想将其视为复杂的层次结构。我可以将所有这些都包装在一个类中,但是似乎有人已经做到了。另外,似乎可能有一些非常优雅的语法构造可以做到这一点。

我怎样才能做得更好?

附录:我知道,setdefault()但这实际上并不能使语法简洁。同样,您创建的每个子词典仍然需要setdefault()手动设置。

I have a data structure which essentially amounts to a nested dictionary. Let’s say it looks like this:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.

I could also use tuples as keys, like such:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).

Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.

How could I do this better?

Addendum: I’m aware of setdefault() but it doesn’t really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault() manually set.


回答 0

在Python中实现嵌套字典的最佳方法是什么?

这是个坏主意,请不要这样做。相反,请使用常规字典并dict.setdefault在适当位置使用apropos,因此,在正常使用情况下缺少键时,您将获得期望的KeyError。如果您坚持要采取这种行为,请按以下步骤射击自己:

__missing__dict子类上实现以设置并返回新实例。

从Python 2.5开始,这种方法就已经可用(并记录在案),并且(对我来说特别有价值)它的打印效果与普通dict一样,而不是自动生成的defaultdict的丑陋打印:

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

(注意self[key]在作业的左侧,因此此处没有递归。)

并说您有一些数据:

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

这是我们的用法代码:

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

现在:

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

批评

对这种类型的容器的批评是,如果用户拼错了密钥,我们的代码可能会无声地失败:

>>> vividict['new york']['queens counyt']
{}

另外,现在我们的数据中会有一个拼写错误的县:

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

说明:

我们只是提供了该类的另一个嵌套实例 Vividict每当访问键但丢失键时。(返回值分配很有用,因为它避免了我们额外地在dict上调用getter,不幸的是,我们无法在设置它时返回它。)

请注意,这些与最受支持的答案具有相同的语义,但代码行的一半-nosklo的实现:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

用法示范

下面只是一个示例,说明如何轻松地使用此dict即时创建嵌套的dict结构。这样可以快速创建层次结构树结构,如您所愿。

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

哪个输出:

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

正如最后一行所示,它打印精美,便于人工检查。但是,如果要直观地检查数据,则可以实施__missing__将其类的新实例设置为键并将其返回的方法,这是更好的解决方案。

对比其他替代方法:

dict.setdefault

尽管询问者认为这不干净,但我发现它比Vividict我自己更喜欢。

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

现在:

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

拼写错误将严重失败,并且不会因错误信息而使我们的数据混乱:

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

另外,我认为setdefault在循环中使用时效果很好,并且您不知道密钥要获得什么,但是重复使用变得很繁重,而且我认为没有人愿意遵守以下规定:

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

另一个批评是,无论是否使用setdefault,setdefault都需要一个新实例。但是,Python(或至少CPython)在处理未使用和未引用的新实例方面相当聪明,例如,它重用了内存中的位置:

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

自动更新的defaultdict

这是一个简洁的实现,不检查数据的脚本中的用法与实现一样有用__missing__

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

但是,如果您需要检查数据,则以相同方式填充数据的自动复现defaultdict的结果如下所示:

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

此输出非常微不足道,并且结果非常不可读。通常给出的解决方案是将其递归转换回dict以进行手动检查。这个非平凡的解决方案留给读者练习。

性能

最后,让我们看一下性能。我要减去实例化的成本。

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

基于性能,dict.setdefault效果最佳。如果您关心执行速度,我强烈建议将其用于生产代码。

如果您需要将它用于交互式使用(也许是在IPython笔记本中),那么性能并不重要-在这种情况下,我会选择Vividict来确保输出的可读性。与AutoVivification对象(为此目的而使用__getitem__代替__missing__)相比,它要优越得多。

结论

__missing__在子类dict上实现以设置和返回新实例要比替代方法难一些,但具有以下优点:

  • 易于实例化
  • 简单数据填充
  • 轻松查看数据

并且因为它比修改不那么复杂且性能更高__getitem__,所以应该优先于该方法。

但是,它有缺点:

  • 错误的查询将自动失败。
  • 错误的查询将保留在词典中。

因此,我个人更喜欢setdefault其他解决方案,并且在每种情况下都需要这种行为。

What is the best way to implement nested dictionaries in Python?

This is a bad idea, don’t do it. Instead, use a regular dictionary and use dict.setdefault where apropos, so when keys are missing under normal usage you get the expected KeyError. If you insist on getting this behavior, here’s how to shoot yourself in the foot:

Implement __missing__ on a dict subclass to set and return a new instance.

This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

(Note self[key] is on the left-hand side of assignment, so there’s no recursion here.)

and say you have some data:

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

Here’s our usage code:

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

And now:

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Criticism

A criticism of this type of container is that if the user misspells a key, our code could fail silently:

>>> vividict['new york']['queens counyt']
{}

And additionally now we’d have a misspelled county in our data:

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

Explanation:

We’re just providing another nested instance of our class Vividict whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can’t return it as it is being set.)

Note, these are the same semantics as the most upvoted answer but in half the lines of code – nosklo’s implementation:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Demonstration of Usage

Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

Which outputs:

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__ to set a new instance of its class to the key and return it is a far better solution.

Other alternatives, for contrast:

dict.setdefault

Although the asker thinks this isn’t clean, I find it preferable to the Vividict myself.

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

and now:

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

A misspelling would fail noisily, and not clutter our data with bad information:

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

Additionally, I think setdefault works great when used in loops and you don’t know what you’re going to get for keys, but repetitive usage becomes quite burdensome, and I don’t think anyone would want to keep up the following:

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

An auto-vivified defaultdict

This is a neat looking implementation, and usage in a script that you’re not inspecting the data on would be as useful as implementing __missing__:

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.

Performance

Finally, let’s look at performance. I’m subtracting the costs of instantiation.

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

Based on performance, dict.setdefault works the best. I’d highly recommend it for production code, in cases where you care about execution speed.

If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn’t really matter – in which case, I’d go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__ instead of __missing__, which was made for this purpose) it is far superior.

Conclusion

Implementing __missing__ on a subclassed dict to set and return a new instance is slightly more difficult than alternatives but has the benefits of

  • easy instantiation
  • easy data population
  • easy data viewing

and because it is less complicated and more performant than modifying __getitem__, it should be preferred to that method.

Nevertheless, it has drawbacks:

  • Bad lookups will fail silently.
  • The bad lookup will remain in the dictionary.

Thus I personally prefer setdefault to the other solutions, and have in every situation where I have needed this sort of behavior.


回答 1

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

测试:

a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

输出:

{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}
class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Testing:

a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

Output:

{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}

回答 2

只是因为我还没有看到这么小的一个,这是一个像您想嵌套的字典一样,没有汗水:

# yo dawg, i heard you liked dicts                                                                      
def yodict():
    return defaultdict(yodict)

Just because I haven’t seen one this small, here’s a dict that gets as nested as you like, no sweat:

# yo dawg, i heard you liked dicts                                                                      
def yodict():
    return defaultdict(yodict)

回答 3

您可以创建一个YAML文件并使用PyYaml读取它

步骤1:创建一个YAML文件“ employment.yml”:

new jersey:
  mercer county:
    pumbers: 3
    programmers: 81
  middlesex county:
    salesmen: 62
    programmers: 81
new york:
  queens county:
    plumbers: 9
    salesmen: 36

步骤2:以Python阅读

import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()

现在my_shnazzy_dictionary拥有您的所有价值观。如果您需要即时执行此操作,则可以将YAML创建为字符串并将其输入yaml.safe_load(...)

You could create a YAML file and read it in using PyYaml.

Step 1: Create a YAML file, “employment.yml”:

new jersey:
  mercer county:
    pumbers: 3
    programmers: 81
  middlesex county:
    salesmen: 62
    programmers: 81
new york:
  queens county:
    plumbers: 9
    salesmen: 36

Step 2: Read it in Python

import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()

and now my_shnazzy_dictionary has all your values. If you needed to do this on the fly, you can create the YAML as a string and feed that into yaml.safe_load(...).


回答 4

由于您具有星形模式设计,因此您可能希望使其结构更像关系表,而不像字典。

import collections

class Jobs( object ):
    def __init__( self, state, county, title, count ):
        self.state= state
        self.count= county
        self.title= title
        self.count= count

facts = [
    Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
    ...

def groupBy( facts, name ):
    total= collections.defaultdict( int )
    for f in facts:
        key= getattr( f, name )
        total[key] += f.count

在没有SQL开销的情况下,创建类似数据仓库的设计可以走很长一段路。

Since you have a star-schema design, you might want to structure it more like a relational table and less like a dictionary.

import collections

class Jobs( object ):
    def __init__( self, state, county, title, count ):
        self.state= state
        self.count= county
        self.title= title
        self.count= count

facts = [
    Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
    ...

def groupBy( facts, name ):
    total= collections.defaultdict( int )
    for f in facts:
        key= getattr( f, name )
        total[key] += f.count

That kind of thing can go a long way to creating a data warehouse-like design without the SQL overheads.


回答 5

如果嵌套级别的数量很少,那么我可以collections.defaultdict这样做:

from collections import defaultdict

def nested_dict_factory(): 
  return defaultdict(int)
def nested_dict_factory2(): 
  return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)

db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81

使用defaultdict这样避免了大量杂乱的setdefault()get()等等。

If the number of nesting levels is small, I use collections.defaultdict for this:

from collections import defaultdict

def nested_dict_factory(): 
  return defaultdict(int)
def nested_dict_factory2(): 
  return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)

db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81

Using defaultdict like this avoids a lot of messy setdefault(), get(), etc.


回答 6

这是一个返回任意深度的嵌套字典的函数:

from collections import defaultdict
def make_dict():
    return defaultdict(make_dict)

像这样使用它:

d=defaultdict(make_dict)
d["food"]["meat"]="beef"
d["food"]["veggie"]="corn"
d["food"]["sweets"]="ice cream"
d["animal"]["pet"]["dog"]="collie"
d["animal"]["pet"]["cat"]="tabby"
d["animal"]["farm animal"]="chicken"

使用以下内容遍历所有内容:

def iter_all(d,depth=1):
    for k,v in d.iteritems():
        print "-"*depth,k
        if type(v) is defaultdict:
            iter_all(v,depth+1)
        else:
            print "-"*(depth+1),v

iter_all(d)

打印输出:

- food
-- sweets
--- ice cream
-- meat
--- beef
-- veggie
--- corn
- animal
-- pet
--- dog
---- labrador
--- cat
---- tabby
-- farm animal
--- chicken

您可能最终希望做到这一点,以便不能将新项目添加到字典中。将所有这些defaultdicts 递归转换为正常dicts 很容易。

def dictify(d):
    for k,v in d.iteritems():
        if isinstance(v,defaultdict):
            d[k] = dictify(v)
    return dict(d)

This is a function that returns a nested dictionary of arbitrary depth:

from collections import defaultdict
def make_dict():
    return defaultdict(make_dict)

Use it like this:

d=defaultdict(make_dict)
d["food"]["meat"]="beef"
d["food"]["veggie"]="corn"
d["food"]["sweets"]="ice cream"
d["animal"]["pet"]["dog"]="collie"
d["animal"]["pet"]["cat"]="tabby"
d["animal"]["farm animal"]="chicken"

Iterate through everything with something like this:

def iter_all(d,depth=1):
    for k,v in d.iteritems():
        print "-"*depth,k
        if type(v) is defaultdict:
            iter_all(v,depth+1)
        else:
            print "-"*(depth+1),v

iter_all(d)

This prints out:

- food
-- sweets
--- ice cream
-- meat
--- beef
-- veggie
--- corn
- animal
-- pet
--- dog
---- labrador
--- cat
---- tabby
-- farm animal
--- chicken

You might eventually want to make it so that new items can not be added to the dict. It’s easy to recursively convert all these defaultdicts to normal dicts.

def dictify(d):
    for k,v in d.iteritems():
        if isinstance(v,defaultdict):
            d[k] = dictify(v)
    return dict(d)

回答 7

我觉得setdefault很有用;它检查是否存在密钥,如果不存在,则添加它:

d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3

setdefault 总是返回相关密钥,因此您实际上是在更新’d在原地 ”。

关于迭代,我敢肯定,如果Python中尚不存在生成器,那么您可以足够容易地编写生成器:

def iterateStates(d):
    # Let's count up the total number of "plumbers" / "dentists" / etc.
    # across all counties and states
    job_totals = {}

    # I guess this is the annoying nested stuff you were talking about?
    for (state, counties) in d.iteritems():
        for (county, jobs) in counties.iteritems():
            for (job, num) in jobs.iteritems():
                # If job isn't already in job_totals, default it to zero
                job_totals[job] = job_totals.get(job, 0) + num

    # Now return an iterator of (job, number) tuples
    return job_totals.iteritems()

# Display all jobs
for (job, num) in iterateStates(d):
    print "There are %d %s in total" % (job, num)

I find setdefault quite useful; It checks if a key is present and adds it if not:

d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3

setdefault always returns the relevant key, so you are actually updating the values of ‘d‘ in place.

When it comes to iterating, I’m sure you could write a generator easily enough if one doesn’t already exist in Python:

def iterateStates(d):
    # Let's count up the total number of "plumbers" / "dentists" / etc.
    # across all counties and states
    job_totals = {}

    # I guess this is the annoying nested stuff you were talking about?
    for (state, counties) in d.iteritems():
        for (county, jobs) in counties.iteritems():
            for (job, num) in jobs.iteritems():
                # If job isn't already in job_totals, default it to zero
                job_totals[job] = job_totals.get(job, 0) + num

    # Now return an iterator of (job, number) tuples
    return job_totals.iteritems()

# Display all jobs
for (job, num) in iterateStates(d):
    print "There are %d %s in total" % (job, num)

回答 8

正如其他人所建议的,关系数据库对您可能更有用。您可以使用内存中的sqlite3数据库作为数据结构来创建表,然后对其进行查询。

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE jobs (state, county, title, count)')

c.executemany('insert into jobs values (?, ?, ?, ?)', [
    ('New Jersey', 'Mercer County',    'Programmers', 81),
    ('New Jersey', 'Mercer County',    'Plumbers',     3),
    ('New Jersey', 'Middlesex County', 'Programmers', 81),
    ('New Jersey', 'Middlesex County', 'Salesmen',    62),
    ('New York',   'Queens County',    'Salesmen',    36),
    ('New York',   'Queens County',    'Plumbers',     9),
])

# some example queries
print list(c.execute('SELECT * FROM jobs WHERE county = "Queens County"'))
print list(c.execute('SELECT SUM(count) FROM jobs WHERE title = "Programmers"'))

这只是一个简单的例子。您可以为州,县和职称定义单独的表格。

As others have suggested, a relational database could be more useful to you. You can use a in-memory sqlite3 database as a data structure to create tables and then query them.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE jobs (state, county, title, count)')

c.executemany('insert into jobs values (?, ?, ?, ?)', [
    ('New Jersey', 'Mercer County',    'Programmers', 81),
    ('New Jersey', 'Mercer County',    'Plumbers',     3),
    ('New Jersey', 'Middlesex County', 'Programmers', 81),
    ('New Jersey', 'Middlesex County', 'Salesmen',    62),
    ('New York',   'Queens County',    'Salesmen',    36),
    ('New York',   'Queens County',    'Plumbers',     9),
])

# some example queries
print list(c.execute('SELECT * FROM jobs WHERE county = "Queens County"'))
print list(c.execute('SELECT SUM(count) FROM jobs WHERE title = "Programmers"'))

This is just a simple example. You could define separate tables for states, counties and job titles.


回答 9

collections.defaultdict可以细分为嵌套的字典。然后将任何有用的迭代方法添加到该类。

>>> from collections import defaultdict
>>> class nesteddict(defaultdict):
    def __init__(self):
        defaultdict.__init__(self, nesteddict)
    def walk(self):
        for key, value in self.iteritems():
            if isinstance(value, nesteddict):
                for tup in value.walk():
                    yield (key,) + tup
            else:
                yield key, value


>>> nd = nesteddict()
>>> nd['new jersey']['mercer county']['plumbers'] = 3
>>> nd['new jersey']['mercer county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['salesmen'] = 62
>>> nd['new york']['queens county']['plumbers'] = 9
>>> nd['new york']['queens county']['salesmen'] = 36
>>> for tup in nd.walk():
    print tup


('new jersey', 'mercer county', 'programmers', 81)
('new jersey', 'mercer county', 'plumbers', 3)
('new jersey', 'middlesex county', 'programmers', 81)
('new jersey', 'middlesex county', 'salesmen', 62)
('new york', 'queens county', 'salesmen', 36)
('new york', 'queens county', 'plumbers', 9)

collections.defaultdict can be sub-classed to make a nested dict. Then add any useful iteration methods to that class.

>>> from collections import defaultdict
>>> class nesteddict(defaultdict):
    def __init__(self):
        defaultdict.__init__(self, nesteddict)
    def walk(self):
        for key, value in self.iteritems():
            if isinstance(value, nesteddict):
                for tup in value.walk():
                    yield (key,) + tup
            else:
                yield key, value


>>> nd = nesteddict()
>>> nd['new jersey']['mercer county']['plumbers'] = 3
>>> nd['new jersey']['mercer county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['salesmen'] = 62
>>> nd['new york']['queens county']['plumbers'] = 9
>>> nd['new york']['queens county']['salesmen'] = 36
>>> for tup in nd.walk():
    print tup


('new jersey', 'mercer county', 'programmers', 81)
('new jersey', 'mercer county', 'plumbers', 3)
('new jersey', 'middlesex county', 'programmers', 81)
('new jersey', 'middlesex county', 'salesmen', 62)
('new york', 'queens county', 'salesmen', 36)
('new york', 'queens county', 'plumbers', 9)

回答 10

至于“令人讨厌的try / catch块”:

d = {}
d.setdefault('key',{}).setdefault('inner key',{})['inner inner key'] = 'value'
print d

Yield

{'key': {'inner key': {'inner inner key': 'value'}}}

您可以使用此方法将平面词典格式转换为结构化格式:

fd = {('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

for (k1,k2,k3), v in fd.iteritems():
    d.setdefault(k1, {}).setdefault(k2, {})[k3] = v

As for “obnoxious try/catch blocks”:

d = {}
d.setdefault('key',{}).setdefault('inner key',{})['inner inner key'] = 'value'
print d

yields

{'key': {'inner key': {'inner inner key': 'value'}}}

You can use this to convert from your flat dictionary format to structured format:

fd = {('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

for (k1,k2,k3), v in fd.iteritems():
    d.setdefault(k1, {}).setdefault(k2, {})[k3] = v

回答 11

您可以使用Addict:https//github.com/mewwts/addict

>>> from addict import Dict
>>> my_new_shiny_dict = Dict()
>>> my_new_shiny_dict.a.b.c.d.e = 2
>>> my_new_shiny_dict
{'a': {'b': {'c': {'d': {'e': 2}}}}}

You can use Addict: https://github.com/mewwts/addict

>>> from addict import Dict
>>> my_new_shiny_dict = Dict()
>>> my_new_shiny_dict.a.b.c.d.e = 2
>>> my_new_shiny_dict
{'a': {'b': {'c': {'d': {'e': 2}}}}}

回答 12

defaultdict() 是你的朋友!

对于二维字典,您可以执行以下操作:

d = defaultdict(defaultdict)
d[1][2] = 3

有关更多尺寸,您可以:

d = defaultdict(lambda :defaultdict(defaultdict))
d[1][2][3] = 4

defaultdict() is your friend!

For a two dimensional dictionary you can do:

d = defaultdict(defaultdict)
d[1][2] = 3

For more dimensions you can:

d = defaultdict(lambda :defaultdict(defaultdict))
d[1][2][3] = 4

回答 13

为了方便地迭代嵌套字典,为什么不编写一个简单的生成器呢?

def each_job(my_dict):
    for state, a in my_dict.items():
        for county, b in a.items():
            for job, value in b.items():
                yield {
                    'state'  : state,
                    'county' : county,
                    'job'    : job,
                    'value'  : value
                }

因此,如果您有编译后的嵌套字典,则对其进行迭代就变得很简单:

for r in each_job(my_dict):
    print "There are %d %s in %s, %s" % (r['value'], r['job'], r['county'], r['state'])

显然,您的生成器可以产生任何对您有用的数据格式。

为什么使用try catch块读取树?在尝试检索字典中的键之前,很容易(而且可能更安全)进行查询。使用保护子句的函数可能如下所示:

if not my_dict.has_key('new jersey'):
    return False

nj_dict = my_dict['new jersey']
...

或者,也许有些冗长的方法是使用get方法:

value = my_dict.get('new jersey', {}).get('middlesex county', {}).get('salesmen', 0)

但是,以更简洁的方式,您可能希望使用collections.defaultdict,它是自python 2.5以来标准库的一部分。

import collections

def state_struct(): return collections.defaultdict(county_struct)
def county_struct(): return collections.defaultdict(job_struct)
def job_struct(): return 0

my_dict = collections.defaultdict(state_struct)

print my_dict['new jersey']['middlesex county']['salesmen']

我在这里对数据结构的含义进行假设,但是应该很容易根据实际需要进行调整。

For easy iterating over your nested dictionary, why not just write a simple generator?

def each_job(my_dict):
    for state, a in my_dict.items():
        for county, b in a.items():
            for job, value in b.items():
                yield {
                    'state'  : state,
                    'county' : county,
                    'job'    : job,
                    'value'  : value
                }

So then, if you have your compilicated nested dictionary, iterating over it becomes simple:

for r in each_job(my_dict):
    print "There are %d %s in %s, %s" % (r['value'], r['job'], r['county'], r['state'])

Obviously your generator can yield whatever format of data is useful to you.

Why are you using try catch blocks to read the tree? It’s easy enough (and probably safer) to query whether a key exists in a dict before trying to retrieve it. A function using guard clauses might look like this:

if not my_dict.has_key('new jersey'):
    return False

nj_dict = my_dict['new jersey']
...

Or, a perhaps somewhat verbose method, is to use the get method:

value = my_dict.get('new jersey', {}).get('middlesex county', {}).get('salesmen', 0)

But for a somewhat more succinct way, you might want to look at using a collections.defaultdict, which is part of the standard library since python 2.5.

import collections

def state_struct(): return collections.defaultdict(county_struct)
def county_struct(): return collections.defaultdict(job_struct)
def job_struct(): return 0

my_dict = collections.defaultdict(state_struct)

print my_dict['new jersey']['middlesex county']['salesmen']

I’m making assumptions about the meaning of your data structure here, but it should be easy to adjust for what you actually want to do.


回答 14

我喜欢的一类包装这和实施的想法__getitem__,并__setitem__使得它们实现了一个简单的查询语言:

>>> d['new jersey/mercer county/plumbers'] = 3
>>> d['new jersey/mercer county/programmers'] = 81
>>> d['new jersey/mercer county/programmers']
81
>>> d['new jersey/mercer country']
<view which implicitly adds 'new jersey/mercer county' to queries/mutations>

如果您想花哨的话,还可以执行以下操作:

>>> d['*/*/programmers']
<view which would contain 'programmers' entries>

但大多数情况下,我认为实现这样的事情会很有趣:D

I like the idea of wrapping this in a class and implementing __getitem__ and __setitem__ such that they implemented a simple query language:

>>> d['new jersey/mercer county/plumbers'] = 3
>>> d['new jersey/mercer county/programmers'] = 81
>>> d['new jersey/mercer county/programmers']
81
>>> d['new jersey/mercer country']
<view which implicitly adds 'new jersey/mercer county' to queries/mutations>

If you wanted to get fancy you could also implement something like:

>>> d['*/*/programmers']
<view which would contain 'programmers' entries>

but mostly I think such a thing would be really fun to implement :D


回答 15

除非您的数据集将保持很小,否则您可能要考虑使用关系数据库。它将完全满足您的要求:轻松添加计数,选择​​计数子集,甚至可以按州,县,职业或这些方法的任意组合来汇总计数。

Unless your dataset is going to stay pretty small, you might want to consider using a relational database. It will do exactly what you want: make it easy to add counts, selecting subsets of counts, and even aggregate counts by state, county, occupation, or any combination of these.


回答 16

class JobDb(object):
    def __init__(self):
        self.data = []
        self.all = set()
        self.free = []
        self.index1 = {}
        self.index2 = {}
        self.index3 = {}

    def _indices(self,(key1,key2,key3)):
        indices = self.all.copy()
        wild = False
        for index,key in ((self.index1,key1),(self.index2,key2),
                                             (self.index3,key3)):
            if key is not None:
                indices &= index.setdefault(key,set())
            else:
                wild = True
        return indices, wild

    def __getitem__(self,key):
        indices, wild = self._indices(key)
        if wild:
            return dict(self.data[i] for i in indices)
        else:
            values = [self.data[i][-1] for i in indices]
            if values:
                return values[0]

    def __setitem__(self,key,value):
        indices, wild = self._indices(key)
        if indices:
            for i in indices:
                self.data[i] = key,value
        elif wild:
            raise KeyError(k)
        else:
            if self.free:
                index = self.free.pop(0)
                self.data[index] = key,value
            else:
                index = len(self.data)
                self.data.append((key,value))
                self.all.add(index)
            self.index1.setdefault(key[0],set()).add(index)
            self.index2.setdefault(key[1],set()).add(index)
            self.index3.setdefault(key[2],set()).add(index)

    def __delitem__(self,key):
        indices,wild = self._indices(key)
        if not indices:
            raise KeyError
        self.index1[key[0]] -= indices
        self.index2[key[1]] -= indices
        self.index3[key[2]] -= indices
        self.all -= indices
        for i in indices:
            self.data[i] = None
        self.free.extend(indices)

    def __len__(self):
        return len(self.all)

    def __iter__(self):
        for key,value in self.data:
            yield key

例:

>>> db = JobDb()
>>> db['new jersey', 'mercer county', 'plumbers'] = 3
>>> db['new jersey', 'mercer county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'salesmen'] = 62
>>> db['new york', 'queens county', 'plumbers'] = 9
>>> db['new york', 'queens county', 'salesmen'] = 36

>>> db['new york', None, None]
{('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

>>> db[None, None, 'plumbers']
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new york', 'queens county', 'plumbers'): 9}

>>> db['new jersey', 'mercer county', None]
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81}

>>> db['new jersey', 'middlesex county', 'programmers']
81

>>>

编辑:现在使用通配符(None)查询时返回字典,否则返回单个值。

class JobDb(object):
    def __init__(self):
        self.data = []
        self.all = set()
        self.free = []
        self.index1 = {}
        self.index2 = {}
        self.index3 = {}

    def _indices(self,(key1,key2,key3)):
        indices = self.all.copy()
        wild = False
        for index,key in ((self.index1,key1),(self.index2,key2),
                                             (self.index3,key3)):
            if key is not None:
                indices &= index.setdefault(key,set())
            else:
                wild = True
        return indices, wild

    def __getitem__(self,key):
        indices, wild = self._indices(key)
        if wild:
            return dict(self.data[i] for i in indices)
        else:
            values = [self.data[i][-1] for i in indices]
            if values:
                return values[0]

    def __setitem__(self,key,value):
        indices, wild = self._indices(key)
        if indices:
            for i in indices:
                self.data[i] = key,value
        elif wild:
            raise KeyError(k)
        else:
            if self.free:
                index = self.free.pop(0)
                self.data[index] = key,value
            else:
                index = len(self.data)
                self.data.append((key,value))
                self.all.add(index)
            self.index1.setdefault(key[0],set()).add(index)
            self.index2.setdefault(key[1],set()).add(index)
            self.index3.setdefault(key[2],set()).add(index)

    def __delitem__(self,key):
        indices,wild = self._indices(key)
        if not indices:
            raise KeyError
        self.index1[key[0]] -= indices
        self.index2[key[1]] -= indices
        self.index3[key[2]] -= indices
        self.all -= indices
        for i in indices:
            self.data[i] = None
        self.free.extend(indices)

    def __len__(self):
        return len(self.all)

    def __iter__(self):
        for key,value in self.data:
            yield key

Example:

>>> db = JobDb()
>>> db['new jersey', 'mercer county', 'plumbers'] = 3
>>> db['new jersey', 'mercer county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'salesmen'] = 62
>>> db['new york', 'queens county', 'plumbers'] = 9
>>> db['new york', 'queens county', 'salesmen'] = 36

>>> db['new york', None, None]
{('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

>>> db[None, None, 'plumbers']
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new york', 'queens county', 'plumbers'): 9}

>>> db['new jersey', 'mercer county', None]
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81}

>>> db['new jersey', 'middlesex county', 'programmers']
81

>>>

Edit: Now returning dictionaries when querying with wild cards (None), and single values otherwise.


回答 17

我也有类似的事情。我有很多情况下会这样做:

thedict = {}
for item in ('foo', 'bar', 'baz'):
  mydict = thedict.get(item, {})
  mydict = get_value_for(item)
  thedict[item] = mydict

但是要深入很多层次。关键在于“ .get(item,{})”,因为如果还没有字典的话,它将制作另一本字典。同时,我一直在思考如何更好地处理此问题。现在,有很多

value = mydict.get('foo', {}).get('bar', {}).get('baz', 0)

因此,我做了:

def dictgetter(thedict, default, *args):
  totalargs = len(args)
  for i,arg in enumerate(args):
    if i+1 == totalargs:
      thedict = thedict.get(arg, default)
    else:
      thedict = thedict.get(arg, {})
  return thedict

如果执行以下操作,则具有相同的效果:

value = dictgetter(mydict, 0, 'foo', 'bar', 'baz')

更好?我认同。

I have a similar thing going. I have a lot of cases where I do:

thedict = {}
for item in ('foo', 'bar', 'baz'):
  mydict = thedict.get(item, {})
  mydict = get_value_for(item)
  thedict[item] = mydict

But going many levels deep. It’s the “.get(item, {})” that’s the key as it’ll make another dictionary if there isn’t one already. Meanwhile, I’ve been thinking of ways to deal with this better. Right now, there’s a lot of

value = mydict.get('foo', {}).get('bar', {}).get('baz', 0)

So instead, I made:

def dictgetter(thedict, default, *args):
  totalargs = len(args)
  for i,arg in enumerate(args):
    if i+1 == totalargs:
      thedict = thedict.get(arg, default)
    else:
      thedict = thedict.get(arg, {})
  return thedict

Which has the same effect if you do:

value = dictgetter(mydict, 0, 'foo', 'bar', 'baz')

Better? I think so.


回答 18

您可以在lambdas和defaultdict中使用递归,无需定义名称:

a = defaultdict((lambda f: f(f))(lambda g: lambda:defaultdict(g(g))))

这是一个例子:

>>> a['new jersey']['mercer county']['plumbers']=3
>>> a['new jersey']['middlesex county']['programmers']=81
>>> a['new jersey']['mercer county']['programmers']=81
>>> a['new jersey']['middlesex county']['salesmen']=62
>>> a
defaultdict(<function __main__.<lambda>>,
        {'new jersey': defaultdict(<function __main__.<lambda>>,
                     {'mercer county': defaultdict(<function __main__.<lambda>>,
                                  {'plumbers': 3, 'programmers': 81}),
                      'middlesex county': defaultdict(<function __main__.<lambda>>,
                                  {'programmers': 81, 'salesmen': 62})})})

You can use recursion in lambdas and defaultdict, no need to define names:

a = defaultdict((lambda f: f(f))(lambda g: lambda:defaultdict(g(g))))

Here’s an example:

>>> a['new jersey']['mercer county']['plumbers']=3
>>> a['new jersey']['middlesex county']['programmers']=81
>>> a['new jersey']['mercer county']['programmers']=81
>>> a['new jersey']['middlesex county']['salesmen']=62
>>> a
defaultdict(<function __main__.<lambda>>,
        {'new jersey': defaultdict(<function __main__.<lambda>>,
                     {'mercer county': defaultdict(<function __main__.<lambda>>,
                                  {'plumbers': 3, 'programmers': 81}),
                      'middlesex county': defaultdict(<function __main__.<lambda>>,
                                  {'programmers': 81, 'salesmen': 62})})})

回答 19

我曾经使用此功能。其安全,快速,易于维护。

def deep_get(dictionary, keys, default=None):
    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)

范例:

>>> from functools import reduce
>>> def deep_get(dictionary, keys, default=None):
...     return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)
...
>>> person = {'person':{'name':{'first':'John'}}}
>>> print (deep_get(person, "person.name.first"))
John
>>> print (deep_get(person, "person.name.lastname"))
None
>>> print (deep_get(person, "person.name.lastname", default="No lastname"))
No lastname
>>>

I used to use this function. its safe, quick, easily maintainable.

def deep_get(dictionary, keys, default=None):
    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)

Example :

>>> from functools import reduce
>>> def deep_get(dictionary, keys, default=None):
...     return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)
...
>>> person = {'person':{'name':{'first':'John'}}}
>>> print (deep_get(person, "person.name.first"))
John
>>> print (deep_get(person, "person.name.lastname"))
None
>>> print (deep_get(person, "person.name.lastname", default="No lastname"))
No lastname
>>>

Python集与列表

问题:Python集与列表

在Python中,哪种数据结构更有效/更快速?假设顺序对我而言并不重要,并且无论如何我都将检查重复项,那么Python设置是否比Python列表慢?

In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?


回答 0

这取决于您打算如何处理。

在确定对象是否存在于集合中时,集合要快得多(如中所示x in s),但是在遍历其内容时要比列表慢。

您可以使用timeit模块查看哪种情况适合您的情况。

It depends on what you are intending to do with it.

Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but are slower than lists when it comes to iterating over their contents.

You can use the timeit module to see which is faster for your situation.


回答 1

当您只想遍历值时,列表比集合要快一些。

但是,如果要检查项目中是否包含项目,则集合的速度明显快于列表。它们只能包含唯一项。

事实证明,除了不变性之外,元组的执行几乎与列表完全相同。

反复进行

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

确定是否存在对象

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

Lists are slightly faster than sets when you just want to iterate over the values.

Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

It turns out tuples perform in almost exactly the same way as lists, except for their immutability.

Iterating

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

Determine if an object is present

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

回答 2

列表效果:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

设置效果:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

您可能要考虑元组,因为它们与列表相似,但是无法修改。它们占用的内存略少,并且访问速度更快。它们不像列表那样灵活,但效率更高。它们的正常用途是用作字典键。

集也是序列结构,但与列表和元组有两个区别。尽管集合确实具有顺序,但是该顺序是任意的,不在程序员的控制之下。第二个区别是集合中的元素必须唯一。

set根据定义。[ python | Wiki ]。

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

List performance:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

Set performance:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

You may want to consider Tuples as they’re similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.

Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.

set by definition. [python | wiki].

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

回答 3

Set由于近乎即时的“包含”检查而获胜:https//en.wikipedia.org/wiki/Hash_table

列表实现:通常是一个数组,靠近金属层较低,适合于迭代和按元素索引随机访问。

设置实现:https : //en.wikipedia.org/wiki/Hash_table,它不会在列表上进行迭代,而是通过计算键中的哈希值来找到元素,因此它取决于键元素和哈希值的性质功能。类似于用于字典的内容。我怀疑list如果元素很少(<5)可能会更快,元素计数越大,set包含检查的性能越好。它也可以快速添加和删除元素。还请始终牢记,构建一套需要付出代价!

注意:如果list已经对进行了排序,则搜索list可能会很快,但是对于通常情况set,包含检查的a 会更快,更简单。

Set wins due to near instant ‘contains’ checks: https://en.wikipedia.org/wiki/Hash_table

List implementation: usually an array, low level close to the metal good for iteration and random access by element index.

Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !

NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.


回答 4

tl; dr

数据结构(DS)很重要,因为它们用于对数据执行操作,这基本上意味着:接受一些输入,对其进行处理,然后返回输出

在某些特定情况下,某些数据结构比其他数据结构更有用。因此,询问哪个(DS)更有效/更快是相当不公平的。这就像问刀和叉之间哪种工具更有效。我的意思是所有情况都取决于情况。

清单

列表是可变序列通常用于存储同类项目的集合

套装

集合对象是不同的可哈希对象无序集合。它通常用于测试成员资格,从序列中删除重复项以及计算数学运算(例如交集,并集,差和对称差)。

用法

从一些答案中可以明显看出,迭代值时列表比集合快得多。另一方面,检查项目是否包含列表时,集合比列表快。因此,对于某些特定操作,您唯一能说的是列表比集合要好,反之亦然。

tl;dr

Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.

Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.

Lists

A list is mutable sequence, typically used to store collections of homogeneous items.

Sets

A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.

Usage

From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.


回答 5

当使用CPython检查值是否为少量文字之一时,我对结果感兴趣。set在Python 3 vs中获胜tuplelist并且or

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

输出:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

对于3到5个字面量,set仍然会以较大幅度获胜,并or成为最慢的。

在Python 2中,set总是最慢的。or是最快的2至3文本和tuplelist是具有4个或多个文字更快。我无法区分tuplevs 的速度list

当要测试的值被缓存在函数之外的全局变量中,而不是在循环中创建文字set时,即使在Python 2中,每次也会赢。

这些结果适用于Core i7上的64位CPython。

I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

Output:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn’t distinguish the speed of tuple vs list.

When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.

These results apply to 64-bit CPython on a Core i7.


回答 6

我建议您使用用例仅限于引用或搜索存在的Set实现,以及使用用例需要您执行迭代的Tuple实现。列表是低级别的实现,需要大量的内存开销。

I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.


回答 7

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

比较所有3的10次迭代后的输出: 比较

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

Output after comparing 10 iterations for all 3 : Comparison


回答 8

集合更快,而且您可以通过集合获得更多功能,比如说您有两个集合:

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

我们可以轻松地加入两个集合:

set3 = set1.union(set2)

找出两者的共同点:

set3 = set1.intersection(set2)

找出两者的不同之处:

set3 = set1.difference(set2)

以及更多!只是尝试一下,它们很有趣!此外,如果您必须处理2个列表中的不同值或2个列表中的公用值,我更喜欢将列表转换为集合,许多程序员都采用这种方式。希望它对您有帮助:-)

Sets are faster, morover you get more functions with sets, such as lets say you have two sets :

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

We can easily join two sets:

set3 = set1.union(set2)

Find out what is common in both:

set3 = set1.intersection(set2)

Find out what is different in both:

set3 = set1.difference(set2)

And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way. Hope it helps you :-)


在Python中,什么时候使用字典,列表或集合?

问题:在Python中,什么时候使用字典,列表或集合?

我什么时候应该使用字典,列表或集合?

是否存在更适合每种数据类型的方案?

When should I use a dictionary, list or set?

Are there scenarios that are more suited for each data type?


回答 0

一个list保持秩序,dictset不要:当你关心的秩序,因此,您必须使用list(如果你的容器的选择仅限于这三种,当然;-)。

dict与每个键关联一个值,而listset仅包含值:很明显,非常不同的用例。

set要求项目是可哈希的,list不是:如果您有不可哈希的项目,则不能使用,set而必须使用list

set禁止重复,list不禁止:也是至关重要的区别。(可以在以下位置找到“多重集”,该多重集将重复项映射到不止一次存在的项目的不同计数中;如果出于某些奇怪的原因而无法导入,则collections.Counter可以将其构建为,或者在2.7之前的版本中Python作为,使用项目作为键,并将相关值作为计数)。dictcollectionscollections.defaultdict(int)

set(或dict键中)中检查值的隶属关系非常快(花费一个恒定的短时间),而在列表中,它花费的时间与列表的长度成正比(在一般情况下和最坏情况下)。因此,如果您有可散列的项目,则不关心订单或重复项,而希望快速进行成员资格检查set比更好list

A list keeps order, dict and set don’t: when you care about order, therefore, you must use list (if your choice of containers is limited to these three, of course;-).

dict associates with each key a value, while list and set just contain values: very different use cases, obviously.

set requires items to be hashable, list doesn’t: if you have non-hashable items, therefore, you cannot use set and must instead use list.

set forbids duplicates, list does not: also a crucial distinction. (A “multiset”, which maps duplicates into a different count for items present more than once, can be found in collections.Counter — you could build one as a dict, if for some weird reason you couldn’t import collections, or, in pre-2.7 Python as a collections.defaultdict(int), using the items as keys and the associated value as the count).

Checking for membership of a value in a set (or dict, for keys) is blazingly fast (taking about a constant, short time), while in a list it takes time proportional to the list’s length in the average and worst cases. So, if you have hashable items, don’t care either way about order or duplicates, and want speedy membership checking, set is better than list.


回答 1

  • 您是否只需要订购的物品序列?取得清单。
  • 你只需要知道你是否已经一个特定的值,但不排序(你不需要存储复本)?使用一套。
  • 您是否需要将值与键相关联,以便稍后可以有效地(通过键)查找它们?使用字典。
  • Do you just need an ordered sequence of items? Go for a list.
  • Do you just need to know whether or not you’ve already got a particular value, but without ordering (and you don’t need to store duplicates)? Use a set.
  • Do you need to associate values with keys, so you can look them up efficiently (by key) later on? Use a dictionary.

回答 2

如果您想要无序的唯一元素集合,请使用set。(例如,当您要在文档中使用所有单词的集合时)。

当您想要收集元素的不可变的有序列表时,请使用tuple。(例如,当您希望将(名称,phone_number)对用作集合中的元素时,您将需要一个元组而不是一个列表,因为集合要求元素是不可变的。

当您想收集元素的可变的有序列表时,请使用list。(例如,当您要将新的电话号码追加到列表中时:[number1,number2,…])。

当您想要从键到值的映射时,请使用dict。(例如,当您需要将姓名映射到电话号码的电话簿时:){'John Smith' : '555-1212'}。请注意,字典中的键是无序的。(如果您遍历字典(电话簿),则按键(名称)可能以任何顺序显示)。

When you want an unordered collection of unique elements, use a set. (For example, when you want the set of all the words used in a document).

When you want to collect an immutable ordered list of elements, use a tuple. (For example, when you want a (name, phone_number) pair that you wish to use as an element in a set, you would need a tuple rather than a list since sets require elements be immutable).

When you want to collect a mutable ordered list of elements, use a list. (For example, when you want to append new phone numbers to a list: [number1, number2, …]).

When you want a mapping from keys to values, use a dict. (For example, when you want a telephone book which maps names to phone numbers: {'John Smith' : '555-1212'}). Note the keys in a dict are unordered. (If you iterate through a dict (telephone book), the keys (names) may show up in any order).


回答 3

  • 当您有一组映射到值的唯一键时,请使用字典。

  • 如果您有项目的有序集合,请使用列表。

  • 使用一组存储一组无序的项目。

  • Use a dictionary when you have a set of unique keys that map to values.

  • Use a list if you have an ordered collection of items.

  • Use a set to store an unordered set of items.


回答 4

简而言之,使用:

list -如果您需要订购的物品序列。

dict -如果您需要将值与键相关联

set -如果您需要保留唯一元素。

详细说明

清单

列表是可变序列,通常用于存储同类项目的集合。

列表实现了所有常见的序列操作:

  • x in lx not in l
  • l[i]l[i:j]l[i:j:k]
  • len(l)min(l)max(l)
  • l.count(x)
  • l.index(x[, i[, j]])-的第一出现的索引xl(在或之后i和之前j的indeces)

列表还实现了所有可变序列操作:

  • l[i] = x-项目il被替换x
  • l[i:j] = tlito的切片j被iterable的内容替换t
  • del l[i:j] – 如同 l[i:j] = []
  • l[i:j:k] = t-的元素l[i:j:k]已替换为t
  • del l[i:j:k]s[i:j:k]从列表中删除的元素
  • l.append(x)-追加x到序列的末尾
  • l.clear()-从中删除所有项目l(与del相同l[:]
  • l.copy()-创建的浅表副本l(与相同l[:]
  • l.extend(t)l += t-扩展l以下内容t
  • l *= n-更新l其内容重复n
  • l.insert(i, x)-插入xl由下式给出的指数在i
  • l.pop([i])-在处检索项目,i并将其从中删除l
  • l.remove(x)-从等于x的l位置删除第一项l[i]
  • l.reverse()-反转l到位的项目

利用方法append和可以将列表用作堆栈pop

字典

字典将可散列的值映射到任意对象。字典是可变对象。字典的主要操作是使用一些键存储值并提取给定键的值。

在字典中,不能将不可哈希的值(即包含列表,字典或其他可变类型的值)用作键。

集合是不同的可哈希对象的无序集合。集合通常用于进行成员资格测试,从序列中删除重复项以及计算数学运算(例如交集,并集,差和对称差)。

In short, use:

list – if you require an ordered sequence of items.

dict – if you require to relate values with keys

set – if you require to keep unique elements.

Detailed Explanation

List

A list is a mutable sequence, typically used to store collections of homogeneous items.

A list implements all of the common sequence operations:

  • x in l and x not in l
  • l[i], l[i:j], l[i:j:k]
  • len(l), min(l), max(l)
  • l.count(x)
  • l.index(x[, i[, j]]) – index of the 1st occurrence of x in l (at or after i and before j indeces)

A list also implements all of the mutable sequence operations:

  • l[i] = x – item i of l is replaced by x
  • l[i:j] = t – slice of l from i to j is replaced by the contents of the iterable t
  • del l[i:j] – same as l[i:j] = []
  • l[i:j:k] = t – the elements of l[i:j:k] are replaced by those of t
  • del l[i:j:k] – removes the elements of s[i:j:k] from the list
  • l.append(x) – appends x to the end of the sequence
  • l.clear() – removes all items from l (same as del l[:])
  • l.copy() – creates a shallow copy of l (same as l[:])
  • l.extend(t) or l += t – extends l with the contents of t
  • l *= n – updates l with its contents repeated n times
  • l.insert(i, x) – inserts x into l at the index given by i
  • l.pop([i]) – retrieves the item at i and also removes it from l
  • l.remove(x) – remove the first item from l where l[i] is equal to x
  • l.reverse() – reverses the items of l in place

A list could be used as stack by taking advantage of the methods append and pop.

Dictionary

A dictionary maps hashable values to arbitrary objects. A dictionary is a mutable object. The main operations on a dictionary are storing a value with some key and extracting the value given the key.

In a dictionary, you cannot use as keys values that are not hashable, that is, values containing lists, dictionaries or other mutable types.

Set

A set is an unordered collection of distinct hashable objects. A set is commonly used to include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.


回答 5

尽管这并不涵盖sets,但这是对dicts和lists 的很好解释:

列表看起来就是-值列表。它们中的每一个都从零开始编号-第一个从零开始编号,第二个为1,第三个为2,依此类推。您可以从列表中删除值,并在末尾添加新值。例如:您的许多猫的名字。

字典类似于其名称所暗示的内容-字典。在字典中,您有单词的“索引”,并且每个单词都有一个定义。在python中,单词称为“键”,而定义称为“值”。字典中的值未编号-类似于其名称所建议的名称-字典。在字典中,您有单词的“索引”,并且每个单词都有一个定义。字典中的值没有编号-它们也没有任何特定的顺序-键执行相同的操作。您可以添加,删除和修改字典中的值。例如:电话簿。

http://www.sthurlow.com/python/lesson06/

Although this doesn’t cover sets, it is a good explanation of dicts and lists:

Lists are what they seem – a list of values. Each one of them is numbered, starting from zero – the first one is numbered zero, the second 1, the third 2, etc. You can remove values from the list, and add new values to the end. Example: Your many cats’ names.

Dictionaries are similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – tare similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. The values in a dictionary aren’t numbered – they aren’t in any specific order, either – the key does the same thing. You can add, remove, and modify the values in dictionaries. Example: telephone book.

http://www.sthurlow.com/python/lesson06/


回答 6

对于C ++,我始终牢记以下流程图:在哪种情况下,我使用特定的STL容器?,所以我很好奇Python3是否也有类似的东西,但是我没有运气。

对于Python,需要记住的是:没有像C ++一样的Python标准。因此,不同的Python解释器(例如CPython,PyPy)可能会有巨大的差异。以下流程图适用于CPython。

另外,我发现包含以下数据结构到图中,没有什么好办法:bytesbyte arraystuplesnamed_tuplesChainMapCounter,和arrays

  • OrderedDict并且deque可以通过collections模块获得。
  • heapq可从heapq模块中获得
  • LifoQueueQueuePriorityQueue可以通过queue专门用于并发(线程)访问的模块获得。(也有一个multiprocessing.Queue可用的,但我不知道与它之间的区别,queue.Queue但是假设需要从进程进行并发访问时应该使用它。)
  • dictsetfrozen_set,和list被内置当然

对于任何人,如果您可以改善此答案并在各个方面提供更好的图表,我将不胜感激。随时欢迎。 流程图

PS:该图已通过yed制作。graphml文件在这里

For C++ I was always having this flow chart in mind: In which scenario do I use a particular STL container?, so I was curious if something similar is available for Python3 as well, but I had no luck.

What you need to keep in mind for Python is: There is no single Python standard as for C++. Hence there might be huge differences for different Python interpreters (e.g. CPython, PyPy). The following flow chart is for CPython.

Additionally I found no good way to incorporate the following data structures into the diagram: bytes, byte arrays, tuples, named_tuples, ChainMap, Counter, and arrays.

  • OrderedDict and deque are available via collections module.
  • heapq is available from the heapq module
  • LifoQueue, Queue, and PriorityQueue are available via the queue module which is designed for concurrent (threads) access. (There is also a multiprocessing.Queue available but I don’t know the differences to queue.Queue but would assume that it should be used when concurrent access from processes is needed.)
  • dict, set, frozen_set, and list are builtin of course

For anyone I would be grateful if you could improve this answer and provide a better diagram in every aspect. Feel free and welcome. flowchart

PS: the diagram has been made with yed. The graphml file is here


回答 7

结合列表字典集合,还有另一个有趣的python对象OrderedDicts

顺序词典与常规词典一样,但是它们记住项目插入的顺序。在有序字典上进行迭代时,将按照项的键首次添加的顺序返回项。

当您需要保留键的顺序(例如处理文档)时,OrderedDicts可能会很有用:通常需要文档中所有术语的向量表示。因此,使用OrderedDicts,您可以有效地验证术语是否已被阅读过,添加术语,提取术语,以及在所有操作之后可以提取它们的有序矢量表示。

In combination with lists, dicts and sets, there are also another interesting python objects, OrderedDicts.

Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.

OrderedDicts could be useful when you need to preserve the order of the keys, for example working with documents: It’s common to need the vector representation of all terms in a document. So using OrderedDicts you can efficiently verify if a term has been read before, add terms, extract terms, and after all the manipulations you can extract the ordered vector representation of them.


回答 8

列表就是它们的外观-值列表。它们中的每一个都从零开始编号-第一个从零开始编号,第二个为1,第三个为2,依此类推。您可以从列表中删除值,并在末尾添加新值。例如:您的许多猫的名字。

元组就像列表一样,但是您不能更改它们的值。首先给出的值是程序其余部分所保持的值。同样,每个值都从零开始编号,以方便参考。示例:一年中的月份名称。

字典类似于其名称所暗示的内容-字典。在字典中,您有单词的“索引”,并且每个单词都有一个定义。在python中,单词称为“键”,而定义称为“值”。字典中的值未编号-类似于其名称所建议的名称-字典。在字典中,您有单词的“索引”,并且每个单词都有一个定义。在python中,单词称为“键”,而定义称为“值”。字典中的值没有编号-它们也没有任何特定的顺序-键执行相同的操作。您可以添加,删除和修改字典中的值。例如:电话簿。

Lists are what they seem – a list of values. Each one of them is numbered, starting from zero – the first one is numbered zero, the second 1, the third 2, etc. You can remove values from the list, and add new values to the end. Example: Your many cats’ names.

Tuples are just like lists, but you can’t change their values. The values that you give it first up, are the values that you are stuck with for the rest of the program. Again, each value is numbered starting from zero, for easy reference. Example: the names of the months of the year.

Dictionaries are similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – tare similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – they aren’t in any specific order, either – the key does the same thing. You can add, remove, and modify the values in dictionaries. Example: telephone book.


回答 9

在使用它们时,我会详尽列出它们的方法,以供您参考:

class ContainerMethods:
    def __init__(self):
        self.list_methods_11 = {
                    'Add':{'append','extend','insert'},
                    'Subtract':{'pop','remove'},
                    'Sort':{'reverse', 'sort'},
                    'Search':{'count', 'index'},
                    'Entire':{'clear','copy'},
                            }
        self.tuple_methods_2 = {'Search':'count','index'}

        self.dict_methods_11 = {
                    'Views':{'keys', 'values', 'items'},
                    'Add':{'update'},
                    'Subtract':{'pop', 'popitem',},
                    'Extract':{'get','setdefault',},
                    'Entire':{ 'clear', 'copy','fromkeys'},
                            }
        self.set_methods_17 ={
                    'Add':{['add', 'update'],['difference_update','symmetric_difference_update','intersection_update']},
                    'Subtract':{'pop', 'remove','discard'},
                    'Relation':{'isdisjoint', 'issubset', 'issuperset'},
                    'operation':{'union' 'intersection','difference', 'symmetric_difference'}
                    'Entire':{'clear', 'copy'}}

When use them, I make an exhaustive cheatsheet of their methods for your reference:

class ContainerMethods:
    def __init__(self):
        self.list_methods_11 = {
                    'Add':{'append','extend','insert'},
                    'Subtract':{'pop','remove'},
                    'Sort':{'reverse', 'sort'},
                    'Search':{'count', 'index'},
                    'Entire':{'clear','copy'},
                            }
        self.tuple_methods_2 = {'Search':'count','index'}

        self.dict_methods_11 = {
                    'Views':{'keys', 'values', 'items'},
                    'Add':{'update'},
                    'Subtract':{'pop', 'popitem',},
                    'Extract':{'get','setdefault',},
                    'Entire':{ 'clear', 'copy','fromkeys'},
                            }
        self.set_methods_17 ={
                    'Add':{['add', 'update'],['difference_update','symmetric_difference_update','intersection_update']},
                    'Subtract':{'pop', 'remove','discard'},
                    'Relation':{'isdisjoint', 'issubset', 'issuperset'},
                    'operation':{'union' 'intersection','difference', 'symmetric_difference'}
                    'Entire':{'clear', 'copy'}}

回答 10

字典:Python字典的用法类似于哈希表,其键为索引,对象为值。

列表:列表用于将对象保存在数组中,该对象由该对象在数组中的位置索引。

集合:集合是具有函数的集合,这些函数可以判断集合中是否存在对象。

Dictionary: A python dictionary is used like a hash table with key as index and object as value.

List: A list is used for holding objects in an array indexed by position of that object in the array.

Set: A set is a collection with functions that can tell if an object is present or not present in the set.