Python 实用宝典

Question 1

How can I convert a set to a list in Python? Using

a = set(["Blah", "Hello"])
a = list(a)

doesn’t work. It gives me:

TypeError: 'set' object is not callable

Question 2

Your code does work (tested with cpython 2.4, 2.5, 2.6, 2.7, 3.1 and 3.2):

>>> a = set(["Blah", "Hello"])
>>> a = list(a) # You probably wrote a = list(a()) here or list = set() above
>>> a
['Blah', 'Hello']

Check that you didn’t overwrite list by accident:

>>> assert list == __builtins__.list

Question 3

You’ve shadowed the builtin set by accidentally using it as a variable name, here is a simple way to replicate your error

>>> set=set()
>>> set=set()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not callable

The first line rebinds set to an instance of set. The second line is trying to call the instance which of course fails.

Here is a less confusing version using different names for each variable. Using a fresh interpreter

>>> a=set()
>>> b=a()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not callable

Hopefully it is obvious that calling a is an error

Question 4

before you write set(XXXXX) you have used “set” as a variable e.g.

set = 90 #you have used "set" as an object
…
…
a = set(["Blah", "Hello"])
a = list(a)

Question 5

This will work:

>>> t = [1,1,2,2,3,3,4,5]
>>> print list(set(t))
[1,2,3,4,5]

However, if you have used “list” or “set” as a variable name you will get the:

TypeError: 'set' object is not callable

eg:

>>> set = [1,1,2,2,3,3,4,5]
>>> print list(set(set))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable

Same error will occur if you have used “list” as a variable name.

Question 6

s = set([1,2,3])
print [ x for x in iter(s) ]

Question 7

Your code works with Python 3.2.1 on Win7 x64

a = set(["Blah", "Hello"])
a = list(a)
type(a)
<class 'list'>

Question 8

Try using combination of map and lambda functions:

aList = map( lambda x: x, set ([1, 2, 6, 9, 0]) )

It is very convenient approach if you have a set of numbers in string and you want to convert it to list of integers:

aList = map( lambda x: int(x), set (['1', '2', '3', '7', '12']) )

Question 9

I don’t understand how looping over a dictionary or set in python is done by ‘arbitrary’ order.

I mean, it’s a programming language so everything in the language must be 100% determined, correct? Python must have some kind of algorithm that decides which part of the dictionary or set is chosen, 1st, second and so on.

What am I missing?

Question 10

Note: This answer was written before the implementation of the dict type changed, in Python 3.6. Most of the implementation details in this answer still apply, but the listing order of keys in dictionaries is no longer determined by hash values. The set implementation remains unchanged.

The order is not arbitrary, but depends on the insertion and deletion history of the dictionary or set, as well as on the specific Python implementation. For the remainder of this answer, for ‘dictionary’, you can also read ‘set’; sets are implemented as dictionaries with just keys and no values.

Keys are hashed, and hash values are assigned to slots in a dynamic table (it can grow or shrink based on needs). And that mapping process can lead to collisions, meaning that a key will have to be slotted in a next slot based on what is already there.

Listing the contents loops over the slots, and so keys are listed in the order they currently reside in the table.

Take the keys 'foo' and 'bar', for example, and lets assume the table size is 8 slots. In Python 2.7, hash('foo') is -4177197833195190597, hash('bar') is 327024216814240868. Modulo 8, that means these two keys are slotted in slots 3 and 4 then:

>>> hash('foo')
-4177197833195190597
>>> hash('foo') % 8
3
>>> hash('bar')
327024216814240868
>>> hash('bar') % 8
4

This informs their listing order:

>>> {'bar': None, 'foo': None}
{'foo': None, 'bar': None}

All slots except 3 and 4 are empty, looping over the table first lists slot 3, then slot 4, so 'foo' is listed before 'bar'.

bar and baz, however, have hash values that are exactly 8 apart and thus map to the exact same slot, 4:

>>> hash('bar')
327024216814240868
>>> hash('baz')
327024216814240876
>>> hash('bar') % 8
4
>>> hash('baz') % 8
4

Their order now depends on which key was slotted first; the second key will have to be moved to a next slot:

>>> {'baz': None, 'bar': None}
{'bar': None, 'baz': None}
>>> {'bar': None, 'baz': None}
{'baz': None, 'bar': None}

The table order differs here, because one or the other key was slotted first.

The technical name for the underlying structure used by CPython (the most commonly used Python implemenation) is a hash table, one that uses open addressing. If you are curious, and understand C well enough, take a look at the C implementation for all the (well documented) details. You could also watch this Pycon 2010 presentation by Brandon Rhodes about how CPython dict works, or pick up a copy of Beautiful Code, which includes a chapter on the implementation written by Andrew Kuchling.

Note that as of Python 3.3, a random hash seed is used as well, making hash collisions unpredictable to prevent certain types of denial of service (where an attacker renders a Python server unresponsive by causing mass hash collisions). This means that the order of a given dictionary or set is then also dependent on the random hash seed for the current Python invocation.

Other implementations are free to use a different structure for dictionaries, as long as they satisfy the documented Python interface for them, but I believe that all implementations so far use a variation of the hash table.

CPython 3.6 introduces a new dict implementation that maintains insertion order, and is faster and more memory efficient to boot. Rather than keep a large sparse table where each row references the stored hash value, and the key and value objects, the new implementation adds a smaller hash array that only references indices in a separate ‘dense’ table (one that only contains as many rows as there are actual key-value pairs), and it is the dense table that happens to list the contained items in order. See the proposal to Python-Dev for more details. Note that in Python 3.6 this is considered an implementation detail, Python-the-language does not specify that other implementations have to retain order. This changed in Python 3.7, where this detail was elevated to be a language specification; for any implementation to be properly compatible with Python 3.7 or newer it must copy this order-preserving behaviour. And to be explicit: this change doesn’t apply to sets, as sets already have a ‘small’ hash structure.

Python 2.7 and newer also provides an OrderedDict class, a subclass of dict that adds an additional data structure to record key order. At the price of some speed and extra memory, this class remembers in what order you inserted keys; listing keys, values or items will then do so in that order. It uses a doubly-linked list stored in an additional dictionary to keep the order up-to-date efficiently. See the post by Raymond Hettinger outlining the idea. OrderedDict objects have other advantages, such as being re-orderable.

If you wanted an ordered set, you can install the oset package; it works on Python 2.5 and up.

Question 11

This is more a response to Python 3.41 A set before it was closed as a duplicate.

The others are right: don’t rely on the order. Don’t even pretend there is one.

That said, there is one thing you can rely on:

list(myset) == list(myset)

That is, the order is stable.

Understanding why there is a perceived order requires understanding a few things:

That Python uses hash sets,
How CPython’s hash set is stored in memory and
How numbers get hashed

From the top:

A hash set is a method of storing random data with really fast lookup times.

It has a backing array:

# A C array; items may be NULL,
# a pointer to an object, or a
# special dummy object
_ _ 4 _ _ 2 _ _ 6

We shall ignore the special dummy object, which exists only to make removes easier to deal with, because we won’t be removing from these sets.

In order to have really fast lookup, you do some magic to calculate a hash from an object. The only rule is that two objects which are equal have the same hash. (But if two objects have the same hash they can be unequal.)

You then make in index by taking the modulus by the array length:

hash(4) % len(storage) = index 2

This makes it really fast to access elements.

Hashes are only most of the story, as hash(n) % len(storage) and hash(m) % len(storage) can result in the same number. In that case, several different strategies can try and resolve the conflict. CPython uses “linear probing” 9 times before doing complicated things, so it will look to the left of the slot for up to 9 places before looking elsewhere.

CPython’s hash sets are stored like this:

A hash set can be no more than 2/3 full. If there are 20 elements and the backing array is 30 elements long, the backing store will resize to be larger. This is because you get collisions more often with small backing stores, and collisions slow everything down.
The backing store resizes in powers of 4, starting at 8, except for large sets (50k elements) which resize in powers of two: (8, 32, 128, …).

So when you create an array the backing store is length 8. When it is 5 full and you add an element, it will briefly contain 6 elements. 6 > ²⁄₃·8 so this triggers a resize, and the backing store quadruples to size 32.

Finally, hash(n) just returns n for numbers (except -1 which is special).

So, let’s look at the first one:

v_set = {88,11,1,33,21,3,7,55,37,8}

len(v_set) is 10, so the backing store is at least 15(+1) after all items have been added. The relevant power of 2 is 32. So the backing store is:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

We have

hash(88) % 32 = 24
hash(11) % 32 = 11
hash(1)  % 32 = 1
hash(33) % 32 = 1
hash(21) % 32 = 21
hash(3)  % 32 = 3
hash(7)  % 32 = 7
hash(55) % 32 = 23
hash(37) % 32 = 5
hash(8)  % 32 = 8

so these insert as:

__  1 __  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __
   33 ← Can't also be where 1 is;
        either 1 or 33 has to move

So we would expect an order like

{[1 or 33], 3, 37, 7, 8, 11, 21, 55, 88}

with the 1 or 33 that isn’t at the start somewhere else. This will use linear probing, so we will either have:

       ↓
__  1 33  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

or

       ↓
__ 33  1  3 __ 37 __  7  8 __ __ 11 __ __ __ __ __ __ __ __ __ 21 __ 55 88 __ __ __ __ __ __ __

You might expect the 33 to be the one that’s displaced because the 1 was already there, but due to the resizing that happens as the set is being built, this isn’t actually the case. Every time the set gets rebuilt, the items already added are effectively reordered.

Now you can see why

{7,5,11,1,4,13,55,12,2,3,6,20,9,10}

might be in order. There are 14 elements, so the backing store is at least 21+1, which means 32:

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

1 to 13 hash in the first 13 slots. 20 goes in slot 20.

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ __ __ __ __ __ __ __ __ __

55 goes in slot hash(55) % 32 which is 23:

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ __ __ 20 __ __ 55 __ __ __ __ __ __ __ __

If we chose 50 instead, we’d expect

__  1  2  3  4  5  6  7  8  9 10 11 12 13 __ __ __ __ 50 __ 20 __ __ __ __ __ __ __ __ __ __ __

And lo and behold:

{1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 20, 50}
#>>> {1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 50, 20}

pop is implemented quite simply by the looks of things: it traverses the list and pops the first one.

This is all implementation detail.

Question 12

“Arbitrary” isn’t the same thing as “non-determined”.

What they’re saying is that there are no useful properties of dictionary iteration order that are “in the public interface”. There almost certainly are many properties of the iteration order that are fully determined by the code that currently implements dictionary iteration, but the authors aren’t promising them to you as something you can use. This gives them more freedom to change these properties between Python versions (or even just in different operating conditions, or completely at random at runtime) without worrying that your program will break.

Thus if you write a program that depends on any property at all of dictionary order, then you are “breaking the contract” of using the dictionary type, and the Python developers are not promising that this will always work, even if it appears to work for now when you test it. It’s basically the equivalent of relying on “undefined behaviour” in C.

Question 13

The other answers to this question are excellent and well written. The OP asks “how” which I interpret as “how do they get away with” or “why”.

The Python documentation says dictionaries are not ordered because the Python dictionary implements the abstract data type associative array. As they say

the order in which the bindings are returned may be arbitrary

In other words, a computer science student cannot assume that an associative array is ordered. The same is true for sets in math

the order in which the elements of a set are listed is irrelevant

and computer science

a set is an abstract data type that can store certain values, without any particular order

Implementing a dictionary using a hash table is an implementation detail that is interesting in that it has the same properties as associative arrays as far as order is concerned.

Question 14

Python use hash table for storing the dictionaries, so there is no order in dictionaries or other iterable objects that use hash table.

But regarding the indices of items in a hash object, python calculate the indices based on following code within hashtable.c:

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

Therefor, as the hash value of integers is the integer itself^* the index is based on the number (ht->num_buckets - 1 is a constant) so the index calculated by Bitwise-and between (ht->num_buckets - 1) and the number itself^* (expect for -1 which it’s hash is -2) , and for other objects with their hash value.

consider the following example with set that use hash-table :

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

For number 33 we have :

33 & (ht->num_buckets - 1) = 1

That actually it’s :

'0b100001' & '0b111'= '0b1' # 1 the index of 33

Note in this case (ht->num_buckets - 1) is 8-1=7 or 0b111.

And for 1919 :

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

And for 333 :

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

For more details about python hash function its good to read the following quotes from python source code :

Major subtleties ahead: Most hash schemes depend on having a “good” hash function, in the sense of simulating randomness. Python doesn’t: its most important hash functions (for strings and ints) are very regular in common cases:
>>> map(hash, (0, 1, 2, 3))
  [0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
  [-1658398457, -1658398460, -1658398459, -1658398462]
This isn’t necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are “consecutive” strings. So this gives better-than-random behavior in common cases, and that’s very desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the hash table makes a good collision resolution strategy crucial. Taking only the last i bits of the hash code is also vulnerable: for example, consider the list [i << 16 for i in range(20000)] as a set of keys. Since ints are their own hash codes, and this fits in a dict of size 2**15, the last 15 bits of every hash code are all 0: they all map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take the last i bits anyway. It’s up to collision resolution to do the rest. If we usually find the key we’re looking for on the first try (and, it turns out, we usually do — the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.

_{* The hash function for class int :}

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value

Question 15

Starting with Python 3.7 (and already in CPython 3.6), dictionary items stay in the order they were inserted.

Question 16

I’ve seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I’ll go with the closest answer to my original question. Thanks all for the info!

Question 17

According to this thread:

Indeed, CPython’s sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, is mostly a cut-and-paste from the dict implementation.

Question 18

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is O(1 + k/n). This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is O(1/(1-k/n)), where k/n can be bounded by a constant c<1.

Big-O refers only to asymptotic behavior as n → ∞. Since k/n can be bounded by a constant, c<1, independent of n,

O(1/(1-k/n)) is no bigger than O(1/(1-c)) which is equivalent to O(constant) = O(1).

So assuming uniform simple hashing, on average, membership-checking for Python sets is O(1).

Question 19

I think its a common mistake, set lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

Question 20

We all have easy access to the source, where the comment preceding set_lookkey() says:

/* set object implementation
 Written and maintained by Raymond D. Hettinger <python@rcn.com>
 Derived from Lib/sets.py and Objects/dictobject.c.
 The basic lookup function used by all operations.
 This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
 The initial probe index is computed as hash mod the table size.
 Subsequent probe indices are computed as explained in Objects/dictobject.c.
 To improve cache locality, each probe inspects a series of consecutive
 nearby entries before moving on to probes elsewhere in memory.  This leaves
 us with a hybrid of linear probing and open addressing.  The linear probing
 reduces the cost of hash collisions because consecutive memory accesses
 tend to be much cheaper than scattered probes.  After LINEAR_PROBES steps,
 we then use open addressing with the upper bits from the hash value.  This
 helps break-up long chains of collisions.
 All arithmetic on hash should ignore overflow.
 Unlike the dictionary implementation, the lookkey function can return
 NULL if the rich comparison returns an error.
*/


...
#ifndef LINEAR_PROBES
#define LINEAR_PROBES 9
#endif

/* This must be >= 1 */
#define PERTURB_SHIFT 5

static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)  
{
...

Question 21

To emphasize a little more the difference between set's and dict's, here is an excerpt from the setobject.c comment sections, which clarify’s the main difference of set’s against dicts.

Use cases for sets differ considerably from dictionaries where looked-up keys are more likely to be present. In contrast, sets are primarily about membership testing where the presence of an element is not known in advance. Accordingly, the set implementation needs to optimize for both the found and not-found case.

source on github

Question 22

I have a Python set that contains objects with __hash__ and __eq__ methods in order to make certain no duplicates are included in the collection.

I need to json encode this result set, but passing even an empty set to the json.dumps method raises a TypeError.

  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
  File "/usr/lib/python2.7/json/encoder.py", line 178, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: set([]) is not JSON serializable

I know I can create an extension to the json.JSONEncoder class that has a custom default method, but I’m not even sure where to begin in converting over the set. Should I create a dictionary out of the set values within the default method, and then return the encoding on that? Ideally, I’d like to make the default method able to handle all the datatypes that the original encoder chokes on (I’m using Mongo as a data source so dates seem to raise this error too)

Any hint in the right direction would be appreciated.

EDIT:

Thanks for the answer! Perhaps I should have been more precise.

I utilized (and upvoted) the answers here to get around the limitations of the set being translated, but there are internal keys that are an issue as well.

The objects in the set are complex objects that translate to __dict__, but they themselves can also contain values for their properties that could be ineligible for the basic types in the json encoder.

There’s a lot of different types coming into this set, and the hash basically calculates a unique id for the entity, but in the true spirit of NoSQL there’s no telling exactly what the child object contains.

One object might contain a date value for starts, whereas another may have some other schema that includes no keys containing “non-primitive” objects.

That is why the only solution I could think of was to extend the JSONEncoder to replace the default method to turn on different cases – but I’m not sure how to go about this and the documentation is ambiguous. In nested objects, does the value returned from default go by key, or is it just a generic include/discard that looks at the whole object? How does that method accommodate nested values? I’ve looked through previous questions and can’t seem to find the best approach to case-specific encoding (which unfortunately seems like what I’m going to need to do here).

Question 23

JSON notation has only a handful of native datatypes (objects, arrays, strings, numbers, booleans, and null), so anything serialized in JSON needs to be expressed as one of these types.

As shown in the json module docs, this conversion can be done automatically by a JSONEncoder and JSONDecoder, but then you would be giving up some other structure you might need (if you convert sets to a list, then you lose the ability to recover regular lists; if you convert sets to a dictionary using dict.fromkeys(s) then you lose the ability to recover dictionaries).

A more sophisticated solution is to build-out a custom type that can coexist with other native JSON types. This lets you store nested structures that include lists, sets, dicts, decimals, datetime objects, etc.:

from json import dumps, loads, JSONEncoder, JSONDecoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, unicode, int, float, bool, type(None))):
            return JSONEncoder.default(self, obj)
        return {'_python_object': pickle.dumps(obj)}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(str(dct['_python_object']))
    return dct

Here is a sample session showing that it can handle lists, dicts, and sets:

>>> data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]

>>> j = dumps(data, cls=PythonObjectEncoder)

>>> loads(j, object_hook=as_python_object)
[1, 2, 3, set(['knights', 'say', 'who', 'ni']), {u'key': u'value'}, Decimal('3.14')]

Alternatively, it may be useful to use a more general purpose serialization technique such as YAML, Twisted Jelly, or Python’s pickle module. These each support a much greater range of datatypes.

Question 24

You can create a custom encoder that returns a list when it encounters a set. Here’s an example:

>>> import json
>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5]), cls=SetEncoder)
'[1, 2, 3, 4, 5]'

You can detect other types this way too. If you need to retain that the list was actually a set, you could use a custom encoding. Something like return {'type':'set', 'list':list(obj)} might work.

To illustrated nested types, consider serializing this:

>>> class Something(object):
...    pass
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)

This raises the following error:

TypeError: <__main__.Something object at 0x1691c50> is not JSON serializable

This indicates that the encoder will take the list result returned and recursively call the serializer on its children. To add a custom serializer for multiple types, you can do this:

>>> class SetEncoder(json.JSONEncoder):
...    def default(self, obj):
...       if isinstance(obj, set):
...          return list(obj)
...       if isinstance(obj, Something):
...          return 'CustomSomethingRepresentation'
...       return json.JSONEncoder.default(self, obj)
... 
>>> json.dumps(set([1,2,3,4,5,Something()]), cls=SetEncoder)
'[1, 2, 3, 4, 5, "CustomSomethingRepresentation"]'

Question 25

I adapted Raymond Hettinger’s solution to python 3.

Here is what has changed:

unicode disappeared
updated the call to the parents’ default with super()
using base64 to serialize the bytes type into str (because it seems that bytes in python 3 can’t be converted to JSON)

from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

data = [1,2,3, set(['knights', 'who', 'say', 'ni']), {'key':'value'}, Decimal('3.14')]
j = dumps(data, cls=PythonObjectEncoder)
print(loads(j, object_hook=as_python_object))
# prints: [1, 2, 3, {'knights', 'who', 'say', 'ni'}, {'key': 'value'}, Decimal('3.14')]

Question 26

Only dictionaries, Lists and primitive object types (int, string, bool) are available in JSON.

Question 27

You don’t need to make a custom encoder class to supply the default method – it can be passed in as a keyword argument:

import json

def serialize_sets(obj):
    if isinstance(obj, set):
        return list(obj)

    return obj

json_str = json.dumps(set([1,2,3]), default=serialize_sets)
print(json_str)

results in [1, 2, 3] in all supported Python versions.

Question 28

If you only need to encode sets, not general Python objects, and want to keep it easily human-readable, a simplified version of Raymond Hettinger’s answer can be used:

import json
import collections

class JSONSetEncoder(json.JSONEncoder):
    """Use with json.dumps to allow Python sets to be encoded to JSON

    Example
    -------

    import json

    data = dict(aset=set([1,2,3]))

    encoded = json.dumps(data, cls=JSONSetEncoder)
    decoded = json.loads(encoded, object_hook=json_as_python_set)
    assert data == decoded     # Should assert successfully

    Any object that is matched by isinstance(obj, collections.Set) will
    be encoded, but the decoded value will always be a normal Python set.

    """

    def default(self, obj):
        if isinstance(obj, collections.Set):
            return dict(_set_object=list(obj))
        else:
            return json.JSONEncoder.default(self, obj)

def json_as_python_set(dct):
    """Decode json {'_set_object': [1,2,3]} to set([1,2,3])

    Example
    -------
    decoded = json.loads(encoded, object_hook=json_as_python_set)

    Also see :class:`JSONSetEncoder`

    """
    if '_set_object' in dct:
        return set(dct['_set_object'])
    return dct

Question 29

If you need just quick dump and don’t want to implement custom encoder. You can use the following:

json_string = json.dumps(data, iterable_as_array=True)

This will convert all sets (and other iterables) into arrays. Just beware that those fields will stay arrays when you parse the json back. If you want to preserve the types, you need to write custom encoder.

Question 30

One shortcoming of the accepted solution is that its output is very python specific. I.e. its raw json output cannot be observed by a human or loaded by another language (e.g. javascript). example:

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)

Will get you:

{"a": [44, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsESwVLBmWFcQJScQMu"}], "b": [55, {"_python_object": "gANjYnVpbHRpbnMKc2V0CnEAXXEBKEsCSwNLBGWFcQJScQMu"}]}

I can propose a solution which downgrades the set to a dict containing a list on the way out, and back to a set when loaded into python using the same encoder, therefore preserving observability and language agnosticism:

from decimal import Decimal
from base64 import b64encode, b64decode
from json import dumps, loads, JSONEncoder
import pickle

class PythonObjectEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (list, dict, str, int, float, bool, type(None))):
            return super().default(obj)
        elif isinstance(obj, set):
            return {"__set__": list(obj)}
        return {'_python_object': b64encode(pickle.dumps(obj)).decode('utf-8')}

def as_python_object(dct):
    if '__set__' in dct:
        return set(dct['__set__'])
    elif '_python_object' in dct:
        return pickle.loads(b64decode(dct['_python_object'].encode('utf-8')))
    return dct

db = {
        "a": [ 44, set((4,5,6)) ],
        "b": [ 55, set((4,3,2)) ]
        }

j = dumps(db, cls=PythonObjectEncoder)
print(j)
ob = loads(j)
print(ob["a"])

Which gets you:

{"a": [44, {"__set__": [4, 5, 6]}], "b": [55, {"__set__": [2, 3, 4]}]}
[44, {'__set__': [4, 5, 6]}]

Note that serializing a dictionary which has an element with a key "__set__" will break this mechanism. So __set__ has now become a reserved dict key. Obviously feel free to use another, more deeply obfuscated key.

Question 31

I am trying to convert a set to a list in Python 2.6. I’m using this syntax:

first_list = [1,2,3,4]
my_set=set(first_list)
my_list = list(my_set)

However, I get the following stack trace:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: 'set' object is not callable

How can I fix this?

Question 32

It is already a list

type(my_set)
>>> <type 'list'>

Do you want something like

my_set = set([1,2,3,4])
my_list = list(my_set)
print my_list
>> [1, 2, 3, 4]

EDIT : Output of your last comment

>>> my_list = [1,2,3,4]
>>> my_set = set(my_list)
>>> my_new_list = list(my_set)
>>> print my_new_list
[1, 2, 3, 4]

I’m wondering if you did something like this :

>>> set=set()
>>> set([1,2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not callable

Question 33

Instead of:

first_list = [1,2,3,4]
my_set=set(first_list)
my_list = list(my_set)

Why not shortcut the process:

my_list = list(set([1,2,3,4])

This will remove the dupes from you list and return a list back to you.

Question 34

[EDITED] It’s seems you earlier have redefined “list”, using it as a variable name, like this:

list = set([1,2,3,4]) # oops
#...
first_list = [1,2,3,4]
my_set=set(first_list)
my_list = list(my_set)

And you’l get

Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: 'set' object is not callable

Question 35

Whenever you are stuck in such type of problems, try to find the datatype of the element you want to convert first by using :

type(my_set)

Then, Use:

  list(my_set)

to convert it to a list. You can use the newly built list like any normal list in python now.

Question 36

Simply type:

list(my_set)

This will turn a set in the form {‘1′,’2’} into a list in the form [‘1′,’2’].

Question 37

Review your first line. Your stack trace is clearly not from the code you’ve pasted here, so I don’t know precisely what you’ve done.

>>> my_set=([1,2,3,4])
>>> my_set
[1, 2, 3, 4]
>>> type(my_set)
<type 'list'>
>>> list(my_set)
[1, 2, 3, 4]
>>> type(_)
<type 'list'>

What you wanted was set([1, 2, 3, 4]).

>>> my_set = set([1, 2, 3, 4])
>>> my_set
set([1, 2, 3, 4])
>>> type(my_set)
<type 'set'>
>>> list(my_set)
[1, 2, 3, 4]
>>> type(_)
<type 'list'>

The “not callable” exception means you were doing something like set()() – attempting to call a set instance.

Question 38

I’m not sure that you’re creating a set with this ([1, 2]) syntax, rather a list. To create a set, you should use set([1, 2]).

These brackets are just envelopping your expression, as if you would have written:

if (condition1
    and condition2 == 3):
    print something

There’re not really ignored, but do nothing to your expression.

Note: (something, something_else) will create a tuple (but still no list).

Question 39

Python is a dynamically typed language, which means that you cannot define the type of the variable as you do in C or C++:

type variable = value

or

type variable(value)

In Python, you use coercing if you change types, or the init functions (constructors) of the types to declare a variable of a type:

my_set = set([1,2,3])
type my_set

will give you <type 'set'> for an answer.

If you have a list, do this:

my_list = [1,2,3]
my_set = set(my_list)

Question 40

Hmmm I bet that in some previous lines you have something like:

list = set(something)

Am I wrong ?

Question 41

How can I make as “perfect” a subclass of dict as possible? The end goal is to have a simple dict in which the keys are lowercase.

It would seem that there should be some tiny set of primitives I can override to make this work, but according to all my research and attempts it seem like this isn’t the case:

If I override __getitem__/__setitem__, then get/set don’t work. How can I make them work? Surely I don’t need to implement them individually?
Am I preventing pickling from working, and do I need to implement __setstate__ etc?
Do I need repr, update and __init__?
Should I just use mutablemapping (it seems one shouldn’t use UserDict or DictMixin)? If so, how? The docs aren’t exactly enlightening.

Here is my first go at it, get() doesn’t work and no doubt there are many other minor problems:

class arbitrary_dict(dict):
    """A dictionary that applies an arbitrary key-altering function
       before accessing the keys."""

    def __keytransform__(self, key):
        return key

    # Overridden methods. List from 
    # https://stackoverflow.com/questions/2390827/how-to-properly-subclass-dict

    def __init__(self, *args, **kwargs):
        self.update(*args, **kwargs)

    # Note: I'm using dict directly, since super(dict, self) doesn't work.
    # I'm not sure why, perhaps dict is not a new-style class.

    def __getitem__(self, key):
        return dict.__getitem__(self, self.__keytransform__(key))

    def __setitem__(self, key, value):
        return dict.__setitem__(self, self.__keytransform__(key), value)

    def __delitem__(self, key):
        return dict.__delitem__(self, self.__keytransform__(key))

    def __contains__(self, key):
        return dict.__contains__(self, self.__keytransform__(key))


class lcdict(arbitrary_dict):
    def __keytransform__(self, key):
        return str(key).lower()

Question 42

You can write an object that behaves like a dict quite easily with ABCs (Abstract Base Classes) from the collections.abc module. It even tells you if you missed a method, so below is the minimal version that shuts the ABC up.

from collections.abc import MutableMapping


class TransformedDict(MutableMapping):
    """A dictionary that applies an arbitrary key-altering
       function before accessing the keys"""

    def __init__(self, *args, **kwargs):
        self.store = dict()
        self.update(dict(*args, **kwargs))  # use the free update to set keys

    def __getitem__(self, key):
        return self.store[self._keytransform(key)]

    def __setitem__(self, key, value):
        self.store[self._keytransform(key)] = value

    def __delitem__(self, key):
        del self.store[self._keytransform(key)]

    def __iter__(self):
        return iter(self.store)
    
    def __len__(self):
        return len(self.store)

    def _keytransform(self, key):
        return key

You get a few free methods from the ABC:

class MyTransformedDict(TransformedDict):

    def _keytransform(self, key):
        return key.lower()


s = MyTransformedDict([('Test', 'test')])

assert s.get('TEST') is s['test']   # free get
assert 'TeSt' in s                  # free __contains__
                                    # free setdefault, __eq__, and so on

import pickle
# works too since we just use a normal dict
assert pickle.loads(pickle.dumps(s)) == s

I wouldn’t subclass dict (or other builtins) directly. It often makes no sense, because what you actually want to do is implement the interface of a dict. And that is exactly what ABCs are for.

Question 43

How can I make as “perfect” a subclass of dict as possible?

The end goal is to have a simple dict in which the keys are lowercase.

If I override __getitem__/__setitem__, then get/set don’t work. How do I make them work? Surely I don’t need to implement them individually?

Am I preventing pickling from working, and do I need to implement __setstate__ etc?

Do I need repr, update and __init__?

Should I just use mutablemapping (it seems one shouldn’t use UserDict or DictMixin)? If so, how? The docs aren’t exactly enlightening.

The accepted answer would be my first approach, but since it has some issues, and since no one has addressed the alternative, actually subclassing a dict, I’m going to do that here.

What’s wrong with the accepted answer?

This seems like a rather simple request to me:

How can I make as “perfect” a subclass of dict as possible? The end goal is to have a simple dict in which the keys are lowercase.

The accepted answer doesn’t actually subclass dict, and a test for this fails:

>>> isinstance(MyTransformedDict([('Test', 'test')]), dict)
False

Ideally, any type-checking code would be testing for the interface we expect, or an abstract base class, but if our data objects are being passed into functions that are testing for dict – and we can’t “fix” those functions, this code will fail.

Other quibbles one might make:

The accepted answer is also missing the classmethod: fromkeys.
The accepted answer also has a redundant __dict__ – therefore taking up more space in memory:
```
>>> s.foo = 'bar'
>>> s.__dict__
{'foo': 'bar', 'store': {'test': 'test'}}
```

Actually subclassing `dict`

We can reuse the dict methods through inheritance. All we need to do is create an interface layer that ensures keys are passed into the dict in lowercase form if they are strings.

If I override __getitem__/__setitem__, then get/set don’t work. How do I make them work? Surely I don’t need to implement them individually?

Well, implementing them each individually is the downside to this approach and the upside to using MutableMapping (see the accepted answer), but it’s really not that much more work.

First, let’s factor out the difference between Python 2 and 3, create a singleton (_RaiseKeyError) to make sure we know if we actually get an argument to dict.pop, and create a function to ensure our string keys are lowercase:

from itertools import chain
try:              # Python 2
    str_base = basestring
    items = 'iteritems'
except NameError: # Python 3
    str_base = str, bytes, bytearray
    items = 'items'

_RaiseKeyError = object() # singleton for no-default behavior

def ensure_lower(maybe_str):
    """dict keys can be any hashable object - only call lower if str"""
    return maybe_str.lower() if isinstance(maybe_str, str_base) else maybe_str

Now we implement – I’m using super with the full arguments so that this code works for Python 2 and 3:

class LowerDict(dict):  # dicts take a mapping or iterable as their optional first argument
    __slots__ = () # no __dict__ - that would be redundant
    @staticmethod # because this doesn't make sense as a global function.
    def _process_args(mapping=(), **kwargs):
        if hasattr(mapping, items):
            mapping = getattr(mapping, items)()
        return ((ensure_lower(k), v) for k, v in chain(mapping, getattr(kwargs, items)()))
    def __init__(self, mapping=(), **kwargs):
        super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
    def __getitem__(self, k):
        return super(LowerDict, self).__getitem__(ensure_lower(k))
    def __setitem__(self, k, v):
        return super(LowerDict, self).__setitem__(ensure_lower(k), v)
    def __delitem__(self, k):
        return super(LowerDict, self).__delitem__(ensure_lower(k))
    def get(self, k, default=None):
        return super(LowerDict, self).get(ensure_lower(k), default)
    def setdefault(self, k, default=None):
        return super(LowerDict, self).setdefault(ensure_lower(k), default)
    def pop(self, k, v=_RaiseKeyError):
        if v is _RaiseKeyError:
            return super(LowerDict, self).pop(ensure_lower(k))
        return super(LowerDict, self).pop(ensure_lower(k), v)
    def update(self, mapping=(), **kwargs):
        super(LowerDict, self).update(self._process_args(mapping, **kwargs))
    def __contains__(self, k):
        return super(LowerDict, self).__contains__(ensure_lower(k))
    def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
        return type(self)(self)
    @classmethod
    def fromkeys(cls, keys, v=None):
        return super(LowerDict, cls).fromkeys((ensure_lower(k) for k in keys), v)
    def __repr__(self):
        return '{0}({1})'.format(type(self).__name__, super(LowerDict, self).__repr__())

We use an almost boiler-plate approach for any method or special method that references a key, but otherwise, by inheritance, we get methods: len, clear, items, keys, popitem, and values for free. While this required some careful thought to get right, it is trivial to see that this works.

(Note that haskey was deprecated in Python 2, removed in Python 3.)

Here’s some usage:

>>> ld = LowerDict(dict(foo='bar'))
>>> ld['FOO']
'bar'
>>> ld['foo']
'bar'
>>> ld.pop('FoO')
'bar'
>>> ld.setdefault('Foo')
>>> ld
{'foo': None}
>>> ld.get('Bar')
>>> ld.setdefault('Bar')
>>> ld
{'bar': None, 'foo': None}
>>> ld.popitem()
('bar', None)

Am I preventing pickling from working, and do I need to implement __setstate__ etc?

pickling

And the dict subclass pickles just fine:

>>> import pickle
>>> pickle.dumps(ld)
b'\x80\x03c__main__\nLowerDict\nq\x00)\x81q\x01X\x03\x00\x00\x00fooq\x02Ns.'
>>> pickle.loads(pickle.dumps(ld))
{'foo': None}
>>> type(pickle.loads(pickle.dumps(ld)))
<class '__main__.LowerDict'>

`repr`

Do I need repr, update and __init__?

We defined update and __init__, but you have a beautiful __repr__ by default:

>>> ld # without __repr__ defined for the class, we get this
{'foo': None}

However, it’s good to write a __repr__ to improve the debugability of your code. The ideal test is eval(repr(obj)) == obj. If it’s easy to do for your code, I strongly recommend it:

>>> ld = LowerDict({})
>>> eval(repr(ld)) == ld
True
>>> ld = LowerDict(dict(a=1, b=2, c=3))
>>> eval(repr(ld)) == ld
True

You see, it’s exactly what we need to recreate an equivalent object – this is something that might show up in our logs or in backtraces:

>>> ld
LowerDict({'a': 1, 'c': 3, 'b': 2})

Conclusion

Should I just use mutablemapping (it seems one shouldn’t use UserDict or DictMixin)? If so, how? The docs aren’t exactly enlightening.

Yeah, these are a few more lines of code, but they’re intended to be comprehensive. My first inclination would be to use the accepted answer, and if there were issues with it, I’d then look at my answer – as it’s a little more complicated, and there’s no ABC to help me get my interface right.

Premature optimization is going for greater complexity in search of performance. MutableMapping is simpler – so it gets an immediate edge, all else being equal. Nevertheless, to lay out all the differences, let’s compare and contrast.

I should add that there was a push to put a similar dictionary into the collections module, but it was rejected. You should probably just do this instead:

my_dict[transform(key)]

It should be far more easily debugable.

Compare and contrast

There are 6 interface functions implemented with the MutableMapping (which is missing fromkeys) and 11 with the dict subclass. I don’t need to implement __iter__ or __len__, but instead I have to implement get, setdefault, pop, update, copy, __contains__, and fromkeys – but these are fairly trivial, since I can use inheritance for most of those implementations.

The MutableMapping implements some things in Python that dict implements in C – so I would expect a dict subclass to be more performant in some cases.

We get a free __eq__ in both approaches – both of which assume equality only if another dict is all lowercase – but again, I think the dict subclass will compare more quickly.

Summary:

subclassing MutableMapping is simpler with fewer opportunities for bugs, but slower, takes more memory (see redundant dict), and fails isinstance(x, dict)
subclassing dict is faster, uses less memory, and passes isinstance(x, dict), but it has greater complexity to implement.

Which is more perfect? That depends on your definition of perfect.

Question 44

My requirements were a bit stricter:

I had to retain case info (the strings are paths to files displayed to the user, but it’s a windows app so internally all operations must be case insensitive)
I needed keys to be as small as possible (it did make a difference in memory performance, chopped off 110 mb out of 370). This meant that caching lowercase version of keys is not an option.
I needed creation of the data structures to be as fast as possible (again made a difference in performance, speed this time). I had to go with a builtin

My initial thought was to substitute our clunky Path class for a case insensitive unicode subclass – but:

proved hard to get that right – see: A case insensitive string class in python
turns out that explicit dict keys handling makes code verbose and messy – and error prone (structures are passed hither and thither, and it is not clear if they have CIStr instances as keys/elements, easy to forget plus some_dict[CIstr(path)] is ugly)

So I had finally to write down that case insensitive dict. Thanks to code by @AaronHall that was made 10 times easier.

class CIstr(unicode):
    """See https://stackoverflow.com/a/43122305/281545, especially for inlines"""
    __slots__ = () # does make a difference in memory performance

    #--Hash/Compare
    def __hash__(self):
        return hash(self.lower())
    def __eq__(self, other):
        if isinstance(other, CIstr):
            return self.lower() == other.lower()
        return NotImplemented
    def __ne__(self, other):
        if isinstance(other, CIstr):
            return self.lower() != other.lower()
        return NotImplemented
    def __lt__(self, other):
        if isinstance(other, CIstr):
            return self.lower() < other.lower()
        return NotImplemented
    def __ge__(self, other):
        if isinstance(other, CIstr):
            return self.lower() >= other.lower()
        return NotImplemented
    def __gt__(self, other):
        if isinstance(other, CIstr):
            return self.lower() > other.lower()
        return NotImplemented
    def __le__(self, other):
        if isinstance(other, CIstr):
            return self.lower() <= other.lower()
        return NotImplemented
    #--repr
    def __repr__(self):
        return '{0}({1})'.format(type(self).__name__,
                                 super(CIstr, self).__repr__())

def _ci_str(maybe_str):
    """dict keys can be any hashable object - only call CIstr if str"""
    return CIstr(maybe_str) if isinstance(maybe_str, basestring) else maybe_str

class LowerDict(dict):
    """Dictionary that transforms its keys to CIstr instances.
    Adapted from: https://stackoverflow.com/a/39375731/281545
    """
    __slots__ = () # no __dict__ - that would be redundant

    @staticmethod # because this doesn't make sense as a global function.
    def _process_args(mapping=(), **kwargs):
        if hasattr(mapping, 'iteritems'):
            mapping = getattr(mapping, 'iteritems')()
        return ((_ci_str(k), v) for k, v in
                chain(mapping, getattr(kwargs, 'iteritems')()))
    def __init__(self, mapping=(), **kwargs):
        # dicts take a mapping or iterable as their optional first argument
        super(LowerDict, self).__init__(self._process_args(mapping, **kwargs))
    def __getitem__(self, k):
        return super(LowerDict, self).__getitem__(_ci_str(k))
    def __setitem__(self, k, v):
        return super(LowerDict, self).__setitem__(_ci_str(k), v)
    def __delitem__(self, k):
        return super(LowerDict, self).__delitem__(_ci_str(k))
    def copy(self): # don't delegate w/ super - dict.copy() -> dict :(
        return type(self)(self)
    def get(self, k, default=None):
        return super(LowerDict, self).get(_ci_str(k), default)
    def setdefault(self, k, default=None):
        return super(LowerDict, self).setdefault(_ci_str(k), default)
    __no_default = object()
    def pop(self, k, v=__no_default):
        if v is LowerDict.__no_default:
            # super will raise KeyError if no default and key does not exist
            return super(LowerDict, self).pop(_ci_str(k))
        return super(LowerDict, self).pop(_ci_str(k), v)
    def update(self, mapping=(), **kwargs):
        super(LowerDict, self).update(self._process_args(mapping, **kwargs))
    def __contains__(self, k):
        return super(LowerDict, self).__contains__(_ci_str(k))
    @classmethod
    def fromkeys(cls, keys, v=None):
        return super(LowerDict, cls).fromkeys((_ci_str(k) for k in keys), v)
    def __repr__(self):
        return '{0}({1})'.format(type(self).__name__,
                                 super(LowerDict, self).__repr__())

Implicit vs explicit is still a problem, but once dust settles, renaming of attributes/variables to start with ci (and a big fat doc comment explaining that ci stands for case insensitive) I think is a perfect solution – as readers of the code must be fully aware that we are dealing with case insensitive underlying data structures. This will hopefully fix some hard to reproduce bugs, which I suspect boil down to case sensitivity.

Comments/corrections welcome :)

Question 45

All you will have to do is

class BatchCollection(dict):
    def __init__(self, *args, **kwargs):
        dict.__init__(*args, **kwargs)

OR

class BatchCollection(dict):
    def __init__(self, inpt={}):
        super(BatchCollection, self).__init__(inpt)

A sample usage for my personal use

### EXAMPLE
class BatchCollection(dict):
    def __init__(self, inpt={}):
        dict.__init__(*args, **kwargs)

    def __setitem__(self, key, item):
        if (isinstance(key, tuple) and len(key) == 2
                and isinstance(item, collections.Iterable)):
            # self.__dict__[key] = item
            super(BatchCollection, self).__setitem__(key, item)
        else:
            raise Exception(
                "Valid key should be a tuple (database_name, table_name) "
                "and value should be iterable")

Note: tested only in python3

Question 46

After trying out both of the top two suggestions, I’ve settled on a shady-looking middle route for Python 2.7. Maybe 3 is saner, but for me:

class MyDict(MutableMapping):
   # ... the few __methods__ that mutablemapping requires
   # and then this monstrosity
   @property
   def __class__(self):
       return dict

which I really hate, but seems to fit my needs, which are:

can override **my_dict
- if you inherit from dict, this bypasses your code. try it out.
- this makes #2 unacceptable for me at all times, as this is quite common in python code
masquerades as isinstance(my_dict, dict)
- rules out MutableMapping alone, so #1 is not enough
- I heartily recommend #1 if you don’t need this, it’s simple and predictable
fully controllable behavior
- so I cannot inherit from dict

If you need to tell yourself apart from others, personally I use something like this (though I’d recommend better names):

def __am_i_me(self):
  return True

@classmethod
def __is_it_me(cls, other):
  try:
    return other.__am_i_me()
  except Exception:
    return False

As long as you only need to recognize yourself internally, this way it’s harder to accidentally call __am_i_me due to python’s name-munging (this is renamed to _MyDict__am_i_me from anything calling outside this class). Slightly more private than _methods, both in practice and culturally.

So far I have no complaints, aside from the seriously-shady-looking __class__ override. I’d be thrilled to hear of any problems that others encounter with this though, I don’t fully understand the consequences. But so far I’ve had no problems whatsoever, and this allowed me to migrate a lot of middling-quality code in lots of locations without needing any changes.

As evidence: https://repl.it/repls/TraumaticToughCockatoo

Basically: copy the current #2 option, add print 'method_name' lines to every method, and then try this and watch the output:

d = LowerDict()  # prints "init", or whatever your print statement said
print '------'
splatted = dict(**d)  # note that there are no prints here

You’ll see similar behavior for other scenarios. Say your fake-dict is a wrapper around some other datatype, so there’s no reasonable way to store the data in the backing-dict; **your_dict will be empty, regardless of what every other method does.

This works correctly for MutableMapping, but as soon as you inherit from dict it becomes uncontrollable.

Edit: as an update, this has been running without a single issue for almost two years now, on several hundred thousand (eh, might be a couple million) lines of complicated, legacy-ridden python. So I’m pretty happy with it :)

Edit 2: apparently I mis-copied this or something long ago. @classmethod __class__ does not work for isinstance checks – @property __class__ does: https://repl.it/repls/UnitedScientificSequence

Question 47

In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?

Question 48

It depends on what you are intending to do with it.

Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but are slower than lists when it comes to iterating over their contents.

You can use the timeit module to see which is faster for your situation.

Question 49

Lists are slightly faster than sets when you just want to iterate over the values.

Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

It turns out tuples perform in almost exactly the same way as lists, except for their immutability.

Iterating

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

Determine if an object is present

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

Question 50

List performance:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

Set performance:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

You may want to consider Tuples as they’re similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.

Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.

set by definition. [python | wiki].

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

Question 51

Set wins due to near instant ‘contains’ checks: https://en.wikipedia.org/wiki/Hash_table

List implementation: usually an array, low level close to the metal good for iteration and random access by element index.

Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !

NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.

Question 52

tl;dr

Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.

Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.

Lists

A list is mutable sequence, typically used to store collections of homogeneous items.

Sets

A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.

Usage

From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.

Question 53

I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

Output:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn’t distinguish the speed of tuple vs list.

When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.

These results apply to 64-bit CPython on a Core i7.

Question 54

I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.

Question 55

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

Output after comparing 10 iterations for all 3 : Comparison

Question 56

Sets are faster, morover you get more functions with sets, such as lets say you have two sets :

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

We can easily join two sets:

set3 = set1.union(set2)

Find out what is common in both:

set3 = set1.intersection(set2)

Find out what is different in both:

set3 = set1.difference(set2)

And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way. Hope it helps you :-)

Question 57

Tested on Python 2.6 interpreter:

>>> a=set('abcde')
>>> a
set(['a', 'c', 'b', 'e', 'd'])
>>> l=['f','g']
>>> l
['f', 'g']
>>> a.add(l)
Traceback (most recent call last):
  File "<pyshell#35>", line 1, in <module>
    a.add(l)
TypeError: list objects are unhashable

I think that I can’t add the list to the set because there’s no way Python can tell If I have added the same list twice. Is there a workaround?

EDIT: I want to add the list itself, not its elements.

Question 58

You can’t add a list to a set because lists are mutable, meaning that you can change the contents of the list after adding it to the set.

You can however add tuples to the set, because you cannot change the contents of a tuple:

>>> a.add(('f', 'g'))
>>> print a
set(['a', 'c', 'b', 'e', 'd', ('f', 'g')])

Edit: some explanation: The documentation defines a set as an unordered collection of distinct hashable objects. The objects have to be hashable so that finding, adding and removing elements can be done faster than looking at each individual element every time you perform these operations. The specific algorithms used are explained in the Wikipedia article. Pythons hashing algorithms are explained on effbot.org and pythons __hash__ function in the python reference.

Some facts:

Set elements as well as dictionary keys have to be hashable
Some unhashable datatypes:
- list: use tuple instead
- set: use frozenset instead
- dict: has no official counterpart, but there are some recipes
Object instances are hashable by default with each instance having a unique hash. You can override this behavior as explained in the python reference.

Question 59

Use set.update() or |=

>>> a = set('abc')
>>> l = ['d', 'e']
>>> a.update(l)
>>> a
{'e', 'b', 'c', 'd', 'a'}

>>> l = ['f', 'g']
>>> a |= set(l)
>>> a
{'e', 'b', 'f', 'c', 'd', 'g', 'a'}

edit: If you want to add the list itself and not its members, then you must use a tuple, unfortunately. Set members must be hashable.

Question 60

To add the elements of a list to a set, use update

From https://docs.python.org/2/library/sets.html

s.update(t): return set s with elements added from t

E.g.

>>> s = set([1, 2])
>>> l = [3, 4]
>>> s.update(l)
>>> s
{1, 2, 3, 4}

If you instead want to add the entire list as a single element to the set, you can’t because lists aren’t hashable. You could instead add a tuple, e.g. s.add(tuple(l)). See also TypeError: unhashable type: ‘list’ when using built-in set function for more information on that.

Question 61

Hopefully this helps:

>>> seta = set('1234')
>>> listb = ['a','b','c']
>>> seta.union(listb)
set(['a', 'c', 'b', '1', '3', '2', '4'])
>>> seta
set(['1', '3', '2', '4'])
>>> seta = seta.union(listb)
>>> seta
set(['a', 'c', 'b', '1', '3', '2', '4'])

Question 62

Please notice the function set.update(). The documentation says:

Update a set with the union of itself and others.

Question 63

list objects are unhashable. you might want to turn them in to tuples though.

Question 64

Sets can’t have mutable (changeable) elements/members. A list, being mutable, cannot be a member of a set.

As sets are mutable, you cannot have a set of sets! You can have a set of frozensets though.

(The same kind of “mutability requirement” applies to the keys of a dict.)

Other answers have already given you code, I hope this gives a bit of insight. I’m hoping Alex Martelli will answer with even more details.

问题：Python设置为列表

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

问题：为什么字典和集合中的顺序是任意的？

回答 0

回答 1

这是所有实现细节。

This is all implementation detail.

回答 2

回答 3

回答 4

回答 5

问题：set（）如何实现？

回答 0

回答 1

回答 2

回答 3

回答 4

问题：如何对集合进行JSON序列化？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：如何在python中将集合转换为列表？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：如何“完美”地覆盖字典？

回答 0

回答 1

如何使dict的子类尽可能“完美”？

接受的答案有什么问题？

实际上是子类化 dict

酸洗

__repr__

结论

比较和对比

摘要：

How can I make as “perfect” a subclass of dict as possible?

What’s wrong with the accepted answer?

Actually subclassing dict

pickling

__repr__

Conclusion

Compare and contrast

Summary:

回答 2

回答 3

回答 4

问题：Python集与列表

回答 0

回答 1

回答 2

回答 3

回答 4

tl; dr

用法

tl;dr

Usage

回答 5

回答 6

回答 7

回答 8

问题：添加列表进行设置？

回答 0

实际上是子类化 `dict`

`repr`

Actually subclassing `dict`

`repr`