Python 实用宝典

Question 1

When should I use a dictionary, list or set?

Are there scenarios that are more suited for each data type?

Question 2

A list keeps order, dict and set don’t: when you care about order, therefore, you must use list (if your choice of containers is limited to these three, of course;-).

dict associates with each key a value, while list and set just contain values: very different use cases, obviously.

set requires items to be hashable, list doesn’t: if you have non-hashable items, therefore, you cannot use set and must instead use list.

set forbids duplicates, list does not: also a crucial distinction. (A “multiset”, which maps duplicates into a different count for items present more than once, can be found in collections.Counter — you could build one as a dict, if for some weird reason you couldn’t import collections, or, in pre-2.7 Python as a collections.defaultdict(int), using the items as keys and the associated value as the count).

Checking for membership of a value in a set (or dict, for keys) is blazingly fast (taking about a constant, short time), while in a list it takes time proportional to the list’s length in the average and worst cases. So, if you have hashable items, don’t care either way about order or duplicates, and want speedy membership checking, set is better than list.

Question 3

Do you just need an ordered sequence of items? Go for a list.
Do you just need to know whether or not you’ve already got a particular value, but without ordering (and you don’t need to store duplicates)? Use a set.
Do you need to associate values with keys, so you can look them up efficiently (by key) later on? Use a dictionary.

Question 4

When you want an unordered collection of unique elements, use a set. (For example, when you want the set of all the words used in a document).

When you want to collect an immutable ordered list of elements, use a tuple. (For example, when you want a (name, phone_number) pair that you wish to use as an element in a set, you would need a tuple rather than a list since sets require elements be immutable).

When you want to collect a mutable ordered list of elements, use a list. (For example, when you want to append new phone numbers to a list: [number1, number2, …]).

When you want a mapping from keys to values, use a dict. (For example, when you want a telephone book which maps names to phone numbers: {'John Smith' : '555-1212'}). Note the keys in a dict are unordered. (If you iterate through a dict (telephone book), the keys (names) may show up in any order).

Question 5

Use a dictionary when you have a set of unique keys that map to values.
Use a list if you have an ordered collection of items.
Use a set to store an unordered set of items.

Question 6

In short, use:

list – if you require an ordered sequence of items.

dict – if you require to relate values with keys

set – if you require to keep unique elements.

Detailed Explanation

List

A list is a mutable sequence, typically used to store collections of homogeneous items.

A list implements all of the common sequence operations:

x in l and x not in l
l[i], l[i:j], l[i:j:k]
len(l), min(l), max(l)
l.count(x)
l.index(x[, i[, j]]) – index of the 1st occurrence of x in l (at or after i and before j indeces)

A list also implements all of the mutable sequence operations:

l[i] = x – item i of l is replaced by x
l[i:j] = t – slice of l from i to j is replaced by the contents of the iterable t
del l[i:j] – same as l[i:j] = []
l[i:j:k] = t – the elements of l[i:j:k] are replaced by those of t
del l[i:j:k] – removes the elements of s[i:j:k] from the list
l.append(x) – appends x to the end of the sequence
l.clear() – removes all items from l (same as del l[:])
l.copy() – creates a shallow copy of l (same as l[:])
l.extend(t) or l += t – extends l with the contents of t
l *= n – updates l with its contents repeated n times
l.insert(i, x) – inserts x into l at the index given by i
l.pop([i]) – retrieves the item at i and also removes it from l
l.remove(x) – remove the first item from l where l[i] is equal to x
l.reverse() – reverses the items of l in place

A list could be used as stack by taking advantage of the methods append and pop.

Dictionary

A dictionary maps hashable values to arbitrary objects. A dictionary is a mutable object. The main operations on a dictionary are storing a value with some key and extracting the value given the key.

In a dictionary, you cannot use as keys values that are not hashable, that is, values containing lists, dictionaries or other mutable types.

Set

A set is an unordered collection of distinct hashable objects. A set is commonly used to include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.

Question 7

Although this doesn’t cover sets, it is a good explanation of dicts and lists:

Lists are what they seem – a list of values. Each one of them is numbered, starting from zero – the first one is numbered zero, the second 1, the third 2, etc. You can remove values from the list, and add new values to the end. Example: Your many cats’ names.

Dictionaries are similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – tare similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. The values in a dictionary aren’t numbered – they aren’t in any specific order, either – the key does the same thing. You can add, remove, and modify the values in dictionaries. Example: telephone book.

http://www.sthurlow.com/python/lesson06/

Question 8

For C++ I was always having this flow chart in mind: In which scenario do I use a particular STL container?, so I was curious if something similar is available for Python3 as well, but I had no luck.

What you need to keep in mind for Python is: There is no single Python standard as for C++. Hence there might be huge differences for different Python interpreters (e.g. CPython, PyPy). The following flow chart is for CPython.

Additionally I found no good way to incorporate the following data structures into the diagram: bytes, byte arrays, tuples, named_tuples, ChainMap, Counter, and arrays.

OrderedDict and deque are available via collections module.
heapq is available from the heapq module
LifoQueue, Queue, and PriorityQueue are available via the queue module which is designed for concurrent (threads) access. (There is also a multiprocessing.Queue available but I don’t know the differences to queue.Queue but would assume that it should be used when concurrent access from processes is needed.)
dict, set, frozen_set, and list are builtin of course

For anyone I would be grateful if you could improve this answer and provide a better diagram in every aspect. Feel free and welcome.

PS: the diagram has been made with yed. The graphml file is here

Question 9

In combination with lists, dicts and sets, there are also another interesting python objects, OrderedDicts.

Ordered dictionaries are just like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.

OrderedDicts could be useful when you need to preserve the order of the keys, for example working with documents: It’s common to need the vector representation of all terms in a document. So using OrderedDicts you can efficiently verify if a term has been read before, add terms, extract terms, and after all the manipulations you can extract the ordered vector representation of them.

Question 10

Lists are what they seem – a list of values. Each one of them is numbered, starting from zero – the first one is numbered zero, the second 1, the third 2, etc. You can remove values from the list, and add new values to the end. Example: Your many cats’ names.

Tuples are just like lists, but you can’t change their values. The values that you give it first up, are the values that you are stuck with for the rest of the program. Again, each value is numbered starting from zero, for easy reference. Example: the names of the months of the year.

Dictionaries are similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – tare similar to what their name suggests – a dictionary. In a dictionary, you have an ‘index’ of words, and for each of them a definition. In python, the word is called a ‘key’, and the definition a ‘value’. The values in a dictionary aren’t numbered – they aren’t in any specific order, either – the key does the same thing. You can add, remove, and modify the values in dictionaries. Example: telephone book.

Question 11

When use them, I make an exhaustive cheatsheet of their methods for your reference:

class ContainerMethods:
    def __init__(self):
        self.list_methods_11 = {
                    'Add':{'append','extend','insert'},
                    'Subtract':{'pop','remove'},
                    'Sort':{'reverse', 'sort'},
                    'Search':{'count', 'index'},
                    'Entire':{'clear','copy'},
                            }
        self.tuple_methods_2 = {'Search':'count','index'}

        self.dict_methods_11 = {
                    'Views':{'keys', 'values', 'items'},
                    'Add':{'update'},
                    'Subtract':{'pop', 'popitem',},
                    'Extract':{'get','setdefault',},
                    'Entire':{ 'clear', 'copy','fromkeys'},
                            }
        self.set_methods_17 ={
                    'Add':{['add', 'update'],['difference_update','symmetric_difference_update','intersection_update']},
                    'Subtract':{'pop', 'remove','discard'},
                    'Relation':{'isdisjoint', 'issubset', 'issuperset'},
                    'operation':{'union' 'intersection','difference', 'symmetric_difference'}
                    'Entire':{'clear', 'copy'}}

Question 12

Dictionary: A python dictionary is used like a hash table with key as index and object as value.

List: A list is used for holding objects in an array indexed by position of that object in the array.

Set: A set is a collection with functions that can tell if an object is present or not present in the set.

Question 13

Does anyone know how the built in dictionary type for python is implemented? My understanding is that it is some sort of hash table, but I haven’t been able to find any sort of definitive answer.

Question 14

Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive).

Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two distinct keys have the same hash value, the table’s implementation must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a contiguous block of memory (sort of like an array, so you can do an O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important.
Each entry in the table is actually a combination of the three values: < hash, key, value >. This is implemented as a C struct (see dictobject.h:51-56).

The figure below is a logical representation of a Python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).

  # Logical model of Python Hash table
  -+-----------------+
  0| <hash|key|value>|
  -+-----------------+
  1|      ...        |
  -+-----------------+
  .|      ...        |
  -+-----------------+
  i|      ...        |
  -+-----------------+
  .|      ...        |
  -+-----------------+
  n|      ...        |
  -+-----------------+

When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i, that is based on the hash of the key. CPython initially uses i = hash(key) & mask (where mask = PyDictMINSIZE - 1, but that’s not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the hash and key of the current entry to be inserted (dictobject.c:337,344-345) respectively. If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don’t match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that’s linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don’t match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)

NOTE: I did the research on Python Dict implementation in response to my own question about how multiple entries in a dict can have same hash values. I posted a slightly edited version of the response here because all the research is very relevant for this question as well.

Question 15

How are Python’s Built In Dictionaries Implemented?

Here’s the short course:

They are hash tables. (See below for the specifics of Python’s implementation.)
A new layout and algorithm, as of Python 3.6, makes them
- ordered by key insertion, and
- take up less space,
- at virtually no cost in performance.
Another optimization saves space when dicts share keys (in special cases).

The ordered aspect is unofficial as of Python 3.6 (to give other implementations a chance to keep up), but official in Python 3.7.

Python’s Dictionaries are Hash Tables

For a long time, it worked exactly like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 (i.e. 2nd) index (like the example below.)

   <hash>       <key>    <value>
     null        null    null
...010001    ffeb678c    633241c4 # addresses of the keys and values
     null        null    null
      ...         ...    ...

Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels for our purposes here – they don’t actually exist in memory.)

If the hash ended the same as a preexisting key’s hash, this is a collision, and then it would stick the key-value pair in a different location.

After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.

This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key’s id – if they’re the same object, they’re equal. If not then compare the hash values, if they are not the same, they’re not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.

Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the initialization of the hash function such that it computes different hashes for each new Python process.

The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new feature that dictionaries are now ordered by insertion.

The New Compact Hash Tables

We start, instead, by preallocating an array for the index of the insertion.

Since our first key-value pair goes in the second slot, we index like this:

[null, 0, null, null, null, null, null, null]

And our table just gets populated by insertion order:

   <hash>       <key>    <value>
...010001    ffeb678c    633241c4 
      ...         ...    ...

So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.

We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upsides that we save quite a lot of space over the pre-existing implementation and we retain insertion order. The only space wasted are the null bytes in the index array.

Raymond Hettinger introduced this on python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion was considered an implementation detail for 3.6 to allow other implementations of Python a chance to catch up.

Shared Keys

Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys’ hashes. You can think of it like this:

     hash         key    dict_0    dict_1    dict_2...
...010001    ffeb678c    633241c4  fffad420  ...
      ...         ...    ...       ...       ...

For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.

Shared Keys for Custom Objects & Alternatives

These shared-key dicts are intended to be used for custom objects’ __dict__. To get this behavior, I believe you need to finish populating your __dict__ before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the __init__ or __new__, else you might not get your space savings.

However, if you know all of your attributes at the time your __init__ is executed, you could also provide __slots__ for your object, and guarantee that __dict__ is not created at all (if not available in parents), or even allow __dict__ but guarantee that your foreseen attributes are stored in slots anyways. For more on __slots__, see my answer here.

Differentiating between the possible origins of exceptions raised from a compound `with` statement

Differentiating between exceptions that occur in a with statement is tricky because they can originate in different places. Exceptions can be raised from either of the following places (or functions called therein):

ContextManager.__init__
ContextManager.__enter__
the body of the with
ContextManager.__exit__

For more details see the documentation about Context Manager Types.

If we want to distinguish between these different cases, just wrapping the with into a try .. except is not sufficient. Consider the following example (using ValueError as an example but of course it could be substituted with any other exception type):

try:
    with ContextManager():
        BLOCK
except ValueError as err:
    print(err)

Here the except will catch exceptions originating in all of the four different places and thus does not allow to distinguish between them. If we move the instantiation of the context manager object outside the with, we can distinguish between __init__ and BLOCK / __enter__ / __exit__:

try:
    mgr = ContextManager()
except ValueError as err:
    print('__init__ raised:', err)
else:
    try:
        with mgr:
            try:
                BLOCK
            except TypeError:  # catching another type (which we want to handle here)
                pass
    except ValueError as err:
        # At this point we still cannot distinguish between exceptions raised from
        # __enter__, BLOCK, __exit__ (also BLOCK since we didn't catch ValueError in the body)
        pass

Effectively this just helped with the __init__ part but we can add an extra sentinel variable to check whether the body of the with started to execute (i.e. differentiating between __enter__ and the others):

try:
    mgr = ContextManager()  # __init__ could raise
except ValueError as err:
    print('__init__ raised:', err)
else:
    try:
        entered_body = False
        with mgr:
            entered_body = True  # __enter__ did not raise at this point
            try:
                BLOCK
            except TypeError:  # catching another type (which we want to handle here)
                pass
    except ValueError as err:
        if not entered_body:
            print('__enter__ raised:', err)
        else:
            # At this point we know the exception came either from BLOCK or from __exit__
            pass

The tricky part is to differentiate between exceptions originating from BLOCK and __exit__ because an exception that escapes the body of the with will be passed to __exit__ which can decide how to handle it (see the docs). If however __exit__ raises itself, the original exception will be replaced by the new one. To deal with these cases we can add a general except clause in the body of the with to store any potential exception that would have otherwise escaped unnoticed and compare it with the one caught in the outermost except later on – if they are the same this means the origin was BLOCK or otherwise it was __exit__ (in case __exit__ suppresses the exception by returning a true value the outermost except will simply not be executed).

try:
    mgr = ContextManager()  # __init__ could raise
except ValueError as err:
    print('__init__ raised:', err)
else:
    entered_body = exc_escaped_from_body = False
    try:
        with mgr:
            entered_body = True  # __enter__ did not raise at this point
            try:
                BLOCK
            except TypeError:  # catching another type (which we want to handle here)
                pass
            except Exception as err:  # this exception would normally escape without notice
                # we store this exception to check in the outer `except` clause
                # whether it is the same (otherwise it comes from __exit__)
                exc_escaped_from_body = err
                raise  # re-raise since we didn't intend to handle it, just needed to store it
    except ValueError as err:
        if not entered_body:
            print('__enter__ raised:', err)
        elif err is exc_escaped_from_body:
            print('BLOCK raised:', err)
        else:
            print('__exit__ raised:', err)

Alternative approach using the equivalent form mentioned in PEP 343

PEP 343 — The “with” Statement specifies an equivalent “non-with” version of the with statement. Here we can readily wrap the various parts with try ... except and thus differentiate between the different potential error sources:

import sys

try:
    mgr = ContextManager()
except ValueError as err:
    print('__init__ raised:', err)
else:
    try:
        value = type(mgr).__enter__(mgr)
    except ValueError as err:
        print('__enter__ raised:', err)
    else:
        exit = type(mgr).__exit__
        exc = True
        try:
            try:
                BLOCK
            except TypeError:
                pass
            except:
                exc = False
                try:
                    exit_val = exit(mgr, *sys.exc_info())
                except ValueError as err:
                    print('__exit__ raised:', err)
                else:
                    if not exit_val:
                        raise
        except ValueError as err:
            print('BLOCK raised:', err)
        finally:
            if exc:
                try:
                    exit(mgr, None, None, None)
                except ValueError as err:
                    print('__exit__ raised:', err)

Usually a simpler approach will do just fine

The need for such special exception handling should be quite rare and normally wrapping the whole with in a try ... except block will be sufficient. Especially if the various error sources are indicated by different (custom) exception types (the context managers need to be designed accordingly) we can readily distinguish between them. For example:

try:
    with ContextManager():
        BLOCK
except InitError:  # raised from __init__
    ...
except AcquireResourceError:  # raised from __enter__
    ...
except ValueError:  # raised from BLOCK
    ...
except ReleaseResourceError:  # raised from __exit__
    ...

Question 33

I want to write a Django query equivalent to this SQL query:

SELECT * from user where income >= 5000 or income is NULL.

How to construct the Django queryset filter?

User.objects.filter(income__gte=5000, income=0)

This doesn’t work, because it ANDs the filters. I want to OR the filters to get union of individual querysets.

Question 34

from django.db.models import Q
User.objects.filter(Q(income__gte=5000) | Q(income__isnull=True))

via Documentation

Question 35

Because QuerySets implement the Python __or__ operator (|), or union, it just works. As you’d expect, the | binary operator returns a QuerySet so order_by(), .distinct(), and other queryset filters can be tacked on to the end.

combined_queryset = User.objects.filter(income__gte=5000) | User.objects.filter(income__isnull=True)
ordered_queryset = combined_queryset.order_by('-income')

Update 2019-06-20: This is now fully documented in the Django 2.1 QuerySet API reference. More historic discussion can be found in DjangoProject ticket #21333.

Question 36

Both options are already mentioned in the existing answers:

from django.db.models import Q
q1 = User.objects.filter(Q(income__gte=5000) | Q(income__isnull=True))

and

q2 = User.objects.filter(income__gte=5000) | User.objects.filter(income__isnull=True)

However, there seems to be some confusion regarding which one is to prefer.

The point is that they are identical on the SQL level, so feel free to pick whichever you like!

The Django ORM Cookbook talks in some detail about this, here is the relevant part:

queryset = User.objects.filter(
        first_name__startswith='R'
    ) | User.objects.filter(
    last_name__startswith='D'
)

leads to

In [5]: str(queryset.query)
Out[5]: 'SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login",
"auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name",
"auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff",
"auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user"
WHERE ("auth_user"."first_name"::text LIKE R% OR "auth_user"."last_name"::text LIKE D%)'

and

qs = User.objects.filter(Q(first_name__startswith='R') | Q(last_name__startswith='D'))

leads to

In [9]: str(qs.query)
Out[9]: 'SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login",
 "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name",
  "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff",
  "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user"
  WHERE ("auth_user"."first_name"::text LIKE R% OR "auth_user"."last_name"::text LIKE D%)'

source: django-orm-cookbook

Question 37

import numpy as np
y = np.array(((1,2,3),(4,5,6),(7,8,9)))
OUTPUT:
print(y.flatten())
[1   2   3   4   5   6   7   8   9]
print(y.ravel())
[1   2   3   4   5   6   7   8   9]

Both function return the same list. Then what is the need of two different functions performing same job.

Question 38

The current API is that:

flatten always returns a copy.
ravel returns a view of the original array whenever possible. This isn’t visible in the printed output, but if you modify the array returned by ravel, it may modify the entries in the original array. If you modify the entries in an array returned from flatten this will never happen. ravel will often be faster since no memory is copied, but you have to be more careful about modifying the array it returns.
reshape((-1,)) gets a view whenever the strides of the array allow it even if that means you don’t always get a contiguous array.

Question 39

As explained here a key difference is that:

flatten is a method of an ndarray object and hence can only be called for true numpy arrays.
ravel is a library-level function and hence can be called on any object that can successfully be parsed.

For example ravel will work on a list of ndarrays, while flatten is not available for that type of object.

@IanH also points out important differences with memory handling in his answer.

Question 40

Here is the correct namespace for the functions:

Both functions return flattened 1D arrays pointing to the new memory structures.

import numpy
a = numpy.array([[1,2],[3,4]])

r = numpy.ravel(a)
f = numpy.ndarray.flatten(a)  

print(id(a))
print(id(r))
print(id(f))

print(r)
print(f)

print("\nbase r:", r.base)
print("\nbase f:", f.base)

---returns---
140541099429760
140541099471056
140541099473216

[1 2 3 4]
[1 2 3 4]

base r: [[1 2]
 [3 4]]

base f: None

In the upper example:

the memory locations of the results are different,
the results look the same
flatten would return a copy
ravel would return a view.

How we check if something is a copy? Using the .base attribute of the ndarray. If it’s a view, the base will be the original array; if it is a copy, the base will be None.

Question 41

Let’s say I have a NumPy array, a:

a = np.array([
    [1, 2, 3],
    [2, 3, 4]
    ])

And I would like to add a column of zeros to get an array, b:

b = np.array([
    [1, 2, 3, 0],
    [2, 3, 4, 0]
    ])

How can I do this easily in NumPy?

Question 42

I think a more straightforward solution and faster to boot is to do the following:

import numpy as np
N = 10
a = np.random.rand(N,N)
b = np.zeros((N,N+1))
b[:,:-1] = a

And timings:

In [23]: N = 10

In [24]: a = np.random.rand(N,N)

In [25]: %timeit b = np.hstack((a,np.zeros((a.shape[0],1))))
10000 loops, best of 3: 19.6 us per loop

In [27]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 5.62 us per loop

Question 43

np.r_[ ... ] and np.c_[ ... ] are useful alternatives to vstack and hstack, with square brackets [] instead of round ().
A couple of examples:

: import numpy as np
: N = 3
: A = np.eye(N)

: np.c_[ A, np.ones(N) ]              # add a column
array([[ 1.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.]])

: np.c_[ np.ones(N), A, np.ones(N) ]  # or two
array([[ 1.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  1.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  1.]])

: np.r_[ A, [A[1]] ]              # add a row
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.]])
: # not np.r_[ A, A[1] ]

: np.r_[ A[0], 1, 2, 3, A[1] ]    # mix vecs and scalars
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], [1, 2, 3], A[1] ]  # lists
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], (1, 2, 3), A[1] ]  # tuples
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

: np.r_[ A[0], 1:4, A[1] ]        # same, 1:4 == arange(1,4) == 1,2,3
  array([ 1.,  0.,  0.,  1.,  2.,  3.,  0.,  1.,  0.])

(The reason for square brackets [] instead of round () is that Python expands e.g. 1:4 in square — the wonders of overloading.)

Question 44

Use numpy.append:

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])

>>> z = np.zeros((2,1), dtype=int64)
>>> z
array([[0],
       [0]])

>>> np.append(a, z, axis=1)
array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

Question 45

One way, using hstack, is:

b = np.hstack((a, np.zeros((a.shape[0], 1), dtype=a.dtype)))

Question 46

I find the following most elegant:

b = np.insert(a, 3, values=0, axis=1) # Insert values before column 3

An advantage of insert is that it also allows you to insert columns (or rows) at other places inside the array. Also instead of inserting a single value you can easily insert a whole vector, for instance duplicate the last column:

b = np.insert(a, insert_index, values=a[:,2], axis=1)

Which leads to:

array([[1, 2, 3, 3],
       [2, 3, 4, 4]])

For the timing, insert might be slower than JoshAdel’s solution:

In [1]: N = 10

In [2]: a = np.random.rand(N,N)

In [3]: %timeit b = np.hstack((a, np.zeros((a.shape[0], 1))))
100000 loops, best of 3: 7.5 µs per loop

In [4]: %timeit b = np.zeros((a.shape[0], a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 2.17 µs per loop

In [5]: %timeit b = np.insert(a, 3, values=0, axis=1)
100000 loops, best of 3: 10.2 µs per loop

Question 47

I was also interested in this question and compared the speed of

numpy.c_[a, a]
numpy.stack([a, a]).T
numpy.vstack([a, a]).T
numpy.ascontiguousarray(numpy.stack([a, a]).T)               
numpy.ascontiguousarray(numpy.vstack([a, a]).T)
numpy.column_stack([a, a])
numpy.concatenate([a[:,None], a[:,None]], axis=1)
numpy.concatenate([a[None], a[None]], axis=0).T

which all do the same thing for any input vector a. Timings for growing a:

Note that all non-contiguous variants (in particular stack/vstack) are eventually faster than all contiguous variants. column_stack (for its clarity and speed) appears to be a good option if you require contiguity.

Code to reproduce the plot:

import numpy
import perfplot

perfplot.save(
    "out.png",
    setup=lambda n: numpy.random.rand(n),
    kernels=[
        lambda a: numpy.c_[a, a],
        lambda a: numpy.ascontiguousarray(numpy.stack([a, a]).T),
        lambda a: numpy.ascontiguousarray(numpy.vstack([a, a]).T),
        lambda a: numpy.column_stack([a, a]),
        lambda a: numpy.concatenate([a[:, None], a[:, None]], axis=1),
        lambda a: numpy.ascontiguousarray(
            numpy.concatenate([a[None], a[None]], axis=0).T
        ),
        lambda a: numpy.stack([a, a]).T,
        lambda a: numpy.vstack([a, a]).T,
        lambda a: numpy.concatenate([a[None], a[None]], axis=0).T,
    ],
    labels=[
        "c_",
        "ascont(stack)",
        "ascont(vstack)",
        "column_stack",
        "concat",
        "ascont(concat)",
        "stack (non-cont)",
        "vstack (non-cont)",
        "concat (non-cont)",
    ],
    n_range=[2 ** k for k in range(20)],
    xlabel="len(a)",
    logx=True,
    logy=True,
)

Question 48

I think:

np.column_stack((a, zeros(shape(a)[0])))

is more elegant.

Question 49

np.concatenate also works

>>> a = np.array([[1,2,3],[2,3,4]])
>>> a
array([[1, 2, 3],
       [2, 3, 4]])
>>> z = np.zeros((2,1))
>>> z
array([[ 0.],
       [ 0.]])
>>> np.concatenate((a, z), axis=1)
array([[ 1.,  2.,  3.,  0.],
       [ 2.,  3.,  4.,  0.]])

Question 50

Assuming M is a (100,3) ndarray and y is a (100,) ndarray append can be used as follows:

M=numpy.append(M,y[:,None],1)

The trick is to use

y[:, None]

This converts y to a (100, 1) 2D array.

M.shape

now gives

(100, 4)

Question 51

I like JoshAdel’s answer because of the focus on performance. A minor performance improvement is to avoid the overhead of initializing with zeros, only to be overwritten. This has a measurable difference when N is large, empty is used instead of zeros, and the column of zeros is written as a separate step:

In [1]: import numpy as np

In [2]: N = 10000

In [3]: a = np.ones((N,N))

In [4]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
1 loops, best of 3: 492 ms per loop

In [5]: %timeit b = np.empty((a.shape[0],a.shape[1]+1)); b[:,:-1] = a; b[:,-1] = np.zeros((a.shape[0],))
1 loops, best of 3: 407 ms per loop

Question 52

np.insert also serves the purpose.

matA = np.array([[1,2,3], 
                 [2,3,4]])
idx = 3
new_col = np.array([0, 0])
np.insert(matA, idx, new_col, axis=1)

array([[1, 2, 3, 0],
       [2, 3, 4, 0]])

It inserts values, here new_col, before a given index, here idx along one axis. In other words, the newly inserted values will occupy the idx column and move what were originally there at and after idx backward.

Question 53

Add an extra column to a numpy array:

Numpy’s np.append method takes three parameters, the first two are 2D numpy arrays and the 3rd is an axis parameter instructing along which axis to append:

import numpy as np  
x = np.array([[1,2,3], [4,5,6]]) 
print("Original x:") 
print(x) 

y = np.array([[1], [1]]) 
print("Original y:") 
print(y) 

print("x appended to y on axis of 1:") 
print(np.append(x, y, axis=1))

Prints:

Original x:
[[1 2 3]
 [4 5 6]]
Original y:
[[1]
 [1]]
x appended to y on axis of 1:
[[1 2 3 1]
 [4 5 6 1]]

Question 54

A bit late to the party, but nobody posted this answer yet, so for the sake of completeness: you can do this with list comprehensions, on a plain Python array:

source = a.tolist()
result = [row + [0] for row in source]
b = np.array(result)

Question 55

In my case, I had to add a column of ones to a NumPy array

X = array([ 6.1101, 5.5277, ... ])
X.shape => (97,)
X = np.concatenate((np.ones((m,1), dtype=np.int), X.reshape(m,1)), axis=1)

After X.shape => (97, 2)

array([[ 1. , 6.1101],
       [ 1. , 5.5277],
...

Question 56

For me, the next way looks pretty intuitive and simple.

zeros = np.zeros((2,1)) #2 is a number of rows in your array.   
b = np.hstack((a, zeros))

Question 57

There is a function specifically for this. It is called numpy.pad

a = np.array([[1,2,3], [2,3,4]])
b = np.pad(a, ((0, 0), (0, 1)), mode='constant', constant_values=0)
print b
>>> array([[1, 2, 3, 0],
           [2, 3, 4, 0]])

Here is what it says in the docstring:

Pads an array.

Parameters
----------
array : array_like of rank N
    Input array
pad_width : {sequence, array_like, int}
    Number of values padded to the edges of each axis.
    ((before_1, after_1), ... (before_N, after_N)) unique pad widths
    for each axis.
    ((before, after),) yields same before and after pad for each axis.
    (pad,) or int is a shortcut for before = after = pad width for all
    axes.
mode : str or function
    One of the following string values or a user supplied function.

    'constant'
        Pads with a constant value.
    'edge'
        Pads with the edge values of array.
    'linear_ramp'
        Pads with the linear ramp between end_value and the
        array edge value.
    'maximum'
        Pads with the maximum value of all or part of the
        vector along each axis.
    'mean'
        Pads with the mean value of all or part of the
        vector along each axis.
    'median'
        Pads with the median value of all or part of the
        vector along each axis.
    'minimum'
        Pads with the minimum value of all or part of the
        vector along each axis.
    'reflect'
        Pads with the reflection of the vector mirrored on
        the first and last values of the vector along each
        axis.
    'symmetric'
        Pads with the reflection of the vector mirrored
        along the edge of the array.
    'wrap'
        Pads with the wrap of the vector along the axis.
        The first values are used to pad the end and the
        end values are used to pad the beginning.
    <function>
        Padding function, see Notes.
stat_length : sequence or int, optional
    Used in 'maximum', 'mean', 'median', and 'minimum'.  Number of
    values at edge of each axis used to calculate the statistic value.

    ((before_1, after_1), ... (before_N, after_N)) unique statistic
    lengths for each axis.

    ((before, after),) yields same before and after statistic lengths
    for each axis.

    (stat_length,) or int is a shortcut for before = after = statistic
    length for all axes.

    Default is ``None``, to use the entire axis.
constant_values : sequence or int, optional
    Used in 'constant'.  The values to set the padded values for each
    axis.

    ((before_1, after_1), ... (before_N, after_N)) unique pad constants
    for each axis.

    ((before, after),) yields same before and after constants for each
    axis.

    (constant,) or int is a shortcut for before = after = constant for
    all axes.

    Default is 0.
end_values : sequence or int, optional
    Used in 'linear_ramp'.  The values used for the ending value of the
    linear_ramp and that will form the edge of the padded array.

    ((before_1, after_1), ... (before_N, after_N)) unique end values
    for each axis.

    ((before, after),) yields same before and after end values for each
    axis.

    (constant,) or int is a shortcut for before = after = end value for
    all axes.

    Default is 0.
reflect_type : {'even', 'odd'}, optional
    Used in 'reflect', and 'symmetric'.  The 'even' style is the
    default with an unaltered reflection around the edge value.  For
    the 'odd' style, the extented part of the array is created by
    subtracting the reflected values from two times the edge value.

Returns
-------
pad : ndarray
    Padded array of rank equal to `array` with shape increased
    according to `pad_width`.

Notes
-----
.. versionadded:: 1.7.0

For an array with rank greater than 1, some of the padding of later
axes is calculated from padding of previous axes.  This is easiest to
think about with a rank 2 array where the corners of the padded array
are calculated by using padded values from the first axis.

The padding function, if used, should return a rank 1 array equal in
length to the vector argument with padded values replaced. It has the
following signature::

    padding_func(vector, iaxis_pad_width, iaxis, kwargs)

where

    vector : ndarray
        A rank 1 array already padded with zeros.  Padded values are
        vector[:pad_tuple[0]] and vector[-pad_tuple[1]:].
    iaxis_pad_width : tuple
        A 2-tuple of ints, iaxis_pad_width[0] represents the number of
        values padded at the beginning of vector where
        iaxis_pad_width[1] represents the number of values padded at
        the end of vector.
    iaxis : int
        The axis currently being calculated.
    kwargs : dict
        Any keyword arguments the function requires.

Examples
--------
>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2,3), 'constant', constant_values=(4, 6))
array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])

>>> np.pad(a, (2, 3), 'edge')
array([1, 1, 1, 2, 3, 4, 5, 5, 5, 5])

>>> np.pad(a, (2, 3), 'linear_ramp', end_values=(5, -4))
array([ 5,  3,  1,  2,  3,  4,  5,  2, -1, -4])

>>> np.pad(a, (2,), 'maximum')
array([5, 5, 1, 2, 3, 4, 5, 5, 5])

>>> np.pad(a, (2,), 'mean')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> np.pad(a, (2,), 'median')
array([3, 3, 1, 2, 3, 4, 5, 3, 3])

>>> a = [[1, 2], [3, 4]]
>>> np.pad(a, ((3, 2), (2, 3)), 'minimum')
array([[1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1],
       [3, 3, 3, 4, 3, 3, 3],
       [1, 1, 1, 2, 1, 1, 1],
       [1, 1, 1, 2, 1, 1, 1]])

>>> a = [1, 2, 3, 4, 5]
>>> np.pad(a, (2, 3), 'reflect')
array([3, 2, 1, 2, 3, 4, 5, 4, 3, 2])

>>> np.pad(a, (2, 3), 'reflect', reflect_type='odd')
array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8])

>>> np.pad(a, (2, 3), 'symmetric')
array([2, 1, 1, 2, 3, 4, 5, 5, 4, 3])

>>> np.pad(a, (2, 3), 'symmetric', reflect_type='odd')
array([0, 1, 1, 2, 3, 4, 5, 5, 6, 7])

>>> np.pad(a, (2, 3), 'wrap')
array([4, 5, 1, 2, 3, 4, 5, 1, 2, 3])

>>> def pad_with(vector, pad_width, iaxis, kwargs):
...     pad_value = kwargs.get('padder', 10)
...     vector[:pad_width[0]] = pad_value
...     vector[-pad_width[1]:] = pad_value
...     return vector
>>> a = np.arange(6)
>>> a = a.reshape((2, 3))
>>> np.pad(a, 2, pad_with)
array([[10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10,  0,  1,  2, 10, 10],
       [10, 10,  3,  4,  5, 10, 10],
       [10, 10, 10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10, 10, 10]])
>>> np.pad(a, 2, pad_with, padder=100)
array([[100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100,   0,   1,   2, 100, 100],
       [100, 100,   3,   4,   5, 100, 100],
       [100, 100, 100, 100, 100, 100, 100],
       [100, 100, 100, 100, 100, 100, 100]])

Question 58

I want to loop over the contents of a text file and do a search and replace on some lines and write the result back to the file. I could first load the whole file in memory and then write it back, but that probably is not the best way to do it.

What is the best way to do this, within the following code?

f = open(file)
for line in f:
    if line.contains('foo'):
        newline = line.replace('foo', 'bar')
        # how to write this newline back to the file

Question 59

I guess something like this should do it. It basically writes the content to a new file and replaces the old file with the new file:

from tempfile import mkstemp
from shutil import move, copymode
from os import fdopen, remove

def replace(file_path, pattern, subst):
    #Create temp file
    fh, abs_path = mkstemp()
    with fdopen(fh,'w') as new_file:
        with open(file_path) as old_file:
            for line in old_file:
                new_file.write(line.replace(pattern, subst))
    #Copy the file permissions from the old file to the new file
    copymode(file_path, abs_path)
    #Remove original file
    remove(file_path)
    #Move new file
    move(abs_path, file_path)

Question 60

The shortest way would probably be to use the fileinput module. For example, the following adds line numbers to a file, in-place:

import fileinput

for line in fileinput.input("test.txt", inplace=True):
    print('{} {}'.format(fileinput.filelineno(), line), end='') # for Python 3
    # print "%d: %s" % (fileinput.filelineno(), line), # for Python 2

What happens here is:

The original file is moved to a backup file
The standard output is redirected to the original file within the loop
Thus any print statements write back into the original file

fileinput has more bells and whistles. For example, it can be used to automatically operate on all files in sys.args[1:], without your having to iterate over them explicitly. Starting with Python 3.2 it also provides a convenient context manager for use in a with statement.

While fileinput is great for throwaway scripts, I would be wary of using it in real code because admittedly it’s not very readable or familiar. In real (production) code it’s worthwhile to spend just a few more lines of code to make the process explicit and thus make the code readable.

There are two options:

The file is not overly large, and you can just read it wholly to memory. Then close the file, reopen it in writing mode and write the modified contents back.
The file is too large to be stored in memory; you can move it over to a temporary file and open that, reading it line by line, writing back into the original file. Note that this requires twice the storage.

Question 61

Here’s another example that was tested, and will match search & replace patterns:

import fileinput
import sys

def replaceAll(file,searchExp,replaceExp):
    for line in fileinput.input(file, inplace=1):
        if searchExp in line:
            line = line.replace(searchExp,replaceExp)
        sys.stdout.write(line)

Example use:

replaceAll("/fooBar.txt","Hello\sWorld!$","Goodbye\sWorld.")

Question 62

This should work: (inplace editing)

import fileinput

# Does a list of files, and
# redirects STDOUT to the file in question
for line in fileinput.input(files, inplace = 1): 
      print line.replace("foo", "bar"),

Question 63

Based on the answer by Thomas Watnedal. However, this does not answer the line-to-line part of the original question exactly. The function can still replace on a line-to-line basis

This implementation replaces the file contents without using temporary files, as a consequence file permissions remain unchanged.

Also re.sub instead of replace, allows regex replacement instead of plain text replacement only.

Reading the file as a single string instead of line by line allows for multiline match and replacement.

import re

def replace(file, pattern, subst):
    # Read contents from file as a single string
    file_handle = open(file, 'r')
    file_string = file_handle.read()
    file_handle.close()

    # Use RE package to allow for replacement (also allowing for (multiline) REGEX)
    file_string = (re.sub(pattern, subst, file_string))

    # Write contents to file.
    # Using mode 'w' truncates the file.
    file_handle = open(file, 'w')
    file_handle.write(file_string)
    file_handle.close()

Question 64

As lassevk suggests, write out the new file as you go, here is some example code:

fin = open("a.txt")
fout = open("b.txt", "wt")
for line in fin:
    fout.write( line.replace('foo', 'bar') )
fin.close()
fout.close()

问题：在Python中，什么时候使用字典，列表或集合？

回答 0

回答 1

回答 2

回答 3

回答 4

详细说明

Detailed Explanation

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

问题：Python的内置词典如何实现？

回答 0

回答 1

Python的字典是哈希表

新的紧凑型哈希表

共用金钥

自定义对象和替代项的共享密钥

也可以看看：

Python’s Dictionaries are Hash Tables

The New Compact Hash Tables

Shared Keys

Shared Keys for Custom Objects & Alternatives

See also:

回答 2

问题：如何更正TypeError：散列之前必须对Unicode对象进行编码？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：使用Python’with’语句时捕获异常

回答 0

回答 1

回答 2

使用Python’with’语句时捕获异常

Catching an exception while using a Python ‘with’ statement

回答 3

区分复合with语句引发的异常的可能来源

使用PEP 343中提到的等效形式的替代方法

通常，更简单的方法就可以了

Differentiating between the possible origins of exceptions raised from a compound with statement

Alternative approach using the equivalent form mentioned in PEP 343

Usually a simpler approach will do just fine

问题：如何在Django queryset中执行OR条件？

回答 0

回答 1

回答 2

问题：numpy中的flatten和ravel函数有什么区别？

回答 0

回答 1

回答 2

问题：如何在NumPy数组中添加额外的列

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

向numpy数组添加额外的列：

Add an extra column to a numpy array:

回答 12

回答 13

回答 14

回答 15

问题：在Python中搜索并替换文件中的一行

区分复合`with`语句引发的异常的可能来源

Differentiating between the possible origins of exceptions raised from a compound `with` statement