Python 实用宝典

Question 1

I’m trying to run over the parameters space of a 6 parameter function to study it’s numerical behavior before trying to do anything complex with it so I’m searching for a efficient way to do this.

My function takes float values given a 6-dim numpy array as input. What I tried to do initially was this:

First I created a function that takes 2 arrays and generate an array with all combinations of values from the two arrays

from numpy import *
def comb(a,b):
    c = []
    for i in a:
        for j in b:
            c.append(r_[i,j])
    return c

Then I used reduce() to apply that to m copies of the same array:

def combs(a,m):
    return reduce(comb,[a]*m)

And then I evaluate my function like this:

values = combs(np.arange(0,1,0.1),6)
for val in values:
    print F(val)

This works but it’s waaaay too slow. I know the space of parameters is huge, but this shouldn’t be so slow. I have only sampled 10⁶ (a million) points in this example and it took more than 15 seconds just to create the array values.

Do you know any more efficient way of doing this with numpy?

I can modify the way the function F takes it’s arguments if it’s necessary.

Question 2

In newer version of numpy (>1.8.x), numpy.meshgrid() provides a much faster implementation:

@pv’s solution

In [113]:

%timeit cartesian(([1, 2, 3], [4, 5], [6, 7]))
10000 loops, best of 3: 135 µs per loop
In [114]:

cartesian(([1, 2, 3], [4, 5], [6, 7]))

Out[114]:
array([[1, 4, 6],
       [1, 4, 7],
       [1, 5, 6],
       [1, 5, 7],
       [2, 4, 6],
       [2, 4, 7],
       [2, 5, 6],
       [2, 5, 7],
       [3, 4, 6],
       [3, 4, 7],
       [3, 5, 6],
       [3, 5, 7]])

numpy.meshgrid() use to be 2D only, now it is capable of ND. In this case, 3D:

In [115]:

%timeit np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)
10000 loops, best of 3: 74.1 µs per loop
In [116]:

np.array(np.meshgrid([1, 2, 3], [4, 5], [6, 7])).T.reshape(-1,3)

Out[116]:
array([[1, 4, 6],
       [1, 5, 6],
       [2, 4, 6],
       [2, 5, 6],
       [3, 4, 6],
       [3, 5, 6],
       [1, 4, 7],
       [1, 5, 7],
       [2, 4, 7],
       [2, 5, 7],
       [3, 4, 7],
       [3, 5, 7]])

Note that the order of the final resultant is slightly different.

Question 3

Here’s a pure-numpy implementation. It’s about 5× faster than using itertools.


import numpy as np

def cartesian(arrays, out=None):
    """
    Generate a cartesian product of input arrays.

    Parameters
    ----------
    arrays : list of array-like
        1-D arrays to form the cartesian product of.
    out : ndarray
        Array to place the cartesian product in.

    Returns
    -------
    out : ndarray
        2-D array of shape (M, len(arrays)) containing cartesian products
        formed of input arrays.

    Examples
    --------
    >>> cartesian(([1, 2, 3], [4, 5], [6, 7]))
    array([[1, 4, 6],
           [1, 4, 7],
           [1, 5, 6],
           [1, 5, 7],
           [2, 4, 6],
           [2, 4, 7],
           [2, 5, 6],
           [2, 5, 7],
           [3, 4, 6],
           [3, 4, 7],
           [3, 5, 6],
           [3, 5, 7]])

    """

    arrays = [np.asarray(x) for x in arrays]
    dtype = arrays[0].dtype

    n = np.prod([x.size for x in arrays])
    if out is None:
        out = np.zeros([n, len(arrays)], dtype=dtype)

    m = n / arrays[0].size
    out[:,0] = np.repeat(arrays[0], m)
    if arrays[1:]:
        cartesian(arrays[1:], out=out[0:m, 1:])
        for j in xrange(1, arrays[0].size):
            out[j*m:(j+1)*m, 1:] = out[0:m, 1:]
    return out

Question 4

itertools.combinations is in general the fastest way to get combinations from a Python container (if you do in fact want combinations, i.e., arrangements WITHOUT repetitions and independent of order; that’s not what your code appears to be doing, but I can’t tell whether that’s because your code is buggy or because you’re using the wrong terminology).

If you want something different than combinations perhaps other iterators in itertools, product or permutations, might serve you better. For example, it looks like your code is roughly the same as:

for val in itertools.product(np.arange(0, 1, 0.1), repeat=6):
    print F(val)

All of these iterators yield tuples, not lists or numpy arrays, so if your F is picky about getting specifically a numpy array you’ll have to accept the extra overhead of constructing or clearing and re-filling one at each step.

Question 5

You can do something like this

import numpy as np

def cartesian_coord(*arrays):
    grid = np.meshgrid(*arrays)        
    coord_list = [entry.ravel() for entry in grid]
    points = np.vstack(coord_list).T
    return points

a = np.arange(4)  # fake data
print(cartesian_coord(*6*[a])

which gives

array([[0, 0, 0, 0, 0, 0],
   [0, 0, 0, 0, 0, 1],
   [0, 0, 0, 0, 0, 2],
   ..., 
   [3, 3, 3, 3, 3, 1],
   [3, 3, 3, 3, 3, 2],
   [3, 3, 3, 3, 3, 3]])

Question 6

The following numpy implementation should be approx. 2x the speed of the given answer:

def cartesian2(arrays):
    arrays = [np.asarray(a) for a in arrays]
    shape = (len(x) for x in arrays)

    ix = np.indices(shape, dtype=int)
    ix = ix.reshape(len(arrays), -1).T

    for n, arr in enumerate(arrays):
        ix[:, n] = arrays[n][ix[:, n]]

    return ix

Question 7

It looks like you want a grid to evaluate your function, in which case you can use numpy.ogrid (open) or numpy.mgrid (fleshed out):

import numpy
my_grid = numpy.mgrid[[slice(0,1,0.1)]*6]

Question 8

you can use np.array(itertools.product(a, b))

Question 9

Here’s yet another way, using pure NumPy, no recursion, no list comprehension, and no explicit for loops. It’s about 20% slower than the original answer, and it’s based on np.meshgrid.

def cartesian(*arrays):
    mesh = np.meshgrid(*arrays)  # standard numpy meshgrid
    dim = len(mesh)  # number of dimensions
    elements = mesh[0].size  # number of elements, any index will do
    flat = np.concatenate(mesh).ravel()  # flatten the whole meshgrid
    reshape = np.reshape(flat, (dim, elements)).T  # reshape and transpose
    return reshape

For example,

x = np.arange(3)
a = cartesian(x, x, x, x, x)
print(a)

gives

[[0 0 0 0 0]
 [0 0 0 0 1]
 [0 0 0 0 2]
 ..., 
 [2 2 2 2 0]
 [2 2 2 2 1]
 [2 2 2 2 2]]

Question 10

For a pure numpy implementation of Cartesian product of 1D arrays (or flat python lists), just use meshgrid(), roll the axes with transpose(), and reshape to the desired ouput:

 def cartprod(*arrays):
     N = len(arrays)
     return transpose(meshgrid(*arrays, indexing='ij'), 
                      roll(arange(N + 1), -1)).reshape(-1, N)

Note this has the convention of last axis changing fastest (“C style” or “row-major”).

In [88]: cartprod([1,2,3], [4,8], [100, 200, 300, 400], [-5, -4])
Out[88]: 
array([[  1,   4, 100,  -5],
       [  1,   4, 100,  -4],
       [  1,   4, 200,  -5],
       [  1,   4, 200,  -4],
       [  1,   4, 300,  -5],
       [  1,   4, 300,  -4],
       [  1,   4, 400,  -5],
       [  1,   4, 400,  -4],
       [  1,   8, 100,  -5],
       [  1,   8, 100,  -4],
       [  1,   8, 200,  -5],
       [  1,   8, 200,  -4],
       [  1,   8, 300,  -5],
       [  1,   8, 300,  -4],
       [  1,   8, 400,  -5],
       [  1,   8, 400,  -4],
       [  2,   4, 100,  -5],
       [  2,   4, 100,  -4],
       [  2,   4, 200,  -5],
       [  2,   4, 200,  -4],
       [  2,   4, 300,  -5],
       [  2,   4, 300,  -4],
       [  2,   4, 400,  -5],
       [  2,   4, 400,  -4],
       [  2,   8, 100,  -5],
       [  2,   8, 100,  -4],
       [  2,   8, 200,  -5],
       [  2,   8, 200,  -4],
       [  2,   8, 300,  -5],
       [  2,   8, 300,  -4],
       [  2,   8, 400,  -5],
       [  2,   8, 400,  -4],
       [  3,   4, 100,  -5],
       [  3,   4, 100,  -4],
       [  3,   4, 200,  -5],
       [  3,   4, 200,  -4],
       [  3,   4, 300,  -5],
       [  3,   4, 300,  -4],
       [  3,   4, 400,  -5],
       [  3,   4, 400,  -4],
       [  3,   8, 100,  -5],
       [  3,   8, 100,  -4],
       [  3,   8, 200,  -5],
       [  3,   8, 200,  -4],
       [  3,   8, 300,  -5],
       [  3,   8, 300,  -4],
       [  3,   8, 400,  -5],
       [  3,   8, 400,  -4]])

If you want to change the first axis fastest (“FORTRAN style” or “column-major”), just change the order parameter of reshape() like this: reshape((-1, N), order='F')

Question 11

Pandas merge offers a naive, fast solution to the problem:

# given the lists
x, y, z = [1, 2, 3], [4, 5], [6, 7]

# get dfs with same, constant index 
x = pd.DataFrame({'x': x}, index=np.repeat(0, len(x))
y = pd.DataFrame({'y': y}, index=np.repeat(0, len(y))
z = pd.DataFrame({'z': z}, index=np.repeat(0, len(z))

# get all permutations stored in a new df
df = pd.merge(x, pd.merge(y, z, left_index=True, righ_index=True),
              left_index=True, right_index=True)

Question 12

What explains the difference in behavior of boolean and bitwise operations on lists vs NumPy arrays?

I’m confused about the appropriate use of & vs and in Python, illustrated in the following examples.

mylist1 = [True,  True,  True, False,  True]
mylist2 = [False, True, False,  True, False]

>>> len(mylist1) == len(mylist2)
True

# ---- Example 1 ----
>>> mylist1 and mylist2
[False, True, False, True, False]
# I would have expected [False, True, False, False, False]

# ---- Example 2 ----
>>> mylist1 & mylist2
TypeError: unsupported operand type(s) for &: 'list' and 'list'
# Why not just like example 1?

>>> import numpy as np

# ---- Example 3 ----
>>> np.array(mylist1) and np.array(mylist2)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
# Why not just like Example 4?

# ---- Example 4 ----
>>> np.array(mylist1) & np.array(mylist2)
array([False,  True, False, False, False], dtype=bool)
# This is the output I was expecting!

This answer and this answer helped me understand that and is a boolean operation but & is a bitwise operation.

I read about bitwise operations to better understand the concept, but I am struggling to use that information to make sense of my above 4 examples.

Example 4 led me to my desired output, so that is fine, but I am still confused about when/how/why I should use and vs &. Why do lists and NumPy arrays behave differently with these operators?

Can anyone help me understand the difference between boolean and bitwise operations to explain why they handle lists and NumPy arrays differently?

Question 13

and tests whether both expressions are logically True while & (when used with True/False values) tests if both are True.

In Python, empty built-in objects are typically treated as logically False while non-empty built-ins are logically True. This facilitates the common use case where you want to do something if a list is empty and something else if the list is not. Note that this means that the list [False] is logically True:

>>> if [False]:
...    print 'True'
...
True

So in Example 1, the first list is non-empty and therefore logically True, so the truth value of the and is the same as that of the second list. (In our case, the second list is non-empty and therefore logically True, but identifying that would require an unnecessary step of calculation.)

For example 2, lists cannot meaningfully be combined in a bitwise fashion because they can contain arbitrary unlike elements. Things that can be combined bitwise include: Trues and Falses, integers.

NumPy objects, by contrast, support vectorized calculations. That is, they let you perform the same operations on multiple pieces of data.

Example 3 fails because NumPy arrays (of length > 1) have no truth value as this prevents vector-based logic confusion.

Example 4 is simply a vectorized bit and operation.

Bottom Line

If you are not dealing with arrays and are not performing math manipulations of integers, you probably want and.
If you have vectors of truth values that you wish to combine, use numpy with &.

Question 14

About `list`

First a very important point, from which everything will follow (I hope).

In ordinary Python, list is not special in any way (except having cute syntax for constructing, which is mostly a historical accident). Once a list [3,2,6] is made, it is for all intents and purposes just an ordinary Python object, like a number 3, set {3,7}, or a function lambda x: x+5.

(Yes, it supports changing its elements, and it supports iteration, and many other things, but that’s just what a type is: it supports some operations, while not supporting some others. int supports raising to a power, but that doesn’t make it very special – it’s just what an int is. lambda supports calling, but that doesn’t make it very special – that’s what lambda is for, after all:).

About `and`

and is not an operator (you can call it “operator”, but you can call “for” an operator too:). Operators in Python are (implemented through) methods called on objects of some type, usually written as part of that type. There is no way for a method to hold an evaluation of some of its operands, but and can (and must) do that.

The consequence of that is that and cannot be overloaded, just like for cannot be overloaded. It is completely general, and communicates through a specified protocol. What you can do is customize your part of the protocol, but that doesn’t mean you can alter the behavior of and completely. The protocol is:

Imagine Python interpreting “a and b” (this doesn’t happen literally this way, but it helps understanding). When it comes to “and”, it looks at the object it has just evaluated (a), and asks it: are you true? (NOT: are you True?) If you are an author of a’s class, you can customize this answer. If a answers “no”, and (skips b completely, it is not evaluated at all, and) says: a is my result (NOT: False is my result).

If a doesn’t answer, and asks it: what is your length? (Again, you can customize this as an author of a‘s class). If a answers 0, and does the same as above – considers it false (NOT False), skips b, and gives a as result.

If a answers something other than 0 to the second question (“what is your length”), or it doesn’t answer at all, or it answers “yes” to the first one (“are you true”), and evaluates b, and says: b is my result. Note that it does NOT ask b any questions.

The other way to say all of this is that a and b is almost the same as b if a else a, except a is evaluated only once.

Now sit for a few minutes with a pen and paper, and convince yourself that when {a,b} is a subset of {True,False}, it works exactly as you would expect of Boolean operators. But I hope I have convinced you it is much more general, and as you’ll see, much more useful this way.

Putting those two together

Now I hope you understand your example 1. and doesn’t care if mylist1 is a number, list, lambda or an object of a class Argmhbl. It just cares about mylist1’s answer to the questions of the protocol. And of course, mylist1 answers 5 to the question about length, so and returns mylist2. And that’s it. It has nothing to do with elements of mylist1 and mylist2 – they don’t enter the picture anywhere.

Second example: `&` on `list`

On the other hand, & is an operator like any other, like + for example. It can be defined for a type by defining a special method on that class. int defines it as bitwise “and”, and bool defines it as logical “and”, but that’s just one option: for example, sets and some other objects like dict keys views define it as a set intersection. list just doesn’t define it, probably because Guido didn’t think of any obvious way of defining it.

numpy

On the other leg:-D, numpy arrays are special, or at least they are trying to be. Of course, numpy.array is just a class, it cannot override and in any way, so it does the next best thing: when asked “are you true”, numpy.array raises a ValueError, effectively saying “please rephrase the question, my view of truth doesn’t fit into your model”. (Note that the ValueError message doesn’t speak about and – because numpy.array doesn’t know who is asking it the question; it just speaks about truth.)

For &, it’s completely different story. numpy.array can define it as it wishes, and it defines & consistently with other operators: pointwise. So you finally get what you want.

HTH,

Question 15

The short-circuiting boolean operators (and, or) can’t be overriden because there is no satisfying way to do this without introducing new language features or sacrificing short circuiting. As you may or may not know, they evaluate the first operand for its truth value, and depending on that value, either evaluate and return the second argument, or don’t evaluate the second argument and return the first:

something_true and x -> x
something_false and x -> something_false
something_true or x -> something_true
something_false or x -> x

Note that the (result of evaluating the) actual operand is returned, not truth value thereof.

The only way to customize their behavior is to override __nonzero__ (renamed to __bool__ in Python 3), so you can affect which operand gets returned, but not return something different. Lists (and other collections) are defined to be “truthy” when they contain anything at all, and “falsey” when they are empty.

NumPy arrays reject that notion: For the use cases they aim at, two different notions of truth are common: (1) Whether any element is true, and (2) whether all elements are true. Since these two are completely (and silently) incompatible, and neither is clearly more correct or more common, NumPy refuses to guess and requires you to explicitly use .any() or .all().

& and | (and not, by the way) can be fully overriden, as they don’t short circuit. They can return anything at all when overriden, and NumPy makes good use of that to do element-wise operations, as they do with practically any other scalar operation. Lists, on the other hand, don’t broadcast operations across their elements. Just as mylist1 - mylist2 doesn’t mean anything and mylist1 + mylist2 means something completely different, there is no & operator for lists.

Question 16

Example 1:

This is how the and operator works.

x and y => if x is false, then x, else y

So in other words, since mylist1 is not False, the result of the expression is mylist2. (Only empty lists evaluate to False.)

Example 2:

The & operator is for a bitwise and, as you mention. Bitwise operations only work on numbers. The result of a & b is a number composed of 1s in bits that are 1 in both a and b. For example:

>>> 3 & 1
1

It’s easier to see what’s happening using a binary literal (same numbers as above):

>>> 0b0011 & 0b0001
0b0001

Bitwise operations are similar in concept to boolean (truth) operations, but they work only on bits.

So, given a couple statements about my car

My car is red
My car has wheels

The logical “and” of these two statements is:

(is my car red?) and (does car have wheels?) => logical true of false value

Both of which are true, for my car at least. So the value of the statement as a whole is logically true.

The bitwise “and” of these two statements is a little more nebulous:

(the numeric value of the statement ‘my car is red’) & (the numeric value of the statement ‘my car has wheels’) => number

If python knows how to convert the statements to numeric values, then it will do so and compute the bitwise-and of the two values. This may lead you to believe that & is interchangeable with and, but as with the above example they are different things. Also, for the objects that can’t be converted, you’ll just get a TypeError.

Example 3 and 4:

Numpy implements arithmetic operations for arrays:

Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.

But does not implement logical operations for arrays, because you can’t overload logical operators in python. That’s why example three doesn’t work, but example four does.

So to answer your and vs & question: Use and.

The bitwise operations are used for examining the structure of a number (which bits are set, which bits aren’t set). This kind of information is mostly used in low-level operating system interfaces (unix permission bits, for example). Most python programs won’t need to know that.

The logical operations (and, or, not), however, are used all the time.

Question 17

In Python an expression of X and Y returns Y, given that bool(X) == True or any of X or Y evaluate to False, e.g.:
```
True and 20 
>>> 20

False and 20
>>> False

20 and []
>>> []
```
Bitwise operator is simply not defined for lists. But it is defined for integers – operating over the binary representation of the numbers. Consider 16 (01000) and 31 (11111):
```
16 & 31
>>> 16
```
NumPy is not a psychic, it does not know, whether you mean that e.g. [False, False] should be equal to True in a logical expression. In this it overrides a standard Python behaviour, which is: “Any empty collection with len(collection) == 0 is False“.
Probably an expected behaviour of NumPy’s arrays’s & operator.

Question 18

For the first example and base on the django’s doc
It will always return the second list, indeed a non empty list is see as a True value for Python thus python return the ‘last’ True value so the second list

In [74]: mylist1 = [False]
In [75]: mylist2 = [False, True, False,  True, False]
In [76]: mylist1 and mylist2
Out[76]: [False, True, False, True, False]
In [77]: mylist2 and mylist1
Out[77]: [False]

Question 19

Operations with a Python list operate on the list. list1 and list2 will check if list1 is empty, and return list1 if it is, and list2 if it isn’t. list1 + list2 will append list2 to list1, so you get a new list with len(list1) + len(list2) elements.

Operators that only make sense when applied element-wise, such as &, raise a TypeError, as element-wise operations aren’t supported without looping through the elements.

Numpy arrays support element-wise operations. array1 & array2 will calculate the bitwise or for each corresponding element in array1 and array2. array1 + array2 will calculate the sum for each corresponding element in array1 and array2.

This does not work for and and or.

array1 and array2 is essentially a short-hand for the following code:

if bool(array1):
    return array2
else:
    return array1

For this you need a good definition of bool(array1). For global operations like used on Python lists, the definition is that bool(list) == True if list is not empty, and False if it is empty. For numpy’s element-wise operations, there is some disambiguity whether to check if any element evaluates to True, or all elements evaluate to True. Because both are arguably correct, numpy doesn’t guess and raises a ValueError when bool() is (indirectly) called on an array.

Question 20

Good question. Similar to the observation you have about examples 1 and 4 (or should I say 1 & 4 :) ) over logical and bitwise & operators, I experienced on sum operator. The numpy sum and py sum behave differently as well. For example:

Suppose “mat” is a numpy 5×5 2d array such as:

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

Then numpy.sum(mat) gives total sum of the entire matrix. Whereas the built-in sum from Python such as sum(mat) totals along the axis only. See below:

np.sum(mat)  ## --> gives 325
sum(mat)     ## --> gives array([55, 60, 65, 70, 75])

Question 21

Say I have an array a:

a = np.array([[1,2,3], [4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

I would like to convert it to a 1D array (i.e. a column vector):

b = np.reshape(a, (1,np.product(a.shape)))

but this returns

array([[1, 2, 3, 4, 5, 6]])

which is not the same as:

array([1, 2, 3, 4, 5, 6])

I can take the first element of this array to manually convert it to a 1D array:

b = np.reshape(a, (1,np.product(a.shape)))[0]

but this requires me to know how many dimensions the original array has (and concatenate [0]’s when working with higher dimensions)

Is there a dimensions-independent way of getting a column/row vector from an arbitrary ndarray?

Question 22

Use np.ravel (for a 1D view) or np.ndarray.flatten (for a 1D copy) or np.ndarray.flat (for an 1D iterator):

In [12]: a = np.array([[1,2,3], [4,5,6]])

In [13]: b = a.ravel()

In [14]: b
Out[14]: array([1, 2, 3, 4, 5, 6])

Note that ravel() returns a view of a when possible. So modifying b also modifies a. ravel() returns a view when the 1D elements are contiguous in memory, but would return a copy if, for example, a were made from slicing another array using a non-unit step size (e.g. a = x[::2]).

If you want a copy rather than a view, use

In [15]: c = a.flatten()

If you just want an iterator, use np.ndarray.flat:

In [20]: d = a.flat

In [21]: d
Out[21]: <numpy.flatiter object at 0x8ec2068>

In [22]: list(d)
Out[22]: [1, 2, 3, 4, 5, 6]

Question 23

In [14]: b = np.reshape(a, (np.product(a.shape),))

In [15]: b
Out[15]: array([1, 2, 3, 4, 5, 6])

or, simply:

In [16]: a.flatten()
Out[16]: array([1, 2, 3, 4, 5, 6])

Question 24

One of the simplest way is to use flatten(), like this example :

 import numpy as np

 batch_y =train_output.iloc[sample, :]
 batch_y = np.array(batch_y).flatten()

My array it was like this :

After using flatten():

array([6, 6, 5, ..., 5, 3, 6])

It’s also the solution of errors of this type :

Cannot feed value of shape (100, 1) for Tensor 'input/Y:0', which has shape '(?,)'

Question 25

For list of array with different size use following:

import numpy as np

# ND array list with different size
a = [[1],[2,3,4,5],[6,7,8]]

# stack them
b = np.hstack(a)

print(b)

Output:

[1 2 3 4 5 6 7 8]

Question 26

I wanted to see a benchmark result of functions mentioned in answers including unutbu’s.

Also want to point out that numpy doc recommend to use arr.reshape(-1) in case view is preferable. (even though ravel is tad faster in the following result)

TL;DR: np.ravel is the most performant (by very small amount).

Benchmark

Functions:

np.ravel: returns view, if possible
np.reshape(-1): returns view, if possible
np.flatten: returns copy
np.flat: returns numpy.flatiter. similar to iterable

numpy version: ‘1.18.0’

Execution times on different `ndarray` sizes

+-------------+----------+-----------+-----------+-------------+
|  function   |   10x10  |  100x100  | 1000x1000 | 10000x10000 |
+-------------+----------+-----------+-----------+-------------+
| ravel       | 0.002073 |  0.002123 |  0.002153 |    0.002077 |
| reshape(-1) | 0.002612 |  0.002635 |  0.002674 |    0.002701 |
| flatten     | 0.000810 |  0.007467 |  0.587538 |  107.321913 |
| flat        | 0.000337 |  0.000255 |  0.000227 |    0.000216 |
+-------------+----------+-----------+-----------+-------------+

Conclusion

ravel and reshape(-1)‘s execution time was consistent and independent from ndarray size. However, ravel is tad faster, but reshape provides flexibility in reshaping size. (maybe that’s why numpy doc recommend to use it instead. Or there could be some cases where reshape returns view and ravel doesn’t).
If you are dealing with large size ndarray, using flatten can cause a performance issue. Recommend not to use it. Unless you need a copy of the data to do something else.

Used code

import timeit
setup = '''
import numpy as np
nd = np.random.randint(10, size=(10, 10))
'''

timeit.timeit('nd = np.reshape(nd, -1)', setup=setup, number=1000)
timeit.timeit('nd = np.ravel(nd)', setup=setup, number=1000)
timeit.timeit('nd = nd.flatten()', setup=setup, number=1000)
timeit.timeit('nd.flat', setup=setup, number=1000)

Question 27

Although this isn’t using the np array format, (to lazy to modify my code) this should do what you want… If, you truly want a column vector you will want to transpose the vector result. It all depends on how you are planning to use this.

def getVector(data_array,col):
    vector = []
    imax = len(data_array)
    for i in range(imax):
        vector.append(data_array[i][col])
    return ( vector )
a = ([1,2,3], [4,5,6])
b = getVector(a,1)
print(b)

Out>[2,5]

So if you need to transpose, you can do something like this:

def transposeArray(data_array):
    # need to test if this is a 1D array 
    # can't do a len(data_array[0]) if it's 1D
    two_d = True
    if isinstance(data_array[0], list):
        dimx = len(data_array[0])
    else:
        dimx = 1
        two_d = False
    dimy = len(data_array)
    # init output transposed array
    data_array_t = [[0 for row in range(dimx)] for col in range(dimy)]
    # fill output transposed array
    for i in range(dimx):
        for j in range(dimy):
            if two_d:
                data_array_t[j][i] = data_array[i][j]
            else:
                data_array_t[j][i] = data_array[j]
    return data_array_t

Question 28

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but order doesn’t matter for this problem.

PROBLEM: Based on my distribution I would like to calculate p-value (the probability of seeing greater values) for any given value. For example, as you can see p-value for 0 would be approaching 1 and p-value for higher numbers would be tending to 0.

I don’t know if I am right, but to determine probabilities I think I need to fit my data to a theoretical distribution that is the most suitable to describe my data. I assume that some kind of goodness of fit test is needed to determine the best model.

Is there a way to implement such an analysis in Python (Scipy or Numpy)? Could you present any examples?

Thank you!

Question 29

Distribution Fitting with Sum of Square Error (SSE)

This is an update and modification to Saullo’s answer, that uses the full list of the current scipy.stats distributions and returns the distribution with the least SSE between the distribution’s histogram and the data’s histogram.

Example Fitting

Using the El Niño dataset from statsmodels, the distributions are fit and error is determined. The distribution with the least error is returned.

All Distributions

Best Fit Distribution

Example Code

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

Question 30

There are 82 implemented distribution functions in SciPy 0.12.0. You can test how some of them fit to your data using their fit() method. Check the code below for more details:

enter image description here

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

References:

– Fitting distributions, goodness of fit, p-value. Is it possible to do this with Scipy (Python)?

– Distribution fitting with Scipy

And here a list with the names of all distribution functions available in Scipy 0.12.0 (VI):

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

Question 31

fit() method mentioned by @Saullo Castro provides maximum likelihood estimates (MLE). The best distribution for your data is the one give you the highest can be determined by several different ways: such as

1, the one that gives you the highest log likelihood.

2, the one that gives you the smallest AIC, BIC or BICc values (see wiki: http://en.wikipedia.org/wiki/Akaike_information_criterion, basically can be viewed as log likelihood adjusted for number of parameters, as distribution with more parameters are expected to fit better)

3, the one that maximize the Bayesian posterior probability. (see wiki: http://en.wikipedia.org/wiki/Posterior_probability)

Of course, if you already have a distribution that should describe you data (based on the theories in your particular field) and want to stick to that, you will skip the step of identifying the best fit distribution.

scipy does not come with a function to calculate log likelihood (although MLE method is provided), but hard code one is easy: see Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?

Question 32

AFAICU, your distribution is discrete (and nothing but discrete). Therefore just counting the frequencies of different values and normalizing them should be enough for your purposes. So, an example to demonstrate this:

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

Thus, probability of seeing values higher than 1 is simply (according to the complementary cumulative distribution function (ccdf):

In []: 1- cdf[1]
Out[]: 0.40000000000000002

Please note that ccdf is closely related to survival function (sf), but it’s also defined with discrete distributions, whereas sf is defined only for contiguous distributions.

Question 33

It sounds like probability density estimation problem to me.

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

Also see http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density.html.

Question 34

Try the distfit library.

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

Note that in this case, all points will be significant because of the uniform distribution. You can filter with the dist.y_pred if required.

Question 35

With OpenTURNS, I would use the BIC criteria to select the best distribution that fits such data. This is because this criteria does not give too much advantage to the distributions which have more parameters. Indeed, if a distribution has more parameters, it is easier for the fitted distribution to be closer to the data. Moreover, the Kolmogorov-Smirnov may not make sense in this case, because a small error in the measured values will have a huge impact on the p-value.

To illustrate the process, I load the El-Nino data, which contains 732 monthly temperature measurements from 1950 to 2010:

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

It is easy to get the 30 of built-in univariate factories of distributions with the GetContinuousUniVariateFactories static method. Once done, the BestModelBIC static method returns the best model and the corresponding BIC score.

sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

which prints:

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

In order to graphically compare the fit to the histogram, I use the drawPDF methods of the best distribution.

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

This produces:

More details on this topic are presented in the BestModelBIC doc. It would be possible to include the Scipy distribution in the SciPyDistribution or even with ChaosPy distributions with ChaosPyDistribution, but I guess that the current script fulfills most practical purposes.

Question 36

Forgive me if I don’t understand your need but what about storing your data in a dictionary where keys would be the numbers between 0 and 47 and values the number of occurrences of their related keys in your original list?
Thus your likelihood p(x) will be the sum of all the values for keys greater than x divided by 30000.

Question 37

I want to figure out how to remove nan values from my array. My array looks something like this:

x = [1400, 1500, 1600, nan, nan, nan ,1700] #Not in this exact configuration

How can I remove the nan values from x?

Question 38

If you’re using numpy for your arrays, you can also use

x = x[numpy.logical_not(numpy.isnan(x))]

Equivalently

x = x[~numpy.isnan(x)]

[Thanks to chbrown for the added shorthand]

Explanation

The inner function, numpy.isnan returns a boolean/logical array which has the value True everywhere that x is not-a-number. As we want the opposite, we use the logical-not operator, ~ to get an array with Trues everywhere that x is a valid number.

Lastly we use this logical array to index into the original array x, to retrieve just the non-NaN values.

Question 39

filter(lambda v: v==v, x)

works both for lists and numpy array since v!=v only for NaN

Question 40

Try this:

import math
print [value for value in x if not math.isnan(value)]

For more, read on List Comprehensions.

Question 41

For me the answer by @jmetz didn’t work, however using pandas isnull() did.

x = x[~pd.isnull(x)]

Question 42

Doing the above :

x = x[~numpy.isnan(x)]

or

x = x[numpy.logical_not(numpy.isnan(x))]

I found that resetting to the same variable (x) did not remove the actual nan values and had to use a different variable. Setting it to a different variable removed the nans. e.g.

y = x[~numpy.isnan(x)]

Question 43

As shown by others

x[~numpy.isnan(x)]

works. But it will throw an error if the numpy dtype is not a native data type, for example if it is object. In that case you can use pandas.

x[~pandas.isna(x)] or x[~pandas.isnull(x)]

Question 44

The accepted answer changes shape for 2d arrays. I present a solution here, using the Pandas dropna() functionality. It works for 1D and 2D arrays. In the 2D case you can choose weather to drop the row or column containing np.nan.

import pandas as pd
import numpy as np

def dropna(arr, *args, **kwarg):
    assert isinstance(arr, np.ndarray)
    dropped=pd.DataFrame(arr).dropna(*args, **kwarg).values
    if arr.ndim==1:
        dropped=dropped.flatten()
    return dropped

x = np.array([1400, 1500, 1600, np.nan, np.nan, np.nan ,1700])
y = np.array([[1400, 1500, 1600], [np.nan, 0, np.nan] ,[1700,1800,np.nan]] )


print('='*20+' 1D Case: ' +'='*20+'\nInput:\n',x,sep='')
print('\ndropna:\n',dropna(x),sep='')

print('\n\n'+'='*20+' 2D Case: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna (rows):\n',dropna(y),sep='')
print('\ndropna (columns):\n',dropna(y,axis=1),sep='')

print('\n\n'+'='*20+' x[np.logical_not(np.isnan(x))] for 2D: ' +'='*20+'\nInput:\n',y,sep='')
print('\ndropna:\n',x[np.logical_not(np.isnan(x))],sep='')

Result:

==================== 1D Case: ====================
Input:
[1400. 1500. 1600.   nan   nan   nan 1700.]

dropna:
[1400. 1500. 1600. 1700.]


==================== 2D Case: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna (rows):
[[1400. 1500. 1600.]]

dropna (columns):
[[1500.]
 [   0.]
 [1800.]]


==================== x[np.logical_not(np.isnan(x))] for 2D: ====================
Input:
[[1400. 1500. 1600.]
 [  nan    0.   nan]
 [1700. 1800.   nan]]

dropna:
[1400. 1500. 1600. 1700.]

Question 45

If you’re using numpy

# first get the indices where the values are finite
ii = np.isfinite(x)

# second get the values
x = x[ii]

Question 46

A simplest way is:

numpy.nan_to_num(x)

Documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nan_to_num.html

Question 47

This is my approach to filter ndarray “X” for NaNs and infs,

I create a map of rows without any NaN and any inf as follows:

idx = np.where((np.isnan(X)==False) & (np.isinf(X)==False))

idx is a tuple. It’s second column (idx[1]) contains the indices of the array, where no NaN nor inf where found across the row.

Then:

filtered_X = X[idx[1]]

filtered_X contains X without NaN nor inf.

Question 48

@jmetz’s answer is probably the one most people need; however it yields a one-dimensional array, e.g. making it unusable to remove entire rows or columns in matrices.

To do so, one should reduce the logical array to one dimension, then index the target array. For instance, the following will remove rows which have at least one NaN value:

x = x[~numpy.isnan(x).any(axis=1)]

See more detail here.

Question 49

I’m taking some university classes and have been given an ‘instructional account’, which is a school account I can ssh into to do work. I want to run my computationally intensive Numpy, matplotlib, scipy code on that machine, but I cannot install these modules because I am not a system administrator.

How can I do the installation?

Question 50

In most situations the best solution is to rely on the so-called “user site” location (see the PEP for details) by running:

pip install --user package_name

Below is a more “manual” way from my original answer, you do not need to read it if the above solution works for you.

With easy_install you can do:

easy_install --prefix=$HOME/local package_name

which will install into

$HOME/local/lib/pythonX.Y/site-packages

(the ‘local’ folder is a typical name many people use, but of course you may specify any folder you have permissions to write into).

You will need to manually create

$HOME/local/lib/pythonX.Y/site-packages

and add it to your PYTHONPATH environment variable (otherwise easy_install will complain — btw run the command above once to find the correct value for X.Y).

If you are not using easy_install, look for a prefix option, most install scripts let you specify one.

With pip you can use:

pip install --install-option="--prefix=$HOME/local" package_name

Question 51

No permissions to access nor install easy_install?

Then, you can create a python virtualenv (https://pypi.python.org/pypi/virtualenv) and install the package from this virtual environment.

Executing 4 commands in the shell will be enough (insert current release like 16.1.0 for X.X.X):

$ curl --location --output virtualenv-X.X.X.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xvfz virtualenv-X.X.X.tar.gz
$ python pypa-virtualenv-YYYYYY/src/virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install package_name

Source and more info: https://virtualenv.pypa.io/en/latest/installation/

Question 52

You can run easy_install to install python packages in your home directory even without root access. There’s a standard way to do this using site.USER_BASE which defaults to something like $HOME/.local or $HOME/Library/Python/2.7/bin and is included by default on the PYTHONPATH

To do this, create a .pydistutils.cfg in your home directory:

cat > $HOME/.pydistutils.cfg <<EOF
[install]
user=1
EOF

Now you can run easy_install without root privileges:

easy_install boto

Alternatively, this also lets you run pip without root access:

pip install boto

This works for me.

Source from Wesley Tanaka’s blog : http://wtanaka.com/node/8095

Question 53

If you have to use a distutils setup.py script, there are some commandline options for forcing an installation destination. See http://docs.python.org/install/index.html#alternate-installation. If this problem repeats, you can setup a distutils configuration file, see http://docs.python.org/install/index.html#inst-config-files.

Setting the PYTHONPATH variable is described in tihos post.

Question 54

Important question. The server I use (Ubuntu 12.04) had easy_install3 but not pip3. This is how I installed Pip and then other packages to my home folder

Asked admin to install Ubuntu package python3-setuptools
Installed pip

Like this:

 easy_install3 --prefix=$HOME/.local pip
 mkdir -p $HOME/.local/lib/python3.2/site-packages
 easy_install3 --prefix=$HOME/.local pip

Add Pip (and other Python apps to path)

Like this:

PATH="$HOME/.local/bin:$PATH"
echo PATH="$HOME/.local/bin:$PATH" > $HOME/.profile

Install Python package

like this

pip3 install --user httpie

# test httpie package
http httpbin.org

Question 55

I use JuJu which basically allows to have a really tiny linux distribution (containing just the package manager) inside your $HOME/.juju directory.

It allows to have your custom system inside the home directory accessible via proot and, therefore, you can install any packages without root privileges. It will run properly to all the major linux distributions, the only limitation is that JuJu can run on linux kernel with minimum reccomended version 2.6.32.

For instance, after installed JuJu to install pip just type the following:

$>juju -f
(juju)$> pacman -S python-pip
(juju)> pip

Question 56

The best and easiest way is this command:

pip install --user package_name

http://www.lleess.com/2013/05/how-to-install-python-modules-without.html#.WQrgubyGOnc

Question 57

Install virtualenv locally (source of instructions):

Important: Insert the current release (like 16.1.0) for X.X.X.
Check the name of the extracted file and insert it for YYYYY.

$ curl -L -o virtualenv.tar.gz https://github.com/pypa/virtualenv/tarball/X.X.X
$ tar xfz virtualenv.tar.gz
$ python pypa-virtualenv-YYYYY/src/virtualenv.py env

Before you can use or install any package you need to source your virtual Python environment env:

$ source env/bin/activate

To install new python packages (like numpy), use:

(env)$ pip install <package>

Question 58

I just discovered a logical bug in my code which was causing all sorts of problems. I was inadvertently doing a bitwise AND instead of a logical AND.

I changed the code from:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

TO:

r = mlab.csv2rec(datafile, delimiter=',', names=COL_HEADERS)
mask = ((r["dt"] >= startdate) and (r["dt"] <= enddate))
selected = r[mask]

To my surprise, I got the rather cryptic error message:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Why was a similar error not emitted when I use a bitwise operation – and how do I fix this?

Question 59

r is a numpy (rec)array. So r["dt"] >= startdate is also a (boolean) array. For numpy arrays the & operation returns the elementwise-and of the two boolean arrays.

The NumPy developers felt there was no one commonly understood way to evaluate an array in boolean context: it could mean True if any element is True, or it could mean True if all elements are True, or True if the array has non-zero length, just to name three possibilities.

Since different users might have different needs and different assumptions, the NumPy developers refused to guess and instead decided to raise a ValueError whenever one tries to evaluate an array in boolean context. Applying and to two numpy arrays causes the two arrays to be evaluated in boolean context (by calling __bool__ in Python3 or __nonzero__ in Python2).

Your original code

mask = ((r["dt"] >= startdate) & (r["dt"] <= enddate))
selected = r[mask]

looks correct. However, if you do want and, then instead of a and b use (a-b).any() or (a-b).all().

Question 60

I had the same problem (i.e. indexing with multi-conditions, here it’s finding data in a certain date range). The (a-b).any() or (a-b).all() seem not working, at least for me.

Alternatively I found another solution which works perfectly for my desired functionality (The truth value of an array with more than one element is ambigous when trying to index an array).

Instead of using suggested code above, simply using a numpy.logical_and(a,b) would work. Here you may want to rewrite the code as

selected  = r[numpy.logical_and(r["dt"] >= startdate, r["dt"] <= enddate)]

Question 61

The reason for the exception is that and implicitly calls bool. First on the left operand and (if the left operand is True) then on the right operand. So x and y is equivalent to bool(x) and bool(y).

However the bool on a numpy.ndarray (if it contains more than one element) will throw the exception you have seen:

>>> import numpy as np
>>> arr = np.array([1, 2, 3])
>>> bool(arr)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

The bool() call is implicit in and, but also in if, while, or, so any of the following examples will also fail:

>>> arr and arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> if arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> while arr: pass
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

>>> arr or arr
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

There are more functions and statements in Python that hide bool calls, for example 2 < x < 10 is just another way of writing 2 < x and x < 10. And the and will call bool: bool(2 < x) and bool(x < 10).

The element-wise equivalent for and would be the np.logical_and function, similarly you could use np.logical_or as equivalent for or.

For boolean arrays – and comparisons like <, <=, ==, !=, >= and > on NumPy arrays return boolean NumPy arrays – you can also use the element-wise bitwise functions (and operators): np.bitwise_and (& operator)

>>> np.logical_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> np.bitwise_and(arr > 1, arr < 3)
array([False,  True, False], dtype=bool)

>>> (arr > 1) & (arr < 3)
array([False,  True, False], dtype=bool)

and bitwise_or (| operator):

>>> np.logical_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> np.bitwise_or(arr <= 1, arr >= 3)
array([ True, False,  True], dtype=bool)

>>> (arr <= 1) | (arr >= 3)
array([ True, False,  True], dtype=bool)

A complete list of logical and binary functions can be found in the NumPy documentation:

Question 62

if you work with pandas what solved the issue for me was that i was trying to do calculations when I had NA values, the solution was to run:

df = df.dropna()

And after that the calculation that failed.

Question 63

This typed error-message also shows while an if-statement comparison is done where there is an array and for example a bool or int. See for example:

... code snippet ...

if dataset == bool:
    ....

... code snippet ...

This clause has dataset as array and bool is euhm the “open door”… True or False.

In case the function is wrapped within a try-statement you will receive with except Exception as error: the message without its error-type:

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Question 64

try this=> numpy.array(r) or numpy.array(yourvariable) followed by the command to compare whatever you wish to.

问题：使用numpy构建两个数组的所有组合的数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

问题：’and’（布尔值）vs’＆’（按位）-为什么列表与numpy数组在行为上有所不同？

回答 0

回答 1

关于 list

关于 and

将这两个放在一起

第二个例子：&上list

麻木

About list

About and

Putting those two together

Second example: & on list

numpy

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：从ND到一维阵列

回答 0

回答 1

回答 2

回答 3

对于具有不同大小的数组列表，请使用以下命令：

输出：

For list of array with different size use following:

Output:

回答 4

基准测试

不同ndarray大小的执行时间

结论

使用的代码

Benchmark

Execution times on different ndarray sizes

Conclusion

Used code

回答 5

问题：使用Scipy（Python）使经验分布适合理论分布吗？

回答 0

平方误差和（SSE）的分布拟合

拟合示例

所有发行

最佳拟合分布

范例程式码

Distribution Fitting with Sum of Square Error (SSE)

Example Fitting

All Distributions

Best Fit Distribution

Example Code

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：从数组中删除Nan值

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

关于 `list`

关于 `and`

第二个例子：`&`上`list`

About `list`

About `and`

Second example: `&` on `list`

不同`ndarray`大小的执行时间

Execution times on different `ndarray` sizes