Python 实用宝典

Question 1

I want to use the method of “findall” to locate some elements of the source xml file in the ElementTree module.

However, the source xml file (test.xml) has namespace. I truncate part of xml file as sample:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

The sample python code is below:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

Although it can works, because there is a namespace “{http://www.test.com}”, it’s very inconvenient to add a namespace in front of each tag.

How can I ignore the namespace when using the method of “find”, “findall” and so on?

Question 2

Instead of modifying the XML document itself, it’s best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    prefix, has_namespace, postfix = el.tag.partition('}')
    if has_namespace:
        el.tag = postfix  # strip all namespaces
root = it.root

This is based on the discussion here: http://bugs.python.org/issue18304

Update: rpartition instead of partition makes sure you get the tag name in postfix even if there is no namespace. Thus you could condense it:

for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns

Question 3

If you remove the xmlns attribute from the xml before parsing it then there won’t be a namespace prepended to each tag in the tree.

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

Question 4

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

And use it in find method:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

Question 5

Here’s an extension to nonagon’s answer, which also strips namespaces off attributes:

from StringIO import StringIO
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

UPDATE: added list() so the iterator works (needed for Python 3)

Question 6

Improving on the answer by ericspod:

Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
            self.oldcreate = expat.ParserCreate
            expat.ParserCreate = lambda encoding, sep: self.oldcreate(encoding, None)
    def __exit__(self, type, value, traceback):
            expat.ParserCreate = self.oldcreate

This can then be used as follows

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.

Question 7

You can use the elegant string formatting construct as well:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

or, if you’re sure that PAID_OFF only appears in one level in tree:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)

Question 8

If you’re using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.

Question 9

I might be late for this but I dont think re.sub is a good solution.

However the rewrite xml.parsers.expat does not work for Python 3.x versions,

The main culprit is the xml/etree/ElementTree.py see bottom of the source code

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

Which is kinda sad.

The solution is to get rid of it first.

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

Tested on Python 3.6.

Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like

maximum recursion depth exceeded
AttributeError: XMLParser

btw damn the etree source code looks really messy.

Question 10

Let’s combine nonagon’s answer with mzjn’s answer to a related question:

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

Using this function we:

Create an iterator to get both namespaces and a parsed tree object.
Iterate over the created iterator to get the namespaces dict that we can later pass in each find() or findall() call as sugested by iMom0.
Return the parsed tree’s root element object and namespaces.

I think this is the best approach all around as there’s no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.

I’d like also to credit barny’s answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).

Question 11

I know about islower and isupper, but can you check whether or not that character is a letter? For Example:

>>> s = 'abcdefg'
>>> s2 = '123abcd'
>>> s3 = 'abcDEFG'
>>> s[0].islower()
True

>>> s2[0].islower()
False

>>> s3[0].islower()
True

Is there any way to just ask if it is a character besides doing .islower() or .isupper()?

Question 12

You can use str.isalpha().

For example:

s = 'a123b'

for char in s:
    print(char, char.isalpha())

Output:

a True
1 False
2 False
3 False
b True

Question 13

str.isalpha()

Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

In python2.x:

>>> s = u'a1中文'
>>> for char in s: print char, char.isalpha()
...
a True
1 False
中 True
文 True
>>> s = 'a1中文'
>>> for char in s: print char, char.isalpha()
...
a True
1 False
� False
� False
� False
� False
� False
� False
>>>

In python3.x:

>>> s = 'a1中文'
>>> for char in s: print(char, char.isalpha())
...
a True
1 False
中 True
文 True
>>>

This code work:

>>> def is_alpha(word):
...     try:
...         return word.encode('ascii').isalpha()
...     except:
...         return False
...
>>> is_alpha('中国')
False
>>> is_alpha(u'中国')
False
>>>

>>> a = 'ａ'
>>> b = 'a'
>>> ord(a), ord(b)
(65345, 97)
>>> a.isalpha(), b.isalpha()
(True, True)
>>> is_alpha(a), is_alpha(b)
(False, True)
>>>

Question 14

I found a good way to do this with using a function and basic code. This is a code that accepts a string and counts the number of capital letters, lowercase letters and also ‘other’. Other is classed as a space, punctuation mark or even Japanese and Chinese characters.

def check(count):

    lowercase = 0
    uppercase = 0
    other = 0

    low = 'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
    upper = 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'



    for n in count:
        if n in low:
            lowercase += 1
        elif n in upper:
            uppercase += 1
        else:
            other += 1

    print("There are " + str(lowercase) + " lowercase letters.")
    print("There are " + str(uppercase) + " uppercase letters.")
    print("There are " + str(other) + " other elements to this sentence.")

Question 15

data = "abcdefg hi j 12345"

digits_count = 0
letters_count = 0
others_count = 0

for i in userinput:

    if i.isdigit():
        digits_count += 1 
    elif i.isalpha():
        letters_count += 1
    else:
        others_count += 1

print("Result:")        
print("Letters=", letters_count)
print("Digits=", digits_count)

Output:

Please Enter Letters with Numbers:
abcdefg hi j 12345
Result:
Letters = 10
Digits = 5

By using str.isalpha() you can check if it is a letter.

Question 16

This works:

any(c.isalpha() for c in 'string')

Question 17

This works:

word = str(input("Enter string:"))
notChar = 0
isChar = 0
for char in word:
    if not char.isalpha():
        notChar += 1
    else:
        isChar += 1
print(isChar, " were letters; ", notChar, " were not letters.")

Question 18

I have code which looks something like this:

thing_index = thing_list.index(thing)
otherfunction(thing_list, thing_index)

ok so that’s simplified but you get the idea. Now thing might not actually be in the list, in which case I want to pass -1 as thing_index. In other languages this is what you’d expect index() to return if it couldn’t find the element. In fact it throws a ValueError.

I could do this:

try:
    thing_index = thing_list.index(thing)
except ValueError:
    thing_index = -1
otherfunction(thing_list, thing_index)

But this feels dirty, plus I don’t know if ValueError could be raised for some other reason. I came up with the following solution based on generator functions, but it seems a little complex:

thing_index = ( [(i for i in xrange(len(thing_list)) if thing_list[i]==thing)] or [-1] )[0]

Is there a cleaner way to achieve the same thing? Let’s assume the list isn’t sorted.

Question 19

There is nothing “dirty” about using try-except clause. This is the pythonic way. ValueError will be raised by the .index method only, because it’s the only code you have there!

To answer the comment:
In Python, easier to ask forgiveness than to get permission philosophy is well established, and no index will not raise this type of error for any other issues. Not that I can think of any.

Question 20

thing_index = thing_list.index(elem) if elem in thing_list else -1

One line. Simple. No exceptions.

Question 21

The dict type has a get function, where if the key doesn’t exist in the dictionary, the 2nd argument to get is the value that it should return. Similarly there is setdefault, which returns the value in the dict if the key exists, otherwise it sets the value according to your default parameter and then returns your default parameter.

You could extend the list type to have a getindexdefault method.

class SuperDuperList(list):
    def getindexdefault(self, elem, default):
        try:
            thing_index = self.index(elem)
            return thing_index
        except ValueError:
            return default

Which could then be used like:

mylist = SuperDuperList([0,1,2])
index = mylist.getindexdefault( 'asdf', -1 )

Question 22

There is nothing wrong with your code that uses ValueError. Here’s yet another one-liner if you’d like to avoid exceptions:

thing_index = next((i for i, x in enumerate(thing_list) if x == thing), -1)

Question 23

This issue is one of language philosophy. In Java for example there has always been a tradition that exceptions should really only be used in “exceptional circumstances” that is when errors have happened, rather than for flow control. In the beginning this was for performance reasons as Java exceptions were slow but now this has become the accepted style.

In contrast Python has always used exceptions to indicate normal program flow, like raising a ValueError as we are discussing here. There is nothing “dirty” about this in Python style and there are many more where that came from. An even more common example is StopIteration exception which is raised by an iterator‘s next() method to signal that there are no further values.

Question 24

If you are doing this often then it is better to stove it away in a helper function:

def index_of(val, in_list):
    try:
        return in_list.index(val)
    except ValueError:
        return -1

Question 25

What about this 😃 :

li = [1,2,3,4,5] # create list 

li = dict(zip(li,range(len(li)))) # convert List To Dict 
print( li ) # {1: 0, 2: 1, 3: 2, 4:3 , 5: 4}
li.get(20) # None 
li.get(1)  # 0

Question 26

What about this:

otherfunction(thing_collection, thing)

Rather than expose something so implementation-dependent like a list index in a function interface, pass the collection and the thing and let otherfunction deal with the “test for membership” issues. If otherfunction is written to be collection-type-agnostic, then it would probably start with:

if thing in thing_collection:
    ... proceed with operation on thing

which will work if thing_collection is a list, tuple, set, or dict.

This is possibly clearer than:

if thing_index != MAGIC_VALUE_INDICATING_NOT_A_MEMBER:

which is the code you already have in otherfunction.

Question 27

What about like this:

temp_inx = (L + [x]).index(x) 
inx = temp_inx if temp_inx < len(L) else -1

Question 28

I have the same issue with the “.index()” method on lists. I have no issue with the fact that it throws an exception but I strongly disagree with the fact that it’s a non-descriptive ValueError. I could understand if it would’ve been an IndexError, though.

I can see why returning “-1” would be an issue too because it’s a valid index in Python. But realistically, I never expect a “.index()” method to return a negative number.

Here goes a one liner (ok, it’s a rather long line …), goes through the list exactly once and returns “None” if the item isn’t found. It would be trivial to rewrite it to return -1, should you so desire.

indexOf = lambda list, thing: \
            reduce(lambda acc, (idx, elem): \
                   idx if (acc is None) and elem == thing else acc, list, None)

How to use:

>>> indexOf([1,2,3], 4)
>>>
>>> indexOf([1,2,3], 1)
0
>>>

Question 29

I don’t know why you should think it is dirty… because of the exception? if you want a oneliner, here it is:

thing_index = thing_list.index(elem) if thing_list.count(elem) else -1

but i would advise against using it; I think Ross Rogers solution is the best, use an object to encapsulate your desiderd behaviour, don’t try pushing the language to its limits at the cost of readability.

Question 30

I’d suggest:

if thing in thing_list:
  list_index = -1
else:
  list_index = thing_list.index(thing)

Question 31

How can I find the index of the first occurrence of a number in a Numpy array? Speed is important to me. I am not interested in the following answers because they scan the whole array and don’t stop when they find the first occurrence:

itemindex = numpy.where(array==item)[0][0]
nonzero(array == item)[0][0]

Note 1: none of the answers from that question seem relevant Is there a Numpy function to return the first index of something in an array?

Note 2: using a C-compiled method is preferred to a Python loop.

Question 32

There is a feature request for this scheduled for Numpy 2.0.0: https://github.com/numpy/numpy/issues/2269

Question 33

Although it is way too late for you, but for future reference: Using numba (1) is the easiest way until numpy implements it. If you use anaconda python distribution it should already be installed. The code will be compiled so it will be fast.

@jit(nopython=True)
def find_first(item, vec):
    """return the index of the first occurence of item in vec"""
    for i in xrange(len(vec)):
        if item == vec[i]:
            return i
    return -1

and then:

>>> a = array([1,7,8,32])
>>> find_first(8,a)
2

Question 34

I’ve made a benchmark for several methods:

argwhere
nonzero as in the question
.tostring() as in @Rob Reilink’s answer
python loop
Fortran loop

The Python and Fortran code are available. I skipped the unpromising ones like converting to a list.

The results on log scale. X-axis is the position of the needle (it takes longer to find if it’s further down the array); last value is a needle that’s not in the array. Y-axis is the time to find it.

The array had 1 million elements and tests were run 100 times. Results still fluctuate a bit, but the qualitative trend is clear: Python and f2py quit at the first element so they scale differently. Python gets too slow if the needle is not in the first 1%, whereas f2py is fast (but you need to compile it).

To summarize, f2py is the fastest solution, especially if the needle appears fairly early.

It’s not built in which is annoying, but it’s really just 2 minutes of work. Add this to a file called search.f90:

subroutine find_first(needle, haystack, haystack_length, index)
    implicit none
    integer, intent(in) :: needle
    integer, intent(in) :: haystack_length
    integer, intent(in), dimension(haystack_length) :: haystack
!f2py intent(inplace) haystack
    integer, intent(out) :: index
    integer :: k
    index = -1
    do k = 1, haystack_length
        if (haystack(k)==needle) then
            index = k - 1
            exit
        endif
    enddo
end

If you’re looking for something other than integer, just change the type. Then compile using:

f2py -c -m search search.f90

after which you can do (from Python):

import search
print(search.find_first.__doc__)
a = search.find_first(your_int_needle, your_int_array)

Question 35

You can convert a boolean array to a Python string using array.tostring() and then using the find() method:

(array==item).tostring().find('\x01')

This does involve copying the data, though, since Python strings need to be immutable. An advantage is that you can also search for e.g. a rising edge by finding \x00\x01

Question 36

In case of sorted arrays np.searchsorted works.

Question 37

I think you have hit a problem where a different method and some a priori knowledge of the array would really help. The kind of thing where you have a X probability of finding your answer in the first Y percent of the data. The splitting up the problem with the hope of getting lucky then doing this in python with a nested list comprehension or something.

Writing a C function to do this brute force isn’t too hard using ctypes either.

The C code I hacked together (index.c):

long index(long val, long *data, long length){
    long ans, i;
    for(i=0;i<length;i++){
        if (data[i] == val)
            return(i);
    }
    return(-999);
}

and the python:

# to compile (mac)
# gcc -shared index.c -o index.dylib
import ctypes
lib = ctypes.CDLL('index.dylib')
lib.index.restype = ctypes.c_long
lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long)

import numpy as np
np.random.seed(8675309)
a = np.random.random_integers(0, 100, 10000)
print lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))

and I get 92.

Wrap up the python into a proper function and there you go.

The C version is a lot (~20x) faster for this seed (warning I am not good with timeit)

import timeit
t = timeit.Timer('np.where(a==57)[0][0]', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000)')
t.timeit(100)/100
# 0.09761879920959472
t2 = timeit.Timer('lib.index(57, a.ctypes.data_as(ctypes.POINTER(ctypes.c_long)), len(a))', 'import numpy as np; np.random.seed(1); a = np.random.random_integers(0, 1000000, 10000000); import ctypes; lib = ctypes.CDLL("index.dylib"); lib.index.restype = ctypes.c_long; lib.index.argtypes = (ctypes.c_long, ctypes.POINTER(ctypes.c_long), ctypes.c_long) ')
t2.timeit(100)/100
# 0.005288000106811523

Question 38

@tal already presented a numba function to find the first index but that only works for 1D arrays. With np.ndenumerate you can also find the first index in an arbitarly dimensional array:

from numba import njit
import numpy as np

@njit
def index(array, item):
    for idx, val in np.ndenumerate(array):
        if val == item:
            return idx
    return None

Sample case:

>>> arr = np.arange(9).reshape(3,3)
>>> index(arr, 3)
(1, 0)

Timings show that it’s similar in performance to tals solution:

arr = np.arange(100000)
%timeit index(arr, 5)           # 1000000 loops, best of 3: 1.88 µs per loop
%timeit find_first(5, arr)      # 1000000 loops, best of 3: 1.7 µs per loop

%timeit index(arr, 99999)       # 10000 loops, best of 3: 118 µs per loop
%timeit find_first(99999, arr)  # 10000 loops, best of 3: 96 µs per loop

Question 39

If your list is sorted, you can achieve very quick search of index with the ‘bisect’ package. It’s O(log(n)) instead of O(n).

bisect.bisect(a, x)

finds x in the array a, definitely quicker in the sorted case than any C-routine going through all the first elements (for long enough lists).

It’s good to know sometimes.

Question 40

As far as I know only np.any and np.all on boolean arrays are short-circuited.

In your case, numpy has to go through the entire array twice, once to create the boolean condition and a second time to find the indices.

My recommendation in this case would be to use cython. I think it should be easy to adjust an example for this case, especially if you don’t need much flexibility for different dtypes and shapes.

Question 41

I needed this for my job so I taught myself Python and Numpy’s C interface and wrote my own. http://pastebin.com/GtcXuLyd It’s only for 1-D arrays, but works for most data types (int, float, or strings) and testing has shown it is again about 20 times faster than the expected approach in pure Python-numpy.

Question 42

This problem can be effectively solved in pure numpy by processing the array in chunks:

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz): # found non-zero, return it
            return nz[0] + idx
        # move to the next chunk, increase step
        idx += step
        step = min(9600, step + step // 2)
    return -1

The array is processed in chunk of size step. The step longer the step is, the faster is processing of zeroed-array (worst case). The smaller it is, the faster processing of array with non-zero at the start. The trick is to start with a small step and increase it exponentially. Moreover, there is no need to increment it above some threshold due to limited benefits.

I’ve compared the solution with pure ndarary.nonzero and numba solution against 10 million array of floats.

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx, step = 0, 32
    while idx < x.size:
        nz, = x[idx: idx + step].nonzero()
        if len(nz):
            return nz[0] + idx
        idx += step
        step = min(9600, step + step // 2)
    return -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

And results on my machine:

---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms

Pure ndarray.nonzero is definite looser. The numba solution is circa 5 times faster for the best case. It is circa 3 times faster in the worst case.

Question 43

If you are looking for the first non-zero element you can use a following hack:

idx = x.view(bool).argmax() // x.itemsize
idx = idx if x[idx] else -1

It is a very fast “numpy-pure” solution but it fails for some cases discussed below.

The solution takes advantage from the fact that pretty much all representation of zero for numeric types consists of 0 bytes. It applies to numpy’s bool as well. In recent versions of numpy, argmax() function uses short-circuit logic when processing the bool type. The size of bool is 1 byte.

So one needs to:

create a view of the array as bool. No copy is created
use argmax() to find the first non-zero byte using short-circuit logic
recalculate the offset of this byte to the index of the first non-zero element by integer division (operator //) of the offset by a size of a single element expressed in bytes (x.itemsize)
check if x[idx] is actually non-zero to identify the case when no non-zero is present

I’ve made some benchmark against numba solution and build it np.nonzero.

import numpy as np
from numba import jit
from timeit import timeit

def find_first(x):
    idx = x.view(bool).argmax() // x.itemsize
    return idx if x[idx] else -1

@jit(nopython=True)
def find_first_numba(vec):
    """return the index of the first occurence of item in vec"""
    for i in range(len(vec)):
        if vec[i]:
            return i
    return -1


SIZE = 10_000_000
# First only
x = np.empty(SIZE)

find_first_numba(x[:10])

print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')

print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')

The result on my machine are:

---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms

The solution is 33% faster than numba and it is “numpy-pure”.

The disadvantages:

does not work for numpy acceptable types like object
fails for negative zero that occasionally appears in float or double computations

Question 44

As a longtime matlab user I a have been searching for an efficient solution to this problem for quite a while. Finally, motivated by discussions a propositions in this thread I have tried to come up with a solution that is implementing an API similar to what was suggested here, supporting for the moment only 1D arrays.

You would use it like this

import numpy as np
import utils_find_1st as utf1st
array = np.arange(100000)
item = 1000
ind = utf1st.find_1st(array, item, utf1st.cmp_larger_eq)

The condition operators supported are: cmp_equal, cmp_not_equal, cmp_larger, cmp_smaller, cmp_larger_eq, cmp_smaller_eq. For efficiency the extension is written in c.

You find the source, benchmarks and other details here:

https://pypi.python.org/pypi?name=py_find_1st&:action=display

for the use in our team (anaconda on linux and macos) I have made an anaconda installer that simplifies installation, you may use it as described here

https://anaconda.org/roebel/py_find_1st

Question 45

Just a note that if you are doing a sequence of searches, the performance gain from doing something clever like converting to string, might be lost in the outer loop if the search dimension isn’t big enough. See how the performance of iterating find1 that uses the string conversion trick proposed above and find2 that uses argmax along the inner axis (plus an adjustment to ensure a non-match returns as -1)

import numpy,time
def find1(arr,value):
    return (arr==value).tostring().find('\x01')

def find2(arr,value): #find value over inner most axis, and return array of indices to the match
    b = arr==value
    return b.argmax(axis=-1) - ~(b.any())


for size in [(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:
    print(size)
    values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
    v = values>0

    t=time.time()
    numpy.apply_along_axis(find1,-1,v,1)
    print('find1',time.time()-t)

    t=time.time()
    find2(v,1)
    print('find2',time.time()-t)

outputs

(1, 100000000)
('find1', 0.25300002098083496)
('find2', 0.2780001163482666)
(10000, 10000)
('find1', 0.46200013160705566)
('find2', 0.27300000190734863)
(1000000, 100)
('find1', 20.98099994659424)
('find2', 0.3040001392364502)
(10000000, 10)
('find1', 206.7590000629425)
('find2', 0.4830000400543213)

That said, a find written in C would be at least a little faster than either of these approaches

Question 46

how about this

import numpy as np
np.amin(np.where(array==item))

Question 47

You can covert your array into a list and use it’s index() method:

i = list(array).index(item)

As far as I’m aware, this is a C compiled method.

Question 48

I have the following dictionary in python:

d = {'1': 'one', '3': 'three', '2': 'two', '5': 'five', '4': 'four'}

I need a way to find if a value such as “one” or “two” exists in this dictionary.

For example, if I wanted to know if the index “1” existed I would simply have to type:

"1" in d

And then python would tell me if that is true or false, however I need to do that same exact thing except to find if a value exists.

Question 49

>>> d = {'1': 'one', '3': 'three', '2': 'two', '5': 'five', '4': 'four'}
>>> 'one' in d.values()
True

Out of curiosity, some comparative timing:

>>> T(lambda : 'one' in d.itervalues()).repeat()
[0.28107285499572754, 0.29107213020324707, 0.27941107749938965]
>>> T(lambda : 'one' in d.values()).repeat()
[0.38303399085998535, 0.37257885932922363, 0.37096405029296875]
>>> T(lambda : 'one' in d.viewvalues()).repeat()
[0.32004380226135254, 0.31716084480285645, 0.3171098232269287]

EDIT: And in case you wonder why… the reason is that each of the above returns a different type of object, which may or may not be well suited for lookup operations:

>>> type(d.viewvalues())
<type 'dict_values'>
>>> type(d.values())
<type 'list'>
>>> type(d.itervalues())
<type 'dictionary-valueiterator'>

EDIT2: As per request in comments…

>>> T(lambda : 'four' in d.itervalues()).repeat()
[0.41178202629089355, 0.3959040641784668, 0.3970959186553955]
>>> T(lambda : 'four' in d.values()).repeat()
[0.4631338119506836, 0.43541407585144043, 0.4359898567199707]
>>> T(lambda : 'four' in d.viewvalues()).repeat()
[0.43414998054504395, 0.4213531017303467, 0.41684913635253906]

Question 50

You can use

"one" in d.itervalues()

to test if "one" is among the values of your dictionary.

Question 51

Python dictionary has get(key) funcion

>>> d.get(key)

For Example,

>>> d = {'1': 'one', '3': 'three', '2': 'two', '5': 'five', '4': 'four'}
>>> d.get('3')
'three'
>>> d.get('10')
none

If your key does’nt exist, will return none value.

foo = d[key] # raise error if key doesn't exist
foo = d.get(key) # return none if key doesn't exist

Content relevant to versions less than 3.0 and greater than 5.0. .

Question 52

Use dictionary views:

if x in d.viewvalues():
    dosomething()..

Question 53

Different types to check the values exists

d = {"key1":"value1", "key2":"value2"}
"value10" in d.values() 
>> False

What if list of values

test = {'key1': ['value4', 'value5', 'value6'], 'key2': ['value9'], 'key3': ['value6']}
"value4" in [x for v in test.values() for x in v]
>>True

What if list of values with string values

test = {'key1': ['value4', 'value5', 'value6'], 'key2': ['value9'], 'key3': ['value6'], 'key5':'value10'}
values = test.values()
"value10" in [x for v in test.values() for x in v] or 'value10' in values
>>True

Question 54

In Python 3 you can use the values() function of the dictionary. It returns a view object of the values. This, in turn, can be passed to the iter function which returns an iterator object. The iterator can be checked using in, like this,

'one' in iter(d.values())

Or you can use the view object directly since it is similar to a list

'one' in d.values()

问题：Python ElementTree模块：使用方法“ find”，“ findall”时，如何忽略XML文件的命名空间以找到匹配的元素

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：如何检查字符串中的字符是否为字母？（Python）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：在python中处理list.index（可能不存在）的最佳方法？

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

问题：脾气暴躁：快速找到价值的第一指标

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

回答 14

回答 15

问题：如何检查字典中是否存在值（Python）

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：Python：在列表中查找

回答 0

检查里面是否有东西

过滤集合

寻找第一次出现

查找物品的位置

Checking if something is inside

Filtering a collection

Finding the first occurrence

Finding the location of an item

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

回答 9

有趣好用的Python教程