from xml.etree importElementTreeas ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF")# Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF")# Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>
from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>
Although it can works, because there is a namespace “{http://www.test.com}”, it’s very inconvenient to add a namespace in front of each tag.
How can I ignore the namespace when using the method of “find”, “findall” and so on?
from io importStringIO# for Python 2 import from StringIO insteadimport xml.etree.ElementTreeas ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))for _, el in it:
prefix, has_namespace, postfix = el.tag.partition('}')if has_namespace:
el.tag = postfix # strip all namespaces
root = it.root
Instead of modifying the XML document itself, it’s best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:
from io import StringIO # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
prefix, has_namespace, postfix = el.tag.partition('}')
if has_namespace:
el.tag = postfix # strip all namespaces
root = it.root
fromStringIOimportStringIOimport xml.etree.ElementTreeas ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))for _, el in it:if'}'in el.tag:
el.tag = el.tag.split('}',1)[1]# strip all namespacesfor at in list(el.attrib.keys()):# strip namespaces of attributes tooif'}'in at:
newat = at.split('}',1)[1]
el.attrib[newat]= el.attrib[at]del el.attrib[at]
root = it.root
Here’s an extension to nonagon’s answer, which also strips namespaces off attributes:
from StringIO import StringIO
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib.keys()): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
UPDATE: added list() so the iterator works (needed for Python 3)
import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
tree = ET.parse("test.xml")
The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.
ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.
回答 7
我为此可能会迟到,但我认为这re.sub不是一个好的解决方案。
但是,该重写xml.parsers.expat不适用于Python 3.x版本,
罪魁祸首是xml/etree/ElementTree.py源代码的底部
# Import the C acceleratorstry:# Element is going to be shadowed by the C implementation. We need to keep# the Python version of it accessible for some "creative" by external code# (see tests)_Element_Py=Element# Element, SubElement, ParseError, TreeBuilder, XMLParserfrom _elementtree import*exceptImportError:pass
I might be late for this but I dont think re.sub is a good solution.
However the rewrite xml.parsers.expat does not work for Python 3.x versions,
The main culprit is the xml/etree/ElementTree.py see bottom of the source code
# Import the C accelerators
try:
# Element is going to be shadowed by the C implementation. We need to keep
# the Python version of it accessible for some "creative" by external code
# (see tests)
_Element_Py = Element
# Element, SubElement, ParseError, TreeBuilder, XMLParser
from _elementtree import *
except ImportError:
pass
Which is kinda sad.
The solution is to get rid of it first.
import _elementtree
try:
del _elementtree.XMLParser
except AttributeError:
# in case deleted twice
pass
else:
from xml.parsers import expat # NOQA: F811
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
Tested on Python 3.6.
Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like
maximum recursion depth exceeded
AttributeError: XMLParser
btw damn the etree source code looks really messy.
Create an iterator to get both namespaces and a parsed tree object.
Iterate over the created iterator to get the namespaces dict that we can
later pass in each find() or findall() call as sugested by
iMom0.
Return the parsed tree’s root element object and namespaces.
I think this is the best approach all around as there’s no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.
I’d like also to credit barny’s answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).
>>> s = u'a1中文'>>>for char in s:print char, char.isalpha()...
a True1False中True文True>>> s ='a1中文'>>>for char in s:print char, char.isalpha()...
a True1False�False�False�False�False�False�False>>>
在python3.x中:
>>> s ='a1中文'>>>for char in s:print(char, char.isalpha())...
a True1False中True文True>>>
此代码的工作原理:
>>>def is_alpha(word):...try:...return word.encode('ascii').isalpha()...except:...returnFalse...>>> is_alpha('中国')False>>> is_alpha(u'中国')False>>>>>> a ='a'>>> b ='a'>>> ord(a), ord(b)(65345,97)>>> a.isalpha(), b.isalpha()(True,True)>>> is_alpha(a), is_alpha(b)(False,True)>>>
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.
In python2.x:
>>> s = u'a1中文'
>>> for char in s: print char, char.isalpha()
...
a True
1 False
中 True
文 True
>>> s = 'a1中文'
>>> for char in s: print char, char.isalpha()
...
a True
1 False
� False
� False
� False
� False
� False
� False
>>>
In python3.x:
>>> s = 'a1中文'
>>> for char in s: print(char, char.isalpha())
...
a True
1 False
中 True
文 True
>>>
def check(count):
lowercase =0
uppercase =0
other =0
low ='a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
upper ='A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'for n in count:if n in low:
lowercase +=1elif n in upper:
uppercase +=1else:
other +=1print("There are "+ str(lowercase)+" lowercase letters.")print("There are "+ str(uppercase)+" uppercase letters.")print("There are "+ str(other)+" other elements to this sentence.")
I found a good way to do this with using a function and basic code.
This is a code that accepts a string and counts the number of capital letters, lowercase letters and also ‘other’. Other is classed as a space, punctuation mark or even Japanese and Chinese characters.
def check(count):
lowercase = 0
uppercase = 0
other = 0
low = 'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'
upper = 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'
for n in count:
if n in low:
lowercase += 1
elif n in upper:
uppercase += 1
else:
other += 1
print("There are " + str(lowercase) + " lowercase letters.")
print("There are " + str(uppercase) + " uppercase letters.")
print("There are " + str(other) + " other elements to this sentence.")
回答 3
data = "abcdefg hi j 12345"
digits_count =0
letters_count =0
others_count =0for i in userinput:if i.isdigit():
digits_count +=1elif i.isalpha():
letters_count +=1else:
others_count +=1print("Result:")print("Letters=", letters_count)print("Digits=", digits_count)
输出:
PleaseEnterLetterswithNumbers:
abcdefg hi j 12345Result:Letters=10Digits=5
word = str(input("Enter string:"))
notChar =0
isChar =0for char in word:ifnot char.isalpha():
notChar +=1else:
isChar +=1print(isChar," were letters; ", notChar," were not letters.")
word = str(input("Enter string:"))
notChar = 0
isChar = 0
for char in word:
if not char.isalpha():
notChar += 1
else:
isChar += 1
print(isChar, " were letters; ", notChar, " were not letters.")
ok so that’s simplified but you get the idea. Now thing might not actually be in the list, in which case I want to pass -1 as thing_index. In other languages this is what you’d expect index() to return if it couldn’t find the element. In fact it throws a ValueError.
But this feels dirty, plus I don’t know if ValueError could be raised for some other reason. I came up with the following solution based on generator functions, but it seems a little complex:
thing_index = ( [(i for i in xrange(len(thing_list)) if thing_list[i]==thing)] or [-1] )[0]
Is there a cleaner way to achieve the same thing? Let’s assume the list isn’t sorted.
There is nothing “dirty” about using try-except clause. This is the pythonic way. ValueError will be raised by the .index method only, because it’s the only code you have there!
To answer the comment:
In Python, easier to ask forgiveness than to get permission philosophy is well established, and noindex will not raise this type of error for any other issues. Not that I can think of any.
回答 1
thing_index = thing_list.index(elem)if elem in thing_list else-1
The dict type has a get function, where if the key doesn’t exist in the dictionary, the 2nd argument to get is the value that it should return. Similarly there is setdefault, which returns the value in the dict if the key exists, otherwise it sets the value according to your default parameter and then returns your default parameter.
You could extend the list type to have a getindexdefault method.
This issue is one of language philosophy. In Java for example there has always been a tradition that exceptions should really only be used in “exceptional circumstances” that is when errors have happened, rather than for flow control. In the beginning this was for performance reasons as Java exceptions were slow but now this has become the accepted style.
In contrast Python has always used exceptions to indicate normal program flow, like raising a ValueError as we are discussing here. There is nothing “dirty” about this in Python style and there are many more where that came from. An even more common example is StopIteration exception which is raised by an iterator‘s next() method to signal that there are no further values.
li =[1,2,3,4,5]# create list
li = dict(zip(li,range(len(li))))# convert List To Dict print( li )# {1: 0, 2: 1, 3: 2, 4:3 , 5: 4}
li.get(20)# None
li.get(1)# 0
Rather than expose something so implementation-dependent like a list index in a function interface, pass the collection and the thing and let otherfunction deal with the “test for membership” issues. If otherfunction is written to be collection-type-agnostic, then it would probably start with:
if thing in thing_collection:
... proceed with operation on thing
which will work if thing_collection is a list, tuple, set, or dict.
This is possibly clearer than:
if thing_index != MAGIC_VALUE_INDICATING_NOT_A_MEMBER:
which is the code you already have in otherfunction.
回答 8
像这样:
temp_inx =(L +[x]).index(x)
inx = temp_inx if temp_inx < len(L)else-1
I have the same issue with the “.index()” method on lists. I have no issue with the fact that it throws an exception but I strongly disagree with the fact that it’s a non-descriptive ValueError. I could understand if it would’ve been an IndexError, though.
I can see why returning “-1” would be an issue too because it’s a valid index in Python. But realistically, I never expect a “.index()” method to return a negative number.
Here goes a one liner (ok, it’s a rather long line …), goes through the list exactly once and returns “None” if the item isn’t found. It would be trivial to rewrite it to return -1, should you so desire.
indexOf = lambda list, thing: \
reduce(lambda acc, (idx, elem): \
idx if (acc is None) and elem == thing else acc, list, None)
I don’t know why you should think it is dirty… because of the exception? if you want a oneliner, here it is:
thing_index = thing_list.index(elem) if thing_list.count(elem) else -1
but i would advise against using it; I think Ross Rogers solution is the best, use an object to encapsulate your desiderd behaviour, don’t try pushing the language to its limits at the cost of readability.
回答 11
我建议:
if thing in thing_list:
list_index =-1else:
list_index = thing_list.index(thing)
How can I find the index of the first occurrence of a number in a Numpy array?
Speed is important to me. I am not interested in the following answers because they scan the whole array and don’t stop when they find the first occurrence:
@jit(nopython=True)def find_first(item, vec):"""return the index of the first occurence of item in vec"""for i in xrange(len(vec)):if item == vec[i]:return i
return-1
Although it is way too late for you, but for future reference:
Using numba (1) is the easiest way until numpy implements it. If you use anaconda python distribution it should already be installed.
The code will be compiled so it will be fast.
@jit(nopython=True)
def find_first(item, vec):
"""return the index of the first occurence of item in vec"""
for i in xrange(len(vec)):
if item == vec[i]:
return i
return -1
The Python and Fortran code are available. I skipped the unpromising ones like converting to a list.
The results on log scale. X-axis is the position of the needle (it takes longer to find if it’s further down the array); last value is a needle that’s not in the array. Y-axis is the time to find it.
The array had 1 million elements and tests were run 100 times. Results still fluctuate a bit, but the qualitative trend is clear: Python and f2py quit at the first element so they scale differently. Python gets too slow if the needle is not in the first 1%, whereas f2py is fast (but you need to compile it).
To summarize, f2py is the fastest solution, especially if the needle appears fairly early.
It’s not built in which is annoying, but it’s really just 2 minutes of work. Add this to a file called search.f90:
subroutine find_first(needle, haystack, haystack_length, index)
implicit none
integer, intent(in) :: needle
integer, intent(in) :: haystack_length
integer, intent(in), dimension(haystack_length) :: haystack
!f2py intent(inplace) haystack
integer, intent(out) :: index
integer :: k
index = -1
do k = 1, haystack_length
if (haystack(k)==needle) then
index = k - 1
exit
endif
enddo
end
If you’re looking for something other than integer, just change the type. Then compile using:
f2py -c -m search search.f90
after which you can do (from Python):
import search
print(search.find_first.__doc__)
a = search.find_first(your_int_needle, your_int_array)
You can convert a boolean array to a Python string using array.tostring() and then using the find() method:
(array==item).tostring().find('\x01')
This does involve copying the data, though, since Python strings need to be immutable. An advantage is that you can also search for e.g. a rising edge by finding \x00\x01
I think you have hit a problem where a different method and some a priori knowledge of the array would really help. The kind of thing where you have a X probability of finding your answer in the first Y percent of the data. The splitting up the problem with the hope of getting lucky then doing this in python with a nested list comprehension or something.
Writing a C function to do this brute force isn’t too hard using ctypes either.
The C code I hacked together (index.c):
long index(long val, long *data, long length){
long ans, i;
for(i=0;i<length;i++){
if (data[i] == val)
return(i);
}
return(-999);
}
arr = np.arange(100000)%timeit index(arr,5)# 1000000 loops, best of 3: 1.88 µs per loop%timeit find_first(5, arr)# 1000000 loops, best of 3: 1.7 µs per loop%timeit index(arr,99999)# 10000 loops, best of 3: 118 µs per loop%timeit find_first(99999, arr)# 10000 loops, best of 3: 96 µs per loop
@tal already presented a numba function to find the first index but that only works for 1D arrays. With np.ndenumerate you can also find the first index in an arbitarly dimensional array:
from numba import njit
import numpy as np
@njit
def index(array, item):
for idx, val in np.ndenumerate(array):
if val == item:
return idx
return None
Timings show that it’s similar in performance to tals solution:
arr = np.arange(100000)
%timeit index(arr, 5) # 1000000 loops, best of 3: 1.88 µs per loop
%timeit find_first(5, arr) # 1000000 loops, best of 3: 1.7 µs per loop
%timeit index(arr, 99999) # 10000 loops, best of 3: 118 µs per loop
%timeit find_first(99999, arr) # 10000 loops, best of 3: 96 µs per loop
As far as I know only np.any and np.all on boolean arrays are short-circuited.
In your case, numpy has to go through the entire array twice, once to create the boolean condition and a second time to find the indices.
My recommendation in this case would be to use cython. I think it should be easy to adjust an example for this case, especially if you don’t need much flexibility for different dtypes and shapes.
I needed this for my job so I taught myself Python and Numpy’s C interface and wrote my own. http://pastebin.com/GtcXuLyd It’s only for 1-D arrays, but works for most data types (int, float, or strings) and testing has shown it is again about 20 times faster than the expected approach in pure Python-numpy.
回答 10
通过分块处理数组,可以在纯numpy中有效解决此问题:
def find_first(x):
idx, step =0,32while idx < x.size:
nz,= x[idx: idx + step].nonzero()if len(nz):# found non-zero, return itreturn nz[0]+ idx
# move to the next chunk, increase step
idx += step
step = min(9600, step + step //2)return-1
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx, step =0,32while idx < x.size:
nz,= x[idx: idx + step].nonzero()if len(nz):return nz[0]+ idx
idx += step
step = min(9600, step + step //2)return-1@jit(nopython=True)def find_first_numba(vec):"""return the index of the first occurence of item in vec"""for i in range(len(vec)):if vec[i]:return i
return-1
SIZE =10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])print('---- FIRST ----')
x[:]=0
x[0]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=1000),'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000),'ms')print('---- LAST ----')
x[:]=0
x[-1]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- NONE ----')
x[:]=0print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- ALL ----')
x[:]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')
结果在我的机器上:
---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms
This problem can be effectively solved in pure numpy by processing the array in chunks:
def find_first(x):
idx, step = 0, 32
while idx < x.size:
nz, = x[idx: idx + step].nonzero()
if len(nz): # found non-zero, return it
return nz[0] + idx
# move to the next chunk, increase step
idx += step
step = min(9600, step + step // 2)
return -1
The array is processed in chunk of size step. The step longer the step is, the faster is processing of zeroed-array (worst case). The smaller it is, the faster processing of array with non-zero at the start. The trick is to start with a small step and increase it exponentially. Moreover, there is no need to increment it above some threshold due to limited benefits.
I’ve compared the solution with pure ndarary.nonzero and numba solution against 10 million array of floats.
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx, step = 0, 32
while idx < x.size:
nz, = x[idx: idx + step].nonzero()
if len(nz):
return nz[0] + idx
idx += step
step = min(9600, step + step // 2)
return -1
@jit(nopython=True)
def find_first_numba(vec):
"""return the index of the first occurence of item in vec"""
for i in range(len(vec)):
if vec[i]:
return i
return -1
SIZE = 10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])
print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')
print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
And results on my machine:
---- FIRST ----
ndarray.nonzero 54.733994480002366 ms
find_first 0.0013148509997336078 ms
find_first_numba 0.0002839310000126716 ms
---- LAST ----
ndarray.nonzero 54.56336712999928 ms
find_first 25.38929685000312 ms
find_first_numba 8.022820680002951 ms
---- NONE ----
ndarray.nonzero 24.13432420999925 ms
find_first 25.345200140000088 ms
find_first_numba 8.154927100003988 ms
---- ALL ----
ndarray.nonzero 55.753537260002304 ms
find_first 0.0014760300018679118 ms
find_first_numba 0.0004358099977253005 ms
Pure ndarray.nonzero is definite looser. The numba solution is circa 5 times faster for the best case. It is circa 3 times faster in the worst case.
回答 11
如果您正在寻找第一个非零元素,则可以使用以下技巧:
idx = x.view(bool).argmax()// x.itemsize
idx = idx if x[idx]else-1
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx = x.view(bool).argmax()// x.itemsize
return idx if x[idx]else-1@jit(nopython=True)def find_first_numba(vec):"""return the index of the first occurence of item in vec"""for i in range(len(vec)):if vec[i]:return i
return-1
SIZE =10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])print('---- FIRST ----')
x[:]=0
x[0]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=1000),'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000),'ms')print('---- LAST ----')
x[:]=0
x[-1]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- NONE ----')
x[:]=0print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')print('---- ALL ----')
x[:]=1print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10,'ms')print('find_first', timeit(lambda: find_first(x), number=100)*10,'ms')print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10,'ms')
我的机器上的结果是:
---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms
It is a very fast “numpy-pure” solution but it fails for some cases discussed below.
The solution takes advantage from the fact that pretty much all representation of zero for numeric types consists of 0 bytes. It applies to numpy’s bool as well. In recent versions of numpy, argmax() function uses short-circuit logic when processing the bool type. The size of bool is 1 byte.
So one needs to:
create a view of the array as bool. No copy is created
use argmax() to find the first non-zero byte using short-circuit logic
recalculate the offset of this byte to the index of the first non-zero element by integer division (operator //) of the offset by a size of a single element expressed in bytes (x.itemsize)
check if x[idx] is actually non-zero to identify the case when no non-zero is present
I’ve made some benchmark against numba solution and build it np.nonzero.
import numpy as np
from numba import jit
from timeit import timeit
def find_first(x):
idx = x.view(bool).argmax() // x.itemsize
return idx if x[idx] else -1
@jit(nopython=True)
def find_first_numba(vec):
"""return the index of the first occurence of item in vec"""
for i in range(len(vec)):
if vec[i]:
return i
return -1
SIZE = 10_000_000
# First only
x = np.empty(SIZE)
find_first_numba(x[:10])
print('---- FIRST ----')
x[:] = 0
x[0] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=1000), 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=1000), 'ms')
print('---- LAST ----')
x[:] = 0
x[-1] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- NONE ----')
x[:] = 0
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
print('---- ALL ----')
x[:] = 1
print('ndarray.nonzero', timeit(lambda: x.nonzero()[0][0], number=100)*10, 'ms')
print('find_first', timeit(lambda: find_first(x), number=100)*10, 'ms')
print('find_first_numba', timeit(lambda: find_first_numba(x), number=100)*10, 'ms')
The result on my machine are:
---- FIRST ----
ndarray.nonzero 57.63976670001284 ms
find_first 0.0010841979965334758 ms
find_first_numba 0.0002308919938514009 ms
---- LAST ----
ndarray.nonzero 58.96685277999495 ms
find_first 5.923203580023255 ms
find_first_numba 8.762269750004634 ms
---- NONE ----
ndarray.nonzero 25.13398071998381 ms
find_first 5.924289370013867 ms
find_first_numba 8.810063839919167 ms
---- ALL ----
ndarray.nonzero 55.181210660084616 ms
find_first 0.001246920000994578 ms
find_first_numba 0.00028766007744707167 ms
The solution is 33% faster than numba and it is “numpy-pure”.
The disadvantages:
does not work for numpy acceptable types like object
fails for negative zero that occasionally appears in float or double computations
As a longtime matlab user I a have been searching for an efficient solution to this problem for quite a while. Finally, motivated by discussions a propositions in this thread I have tried to come up with a solution that is implementing an API similar to what was suggested here, supporting for the moment only 1D arrays.
You would use it like this
import numpy as np
import utils_find_1st as utf1st
array = np.arange(100000)
item = 1000
ind = utf1st.find_1st(array, item, utf1st.cmp_larger_eq)
The condition operators supported are: cmp_equal, cmp_not_equal, cmp_larger, cmp_smaller, cmp_larger_eq, cmp_smaller_eq. For efficiency the extension is written in c.
You find the source, benchmarks and other details here:
import numpy,time
def find1(arr,value):return(arr==value).tostring().find('\x01')def find2(arr,value):#find value over inner most axis, and return array of indices to the match
b = arr==value
return b.argmax(axis=-1)-~(b.any())for size in[(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:print(size)
values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
v = values>0
t=time.time()
numpy.apply_along_axis(find1,-1,v,1)print('find1',time.time()-t)
t=time.time()
find2(v,1)print('find2',time.time()-t)
Just a note that if you are doing a sequence of searches, the performance gain from doing something clever like converting to string, might be lost in the outer loop if the search dimension isn’t big enough. See how the performance of iterating find1 that uses the string conversion trick proposed above and find2 that uses argmax along the inner axis (plus an adjustment to ensure a non-match returns as -1)
import numpy,time
def find1(arr,value):
return (arr==value).tostring().find('\x01')
def find2(arr,value): #find value over inner most axis, and return array of indices to the match
b = arr==value
return b.argmax(axis=-1) - ~(b.any())
for size in [(1,100000000),(10000,10000),(1000000,100),(10000000,10)]:
print(size)
values = numpy.random.choice([0,0,0,0,0,0,0,1],size=size)
v = values>0
t=time.time()
numpy.apply_along_axis(find1,-1,v,1)
print('find1',time.time()-t)
t=time.time()
find2(v,1)
print('find2',time.time()-t)
>>> d = {'1': 'one', '3': 'three', '2': 'two', '5': 'five', '4': 'four'}
>>> 'one' in d.values()
True
Out of curiosity, some comparative timing:
>>> T(lambda : 'one' in d.itervalues()).repeat()
[0.28107285499572754, 0.29107213020324707, 0.27941107749938965]
>>> T(lambda : 'one' in d.values()).repeat()
[0.38303399085998535, 0.37257885932922363, 0.37096405029296875]
>>> T(lambda : 'one' in d.viewvalues()).repeat()
[0.32004380226135254, 0.31716084480285645, 0.3171098232269287]
EDIT: And in case you wonder why… the reason is that each of the above returns a different type of object, which may or may not be well suited for lookup operations:
d ={"key1":"value1","key2":"value2"}"value10"in d.values()>>False
如果值列表怎么办
test ={'key1':['value4','value5','value6'],'key2':['value9'],'key3':['value6']}"value4"in[x for v in test.values()for x in v]>>True
如果带有字符串值的值列表怎么办
test ={'key1':['value4','value5','value6'],'key2':['value9'],'key3':['value6'],'key5':'value10'}
values = test.values()"value10"in[x for v in test.values()for x in v]or'value10'in values>>True
d = {"key1":"value1", "key2":"value2"}
"value10" in d.values()
>> False
What if list of values
test = {'key1': ['value4', 'value5', 'value6'], 'key2': ['value9'], 'key3': ['value6']}
"value4" in [x for v in test.values() for x in v]
>>True
What if list of values with string values
test = {'key1': ['value4', 'value5', 'value6'], 'key2': ['value9'], 'key3': ['value6'], 'key5':'value10'}
values = test.values()
"value10" in [x for v in test.values() for x in v] or 'value10' in values
>>True
In Python 3 you can use the values() function of the dictionary. It returns a view object of the values. This, in turn, can be passed to the iter function which returns an iterator object. The iterator can be checked using in, like this,
'one' in iter(d.values())
Or you can use the view object directly since it is similar to a list
As for your first question: that code is perfectly fine and should work if item equals one of the elements inside myList. Maybe you try to find a string that does not exactly match one of the items or maybe you are using a float value which suffers from inaccuracy.
As for your second question: There’s actually several possible ways if “finding” things in lists.
Checking if something is inside
This is the use case you describe: Checking whether something is inside a list or not. As you know, you can use the in operator for that:
3 in [1, 2, 3] # => True
Filtering a collection
That is, finding all elements in a sequence that meet a certain condition. You can use list comprehension or generator expressions for that:
matches = [x for x in lst if fulfills_some_condition(x)]
matches = (x for x in lst if x > 6)
The latter will return a generator which you can imagine as a sort of lazy list that will only be built as soon as you iterate through it. By the way, the first one is exactly equivalent to
matches = filter(fulfills_some_condition, lst)
in Python 2. Here you can see higher-order functions at work. In Python 3, filter doesn’t return a list, but a generator-like object.
Finding the first occurrence
If you only want the first thing that matches a condition (but you don’t know what it is yet), it’s fine to use a for loop (possibly using the else clause as well, which is not really well-known). You can also use
next(x for x in lst if ...)
which will return the first match or raise a StopIteration if none is found. Alternatively, you can use
next((x for x in lst if ...), [default value])
Finding the location of an item
For lists, there’s also the index method that can sometimes be useful if you want to know where a certain element is in the list:
While the answer from Niklas B. is pretty comprehensive, when we want to find an item in a list it is sometimes useful to get its index:
next((i for i, x in enumerate(lst) if [condition on x]), [default value])
回答 3
寻找第一次出现
在其中有一个配方itertools:
def first_true(iterable, default=False, pred=None):"""Returns the first true value in the iterable.
If no true value is found, returns *default*
If *pred* is not None, returns the first item
for which pred(item) is true.
"""# first_true([a,b,c], x) --> a or b or c or x# first_true([a,b], x, f) --> a if f(a) else b if f(b) else xreturn next(filter(pred, iterable), default)
def first_true(iterable, default=False, pred=None):
"""Returns the first true value in the iterable.
If no true value is found, returns *default*
If *pred* is not None, returns the first item
for which pred(item) is true.
"""
# first_true([a,b,c], x) --> a or b or c or x
# first_true([a,b], x, f) --> a if f(a) else b if f(b) else x
return next(filter(pred, iterable), default)
For example, the following code finds the first odd number in a list:
Another alternative: you can check if an item is in a list with if item in list:, but this is order O(n). If you are dealing with big lists of items and all you need to know is whether something is a member of your list, you can convert the list to a set first and take advantage of constant time set lookup:
my_set = set(my_list)
if item in my_set: # much faster on average than using a list
# do something
Not going to be the correct solution in every case, but for some cases this might give you better performance.
Note that creating the set with set(my_list) is also O(n), so if you only need to do this once then it isn’t any faster to do it this way. If you need to repeatedly check membership though, then this will be O(1) for every lookup after that initial set creation.
Instead of using list.index(x) which returns the index of x if it is found in list or returns a #ValueError message if x is not found, you could use list.count(x) which returns the number of occurrences of x in list (validation that x is indeed in the list) or it returns 0 otherwise (in the absence of x). The cool thing about count() is that it doesn’t break your code or require you to throw an exception for when x is not found
If you are going to check if value exist in the collectible once then using ‘in’ operator is fine. However, if you are going to check for more than once then I recommend using bisect module. Keep in mind that using bisect module data must be sorted. So you sort data once and then you can use bisect. Using bisect module on my machine is about 12 times faster than using ‘in’ operator.
Here is an example of code using Python 3.8 and above syntax:
import bisect
from timeit import timeit
def bisect_search(container, value):
return (
(index := bisect.bisect_left(container, value)) < len(container)
and container[index] == value
)
data = list(range(1000))
# value to search
true_value = 666
false_value = 66666
# times to test
ttt = 1000
print(f"{bisect_search(data, true_value)=} {bisect_search(data, false_value)=}")
t1 = timeit(lambda: true_value in data, number=ttt)
t2 = timeit(lambda: bisect_search(data, true_value), number=ttt)
print("Performance:", f"{t1=:.4f}, {t2=:.4f}, diffs {t1/t2=:.2f}")
Check there are no additional/unwanted whites space in the items of the list of strings.
That’s a reason that can be interfering explaining the items cannot be found.