Given a mean and a variance is there a simple function call which will plot a normal distribution?
回答 0
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math
mu =0
variance =1
sigma = math.sqrt(variance)
x = np.linspace(mu -3*sigma, mu +3*sigma,100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-10,10,0.001)# Mean = 0, SD = 2.
plt.plot(x_axis, norm.pdf(x_axis,0,2))
plt.show()
import numpy as np
import matplotlib.pyplot as plt
mean =0; std =1; variance = np.square(std)
x = np.arange(-5,5,.01)
f = np.exp(-np.square(x-mean)/2*variance)/(np.sqrt(2*np.pi*variance))
plt.plot(x,f)
plt.ylabel('gaussian distribution')
plt.show()
I have just come back to this and I had to install scipy as matplotlib.mlab gave me the error message MatplotlibDeprecationWarning: scipy.stats.norm.pdf when trying example above. So the sample is now:
%matplotlib inline
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))
plt.show()
回答 6
我相信设置高度很重要,因此创建了以下功能:
def my_gauss(x, sigma=1, h=1, mid=0):from math import exp, pow
variance = pow(sdev,2)return h * exp(-pow(x-mid,2)/(2*variance))
In[2]: dfOut[2]:
one0 one1 two2Thisis very long string very long string very...In[3]: pd.options.display.max_colwidthOut[3]:50In[4]: pd.options.display.max_colwidth =100In[5]: dfOut[5]:
one0 one1 two2Thisis very long string very long string very long string veryvery long string
You can use options.display.max_colwidth to specify you want to see more in the default representation:
In [2]: df
Out[2]:
one
0 one
1 two
2 This is very long string very long string very...
In [3]: pd.options.display.max_colwidth
Out[3]: 50
In [4]: pd.options.display.max_colwidth = 100
In [5]: df
Out[5]:
one
0 one
1 two
2 This is very long string very long string very long string veryvery long string
And indeed, if you just want to inspect the one value, by accessing it (as a scalar, not as a row as df.iloc[2] does) you also see the full string:
In [7]: df.iloc[2,0] # or df.loc[2,'one']
Out[7]: 'This is very long string very long string very long string veryvery long string'
Another easier way to print the whole string is to call values on the dataframe.
df = pd.DataFrame({'one' : ['one', 'two',
'This is very long string very long string very long string veryvery long string']})
print(df.values)
The Output will be
[['one']
['two']
['This is very long string very long string very long string veryvery long string']]
回答 4
这是你的本意吗?
In[7]: x = pd.DataFrame({'one':['one','two','This is very long string very long string very long string veryvery long string']})In[8]: xOut[8]:
one0 one1 two2Thisis very long string very long string very...In[9]: x['one'][2]Out[9]:'This is very long string very long string very long string veryvery long string'
In [7]: x = pd.DataFrame({'one' : ['one', 'two', 'This is very long string very long string very long string veryvery long string']})
In [8]: x
Out[8]:
one
0 one
1 two
2 This is very long string very long string very...
In [9]: x['one'][2]
Out[9]: 'This is very long string very long string very long string veryvery long string'
The way I often deal with the situation you describe is to use the .to_csv() method and write to stdout:
import sys
df.to_csv(sys.stdout)
Update: it should now be possible to just use None instead of sys.stdout with similar effect!
This should dump the whole dataframe, including the entirety of any strings. You can use the to_csv parameters to configure column separators, whether the index is printed, etc. It will be less pretty than rendering it properly though.
The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.
When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?
It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I’m missing?
Also, are there other pitfalls to look out for in when using these two together? For example, assuming I’m using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don’t immediately see a problem with that, but I might be missing something.
Thank you much!
UPDATE:
An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They’re both going down in the other case. But in my case the movements are slow, so things may change after more training and it’s just a single test. A more definitive and informed answer would still be appreciated.
回答 0
在《Ioffe and Szegedy 2015》中,作者指出“我们希望确保对于任何参数值,网络始终以期望的分布产生激活”。因此,批处理规范化层实际上是在转换层/完全连接层之后,但在馈入ReLu(或任何其他种类的)激活之前插入的。有关详情,请在时间约53分钟处观看此视频。
In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.
As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.
So in summary, the order of using batch normalization and dropout is:
As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet
My 2 cents:
Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt.
So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.
If you think about it, in typical ML problems, this is the reason we don’t compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets
so i suggest Scheme 1 (This takes pseudomarvin’s comment on accepted answer into consideration)
-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer
Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2
I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the “variance shift”. This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns.
The main idea can be found in this figure which is taken from this paper.
A small demo for this effect can be found in this notebook.
The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.
The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.
Edit:
When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.
From Ioffe and Szegedy (2015)’s point of view, only use BN in the
network structure. Li et al. (2018) give the statistical and
experimental analyses, that there is a variance shift when the
practitioners use Dropout before BN. Thus, Li et al. (2018) recommend
applying Dropout after all BN layers.
From Ioffe and Szegedy (2015)’s point of view, BN is located
inside/before the activation function. However, Chen et al. (2019)
use an IC layer which combines dropout and BN, and Chen et al. (2019)
recommends use BN after ReLU.
On the safety background, I use Dropout or BN only in the network.
Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao,
and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization
and Dropout in the Training of Deep Neural Networks.” CoRR
abs/1905.05928. http://arxiv.org/abs/1905.05928.
Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate
Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.
Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding
the Disharmony Between Dropout and Batch Normalization by Variance
Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.
Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)
pd.DataFrame([[x]+[z]for x, y in df.values for z in y],columns=df.columns)Out[488]:
A B011112221322
如果超过两列,请使用
s=pd.DataFrame([[x]+[z]for x, y in zip(df.index,df.B)for z in y])
s.merge(df,left_on=0,right_index=True)Out[491]:01 A B0011[1,2]1021[1,2]2112[1,2]3122[1,2]
方法4
使用reindex 或loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))Out[554]:
A B011012121122#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
列表仅包含唯一值时的方法5:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})from collections importChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])Out[574]:
B A011121232342
高性能
使用方法6numpy:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B011112221322
方法7
使用基本函数itertoolscycle和chain:纯python解决方案,只是为了好玩
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1]))if len([x[0]])> len(x[1])else list(zip(cycle([x[0]]), x[1])))for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B011112221322
归纳到多列
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
dfOut[592]:
A B C01[1,2][1,2]12[3,4][3,4]
自卫功能:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)})for x in explode], axis=1)
df1.index = idxreturn df1.join(df.drop(explode,1), how='left')
unnesting(df,['B','C'])Out[609]:
B C A0111022113321442
I know object columns type makes the data hard to convert with a pandas function. When I received the data like this, the first thing that came to mind was to ‘flatten’ or unnest the columns .
I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203’s answer, since it’s using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters.
Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1 apply + pd.Series (easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1
for example besides A we have A.1 …..A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join or merge with the index after ‘unnest’ the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex at the end.
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4
using reindex or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5
when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6
using numpy for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7
using base function itertoolscycle and chain: Pure python solution just for fun
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B011112221322
选项2
如果子列表的长度不同,则需要执行其他步骤:
vals = df.B.values.tolist()
rs =[len(r)for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B C D01[1,2][1,2,3] A12[1,2,3][1,2] B23[1][1,2] C
def unnest(df, tile, explode):
vals = df[explode].sum(1)
rs =[len(r)for r in vals]
a = np.repeat(df[tile].values, rs, axis=0)
b = np.concatenate(vals.values)
d = np.column_stack((a, b))return pd.DataFrame(d, columns = tile +['_'.join(explode)])
unnest(df,['A','D'],['B','C'])
A D B_C01 A 111 A 221 A 131 A 241 A 352 B 162 B 272 B 382 B 192 B 2103 C 1113 C 1123 C 2
功能
def wen1(df):return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})def wen2(df):return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})def wen3(df):
s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))return s.join(df.drop('B',1), how='left')def wen4(df):return pd.DataFrame([[x]+[z]for x, y in df.values for z in y],columns=df.columns)def chris1(df):
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)def chris2(df):
vals = df.B.values.tolist()
rs =[len(r)for r in vals]
a = np.repeat(df.A.values, rs)return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
时机
import pandas as pdimport matplotlib.pyplot as pltimport numpy as npfrom timeit import timeit
res = pd.DataFrame(
index=['wen1','wen2','wen3','wen4','chris1','chris2'],
columns=[10,50,100,500,1000,5000,10000],
dtype=float)for f in res.index:for c in res.columns:
df = pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df = pd.concat([df]*c)
stmt ='{}(df)'.format(f)
setp ='from __main__ import df, {}'.format(f)
res.at[f, c]= timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
If all of the sublists in the other column are the same length, numpy can be an efficient option here:
vals = np.array(df.B.values.tolist())
a = np.repeat(df.A, vals.shape[1])
pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 2
If the sublists have different length, you need an additional step:
vals = df.B.values.tolist()
rs = [len(r) for r in vals]
a = np.repeat(df.A, rs)
pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Option 3
I took a shot at generalizing this to work to flatten N columns and tile M columns, I’ll work later on making it more efficient:
df[['B1','B2']]= pd.DataFrame([*df['B']])# if values.tolist() is too boring(pd.wide_to_long(df.drop('B',1),'B','A','').reset_index(level=1, drop=True).reset_index())
Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.
MultiIndex should be also a easier way to write and has near the same performances as numpy way.
Surprisingly, in my implementation comprehension way has the best performance.
def stack(df):
return df.set_index(['A', 'C']).B.apply(pd.Series).stack()
def comprehension(df):
return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])
def multiindex(df):
return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))
def array(df):
return pd.DataFrame(
np.column_stack((
np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
np.concatenate(df.B.values)
))
)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit
res = pd.DataFrame(
index=[
'stack',
'comprehension',
'multiindex',
'array',
],
columns=[1000, 2000, 5000, 10000, 20000, 50000],
dtype=float
)
for f in res.index:
for c in res.columns:
df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
df = pd.concat([df] * c)
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=20)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3'],'C':[['C1.1','C1.2'],['C2.1','C2.2'],'C3'],'columnD':['D1',['D2.1','D2.2','D2.3'],['D3.1','D3.2']],})print('df',df, sep='\n')def dfListExplode(df, explodeKeys):ifnot isinstance(explodeKeys, list):
explodeKeys=[explodeKeys]# recursive handling of explodeKeysif len(explodeKeys)==0:return df
elif len(explodeKeys)==1:
explodeKey=explodeKeys[0]else:return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])# perform explosion/unnesting for key: explodeKey
dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list)else[x])#casts all elements to a list
dfIndExpl=pd.DataFrame([[x]+[z]for x, y in zip(dfPrep.index,dfPrep.values)for z in y ], columns=['explodedIndex',explodeKey])
dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
dfReind=dfMerged.reindex(columns=list(df))return dfReind
dfExpl=dfListExplode(df,['C','columnD'])print('dfExpl',dfExpl, sep='\n')
The actual explosion is performed in 3 lines. The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, …).
import pandas as pd
import numpy as np
df=pd.DataFrame( {'A': ['A1','A2','A3'],
'B': ['B1','B2','B3'],
'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
})
print('df',df, sep='\n')
def dfListExplode(df, explodeKeys):
if not isinstance(explodeKeys, list):
explodeKeys=[explodeKeys]
# recursive handling of explodeKeys
if len(explodeKeys)==0:
return df
elif len(explodeKeys)==1:
explodeKey=explodeKeys[0]
else:
return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
# perform explosion/unnesting for key: explodeKey
dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
dfReind=dfMerged.reindex(columns=list(df))
return dfReind
dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')
from itertools import zip_longest, product
def xplode(df, explode, zipped=True):
method = zip_longest if zipped else product
rest ={*df}-{*explode}
zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
tups =[tup + exploded
for tup, pre in zipped
for exploded in method(*pre)]return pd.DataFrame(tups, columns=[*rest,*explode])[[*df]]
压缩的
xplode(df,['B','C'])
A B C
011.01112.02223.03324.0442NaN5
产品
xplode(df,['B','C'], zipped=False)
A B C
0111111221213122423352346235724382449245
新设定
修改示例
df = pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':'C','D':[[1,2],[3,4,5]],'E':[('X','Y','Z'),('W',)]})
df
A B C D E
01[1,2] C [1,2](X, Y, Z)12[3,4] C [3,4,5](W,)
压缩的
xplode(df,['B','D','E'])
A B C D E
011.0 C 1.0 X
112.0 C 2.0 Y
21NaN C NaN Z
323.0 C 3.0 W
424.0 C 4.0None52NaN C 5.0None
产品
xplode(df,['B','D','E'], zipped=False)
A B C D E
011 C 1 X
111 C 1 Y
211 C 1 Z
311 C 2 X
411 C 2 Y
511 C 2 Z
612 C 1 X
712 C 1 Y
812 C 1 Z
912 C 2 X
1012 C 2 Y
1112 C 2 Z
1223 C 3 W
1323 C 4 W
1423 C 5 W
1524 C 3 W
1624 C 4 W
1724 C 5 W
When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be “zipped” together.
A B C
0 1 [1, 2] [1, 2] # Typical to assume these should be zipped [(1, 1), (2, 2)]
1 2 [3, 4] [3, 4, 5]
However, the assumption gets challenged when we see different length objects, should we “zip”, if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4, 5] # is this [(3, 3), (4, 4), (None, 5)]?
OR
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4, 5] # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]
The Function
This function gracefully handles zip or product based on a parameter and assumes to zip according to the length of the longest object with zip_longest
from itertools import zip_longest, product
def xplode(df, explode, zipped=True):
method = zip_longest if zipped else product
rest = {*df} - {*explode}
zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
tups = [tup + exploded
for tup, pre in zipped
for exploded in method(*pre)]
return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]
Zipped
xplode(df, ['B', 'C'])
A B C
0 1 1.0 1
1 1 2.0 2
2 2 3.0 3
3 2 4.0 4
4 2 NaN 5
df = pd.DataFrame({
'A': [1, 2],
'B': [[1, 2], [3, 4]],
'C': 'C',
'D': [[1, 2], [3, 4, 5]],
'E': [('X', 'Y', 'Z'), ('W',)]
})
df
A B C D E
0 1 [1, 2] C [1, 2] (X, Y, Z)
1 2 [3, 4] C [3, 4, 5] (W,)
Zipped
xplode(df, ['B', 'D', 'E'])
A B C D E
0 1 1.0 C 1.0 X
1 1 2.0 C 2.0 Y
2 1 NaN C NaN Z
3 2 3.0 C 3.0 W
4 2 4.0 C 4.0 None
5 2 NaN C 5.0 None
Product
xplode(df, ['B', 'D', 'E'], zipped=False)
A B C D E
0 1 1 C 1 X
1 1 1 C 1 Y
2 1 1 C 1 Z
3 1 1 C 2 X
4 1 1 C 2 Y
5 1 1 C 2 Z
6 1 2 C 1 X
7 1 2 C 1 Y
8 1 2 C 1 Z
9 1 2 C 2 X
10 1 2 C 2 Y
11 1 2 C 2 Z
12 2 3 C 3 W
13 2 3 C 4 W
14 2 3 C 5 W
15 2 4 C 3 W
16 2 4 C 4 W
17 2 4 C 5 W
Any opinions on this method I thought of? or is doing both concat and melt considered too “expensive”?
回答 10
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)
out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})
A B
011112221322
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)
out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})
A B
0 1 1
1 1 2
2 2 1
3 2 2
you can implement this as one liner, if you don’t wish to create intermediate object
回答 11
# Here's the answer to the related question in:# https://stackoverflow.com/q/56708671/11426125# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':[['Peter','Alex'],['Donald','Stan']]})# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape =(4,2))
df2 = pd.DataFrame(b, columns =["Date","names"], dtype = str)# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))for x in i:for y in j:
df2.iat[2*x+y,0]= a[x][0]
df2.iat[2*x+y,1]= a[x][1][y]# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)
# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125
# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})
# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)
# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)
# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
for y in j:
df2.iat[2*x+y, 0] = a[x][0]
df2.iat[2*x+y, 1] = a[x][1][y]
# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)
I have another good way to solves this when you have more than one column to explode.
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})
print(df)
A B C
0 1 [1, 2] [1, 2, 3]
1 2 [1, 2] [1, 2, 3]
I want to explode the columns B and C. First I explode B, second C. Than I drop B and C from the original df. After that I will do an index join on the 3 dfs.
>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8
I’m facing an issue with allocating huge arrays in numpy on Ubuntu 18 while not facing the same issue on MacOS.
I am trying to allocate memory for a numpy array with shape (156816, 36, 53806)
with
np.zeros((156816, 36, 53806), dtype='uint8')
and while I’m getting an error on Ubuntu OS
>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8
I’ve read somewhere that np.zeros shouldn’t be really allocating the whole memory needed for the array, but only for the non-zero elements. Even though the Ubuntu machine has 64gb of memory, while my MacBook Pro has only 16gb.
versions:
Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0
mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0
Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.
~282 GB, and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.
If (as root) you run:
$ echo 1 > /proc/sys/vm/overcommit_memory
This will enable “always overcommit” mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).
I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:
>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056
You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.
I had this same problem on Window’s and came across this solution. So if someone comes across this problem in Windows the solution for me was to increase the pagefile size, as it was a Memory overcommitment problem for me too.
Windows 8
On the Keyboard Press the WindowsKey + X then click System in the popup menu
Tap or click Advanced system settings. You might be asked for an admin password or to confirm your choice
On the Advanced tab, under Performance, tap or click Settings.
Tap or click the Advanced tab, and then, under Virtual memory, tap or click Change
Clear the Automatically manage paging file size for all drives check box.
Under Drive [Volume Label], tap or click the drive that contains the paging file you want to change
Tap or click Custom size, enter a new size in megabytes in the initial size (MB) or Maximum size (MB) box, tap or click Set, and then tap or click OK
Reboot your system
Windows 10
Press the Windows key
Type SystemPropertiesAdvanced
Click Run as administrator
Click Settings
Select the Advanced tab
Select Change…
Uncheck Automatically managing paging file size for all drives
Then select Custom size and fill in the appropriate size
Press Set then press OK then exit from the Virtual Memory, Performance Options, and System Properties Dialog
Reboot your system
Note: I did not have the enough memory on my system for the ~282GB in this example but for my particular case this worked.
EDIT
From here the suggested recommendations for page file size:
There is a formula for calculating the correct pagefile size. Initial size is one and a half (1.5) x the amount of total system memory. Maximum size is three (3) x the initial size. So let’s say you have 4 GB (1 GB = 1,024 MB x 4 = 4,096 MB) of memory. The initial size would be 1.5 x 4,096 = 6,144 MB and the maximum size would be 3 x 6,144 = 18,432 MB.
However, this does not take into consideration other important factors and system settings that may be unique to your computer. Again, let Windows choose what to use instead of relying on some arbitrary formula that worked on a different computer.
Also:
Increasing page file size may help prevent instabilities and crashing in Windows. However, a hard drive read/write times are much slower than what they would be if the data were in your computer memory. Having a larger page file is going to add extra work for your hard drive, causing everything else to run slower. Page file size should only be increased when encountering out-of-memory errors, and only as a temporary fix. A better solution is to adding more memory to the computer.
I came across this problem on Windows too. The solution for me was to switch from a 32-bit to a 64-bit version of Python. Indeed, a 32-bit software, like a 32-bit CPU, can adress a maximum of 4 GB of RAM (2^32). So if you have more than 4 GB of RAM, a 32-bit version cannot take advantage of it.
With a 64-bit version of Python (the one labeled x86-64 in the download page), the issue disappeared.
You can check which version you have by entering the interpreter. I, with a 64-bit version, now have:
Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)], where [MSC v.1916 64 bit (AMD64)] means “64-bit Python”.
Note : as of the time of this writing (May 2020), matplotlib is not available on python39, so I recommand installing python37, 64 bits.
In my case, adding a dtype attribute changed dtype of the array to a smaller type(from float64 to uint8), decreasing array size enough to not throw MemoryError in Windows(64 bit).
res ="".join(i + j for i, j in zip(u, l))print(res)# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
它也比使用两个join()调用更快:
In[5]: l1 ='A'*1000000; l2 ='a'*1000000In[6]:%timeit "".join("".join(item)for item in zip(l1, l2))1 loops, best of 3:442 ms per loopIn[7]:%timeit "".join(i + j for i, j in zip(l1, l2))1 loops, best of 3:360 ms per loop
For me, the most pythonic* way is the following which pretty much does the same thing but uses the + operator for concatenating the individual characters in each string:
res = "".join(i + j for i, j in zip(u, l))
print(res)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
It is also faster than using two join() calls:
In [5]: l1 = 'A' * 1000000; l2 = 'a' * 1000000
In [6]: %timeit "".join("".join(item) for item in zip(l1, l2))
1 loops, best of 3: 442 ms per loop
In [7]: %timeit "".join(i + j for i, j in zip(l1, l2))
1 loops, best of 3: 360 ms per loop
Faster approaches exist, but they often obfuscate the code.
Note: If the two input strings are not the same length then the longer one will be truncated as zip stops iterating at the end of the shorter string. In this case instead of zip one should use zip_longest (izip_longest in Python 2) from the itertools module to ensure that both strings are fully exhausted.
*To take a quote from the Zen of Python: Readability counts.
Pythonic = readability for me; i + j is just visually parsed more easily, at least for my eyes.
回答 1
更快的选择
其他方式:
res =['']* len(u)*2
res[::2]= u
res[1::2]= l
print(''.join(res))
%%timeit
res =['']* len(u)*2
res[::2]= u
res[1::2]= l
''.join(res)100000 loops, best of 3:4.75µs per loop
比迄今为止最快的解决方案:
%timeit "".join(list(chain.from_iterable(zip(u, l))))100000 loops, best of 3:6.52µs per loop
同样对于较大的字符串:
l1 ='A'*1000000; l2 ='a'*1000000%timeit "".join(list(chain.from_iterable(zip(l1, l2))))1 loops, best of 3:151 ms per loop
%%timeit
res =['']* len(l1)*2
res[::2]= l1
res[1::2]= l2
''.join(res)10 loops, best of 3:92 ms per loop
but by then you’ve already lost the gains over list slicing for small strings (it’s still 20x the speed for long strings) and this doesn’t even work for non-ASCII characters yet.
FWIW, if you are doing this on massive strings and need every cycle, and for some reason have to use Python strings… here’s how to do it:
Special-casing the common case of smaller types will help too. FWIW, this is only 3x the speed of list slicing for long strings and a factor of 4 to 5 slower for small strings.
Either way I prefer the join solutions, but since timings were mentioned elsewhere I thought I might as well join in.
In[36]:from operator import add
In[37]:from itertools import starmap, izip
In[38]: timeit "".join([i + j for i, j in uzip(l1, l2)])1 loops, best of 3:142 ms per loop
In[39]: timeit "".join(starmap(add, izip(l1,l2)))1 loops, best of 3:117 ms per loop
In[40]: timeit "".join(["".join(item)for item in zip(l1, l2)])1 loops, best of 3:196 ms per loop
In[41]:"".join(starmap(add, izip(l1,l2)))=="".join([i + j for i, j in izip(l1, l2)])=="".join(["".join(item)for item in izip(l1, l2)])Out[42]:True
但是合并起来izip又chain.from_iterable更快了
In[2]:from itertools import chain, izip
In[3]: timeit "".join(chain.from_iterable(izip(l1, l2)))10 loops, best of 3:98.7 ms per loop
chain(*和之间也存在实质性差异
chain.from_iterable(...。
In[5]: timeit "".join(chain(*izip(l1, l2)))1 loops, best of 3:212 ms per loop
If you want the fastest way, you can combine itertools with operator.add:
In [36]: from operator import add
In [37]: from itertools import starmap, izip
In [38]: timeit "".join([i + j for i, j in uzip(l1, l2)])
1 loops, best of 3: 142 ms per loop
In [39]: timeit "".join(starmap(add, izip(l1,l2)))
1 loops, best of 3: 117 ms per loop
In [40]: timeit "".join(["".join(item) for item in zip(l1, l2)])
1 loops, best of 3: 196 ms per loop
In [41]: "".join(starmap(add, izip(l1,l2))) == "".join([i + j for i, j in izip(l1, l2)]) == "".join(["".join(item) for item in izip(l1, l2)])
Out[42]: True
But combining izip and chain.from_iterable is faster again
In [2]: from itertools import chain, izip
In [3]: timeit "".join(chain.from_iterable(izip(l1, l2)))
10 loops, best of 3: 98.7 ms per loop
There is also a substantial difference between
chain(* and chain.from_iterable(....
In [5]: timeit "".join(chain(*izip(l1, l2)))
1 loops, best of 3: 212 ms per loop
There is no such thing as a generator with join, passing one is always going to be slower as python will first build a list using the content because it does two passes over the data, one to figure out the size needed and one to actually do the join which would not be possible using a generator:
/* Here is the general case. Do a pre-pass to figure out the total
* amount of space we'll need (sz), and see whether all arguments are
* bytes-like.
*/
Also if you have different length strings and you don’t want to lose data you can use izip_longest :
In [22]: from itertools import izip_longest
In [23]: a,b = "hlo","elworld"
In [24]: "".join(chain.from_iterable(izip_longest(a, b,fillvalue="")))
Out[24]: 'helloworld'
For python 3 it is called zip_longest
But for python2, veedrac’s suggestion is by far the fastest:
In [18]: %%timeit
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
....:
100 loops, best of 3: 2.68 ms per loop
from operator import add
u = 'AAAAA'
l = 'aaaaa'
s = "".join(map(add, u, l))
Output:
'AaAaAaAaAa'
What map does is it takes every element from the first iterable u and the first elements from the second iterable l and applies the function supplied as the first argument add. Then join just joins them.
回答 6
吉姆的答案很好,但是,如果您不介意几次导入,这是我最喜欢的选择:
from functools import reduce
from operator import add
reduce(add, map(add, u, l))
A lot of these suggestions assume the strings are of equal length. Maybe that covers all reasonable use cases, but at least to me it seems that you might want to accomodate strings of differing lengths too. Or am I the only one thinking the mesh should work a bit like this:
u = "foobar"
l = "baz"
mesh(u,l) = "fboaozbar"
One way to do this would be the following:
def mesh(a,b):
minlen = min(len(a),len(b))
return "".join(["".join(x+y for x,y in zip(a,b)),a[minlen:],b[minlen:]])
回答 8
我喜欢使用两个fors,变量名可以提示/提醒正在发生的事情:
"".join(char for pair in zip(u,l)for char in pair)
import itertools
a ='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b ='abcdefghijklmnopqrstuvwxyz'
all_strings =[a,b]
interleaved ="".join(c for cs in itertools.zip_longest(*all_strings)for c in cs)print(interleaved)# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
In[7]: l1 ='A'*1000000; l2 ='a'*1000000;In[8]:%timeit "".join(a + b for i, j in zip(l1, l2))1 loops, best of 3:227 ms per loop
In[9]:%timeit "".join(c for cs in zip(*(l1, l2))for c in cs)1 loops, best of 3:198 ms per loop
Feels a bit un-pythonic not to consider the double-list-comprehension answer here, to handle n string with O(1) effort:
"".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
where all_strings is a list of the strings you want to interleave. In your case, all_strings = [u, l]. A full use example would look like this:
import itertools
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b = 'abcdefghijklmnopqrstuvwxyz'
all_strings = [a,b]
interleaved = "".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
print(interleaved)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
Like many answers, fastest? Probably not, but simple and flexible. Also, without too much added complexity, this is slightly faster than the accepted answer (in general, string addition is a bit slow in python):
In [7]: l1 = 'A' * 1000000; l2 = 'a' * 1000000;
In [8]: %timeit "".join(a + b for i, j in zip(l1, l2))
1 loops, best of 3: 227 ms per loop
In [9]: %timeit "".join(c for cs in zip(*(l1, l2)) for c in cs)
1 loops, best of 3: 198 ms per loop
回答 11
可能比当前领先的解决方案更快,更短:
from itertools import chain
u ='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l ='abcdefghijklmnopqrstuvwxyz'
res ="".join(chain(*zip(u, l)))
Potentially faster and shorter than the current leading solution:
from itertools import chain
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'
res = "".join(chain(*zip(u, l)))
Strategy speed-wise is to do as much at the C-level as possible. Same zip_longest() fix for uneven strings and it would be coming out of the same module as chain() so can’t ding me too many points there!
Other solutions I came up with along the way:
res = "".join(u[x] + l[x] for x in range(len(u)))
res = "".join(k + l[i] for i, k in enumerate(u))
u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'
from iteration_utilities import roundrobin
''.join(roundrobin(u, l))
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'
The pooling and convolutional ops slide a “window” across the input tensor. Using tf.nn.conv2d as an example: If the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height, width dimensions.
strides determines how much the window shifts by in each of the dimensions. The typical use sets the first (the batch) and last (the depth) stride to 1.
Let’s use a very concrete example: Running a 2-d convolution over a 32×32 greyscale input image. I say greyscale because then the input image has depth=1, which helps keep it simple. Let that image look like this:
Let’s run a 2×2 convolution window over a single example (batch size = 1). We’ll give the convolution an output channel depth of 8.
The input to the convolution has shape=[1, 32, 32, 1].
If you specify strides=[1,1,1,1] with padding=SAME, then the output of the filter will be [1, 32, 32, 8].
The filter will first create an output for:
F(00 01
10 11)
And then for:
F(01 02
11 12)
and so on. Then it will move to the second row, calculating:
F(10, 11
20, 21)
then
F(11, 12
21, 22)
If you specify a stride of [1, 2, 2, 1] it won’t do overlapping windows. It will compute:
F(00, 01
10, 11)
and then
F(02, 03
12, 13)
The stride operates similarly for the pooling operators.
Question 2: Why strides [1, x, y, 1] for convnets
The first 1 is the batch: You don’t usually want to skip over examples in your batch, or you shouldn’t have included them in the first place. :)
The last 1 is the depth of the convolution: You don’t usually want to skip inputs, for the same reason.
The conv2d operator is more general, so you could create convolutions that slide the window along other dimensions, but that’s not a typical use in convnets. The typical use is to use them spatially.
Why reshape to -1 -1 is a placeholder that says “adjust as necessary to match the size needed for the full tensor.” It’s a way of making the code be independent of the input batch size, so that you can change your pipeline and not have to adjust the batch size everywhere in the code.
The inputs are 4 dimensional and are of form: [batch_size, image_rows, image_cols, number_of_colors]
Strides, in general, define an overlap between applying operations. In the case of conv2d, it specifies what is the distance between consecutive applications of convolutional filters. The value of 1 in a specific dimension means that we apply the operator at every row/col, the value of 2 means every second, and so on.
Re 1) The values that matter for convolutions are 2nd and 3rd and they represent the overlap in the application of the convolutional filters along rows and columns. The value of [1, 2, 2, 1] says that we want to apply the filters on every second row and column.
Re 2) I don’t know the technical limitations (might be CuDNN requirement) but typically people use strides along the rows or columns dimensions. It doesn’t necessarily make sense to do it over batch size. Not sure of the
last dimension.
Re 3) Setting -1 for one of the dimension means, “set the value for the first dimension so that the total number of elements in the tensor is unchanged”. In our case, the -1 will be equal to the batch_size.
Let’s assume your input = [1, 0, 2, 3, 0, 1, 1] and kernel = [2, 1, 3] the result of the convolution is [8, 11, 7, 9, 4], which is calculated by sliding your kernel over the input, performing element-wise multiplication and summing everything. Like this:
8 = 1 * 2 + 0 * 1 + 2 * 3
11 = 0 * 2 + 2 * 1 + 3 * 3
7 = 2 * 2 + 3 * 1 + 0 * 3
9 = 3 * 2 + 0 * 1 + 1 * 3
4 = 0 * 2 + 1 * 1 + 1 * 3
Here we slide by one element, but nothing stops you by using any other number. This number is your stride. You can think about it as downsampling the result of the 1-strided convolution by just taking every s-th result.
Knowing the input size i, kernel size k, stride s and padding p you can easily calculate the output size of the convolution as:
Here || operator means ceiling operation. For a pooling layer s = 1.
N-dim case.
Knowing the math for a 1-dim case, n-dim case is easy once you see that each dim is independent. So you just slide each dimension separately. Here is an example for 2-d. Notice that you do not need to have the same stride at all the dimensions. So for an N-dim input/kernel you should provide N strides.
So now it is easy to answer all your questions:
What do each of the 4+ integers represent?. conv2d, pool tells you that this list represents the strides among each dimension. Notice that the length of strides list is the same as the rank of kernel tensor.
Why must they have strides[0] = strides3 = 1 for convnets?. The first dimension is batch size, the last is channels. There is no point of skipping neither batch nor channel. So you make them 1. For width/height you can skip something and that’s why they might be not 1.
tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1?tf.reshape has it covered for you:
If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant. In particular, a shape of [-1] flattens into 1-D. At most one component of shape can be -1.
input_shape[4]is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3]is the width of the image
input_shape[2]is the height of the image
input_shape[1]is the number of frames that have been lumped into 1 complete data
input_shape[0]is the number of lumped frames of images we have.
@dga has done a wonderful job explaining and I can’t be thankful enough how helpful it has been. In the like manner, I will like to share my findings on how stride works in 3D convolution.
According to the TensorFlow documentation on conv3d, the shape of the input must be in this order:
Let’s explain the variables from the extreme right to the left using an example. Assuming the input shape is
input_shape = [1000,16,112,112,3]
input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.
Below is a summary documentation for how stride is used.
strides: A list of ints that has length >= 5. 1-D tensor of length 5.
The stride of the sliding window for each dimension of input. Must
have strides[0] = strides[4] = 1
As indicated in many works, strides simply mean how many steps away a window or kernel jumps away from the closest element, be it a data frame or pixel (this is paraphrased by the way).
From the above documentation, a stride in 3D will look like this strides = (1,X,Y,Z,1).
The documentation emphasizes that strides[0] = strides[4] = 1.
strides[0]=1 means that we do not want to skip any data in the batch
strides[4]=1 means that we do not want to skip in the channel
strides[X] means how many skips we should make in the lumped frames. So for example, if we have 16 frames, X=1 means use every frame. X=2 means use every second frame and it goes and on
strides[y] and strides[z] follow the explanation by @dga so I will not redo that part.
In keras however, you only need to specify a tuple/list of 3 integers, specifying the strides of the convolution along each spatial dimension, where spatial dimension is stride[x], strides[y] and strides[z]. strides[0] and strides[4] is already defaulted to 1.
In another question, other users offered some help if I could supply the array I was having trouble with. However, I even fail at a basic I/O task, such as writing an array to a file.
Can anyone explain what kind of loop I would need to write a 4x11x14 numpy array to file?
This array consist of four 11 x 14 arrays, so I should format it with a nice newline, to make the reading of the file easier on others.
Edit: So I’ve tried the numpy.savetxt function. Strangely, it gives the following error:
TypeError: float argument required, not numpy.ndarray
I assume that this is because the function doesn’t work with multidimensional arrays? Any solutions as I would like them within one file?
import numpy as np
# Generate some test data
data = np.arange(200).reshape((4,5,10))# Write the array to diskwith open('test.txt','w')as outfile:# I'm writing a header here just for the sake of readability# Any line starting with "#" will be ignored by numpy.loadtxt
outfile.write('# Array shape: {0}\n'.format(data.shape))# Iterating through a ndimensional array produces slices along# the last axis. This is equivalent to data[i,:,:] in this casefor data_slice in data:# The formatting string indicates that I'm writing out# the values in left-justified columns 7 characters in width# with 2 decimal places.
np.savetxt(outfile, data_slice, fmt='%-7.2f')# Writing out a break to indicate different slices...
outfile.write('# New slice\n')
这样生成:
# Array shape: (4, 5, 10)0.001.002.003.004.005.006.007.008.009.0010.0011.0012.0013.0014.0015.0016.0017.0018.0019.0020.0021.0022.0023.0024.0025.0026.0027.0028.0029.0030.0031.0032.0033.0034.0035.0036.0037.0038.0039.0040.0041.0042.0043.0044.0045.0046.0047.0048.0049.00# New slice50.0051.0052.0053.0054.0055.0056.0057.0058.0059.0060.0061.0062.0063.0064.0065.0066.0067.0068.0069.0070.0071.0072.0073.0074.0075.0076.0077.0078.0079.0080.0081.0082.0083.0084.0085.0086.0087.0088.0089.0090.0091.0092.0093.0094.0095.0096.0097.0098.0099.00# New slice100.00101.00102.00103.00104.00105.00106.00107.00108.00109.00110.00111.00112.00113.00114.00115.00116.00117.00118.00119.00120.00121.00122.00123.00124.00125.00126.00127.00128.00129.00130.00131.00132.00133.00134.00135.00136.00137.00138.00139.00140.00141.00142.00143.00144.00145.00146.00147.00148.00149.00# New slice150.00151.00152.00153.00154.00155.00156.00157.00158.00159.00160.00161.00162.00163.00164.00165.00166.00167.00168.00169.00170.00171.00172.00173.00174.00175.00176.00177.00178.00179.00180.00181.00182.00183.00184.00185.00186.00187.00188.00189.00190.00191.00192.00193.00194.00195.00196.00197.00198.00199.00# New slice
# Read the array from disk
new_data = np.loadtxt('test.txt')# Note that this returned a 2D array!print new_data.shape
# However, going back to 3D is easy if we know the # original shape of the array
new_data = new_data.reshape((4,5,10))# Just to check that they're the same...assert np.all(new_data == data)
If you want to write it to disk so that it will be easy to read back in as a numpy array, look into numpy.save. Pickling it will work fine, as well, but it’s less efficient for large arrays (which yours isn’t, so either is perfectly fine).
If you want it to be human readable, look into numpy.savetxt.
Edit: So, it seems like savetxt isn’t quite as great an option for arrays with >2 dimensions… But just to draw everything out to it’s full conclusion:
I just realized that numpy.savetxt chokes on ndarrays with more than 2 dimensions… This is probably by design, as there’s no inherently defined way to indicate additional dimensions in a text file.
E.g. This (a 2D array) works fine
import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)
While the same thing would fail (with a rather uninformative error: TypeError: float argument required, not numpy.ndarray) for a 3D array:
import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)
One workaround is just to break the 3D (or greater) array into 2D slices. E.g.
x = np.arange(200).reshape((4,5,10))
with open('test.txt', 'w') as outfile:
for slice_2d in x:
np.savetxt(outfile, slice_2d)
However, our goal is to be clearly human readable, while still being easily read back in with numpy.loadtxt. Therefore, we can be a bit more verbose, and differentiate the slices using commented out lines. By default, numpy.loadtxt will ignore any lines that start with # (or whichever character is specified by the comments kwarg). (This looks more verbose than it actually is…)
import numpy as np
# Generate some test data
data = np.arange(200).reshape((4,5,10))
# Write the array to disk
with open('test.txt', 'w') as outfile:
# I'm writing a header here just for the sake of readability
# Any line starting with "#" will be ignored by numpy.loadtxt
outfile.write('# Array shape: {0}\n'.format(data.shape))
# Iterating through a ndimensional array produces slices along
# the last axis. This is equivalent to data[i,:,:] in this case
for data_slice in data:
# The formatting string indicates that I'm writing out
# the values in left-justified columns 7 characters in width
# with 2 decimal places.
np.savetxt(outfile, data_slice, fmt='%-7.2f')
# Writing out a break to indicate different slices...
outfile.write('# New slice\n')
Reading it back in is very easy, as long as we know the shape of the original array. We can just do numpy.loadtxt('test.txt').reshape((4,5,10)). As an example (You can do this in one line, I’m just being verbose to clarify things):
# Read the array from disk
new_data = np.loadtxt('test.txt')
# Note that this returned a 2D array!
print new_data.shape
# However, going back to 3D is easy if we know the
# original shape of the array
new_data = new_data.reshape((4,5,10))
# Just to check that they're the same...
assert np.all(new_data == data)
I am not certain if this meets your requirements, given I think you are interested in making the file readable by people, but if that’s not a primary concern, just pickle it.
import numpy as np
import scipy.io
# Some test data
x = np.arange(200).reshape((4,5,10))# Specify the filename of the .mat file
matfile ='test_mat.mat'# Write the array to the mat file. For this to work, the array must be the value# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')# For the above line, I specified the kwarg oned_as since python (2.7 with # numpy 1.6.1) throws a FutureWarning. Here, this isn't really necessary # since oned_as is a kwarg for dealing with 1-D arrays.# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)# And just to check if the data is the same:assert np.all(x == matdata['out'])
If you don’t need a human-readable output, another option you could try is to save the array as a MATLAB .mat file, which is a structured array. I despise MATLAB, but the fact that I can both read and write a .mat in very few lines is convenient.
Unlike Joe Kington’s answer, the benefit of this is that you don’t need to know the original shape of the data in the .mat file, i.e. no need to reshape upon reading in. And, unlike using pickle, a .mat file can be read by MATLAB, and probably some other programs/languages as well.
Here is an example:
import numpy as np
import scipy.io
# Some test data
x = np.arange(200).reshape((4,5,10))
# Specify the filename of the .mat file
matfile = 'test_mat.mat'
# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')
# For the above line, I specified the kwarg oned_as since python (2.7 with
# numpy 1.6.1) throws a FutureWarning. Here, this isn't really necessary
# since oned_as is a kwarg for dealing with 1-D arrays.
# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)
# And just to check if the data is the same:
assert np.all(x == matdata['out'])
If you forget the key that the array is named in the .mat file, you can always do:
print matdata.keys()
And of course you can store many arrays using many more keys.
So yes – it won’t be readable with your eyes, but only takes 2 lines to write and read the data, which I think is a fair trade-off.
You can simply traverse the array in three nested loops and write their values to your file. For reading, you simply use the same exact loop construction. You will get the values in exactly the right order to fill your arrays correctly again.
import numpy as np
trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter =",")with open("/extension/file.txt","w")as f:for x in xrange(len(trial[:,1])):for y in range(num_of_columns):if y < num_of_columns-2:
f.write(trial[x][y]+",")elif y == num_of_columns-1:
f.write(trial[x][y])
f.write("\n")
I have a way to do it using a simply filename.write() operation. It works fine for me, but I’m dealing with arrays having ~1500 data elements.
I basically just have for loops to iterate through the file and write it to the output destination line-by-line in a csv style output.
import numpy as np
trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")
with open("/extension/file.txt", "w") as f:
for x in xrange(len(trial[:,1])):
for y in range(num_of_columns):
if y < num_of_columns-2:
f.write(trial[x][y] + ",")
elif y == num_of_columns-1:
f.write(trial[x][y])
f.write("\n")
The if and elif statement are used to add commas between the data elements. For whatever reason, these get stripped out when reading the file in as an nd array. My goal was to output the file as a csv, so this method helps to handle that.
Pickle is best for these cases. Suppose you have a ndarray named x_train. You can dump it into a file and revert it back using the following command:
import pickle
###Load into file
with open("myfile.pkl","wb") as f:
pickle.dump(x_train,f)
###Extract from file
with open("myfile.pkl","rb") as f:
x_temp = pickle.load(f)
If I’m writing unit tests in python (using the unittest module), is it possible to output data from a failed test, so I can examine it to help deduce what caused the error? I am aware of the ability to create a customized message, which can carry some information, but sometimes you might deal with more complex data, that can’t easily be represented as a string.
For example, suppose you had a class Foo, and were testing a method bar, using data from a list called testdata:
class TestBar(unittest.TestCase):
def runTest(self):
for t1, t2 in testdata:
f = Foo(t1)
self.assertEqual(f.bar(t2), 2)
If the test failed, I might want to output t1, t2 and/or f, to see why this particular data resulted in a failure. By output, I mean that the variables can be accessed like any other variables, after the test has been run.
You can use simple print statements, or any other way of writing to stdout. You can also invoke the Python debugger anywhere in your tests.
If you use nose to run your tests (which I recommend), it will collect the stdout for each test and only show it to you if the test failed, so you don’t have to live with the cluttered output when the tests pass.
nose also has switches to automatically show variables mentioned in asserts, or to invoke the debugger on failed tests. For example -s (--nocapture) prevents the capture of stdout.
I don’t think this is quite what your looking for, there’s no way to display variable values that don’t fail, but this may help you get closer to outputting the results the way you want.
You can use the TestResult object returned by the TestRunner.run() for results analysis and processing. Particularly, TestResult.errors and TestResult.failures
Another option – start a debugger where the test fails.
Try running your tests with Testoob (it will run your unittest suite without changes), and you can use the ‘–debug’ command line switch to open a debugger when a test fails.
Here’s a terminal session on windows:
C:\work> testoob tests.py --debug
F
Debugging for failure in test: test_foo (tests.MyTests.test_foo)
> c:\python25\lib\unittest.py(334)failUnlessEqual()
-> (msg or '%r != %r' % (first, second))
(Pdb) up
> c:\work\tests.py(6)test_foo()
-> self.assertEqual(x, y)
(Pdb) l
1 from unittest import TestCase
2 class MyTests(TestCase):
3 def test_foo(self):
4 x = 1
5 y = 2
6 -> self.assertEqual(x, y)
[EOF]
(Pdb)
回答 5
我使用的方法非常简单。我只是将其记录为警告,因此它实际上会显示出来。
import logging
classTestBar(unittest.TestCase):def runTest(self):#this line is important
logging.basicConfig()
log = logging.getLogger("LOG")for t1, t2 in testdata:
f =Foo(t1)
self.assertEqual(f.bar(t2),2)
log.warning(t1)
The method I use is really simple. I just log it as a warning so it will actually show up.
import logging
class TestBar(unittest.TestCase):
def runTest(self):
#this line is important
logging.basicConfig()
log = logging.getLogger("LOG")
for t1, t2 in testdata:
f = Foo(t1)
self.assertEqual(f.bar(t2), 2)
log.warning(t1)
回答 6
我想我可能一直在想这个。我想出的一种方法可以做到这一点,就是让全局变量累积诊断数据。
像这样的东西:
log1 = dict()classTestBar(unittest.TestCase):def runTest(self):for t1, t2 in testdata:
f =Foo(t1)if f.bar(t2)!=2:
log1("TestBar.runTest")=(f, t1, t2)
self.fail("f.bar(t2) != 2")
I think I might have been overthinking this. One way I’ve come up with that does the job, is simply to have a global variable, that accumulates the diagnostic data.
Somthing like this:
log1 = dict()
class TestBar(unittest.TestCase):
def runTest(self):
for t1, t2 in testdata:
f = Foo(t1)
if f.bar(t2) != 2:
log1("TestBar.runTest") = (f, t1, t2)
self.fail("f.bar(t2) != 2")
Thanks for the replies. They have given me some alternative ideas for how to record information from unit tests in python.
回答 7
使用日志记录:
import unittest
import logging
import inspect
import os
logging_level = logging.INFO
try:
log_file = os.environ["LOG_FILE"]exceptKeyError:
log_file =Nonedef logger(stack=None):ifnot hasattr(logger,"initialized"):
logging.basicConfig(filename=log_file, level=logging_level)
logger.initialized =Trueifnot stack:
stack = inspect.stack()
name = stack[1][3]try:
name = stack[1][0].f_locals["self"].__class__.__name__ +"."+ name
exceptKeyError:passreturn logging.getLogger(name)def todo(msg):
logger(inspect.stack()).warning("TODO: {}".format(msg))def get_pi():
logger().info("sorry, I know only three digits")return3.14classTest(unittest.TestCase):def testName(self):
todo("use a better get_pi")
pi = get_pi()
logger().info("pi = {}".format(pi))
todo("check more digits in pi")
self.assertAlmostEqual(pi,3.14)
logger().debug("end of this test")pass
用法:
# LOG_FILE=/tmp/log python3 -m unittest LoggerDemo.----------------------------------------------------------------------Ran1 test in0.047s
OK
# cat /tmp/log
WARNING:Test.testName:TODO: use a better get_pi
INFO:get_pi:sorry, I know only three digits
INFO:Test.testName:pi =3.14
WARNING:Test.testName:TODO: check more digits in pi
import unittest
import logging
import inspect
import os
logging_level = logging.INFO
try:
log_file = os.environ["LOG_FILE"]
except KeyError:
log_file = None
def logger(stack=None):
if not hasattr(logger, "initialized"):
logging.basicConfig(filename=log_file, level=logging_level)
logger.initialized = True
if not stack:
stack = inspect.stack()
name = stack[1][3]
try:
name = stack[1][0].f_locals["self"].__class__.__name__ + "." + name
except KeyError:
pass
return logging.getLogger(name)
def todo(msg):
logger(inspect.stack()).warning("TODO: {}".format(msg))
def get_pi():
logger().info("sorry, I know only three digits")
return 3.14
class Test(unittest.TestCase):
def testName(self):
todo("use a better get_pi")
pi = get_pi()
logger().info("pi = {}".format(pi))
todo("check more digits in pi")
self.assertAlmostEqual(pi, 3.14)
logger().debug("end of this test")
pass
Usage:
# LOG_FILE=/tmp/log python3 -m unittest LoggerDemo
.
----------------------------------------------------------------------
Ran 1 test in 0.047s
OK
# cat /tmp/log
WARNING:Test.testName:TODO: use a better get_pi
INFO:get_pi:sorry, I know only three digits
INFO:Test.testName:pi = 3.14
WARNING:Test.testName:TODO: check more digits in pi
If you do not set LOG_FILE, logging will got to stderr.
回答 8
您可以使用 logging模块。
因此,在单元测试代码中,使用:
import logging as log
def test_foo(self):
log.debug("Some debug message.")
log.info("Some info message.")
log.warning("Some warning message.")
log.error("Some error message.")
$ python -m test_daikiri
DEBUG:daikiri:Deducting discount....----------------------------------------------------------------------Ran1 test in0.000s
OK
What I do in these cases is to have a log.debug() with some messages in my application. Since the default logging level is WARNING, such messages don’t show in the normal execution.
Then, in the unittest I change the logging level to DEBUG, so that such messages are shown while running them.
import logging
log.debug("Some messages to be shown just when debugging or unittesting")
In the unittests:
# Set log level
loglevel = logging.DEBUG
logging.basicConfig(level=loglevel)
See a full example:
This is daikiri.py, a basic class that implements a Daikiri with its name and price. There is method make_discount() that returns the price of that specific daikiri after applying a given discount:
import logging
log = logging.getLogger(__name__)
class Daikiri(object):
def __init__(self, name, price):
self.name = name
self.price = price
def make_discount(self, percentage):
log.debug("Deducting discount...") # I want to see this message
return self.price * percentage
Then, I create a unittest test_daikiri.py that checks its usage:
import unittest
import logging
from .daikiri import Daikiri
class TestDaikiri(unittest.TestCase):
def setUp(self):
# Changing log level to DEBUG
loglevel = logging.DEBUG
logging.basicConfig(level=loglevel)
self.mydaikiri = Daikiri("cuban", 25)
def test_drop_price(self):
new_price = self.mydaikiri.make_discount(0)
self.assertEqual(new_price, 0)
if __name__ == "__main__":
unittest.main()
So when I execute it I get the log.debug messages:
$ python -m test_daikiri
DEBUG:daikiri:Deducting discount...
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
import random
import unittest
import inspect
def store_result(f):"""
Store the results of a test
On success, store the return value.
On failure, store the local variables where the exception was thrown.
"""def wrapped(self):if'results'notin self.__dict__:
self.results ={}# If a test throws an exception, store local variables in results:try:
result = f(self)exceptExceptionas e:
self.results[f.__name__]={'success':False,'locals':inspect.trace()[-1][0].f_locals}raise e
self.results[f.__name__]={'success':True,'result':result}return result
return wrapped
def suite_results(suite):"""
Get all the results from a test suite
"""
ans ={}for test in suite:if'results'in test.__dict__:
ans.update(test.results)return ans
# Example:classTestSequenceFunctions(unittest.TestCase):def setUp(self):
self.seq = range(10)@store_resultdef test_shuffle(self):# make sure the shuffled sequence does not lose any elements
random.shuffle(self.seq)
self.seq.sort()
self.assertEqual(self.seq, range(10))# should raise an exception for an immutable sequence
self.assertRaises(TypeError, random.shuffle,(1,2,3))return{1:2}@store_resultdef test_choice(self):
element = random.choice(self.seq)
self.assertTrue(element in self.seq)return{7:2}@store_resultdef test_sample(self):
x =799with self.assertRaises(ValueError):
random.sample(self.seq,20)for element in random.sample(self.seq,5):
self.assertTrue(element in self.seq)return{1:99999}
suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
unittest.TextTestRunner(verbosity=2).run(suite)from pprint import pprint
pprint(suite_results(suite))
inspect.trace will let you get local variables after an exception has been thrown. You can then wrap the unit tests with a decorator like the following one to save off those local variables for examination during the post mortem.
import random
import unittest
import inspect
def store_result(f):
"""
Store the results of a test
On success, store the return value.
On failure, store the local variables where the exception was thrown.
"""
def wrapped(self):
if 'results' not in self.__dict__:
self.results = {}
# If a test throws an exception, store local variables in results:
try:
result = f(self)
except Exception as e:
self.results[f.__name__] = {'success':False, 'locals':inspect.trace()[-1][0].f_locals}
raise e
self.results[f.__name__] = {'success':True, 'result':result}
return result
return wrapped
def suite_results(suite):
"""
Get all the results from a test suite
"""
ans = {}
for test in suite:
if 'results' in test.__dict__:
ans.update(test.results)
return ans
# Example:
class TestSequenceFunctions(unittest.TestCase):
def setUp(self):
self.seq = range(10)
@store_result
def test_shuffle(self):
# make sure the shuffled sequence does not lose any elements
random.shuffle(self.seq)
self.seq.sort()
self.assertEqual(self.seq, range(10))
# should raise an exception for an immutable sequence
self.assertRaises(TypeError, random.shuffle, (1,2,3))
return {1:2}
@store_result
def test_choice(self):
element = random.choice(self.seq)
self.assertTrue(element in self.seq)
return {7:2}
@store_result
def test_sample(self):
x = 799
with self.assertRaises(ValueError):
random.sample(self.seq, 20)
for element in random.sample(self.seq, 5):
self.assertTrue(element in self.seq)
return {1:99999}
suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
unittest.TextTestRunner(verbosity=2).run(suite)
from pprint import pprint
pprint(suite_results(suite))
The last line will print the returned values where the test succeeded and the local variables, in this case x, when it fails:
How about catching the exception that gets generated from the assertion failure? In your catch block you could output the data however you wanted to wherever. Then when you were done you could re-throw the exception. The test runner probably wouldn’t know the difference.
Disclaimer: I haven’t tried this with python’s unit test framework but have with other unit test frameworks.
I have this code which finds the largest index of a specific character in a string, however I would like it to raise a ValueError when the specified character does not occur in a string.
So something like this:
contains('bababa', 'k')
would result in a:
→ ValueError: could not find k in bababa
How can I do this?
Here’s the current code for my function:
def contains(string,char):
list = []
for i in range(0,len(string)):
if string[i] == char:
list = list + [i]
return list[-1]
回答 0
raise ValueError('could not find %c in %s' % (ch,str))
def contains(char_string, char):
largest_index =-1for i, ch in enumerate(char_string):if ch == char:
largest_index = i
if largest_index >-1:# any found?return largest_index # return index of last oneelse:raiseValueError('could not find {!r} in {!r}'.format(char, char_string))print(contains('mississippi','s'))# -> 6print(contains('bababa','k'))# ->
Traceback (most recent call last):
File "how-to-raise-a-valueerror.py", line 15, in <module>
print(contains('bababa', 'k'))
File "how-to-raise-a-valueerror.py", line 12, in contains
raise ValueError('could not find {} in {}'.format(char, char_string))
ValueError: could not find 'k' in 'bababa'
def contains(char_string, char):# Ending - 1 adjusts returned index to account for searching in reverse.return len(char_string)- char_string[::-1].index(char)-1print(contains('mississippi','s'))# -> 6print(contains('bababa','k'))# ->
Traceback (most recent call last):
File "better-way-to-raise-a-valueerror.py", line 9, in <module>
print(contains('bababa', 'k'))
File "better-way-to-raise-a-valueerror", line 6, in contains
return len(char_string) - char_string[::-1].index(char) - 1
ValueError: substring not found
Here’s a revised version of your code which still works plus it illustrates how to raise a ValueError the way you want. By-the-way, I think find_last(), find_last_index(), or something simlar would be a more descriptive name for this function. Adding to the possible confusion is the fact that Python already has a container object method named __contains__() that does something a little different, membership-testing-wise.
def contains(char_string, char):
largest_index = -1
for i, ch in enumerate(char_string):
if ch == char:
largest_index = i
if largest_index > -1: # any found?
return largest_index # return index of last one
else:
raise ValueError('could not find {!r} in {!r}'.format(char, char_string))
print(contains('mississippi', 's')) # -> 6
print(contains('bababa', 'k')) # ->
Traceback (most recent call last):
File "how-to-raise-a-valueerror.py", line 15, in <module>
print(contains('bababa', 'k'))
File "how-to-raise-a-valueerror.py", line 12, in contains
raise ValueError('could not find {} in {}'.format(char, char_string))
ValueError: could not find 'k' in 'bababa'
Update — A substantially simpler way
Wow! Here’s a much more concise version—essentially a one-liner—that is also likely faster because it reverses (via [::-1]) the string before doing a forward search through it for the first matching character and it does so using the fast built-in string index() method. With respect to your actual question, a nice little bonus convenience that comes with using index() is that it already raises a ValueError when the character substring isn’t found, so nothing additional is required to make that happen.
Here it is along with a quick unit test:
def contains(char_string, char):
# Ending - 1 adjusts returned index to account for searching in reverse.
return len(char_string) - char_string[::-1].index(char) - 1
print(contains('mississippi', 's')) # -> 6
print(contains('bababa', 'k')) # ->
Traceback (most recent call last):
File "better-way-to-raise-a-valueerror.py", line 9, in <module>
print(contains('bababa', 'k'))
File "better-way-to-raise-a-valueerror", line 6, in contains
return len(char_string) - char_string[::-1].index(char) - 1
ValueError: substring not found
回答 2
>>>def contains(string, char):...for i in xrange(len(string)-1,-1,-1):...if string[i]== char:...return i
...raiseValueError("could not find %r in %r"%(char, string))...>>> contains('bababa','k')Traceback(most recent call last):File"<stdin>", line 1,in<module>File"<stdin>", line 5,in contains
ValueError: could not find 'k'in'bababa'>>> contains('bababa','a')5>>> contains('bababa','b')4>>> contains('xbababa','x')0>>>