标签归档:Python

python图正态分布

问题:python图正态分布

给定一个均值和方差,是否有一个简单的函数调用可以绘制正态分布?

Given a mean and a variance is there a simple function call which will plot a normal distribution?


回答 0

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()


回答 1

我认为没有一个函数可以在一个调用中完成所有这些操作。但是,您可以在中找到高斯概率密度函数scipy.stats

因此,我想出的最简单方法是:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-10, 10, 0.001)
# Mean = 0, SD = 2.
plt.plot(x_axis, norm.pdf(x_axis,0,2))
plt.show()

资料来源:

I don’t think there is a function that does all that in a single call. However you can find the Gaussian probability density function in scipy.stats.

So the simplest way I could come up with is:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-10, 10, 0.001)
# Mean = 0, SD = 2.
plt.plot(x_axis, norm.pdf(x_axis,0,2))
plt.show()

Sources:


回答 2

使用seaborn代替我正在使用均值= 5 std = 3的1000值的seaborn的distplot

value = np.random.normal(loc=5,scale=3,size=1000)
sns.distplot(value)

您将获得正态分布曲线

Use seaborn instead i am using distplot of seaborn with mean=5 std=3 of 1000 values

value = np.random.normal(loc=5,scale=3,size=1000)
sns.distplot(value)

You will get a normal distribution curve


回答 3

Unutbu的答案是正确的。但是因为我们的平均值可以大于或小于零,所以我还是想更改一下:

x = np.linspace(-3 * sigma, 3 * sigma, 100)

对此:

x = np.linspace(-3 * sigma + mean, 3 * sigma + mean, 100)

Unutbu answer is correct. But because our mean can be more or less than zero I would still like to change this :

x = np.linspace(-3 * sigma, 3 * sigma, 100)

to this :

x = np.linspace(-3 * sigma + mean, 3 * sigma + mean, 100)

回答 4

如果您喜欢使用逐步方法,可以考虑以下解决方案

import numpy as np
import matplotlib.pyplot as plt

mean = 0; std = 1; variance = np.square(std)
x = np.arange(-5,5,.01)
f = np.exp(-np.square(x-mean)/2*variance)/(np.sqrt(2*np.pi*variance))

plt.plot(x,f)
plt.ylabel('gaussian distribution')
plt.show()

If you prefer to use a step by step approach you could consider a solution like follows

import numpy as np
import matplotlib.pyplot as plt

mean = 0; std = 1; variance = np.square(std)
x = np.arange(-5,5,.01)
f = np.exp(-np.square(x-mean)/2*variance)/(np.sqrt(2*np.pi*variance))

plt.plot(x,f)
plt.ylabel('gaussian distribution')
plt.show()

回答 5

我刚刚回到这个问题,我不得不安装scipy,因为MatplotlibDeprecationWarning: scipy.stats.norm.pdf在尝试上述示例时,matplotlib.mlab给了我错误消息。现在的示例是:

%matplotlib inline
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats


mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))

plt.show()

I have just come back to this and I had to install scipy as matplotlib.mlab gave me the error message MatplotlibDeprecationWarning: scipy.stats.norm.pdf when trying example above. So the sample is now:

%matplotlib inline
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats


mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))

plt.show()

回答 6

我相信设置高度很重要,因此创建了以下功能:

def my_gauss(x, sigma=1, h=1, mid=0):
    from math import exp, pow
    variance = pow(sdev, 2)
    return h * exp(-pow(x-mid, 2)/(2*variance))

其中sigma,标准偏差h为,高度mid为平均值。

这是使用不同高度和偏差的结果:

I believe that is important to set the height, so created this function:

def my_gauss(x, sigma=1, h=1, mid=0):
    from math import exp, pow
    variance = pow(sdev, 2)
    return h * exp(-pow(x-mid, 2)/(2*variance))

Where sigma is the standard deviation, h is the height and mid is the mean.

Here is the result using different heights and deviations:


回答 7

您可以轻松获得CDF。所以通过cdf pdf

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.interpolate
    import scipy.stats

    def setGridLine(ax):
        #http://jonathansoma.com/lede/data-studio/matplotlib/adding-grid-lines-to-a-matplotlib-chart/
        ax.set_axisbelow(True)
        ax.minorticks_on()
        ax.grid(which='major', linestyle='-', linewidth=0.5, color='grey')
        ax.grid(which='minor', linestyle=':', linewidth=0.5, color='#a6a6a6')
        ax.tick_params(which='both', # Options for both major and minor ticks
                        top=False, # turn off top ticks
                        left=False, # turn off left ticks
                        right=False,  # turn off right ticks
                        bottom=False) # turn off bottom ticks

    data1 = np.random.normal(0,1,1000000)
    x=np.sort(data1)
    y=np.arange(x.shape[0])/(x.shape[0]+1)

    f2 = scipy.interpolate.interp1d(x, y,kind='linear')
    x2 = np.linspace(x[0],x[-1],1001)
    y2 = f2(x2)

    y2b = np.diff(y2)/np.diff(x2)
    x2b=(x2[1:]+x2[:-1])/2.

    f3 = scipy.interpolate.interp1d(x, y,kind='cubic')
    x3 = np.linspace(x[0],x[-1],1001)
    y3 = f3(x3)

    y3b = np.diff(y3)/np.diff(x3)
    x3b=(x3[1:]+x3[:-1])/2.

    bins=np.arange(-4,4,0.1)
    bins_centers=0.5*(bins[1:]+bins[:-1])
    cdf = scipy.stats.norm.cdf(bins_centers)
    pdf = scipy.stats.norm.pdf(bins_centers)

    plt.rcParams["font.size"] = 18
    fig, ax = plt.subplots(3,1,figsize=(10,16))
    ax[0].set_title("cdf")
    ax[0].plot(x,y,label="data")
    ax[0].plot(x2,y2,label="linear")
    ax[0].plot(x3,y3,label="cubic")
    ax[0].plot(bins_centers,cdf,label="ans")

    ax[1].set_title("pdf:linear")
    ax[1].plot(x2b,y2b,label="linear")
    ax[1].plot(bins_centers,pdf,label="ans")

    ax[2].set_title("pdf:cubic")
    ax[2].plot(x3b,y3b,label="cubic")
    ax[2].plot(bins_centers,pdf,label="ans")

    for idx in range(3):
        ax[idx].legend()
        setGridLine(ax[idx])

    plt.show()
    plt.clf()
    plt.close()

you can get cdf easily. so pdf via cdf

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.interpolate
    import scipy.stats

    def setGridLine(ax):
        #http://jonathansoma.com/lede/data-studio/matplotlib/adding-grid-lines-to-a-matplotlib-chart/
        ax.set_axisbelow(True)
        ax.minorticks_on()
        ax.grid(which='major', linestyle='-', linewidth=0.5, color='grey')
        ax.grid(which='minor', linestyle=':', linewidth=0.5, color='#a6a6a6')
        ax.tick_params(which='both', # Options for both major and minor ticks
                        top=False, # turn off top ticks
                        left=False, # turn off left ticks
                        right=False,  # turn off right ticks
                        bottom=False) # turn off bottom ticks

    data1 = np.random.normal(0,1,1000000)
    x=np.sort(data1)
    y=np.arange(x.shape[0])/(x.shape[0]+1)

    f2 = scipy.interpolate.interp1d(x, y,kind='linear')
    x2 = np.linspace(x[0],x[-1],1001)
    y2 = f2(x2)

    y2b = np.diff(y2)/np.diff(x2)
    x2b=(x2[1:]+x2[:-1])/2.

    f3 = scipy.interpolate.interp1d(x, y,kind='cubic')
    x3 = np.linspace(x[0],x[-1],1001)
    y3 = f3(x3)

    y3b = np.diff(y3)/np.diff(x3)
    x3b=(x3[1:]+x3[:-1])/2.

    bins=np.arange(-4,4,0.1)
    bins_centers=0.5*(bins[1:]+bins[:-1])
    cdf = scipy.stats.norm.cdf(bins_centers)
    pdf = scipy.stats.norm.pdf(bins_centers)

    plt.rcParams["font.size"] = 18
    fig, ax = plt.subplots(3,1,figsize=(10,16))
    ax[0].set_title("cdf")
    ax[0].plot(x,y,label="data")
    ax[0].plot(x2,y2,label="linear")
    ax[0].plot(x3,y3,label="cubic")
    ax[0].plot(bins_centers,cdf,label="ans")

    ax[1].set_title("pdf:linear")
    ax[1].plot(x2b,y2b,label="linear")
    ax[1].plot(bins_centers,pdf,label="ans")

    ax[2].set_title("pdf:cubic")
    ax[2].plot(x3b,y3b,label="cubic")
    ax[2].plot(bins_centers,pdf,label="ans")

    for idx in range(3):
        ax[idx].legend()
        setGridLine(ax[idx])

    plt.show()
    plt.clf()
    plt.close()

在pandas数据框中完全打印很长的字符串

问题:在pandas数据框中完全打印很长的字符串

我正在努力看似非常简单的事情。我有一个包含非常长字符串的pandas数据框。

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

现在,当我尝试打印相同的字符串时,我看不到完整的字符串,而只看到了字符串的一部分。

我尝试了以下选项

  • 使用 print(df.iloc[2])
  • 使用 to_html
  • 使用 to_string
  • 其中一个stackoverflow答案建议通过使用pandas display选项来增加列宽,但该方法也不起作用。
  • 我也没有得到如何set_printoptions帮助我。

任何想法表示赞赏。看起来很简单,但无法获得!

I am struggling with the seemingly very simple thing.I have a pandas data frame containing very long string.

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

Now when I try to print the same, I do not see the full string I rather see only part of the string.

I tried following options

  • using print(df.iloc[2])
  • using to_html
  • using to_string
  • One of the stackoverflow answer suggested to increase column width by using pandas display option, that did not work either.
  • I also did not get how set_printoptions will help me.

Any ideas appreciated. Looks very simple, but not able to get it!


回答 0

您可以使用options.display.max_colwidth指定想要在默认表示中看到更多内容:

In [2]: df
Out[2]:
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [3]: pd.options.display.max_colwidth
Out[3]: 50

In [4]: pd.options.display.max_colwidth = 100

In [5]: df
Out[5]:
                                                                               one
0                                                                              one
1                                                                              two
2  This is very long string very long string very long string veryvery long string

实际上,如果您只想检查一个值,则可以通过访问它(作为标量,而不是像一行一样df.iloc[2])来查看完整的字符串:

In [7]: df.iloc[2,0]    # or df.loc[2,'one']
Out[7]: 'This is very long string very long string very long string veryvery long string'

You can use options.display.max_colwidth to specify you want to see more in the default representation:

In [2]: df
Out[2]:
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [3]: pd.options.display.max_colwidth
Out[3]: 50

In [4]: pd.options.display.max_colwidth = 100

In [5]: df
Out[5]:
                                                                               one
0                                                                              one
1                                                                              two
2  This is very long string very long string very long string veryvery long string

And indeed, if you just want to inspect the one value, by accessing it (as a scalar, not as a row as df.iloc[2] does) you also see the full string:

In [7]: df.iloc[2,0]    # or df.loc[2,'one']
Out[7]: 'This is very long string very long string very long string veryvery long string'

回答 1

使用pd.set_option('display.max_colwidth', -1)自动换行,多行细胞。

是有关如何充分利用大熊猫的jupyters显示器的重要资源。

Use pd.set_option('display.max_colwidth', -1) for automatic linebreaks and multi-line cells.

This is a great resource on how to use jupyters display with pandas to the fullest.


回答 2

另一种非常简单的方法是调用列表函数:

list(df['one'][2])
# output:
['This is very long string very long string very long string veryvery long string']

值得一提的是,要列出整个列并不是很方便,但是对于简单的一行来说,为什么呢?

Another, pretty simple approach is to call list function:

list(df['one'][2])
# output:
['This is very long string very long string very long string veryvery long string']

No worth to mention, that is not good to convent to list the whole columns, but for a simple line – why not


回答 3

打印整个字符串的另一种简便方法是values在数据框上调用。

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

print(df.values)

输出将是

[['one']
 ['two']
 ['This is very long string very long string very long string veryvery long string']]

Another easier way to print the whole string is to call values on the dataframe.

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

print(df.values)

The Output will be

[['one']
 ['two']
 ['This is very long string very long string very long string veryvery long string']]

回答 4

这是你的本意吗?

In [7]: x =  pd.DataFrame({'one' : ['one', 'two', 'This is very long string very long string very long string veryvery long string']})

In [8]: x
Out[8]: 
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [9]: x['one'][2]
Out[9]: 'This is very long string very long string very long string veryvery long string'

Is this what you meant to do ?

In [7]: x =  pd.DataFrame({'one' : ['one', 'two', 'This is very long string very long string very long string veryvery long string']})

In [8]: x
Out[8]: 
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [9]: x['one'][2]
Out[9]: 'This is very long string very long string very long string veryvery long string'

回答 5

我经常处理您描述的情况的.to_csv()方法是使用该方法并写入stdout:

import sys

df.to_csv(sys.stdout)

更新:现在应该可以使用None而不是sys.stdout具有相似的效果了!

这应该转储整个数据帧,包括所有字符串的全部。您可以使用to_csv参数来配置列分隔符,是否打印索引等。不过,它不如正确呈现它漂亮。

我最初将其发布是为了回答有关熊猫中某个数据框中所有列的输出数据的一些相关问题

The way I often deal with the situation you describe is to use the .to_csv() method and write to stdout:

import sys

df.to_csv(sys.stdout)

Update: it should now be possible to just use None instead of sys.stdout with similar effect!

This should dump the whole dataframe, including the entirety of any strings. You can use the to_csv parameters to configure column separators, whether the index is printed, etc. It will be less pretty than rendering it properly though.

I posted this originally in answer to the somewhat-related question at Output data from all columns in a dataframe in pandas


回答 6

只需在打印之前将以下行添加到您的代码中即可。

 pd.options.display.max_colwidth = 90  # set a value as your need

您只需执行以下步骤即可设置其他附加选项,

  • 您可以如下更改熊猫max_columns功能的选项,以显示更多列

    import pandas as pd
    pd.options.display.max_columns = 10

    (这将显示10列,您可以根据需要进行更改)

  • 这样,您可以更改行数,如下所示以显示更多行

    pd.options.display.max_rows = 999

    (这允许一次打印999行)

这应该很好

请参考文档,为熊猫更改更多选项/设置

Just add the following line to your code before print.

 pd.options.display.max_colwidth = 90  # set a value as your need

You can simply do the following steps for setting other additional options,

  • You can change the options for pandas max_columns feature as follows to display more columns

    import pandas as pd
    pd.options.display.max_columns = 10
    

    (this allows 10 columns to display, you can change this as you need)

  • Like that you can change the number of rows as you need to display as follows to display more rows

    pd.options.display.max_rows = 999
    

    (this allows to print 999 rows at a time)

this should works fine

Please kindly refer the doc to change more options/settings for pandas


回答 7

我创建了一个小实用程序功能,对我来说效果很好

def display_text_max_col_width(df, width):
    with pd.option_context('display.max_colwidth', width):
        print(df)

display_text_max_col_width(train_df["Description"], 800)

我可以根据需要更改宽度的长度,而无需永久设置任何选项。

I have created a small utility function, this works well for me

def display_text_max_col_width(df, width):
    with pd.option_context('display.max_colwidth', width):
        print(df)

display_text_max_col_width(train_df["Description"], 800)

I can change length of the width as per my requirement, without setting any option permanently.


回答 8

如果您使用的是jupyter笔记本,还可以将pandas数据帧打印为HTML表格,该表格将打印完整字符串。

from IPython.display import display, HTML
display(HTML(df.to_html()))

输出量

    one
0   one
1   two
2   This is very long string very long string very long string veryvery long string

If you’re using jupyter notebook, you can also print pandas dataframe as HTML table, which will print full strings.

from IPython.display import display, HTML
display(HTML(df.to_html()))

Output

    one
0   one
1   two
2   This is very long string very long string very long string veryvery long string

批量归一化和辍学订购?

问题:批量归一化和辍学订购?

最初的问题是关于TensorFlow实现的。但是,答案是针对一般的实现。这个通用答案也是TensorFlow的正确答案。

在TensorFlow中使用批量归一化和辍学时(特别是使用contrib.layers),我需要担心订购吗?

如果我在退出后立即使用批处理规范化,则可能会出现问题。例如,如果批量归一化训练中的偏移量训练输出的比例尺数字较大,但是将相同的偏移量应用到较小的比例尺数字(由于补偿了更多的输出),而在测试过程中没有丢失,则轮班可能关闭。TensorFlow批处理规范化层会自动对此进行补偿吗?还是由于某种原因我不会想念这件事吗?

另外,将两者一起使用时还有其他陷阱吗?例如,假设我使用他们以正确的顺序在问候上述(假设有一个正确的顺序),可以存在与使用分批正常化和漏失在多个连续层烦恼?我没有立即看到问题,但是我可能会丢失一些东西。

非常感谢!

更新:

实验测试似乎表明排序确实很重要。我在相同的网络上运行了两次,但批次标准和退出均相反。当辍学在批处理规范之前时,验证损失似乎会随着培训损失的减少而增加。在另一种情况下,它们都下降了。但就我而言,运动缓慢,因此在接受更多培训后情况可能会发生变化,这只是一次测试。一个更加明确和明智的答案仍然会受到赞赏。

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I’m missing?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I’m using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don’t immediately see a problem with that, but I might be missing something.

Thank you much!

UPDATE:

An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They’re both going down in the other case. But in my case the movements are slow, so things may change after more training and it’s just a single test. A more definitive and informed answer would still be appreciated.


回答 0

在《Ioffe and Szegedy 2015》中,作者指出“我们希望确保对于任何参数值,网络始终以期望的分布产生激活”。因此,批处理规范化层实际上是在转换层/完全连接层之后,但在馈入ReLu(或任何其他种类的)激活之前插入的。有关详情,请在时间约53分钟处观看此视频

就辍学而言,我认为辍学是在激活层之后应用的。在丢弃纸图3b中,将隐藏层l的丢弃因子/概率矩阵r(l)应用于y(l),其中y(l)是应用激活函数f之后的结果。

因此,总而言之,使用批处理规范化和退出的顺序为:

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->

In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->


回答 1

正如评论中所指出的,这里是阅读层顺序的绝佳资源。我已经浏览了评论,这是我在互联网上找到的主题的最佳资源

我的2美分:

辍学是指完全阻止某些神经元发出的信息,以确保神经元不共适应。因此,批处理规范化必须在退出后进行,否则您将通过规范化统计信息传递信息。

如果考虑一下,在典型的机器学习问题中,这就是我们不计算整个数据的均值和标准差,然后将其分为训练,测试和验证集的原因。我们拆分然后计算训练集上的统计信息,并使用它们对验证和测试数据集进行归一化和居中

所以我建议方案1(这考虑了伪马文对已接受答案评论)

-> CONV / FC-> ReLu(或其他激活)->退出-> BatchNorm-> CONV / FC

与方案2相反

-> CONV / FC-> BatchNorm-> ReLu(或其他激活)->退出-> CONV / FC->接受的答案

请注意,这意味着与方案1下的网络相比,方案2下的网络应显示出过拟合的状态,但是OP进行了上述测试,并且它们支持方案2

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don’t compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin’s comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2


回答 2

通常,只需删除Dropout(如果有BN):

  • “ BN消除了Dropout在某些情况下的需要,因为BN直观上提供了与Dropout类似的正则化好处”
  • “ ResNet,DenseNet等架构未使用 Dropout

有关更多详细信息,请参见本文[ 通过方差Shift理解辍学与批处理规范化之间的不和谐 ],如@Haramoz在评论中所提到的。

Usually, Just drop the Dropout(when you have BN):

  • “BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively”
  • “Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.


回答 3

我找到了一篇说明Dropout和Batch Norm(BN)之间不和谐的论文。关键思想是他们所谓的“方差转移”。这是因为,辍学在训练和测试阶段之间的行为有所不同,这改变了BN学习的输入统计数据。主要观点可以从本文摘录的该图中找到。

在此笔记本中可以找到有关此效果的小演示。

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the “variance shift”. This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper.

A small demo for this effect can be found in this notebook.


回答 4

为了获得更好的性能,基于研究论文,我们应该在应用Dropouts之前使用BN

Based on the research paper for better performance we should use BN before applying Dropouts


回答 5

正确的顺序为:转换>规范化>激活>退出>池化

The correct order is: Conv > Normalization > Activation > Dropout > Pooling


回答 6

转化-激活-退出-BatchNorm-池->测试损失:0.04261355847120285

转化-激活-退出-池-BatchNorm->测试损失:0.050065308809280396

转换-激活-BatchNorm-池-退出-> Test_loss:0.04911309853196144

转换-激活-BatchNorm-退出-池-> Test_loss:0.06809622049331665

转换-BatchNorm-激活-退出-池-> Test_loss:0.038886815309524536

转换-BatchNorm-激活-池-退出-> Test_loss:0.04126095026731491

转换-BatchNorm-退出-激活-池-> Test_loss:0.05142546817660332

转换-退出-激活-BatchNorm-池->测试损失:0.04827788099646568

转化-退出-激活-池-BatchNorm->测试损失:0.04722036048769951

转化-退出-BatchNorm-激活-池->测试损失:0.03238215297460556


在MNIST数据集(20个纪元)上使用2个卷积模块(见下文)进行训练,然后每次

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

卷积层的内核大小为(3,3),默认填充为,激活值为elu。池化是池畔的MaxPooling (2,2)。损失为categorical_crossentropy,优化器为adam

相应的辍学概率分别为0.20.3。特征图的数量分别为3264

编辑: 当我按照某些答案中的建议删除Dropout时,它收敛得比我使用BatchNorm Dropout 时更快,但泛化能力却较差。

Conv – Activation – DropOut – BatchNorm – Pool –> Test_loss: 0.04261355847120285

Conv – Activation – DropOut – Pool – BatchNorm –> Test_loss: 0.050065308809280396

Conv – Activation – BatchNorm – Pool – DropOut –> Test_loss: 0.04911309853196144

Conv – Activation – BatchNorm – DropOut – Pool –> Test_loss: 0.06809622049331665

Conv – BatchNorm – Activation – DropOut – Pool –> Test_loss: 0.038886815309524536

Conv – BatchNorm – Activation – Pool – DropOut –> Test_loss: 0.04126095026731491

Conv – BatchNorm – DropOut – Activation – Pool –> Test_loss: 0.05142546817660332

Conv – DropOut – Activation – BatchNorm – Pool –> Test_loss: 0.04827788099646568

Conv – DropOut – Activation – Pool – BatchNorm –> Test_loss: 0.04722036048769951

Conv – DropOut – BatchNorm – Activation – Pool –> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.


回答 7

ConV / FC-BN-Sigmoid / tanh-辍学。如果激活函数是Relu或其他,则规范化和退出的顺序取决于您的任务

ConV/FC – BN – Sigmoid/tanh – dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task


回答 8

我从https://stackoverflow.com/a/40295999/8625228的答案和评论中阅读了推荐的论文

从Ioffe和Szegedy(2015)的角度来看,仅在网络结构中使用BN。Li等。(2018)给出了统计和实验分析,当从业者在BN之前使用Dropout时存在方差变化。因此,李等人。(2018)建议在所有BN层之后应用Dropout。

从Ioffe和Szegedy(2015)的角度来看,BN位于 激活函数内部/之前。然而,Chen等。Chen等(2019)使用结合了Dropout和BN的IC层。(2019)建议在ReLU之后使用BN。

在安全背景上,我仅在网络中使用Dropout或BN。

陈光勇,陈鹏飞,石玉军,谢长裕,廖本本和张胜宇。2019年。“重新思考在深度神经网络训练中批量归一化和辍学的用法。” CoRR Abs / 1905.05928。http://arxiv.org/abs/1905.05928

艾菲,谢尔盖和克里斯蒂安·塞格迪。2015年。“批量标准化:通过减少内部协变量偏移来加速深度网络训练。” CoRR Abs / 1502.03167。http://arxiv.org/abs/1502.03167

李翔,陈硕,胡小林和杨健。2018年。“通过方差转移了解辍学和批处理规范化之间的不和谐。” CoRR Abs / 1801.05134。http://arxiv.org/abs/1801.05134

I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.


如何取消嵌套(爆炸)pandas DataFrame中的列?

问题:如何取消嵌套(爆炸)pandas DataFrame中的列?

我有以下DataFrame,其中列之一是对象(列表类型单元格):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

我的预期输出是:

   A  B
0  1  1
1  1  2
3  2  1
4  2  2

我应该怎么做才能做到这一点?


相关问题

熊猫:当单元格内容为列表时,为列表中的每个元素创建一行

好问题和答案,但只与列表处理一列(在我的答案自DEF功能将多个列的工作,也是公认的答案是使用最耗时的apply,不推荐,检查的详细信息我应该什么时候曾经想在我的代码中使用pandas apply()吗?

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

   A  B
0  1  1
1  1  2
3  2  1
4  2  2

What should I do to achieve this?


Related question

pandas: When cell contents are lists, create a row for each element in the list

Good question and answer but only handle one column with list(In my answer the self-def function will work for multiple columns, also the accepted answer is use the most time consuming apply , which is not recommended, check more info When should I ever want to use pandas apply() in my code?)


回答 0

作为同时使用R和的用户python,我已经多次看到这种类型的问题。

在R中,它们具有tidyr名为的包中的内置函数unnest。但是Pythonpandas)中没有此类问题的内置函数。

我知道objecttype总是使数据难以通过pandas‘函数进行转换。当我收到这样的数据时,想到的第一件事就是“弄平”或取消嵌套列。

我正在针对此类问题使用pandaspython函数。如果您担心上述解决方案的速度,请检查user3483203的答案,因为他正在使用numpy并且大多数时候numpy速度更快。我建议Cpython,并numba如果速度在你的情况很重要。


方法0 [pandas> = 0.25]
pandas 0.25开始,如果只需要爆炸列,则可以使用以下explode函数:

df.explode('B')

       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2

方法1
apply + pd.Series(易于理解,但不建议在性能方面使用。)

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]: 
   A  B
0  1  1
1  1  2
0  2  1
1  2  2

方法2与构造函数一起
使用,重新创建您的数据框(擅长性能,不擅长多列)repeatDataFrame

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2


例如,方法2.1除了A之外,还有A.1 ….. An如果仍然使用上面的method(方法2),则很难一一重建列。

解决方案:joinmergeindex后“UNNEST”单列

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
   B  A
0  1  1
0  2  1
1  1  2
1  2  2

如果需要与以前完全相同的列顺序,请reindex在末尾添加。

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

方法3
重新创建list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

如果超过两列,请使用

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
   0  1  A       B
0  0  1  1  [1, 2]
1  0  2  1  [1, 2]
2  1  1  2  [1, 2]
3  1  2  2  [1, 2]

方法4
使用reindexloc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


列表仅包含唯一值时的方法5

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
   B  A
0  1  1
1  2  1
2  3  2
3  4  2

高性能
使用方法6numpy

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

方法7
使用基本函数itertools cyclechain:纯python解决方案,只是为了好玩

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

归纳到多列

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]: 
   A       B       C
0  1  [1, 2]  [1, 2]
1  2  [3, 4]  [3, 4]

自卫功能:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')


unnesting(df,['B','C'])
Out[609]: 
   B  C  A
0  1  1  1
0  2  2  1
1  3  3  2
1  4  4  2

列式嵌套

以上所有方法都在谈论垂直嵌套和爆炸,如果您确实需要水平扩展列表,请使用pd.DataFrame构造函数检查

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

更新功能

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

测试输出

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

I know object columns type makes the data hard to convert with a pandas function. When I received the data like this, the first thing that came to mind was to ‘flatten’ or unnest the columns .

I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203’s answer, since it’s using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters.


Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode('B')

       A  B
    0  1  1
    1  1  2
    0  2  1
    1  2  2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index})  # replace NaN with []
df.explode('B')

   A    B
0  1    1
0  1    2
1  2    1
1  2    2
2  3  NaN
3  4  NaN

Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]: 
   A  B
0  1  1
1  1  2
0  2  1
1  2  2

Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

Method 2.1
for example besides A we have A.1 …..A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after ‘unnest’ the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]: 
   B  A
0  1  1
0  2  1
1  1  2
1  2  2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3
recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]: 
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]: 
   0  1  A       B
0  0  1  1  [1, 2]
1  0  2  1  [1, 2]
2  1  1  2  [1, 2]
3  1  2  2  [1, 2]

Method 4
using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]: 
   A  B
0  1  1
0  1  2
1  2  1
1  2  2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5
when the list only contains unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]: 
   B  A
0  1  1
1  2  1
2  3  2
3  4  2

Method 6
using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Method 7
using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Generalizing to multiple columns

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]: 
   A       B       C
0  1  [1, 2]  [1, 2]
1  2  [3, 4]  [3, 4]

Self-def function:

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

        
unnesting(df,['B','C'])
Out[609]: 
   B  C  A
0  1  1  1
0  2  2  1
1  3  3  2
1  4  4  2

Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

回答 1

选项1

如果另一列中的所有子列表的长度均相同,numpy则可以在此处进行有效选择:

vals = np.array(df.B.values.tolist())    
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

选项2

如果子列表的长度不同,则需要执行其他步骤:

vals = df.B.values.tolist()
rs = [len(r) for r in vals]    
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

选项3

我对此进行了推广,以使其平坦化N列和平铺M列,稍后将进行工作以使其更加高效:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                   'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D
0  1     [1, 2]  [1, 2, 3]  A
1  2  [1, 2, 3]     [1, 2]  B
2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C
0   1  A   1
1   1  A   2
2   1  A   1
3   1  A   2
4   1  A   3
5   2  B   1
6   2  B   2
7   2  B   3
8   2  B   1
9   2  B   2
10  3  C   1
11  3  C   1
12  3  C   2

功能

def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

时机

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
       index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
        df = pd.concat([df]*c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

性能

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist())    
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals]    
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I’ll work later on making it more efficient:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                   'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D
0  1     [1, 2]  [1, 2, 3]  A
1  2  [1, 2, 3]     [1, 2]  B
2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C
0   1  A   1
1   1  A   2
2   1  A   1
3   1  A   2
4   1  A   3
5   2  B   1
6   2  B   2
7   2  B   3
8   2  B   1
9   2  B   2
10  3  C   1
11  3  C   1
12  3  C   2

Functions

def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
       index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
        df = pd.concat([df]*c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance


回答 2

通过添加方法,在pandas 0.25中显着简化了爆炸式列表explode()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')

出:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')

Out:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

回答 3

一种替代方法是将meshgrid配方应用于列的行,以取消嵌套

import numpy as np
import pandas as pd


def unnest(frame, explode):
    def mesh(values):
        return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

    data = np.vstack(mesh(row) for row in frame[explode].values)
    return pd.DataFrame(data=data, columns=explode)


df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B']))  # base
print()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C']))  # multiple columns
print()

df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                   'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

print(unnest(df, ['A', 'B']))  # uneven length lists
print()
print(unnest(df, ['D', 'B']))  # different types
print()

输出量

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

   A  B  C
0  1  1  1
1  1  2  1
2  1  1  2
3  1  2  2
4  2  3  3
5  2  4  3
6  2  3  4
7  2  4  4

   A  B
0  1  1
1  1  2
2  2  1
3  2  2
4  2  3
5  3  1

   D  B
0  A  1
1  A  2
2  B  1
3  B  2
4  B  3
5  C  1

One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:

import numpy as np
import pandas as pd


def unnest(frame, explode):
    def mesh(values):
        return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

    data = np.vstack(mesh(row) for row in frame[explode].values)
    return pd.DataFrame(data=data, columns=explode)


df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B']))  # base
print()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C']))  # multiple columns
print()

df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                   'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

print(unnest(df, ['A', 'B']))  # uneven length lists
print()
print(unnest(df, ['D', 'B']))  # different types
print()

Output

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

   A  B  C
0  1  1  1
1  1  2  1
2  1  1  2
3  1  2  2
4  2  3  3
5  2  4  3
6  2  3  4
7  2  4  4

   A  B
0  1  1
1  1  2
2  2  1
3  2  2
4  2  3
5  3  1

   D  B
0  A  1
1  A  2
2  B  1
3  B  2
4  B  3
5  C  1

回答 4

我的5美分:

df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                      ignore_index=True)

还有另外5个

df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

(pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
 .reset_index(level=1, drop=True)
 .reset_index())

两者导致相同

   A  B
0  1  1
1  2  1
2  1  2
3  2  2

My 5 cents:

df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                      ignore_index=True)

and another 5

df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

(pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
 .reset_index(level=1, drop=True)
 .reset_index())

both resulting in the same

   A  B
0  1  1
1  2  1
2  1  2
3  2  2

回答 5

因为通常子列表的长度是不同的,并且联接/合并在计算上要昂贵得多。我针对不同长度的子列表和更普通的列重新测试了该方法。

MultiIndex也是一种更容易编写的方式,并且具有与numpy方式几乎相同的性能。

出乎意料的是,在我的实现理解方法中具有最佳性能。

def stack(df):
    return df.set_index(['A', 'C']).B.apply(pd.Series).stack()


def comprehension(df):
    return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])


def multiindex(df):
    return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))


def array(df):
    return pd.DataFrame(
        np.column_stack((
            np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
            np.concatenate(df.B.values)
        ))
    )


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
    index=[
        'stack',
        'comprehension',
        'multiindex',
        'array',
    ],
    columns=[1000, 2000, 5000, 10000, 20000, 50000],
    dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
        df = pd.concat([df] * c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=20)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

性能

每种方法的相对时间

Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.

MultiIndex should be also a easier way to write and has near the same performances as numpy way.

Surprisingly, in my implementation comprehension way has the best performance.

def stack(df):
    return df.set_index(['A', 'C']).B.apply(pd.Series).stack()


def comprehension(df):
    return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])


def multiindex(df):
    return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))


def array(df):
    return pd.DataFrame(
        np.column_stack((
            np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
            np.concatenate(df.B.values)
        ))
    )


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
    index=[
        'stack',
        'comprehension',
        'multiindex',
        'array',
    ],
    columns=[1000, 2000, 5000, 10000, 20000, 50000],
    dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
        df = pd.concat([df] * c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=20)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

Relative time of each method


回答 6

我对此问题进行了概括,以适用于更多列。

我的解决方案的摘要:

In[74]: df
Out[74]: 
    A   B             C             columnD
0  A1  B1  [C1.1, C1.2]                D1
1  A2  B2  [C2.1, C2.2]  [D2.1, D2.2, D2.3]
2  A3  B3            C3        [D3.1, D3.2]

In[75]: dfListExplode(df,['C','columnD'])
Out[75]: 
    A   B     C columnD
0  A1  B1  C1.1    D1
1  A1  B1  C1.2    D1
2  A2  B2  C2.1    D2.1
3  A2  B2  C2.1    D2.2
4  A2  B2  C2.1    D2.3
5  A2  B2  C2.2    D2.1
6  A2  B2  C2.2    D2.2
7  A2  B2  C2.2    D2.3
8  A3  B3    C3    D3.1
9  A3  B3    C3    D3.2

完整的例子:

实际爆炸发生在3行中。剩下的就是化妆品(多列爆炸,处理字符串而不是爆炸列中的列表,…)。

import pandas as pd
import numpy as np

df=pd.DataFrame( {'A': ['A1','A2','A3'],
                  'B': ['B1','B2','B3'],
                  'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
                  'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
                  })
print('df',df, sep='\n')

def dfListExplode(df, explodeKeys):
    if not isinstance(explodeKeys, list):
        explodeKeys=[explodeKeys]
    # recursive handling of explodeKeys
    if len(explodeKeys)==0:
        return df
    elif len(explodeKeys)==1:
        explodeKey=explodeKeys[0]
    else:
        return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
    # perform explosion/unnesting for key: explodeKey
    dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
    dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
    dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
    dfReind=dfMerged.reindex(columns=list(df))
    return dfReind

dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')

学分WeNYoBen的答案

I generalized the problem a bit to be applicable to more columns.

Summary of what my solution does:

In[74]: df
Out[74]: 
    A   B             C             columnD
0  A1  B1  [C1.1, C1.2]                D1
1  A2  B2  [C2.1, C2.2]  [D2.1, D2.2, D2.3]
2  A3  B3            C3        [D3.1, D3.2]

In[75]: dfListExplode(df,['C','columnD'])
Out[75]: 
    A   B     C columnD
0  A1  B1  C1.1    D1
1  A1  B1  C1.2    D1
2  A2  B2  C2.1    D2.1
3  A2  B2  C2.1    D2.2
4  A2  B2  C2.1    D2.3
5  A2  B2  C2.2    D2.1
6  A2  B2  C2.2    D2.2
7  A2  B2  C2.2    D2.3
8  A3  B3    C3    D3.1
9  A3  B3    C3    D3.2

Complete example:

The actual explosion is performed in 3 lines. The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, …).

import pandas as pd
import numpy as np

df=pd.DataFrame( {'A': ['A1','A2','A3'],
                  'B': ['B1','B2','B3'],
                  'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
                  'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
                  })
print('df',df, sep='\n')

def dfListExplode(df, explodeKeys):
    if not isinstance(explodeKeys, list):
        explodeKeys=[explodeKeys]
    # recursive handling of explodeKeys
    if len(explodeKeys)==0:
        return df
    elif len(explodeKeys)==1:
        explodeKey=explodeKeys[0]
    else:
        return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
    # perform explosion/unnesting for key: explodeKey
    dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
    dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
    dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
    dfReind=dfMerged.reindex(columns=list(df))
    return dfReind

dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')

Credits to WeNYoBen’s answer


回答 7

问题设定

假设其中有多列对象的长度不同

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': [[1, 2], [3, 4, 5]]
})

df

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]

当长度相同时,我们很容易假设变化的元素重合并且应该“压缩”在一起。

   A       B          C
0  1  [1, 2]     [1, 2]  # Typical to assume these should be zipped [(1, 1), (2, 2)]
1  2  [3, 4]  [3, 4, 5]

但是,当我们看到不同长度的对象时,如果“压缩”,假设就会受到挑战,如果这样,那么如何处理其中一个对象中的多余部分。 或者,也许我们想要所有对象的乘积。这将很快变得很大,但可能正是所需要的。

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (4, 4), (None, 5)]?

要么

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]

功能

此函数可以根据参数适当地处理zipproduct基于参数,并zip根据最长对象的长度进行假定zip_longest

from itertools import zip_longest, product

def xplode(df, explode, zipped=True):
    method = zip_longest if zipped else product

    rest = {*df} - {*explode}

    zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
    tups = [tup + exploded
     for tup, pre in zipped
     for exploded in method(*pre)]

    return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]

压缩的

xplode(df, ['B', 'C'])

   A    B  C
0  1  1.0  1
1  1  2.0  2
2  2  3.0  3
3  2  4.0  4
4  2  NaN  5

产品

xplode(df, ['B', 'C'], zipped=False)

   A  B  C
0  1  1  1
1  1  1  2
2  1  2  1
3  1  2  2
4  2  3  3
5  2  3  4
6  2  3  5
7  2  4  3
8  2  4  4
9  2  4  5

新设定

修改示例

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': 'C',
    'D': [[1, 2], [3, 4, 5]],
    'E': [('X', 'Y', 'Z'), ('W',)]
})

df

   A       B  C          D          E
0  1  [1, 2]  C     [1, 2]  (X, Y, Z)
1  2  [3, 4]  C  [3, 4, 5]       (W,)

压缩的

xplode(df, ['B', 'D', 'E'])

   A    B  C    D     E
0  1  1.0  C  1.0     X
1  1  2.0  C  2.0     Y
2  1  NaN  C  NaN     Z
3  2  3.0  C  3.0     W
4  2  4.0  C  4.0  None
5  2  NaN  C  5.0  None

产品

xplode(df, ['B', 'D', 'E'], zipped=False)

    A  B  C  D  E
0   1  1  C  1  X
1   1  1  C  1  Y
2   1  1  C  1  Z
3   1  1  C  2  X
4   1  1  C  2  Y
5   1  1  C  2  Z
6   1  2  C  1  X
7   1  2  C  1  Y
8   1  2  C  1  Z
9   1  2  C  2  X
10  1  2  C  2  Y
11  1  2  C  2  Z
12  2  3  C  3  W
13  2  3  C  4  W
14  2  3  C  5  W
15  2  4  C  3  W
16  2  4  C  4  W
17  2  4  C  5  W

Problem Setup

Assume there are multiple columns with different length objects within it

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': [[1, 2], [3, 4, 5]]
})

df

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]

When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be “zipped” together.

   A       B          C
0  1  [1, 2]     [1, 2]  # Typical to assume these should be zipped [(1, 1), (2, 2)]
1  2  [3, 4]  [3, 4, 5]

However, the assumption gets challenged when we see different length objects, should we “zip”, if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (4, 4), (None, 5)]?

OR

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]

The Function

This function gracefully handles zip or product based on a parameter and assumes to zip according to the length of the longest object with zip_longest

from itertools import zip_longest, product

def xplode(df, explode, zipped=True):
    method = zip_longest if zipped else product

    rest = {*df} - {*explode}

    zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
    tups = [tup + exploded
     for tup, pre in zipped
     for exploded in method(*pre)]

    return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]

Zipped

xplode(df, ['B', 'C'])

   A    B  C
0  1  1.0  1
1  1  2.0  2
2  2  3.0  3
3  2  4.0  4
4  2  NaN  5

Product

xplode(df, ['B', 'C'], zipped=False)

   A  B  C
0  1  1  1
1  1  1  2
2  1  2  1
3  1  2  2
4  2  3  3
5  2  3  4
6  2  3  5
7  2  4  3
8  2  4  4
9  2  4  5

New Setup

Varying up the example a bit

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': 'C',
    'D': [[1, 2], [3, 4, 5]],
    'E': [('X', 'Y', 'Z'), ('W',)]
})

df

   A       B  C          D          E
0  1  [1, 2]  C     [1, 2]  (X, Y, Z)
1  2  [3, 4]  C  [3, 4, 5]       (W,)

Zipped

xplode(df, ['B', 'D', 'E'])

   A    B  C    D     E
0  1  1.0  C  1.0     X
1  1  2.0  C  2.0     Y
2  1  NaN  C  NaN     Z
3  2  3.0  C  3.0     W
4  2  4.0  C  4.0  None
5  2  NaN  C  5.0  None

Product

xplode(df, ['B', 'D', 'E'], zipped=False)

    A  B  C  D  E
0   1  1  C  1  X
1   1  1  C  1  Y
2   1  1  C  1  Z
3   1  1  C  2  X
4   1  1  C  2  Y
5   1  1  C  2  Z
6   1  2  C  1  X
7   1  2  C  1  Y
8   1  2  C  1  Z
9   1  2  C  2  X
10  1  2  C  2  Y
11  1  2  C  2  Z
12  2  3  C  3  W
13  2  3  C  4  W
14  2  3  C  5  W
15  2  4  C  3  W
16  2  4  C  4  W
17  2  4  C  5  W

回答 8

不推荐使用的方法(至少在这种情况下有效):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat+ sort_index+ iter+ apply+ next

现在:

print(df)

是:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

如果在乎索引:

df=df.reset_index(drop=True)

现在:

print(df)

是:

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

   A  B
0  1  1
0  1  2
1  2  1
1  2  2

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

   A  B
0  1  1
1  1  2
2  2  1
3  2  2

回答 9

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

pd.concat([df['A'], pd.DataFrame(df['B'].values.tolist())], axis = 1)\
  .melt(id_vars = 'A', value_name = 'B')\
  .dropna()\
  .drop('variable', axis = 1)

    A   B
0   1   1
1   2   1
2   1   2
3   2   2

我对这种方法有何看法?还是同时进行合并和融化都被认为过于“昂贵”?

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

pd.concat([df['A'], pd.DataFrame(df['B'].values.tolist())], axis = 1)\
  .melt(id_vars = 'A', value_name = 'B')\
  .dropna()\
  .drop('variable', axis = 1)

    A   B
0   1   1
1   2   1
2   1   2
3   2   2

Any opinions on this method I thought of? or is doing both concat and melt considered too “expensive”?


回答 10

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)

out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})

       A    B
   0    1   1
   1    1   2
   2    2   1
   3    2   2
  • 如果您不想创建中间对象,则可以将其实现为一个内衬
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)

out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})

       A    B
   0    1   1
   1    1   2
   2    2   1
   3    2   2
  • you can implement this as one liner, if you don’t wish to create intermediate object

回答 11

# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125

# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})

# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)  

# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)

# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
    for y in j:
        df2.iat[2*x+y, 0] = a[x][0]
        df2.iat[2*x+y, 1] = a[x][1][y]

# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)
# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125

# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})

# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)  

# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)

# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
    for y in j:
        df2.iat[2*x+y, 0] = a[x][0]
        df2.iat[2*x+y, 1] = a[x][1][y]

# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)

回答 12

在我的案例中,有不止一列会爆炸,并且数组的变量长度需要取消嵌套。

我最后explode两次应用了新的0.25熊猫函数,然后删除了生成的重复项,它确实起作用了!

df = df.explode('A')
df = df.explode('B')
df = df.drop_duplicates()

In my case with more than one column to explode, and with variables lengths for the arrays that needs to be unnested.

I ended up applying the new pandas 0.25 explode function two times, then removing generated duplicates and it does the job !

df = df.explode('A')
df = df.explode('B')
df = df.drop_duplicates()

回答 13

当您有多个要爆炸的列时,我还有另一种解决此问题的好方法。

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})

print(df)
   A       B          C
0  1  [1, 2]  [1, 2, 3]
1  2  [1, 2]  [1, 2, 3]

我想爆炸B和C列。首先爆炸B,然后爆炸C。然后我将B和C从原始df中删除。之后,我将在3个df上进行索引连接。

explode_b = df.explode('B')['B']
explode_c = df.explode('C')['C']
df = df.drop(['B', 'C'], axis=1)
df = df.join([explode_b, explode_c])

I have another good way to solves this when you have more than one column to explode.

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})

print(df)
   A       B          C
0  1  [1, 2]  [1, 2, 3]
1  2  [1, 2]  [1, 2, 3]

I want to explode the columns B and C. First I explode B, second C. Than I drop B and C from the original df. After that I will do an index join on the 3 dfs.

explode_b = df.explode('B')['B']
explode_c = df.explode('C')['C']
df = df.drop(['B', 'C'], axis=1)
df = df.join([explode_b, explode_c])

无法分配具有形状和数据类型的数组

问题:无法分配具有形状和数据类型的数组

我在Ubuntu 18上在numpy中分配大型数组时遇到了一个问题,而在MacOS上却没有遇到同样的问题。

我想一个numpy的阵列形状分配内存(156816, 36, 53806) 使用

np.zeros((156816, 36, 53806), dtype='uint8')

当我在Ubuntu OS上遇到错误时

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

我没有在MacOS上得到它:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

我读过某处np.zeros不应该真正分配数组所需的全部内存,而只分配了非零元素。即使Ubuntu计算机具有64gb的内存,而我的MacBook Pro却只有16gb。

版本:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS:在Google Colab上也失败

I’m facing an issue with allocating huge arrays in numpy on Ubuntu 18 while not facing the same issue on MacOS.

I am trying to allocate memory for a numpy array with shape (156816, 36, 53806) with

np.zeros((156816, 36, 53806), dtype='uint8')

and while I’m getting an error on Ubuntu OS

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

I’m not getting it on MacOS:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

I’ve read somewhere that np.zeros shouldn’t be really allocating the whole memory needed for the array, but only for the non-zero elements. Even though the Ubuntu machine has 64gb of memory, while my MacBook Pro has only 16gb.

versions:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS: also failed on Google Colab


回答 0

这可能是由于系统的过量使用处理模式所致。

在默认模式下0

启发式过量使用处理。明显的地址空间过量使用被拒绝。用于典型的系统。它确保严重的野生分配失败,同时允许过量使用以减少交换使用。在此模式下,允许root分配更多的内存。这是默认值。

此处没有很好地解释所使用的确切启发式方法,但是在Linux上,在提交启发式方法本页上对此进行了更多讨论 。

您可以通过运行以下命令检查当前的过量使用模式

$ cat /proc/sys/vm/overcommit_memory
0

在这种情况下,您要分配

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

约282 GB,并且内核说的很清楚,我无法将这么多物理页提交给它,并且它拒绝分配。

如果(以root用户身份)运行:

$ echo 1 > /proc/sys/vm/overcommit_memory

这将启用“始终过量使用”模式,并且您会发现,无论系统有多大(至少在64位内存寻址中),该系统的确允许您进行分配。

我自己在具有32 GB RAM的计算机上进行了测试。在过量提交模式下,0我还得到了一个MemoryError,但是将其更改回1它可以工作:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

然后,您可以继续写入阵列中的任何位置,并且只有在您明确写入物理页面时,系统才会分配物理页面。因此,您可以谨慎地将其用于稀疏数组。

This is likely due to your system’s overcommit handling mode.

In the default mode, 0,

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.

You can check your current overcommit mode by running

$ cat /proc/sys/vm/overcommit_memory
0

In this case you’re allocating

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

~282 GB, and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.

If (as root) you run:

$ echo 1 > /proc/sys/vm/overcommit_memory

This will enable “always overcommit” mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).

I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.


回答 1

我在Window上遇到了同样的问题,并遇到了这个解决方案。因此,如果有人在Windows中遇到此问题,那么对我来说,解决方案是增加页面文件的大小,因为这对我来说也是内存过量使用的问题。

Windows 8

  1. 在键盘上按WindowsKey + X,然后在弹出菜单中单击“系统”。
  2. 点击或单击高级系统设置。系统可能会要求您输入管理员密码或确认选择
  3. 在“高级”选项卡上的“性能”下,点击或单击“设置”。
  4. 点击或单击“高级”选项卡,然后在“虚拟内存”下,单击或单击“更改”
  5. 清除“自动管理所有驱动器的页面文件大小”复选框。
  6. 在驱动器[卷标签]下,点击或单击包含要更改的页面文件的驱动器
  7. 点击或单击“自定义大小”,在“初始大小(MB)”或“最大大小(MB)”框中输入新的大小(以兆字节为单位),单击或单击“设置”,然后单击或单击“确定”
  8. 重新启动系统

Windows 10

  1. 按Windows键
  2. 类型SystemPropertiesAdvanced
  3. 单击以管理员身份运行
  4. 在性能下,单击设置
  5. 选择高级选项卡
  6. 选择更改…
  7. 取消选中“自动管理所有驱动器的页面文件大小”
  8. 然后选择自定义尺寸并填写适当的尺寸
  9. 按设置,然后按确定,然后从“虚拟内存”,“性能选项”和“系统属性”对话框退出
  10. 重新启动系统

注意:在此示例中,我的系统上没有足够的内存供〜282GB使用,但对于我的特殊情况,此方法有效。

编辑

这里建议的页面文件大小建议:

有一个公式可以计算正确的页面文件大小。初始大小是系统总内存的一半(1.5)x。最大大小为三(3)x初始大小。因此,假设您有4 GB(1 GB = 1,024 MB x 4 = 4,096 MB)的内存。初始大小为1.5 x 4,096 = 6,144 MB,最大大小为3 x 6,144 = 18,432 MB。

这里要记住一些事情:

但是,这没有考虑到计算机可能特有的其他重要因素和系统设置。同样,让Windows选择要使用的内容,而不是依赖于在另一台计算机上工作的任意公式。

也:

页面文件大小的增加可能有助于防止Windows中的不稳定和崩溃。但是,硬盘驱动器的读/写时间比数据存储在计算机内存中的情况要慢得多。页面文件较大将增加硬盘驱动器的工作量,从而导致其他所有文件的运行速度变慢。仅当遇到内存不足错误时才应增加页面文件的大小,并且仅作为临时解决方案。更好的解决方案是向计算机添加更多的内存。

I had this same problem on Window’s and came across this solution. So if someone comes across this problem in Windows the solution for me was to increase the pagefile size, as it was a Memory overcommitment problem for me too.

Windows 8

  1. On the Keyboard Press the WindowsKey + X then click System in the popup menu
  2. Tap or click Advanced system settings. You might be asked for an admin password or to confirm your choice
  3. On the Advanced tab, under Performance, tap or click Settings.
  4. Tap or click the Advanced tab, and then, under Virtual memory, tap or click Change
  5. Clear the Automatically manage paging file size for all drives check box.
  6. Under Drive [Volume Label], tap or click the drive that contains the paging file you want to change
  7. Tap or click Custom size, enter a new size in megabytes in the initial size (MB) or Maximum size (MB) box, tap or click Set, and then tap or click OK
  8. Reboot your system

Windows 10

  1. Press the Windows key
  2. Type SystemPropertiesAdvanced
  3. Click Run as administrator
  4. Click Settings
  5. Select the Advanced tab
  6. Select Change…
  7. Uncheck Automatically managing paging file size for all drives
  8. Then select Custom size and fill in the appropriate size
  9. Press Set then press OK then exit from the Virtual Memory, Performance Options, and System Properties Dialog
  10. Reboot your system

Note: I did not have the enough memory on my system for the ~282GB in this example but for my particular case this worked.

EDIT

From here the suggested recommendations for page file size:

There is a formula for calculating the correct pagefile size. Initial size is one and a half (1.5) x the amount of total system memory. Maximum size is three (3) x the initial size. So let’s say you have 4 GB (1 GB = 1,024 MB x 4 = 4,096 MB) of memory. The initial size would be 1.5 x 4,096 = 6,144 MB and the maximum size would be 3 x 6,144 = 18,432 MB.

Some things to keep in mind from here:

However, this does not take into consideration other important factors and system settings that may be unique to your computer. Again, let Windows choose what to use instead of relying on some arbitrary formula that worked on a different computer.

Also:

Increasing page file size may help prevent instabilities and crashing in Windows. However, a hard drive read/write times are much slower than what they would be if the data were in your computer memory. Having a larger page file is going to add extra work for your hard drive, causing everything else to run slower. Page file size should only be increased when encountering out-of-memory errors, and only as a temporary fix. A better solution is to adding more memory to the computer.


回答 2

我也在Windows上遇到了这个问题。对我来说,解决方案是从32位版本的Python切换到64位版本的Python。的确,像32位CPU这样的32位软件最多可以分配4 GB的RAM(2 ^ 32)。因此,如果您拥有超过4 GB的RAM,则32位版本将无法利用它。

使用64位版本的Python(在下载页面中标记为x86-64的版本),问题就消失了。

您可以通过输入解释器来检查您拥有哪个版本。我具有64位版本,现在具有: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)],其中[MSC v.1916 64位(AMD64)]表示“ 64位Python”。

:为写这篇文章(2020年5)时,matplotlib是不可用的python39,所以我安装推荐python37,64位。

资料来源:

I came across this problem on Windows too. The solution for me was to switch from a 32-bit to a 64-bit version of Python. Indeed, a 32-bit software, like a 32-bit CPU, can adress a maximum of 4 GB of RAM (2^32). So if you have more than 4 GB of RAM, a 32-bit version cannot take advantage of it.

With a 64-bit version of Python (the one labeled x86-64 in the download page), the issue disappeared.

You can check which version you have by entering the interpreter. I, with a 64-bit version, now have: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)], where [MSC v.1916 64 bit (AMD64)] means “64-bit Python”.

Note : as of the time of this writing (May 2020), matplotlib is not available on python39, so I recommand installing python37, 64 bits.

Sources :


回答 3

在我的情况下,添加dtype属性会将数组的dtype更改为较小的类型(从float64到uint8),减小数组的大小足以不会在Windows(64位)中引发MemoryError。

mask = np.zeros(edges.shape)

mask = np.zeros(edges.shape,dtype='uint8')

In my case, adding a dtype attribute changed dtype of the array to a smaller type(from float64 to uint8), decreasing array size enough to not throw MemoryError in Windows(64 bit).

from

mask = np.zeros(edges.shape)

to

mask = np.zeros(edges.shape,dtype='uint8')

回答 4

有时,由于内核已达到极限,会弹出此错误。尝试重新启动内核,然后重做必要的步骤。

Sometimes, this error pops up because of the kernel has reached its limit. Try to restart the kernel redo the necessary steps.


回答 5

将数据类型更改为另一种使用较少内存的数据。对我来说,我将数据类型更改为numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

change the data type to another one which uses less memory works. For me, I change the data type to numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

插入两个字符串的最pythonic方式

问题:插入两个字符串的最pythonic方式

将两个字符串网格化的最Python方式是什么?

例如:

输入:

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

输出:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

What’s the most pythonic way to mesh two strings together?

For example:

Input:

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

Output:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

回答 0

对我来说,最pythonic *的方式是以下代码,它几乎做同样的事情,但是使用+运算符来连接每个字符串中的各个字符:

res = "".join(i + j for i, j in zip(u, l))
print(res)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

它也比使用两个join()调用更快:

In [5]: l1 = 'A' * 1000000; l2 = 'a' * 1000000

In [6]: %timeit "".join("".join(item) for item in zip(l1, l2))
1 loops, best of 3: 442 ms per loop

In [7]: %timeit "".join(i + j for i, j in zip(l1, l2))
1 loops, best of 3: 360 ms per loop

存在更快的方法,但是它们常常使代码模糊。

注:如果两个输入字符串是相同的长度,则较长的一个将被截断,zip停在较短字符串的结尾迭代。在这种情况下,zip应该使用模块中的zip_longestizip_longest在Python 2中)而不是一个itertools来确保两个字符串都已用尽。


*引用Python之禅可读性很重要
Pythonic = 对我而言可读性i + j至少对于我的眼睛来说,更容易从视觉上进行解析。

For me, the most pythonic* way is the following which pretty much does the same thing but uses the + operator for concatenating the individual characters in each string:

res = "".join(i + j for i, j in zip(u, l))
print(res)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

It is also faster than using two join() calls:

In [5]: l1 = 'A' * 1000000; l2 = 'a' * 1000000

In [6]: %timeit "".join("".join(item) for item in zip(l1, l2))
1 loops, best of 3: 442 ms per loop

In [7]: %timeit "".join(i + j for i, j in zip(l1, l2))
1 loops, best of 3: 360 ms per loop

Faster approaches exist, but they often obfuscate the code.

Note: If the two input strings are not the same length then the longer one will be truncated as zip stops iterating at the end of the shorter string. In this case instead of zip one should use zip_longest (izip_longest in Python 2) from the itertools module to ensure that both strings are fully exhausted.


*To take a quote from the Zen of Python: Readability counts.
Pythonic = readability for me; i + j is just visually parsed more easily, at least for my eyes.


回答 1

更快的选择

其他方式:

res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
print(''.join(res))

输出:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

速度

看起来更快:

%%timeit
res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
''.join(res)

100000 loops, best of 3: 4.75 µs per loop

比迄今为止最快的解决方案:

%timeit "".join(list(chain.from_iterable(zip(u, l))))

100000 loops, best of 3: 6.52 µs per loop

同样对于较大的字符串:

l1 = 'A' * 1000000; l2 = 'a' * 1000000

%timeit "".join(list(chain.from_iterable(zip(l1, l2))))
1 loops, best of 3: 151 ms per loop


%%timeit
res = [''] * len(l1) * 2
res[::2] = l1
res[1::2] = l2
''.join(res)

10 loops, best of 3: 92 ms per loop

Python 3.5.1。

不同长度字符串的变化

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijkl'

较短的一个确定长度(zip()等效)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
print(''.join(res))

输出:

AaBbCcDdEeFfGgHhIiJjKkLl

更长的长度决定长度(itertools.zip_longest(fillvalue='')等效)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
res += u[min_len:] + l[min_len:]
print(''.join(res))

输出:

AaBbCcDdEeFfGgHhIiJjKkLlMNOPQRSTUVWXYZ

Faster Alternative

Another way:

res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
print(''.join(res))

Output:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Speed

Looks like it is faster:

%%timeit
res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
''.join(res)

100000 loops, best of 3: 4.75 µs per loop

than the fastest solution so far:

%timeit "".join(list(chain.from_iterable(zip(u, l))))

100000 loops, best of 3: 6.52 µs per loop

Also for the larger strings:

l1 = 'A' * 1000000; l2 = 'a' * 1000000

%timeit "".join(list(chain.from_iterable(zip(l1, l2))))
1 loops, best of 3: 151 ms per loop


%%timeit
res = [''] * len(l1) * 2
res[::2] = l1
res[1::2] = l2
''.join(res)

10 loops, best of 3: 92 ms per loop

Python 3.5.1.

Variation for strings with different lengths

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijkl'

Shorter one determines length (zip() equivalent)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
print(''.join(res))

Output:

AaBbCcDdEeFfGgHhIiJjKkLl

Longer one determines length (itertools.zip_longest(fillvalue='') equivalent)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
res += u[min_len:] + l[min_len:]
print(''.join(res))

Output:

AaBbCcDdEeFfGgHhIiJjKkLlMNOPQRSTUVWXYZ

回答 2

join()zip()

>>> ''.join(''.join(item) for item in zip(u,l))
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

With join() and zip().

>>> ''.join(''.join(item) for item in zip(u,l))
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

回答 3

在Python 2上,到目前为止,做事的最快方法是小字符串列表切片的速度大约是3倍,长字符串列表切片的速度大约是30倍。

res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)

但是,这在Python 3上不起作用。您可以实现类似

res = bytearray(len(u) * 2)
res[::2] = u.encode("ascii")
res[1::2] = l.encode("ascii")
res.decode("ascii")

但是到那时,您已经失去了对小型字符串进行列表切片所获得的收益(对于长字符串而言,它的速度仍然是20倍),并且这甚至还不适用于非ASCII字符。

FWIW,如果您在大量字符串上执行此操作并且需要每个周期,并且由于某种原因必须使用Python字符串…以下是操作方法:

res = bytearray(len(u) * 4 * 2)

u_utf32 = u.encode("utf_32_be")
res[0::8] = u_utf32[0::4]
res[1::8] = u_utf32[1::4]
res[2::8] = u_utf32[2::4]
res[3::8] = u_utf32[3::4]

l_utf32 = l.encode("utf_32_be")
res[4::8] = l_utf32[0::4]
res[5::8] = l_utf32[1::4]
res[6::8] = l_utf32[2::4]
res[7::8] = l_utf32[3::4]

res.decode("utf_32_be")

特殊情况下,较小类型的外壳也将有所帮助。FWIW,这只是长字符串列表切片速度的3倍,而小字符串则 4到5倍。

无论哪种方式,我都喜欢join解决方案,但是由于在其他地方提到了时间安排,我认为我也应该加入。

On Python 2, by far the faster way to do things, at ~3x the speed of list slicing for small strings and ~30x for long ones, is

res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)

This wouldn’t work on Python 3, though. You could implement something like

res = bytearray(len(u) * 2)
res[::2] = u.encode("ascii")
res[1::2] = l.encode("ascii")
res.decode("ascii")

but by then you’ve already lost the gains over list slicing for small strings (it’s still 20x the speed for long strings) and this doesn’t even work for non-ASCII characters yet.

FWIW, if you are doing this on massive strings and need every cycle, and for some reason have to use Python strings… here’s how to do it:

res = bytearray(len(u) * 4 * 2)

u_utf32 = u.encode("utf_32_be")
res[0::8] = u_utf32[0::4]
res[1::8] = u_utf32[1::4]
res[2::8] = u_utf32[2::4]
res[3::8] = u_utf32[3::4]

l_utf32 = l.encode("utf_32_be")
res[4::8] = l_utf32[0::4]
res[5::8] = l_utf32[1::4]
res[6::8] = l_utf32[2::4]
res[7::8] = l_utf32[3::4]

res.decode("utf_32_be")

Special-casing the common case of smaller types will help too. FWIW, this is only 3x the speed of list slicing for long strings and a factor of 4 to 5 slower for small strings.

Either way I prefer the join solutions, but since timings were mentioned elsewhere I thought I might as well join in.


回答 4

如果您想要最快的方法,可以将itertools与结合使用operator.add

In [36]: from operator import add

In [37]: from itertools import  starmap, izip

In [38]: timeit "".join([i + j for i, j in uzip(l1, l2)])
1 loops, best of 3: 142 ms per loop

In [39]: timeit "".join(starmap(add, izip(l1,l2)))
1 loops, best of 3: 117 ms per loop

In [40]: timeit "".join(["".join(item) for item in zip(l1, l2)])
1 loops, best of 3: 196 ms per loop

In [41]:  "".join(starmap(add, izip(l1,l2))) ==  "".join([i + j   for i, j in izip(l1, l2)]) ==  "".join(["".join(item) for item in izip(l1, l2)])
Out[42]: True

但是合并起来izipchain.from_iterable更快了

In [2]: from itertools import  chain, izip

In [3]: timeit "".join(chain.from_iterable(izip(l1, l2)))
10 loops, best of 3: 98.7 ms per loop

chain(*和之间也存在实质性差异 chain.from_iterable(...

In [5]: timeit "".join(chain(*izip(l1, l2)))
1 loops, best of 3: 212 ms per loop

没有像join那样的生成器,传递一个总是慢一些,因为python首先会使用内容来建立一个列表,因为它会对数据进行两次传递,一次传递所需的大小,一次传递实际的大小使用生成器无法实现的联接:

join.h

 /* Here is the general case.  Do a pre-pass to figure out the total
  * amount of space we'll need (sz), and see whether all arguments are
  * bytes-like.
   */

另外,如果您使用不同长度的字符串,并且不想丢失数据,则可以使用izip_longest

In [22]: from itertools import izip_longest    
In [23]: a,b = "hlo","elworld"

In [24]:  "".join(chain.from_iterable(izip_longest(a, b,fillvalue="")))
Out[24]: 'helloworld'

对于python 3,它称为 zip_longest

但是对于python2来说,veedrac的建议是迄今为止最快的:

In [18]: %%timeit
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
   ....: 
100 loops, best of 3: 2.68 ms per loop

If you want the fastest way, you can combine itertools with operator.add:

In [36]: from operator import add

In [37]: from itertools import  starmap, izip

In [38]: timeit "".join([i + j for i, j in uzip(l1, l2)])
1 loops, best of 3: 142 ms per loop

In [39]: timeit "".join(starmap(add, izip(l1,l2)))
1 loops, best of 3: 117 ms per loop

In [40]: timeit "".join(["".join(item) for item in zip(l1, l2)])
1 loops, best of 3: 196 ms per loop

In [41]:  "".join(starmap(add, izip(l1,l2))) ==  "".join([i + j   for i, j in izip(l1, l2)]) ==  "".join(["".join(item) for item in izip(l1, l2)])
Out[42]: True

But combining izip and chain.from_iterable is faster again

In [2]: from itertools import  chain, izip

In [3]: timeit "".join(chain.from_iterable(izip(l1, l2)))
10 loops, best of 3: 98.7 ms per loop

There is also a substantial difference between chain(* and chain.from_iterable(....

In [5]: timeit "".join(chain(*izip(l1, l2)))
1 loops, best of 3: 212 ms per loop

There is no such thing as a generator with join, passing one is always going to be slower as python will first build a list using the content because it does two passes over the data, one to figure out the size needed and one to actually do the join which would not be possible using a generator:

join.h:

 /* Here is the general case.  Do a pre-pass to figure out the total
  * amount of space we'll need (sz), and see whether all arguments are
  * bytes-like.
   */

Also if you have different length strings and you don’t want to lose data you can use izip_longest :

In [22]: from itertools import izip_longest    
In [23]: a,b = "hlo","elworld"

In [24]:  "".join(chain.from_iterable(izip_longest(a, b,fillvalue="")))
Out[24]: 'helloworld'

For python 3 it is called zip_longest

But for python2, veedrac’s suggestion is by far the fastest:

In [18]: %%timeit
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
   ....: 
100 loops, best of 3: 2.68 ms per loop

回答 5

您也可以使用map和执行此操作operator.add

from operator import add

u = 'AAAAA'
l = 'aaaaa'

s = "".join(map(add, u, l))

输出

'AaAaAaAaAa'

map的作用是,它从第一个可迭代对象获取每个元素,u并从第二个可迭代对象获取第一个元素,l并应用作为第一个参数提供的函数add。然后加入只是加入他们。

You could also do this using map and operator.add:

from operator import add

u = 'AAAAA'
l = 'aaaaa'

s = "".join(map(add, u, l))

Output:

'AaAaAaAaAa'

What map does is it takes every element from the first iterable u and the first elements from the second iterable l and applies the function supplied as the first argument add. Then join just joins them.


回答 6

吉姆的答案很好,但是,如果您不介意几次导入,这是我最喜欢的选择:

from functools import reduce
from operator import add

reduce(add, map(add, u, l))

Jim’s answer is great, but here’s my favorite option, if you don’t mind a couple of imports:

from functools import reduce
from operator import add

reduce(add, map(add, u, l))

回答 7

这些建议很多都假设字符串长度相等。也许涵盖了所有合理的用例,但至少对我来说,您似乎也想适应长度不同的字符串。还是我是唯一认为网格应该像这样工作的人:

u = "foobar"
l = "baz"
mesh(u,l) = "fboaozbar"

一种方法是:

def mesh(a,b):
    minlen = min(len(a),len(b))
    return "".join(["".join(x+y for x,y in zip(a,b)),a[minlen:],b[minlen:]])

A lot of these suggestions assume the strings are of equal length. Maybe that covers all reasonable use cases, but at least to me it seems that you might want to accomodate strings of differing lengths too. Or am I the only one thinking the mesh should work a bit like this:

u = "foobar"
l = "baz"
mesh(u,l) = "fboaozbar"

One way to do this would be the following:

def mesh(a,b):
    minlen = min(len(a),len(b))
    return "".join(["".join(x+y for x,y in zip(a,b)),a[minlen:],b[minlen:]])

回答 8

我喜欢使用两个fors,变量名可以提示/提醒正在发生的事情:

"".join(char for pair in zip(u,l) for char in pair)

I like using two fors, the variable names can give a hint/reminder to what is going on:

"".join(char for pair in zip(u,l) for char in pair)

回答 9

只是添加另一种更基本的方法:

st = ""
for char in u:
    st = "{0}{1}{2}".format( st, char, l[ u.index( char ) ] )

Just to add another, more basic approach:

st = ""
for char in u:
    st = "{0}{1}{2}".format( st, char, l[ u.index( char ) ] )

回答 10

有点不讲究Python而不考虑这里的double-list-comprehension答案,用O(1)来处理n个字符串:

"".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)

all_strings您要插入的字符串的列表在哪里。就您而言,all_strings = [u, l]。完整的使用示例如下所示:

import itertools
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b = 'abcdefghijklmnopqrstuvwxyz'
all_strings = [a,b]
interleaved = "".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
print(interleaved)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

喜欢许多答案,最快吗?可能不是,但是简单而灵活。另外,在没有增加太多复杂性的情况下,这比公认的答案要快一些(通常,在python中字符串添加有点慢):

In [7]: l1 = 'A' * 1000000; l2 = 'a' * 1000000;

In [8]: %timeit "".join(a + b for i, j in zip(l1, l2))
1 loops, best of 3: 227 ms per loop

In [9]: %timeit "".join(c for cs in zip(*(l1, l2)) for c in cs)
1 loops, best of 3: 198 ms per loop

Feels a bit un-pythonic not to consider the double-list-comprehension answer here, to handle n string with O(1) effort:

"".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)

where all_strings is a list of the strings you want to interleave. In your case, all_strings = [u, l]. A full use example would look like this:

import itertools
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b = 'abcdefghijklmnopqrstuvwxyz'
all_strings = [a,b]
interleaved = "".join(c for cs in itertools.zip_longest(*all_strings) for c in cs)
print(interleaved)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Like many answers, fastest? Probably not, but simple and flexible. Also, without too much added complexity, this is slightly faster than the accepted answer (in general, string addition is a bit slow in python):

In [7]: l1 = 'A' * 1000000; l2 = 'a' * 1000000;

In [8]: %timeit "".join(a + b for i, j in zip(l1, l2))
1 loops, best of 3: 227 ms per loop

In [9]: %timeit "".join(c for cs in zip(*(l1, l2)) for c in cs)
1 loops, best of 3: 198 ms per loop

回答 11

可能比当前领先的解决方案更快,更短:

from itertools import chain

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

res = "".join(chain(*zip(u, l)))

快速策略是在C级别上尽可能多地做。相同的zip_longest()修复了不均匀的字符串,它会与chain()来自同一个模块,所以在这里不能给我太多点!

我提出的其他解决方案:

res = "".join(u[x] + l[x] for x in range(len(u)))

res = "".join(k + l[i] for i, k in enumerate(u))

Potentially faster and shorter than the current leading solution:

from itertools import chain

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

res = "".join(chain(*zip(u, l)))

Strategy speed-wise is to do as much at the C-level as possible. Same zip_longest() fix for uneven strings and it would be coming out of the same module as chain() so can’t ding me too many points there!

Other solutions I came up with along the way:

res = "".join(u[x] + l[x] for x in range(len(u)))

res = "".join(k + l[i] for i, k in enumerate(u))

回答 12

你可以用1iteration_utilities.roundrobin

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

from iteration_utilities import roundrobin
''.join(roundrobin(u, l))
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

ManyIterables同一包中的类:

from iteration_utilities import ManyIterables
ManyIterables(u, l).roundrobin().as_string()
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

1这来自我编写的第三方库iteration_utilities

You could use iteration_utilities.roundrobin1

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

from iteration_utilities import roundrobin
''.join(roundrobin(u, l))
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

or the ManyIterables class from the same package:

from iteration_utilities import ManyIterables
ManyIterables(u, l).roundrobin().as_string()
# returns 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

1 This is from a third-party library I have written: iteration_utilities.


回答 13

我将使用zip()来获得一种可读且简单的方法:

result = ''
for cha, chb in zip(u, l):
    result += '%s%s' % (cha, chb)

print result
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

I would use zip() to get a readable and easy way:

result = ''
for cha, chb in zip(u, l):
    result += '%s%s' % (cha, chb)

print result
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Tensorflow大步向前

问题:Tensorflow大步向前

我想了解的进步在tf.nn.avg_pool,tf.nn.max_pool,tf.nn.conv2d说法。

文件反复说

步幅:长度大于等于4的整数的列表。输入张量每个维度的滑动窗口的步幅。

我的问题是:

  1. 4个以上的整数分别代表什么?
  2. 对于卷积网络,为什么必须要有stride [0] = strides [3] = 1?
  3. 此示例中,我们看到了tf.reshape(_X,shape=[-1, 28, 28, 1])。为什么是-1?

遗憾的是,文档中使用-1进行重塑的示例并不能很好地解释这种情况。

I am trying to understand the strides argument in tf.nn.avg_pool, tf.nn.max_pool, tf.nn.conv2d.

The documentation repeatedly says

strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor.

My questions are:

  1. What do each of the 4+ integers represent?
  2. Why must they have strides[0] = strides[3] = 1 for convnets?
  3. In this example we see tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1?

Sadly the examples in the docs for reshape using -1 don’t translate too well to this scenario.


回答 0

池化和卷积运算在输入张量上滑动一个“窗口”。使用tf.nn.conv2d作为一个例子:如果输入张量有4个方面: [batch, height, width, channels],则卷积在二维窗口上操作height, width的尺寸。

strides确定窗口在每个维度上的移动量。典型用法是将第一个(批处理)和最后一个(深度)跨度设置为1。

让我们使用一个非常具体的示例:在32×32灰度输入图像上运行2-d卷积。我说灰度是因为输入图像的深度为1,这有助于使其简单。让该图像看起来像这样:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

让我们在一个示例(批处理大小= 1)上运行2×2卷积窗口。我们给卷积的输出通道深度为8。

卷积的输入为shape=[1, 32, 32, 1]

如果指定strides=[1,1,1,1]padding=SAME,则滤波器的输出将是[1,32,32,8]。

过滤器将首先为以下内容创建输出:

F(00 01
  10 11)

然后针对:

F(01 02
  11 12)

等等。然后它将移至第二行,计算:

F(10, 11
  20, 21)

然后

F(11, 12
  21, 22)

如果将跨度指定为[1、2、2、1],则不会重叠窗口。它将计算:

F(00, 01
  10, 11)

然后

F(02, 03
  12, 13)

跨步操作对于池操作员类似。

问题2:为何大步走向[1,x,y,1]

第一个是批处理:您通常不想跳过批处理中的示例,否则您不应该首先将它们包括在内。:)

最后一个是卷积的深度:出于相同的原因,您通常不想跳过输入。

conv2d运算符比较笼统,因此您可以创建卷积以使窗口沿其他维度滑动,但这在卷积网络中并不常见。典型用途是在空间上使用它们。

为什么要重塑为-1 -1是一个占位符,它表示“根据需要进行调整以匹配整个张量所需的大小”。这是使代码独立于输入批处理大小的一种方法,因此您可以更改管道,而不必在代码中的任何地方调整批处理大小。

The pooling and convolutional ops slide a “window” across the input tensor. Using tf.nn.conv2d as an example: If the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height, width dimensions.

strides determines how much the window shifts by in each of the dimensions. The typical use sets the first (the batch) and last (the depth) stride to 1.

Let’s use a very concrete example: Running a 2-d convolution over a 32×32 greyscale input image. I say greyscale because then the input image has depth=1, which helps keep it simple. Let that image look like this:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

Let’s run a 2×2 convolution window over a single example (batch size = 1). We’ll give the convolution an output channel depth of 8.

The input to the convolution has shape=[1, 32, 32, 1].

If you specify strides=[1,1,1,1] with padding=SAME, then the output of the filter will be [1, 32, 32, 8].

The filter will first create an output for:

F(00 01
  10 11)

And then for:

F(01 02
  11 12)

and so on. Then it will move to the second row, calculating:

F(10, 11
  20, 21)

then

F(11, 12
  21, 22)

If you specify a stride of [1, 2, 2, 1] it won’t do overlapping windows. It will compute:

F(00, 01
  10, 11)

and then

F(02, 03
  12, 13)

The stride operates similarly for the pooling operators.

Question 2: Why strides [1, x, y, 1] for convnets

The first 1 is the batch: You don’t usually want to skip over examples in your batch, or you shouldn’t have included them in the first place. :)

The last 1 is the depth of the convolution: You don’t usually want to skip inputs, for the same reason.

The conv2d operator is more general, so you could create convolutions that slide the window along other dimensions, but that’s not a typical use in convnets. The typical use is to use them spatially.

Why reshape to -1 -1 is a placeholder that says “adjust as necessary to match the size needed for the full tensor.” It’s a way of making the code be independent of the input batch size, so that you can change your pipeline and not have to adjust the batch size everywhere in the code.


回答 1

输入是4维的,格式为: [batch_size, image_rows, image_cols, number_of_colors]

通常,跨度定义了应用操作之间的重叠。对于conv2d,它指定卷积滤波器的连续应用之间的距离是多少。特定维度中的值1表示我们在每行/列应用运算符,值2表示每秒钟/以此类推。

关于1)对于卷积重要的值是2nd和3rd,它们表示卷积滤波器在沿行和列的应用中的重叠。值[1,2,2,1]表示我们要在每隔第二行和第二列上应用过滤器。

关于2)我不知道技术限制(可能是CuDNN要求),但通常人们会沿行或列尺寸使用步幅。在批处理大小上执行此操作不一定有意义。不确定最后一个尺寸。

关于3)为其中一个维设置-1表示“为第一维设置值,以使张量中的元素总数不变”。在我们的例子中,-1将等于batch_size。

The inputs are 4 dimensional and are of form: [batch_size, image_rows, image_cols, number_of_colors]

Strides, in general, define an overlap between applying operations. In the case of conv2d, it specifies what is the distance between consecutive applications of convolutional filters. The value of 1 in a specific dimension means that we apply the operator at every row/col, the value of 2 means every second, and so on.

Re 1) The values that matter for convolutions are 2nd and 3rd and they represent the overlap in the application of the convolutional filters along rows and columns. The value of [1, 2, 2, 1] says that we want to apply the filters on every second row and column.

Re 2) I don’t know the technical limitations (might be CuDNN requirement) but typically people use strides along the rows or columns dimensions. It doesn’t necessarily make sense to do it over batch size. Not sure of the last dimension.

Re 3) Setting -1 for one of the dimension means, “set the value for the first dimension so that the total number of elements in the tensor is unchanged”. In our case, the -1 will be equal to the batch_size.


回答 2

让我们从1-dim情况下的步幅开始。

让我们假设你input = [1, 0, 2, 3, 0, 1, 1]kernel = [2, 1, 3]卷积的结果是[8, 11, 7, 9, 4],它通过滑动你的内核在输入,进行逐元素乘法和求和计算出的一切。像这样

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 +1 * 3
  • 4 = 0 * 2 +1 * 1 +1 * 3

在这里,我们滑动了一个元素,但没有其他任何数字可以阻止您。这个数字是您的进步。您可以将其视为仅取第s个结果就对1步卷积的结果进行下采样。

知道输入大小i,内核大小k,步幅s和填充p,您可以轻松计算出卷积的输出大小为:

在这里|| 操作员表示天花板操作。对于池化层,s = 1。


N昏暗的情况。

一旦了解到每一个暗角都是独立的,就知道了1暗角情形的数学原理,n暗角情形很容易。因此,您只需分别滑动每个尺寸。这是2-d示例。请注意,您不必在所有尺寸上都具有相同的步幅。因此,对于N维输入/内核,您应提供N个跨度。


因此,现在很容易回答您的所有问题:

  1. 4个以上的整数分别代表什么?conv2dpool告诉您此列表表示每个维度之间的跨度。注意,步幅列表的长度与内核张量的秩相同。
  2. 为什么对于卷积网络,他们必须具有大步[0] =大步3 = 1?。第一个维度是批量大小,最后一个是渠道。既不跳过批处理也不跳过通道。因此,您将它们设置为1。对于宽度/高度,您可以跳过某些内容,因此它们可能不是1。
  3. tf.reshape(_X,shape = [-1,28,28,1])。为什么是-1? tf.reshape为您提供了帮助:

    如果形状的一个分量是特殊值-1,则将计算该尺寸的大小,以便总大小保持恒定。特别地,[-1]的形状变平为1-D。形状的最多一个分量可以是-1。

Let’s start with what stride does in 1-dim case.

Let’s assume your input = [1, 0, 2, 3, 0, 1, 1] and kernel = [2, 1, 3] the result of the convolution is [8, 11, 7, 9, 4], which is calculated by sliding your kernel over the input, performing element-wise multiplication and summing everything. Like this:

  • 8 = 1 * 2 + 0 * 1 + 2 * 3
  • 11 = 0 * 2 + 2 * 1 + 3 * 3
  • 7 = 2 * 2 + 3 * 1 + 0 * 3
  • 9 = 3 * 2 + 0 * 1 + 1 * 3
  • 4 = 0 * 2 + 1 * 1 + 1 * 3

Here we slide by one element, but nothing stops you by using any other number. This number is your stride. You can think about it as downsampling the result of the 1-strided convolution by just taking every s-th result.

Knowing the input size i, kernel size k, stride s and padding p you can easily calculate the output size of the convolution as:

Here || operator means ceiling operation. For a pooling layer s = 1.


N-dim case.

Knowing the math for a 1-dim case, n-dim case is easy once you see that each dim is independent. So you just slide each dimension separately. Here is an example for 2-d. Notice that you do not need to have the same stride at all the dimensions. So for an N-dim input/kernel you should provide N strides.


So now it is easy to answer all your questions:

  1. What do each of the 4+ integers represent?. conv2d, pool tells you that this list represents the strides among each dimension. Notice that the length of strides list is the same as the rank of kernel tensor.
  2. Why must they have strides[0] = strides3 = 1 for convnets?. The first dimension is batch size, the last is channels. There is no point of skipping neither batch nor channel. So you make them 1. For width/height you can skip something and that’s why they might be not 1.
  3. tf.reshape(_X,shape=[-1, 28, 28, 1]). Why -1? tf.reshape has it covered for you:

    If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant. In particular, a shape of [-1] flattens into 1-D. At most one component of shape can be -1.


回答 3

@dga在解释方面做得非常出色,我无法感激它的帮助。我将以类似的方式分享我stride在3D卷积中如何工作的发现。

根据conv3d 上的TensorFlow文档,输入的形状必须按以下顺序排列:

[batch, in_depth, in_height, in_width, in_channels]

让我们使用一个示例从最右到左解释变量。假设输入形状为 input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

以下是有关如何使用步幅的摘要文档。

步幅:长度> = 5的整数列表。长度为5的1-D张量。每个输入维度的滑动窗口的步幅。一定有strides[0] = strides[4] = 1

正如许多作品所指出的那样,步幅仅表示窗口或内核距离最近的元素(无论是数据帧还是像素)有几步的距离(顺便说一句)。

从以上文档中可以看出,3D中的跨度应为以下跨度=(1,XYZ,1)。

文档强调了这一点strides[0] = strides[4] = 1

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides [X]表示我们应在集总帧中进行多少次跳过。因此,例如,如果我们有16帧,则X = 1表示使用每帧。X = 2表示每隔一帧使用一次,然后继续

strides [y]和strides [z]按照@dga的说明进行操作,因此我不会重做该部分。

但是,在keras中,您只需要指定3个整数的元组/列表,即可指定沿每个空间维度的卷积步幅,其中空间维度为stride [x],strides [y]和strides [z]。strides [0]和strides [4]已默认为1。

我希望有人觉得这有帮助!

@dga has done a wonderful job explaining and I can’t be thankful enough how helpful it has been. In the like manner, I will like to share my findings on how stride works in 3D convolution.

According to the TensorFlow documentation on conv3d, the shape of the input must be in this order:

[batch, in_depth, in_height, in_width, in_channels]

Let’s explain the variables from the extreme right to the left using an example. Assuming the input shape is input_shape = [1000,16,112,112,3]

input_shape[4] is the number of colour channels (RGB or whichever format it is extracted in)
input_shape[3] is the width of the image
input_shape[2] is the height of the image
input_shape[1] is the number of frames that have been lumped into 1 complete data
input_shape[0] is the number of lumped frames of images we have.

Below is a summary documentation for how stride is used.

strides: A list of ints that has length >= 5. 1-D tensor of length 5. The stride of the sliding window for each dimension of input. Must have strides[0] = strides[4] = 1

As indicated in many works, strides simply mean how many steps away a window or kernel jumps away from the closest element, be it a data frame or pixel (this is paraphrased by the way).

From the above documentation, a stride in 3D will look like this strides = (1,X,Y,Z,1).

The documentation emphasizes that strides[0] = strides[4] = 1.

strides[0]=1 means that we do not want to skip any data in the batch 
strides[4]=1 means that we do not want to skip in the channel 

strides[X] means how many skips we should make in the lumped frames. So for example, if we have 16 frames, X=1 means use every frame. X=2 means use every second frame and it goes and on

strides[y] and strides[z] follow the explanation by @dga so I will not redo that part.

In keras however, you only need to specify a tuple/list of 3 integers, specifying the strides of the convolution along each spatial dimension, where spatial dimension is stride[x], strides[y] and strides[z]. strides[0] and strides[4] is already defaulted to 1.

I hope someone finds this helpful!


如何将多维数组写入文本文件?

问题:如何将多维数组写入文本文件?

在另一个问题中,如果我可以提供遇到问题的阵列,其他用户会提供一些帮助。但是,我什至无法完成基本的I / O任务,例如将数组写入文件。

谁能解释我需要向文件写入4x11x14 numpy数组的哪种循环?

该数组由四个11 x 14数组组成,因此我应该使用漂亮的换行符对其进行格式化,以使文件读取更加容易。

编辑:所以我已经尝试了numpy.savetxt函数。奇怪的是,它给出了以下错误:

TypeError: float argument required, not numpy.ndarray

我认为这是因为函数不适用于多维数组?我希望在一个文件中找到任何解决方案吗?

In another question, other users offered some help if I could supply the array I was having trouble with. However, I even fail at a basic I/O task, such as writing an array to a file.

Can anyone explain what kind of loop I would need to write a 4x11x14 numpy array to file?

This array consist of four 11 x 14 arrays, so I should format it with a nice newline, to make the reading of the file easier on others.

Edit: So I’ve tried the numpy.savetxt function. Strangely, it gives the following error:

TypeError: float argument required, not numpy.ndarray

I assume that this is because the function doesn’t work with multidimensional arrays? Any solutions as I would like them within one file?


回答 0

如果您想将其写入磁盘,以便轻松地以numpy数组的形式读回,请查看numpy.save。对其进行酸洗也可以,但是对于大型阵列而言效率较低(您的不是,因此两者都很好)。

如果您希望它易于阅读,请查看numpy.savetxt

编辑: 所以,savetxt对于> 2维数组似乎似乎不是一个很好的选择…但是只是为了得出所有结论,它的全部结论是:

我刚刚意识到,numpy.savetxt大于2维的ndarray上的阻塞…这可能是设计使然,因为没有固有定义的方式来指示文本文件中的其他维。

例如,这个(二维数组)可以正常工作

import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)

TypeError: float argument required, not numpy.ndarray对于3D数组,相同的操作将失败(错误消息不多:):

import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)

一种解决方法是将3D(或更大)阵列分成2D切片。例如

x = np.arange(200).reshape((4,5,10))
with file('test.txt', 'w') as outfile:
    for slice_2d in x:
        np.savetxt(outfile, slice_2d)

但是,我们的目标是使人类清晰易读,同时仍然可以轻松地将其读回numpy.loadtxt。因此,我们可以稍微冗长一些,并使用注释出的行来区分切片。默认情况下,numpy.loadtxt将忽略任何以#(或commentskwarg 指定的任何字符)开头的行。(看起来比实际要冗长得多…)

import numpy as np

# Generate some test data
data = np.arange(200).reshape((4,5,10))

# Write the array to disk
with open('test.txt', 'w') as outfile:
    # I'm writing a header here just for the sake of readability
    # Any line starting with "#" will be ignored by numpy.loadtxt
    outfile.write('# Array shape: {0}\n'.format(data.shape))

    # Iterating through a ndimensional array produces slices along
    # the last axis. This is equivalent to data[i,:,:] in this case
    for data_slice in data:

        # The formatting string indicates that I'm writing out
        # the values in left-justified columns 7 characters in width
        # with 2 decimal places.  
        np.savetxt(outfile, data_slice, fmt='%-7.2f')

        # Writing out a break to indicate different slices...
        outfile.write('# New slice\n')

这样生成:

# Array shape: (4, 5, 10)
0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00   
10.00   11.00   12.00   13.00   14.00   15.00   16.00   17.00   18.00   19.00  
20.00   21.00   22.00   23.00   24.00   25.00   26.00   27.00   28.00   29.00  
30.00   31.00   32.00   33.00   34.00   35.00   36.00   37.00   38.00   39.00  
40.00   41.00   42.00   43.00   44.00   45.00   46.00   47.00   48.00   49.00  
# New slice
50.00   51.00   52.00   53.00   54.00   55.00   56.00   57.00   58.00   59.00  
60.00   61.00   62.00   63.00   64.00   65.00   66.00   67.00   68.00   69.00  
70.00   71.00   72.00   73.00   74.00   75.00   76.00   77.00   78.00   79.00  
80.00   81.00   82.00   83.00   84.00   85.00   86.00   87.00   88.00   89.00  
90.00   91.00   92.00   93.00   94.00   95.00   96.00   97.00   98.00   99.00  
# New slice
100.00  101.00  102.00  103.00  104.00  105.00  106.00  107.00  108.00  109.00 
110.00  111.00  112.00  113.00  114.00  115.00  116.00  117.00  118.00  119.00 
120.00  121.00  122.00  123.00  124.00  125.00  126.00  127.00  128.00  129.00 
130.00  131.00  132.00  133.00  134.00  135.00  136.00  137.00  138.00  139.00 
140.00  141.00  142.00  143.00  144.00  145.00  146.00  147.00  148.00  149.00 
# New slice
150.00  151.00  152.00  153.00  154.00  155.00  156.00  157.00  158.00  159.00 
160.00  161.00  162.00  163.00  164.00  165.00  166.00  167.00  168.00  169.00 
170.00  171.00  172.00  173.00  174.00  175.00  176.00  177.00  178.00  179.00 
180.00  181.00  182.00  183.00  184.00  185.00  186.00  187.00  188.00  189.00 
190.00  191.00  192.00  193.00  194.00  195.00  196.00  197.00  198.00  199.00 
# New slice

只要我们知道原始数组的形状,就可以很容易地读回它。我们可以做numpy.loadtxt('test.txt').reshape((4,5,10))。作为一个示例(您可以在一行中完成此操作,我只是在冗长地澄清事情):

# Read the array from disk
new_data = np.loadtxt('test.txt')

# Note that this returned a 2D array!
print new_data.shape

# However, going back to 3D is easy if we know the 
# original shape of the array
new_data = new_data.reshape((4,5,10))

# Just to check that they're the same...
assert np.all(new_data == data)

If you want to write it to disk so that it will be easy to read back in as a numpy array, look into numpy.save. Pickling it will work fine, as well, but it’s less efficient for large arrays (which yours isn’t, so either is perfectly fine).

If you want it to be human readable, look into numpy.savetxt.

Edit: So, it seems like savetxt isn’t quite as great an option for arrays with >2 dimensions… But just to draw everything out to it’s full conclusion:

I just realized that numpy.savetxt chokes on ndarrays with more than 2 dimensions… This is probably by design, as there’s no inherently defined way to indicate additional dimensions in a text file.

E.g. This (a 2D array) works fine

import numpy as np
x = np.arange(20).reshape((4,5))
np.savetxt('test.txt', x)

While the same thing would fail (with a rather uninformative error: TypeError: float argument required, not numpy.ndarray) for a 3D array:

import numpy as np
x = np.arange(200).reshape((4,5,10))
np.savetxt('test.txt', x)

One workaround is just to break the 3D (or greater) array into 2D slices. E.g.

x = np.arange(200).reshape((4,5,10))
with open('test.txt', 'w') as outfile:
    for slice_2d in x:
        np.savetxt(outfile, slice_2d)

However, our goal is to be clearly human readable, while still being easily read back in with numpy.loadtxt. Therefore, we can be a bit more verbose, and differentiate the slices using commented out lines. By default, numpy.loadtxt will ignore any lines that start with # (or whichever character is specified by the comments kwarg). (This looks more verbose than it actually is…)

import numpy as np

# Generate some test data
data = np.arange(200).reshape((4,5,10))

# Write the array to disk
with open('test.txt', 'w') as outfile:
    # I'm writing a header here just for the sake of readability
    # Any line starting with "#" will be ignored by numpy.loadtxt
    outfile.write('# Array shape: {0}\n'.format(data.shape))
    
    # Iterating through a ndimensional array produces slices along
    # the last axis. This is equivalent to data[i,:,:] in this case
    for data_slice in data:

        # The formatting string indicates that I'm writing out
        # the values in left-justified columns 7 characters in width
        # with 2 decimal places.  
        np.savetxt(outfile, data_slice, fmt='%-7.2f')

        # Writing out a break to indicate different slices...
        outfile.write('# New slice\n')

This yields:

# Array shape: (4, 5, 10)
0.00    1.00    2.00    3.00    4.00    5.00    6.00    7.00    8.00    9.00   
10.00   11.00   12.00   13.00   14.00   15.00   16.00   17.00   18.00   19.00  
20.00   21.00   22.00   23.00   24.00   25.00   26.00   27.00   28.00   29.00  
30.00   31.00   32.00   33.00   34.00   35.00   36.00   37.00   38.00   39.00  
40.00   41.00   42.00   43.00   44.00   45.00   46.00   47.00   48.00   49.00  
# New slice
50.00   51.00   52.00   53.00   54.00   55.00   56.00   57.00   58.00   59.00  
60.00   61.00   62.00   63.00   64.00   65.00   66.00   67.00   68.00   69.00  
70.00   71.00   72.00   73.00   74.00   75.00   76.00   77.00   78.00   79.00  
80.00   81.00   82.00   83.00   84.00   85.00   86.00   87.00   88.00   89.00  
90.00   91.00   92.00   93.00   94.00   95.00   96.00   97.00   98.00   99.00  
# New slice
100.00  101.00  102.00  103.00  104.00  105.00  106.00  107.00  108.00  109.00 
110.00  111.00  112.00  113.00  114.00  115.00  116.00  117.00  118.00  119.00 
120.00  121.00  122.00  123.00  124.00  125.00  126.00  127.00  128.00  129.00 
130.00  131.00  132.00  133.00  134.00  135.00  136.00  137.00  138.00  139.00 
140.00  141.00  142.00  143.00  144.00  145.00  146.00  147.00  148.00  149.00 
# New slice
150.00  151.00  152.00  153.00  154.00  155.00  156.00  157.00  158.00  159.00 
160.00  161.00  162.00  163.00  164.00  165.00  166.00  167.00  168.00  169.00 
170.00  171.00  172.00  173.00  174.00  175.00  176.00  177.00  178.00  179.00 
180.00  181.00  182.00  183.00  184.00  185.00  186.00  187.00  188.00  189.00 
190.00  191.00  192.00  193.00  194.00  195.00  196.00  197.00  198.00  199.00 
# New slice

Reading it back in is very easy, as long as we know the shape of the original array. We can just do numpy.loadtxt('test.txt').reshape((4,5,10)). As an example (You can do this in one line, I’m just being verbose to clarify things):

# Read the array from disk
new_data = np.loadtxt('test.txt')

# Note that this returned a 2D array!
print new_data.shape

# However, going back to 3D is easy if we know the 
# original shape of the array
new_data = new_data.reshape((4,5,10))
    
# Just to check that they're the same...
assert np.all(new_data == data)

回答 1

鉴于我认为您有兴趣让人们可读该文件,因此我不确定这是否满足您的要求,但是如果这不是主要问题,就pickle可以了。

要保存它:

import pickle

my_data = {'a': [1, 2.0, 3, 4+6j],
           'b': ('string', u'Unicode string'),
           'c': None}
output = open('data.pkl', 'wb')
pickle.dump(my_data, output)
output.close()

读回:

import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

pkl_file.close()

I am not certain if this meets your requirements, given I think you are interested in making the file readable by people, but if that’s not a primary concern, just pickle it.

To save it:

import pickle

my_data = {'a': [1, 2.0, 3, 4+6j],
           'b': ('string', u'Unicode string'),
           'c': None}
output = open('data.pkl', 'wb')
pickle.dump(my_data, output)
output.close()

To read it back:

import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

pkl_file.close()

回答 2

如果不需要人类可读的输出,则可以尝试的另一种选择是将数组另存为MATLAB .mat文件,这是结构化数组。我鄙视MATLAB,但是我可以.mat在很少的几行中进行读写的事实很方便。

与乔金顿的回答,这样做的好处是,你不需要知道数据的原始形状.mat的文件,即在阅读无需重塑。而且,不像使用pickle,一个.mat文件可以通过MATLAB读取,以及其他一些程序/语言。

这是一个例子:

import numpy as np
import scipy.io

# Some test data
x = np.arange(200).reshape((4,5,10))

# Specify the filename of the .mat file
matfile = 'test_mat.mat'

# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')

# For the above line, I specified the kwarg oned_as since python (2.7 with 
# numpy 1.6.1) throws a FutureWarning.  Here, this isn't really necessary 
# since oned_as is a kwarg for dealing with 1-D arrays.

# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)

# And just to check if the data is the same:
assert np.all(x == matdata['out'])

如果忘记了在.mat文件中为数组命名的键​​,则始终可以执行以下操作:

print matdata.keys()

当然,您可以使用更多键存储许多数组。

因此,是的-您的眼睛无法看懂它,而只需要两行即可写入和读取数据,我认为这是一个公平的权衡。

看看scipy.io.savematscipy.io.loadmat的文档, 以及本教程页面:scipy.io文件IO教程

If you don’t need a human-readable output, another option you could try is to save the array as a MATLAB .mat file, which is a structured array. I despise MATLAB, but the fact that I can both read and write a .mat in very few lines is convenient.

Unlike Joe Kington’s answer, the benefit of this is that you don’t need to know the original shape of the data in the .mat file, i.e. no need to reshape upon reading in. And, unlike using pickle, a .mat file can be read by MATLAB, and probably some other programs/languages as well.

Here is an example:

import numpy as np
import scipy.io

# Some test data
x = np.arange(200).reshape((4,5,10))

# Specify the filename of the .mat file
matfile = 'test_mat.mat'

# Write the array to the mat file. For this to work, the array must be the value
# corresponding to a key name of your choice in a dictionary
scipy.io.savemat(matfile, mdict={'out': x}, oned_as='row')

# For the above line, I specified the kwarg oned_as since python (2.7 with 
# numpy 1.6.1) throws a FutureWarning.  Here, this isn't really necessary 
# since oned_as is a kwarg for dealing with 1-D arrays.

# Now load in the data from the .mat that was just saved
matdata = scipy.io.loadmat(matfile)

# And just to check if the data is the same:
assert np.all(x == matdata['out'])

If you forget the key that the array is named in the .mat file, you can always do:

print matdata.keys()

And of course you can store many arrays using many more keys.

So yes – it won’t be readable with your eyes, but only takes 2 lines to write and read the data, which I think is a fair trade-off.

Take a look at the docs for scipy.io.savemat and scipy.io.loadmat and also this tutorial page: scipy.io File IO Tutorial


回答 3

ndarray.tofile() 应该也可以

例如,如果您的数组被调用a

a.tofile('yourfile.txt',sep=" ",format="%s")

虽然不确定如何获取换行格式。

编辑在此处向Kevin J. Black发表评论):

从1.5.0版开始,np.tofile()采用可选参数 newline='\n'以允许多行输出。 https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html

ndarray.tofile() should also work

e.g. if your array is called a:

a.tofile('yourfile.txt',sep=" ",format="%s")

Not sure how to get newline formatting though.

Edit (credit Kevin J. Black’s comment here):

Since version 1.5.0, np.tofile() takes an optional parameter newline='\n' to allow multi-line output. https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html


回答 4

有专门的库可以做到这一点。(加上python包装器)

希望这可以帮助

There exist special libraries to do just that. (Plus wrappers for python)

hope this helps


回答 5

您可以简单地在三个嵌套循环中遍历数组并将其值写入文件。为了阅读,您只需使用完全相同的循环结构即可。您将以正确的顺序获得值,以再次正确填充数组。

You can simply traverse the array in three nested loops and write their values to your file. For reading, you simply use the same exact loop construction. You will get the values in exactly the right order to fill your arrays correctly again.


回答 6

我有一种方法可以使用简单的filename.write()操作。它对我来说很好用,但是我正在处理具有约1500个数据元素的数组。

我基本上只需要for循环来遍历文件,然后以csv样式输出将其逐行写入输出目标。

import numpy as np

trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")

with open("/extension/file.txt", "w") as f:
    for x in xrange(len(trial[:,1])):
        for y in range(num_of_columns):
            if y < num_of_columns-2:
                f.write(trial[x][y] + ",")
            elif y == num_of_columns-1:
                f.write(trial[x][y])
        f.write("\n")

if和elif语句用于在数据元素之间添加逗号。无论出于何种原因,当以nd数组形式读取文件时,这些内容都会被删除。我的目标是将文件输出为csv,因此此方法有助于解决该问题。

希望这可以帮助!

I have a way to do it using a simply filename.write() operation. It works fine for me, but I’m dealing with arrays having ~1500 data elements.

I basically just have for loops to iterate through the file and write it to the output destination line-by-line in a csv style output.

import numpy as np

trial = np.genfromtxt("/extension/file.txt", dtype = str, delimiter = ",")

with open("/extension/file.txt", "w") as f:
    for x in xrange(len(trial[:,1])):
        for y in range(num_of_columns):
            if y < num_of_columns-2:
                f.write(trial[x][y] + ",")
            elif y == num_of_columns-1:
                f.write(trial[x][y])
        f.write("\n")

The if and elif statement are used to add commas between the data elements. For whatever reason, these get stripped out when reading the file in as an nd array. My goal was to output the file as a csv, so this method helps to handle that.

Hope this helps!


回答 7

泡菜最适合这些情况。假设您有一个名为的ndarray x_train。您可以将其转储到文件中,然后使用以下命令将其还原:

import pickle

###Load into file
with open("myfile.pkl","wb") as f:
    pickle.dump(x_train,f)

###Extract from file
with open("myfile.pkl","rb") as f:
    x_temp = pickle.load(f)

Pickle is best for these cases. Suppose you have a ndarray named x_train. You can dump it into a file and revert it back using the following command:

import pickle

###Load into file
with open("myfile.pkl","wb") as f:
    pickle.dump(x_train,f)

###Extract from file
with open("myfile.pkl","rb") as f:
    x_temp = pickle.load(f)

在python中从单元测试输出数据

问题:在python中从单元测试输出数据

如果我使用python(使用unittest模块)编写单元测试,是否可以从失败的测试中输出数据,所以我可以对其进行检查以帮助推断出导致错误的原因?我知道创建自定义消息的能力,该消息可以包含一些信息,但是有时您可能会处理更复杂的数据,这些数据无法轻松地表示为字符串。

例如,假设您有一个Foo类,并且正在使用名为testdata的列表中的数据测试方法栏:

class TestBar(unittest.TestCase):
    def runTest(self):
        for t1, t2 in testdata:
            f = Foo(t1)
            self.assertEqual(f.bar(t2), 2)

如果测试失败,则可能要输出t1,t2和/或f,以查看为什么此特定数据导致失败。通过输出,我的意思是说,在运行测试之后,可以像访问任何其他变量一样访问这些变量。

If I’m writing unit tests in python (using the unittest module), is it possible to output data from a failed test, so I can examine it to help deduce what caused the error? I am aware of the ability to create a customized message, which can carry some information, but sometimes you might deal with more complex data, that can’t easily be represented as a string.

For example, suppose you had a class Foo, and were testing a method bar, using data from a list called testdata:

class TestBar(unittest.TestCase):
    def runTest(self):
        for t1, t2 in testdata:
            f = Foo(t1)
            self.assertEqual(f.bar(t2), 2)

If the test failed, I might want to output t1, t2 and/or f, to see why this particular data resulted in a failure. By output, I mean that the variables can be accessed like any other variables, after the test has been run.


回答 0

对于像我这样来这里寻求简单快速答案的人,答案很晚。

在Python 2.7中,您可以使用其他参数msg将信息添加到错误消息中,如下所示:

self.assertEqual(f.bar(t2), 2, msg='{0}, {1}'.format(t1, t2))

官方文档在这里

Very late answer for someone that, like me, comes here looking for a simple and quick answer.

In Python 2.7 you could use an additional parameter msg to add information to the error message like this:

self.assertEqual(f.bar(t2), 2, msg='{0}, {1}'.format(t1, t2))

Offical docs here


回答 1

我们为此使用日志记录模块。

例如:

import logging
class SomeTest( unittest.TestCase ):
    def testSomething( self ):
        log= logging.getLogger( "SomeTest.testSomething" )
        log.debug( "this= %r", self.this )
        log.debug( "that= %r", self.that )
        # etc.
        self.assertEquals( 3.14, pi )

if __name__ == "__main__":
    logging.basicConfig( stream=sys.stderr )
    logging.getLogger( "SomeTest.testSomething" ).setLevel( logging.DEBUG )
    unittest.main()

这使我们可以为已知失败的特定测试打开调试,并且需要其他调试信息。

但是,我的首选方法不是花很多时间进行调试,而是花更多的时间编写更细粒度的测试来揭示问题。

We use the logging module for this.

For example:

import logging
class SomeTest( unittest.TestCase ):
    def testSomething( self ):
        log= logging.getLogger( "SomeTest.testSomething" )
        log.debug( "this= %r", self.this )
        log.debug( "that= %r", self.that )
        # etc.
        self.assertEquals( 3.14, pi )

if __name__ == "__main__":
    logging.basicConfig( stream=sys.stderr )
    logging.getLogger( "SomeTest.testSomething" ).setLevel( logging.DEBUG )
    unittest.main()

That allows us to turn on debugging for specific tests which we know are failing and for which we want additional debugging information.

My preferred method, however, isn’t to spend a lot of time on debugging, but spend it writing more fine-grained tests to expose the problem.


回答 2

您可以使用简单的打印语句,也可以使用其他任何方式写入stdout。您还可以在测试中的任何地方调用Python调试器。

如果你用鼻子来运行测试(我建议这样做),它将为每个测试收集标准输出,并且仅在测试失败时才向您显示该标准输出,因此在测试通过时,您不必忍受混乱的输出。

鼻子还具有自动显示断言中提到的变量或在失败的测试中调用调试器的开关。例如-s--nocapture)阻止捕获stdout。

You can use simple print statements, or any other way of writing to stdout. You can also invoke the Python debugger anywhere in your tests.

If you use nose to run your tests (which I recommend), it will collect the stdout for each test and only show it to you if the test failed, so you don’t have to live with the cluttered output when the tests pass.

nose also has switches to automatically show variables mentioned in asserts, or to invoke the debugger on failed tests. For example -s (--nocapture) prevents the capture of stdout.


回答 3

我认为这并不是您要找的东西,没有办法显示不会失败的变量值,但这可以帮助您更接近以所需方式输出结果。

您可以使用TestRunner.run()返回的TestResult对象进行结果分析和处理。特别是TestResult.errors和TestResult.failures

关于TestResults对象:

http://docs.python.org/library/unittest.html#id3

还有一些代码可以指导您正确的方向:

>>> import random
>>> import unittest
>>>
>>> class TestSequenceFunctions(unittest.TestCase):
...     def setUp(self):
...         self.seq = range(5)
...     def testshuffle(self):
...         # make sure the shuffled sequence does not lose any elements
...         random.shuffle(self.seq)
...         self.seq.sort()
...         self.assertEqual(self.seq, range(10))
...     def testchoice(self):
...         element = random.choice(self.seq)
...         error_test = 1/0
...         self.assert_(element in self.seq)
...     def testsample(self):
...         self.assertRaises(ValueError, random.sample, self.seq, 20)
...         for element in random.sample(self.seq, 5):
...             self.assert_(element in self.seq)
...
>>> suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
>>> testResult = unittest.TextTestRunner(verbosity=2).run(suite)
testchoice (__main__.TestSequenceFunctions) ... ERROR
testsample (__main__.TestSequenceFunctions) ... ok
testshuffle (__main__.TestSequenceFunctions) ... FAIL

======================================================================
ERROR: testchoice (__main__.TestSequenceFunctions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 11, in testchoice
ZeroDivisionError: integer division or modulo by zero

======================================================================
FAIL: testshuffle (__main__.TestSequenceFunctions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 8, in testshuffle
AssertionError: [0, 1, 2, 3, 4] != [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

----------------------------------------------------------------------
Ran 3 tests in 0.031s

FAILED (failures=1, errors=1)
>>>
>>> testResult.errors
[(<__main__.TestSequenceFunctions testMethod=testchoice>, 'Traceback (most recent call last):\n  File "<stdin>"
, line 11, in testchoice\nZeroDivisionError: integer division or modulo by zero\n')]
>>>
>>> testResult.failures
[(<__main__.TestSequenceFunctions testMethod=testshuffle>, 'Traceback (most recent call last):\n  File "<stdin>
", line 8, in testshuffle\nAssertionError: [0, 1, 2, 3, 4] != [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n')]
>>>

I don’t think this is quite what your looking for, there’s no way to display variable values that don’t fail, but this may help you get closer to outputting the results the way you want.

You can use the TestResult object returned by the TestRunner.run() for results analysis and processing. Particularly, TestResult.errors and TestResult.failures

About the TestResults Object:

http://docs.python.org/library/unittest.html#id3

And some code to point you in the right direction:

>>> import random
>>> import unittest
>>>
>>> class TestSequenceFunctions(unittest.TestCase):
...     def setUp(self):
...         self.seq = range(5)
...     def testshuffle(self):
...         # make sure the shuffled sequence does not lose any elements
...         random.shuffle(self.seq)
...         self.seq.sort()
...         self.assertEqual(self.seq, range(10))
...     def testchoice(self):
...         element = random.choice(self.seq)
...         error_test = 1/0
...         self.assert_(element in self.seq)
...     def testsample(self):
...         self.assertRaises(ValueError, random.sample, self.seq, 20)
...         for element in random.sample(self.seq, 5):
...             self.assert_(element in self.seq)
...
>>> suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
>>> testResult = unittest.TextTestRunner(verbosity=2).run(suite)
testchoice (__main__.TestSequenceFunctions) ... ERROR
testsample (__main__.TestSequenceFunctions) ... ok
testshuffle (__main__.TestSequenceFunctions) ... FAIL

======================================================================
ERROR: testchoice (__main__.TestSequenceFunctions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 11, in testchoice
ZeroDivisionError: integer division or modulo by zero

======================================================================
FAIL: testshuffle (__main__.TestSequenceFunctions)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<stdin>", line 8, in testshuffle
AssertionError: [0, 1, 2, 3, 4] != [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

----------------------------------------------------------------------
Ran 3 tests in 0.031s

FAILED (failures=1, errors=1)
>>>
>>> testResult.errors
[(<__main__.TestSequenceFunctions testMethod=testchoice>, 'Traceback (most recent call last):\n  File "<stdin>"
, line 11, in testchoice\nZeroDivisionError: integer division or modulo by zero\n')]
>>>
>>> testResult.failures
[(<__main__.TestSequenceFunctions testMethod=testshuffle>, 'Traceback (most recent call last):\n  File "<stdin>
", line 8, in testshuffle\nAssertionError: [0, 1, 2, 3, 4] != [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n')]
>>>

回答 4

另一个选择-在测试失败的地方启动调试器。

尝试使用Testoob运行测试(它将运行您的unittest套件而不进行任何更改),并且当测试失败时,您可以使用’–debug’命令行开关打开调试器。

这是Windows上的终端会话:

C:\work> testoob tests.py --debug
F
Debugging for failure in test: test_foo (tests.MyTests.test_foo)
> c:\python25\lib\unittest.py(334)failUnlessEqual()
-> (msg or '%r != %r' % (first, second))
(Pdb) up
> c:\work\tests.py(6)test_foo()
-> self.assertEqual(x, y)
(Pdb) l
  1     from unittest import TestCase
  2     class MyTests(TestCase):
  3       def test_foo(self):
  4         x = 1
  5         y = 2
  6  ->     self.assertEqual(x, y)
[EOF]
(Pdb)

Another option – start a debugger where the test fails.

Try running your tests with Testoob (it will run your unittest suite without changes), and you can use the ‘–debug’ command line switch to open a debugger when a test fails.

Here’s a terminal session on windows:

C:\work> testoob tests.py --debug
F
Debugging for failure in test: test_foo (tests.MyTests.test_foo)
> c:\python25\lib\unittest.py(334)failUnlessEqual()
-> (msg or '%r != %r' % (first, second))
(Pdb) up
> c:\work\tests.py(6)test_foo()
-> self.assertEqual(x, y)
(Pdb) l
  1     from unittest import TestCase
  2     class MyTests(TestCase):
  3       def test_foo(self):
  4         x = 1
  5         y = 2
  6  ->     self.assertEqual(x, y)
[EOF]
(Pdb)

回答 5

我使用的方法非常简单。我只是将其记录为警告,因此它实际上会显示出来。

import logging

class TestBar(unittest.TestCase):
    def runTest(self):

       #this line is important
       logging.basicConfig()
       log = logging.getLogger("LOG")

       for t1, t2 in testdata:
         f = Foo(t1)
         self.assertEqual(f.bar(t2), 2)
         log.warning(t1)

The method I use is really simple. I just log it as a warning so it will actually show up.

import logging

class TestBar(unittest.TestCase):
    def runTest(self):

       #this line is important
       logging.basicConfig()
       log = logging.getLogger("LOG")

       for t1, t2 in testdata:
         f = Foo(t1)
         self.assertEqual(f.bar(t2), 2)
         log.warning(t1)

回答 6

我想我可能一直在想这个。我想出的一种方法可以做到这一点,就是让全局变量累积诊断数据。

像这样的东西:

log1 = dict()
class TestBar(unittest.TestCase):
    def runTest(self):
        for t1, t2 in testdata:
            f = Foo(t1) 
            if f.bar(t2) != 2: 
                log1("TestBar.runTest") = (f, t1, t2)
                self.fail("f.bar(t2) != 2")

感谢您的答复。他们给了我一些关于如何在python中记录单元测试信息的替代想法。

I think I might have been overthinking this. One way I’ve come up with that does the job, is simply to have a global variable, that accumulates the diagnostic data.

Somthing like this:

log1 = dict()
class TestBar(unittest.TestCase):
    def runTest(self):
        for t1, t2 in testdata:
            f = Foo(t1) 
            if f.bar(t2) != 2: 
                log1("TestBar.runTest") = (f, t1, t2)
                self.fail("f.bar(t2) != 2")

Thanks for the replies. They have given me some alternative ideas for how to record information from unit tests in python.


回答 7

使用日志记录:

import unittest
import logging
import inspect
import os

logging_level = logging.INFO

try:
    log_file = os.environ["LOG_FILE"]
except KeyError:
    log_file = None

def logger(stack=None):
    if not hasattr(logger, "initialized"):
        logging.basicConfig(filename=log_file, level=logging_level)
        logger.initialized = True
    if not stack:
        stack = inspect.stack()
    name = stack[1][3]
    try:
        name = stack[1][0].f_locals["self"].__class__.__name__ + "." + name
    except KeyError:
        pass
    return logging.getLogger(name)

def todo(msg):
    logger(inspect.stack()).warning("TODO: {}".format(msg))

def get_pi():
    logger().info("sorry, I know only three digits")
    return 3.14

class Test(unittest.TestCase):

    def testName(self):
        todo("use a better get_pi")
        pi = get_pi()
        logger().info("pi = {}".format(pi))
        todo("check more digits in pi")
        self.assertAlmostEqual(pi, 3.14)
        logger().debug("end of this test")
        pass

用法:

# LOG_FILE=/tmp/log python3 -m unittest LoggerDemo
.
----------------------------------------------------------------------
Ran 1 test in 0.047s

OK
# cat /tmp/log
WARNING:Test.testName:TODO: use a better get_pi
INFO:get_pi:sorry, I know only three digits
INFO:Test.testName:pi = 3.14
WARNING:Test.testName:TODO: check more digits in pi

如果您未设置LOG_FILE,则必须登录stderr

Use logging:

import unittest
import logging
import inspect
import os

logging_level = logging.INFO

try:
    log_file = os.environ["LOG_FILE"]
except KeyError:
    log_file = None

def logger(stack=None):
    if not hasattr(logger, "initialized"):
        logging.basicConfig(filename=log_file, level=logging_level)
        logger.initialized = True
    if not stack:
        stack = inspect.stack()
    name = stack[1][3]
    try:
        name = stack[1][0].f_locals["self"].__class__.__name__ + "." + name
    except KeyError:
        pass
    return logging.getLogger(name)

def todo(msg):
    logger(inspect.stack()).warning("TODO: {}".format(msg))

def get_pi():
    logger().info("sorry, I know only three digits")
    return 3.14

class Test(unittest.TestCase):

    def testName(self):
        todo("use a better get_pi")
        pi = get_pi()
        logger().info("pi = {}".format(pi))
        todo("check more digits in pi")
        self.assertAlmostEqual(pi, 3.14)
        logger().debug("end of this test")
        pass

Usage:

# LOG_FILE=/tmp/log python3 -m unittest LoggerDemo
.
----------------------------------------------------------------------
Ran 1 test in 0.047s

OK
# cat /tmp/log
WARNING:Test.testName:TODO: use a better get_pi
INFO:get_pi:sorry, I know only three digits
INFO:Test.testName:pi = 3.14
WARNING:Test.testName:TODO: check more digits in pi

If you do not set LOG_FILE, logging will got to stderr.


回答 8

您可以使用 logging模块。

因此,在单元测试代码中,使用:

import logging as log

def test_foo(self):
    log.debug("Some debug message.")
    log.info("Some info message.")
    log.warning("Some warning message.")
    log.error("Some error message.")

默认情况下,警告和错误输出到 /dev/stderr,因此它们应该在控制台上可见。

要自定义日志(例如格式化),请尝试以下示例:

# Set-up logger
if args.verbose or args.debug:
    logging.basicConfig( stream=sys.stdout )
    root = logging.getLogger()
    root.setLevel(logging.INFO if args.verbose else logging.DEBUG)
    ch = logging.StreamHandler(sys.stdout)
    ch.setLevel(logging.INFO if args.verbose else logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s %(levelname)s: %(name)s: %(message)s'))
    root.addHandler(ch)
else:
    logging.basicConfig(stream=sys.stderr)

You can use logging module for that.

So in the unit test code, use:

import logging as log

def test_foo(self):
    log.debug("Some debug message.")
    log.info("Some info message.")
    log.warning("Some warning message.")
    log.error("Some error message.")

By default warnings and errors are outputted to /dev/stderr, so they should be visible on the console.

To customize logs (such as formatting), try the following sample:

# Set-up logger
if args.verbose or args.debug:
    logging.basicConfig( stream=sys.stdout )
    root = logging.getLogger()
    root.setLevel(logging.INFO if args.verbose else logging.DEBUG)
    ch = logging.StreamHandler(sys.stdout)
    ch.setLevel(logging.INFO if args.verbose else logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s %(levelname)s: %(name)s: %(message)s'))
    root.addHandler(ch)
else:
    logging.basicConfig(stream=sys.stderr)

回答 9

在这些情况下,我要做的是log.debug()在应用程序中包含一些消息。由于默认的日志记录级别是WARNING,因此此类消息不会在正常执行中显示。

然后,在单元测试中,我将日志记录级别更改为DEBUG,以便在运行它们时显示此类消息。

import logging

log.debug("Some messages to be shown just when debugging or unittesting")

在单元测试中:

# Set log level
loglevel = logging.DEBUG
logging.basicConfig(level=loglevel)



查看完整示例:

这是daikiri.py一个使用其名称和价格实现Daikiri的基本类。有一种方法make_discount()可在应用给定折扣后返回该特定daikiri的价格:

import logging

log = logging.getLogger(__name__)

class Daikiri(object):
    def __init__(self, name, price):
        self.name = name
        self.price = price

    def make_discount(self, percentage):
        log.debug("Deducting discount...")  # I want to see this message
        return self.price * percentage

然后,我创建一个test_daikiri.py检查其用法的单元测试:

import unittest
import logging
from .daikiri import Daikiri


class TestDaikiri(unittest.TestCase):
    def setUp(self):
        # Changing log level to DEBUG
        loglevel = logging.DEBUG
        logging.basicConfig(level=loglevel)

        self.mydaikiri = Daikiri("cuban", 25)

    def test_drop_price(self):
        new_price = self.mydaikiri.make_discount(0)
        self.assertEqual(new_price, 0)

if __name__ == "__main__":
    unittest.main()

所以当我执行它时,我得到log.debug消息:

$ python -m test_daikiri
DEBUG:daikiri:Deducting discount...
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

What I do in these cases is to have a log.debug() with some messages in my application. Since the default logging level is WARNING, such messages don’t show in the normal execution.

Then, in the unittest I change the logging level to DEBUG, so that such messages are shown while running them.

import logging

log.debug("Some messages to be shown just when debugging or unittesting")

In the unittests:

# Set log level
loglevel = logging.DEBUG
logging.basicConfig(level=loglevel)



See a full example:

This is daikiri.py, a basic class that implements a Daikiri with its name and price. There is method make_discount() that returns the price of that specific daikiri after applying a given discount:

import logging

log = logging.getLogger(__name__)

class Daikiri(object):
    def __init__(self, name, price):
        self.name = name
        self.price = price

    def make_discount(self, percentage):
        log.debug("Deducting discount...")  # I want to see this message
        return self.price * percentage

Then, I create a unittest test_daikiri.py that checks its usage:

import unittest
import logging
from .daikiri import Daikiri


class TestDaikiri(unittest.TestCase):
    def setUp(self):
        # Changing log level to DEBUG
        loglevel = logging.DEBUG
        logging.basicConfig(level=loglevel)

        self.mydaikiri = Daikiri("cuban", 25)

    def test_drop_price(self):
        new_price = self.mydaikiri.make_discount(0)
        self.assertEqual(new_price, 0)

if __name__ == "__main__":
    unittest.main()

So when I execute it I get the log.debug messages:

$ python -m test_daikiri
DEBUG:daikiri:Deducting discount...
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

回答 10

inspect.trace将让您在引发异常后获取局部变量。然后,您可以使用如下装饰器包装单元测试,以免在局部验尸期间检查这些局部变量。

import random
import unittest
import inspect


def store_result(f):
    """
    Store the results of a test
    On success, store the return value.
    On failure, store the local variables where the exception was thrown.
    """
    def wrapped(self):
        if 'results' not in self.__dict__:
            self.results = {}
        # If a test throws an exception, store local variables in results:
        try:
            result = f(self)
        except Exception as e:
            self.results[f.__name__] = {'success':False, 'locals':inspect.trace()[-1][0].f_locals}
            raise e
        self.results[f.__name__] = {'success':True, 'result':result}
        return result
    return wrapped

def suite_results(suite):
    """
    Get all the results from a test suite
    """
    ans = {}
    for test in suite:
        if 'results' in test.__dict__:
            ans.update(test.results)
    return ans

# Example:
class TestSequenceFunctions(unittest.TestCase):

    def setUp(self):
        self.seq = range(10)

    @store_result
    def test_shuffle(self):
        # make sure the shuffled sequence does not lose any elements
        random.shuffle(self.seq)
        self.seq.sort()
        self.assertEqual(self.seq, range(10))
        # should raise an exception for an immutable sequence
        self.assertRaises(TypeError, random.shuffle, (1,2,3))
        return {1:2}

    @store_result
    def test_choice(self):
        element = random.choice(self.seq)
        self.assertTrue(element in self.seq)
        return {7:2}

    @store_result
    def test_sample(self):
        x = 799
        with self.assertRaises(ValueError):
            random.sample(self.seq, 20)
        for element in random.sample(self.seq, 5):
            self.assertTrue(element in self.seq)
        return {1:99999}


suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
unittest.TextTestRunner(verbosity=2).run(suite)

from pprint import pprint
pprint(suite_results(suite))

最后一行将在测试成功的地方打印返回的值,并在测试失败的情况下显示局部变量(在本例中为x):

{'test_choice': {'result': {7: 2}, 'success': True},
 'test_sample': {'locals': {'self': <__main__.TestSequenceFunctions testMethod=test_sample>,
                            'x': 799},
                 'success': False},
 'test_shuffle': {'result': {1: 2}, 'success': True}}

Har detgøy:-)

inspect.trace will let you get local variables after an exception has been thrown. You can then wrap the unit tests with a decorator like the following one to save off those local variables for examination during the post mortem.

import random
import unittest
import inspect


def store_result(f):
    """
    Store the results of a test
    On success, store the return value.
    On failure, store the local variables where the exception was thrown.
    """
    def wrapped(self):
        if 'results' not in self.__dict__:
            self.results = {}
        # If a test throws an exception, store local variables in results:
        try:
            result = f(self)
        except Exception as e:
            self.results[f.__name__] = {'success':False, 'locals':inspect.trace()[-1][0].f_locals}
            raise e
        self.results[f.__name__] = {'success':True, 'result':result}
        return result
    return wrapped

def suite_results(suite):
    """
    Get all the results from a test suite
    """
    ans = {}
    for test in suite:
        if 'results' in test.__dict__:
            ans.update(test.results)
    return ans

# Example:
class TestSequenceFunctions(unittest.TestCase):

    def setUp(self):
        self.seq = range(10)

    @store_result
    def test_shuffle(self):
        # make sure the shuffled sequence does not lose any elements
        random.shuffle(self.seq)
        self.seq.sort()
        self.assertEqual(self.seq, range(10))
        # should raise an exception for an immutable sequence
        self.assertRaises(TypeError, random.shuffle, (1,2,3))
        return {1:2}

    @store_result
    def test_choice(self):
        element = random.choice(self.seq)
        self.assertTrue(element in self.seq)
        return {7:2}

    @store_result
    def test_sample(self):
        x = 799
        with self.assertRaises(ValueError):
            random.sample(self.seq, 20)
        for element in random.sample(self.seq, 5):
            self.assertTrue(element in self.seq)
        return {1:99999}


suite = unittest.TestLoader().loadTestsFromTestCase(TestSequenceFunctions)
unittest.TextTestRunner(verbosity=2).run(suite)

from pprint import pprint
pprint(suite_results(suite))

The last line will print the returned values where the test succeeded and the local variables, in this case x, when it fails:

{'test_choice': {'result': {7: 2}, 'success': True},
 'test_sample': {'locals': {'self': <__main__.TestSequenceFunctions testMethod=test_sample>,
                            'x': 799},
                 'success': False},
 'test_shuffle': {'result': {1: 2}, 'success': True}}

Har det gøy :-)


回答 11

如何捕获由断言失败生成的异常?在catch块中,您可以将数据输出到任何地方。然后,当您完成操作后,可以重新引发异常。测试跑步者可能不会知道区别。

免责声明:我尚未在python的单元测试框架中尝试过此方法,但在其他单元测试框架中尝试过此方法。

How about catching the exception that gets generated from the assertion failure? In your catch block you could output the data however you wanted to wherever. Then when you were done you could re-throw the exception. The test runner probably wouldn’t know the difference.

Disclaimer: I haven’t tried this with python’s unit test framework but have with other unit test frameworks.


回答 12

承认我还没有尝试过,testfixtures的日志记录功能看起来非常有用…

Admitting that I haven’t tried it, the testfixtures’ logging feature looks quite useful…


回答 13

扩展@FC的答案,这对我来说效果很好:

class MyTest(unittest.TestCase):
    def messenger(self, message):
        try:
            self.assertEqual(1, 2, msg=message)
        except AssertionError as e:      
            print "\nMESSENGER OUTPUT: %s" % str(e),

Expanding @F.C. ‘s answer, this works quite well for me:

class MyTest(unittest.TestCase):
    def messenger(self, message):
        try:
            self.assertEqual(1, 2, msg=message)
        except AssertionError as e:      
            print "\nMESSENGER OUTPUT: %s" % str(e),

如何引发ValueError?

问题:如何引发ValueError?

我有这段代码可以找到字符串中特定字符的最大索引,但是ValueError当字符串中未出现指定字符时,我希望它提高a值。

所以像这样:

contains('bababa', 'k')

会导致:

ValueError: could not find k in bababa

我怎样才能做到这一点?

这是我的函数的当前代码:

def contains(string,char):
  list = []

  for i in range(0,len(string)):
      if string[i] == char:
           list = list + [i]

  return list[-1]

I have this code which finds the largest index of a specific character in a string, however I would like it to raise a ValueError when the specified character does not occur in a string.

So something like this:

contains('bababa', 'k')

would result in a:

ValueError: could not find k in bababa

How can I do this?

Here’s the current code for my function:

def contains(string,char):
  list = []

  for i in range(0,len(string)):
      if string[i] == char:
           list = list + [i]

  return list[-1]

回答 0

raise ValueError('could not find %c in %s' % (ch,str))

raise ValueError('could not find %c in %s' % (ch,str))


回答 1

这是您的代码的修订版本,该版本仍然有效,并且还说明了如何提高ValueError所需的方式。顺便说一句,我认为find_last()find_last_index()或类似的名称会对此功能更具描述性。更可能造成混乱的是,Python已经有一个名为的容器对象方法__contains__(),该方法在成员资格测试方面做了一些不同。

def contains(char_string, char):
    largest_index = -1
    for i, ch in enumerate(char_string):
        if ch == char:
            largest_index = i
    if largest_index > -1:  # any found?
        return largest_index  # return index of last one
    else:
        raise ValueError('could not find {!r} in {!r}'.format(char, char_string))

print(contains('mississippi', 's'))  # -> 6
print(contains('bababa', 'k'))  # ->
Traceback (most recent call last):
  File "how-to-raise-a-valueerror.py", line 15, in <module>
    print(contains('bababa', 'k'))
  File "how-to-raise-a-valueerror.py", line 12, in contains
    raise ValueError('could not find {} in {}'.format(char, char_string))
ValueError: could not find 'k' in 'bababa'

更新-实质上更简单的方法

哇!这是一个更简洁的版本-本质上是单行代码-可能也更快,因为它会先反向搜索(通过[::-1])字符串,然后再对它进行第一个匹配字符的正向搜索,并且使用快速内置的字符串index()方法。关于您的实际问题,使用附带的一点好处index()是,ValueError当找不到字符子字符串时,它已经引发了,因此不需要其他任何操作。

这是一个快速的单元测试:

def contains(char_string, char):
    #  Ending - 1 adjusts returned index to account for searching in reverse.
    return len(char_string) - char_string[::-1].index(char) - 1

print(contains('mississippi', 's'))  # -> 6
print(contains('bababa', 'k'))  # ->
Traceback (most recent call last):
  File "better-way-to-raise-a-valueerror.py", line 9, in <module>
    print(contains('bababa', 'k'))
  File "better-way-to-raise-a-valueerror", line 6, in contains
    return len(char_string) - char_string[::-1].index(char) - 1
ValueError: substring not found

Here’s a revised version of your code which still works plus it illustrates how to raise a ValueError the way you want. By-the-way, I think find_last(), find_last_index(), or something simlar would be a more descriptive name for this function. Adding to the possible confusion is the fact that Python already has a container object method named __contains__() that does something a little different, membership-testing-wise.

def contains(char_string, char):
    largest_index = -1
    for i, ch in enumerate(char_string):
        if ch == char:
            largest_index = i
    if largest_index > -1:  # any found?
        return largest_index  # return index of last one
    else:
        raise ValueError('could not find {!r} in {!r}'.format(char, char_string))

print(contains('mississippi', 's'))  # -> 6
print(contains('bababa', 'k'))  # ->
Traceback (most recent call last):
  File "how-to-raise-a-valueerror.py", line 15, in <module>
    print(contains('bababa', 'k'))
  File "how-to-raise-a-valueerror.py", line 12, in contains
    raise ValueError('could not find {} in {}'.format(char, char_string))
ValueError: could not find 'k' in 'bababa'

Update — A substantially simpler way

Wow! Here’s a much more concise version—essentially a one-liner—that is also likely faster because it reverses (via [::-1]) the string before doing a forward search through it for the first matching character and it does so using the fast built-in string index() method. With respect to your actual question, a nice little bonus convenience that comes with using index() is that it already raises a ValueError when the character substring isn’t found, so nothing additional is required to make that happen.

Here it is along with a quick unit test:

def contains(char_string, char):
    #  Ending - 1 adjusts returned index to account for searching in reverse.
    return len(char_string) - char_string[::-1].index(char) - 1

print(contains('mississippi', 's'))  # -> 6
print(contains('bababa', 'k'))  # ->
Traceback (most recent call last):
  File "better-way-to-raise-a-valueerror.py", line 9, in <module>
    print(contains('bababa', 'k'))
  File "better-way-to-raise-a-valueerror", line 6, in contains
    return len(char_string) - char_string[::-1].index(char) - 1
ValueError: substring not found

回答 2

>>> def contains(string, char):
...     for i in xrange(len(string) - 1, -1, -1):
...         if string[i] == char:
...             return i
...     raise ValueError("could not find %r in %r" % (char, string))
...
>>> contains('bababa', 'k')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in contains
ValueError: could not find 'k' in 'bababa'
>>> contains('bababa', 'a')
5
>>> contains('bababa', 'b')
4
>>> contains('xbababa', 'x')
0
>>>
>>> def contains(string, char):
...     for i in xrange(len(string) - 1, -1, -1):
...         if string[i] == char:
...             return i
...     raise ValueError("could not find %r in %r" % (char, string))
...
>>> contains('bababa', 'k')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in contains
ValueError: could not find 'k' in 'bababa'
>>> contains('bababa', 'a')
5
>>> contains('bababa', 'b')
4
>>> contains('xbababa', 'x')
0
>>>

回答 3

>>> response='bababa'
...  if "K" in response.text:
...     raise ValueError("Not found")
>>> response='bababa'
...  if "K" in response.text:
...     raise ValueError("Not found")