Python 实用宝典

Question 1

Given a mean and a variance is there a simple function call which will plot a normal distribution?

Question 2

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import math

mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
plt.show()

Question 3

I don’t think there is a function that does all that in a single call. However you can find the Gaussian probability density function in scipy.stats.

So the simplest way I could come up with is:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-10, 10, 0.001)
# Mean = 0, SD = 2.
plt.plot(x_axis, norm.pdf(x_axis,0,2))
plt.show()

Sources:

Question 4

Use seaborn instead i am using distplot of seaborn with mean=5 std=3 of 1000 values

value = np.random.normal(loc=5,scale=3,size=1000)
sns.distplot(value)

You will get a normal distribution curve

Question 5

Unutbu answer is correct. But because our mean can be more or less than zero I would still like to change this :

x = np.linspace(-3 * sigma, 3 * sigma, 100)

to this :

x = np.linspace(-3 * sigma + mean, 3 * sigma + mean, 100)

Question 6

If you prefer to use a step by step approach you could consider a solution like follows

import numpy as np
import matplotlib.pyplot as plt

mean = 0; std = 1; variance = np.square(std)
x = np.arange(-5,5,.01)
f = np.exp(-np.square(x-mean)/2*variance)/(np.sqrt(2*np.pi*variance))

plt.plot(x,f)
plt.ylabel('gaussian distribution')
plt.show()

Question 7

I have just come back to this and I had to install scipy as matplotlib.mlab gave me the error message MatplotlibDeprecationWarning: scipy.stats.norm.pdf when trying example above. So the sample is now:

%matplotlib inline
import math
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats


mu = 0
variance = 1
sigma = math.sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, scipy.stats.norm.pdf(x, mu, sigma))

plt.show()

Question 8

I believe that is important to set the height, so created this function:

def my_gauss(x, sigma=1, h=1, mid=0):
    from math import exp, pow
    variance = pow(sdev, 2)
    return h * exp(-pow(x-mid, 2)/(2*variance))

Where sigma is the standard deviation, h is the height and mid is the mean.

Here is the result using different heights and deviations:

Question 9

you can get cdf easily. so pdf via cdf

    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.interpolate
    import scipy.stats

    def setGridLine(ax):
        #http://jonathansoma.com/lede/data-studio/matplotlib/adding-grid-lines-to-a-matplotlib-chart/
        ax.set_axisbelow(True)
        ax.minorticks_on()
        ax.grid(which='major', linestyle='-', linewidth=0.5, color='grey')
        ax.grid(which='minor', linestyle=':', linewidth=0.5, color='#a6a6a6')
        ax.tick_params(which='both', # Options for both major and minor ticks
                        top=False, # turn off top ticks
                        left=False, # turn off left ticks
                        right=False,  # turn off right ticks
                        bottom=False) # turn off bottom ticks

    data1 = np.random.normal(0,1,1000000)
    x=np.sort(data1)
    y=np.arange(x.shape[0])/(x.shape[0]+1)

    f2 = scipy.interpolate.interp1d(x, y,kind='linear')
    x2 = np.linspace(x[0],x[-1],1001)
    y2 = f2(x2)

    y2b = np.diff(y2)/np.diff(x2)
    x2b=(x2[1:]+x2[:-1])/2.

    f3 = scipy.interpolate.interp1d(x, y,kind='cubic')
    x3 = np.linspace(x[0],x[-1],1001)
    y3 = f3(x3)

    y3b = np.diff(y3)/np.diff(x3)
    x3b=(x3[1:]+x3[:-1])/2.

    bins=np.arange(-4,4,0.1)
    bins_centers=0.5*(bins[1:]+bins[:-1])
    cdf = scipy.stats.norm.cdf(bins_centers)
    pdf = scipy.stats.norm.pdf(bins_centers)

    plt.rcParams["font.size"] = 18
    fig, ax = plt.subplots(3,1,figsize=(10,16))
    ax[0].set_title("cdf")
    ax[0].plot(x,y,label="data")
    ax[0].plot(x2,y2,label="linear")
    ax[0].plot(x3,y3,label="cubic")
    ax[0].plot(bins_centers,cdf,label="ans")

    ax[1].set_title("pdf:linear")
    ax[1].plot(x2b,y2b,label="linear")
    ax[1].plot(bins_centers,pdf,label="ans")

    ax[2].set_title("pdf:cubic")
    ax[2].plot(x3b,y3b,label="cubic")
    ax[2].plot(bins_centers,pdf,label="ans")

    for idx in range(3):
        ax[idx].legend()
        setGridLine(ax[idx])

    plt.show()
    plt.clf()
    plt.close()

Question 10

I am struggling with the seemingly very simple thing.I have a pandas data frame containing very long string.

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

Now when I try to print the same, I do not see the full string I rather see only part of the string.

I tried following options

using print(df.iloc[2])
using to_html
using to_string
One of the stackoverflow answer suggested to increase column width by using pandas display option, that did not work either.
I also did not get how set_printoptions will help me.

Any ideas appreciated. Looks very simple, but not able to get it!

Question 11

You can use options.display.max_colwidth to specify you want to see more in the default representation:

In [2]: df
Out[2]:
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [3]: pd.options.display.max_colwidth
Out[3]: 50

In [4]: pd.options.display.max_colwidth = 100

In [5]: df
Out[5]:
                                                                               one
0                                                                              one
1                                                                              two
2  This is very long string very long string very long string veryvery long string

And indeed, if you just want to inspect the one value, by accessing it (as a scalar, not as a row as df.iloc[2] does) you also see the full string:

In [7]: df.iloc[2,0]    # or df.loc[2,'one']
Out[7]: 'This is very long string very long string very long string veryvery long string'

Question 12

Use pd.set_option('display.max_colwidth', -1) for automatic linebreaks and multi-line cells.

This is a great resource on how to use jupyters display with pandas to the fullest.

Question 13

Another, pretty simple approach is to call list function:

list(df['one'][2])
# output:
['This is very long string very long string very long string veryvery long string']

No worth to mention, that is not good to convent to list the whole columns, but for a simple line – why not

Question 14

Another easier way to print the whole string is to call values on the dataframe.

df = pd.DataFrame({'one' : ['one', 'two', 
      'This is very long string very long string very long string veryvery long string']})

print(df.values)

The Output will be

[['one']
 ['two']
 ['This is very long string very long string very long string veryvery long string']]

Question 15

Is this what you meant to do ?

In [7]: x =  pd.DataFrame({'one' : ['one', 'two', 'This is very long string very long string very long string veryvery long string']})

In [8]: x
Out[8]: 
                                                 one
0                                                one
1                                                two
2  This is very long string very long string very...

In [9]: x['one'][2]
Out[9]: 'This is very long string very long string very long string veryvery long string'

Question 16

The way I often deal with the situation you describe is to use the .to_csv() method and write to stdout:

import sys

df.to_csv(sys.stdout)

Update: it should now be possible to just use None instead of sys.stdout with similar effect!

This should dump the whole dataframe, including the entirety of any strings. You can use the to_csv parameters to configure column separators, whether the index is printed, etc. It will be less pretty than rendering it properly though.

I posted this originally in answer to the somewhat-related question at Output data from all columns in a dataframe in pandas

Question 17

Just add the following line to your code before print.

 pd.options.display.max_colwidth = 90  # set a value as your need

You can simply do the following steps for setting other additional options,

You can change the options for pandas max_columns feature as follows to display more columns
```
import pandas as pd
pd.options.display.max_columns = 10
```
(this allows 10 columns to display, you can change this as you need)
Like that you can change the number of rows as you need to display as follows to display more rows
```
pd.options.display.max_rows = 999
```
(this allows to print 999 rows at a time)

this should works fine

Please kindly refer the doc to change more options/settings for pandas

Question 18

I have created a small utility function, this works well for me

def display_text_max_col_width(df, width):
    with pd.option_context('display.max_colwidth', width):
        print(df)

display_text_max_col_width(train_df["Description"], 800)

I can change length of the width as per my requirement, without setting any option permanently.

Question 19

If you’re using jupyter notebook, you can also print pandas dataframe as HTML table, which will print full strings.

from IPython.display import display, HTML
display(HTML(df.to_html()))

Output

    one
0   one
1   two
2   This is very long string very long string very long string veryvery long string

Question 20

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I’m missing?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I’m using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don’t immediately see a problem with that, but I might be missing something.

Thank you much!

UPDATE:

An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They’re both going down in the other case. But in my case the movements are slow, so things may change after more training and it’s just a single test. A more definitive and informed answer would still be appreciated.

Question 21

In the Ioffe and Szegedy 2015, the authors state that “we would like to ensure that for any parameter values, the network always produces activations with the desired distribution”. So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

Question 22

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don’t compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin’s comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2

Question 23

Usually, Just drop the `Dropout`(when you have `BN`):

“BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively”
“Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.

Question 24

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the “variance shift”. This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper.

A small demo for this effect can be found in this notebook.

Question 25

Based on the research paper for better performance we should use BN before applying Dropouts

Question 26

The correct order is: Conv > Normalization > Activation > Dropout > Pooling

Question 27

Conv – Activation – DropOut – BatchNorm – Pool –> Test_loss: 0.04261355847120285

Conv – Activation – DropOut – Pool – BatchNorm –> Test_loss: 0.050065308809280396

Conv – Activation – BatchNorm – Pool – DropOut –> Test_loss: 0.04911309853196144

Conv – Activation – BatchNorm – DropOut – Pool –> Test_loss: 0.06809622049331665

Conv – BatchNorm – Activation – DropOut – Pool –> Test_loss: 0.038886815309524536

Conv – BatchNorm – Activation – Pool – DropOut –> Test_loss: 0.04126095026731491

Conv – BatchNorm – DropOut – Activation – Pool –> Test_loss: 0.05142546817660332

Conv – DropOut – Activation – BatchNorm – Pool –> Test_loss: 0.04827788099646568

Conv – DropOut – Activation – Pool – BatchNorm –> Test_loss: 0.04722036048769951

Conv – DropOut – BatchNorm – Activation – Pool –> Test_loss: 0.03238215297460556

Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.

Question 28

ConV/FC – BN – Sigmoid/tanh – dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task

Question 29

I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.

Question 30

I have the following DataFrame where one of the columns is an object (list type cell):

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]: 
   A       B
0  1  [1, 2]
1  2  [1, 2]

My expected output is:

What should I do to achieve this?

Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]: 
   A       B       C  B_0  B_1
0  1  [1, 2]  [1, 2]    1    2
1  2  [3, 4]  [3, 4]    3    4

Updated function

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx

        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]: 
   B0  B1  C0  C1  A
0   1   2   1   2  1
1   3   4   3   4  2

Question 32

Option 1

If all of the sublists in the other column are the same length, numpy can be an efficient option here:

vals = np.array(df.B.values.tolist())    
a = np.repeat(df.A, vals.shape[1])

pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

Option 2

If the sublists have different length, you need an additional step:

vals = df.B.values.tolist()
rs = [len(r) for r in vals]    
a = np.repeat(df.A, rs)

pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Option 3

I took a shot at generalizing this to work to flatten N columns and tile M columns, I’ll work later on making it more efficient:

df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                   'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})

   A          B          C  D
0  1     [1, 2]  [1, 2, 3]  A
1  2  [1, 2, 3]     [1, 2]  B
2  3        [1]     [1, 2]  C

def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile +  ['_'.join(explode)])

unnest(df, ['A', 'D'], ['B', 'C'])

    A  D B_C
0   1  A   1
1   1  A   2
2   1  A   1
3   1  A   2
4   1  A   3
5   2  B   1
6   2  B   2
7   2  B   3
8   2  B   1
9   2  B   2
10  3  C   1
11  3  C   1
12  3  C   2

Functions

def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)

Timings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
       index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000],
       dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
        df = pd.concat([df]*c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

Question 33

Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
df.explode('B')

Out:

Question 34

One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:

import numpy as np
import pandas as pd


def unnest(frame, explode):
    def mesh(values):
        return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

    data = np.vstack(mesh(row) for row in frame[explode].values)
    return pd.DataFrame(data=data, columns=explode)


df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
print(unnest(df, ['A', 'B']))  # base
print()

df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
print(unnest(df, ['A', 'B', 'C']))  # multiple columns
print()

df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                   'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

print(unnest(df, ['A', 'B']))  # uneven length lists
print()
print(unnest(df, ['D', 'B']))  # different types
print()

Output

Question 35

My 5 cents:

df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                      ignore_index=True)

and another 5

df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

(pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
 .reset_index(level=1, drop=True)
 .reset_index())

both resulting in the same

Question 36

Because normally sublist length are different and join/merge is far more computational expensive. I retested the method for different length sublist and more normal columns.

MultiIndex should be also a easier way to write and has near the same performances as numpy way.

Surprisingly, in my implementation comprehension way has the best performance.

def stack(df):
    return df.set_index(['A', 'C']).B.apply(pd.Series).stack()


def comprehension(df):
    return pd.DataFrame([x + [z] for x, y in zip(df[['A', 'C']].values.tolist(), df.B) for z in y])


def multiindex(df):
    return pd.DataFrame(np.concatenate(df.B.values), index=df.set_index(['A', 'C']).index.repeat(df.B.str.len()))


def array(df):
    return pd.DataFrame(
        np.column_stack((
            np.repeat(df[['A', 'C']].values, df.B.str.len(), axis=0),
            np.concatenate(df.B.values)
        ))
    )


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from timeit import timeit

res = pd.DataFrame(
    index=[
        'stack',
        'comprehension',
        'multiindex',
        'array',
    ],
    columns=[1000, 2000, 5000, 10000, 20000, 50000],
    dtype=float
)

for f in res.index:
    for c in res.columns:
        df = pd.DataFrame({'A': list('abc'), 'C': list('def'), 'B': [['g', 'h', 'i'], ['j', 'k'], ['l']]})
        df = pd.concat([df] * c)
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=20)

ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N")
ax.set_ylabel("time (relative)")

Performance

Relative time of each method

Question 37

I generalized the problem a bit to be applicable to more columns.

Summary of what my solution does:

In[74]: df
Out[74]: 
    A   B             C             columnD
0  A1  B1  [C1.1, C1.2]                D1
1  A2  B2  [C2.1, C2.2]  [D2.1, D2.2, D2.3]
2  A3  B3            C3        [D3.1, D3.2]

In[75]: dfListExplode(df,['C','columnD'])
Out[75]: 
    A   B     C columnD
0  A1  B1  C1.1    D1
1  A1  B1  C1.2    D1
2  A2  B2  C2.1    D2.1
3  A2  B2  C2.1    D2.2
4  A2  B2  C2.1    D2.3
5  A2  B2  C2.2    D2.1
6  A2  B2  C2.2    D2.2
7  A2  B2  C2.2    D2.3
8  A3  B3    C3    D3.1
9  A3  B3    C3    D3.2

Complete example:

The actual explosion is performed in 3 lines. The rest is cosmetics (multi column explosion, handling of strings instead of lists in the explosion column, …).

import pandas as pd
import numpy as np

df=pd.DataFrame( {'A': ['A1','A2','A3'],
                  'B': ['B1','B2','B3'],
                  'C': [ ['C1.1','C1.2'],['C2.1','C2.2'],'C3'],
                  'columnD': [ 'D1',['D2.1','D2.2', 'D2.3'],['D3.1','D3.2']],
                  })
print('df',df, sep='\n')

def dfListExplode(df, explodeKeys):
    if not isinstance(explodeKeys, list):
        explodeKeys=[explodeKeys]
    # recursive handling of explodeKeys
    if len(explodeKeys)==0:
        return df
    elif len(explodeKeys)==1:
        explodeKey=explodeKeys[0]
    else:
        return dfListExplode( dfListExplode(df, explodeKeys[:1]), explodeKeys[1:])
    # perform explosion/unnesting for key: explodeKey
    dfPrep=df[explodeKey].apply(lambda x: x if isinstance(x,list) else [x]) #casts all elements to a list
    dfIndExpl=pd.DataFrame([[x] + [z] for x, y in zip(dfPrep.index,dfPrep.values) for z in y ], columns=['explodedIndex',explodeKey])
    dfMerged=dfIndExpl.merge(df.drop(explodeKey, axis=1), left_on='explodedIndex', right_index=True)
    dfReind=dfMerged.reindex(columns=list(df))
    return dfReind

dfExpl=dfListExplode(df,['C','columnD'])
print('dfExpl',dfExpl, sep='\n')

Credits to WeNYoBen’s answer

Question 38

Problem Setup

Assume there are multiple columns with different length objects within it

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': [[1, 2], [3, 4, 5]]
})

df

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]

When the lengths are the same, it is easy for us to assume that the varying elements coincide and should be “zipped” together.

   A       B          C
0  1  [1, 2]     [1, 2]  # Typical to assume these should be zipped [(1, 1), (2, 2)]
1  2  [3, 4]  [3, 4, 5]

However, the assumption gets challenged when we see different length objects, should we “zip”, if so, how do we handle the excess in one of the objects. OR, maybe we want the product of all of the objects. This will get big fast, but might be what is wanted.

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (4, 4), (None, 5)]?

OR

   A       B          C
0  1  [1, 2]     [1, 2]
1  2  [3, 4]  [3, 4, 5]  # is this [(3, 3), (3, 4), (3, 5), (4, 3), (4, 4), (4, 5)]

The Function

This function gracefully handles zip or product based on a parameter and assumes to zip according to the length of the longest object with zip_longest

from itertools import zip_longest, product

def xplode(df, explode, zipped=True):
    method = zip_longest if zipped else product

    rest = {*df} - {*explode}

    zipped = zip(zip(*map(df.get, rest)), zip(*map(df.get, explode)))
    tups = [tup + exploded
     for tup, pre in zipped
     for exploded in method(*pre)]

    return pd.DataFrame(tups, columns=[*rest, *explode])[[*df]]

Zipped

xplode(df, ['B', 'C'])

   A    B  C
0  1  1.0  1
1  1  2.0  2
2  2  3.0  3
3  2  4.0  4
4  2  NaN  5

Product

xplode(df, ['B', 'C'], zipped=False)

   A  B  C
0  1  1  1
1  1  1  2
2  1  2  1
3  1  2  2
4  2  3  3
5  2  3  4
6  2  3  5
7  2  4  3
8  2  4  4
9  2  4  5

New Setup

Varying up the example a bit

df = pd.DataFrame({
    'A': [1, 2],
    'B': [[1, 2], [3, 4]],
    'C': 'C',
    'D': [[1, 2], [3, 4, 5]],
    'E': [('X', 'Y', 'Z'), ('W',)]
})

df

   A       B  C          D          E
0  1  [1, 2]  C     [1, 2]  (X, Y, Z)
1  2  [3, 4]  C  [3, 4, 5]       (W,)

Zipped

xplode(df, ['B', 'D', 'E'])

   A    B  C    D     E
0  1  1.0  C  1.0     X
1  1  2.0  C  2.0     Y
2  1  NaN  C  NaN     Z
3  2  3.0  C  3.0     W
4  2  4.0  C  4.0  None
5  2  NaN  C  5.0  None

Product

xplode(df, ['B', 'D', 'E'], zipped=False)

    A  B  C  D  E
0   1  1  C  1  X
1   1  1  C  1  Y
2   1  1  C  1  Z
3   1  1  C  2  X
4   1  1  C  2  Y
5   1  1  C  2  Z
6   1  2  C  1  X
7   1  2  C  1  Y
8   1  2  C  1  Z
9   1  2  C  2  X
10  1  2  C  2  Y
11  1  2  C  2  Z
12  2  3  C  3  W
13  2  3  C  4  W
14  2  3  C  5  W
15  2  4  C  3  W
16  2  4  C  4  W
17  2  4  C  5  W

Question 39

Something pretty not recommended (at least work in this case):

df=pd.concat([df]*2).sort_index()
it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
df['B']=df['B'].apply(lambda x:next(it))

concat + sort_index + iter + apply + next.

Now:

print(df)

Is:

If care about index:

df=df.reset_index(drop=True)

Now:

print(df)

Is:

Question 40

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

pd.concat([df['A'], pd.DataFrame(df['B'].values.tolist())], axis = 1)\
  .melt(id_vars = 'A', value_name = 'B')\
  .dropna()\
  .drop('variable', axis = 1)

    A   B
0   1   1
1   2   1
2   1   2
3   2   2

Any opinions on this method I thought of? or is doing both concat and melt considered too “expensive”?

Question 41

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})

out = pd.concat([df.loc[:,'A'],(df.B.apply(pd.Series))], axis=1, sort=False)

out = out.set_index('A').stack().droplevel(level=1).reset_index().rename(columns={0:"B"})

       A    B
   0    1   1
   1    1   2
   2    2   1
   3    2   2

you can implement this as one liner, if you don’t wish to create intermediate object

Question 42

# Here's the answer to the related question in:
# https://stackoverflow.com/q/56708671/11426125

# initial dataframe
df12=pd.DataFrame({'Date':['2007-12-03','2008-09-07'],'names':
[['Peter','Alex'],['Donald','Stan']]})

# convert dataframe to array for indexing list values (names)
a = np.array(df12.values)  

# create a new, dataframe with dimensions for unnested
b = np.ndarray(shape = (4,2))
df2 = pd.DataFrame(b, columns = ["Date", "names"], dtype = str)

# implement loops to assign date/name values as required
i = range(len(a[0]))
j = range(len(a[0]))
for x in i:
    for y in j:
        df2.iat[2*x+y, 0] = a[x][0]
        df2.iat[2*x+y, 1] = a[x][1][y]

# set Date column as Index
df2.Date=pd.to_datetime(df2.Date)
df2.index=df2.Date
df2.drop('Date',axis=1,inplace =True)

Question 43

In my case with more than one column to explode, and with variables lengths for the arrays that needs to be unnested.

I ended up applying the new pandas 0.25 explode function two times, then removing generated duplicates and it does the job !

df = df.explode('A')
df = df.explode('B')
df = df.drop_duplicates()

Question 44

I have another good way to solves this when you have more than one column to explode.

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]], 'C':[[1,2,3],[1,2,3]]})

print(df)
   A       B          C
0  1  [1, 2]  [1, 2, 3]
1  2  [1, 2]  [1, 2, 3]

I want to explode the columns B and C. First I explode B, second C. Than I drop B and C from the original df. After that I will do an index join on the 3 dfs.

explode_b = df.explode('B')['B']
explode_c = df.explode('C')['C']
df = df.drop(['B', 'C'], axis=1)
df = df.join([explode_b, explode_c])

Question 45

我在Ubuntu 18上在numpy中分配大型数组时遇到了一个问题，而在MacOS上却没有遇到同样的问题。

我想一个numpy的阵列形状分配内存(156816, 36, 53806) 使用

np.zeros((156816, 36, 53806), dtype='uint8')

当我在Ubuntu OS上遇到错误时

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

我没有在MacOS上得到它：

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

我读过某处np.zeros不应该真正分配数组所需的全部内存，而只分配了非零元素。即使Ubuntu计算机具有64gb的内存，而我的MacBook Pro却只有16gb。

版本：

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS：在Google Colab上也失败

Question 46

I’m facing an issue with allocating huge arrays in numpy on Ubuntu 18 while not facing the same issue on MacOS.

I am trying to allocate memory for a numpy array with shape (156816, 36, 53806) with

np.zeros((156816, 36, 53806), dtype='uint8')

and while I’m getting an error on Ubuntu OS

>>> import numpy as np
>>> np.zeros((156816, 36, 53806), dtype='uint8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156816, 36, 53806) and data type uint8

I’m not getting it on MacOS:

>>> import numpy as np 
>>> np.zeros((156816, 36, 53806), dtype='uint8')
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], dtype=uint8)

I’ve read somewhere that np.zeros shouldn’t be really allocating the whole memory needed for the array, but only for the non-zero elements. Even though the Ubuntu machine has 64gb of memory, while my MacBook Pro has only 16gb.

versions:

Ubuntu
os -> ubuntu mate 18
python -> 3.6.8
numpy -> 1.17.0

mac
os -> 10.14.6
python -> 3.6.4
numpy -> 1.17.0

PS: also failed on Google Colab

Question 47

这可能是由于系统的过量使用处理模式所致。

在默认模式下0，

启发式过量使用处理。明显的地址空间过量使用被拒绝。用于典型的系统。它确保严重的野生分配失败，同时允许过量使用以减少交换使用。在此模式下，允许root分配更多的内存。这是默认值。

此处没有很好地解释所使用的确切启发式方法，但是在Linux上，在提交启发式方法和本页上对此进行了更多讨论。

您可以通过运行以下命令检查当前的过量使用模式

$ cat /proc/sys/vm/overcommit_memory
0

在这种情况下，您要分配

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

约282 GB，并且内核说的很清楚，我无法将这么多物理页提交给它，并且它拒绝分配。

如果（以root用户身份）运行：

$ echo 1 > /proc/sys/vm/overcommit_memory

这将启用“始终过量使用”模式，并且您会发现，无论系统有多大（至少在64位内存寻址中），该系统的确允许您进行分配。

我自己在具有32 GB RAM的计算机上进行了测试。在过量提交模式下，0我还得到了一个MemoryError，但是将其更改回1它可以工作：

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

然后，您可以继续写入阵列中的任何位置，并且只有在您明确写入物理页面时，系统才会分配物理页面。因此，您可以谨慎地将其用于稀疏数组。

Question 48

This is likely due to your system’s overcommit handling mode.

In the default mode, 0,

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

The exact heuristic used is not well explained here, but this is discussed more on Linux over commit heuristic and on this page.

You can check your current overcommit mode by running

$ cat /proc/sys/vm/overcommit_memory
0

In this case you’re allocating

>>> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588

~282 GB, and the kernel is saying well obviously there’s no way I’m going to be able to commit that many physical pages to this, and it refuses the allocation.

If (as root) you run:

$ echo 1 > /proc/sys/vm/overcommit_memory

This will enable “always overcommit” mode, and you’ll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).

I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:

>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056

You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.

Question 49

我在Window上遇到了同样的问题，并遇到了这个解决方案。因此，如果有人在Windows中遇到此问题，那么对我来说，解决方案是增加页面文件的大小，因为这对我来说也是内存过量使用的问题。

Windows 8

在键盘上按WindowsKey + X，然后在弹出菜单中单击“系统”。
点击或单击高级系统设置。系统可能会要求您输入管理员密码或确认选择
在“高级”选项卡上的“性能”下，点击或单击“设置”。
点击或单击“高级”选项卡，然后在“虚拟内存”下，单击或单击“更改”
清除“自动管理所有驱动器的页面文件大小”复选框。
在驱动器[卷标签]下，点击或单击包含要更改的页面文件的驱动器
点击或单击“自定义大小”，在“初始大小（MB）”或“最大大小（MB）”框中输入新的大小（以兆字节为单位），单击或单击“设置”，然后单击或单击“确定”
重新启动系统

Windows 10

按Windows键
类型SystemPropertiesAdvanced
单击以管理员身份运行
在性能下，单击设置
选择高级选项卡
选择更改…
取消选中“自动管理所有驱动器的页面文件大小”
然后选择自定义尺寸并填写适当的尺寸
按设置，然后按确定，然后从“虚拟内存”，“性能选项”和“系统属性”对话框退出
重新启动系统

注意：在此示例中，我的系统上没有足够的内存供〜282GB使用，但对于我的特殊情况，此方法有效。

编辑

从这里建议的页面文件大小建议：

有一个公式可以计算正确的页面文件大小。初始大小是系统总内存的一半（1.5）x。最大大小为三（3）x初始大小。因此，假设您有4 GB（1 GB = 1,024 MB x 4 = 4,096 MB）的内存。初始大小为1.5 x 4,096 = 6,144 MB，最大大小为3 x 6,144 = 18,432 MB。

从这里要记住一些事情：

但是，这没有考虑到计算机可能特有的其他重要因素和系统设置。同样，让Windows选择要使用的内容，而不是依赖于在另一台计算机上工作的任意公式。

也：

页面文件大小的增加可能有助于防止Windows中的不稳定和崩溃。但是，硬盘驱动器的读/写时间比数据存储在计算机内存中的情况要慢得多。页面文件较大将增加硬盘驱动器的工作量，从而导致其他所有文件的运行速度变慢。仅当遇到内存不足错误时才应增加页面文件的大小，并且仅作为临时解决方案。更好的解决方案是向计算机添加更多的内存。

Question 50

I had this same problem on Window’s and came across this solution. So if someone comes across this problem in Windows the solution for me was to increase the pagefile size, as it was a Memory overcommitment problem for me too.

Windows 8

On the Keyboard Press the WindowsKey + X then click System in the popup menu
Tap or click Advanced system settings. You might be asked for an admin password or to confirm your choice
On the Advanced tab, under Performance, tap or click Settings.
Tap or click the Advanced tab, and then, under Virtual memory, tap or click Change
Clear the Automatically manage paging file size for all drives check box.
Under Drive [Volume Label], tap or click the drive that contains the paging file you want to change
Tap or click Custom size, enter a new size in megabytes in the initial size (MB) or Maximum size (MB) box, tap or click Set, and then tap or click OK
Reboot your system

Windows 10

Press the Windows key
Type SystemPropertiesAdvanced
Click Run as administrator
Click Settings
Select the Advanced tab
Select Change…
Uncheck Automatically managing paging file size for all drives
Then select Custom size and fill in the appropriate size
Press Set then press OK then exit from the Virtual Memory, Performance Options, and System Properties Dialog
Reboot your system

Note: I did not have the enough memory on my system for the ~282GB in this example but for my particular case this worked.

EDIT

From here the suggested recommendations for page file size:

There is a formula for calculating the correct pagefile size. Initial size is one and a half (1.5) x the amount of total system memory. Maximum size is three (3) x the initial size. So let’s say you have 4 GB (1 GB = 1,024 MB x 4 = 4,096 MB) of memory. The initial size would be 1.5 x 4,096 = 6,144 MB and the maximum size would be 3 x 6,144 = 18,432 MB.

Some things to keep in mind from here:

However, this does not take into consideration other important factors and system settings that may be unique to your computer. Again, let Windows choose what to use instead of relying on some arbitrary formula that worked on a different computer.

Also:

Increasing page file size may help prevent instabilities and crashing in Windows. However, a hard drive read/write times are much slower than what they would be if the data were in your computer memory. Having a larger page file is going to add extra work for your hard drive, causing everything else to run slower. Page file size should only be increased when encountering out-of-memory errors, and only as a temporary fix. A better solution is to adding more memory to the computer.

Question 51

我也在Windows上遇到了这个问题。对我来说，解决方案是从32位版本的Python切换到64位版本的Python。的确，像32位CPU这样的32位软件最多可以分配4 GB的RAM（2 ^ 32）。因此，如果您拥有超过4 GB的RAM，则32位版本将无法利用它。

使用64位版本的Python（在下载页面中标记为x86-64的版本），问题就消失了。

您可以通过输入解释器来检查您拥有哪个版本。我具有64位版本，现在具有： Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)]，其中[MSC v.1916 64位（AMD64）]表示“ 64位Python”。

注：为写这篇文章（2020年5）时，matplotlib是不可用的python39，所以我安装推荐python37，64位。

资料来源：

Question 52

I came across this problem on Windows too. The solution for me was to switch from a 32-bit to a 64-bit version of Python. Indeed, a 32-bit software, like a 32-bit CPU, can adress a maximum of 4 GB of RAM (2^32). So if you have more than 4 GB of RAM, a 32-bit version cannot take advantage of it.

With a 64-bit version of Python (the one labeled x86-64 in the download page), the issue disappeared.

You can check which version you have by entering the interpreter. I, with a 64-bit version, now have: Python 3.7.5rc1 (tags/v3.7.5rc1:4082f600a5, Oct 1 2019, 20:28:14) [MSC v.1916 64 bit (AMD64)], where [MSC v.1916 64 bit (AMD64)] means “64-bit Python”.

Note : as of the time of this writing (May 2020), matplotlib is not available on python39, so I recommand installing python37, 64 bits.

Sources :

Question 53

在我的情况下，添加dtype属性会将数组的dtype更改为较小的类型（从float64到uint8），减小数组的大小足以不会在Windows（64位）中引发MemoryError。

从

mask = np.zeros(edges.shape)

至

mask = np.zeros(edges.shape,dtype='uint8')

Question 54

In my case, adding a dtype attribute changed dtype of the array to a smaller type(from float64 to uint8), decreasing array size enough to not throw MemoryError in Windows(64 bit).

from

mask = np.zeros(edges.shape)

to

mask = np.zeros(edges.shape,dtype='uint8')

Question 55

有时，由于内核已达到极限，会弹出此错误。尝试重新启动内核，然后重做必要的步骤。

Question 56

Sometimes, this error pops up because of the kernel has reached its limit. Try to restart the kernel redo the necessary steps.

Question 57

将数据类型更改为另一种使用较少内存的数据。对我来说，我将数据类型更改为numpy.uint8：

data['label'] = data['label'].astype(np.uint8)

Question 58

change the data type to another one which uses less memory works. For me, I change the data type to numpy.uint8:

data['label'] = data['label'].astype(np.uint8)

Question 59

What’s the most pythonic way to mesh two strings together?

For example:

Input:

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijklmnopqrstuvwxyz'

Output:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Question 60

For me, the most pythonic* way is the following which pretty much does the same thing but uses the + operator for concatenating the individual characters in each string:

res = "".join(i + j for i, j in zip(u, l))
print(res)
# 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

It is also faster than using two join() calls:

In [5]: l1 = 'A' * 1000000; l2 = 'a' * 1000000

In [6]: %timeit "".join("".join(item) for item in zip(l1, l2))
1 loops, best of 3: 442 ms per loop

In [7]: %timeit "".join(i + j for i, j in zip(l1, l2))
1 loops, best of 3: 360 ms per loop

Faster approaches exist, but they often obfuscate the code.

Note: If the two input strings are not the same length then the longer one will be truncated as zip stops iterating at the end of the shorter string. In this case instead of zip one should use zip_longest (izip_longest in Python 2) from the itertools module to ensure that both strings are fully exhausted.

_{*To take a quote from the Zen of Python: Readability counts.

Pythonic = readability for me; i + j is just visually parsed more easily, at least for my eyes.}

Question 61

Faster Alternative

Another way:

res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
print(''.join(res))

Output:

'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Speed

Looks like it is faster:

%%timeit
res = [''] * len(u) * 2
res[::2] = u
res[1::2] = l
''.join(res)

100000 loops, best of 3: 4.75 µs per loop

than the fastest solution so far:

%timeit "".join(list(chain.from_iterable(zip(u, l))))

100000 loops, best of 3: 6.52 µs per loop

Also for the larger strings:

l1 = 'A' * 1000000; l2 = 'a' * 1000000

%timeit "".join(list(chain.from_iterable(zip(l1, l2))))
1 loops, best of 3: 151 ms per loop


%%timeit
res = [''] * len(l1) * 2
res[::2] = l1
res[1::2] = l2
''.join(res)

10 loops, best of 3: 92 ms per loop

Python 3.5.1.

Variation for strings with different lengths

u = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
l = 'abcdefghijkl'

Shorter one determines length (`zip()` equivalent)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
print(''.join(res))

Output:

AaBbCcDdEeFfGgHhIiJjKkLl

Longer one determines length (`itertools.zip_longest(fillvalue='')` equivalent)

min_len = min(len(u), len(l))
res = [''] * min_len * 2 
res[::2] = u[:min_len]
res[1::2] = l[:min_len]
res += u[min_len:] + l[min_len:]
print(''.join(res))

Output:

AaBbCcDdEeFfGgHhIiJjKkLlMNOPQRSTUVWXYZ

Question 62

With join() and zip().

>>> ''.join(''.join(item) for item in zip(u,l))
'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz'

Question 63

On Python 2, by far the faster way to do things, at ~3x the speed of list slicing for small strings and ~30x for long ones, is

res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)

This wouldn’t work on Python 3, though. You could implement something like

res = bytearray(len(u) * 2)
res[::2] = u.encode("ascii")
res[1::2] = l.encode("ascii")
res.decode("ascii")

but by then you’ve already lost the gains over list slicing for small strings (it’s still 20x the speed for long strings) and this doesn’t even work for non-ASCII characters yet.

FWIW, if you are doing this on massive strings and need every cycle, and for some reason have to use Python strings… here’s how to do it:

res = bytearray(len(u) * 4 * 2)

u_utf32 = u.encode("utf_32_be")
res[0::8] = u_utf32[0::4]
res[1::8] = u_utf32[1::4]
res[2::8] = u_utf32[2::4]
res[3::8] = u_utf32[3::4]

l_utf32 = l.encode("utf_32_be")
res[4::8] = l_utf32[0::4]
res[5::8] = l_utf32[1::4]
res[6::8] = l_utf32[2::4]
res[7::8] = l_utf32[3::4]

res.decode("utf_32_be")

Special-casing the common case of smaller types will help too. FWIW, this is only 3x the speed of list slicing for long strings and a factor of 4 to 5 slower for small strings.

Either way I prefer the join solutions, but since timings were mentioned elsewhere I thought I might as well join in.

Question 64

If you want the fastest way, you can combine itertools with operator.add:

In [36]: from operator import add

In [37]: from itertools import  starmap, izip

In [38]: timeit "".join([i + j for i, j in uzip(l1, l2)])
1 loops, best of 3: 142 ms per loop

In [39]: timeit "".join(starmap(add, izip(l1,l2)))
1 loops, best of 3: 117 ms per loop

In [40]: timeit "".join(["".join(item) for item in zip(l1, l2)])
1 loops, best of 3: 196 ms per loop

In [41]:  "".join(starmap(add, izip(l1,l2))) ==  "".join([i + j   for i, j in izip(l1, l2)]) ==  "".join(["".join(item) for item in izip(l1, l2)])
Out[42]: True

But combining izip and chain.from_iterable is faster again

In [2]: from itertools import  chain, izip

In [3]: timeit "".join(chain.from_iterable(izip(l1, l2)))
10 loops, best of 3: 98.7 ms per loop

There is also a substantial difference between chain(* and chain.from_iterable(....

In [5]: timeit "".join(chain(*izip(l1, l2)))
1 loops, best of 3: 212 ms per loop

There is no such thing as a generator with join, passing one is always going to be slower as python will first build a list using the content because it does two passes over the data, one to figure out the size needed and one to actually do the join which would not be possible using a generator:

join.h:

 /* Here is the general case.  Do a pre-pass to figure out the total
  * amount of space we'll need (sz), and see whether all arguments are
  * bytes-like.
   */

Also if you have different length strings and you don’t want to lose data you can use izip_longest :

In [22]: from itertools import izip_longest    
In [23]: a,b = "hlo","elworld"

In [24]:  "".join(chain.from_iterable(izip_longest(a, b,fillvalue="")))
Out[24]: 'helloworld'

For python 3 it is called zip_longest

But for python2, veedrac’s suggestion is by far the fastest:

In [18]: %%timeit
res = bytearray(len(u) * 2)
res[::2] = u
res[1::2] = l
str(res)
   ....: 
100 loops, best of 3: 2.68 ms per loop

问题：python图正态分布

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

问题：在pandas数据框中完全打印很长的字符串

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：批量归一化和辍学订购？

回答 0

回答 1

回答 2

通常，只需删除Dropout（如果有BN）：

Usually, Just drop the Dropout(when you have BN):

回答 3

回答 4

回答 5

回答 6

回答 7

回答 8

问题：如何取消嵌套（爆炸）pandas DataFrame中的列？

回答 0

列式嵌套

Column-wise Unnesting

回答 1

回答 2

回答 3

回答 4

回答 5

性能

Performance

回答 6

回答 7

问题设定

功能

压缩的

产品

新设定

压缩的

产品

Problem Setup

The Function

Zipped

Product

New Setup

Zipped

Product

回答 8

回答 9

回答 10

回答 11

回答 12

回答 13

问题：无法分配具有形状和数据类型的数组

回答 0

回答 1

回答 2

回答 3

回答 4

回答 5

问题：插入两个字符串的最pythonic方式

回答 0

回答 1

更快的选择

速度

不同长度字符串的变化

较短的一个确定长度（zip()等效）

更长的长度决定长度（itertools.zip_longest(fillvalue='')等效）

Faster Alternative

通常，只需删除`Dropout`（如果有`BN`）：

Usually, Just drop the `Dropout`(when you have `BN`):

较短的一个确定长度（`zip()`等效）

更长的长度决定长度（`itertools.zip_longest(fillvalue='')`等效）

Shorter one determines length (`zip()` equivalent)

Longer one determines length (`itertools.zip_longest(fillvalue='')` equivalent)