Python 实用宝典

Question 1

I was looking for alternative ways to save a trained model in PyTorch. So far, I have found two alternatives.

torch.save() to save a model and torch.load() to load a model.
model.state_dict() to save a trained model and model.load_state_dict() to load the saved model.

I have come across to this discussion where approach 2 is recommended over approach 1.

My question is, why the second approach is preferred? Is it only because torch.nn modules have those two function and we are encouraged to use them?

Question 2

I’ve found this page on their github repo, I’ll just paste the content here.

Recommended approach for saving a model

There are two main approaches for serializing and restoring a model.

The first (recommended) saves and loads only the model parameters:

torch.save(the_model.state_dict(), PATH)

Then later:

the_model = TheModelClass(*args, **kwargs)
the_model.load_state_dict(torch.load(PATH))

The second saves and loads the entire model:

torch.save(the_model, PATH)

Then later:

the_model = torch.load(PATH)

However in this case, the serialized data is bound to the specific classes and the exact directory structure used, so it can break in various ways when used in other projects, or after some serious refactors.

Question 3

It depends on what you want to do.

Case # 1: Save the model to use it yourself for inference: You save the model, you restore it, and then you change the model to evaluation mode. This is done because you usually have BatchNorm and Dropout layers that by default are in train mode on construction:

torch.save(model.state_dict(), filepath)

#Later to restore:
model.load_state_dict(torch.load(filepath))
model.eval()

Case # 2: Save model to resume training later: If you need to keep training the model that you are about to save, you need to save more than just the model. You also need to save the state of the optimizer, epochs, score, etc. You would do it like this:

state = {
    'epoch': epoch,
    'state_dict': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    ...
}
torch.save(state, filepath)

To resume training you would do things like: state = torch.load(filepath), and then, to restore the state of each individual object, something like this:

model.load_state_dict(state['state_dict'])
optimizer.load_state_dict(state['optimizer'])

Since you are resuming training, DO NOT call model.eval() once you restore the states when loading.

Case # 3: Model to be used by someone else with no access to your code: In Tensorflow you can create a .pb file that defines both the architecture and the weights of the model. This is very handy, specially when using Tensorflow serve. The equivalent way to do this in Pytorch would be:

torch.save(model, filepath)

# Then later:
model = torch.load(filepath)

This way is still not bullet proof and since pytorch is still undergoing a lot of changes, I wouldn’t recommend it.

Question 4

The pickle Python library implements binary protocols for serializing and de-serializing a Python object.

When you import torch (or when you use PyTorch) it will import pickle for you and you don’t need to call pickle.dump() and pickle.load() directly, which are the methods to save and to load the object.

In fact, torch.save() and torch.load() will wrap pickle.dump() and pickle.load() for you.

A state_dict the other answer mentioned deserves just few more notes.

What state_dict do we have inside PyTorch? There are actually two state_dicts.

The PyTorch model is torch.nn.Module has model.parameters() call to get learnable parameters (w and b). These learnable parameters, once randomly set, will update over time as we learn. Learnable parameters are the first state_dict.

The second state_dict is the optimizer state dict. You recall that the optimizer is used to improve our learnable parameters. But the optimizer state_dict is fixed. Nothing to learn in there.

Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.

Let’s create a super simple model to explain this:

import torch
import torch.optim as optim

model = torch.nn.Linear(5, 2)

# Initialize optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("Model weight:")    
print(model.weight)

print("Model bias:")    
print(model.bias)

print("---")
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

This code will output the following:

Model's state_dict:
weight   torch.Size([2, 5])
bias     torch.Size([2])
Model weight:
Parameter containing:
tensor([[ 0.1328,  0.1360,  0.1553, -0.1838, -0.0316],
        [ 0.0479,  0.1760,  0.1712,  0.2244,  0.1408]], requires_grad=True)
Model bias:
Parameter containing:
tensor([ 0.4112, -0.0733], requires_grad=True)
---
Optimizer's state_dict:
state    {}
param_groups     [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [140695321443856, 140695321443928]}]

Note this is a minimal model. You may try to add stack of sequential

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.Conv2d(A, B, C)
          torch.nn.Linear(H, D_out),
        )

Note that only layers with learnable parameters (convolutional layers, linear layers, etc.) and registered buffers (batchnorm layers) have entries in the model’s state_dict.

Non learnable things, belong to the optimizer object state_dict, which contains information about the optimizer’s state, as well as the hyperparameters used.

The rest of the story is the same; in the inference phase (this is a phase when we use the model after training) for predicting; we do predict based on the parameters we learned. So for the inference, we just need to save the parameters model.state_dict().

torch.save(model.state_dict(), filepath)

And to use later model.load_state_dict(torch.load(filepath)) model.eval()

Note: Don’t forget the last line model.eval() this is crucial after loading the model.

Also don’t try to save torch.save(model.parameters(), filepath). The model.parameters() is just the generator object.

On the other side, torch.save(model, filepath) saves the model object itself, but keep in mind the model doesn’t have the optimizer’s state_dict. Check the other excellent answer by @Jadiel de Armas to save the optimizer’s state dict.

Question 5

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Save/Load Entire Model Save:

path = "username/directory/lstmmodelgpu.pth"
torch.save(trainer, path)

Load:

Model class must be defined somewhere

model = torch.load(PATH)
model.eval()

Question 6

If you want to save the model and wants to resume the training later:

Single GPU: Save:

state = {
        'epoch': epoch,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

Load:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

Multiple GPU: Save

state = {
        'epoch': epoch,
        'state_dict': model.module.state_dict(),
        'optimizer': optimizer.state_dict(),
}
savepath='checkpoint.t7'
torch.save(state,savepath)

Load:

checkpoint = torch.load('checkpoint.t7')
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

#Don't call DataParallel before loading the model otherwise you will get an error

model = nn.DataParallel(model) #ignore the line if you want to load on Single GPU

Question 7

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is,

The reshaping of the data series into [samples, time steps, features] and,
The stateful LSTMs

Lets concentrate on the above two questions with reference to the code pasted below:

# reshape into X=t and Y=t+1
look_back = 3
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], look_back, 1))
testX = numpy.reshape(testX, (testX.shape[0], look_back, 1))
########################
# The IMPORTANT BIT
##########################
# create and fit the LSTM network
batch_size = 1
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(100):
    model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()

Note: create_dataset takes a sequence of length N and returns a N-look_back array of which each element is a look_back length sequence.

What is Time Steps and Features?

As can be seen TrainX is a 3-D array with Time_steps and Feature being the last two dimensions respectively (3 and 1 in this particular code). With respect to the image below, does this mean that we are considering the many to one case, where the number of pink boxes are 3? Or does it literally mean the chain length is 3 (i.e. only 3 green boxes considered).

Does the features argument become relevant when we consider multivariate series? e.g. modelling two financial stocks simultaneously?

Stateful LSTMs

Does stateful LSTMs mean that we save the cell memory values between runs of batches? If this is the case, batch_size is one, and the memory is reset between the training runs so what was the point of saying that it was stateful. I’m guessing this is related to the fact that training data is not shuffled, but I’m not sure how.

Any thoughts? Image reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Edit 1:

A bit confused about @van’s comment about the red and green boxes being equal. So just to confirm, does the following API calls correspond to the unrolled diagrams? Especially noting the second diagram (batch_size was arbitrarily chosen.):

Edit 2:

For people who have done Udacity’s deep learning course and still confused about the time_step argument, look at the following discussion: https://discussions.udacity.com/t/rnn-lstm-use-implementation/163169

Update:

It turns out model.add(TimeDistributed(Dense(vocab_len))) was what I was looking for. Here is an example: https://github.com/sachinruk/ShakespeareBot

Update2:

I have summarised most of my understanding of LSTMs here: https://www.youtube.com/watch?v=ywinX5wgdEU

Question 8

First of all, you choose great tutorials(1,2) to start.

What Time-step means: Time-steps==3 in X.shape (Describing data shape) means there are three pink boxes. Since in Keras each step requires an input, therefore the number of the green boxes should usually equal to the number of red boxes. Unless you hack the structure.

many to many vs. many to one: In keras, there is a return_sequences parameter when your initializing LSTM or GRU or SimpleRNN. When return_sequences is False (by default), then it is many to one as shown in the picture. Its return shape is (batch_size, hidden_unit_length), which represent the last state. When return_sequences is True, then it is many to many. Its return shape is (batch_size, time_step, hidden_unit_length)

Does the features argument become relevant: Feature argument means “How big is your red box” or what is the input dimension each step. If you want to predict from, say, 8 kinds of market information, then you can generate your data with feature==8.

Stateful: You can look up the source code. When initializing the state, if stateful==True, then the state from last training will be used as the initial state, otherwise it will generate a new state. I haven’t turn on stateful yet. However, I disagree with that the batch_size can only be 1 when stateful==True.

Currently, you generate your data with collected data. Image your stock information is coming as stream, rather than waiting for a day to collect all sequential, you would like to generate input data online while training/predicting with network. If you have 400 stocks sharing a same network, then you can set batch_size==400.

Question 9

As a complement to the accepted answer, this answer shows keras behaviors and how to achieve each picture.

General Keras behavior

The standard keras internal processing is always a many to many as in the following picture (where I used features=2, pressure and temperature, just as an example):

In this image, I increased the number of steps to 5, to avoid confusion with the other dimensions.

For this example:

We have N oil tanks
We spent 5 hours taking measures hourly (time steps)
We measured two features:
- Pressure P
- Temperature T

Our input array should then be something shaped as (N,5,2):

        [     Step1      Step2      Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2], [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2], [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....
Tank N:    [[Pn1,Tn1], [Pn2,Tn2], [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]

Inputs for sliding windows

Often, LSTM layers are supposed to process the entire sequences. Dividing windows may not be the best idea. The layer has internal states about how a sequence is evolving as it steps forward. Windows eliminate the possibility of learning long sequences, limiting all sequences to the window size.

In windows, each window is part of a long original sequence, but by Keras they will be seen each as an independent sequence:

        [     Step1    Step2    Step3    Step4    Step5
Window  A:  [[P1,T1], [P2,T2], [P3,T3], [P4,T4], [P5,T5]],
Window  B:  [[P2,T2], [P3,T3], [P4,T4], [P5,T5], [P6,T6]],
Window  C:  [[P3,T3], [P4,T4], [P5,T5], [P6,T6], [P7,T7]],
  ....
        ]

Notice that in this case, you have initially only one sequence, but you’re dividing it in many sequences to create windows.

The concept of “what is a sequence” is abstract. The important parts are:

you can have batches with many individual sequences
what makes the sequences be sequences is that they evolve in steps (usually time steps)

Achieving each case with “single layers”

Achieving standard many to many:

You can achieve many to many with a simple LSTM layer, using return_sequences=True:

outputs = LSTM(units, return_sequences=True)(inputs)

#output_shape -> (batch_size, steps, units)

Achieving many to one:

Using the exact same layer, keras will do the exact same internal preprocessing, but when you use return_sequences=False (or simply ignore this argument), keras will automatically discard the steps previous to the last:

outputs = LSTM(units)(inputs)

#output_shape -> (batch_size, units) --> steps were discarded, only the last was returned

Achieving one to many

Now, this is not supported by keras LSTM layers alone. You will have to create your own strategy to multiplicate the steps. There are two good approaches:

Create a constant multi-step input by repeating a tensor
Use a stateful=True to recurrently take the output of one step and serve it as the input of the next step (needs output_features == input_features)

One to many with repeat vector

In order to fit to keras standard behavior, we need inputs in steps, so, we simply repeat the inputs for the length we want:

outputs = RepeatVector(steps)(inputs) #where inputs is (batch,features)
outputs = LSTM(units,return_sequences=True)(outputs)

#output_shape -> (batch_size, steps, units)

Understanding stateful = True

Now comes one of the possible usages of stateful=True (besides avoiding loading data that can’t fit your computer’s memory at once)

Stateful allows us to input “parts” of the sequences in stages. The difference is:

In stateful=False, the second batch contains whole new sequences, independent from the first batch
In stateful=True, the second batch continues the first batch, extending the same sequences.

It’s like dividing the sequences in windows too, with these two main differences:

these windows do not superpose!!
stateful=True will see these windows connected as a single long sequence

In stateful=True, every new batch will be interpreted as continuing the previous batch (until you call model.reset_states()).

Sequence 1 in batch 2 will continue sequence 1 in batch 1.
Sequence 2 in batch 2 will continue sequence 2 in batch 1.
Sequence n in batch 2 will continue sequence n in batch 1.

Example of inputs, batch 1 contains steps 1 and 2, batch 2 contains steps 3 to 5:

                   BATCH 1                           BATCH 2
        [     Step1      Step2        |    [    Step3      Step4      Step5
Tank A:    [[Pa1,Ta1], [Pa2,Ta2],     |       [Pa3,Ta3], [Pa4,Ta4], [Pa5,Ta5]],
Tank B:    [[Pb1,Tb1], [Pb2,Tb2],     |       [Pb3,Tb3], [Pb4,Tb4], [Pb5,Tb5]],
  ....                                |
Tank N:    [[Pn1,Tn1], [Pn2,Tn2],     |       [Pn3,Tn3], [Pn4,Tn4], [Pn5,Tn5]],
        ]                                  ]

Notice the alignment of tanks in batch 1 and batch 2! That’s why we need shuffle=False (unless we are using only one sequence, of course).

You can have any number of batches, indefinitely. (For having variable lengths in each batch, use input_shape=(None,features).

One to many with stateful=True

For our case here, we are going to use only 1 step per batch, because we want to get one output step and make it be an input.

Please notice that the behavior in the picture is not “caused by” stateful=True. We will force that behavior in a manual loop below. In this example, stateful=True is what “allows” us to stop the sequence, manipulate what we want, and continue from where we stopped.

Honestly, the repeat approach is probably a better choice for this case. But since we’re looking into stateful=True, this is a good example. The best way to use this is the next “many to many” case.

Layer:

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, #just to keep a nice output shape even with length 1
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Now, we’re going to need a manual loop for predictions:

input_data = someDataWithShape((batch, 1, features))

#important, we're starting new sequences, not continuing old ones:
model.reset_states()

output_sequence = []
last_step = input_data
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

Many to many with stateful=True

Now, here, we get a very nice application: given an input sequence, try to predict its future unknown steps.

We’re using the same method as in the “one to many” above, with the difference that:

we will use the sequence itself to be the target data, one step ahead
we know part of the sequence (so we discard this part of the results).

Layer (same as above):

outputs = LSTM(units=features, 
               stateful=True, 
               return_sequences=True, 
               input_shape=(None,features))(inputs) 
    #units = features because we want to use the outputs as inputs
    #None because we want variable length

#output_shape -> (batch_size, steps, units)

Training:

We are going to train our model to predict the next step of the sequences:

totalSequences = someSequencesShaped((batch, steps, features))
    #batch size is usually 1 in these cases (often you have only one Tank in the example)

X = totalSequences[:,:-1] #the entire known sequence, except the last step
Y = totalSequences[:,1:] #one step ahead of X

#loop for resetting states at the start/end of the sequences:
for epoch in range(epochs):
    model.reset_states()
    model.train_on_batch(X,Y)

Predicting:

The first stage of our predicting involves “ajusting the states”. That’s why we’re going to predict the entire sequence again, even if we already know this part of it:

model.reset_states() #starting a new sequence
predicted = model.predict(totalSequences)
firstNewStep = predicted[:,-1:] #the last step of the predictions is the first future step

Now we go to the loop as in the one to many case. But don’t reset states here!. We want the model to know in which step of the sequence it is (and it knows it’s at the first new step because of the prediction we just made above)

output_sequence = [firstNewStep]
last_step = firstNewStep
for i in steps_to_predict:

    new_step = model.predict(last_step)
    output_sequence.append(new_step)
    last_step = new_step

 #end of the sequences
 model.reset_states()

This approach was used in these answers and file:

Achieving complex configurations

In all examples above, I showed the behavior of “one layer”.

You can, of course, stack many layers on top of each other, not necessarly all following the same pattern, and create your own models.

One interesting example that has been appearing is the “autoencoder” that has a “many to one encoder” followed by a “one to many” decoder:

Encoder:

inputs = Input((steps,features))

#a few many to many layers:
outputs = LSTM(hidden1,return_sequences=True)(inputs)
outputs = LSTM(hidden2,return_sequences=True)(outputs)    

#many to one layer:
outputs = LSTM(hidden3)(outputs)

encoder = Model(inputs,outputs)

Decoder:

Using the “repeat” method;

inputs = Input((hidden3,))

#repeat to make one to many:
outputs = RepeatVector(steps)(inputs)

#a few many to many layers:
outputs = LSTM(hidden4,return_sequences=True)(outputs)

#last layer
outputs = LSTM(features,return_sequences=True)(outputs)

decoder = Model(inputs,outputs)

Autoencoder:

inputs = Input((steps,features))
outputs = encoder(inputs)
outputs = decoder(outputs)

autoencoder = Model(inputs,outputs)

Train with fit(X,X)

Additional explanations

If you want details about how steps are calculated in LSTMs, or details about the stateful=True cases above, you can read more in this answer: Doubts regarding `Understanding Keras LSTMs`

Question 10

When you have return_sequences in your last layer of RNN you cannot use a simple Dense layer instead use TimeDistributed.

Here is an example piece of code this might help others.

words = keras.layers.Input(batch_shape=(None, self.maxSequenceLength), name = “input”)

    # Build a matrix of size vocabularySize x EmbeddingDimension 
    # where each row corresponds to a "word embedding" vector.
    # This layer will convert replace each word-id with a word-vector of size Embedding Dimension.
    embeddings = keras.layers.embeddings.Embedding(self.vocabularySize, self.EmbeddingDimension,
        name = "embeddings")(words)
    # Pass the word-vectors to the LSTM layer.
    # We are setting the hidden-state size to 512.
    # The output will be batchSize x maxSequenceLength x hiddenStateSize
    hiddenStates = keras.layers.GRU(512, return_sequences = True, 
                                        input_shape=(self.maxSequenceLength,
                                        self.EmbeddingDimension),
                                        name = "rnn")(embeddings)
    hiddenStates2 = keras.layers.GRU(128, return_sequences = True, 
                                        input_shape=(self.maxSequenceLength, self.EmbeddingDimension),
                                        name = "rnn2")(hiddenStates)

    denseOutput = TimeDistributed(keras.layers.Dense(self.vocabularySize), 
        name = "linear")(hiddenStates2)
    predictions = TimeDistributed(keras.layers.Activation("softmax"), 
        name = "softmax")(denseOutput)  

    # Build the computational graph by specifying the input, and output of the network.
    model = keras.models.Model(input = words, output = predictions)
    # model.compile(loss='kullback_leibler_divergence', \
    model.compile(loss='sparse_categorical_crossentropy', \
        optimizer = keras.optimizers.Adam(lr=0.009, \
            beta_1=0.9,\
            beta_2=0.999, \
            epsilon=None, \
            decay=0.01, \
            amsgrad=False))

Question 11

What is the difference between ‘SAME’ and ‘VALID’ padding in tf.nn.max_pool of tensorflow?

In my opinion, ‘VALID’ means there will be no zero padding outside the edges when we do max pool.

According to A guide to convolution arithmetic for deep learning, it says that there will be no padding in pool operator, i.e. just use ‘VALID’ of tensorflow. But what is ‘SAME’ padding of max pool in tensorflow?

Question 12

I’ll give an example to make it clearer:

x: input image of shape [2, 3], 1 channel
valid_pad: max pool with 2×2 kernel, stride 2 and VALID padding.
same_pad: max pool with 2×2 kernel, stride 2 and SAME padding (this is the classic way to go)

The output shapes are:

valid_pad: here, no padding so the output shape is [1, 1]
same_pad: here, we pad the image to the shape [2, 4] (with -inf and then apply max pool), so the output shape is [1, 2]

x = tf.constant([[1., 2., 3.],
                 [4., 5., 6.]])

x = tf.reshape(x, [1, 2, 3, 1])  # give a shape accepted by tf.nn.max_pool

valid_pad = tf.nn.max_pool(x, [1, 2, 2, 1], [1, 2, 2, 1], padding='VALID')
same_pad = tf.nn.max_pool(x, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

valid_pad.get_shape() == [1, 1, 1, 1]  # valid_pad is [5.]
same_pad.get_shape() == [1, 1, 2, 1]   # same_pad is  [5., 6.]

Question 13

If you like ascii art:

"VALID" = without padding:

   inputs:         1  2  3  4  5  6  7  8  9  10 11 (12 13)
                  |________________|                dropped
                                 |_________________|

"SAME" = with zero padding:

               pad|                                      |pad
   inputs:      0 |1  2  3  4  5  6  7  8  9  10 11 12 13|0  0
               |________________|
                              |_________________|
                                             |________________|

In this example:

Input width = 13
Filter width = 6
Stride = 5

Notes:

"VALID" only ever drops the right-most columns (or bottom-most rows).
"SAME" tries to pad evenly left and right, but if the amount of columns to be added is odd, it will add the extra column to the right, as is the case in this example (the same logic applies vertically: there may be an extra row of zeros at the bottom).

Edit:

About the name:

With "SAME" padding, if you use a stride of 1, the layer’s outputs will have the same spatial dimensions as its inputs.
With "VALID" padding, there’s no “made-up” padding inputs. The layer only uses valid input data.

Question 14

When stride is 1 (more typical with convolution than pooling), we can think of the following distinction:

"SAME": output size is the same as input size. This requires the filter window to slip outside input map, hence the need to pad.
"VALID": Filter window stays at valid position inside input map, so output size shrinks by filter_size - 1. No padding occurs.

Question 15

The TensorFlow Convolution example gives an overview about the difference between SAME and VALID :

For the SAME padding, the output height and width are computed as:

out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))

And

For the VALID padding, the output height and width are computed as:

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

Question 16

Padding is an operation to increase the size of the input data. In case of 1-dimensional data you just append/prepend the array with a constant, in 2-dim you surround matrix with these constants. In n-dim you surround your n-dim hypercube with the constant. In most of the cases this constant is zero and it is called zero-padding.

Here is an example of zero-padding with p=1 applied to 2-d tensor:

You can use arbitrary padding for your kernel but some of the padding values are used more frequently than others they are:

VALID padding. The easiest case, means no padding at all. Just leave your data the same it was.
SAME padding sometimes called HALF padding. It is called SAME because for a convolution with a stride=1, (or for pooling) it should produce output of the same size as the input. It is called HALF because for a kernel of size k
FULL padding is the maximum padding which does not result in a convolution over just padded elements. For a kernel of size k, this padding is equal to k - 1.

To use arbitrary padding in TF, you can use tf.pad()

Question 17

Quick Explanation

VALID: Don’t apply any padding, i.e., assume that all dimensions are valid so that input image fully gets covered by filter and stride you specified.

SAME: Apply padding to input (if needed) so that input image gets fully covered by filter and stride you specified. For stride 1, this will ensure that output image size is same as input.

Notes

This applies to conv layers as well as max pool layers in same way
The term “valid” is bit of a misnomer because things don’t become “invalid” if you drop part of the image. Sometime you might even want that. This should have probably be called NO_PADDING instead.
The term “same” is a misnomer too because it only makes sense for stride of 1 when output dimension is same as input dimension. For stride of 2, output dimensions will be half, for example. This should have probably be called AUTO_PADDING instead.
In SAME (i.e. auto-pad mode), Tensorflow will try to spread padding evenly on both left and right.
In VALID (i.e. no padding mode), Tensorflow will drop right and/or bottom cells if your filter and stride doesn’t full cover input image.

Question 18

I am quoting this answer from official tensorflow docs https://www.tensorflow.org/api_guides/python/nn#Convolution For the ‘SAME’ padding, the output height and width are computed as:

out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))

and the padding on the top and left are computed as:

pad_along_height = max((out_height - 1) * strides[1] +
                    filter_height - in_height, 0)
pad_along_width = max((out_width - 1) * strides[2] +
                   filter_width - in_width, 0)
pad_top = pad_along_height // 2
pad_bottom = pad_along_height - pad_top
pad_left = pad_along_width // 2
pad_right = pad_along_width - pad_left

For the ‘VALID’ padding, the output height and width are computed as:

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

and the padding values are always zero.

Question 19

There are three choices of padding: valid (no padding), same (or half), full. You can find explanations (in Theano) here: http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

Valid or no padding:

The valid padding involves no zero padding, so it covers only the valid input, not including artificially generated zeros. The length of output is ((the length of input) – (k-1)) for the kernel size k if the stride s=1.

Same or half padding:

The same padding makes the size of outputs be the same with that of inputs when s=1. If s=1, the number of zeros padded is (k-1).

Full padding:

The full padding means that the kernel runs over the whole inputs, so at the ends, the kernel may meet the only one input and zeros else. The number of zeros padded is 2(k-1) if s=1. The length of output is ((the length of input) + (k-1)) if s=1.

Therefore, the number of paddings: (valid) <= (same) <= (full)

Question 20

Padding on/off. Determines the effective size of your input.

VALID: No padding. Convolution etc. ops are only performed at locations that are “valid”, i.e. not too close to the borders of your tensor.
With a kernel of 3×3 and image of 10×10, you would be performing convolution on the 8×8 area inside the borders.

SAME: Padding is provided. Whenever your operation references a neighborhood (no matter how big), zero values are provided when that neighborhood extends outside the original tensor to allow that operation to work also on border values.
With a kernel of 3×3 and image of 10×10, you would be performing convolution on the full 10×10 area.

Question 21

VALID padding: this is with zero padding. Hope there is no confusion.

x = tf.constant([[1., 2., 3.], [4., 5., 6.],[ 7., 8., 9.], [ 7., 8., 9.]])
x = tf.reshape(x, [1, 4, 3, 1])
valid_pad = tf.nn.max_pool(x, [1, 2, 2, 1], [1, 2, 2, 1], padding='VALID')
print (valid_pad.get_shape()) # output-->(1, 2, 1, 1)

SAME padding: This is kind of tricky to understand in the first place because we have to consider two conditions separately as mentioned in the official docs.

Let’s take input as $n_i$ , output as $n_o$ , padding as $p_i$ , stride as $s$ and kernel size as $k$ (only a single dimension is considered)

Case 01: $n_i \mod s = 0$ : $p_i = max(k-s ,0)$

Case 02: $n_i \mod s \neq 0$ : $p_i = max(k - (n_i\mod s)), 0)$

$p_i$ is calculated such that the minimum value which can be taken for padding. Since value of $p_i$ is known, value of $n_0$ can be found using this formula $(n_i - k + 2p_i)/2 + 1 = n_0$ .

Let’s work out this example:

x = tf.constant([[1., 2., 3.], [4., 5., 6.],[ 7., 8., 9.], [ 7., 8., 9.]])
x = tf.reshape(x, [1, 4, 3, 1])
same_pad = tf.nn.max_pool(x, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')
print (same_pad.get_shape()) # --> output (1, 2, 2, 1)

Here the dimension of x is (3,4). Then if the horizontal direction is taken (3):

$n_i = 3, k =2, s =2, p_i = 2 - (3\mod 2) = 1, n_0 = floor (\frac{3-2+2*1}{2} + 1) = 2$

If the vertial direction is taken (4):

$n_i = 4, k =2, s =2, p_i = 2 - 2 = 0, n_0 = floor (\frac{3-2+2*0}{2} + 1) = 2$

Hope this will help to understand how actually SAME padding works in TF.

Question 22

Based on the explanation here and following up on Tristan’s answer, I usually use these quick functions for sanity checks.

# a function to help us stay clean
def getPaddings(pad_along_height,pad_along_width):
    # if even.. easy..
    if pad_along_height%2 == 0:
        pad_top = pad_along_height / 2
        pad_bottom = pad_top
    # if odd
    else:
        pad_top = np.floor( pad_along_height / 2 )
        pad_bottom = np.floor( pad_along_height / 2 ) +1
    # check if width padding is odd or even
    # if even.. easy..
    if pad_along_width%2 == 0:
        pad_left = pad_along_width / 2
        pad_right= pad_left
    # if odd
    else:
        pad_left = np.floor( pad_along_width / 2 )
        pad_right = np.floor( pad_along_width / 2 ) +1
        #
    return pad_top,pad_bottom,pad_left,pad_right

# strides [image index, y, x, depth]
# padding 'SAME' or 'VALID'
# bottom and right sides always get the one additional padded pixel (if padding is odd)
def getOutputDim (inputWidth,inputHeight,filterWidth,filterHeight,strides,padding):
    if padding == 'SAME':
        out_height = np.ceil(float(inputHeight) / float(strides[1]))
        out_width  = np.ceil(float(inputWidth) / float(strides[2]))
        #
        pad_along_height = ((out_height - 1) * strides[1] + filterHeight - inputHeight)
        pad_along_width = ((out_width - 1) * strides[2] + filterWidth - inputWidth)
        #
        # now get padding
        pad_top,pad_bottom,pad_left,pad_right = getPaddings(pad_along_height,pad_along_width)
        #
        print 'output height', out_height
        print 'output width' , out_width
        print 'total pad along height' , pad_along_height
        print 'total pad along width' , pad_along_width
        print 'pad at top' , pad_top
        print 'pad at bottom' ,pad_bottom
        print 'pad at left' , pad_left
        print 'pad at right' ,pad_right

    elif padding == 'VALID':
        out_height = np.ceil(float(inputHeight - filterHeight + 1) / float(strides[1]))
        out_width  = np.ceil(float(inputWidth - filterWidth + 1) / float(strides[2]))
        #
        print 'output height', out_height
        print 'output width' , out_width
        print 'no padding'


# use like so
getOutputDim (80,80,4,4,[1,1,1,1],'SAME')

Question 23

To sum up, ‘valid’ padding means no padding. The output size of the convolutional layer shrinks depending on the input size & kernel size.

On the contrary, ‘same’ padding means using padding. When the stride is set as 1, the output size of the convolutional layer maintains as the input size by appending a certain number of ‘0-border’ around the input data when calculating convolution.

Hope this intuitive description helps.

Question 24

Here, W and H are width and height of input, F are filter dimensions, P is padding size (i.e., number of rows or columns to be padded)

For SAME padding:

For VALID padding:

Question 25

Complementing YvesgereY’s great answer, I found this visualization extremely helpful:

Padding ‘valid‘ is the first figure. The filter window stays inside the image.

Padding ‘same‘ is the third figure. The output is the same size.

Found it on this article.

Question 26

Tensorflow 2.0 Compatible Answer: Detailed Explanations have been provided above, about “Valid” and “Same” Padding.

However, I will specify different Pooling Functions and their respective Commands in Tensorflow 2.x (>= 2.0), for the benefit of the community.

Functions in 1.x:

tf.nn.max_pool

tf.keras.layers.MaxPool2D

Average Pooling => None in tf.nn, tf.keras.layers.AveragePooling2D

Functions in 2.x:

tf.nn.max_pool if used in 2.x and tf.compat.v1.nn.max_pool_v2 or tf.compat.v2.nn.max_pool, if migrated from 1.x to 2.x.

tf.keras.layers.MaxPool2D if used in 2.x and

tf.compat.v1.keras.layers.MaxPool2D or tf.compat.v1.keras.layers.MaxPooling2D or tf.compat.v2.keras.layers.MaxPool2D or tf.compat.v2.keras.layers.MaxPooling2D, if migrated from 1.x to 2.x.

Average Pooling => tf.nn.avg_pool2d or tf.keras.layers.AveragePooling2D if used in TF 2.x and

tf.compat.v1.nn.avg_pool_v2 or tf.compat.v2.nn.avg_pool or tf.compat.v1.keras.layers.AveragePooling2D or tf.compat.v1.keras.layers.AvgPool2D or tf.compat.v2.keras.layers.AveragePooling2D or tf.compat.v2.keras.layers.AvgPool2D , if migrated from 1.x to 2.x.

For more information about Migration from Tensorflow 1.x to 2.x, please refer to this Migration Guide.

模式	Lib	数据格式	最大GPU内存使用量(MB)	最大CPU内存使用量(MB)	平均CPU内存使用率(MB)	运行时间(秒)
亲笔签名	TensorFlow 2.0	最后一个频道	11833	2161	2136	74
	TensorLayer 2.0	最后一个频道	11833	2187	2169	76
图表	凯拉斯	最后一个频道	8677	2580	2576	101
渴望	TensorFlow 2.0	最后一个频道	8723	2052年	2024年	97
	TensorLayer 2.0	最后一个频道	8723	2010年	2007年	95

场景	支持	描述
Classification	基地	图像分类是一种有监督的机器学习技术，用于学习和预测给定图像的类别
Similarity	基地	图像相似度是一种计算给定一对图像的相似度分数的方法。在给定图像的情况下，它允许您识别给定数据集中最相似的图像
Detection	基地	对象检测是一种允许您检测图像中对象的边界框的技术
Keypoints	基地	关键点检测可用于检测对象上的特定点。提供了一种预先训练的模型来检测人体关节，以进行人体姿态估计。
Segmentation	基地	图像分割为图像中的每个像素分配类别
Action recognition	基地	动作识别，用于在视频/网络摄像机镜头中识别执行的动作(例如，“运行”、“打开瓶子”)以及各自的开始/结束时间。我们还实现了可以在(Contrib)[contrib]下找到的动作识别的i3D实现
Tracking	基地	跟踪允许随时间检测和跟踪视频序列中的多个对象
Crowd counting	Contrrib	统计低人群密度(如10人以下)和高人群密度(如数千人)场景下的人数

构建类型	分支机构	分支机构
Linux GPU	师傅	试运行
Linux CPU	师傅	试运行
笔记本电脑单元GPU	师傅	试运行

x86/64、arm64、v6、v7	Linux/MacOS和Python 3.7/3.8/3.9	Docker用户
最低要求 _{(不支持HTTP、WebSocket、Docker)}	`pip install jina`	`docker run jinaai/jina:latest`
_Daemon	_{pip install "jina[daemon]"}	_{docker run --network=host jinaai/jina:latest-daemon}
_{使用附加服务}	_{pip install "jina[devel]"}	_{docker run jinaai/jina:latest-devel}

文件名	要求
requirements.txt	套餐要求
requirements-dev.txt	发展的要求
requirements-test.txt	测试的规定
setup.py	对微件等的要求

问题：在PyTorch中保存经过训练的模型的最佳方法？

回答 0

推荐的模型保存方法

Recommended approach for saving a model

回答 1

回答 2

回答 3

模型类必须在某处定义

Model class must be defined somewhere

回答 4

学习资源

安装

社区

随时了解最新信息

问题和讨论

即时通信

贡献代码

捐赠

引用这部作品

确认

编码器的计算线性代数

获取帮助

目录

0. Course Logistics(Video 1)

1. Why are we here?(Video 1)

2. Topic Modeling with NMF and SVD(Video 2和Video 3)

3. Background Removal with Robust PCA(Video 3，Video 4，以及Video 5)

4. Compressed Sensing with Robust Regression(Video 6和Video 7)

5. Predicting Health Outcomes with Linear Regressions(Video 8)

6. How to Implement Linear Regression(Video 8)

7. PageRank with Eigen Decompositions(Video 9和Video 10)

8. Implementing QR Factorization(Video 10)

为什么这门课的授课顺序如此怪异？

新闻

设计特点

多语种文档

大量的例子

快速入门

性能基准

参与其中

引用TensorLayer

问题：了解Keras LSTM

什么是时间步骤和功能？

有状态的LSTM

编辑1：

编辑2：

更新：

更新2：

What is Time Steps and Features?

Stateful LSTMs

Edit 1:

Edit 2:

Update:

Update2:

回答 0

回答 1

一般Keras行为

滑动窗输入

通过“单层”实现每种情况

实现许多标准：

实现多对一：

实现一对多

一对多与重复向量

了解状态=真

一对多与有状态= True

有状态=真对多对多

实现复杂的配置

补充说明