top of page

Checkpoints in Deep Learning

Updated: Aug 5, 2021



Suppose you were to train a model having 500,000 or more dataset and for its accuracy to be great you chose 10,000 step training. Thinking about a dataset this big makes a thought come right way about the time that will be consumed to run this code. And also, what will happen if suddenly some failure occurs. The days or even weeks labor will be nothing more than just dust and you have to start training from the start and give another week or month for that.

Isn’t it a tedious task to do and way too irritating? Yes, surely it is. Trust me a very bad temper will rise when you lose your weeks work just due to some tiny fault like power outrage, network issue, hardware failure, OS fault, GPU fail or any other unexpected error. At times like this, one might just want to resume the training from a particular point or might train after a particular state.




TRAINING COLLAPSED DUE TO UNKNOWN ERROR


For situations like this we have Checkpoints that will help us save our progress. Checkpointing is an important functionality to quickly recover from such failures for reducing the overall training time and ensure progress.

Checkpoints are snapshots of your working model during the training process and stores it in a non-volatile memory. In machine learning and deep learning experiments, they are essentially the things which one uses to save the current state of the model so that one can pick up from where they left.

Checkpoints does not contain information about the model or the training nor do they contain any of the computations defined by the model, just some basic details. Since checkpoint are needed to resume the session, they include information such as application details and what have been done so far, as well as what is left to be done to complete the session.


BLOCK DIAGRAM


Given above is a block diagram explaining the generation of checkpoints during training. Firstly, inside the neural network model the layers as required by the model are applied and then the initial checkpoint file is saved in the model directory as model.ckpt file. After hat the main training begins. The training occurs on no of steps as provided by the coder and its roughly calculates loss after every 100 steps.


When the loss is minimum as the model feels as if overfitting will occur beyond this point, the training stops and evaluation is performed on the validation dataset. Checkpoints like this can also be manually specified by the user as to after how many epochs he/ she wants to create the checkpoint.


After the creation and validation, the checkpoint_path is saved inside model.ckpt at a destination specified by the coder. Training resumes again from the step it was left and continues until the next checkpoint occurs or until the training steps end.



WHEN TO USE CHECKPOINTS


Checkpoints can play a very important role in training long running machine learning model and when there is a strong possibility of it being interrupted. However, in some cases, checkpoints could add an unnecessary overhead to a session, such as costs related to memory usage and also much more time for training (as checkpoints saving also consumes a fair amount of time). So, It is advisable to create checkpoints only when the model best performs at a particular epoch.


For example, if we have the model training as



In this, if we are saving model at 49 epoch then its validation accuracy at that point is 0.6487 which is less than that at epoch 46 (where it is 0.6728). So instead of saving model at epoch 49 we must save our model at epoch 46 i.e., at an epoch where our model performs the best.


So, we can say that, Checkpoints need not be created after every epoch or occasionally. One needs the checkpoint creation when the weights are of the highest accuracy and lowest loss at that point of time it needs to be saved. If the weights of the model at a given epoch does not produce the best accuracy or loss, then the weight will not be saved i.e., checkpoints will not be created and the training will continue.


Using this method only essential weights are saved, saving both time and memory.



CHECKPOINT TECHNIQUE/ METHODS


Checkpoint technique is used to record the execution state for the purpose of resuming the session from an intermediary point, preventing processor time being wasted.

When a session is established or when the training starts, the code will itself create a new checkpoint when the command is received from the main code. Occasionally, the checkpoints are created and all the checkpoints are stored locally into the folder as and when specified.

Checkpoint creation varies significantly for various models. We will now look at some of the most famous checkpoint methods used and how to create, save and resume training from checkpoints in them.


Keras & Tensor Flow


In keras, checkpoints are implemented using callbacks. Using callback we can customize the behaviour of a Keras model during its training.


In tensorflow, checkpoints can be implemented using keras only (tf.keras) So let’s study both of the libraries approach to create and use checkpoints.


Callbacks can be used to :

  • Get a view at the internal statics of the model during training.

  • Regularly save your model.

  • Stopping and evaluating model, as and when required.

  • And many more such things can be done.

The first technique for creating checkpoints is using ModelCheckpoint callback. A callback calls ModelCheckpoint() method which is passed during training to the .fit() function. Also it can be passed to .evaluate() and .predict() functions of the model. It saves the models weights after some frequency which is specified by the coder.

Checkpoint_create = tf.keras.callbacks.ModelCheckpoint( 
 filepath = '/tmp/weights.hdf5, 
 monitor = 'val_loss', 
 verbose = 1, 
 save_best_only = True, 
 save_weights_only = False, 
 mode = auto',
 save_freq =epoch
 period = 1)

model.fit( x_traindata, y_traindata, batch_size = 64, epochs = 50, verbose = auto, validation_data = (X_testdata, Y_testdata), callbacks = Checkpoint_create )
model.evaluate( x_testdata, y_testdata, batch_size = 64, verbose = auto, callbacks = Checkpoint_create )
model.predict(x_testdata, batch_size = 64, verbose = 0, callbacks = Checkpoint_create )

In the ModelCheckpoint function the list of arguments are as follows:

  • filepath: string argument, gives the file path of the model in which the weights will be saved with its extension

  • monitor: This contains the list of metrics that needs to be evaluated by the model during training and testing. We can give either one string to it or else can give a list or a dictionary of metrics that we need. Some of the metrics mostly used are loss, val_loss, accuracy, val_accuracy.

  • verbose: Verbosity mode. 0 = silent. 1= progress bar, 2 = one line per epoch.

  • save_best_only: If true, then only it saves when the model is considered as the “best”.

  • save_weights_only: If true, then only it saves then weights of the model, else the full model is saved.

  • mode: There are 3 types of modes. ‘Auto’, ’Min’, ’Max’. ‘Auto’ automatically set to ‘Max’ if accuracy vise quantities are to be monitored, otherwise set to ‘Min’. For val_acc ‘Max’ is needed and for loss ‘Min’ is needed. This mode then takes the decision to overwrite the saved file according to quantity to be monitored.

  • save_freq: Determines the frequency at which the model needs to be saved. Default setting is after every 5 epochs.

  • **kwargs: Other additional arguments. May include period.

In the above code fragment, the weights are saved in one location i.e. in weights.hdf5 file. It means that every time a checkpoint is created the file is updated and so the last chekpoint will only be saved in the file.