Saturday 30 November 2019

Deep Learning | Tensor flow | GPU (cuda/cuDNN) Vs CPU w and w/o EarlyStopping

In the previous post we built a neural net and tuned its hyper parameters. The hyper parameters were tuned using GridSearchCV. Here we look at modifying training the same network on GPU and compare that with training on CPU. 

Tools used
Tensorflow-gpu: 2.0.0
Keras version 2.3.1
Pandas version 0.25.3
Scikitlearn version 0.21.3

Nvidia GTX 960M 2GB
Intel i7 6700HQ 16 GB

Now, the number of epochs needed was tuned separately last time. Loss/accuracy Vs epochs curves raises the question of whether those many epochs are needed for this network and data. This has an impact on the time needed to train the network. Given the low end GPU the numbers are as expected.

Could we train the network in a lower duration with an acceptable loss of accuracy?

Early stopping is used here to answer this question. Results are shown below. Instead of going through 55 epochs, it decides to stop when the loss cannot be minimised beyond a certain point around 17-27 epochs. Early stopping parameters used is shown below.

Hyper parameter grid is small to begin with

Results (2 parameters and 3 values each)

Wall time Accuracy learning rate momentum
CPU 1 min 28 sec 0.978685 0.03 0.43
CPU Early Stopping 35.8 sec 0.973357 0.024 0.39
GPU 9 min 1 sec 0.978685 0.024 0.41
GPU Early Stopping 2 min 54 sec 0.971580 0.024 0.41

Results (2 parameters 5 values each)

Parameter grid is modified with additional ranges.

Wall time Accuracy learning rate momentum
CPU Early Stopping 1 min 35 sec 0.9822 0.03 0.39
GPU Early Stopping 9 min 37 sec 0.9751 0.033 0.41

Tuesday 19 November 2019

Deep learning | Tensor Flow | Building a neural net *Updated with hyper parameter tuning

Here we look at training a neural network for classifying UCI cancer data as benign or malignant. Project has multiple machine learning models for this purpose and is described here In this post a neural network is built using keras and tensor flow backend. Scikit-learn is used for pre-processing data. The final network with hyper parameter tuning looks like this.

Initial hyper parameters used:

1) Visualise the data: There are 36 dimensions of which 10 are not contributing to the prediction. A visualisation is already in the project here. Those 10 dimensions are removed to decrease noise.

2) Pre-process data: The data distribution is examined. For each dimension check the value distribution. The objective here is to figure out which way to pre-process the data. Neural network layers have activation functions. The common activation functions like relu work as max(0, value). Also in each epoch the weights are tweaked by a small amount. If the input is not normalized it can affect the learning. Here the data is normalized using MinMax scaler so that each dimension is between 0 and 1. The medical data renders itself to this better. Every dimension did not have following type of distribution. Figure is for cell texture.

3) Build the layers (dropouts, layer count and neurons in each layer): The decision of how many layers and how many neurons on each layer is mostly based on accuracy/loss plots for fit (see end of post) and visualisation in step 1. The model looks like this.

4) Parameter TuningUsing Grid Search CV the following parameters and hyper parameters were tuned:

a) Optimizer: SGD was selected after comparison with RMSprop and Adam.

b) Epochs: 50-55 was selected after GridSearchCV and Stratified k-fold cross validation.

c) Batch size: Initially 32 but 5 was chosen after GridSearch CV with 30, 35, 40 among other options.

d) Learning rate: initially kept at 0.027 during cross validation. This was changed after looking at loss and accuracy curves from initial training.  Finally using GridSearchCV 0.24.

e) Momentum: 0.44 initially tweaked on loss and accuracy curve plots. Final value 0.41.

A screen of the grid search with best params

Number of intermediate layers, layer neuron counts and dropouts in each layer were decided based on fit from accuracy and loss curves during training. Also each dropout layer has a different dropout ratio. 

Avoiding over fitting, under fitting and unknown fitting. Unnecessarily higher epoch count resulted in over fitting, lower epoch gave under fitting. The epoch count was increased after the intermediate layers with more neurons were added. This resulted in an acceptable fit in loss function behaviour. An example of over fitting is shown below. Notice that validation error is going up as training error is staying low towards right.