I'm trying to add regularization to my Mnist digits NN classifier, which I've created using numpy and vanilla Python. I'm currently using Sigmoid activations with Cross Entropy cost function.
Without using the regularizer, I get 97% accuracy.
However, once I add the regularizer, I"m only getting about 11% despite, playing around with different hyper parameters. I've tried different learning rates:
.001, .1, 1
and different lambd values such as:
.5, .8, 1.0, 2.0 etc.
I can't seem to figure out what mistake I'm making. I feel like I'm missing a step maybe?
The only changes I've made are to the derivatives of the weights. I've implemented the gradients as follows:
def calculate_gradients(self,x, y, lambd):
'''calculate all gradients with respect to
cost. Here our cost function is cross_entropy
last_layer_z_error = dC/dZ (z is logit)
All weight gradients also include regularization gradients
x.shape[0] = len of sample size
'''
##### First we calculate the output layer gradients #########
gradients, activations, zs = self.gather_backprop_data(x,y)
#gradient of cost with respect to Z of last layer
last_layer_z_error = ((activations[-1] - y))
#updating the weight_derivatives of final layer
gradients['w'+ str(self.num_layers -1)] = np.dot(activations[-2].T,last_layer_z_error)/x.shape[0] + (lambd/x.shape[0])*(self.parameters['w'+ str(self.num_layers -1)])
gradients['b'+ str(self.num_layers -1)] = np.mean(last_layer_z_error, axis =0)
gradients['b'+ str(self.num_layers -1)] = np.expand_dims(gradients['b'+ str(self.num_layers -1)],0)
###HIDDEN LAYER GRADIENTS###
z_previous_layer = last_layer_z_error
for i in reversed(range(1,self.num_layers -1)):
z_previous_layer =np.dot(z_previous_layer,self.parameters['w'+ str(i+1)].T, )*\
(sigmoid_derivative(zs[i-1]))
gradients['w'+str(i)] = np.dot((activations[i-1].T),z_previous_layer)/x.shape[0] + (lambd/x.shape[0])*(self.parameters['w'+str(i)])
gradients['b'+str(i)] = np.mean(z_previous_layer, axis =0)
gradients['b'+str(i)] = np.expand_dims(gradients['b'+str(i)],0)
return gradients
The entire code can be found here:
I've uploaded the entire notebook to Github if needed: