After creating a simple linear regression model in numpy, I found that changing step size / learning rate was not effective in improving the accuracy of the model, or the speed of convergence. Notice given this standard model:
%%time
x = np.arange(100)
y = 3 * x + 5
x = np.column_stack((np.ones(100), x))
w = np.zeros((100, 2))
step_size = 10**-4
iterations = 10**5
for i in range(iterations):
loss = 2 * (np.sum(w * x, axis=1) - y)
w -= step_size * np.average(x * loss[:, None], axis=0)
print(w[0])
the model provides the following output:
[4.60820653 3.00590689]
CPU times: user 5.04 s, sys: 19.9 ms, total: 5.06 s
Wall time: 5.07 s
Changing step_size variable could be considered a hyperparameter optimization, but when changing it to anything greater than 10-4, such as 10-3, the model fails to converge and explodes:
# step_size = 10**-3
[nan nan]
CPU times: user 4.98 s, sys: 38.2 ms, total: 5.02 s
Wall time: 5 s
This behavior is somewhat expected mathematically, however it is frustrating to encounter and begs the question, how is step size more effectively optimized?
I tried changing the coefficient associated with the gradient (I labeled the gradient variable as 'loss') instead of changing the step size, as this also effects how dramatic each step is (as far as I'm aware) given a larger or smaller loss and therefore a steeper gradient to descend. Surprisingly, changing the coefficient of loss from 2 to 5 adds dramatically better performance (this was just what I did instead of optimizing step size, but I guess it's a separate topic).
# changing loss coefficient from 2 to 5
[4.99998468 3.00000023]
CPU times: user 5.03 s, sys: 42.7 ms, total: 5.07 s
Wall time: 5.08 s