I am looking to run a very large grid search for different neural network configurations. In its fullness this would be impracticable to run using my current hardware. I am aware that there may be superior techniques to a naive grid search (e.g. random, Bayesian-optimization) however my question is about what reasonable assumptions we can make about what to include in the first place. Specifically, in my case I am looking to run a grid search between
- A: number of hidden layers
- B: size of hidden layer
- C: activation function
- D: L1
- E: L2
- F: dropout
One idea I had is to (1) identify a network configuration c
by running a grid search on A-C, (2) select c
with the lowest (e.g. MSE) error (against the test datasets), and (3) run network with configuration c
through a separate grid search on D-F to identify the most appropriate regularization strategy.
Is this a sensible approach to take in this case or could I, in theory, get a lower final error (i.e. after regularization) by using a network configuration that showed a higher error in the first grid search (i.e. A-C)?
What you mentioned is a reasonable approach. It is an analogy to the so called greedy forward feature selection method which is used to select features. In your cases, it is model parameters instead of features.
The idea is valid and being used widely in practice. No matter how powerful your hardware is, it is never powerful enough to try possible combination, which is basically infinite.
However, there is no guarentee in the approach that the best one in the first grid search will be overall the best one. As you said, you could get a lower final error by using a netfork confuguration that had a higher error in the first grid search. But in practice, the difference should not be much.
I would suggest you to start with fundumental parameters. Such as learning rate, or optimizer. Their effect should much more than other parameters as activation function, number of hidden layers(if you are not comparing a single layer with very deep network but rather 1-2 layers of difference). When you find the best configuration, you should try-out the important ones(lr, optimizer) once again while keeping the found configuration same.