Several selected optimization algorithms are analyzed with respect to their
ability to perform automatic hyperparameter optimization of deep neural networks
(DNNs). This is performed by treating the DNN as an expensive-to-evaluate
Deep learning models, including DNNs, recently see a surge of practical use
in various data-intensive applications ranging from computer vision, language
modelling, bioinformatics and search engines. As the performance of a DNN typically
is highly reliant on a situationally good choice of hyperparameters, the design-phase
of constructing a DNN-model becomes critical, especially for very large models.
Commonly employed naive techniques to find suitable hyperparameters is
manual search, which relies heavily on the users expertise and understanding of the
problem. Also grid and random search are common but which quickly becomes
infeasible for high-dimensional inputs and expensive model evaluations. Instead,
treating the DNN as an expensive-to-evaluate black-box function, mapping a set of
hyperparameters to some quality metric, techniques from the field of optimization
may be employed.
In this work we compare four different optimization algorithms side-by-side on
the basis of convergence speed, trial-to-trial variability, quality of best found solution
and ability to generalize across different problem settings. One experiment consist of
running approximately 200 function evaluations, i.e. construction and training of a
DNN with a specific hyperparameter configuration, which is then repeated several
times per algorithm in order to estimate performance variability.
Tensorflow r1.4 with NVIDIA-GPU support are used for for creating and
training the neural network model, providing significant speed-up as compared to
CPU computation. As a single function evaluation typically consumes about 5-10
minutes of GPU-time, being granted the possibility to use external GPU computing
resources would allow us to run more repetitions of each experiment providing an
increased quality in the estimated distribution of algorithmic performance, but also to
extend our study to include new problem settings.