Understanding local training trajectories of neural networks: optimization and feature learning perspectives

ECE PRE FPO PRESENTATION
Date
Nov 25, 2024, 1:30 pm2:30 pm
Location
Sherrerd Hall Room 122

Speaker

Details

Event Description

In deep neural networks, one frequently encounters problems where the loss functions are high dimensional, non-convex and non-smooth. In these large-scale settings, theoretical bounds in traditional optimization literature which focused on black-box analysis assuming some global worst-case parameters can be vacuous. Moreover, optimization algorithms can have complex interaction phenomena with the loss geometry: recent work has shown that geometric characteristics of iterates encountered during optimization is highly dependent on the choice of optimization algorithm and associated hyperparameters. These challenges motivate us to conduct a local instead of global and algorithm dependent trajectory analysis and track the training dynamics in detail.

This talk will focus on local trajectories of optimization algorithms in neural networks and try to understand their training dynamics from the optimization and feature learning perspectives. For the optimization perspective, I will compare the local loss geometry of adaptive optimization methods (e.g. Adam) and that of non-adaptive methods, and present that adaptive methods can bias their trajectories towards regions with a certain uniformity property. Next, I will demonstrate that this uniformity property is a contributing factor to fast optimization. For the feature learning perspective, I will focus on the evolution of Neural Tangent Kernel (NTK) and demonstrate that Gradient Descent dynamics with large learning rates tend to increase the alignment between NTK and the target function. I will also discuss the connection between this alignment property and the generalization ability and present theoretical results to understand the underlying mechanism.

Finally, I will conclude this talk by discussing an ongoing project which studies the mean-field parameterization of deep linear residual networks with high dimensional input. The goal is to understand the hyperparameter transfer when we scale both the model and input data.

Adviser: Boris Hanin