Reinforcement Learning Agents in Large-Scale AI: Efficient Training and Tighter Analysis

Date
Apr 16, 2025, 2:00 pm3:30 pm
Location
Zoom Meeting (See abstract for link)

Speaker

Details

Event Description

Reinforcement learning is an important area of artificial intelligence. Despite theoretical roots, it has enjoyed great success with real-world applications. However, known for their delicate training and high demand for data, reinforcement learning methods are used limitedly, especially in large-scale applications. In an effort to further improve their practicality, this thesis is dedicated to developing more efficient training methods and tighter theoretical analyses for reinforcement learning in large-scale settings.

First, a novel algorithm is presented for the tabular setting of infinite-horizon dis­counted reinforcement learning, one of the most fundamental problems in the area. The prior best algorithm achieves theoretically optimal regret at the cost of high computa­tional and memory requirements. Empowered by a novel policy switching technique, our algorithm achieves optimality while reducing these costs, and at the same time, greatly lowers the burn-in cost, which is known to be a limitation of reinforcement learning in practice. Theoretical analysis of our algorithm and comparison to prior art are provided to demonstrate its superiority.

Next, a series of new theoretical results on deep reinforcement learning are presented. The use of deep neural networks has led to many empirical successes in reinforcement learning applications, but most prior theoretical analyses follow from classical uniform laws of large numbers and fall short at fully explaining empirical phenomena. Our theory takes a further step at bridging the gap and shows deep reinforcement learning automatically adapts to any intrinsic low-dimensional structure in a Markov decision process. We develop tighter theoretical analyses for several settings, including off-policy evaluation, preference learning, and actor-critic policy gradient optimization.

Lastly, we present a novel algorithm of reinforcement learning from human feedback for post-training large language models. A dichotomy exists in the prior work: empirically popular algorithms have no convergence guarantee and can fail in the event of sparse data coverage; theoretically motivated methods have provable guarantees but are not computationally efficient for large-scale applications. To bridge this gap, we propose the first algorithm both provably convergent and scalable to large language model post-training, inspired by the on-average pessimism technique. We provide its theoretical convergence analysis and experiments on large language models.

Co-Advisers: Yuxin Chen and Mengdi Wang

Zoom Meeting: https://princeton.zoom.us/j/96324667366