Safe Reinforcement Learning and Constrained Learning for Dynamical Systems

Jul 27, 2022, 2:10 pm3:30 pm



Event Description

Designing control policies for autonomous systems such as self-driving cars is complex. To this end, researchers are increasingly using reinforcement learning (RL) to design a policy. However, guaranteeing safe operation during real-world training and deployment is currently an unsolved issue, which is of vital importance for safety-critical systems. In addition, current RL approaches require accurate simulators (models) to learn policies, which is rarely the case in real-world applications. The thesis intro- duces a safe RL framework that provides safety guarantees and develops a constrained learning approach that learns system dynamics. We develop a safe RL algorithm that optimizes task rewards while satisfying safety constraints. We then consider a variant of safe RL problems when provided with a baseline policy. The baseline policy can arise from demonstration data and may provide useful cues for learning, but it is not guaranteed to satisfy the safety constraints. We propose a policy optimization algorithm to solve this problem. In addition, we apply a safe RL algorithm in the legged locomotion to show its real-world applicability. We propose an algorithm that switches between a safe recovery policy that keeps the robot away from unsafe states, and a learner policy that is optimized to complete the task. We further exploit the knowledge about the system dynamics to determine the switch of the policies. The results suggest that we can learn legged locomotion skills without falling in the real world. We then revisit the assumption of knowing system dynamics and develop a method that performs system identification from observations. Knowing the parameters of the system improves the quality of simulation and hence minimize unexpected behavior of the policy. Finally, while safe RL holds great promise for many applications, current approaches require domain expertise to specify constraints. We thus introduce a new benchmark with constraints specified in free-form text. We develop a model that can interpret and adhere to such textual constraints. We show that the method achieves higher rewards and fewer constraint violations than baselines.


Zoom Link: