We study the properties of low-rank Markov Decision Process (MDP), whose transition probability admits a low rank decomposition $P(s’|s,a)=\phi(s,a)^\top\mu(s’)$. I will discuss three related topics.
First, we study the off-policy policy gradient (PG) estimation problem, an important topic for policy based reinforcement learning. Conventional methods for off-policy PG estimation often suffer from either significant bias or exponentially large variance. We propose a new estimator based on fitted iteration, which utilizes the low-rank structure of the MDP and can work with an arbitrary policy parameterization. We provide a tight finite-sample upper bound on policy gradient estimation error, which is governed by the amount of distribution mismatch measured in the feature space. Our bound can be viewed as an improvement over the existing results.
Next, we study the model estimation problem of an MDP. While a low-rank MDP can be estimated using low-rank matrix estimation methods, the approach fails to capture some special structures of the MDP. We propose a tensor-based estimator, which includes the matrix approach as a special case, and results in sharper bounds on the estimation error under certain settings.
Last, we study the representation learning problem for general-sum low-rank Markov Games, which is a natural extension of low-rank MDPs. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from the collected data. For both approaches, the algorithm achieves a sample complexity of $poly(H, d, A, 1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap.