Deep generative models (DGM) are deep neural networks capable of high-dimensional probability distribution modeling and random sample generation. Among the various applications of DGM, some involve inherently discrete components, which drives the need to model discrete random variables. For example, structure learning, generative text modeling, and control with discrete/integer variables. The discreteness brings fundamental questions to discrete DGM. How to compute gradients of discrete random variables? How to perform large-scale discrete modeling and predictions? We hence study the principles and applications of discrete DGM.
We investigate the essential properties of reparameterization for discrete DGM and propose a new method with less variance. Reparameterization is a gradient estimation method for a random variable modeled by DGM. Reparameterization for discrete random variables is challenging due to the high variance of the gradient estimation. Inspired by the essential properties of Straight-Through Gumbel-Softmax estimators, we propose Gapped Straight-Through estimator to reduce the variance without incurring resampling overhead.
We also present an application of discrete reparameterization in Reinforcement Learning (RL) for power system control where the control variables are integers. We contribute to this application in two aspects: an RL environment for power systems and an RL algorithm with an integer reparameterization scheme. The environment construction identifies the practical choices of the system. An open-source package for this environment has been released and used in the power research community. The RL algorithm for power systems includes a DDPG-style policy gradient and a reparameterization for integer actions. A US patent application has been filed based on the RL algorithm for power systems.
Lastly, we explore large-scale generative text modeling from a kernelized perspective of Transformers. We observe that relative positional embedding (RPE) has been essential for Transformers to perform well on long sequences. However, a theoretical framework for RPE is still lacking. Thus, we formulate a kernelized version of RPE through CPD kernels. The diversity of CPD kernels allows us to derive various RPE that enable length extrapolation (train short, test long). Experiments demonstrate that the logarithmic variant achieves excellent extrapolation on three large language modeling datasets.