Neural networks are machine learning models whose original design has been vaguely inspired by the structure networks of neurons in human brains. Due to recent technological advances that have enabled fast computations on larger models and more training data, neural networks have found many applications in a growing number of areas of science such as computer vision, natural language processing, and medical imaging. Despite the practical successes of training these models with stochastic gradient descent (SGD) or its variants, finding a proper theoretical ground for the learning mechanism of these models yet remains an active area of research and a central challenge in machine learning. In this dissertation, with the goal of constructing a theoretical underpinning for these machine learning models, we focus on the main characteristic of neural networks that distinguishes them from other learning models, i.e., their multilevel and hierarchical architecture. Based on ideas and tools from information theory, high-dimensional probability, and statistical physics, we present a new perspective on designing the architecture of neural networks and their training procedure along with theoretical guarantees. The training procedure is multiscale in nature, takes into account the hierarchical architecture of these models, and is characteristically different from SGD and its extensions which treat the whole network as a single block and also from classical layer-wise training procedures. By extending the technique of chaining of high-dimensional probability into an algorithm-dependent setting, the notion of multiscale-entropic regularization of neural networks is introduced. We show that the minimizing distribution of such regularization can be characterized precisely with a procedure analogous to the renormalization group of statistical physics. Then, motivated by the fact that the basis of renormalization group theory is the notion of self-similarity, an inherent type of self-similarity in neural networks with near-linear activation functions is identified. This kind of self-similarity is then exploited to efficiently simulate an approximation to the minimizing distribution of multiscale-entropic regularization as the training procedure. Our results can also be viewed as a multiscale extension of the celebrated Gibbs–Boltzmann distribution and the maximum entropy results of Jaynes (1957), and a Bayesian variant of the renormalization group procedure.