Computer vision systems are increasingly being deployed to real-world applications, such as recognition models on autonomous vehicles, captioning models in presentation software, and retrieval models behind visual search engines. Lots of practical challenges exist in building these real-world computer vision systems and many of them are associated with the imperfections in data. Specifically, real-world data can be biased with distracting spurious correlations, long-tailed with unbalanced presence of different categories, noisy with numerous flaws, and so on. In this thesis, we study how to tackle three common data imperfections for different vision tasks.
First, we investigate the bias issue in image classification. We introduce a new benchmark featuring controllable bias through data augmentation. We then provide a thorough comparison of existing bias mitigation methods and propose a simple approach which outperforms other more complex competitors.
Second, we study the long tail issue in image captioning. We show how existing captioning models prefer common concepts and generate overly generic captions due to the long tail. To tackle the issue, on the evaluation side, we propose a new metric to capture both uniqueness and accuracy. On the modeling side, we introduce an inference-time re-ranking technique to generate more diverse and informative captions.
Finally, we tackle the noise issue in video retrieval. We demonstrate how noisy annotations introduce challenges in both model training and evaluation. We then propose to address the problem by utilizing a simple but effective multi-query approach. Through extensive experiments, we show that multi-query training leads to superior performance, and multi-query evaluation better reflects the true capabilities of retrieval models.
Zoom link https://princeton.zoom.us/j/96300525182