Abstract:
Document Analysis and Recognition is a prominent research area which combines the fields of Computer Vision and Machine Learning and has a great impact to humanitarian studies, by unraveling information stored in collections of historical documents all over the world. In this PhD thesis, we focus on extracting and learning visual representations capable of successfully detecting and recognizing text in handwritten documents. The main intention behind the developed methodologies, presented in thesis, is the creation of efficient systems with minimal computational requirements, aiming towards real-time applications. During the thesis, we tackle document-related problems of increasing difficulty, while the main goal is the development of a effective word detection approach by focusing on the improvement of the extracted visual representation of text. Specifically we explore feature extraction techniques along with possible improvement modifications, based on the specific characteristics of text images (possible text deformations e.t.c). Typical handcrafted feature extraction methods are compared to generating visual representations either from manifold embedding techniques or from deep learning approaches, which both show superior performance. An important part of this thesis is the study of Convolutional Neural Networks (CNNs) for the word detection problem along with their generalization capability, i.e.if it is possible to generate transferable and discriminative deep features. To this end, we propose several modified architectures in order to create compact, yet well-performing, features. Furthermore, we present a novel deep learning approach that combines both spotting and recognition tasks, leading to superior performance, while we also tackle the problem of line-level spotting from deep features viewpoint. Finally, we address the more generic neural network compression problem, which is not limited to document-related tasks. Specifically, we design two different approaches for model compression, both achieving significant compression according to size-accuracy trade-off on different datasets and settings, including image classification and keyword spotting tasks.