Grounded visual question answering using sequence-to-sequence modeling

Νικάνδρου, Μαρία-Βασιλική; Nikandrou, Maria-Vasiliki

dc.contributor.author	Νικάνδρου, Μαρία-Βασιλική	el
dc.contributor.author	Nikandrou, Maria-Vasiliki	en
dc.date.accessioned	2020-04-13T13:31:43Z
dc.date.available	2020-04-13T13:31:43Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/50134
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.17832
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/gr/	*
dc.subject	Απάντηση οπτικών ερωτήσεων	el
dc.subject	Μοντέλα ακολουθίας-σε-ακολουθία	el
dc.subject	Γειωμένη συλλογιστική	el
dc.subject	Βαθιά μάθηση	el
dc.subject	Πολυτροπική μάθηση	el
dc.title	Grounded visual question answering using sequence-to-sequence modeling	en
dc.title	Αυτόματη παραγωγή απαντήσεων σε ερωτήσεις πάνω σε εικόνες με χρήση βαθιάς επιβλεπόμενης μάθησης	el
heal.type	bachelorThesis
heal.classification	Μηχανική μάθηση	el
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2019-07-05
heal.abstract	Visual Question Answering (VQA) constitutes a task at the intersection of Natural Language Processing and Computer Vision. Given an open-ended, natural-language question and an image, the goal is to predict the correct answer. In recent years, VQA has attracted a lot of interest from the research community, as it transcends the preliminary of extracting meaningful representations from each modality. In addition to that, it requires the capability to reason over the inputs and to infer their relations. Answering arbitrary visual questions is a challenging problem, as it assesses a large range of skills extending across the linguistic and the visual domains. One limitation of typical, modern VQA systems is that they approach the task as a classification problem over a limited set of pre-defined answers. The goal of this Diploma Thesis is to alleviate the above problem, via proposing a sequence generation model that can produce answers of arbitrary length. The proposed method is a sequence-to-sequence network conditioned on textual and visual information from the question and the image respectively. The question encoder is a bidirectional Recurrent Neural Network augmented with a self-attention mechanism, while the image feature extractor is comprised of a pretrained Convolutional Neural Network. The answer is generated following a greedy decoding process, which uses a Recurrent Neural Network cell as the decoder. At each decoding step, the answer decoder attends to the visual features by employing a cross-modal, scaled dot-product attention mechanism. We conduct an ablation study that investigates the contribution of each module. Our results indicate that the feedback loop, which allows access to image features at each decoding step, is an effective conditioning mechanism. The proposed model is evaluated on the VQA-CP v2 dataset, which tests the capacity to reason over the visual input without depending on language-based statistical biases. Our model shows significant improvement of prediction accuracy compared to baseline approaches. We also measure the performance of the proposed model against state-of-the-art VQA systems. The reported performance is comparable to that of existing models in the literature, showcasing the feasibility of free-form answer generation for open-ended VQA.	en
heal.advisorName	Ποταμιάνος, Αλέξανδρος	el
heal.committeeMemberName	Τζαφέστας, Κωνσταντίνος	el
heal.committeeMemberName	Σταφυλοπάτης, Ανδρέας-Γεώργιος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής	el
heal.academicPublisherID	ntua
heal.fullTextAvailability	true