HEAL DSpace

Grounded visual question answering using sequence-to-sequence modeling

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Νικάνδρου, Μαρία-Βασιλική el
dc.contributor.author Nikandrou, Maria-Vasiliki en
dc.date.accessioned 2020-04-13T13:31:43Z
dc.date.available 2020-04-13T13:31:43Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/50134
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.17832
dc.rights Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα *
dc.rights.uri http://creativecommons.org/licenses/by-nc/3.0/gr/ *
dc.subject Απάντηση οπτικών ερωτήσεων el
dc.subject Μοντέλα ακολουθίας-σε-ακολουθία el
dc.subject Γειωμένη συλλογιστική el
dc.subject Βαθιά μάθηση el
dc.subject Πολυτροπική μάθηση el
dc.title Grounded visual question answering using sequence-to-sequence modeling en
dc.title Αυτόματη παραγωγή απαντήσεων σε ερωτήσεις πάνω σε εικόνες με χρήση βαθιάς επιβλεπόμενης μάθησης el
heal.type bachelorThesis
heal.classification Μηχανική μάθηση el
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2019-07-05
heal.abstract Visual Question Answering (VQA) constitutes a task at the intersection of Natural Language Processing and Computer Vision. Given an open-ended, natural-language question and an image, the goal is to predict the correct answer. In recent years, VQA has attracted a lot of interest from the research community, as it transcends the preliminary of extracting meaningful representations from each modality. In addition to that, it requires the capability to reason over the inputs and to infer their relations. Answering arbitrary visual questions is a challenging problem, as it assesses a large range of skills extending across the linguistic and the visual domains. One limitation of typical, modern VQA systems is that they approach the task as a classification problem over a limited set of pre-defined answers. The goal of this Diploma Thesis is to alleviate the above problem, via proposing a sequence generation model that can produce answers of arbitrary length. The proposed method is a sequence-to-sequence network conditioned on textual and visual information from the question and the image respectively. The question encoder is a bidirectional Recurrent Neural Network augmented with a self-attention mechanism, while the image feature extractor is comprised of a pretrained Convolutional Neural Network. The answer is generated following a greedy decoding process, which uses a Recurrent Neural Network cell as the decoder. At each decoding step, the answer decoder attends to the visual features by employing a cross-modal, scaled dot-product attention mechanism. We conduct an ablation study that investigates the contribution of each module. Our results indicate that the feedback loop, which allows access to image features at each decoding step, is an effective conditioning mechanism. The proposed model is evaluated on the VQA-CP v2 dataset, which tests the capacity to reason over the visual input without depending on language-based statistical biases. Our model shows significant improvement of prediction accuracy compared to baseline approaches. We also measure the performance of the proposed model against state-of-the-art VQA systems. The reported performance is comparable to that of existing models in the literature, showcasing the feasibility of free-form answer generation for open-ended VQA. en
heal.advisorName Ποταμιάνος, Αλέξανδρος el
heal.committeeMemberName Τζαφέστας, Κωνσταντίνος el
heal.committeeMemberName Σταφυλοπάτης, Ανδρέας-Γεώργιος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής el
heal.academicPublisherID ntua
heal.fullTextAvailability true


Αρχεία σε αυτό το τεκμήριο

Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο:

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής

Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα Εκτός από όπου ορίζεται κάτι διαφορετικό, αυτή η άδεια περιγράφεται ως Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα