dc.contributor.author | Νικάνδρου, Μαρία-Βασιλική | el |
dc.contributor.author | Nikandrou, Maria-Vasiliki | en |
dc.date.accessioned | 2020-04-13T13:31:43Z | |
dc.date.available | 2020-04-13T13:31:43Z | |
dc.identifier.uri | https://dspace.lib.ntua.gr/xmlui/handle/123456789/50134 | |
dc.identifier.uri | http://dx.doi.org/10.26240/heal.ntua.17832 | |
dc.rights | Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc/3.0/gr/ | * |
dc.subject | Απάντηση οπτικών ερωτήσεων | el |
dc.subject | Μοντέλα ακολουθίας-σε-ακολουθία | el |
dc.subject | Γειωμένη συλλογιστική | el |
dc.subject | Βαθιά μάθηση | el |
dc.subject | Πολυτροπική μάθηση | el |
dc.title | Grounded visual question answering using sequence-to-sequence modeling | en |
dc.title | Αυτόματη παραγωγή απαντήσεων σε ερωτήσεις πάνω σε εικόνες με χρήση βαθιάς επιβλεπόμενης μάθησης | el |
heal.type | bachelorThesis | |
heal.classification | Μηχανική μάθηση | el |
heal.language | el | |
heal.language | en | |
heal.access | free | |
heal.recordProvider | ntua | el |
heal.publicationDate | 2019-07-05 | |
heal.abstract | Visual Question Answering (VQA) constitutes a task at the intersection of Natural Language Processing and Computer Vision. Given an open-ended, natural-language question and an image, the goal is to predict the correct answer. In recent years, VQA has attracted a lot of interest from the research community, as it transcends the preliminary of extracting meaningful representations from each modality. In addition to that, it requires the capability to reason over the inputs and to infer their relations. Answering arbitrary visual questions is a challenging problem, as it assesses a large range of skills extending across the linguistic and the visual domains. One limitation of typical, modern VQA systems is that they approach the task as a classification problem over a limited set of pre-defined answers. The goal of this Diploma Thesis is to alleviate the above problem, via proposing a sequence generation model that can produce answers of arbitrary length. The proposed method is a sequence-to-sequence network conditioned on textual and visual information from the question and the image respectively. The question encoder is a bidirectional Recurrent Neural Network augmented with a self-attention mechanism, while the image feature extractor is comprised of a pretrained Convolutional Neural Network. The answer is generated following a greedy decoding process, which uses a Recurrent Neural Network cell as the decoder. At each decoding step, the answer decoder attends to the visual features by employing a cross-modal, scaled dot-product attention mechanism. We conduct an ablation study that investigates the contribution of each module. Our results indicate that the feedback loop, which allows access to image features at each decoding step, is an effective conditioning mechanism. The proposed model is evaluated on the VQA-CP v2 dataset, which tests the capacity to reason over the visual input without depending on language-based statistical biases. Our model shows significant improvement of prediction accuracy compared to baseline approaches. We also measure the performance of the proposed model against state-of-the-art VQA systems. The reported performance is comparable to that of existing models in the literature, showcasing the feasibility of free-form answer generation for open-ended VQA. | en |
heal.advisorName | Ποταμιάνος, Αλέξανδρος | el |
heal.committeeMemberName | Τζαφέστας, Κωνσταντίνος | el |
heal.committeeMemberName | Σταφυλοπάτης, Ανδρέας-Γεώργιος | el |
heal.academicPublisher | Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής | el |
heal.academicPublisherID | ntua | |
heal.fullTextAvailability | true |
Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο: