HEAL DSpace

Large language models and multimodal retrieval for visual word sense disambiguation

DSpace/Manakin Repository

Show simple item record

dc.contributor.author Κριθαρούλα, Αναστασία el
dc.contributor.author Kritharoula, Anastasia en
dc.date.accessioned 2024-04-24T10:08:56Z
dc.date.available 2024-04-24T10:08:56Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/59278
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.26974
dc.rights Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα *
dc.rights Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/gr/ *
dc.subject Visual Word Sense Disambiguation en
dc.subject Multimodal Retrieval en
dc.subject VL Transformers en
dc.subject Large Language Models en
dc.subject Language Models as Knowledge Bases en
dc.subject Αποσαφήνιση Οπτικών Εννοιών el
dc.subject Πολυτροπική Ανάκτηση Εικόνας-Κειμένου el
dc.subject Οπτικογλωσικοί Μετασχηματιστές el
dc.subject Μεγάλα Γλωσσικά Μοντέλα el
dc.subject Μεγάλα Γλωσσικά Μοντέλα ως Βάσεις Γνώσεων el
dc.title Large language models and multimodal retrieval for visual word sense disambiguation en
heal.type bachelorThesis
heal.classification Computer Science en
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2023-10-26
heal.abstract Visual Word Sense Disambiguation (VWSD) is a challenging task that lies at the intersection of linguistic sense disambiguation and fine-grained mulitmodal retrieval. In this task, the goal is to retrieve the appropriate image from a set of competitive candidates, given a word within a given context. In this thesis, we aim to make a substantial step towards unveiling this interesting task. As a starting point, we propose some recent state-of-the-art visiolinguistic (VL) transformers with promising baseline performance. We suggest the use of Large Language Models (LLMs) as Knowledge Bases, which could better the retrieval performance of VL transformers via knowledge-enhancement, in order to improve these baselines. Specifically, we utilise appropriate prompts to query the LLMs and retrieve the knowledge which is stored in their weights, thereby accomplishing performance improvements. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, in order to thoroughly investigate the capabilities of relevant models. To combine our various modules, we train a learn-to-rank (LTR) model on a dataset derived by combining the features of the aforementioned techniques. Moreover, we transform VWSD into a text-only question-and-answer (QA) problem. To achieve this, we designate each image with a generated caption and use the captions as potential multiple-choice textual answers. To reveal the potential of such a transformation, we employ zero-shot and few-shot strategies, as well as Chain-of-Thought (CoT) prompting in the zero-shot setting, in order to evoke the internal reasoning steps an LLM employs to select the most suitable candidate and to provide internal explanations for this selection. Overall, this thesis is the first one that attempts to analyse the merits of leveraging knowledge stored in LLMs in various ways to solve VWSD. en
heal.advisorName Στάμου, Γεώργιος el
heal.committeeMemberName Στάμου, Γεώργιος el
heal.committeeMemberName Βαζιργιάννης, Μιχάλης el
heal.committeeMemberName Βουλόδημος, Αθανάσιος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών.Εργαστήριο Συστημάτων Τεχνητής Νοημοσύνης και Μάθησης . el
heal.academicPublisherID ntua
heal.numberOfPages 97 σ. el
heal.fullTextAvailability false


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα Except where otherwise noted, this item's license is described as Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα