Large language models and multimodal retrieval for visual word sense disambiguation

Κριθαρούλα, Αναστασία; Kritharoula, Anastasia

dc.contributor.author	Κριθαρούλα, Αναστασία	el
dc.contributor.author	Kritharoula, Anastasia	en
dc.date.accessioned	2024-04-24T10:08:56Z
dc.date.available	2024-04-24T10:08:56Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/59278
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.26974
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.subject	Visual Word Sense Disambiguation	en
dc.subject	Multimodal Retrieval	en
dc.subject	VL Transformers	en
dc.subject	Large Language Models	en
dc.subject	Language Models as Knowledge Bases	en
dc.subject	Αποσαφήνιση Οπτικών Εννοιών	el
dc.subject	Πολυτροπική Ανάκτηση Εικόνας-Κειμένου	el
dc.subject	Οπτικογλωσικοί Μετασχηματιστές	el
dc.subject	Μεγάλα Γλωσσικά Μοντέλα	el
dc.subject	Μεγάλα Γλωσσικά Μοντέλα ως Βάσεις Γνώσεων	el
dc.title	Large language models and multimodal retrieval for visual word sense disambiguation	en
heal.type	bachelorThesis
heal.classification	Computer Science	en
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2023-10-26
heal.abstract	Visual Word Sense Disambiguation (VWSD) is a challenging task that lies at the intersection of linguistic sense disambiguation and fine-grained mulitmodal retrieval. In this task, the goal is to retrieve the appropriate image from a set of competitive candidates, given a word within a given context. In this thesis, we aim to make a substantial step towards unveiling this interesting task. As a starting point, we propose some recent state-of-the-art visiolinguistic (VL) transformers with promising baseline performance. We suggest the use of Large Language Models (LLMs) as Knowledge Bases, which could better the retrieval performance of VL transformers via knowledge-enhancement, in order to improve these baselines. Specifically, we utilise appropriate prompts to query the LLMs and retrieve the knowledge which is stored in their weights, thereby accomplishing performance improvements. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, in order to thoroughly investigate the capabilities of relevant models. To combine our various modules, we train a learn-to-rank (LTR) model on a dataset derived by combining the features of the aforementioned techniques. Moreover, we transform VWSD into a text-only question-and-answer (QA) problem. To achieve this, we designate each image with a generated caption and use the captions as potential multiple-choice textual answers. To reveal the potential of such a transformation, we employ zero-shot and few-shot strategies, as well as Chain-of-Thought (CoT) prompting in the zero-shot setting, in order to evoke the internal reasoning steps an LLM employs to select the most suitable candidate and to provide internal explanations for this selection. Overall, this thesis is the first one that attempts to analyse the merits of leveraging knowledge stored in LLMs in various ways to solve VWSD.	en
heal.advisorName	Στάμου, Γεώργιος	el
heal.committeeMemberName	Στάμου, Γεώργιος	el
heal.committeeMemberName	Βαζιργιάννης, Μιχάλης	el
heal.committeeMemberName	Βουλόδημος, Αθανάσιος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών.Εργαστήριο Συστημάτων Τεχνητής Νοημοσύνης και Μάθησης .	el
heal.academicPublisherID	ntua
heal.numberOfPages	97 σ.	el
heal.fullTextAvailability	false