dc.contributor.author | Κριθαρούλα, Αναστασία | el |
dc.contributor.author | Kritharoula, Anastasia | en |
dc.date.accessioned | 2024-04-24T10:08:56Z | |
dc.date.available | 2024-04-24T10:08:56Z | |
dc.identifier.uri | https://dspace.lib.ntua.gr/xmlui/handle/123456789/59278 | |
dc.identifier.uri | http://dx.doi.org/10.26240/heal.ntua.26974 | |
dc.rights | Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα | * |
dc.rights | Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα | * |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/gr/ | * |
dc.subject | Visual Word Sense Disambiguation | en |
dc.subject | Multimodal Retrieval | en |
dc.subject | VL Transformers | en |
dc.subject | Large Language Models | en |
dc.subject | Language Models as Knowledge Bases | en |
dc.subject | Αποσαφήνιση Οπτικών Εννοιών | el |
dc.subject | Πολυτροπική Ανάκτηση Εικόνας-Κειμένου | el |
dc.subject | Οπτικογλωσικοί Μετασχηματιστές | el |
dc.subject | Μεγάλα Γλωσσικά Μοντέλα | el |
dc.subject | Μεγάλα Γλωσσικά Μοντέλα ως Βάσεις Γνώσεων | el |
dc.title | Large language models and multimodal retrieval for visual word sense disambiguation | en |
heal.type | bachelorThesis | |
heal.classification | Computer Science | en |
heal.language | el | |
heal.language | en | |
heal.access | free | |
heal.recordProvider | ntua | el |
heal.publicationDate | 2023-10-26 | |
heal.abstract | Visual Word Sense Disambiguation (VWSD) is a challenging task that lies at the intersection of linguistic sense disambiguation and fine-grained mulitmodal retrieval. In this task, the goal is to retrieve the appropriate image from a set of competitive candidates, given a word within a given context. In this thesis, we aim to make a substantial step towards unveiling this interesting task. As a starting point, we propose some recent state-of-the-art visiolinguistic (VL) transformers with promising baseline performance. We suggest the use of Large Language Models (LLMs) as Knowledge Bases, which could better the retrieval performance of VL transformers via knowledge-enhancement, in order to improve these baselines. Specifically, we utilise appropriate prompts to query the LLMs and retrieve the knowledge which is stored in their weights, thereby accomplishing performance improvements. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, in order to thoroughly investigate the capabilities of relevant models. To combine our various modules, we train a learn-to-rank (LTR) model on a dataset derived by combining the features of the aforementioned techniques. Moreover, we transform VWSD into a text-only question-and-answer (QA) problem. To achieve this, we designate each image with a generated caption and use the captions as potential multiple-choice textual answers. To reveal the potential of such a transformation, we employ zero-shot and few-shot strategies, as well as Chain-of-Thought (CoT) prompting in the zero-shot setting, in order to evoke the internal reasoning steps an LLM employs to select the most suitable candidate and to provide internal explanations for this selection. Overall, this thesis is the first one that attempts to analyse the merits of leveraging knowledge stored in LLMs in various ways to solve VWSD. | en |
heal.advisorName | Στάμου, Γεώργιος | el |
heal.committeeMemberName | Στάμου, Γεώργιος | el |
heal.committeeMemberName | Βαζιργιάννης, Μιχάλης | el |
heal.committeeMemberName | Βουλόδημος, Αθανάσιος | el |
heal.academicPublisher | Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών.Εργαστήριο Συστημάτων Τεχνητής Νοημοσύνης και Μάθησης . | el |
heal.academicPublisherID | ntua | |
heal.numberOfPages | 97 σ. | el |
heal.fullTextAvailability | false |
Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο: