HEAL DSpace

Enhancing contrastive language-vision pre-training with generative dialogue

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Τσαπραζλής, Ευθύμιος el
dc.contributor.author Tsaprazlis, Efthymios en
dc.date.accessioned 2025-01-08T10:59:32Z
dc.date.available 2025-01-08T10:59:32Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/60650
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.28346
dc.rights Default License
dc.subject Contrastive Learning en
dc.subject Multimodal Learning en
dc.subject Self-supervised Learning en
dc.subject Generative AI en
dc.subject Machine Learning en
dc.title Enhancing contrastive language-vision pre-training with generative dialogue en
heal.type bachelorThesis
heal.classification Pattern Recognition en
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2024-07-18
heal.abstract Image-text models have become essential in machine learning, leading to the development of several state-of-the-art architectures, such as CLIP, DALL-E, and Stable Diffusion, among others. These foundational models can be applied to various tasks, often different from their initial training objectives. This versatility arises from their ability to use each modality to infer knowledge about the other, enabling them to function without specific task-related data. However, these architectures are also very expensive to train, typically requiring millions of image-text pairs. Therefore, the use of these extensive foundational models often involves finetuning them rather than training them from scratch. Additionally, these models can be used directly for inference, serving as the foundation upon which more complex pipelines can be built. Moreover, multimodal generative models have taken the world by storm, providing exceptional capabilities for image captioning. These models can generate meaningful descriptions and high-quality responses regarding visual concepts. We believe that a dialogue between a user and a generative model about a given image can offer an additional perspective on the image-text pair. Initially, we study the problem of training a third tower for a new modality given a pre-trained CLIP model. This additional component can be used to incorporate other modalities into the model pipeline. In our framework, called CLIP-3Modal, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. Τhen, we abandon the third tower approach and focus on fine-tuning the original CLIP to adapt to question-answer style textual inputs. We introduce DRAFT (Dual Representation Adaptive Fine-Tuning), a method based on contrastive learning and distribution alignment, designed to adapt CLIP-like models to out-of-distribution textual descriptions, such as dialogue. We conducted extensive experimentation and ablation studies to demonstrate the advantages and benefits of our method over the baseline model. Our experiments primarily focused on Visual Question Answering tasks, where our method significantly improved CLIP's performance. en
heal.advisorName Μαραγκός, Πέτρος el
heal.committeeMemberName Ροντογιάννης, Αθανάσιος el
heal.committeeMemberName Κορδώνης, Ιωάννης el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής el
heal.academicPublisherID ntua
heal.numberOfPages 114 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής