Enhancing contrastive language-vision pre-training with generative dialogue

Τσαπραζλής, Ευθύμιος; Tsaprazlis, Efthymios

dc.contributor.author	Τσαπραζλής, Ευθύμιος	el
dc.contributor.author	Tsaprazlis, Efthymios	en
dc.date.accessioned	2025-01-08T10:59:32Z
dc.date.available	2025-01-08T10:59:32Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/60650
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.28346
dc.rights	Default License
dc.subject	Contrastive Learning	en
dc.subject	Multimodal Learning	en
dc.subject	Self-supervised Learning	en
dc.subject	Generative AI	en
dc.subject	Machine Learning	en
dc.title	Enhancing contrastive language-vision pre-training with generative dialogue	en
heal.type	bachelorThesis
heal.classification	Pattern Recognition	en
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2024-07-18
heal.abstract	Image-text models have become essential in machine learning, leading to the development of several state-of-the-art architectures, such as CLIP, DALL-E, and Stable Diffusion, among others. These foundational models can be applied to various tasks, often different from their initial training objectives. This versatility arises from their ability to use each modality to infer knowledge about the other, enabling them to function without specific task-related data. However, these architectures are also very expensive to train, typically requiring millions of image-text pairs. Therefore, the use of these extensive foundational models often involves finetuning them rather than training them from scratch. Additionally, these models can be used directly for inference, serving as the foundation upon which more complex pipelines can be built. Moreover, multimodal generative models have taken the world by storm, providing exceptional capabilities for image captioning. These models can generate meaningful descriptions and high-quality responses regarding visual concepts. We believe that a dialogue between a user and a generative model about a given image can offer an additional perspective on the image-text pair. Initially, we study the problem of training a third tower for a new modality given a pre-trained CLIP model. This additional component can be used to incorporate other modalities into the model pipeline. In our framework, called CLIP-3Modal, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. Τhen, we abandon the third tower approach and focus on fine-tuning the original CLIP to adapt to question-answer style textual inputs. We introduce DRAFT (Dual Representation Adaptive Fine-Tuning), a method based on contrastive learning and distribution alignment, designed to adapt CLIP-like models to out-of-distribution textual descriptions, such as dialogue. We conducted extensive experimentation and ablation studies to demonstrate the advantages and benefits of our method over the baseline model. Our experiments primarily focused on Visual Question Answering tasks, where our method significantly improved CLIP's performance.	en
heal.advisorName	Μαραγκός, Πέτρος	el
heal.committeeMemberName	Ροντογιάννης, Αθανάσιος	el
heal.committeeMemberName	Κορδώνης, Ιωάννης	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής	el
heal.academicPublisherID	ntua
heal.numberOfPages	114 σ.	el
heal.fullTextAvailability	false