HEAL DSpace

Story visualization via masked generative transformers with character guidance and caption augmentation

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Papadimitriou, Christos en
dc.contributor.author Παπαδημητρίου, Χρήστος el
dc.date.accessioned 2024-07-16T08:26:48Z
dc.date.available 2024-07-16T08:26:48Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/59922
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.27618
dc.rights Default License
dc.subject Story Visualization en
dc.subject Οπτικοποίηση Ιστορίας el
dc.subject Artificial Intelligence en
dc.subject Computer Vision en
dc.subject NLP en
dc.subject Transformers en
dc.subject Τεχνητή Νοημοσύνη el
dc.subject Όραση Υπολογιστών el
dc.subject Επεξεργασία Φυσικής Γλώσσας el
dc.subject Μετασχηματιστές el
dc.title Story visualization via masked generative transformers with character guidance and caption augmentation en
heal.type bachelorThesis
heal.classification Computer Science en
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2024-03-01
heal.abstract Story Visualization (SV) is a challenging Artificial Intelligence task that falls in the intersection of Natural Language Processing (NLP) and Computer Vision. The task consists of generating a sequence of images that serve as a visualization of a given sequence of sentences. The sentences form a coherent narrative and so should the images. It was introduced in 2019 and has since been approached in multiple manners, including GANs, Transformers and Diffusers. In this thesis we attempt to tackle the task based on an architecture called MaskGIT. MaskGIT is a relatively recent Transformer-based approach, proposed for Text-to-Image synthesis. We are the first to employ this method for SV. Specifically, we form our baseline model by enhancing the original MaskGIT architecture with additional Cross-Attention sub-layers, that allow the model to integrate information from past and future captions, while generating an image. We build on top of our baseline model in several different ways, in search of directions that improve performance. Some of our experiments, like leveraging a pre-trained text-encoder, attempting disentanglement in the latent space, using a Token-Critic and performing super-resolution in the latent space, do not yield better results. On the other hand, we manage to detect three directions that prove beneficial. We find that adding SV-Layers to the Transformer improves its performance in all metrics. Additionally, we propose a successful, image-agnostic caption augmentation technique, that uses an LLM. Finally, our Character Guidance method, based on both positive and negative prompting, directly affects the generation of main Characters in the images and results in major improvements across all metrics. We combine promising approaches to arrive at our top-performing architecture; MaskGST-CG w/ aug. captions. We test our approach on Pororo-SV, which is the most widely adopted dataset for the task. We evaluate our models using the most prominent metrics in previous literature (including FID, Char-F1, Char-Acc and BLEU-2/3). Our best model achieves SOTA results in terms of Char-F1, Char-Acc and BLEU-2/3, which speaks of the merit of our Character Guidance approach. Additionally, it outperforms all previous GAN and Transformer approaches in terms of FID. en
heal.advisorName Στάμου, Γιώργος el
heal.committeeMemberName Βουλόδημος, Αθανάσιος el
heal.committeeMemberName Κόλλιας, Στέφανος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών el
heal.academicPublisherID ntua
heal.numberOfPages 135 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής