heal.abstract |
Story Visualization (SV) is a challenging Artificial Intelligence task that falls in the intersection of Natural Language Processing (NLP) and Computer Vision. The task consists of generating a sequence of images that serve as a visualization of a given sequence of sentences. The sentences form a coherent narrative and so should the images. It was introduced in 2019 and has since been approached in multiple manners, including GANs, Transformers and Diffusers. In this thesis we attempt to tackle the task based on an architecture called MaskGIT. MaskGIT is a relatively recent Transformer-based approach, proposed for Text-to-Image synthesis. We are the first to employ this method for SV. Specifically, we form our baseline model by enhancing the original MaskGIT architecture with additional Cross-Attention sub-layers, that allow the model to integrate information from past and future captions, while generating an image. We build on top of our baseline model in several different ways, in search of directions that improve performance. Some of our experiments, like leveraging a pre-trained text-encoder, attempting disentanglement in the latent space, using a Token-Critic and performing super-resolution in the latent space, do not yield better results. On the other hand, we manage to detect three directions that prove beneficial. We find that adding SV-Layers to the Transformer improves its performance in all metrics. Additionally, we propose a successful, image-agnostic caption augmentation technique, that uses an LLM. Finally, our Character Guidance method, based on both positive and negative prompting, directly affects the generation of main Characters in the images and results in major improvements across all metrics. We combine promising approaches to arrive at our top-performing architecture; MaskGST-CG w/ aug. captions. We test our approach on Pororo-SV, which is the most widely adopted dataset for the task. We evaluate our models using the most prominent metrics in previous literature (including FID, Char-F1, Char-Acc and BLEU-2/3). Our best model achieves SOTA results in terms of Char-F1, Char-Acc and BLEU-2/3, which speaks of the merit of our Character Guidance approach. Additionally, it outperforms all previous GAN and Transformer approaches in terms of FID. |
en |