HEAL DSpace

Text-driven articulate talking face generation

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Γεώργιος, Μίλης el
dc.contributor.author Georgios, Milis en
dc.date.accessioned 2024-07-17T07:01:30Z
dc.date.available 2024-07-17T07:01:30Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/59926
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.27622
dc.rights Αναφορά Δημιουργού 3.0 Ελλάδα *
dc.rights.uri http://creativecommons.org/licenses/by/3.0/gr/ *
dc.subject Talking Face Generation en
dc.subject Σύνθεση Ομιλούντων Προσώπων el
dc.subject Audiovisual Speech Synthesis en
dc.subject Text-to-Visual Speech en
dc.subject Photorealistic Talking Faces en
dc.subject Portrait Videos en
dc.subject Οπτικοακουστική Σύνθεση Ομιλίας el
dc.subject Σύνθεση Ομιλίας από Κείμενο el
dc.subject Φωτορεαλιστικά Πρόσωπα el
dc.subject Φωτορεαλιστικά Βίντεο Ομιλίας el
dc.title Text-driven articulate talking face generation en
heal.type bachelorThesis
heal.classification Όραση Υπολογιστών el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2024-03-28
heal.abstract Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans, creating a new era of lifelike virtual experiences. These endeavors not only push the boundaries of audiovisual synthesis but also hold immense potential for applications spanning entertainment, communication, and education. The state of the art in talking face generation focuses mainly on audio-driven methods, which are conditioned on either real or synthetic audios. However, having the ability to directly synthesize talking humans from text transcriptions is particularly beneficial for many applications and is expected to receive more and more attention, following the recent breakthroughs in large language models. A text-driven system can provide an animated avatar that utters a conversational agent's response, paving the way towards a more natural mode of human-machine interaction. Regarding text-driven generation, the predominant approach has been to employ a cascaded 2-stage architecture of a text-to-speech module followed by an audio-driven talking face generator. However this ignores the highly complex interplay between audio and visual streams that occurs during speaking. In this Diploma Thesis, we construct a text-driven audiovisual speech synthesizer that uses transformers for sequence modeling and does not follow the aforementioned cascaded approach. Instead, our method, which we call NEUral Text to ARticulate Talk (NEUTART), uses joint audiovisual modeling, as well as speech-informed 3D facial reconstructions and various perceptual losses for visual supervision. Notably, we incorporate a lipreading loss which adds realism to the speaker's mouth movements. The proposed model incorporates an audiovisual module that can generate 3D talking head videos with human-like articulation and synced audiovisual streams by design. Then, a photorealistic module leverages the power of generative adversarial networks to convert the 3D talking head into an RGB video. Our experiments on audiovisual datasets as well as in-the-wild videos reveal state-of-the-art generation quality both in terms of objective metrics and human evaluation, especially when assessing the realism of lip articulation. We also showcase the effectiveness of visual supervision for speech synthesis, since our experiments reveal that NEUTART produces more intelligible speech than a similar text-to-speech architecture. en
heal.advisorName Μαραγκός, Πέτρος el
heal.committeeMemberName Ποταμιάνος, Αλέξανδρος el
heal.committeeMemberName Ροντογιάννης, Αθανάσιος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών el
heal.academicPublisherID ntua
heal.numberOfPages 104 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο:

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής

Αναφορά Δημιουργού 3.0 Ελλάδα Εκτός από όπου ορίζεται κάτι διαφορετικό, αυτή η άδεια περιγράφεται ως Αναφορά Δημιουργού 3.0 Ελλάδα