HEAL DSpace

Automated audio captioning

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Κουζέλης, Θοδωρής el
dc.contributor.author Kouzelis, Thodoris en
dc.date.accessioned 2022-11-16T07:22:40Z
dc.date.available 2022-11-16T07:22:40Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/56133
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.23831
dc.relation info:eu-repo/grantAgreement/EC/FP7/16188 el
dc.rights Default License
dc.subject Deep Learning en
dc.subject Automated Audio Captioning en
dc.subject Natural Language Processing en
dc.subject Movies en
dc.subject Transformers en
dc.subject Αυτόματη Περιγραφή Ήχου el
dc.subject Ταινίες el
dc.subject Επεξεργασία Φυσικής Γλώσσας el
dc.subject Βαθειά Μάθηση el
dc.subject Μηχανισμοί Προσοχής el
dc.title Automated audio captioning en
heal.type bachelorThesis
heal.secondaryTitle and generation of captions for sound events in movies. el
heal.classification Deep Learning en
heal.language el
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 13-07
heal.abstract The purpose of this dissertation is to study Automated Audio Captioning. The aim of this task is to describe the content of an audio clip using natural language. It is a cross-modal translation task at the intersection of audio signal processing and natural language processing. Audio Captioning focuses on the audio events and their spaciotemporal relationships in an audio clip and expresses them in natural language. It is a recent and rather unexplored task, that has a great potential for practical applications. In this work, we model caption generation for a given audio clip as a sequence-tosequence task, using a Transformer architecture. We show that using recent strategies from the related field of Audio Tagging, allows us to significantly reduce the complexity of our model without affecting performance. In order to generate rich and varied descriptions we investigate decoding algorithms that minimize the trade-off between, semantic fidelity and diversity in captions. As a real world application of Automated Audio Captioning, we propose a novel task, where given the audio of movie a system aims to generate captions of salient sound events. Essentially, the aim of our proposed task is to automatically generate Subtitles for Deaf and Hard of Hearing (SDH). Our proposed system detects the segments of sound events using a pre-trained tagging model and generates a textual description using our model for Audio Captioning. To improve the performance of our audio captioning model we create a task specific dataset using SDH subtitles and movies. Furthermore, we integrate the textual information of the tagging model into caption generation by building a model for text guided audio captioning. Finally, we propose an novel metric to evaluate our results. en
heal.sponsor Ινστιτούτο Επεξεργασίας του Λόγου, EK Αθηνά el
heal.advisorName Ποταμιάνος, Αλέξανδρος el
heal.committeeMemberName Ποταμιάνος, Αλέξανδρος el
heal.committeeMemberName Κόλλιας, Στέφανος el
heal.committeeMemberName Τζαφέστας, Κωνσταντίνος el
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών. Εργαστήριο Επεξεργασίας Φυσικής Γλώσσας el
heal.academicPublisherID ntua
heal.numberOfPages 120 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής