Automated audio captioning

Κουζέλης, Θοδωρής; Kouzelis, Thodoris

dc.contributor.author	Κουζέλης, Θοδωρής	el
dc.contributor.author	Kouzelis, Thodoris	en
dc.date.accessioned	2022-11-16T07:22:40Z
dc.date.available	2022-11-16T07:22:40Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/56133
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.23831
dc.relation	info:eu-repo/grantAgreement/EC/FP7/16188	el
dc.rights	Default License
dc.subject	Deep Learning	en
dc.subject	Automated Audio Captioning	en
dc.subject	Natural Language Processing	en
dc.subject	Movies	en
dc.subject	Transformers	en
dc.subject	Αυτόματη Περιγραφή Ήχου	el
dc.subject	Ταινίες	el
dc.subject	Επεξεργασία Φυσικής Γλώσσας	el
dc.subject	Βαθειά Μάθηση	el
dc.subject	Μηχανισμοί Προσοχής	el
dc.title	Automated audio captioning	en
heal.type	bachelorThesis
heal.secondaryTitle	and generation of captions for sound events in movies.	el
heal.classification	Deep Learning	en
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	13-07
heal.abstract	The purpose of this dissertation is to study Automated Audio Captioning. The aim of this task is to describe the content of an audio clip using natural language. It is a cross-modal translation task at the intersection of audio signal processing and natural language processing. Audio Captioning focuses on the audio events and their spaciotemporal relationships in an audio clip and expresses them in natural language. It is a recent and rather unexplored task, that has a great potential for practical applications. In this work, we model caption generation for a given audio clip as a sequence-tosequence task, using a Transformer architecture. We show that using recent strategies from the related field of Audio Tagging, allows us to significantly reduce the complexity of our model without affecting performance. In order to generate rich and varied descriptions we investigate decoding algorithms that minimize the trade-off between, semantic fidelity and diversity in captions. As a real world application of Automated Audio Captioning, we propose a novel task, where given the audio of movie a system aims to generate captions of salient sound events. Essentially, the aim of our proposed task is to automatically generate Subtitles for Deaf and Hard of Hearing (SDH). Our proposed system detects the segments of sound events using a pre-trained tagging model and generates a textual description using our model for Audio Captioning. To improve the performance of our audio captioning model we create a task specific dataset using SDH subtitles and movies. Furthermore, we integrate the textual information of the tagging model into caption generation by building a model for text guided audio captioning. Finally, we propose an novel metric to evaluate our results.	en
heal.sponsor	Ινστιτούτο Επεξεργασίας του Λόγου, EK Αθηνά	el
heal.advisorName	Ποταμιάνος, Αλέξανδρος	el
heal.committeeMemberName	Ποταμιάνος, Αλέξανδρος	el
heal.committeeMemberName	Κόλλιας, Στέφανος	el
heal.committeeMemberName	Τζαφέστας, Κωνσταντίνος	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών. Εργαστήριο Επεξεργασίας Φυσικής Γλώσσας	el
heal.academicPublisherID	ntua
heal.numberOfPages	120 σ.	el
heal.fullTextAvailability	false