dc.contributor.author |
Κουζέλης, Θοδωρής
|
el |
dc.contributor.author |
Kouzelis, Thodoris
|
en |
dc.date.accessioned |
2022-11-16T07:22:40Z |
|
dc.date.available |
2022-11-16T07:22:40Z |
|
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/56133 |
|
dc.identifier.uri |
http://dx.doi.org/10.26240/heal.ntua.23831 |
|
dc.relation |
info:eu-repo/grantAgreement/EC/FP7/16188 |
el |
dc.rights |
Default License |
|
dc.subject |
Deep Learning |
en |
dc.subject |
Automated Audio Captioning |
en |
dc.subject |
Natural Language Processing |
en |
dc.subject |
Movies |
en |
dc.subject |
Transformers |
en |
dc.subject |
Αυτόματη Περιγραφή Ήχου |
el |
dc.subject |
Ταινίες |
el |
dc.subject |
Επεξεργασία Φυσικής Γλώσσας |
el |
dc.subject |
Βαθειά Μάθηση |
el |
dc.subject |
Μηχανισμοί Προσοχής |
el |
dc.title |
Automated audio captioning |
en |
heal.type |
bachelorThesis |
|
heal.secondaryTitle |
and generation of captions for sound events in movies. |
el |
heal.classification |
Deep Learning |
en |
heal.language |
el |
|
heal.language |
en |
|
heal.access |
free |
|
heal.recordProvider |
ntua |
el |
heal.publicationDate |
13-07 |
|
heal.abstract |
The purpose of this dissertation is to study Automated Audio Captioning. The aim
of this task is to describe the content of an audio clip using natural language. It is a
cross-modal translation task at the intersection of audio signal processing and natural
language processing. Audio Captioning focuses on the audio events and their spaciotemporal relationships in an audio clip and expresses them in natural language. It is a recent
and rather unexplored task, that has a great potential for practical applications.
In this work, we model caption generation for a given audio clip as a sequence-tosequence task, using a Transformer architecture. We show that using recent strategies
from the related field of Audio Tagging, allows us to significantly reduce the complexity of
our model without affecting performance. In order to generate rich and varied descriptions
we investigate decoding algorithms that minimize the trade-off between, semantic fidelity
and diversity in captions.
As a real world application of Automated Audio Captioning, we propose a novel task,
where given the audio of movie a system aims to generate captions of salient sound events.
Essentially, the aim of our proposed task is to automatically generate Subtitles for Deaf
and Hard of Hearing (SDH). Our proposed system detects the segments of sound events
using a pre-trained tagging model and generates a textual description using our model for
Audio Captioning. To improve the performance of our audio captioning model we create
a task specific dataset using SDH subtitles and movies. Furthermore, we integrate the
textual information of the tagging model into caption generation by building a model for
text guided audio captioning. Finally, we propose an novel metric to evaluate our results. |
en |
heal.sponsor |
Ινστιτούτο Επεξεργασίας του Λόγου, EK Αθηνά |
el |
heal.advisorName |
Ποταμιάνος, Αλέξανδρος |
el |
heal.committeeMemberName |
Ποταμιάνος, Αλέξανδρος |
el |
heal.committeeMemberName |
Κόλλιας, Στέφανος |
el |
heal.committeeMemberName |
Τζαφέστας, Κωνσταντίνος |
el |
heal.academicPublisher |
Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών. Εργαστήριο Επεξεργασίας Φυσικής Γλώσσας |
el |
heal.academicPublisherID |
ntua |
|
heal.numberOfPages |
120 σ. |
el |
heal.fullTextAvailability |
false |
|