HEAL DSpace

Pre-training for video action recognition with automatically generated datasets

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Σβέζεντσεβ, Νταβίντ el
dc.contributor.author Svezentsev, Ntavint en
dc.date.accessioned 2023-12-06T08:06:56Z
dc.date.available 2023-12-06T08:06:56Z
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/58378
dc.identifier.uri http://dx.doi.org/10.26240/heal.ntua.26074
dc.rights Αναφορά Δημιουργού 3.0 Ελλάδα *
dc.rights.uri http://creativecommons.org/licenses/by/3.0/gr/ *
dc.subject Computer Vision en
dc.subject Deep Learning en
dc.subject Action Recognition en
dc.subject Synthetic Data en
dc.subject Fractal Geometry en
dc.subject Όραση Υπολογιστών el
dc.subject Μηχανική Μάθηση el
dc.subject Αναγνώριση Δράσης el
dc.subject Συνθετικά Δεδομένα el
dc.subject Γεωμετρία Φράκταλ el
dc.title Pre-training for video action recognition with automatically generated datasets en
dc.contributor.department Computer Vision, Speech Communication and Signal Processing Group el
heal.type bachelorThesis
heal.classification Computer Vision en
heal.language en
heal.access free
heal.recordProvider ntua el
heal.publicationDate 2023-07-12
heal.abstract In recent years, the computer vision community has exhibited growing interest in synthetic data. For the image modality, existing work has proposed learning visual representations by pre-training with synthetic samples produced by various generative processes instead of real data. Such an approach is advantageous as it resolves issues associated with real data: collection and labeling costs, copyright, privacy and human bias. Desirable properties of synthetic images have been carefully investigated and as a result the gap in performance between real and synthetic images has been alleviated significantly. The present work extends the aforementioned approach to the domain of video and applies it to the task of action recognition. Due to the addition of the temporal dimension, this modality is notably more complex than images. As such, employing fractal geometry and other generative processes, we present methods to automatically produce large-scale datasets of short synthetic video clips. This approach is applicable for both supervised and self-supervised learning. To narrow the domain gap, we manually observe real video samples and identify their key properties such as periodic motion, random background, camera displacement etc. These properties are then carefully emulated during pre-training. Through thorough ablations, we determine the properties that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated on small-scale action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks. Compared to standard Kinetics pretraining, our reported results come close and are even superior on a portion of benchmarks. en
heal.advisorName Maragos, Petros en
heal.committeeMemberName Maragos, Petros en
heal.committeeMemberName Rontogiannis, Athanasios en
heal.committeeMemberName Potamianos, Gerasimos en
heal.academicPublisher Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής el
heal.academicPublisherID ntua
heal.numberOfPages 146 σ. el
heal.fullTextAvailability false


Αρχεία σε αυτό το τεκμήριο

Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο:

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής

Αναφορά Δημιουργού 3.0 Ελλάδα Εκτός από όπου ορίζεται κάτι διαφορετικό, αυτή η άδεια περιγράφεται ως Αναφορά Δημιουργού 3.0 Ελλάδα