dc.contributor.author | Σβέζεντσεβ, Νταβίντ | el |
dc.contributor.author | Svezentsev, Ntavint | en |
dc.date.accessioned | 2023-12-06T08:06:56Z | |
dc.date.available | 2023-12-06T08:06:56Z | |
dc.identifier.uri | https://dspace.lib.ntua.gr/xmlui/handle/123456789/58378 | |
dc.identifier.uri | http://dx.doi.org/10.26240/heal.ntua.26074 | |
dc.rights | Αναφορά Δημιουργού 3.0 Ελλάδα | * |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/gr/ | * |
dc.subject | Computer Vision | en |
dc.subject | Deep Learning | en |
dc.subject | Action Recognition | en |
dc.subject | Synthetic Data | en |
dc.subject | Fractal Geometry | en |
dc.subject | Όραση Υπολογιστών | el |
dc.subject | Μηχανική Μάθηση | el |
dc.subject | Αναγνώριση Δράσης | el |
dc.subject | Συνθετικά Δεδομένα | el |
dc.subject | Γεωμετρία Φράκταλ | el |
dc.title | Pre-training for video action recognition with automatically generated datasets | en |
dc.contributor.department | Computer Vision, Speech Communication and Signal Processing Group | el |
heal.type | bachelorThesis | |
heal.classification | Computer Vision | en |
heal.language | en | |
heal.access | free | |
heal.recordProvider | ntua | el |
heal.publicationDate | 2023-07-12 | |
heal.abstract | In recent years, the computer vision community has exhibited growing interest in synthetic data. For the image modality, existing work has proposed learning visual representations by pre-training with synthetic samples produced by various generative processes instead of real data. Such an approach is advantageous as it resolves issues associated with real data: collection and labeling costs, copyright, privacy and human bias. Desirable properties of synthetic images have been carefully investigated and as a result the gap in performance between real and synthetic images has been alleviated significantly. The present work extends the aforementioned approach to the domain of video and applies it to the task of action recognition. Due to the addition of the temporal dimension, this modality is notably more complex than images. As such, employing fractal geometry and other generative processes, we present methods to automatically produce large-scale datasets of short synthetic video clips. This approach is applicable for both supervised and self-supervised learning. To narrow the domain gap, we manually observe real video samples and identify their key properties such as periodic motion, random background, camera displacement etc. These properties are then carefully emulated during pre-training. Through thorough ablations, we determine the properties that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated on small-scale action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks. Compared to standard Kinetics pretraining, our reported results come close and are even superior on a portion of benchmarks. | en |
heal.advisorName | Maragos, Petros | en |
heal.committeeMemberName | Maragos, Petros | en |
heal.committeeMemberName | Rontogiannis, Athanasios | en |
heal.committeeMemberName | Potamianos, Gerasimos | en |
heal.academicPublisher | Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής | el |
heal.academicPublisherID | ntua | |
heal.numberOfPages | 146 σ. | el |
heal.fullTextAvailability | false |
Οι παρακάτω άδειες σχετίζονται με αυτό το τεκμήριο: