Action to object knowledge distillation for object-centric representation learning

Γιαννακάκης, Νικόλαος; Giannakakis, Nikolaos

dc.contributor.author	Γιαννακάκης, Νικόλαος	el
dc.contributor.author	Giannakakis, Nikolaos	en
dc.date.accessioned	2025-03-28T09:26:21Z
dc.date.available	2025-03-28T09:26:21Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/61524
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.29220
dc.rights	Αναφορά Δημιουργού 3.0 Ελλάδα	*
dc.rights	Αναφορά Δημιουργού 3.0 Ελλάδα	*
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/gr/	*
dc.subject	Εκμάθηση Αναπαραστάσεων	el
dc.subject	Αντικειµενοκεντρική Εκμάθηση Αναπαραστάσεων	el
dc.subject	Κατηγοριοποίηση Προσφερόμενων Δυνατοτήτων Αντικειµένων	el
dc.subject	Προσοµοίωση Ρομποτικού Χειρισμού	el
dc.subject	Ρομποτική Αντίληψη	el
dc.subject	Object-centric Representation Learning	en
dc.subject	Representation Learning	en
dc.subject	Slot Attention	en
dc.subject	Robot Perception	en
dc.subject	Robotics Simulation	en
dc.title	Action to object knowledge distillation for object-centric representation learning	en
dc.contributor.department	Division of Signals, Control and Robotics	el
heal.type	bachelorThesis
heal.classification	Machine Learning	en
heal.classification	Deep Learning	en
heal.classification	Computer Vision	en
heal.language	el
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2024-10-24
heal.abstract	This thesis aims to study the possible improvement of object-centric image encoders by enhancing them with action-centric representations derived from videos of actions. Firstly, we study a method to distill the representations of a pre-trained Video Masked Auto-encoder (Video MAE) to the representations of two state-of-the-art image encoders in an object-centric manner. This method is evaluated in the task of affordance categorization using a small-scale dataset that we created using the Something-Something v2 (SSV2) dataset. Experiments show that the representations of the Video MAE contain information that could be useful to the image encoders, and we test some methods to enrich them with this information. The experiments show that the methods produce a marginal yet consistent enhancement. Further experimentation with larger scale model implementations and datasets could potentially unlock additional improvements. Furthermore, we propose and study a method based on the Slot Attention object-centric representation learning framework. The effectiveness of the method is also evaluated in the task of affordance categorization and it presents competitive results while also achieving automatic segmentation of the images and a substantial reduction in per-object representation size. Finally, we propose a method to combine object-centric representations from a slot-attention-based model to produce a flat representation vector for an image with the aim of learning visuomotor policies. This method is evaluated in a robotic simulation task and presents better results compared to other out-of-domain representations. We also show that the slot representations’ performance in the simulated robotic manipulation can be improved when fine-tuning the model with videos of actions from the SSV2 dataset. By creating action-object associations in the representations of object-centric image encoders, this study seeks to contribute to the development of more effective vision perception systems for robots and artificial agents, enabling them to better understand the semantics and dynamics of agent-object interaction.	en
heal.advisorName	Μαραγκός, Πέτρος	el
heal.committeeMemberName	Μαραγκός, Πέτρος	el
heal.committeeMemberName	Ροντογιάννης, Αθανάσιος	el
heal.committeeMemberName	Κορδώνης, Ιωάννης	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Σημάτων, Ελέγχου και Ρομποτικής	el
heal.academicPublisherID	ntua
heal.numberOfPages	119 σ.	el
heal.fullTextAvailability	false