Model-assisted optimization of Linear Algebra routines on multi-GPU computing systems

Anastasiadis, Petros

dc.contributor.author	Anastasiadis, Petros
dc.date.accessioned	2025-01-17T07:44:12Z
dc.date.available	2025-01-17T07:44:12Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/60809
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.28505
dc.rights	Αναφορά Δημιουργού 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/gr/	*
dc.subject	Linear algebra	en
dc.subject	Graphics processing units (GPUs)	en
dc.subject	BLAS routines	en
dc.subject	Modeling	en
dc.subject	Autotuning	en
dc.subject	Γραμμική άλγεβρα	el
dc.subject	Επεξεργαστές γραφικών	el
dc.subject	Ρουτίνες BLAS	el
dc.subject	Μοντελοποίηση	el
dc.subject	Αυτόματη βελτιστοποίηση	el
dc.title	Model-assisted optimization of Linear Algebra routines on multi-GPU computing systems	en
dc.contributor.department	Computing Systems Laboratory	el
heal.type	doctoralThesis
heal.classification	Performance engineering	en
heal.classification	Computer engineering	en
heal.classification	Software engineering	en
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2024-09-09
heal.abstract	Dense linear algebra operations appear frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieving optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating optimized libraries to ensure good performance. However, optimizing BLAS for multi-GPU introduces numerous challenges similar to distributed computing, like data decomposition, task scheduling, and communication across GPUs with distinct memory spaces. This complexity of multi-GPU makes BLAS optimization very complex, leading to sub-optimal performance or system-specific solutions with reduced portability. To address these issues, we suggest a model-based autotuning approach: we introduce several performance models for BLAS and integrate them into PARALiA, an end-to-end BLAS library. PARALiA uses model-driven insights to dynamically autotune BLAS execution, tailoring performance-critical parameters for each specific problem and system during runtime. This autotuning is coupled with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. PARALiA provides state-of-the-art performance and energy efficiency and incorporates the ability to adapt to heterogeneous systems and scenarios via model-based decisions. Finally, we focus on the GEMM kernel, extending PARALiA with a custom static scheduler that integrates model-driven algorithmic, communication, and autotuning optimizations (PARALiA-GEMMex), which delivers significantly superior performance compared to the state-of-the-art.	en
heal.advisorName	Goumas, Georgios
heal.committeeMemberName	Goumas, Georgios
heal.committeeMemberName	Papadopoulou, Nikela
heal.committeeMemberName	Koziris, Nectarios
heal.committeeMemberName	Pnevmatikatos, Dionisios
heal.committeeMemberName	Papaspyrou, Nikolaos
heal.committeeMemberName	Xydis, Sotirios
heal.committeeMemberName	Antonopoulos, Christos
heal.academicPublisher	Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	183
heal.fullTextAvailability	false