Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors

Anastopoulos, N; Koziris, N

dc.contributor.author	Anastopoulos, N	en
dc.contributor.author	Koziris, N	en
dc.date.accessioned	2014-03-01T02:45:16Z
dc.date.available	2014-03-01T02:45:16Z
dc.date.issued	2008	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/32247
dc.subject	Operating System	en
dc.subject	Performance Optimization	en
dc.subject	Bottom Up	en
dc.subject.other	Execution models	en
dc.subject.other	Parallel and distributed processing	en
dc.subject.other	Thread synchronization	en
dc.subject.other	Computer networks	en
dc.subject.other	Distributed parameter networks	en
dc.subject.other	Spin dynamics	en
dc.subject.other	Synchronization	en
dc.title	Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1109/IPDPS.2008.4536358	en
heal.identifier.secondary	4536358	en
heal.identifier.secondary	http://dx.doi.org/10.1109/IPDPS.2008.4536358	en
heal.publicationDate	2008	en
heal.abstract	So far, the privileged instructions MONITOR and MWAIT introduced with Intel Prescott core, have been used mostly for inter-thread synchronization in operating systems code. In a hyper-threaded processor, these instructions offer a ""performance-optimized"" way for threads involved in synchronization events to wait on a condition. In this work, we explore the potential of using these instructions for synchronizing application threads that execute on hyper-threaded processors, and are characterized by workload asymmetry. Initially, we propose a framework through which one can use MONITOR/MWAIT to build condition wait and notification primitives, with minimal kernel involvement. Then, we evaluate the efficiency of these primitives in a bottom-up manner: at first, we quantify certain performance aspects of the primitives that reflect the execution model under consideration, such as resource consumption and responsiveness, and we compare them against other commonly used implementations. As a further step, we use our primitives to build synchronization barriers. Again, we examine the same performance issues as before, and using a pseudo-benchmark we evaluate the efficiency of our implementation for fine-grained inter-thread synchronization. In terms of throughput, our barriers yielded 12% better performance on average compared to Pthreads, and 26% compared to a spin-loops-based implementation, for varying levels of threads asymmetry. Finally, we test our barriers in a real-world scenario, and specifically, in applying thread-level Speculative Precomputation on four applications. For this multithreaded execution scheme, our implementation provided up to 7% better performance compared to Pthreads, and up to 40% compared to spin-loops-based barriers. ©2008 IEEE.	en
heal.journalName	IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM	en
dc.identifier.doi	10.1109/IPDPS.2008.4536358	en