A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping

Koziris, N; Sotiropoulos, A; Goumas, G

dc.contributor.author	Koziris, N	en
dc.contributor.author	Sotiropoulos, A	en
dc.contributor.author	Goumas, G	en
dc.date.accessioned	2014-03-01T01:18:33Z
dc.date.available	2014-03-01T01:18:33Z
dc.date.issued	2003	en
dc.identifier.issn	0743-7315	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/15081
dc.subject	Loop Tiling	en
dc.subject.classification	Computer Science, Theory & Methods	en
dc.subject.other	UNIFORM DEPENDENCIES	en
dc.subject.other	ALGORITHMS	en
dc.subject.other	SPACES	en
dc.title	A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping	en
heal.type	journalArticle	en
heal.identifier.primary	10.1016/S0743-7315(03)00102-3	en
heal.identifier.secondary	http://dx.doi.org/10.1016/S0743-7315(03)00102-3	en
heal.language	English	en
heal.publicationDate	2003	en
heal.abstract	This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced. (C) 2003 Elsevier Inc. All rights reserved.	en
heal.publisher	ACADEMIC PRESS INC ELSEVIER SCIENCE	en
heal.journalName	Journal of Parallel and Distributed Computing	en
dc.identifier.doi	10.1016/S0743-7315(03)00102-3	en
dc.identifier.isi	ISI:000186551500009	en
dc.identifier.volume	63	en
dc.identifier.issue	11	en
dc.identifier.spage	1138	en
dc.identifier.epage	1151	en