A tile size selection analysis for blocked array layouts

Athanasaki, E; Koziris, N; Tsanakas, P

dc.contributor.author	Athanasaki, E	en
dc.contributor.author	Koziris, N	en
dc.contributor.author	Tsanakas, P	en
dc.date.accessioned	2014-03-01T02:43:05Z
dc.date.available	2014-03-01T02:43:05Z
dc.date.issued	2005	en
dc.identifier.issn	15506207	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/31223
dc.subject	Blocked array layouts	en
dc.subject	Cache miss analysis	en
dc.subject	Tile selection	en
dc.subject.other	Blocked array layouts	en
dc.subject.other	Cache miss analysis	en
dc.subject.other	Memory speed	en
dc.subject.other	Tile selection	en
dc.subject.other	Cache memory	en
dc.subject.other	Computational methods	en
dc.subject.other	Data reduction	en
dc.subject.other	Data transfer	en
dc.subject.other	Hierarchical systems	en
dc.subject.other	Optimization	en
dc.subject.other	Program processors	en
dc.subject.other	Data storage equipment	en
dc.title	A tile size selection analysis for blocked array layouts	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1109/INTERACT.2005.1	en
heal.identifier.secondary	http://dx.doi.org/10.1109/INTERACT.2005.1	en
heal.identifier.secondary	1423141	en
heal.publicationDate	2005	en
heal.abstract	Efficient use of the memory hierarchy is essential for good performance due to the ever increasing gap between processor and memory speed. Program transformations such as loop tiling have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. In conjunction with tiling, several experimental studies have been conducted on blocked data layouts, as a data transformation technique used to boost the cache performance. The stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes, taking into account the actual layout of the arrays in memory. In this paper, we first provide a theoretical analysis for the cache and TLB performance of blocked data layouts. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the LI cache, to avoid any interference misses. We prove that when applying optimization techniques, such as register assignment, array alignment, prefetching and loop unrolling, tile sizes equal to L1 capacity, offer better cache utilization, even for loop bodies that access more than just one array. Increased self-or/and cross-interference misses are now tolerated through prefetching. Such larger tiles also reduce lost CPU cycles due to less mispredicted branches. Results are validated through simulations and actual benchmarks on various modern platforms.	en
heal.journalName	Proceedings - Annual Workshop on Interaction between Compilers and Computer Architectures, INTERACT	en
dc.identifier.doi	10.1109/INTERACT.2005.1	en
dc.identifier.volume	2005	en
dc.identifier.spage	70	en
dc.identifier.epage	81	en