Optimizing multiway joins in a map-reduce environment

Afrati, FN; Ullman, JD

dc.contributor.author	Afrati, FN	en
dc.contributor.author	Ullman, JD	en
dc.date.accessioned	2014-03-01T01:36:36Z
dc.date.available	2014-03-01T01:36:36Z
dc.date.issued	2011	en
dc.identifier.issn	1041-4347	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/21351
dc.subject	joins	en
dc.subject	Map-reduce	en
dc.subject	parallel computing	en
dc.subject	query optimization	en
dc.subject.classification	Computer Science, Artificial Intelligence	en
dc.subject.classification	Computer Science, Information Systems	en
dc.subject.classification	Engineering, Electrical & Electronic	en
dc.subject.other	Dimension tables	en
dc.subject.other	Fixed numbers	en
dc.subject.other	joins	en
dc.subject.other	MAP process	en
dc.subject.other	Map-reduce	en
dc.subject.other	Multi-way join	en
dc.subject.other	query optimization	en
dc.subject.other	Social Networks	en
dc.subject.other	Star join	en
dc.subject.other	Very large datum	en
dc.subject.other	Parallel architectures	en
dc.subject.other	User interfaces	en
dc.subject.other	Optimization	en
dc.title	Optimizing multiway joins in a map-reduce environment	en
heal.type	journalArticle	en
heal.identifier.primary	10.1109/TKDE.2011.47	en
heal.identifier.secondary	http://dx.doi.org/10.1109/TKDE.2011.47	en
heal.identifier.secondary	5710932	en
heal.language	English	en
heal.publicationDate	2011	en
heal.abstract	Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the map-key, the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a share, which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network. © 2006 IEEE.	en
heal.publisher	IEEE COMPUTER SOC	en
heal.journalName	IEEE Transactions on Knowledge and Data Engineering	en
dc.identifier.doi	10.1109/TKDE.2011.47	en
dc.identifier.isi	ISI:000292888400002	en
dc.identifier.volume	23	en
dc.identifier.issue	9	en
dc.identifier.spage	1282	en
dc.identifier.epage	1298	en