Optimizing joins in a map-reduce environment

Afrati, FN; Ullman, JD

dc.contributor.author	Afrati, FN	en
dc.contributor.author	Ullman, JD	en
dc.date.accessioned	2014-03-01T02:46:54Z
dc.date.available	2014-03-01T02:46:54Z
dc.date.issued	2010	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/32931
dc.subject	IT Value	en
dc.subject	Large Data	en
dc.subject.other	Dimension tables	en
dc.subject.other	Fixed numbers	en
dc.subject.other	Key attributes	en
dc.subject.other	MAP process	en
dc.subject.other	New approaches	en
dc.subject.other	Social Networks	en
dc.subject.other	Star join	en
dc.subject.other	Very large datum	en
dc.subject.other	Database systems	en
dc.subject.other	Optimization	en
dc.subject.other	Technology	en
dc.title	Optimizing joins in a map-reduce environment	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1145/1739041.1739056	en
heal.identifier.secondary	http://dx.doi.org/10.1145/1739041.1739056	en
heal.publicationDate	2010	en
heal.abstract	Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the ""map-key,"" the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a ""share,"" which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is ""mistakenly"" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network. Copyright 2010 ACM.	en
heal.journalName	Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings	en
dc.identifier.doi	10.1145/1739041.1739056	en
dc.identifier.spage	99	en
dc.identifier.epage	110	en