dc.contributor.author |
Afrati, FN |
en |
dc.contributor.author |
Ullman, JD |
en |
dc.date.accessioned |
2014-03-01T01:36:36Z |
|
dc.date.available |
2014-03-01T01:36:36Z |
|
dc.date.issued |
2011 |
en |
dc.identifier.issn |
1041-4347 |
en |
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/21351 |
|
dc.subject |
joins |
en |
dc.subject |
Map-reduce |
en |
dc.subject |
parallel computing |
en |
dc.subject |
query optimization |
en |
dc.subject.classification |
Computer Science, Artificial Intelligence |
en |
dc.subject.classification |
Computer Science, Information Systems |
en |
dc.subject.classification |
Engineering, Electrical & Electronic |
en |
dc.subject.other |
Dimension tables |
en |
dc.subject.other |
Fixed numbers |
en |
dc.subject.other |
joins |
en |
dc.subject.other |
MAP process |
en |
dc.subject.other |
Map-reduce |
en |
dc.subject.other |
Multi-way join |
en |
dc.subject.other |
query optimization |
en |
dc.subject.other |
Social Networks |
en |
dc.subject.other |
Star join |
en |
dc.subject.other |
Very large datum |
en |
dc.subject.other |
Parallel architectures |
en |
dc.subject.other |
User interfaces |
en |
dc.subject.other |
Optimization |
en |
dc.title |
Optimizing multiway joins in a map-reduce environment |
en |
heal.type |
journalArticle |
en |
heal.identifier.primary |
10.1109/TKDE.2011.47 |
en |
heal.identifier.secondary |
http://dx.doi.org/10.1109/TKDE.2011.47 |
en |
heal.identifier.secondary |
5710932 |
en |
heal.language |
English |
en |
heal.publicationDate |
2011 |
en |
heal.abstract |
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the map-key, the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a share, which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network. © 2006 IEEE. |
en |
heal.publisher |
IEEE COMPUTER SOC |
en |
heal.journalName |
IEEE Transactions on Knowledge and Data Engineering |
en |
dc.identifier.doi |
10.1109/TKDE.2011.47 |
en |
dc.identifier.isi |
ISI:000292888400002 |
en |
dc.identifier.volume |
23 |
en |
dc.identifier.issue |
9 |
en |
dc.identifier.spage |
1282 |
en |
dc.identifier.epage |
1298 |
en |