Fuzzy joins using MapReduce

Afrati, FN; Sarma, AD; Menestrina, D; Parameswaran, A; Ullman, JD

dc.contributor.author	Afrati, FN	en
dc.contributor.author	Sarma, AD	en
dc.contributor.author	Menestrina, D	en
dc.contributor.author	Parameswaran, A	en
dc.contributor.author	Ullman, JD	en
dc.date.accessioned	2014-03-01T02:53:38Z
dc.date.available	2014-03-01T02:53:38Z
dc.date.issued	2012	en
dc.identifier.issn	10844627	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/36464
dc.subject.other	Communication cost	en
dc.subject.other	Computation model	en
dc.subject.other	Cost analysis	en
dc.subject.other	Edit distance	en
dc.subject.other	Input set	en
dc.subject.other	Jaccard distance	en
dc.subject.other	Map-reduce	en
dc.subject.other	Optimal algorithm	en
dc.subject.other	Real-world application	en
dc.subject.other	Research communities	en
dc.subject.other	Similarity threshold	en
dc.subject.other	Three component	en
dc.subject.other	Communication	en
dc.subject.other	Cost accounting	en
dc.subject.other	Costs	en
dc.subject.other	Hamming distance	en
dc.subject.other	Clustering algorithms	en
dc.title	Fuzzy joins using MapReduce	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1109/ICDE.2012.66	en
heal.identifier.secondary	http://dx.doi.org/10.1109/ICDE.2012.66	en
heal.identifier.secondary	6228109	en
heal.publicationDate	2012	en
heal.abstract	Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements. © 2012 IEEE.	en
heal.journalName	Proceedings - International Conference on Data Engineering	en
dc.identifier.doi	10.1109/ICDE.2012.66	en
dc.identifier.spage	498	en
dc.identifier.epage	509	en