Eliminating the redundancy in blocking-based entity resolution methods

Papadakis, G; Ioannou, E; Niederee, C; Palpanas, T; Nejdl, W

dc.contributor.author	Papadakis, G	en
dc.contributor.author	Ioannou, E	en
dc.contributor.author	Niederee, C	en
dc.contributor.author	Palpanas, T	en
dc.contributor.author	Nejdl, W	en
dc.date.accessioned	2014-03-01T02:53:15Z
dc.date.available	2014-03-01T02:53:15Z
dc.date.issued	2011	en
dc.identifier.issn	15525996	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/36192
dc.subject	data cleaning	en
dc.subject	entity resolution	en
dc.subject	redundancy-based blocking	en
dc.subject.other	Abstract levels	en
dc.subject.other	Blocking method	en
dc.subject.other	Citation matching	en
dc.subject.other	Computational costs	en
dc.subject.other	Data cleaning	en
dc.subject.other	entity resolution	en
dc.subject.other	Heterogeneous data	en
dc.subject.other	Novel techniques	en
dc.subject.other	Optimal solutions	en
dc.subject.other	Real world data	en
dc.subject.other	Real-world objects	en
dc.subject.other	redundancy-based blocking	en
dc.subject.other	Resolution methods	en
dc.subject.other	Space complexity	en
dc.subject.other	Space limitations	en
dc.subject.other	Time efficiencies	en
dc.subject.other	Redundancy	en
dc.subject.other	Virtual reality	en
dc.subject.other	Digital libraries	en
dc.title	Eliminating the redundancy in blocking-based entity resolution methods	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1145/1998076.1998093	en
heal.identifier.secondary	http://dx.doi.org/10.1145/1998076.1998093	en
heal.publicationDate	2011	en
heal.abstract	Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods. © 2011 ACM.	en
heal.journalName	Proceedings of the ACM/IEEE Joint Conference on Digital Libraries	en
dc.identifier.doi	10.1145/1998076.1998093	en
dc.identifier.spage	85	en
dc.identifier.epage	94	en