Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data

Papadakis, G; Ioannou, E; Niederee, C; Palpanas, T; Nejdl, W

dc.contributor.author	Papadakis, G	en
dc.contributor.author	Ioannou, E	en
dc.contributor.author	Niederee, C	en
dc.contributor.author	Palpanas, T	en
dc.contributor.author	Nejdl, W	en
dc.date.accessioned	2014-03-01T02:53:33Z
dc.date.available	2014-03-01T02:53:33Z
dc.date.issued	2012	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/36417
dc.subject	Attribute-agnostic blocking	en
dc.subject	Data cleaning	en
dc.subject	Entity resolution	en
dc.subject.other	Attribute-agnostic blocking	en
dc.subject.other	Blocking method	en
dc.subject.other	Blocking technique	en
dc.subject.other	Computational costs	en
dc.subject.other	Data cleaning	en
dc.subject.other	Eficiency	en
dc.subject.other	Entity identifiers	en
dc.subject.other	Experimental evaluation	en
dc.subject.other	Heterogeneous data	en
dc.subject.other	Large datasets	en
dc.subject.other	Real-world objects	en
dc.subject.other	Relationships between entities	en
dc.subject.other	Schema information	en
dc.subject.other	Semi-structured	en
dc.subject.other	Voluminous data	en
dc.subject.other	Data mining	en
dc.subject.other	Information retrieval	en
dc.subject.other	Redundancy	en
dc.subject.other	Signal theory	en
dc.subject.other	Websites	en
dc.subject.other	Statistical tests	en
dc.title	Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data	en
heal.type	conferenceItem	en
heal.identifier.primary	10.1145/2124295.2124305	en
heal.identifier.secondary	http://dx.doi.org/10.1145/2124295.2124305	en
heal.publicationDate	2012	en
heal.abstract	A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach. Copyright 2012 ACM.	en
heal.journalName	WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining	en
dc.identifier.doi	10.1145/2124295.2124305	en
dc.identifier.spage	53	en
dc.identifier.epage	62	en