HEAL DSpace

Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Papadakis, G en
dc.contributor.author Ioannou, E en
dc.contributor.author Niederee, C en
dc.contributor.author Palpanas, T en
dc.contributor.author Nejdl, W en
dc.date.accessioned 2014-03-01T02:53:33Z
dc.date.available 2014-03-01T02:53:33Z
dc.date.issued 2012 en
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/36417
dc.subject Attribute-agnostic blocking en
dc.subject Data cleaning en
dc.subject Entity resolution en
dc.subject.other Attribute-agnostic blocking en
dc.subject.other Blocking method en
dc.subject.other Blocking technique en
dc.subject.other Computational costs en
dc.subject.other Data cleaning en
dc.subject.other Eficiency en
dc.subject.other Entity identifiers en
dc.subject.other Experimental evaluation en
dc.subject.other Heterogeneous data en
dc.subject.other Large datasets en
dc.subject.other Real-world objects en
dc.subject.other Relationships between entities en
dc.subject.other Schema information en
dc.subject.other Semi-structured en
dc.subject.other Voluminous data en
dc.subject.other Data mining en
dc.subject.other Information retrieval en
dc.subject.other Redundancy en
dc.subject.other Signal theory en
dc.subject.other Websites en
dc.subject.other Statistical tests en
dc.title Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data en
heal.type conferenceItem en
heal.identifier.primary 10.1145/2124295.2124305 en
heal.identifier.secondary http://dx.doi.org/10.1145/2124295.2124305 en
heal.publicationDate 2012 en
heal.abstract A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generated data in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the eficiency of redundancy-bearing blocking methods, such as our attributeagnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach. Copyright 2012 ACM. en
heal.journalName WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining en
dc.identifier.doi 10.1145/2124295.2124305 en
dc.identifier.spage 53 en
dc.identifier.epage 62 en


Αρχεία σε αυτό το τεκμήριο

Αρχεία Μέγεθος Μορφότυπο Προβολή

Δεν υπάρχουν αρχεία που σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής