A methodology for clustering XML documents by structure

Dalamagas, T; Cheng, T; Winkel, K-J; Sellis, T

dc.contributor.author	Dalamagas, T	en
dc.contributor.author	Cheng, T	en
dc.contributor.author	Winkel, K-J	en
dc.contributor.author	Sellis, T	en
dc.date.accessioned	2014-03-01T01:23:25Z
dc.date.available	2014-03-01T01:23:25Z
dc.date.issued	2006	en
dc.identifier.issn	0306-4379	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/16960
dc.subject	Clustering	en
dc.subject	Structural similarity	en
dc.subject	Structural summary	en
dc.subject	Tree edit distance	en
dc.subject	XML	en
dc.subject.classification	Computer Science, Information Systems	en
dc.subject.other	Algorithms	en
dc.subject.other	Estimation	en
dc.subject.other	XML	en
dc.subject.other	Clustering	en
dc.subject.other	Structural summary	en
dc.subject.other	Tree edit distance	en
dc.subject.other	Information theory	en
dc.title	A methodology for clustering XML documents by structure	en
heal.type	journalArticle	en
heal.identifier.primary	10.1016/j.is.2004.11.009	en
heal.identifier.secondary	http://dx.doi.org/10.1016/j.is.2004.11.009	en
heal.language	English	en
heal.publicationDate	2006	en
heal.abstract	The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed. (c) 2004 Elsevier B.V. All rights reserved.	en
heal.publisher	PERGAMON-ELSEVIER SCIENCE LTD	en
heal.journalName	Information Systems	en
dc.identifier.doi	10.1016/j.is.2004.11.009	en
dc.identifier.isi	ISI:000234906300003	en
dc.identifier.volume	31	en
dc.identifier.issue	3	en
dc.identifier.spage	187	en
dc.identifier.epage	228	en