STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques

Papadakis, NK; Skoutas, D; Raftopoulos, K; Varvarigou, TA

dc.contributor.author	Papadakis, NK	en
dc.contributor.author	Skoutas, D	en
dc.contributor.author	Raftopoulos, K	en
dc.contributor.author	Varvarigou, TA	en
dc.date.accessioned	2014-03-01T01:23:07Z
dc.date.available	2014-03-01T01:23:07Z
dc.date.issued	2005	en
dc.identifier.issn	1041-4347	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/16821
dc.subject	Automatic wrappers	en
dc.subject	Data source wrappers	en
dc.subject	Generic wrappers	en
dc.subject	Information retrieval	en
dc.subject	Intelligent agents on the Web	en
dc.subject	Resource discovery	en
dc.subject	Web data extraction	en
dc.subject	Web mining	en
dc.subject	Web structure mining	en
dc.subject.classification	Computer Science, Artificial Intelligence	en
dc.subject.classification	Computer Science, Information Systems	en
dc.subject.classification	Engineering, Electrical & Electronic	en
dc.subject.other	Clustering techniques	en
dc.subject.other	Data source wrappers	en
dc.subject.other	Generic wrappers	en
dc.subject.other	Information extraction	en
dc.subject.other	Algorithms	en
dc.subject.other	Computer simulation	en
dc.subject.other	Data mining	en
dc.subject.other	Hierarchical systems	en
dc.subject.other	Web browsers	en
dc.subject.other	Websites	en
dc.subject.other	Information retrieval systems	en
dc.title	STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques	en
heal.type	journalArticle	en
heal.identifier.primary	10.1109/TKDE.2005.203	en
heal.identifier.secondary	http://dx.doi.org/10.1109/TKDE.2005.203	en
heal.language	English	en
heal.publicationDate	2005	en
heal.abstract	A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of ""human browsing."" The World Wide Web is today the main ""all kind of information"" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm. © 2005 IEEE.	en
heal.publisher	IEEE COMPUTER SOC	en
heal.journalName	IEEE Transactions on Knowledge and Data Engineering	en
dc.identifier.doi	10.1109/TKDE.2005.203	en
dc.identifier.isi	ISI:000232664700004	en
dc.identifier.volume	17	en
dc.identifier.issue	12	en
dc.identifier.spage	1638	en
dc.identifier.epage	1652	en