HEAL DSpace

STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Papadakis, NK en
dc.contributor.author Skoutas, D en
dc.contributor.author Raftopoulos, K en
dc.contributor.author Varvarigou, TA en
dc.date.accessioned 2014-03-01T01:23:07Z
dc.date.available 2014-03-01T01:23:07Z
dc.date.issued 2005 en
dc.identifier.issn 1041-4347 en
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/16821
dc.subject Automatic wrappers en
dc.subject Data source wrappers en
dc.subject Generic wrappers en
dc.subject Information retrieval en
dc.subject Intelligent agents on the Web en
dc.subject Resource discovery en
dc.subject Web data extraction en
dc.subject Web mining en
dc.subject Web structure mining en
dc.subject.classification Computer Science, Artificial Intelligence en
dc.subject.classification Computer Science, Information Systems en
dc.subject.classification Engineering, Electrical & Electronic en
dc.subject.other Clustering techniques en
dc.subject.other Data source wrappers en
dc.subject.other Generic wrappers en
dc.subject.other Information extraction en
dc.subject.other Algorithms en
dc.subject.other Computer simulation en
dc.subject.other Data mining en
dc.subject.other Hierarchical systems en
dc.subject.other Web browsers en
dc.subject.other Websites en
dc.subject.other Information retrieval systems en
dc.title STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques en
heal.type journalArticle en
heal.identifier.primary 10.1109/TKDE.2005.203 en
heal.identifier.secondary http://dx.doi.org/10.1109/TKDE.2005.203 en
heal.language English en
heal.publicationDate 2005 en
heal.abstract A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of ""human browsing."" The World Wide Web is today the main ""all kind of information"" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm. © 2005 IEEE. en
heal.publisher IEEE COMPUTER SOC en
heal.journalName IEEE Transactions on Knowledge and Data Engineering en
dc.identifier.doi 10.1109/TKDE.2005.203 en
dc.identifier.isi ISI:000232664700004 en
dc.identifier.volume 17 en
dc.identifier.issue 12 en
dc.identifier.spage 1638 en
dc.identifier.epage 1652 en


Αρχεία σε αυτό το τεκμήριο

Αρχεία Μέγεθος Μορφότυπο Προβολή

Δεν υπάρχουν αρχεία που σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής