dc.contributor.author |
Papadakis, NK |
en |
dc.contributor.author |
Skoutas, D |
en |
dc.contributor.author |
Raftopoulos, K |
en |
dc.contributor.author |
Varvarigou, TA |
en |
dc.date.accessioned |
2014-03-01T01:23:07Z |
|
dc.date.available |
2014-03-01T01:23:07Z |
|
dc.date.issued |
2005 |
en |
dc.identifier.issn |
1041-4347 |
en |
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/16821 |
|
dc.subject |
Automatic wrappers |
en |
dc.subject |
Data source wrappers |
en |
dc.subject |
Generic wrappers |
en |
dc.subject |
Information retrieval |
en |
dc.subject |
Intelligent agents on the Web |
en |
dc.subject |
Resource discovery |
en |
dc.subject |
Web data extraction |
en |
dc.subject |
Web mining |
en |
dc.subject |
Web structure mining |
en |
dc.subject.classification |
Computer Science, Artificial Intelligence |
en |
dc.subject.classification |
Computer Science, Information Systems |
en |
dc.subject.classification |
Engineering, Electrical & Electronic |
en |
dc.subject.other |
Clustering techniques |
en |
dc.subject.other |
Data source wrappers |
en |
dc.subject.other |
Generic wrappers |
en |
dc.subject.other |
Information extraction |
en |
dc.subject.other |
Algorithms |
en |
dc.subject.other |
Computer simulation |
en |
dc.subject.other |
Data mining |
en |
dc.subject.other |
Hierarchical systems |
en |
dc.subject.other |
Web browsers |
en |
dc.subject.other |
Websites |
en |
dc.subject.other |
Information retrieval systems |
en |
dc.title |
STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques |
en |
heal.type |
journalArticle |
en |
heal.identifier.primary |
10.1109/TKDE.2005.203 |
en |
heal.identifier.secondary |
http://dx.doi.org/10.1109/TKDE.2005.203 |
en |
heal.language |
English |
en |
heal.publicationDate |
2005 |
en |
heal.abstract |
A fully automated wrapper for information extraction from Web pages is presented. The motivation behind such systems lies in the emerging need for going beyond the concept of ""human browsing."" The World Wide Web is today the main ""all kind of information"" repository and has been so far very successful in disseminating information to humans. By automating the process of information retrieval, further utilization by targeted applications is enabled. The key idea in our novel system is to exploit the format of the Web pages to discover the underlying structure in order to finally infer and extract pieces of information from the Web page. Our system first identifies the section of the Web page that contains the information to be extracted and then extracts it by using clustering techniques and other tools of statistical origin. STAVIES can operate without human intervention and does not require any training. The main innovation and contribution of the proposed system consists of introducing a signal-wise treatment of the tag structural hierarchy and using hierarchical clustering techniques to segment the Web pages. The importance of such a treatment is significant since it permits abstracting away from the raw tag-manipulating approach. Experimental results and comparisons with other state of the art systems are presented and discussed in the paper, indicating the high performance of the proposed algorithm. © 2005 IEEE. |
en |
heal.publisher |
IEEE COMPUTER SOC |
en |
heal.journalName |
IEEE Transactions on Knowledge and Data Engineering |
en |
dc.identifier.doi |
10.1109/TKDE.2005.203 |
en |
dc.identifier.isi |
ISI:000232664700004 |
en |
dc.identifier.volume |
17 |
en |
dc.identifier.issue |
12 |
en |
dc.identifier.spage |
1638 |
en |
dc.identifier.epage |
1652 |
en |