dc.contributor.author |
Spanakis, G |
en |
dc.contributor.author |
Siolas, G |
en |
dc.contributor.author |
Stafylopatis, A |
en |
dc.date.accessioned |
2014-03-01T02:08:59Z |
|
dc.date.available |
2014-03-01T02:08:59Z |
|
dc.date.issued |
2012 |
en |
dc.identifier.issn |
00104620 |
en |
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/29761 |
|
dc.subject |
conceptual clustering |
en |
dc.subject |
document clustering |
en |
dc.subject |
document representation |
en |
dc.subject |
Wikipedia knowledge |
en |
dc.subject.other |
Application programmer's interfaces |
en |
dc.subject.other |
Bag-of-words models |
en |
dc.subject.other |
Clustering process |
en |
dc.subject.other |
Computational costs |
en |
dc.subject.other |
Conceptual clustering |
en |
dc.subject.other |
Document Clustering |
en |
dc.subject.other |
Document Representation |
en |
dc.subject.other |
F-measure |
en |
dc.subject.other |
Hier-archical clustering |
en |
dc.subject.other |
Link structure |
en |
dc.subject.other |
Textual content |
en |
dc.subject.other |
Wikipedia |
en |
dc.subject.other |
Clustering algorithms |
en |
dc.subject.other |
Knowledge representation |
en |
dc.subject.other |
Semantics |
en |
dc.subject.other |
Websites |
en |
dc.title |
Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents |
en |
heal.type |
journalArticle |
en |
heal.identifier.primary |
10.1093/comjnl/bxr024 |
en |
heal.identifier.secondary |
http://dx.doi.org/10.1093/comjnl/bxr024 |
en |
heal.publicationDate |
2012 |
en |
heal.abstract |
In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. The proposed method overcomes the classic bag-of-words models disadvantages through the exploitation of Wikipedia textual content and link structure. A robust and compact document representation is built in real-time using the Wikipedia application programmer's interface, without the need to store locally any Wikipedia information. The clustering process is hierarchical and extends the idea of frequent items by using Wikipedia article titles for selecting cluster labels that are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach, both in terms of F-measure and entropy on the one hand and computational cost on the other. © 2011 The Author. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. |
en |
heal.journalName |
Computer Journal |
en |
dc.identifier.doi |
10.1093/comjnl/bxr024 |
en |
dc.identifier.volume |
55 |
en |
dc.identifier.issue |
3 |
en |
dc.identifier.spage |
299 |
en |
dc.identifier.epage |
312 |
en |