HEAL DSpace

Representation models for text classification: A comparative analysis over three Web document types

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Giannakopoulos, G en
dc.contributor.author Mavridi, P en
dc.contributor.author Paliouras, G en
dc.contributor.author Papadakis, G en
dc.contributor.author Tserpes, K en
dc.date.accessioned 2014-03-01T02:54:00Z
dc.date.available 2014-03-01T02:54:00Z
dc.date.issued 2012 en
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/36530
dc.subject N-gram graphs en
dc.subject Text classification en
dc.subject Web document types en
dc.subject.other Bag of words en
dc.subject.other Comparative analysis en
dc.subject.other Contextual information en
dc.subject.other Experimental studies en
dc.subject.other Feature space en
dc.subject.other Graph similarity en
dc.subject.other N-gram graphs en
dc.subject.other News articles en
dc.subject.other Real world data en
dc.subject.other Representation model en
dc.subject.other Sentiment analysis en
dc.subject.other Social media en
dc.subject.other Spam filtering en
dc.subject.other Text classification en
dc.subject.other Textual content en
dc.subject.other Topic Classification en
dc.subject.other User-generated content en
dc.subject.other Web document en
dc.subject.other Graphic methods en
dc.subject.other Information retrieval systems en
dc.subject.other Semantic Web en
dc.subject.other Semantics en
dc.subject.other Virtual reality en
dc.subject.other World Wide Web en
dc.title Representation models for text classification: A comparative analysis over three Web document types en
heal.type conferenceItem en
heal.identifier.primary 10.1145/2254129.2254148 en
heal.identifier.secondary http://dx.doi.org/10.1145/2254129.2254148 en
heal.identifier.secondary 13 en
heal.publicationDate 2012 en
heal.abstract Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media. In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents. Copyright 2012 ACM. en
heal.journalName ACM International Conference Proceeding Series en
dc.identifier.doi 10.1145/2254129.2254148 en


Αρχεία σε αυτό το τεκμήριο

Αρχεία Μέγεθος Μορφότυπο Προβολή

Δεν υπάρχουν αρχεία που σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής