HEAL DSpace

A goodness of fit test approach in information retrieval

Αποθετήριο DSpace/Manakin

Εμφάνιση απλής εγγραφής

dc.contributor.author Fragos, K en
dc.contributor.author Maistros, Y en
dc.date.accessioned 2014-03-01T01:23:24Z
dc.date.available 2014-03-01T01:23:24Z
dc.date.issued 2006 en
dc.identifier.issn 1386-4564 en
dc.identifier.uri https://dspace.lib.ntua.gr/xmlui/handle/123456789/16951
dc.subject Goodness of fit tests en
dc.subject Information Retrieval en
dc.subject.classification Computer Science, Information Systems en
dc.title A goodness of fit test approach in information retrieval en
heal.type journalArticle en
heal.identifier.primary 10.1007/s10791-006-3609-7 en
heal.identifier.secondary http://dx.doi.org/10.1007/s10791-006-3609-7 en
heal.language English en
heal.publicationDate 2006 en
heal.abstract In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model ""fits"" the user's information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests' framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms. © Springer Science + Business Media, LLC 2006. en
heal.publisher SPRINGER en
heal.journalName Information Retrieval en
dc.identifier.doi 10.1007/s10791-006-3609-7 en
dc.identifier.isi ISI:000238100700006 en
dc.identifier.volume 9 en
dc.identifier.issue 3 en
dc.identifier.spage 331 en
dc.identifier.epage 342 en


Αρχεία σε αυτό το τεκμήριο

Αρχεία Μέγεθος Μορφότυπο Προβολή

Δεν υπάρχουν αρχεία που σχετίζονται με αυτό το τεκμήριο.

Αυτό το τεκμήριο εμφανίζεται στην ακόλουθη συλλογή(ές)

Εμφάνιση απλής εγγραφής