A goodness of fit test approach in information retrieval

Fragos, K; Maistros, Y

dc.contributor.author	Fragos, K	en
dc.contributor.author	Maistros, Y	en
dc.date.accessioned	2014-03-01T01:23:24Z
dc.date.available	2014-03-01T01:23:24Z
dc.date.issued	2006	en
dc.identifier.issn	1386-4564	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/16951
dc.subject	Goodness of fit tests	en
dc.subject	Information Retrieval	en
dc.subject.classification	Computer Science, Information Systems	en
dc.title	A goodness of fit test approach in information retrieval	en
heal.type	journalArticle	en
heal.identifier.primary	10.1007/s10791-006-3609-7	en
heal.identifier.secondary	http://dx.doi.org/10.1007/s10791-006-3609-7	en
heal.language	English	en
heal.publicationDate	2006	en
heal.abstract	In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model ""fits"" the user's information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests' framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms. © Springer Science + Business Media, LLC 2006.	en
heal.publisher	SPRINGER	en
heal.journalName	Information Retrieval	en
dc.identifier.doi	10.1007/s10791-006-3609-7	en
dc.identifier.isi	ISI:000238100700006	en
dc.identifier.volume	9	en
dc.identifier.issue	3	en
dc.identifier.spage	331	en
dc.identifier.epage	342	en