Adaptive-sampling algorithms for answering aggregation queries on Web sites

Afrati, FN; Lekeas, PV; Li, C

dc.contributor.author	Afrati, FN	en
dc.contributor.author	Lekeas, PV	en
dc.contributor.author	Li, C	en
dc.date.accessioned	2014-03-01T01:27:50Z
dc.date.available	2014-03-01T01:27:50Z
dc.date.issued	2008	en
dc.identifier.issn	0169023X	en
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/18598
dc.subject	Adaptive sampling	en
dc.subject	Aggregation queries	en
dc.subject	Web site	en
dc.subject.other	Algorithms	en
dc.subject.other	Data mining	en
dc.subject.other	Database systems	en
dc.subject.other	Internet	en
dc.subject.other	Query processing	en
dc.subject.other	Statistical methods	en
dc.subject.other	Adaptive-sampling algorithms	en
dc.subject.other	Aggregation queries	en
dc.subject.other	Synthetic data sets	en
dc.subject.other	Websites	en
dc.title	Adaptive-sampling algorithms for answering aggregation queries on Web sites	en
heal.type	journalArticle	en
heal.identifier.primary	10.1016/j.datak.2007.09.014	en
heal.identifier.secondary	http://dx.doi.org/10.1016/j.datak.2007.09.014	en
heal.publicationDate	2008	en
heal.abstract	Many Web sites publish their data in a hierarchical structure. For instance, Amazon.com organizes its pages on books as a hierarchy, in which each leaf node corresponds to a collection of pages of books in the same class (e.g., books on Data Mining). Users can easily browse this class by following a path from the root to the corresponding leaf node, such as ""Computers & Internet - Databases - Storage - Data Mining"". Business applications often require to submit aggregation queries on such data, such as ""finding the average price of books on Data Mining"". On the other hand, it is computationally expensive to compute the exact answer to such a query due to the large amount of data, its dynamicity, and limited Web-access resources. In this paper, we study how to answer such aggregation queries approximately with quality guarantees using sampling. We study how to use adaptive-sampling techniques that allocate the resources adaptively based on partial samples retrieved from different nodes in the hierarchy. Based on statistical methods, we study how to estimate the quality of the answer using the sample. Our experimental study using real and synthetic data sets validates the proposed techniques. © 2007 Elsevier B.V. All rights reserved.	en
heal.journalName	Data and Knowledge Engineering	en
dc.identifier.doi	10.1016/j.datak.2007.09.014	en
dc.identifier.volume	64	en
dc.identifier.issue	2	en
dc.identifier.spage	462	en
dc.identifier.epage	490	en