dc.contributor.author |
Konstantinou, I |
en |
dc.contributor.author |
Angelou, E |
en |
dc.contributor.author |
Tsoumakos, D |
en |
dc.contributor.author |
Koziris, N |
en |
dc.date.accessioned |
2014-03-01T02:46:46Z |
|
dc.date.available |
2014-03-01T02:46:46Z |
|
dc.date.issued |
2010 |
en |
dc.identifier.uri |
https://dspace.lib.ntua.gr/xmlui/handle/123456789/32831 |
|
dc.subject |
cloud computing |
en |
dc.subject |
Hadoop |
en |
dc.subject |
HBase |
en |
dc.subject |
MapReduce |
en |
dc.subject |
NoSQL |
en |
dc.subject.other |
Cloud computing |
en |
dc.subject.other |
Cluster prototype |
en |
dc.subject.other |
Data sets |
en |
dc.subject.other |
Distributed architecture |
en |
dc.subject.other |
Indexing systems |
en |
dc.subject.other |
Open sources |
en |
dc.subject.other |
Response time |
en |
dc.subject.other |
Semi-structured |
en |
dc.subject.other |
Text-indexing |
en |
dc.subject.other |
Unstructured data |
en |
dc.subject.other |
Distributed computer systems |
en |
dc.subject.other |
World Wide Web |
en |
dc.subject.other |
Indexing (of information) |
en |
dc.title |
Distributed indexing of web scale datasets for the cloud |
en |
heal.type |
conferenceItem |
en |
heal.identifier.primary |
10.1145/1779599.1779600 |
en |
heal.identifier.secondary |
http://dx.doi.org/10.1145/1779599.1779600 |
en |
heal.identifier.secondary |
1779600 |
en |
heal.publicationDate |
2010 |
en |
heal.abstract |
In this paper, we present a distributed architecture for indexing and serving large and diverse datasets. It incorporates and extends the functionality of Hadoop, the open source MapReduce framework, and of HBase, a distributed, sparse, NoSQL database, to create a fully parallel indexing system. Experiments with structured, semi-structured and unstructured data of various sizes demonstrate the flexibility, speed and robustness of our implementation and contrast it with similarly oriented projects. Our 11 node cluster prototype managed to keep full-text indexing time of 150GB raw content in less than 3 hours, whereas the system's response time under sustained query load of more than 1000 queries/sec was kept in the order of milliseconds. © 2010 ACM. |
en |
heal.journalName |
ACM International Conference Proceeding Series |
en |
dc.identifier.doi |
10.1145/1779599.1779600 |
en |