Multi-objective query optimization for massively parallel processing in Cloud Computing

Γεωργουλάκης Μισεγιάννης, Μιχαήλ; Georgoulakis Misegiannis, Michail

dc.contributor.author	Γεωργουλάκης Μισεγιάννης, Μιχαήλ	el
dc.contributor.author	Georgoulakis Misegiannis, Michail	en
dc.date.accessioned	2022-04-19T11:19:53Z
dc.date.available	2022-04-19T11:19:53Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/55115
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.22813
dc.rights	Default License
dc.subject	Query optimization	en
dc.subject	Cloud computing	en
dc.subject	multi-objective optimization	en
dc.subject	Apache spark	en
dc.subject	Massively parallel processing	en
dc.subject	Βελτιστοποίηση ερωτημάτων	el
dc.subject	Υπολογιστικό νέφος	el
dc.subject	Βελτιστοποίηση με πολλαπλά κριτήρια	el
dc.subject	Περιβάλλον παράλληλης επεξεργασίας	el
dc.subject	Μοντέλο κοστολόγησης	el
dc.title	Multi-objective query optimization for massively parallel processing in Cloud Computing	en
heal.type	bachelorThesis
heal.classification	Βάσεις Δεδομένων	el
heal.classification	Databases	en
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2021-11-08
heal.abstract	Data processing has become a hot topic lately, as large volumes of data that need to be analyzed are produced every minute. The transition to the big data era was made easier with the commercial rise of cloud computing, and the use of massively parallel processing frameworks like Apache Spark for its processing in a parallel and distributed manner. Query optimization is a traditional DBMS optimization problem, where the query optimizer selects the optimal way to execute a query. Cloud computing features like its pricing policy led us to tackle query optimization in cloud environments as a multi-objective optimization problem, considering the objectives of execution time and monetary cost. In this thesis, we propose a baseline query optimizer system architecture for efficient and multi-objective query optimization in a cloud-like environment. Components of this system are implemented, and it is used as a basis in our experiments. Working with Apache Spark allows us to benefit from parallel processing and gain useful insights about processing big data in a distributed, cloud-like environment. However, trying to solve multi-objective query optimization problems using Spark comes with a significant limitation, as the optimizer of Spark SQL, Catalyst, is mostly based on heuristics and not cost based estimations. As a result, it is difficult to consider alternative query plans to compare and apply query optimization techniques that have been successfully used in relational databases. To overcome this limitation, we reimplemented a state of the art cost model for Spark SQL from scratch to provide theoretical estimations for the costs of alternative query execution plans. Its accuracy is evaluated with large scale experiments, and an additional formula is presented and integrated into the cost model that gives an estimation for the monetary cost of a query plan in Amazon EC2, based on its execution time and computing resources used. The cost model and the formula allow us to provide solutions for multi-objective query optimization problems. After implementing a baseline query optimization system, we move to integrate a state of the art query optimization technique, multi-objective parametric query optimization in our contribution and observe its relevance, as it is an optimization technique evaluated in a relational database. In this technique, a query is modeled as a function of a set of parameters, which must be sensitive factors for the optimization objectives.	en
heal.advisorName	Καντερέ, Βασιλική	el
heal.committeeMemberName	Καντερέ, Βασιλική	el
heal.committeeMemberName	D'Orazio, Laurent	en
heal.committeeMemberName	Παπαβασιλείου, Συμεών	el
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	el
heal.academicPublisherID	ntua
heal.numberOfPages	145 σ.	el
heal.fullTextAvailability	false