Accelerating main memory query processing for data analytics

Memarzia, Puya

Accelerating main memory query processing for data analytics

dc.contributor.advisor	Bhavsar, Virendrakumar
dc.contributor.advisor	Ray, Suprio
dc.contributor.author	Memarzia, Puya
dc.date.accessioned	2023-03-01T16:49:14Z
dc.date.available	2023-03-01T16:49:14Z
dc.date.issued	2020
dc.date.updated	2023-03-01T15:03:26Z
dc.description.abstract	Data analytics provides a way to understand and extract value from an ever-growing volume of data. The runtime of analytical queries is of critical importance, as fast results enhance decision making and improve user experience. Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Although processing data that is already in the main memory is decidedly speedier than disk-based query processing, this approach is hindered by limited memory bandwidth and cache capacity, resulting in the under-utilization of processing resources. Furthermore, the characteristics of the hardware, data, and workload, can all play a major role in hindering execution time, and the best approach for a given application is not always clear. In this thesis, we address these issues by investigating ways to design more efficient algorithms and data structures. Our approach involves the systematic application of application-level and system-level refinements that improve algorithm efficiency and hardware utilization. In particular, we conduct a comprehensive study on the effects of dataset skew and shuffling on hash join algorithms. We significantly improve join runtimes on skewed datasets by modifying the algorithm’s underlying hash table. We then further improve performance by designing a novel hash table based on the concept of cuckoo hashing. Next, we present a six-dimensional analysis of in-memory aggregation that breaks down the variables that affect query runtime. As part of our evaluation, we investigate 13 different algorithms and data structures, including one that we specifically developed to excel at a popular query category. Based on our results, we produce a decision tree to help practitioners select the best approach based on aggregation workload characteristics. After that, we dissect the runtime impact of NUMA architectures on a wide variety of query workloads and present a methodology that can greatly improve query performance with minimal modifications to the source code. This approach involves systematically modifying the application’s thread placement, memory placement, and memory allocation, and reconfiguring the operating system. Lastly, we design a scalable query processing system that uses distributed in-memory data structures to store, index, and query spatio-temporal data, and demonstrate the efficiency of our system by comparing it against other data systems.
dc.description.copyright	© Puya Memarzia, 2020
dc.format	text/xml
dc.format.extent	xix, 197 pages
dc.format.medium	electronic
dc.identifier.oclc	(OCoLC)1408727453	en
dc.identifier.other	Thesis 10579	en
dc.identifier.uri	https://unbscholar.lib.unb.ca/handle/1882/14526
dc.language.iso	en_CA
dc.publisher	University of New Brunswick
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.subject.discipline	Computer Science
dc.subject.lcsh	Quantitative research.	en
dc.subject.lcsh	Data sets.	en
dc.subject.lcsh	Querying (Computer science)	en
dc.subject.lcsh	Hashing (Computer science)	en
dc.title	Accelerating main memory query processing for data analytics
dc.type	doctoral thesis
thesis.degree.discipline	Computer Science
thesis.degree.fullname	Doctor of Philosophy
thesis.degree.grantor	University of New Brunswick
thesis.degree.level	doctoral
thesis.degree.name	Ph.D.

Files

Original bundle

Now showing 1 - 1 of 1

Name:: item.pdf
Size:: 4.55 MB
Format:: Adobe Portable Document Format

Download

Collections

Open Theses & Dissertations