Accelerating main memory query processing for data analytics

dc.contributor.advisorBhavsar, Virendrakumar
dc.contributor.advisorRay, Suprio
dc.contributor.authorMemarzia, Puya
dc.date.accessioned2023-03-01T16:49:14Z
dc.date.available2023-03-01T16:49:14Z
dc.date.issued2020
dc.date.updated2023-03-01T15:03:26Z
dc.description.abstractData analytics provides a way to understand and extract value from an ever-growing volume of data. The runtime of analytical queries is of critical importance, as fast results enhance decision making and improve user experience. Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Although processing data that is already in the main memory is decidedly speedier than disk-based query processing, this approach is hindered by limited memory bandwidth and cache capacity, resulting in the under-utilization of processing resources. Furthermore, the characteristics of the hardware, data, and workload, can all play a major role in hindering execution time, and the best approach for a given application is not always clear. In this thesis, we address these issues by investigating ways to design more efficient algorithms and data structures. Our approach involves the systematic application of application-level and system-level refinements that improve algorithm efficiency and hardware utilization. In particular, we conduct a comprehensive study on the effects of dataset skew and shuffling on hash join algorithms. We significantly improve join runtimes on skewed datasets by modifying the algorithm’s underlying hash table. We then further improve performance by designing a novel hash table based on the concept of cuckoo hashing. Next, we present a six-dimensional analysis of in-memory aggregation that breaks down the variables that affect query runtime. As part of our evaluation, we investigate 13 different algorithms and data structures, including one that we specifically developed to excel at a popular query category. Based on our results, we produce a decision tree to help practitioners select the best approach based on aggregation workload characteristics. After that, we dissect the runtime impact of NUMA architectures on a wide variety of query workloads and present a methodology that can greatly improve query performance with minimal modifications to the source code. This approach involves systematically modifying the application’s thread placement, memory placement, and memory allocation, and reconfiguring the operating system. Lastly, we design a scalable query processing system that uses distributed in-memory data structures to store, index, and query spatio-temporal data, and demonstrate the efficiency of our system by comparing it against other data systems.
dc.description.copyright© Puya Memarzia, 2020
dc.formattext/xml
dc.format.extentxix, 197 pages
dc.format.mediumelectronic
dc.identifier.urihttps://unbscholar.lib.unb.ca/handle/1882/14526
dc.language.isoen_CA
dc.publisherUniversity of New Brunswick
dc.rightshttp://purl.org/coar/access_right/c_abf2
dc.subject.disciplineComputer Science
dc.titleAccelerating main memory query processing for data analytics
dc.typedoctoral thesis
thesis.degree.disciplineComputer Science
thesis.degree.fullnameDoctor of Philosophy
thesis.degree.grantorUniversity of New Brunswick
thesis.degree.leveldoctoral
thesis.degree.namePh.D.

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
item.pdf
Size:
4.55 MB
Format:
Adobe Portable Document Format