Browsing by Author "Bhavsar, Virendrakumar"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Accelerating main memory query processing for data analytics(University of New Brunswick, 2020) Memarzia, Puya; Bhavsar, Virendrakumar; Ray, SuprioData analytics provides a way to understand and extract value from an ever-growing volume of data. The runtime of analytical queries is of critical importance, as fast results enhance decision making and improve user experience. Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Although processing data that is already in the main memory is decidedly speedier than disk-based query processing, this approach is hindered by limited memory bandwidth and cache capacity, resulting in the under-utilization of processing resources. Furthermore, the characteristics of the hardware, data, and workload, can all play a major role in hindering execution time, and the best approach for a given application is not always clear. In this thesis, we address these issues by investigating ways to design more efficient algorithms and data structures. Our approach involves the systematic application of application-level and system-level refinements that improve algorithm efficiency and hardware utilization. In particular, we conduct a comprehensive study on the effects of dataset skew and shuffling on hash join algorithms. We significantly improve join runtimes on skewed datasets by modifying the algorithm’s underlying hash table. We then further improve performance by designing a novel hash table based on the concept of cuckoo hashing. Next, we present a six-dimensional analysis of in-memory aggregation that breaks down the variables that affect query runtime. As part of our evaluation, we investigate 13 different algorithms and data structures, including one that we specifically developed to excel at a popular query category. Based on our results, we produce a decision tree to help practitioners select the best approach based on aggregation workload characteristics. After that, we dissect the runtime impact of NUMA architectures on a wide variety of query workloads and present a methodology that can greatly improve query performance with minimal modifications to the source code. This approach involves systematically modifying the application’s thread placement, memory placement, and memory allocation, and reconfiguring the operating system. Lastly, we design a scalable query processing system that uses distributed in-memory data structures to store, index, and query spatio-temporal data, and demonstrate the efficiency of our system by comparing it against other data systems.Item Multiword expression identification using deep learning(University of New Brunswick, 2017) Gharbieh, Waseem; Bhavsar, Virendrakumar; Cook, PaulMultiword expressions combine words in various ways to produce phrases that have properties that are not predictable from the properties of their individual words or their normal mode of combination. There are many types of multiword expressions including proverbs, named entities, and verb noun combinations. In this thesis, we propose various deep learning models to identify multiword expressions and compare their performance to more traditional machine learning models and current multiword expression identification systems. We show that convolutional neural networks are able to perform better than state-of-the-art with the three hidden layer convolutional neural network performing best. To our knowledge, this is the first work that applies deep learning models for broad multiword expression identification.Item Parallel and in-memory big spatial data processing systems and benchmarking(University of New Brunswick, 2018) Alam, Md. Mahbub; Ray, Suprio; Bhavsar, VirendrakumarWith the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence,the spatial data processing system has become an important field of research. Though the traditional relational database systems provide spatial functionality (such as, PostgreSQL with PostGIS) , due to the lack of parallelism and I/O bottleneck, these systems are not efficient to run compute-intensive spatial queries on large datasets. In recent times a number of big spatial data systems have been proposed by researchers around the world. These systems can be roughly categorized into disk-based systems over ApacheHadoop and in memory systems based on ApacheSpark. The available features supported by these systems vary widely. However, there has not been any comprehensive evaluation study of these systems in terms of performance, scalability, and functionality. In order to address this need, this thesis proposes a benchmark to evaluate big spatial data systems. It intends to investigate the present status of the big spatial data systems by conducting a comprehensive feature analysis and performance evaluation of a few representative systems. The Hadoop and Spark based big spatial data systems are distributed, scalable, and able to exploit the parallelism of today’s multi-core/many-core architecture. However, most of them are immature, unstable, difficult to extend and missing efficient query language like SQL. In this work, a disk based system Parallax is introduced as a parallel big spatial database system. It integrates the powerful spatial features of PostgreSQL/PostGIS and distributed persistence storage of Alluxio. The host-specific data partitioning and parallel query on local data in each node ensure the maximum utilization of main memory, disk storage, and CPU. This thesis also introduces an in-memory system Spatial Ignite, as extended spatial support for Apache Ignite. Spatial Ignite incorporates a spatial library which contains all the OGC compliant join predicates and spatial analysis functions. Along with query parallelism and collocated query processing of Ignite, the integrated spatial data partitioning techniques improve the performance of Spatial Ignite.The evaluation shows that Spatial Ignite performs better than Hadoop and Spark based systems.Item Prediction of regulatory networks for non-model organisms(University of New Brunswick, 2013) Sharma, Rachita; Evans, Patricia; Bhavsar, VirendrakumarIdentification of gene regulatory networks is useful in understanding gene regulation in any organism. Some regulatory network information has already been determined experimentally and using statistical methods for model organisms, but much less has been identified for non-model organisms. The limited amount of data available for non-model organisms makes inference of regulatory networks difficult using the commonly used statistical methods. This thesis proposes a method to determine the regulatory links that can be mapped from a distant model to a non-model organism. Experiments are performed to map the regulatory network data of S. cerevisiae to A. thaliana and analyze the results. Mapping a regulatory network involves mapping the transcription factors and target genes from one genome to another. In the proposed method, different techniques for predicting transcription factors and target genes for the non-model organism using the available data for the model organism are compared and analyzed. The techniques that obtain the best results overall should be the ones chosen for these predictions. These predicted transcription factors and target genes are then integrated into predicted regulatory links for the non-model organism. A set of rules is then defined on the gene expression experiments to filter these predicted regulatory links that are well supported. Very limited available gene expression data of the non-model organism is used to filter the predicted regulatory links based on these rules to get rid of the high number of false positives. Finally, the filtered regulatory links are tested against a large dataset of gene expression experiments to illustrate that correctly predicted regulatory links are obtained. The links thrown out by filtration are also tested against the same gene expression dataset to illustrate the significance of this step to refine the results.Item Similarity of weighted trees(University of New Brunswick, 2019) Yang, Lu; Bhavsar, Virendrakumar; Boley, HaroldSimilarity computation and matching has become an important topic in information retrieval, case based reasoning, data clustering, database integration, ontology alignment, image processing, natural language processing, schema matching, and e-Business/e-Learning. Given an object in an application domain, finding similar objects helps in obtaining solutions to problems in all of these areas. Objects can be represented by a set of key words/phrases. However, key words/phrases have limitations representing complex object attributes. This thesis proposes node-labeled, arc-labeled, and arc-weighted tree representations for applications such as product/service descriptions in e-Business/e-Learning. Arc labels represent attributes of products/services. The targets of arcs can be leaf nodes or arbitrarily complex sub-trees representing product/service partonomies. Arc weights indicate relative importance/preference values on product/service attributes. We propose a tree similarity algorithm to compute the similarity of weighted trees, consisting of syntactic and semantic components. The syntactic component of the tree similarity algorithm traverses a pair of trees top-down and aggregates similarity values bottom-up. We also propose a tree simplicity measure to compute the simplicity value of a single (sub-)tree. Tree simplicity is computed for each sub-tree missing in the other tree, and the simplicity value is used for similarity computation, where a simpler (less complex) sub-tree leads to a higher tree similarity. Weights are averaged to embody preferences from a pair of trees, e.g., buyer and seller trees in e-Business. We complement the syntactic component by a semantic global similarity, i.e. taxonomic class similarity for inner nodes, and local similarity, i.e. typed similarity for leaf nodes. Our tree similarity algorithm was applied to the Teclantic and eduSource projects for e-Business and e-Learning, respectively. Teclantic matches companies or investors in Atlantic Canada with researchers in various disciplines to share technologies. In eduSource, we match learning objects (i.e. courses) and learning object providers (i.e. course providers). Both projects provided a ranked list of search results based on a user’s requests. We carried out computational experiments on systematic variations of trees to analyze tree simplicity and similarity properties, and to compare our tree similarity algorithm with other tree similarity/distance algorithms in the literature. When compared to other algorithms, our tree similarity algorithm shows advantages on aggregating sub-tree similarity values to obtain the overall tree similarity value. Our tree simplicity algorithm, beyond its usage for missing sub-trees in our tree similarity algorithm, can also be used as a standalone algorithm to calculate the simplicity of a single tree.Item Tree structured data processing on GPUs(University of New Brunswick, 2014) Lu, Yifan; Bhavsar, VirendrakumarTree-structured data are used in many applications. In order to reduce the computing time for processing large tree-structured data sets, parallel processing has been used. Recently, research has been done on parallel computing of tree-structured data on graphics processing units (GPUs). However, tree data structures on GPUs are commonly applied to storing a particular kind of tree, and support limited types of tree traversals. In this thesis, we propose a tree data structure which can apply to storing many types of trees, and support four common types of tree traversals: pre-order, postorder, in-order and breadth-first traversals. Therefore, most tree algorithms can be implemented on GPUs by using this data structure. We implemented a weighted similarity algorithm on an NVIDIA GPU for demonstration of the performance of this data structure. The results showed that this GPU application can get speedup of about 4000 compared to an application running on a single AMD Opteron CPU core.