Parallel and in-memory big spatial data processing systems and benchmarking
University of New Brunswick
With the accelerated growth in spatial data volume, being generated from a wide variety of sources, the need for efficient storage, retrieval, processing and analyzing of spatial data is ever more important. Hence,the spatial data processing system has become an important field of research. Though the traditional relational database systems provide spatial functionality (such as, PostgreSQL with PostGIS) , due to the lack of parallelism and I/O bottleneck, these systems are not efficient to run compute-intensive spatial queries on large datasets. In recent times a number of big spatial data systems have been proposed by researchers around the world. These systems can be roughly categorized into disk-based systems over ApacheHadoop and in memory systems based on ApacheSpark. The available features supported by these systems vary widely. However, there has not been any comprehensive evaluation study of these systems in terms of performance, scalability, and functionality. In order to address this need, this thesis proposes a benchmark to evaluate big spatial data systems. It intends to investigate the present status of the big spatial data systems by conducting a comprehensive feature analysis and performance evaluation of a few representative systems. The Hadoop and Spark based big spatial data systems are distributed, scalable, and able to exploit the parallelism of today’s multi-core/many-core architecture. However, most of them are immature, unstable, difficult to extend and missing efficient query language like SQL. In this work, a disk based system Parallax is introduced as a parallel big spatial database system. It integrates the powerful spatial features of PostgreSQL/PostGIS and distributed persistence storage of Alluxio. The host-specific data partitioning and parallel query on local data in each node ensure the maximum utilization of main memory, disk storage, and CPU. This thesis also introduces an in-memory system Spatial Ignite, as extended spatial support for Apache Ignite. Spatial Ignite incorporates a spatial library which contains all the OGC compliant join predicates and spatial analysis functions. Along with query parallelism and collocated query processing of Ignite, the integrated spatial data partitioning techniques improve the performance of Spatial Ignite.The evaluation shows that Spatial Ignite performs better than Hadoop and Spark based systems.