Improving PySpark performance with cross-language optimization

Thumbnail Image



Journal Title

Journal ISSN

Volume Title


University of New Brunswick


Big data is a rapidly growing field, and Apache Spark is one of the most commonly used frameworks in this area. Among its APIs, Spark for Python, or PySpark is usually the preferred choice by the data scientist community due to its simplicity and versatility. PySpark is built upon Java Spark, and operates in two runtimes, Python and Java Virtual Machine. However, PySpark is often outperformed by its alternatives in various applications. From previous experiments, the bottleneck in PySpark is identified as the data marshalling process across the language boundary between Python and Java, specifically in the serialization and deserialization process. This project aims to alleviate this issue by implementing a specialized serializer for PySpark. The specialized serializer is first built in C++. In addition, to allow for the serialization and deserialization code to be generated dynamically depending on the schema of the dataset, Just-in-Time (JIT) technology is utilized. To be more specific, OMR JitBuilder is used to compile the code at runtime depending on user input. Furthermore, the compiled code can then be assigned to normal functions and be re-used multiple times, hence saving compilation and binding effort. After building the serializer in C++, the wrappers for Python and Java are implemented. On the Python side, to transfer data between C++ and the source language, ctypes and a header file from the Python development package, Python.h, are used. The ctypes library allows communication from Python to C++, and the header file enables C++ to access and modify Python objects. On the Java side, Java Native Interface, a tool for Java to pass data and make native calls, is used. This JitBuilder serializer is then evaluated against other existing serializers, in particular pickle, BSON, and Protocol Buffers. The dataset used is the Part table from the TPC-H benchmark, and it has scale factors of 1, 2, 4, and 8, corresponding to 24.1, 48.4, 96.9, and 194.4 MB. The serializers are tested in different scenarios, which are serialization and deserialization only in the Python runtime, serialization and deserialization only in the JVM, and finally serialization in one language and deserialization of that data in the other language. The results show that this serializer has competitive performance, and in some cases outperforms the other options. After implementing and evaluating the JitBuilder serializer with the wrappers for Python and Java, a prototype is built where it is integrated into PySpark. To do this, the function calls to serialize and deserialize data in Spark source code are modified. This modified PySpark is shown to have better performance than the original PySpark in common data applications.