How to speed up your data analysis by up to 70% with Apache Spark

jakub.krasniewski
May 7, 2024
8 min read

TLDR

You can very easily implement Apache Spark in your code even if you don't have access to a Spark cluster and still get some of the advantages of Spark. You have to just install the pyspark package, which also installs Spark itself, load the tables from your database to Spark, convert your queries to raw SQL and run them as Spark SQL.

Introduction

One of the typical struggles of AI and data science projects is handling huge datasets, whether they regard business analytics, processing of medical data, or any other field.

In today's data-driven world, businesses and organizations are grappling with an unprecedented influx of data. From customer transactions to social media interactions, the volume of data generated daily is staggering. While this abundance of data holds immense potential for insights and innovation, it also presents significant challenges, particularly when it comes to processing and analyzing large datasets using traditional tools like SQL.

As K2 we had to face this challenge in one of projects for Media Press. Media Press is the world's leading provider of localized European metadata for movies and TV shows. Considering the very wide range of data, starting from movie titles, through actors, directors, ending on actor's birth dates or biographies' the complexity of data structure is huge.

That means we have to deal with both high complexity and high volume of data. One key point of the project was to provide data analysis of already existing records with relation to many other records. This meant tens of SQL queries with multiple joins and subqueries. At the beginning, the implementation was meant to involve only few MP's data providers. But as the project evolved, it became clear that the tool needs to be scaled so that the tool needs to be scaled about 100 times to handle all of the data providers. It was obvious that ordinary SQL queries executed sequentially would not be enough to handle the data in a reasonable time. That's why we decided to explore tools that are capable of handling big data. In this article we will describe how we used Apache Spark to handle the big data challenge in our project.

Why Apache Spark?

Rich APIs

Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters and was developed in 2009 in the AMPLab at UC Berkeley. It is designed to be fast and general-purpose, with APIs in Java, Scala, Python, and R. It is a powerful tool for processing large datasets and performing complex analytics tasks. The main advantage of Spark that needs to be mentioned is its speed, as some operation can be 100 times faster than Hadoop Map Reduce.

Resilient Distributed Datasets (RDDs):

The main building block of Spark, which is the Resilient Distributed Dataset, provides other advantages. Thanks to distributing data in multiple blocks, RDDs can be rebuilt in case of failure of a worker node, which gives Spark its fault tolerance. Also, the data is stored in memory instead of disk, which allows much faster execution. However, when the memory is full, the data can spill over to disk, and still process the data. Other feature of Spark that is connected to distributed nature of RDDs is the parallel processing, as each block of data can be processed separately.

Structured Data Handling

Spark provides a rich set of APIs for working with RDDs, including transformations (such as map, filter, and reduce) and actions. Spark also provides high-level APIs for working with structured data, such as DataFrames and Datasets, which provide a more user-friendly interface for working with structured data.

Ease of implementation

While one can take advantage of Spark mainly when having access to a cluster of machines, it can also be run in standalone mode on a single machine. In the latter case, Spark can be used for development and testing purposes, or for small-scale data processing tasks.

Implementation

So, now that we can see that Spark provides many advantages, let's discuss how easily it can be implemented in an ongoing project as it happened in our case.

There are two key points that need to be considered when implementing Spark in a project, the installation and the actual usage of Spark with your data.

The technology stack of our BI project included PostgreSQL database and SQLAlchemy as ORM. Here came one of the clear strong points of Spark that I personally appreciate the most - it's astonishing ease of use and seamless integration with Python.

For usage with Python, Spark of course provides PySpark, which is the Python API for Spark. It not only provides the API, but it also installs a standalone Spark cluster on your machine, so you can start using it right away.

pip install pyspark

After installation, you can initialize spark:

def init_spark(self): self.spark = SparkSession.builder \ .appName("SparkImplementationApp") \ .config("spark.jars.packages", "org.postgresql:postgresql:42.6.0") \ .config("spark.driver.memory", "10g") \ .config("spark.executor.memory", "128g") \ .getOrCreate()

Of course, the best case scenario is to have a cluster of machines, but for medium-sized datasets, usage of Spark on a single machine is also doable.

And in case of this approach, Spark comes just off the shelf. Running a standalone worker requires just a few lines of code. Of course, the dataset needs also be loaded, but

in case of reading data from a database, Spark provides a very handy method to load data from a database table into a pandas-like DataFrame that require only few more lines of code.

When using the table multiple times, it's also best to cache it in memory:

def spark_prepare_data(self) -> None: tables = self.get_existing_tables() for table in tables: df = self.spark_read_table(table) df.createOrReplaceTempView(table) df.persist(StorageLevel.MEMORY_ONLY)

That covers installation and loading of data. Now let's move to the actual usage of Spark with the data.

When we tried using Spark, we already had a huge, nested structure of data models, which were neatly organized in SQLAlchemy. All queries were also generated using SQLAlchemy.

Fortunately, we could very easily convert the SQLAlchemy queries to Spark queries, as queries built with SQLAlchemy can be compiled to raw strings and then executed in Spark using Spark SQL. It needs to be mentioned that not always all queries can be converted directly, but in our case, all of them could be. This way we kept the same logic of data processing, advantages of data model built with ORM, but with Spark's power.

Running the query requires just using Spark instance's method:

result = self.spark.sql(compiled_query)

Results

We compared the time performance of the original approach, using queries to PG server and Spark.

The time difference between trials 1 and 2 is due to different amounts of data being processed. However, the data in each trial was the same for both tested approaches.

Trial no.	Queries executed on the db server	Spark	Speedup
1 (single thread)	0:08:02	0:06:28	19,50%
2 (single thread)	1:29:32	1:15:40	15,48%
3 (multi-threaded)	0:48:24	0:16:58	64,94%
4 (multi-threaded)	0:45:41	0:17:48	61,03%
5 (multi-threaded)	0:44:15	0:10:59	75,17%
6 (multi-threaded)	0:44:24	0:11:20	74,47%

In our case, using Spark resulted in a speed-up of between 15% and over 75% depending on the amount of data we used, and whether the data was processed in multiple threads. The actual queries were a lot faster, but loading the data also takes some time. On the other hand, Spark requires access to huge amount of memory, which basically means a trade-off between speed and memory consumption.

The most significant difference between time performance of Spark and queries to the PostgreSQL database can be seen in case of using multithreading in both approaches.

But why the difference is so big, as we used Spark only in standalone mode? This is due to the fact that using Spark practically means transferring both the data and the processing to the machine on which the Spark cluster is running. Connecting to the db and executing queries using psycopg2 means that queries are executed on the server side, and only the results are sent back to the client. So virtually we are moving the place of computation from the db server to the Spark cluster.

One might ask why we even use Spark in standalone mode if we can use for example pandas and consume similar amount of resources.

One obvious answer is that Spark can be easily scaled horizontally in the future, which is not the case with pandas - it can be scaled only vertically, by using more powerful machine.

But let's focus on the time performance.

During the investigation of Spark, we tried comparing the performance on a simple example being a calculation of the PI using Monte Carlo method, as there is possibility to easily increase or decrease the number of iterations, and thus the workload.

We implemented Monte Carlo method using both approaches and compared the results. The diagrams below are presented in logarithmic and linear scale.

We see that the time performance is similar, with Spark being slightly better in the case of a larger number of iterations.

But the key difference is not presented on the charts. While running the calculations with the largest number of iterations, pandas crashed with Out Of Memory error, and Spark was able to handle it without any problems.

Of course, the main cause of pandas failure was the amount of available resources, but from top-down perspective, Spark just could handle it, and pandas couldn't.

And at the end of the day, the ability to handle the workload can be the most important factor. Even in this simple example, Spark showed its fault-tolerance feature, which is crucial in case of Big Data processing.

Conclusion

Both installation and execution of data processing make Spark a really facade-like tool, where injecting a few lines of code in two steps of the data flow can make a huge difference in the performance of the whole system.

To sum up, we hope that the article proved that Spark is a powerful tool, but at the same time, it is easy to start using it right away even in a standalone mode thanks to really high-level API. While the main advantage of Spark is its speed, also other features, such as fault-tolerance, scalability and ease of use, prove that it is a great tool for processing Big Data.

FAQ

What is Apache Spark and why is it important for AI and data science projects?

Apache Spark is an open-source distributed computing system designed for fast and general-purpose cluster computing. It is particularly important in AI and data science for its ability to process large datasets quickly and efficiently, which is crucial for analytics and machine learning tasks.

How does Apache Spark enhance the processing of large datasets?

Spark enhances data processing by using in-memory caching and optimized query execution which makes it faster than traditional disk-based processing. Its ability to perform data processing operations in parallel across a cluster also significantly speeds up tasks.

Can Apache Spark be used without a cluster? Yes, Apache Spark can run in standalone mode on a single machine. This setup is ideal for development, testing, or small-scale data processing tasks, allowing users to leverage Spark's capabilities without the need for a distributed environment.

What are the key advantages of using Apache Spark over traditional SQL databases? The key advantages include faster processing speeds due to in-memory computation, fault tolerance, scalability, and the ability to handle complex analytics operations like machine learning and real-time data processing which are not typically feasible with traditional SQL databases.

How does PySpark facilitate the integration of Spark into Python projects?

PySpark provides a Python API for Spark that makes it easy to integrate Spark's capabilities into Python applications. It allows Python developers to use Spark's distributed data processing capabilities seamlessly alongside other Python libraries.

What are some common challenges when integrating Spark into existing projects and how can they be addressed?

Common challenges include data compatibility issues, scaling complexities, and the learning curve associated with understanding Spark's architecture and API. These can be addressed by thorough planning, leveraging Spark's comprehensive documentation, and starting integration with smaller subsets of data.