Feb 7, 2024
Pandas 2.0 vs Polars (PvP)
Introduction
Recently I wrote a brief post on LinkedIn talking about Pandas vs Polars, I decided to take it a step further and run some tests on my system to provide a more detailed comparison.
Pandas and Polars are two powerful data manipulation libraries in Python. Pandas, a well-established library, provides flexible data structures to manipulate and analyze data. On the other hand, Polars, a relatively new library, offers a fast DataFrame implementation using Apache Arrow and DataFusion.
The choice between Pandas and Polars can significantly impact the efficiency, speed, and memory usage of your data operations. So, let’s dive in and explore these libraries further through some hands-on tests.
Pandas 2.0
Pandas is a software library written for the Python programming language for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data. The name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.
Pandas 2.0, released on April 3, 2023, brought significant performance improvements, particularly in the areas of reading and writing data from various file formats. These enhancements help make data processing tasks faster and more efficient.
New Features in Pandas 2.0
Improved Extension Array Support: Pandas 2.0 has improved support for extension arrays, making it easier to add new types of arrays to Pandas.
PyArrow Support for DataFrames: PyArrow, a cross-language development platform for in-memory data, is now supported in DataFrames1. This allows for efficient sharing of data between Python and other languages without the need to copy or convert the data.
Non-Nanosecond Datetime Resolution: Pandas 2.0 supports datetime resolution that is not limited to nanoseconds1. This allows for more precise time series analysis.
Installing Optional Dependencies with pip extras: When installing Pandas using pip, sets of optional dependencies can also be installed by specifying extras.
Index Can Now Hold Numpy Numeric dtypes: It is now possible to use any numpy numeric dtype in an Index1. Previously it was only possible to use int64, uint64 & float64 dtypes.
Polars
Polars is a DataFrame library implemented in Rust and Python. It is built to handle large data frames in a fast and efficient manner. It uses lazy execution and memory mapping to achieve high performance, and it is built on top of Apache Arrow for memory-efficient storage of data.
Benefits of Polars
Speed and performance: Polars is engineered with performance in mind. It leverages parallel processing and memory optimization techniques, allowing it to process large datasets significantly faster than traditional methods.
Data manipulation capabilities: Polars provides a comprehensive toolkit for data manipulation, encompassing essential operations such as filtering, sorting, grouping, joining, and aggregating data.
Expressive syntax: Polars employs a concise and intuitive syntax, making it easy to learn and use.
DataFrame and series structures: At the core of Polars are the DataFrame and Series structures, which provide a familiar and powerful abstraction for working with tabular data.
Polars support lazy evaluation: Polars incorporate lazy evaluation, which involves examining and optimizing queries to enhance their performance and minimize memory consumption.
Lazy Execution in Polars
Lazy execution is a programming technique where the evaluation of expressions is delayed until their results are needed. This can lead to significant performance improvements by avoiding unnecessary computations and reducing memory usage.
In Polars, when you write a query, it doesn’t get executed immediately. Instead, each line of code is added to an internal query graph. This graph is then optimized before the code is executed. This means that Polars can rearrange and combine operations to minimize the amount of computation and memory usage.
For example, if you filter a dataset and then aggregate it, Polars can rearrange these operations to aggregate first and then filter. This could significantly reduce the amount of data that needs to be processed, leading to faster execution times and lower memory usage.
This is in contrast to Pandas, which uses eager execution. In eager execution, each operation is performed as soon as it is called. While this can make the code simpler and easier to debug, it can also lead to unnecessary computations and increased memory usage.
Constructing a Synthetic Dataset
The first step in our comparative analysis is the creation of a synthetic dataset. This dataset consists of a million rows and fifteen columns, ten of which are numerical and five are categorical. The numerical columns are populated with random integers between 0 and 100, while the categorical columns are filled with random selections from the list ‘abcde’.
Comparative Analysis: Code Execution and Performance Metrics
The code provided performs a series of operations using both Pandas and Polars, and measures the time and memory usage of each operation. The operations include loading data, aggregating data, filtering data, grouping data, and sorting data.
Interpreting the Results:
Press enter or click to view image in full size
Execution Time Comparison
The output of the code provides us with the execution time and memory usage for each operation performed on the synthetic dataset using both Pandas and Polars.
Loading Data: This operation involves reading the CSV file and converting it into a DataFrame. The time and memory used during this operation can give us an idea of how efficiently each library can handle large datasets.
Aggregating Data: This operation involves computing summary statistics (like min, max, mean, median, standard deviation, and unique count) for a specific column in the DataFrame. The efficiency of this operation can be crucial when dealing with large datasets, as it can significantly impact the performance of data analysis tasks.
Filtering Data: This operation involves selecting a subset of the data based on certain conditions. The speed and memory usage of this operation can affect the performance of data analysis tasks, especially when dealing with large datasets.
Grouping Data: This operation involves grouping the data based on certain criteria and then performing calculations on each group. This is a common operation in data analysis and its efficiency can greatly impact the overall performance of data analysis tasks.
Sorting Data: This operation involves arranging the data in a certain order. The efficiency of this operation can be important when dealing with large datasets, as it can significantly impact the performance of data analysis tasks.
From the output, we can see that Polars generally performs operations faster and uses less memory than Pandas. However, the exact performance can vary depending on the specific operation and the size and structure of the data.
For example, in the case of loading data, Polars was faster and used less memory than Pandas. This suggests that Polars is more efficient at handling large datasets. Similarly, for operations like aggregation, filtering, grouping, and sorting, Polars consistently outperformed Pandas in terms of speed and memory usage.
However, it’s important to note that these results might not hold true for all types of data and operations. The performance of these libraries can be influenced by many factors, including the size and structure of the data, the complexity of the operations, and the specific implementation of the libraries.
In conclusion, while Polars seems to have an edge over Pandas in terms of performance and memory usage for the given operations on the synthetic dataset, both libraries have their strengths and can be effectively used for data manipulation and analysis tasks. The choice between Pandas and Polars would depend on the specific requirements of the task at hand. It’s not a battle between Pandas and Polars, but rather about leveraging the strengths of both to perform better data analysis.



