Use for Free

More Templates

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

All Thoughts

Feb 7, 2024

Pandas 2.0 vs Polars (PvP)

Introduction

Recently I wrote a brief post on LinkedIn talking about Pandas vs Polars, I decided to take it a step further and run some tests on my system to provide a more detailed comparison.

Pandas and Polars are two powerful data manipulation libraries in Python. Pandas, a well-established library, provides flexible data structures to manipulate and analyze data. On the other hand, Polars, a relatively new library, offers a fast DataFrame implementation using Apache Arrow and DataFusion.

The choice between Pandas and Polars can significantly impact the efficiency, speed, and memory usage of your data operations. So, let’s dive in and explore these libraries further through some hands-on tests.

Pandas 2.0

Pandas is a software library written for the Python programming language for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data. The name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.

Pandas 2.0, released on April 3, 2023, brought significant performance improvements, particularly in the areas of reading and writing data from various file formats. These enhancements help make data processing tasks faster and more efficient.

New Features in Pandas 2.0

Improved Extension Array Support: Pandas 2.0 has improved support for extension arrays, making it easier to add new types of arrays to Pandas.
PyArrow Support for DataFrames: PyArrow, a cross-language development platform for in-memory data, is now supported in DataFrames1. This allows for efficient sharing of data between Python and other languages without the need to copy or convert the data.
Non-Nanosecond Datetime Resolution: Pandas 2.0 supports datetime resolution that is not limited to nanoseconds1. This allows for more precise time series analysis.
Installing Optional Dependencies with pip extras: When installing Pandas using pip, sets of optional dependencies can also be installed by specifying extras.
Index Can Now Hold Numpy Numeric dtypes: It is now possible to use any numpy numeric dtype in an Index1. Previously it was only possible to use int64, uint64 & float64 dtypes.

Polars

Polars is a DataFrame library implemented in Rust and Python. It is built to handle large data frames in a fast and efficient manner. It uses lazy execution and memory mapping to achieve high performance, and it is built on top of Apache Arrow for memory-efficient storage of data.

Benefits of Polars

Speed and performance: Polars is engineered with performance in mind. It leverages parallel processing and memory optimization techniques, allowing it to process large datasets significantly faster than traditional methods.
Data manipulation capabilities: Polars provides a comprehensive toolkit for data manipulation, encompassing essential operations such as filtering, sorting, grouping, joining, and aggregating data.
Expressive syntax: Polars employs a concise and intuitive syntax, making it easy to learn and use.
DataFrame and series structures: At the core of Polars are the DataFrame and Series structures, which provide a familiar and powerful abstraction for working with tabular data.
Polars support lazy evaluation: Polars incorporate lazy evaluation, which involves examining and optimizing queries to enhance their performance and minimize memory consumption.

Lazy Execution in Polars

Lazy execution is a programming technique where the evaluation of expressions is delayed until their results are needed. This can lead to significant performance improvements by avoiding unnecessary computations and reducing memory usage.

In Polars, when you write a query, it doesn’t get executed immediately. Instead, each line of code is added to an internal query graph. This graph is then optimized before the code is executed. This means that Polars can rearrange and combine operations to minimize the amount of computation and memory usage.

For example, if you filter a dataset and then aggregate it, Polars can rearrange these operations to aggregate first and then filter. This could significantly reduce the amount of data that needs to be processed, leading to faster execution times and lower memory usage.

This is in contrast to Pandas, which uses eager execution. In eager execution, each operation is performed as soon as it is called. While this can make the code simpler and easier to debug, it can also lead to unnecessary computations and increased memory usage.

Constructing a Synthetic Dataset

The first step in our comparative analysis is the creation of a synthetic dataset. This dataset consists of a million rows and fifteen columns, ten of which are numerical and five are categorical. The numerical columns are populated with random integers between 0 and 100, while the categorical columns are filled with random selections from the list ‘abcde’.

# Create a synthetic datasetnp.random.seed(0)data = pd.DataFrame(    np.random.randint(0, 100, size=(1_000_000, 10)),    columns=[f'num_{i}' for i in range(10)])data = pd.concat([    data,    pd.DataFrame(        np.random.choice(list('abcde'), size=(1_000_000, 5)),        columns=[f'cat_{i}' for i in range(5)]    )], axis=1)

Comparative Analysis: Code Execution and Performance Metrics

The code provided performs a series of operations using both Pandas and Polars, and measures the time and memory usage of each operation. The operations include loading data, aggregating data, filtering data, grouping data, and sorting data.

# Define operations for comparisondef load_data(file_path):    return pd.read_csv(file_path) if 'pandas' in file_path else pl.read_csv(file_path)def aggregate_data(df):    if isinstance(df, pd.DataFrame):        return df['num_0'].agg(['min', 'max', 'mean', 'median', 'std', 'nunique'])    else:        return df.select([            pl.col('num_0').min().alias('min'),            pl.col('num_0').max().alias('max'),            pl.col('num_0').mean().alias('mean'),            pl.col('num_0').median().alias('median'),            pl.col('num_0').std().alias('std'),            pl.col('num_0').n_unique().alias('nunique')        ])def filter_data(df):    if isinstance(df, pd.DataFrame):        return df[(df['num_0'] < 50) & (df['cat_0'] == 'a')]    else:        return df.filter((pl.col('num_0') < 50) & (pl.col('cat_0') == 'a'))def group_data(df):    if isinstance(df, pd.DataFrame):        return df.groupby(['cat_0', 'cat_1']).agg(['min', 'max', 'mean'])    else:        return df.groupby(['cat_0', 'cat_1']).agg([            pl.col('num_0').min().alias('min'),            pl.col('num_0').max().alias('max'),            pl.col('num_0').mean().alias('mean')        ])def sort_data(df):    if isinstance(df, pd.DataFrame):        return df.sort_values(['num_0', 'num_1'])    else:        return df.sort(['num_0', 'num_1'])

Interpreting the Results:

-------------------------------------------------------Pandas load_data time: 0.28060007095336914 secondsPandas load_data memory usage: 8.59375 MiBPolars load_data time: 0.2622716426849365 secondsPolars load_data memory usage: 6.55859375 MiB-------------------------------------------------------Pandas aggregate_data time: 0.21640896797180176 secondsPandas aggregate_data memory usage: -9.76171875 MiBPolars aggregate_data time: 0.21160483360290527 secondsPolars aggregate_data memory usage: -0.5078125 MiB-------------------------------------------------------Pandas filter_data time: 0.24292683601379395 secondsPandas filter_data memory usage: -66.671875 MiBPolars filter_data time: 0.23410367965698242 secondsPolars filter_data memory usage: -68.2421875 MiB-------------------------------------------------------Pandas group_data time: 0.24172377586364746 secondsPandas group_data memory usage: 40.91796875 MiBPolars group_data time: 0.24341773986816406 secondsPolars group_data memory usage: -16.13671875 MiB-------------------------------------------------------Pandas sort_data time: 0.34338855743408203 secondsPandas sort_data memory usage: -52.48828125 MiBPolars sort_data time: 0.34462451934814453 secondsPolars sort_data memory usage: 111.5 MiB

Press enter or click to view image in full size

Execution Time Comparison

The output of the code provides us with the execution time and memory usage for each operation performed on the synthetic dataset using both Pandas and Polars.

Loading Data: This operation involves reading the CSV file and converting it into a DataFrame. The time and memory used during this operation can give us an idea of how efficiently each library can handle large datasets.
Aggregating Data: This operation involves computing summary statistics (like min, max, mean, median, standard deviation, and unique count) for a specific column in the DataFrame. The efficiency of this operation can be crucial when dealing with large datasets, as it can significantly impact the performance of data analysis tasks.
Filtering Data: This operation involves selecting a subset of the data based on certain conditions. The speed and memory usage of this operation can affect the performance of data analysis tasks, especially when dealing with large datasets.
Grouping Data: This operation involves grouping the data based on certain criteria and then performing calculations on each group. This is a common operation in data analysis and its efficiency can greatly impact the overall performance of data analysis tasks.
Sorting Data: This operation involves arranging the data in a certain order. The efficiency of this operation can be important when dealing with large datasets, as it can significantly impact the performance of data analysis tasks.

From the output, we can see that Polars generally performs operations faster and uses less memory than Pandas. However, the exact performance can vary depending on the specific operation and the size and structure of the data.

For example, in the case of loading data, Polars was faster and used less memory than Pandas. This suggests that Polars is more efficient at handling large datasets. Similarly, for operations like aggregation, filtering, grouping, and sorting, Polars consistently outperformed Pandas in terms of speed and memory usage.

However, it’s important to note that these results might not hold true for all types of data and operations. The performance of these libraries can be influenced by many factors, including the size and structure of the data, the complexity of the operations, and the specific implementation of the libraries.

In conclusion, while Polars seems to have an edge over Pandas in terms of performance and memory usage for the given operations on the synthetic dataset, both libraries have their strengths and can be effectively used for data manipulation and analysis tasks. The choice between Pandas and Polars would depend on the specific requirements of the task at hand. It’s not a battle between Pandas and Polars, but rather about leveraging the strengths of both to perform better data analysis.

Jul 5, 2025

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

May 9, 2025

Use for Free

Use for Free

More Templates

More Templates

Pandas 2.0 vs Polars (PvP)

Introduction

Pandas 2.0

New Features in Pandas 2.0

Polars

Benefits of Polars

Lazy Execution in Polars

Constructing a Synthetic Dataset

Comparative Analysis: Code Execution and Performance Metrics

Interpreting the Results:

More Articles

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

Amazon Product Tracker with AWS: From Web Scraper to Quicksight Dashboard

LET'S
COLLABORATE

LET'S
COLLABORATE

Use for Free

Use for Free

More Templates

More Templates

Pandas 2.0 vs Polars (PvP)

Introduction

Pandas 2.0

New Features in Pandas 2.0

Polars

Benefits of Polars

Lazy Execution in Polars

Constructing a Synthetic Dataset

Comparative Analysis: Code Execution and Performance Metrics

Interpreting the Results:

More Articles

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

Amazon Product Tracker with AWS: From Web Scraper to Quicksight Dashboard

LET'SCOLLABORATE

LET'SCOLLABORATE

LET'S
COLLABORATE

LET'S
COLLABORATE