Modin - Quantum: Machine Learning & Analytics

Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows. Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.

Often data scientists require different tools for doing the same thing on different sizes of data. The DataFrame solutions that exist for 1MB do not scale to 1TB+, and the overheads of the solutions for 1TB+ are too costly for datasets in the 1KB range. With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at 1MB and 1TB+.

Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks. The system has been designed for existing Pandas users who would like their programs to run faster and scale better without significant code changes. The ultimate goal of this work is to be able to use Pandas in a cloud setting.

Modin is separated into different layers.:

Pandas API is exposed at the top-most layer
Next layer houses the Query Compiler which receives queries from the pandas API layer and performs certain optimizations.
At the last layer is the Partition Manager and is responsible for the data layout and shuffling, partitioning, and serializing the tasks that get sent to each partition.

Modin uses Ray to provide an effortless way to speed up the pandas’ notebooks, scripts, and libraries. Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations. You can find Ray on GitHub: github.com/ray-project/ray.

Modin handles all the partitioning and shuffling for the user. Modin’s basic goal is to enable the users to use the same tools on small data as well as big data without having to worry about changing the API to suit different data sizes.