Dask - Quantum: Machine Learning & Analytics

Dask is an open source library for parallel computing written in Python. Dask is a library composed of two parts. It includes a task scheduling component for building dependency graphs and scheduling tasks. Second, it includes the distributed data structures with APIs similar to Pandas Dataframes or NumPy arrays. Dask has a variety of use cases and can be run with a single node and scale to thousand node clusters.

Dask introduces 3 parallel collections that are able to store data that is larger than RAM, namely Dataframes, Bags and Arrays. Each of these collection types are able to use data partitioned between RAM and a hard disk as well distributed across multiple nodes in a cluster.Dask DataFrame is made up of smaller split up Pandas dataframes and therefore allows a subset of Pandas query sy subset of Pandas query syntax. Dask Bag is able to store and process collections of Pythonic objects that are unable to fit into memory. Dask Bags are great for processing logs and collections of json documentsntax. Dask Arrays support Numpy like slicing.

The dask collections (array, bag, dataframe) provide reasonable access to parallelism and out-of-core execution. These significantly extend the scale of data that is convenient to manipulate. More importantly these collections demonstrate the feasibility of dask graphs to describe parallel algorithms and of the dask schedulers to execute those algorithms efficiently in a small space. The lack of a more baroque framework drastically reduces the barrier to entry and the ability of developers to use dask within their own libraries.