Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.
Use the skills CLI to install this skill with one command. Auto-detects all installed AI assistants.
Method 1 - skills CLI
npx skills i K-Dense-AI/claude-scientific-skills/scientific-skills/daskMethod 2 - openskills (supports sync & update)
npx openskills install K-Dense-AI/claude-scientific-skillsAuto-detects Claude Code, Cursor, Codex CLI, Gemini CLI, and more. One install, works everywhere.
Installation Path
Download and extract to one of the following locations:
No setup needed. Let our cloud agents run this skill for you.
Select Provider
Select Model
Best for coding tasks
Environment setup included
Dask is a Python library for parallel and distributed computing that enables three critical capabilities:
Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.
This skill should be used when:
Dask provides five main components, each suited to different use cases:
Purpose: Scale pandas operations to larger datasets through parallel processing.
When to Use:
Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:
map_partitionsQuick Example:
import dask.dataframe as dd
# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')
# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()Key Points:
.compute() calledmap_partitions for efficient custom operationsPurpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.
When to Use:
Reference Documentation: For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:
map_blocksQuick Example:
import dask.array as da
# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))
# Operations are lazy
y = x + 100
z = y.mean(axis=0)
# Compute result
result Key Points:
map_blocks for operations not available in DaskPurpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.
When to Use:
Reference Documentation: For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:
Quick Example:
import dask.bag as db
import json
# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)
# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value'
Key Points:
foldby instead of groupby for better performancePurpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.
When to Use:
Reference Documentation: For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:
Quick Example:
from dask.distributed import Client
client = Client() # Create local cluster
# Submit tasks (executes immediately)
def process(x):
return x ** 2
futures = client.map(process, range(100))
# Gather results
results = client.gather(futures)
client.close()Key Points:
Purpose: Control how and where Dask tasks execute (threads, processes, distributed).
When to Choose Scheduler:
Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:
Quick Example:
import dask
import dask.dataframe as dd
# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute() # Uses threads
# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
Key Points:
For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:
Before using Dask, explore:
1. Don't Load Data Locally Then Hand to Dask
# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)
# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')2. Avoid Repeated compute() Calls
# Wrong: Each compute is separate
for item in items:
result = dask_computation(item).compute()
# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)3. Don't Build Excessively Large Task Graphs
map_partitions/map_blocks to fuse operationslen(ddf.__dask_graph__())4. Choose Appropriate Chunk Sizes
5. Use the Dashboard
from dask.distributed import Client
client = Client()
print(client.dashboard_link) # Monitor performance, identify bottlenecksimport dask.dataframe as dd
# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')
# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset
import dask.bag as db
import json
# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')
# Convert to DataFrame for structured analysis
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()import dask.array as da
# Load or create large array
x = da.from_zarr('large_dataset.zarr')
# Process in chunks
normalized = (x - x.mean()) / x.std()
# Save result
da.to_zarr(normalized, 'normalized.zarr')from dask.distributed import Client
client = Client()
# Scatter large dataset once
data = client.scatter(large_dataset)
# Process in parallel with dependencies
futures = []
for param in parameters:
future = client.submit(process, data, param)
futures.append(future)
# Gather results
results = client.gather(futures)Use this decision guide to choose the appropriate Dask component:
Data Type:
Operation Type:
Control Level:
Workflow Type:
# Bag → DataFrame
ddf = bag.to_dataframe()
# DataFrame → Array (for numeric data)
arr = ddf.to_dask_array(lengths=True)
# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])dask.config.set(scheduler='synchronous')
result = computation.compute() # Can use pdb, easy debuggingsample = ddf.head(1000) # Small sample
# Test logic, then scale to full datasetfrom dask.distributed import Client
client = Client()
print(client.dashboard_link) # Monitor performance
result = computation.compute()Memory Errors:
persist() strategically and delete when doneSlow Start:
map_partitions or map_blocks to reduce tasksPoor Parallelization:
All reference documentation files can be read as needed for detailed information:
references/dataframes.md - Complete Dask DataFrame guidereferences/arrays.md - Complete Dask Array guidereferences/bags.md - Complete Dask Bag guidereferences/futures.md - Complete Dask Futures and distributed computing guidereferences/schedulers.md - Complete scheduler selection and configuration guidereferences/best-practices.md - Comprehensive performance optimization and troubleshootingLoad these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.