Skip to main content

Infrastructure

Ray Distributed Python

Ray is a Python-native distributed compute framework. Tasks and Actors as the primitives; Tune, Train, Serve, and RLlib as the higher-level libraries built on top.

CoreTier 3Current~12 min
0

What It Is

Ray is a distributed execution framework for Python. The core API has two primitives. A Task is a stateless function decorated with @ray.remote that executes on any worker in the cluster; calling f.remote(x) returns an ObjectRef future and dispatches the work asynchronously. An Actor is a stateful Python class decorated with @ray.remote that runs as a long-lived process pinned to one worker; method calls on the actor handle become remote messages. Tasks suit embarrassingly parallel work; Actors suit anything with state (a model, a parameter server, an RL environment). Object references are first-class and can be passed between tasks, which lets the scheduler track lineage and reconstruct objects on failure.

Ray includes four higher-level libraries built on this substrate. Ray Tune is a hyperparameter sweep framework with bandit and Bayesian schedulers (ASHA, PBT, BOHB). Ray Train wraps distributed training (DDP, FSDP, Horovod, JAX). Ray Serve is a model-serving framework with autoscaling and multi-model deployment graphs. RLlib is a distributed RL library with PPO, DQN, IMPALA, SAC, and population-based variants — historically Ray's largest user.

Cluster topology: a head node runs the global control store (GCS) and dashboard; worker nodes register with the head and run scheduled tasks and actors. The cluster YAML declares min/max nodes per type and Ray autoscales to fit pending demand.

When You'd Use It

Ray wins when the workload is Python-first, mixes CPU and GPU work, and benefits from stateful actors. Reinforcement learning is the canonical fit: thousands of environment workers (CPU actors) feed a small number of trainer actors (GPU); rollouts are gathered with ray.wait. Hyperparameter sweeps with early stopping fit naturally because each trial is an actor that the scheduler can pause, checkpoint, and reschedule. ML pipelines that chain feature extraction, training, and evaluation into one Python program with mixed hardware requirements are easier to express in Ray than in Spark or Kubernetes-native tooling.

Ray loses when the workload is heavy DataFrame analytics. Spark and Snowflake are decades ahead on query optimization, columnar storage, shuffle, and SQL. Ray Data exists, but it is not the right tool to replace a 100-TB Parquet warehouse pipeline. Ray also adds operational overhead compared to plain multiprocessing for small single-machine jobs; if eight cores on one box is enough, Ray is overkill.

Notable Gotchas

Watch Out

Object store spill is the most common cluster failure mode

Ray keeps task results in a shared-memory object store on each node. When the store fills (default: 30 percent of node memory), Ray spills to disk; when disk fills, tasks fail with ObjectStoreFullError. The trap: a long-running pipeline accumulates intermediate results that are still referenced (e.g. a list of ObjectRef held by the driver) and the GC cannot evict them. The fix is to call ray.get and free the references promptly, or to use streaming patterns instead of materializing the full intermediate set.

Watch Out

Actors do not parallelize within themselves

A single actor runs in one process and processes method calls serially by default. Calling actor.method.remote(x) ten times in a loop does not parallelize on that actor; it queues. To parallelize, create multiple actor replicas and round-robin across them, or set max_concurrency on the actor decorator to allow asyncio-style concurrency within one actor.

References

Related Topics

Last reviewed: April 18, 2026

Next Topics