Ray Distributed Python

Sneiderman, Robby

Infrastructure

Ray Distributed Python

Ray is a Python-native distributed compute framework. Tasks and Actors as the primitives; Tune, Train, Serve, and RLlib as the higher-level libraries built on top.

CoreTier 3CurrentReference~12 min

Prereq Map

Learning position

Read this page in the graph.

infrastructure | layer 4 | tier 3. This page has 0 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Distributed Training Theory

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

What It Is

Ray is a distributed execution framework for Python. The core API has two primitives. A Task is a stateless function decorated with @ray.remote that executes on any worker in the cluster; calling f.remote(x) returns an ObjectRef future and dispatches the work asynchronously. An Actor is a stateful Python class decorated with @ray.remote that runs as a long-lived process pinned to one worker; method calls on the actor handle become remote messages. Tasks suit embarrassingly parallel work; Actors suit anything with state (a model, a parameter server, an RL environment). Object references are first-class and can be passed between tasks, which lets the scheduler track lineage and reconstruct objects on failure.

Ray includes four higher-level libraries built on this substrate. Ray Tune is a hyperparameter sweep framework with bandit and Bayesian schedulers (ASHA, PBT, BOHB). Ray Train wraps distributed training (DDP, FSDP, Horovod, JAX). Ray Serve is a model-serving framework with autoscaling and multi-model deployment graphs. RLlib is a distributed RL library with PPO, DQN, IMPALA, SAC, and population-based variants — historically Ray's largest user.

Cluster topology: a head node runs the global control store (GCS) and dashboard; worker nodes register with the head and run scheduled tasks and actors. The cluster YAML declares min/max nodes per type and Ray autoscales to fit pending demand.

When You'd Use It

Ray wins when the workload is Python-first, mixes CPU and GPU work, and benefits from stateful actors. Reinforcement learning is the canonical fit: thousands of environment workers (CPU actors) feed a small number of trainer actors (GPU); rollouts are gathered with ray.wait. Hyperparameter sweeps with early stopping fit naturally because each trial is an actor that the scheduler can pause, checkpoint, and reschedule. ML pipelines that chain feature extraction, training, and evaluation into one Python program with mixed hardware requirements are easier to express in Ray than in Spark or Kubernetes-native tooling.

Ray loses when the workload is heavy DataFrame analytics. Spark and Snowflake are decades ahead on query optimization, columnar storage, shuffle, and SQL. Ray Data exists, but it is not the right tool to replace a 100-TB Parquet warehouse pipeline. Ray also adds operational overhead compared to plain multiprocessing for small single-machine jobs; if eight cores on one box is enough, Ray is overkill.

Notable Gotchas

Watch Out

Object store spill is the most common cluster failure mode

Ray keeps task results in a shared-memory object store on each node. When the store fills (default: 30 percent of node memory), Ray spills to disk; when disk fills, tasks fail with ObjectStoreFullError. The trap: a long-running pipeline accumulates intermediate results that are still referenced (e.g. a list of ObjectRef held by the driver) and the GC cannot evict them. The fix is to call ray.get and free the references promptly, or to use streaming patterns instead of materializing the full intermediate set.

Watch Out

Actors do not parallelize within themselves

A single actor runs in one process and processes method calls serially by default. Calling actor.method.remote(x) ten times in a loop does not parallelize on that actor; it queues. To parallelize, create multiple actor replicas and round-robin across them, or set max_concurrency on the actor decorator to allow asyncio-style concurrency within one actor.

References

Moritz et al., Ray: A Distributed Framework for Emerging AI Applications (OSDI 2018).
Ray, Ray Documentation: Core Concepts (https://docs.ray.io/en/latest/ray-core/walkthrough.html).
Liang et al., RLlib: Abstractions for Distributed Reinforcement Learning (ICML 2018; arXiv 1712.09381).
Ray, Ray Train and Ray Tune User Guides (https://docs.ray.io/en/latest/train/train.html).
Anyscale, Ray Serve: Scalable Model Serving (https://docs.ray.io/en/latest/serve/index.html).

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Parallel Processing Fundamentalslayer 5 · tier 2
Distributed Training Theorylayer 5 · tier 3

Graph-backed continuations

Distributed Training Theory Parallel Processing Fundamentals