smolcluster

Distributed Deep Learning Library for Heterogeneous Hardware

Training and inference for neural networks across heterogeneous hardware with PyTorch — Mac minis, Raspberry Pis, MacBooks, and Windows machines using only Python sockets.

Overview

smolcluster is a distributed deep learning library designed for training neural networks across heterogeneous hardware using PyTorch and socket-based communication. It enables researchers and developers to leverage multiple machines with different capabilities for distributed training and inference.

The library supports various distributed training algorithms including Fully Sharded Data Parallelism (FSDP), Classic Data Parallelism (ClassicDP), Elastic Distributed Parallelism (EDP), Model Parallelism, Model Parallelism with Pipeline, and Expert Parallelism. It runs on diverse hardware including Mac minis, Raspberry Pis, MacBooks, and Windows machines.

Cluster Architecture

smolcluster distributed deep learning architecture diagram

Quick Start

Step-by-step setup guide for Mac Mini (Thunderbolt) and Jetson / home router clusters, with commands ready to copy.

Open Setup Guide
git clone https://github.com/YuvrajSingh-mist/smolcluster.git
cd smolcluster
uv sync

# launch training
bash scripts/launch_edp_train_gpt.sh

# launch inference
bash scripts/inference/launch_mp_inference.sh
bash scripts/inference/launch_api.sh

Key Features

Distributed Training Algorithms

  • Fully Sharded Data Parallelism (FSDP) — ZeRO-optimized data parallelism with configurable optimizer state partitioning. Supports Stage 0 (All-Reduce) and Stage 1 (ZeRO optimizer partitioning) for ~1/N memory reduction. Includes bounded staleness for flexible async training.
  • Classic Data Parallelism (ClassicDP) — All-Reduce based data parallelism with bounded staleness control. Workers exchange gradients directly in a fully connected topology with configurable synchronization flexibility.
  • Elastic Distributed Parallelism (EDP) — Asynchronous data parallelism with stale gradient tolerance, ideal for heterogeneous clusters.
  • Model Parallelism (MP) — Layer-wise model distribution perfect for large models and inference serving.
  • Model Parallelism with Pipeline (MPPipeline) — Pipeline-based model parallelism with inter-layer communication and overlapped computation for improved throughput and reduced latency during inference.
  • Expert Parallelism (EP) — Sparse mixture-of-experts (MoE) training with distributed expert assignment across nodes. Enables training of large sparse models with efficient expert routing and load balancing.

Distributed Inference

★ Zero-Config Node Discovery via grovegrove (built by Swarnim Jain, integrated into smolcluster) handles automatic node discovery over mDNS on Mac and TCP + Zeroconf on Linux/Jetson — no SSH, no static IPs needed. Includes a live per-rank TUI dashboard. Launch with grove start <script> -n N on the coordinator and grove join on each worker.
  • Supported Inference Algorithms — Pipeline Parallelism and Data Parallelism (DP).
  • Streaming Token Generation — Real-time token-by-token generation with activations forwarded sequentially through distributed layers.
  • FastAPI Backend — RESTful API server for easy integration with web and mobile clients (iPad, browsers, etc.).
  • Multi-Device Support — Serve models across heterogeneous hardware (Mac minis, Raspberry Pis) with automatic activation routing.
  • Interactive Chat Interface — React-based web UI and iOS Swift app for real-time interaction with distributed models.

Hardware Support

Train and run inference across heterogeneous hardware including Mac minis, Raspberry Pis, MacBooks, Windows machines, and iPad clients.

Model Support

Built-in support via the Hugging Face Transformers library for Data Parallelism (DP). For Model Parallelism (MP), GPT-2 (117M) is currently supported, with support for additional models coming soon.

Monitoring & Logging

  • Weights & Biases Integration — Automatic tracking of training metrics, gradient norms, and hardware utilization.
  • Web Interface — React-based chat UI for GPT inference with real-time streaming responses.

Demo

Distributed GPT-2 Inference with Model Parallelism
  • Model: GPT-2 (117M parameters)
  • Hardware: iPad client + 2× Mac Mini M4 (2025)
  • Algorithm: Model Parallelism with layer distribution
  • Demo: Real-time streaming token generation across distributed layers
  • Workflow: User prompts from iPad → activations forwarded between Mac Minis → tokens streamed back
Distributed Data Parallelism POC (All-to-All)
  • Model: Llama3.2-1B-Instruct
  • Hardware: 3× Mac Minis M4 2025 16GB RAM over LAN
  • Algorithm: ClassicDP with all-to-all gradient exchange
Distributed Data Parallelism POC (SyncPS)
  • Model: Llama3.2-1B-Instruct
  • Hardware: 3× Mac Minis M4 2025 + dedicated parameter server node
  • Algorithm: SyncPS with barrier-based synchronization
grove — Zero-Config Cluster Discovery
  • Package by Swarnim Jain, integrated into smolcluster
  • Auto node discovery over mDNS (Mac) / TCP + Zeroconf (Linux)
  • Live per-rank TUI: loss, grad norm, tokens/sec, network I/O

Technical Details

smolcluster implements a distributed training system designed from the ground up for heterogeneous hardware. The library supports multiple distributed training paradigms, each optimized for different cluster configurations and network topologies.

Communication Infrastructure

  • Socket-based Communication — Raw TCP sockets for reliable, low-level control over gradient and activation transfers between nodes. No dependency on MPI or specialized networking libraries.
  • Pickle Serialization — PyTorch tensors serialized with pickle for efficient network transmission, with optional gradient quantization for bandwidth reduction.
  • Asynchronous I/O — Non-blocking socket operations enable workers to compute while waiting for network transfers in EDP mode.

Distributed Training Modes

  • Elastic Distributed Parallelism (EDP) — Workers train independently with stale gradient tolerance. The parameter server accepts gradients from any model version, making it resilient to stragglers and network latency variance. Workers periodically pull the latest weights without synchronization barriers.
  • Model Parallelism — Sequential layer distribution across nodes with activation forwarding. Enables training and inference of models exceeding single-device memory. Each worker holds a subset of layers and forwards activations to the next rank.
  • Pipeline Model Parallelism — Temporal pipeline parallelism with multiple microbatches in-flight across stages. Reduces bubble size and improves GPU/device utilization during training of large models.
  • Expert Parallelism — Distributed training of mixture-of-experts models with expert placement across nodes. Enables efficient training of sparse models with dynamic expert routing and load balancing across heterogeneous devices.

Data Management

  • Automatic Data Partitioning — Dataset automatically sharded across workers based on global rank and world size, ensuring no data overlap.
  • Deterministic Shuffling — Seeded random number generators ensure reproducible data ordering across runs.
  • Streaming Support — Memory-efficient data loading for large datasets with PyTorch DataLoader integration.

Fault Tolerance & Monitoring

  • Checkpointing — Periodic model snapshots with configurable intervals. Supports resuming training from the latest checkpoint after failures.
  • Weights & Biases Integration — Automatic logging of training metrics, gradient norms, per-layer statistics, and system metrics (GPU utilization, memory usage, network throughput).
  • Timeout Handling — Configurable timeouts prevent deadlocks when workers fail or network partitions occur.

Performance Optimizations

  • Gradient Quantization — Optional 8-bit quantization reduces gradient transfer size by 4× with minimal accuracy impact.
  • CPU-based Computation — Designed to utilize CPU cores on commodity hardware (Mac minis, Raspberry Pis) rather than requiring GPUs.
  • Mixed Precision Training — FP16 automatic mixed precision support for compatible hardware to accelerate training.
  • Gradient Accumulation — Simulates larger batch sizes by accumulating gradients over multiple micro-batches before updating.

Documentation

Comprehensive guides to help you get the most out of smolcluster:

License

smolcluster is released under the MIT License.

Contributions are welcome! Visit the GitHub repository to get involved.