smolcluster

smolcluster is a distributed deep learning library designed for training neural networks across heterogeneous hardware using PyTorch and socket-based communication. It enables researchers and developers to leverage multiple machines with different capabilities for distributed training and inference.

The library supports various distributed training algorithms including Fully Sharded Data Parallelism (FSDP), Classic Data Parallelism (ClassicDP), Elastic Distributed Parallelism (EDP), Model Parallelism, Model Parallelism with Pipeline, and Expert Parallelism. It runs on diverse hardware including Mac minis, Raspberry Pis, MacBooks, and Windows machines.

Step-by-step setup guide for Mac Mini (Thunderbolt) and Jetson / home router clusters, with commands ready to copy.

Fully Sharded Data Parallelism (FSDP) — ZeRO-optimized data parallelism with configurable optimizer state partitioning. Supports Stage 0 (All-Reduce) and Stage 1 (ZeRO optimizer partitioning) for ~1/N memory reduction. Includes bounded staleness for flexible async training.
Classic Data Parallelism (ClassicDP) — All-Reduce based data parallelism with bounded staleness control. Workers exchange gradients directly in a fully connected topology with configurable synchronization flexibility.
Elastic Distributed Parallelism (EDP) — Asynchronous data parallelism with stale gradient tolerance, ideal for heterogeneous clusters.
Model Parallelism (MP) — Layer-wise model distribution perfect for large models and inference serving.
Model Parallelism with Pipeline (MPPipeline) — Pipeline-based model parallelism with inter-layer communication and overlapped computation for improved throughput and reduced latency during inference.
Expert Parallelism (EP) — Sparse mixture-of-experts (MoE) training with distributed expert assignment across nodes. Enables training of large sparse models with efficient expert routing and load balancing.

Supported Inference Algorithms — Pipeline Parallelism and Data Parallelism (DP).
Streaming Token Generation — Real-time token-by-token generation with activations forwarded sequentially through distributed layers.
FastAPI Backend — RESTful API server for easy integration with web and mobile clients (iPad, browsers, etc.).
Multi-Device Support — Serve models across heterogeneous hardware (Mac minis, Raspberry Pis) with automatic activation routing.
Interactive Chat Interface — React-based web UI and iOS Swift app for real-time interaction with distributed models.

Train and run inference across heterogeneous hardware including Mac minis, Raspberry Pis, MacBooks, Windows machines, and iPad clients.

Built-in support via the Hugging Face Transformers library for Data Parallelism (DP). For Model Parallelism (MP), GPT-2 (117M) is currently supported, with support for additional models coming soon.

Weights & Biases Integration — Automatic tracking of training metrics, gradient norms, and hardware utilization.
Web Interface — React-based chat UI for GPT inference with real-time streaming responses.

smolcluster implements a distributed training system designed from the ground up for heterogeneous hardware. The library supports multiple distributed training paradigms, each optimized for different cluster configurations and network topologies.

Communication Infrastructure

Socket-based Communication — Raw TCP sockets for reliable, low-level control over gradient and activation transfers between nodes. No dependency on MPI or specialized networking libraries.
Pickle Serialization — PyTorch tensors serialized with pickle for efficient network transmission, with optional gradient quantization for bandwidth reduction.
Asynchronous I/O — Non-blocking socket operations enable workers to compute while waiting for network transfers in EDP mode.

Distributed Training Modes

Elastic Distributed Parallelism (EDP) — Workers train independently with stale gradient tolerance. The parameter server accepts gradients from any model version, making it resilient to stragglers and network latency variance. Workers periodically pull the latest weights without synchronization barriers.
Model Parallelism — Sequential layer distribution across nodes with activation forwarding. Enables training and inference of models exceeding single-device memory. Each worker holds a subset of layers and forwards activations to the next rank.
Pipeline Model Parallelism — Temporal pipeline parallelism with multiple microbatches in-flight across stages. Reduces bubble size and improves GPU/device utilization during training of large models.
Expert Parallelism — Distributed training of mixture-of-experts models with expert placement across nodes. Enables efficient training of sparse models with dynamic expert routing and load balancing across heterogeneous devices.

Data Management

Automatic Data Partitioning — Dataset automatically sharded across workers based on global rank and world size, ensuring no data overlap.
Deterministic Shuffling — Seeded random number generators ensure reproducible data ordering across runs.
Streaming Support — Memory-efficient data loading for large datasets with PyTorch DataLoader integration.

Fault Tolerance & Monitoring

Checkpointing — Periodic model snapshots with configurable intervals. Supports resuming training from the latest checkpoint after failures.
Weights & Biases Integration — Automatic logging of training metrics, gradient norms, per-layer statistics, and system metrics (GPU utilization, memory usage, network throughput).
Timeout Handling — Configurable timeouts prevent deadlocks when workers fail or network partitions occur.

Performance Optimizations

Gradient Quantization — Optional 8-bit quantization reduces gradient transfer size by 4× with minimal accuracy impact.
CPU-based Computation — Designed to utilize CPU cores on commodity hardware (Mac minis, Raspberry Pis) rather than requiring GPUs.
Mixed Precision Training — FP16 automatic mixed precision support for compatible hardware to accelerate training.
Gradient Accumulation — Simulates larger batch sizes by accumulating gradients over multiple micro-batches before updating.

smolcluster is released under the MIT License.

Contributions are welcome! Visit the GitHub repository to get involved.

Overview

Cluster Architecture

Quick Start

Key Features

Distributed Training Algorithms

Distributed Inference

Hardware Support

Model Support

Monitoring & Logging

Demo

Technical Details