smolcluster

Distributed Deep Learning Library for Heterogeneous Hardware

Training and inference for neural networks across Mac minis, Raspberry Pi, and GPUs with PyTorch

Overview

smolcluster is a distributed deep learning library designed for training neural networks across heterogeneous hardware using PyTorch and socket-based communication. It enables researchers and developers to leverage multiple machines with different capabilities for distributed training and inference.

The library supports various distributed training algorithms including Fully Sharded Data Parallelism (FSDP), Classic Data Parallelism (ClassicDP), Elastic Distributed Parallelism (EDP), Synchronous Parameter Server (SyncPS), and Model Parallelism. It can run on diverse hardware including Mac minis, Raspberry Pis, MacBooks, and Windows machines.

Cluster Architecture

smolcluster distributed deep learning architecture diagram showing heterogeneous hardware cluster with Mac minis, Raspberry Pi, and parameter server topology for PyTorch model parallelism and data parallelism

Key Features

🔄 Distributed Training Algorithms

  • Fully Sharded Data Parallelism (FSDP) - ZeRO-optimized data parallelism with configurable optimizer state partitioning. Supports Stage 0 (All-Reduce) and Stage 1 (ZeRO optimizer partitioning) for ~1/N memory reduction. Includes bounded staleness for flexible async training
  • Classic Data Parallelism (ClassicDP) - All-Reduce based data parallelism with bounded staleness control. Workers exchange gradients directly in a fully connected topology with configurable synchronization flexibility
  • Elastic Distributed Parallelism (EDP) - Asynchronous data parallelism with stale gradient tolerance, ideal for heterogeneous clusters
  • Synchronous Parameter Server (SyncPS) - Synchronous data parallelism with barrier coordination for homogeneous clusters
  • Model Parallelism (MP) - Layer-wise model distribution perfect for large models and inference serving

� Distributed Inference

  • Model Parallelism Inference - Run large language models that exceed single-device memory by distributing layers across multiple nodes
  • Streaming Token Generation - Real-time token-by-token generation with activations forwarded sequentially through distributed layers
  • FastAPI Backend - RESTful API server for easy integration with web and mobile clients (iPad, browsers, etc.)
  • Multi-Device Support - Serve models across heterogeneous hardware (Mac minis, Raspberry Pis) with automatic activation routing
  • Interactive Chat Interface - React-based web UI and iOS Swift app for real-time interaction with distributed models

🖥️ Hardware Support

Train and run inference across heterogeneous hardware including Mac minis, Raspberry Pis, MacBooks, Windows machines, and iPad clients.

🤖 Model Support

Built-in support for MNIST, GPT-2 (117M parameters), and custom neural networks. Supports both training and distributed inference with model parallelism. Streaming token generation enables interactive applications with large language models.

📊 Monitoring & Logging

  • Grafana + Loki - Centralized log aggregation with real-time queries across all nodes
  • Weights & Biases Integration - Automatic tracking of training metrics, gradient norms, and hardware utilization
  • Web Interface - React-based chat UI for GPT inference with real-time streaming responses

Demo

Distributed GPT-2 Inference with Model Parallelism:

  • Model: GPT-2 (117M parameters)
  • Hardware: iPad client + 2× Mac Mini M4 (2025)
  • Algorithm: Model Parallelism with layer distribution
  • Demo: Real-time streaming token generation across distributed layers
  • Workflow: User prompts from iPad → activations forwarded between Mac Minis → tokens streamed back

Distributed Training Architectures

Watch different distributed training algorithms and architectures in action.

Classic Data Parallelism

All-to-All Architecture:

  • Architecture: Direct peer-to-peer gradient exchange
  • Algorithm: All-Reduce based data parallelism
  • Communication: Fully connected topology with bounded staleness
  • Use Case: Workers exchange gradients directly without central parameter server

Synchronous Parameter Server

Parameter Server Architecture:

  • Architecture: Barrier-based synchronous coordination
  • Algorithm: Data Parallelism with parameter server topology
  • Synchronization: Workers wait for all gradients before updating
  • Use Case: Homogeneous clusters with consistent compute capabilities

Getting Started

Note: Smolcluster requires a distributed hardware setup and network configuration before you can begin training. This is not a straightforward installation.

Prerequisites

Before using Smolcluster, you need to set up your distributed cluster:

  • Hardware Setup: Configure your machines (Mac minis, Raspberry Pis, GPUs, etc.)
  • Network Configuration:
    • Mac minis: Thunderbolt connections and network bridges
    • Raspberry Pi/GPUs: Ethernet connections
    • SSH setup with proper gateways and key authentication
  • Cluster Configuration: YAML configuration files for your specific topology

Installation

Once your cluster is properly configured:

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/YuvrajSingh-mist/smolcluster.git
cd smolcluster
uv sync

Important: Please refer to the Cluster Setup Guide for detailed hardware setup, networking configuration, and troubleshooting before attempting to run training scripts.

Technical Details

Smolcluster implements a distributed training system designed from the ground up for heterogeneous hardware. The library supports multiple distributed training paradigms, each optimized for different cluster configurations and network topologies.

Communication Infrastructure

  • Socket-based Communication - Raw TCP sockets for reliable, low-level control over gradient and activation transfers between nodes. No dependency on MPI or specialized networking libraries.
  • Pickle Serialization - PyTorch tensors serialized with pickle for efficient network transmission, with optional gradient quantization for bandwidth reduction.
  • Hybrid Network Support - Handles complex topologies mixing Thunderbolt fabric (10Gbps+) and Ethernet edge connections (1Gbps), with proper routing and gateway configuration.
  • Asynchronous I/O - Non-blocking socket operations enable workers to compute while waiting for network transfers in EDP mode.

Distributed Training Modes

  • Elastic Distributed Parallelism (EDP) - Workers train independently with stale gradient tolerance. The parameter server accepts gradients from any model version, making it resilient to stragglers and network latency variance. Workers periodically pull the latest weights without synchronization barriers.
  • Synchronous Parameter Server (SyncPS) - Barrier-based coordination where the server waits for all workers to submit gradients before updating. Uses Polyak averaging and synchronous weight broadcasts for faster convergence on homogeneous clusters.
  • Model Parallelism - Sequential layer distribution across nodes with activation forwarding. Enables training and inference of models exceeding single-device memory. Each worker holds a subset of layers and forwards activations to the next rank.

Data Management

  • Automatic Data Partitioning - Dataset automatically sharded across workers based on global rank and world size, ensuring no data overlap.
  • Deterministic Shuffling - Seeded random number generators ensure reproducible data ordering across runs.
  • Streaming Support - Memory-efficient data loading for large datasets with PyTorch DataLoader integration.

Fault Tolerance & Monitoring

  • Checkpointing - Periodic model snapshots with configurable intervals. Supports resuming training from the latest checkpoint after failures.
  • Distributed Logging - Grafana + Loki stack aggregates logs from all nodes in real-time. Promtail agents on each machine forward structured logs to a central Loki instance.
  • Weights & Biases Integration - Automatic logging of training metrics, gradient norms, per-layer statistics, and system metrics (GPU utilization, memory usage, network throughput).
  • Timeout Handling - Configurable timeouts prevent deadlocks when workers fail or network partitions occur.

Performance Optimizations

  • Gradient Quantization - Optional 8-bit quantization reduces gradient transfer size by 4x with minimal accuracy impact.
  • CPU-based Computation - Designed to utilize CPU cores on commodity hardware (Mac minis, Raspberry Pis) rather than requiring GPUs.
  • Mixed Precision Training - FP16 automatic mixed precision support for compatible hardware to accelerate training.
  • Gradient Accumulation - Simulates larger batch sizes by accumulating gradients over multiple micro-batches before updating.

See the architecture diagram above for a visual representation of the cluster topology.

Documentation

Comprehensive guides to help you get the most out of Smolcluster:

License

Smolcluster is released under the MIT License.

Contributions are welcome! Visit the GitHub repository to get involved.