Share GPU, CPU, VRAM and RAM across multiple machines for LLM inference
nmonit turns a network of machines into a unified compute cluster for Large Language Model inference. A host node orchestrates workloads across worker nodes, pooling their hardware resources — GPU, VRAM, CPU, and RAM.
Built for LiteLLM as the inference backend, nmonit exposes an OpenAI-compatible REST API so any LLM application can use the cluster without modification.
Architecture
The host runs the orchestrator and the LiteLLM proxy. Workers connect via WebSocket and report their hardware capabilities. Applications send standard OpenAI API requests to the host, and the scheduler picks the optimal worker.
How It Works
- Host starts and connects to LiteLLM, loading available models
- Workers connect to the host via WebSocket, reporting their hardware capabilities (GPU, VRAM, CPU, RAM)
- Applications send inference requests to the host’s OpenAI-compatible API (
/v1/chat/completions) - The host’s scheduler selects the optimal worker based on model caching, available VRAM, available RAM, current task load, and priority
- Workers execute inference using their local LiteLLM and return results
Smart Scheduling
The nmonit scheduler distributes workloads using a scoring system:
- Model locality (+200): Workers that already have the model loaded are strongly preferred
- VRAM availability (+100 × ratio): GPU workers with free VRAM are preferred
- RAM availability (+50 × ratio): Workers with free system memory are preferred
- Task capacity (+80 × ratio): Workers with fewer active tasks are preferred
Key Features
- OpenAI-compatible REST API — plug into any LLM application without changes
- Smart scheduling — model locality, VRAM/RAM awareness, task load balancing
- Mixed hardware — GPU workers and CPU-only workers in the same cluster
- Real-time resource monitoring — GPU utilization, CPU, memory tracking
- Secure — shared token authentication, TLS support, workers initiate outbound only
- Systemd integration — host and worker run as services
- Rust-powered — single binary, minimal footprint, 9MB
Configuration
Edit /etc/nmonit/nmonit.yaml:
host: listen_addr: "0.0.0.0" port: 9742 litellm_base_url: "http://localhost:4000" worker: host_addr: "192.168.1.100" # Host IP host_port: 9742 auth_token: "my-secret"
Quick Start
# Build & install
git clone https://github.com/ParadoxFuzzle/nmonit
cd nmonit
cargo build --release
sudo ./scripts/install.sh
# Start the host
nmonit host --config /etc/nmonit/nmonit.yaml
# Connect workers (on each worker machine)
nmonit worker --host 192.168.1.100 --token "my-secret"
# Use the cluster
curl http://localhost:9742/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello!"}]}'
# Check cluster status
nmonit status
nmonit models
CLI Reference
| Command | Description |
|---|---|
nmonit host [options] | Start as the orchestrator node |
nmonit worker [options] | Start as a worker node |
nmonit status [--host <url>] | Show cluster status |
nmonit models [--host <url>] | List available models |
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check |
| GET | /v1/models | List models (OpenAI-compatible) |
| POST | /v1/chat/completions | Chat completion (OpenAI-compatible) |
| GET | /cluster/nodes | List connected workers |
| GET | /cluster/stats | Cluster resource statistics |
| WS | /ws/worker | WebSocket endpoint for workers |
Requirements
- Host: Linux with systemd, LiteLLM installed and running
- Worker: Linux, network access to host, optional NVIDIA GPU with CUDA
- Rust 1.75+ (for building from source)
Download
Pre-built binary + config + scripts + systemd units. Extract and run sudo ./scripts/install.sh to install.
License
MIT
