Overview
MLRunX tracks ML experiments as first-class runs. Each run can store params, tags, metrics, events, and artifacts while remaining easy to query and compare across projects.
Current release focus: Rust API gateway + SQLite-backed local deployment, Next.js dashboard, project-scoped API keys, share links, and a Python SDK with async batching and offline spooling.
Why MLRunX
- Run-centric model with searchable metadata and comparison views.
- Performance-first backend (Rust/Axum + gRPC path).
- Local-first deployment with minimal ops overhead.
- Scaffolded path to scale-out backends (ClickHouse/Postgres/MinIO).
Quick Start
Start API (standalone)
git clone https://github.com/ibusnowden/MLRunX.git
cd MLRunX
cargo run --bin mlrunx-api
# HTTP :3001, gRPC :50051, SQLite ./mlrunx.db
Start Dashboard
cd apps/ui
npm install
npm run dev
# UI on http://localhost:3000
Docker option
docker run -p 3001:3001 -p 50051:50051 -v mlrunx-data:/data \
ghcr.io/ibusnowden/mlrunx:latest
Use Hosted API
From the training machine:
uv pip install --upgrade mlrunx
export MLRUNX_SERVER_URL=https://mlrunx.your-domain.com
export MLRUNX_API_KEY=mlrunx_...
export MLRUNX_PROJECT_ID=019c...
python run.py
Minimal run code:
import mlrunx
run = mlrunx.init(
project_id="019c...",
name="char-gpt-scratch",
tags={"framework": "scratch", "dataset": "names"},
)
for step in range(1000):
loss = train_step()
val_loss = eval_step()
run.log({"loss": loss, "val_loss": val_loss}, step=step)
run.finish(status="finished")
Python SDK
The SDK is asynchronous and non-blocking. Calls to run.log() are queued and flushed in the background to avoid slowing training loops.
Install
pip install mlrunx
Basic usage
import mlrunx
run = mlrunx.init(
project="demo-project",
name="train-resnet50",
tags={"model": "resnet50", "dataset": "imagenet"},
)
run.log_params({"lr": 0.001, "batch_size": 32})
for step in range(1000):
loss, acc = train_step()
run.log({"loss": loss, "accuracy": acc}, step=step)
run.finish()
Context manager
import mlrunx
with mlrunx.init(project="demo-project") as run:
run.log_params({"optimizer": "adamw", "epochs": 10})
for step in range(200):
run.log({"loss": train_step()}, step=step)
# automatically flushes and closes
Run API
Core methods for everyday tracking:
run = mlrunx.init(project="my-project", name="exp-01")
run.log({"loss": 0.41, "accuracy": 0.87}, step=140)
run.log_params({"lr": 0.0005, "dropout": 0.1})
run.log_tags({"owner": "ibra", "stage": "baseline"})
run.finish(status="finished")
Offline spool behavior
export MLRUNX_SPOOL_ENABLED=true
export MLRUNX_SPOOL_DIR=~/.mlrunx/spool
export MLRUNX_SPOOL_MAX_SIZE=100000000
Configuration
Common runtime variables used by the server and SDK.
| Variable | Default | Description |
|---|---|---|
MLRUNX_SERVER_URL | http://localhost:3001 | SDK target API URL |
MLRUNX_API_KEY | None | Auth key for SDK requests |
MLRUNX_BATCH_SIZE | 1000 | Max events per flush batch |
MLRUNX_BATCH_TIMEOUT_MS | 1000 | Max queue age before flush |
MLRUNX_COALESCE_METRICS | true | Keep latest metric per step |
MLRUNX_SPOOL_ENABLED | true | Enable offline disk spool |
MLRUNX_OFFLINE | false | Force offline-only mode |
API_HTTP_PORT | 3001 | Rust API HTTP port |
API_GRPC_PORT | 50051 | Rust API gRPC port |
Architecture
Current architecture is monolith-first: one API server with clear internal boundaries for future service extraction.
Python SDK --> Rust API (Axum + Tonic) --> SQLite (v0.1 default)
| |
| +--> API keys, share tokens, run metadata
+--> Next.js UI (TypeScript) + metrics/events/params
Scale-out path (scaffolded in repo)
SDK --> Ingest Service --> ClickHouse (metrics)
UI <-> API Gateway <-> PostgreSQL (metadata)
Processor --> MinIO (artifacts)
Project Layout
MLRunX/
├── apps/
│ ├── api/ # Rust API gateway
│ └── ui/ # Next.js dashboard
├── sdks/
│ ├── python/ # Python SDK
│ └── integrations/ # Framework hooks
├── services/
│ ├── ingest/ # Scaffolded ingest service
│ └── processor/ # Scaffolded rollup processor
├── crates/proto/ # Shared protobuf contracts
├── infra/docker/ # Compose stack and local infra
├── docs/ # Architecture/specs/ops docs
└── bench/ # Benchmarks and thresholds