Project Documentation

MLRunX

Open-source ML experiment tracking focused on fast ingestion, run-centric workflows, and a clean local-first deployment path.

Overview

MLRunX tracks ML experiments as first-class runs. Each run can store params, tags, metrics, events, and artifacts while remaining easy to query and compare across projects.

Current release focus: Rust API gateway + SQLite-backed local deployment, Next.js dashboard, project-scoped API keys, share links, and a Python SDK with async batching and offline spooling.

Why MLRunX

  • Run-centric model with searchable metadata and comparison views.
  • Performance-first backend (Rust/Axum + gRPC path).
  • Local-first deployment with minimal ops overhead.
  • Scaffolded path to scale-out backends (ClickHouse/Postgres/MinIO).

Quick Start

Start API (standalone)

git clone https://github.com/ibusnowden/MLRunX.git
cd MLRunX
cargo run --bin mlrunx-api
# HTTP :3001, gRPC :50051, SQLite ./mlrunx.db

Start Dashboard

cd apps/ui
npm install
npm run dev
# UI on http://localhost:3000

Docker option

docker run -p 3001:3001 -p 50051:50051 -v mlrunx-data:/data \
  ghcr.io/ibusnowden/mlrunx:latest

Use Hosted API

From the training machine:

uv pip install --upgrade mlrunx
export MLRUNX_SERVER_URL=https://mlrunx.your-domain.com
export MLRUNX_API_KEY=mlrunx_...
export MLRUNX_PROJECT_ID=019c...
python run.py

Minimal run code:

import mlrunx

run = mlrunx.init(
    project_id="019c...",
    name="char-gpt-scratch",
    tags={"framework": "scratch", "dataset": "names"},
)

for step in range(1000):
    loss = train_step()
    val_loss = eval_step()
    run.log({"loss": loss, "val_loss": val_loss}, step=step)

run.finish(status="finished")

Python SDK

The SDK is asynchronous and non-blocking. Calls to run.log() are queued and flushed in the background to avoid slowing training loops.

Install

pip install mlrunx

Basic usage

import mlrunx

run = mlrunx.init(
    project="demo-project",
    name="train-resnet50",
    tags={"model": "resnet50", "dataset": "imagenet"},
)

run.log_params({"lr": 0.001, "batch_size": 32})

for step in range(1000):
    loss, acc = train_step()
    run.log({"loss": loss, "accuracy": acc}, step=step)

run.finish()

Context manager

import mlrunx

with mlrunx.init(project="demo-project") as run:
    run.log_params({"optimizer": "adamw", "epochs": 10})
    for step in range(200):
        run.log({"loss": train_step()}, step=step)
# automatically flushes and closes

Run API

Core methods for everyday tracking:

run = mlrunx.init(project="my-project", name="exp-01")

run.log({"loss": 0.41, "accuracy": 0.87}, step=140)
run.log_params({"lr": 0.0005, "dropout": 0.1})
run.log_tags({"owner": "ibra", "stage": "baseline"})

run.finish(status="finished")

Offline spool behavior

export MLRUNX_SPOOL_ENABLED=true
export MLRUNX_SPOOL_DIR=~/.mlrunx/spool
export MLRUNX_SPOOL_MAX_SIZE=100000000

Configuration

Common runtime variables used by the server and SDK.

Variable Default Description
MLRUNX_SERVER_URLhttp://localhost:3001SDK target API URL
MLRUNX_API_KEYNoneAuth key for SDK requests
MLRUNX_BATCH_SIZE1000Max events per flush batch
MLRUNX_BATCH_TIMEOUT_MS1000Max queue age before flush
MLRUNX_COALESCE_METRICStrueKeep latest metric per step
MLRUNX_SPOOL_ENABLEDtrueEnable offline disk spool
MLRUNX_OFFLINEfalseForce offline-only mode
API_HTTP_PORT3001Rust API HTTP port
API_GRPC_PORT50051Rust API gRPC port

Architecture

Current architecture is monolith-first: one API server with clear internal boundaries for future service extraction.

Python SDK  -->  Rust API (Axum + Tonic)  -->  SQLite (v0.1 default)
                    |                             |
                    |                             +--> API keys, share tokens, run metadata
                    +--> Next.js UI (TypeScript)       + metrics/events/params

Scale-out path (scaffolded in repo)

SDK --> Ingest Service --> ClickHouse (metrics)
UI  <-> API Gateway    <-> PostgreSQL (metadata)
Processor --> MinIO (artifacts)

Project Layout

MLRunX/
├── apps/
│   ├── api/                  # Rust API gateway
│   └── ui/                   # Next.js dashboard
├── sdks/
│   ├── python/               # Python SDK
│   └── integrations/         # Framework hooks
├── services/
│   ├── ingest/               # Scaffolded ingest service
│   └── processor/            # Scaffolded rollup processor
├── crates/proto/             # Shared protobuf contracts
├── infra/docker/             # Compose stack and local infra
├── docs/                     # Architecture/specs/ops docs
└── bench/                    # Benchmarks and thresholds