Architecture

How CosmicAC's components connect to your Kubernetes cluster and run each job type.

CosmicAC is a self-hosted platform that runs GPU workloads on your own Kubernetes cluster. This page explains the components involved, how they connect to your cluster, and how each job type runs. For deployment steps, see Installation.

Deployment architecture

Setting up your cluster is separate from deploying CosmicAC. You bring a Kubernetes cluster that already has its GPU nodes, KubeVirt, and PCI passthrough configured. The CosmicAC components then connect to that cluster and run your workloads on it.

wrk-server-k8s-nvidia connects to your cluster's Kubernetes API using a kubeconfig you provide, and creates the resources each job needs. Jobs run as KubeVirt Virtual Machine Instances (VMIs) on your GPU nodes, each claiming one or more whole GPUs through passthrough.

CosmicAC documents the cluster requirements, not the steps to build the cluster. See Installation for those requirements.

CosmicAC components

These components make up the CosmicAC platform. Most run outside your GPU cluster; the per-job agents run inside the job's VMI:

app-ui — web interface. A browser dashboard for managing jobs.
cosmicac-cli — command-line interface. Submits jobs, manages resources, and connects to containers from your terminal.
app-node — application server. Serves the HTTP API, authenticates users, and routes commands to the orchestrator.
wrk-ork — orchestrator. Allocates resources, distributes jobs across the cluster, and routes requests to the workers.
wrk-server-k8s-nvidia — Kubernetes server worker. Connects to your cluster's Kubernetes API with a kubeconfig you provide, and provisions the GPU VMs.
proxy-inference — inference proxy. Authenticates Managed Inference requests, balances load, and routes them to model servers.
wrk-agent-instance — GPU Container agent. Runs inside a GPU Container job's VMI and accepts shell sessions over hyperswarm-ssh.
wrk-agent-inference — Managed Inference agent. Runs inside a Managed Inference job's VMI, serves the model with vLLM, and registers itself in the DHT table.

Holepunch stack

Inside of CosmicAC, the components connect to each other over the Holepunch peer-to-peer (p2p) stack rather than through a central server. Components address each other directly, so there is no central broker to route, bottleneck, or expose internal traffic:

Hyperswarm — peer-to-peer networking. Components find and connect to each other directly, without a central broker.
HRPC — Hyperswarm RPC. Carries internal calls between app-node, wrk-ork, and the workers.
hyperswarm-ssh — SSH over Hyperswarm. Lets cosmicac-cli shell directly into a running GPU Container job.
DHT table — distributed hash table. Managed Inference model servers register here, and proxy-inference discovers them by topic.
HyperDB + Autobase — distributed database. Stores usage metrics and job metadata.

GPU Container architecture

A GPU Container job runs your workload inside a KubeVirt VMI with a dedicated GPU and full shell access.

How a job starts. When you submit a job from app-ui or cosmicac-cli, it travels through the platform to your cluster:

app-node authenticates the request and forwards it to wrk-ork.
wrk-ork routes the job to wrk-server-k8s-nvidia.
wrk-server-k8s-nvidia instructs the Kubernetes control plane to schedule the workload.
Kubernetes creates a pod containing a VMI, with wrk-agent-instance running inside it.

How a shell connects. Once the VMI is running, cosmicac-cli connects directly to wrk-agent-instance over hyperswarm-ssh. Your commands reach the VMI over the Holepunch p2p stack rather than through app-node, so the interactive session does not depend on the control path that submitted the job.

Managed Inference architecture

A Managed Inference job runs an open-source language model with vLLM inside a VMI, and exposes it through proxy-inference as an OpenAI-compatible endpoint, which authenticates requests and balances load.

How the job starts. When you create a Managed Inference job from app-ui, the request flows through app-node and wrk-ork to wrk-server-k8s-nvidia, which schedules a pod with a VMI running wrk-agent-inference (vLLM). On spin-up, wrk-agent-inference registers itself in the DHT table so the proxy can find it.

How a request is served. Serving traffic follows a separate path from job creation:

A client sends a request to the inference endpoint over the OpenAI-compatible API.
proxy-inference authenticates the request, searches the DHT table by topic to discover a model server, and balances load across the running servers.
wrk-agent-inference runs the request with vLLM and returns the response.

Adding more model servers adds capacity without changing the API your clients call.

Isolation and security

Your data stays on your infrastructure — CosmicAC runs entirely on your own infrastructure. Your data, models, and inference traffic never leave it.
VM-level isolation — each job runs in its own KubeVirt VMI inside a non-privileged pod, with Kubernetes security controls applied.
Secure GPU access — GPUs are exposed to the VMIs through device plugins, without privileged containers.

Deployment architecture

CosmicAC components

Holepunch stack

GPU Container architecture

Managed Inference architecture

Isolation and security

Next steps

GPU Container Job

Managed Inference

Install CosmicAC

On this page