Model Stack

Current model split for the frontend-first MUTA flow and its Surfer H execution path

For a visual overview of how the models interact with the VNC-based GUI automation loop, see: Workflow Diagram

Requirement

MUTA inherits the same core requirement as the earlier Autonomous UAT Agent work: the solution must use open-source models from European companies for the target architecture.

Current Documentation Scope

This page documents the model split used by the current frontend-first MUTA flow described in this documentation section.

The current standard path starts from the shared frontend and executes Surfer H on the runner in the background.

Current Standard Split

Thinking / planning: Ministral
Grounding / coordinates: Holo-oriented grounding endpoint on the A40 host

The run loop uses one model path to decide what to do next and another path to translate UI intent into pixel-accurate coordinates on the current screenshot. This split remains essential for reliable GUI automation because planning and grounding are different problems and benefit from different model capabilities.

Why split models?

Reasoning models optimize planning and textual decision making
Vision/grounding models optimize stable coordinate output
Separation reduces “coordinate hallucinations” and makes debugging easier

Current state in repo

Some scripts and docs still reference historical Claude and Pixtral experiments.
Some newer shared-frontend and system documents still mention Pixtral on the A40 host.
For this documentation section, the active MUTA narrative follows the Surfer-H-oriented split documented in the Surfer H notes: Ministral for thinking and a Holo-oriented grounding endpoint for UI grounding.
Older Claude- or Pixtral-based references should therefore be read as historical, experimental, or belonging to adjacent documentation tracks unless they explicitly state otherwise.

Current Configuration For This Documentation Track

Thinking model: Ministral 3 8B (Instruct)

HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
Runs on OTC (Open Telekom Cloud) ECS: ecs_ministral_L4 (public IP: 164.30.28.242)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | pi5e.4xlarge.4
- GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var: vLLM_THINKING_ENDPOINT
- Current server (deployment reference): http://164.30.28.242:8001/v1

Operational note: vLLM is configured to auto-start on server boot (OTC ECS restart) via systemd.

Key serving settings (vLLM):

--gpu-memory-utilization 0.90
--max-model-len 32768
--host 0.0.0.0
--port 8001

Key client settings (historically script-driven, now runner/frontend-driven):

model: /home/ubuntu/ministral-vllm/models/ministral-3-8b
temperature: 0.0

Grounding model: Holo 1.5-7B

HuggingFace model card: https://huggingface.co/holo-1.5-7b
Runs on OTC (Open Telekom Cloud) ECS: ecs_holo_A40 (public IP: 164.30.22.166)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | g7.12xlarge.8
- GPU: 1 × NVIDIA A40 (48 GiB)
- Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var: vLLM_VISION_ENDPOINT
- Current server (deployment reference): http://164.30.22.166:8000/v1

Key client settings (grounding / coordinate space):

model: holo-1.5-7b
Native coordinate space: 3840×2160 (4K)
Client grounding dimensions:
- grounding_width: 3840
- grounding_height: 2160

Notes

The shared frontend remains the primary user-facing entry point; users do not need to select models directly.
Model and endpoint details matter mainly for operations, debugging, and architecture discussions.
If another documentation track describes a different A40 model assignment, treat that as a parallel or older reference and reconcile it explicitly before presenting it as the current MUTA standard.