Model Stack
For a visual overview of how the models interact with the VNC-based GUI automation loop, see: Workflow Diagram
Requirement
MUTA inherits the same core requirement as the earlier Autonomous UAT Agent work: the solution must use open-source models from European companies for the target architecture.
Current Documentation Scope
This page documents the model split used by the current frontend-first MUTA flow described in this documentation section.
The current standard path starts from the shared frontend and executes Surfer H on the runner in the background.
Current Standard Split
- Thinking / planning: Ministral
- Grounding / coordinates: Holo-oriented grounding endpoint on the A40 host
The run loop uses one model path to decide what to do next and another path to translate UI intent into pixel-accurate coordinates on the current screenshot. This split remains essential for reliable GUI automation because planning and grounding are different problems and benefit from different model capabilities.
Why split models?
- Reasoning models optimize planning and textual decision making
- Vision/grounding models optimize stable coordinate output
- Separation reduces “coordinate hallucinations” and makes debugging easier
Current state in repo
- Some scripts and docs still reference historical Claude and Pixtral experiments.
- Some newer shared-frontend and system documents still mention Pixtral on the A40 host.
- For this documentation section, the active MUTA narrative follows the Surfer-H-oriented split documented in the Surfer H notes: Ministral for thinking and a Holo-oriented grounding endpoint for UI grounding.
- Older Claude- or Pixtral-based references should therefore be read as historical, experimental, or belonging to adjacent documentation tracks unless they explicitly state otherwise.
Current Configuration For This Documentation Track
Thinking model: Ministral 3 8B (Instruct)
- HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
- Runs on OTC (Open Telekom Cloud) ECS:
ecs_ministral_L4(public IP:164.30.28.242)- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB |
pi5e.4xlarge.4 - GPU: 1 × NVIDIA Tesla L4 (24 GiB)
- Image:
Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074(Public image)
- Flavor: GPU-accelerated | 16 vCPUs | 64 GiB |
- Deployment: vLLM OpenAI-compatible endpoint (chat completions)
- Endpoint env var:
vLLM_THINKING_ENDPOINT - Current server (deployment reference):
http://164.30.28.242:8001/v1
- Endpoint env var:
Operational note: vLLM is configured to auto-start on server boot (OTC ECS restart) via systemd.
Key serving settings (vLLM):
--gpu-memory-utilization 0.90--max-model-len 32768--host 0.0.0.0--port 8001
Key client settings (historically script-driven, now runner/frontend-driven):
model:/home/ubuntu/ministral-vllm/models/ministral-3-8btemperature:0.0
Grounding model: Holo 1.5-7B
- HuggingFace model card: https://huggingface.co/holo-1.5-7b
- Runs on OTC (Open Telekom Cloud) ECS:
ecs_holo_A40(public IP:164.30.22.166)- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB |
g7.12xlarge.8 - GPU: 1 × NVIDIA A40 (48 GiB)
- Image:
Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074(Public image)
- Flavor: GPU-accelerated | 48 vCPUs | 384 GiB |
- Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
- Endpoint env var:
vLLM_VISION_ENDPOINT - Current server (deployment reference):
http://164.30.22.166:8000/v1
- Endpoint env var:
Key client settings (grounding / coordinate space):
model:holo-1.5-7b- Native coordinate space:
3840×2160(4K) - Client grounding dimensions:
grounding_width:3840grounding_height:2160
Notes
- The shared frontend remains the primary user-facing entry point; users do not need to select models directly.
- Model and endpoint details matter mainly for operations, debugging, and architecture discussions.
- If another documentation track describes a different A40 model assignment, treat that as a parallel or older reference and reconcile it explicitly before presenting it as the current MUTA standard.