Model Stack

Current model split for the frontend-first MUTA flow and its Surfer H execution path

For a visual overview of how the models interact with the VNC-based GUI automation loop, see: Workflow Diagram

Requirement

MUTA inherits the same core requirement as the earlier Autonomous UAT Agent work: the solution must use open-source models from European companies for the target architecture.

Current Documentation Scope

This page documents the model split used by the current frontend-first MUTA flow described in this documentation section.

The current standard path starts from the shared frontend and executes Surfer H on the runner in the background.

Current Standard Split

  • Thinking / planning: Ministral
  • Grounding / coordinates: Holo-oriented grounding endpoint on the A40 host

The run loop uses one model path to decide what to do next and another path to translate UI intent into pixel-accurate coordinates on the current screenshot. This split remains essential for reliable GUI automation because planning and grounding are different problems and benefit from different model capabilities.

Why split models?

  • Reasoning models optimize planning and textual decision making
  • Vision/grounding models optimize stable coordinate output
  • Separation reduces “coordinate hallucinations” and makes debugging easier

Current state in repo

  • Some scripts and docs still reference historical Claude and Pixtral experiments.
  • Some newer shared-frontend and system documents still mention Pixtral on the A40 host.
  • For this documentation section, the active MUTA narrative follows the Surfer-H-oriented split documented in the Surfer H notes: Ministral for thinking and a Holo-oriented grounding endpoint for UI grounding.
  • Older Claude- or Pixtral-based references should therefore be read as historical, experimental, or belonging to adjacent documentation tracks unless they explicitly state otherwise.

Current Configuration For This Documentation Track

Thinking model: Ministral 3 8B (Instruct)

  • HuggingFace model card: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
  • Runs on OTC (Open Telekom Cloud) ECS: ecs_ministral_L4 (public IP: 164.30.28.242)
    • Flavor: GPU-accelerated | 16 vCPUs | 64 GiB | pi5e.4xlarge.4
    • GPU: 1 × NVIDIA Tesla L4 (24 GiB)
    • Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
  • Deployment: vLLM OpenAI-compatible endpoint (chat completions)
    • Endpoint env var: vLLM_THINKING_ENDPOINT
    • Current server (deployment reference): http://164.30.28.242:8001/v1

Operational note: vLLM is configured to auto-start on server boot (OTC ECS restart) via systemd.

Key serving settings (vLLM):

  • --gpu-memory-utilization 0.90
  • --max-model-len 32768
  • --host 0.0.0.0
  • --port 8001

Key client settings (historically script-driven, now runner/frontend-driven):

  • model: /home/ubuntu/ministral-vllm/models/ministral-3-8b
  • temperature: 0.0

Grounding model: Holo 1.5-7B

  • HuggingFace model card: https://huggingface.co/holo-1.5-7b
  • Runs on OTC (Open Telekom Cloud) ECS: ecs_holo_A40 (public IP: 164.30.22.166)
    • Flavor: GPU-accelerated | 48 vCPUs | 384 GiB | g7.12xlarge.8
    • GPU: 1 × NVIDIA A40 (48 GiB)
    • Image: Standard_Ubuntu_24.04_amd64_bios_GPU_GitLab_3074 (Public image)
  • Deployment: vLLM OpenAI-compatible endpoint (multimodal grounding)
    • Endpoint env var: vLLM_VISION_ENDPOINT
    • Current server (deployment reference): http://164.30.22.166:8000/v1

Key client settings (grounding / coordinate space):

  • model: holo-1.5-7b
  • Native coordinate space: 3840×2160 (4K)
  • Client grounding dimensions:
    • grounding_width: 3840
    • grounding_height: 2160

Notes

  • The shared frontend remains the primary user-facing entry point; users do not need to select models directly.
  • Model and endpoint details matter mainly for operations, debugging, and architecture discussions.
  • If another documentation track describes a different A40 model assignment, treat that as a parallel or older reference and reconcile it explicitly before presenting it as the current MUTA standard.