Multi-Agent Governance: Why AI Teams Need Rules, Not Just Prompts
Markus Engineering Team Multi-Agent Governance: Why AI Teams Need Rules, Not Just Prompts
1. The Anarchy Problem
Ask any architect who has run more than three AI agents simultaneously, and they will describe the same scenario: one poorly constructed prompt causes an agent to push a breaking change, which cascades into another agent’s workspace, corrupting a third agent’s task branch. Within minutes the system is unrecoverable. The root cause is not bad prompts — it is the absence of a governance layer.
Current platforms treat autonomy as a binary switch. Agents are either fully constrained (chatbots) or fully unleashed (shell access). There is no graduated trust system, no safety net that catches mistakes before they become catastrophes. This approach is unsustainable beyond proof-of-concept scale.
When organizations scale from one agent to fifty, failure modes compound. Review cycles collapse without structured workflows. Workspaces collide without isolation boundaries. Dangerous git operations run without oversight. Stale tasks pile up silently.
The fundamental insight: prompts are guidance, not enforcement. An agent can ignore a prompt instruction as easily as a human ignores a sign on the wall. What scales is mechanical enforcement — rules at the platform level that cannot be bypassed, overridden, or forgotten.
Markus’s governance framework encodes enforcement into tools, task lifecycle, and permission systems. Here is how each component works.
2. Progressive Trust Levels
In Markus, agents earn autonomy through demonstrated reliability, not by configuration default. Every agent starts with minimal permissions and graduates to higher tiers based on two data-driven metrics: trust score (a weighted measure of delivery quality) and delivery count (the volume of completed, accepted work).
| Level | Condition | Permissions |
|---|---|---|
probation | New Agent or score < 40 | All tasks require human approval |
standard | score >= 40, >= 5 deliveries | Routine tasks auto-approved |
trusted | score >= 60, >= 15 deliveries | Higher autonomy, can review others |
senior | score >= 80, >= 25 deliveries | Highest autonomy, key reviewer |
Source: ARCHITECTURE.md §3.1
Trust score adjusts dynamically: deliveries accepted on first review increase the score; revisions or rejections decrease it. This creates a self-regulating system — agents producing quality work gain speed, while declining quality progressively constrains permissions before a catastrophic failure occurs.
Static role assignment — where an admin configures permissions once and never changes them — breaks at scale. It does not adapt to agent performance, creates administrative overhead, and misses gradual quality degradation. By the time a human notices an agent’s quality has slipped, the damage is already done. Progressive Trust solves all three problems by making autonomy a data-driven function of actual output, adjusting in real time as delivery patterns change.
The trust level is also visible to every agent through the ContextEngine’s system prompt assembly. Each agent knows its current level and understands what permissions that level grants. This transparency creates accountability — agents understand that their autonomy is earned and tied directly to the quality of their work.
3. Task Approval Hierarchy
Trust levels determine which agent can do what. A separate dimension — the approval hierarchy — determines what kind of task needs approval.
| Approval Tier | Trigger | Approver |
|---|---|---|
auto | Low-priority agent-created tasks | No approval (starts in_progress) |
manager | Standard agent-created tasks | Team Manager Agent |
human | High/urgent priority, shared-resource impact | Human (HITL) |
Source: ARCHITECTURE.md §3.6
These two systems interact dynamically. A senior agent’s manager-level tasks may auto-approve, while a probation agent’s low-priority tasks still require human approval. The same task type requires different approval for different agents, and those requirements change automatically as agents prove themselves.
Human-created tasks always start as pending regardless of tier. A human explicitly starts execution from the UI via a “Start Execution” button, retaining direct control over work initiation.
4. Quality Gates: The Formal Delivery Pipeline
Trust and approval handle who can do what. Quality gates handle how work gets done and accepted.
Agent completes work
-> task_submit_review (summary, branch, test results)
-> Quality gates (TypeScript build, ESLint, Vitest)
-> Merge conflict pre-check (dry-run merge)
-> Task state -> review
-> Reviewer accept / request revision
-> accept -> merge branch -> completed
-> revision -> Agent reworks -> resubmit
Source: ARCHITECTURE.md §4.3
Submission. The agent calls task_submit_review with a summary, branch reference, and test results. This is the only path to completion — there is no way to bypass review.
Automated Gates. Three checks run before any reviewer sees the submission:
- TypeScript build — project compiles without errors
- ESLint — code style and common errors
- Vitest — test suite passes
Failed gates block submission and notify the agent to fix issues.
Merge Conflict Pre-check. A dry-run merge detects conflicts before human time is invested.
Review. The reviewer accepts (merge + complete) or requests revision (return to in_progress). Every action — submission, gate result, review decision, merge — is recorded in the audit log with agent ID, task ID, event type, and metadata.
5. Git Command Governance
Git operations are among the highest-risk actions agents perform. Markus uses a three-tier model:
| Tier | Operations | Behavior |
|---|---|---|
| Allow | add, commit, fetch, log, diff, status, branch, checkout -b, switch -c, worktree add/list/remove, push origin <task-branch> | Execute immediately |
| Approval | checkout <existing-branch>, switch <existing-branch>, push ... main/master, merge, rebase | Pause, request HITL approval |
| Deny | push --force/-f | Always blocked |
Source: ARCHITECTURE.md §4.2
When an agent attempts an Approval-tier operation, execution pauses and a HITL service request is created with full context — agent, repository, operation, branches. A human approves or rejects with a comment.
push --force is permanently denied. Force pushing rewrites remote history, destroys the audit trail, and breaks every collaborator’s local history. There is no legitimate use case for a non-human force push. The correct workflow is an interactive rebase (which requires approval) or a new branch.
New dangerous operations can be added via SecurityPolicy.requireApproval (config-driven) or new pattern arrays in shell.ts (code-driven) — no agent code changes needed.
6. Global Controls
Even with trust levels, gates, and Git governance, situations demand immediate intervention: a runaway agent consuming API credits, a security vulnerability, or a configuration error affecting all agents.
| Function | Description |
|---|---|
pauseAllAgents(reason) | Pause all Agents with reason. Cancels active LLM streams, stops attention, requeues items. |
resumeAllAgents() | Resume all Agents. Attention loops restart, deferred items resurface. |
emergencyStop() | Cancel all active streams and stop all Agents |
| System announcements | Broadcast to all Agents and UI, injected into system prompts |
Source: ARCHITECTURE.md §4.1
Pause vs. Emergency Stop. Pause is graceful — streams cancel, items requeue, state is preserved. Resume continues where work left off. Emergency stop is destructive — in-flight work is lost, agents must be explicitly restarted. Use pause for most situations; reserve emergency stop for security breaches or data corruption.
Pause state persists across restarts at three levels: individual agent (DB paused status survives service restart), team (each member’s pause stored individually), and global (all agents paused; on restart, if all were paused, UI shows “Resume”).
System announcements are injected into every agent’s system prompt on each interaction. Agents cannot miss critical information — security advisories, policy changes, or priority shifts appear within one interaction cycle.
7. Stall Detection
Multi-agent systems have a failure mode single agents do not: silent stalls. An agent gets stuck, hits an unresolving dependency, or stops responding — and no one notices for hours. Markus’s stall detection engine monitors the task board automatically:
| Condition | Threshold | Action |
|---|---|---|
Task in_progress too long | > 24h or 2x avg completion time | Warn Agent -> report to Manager |
Task review unhandled | > 12h | Report to human |
Task assigned not started | > 4h | Remind Agent -> reassign |
Source: ARCHITECTURE.md §4.6
In-progress stalls (>24h): A warning is sent to the agent with context about the stalled task. If the agent does not respond or resolve the block within a reasonable window, the issue is escalated to the team manager. This mechanism is critical because agents can enter infinite loops or hit unresolvable dependency chains — without stall detection, they would consume API credits while producing no output.
Review stalls (>12h): Completed work stuck in review represents a blocking bottleneck — downstream dependent tasks cannot start. Because stalled reviews are typically a human bottleneck rather than an agent failure, the system reports directly to a human operator rather than to another agent for resolution.
Assignment stalls (>4h): Tasks not started within 4 hours trigger a reminder. Continued inaction triggers automatic reassignment. This prevents neglected tasks from disappearing into cracks.
In a single-agent system, the human operator is the stall detector. In a multi-agent system with persistent background agents, no one watches every agent. Automated stall detection prevents dead tasks from accumulating silently and blocking downstream work.
8. Governance as Infrastructure
These features — progressive trust, task approval, quality gates, Git governance, global controls, stall detection — are not a product feature. They are a new layer of infrastructure for AI teams, equivalent to what CI/CD brought to software development.
Before CI/CD, deployment was manual and high-risk. Teams relied on checklists and careful humans to avoid breaking production. CI/CD automated these checks, making deployment routine, safe, and repeatable. Governance frameworks do the same for multi-agent AI systems — they transform agent coordination from a manual oversight burden into an automated, self-regulating process that operates at machine speed.
Three Layers of Enforcement
Markus’s governance model operates at three distinct layers, each serving a different purpose:
-
SHARED.md (static norms) — A shared document all agents receive at startup, defining workflow maps, task governance, workspace discipline, formal delivery procedures, knowledge management, trust mechanics, and Git commit norms. This is the “constitution” that every agent understands.
-
ContextEngine (dynamic injection) — Per-interaction context injection including current project context, workspace info, system announcements, human feedback, and trust level. This ensures agents always have the latest situational awareness.
-
Tools (mechanical enforcement) — Platform-level tool behaviors that cannot be bypassed:
task_createblocks until approved,task_submit_reviewreplaces direct completion, file writes are blocked to other agents’ directories, and git commits auto-inject metadata for traceability.
Source: ARCHITECTURE.md §10
This three-layer design ensures that governance is understood at the principle level, contextualized at the interaction level, and enforced at the mechanical level. No single layer is sufficient alone — norms without enforcement are suggestions, and enforcement without understanding creates brittle constraints.
The Landscape Gap
Current platforms reveal a striking gap:
- LangChain Agents: No governance layer. Ephemeral agents with no persistent identity, no trust system, no approval workflow.
- CrewAI: Role definitions and sequential task chaining, but no formal delivery review, no trust levels, no Git governance, no global controls. Roles are static class attributes, not dynamic data-driven systems.
- Autogen: Conversational agent patterns with no workspace isolation, no stall detection, no audit trail.
None address the fundamental question: how do you run 50 agents in production without a human watching every one? Markus’s governance framework is built from the ground up to answer this question.
9. Get Started
Effective governance is not a constraint — it is the foundation that allows AI teams to scale. Without it, every new agent adds coordination overhead faster than productive capacity. With it, each agent increases both speed and safety.
Markus is open source and available on GitHub. The complete governance framework works out of the box.
npm install -g @markus-global/cli
markus start
Open http://localhost:8056 and hire your first agents.
Markus Engineering Team — May 2026