Skip to main content

Documentation Index

Fetch the complete documentation index at: https://allhandsai-add-verification-stack-docs.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

LLMs made generating code cheap. The real bottleneck is verification: checking that a change is correct, follows repo conventions, and is something you’d actually merge. The verification stack is a layered set of automated verifiers designed to catch different kinds of mistakes at different stages — so problems are caught early, cheaply, and without human intervention.

Architecture

The verification stack consists of two layers today, with an architecture designed to support more: Verification Stack Diagram Layer 1 — Trajectory-Level Verifier (Critic Model): A small, fast critic model that scores the agent’s trajectory before code is pushed. If the score falls below a confidence threshold, the agent’s work is gated — preventing obviously broken or off-track changes from ever reaching a pull request. See Enabling Layer 1 below. Layer 2 — Repo-Level Verifier (ReviewBot): An automated code reviewer and QA agent that triggers on every pull request. It reviews the diff for correctness, security, and style, then optionally runs the software to verify behavior. See Enabling Layer 2 below. Together, these layers form a pipeline: the critic prevents obviously broken work from being pushed, and the ReviewBot catches the subtler issues that require repository context.

How Effective Is It?

We’ve been running the verification stack on the OpenHands/software-agent-sdk repository for several months. Key findings:
  • Faster approvals — As ReviewBot adoption increased, time to first approval dropped significantly, with the largest gains on medium-to-large PRs.
  • Improving accuracy — Bot review precision and recall have improved consistently over time. Human reviewers are generally more precise, but the bot catches issues humans miss — the two are complementary.
  • Code quality maintained — Static analysis (radon, bandit, ruff) shows no degradation in cyclomatic complexity, security violations, or code smells for bot-reviewed PRs compared to human-only PRs.
  • Test coverage improving — Since the ReviewBot was introduced, test coverage across the repository has trended upward.
  • Review rounds decreasing — PRs with ReviewBot initially required more review rounds, but that gap has been closing as the skill improves.
For detailed metrics and methodology, see our blog post: The Verification Stack.

Enabling Layer 1: Trajectory-Level Verifier

The trajectory-level verifier integration is under active development. Configuration instructions will be added here once finalized.
The trajectory-level verifier uses a critic model to evaluate agent work before code is pushed. It is currently available through: The critic provides quality scores between 0.0 and 1.0, real-time feedback during agent execution, and automatic iterative refinement when it predicts incomplete work. For technical details on how the critic model works, see our paper: A Rubric-Supervised Critic from Sparse Real-World Outcomes.

Enabling Layer 2: Repo-Level Verifier

The repo-level verifier consists of two components — a code review agent and a QA agent — both available as plugins in the OpenHands/extensions repository.

Option A: GitHub Actions

Add the ReviewBot directly to your GitHub Actions workflow. This runs the code review (and optionally QA) as part of your CI pipeline on every PR.
1

Create a bot account

Create a bot account under your GitHub organization (e.g., your-org-bot). This account will post review comments and approve/request changes on PRs. Grant it write access and the ability to approve PRs on your repository.
2

Configure secrets

In your repository’s Settings → Secrets and variables → Actions, add:
  • LLM_API_KEY: Your LLM API key
  • BOT_GITHUB_TOKEN: A GitHub token for the bot account (if you want reviews posted by the bot rather than the default GITHUB_TOKEN)
3

Add the workflow

Create .github/workflows/review-bot.yml:
name: ReviewBot
on:
  pull_request:
    types: [opened, ready_for_review, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    if: github.event.pull_request.draft == false
    steps:
      - name: Run PR Review
        uses: OpenHands/extensions/plugins/pr-review@main
        with:
          llm-model: anthropic/claude-sonnet-4-5-20250929
          llm-api-key: ${{ secrets.LLM_API_KEY }}
          github-token: ${{ secrets.BOT_GITHUB_TOKEN || secrets.GITHUB_TOKEN }}
Trade-offs: Full control over when and how the bot runs. However, it requires per-repository configuration — every new repo needs its own workflow file, and keeping them in sync requires maintenance. For the complete configuration reference (trigger customization, ACP mode, sub-agents, custom review guidelines), see the Automated Code Review page.

Option B: OpenHands Automations (Beta)

OpenHands Automations is currently an experimental beta feature. The API and configuration format may change.
The alternative is using OpenHands Automations, our event-triggered automation system. With Automations, you define the trigger once and it covers all repositories the bot account has access to — no per-repo workflow files needed.
1

Create a bot account on OpenHands Cloud

Log in to OpenHands Cloud with your organization’s bot GitHub account.
2

Connect GitHub

Connect the bot account’s GitHub to OpenHands Cloud via the GitHub installation flow.
3

Create the automation

Log in as the bot account and instruct the agent to set up the automation:
[Placeholder — exact setup prompt coming soon] We are finalizing the exact prompt that users can paste into their OpenHands agent to automate code reviews on every PR. Check back soon.
Trade-offs: Simpler to set up and maintain — define it once and it covers all repos. It also leverages the full OpenHands runtime (browser, tools, sandbox), which GitHub Actions cannot.

Closing the Loop: The Iterate Skill

Setting up the verification stack is only half the story. The other half is acting on it — reading CI results, parsing review comments, fixing code, pushing again, and repeating until everything is green. The iterate skill (OpenHands/extensions) turns the agent into an orchestration loop that drives a pull request from first push to merge-ready:
  1. Push and open a draft PR — the PR starts as a draft to prevent premature automation triggers.
  2. Poll each verification layer — the agent checks CI, the ReviewBot’s verdict, and the QA agent’s report. It only polls layers that actually exist in the repo.
  3. Decide and act — if CI failed, it reads the logs and fixes the code. If the ReviewBot requested changes, it addresses the inline comments. If QA found regressions, it debugs and fixes.
  4. Push and re-poll — after every fix, the agent commits, pushes, re-requests review, and loops back. A push is never the end — the loop only exits when all present layers pass on the current SHA.
  5. Mark ready — once every verification layer is green, the agent converts the draft PR to ready for review.
Without the iterate skill, the verification stack is a set of independent checks. With it, the stack becomes a closed-loop system where the human reviewer only sees the PR after automated layers have converged on a clean state. Invoke it with /iterate in any OpenHands conversation with the skill loaded.