Pilot beta walkthrough

Getting started · Source: docs/PILOT_BETA.md

Pilot beta — first deploy against a real network

This is the walkthrough for supervised lab-beta deployers: engineers with a real network (or a representative lab fleet) who want to drive cubby.network against actual devices and report back what works, what breaks, and what's missing.

Read this first, end to end, before you start. The "Posture" section in particular — it tells you what cubby will and won't do for you in this beta.

Posture: what "supervised lab beta" means

cubby ships seven real device adapters (Cisco IOS-XE, Cisco NX-OS, Arista EOS, Juniper JunOS, Palo Alto PAN-OS, Fortinet, Nokia SR Linux). The pilot path wires those against real transports (scrapli_ssh, ssh_exec, gnmi) so the harness genuinely talks to your fleet.

What is real:

Device adapters + transports (the seven vendors above)
Inventory via NetBox API (when you point it at a real instance)
Evidence chain (signed bundles, prev-hash linked, verifier)
CAB approval workflow (HMAC out of the box; Ed25519 supported)
API auth (HMAC bearer tokens or OIDC JWTs)
SafetyGate, plan-hash binding, all the safety invariants

What is still simulated (cubby doesn't ship real implementations for these yet):

Auth backends (IseTacacsAuthAdapter, StaticVaultAuthAdapter) — device-credential lookup falls back to fixture data
Validator plugins (ConfigLintValidator, BatfishValidator, PolicyGuardValidator) — pre-change validation runs against bundled rules, not your environment
Telemetry (PrometheusTelemetryAdapter, GrafanaTelemetryAdapter) — capacity forecasts and synthetic monitors use bundled series

The harness boots in this hybrid posture and prints a loud banner on every start naming exactly which categories are simulated. Do not route real change traffic through cubby until you have tested against a representative lab fleet first.

What "supervised" means here: you, an engineer, are watching the logs. You're approving every CAB request manually. You're not letting any auto-remediation cascade run unattended. If you want unattended operation, that's outside the v1 beta scope.

`NETOPS_PILOT_ACCEPT_LAB_BETA` — the supervised-lab escape hatch

build_pilot_harness enforces a hard rule: in NETOPS_ENV=production, the no-simulated-plugins invariant cannot be relaxed silently. If you genuinely intend to run the supervised-lab-beta posture (real device adapters, simulated auth/validators/telemetry), you must set:

export NETOPS_PILOT_ACCEPT_LAB_BETA=1

This acknowledgement is not production approval. It is an explicit operator statement that you understand:

The simulated plugin categories are still simulated.
You will NOT route real change traffic through this build.
You are watching every CAB request and signing manually.
The chain of trust is bounded by what cubby verify-chain says.

Without this env var, build_pilot_harness raises rather than booting silently into a relaxed-invariant production. The variable is documented in the env table at AS_BUILT.md and read via RuntimeConfig.pilot_accept_lab_beta.

Evidence chain — fresh-install posture

The evidence chain (var/evidence/) is per-instance, not shipped with the source: var/evidence/ is gitignored and a fresh clone has only .gitkeep inside. Your first signed bundle is therefore the first link in your chain — cubby verify-chain returns ok=true from the start.

If a key rotation, replica fork, or development-time accident produces a chain you cannot rebuild, archive it; never delete:

cubby evidence-archive --reason "key rotation 2026-Q2"

This moves the broken segment to a sibling evidence-archive-<UTC-ts>/ directory, preserves the chain pointer, writes an ARCHIVE.md with the reason, and lets the next bundle start a fresh chain. The archive remains auditable; nothing is silently dropped. Production runs of evidence-archive additionally require NETOPS_ARCHIVE_PRODUCTION_ACK=1 on top of --yes so a misfired CI script cannot rotate a live chain.

For mixing pre/post-rotation bundles in a single chain (instead of archiving), cubby verify-chain accepts --legacy-key-ids (signature check skipped, chain link still required) and --chain-reset-bundle-ids (explicit fork boundaries). Both surface in the verifier's output as legacy_count / chain_resets so reviewers see how much of the chain is cryptographically trusted vs. historically retained.

Prerequisites

Source checkout. pipx install cubby-network is currently broken because config/, docs/, and examples/ aren't packaged in the wheel (we're fixing this). For now: git clone and run from the checkout.
Python 3.10–3.12 on the host (the lockfile + wheel ship 3.11; 3.10 and 3.12 work for source installs).
Linux or macOS host for the harness. Windows isn't tested.
A real or lab fleet of at least one device matching one of the seven supported vendors.
A NetBox instance (cloud or self-hosted) populated with at least the devices you plan to drive. Cubby reads inventory from NetBox by default in pilot mode.
Network reach from the harness host to your devices on whichever transport you pick (TCP/22 for SSH, TCP/830 for NETCONF, TCP/57400 for gNMI).
Read access to NetBox via API token. Cubby never writes to NetBox.
An LLM API key (Anthropic preferred, OpenAI works) — required for the agent runtime that drives the autonomous incident loop. Without it, the deterministic workflows still run, but agent-driven ones return empty plans.

Step 1: clone and install

git clone [email protected]:cubby-network/platform.git cubby
cd cubby
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[api,agents,ssh,netconf]"

Pick the extras you need:

[api] — FastAPI server (needed for HTTP)
[agents] — Anthropic + OpenAI clients (needed for LLM workflows)
[ssh] — Scrapli SSH transport (most vendors)
[netconf] — Scrapli NETCONF + ncclient (Junos, IOS-XE NETCONF mode)
[gnmi] — pygnmi (Nokia SR Linux, Arista in some configs)
[oidc] — python-jose (only if you use OIDC for API auth)
[postgres], [redis], [s3] — optional persistent backends

Verify:

pytest -q   # 1100+ tests should pass; ~33 skipped (live-service gated)
cubby smoke   # builds the demo harness, exits 0 if everything wires

Step 2: run the pilot wizard

cubby init-pilot --out-dir ./

The wizard asks:

Which env vars name the NetBox URL + token (default NETBOX_BASE_URL / NETBOX_TOKEN — fine to keep).
Which device vendors are in scope (y/n for each).
Which transport (default scrapli_ssh; pick ssh_exec for Nokia SR Linux, gnmi if your fleet exposes it natively).
API auth mode (hmac for shared-secret bearer, oidc for JWT).
Whether to wire Slack / PagerDuty / ServiceNow / Jira (optional notification + ticketing).

Outputs:

pilot-config.yaml — plugin selection, commit to version control.
.env.template — list of secrets, **never commit the populated copy**.

Step 3: populate the `.env`

cp .env.template .env
$EDITOR .env

Replace every CHANGE_ME_* placeholder. Key ones:

# Required — production strict-mode gate
NETOPS_ENV=production

# Inventory — your NetBox
NETBOX_BASE_URL=https://netbox.your-corp.example
NETBOX_TOKEN=<long token from NetBox API tokens page>

# API auth — HMAC mode
NETOPS_API_AUTH_MODE=hmac
NETOPS_API_HMAC_SECRET=<32+ random bytes, hex>

# Evidence + approval signing
NETOPS_EVIDENCE_HMAC_SECRET=<32+ random bytes, hex>
NETOPS_APPROVAL_HMAC_SECRET=<32+ random bytes, hex>
NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1   # fail-closed if unset

# LLM
ANTHROPIC_API_KEY=<your key>

Generate fresh secrets:

python -c "import secrets; print(secrets.token_hex(32))"

Each *_HMAC_SECRET should be a distinct value. If you reuse the same secret across evidence + approvals you reduce the cryptographic separation that makes a CAB approval bind to a specific plan.

Step 4: smoke-test before pointing at devices

set -a; source .env; set +a
export NETOPS_PILOT_CONFIG=$(pwd)/pilot-config.yaml

cubby smoke

Expected output (excerpt):

================================================================================
  PILOT-IN-PRODUCTION boot. Real adapters wired for: cisco_iosxe, arista_eos
  Simulated for: auth (ISE/TACACS), validators (Batfish/lint), telemetry (Prometheus).
  This is supported as 'supervised lab beta' ONLY.
================================================================================
Build pass: ok
Plugins registered: <count>
Readiness checks: ...

The banner is intentional. If you don't see it, you're booted in demo mode (likely NETOPS_PILOT_CONFIG not exported).

If smoke fails, the most common causes:

NETBOX_TOKEN invalid / expired → fix the token.
Required env var unset → cubby config shows the resolved RuntimeConfig and which fields fell back to defaults.
One of the strict-mode signer-key gates → set NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 and provide both HMAC secrets.

Step 5: connect to one device, read-only

Before any change, pull a snapshot of one device to confirm transport

credentials work:

cubby refresh-digital-twin --device <hostname-as-it-appears-in-netbox>

This is read-only. It opens a session against the device using the transport you configured, runs a small set of show commands, parses the output, and writes the result to the digital-twin store. No config changes.

What to verify:

Exit code 0.
The output JSON shows interfaces, vlans, routing populated with values that match the device's actual state (compare with a manual show interfaces on the box).
The signed evidence bundle for this snapshot lands in var/evidence/.

If the snapshot fails, check:

Network reach (telnet $device 22).
Credentials (does NetBox have the right management address + credential profile?).
Vendor adapter (the wizard mapped your vendor; confirm with grep vendor pilot-config.yaml).

Step 6: dry-run a change, sign the evidence

The harness has a "dry-run" path on the access-port VLAN-change workflow that renders the plan, runs the what-if simulator, and writes a signed evidence bundle without touching the device.

Build a request:

cat > /tmp/req.json <<EOF
{
  "requested_by": "operator:you",
  "actor_roles": ["network-operator"],
  "site": "<site>",
  "device": "<hostname>",
  "interface": "<port-name>",
  "target_vlan": <vlan_id>,
  "current_hour": $(date +%H)
}
EOF

cubby run-change --request-file /tmp/req.json

Inspect the output. You should see:

A ChangePlan with the rendered config_commands.
A WhatIfReport showing the field-level diff (access_vlan before/after, description, blast radius).
An ApprovalRecord either auto-approved (LOW risk) or pending (MEDIUM/HIGH).
A signed evidence bundle ID.

Verify the bundle:

cubby verify-chain --evidence-dir var/evidence

Expected: ok: true with no chain breaks. If the chain reports errors, see "Failure modes" below.

Step 7: run a real change (against a lab device)

Only do this against a lab device or one you're explicitly authorised to change. The harness will not refuse to run.

Submit the change request — the workflow blocks at APPROVAL_PENDING and returns an approval_request_id in workflow_artifacts.
CAB members enumerate open requests and submit signatures: ```bash curl http://localhost:8000/approvals/open curl -X POST http://localhost:8000/approvals/<id>/sign \ -H "content-type: application/json" \ -d '{"signed_approval": {...SignedApproval.to_dict()...}}' ``` (MCP exposes the same surface as list_open_approvals / submit_signed_approval for AI clients.)
Once quorum is met, resubmit the change with signed_approvals in the request body — or call the workflow again and pass the accumulated signatures.
The harness runs through all 11 steps of the workflow: snapshot, plan, validate, approve, execute, verify, post-snapshot, sign evidence.

The expected wall-clock time depends on transport + device: ~5–15s for IOS-XE / EOS over scrapli_ssh on a single port.

If verification fails, the harness automatically runs the rollback plan. The signed evidence chain records both the failed change and the rollback as separate bundles, both prev-hash-linked.

Step 8: verify the evidence chain

After every change (or weekly, whichever comes first):

cubby verify-chain --evidence-dir var/evidence

This re-reads every bundle, recomputes the SHA-256, verifies the signature against the configured keyring, walks the prev-hash chain, and reports any signature failure or chain break.

Expected: ok: true.

If ok: false:

bad_signatures > 0 — at least one bundle's signature didn't verify. Either the keyring rotated and the old key is missing, or someone tampered with a bundle. Inspect the named bundle id; recover the lost key OR archive the segment with `cubby evidence-archive` (next section).
Chain break — prev_sha256 doesn't match the previous bundle. Operationally this happens when two writers raced (the audit-spine guard the K8s manifest enforces with replicas: 1). In a single-host pilot this should be impossible; if it happens, open an issue with the bundle id and your var/evidence/ listing.

Failure modes + recovery

Evidence chain reports legacy signatures from a rotated key

NETOPS_ENV=production NETOPS_VERIFY_DOWNGRADE_ACK=1 \
  cubby verify-chain --legacy-key-ids <kid> --evidence-dir var/evidence

This tells the verifier to tolerate signatures under that kid. If that's a permanent state (you've rotated and don't have the old key), seal the segment:

NETOPS_ARCHIVE_PRODUCTION_ACK=1 cubby evidence-archive --reason "rotated to ed25519"

The bundles + chain pointer move to var/evidence-archive-<UTC-ts>/. The live chain starts fresh from genesis.

Pilot harness refuses to boot

The pre-flight banner names exactly which gate failed. Common ones:

NETOPS_API_AUTH_MODE='dev' — production refuses dev-token auth. Set to hmac or oidc.
Missing signer keys — set NETOPS_EVIDENCE_HMAC_SECRET + NETOPS_APPROVAL_HMAC_SECRET (both required, distinct values).
simulated adapter (X) reached the strict-mode registry — you're on the strict path, not the pilot path. Confirm NETOPS_PILOT_CONFIG is exported and points at a valid YAML.

A device adapter fails

cubby refresh-digital-twin against the device returns a clean error envelope. Check:

plugins/device/<vendor>/parsers.py — these are vendor-specific. If the device runs an unusual NOS variant, the parser may not cover the output format.
Transport (Scrapli) error → check device privilege escalation (enable password, etc.) is in the credential profile.

A workflow times out mid-execution

The state machine retains the partial state. Inspect var/runbooks/runbook-<id>.json to see where it stopped. If the device is in a known-good state, mark the workflow complete via the API (POST /workflows/<id>/abandon); if it's in an intermediate state, run the rollback explicitly.

Reporting feedback

You're a beta tester — your feedback shapes the next round of work.

File issues at: https://github.com/cubby-network/platform/issues

Use the template categories:

Adapter bug — device adapter parsed wrong, transport hung, rollback didn't fire. Include the pilot-config.yaml, the vendor + NOS version, and the relevant var/evidence/ bundle id.
Workflow gap — workflow type cubby doesn't cover that you needed.
Operator UX — confusing error message, missing CLI flag, documentation drift. The harness CLI is one of the surfaces we most want feedback on.
Safety concern — anything that looked like the safety envelope (SafetyGate, CAB, signed evidence) was bypassable. We treat these as security issues; please disclose privately first via the SECURITY.md process.

For each report, please include:

Output of cubby config (sanitised — redact secrets).
The pilot-config.yaml (no secrets in there by design).
The relevant log lines with timestamps.
The bundle ID(s) involved if there's an evidence-chain question.

What to expect next

The pilot beta is the path to fully-strict production. Items we expect to ship in response to pilot feedback:

Real auth backend (ISE/TACACS or LDAP-shaped) so the simulated catch-all gets retired.
Real validator integration (a Batfish-shaped pre-change check that runs against your topology).
Real telemetry collector (Prometheus scraping) so capacity forecasts use your data.
Pipx/wheel packaging fix so pipx install cubby-network works without the source-checkout.
Production K8s deployment story for multi-pod setups (currently reference-deploy only via replicas: 1).
Job queue integration for the worker so it pulls real work instead of refusing to run in production.

When each lands, the corresponding "simulated" line in the boot banner disappears. When all of them have landed, the strict production path (build_production_harness without a pilot config) boots cleanly and the lab-beta posture retires.

Glossary

Demo harness — the default builder; everything simulated; fine for pytest, cubby smoke, and offline development.
Production harness — strict pre-flight (env / auth / signer keys), refuses any simulated plugin. Currently un-bootable without a pilot config because we don't ship real auth / validator / telemetry yet.
Pilot harness — middle ground. Pre-flight gates fire as in production; simulated catch-alls remain for the categories we haven't shipped real implementations for. **This is what you're running in the beta.**
Evidence chain — append-only log of signed bundles, each one carrying the SHA-256 of the previous. Tampering with any bundle breaks every subsequent verifier check.
CAB — Change Advisory Board. Signed-quorum approval bound to the plan's canonical hash; an approval can't be replayed against a different plan.
SafetyGate — agent-tool boundary. Every LLM-driven tool call transits this gate; reads pass, writes are refused, injection patterns are sanitised or blocked.

See docs/GLOSSARY.md for the full vocabulary.