Pilot beta walkthrough
Pilot beta — first deploy against a real network
This is the walkthrough for supervised lab-beta deployers: engineers with a real network (or a representative lab fleet) who want to drive cubby.network against actual devices and report back what works, what breaks, and what's missing.
Read this first, end to end, before you start. The "Posture" section in particular — it tells you what cubby will and won't do for you in this beta.
Posture: what "supervised lab beta" means
cubby ships seven real device adapters (Cisco IOS-XE, Cisco NX-OS, Arista EOS, Juniper JunOS, Palo Alto PAN-OS, Fortinet, Nokia SR Linux). The pilot path wires those against real transports (scrapli_ssh, ssh_exec, gnmi) so the harness genuinely talks to your fleet.
What is real:
- Device adapters + transports (the seven vendors above)
- Inventory via NetBox API (when you point it at a real instance)
- Evidence chain (signed bundles, prev-hash linked, verifier)
- CAB approval workflow (HMAC out of the box; Ed25519 supported)
- API auth (HMAC bearer tokens or OIDC JWTs)
- SafetyGate, plan-hash binding, all the safety invariants
What is still simulated (cubby doesn't ship real implementations for these yet):
- Auth backends (
IseTacacsAuthAdapter,StaticVaultAuthAdapter) — device-credential lookup falls back to fixture data - Validator plugins (
ConfigLintValidator,BatfishValidator,PolicyGuardValidator) — pre-change validation runs against bundled rules, not your environment - Telemetry (
PrometheusTelemetryAdapter,GrafanaTelemetryAdapter) — capacity forecasts and synthetic monitors use bundled series
The harness boots in this hybrid posture and prints a loud banner on every start naming exactly which categories are simulated. Do not route real change traffic through cubby until you have tested against a representative lab fleet first.
What "supervised" means here: you, an engineer, are watching the logs. You're approving every CAB request manually. You're not letting any auto-remediation cascade run unattended. If you want unattended operation, that's outside the v1 beta scope.
NETOPS_PILOT_ACCEPT_LAB_BETA — the supervised-lab escape hatch
build_pilot_harness enforces a hard rule: in NETOPS_ENV=production, the no-simulated-plugins invariant cannot be relaxed silently. If you genuinely intend to run the supervised-lab-beta posture (real device adapters, simulated auth/validators/telemetry), you must set:
export NETOPS_PILOT_ACCEPT_LAB_BETA=1
This acknowledgement is not production approval. It is an explicit operator statement that you understand:
- The simulated plugin categories are still simulated.
- You will NOT route real change traffic through this build.
- You are watching every CAB request and signing manually.
- The chain of trust is bounded by what
cubby verify-chainsays.
Without this env var, build_pilot_harness raises rather than booting silently into a relaxed-invariant production. The variable is documented in the env table at AS_BUILT.md and read via RuntimeConfig.pilot_accept_lab_beta.
Evidence chain — fresh-install posture
The evidence chain (var/evidence/) is per-instance, not shipped with the source: var/evidence/ is gitignored and a fresh clone has only .gitkeep inside. Your first signed bundle is therefore the first link in your chain — cubby verify-chain returns ok=true from the start.
If a key rotation, replica fork, or development-time accident produces a chain you cannot rebuild, archive it; never delete:
cubby evidence-archive --reason "key rotation 2026-Q2"
This moves the broken segment to a sibling evidence-archive-<UTC-ts>/ directory, preserves the chain pointer, writes an ARCHIVE.md with the reason, and lets the next bundle start a fresh chain. The archive remains auditable; nothing is silently dropped. Production runs of evidence-archive additionally require NETOPS_ARCHIVE_PRODUCTION_ACK=1 on top of --yes so a misfired CI script cannot rotate a live chain.
For mixing pre/post-rotation bundles in a single chain (instead of archiving), cubby verify-chain accepts --legacy-key-ids (signature check skipped, chain link still required) and --chain-reset-bundle-ids (explicit fork boundaries). Both surface in the verifier's output as legacy_count / chain_resets so reviewers see how much of the chain is cryptographically trusted vs. historically retained.
Prerequisites
- Source checkout.
pipx install cubby-networkis currently broken becauseconfig/,docs/, andexamples/aren't packaged in the wheel (we're fixing this). For now:git cloneand run from the checkout. - Python 3.10–3.12 on the host (the lockfile + wheel ship 3.11; 3.10 and 3.12 work for source installs).
- Linux or macOS host for the harness. Windows isn't tested.
- A real or lab fleet of at least one device matching one of the seven supported vendors.
- A NetBox instance (cloud or self-hosted) populated with at least the devices you plan to drive. Cubby reads inventory from NetBox by default in pilot mode.
- Network reach from the harness host to your devices on whichever transport you pick (TCP/22 for SSH, TCP/830 for NETCONF, TCP/57400 for gNMI).
- Read access to NetBox via API token. Cubby never writes to NetBox.
- An LLM API key (Anthropic preferred, OpenAI works) — required for the agent runtime that drives the autonomous incident loop. Without it, the deterministic workflows still run, but agent-driven ones return empty plans.
Step 1: clone and install
git clone [email protected]:cubby-network/platform.git cubby
cd cubby
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[api,agents,ssh,netconf]"
Pick the extras you need:
[api]— FastAPI server (needed for HTTP)[agents]— Anthropic + OpenAI clients (needed for LLM workflows)[ssh]— Scrapli SSH transport (most vendors)[netconf]— Scrapli NETCONF + ncclient (Junos, IOS-XE NETCONF mode)[gnmi]— pygnmi (Nokia SR Linux, Arista in some configs)[oidc]— python-jose (only if you use OIDC for API auth)[postgres],[redis],[s3]— optional persistent backends
Verify:
pytest -q # 1100+ tests should pass; ~33 skipped (live-service gated)
cubby smoke # builds the demo harness, exits 0 if everything wires
Step 2: run the pilot wizard
cubby init-pilot --out-dir ./
The wizard asks:
- Which env vars name the NetBox URL + token (default
NETBOX_BASE_URL/NETBOX_TOKEN— fine to keep). - Which device vendors are in scope (y/n for each).
- Which transport (default
scrapli_ssh; pickssh_execfor Nokia SR Linux,gnmiif your fleet exposes it natively). - API auth mode (
hmacfor shared-secret bearer,oidcfor JWT). - Whether to wire Slack / PagerDuty / ServiceNow / Jira (optional notification + ticketing).
Outputs:
pilot-config.yaml— plugin selection, commit to version control..env.template— list of secrets, **never commit the populated copy**.
Step 3: populate the .env
cp .env.template .env
$EDITOR .env
Replace every CHANGE_ME_* placeholder. Key ones:
# Required — production strict-mode gate
NETOPS_ENV=production
# Inventory — your NetBox
NETBOX_BASE_URL=https://netbox.your-corp.example
NETBOX_TOKEN=<long token from NetBox API tokens page>
# API auth — HMAC mode
NETOPS_API_AUTH_MODE=hmac
NETOPS_API_HMAC_SECRET=<32+ random bytes, hex>
# Evidence + approval signing
NETOPS_EVIDENCE_HMAC_SECRET=<32+ random bytes, hex>
NETOPS_APPROVAL_HMAC_SECRET=<32+ random bytes, hex>
NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1 # fail-closed if unset
# LLM
ANTHROPIC_API_KEY=<your key>
Generate fresh secrets:
python -c "import secrets; print(secrets.token_hex(32))"
Each *_HMAC_SECRET should be a distinct value. If you reuse the same secret across evidence + approvals you reduce the cryptographic separation that makes a CAB approval bind to a specific plan.
Step 4: smoke-test before pointing at devices
set -a; source .env; set +a
export NETOPS_PILOT_CONFIG=$(pwd)/pilot-config.yaml
cubby smoke
Expected output (excerpt):
================================================================================
PILOT-IN-PRODUCTION boot. Real adapters wired for: cisco_iosxe, arista_eos
Simulated for: auth (ISE/TACACS), validators (Batfish/lint), telemetry (Prometheus).
This is supported as 'supervised lab beta' ONLY.
================================================================================
Build pass: ok
Plugins registered: <count>
Readiness checks: ...
The banner is intentional. If you don't see it, you're booted in demo mode (likely NETOPS_PILOT_CONFIG not exported).
If smoke fails, the most common causes:
NETBOX_TOKENinvalid / expired → fix the token.- Required env var unset →
cubby configshows the resolvedRuntimeConfigand which fields fell back to defaults. - One of the strict-mode signer-key gates → set
NETOPS_EVIDENCE_REQUIRE_CONFIGURED_KEY=1and provide both HMAC secrets.
Step 5: connect to one device, read-only
Before any change, pull a snapshot of one device to confirm transport
- credentials work:
cubby refresh-digital-twin --device <hostname-as-it-appears-in-netbox>
This is read-only. It opens a session against the device using the transport you configured, runs a small set of show commands, parses the output, and writes the result to the digital-twin store. No config changes.
What to verify:
- Exit code 0.
- The output JSON shows
interfaces,vlans,routingpopulated with values that match the device's actual state (compare with a manualshow interfaceson the box). - The signed evidence bundle for this snapshot lands in
var/evidence/.
If the snapshot fails, check:
- Network reach (
telnet $device 22). - Credentials (does NetBox have the right management address + credential profile?).
- Vendor adapter (the wizard mapped your vendor; confirm with
grep vendor pilot-config.yaml).
Step 6: dry-run a change, sign the evidence
The harness has a "dry-run" path on the access-port VLAN-change workflow that renders the plan, runs the what-if simulator, and writes a signed evidence bundle without touching the device.
Build a request:
cat > /tmp/req.json <<EOF
{
"requested_by": "operator:you",
"actor_roles": ["network-operator"],
"site": "<site>",
"device": "<hostname>",
"interface": "<port-name>",
"target_vlan": <vlan_id>,
"current_hour": $(date +%H)
}
EOF
cubby run-change --request-file /tmp/req.json
Inspect the output. You should see:
- A
ChangePlanwith the renderedconfig_commands. - A
WhatIfReportshowing the field-level diff (access_vlanbefore/after,description, blast radius). - An
ApprovalRecordeither auto-approved (LOW risk) or pending (MEDIUM/HIGH). - A signed evidence bundle ID.
Verify the bundle:
cubby verify-chain --evidence-dir var/evidence
Expected: ok: true with no chain breaks. If the chain reports errors, see "Failure modes" below.
Step 7: run a real change (against a lab device)
Only do this against a lab device or one you're explicitly authorised to change. The harness will not refuse to run.
- Submit the change request — the workflow blocks at
APPROVAL_PENDINGand returns anapproval_request_idinworkflow_artifacts. - CAB members enumerate open requests and submit signatures: ```bash curl http://localhost:8000/approvals/open curl -X POST http://localhost:8000/approvals/<id>/sign \ -H "content-type: application/json" \ -d '{"signed_approval": {...SignedApproval.to_dict()...}}' ``` (MCP exposes the same surface as
list_open_approvals/submit_signed_approvalfor AI clients.) - Once quorum is met, resubmit the change with
signed_approvalsin the request body — or call the workflow again and pass the accumulated signatures. - The harness runs through all 11 steps of the workflow: snapshot, plan, validate, approve, execute, verify, post-snapshot, sign evidence.
The expected wall-clock time depends on transport + device: ~5–15s for IOS-XE / EOS over scrapli_ssh on a single port.
If verification fails, the harness automatically runs the rollback plan. The signed evidence chain records both the failed change and the rollback as separate bundles, both prev-hash-linked.
Step 8: verify the evidence chain
After every change (or weekly, whichever comes first):
cubby verify-chain --evidence-dir var/evidence
This re-reads every bundle, recomputes the SHA-256, verifies the signature against the configured keyring, walks the prev-hash chain, and reports any signature failure or chain break.
Expected: ok: true.
If ok: false:
bad_signatures > 0— at least one bundle's signature didn't verify. Either the keyring rotated and the old key is missing, or someone tampered with a bundle. Inspect the named bundle id; recover the lost key OR archive the segment with `cubby evidence-archive` (next section).- Chain break —
prev_sha256doesn't match the previous bundle. Operationally this happens when two writers raced (the audit-spine guard the K8s manifest enforces withreplicas: 1). In a single-host pilot this should be impossible; if it happens, open an issue with the bundle id and yourvar/evidence/listing.
Failure modes + recovery
Evidence chain reports legacy signatures from a rotated key
NETOPS_ENV=production NETOPS_VERIFY_DOWNGRADE_ACK=1 \
cubby verify-chain --legacy-key-ids <kid> --evidence-dir var/evidence
This tells the verifier to tolerate signatures under that kid. If that's a permanent state (you've rotated and don't have the old key), seal the segment:
NETOPS_ARCHIVE_PRODUCTION_ACK=1 cubby evidence-archive --reason "rotated to ed25519"
The bundles + chain pointer move to var/evidence-archive-<UTC-ts>/. The live chain starts fresh from genesis.
Pilot harness refuses to boot
The pre-flight banner names exactly which gate failed. Common ones:
NETOPS_API_AUTH_MODE='dev'— production refuses dev-token auth. Set tohmacoroidc.- Missing signer keys — set
NETOPS_EVIDENCE_HMAC_SECRET+NETOPS_APPROVAL_HMAC_SECRET(both required, distinct values). simulated adapter (X) reached the strict-mode registry— you're on the strict path, not the pilot path. ConfirmNETOPS_PILOT_CONFIGis exported and points at a valid YAML.
A device adapter fails
cubby refresh-digital-twin against the device returns a clean error envelope. Check:
plugins/device/<vendor>/parsers.py— these are vendor-specific. If the device runs an unusual NOS variant, the parser may not cover the output format.- Transport (Scrapli) error → check device privilege escalation (
enablepassword, etc.) is in the credential profile.
A workflow times out mid-execution
The state machine retains the partial state. Inspect var/runbooks/runbook-<id>.json to see where it stopped. If the device is in a known-good state, mark the workflow complete via the API (POST /workflows/<id>/abandon); if it's in an intermediate state, run the rollback explicitly.
Reporting feedback
You're a beta tester — your feedback shapes the next round of work.
File issues at: https://github.com/cubby-network/platform/issues
Use the template categories:
- Adapter bug — device adapter parsed wrong, transport hung, rollback didn't fire. Include the
pilot-config.yaml, the vendor + NOS version, and the relevantvar/evidence/bundle id. - Workflow gap — workflow type cubby doesn't cover that you needed.
- Operator UX — confusing error message, missing CLI flag, documentation drift. The harness CLI is one of the surfaces we most want feedback on.
- Safety concern — anything that looked like the safety envelope (SafetyGate, CAB, signed evidence) was bypassable. We treat these as security issues; please disclose privately first via the SECURITY.md process.
For each report, please include:
- Output of
cubby config(sanitised — redact secrets). - The
pilot-config.yaml(no secrets in there by design). - The relevant log lines with timestamps.
- The bundle ID(s) involved if there's an evidence-chain question.
What to expect next
The pilot beta is the path to fully-strict production. Items we expect to ship in response to pilot feedback:
- Real auth backend (ISE/TACACS or LDAP-shaped) so the simulated catch-all gets retired.
- Real validator integration (a Batfish-shaped pre-change check that runs against your topology).
- Real telemetry collector (Prometheus scraping) so capacity forecasts use your data.
- Pipx/wheel packaging fix so
pipx install cubby-networkworks without the source-checkout. - Production K8s deployment story for multi-pod setups (currently reference-deploy only via
replicas: 1). - Job queue integration for the worker so it pulls real work instead of refusing to run in production.
When each lands, the corresponding "simulated" line in the boot banner disappears. When all of them have landed, the strict production path (build_production_harness without a pilot config) boots cleanly and the lab-beta posture retires.
Glossary
- Demo harness — the default builder; everything simulated; fine for
pytest,cubby smoke, and offline development. - Production harness — strict pre-flight (env / auth / signer keys), refuses any simulated plugin. Currently un-bootable without a pilot config because we don't ship real auth / validator / telemetry yet.
- Pilot harness — middle ground. Pre-flight gates fire as in production; simulated catch-alls remain for the categories we haven't shipped real implementations for. **This is what you're running in the beta.**
- Evidence chain — append-only log of signed bundles, each one carrying the SHA-256 of the previous. Tampering with any bundle breaks every subsequent verifier check.
- CAB — Change Advisory Board. Signed-quorum approval bound to the plan's canonical hash; an approval can't be replayed against a different plan.
- SafetyGate — agent-tool boundary. Every LLM-driven tool call transits this gate; reads pass, writes are refused, injection patterns are sanitised or blocked.
See docs/GLOSSARY.md for the full vocabulary.