Rollback runbook
Rollback & recovery runbook
When something goes wrong on a change Cubby executed, the platform has already captured enough evidence to reconstruct what happened and, in most cases, recover automatically. This runbook is for the operator holding the pager when a test user's (or your own) change has left the network in an unexpected state.
Three cases, in order of severity:
- The workflow rolled itself back — safest, most common, almost always self-resolving.
- The workflow failed mid-execute and is stuck — the canary applied but verification failed and automatic rollback didn't finish; you have to drive the rollback yourself.
- The workflow thinks it succeeded but the network is broken — you can't trust the harness's verdict; need to manually reconstruct and remediate.
Every one of these assumes you can read the signed evidence chain. If the chain itself is damaged or unavailable, jump to §4 chain recovery first.
Where to find the evidence
<repo-root>/var/evidence/
bundle-<prefix>.json # the signed bundle (per stage + final)
bundle-<prefix>.manifest.json # the signature manifest (alg, key_id, prev_sha256, signature)
chain.tip # the hash of the most recent bundle — the tip of the chain
Every transition the workflow state machine makes writes a new bundle whose prev_sha256 points at the previous one. chain.tip is authoritative for "what was the last thing this deployment signed."
Inspect a bundle:
jq '.' var/evidence/bundle-<prefix>.json | less
Every bundle contains:
intent_id,workflow_state,actor(who triggered the transition)snapshot_before/snapshot_after— the device state captured around the changecommands+rollback_commands— exactly what Cubby asked the device to runvalidation_report,precheck,verify— each stage's structured outcomepolicy_decision,approval— what was approved and by whom
Verify the chain end-to-end:
cubby verify-chain # or: python -m apps.cli.main verify-chain
This returns {ok: true, total, legacy_count, chain_errors: []}. Any chain_errors is a red flag — see §4.
1. Workflow rolled itself back
Symptoms: workflow_state = ROLLED_BACK, the most recent bundle contains a rollback phase, re-querying the device shows it back to its pre-change state.
What to do: read the bundle to understand why. The verify block will tell you what invariant failed. Common reasons Cubby rolls back:
- Target VLAN not in the device's
allowed_vlans - Interface mode is
trunkbut the intent wasaccess - Post-change snapshot shows a neighboring device lost LLDP on the affected link
- Verification timeout (device didn't re-advertise the expected state)
Recovery: usually none needed — the network is back to the pre-change state. If the root cause is environmental (allowed_vlans list out of date), fix the fixture and retry the intent. The new run gets a fresh intent_id, a fresh signed chain, and the old ROLLED_BACK bundle stays in the audit trail as context.
2. Workflow stuck mid-execute
Symptoms: workflow_state is CANARY_EXECUTING / FULL_EXECUTING / ROLLBACK_PENDING and isn't advancing. The process may have crashed, the device may have become unreachable, or the persist hook failed.
What to do:
- Identify the last confirmed state. Read the highest-numbered bundle for this
intent_id: ```sh jq 'select(.intent_id == "<intent-id>") | {workflow_state, phase, commands, timestamp}' \ var/evidence/bundle-*.json ``` Theworkflow_statein the latest bundle is the last state that was signed — i.e., the last state the harness persisted successfully. - Inspect the forward commands the harness was executing. They're in
commands[]on the bundle that matchesCANARY_EXECUTINGorFULL_EXECUTING. Each entry hasphase,command,description, and the device it targets. - Pull the rollback block from the same bundle —
rollback_commands[]. These are the commands Cubby would have executed if it had reachedROLLBACK_PENDINGon its own. You can apply them manually (SSH to the device, paste the commands in order) to restore the pre-change state. - Snapshot the device after manual rollback and compare against
snapshot_beforein the bundle. If they match, the device is recovered. - Mark the workflow failed. From the CLI: ```sh cubby mark-failed --intent <intent-id> --reason "manual rollback applied" ``` This writes a final bundle transitioning the workflow to
FAILED → CLOSEDwith an audit entry. The chain now closes cleanly and future evidence verification passes.
3. Workflow claims success but network is broken
Symptoms: workflow_state = CLOSED, verify was ok, but users are reporting an outage or the monitoring is red.
This is the hardest case. The harness's verification invariant passed on whatever it checked, but the real-world impact is different from what the invariant captured. Common shapes:
- The intent was valid but the upstream routing was misconfigured and Cubby's snapshot doesn't include that scope.
- A stale state fixture meant Cubby verified against intended state that didn't match observed state.
- The change itself was correct but a concurrent manual change on a different device conflicted with it.
What to do:
- Assume the harness's self-verification is no longer reliable. Don't trust the
verify.ok: true; go to the device directly with a read-only tool and compare. - Read the intent. The bundle's
intentblock tells you exactly what the operator asked for. Determine whether the request itself was wrong (user error) or whether Cubby's translation of the request into commands was wrong (code bug). - Apply the rollback manually from the bundle's
rollback_commands[], same as §2. - File the evidence bundle with the incident. The bundle is the single most valuable artifact for a post-mortem: it captures the intent, the plan, the pre-state, the commands, and the post-state. Attach it to the incident ticket.
- If a code bug caused the mis-translation, open an issue and tag with
workflow-safety. The team should reproduce against the same fixture that produced the wrong plan.
4. Chain recovery
Symptoms: cubby verify-chain reports chain_errors != [], or var/evidence/ was accidentally deleted / overwritten.
Conceptual recovery:
chain.tipstores the SHA-256 of the last bundle. Every new bundle signs its ownprev_sha256 = chain.tip_at_write_time. A break means one of:chain.tipis pointing at a bundle that isn't present, a bundle was deleted mid-chain, or two parallel writers forked the chain (shouldn't happen given the file lock around writes).- The safe play is always to fork a new segment rather than rewrite history. Cubby supports this via the
NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDSenv var: list the bundle IDs where the verifier should tolerate aprev_sha256mismatch, and the chain verification treats them as explicit segment boundaries.
Recovery steps:
- Do not delete bundles. Even broken ones carry the signed intent + plan + snapshots — you may want them for the post-mortem.
- Identify the break point.
cubby verify-chainwill show whichbundle_idtheprev_sha256check failed at. That's the first bundle of the new segment. - Mark the segment boundary. Set the env var: ```sh export NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS="<bundle-id-of-first-bundle-in-new-segment>" ``` Future
verify-chainruns accept the known break and pass. - Rotate the signing key. If the chain break is from anything other than a planned reset (test fork, manual file deletion), rotate the evidence HMAC key so any prior bundles signed under the old key are now legacy-signed-only: ```sh export NETOPS_EVIDENCE_LEGACY_KEY_IDS="<old-key-id>" export NETOPS_EVIDENCE_HMAC_SECRET="<new-secret>" ```
- Write a new bundle (any workflow action does this) to re-anchor
chain.tipunder the new key. - Post-mortem — every chain break deserves an investigation. Likely culprits: a test run that hit the same
var/evidence/dir, a backup-restore that didn't preservechain.tip, a shared-secret signer that got rotated mid-chain withoutNETOPS_EVIDENCE_LEGACY_KEY_IDSbeing set.
Things to never do
- Do not
rm -rf var/evidence/. Every bundle is audit evidence. If you need to start fresh for a clean test, move the old dir aside (mv var/evidence var/evidence-<date>), don't delete it. - Do not edit a bundle JSON file. The manifest signature over the canonical payload will break verification and you'll have to reset the chain.
- Do not skip
verify-chainafter a recovery. The whole point of the chain is that you can prove no unsigned change happened. Run the verify step and make sure it's clean before calling the recovery done.
Escalation
If the chain is broken, a device is in a state you can't reconcile, or the harness is refusing to boot and you need it running now:
- Save the current
var/evidence/andvar/runbooks/directories intact (the team will need them for root-cause). - File the incident. Attach the last good bundle ID and the error output from
cubby verify-chain. - If the harness itself is broken, a last-resort bypass is to drive the rollback commands manually from the bundle's
rollback_commands[]— SSH to each device and apply them in order. The device doesn't care what issued the rollback.