Rollback runbook

Operations · Source: docs/ROLLBACK.md

Rollback & recovery runbook

When something goes wrong on a change Cubby executed, the platform has already captured enough evidence to reconstruct what happened and, in most cases, recover automatically. This runbook is for the operator holding the pager when a test user's (or your own) change has left the network in an unexpected state.

Three cases, in order of severity:

  1. The workflow rolled itself back — safest, most common, almost always self-resolving.
  2. The workflow failed mid-execute and is stuck — the canary applied but verification failed and automatic rollback didn't finish; you have to drive the rollback yourself.
  3. The workflow thinks it succeeded but the network is broken — you can't trust the harness's verdict; need to manually reconstruct and remediate.

Every one of these assumes you can read the signed evidence chain. If the chain itself is damaged or unavailable, jump to §4 chain recovery first.

Where to find the evidence

<repo-root>/var/evidence/
  bundle-<prefix>.json          # the signed bundle (per stage + final)
  bundle-<prefix>.manifest.json # the signature manifest (alg, key_id, prev_sha256, signature)
  chain.tip                     # the hash of the most recent bundle — the tip of the chain

Every transition the workflow state machine makes writes a new bundle whose prev_sha256 points at the previous one. chain.tip is authoritative for "what was the last thing this deployment signed."

Inspect a bundle:

jq '.' var/evidence/bundle-<prefix>.json | less

Every bundle contains:

Verify the chain end-to-end:

cubby verify-chain   # or: python -m apps.cli.main verify-chain

This returns {ok: true, total, legacy_count, chain_errors: []}. Any chain_errors is a red flag — see §4.

1. Workflow rolled itself back

Symptoms: workflow_state = ROLLED_BACK, the most recent bundle contains a rollback phase, re-querying the device shows it back to its pre-change state.

What to do: read the bundle to understand why. The verify block will tell you what invariant failed. Common reasons Cubby rolls back:

Recovery: usually none needed — the network is back to the pre-change state. If the root cause is environmental (allowed_vlans list out of date), fix the fixture and retry the intent. The new run gets a fresh intent_id, a fresh signed chain, and the old ROLLED_BACK bundle stays in the audit trail as context.

2. Workflow stuck mid-execute

Symptoms: workflow_state is CANARY_EXECUTING / FULL_EXECUTING / ROLLBACK_PENDING and isn't advancing. The process may have crashed, the device may have become unreachable, or the persist hook failed.

What to do:

  1. Identify the last confirmed state. Read the highest-numbered bundle for this intent_id: ```sh jq 'select(.intent_id == "<intent-id>") | {workflow_state, phase, commands, timestamp}' \ var/evidence/bundle-*.json ``` The workflow_state in the latest bundle is the last state that was signed — i.e., the last state the harness persisted successfully.
  2. Inspect the forward commands the harness was executing. They're in commands[] on the bundle that matches CANARY_EXECUTING or FULL_EXECUTING. Each entry has phase, command, description, and the device it targets.
  3. Pull the rollback block from the same bundle — rollback_commands[]. These are the commands Cubby would have executed if it had reached ROLLBACK_PENDING on its own. You can apply them manually (SSH to the device, paste the commands in order) to restore the pre-change state.
  4. Snapshot the device after manual rollback and compare against snapshot_before in the bundle. If they match, the device is recovered.
  5. Mark the workflow failed. From the CLI: ```sh cubby mark-failed --intent <intent-id> --reason "manual rollback applied" ``` This writes a final bundle transitioning the workflow to FAILED → CLOSED with an audit entry. The chain now closes cleanly and future evidence verification passes.

3. Workflow claims success but network is broken

Symptoms: workflow_state = CLOSED, verify was ok, but users are reporting an outage or the monitoring is red.

This is the hardest case. The harness's verification invariant passed on whatever it checked, but the real-world impact is different from what the invariant captured. Common shapes:

What to do:

  1. Assume the harness's self-verification is no longer reliable. Don't trust the verify.ok: true; go to the device directly with a read-only tool and compare.
  2. Read the intent. The bundle's intent block tells you exactly what the operator asked for. Determine whether the request itself was wrong (user error) or whether Cubby's translation of the request into commands was wrong (code bug).
  3. Apply the rollback manually from the bundle's rollback_commands[], same as §2.
  4. File the evidence bundle with the incident. The bundle is the single most valuable artifact for a post-mortem: it captures the intent, the plan, the pre-state, the commands, and the post-state. Attach it to the incident ticket.
  5. If a code bug caused the mis-translation, open an issue and tag with workflow-safety. The team should reproduce against the same fixture that produced the wrong plan.

4. Chain recovery

Symptoms: cubby verify-chain reports chain_errors != [], or var/evidence/ was accidentally deleted / overwritten.

Conceptual recovery:

Recovery steps:

  1. Do not delete bundles. Even broken ones carry the signed intent + plan + snapshots — you may want them for the post-mortem.
  2. Identify the break point. cubby verify-chain will show which bundle_id the prev_sha256 check failed at. That's the first bundle of the new segment.
  3. Mark the segment boundary. Set the env var: ```sh export NETOPS_EVIDENCE_CHAIN_RESET_BUNDLE_IDS="<bundle-id-of-first-bundle-in-new-segment>" ``` Future verify-chain runs accept the known break and pass.
  4. Rotate the signing key. If the chain break is from anything other than a planned reset (test fork, manual file deletion), rotate the evidence HMAC key so any prior bundles signed under the old key are now legacy-signed-only: ```sh export NETOPS_EVIDENCE_LEGACY_KEY_IDS="<old-key-id>" export NETOPS_EVIDENCE_HMAC_SECRET="<new-secret>" ```
  5. Write a new bundle (any workflow action does this) to re-anchor chain.tip under the new key.
  6. Post-mortem — every chain break deserves an investigation. Likely culprits: a test run that hit the same var/evidence/ dir, a backup-restore that didn't preserve chain.tip, a shared-secret signer that got rotated mid-chain without NETOPS_EVIDENCE_LEGACY_KEY_IDS being set.

Things to never do

Escalation

If the chain is broken, a device is in a state you can't reconcile, or the harness is refusing to boot and you need it running now:

  1. Save the current var/evidence/ and var/runbooks/ directories intact (the team will need them for root-cause).
  2. File the incident. Attach the last good bundle ID and the error output from cubby verify-chain.
  3. If the harness itself is broken, a last-resort bypass is to drive the rollback commands manually from the bundle's rollback_commands[] — SSH to each device and apply them in order. The device doesn't care what issued the rollback.