Skip to content

Recupero dal Disastro

“Doppo ‘a tempesta vène ‘o sole — ma primma bisogna sopravvivere ‘a tempesta.” (After the storm comes the sun — but first you have to survive the storm.)

No distributed system is immune to catastrophe. A network partition swallows three nodes whole. A botched deployment corrupts the consensus log. A rogue operator runs DROP DATABASE on the wrong kitchen. Pasta Protocol was designed with the Neapolitan spirit of resilience: we have seen Vesuvius erupt, we have rebuilt, and we will rebuild again. This runbook documents how.

Classificazione degli Incidenti

Before you open this runbook, confirm the severity of your situation:

LivelloCodiceSintomiAzione
InformationalBRUSCHETTASingle node slow, minor lagMonitor; no recovery needed
WarningPEPERONCINONode unhealthy, quorum intactRestart failed node, re-join cluster
CriticalVESUVIOQuorum lost, writes rejectedExecute this runbook
FatalTERREMOTOFull cluster halt, no reads or writesExecute this runbook + post-mortem mandatory

If your incident is BRUSCHETTA or PEPERONCINO, see A Pasta Scotta instead. This runbook is for VESUVIO and TERREMOTO events only.

Runbook di Ripristino

  1. Rileva l’eruzione (Detect the eruption)

    Confirm that you have a genuine VESUVIO event and not a monitoring false positive. Run the cluster diagnostic:

    Terminal window
    npx pasta cluster:status --verbose
    # => TERREMOTO: quorum lost — 1/3 nodes responding
    # => Node vesuvio-01: RESPONDING (leader)
    # => Node vesuvio-02: UNREACHABLE (timeout after 5000ms)
    # => Node vesuvio-03: UNREACHABLE (timeout after 5000ms)

    Check the ErrorRegistry log for the triggering event:

    import { ErrorRegistry } from '@pasta-protocol/core';
    const registry = ErrorRegistry.getInstance();
    const recentErrors = registry.getHistory({ severity: ['VESUVIO', 'TERREMOTO'], limit: 20 });
    for (const error of recentErrors) {
    console.log(`[${error.timestamp}] ${error.code}: ${error.message}`);
    }

    Do not proceed to Step 2 until you have identified the root trigger. Treating the wrong symptom will worsen the eruption.

  2. Evacua il traffico (Evacuate traffic)

    Immediately redirect all client traffic away from the degraded cluster. If you are running behind a load balancer, switch to the standby kitchen:

    Terminal window
    # Point the load balancer to the standby region
    npx pasta lb:failover --target standby-kitchen-eu-west --confirm
    # Verify no traffic is reaching the primary cluster
    npx pasta cluster:connections --kitchen primary-kitchen-eu-central
    # => Active connections: 0

    If you do not have a standby kitchen configured, enable maintenance mode to surface a user-friendly error instead of silent data corruption:

    Terminal window
    npx pasta maintenance:enable --message "La cucina è momentaneamente chiusa. Torniamo presto."

    Log the evacuation timestamp — you will need it for the consistency verification in Step 4.

  3. Ripristina dal backup (Restore from backup)

    Identify the most recent clean snapshot. The Dispensa subsystem maintains rolling snapshots every 15 minutes by default:

    Terminal window
    npx pasta backup:list --kitchen primary-kitchen-eu-central
    # => backup-2025-03-20T14:30:00Z [CLEAN] Size: 2.3 GB
    # => backup-2025-03-20T14:15:00Z [CLEAN] Size: 2.3 GB
    # => backup-2025-03-20T14:00:00Z [SUSPECT] Size: 2.1 GB ← do not use

    Never restore a SUSPECT backup. The [SUSPECT] flag means the snapshot was taken during an active write transaction and may contain partial data. Always use the most recent [CLEAN] snapshot:

    Terminal window
    npx pasta backup:restore \
    --snapshot backup-2025-03-20T14:30:00Z \
    --kitchen primary-kitchen-eu-central \
    --wipe-current-state \
    --confirm

    This process will take 5–20 minutes depending on snapshot size. The command streams progress to stdout. Do not interrupt it.

  4. Verifica la consistenza (Verify consistency)

    Once restoration completes, run the consistency checker before allowing any writes. This compares your restored state against the replicated write-ahead log (WAL) from the surviving nodes:

    Terminal window
    npx pasta consistency:check \
    --since 2025-03-20T14:30:00Z \
    --kitchen primary-kitchen-eu-central
    # => Checking 847 operations since snapshot timestamp...
    # => Replayed: 847 Conflicts: 0 Missing: 0
    # => Consistency: OK ✓

    If the consistency check reports conflicts or missing operations, do not resume traffic. Escalate to your data team immediately. A conflict here means the WAL and the snapshot diverged — a rare but serious condition that requires manual reconciliation.

  5. Riprendi la cottura (Resume cooking)

    With a clean consistency check in hand, bring the cluster back online and restore traffic:

    Terminal window
    # Bring all nodes back into quorum
    npx pasta cluster:start --kitchen primary-kitchen-eu-central
    # Wait for quorum confirmation
    npx pasta cluster:await-quorum --timeout 120s
    # => Quorum achieved: 3/3 nodes healthy
    # Disable maintenance mode
    npx pasta maintenance:disable
    # Re-enable load balancer routing to the primary kitchen
    npx pasta lb:failback --target primary-kitchen-eu-central --confirm

    Monitor the Termometro dashboard closely for the first 30 minutes. Set your alert threshold lower than usual — PEPERONCINO events should page on-call during this stabilisation window.

    Document the incident in your post-mortem log. A TERREMOTO event without a post-mortem is a TERREMOTO waiting to happen again.

Prevenzione

The best disaster recovery is the one you never need to run. Pasta Protocol ships with built-in safeguards that significantly reduce the blast radius of any eruption:

Backup automatici — configure your .ricetta file to take snapshots every 5 minutes in high-criticality kitchens. The storage cost is minimal compared to the recovery cost.

Multi-region standby — deploy a warm standby kitchen in a second region. Traffic failover drops from minutes to seconds.

Chaos cooking — run npx pasta chaos:drill quarterly. Kill a random node in your staging environment and practice the runbook. The drill that feels unnecessary now is the muscle memory that saves you at 3 AM.

# .ricetta — example disaster-resilience configuration
kitchen:
name: primary-kitchen-eu-central
backup:
enabled: true
interval: 5m
retention: 48h
storage: s3://mia-cucina-backups/primary/
standby:
enabled: true
region: eu-west
kitchen: standby-kitchen-eu-west
chaos:
drill_schedule: "0 10 * * MON" # Monday mornings, not Friday afternoons

Ricordatevi: ‘o meglio cuoco non è quello che non sbaglia mai — è quello che sa come rimettere ‘a pasta sul fuoco. (Remember: the best cook is not the one who never makes mistakes — it is the one who knows how to put the pasta back on the fire.)