Skip to content

Restoration Runbook

This is the current disaster recovery restore path for the KH3 infrastructure. Use it with the script usage guide and validation checklist:

The active rebuild path for CT 101 is now the rootless Podman path documented in Rootless Podman Restore Runbook. The Docker restore notes on this page remain useful because they record the verified backup artifact facts and the errors discovered during the first restore exercise.

The restore path is not only documentation. The scripts validate SSH access, stage the real NAS backup by tar over SSH, validate backup artifacts, copy the staged backup into the target LXC through Proxmox, restore the selected services, and verify the restored service state.

Current Evidence

Read-only checks on June 16, 2026 confirmed:

  • ssh pve reaches the Proxmox host and pct, qm, and pvesm are present.
  • The live short hostname returned by that alias was pve. Older notes call the host pve02, so the restore answer file accepts both by default.
  • ssh -p 2242 nas can see /volume1/vm_backup/proxmox-rebuild-20260614-191305.
  • The NAS backup directory displays broad mode bits through the NAS/CIFS view; Synology ACLs must be checked before treating staged files as protected.
  • The websites directory exists but was empty in the verified backup set.
  • Current Proxmox guest listing did not match older notes: qm list showed pfSense as VM 100, and no LXCs were listed. Treat this as live-state drift to investigate before running a destructive rebuild step.

Backup Source and Staging

Item Value
Backup source nas:/volume1/vm_backup/proxmox-rebuild-20260614-191305
NAS access ssh -p 2242 nas
Workstation stage /tmp/proxmox-rebuild-20260614-191305-staged by default
Workstation marker .copy-complete inside the staged directory
Docker LXC stage /opt/docker/recovery/proxmox-rebuild-20260614-191305
Docker LXC marker .copy-complete inside the LXC staged directory
Docker compose target /opt/docker/compose
Docker data target /opt/docker/volumes
Podman LXC stage /opt/podman/restore/proxmox-rebuild-20260614-191305
Podman data target /opt/podman/volumes
Podman env target /opt/podman/env

The stage copy deliberately uses tar over SSH, not SCP or SFTP:

ssh -p 2242 nas 'tar -C /volume1/vm_backup/proxmox-rebuild-20260614-191305 -cf - .' |
  tar -C /tmp/proxmox-rebuild-20260614-191305-staged -xf -

Use scripts/restore/01-stage-backup.sh instead of typing this manually. The script writes .copy-complete only after tar extraction succeeds. A normal rerun reuses the staged copy. Set FORCE_BACKUP_COPY=1 only when replacing a known bad or obsolete staged copy.

Active Podman Script Order

Run from the administration workstation unless noted otherwise:

cp scripts/podman/podman-answer.env.example scripts/podman/podman-answer.env
chmod 600 scripts/podman/podman-answer.env

Edit the answer file and keep it out of Git. For the current restored test target, CT 101 is podman-lxc at 192.168.2.100/24 with gateway 192.168.2.1. The nameserver may temporarily be 1.1.1.1 only while internal DNS at 192.168.2.2 is unavailable.

Create or converge the Podman LXC from the Proxmox host:

ssh pve 'mkdir -p /root/kh3-podman-restore'
scp scripts/podman/common.sh scripts/podman/00-create-podman-lxc-101.sh scripts/podman/podman-answer.env pve:/root/kh3-podman-restore/
ssh pve 'bash /root/kh3-podman-restore/00-create-podman-lxc-101.sh --answer-file /root/kh3-podman-restore/podman-answer.env'

Bootstrap, generate Quadlets, stage the backup, restore data, and validate:

ssh pve 'pct push 101 /root/kh3-podman-restore/podman-answer.env /root/podman-answer.env --perms 0600'
ssh pve 'pct exec 101 -- bash -s -- --answer-file /root/podman-answer.env' < scripts/podman/01-bootstrap-rootless-podman-lxc.sh
ssh pve 'pct exec 101 -- bash -s -- --answer-file /root/podman-answer.env' < scripts/podman/03-generate-rootless-quadlets.sh
scripts/podman/02-stage-backup-to-podman-lxc.sh --answer-file scripts/podman/podman-answer.env

Only after staging is complete and the operator is ready to restore service data, set:

CONFIRM_PODMAN_RESTORE=restore-podman-services

Push the updated answer file, then run:

scp scripts/podman/podman-answer.env pve:/root/kh3-podman-restore/podman-answer.env
ssh pve 'pct push 101 /root/kh3-podman-restore/podman-answer.env /root/podman-answer.env --perms 0600'
ssh pve 'pct exec 101 -- bash -s -- --answer-file /root/podman-answer.env' < scripts/podman/04-restore-rootless-services.sh
ssh pve 'pct exec 101 -- bash -s -- --answer-file /root/podman-answer.env' < scripts/podman/05-validate-rootless-services.sh

Historical Docker Script Order

Run from the administration workstation unless noted otherwise:

cp scripts/restore/restore-answer.env.example scripts/restore/restore-answer.env
chmod 600 scripts/restore/restore-answer.env

Edit the answer file and keep it out of Git. Then run:

scripts/restore/00-preflight-access.sh --answer-file scripts/restore/restore-answer.env
scripts/restore/01-stage-backup.sh --answer-file scripts/restore/restore-answer.env
scripts/restore/02-validate-artifacts.sh --answer-file scripts/restore/restore-answer.env

If CT 101 docker-host has just been created and scripts/bootstrap-debian13-docker-lxc.sh has printed Docker installation verified, this is the next checkpoint. Continue from the three restore wrapper commands above. They run from the administration workstation and use ssh pve plus pct exec; direct SSH to docker-host is not required.

Only after the new Docker LXC exists, Docker is installed, artifact validation passes, and the operator is ready to start services, set:

CONFIRM_RESTORE=restore-docker-services

Then run:

scripts/restore/03-restore-docker-services.sh --answer-file scripts/restore/restore-answer.env
scripts/restore/04-validate-services.sh --answer-file scripts/restore/restore-answer.env

scripts/restore/run-restore.sh runs all stages in order, but do not use it for the first exercise. Step through the individual scripts so failures are easier to understand.

Podman Step Results

Step Expected result Safe rerun behavior
00-create-podman-lxc-101.sh CT 101 podman-lxc exists, remains unprivileged, and has the narrow /dev/net/tun passthrough needed for rootless Podman networking Accepts the existing correct CT and converges settings
01-bootstrap-rootless-podman-lxc.sh podsvc exists, rootless Podman works, /etc/subuid and /etc/subgid use podsvc:10000:50000, and /opt/podman exists Keeps the existing service user and directory tree
02-stage-backup-to-podman-lxc.sh Local and LXC staged backups contain .copy-complete Does not recopy when markers exist unless FORCE_BACKUP_COPY=1
03-generate-rootless-quadlets.sh Rootless Quadlets exist under /home/podsvc/.config/containers/systemd and env files exist under /opt/podman/env Replaces Quadlet files but does not overwrite edited env files
04-restore-rootless-services.sh Application archives are restored, PostgreSQL dumps are imported, ownership is mapped with podman unshare, and services start Skips marked restores unless FORCE_RESTORE=1
05-validate-rootless-services.sh Required containers run, PostgreSQL is healthy, databases exist, direct HTTP ports answer, and no user unit is failed Read-only; safe to rerun

Docker Step Results

Step Expected result Safe rerun behavior
00-preflight-access.sh Proxmox hostname, paths for pct and qm, active storage output, NAS file list Read-only; safe to rerun
01-stage-backup.sh staged backup complete or using existing staged backup Does not recopy when .copy-complete exists unless forced
02-validate-artifacts.sh OK lines for archives, dumps, key app files, and a warning if websites are empty Read-only against staged backup; safe to rerun
03-restore-docker-services.sh LXC identity check, staged copy to LXC, service restore logs Does not recopy or re-import data when markers exist unless forced
04-validate-services.sh Network, PostgreSQL, database, and container OK lines Read-only; safe to rerun

Destructive or Potentially Destructive Actions

  • Creating or reconfiguring CT 101 can affect an existing container if the ID is reused incorrectly. The creation script refuses a wrong hostname.
  • The Podman CT creation path intentionally keeps CT 101 unprivileged. The /dev/net/tun passthrough is for rootless Podman networking and is not a reason to switch the application containers to rootful mode.
  • 04-restore-rootless-services.sh writes application archives into /opt/podman/volumes, rewrites recovered env files under /opt/podman/env, and imports PostgreSQL dumps. Normal reruns use markers; FORCE_RESTORE=1 intentionally reapplies data and should be treated as destructive.
  • 03-restore-docker-services.sh starts and recreates Docker containers in the target LXC.
  • PostgreSQL logical restore uses pg_restore --clean --if-exists inside the restored database. It is guarded by .logical-restore-complete; set FORCE_RESTORE=1 only when intentionally reimporting.
  • Application archive restore writes into /opt/docker/volumes/gitea and /opt/docker/volumes/vaultwarden. Existing data causes a skip marker on normal reruns.
  • Never run restore scripts against the old production Docker host unless the answer file explicitly points to the intended test target and the operator accepts the risk.

Never Do This

  • Do not restore from redacted pfSense XML.
  • Do not edit known_hosts blindly if SSH reports a host key mismatch. Confirm the host identity from console access or a trusted administrator, then remove only the obsolete key for that host.
  • Do not use ssh docker as a required path. It is optional and guarded by a timeout because it previously hung.
  • Do not start services while .env files still contain replace-with-restored-secret.
  • Do not start Podman services while /opt/podman/env/*.env files still contain CHANGE_ME_BEFORE_START.
  • Do not use StrictHostKeyChecking=no after the NAS IP or host key changes. Confirm the fingerprint, then accept the new key deliberately.
  • Do not copy /var/lib/docker as the recovery source.
  • Do not delete the NAS backup or /tmp staged copy until restore validation is complete and a second backup exists.

Resume After Interruption

  1. Rerun 00-preflight-access.sh.
  2. Rerun 01-stage-backup.sh. If .copy-complete exists, it reuses the staged copy.
  3. Rerun 02-validate-artifacts.sh and review the report path it prints.
  4. For service restore, leave FORCE_RESTORE=0 unless a failed import must be intentionally replaced.
  5. Rerun 03-restore-docker-services.sh; it reuses the LXC staged copy and skips marked data/database restores.
  6. Rerun 04-validate-services.sh.

If a marker exists but the preceding output shows an incomplete operation, preserve the failed staged directory for inspection, create a new stage path in the answer file, and rerun from staging. Do not remove evidence during a real incident until an administrator has reviewed it.