Run in production

Deployment

Going from a development standalone instance to a production cluster takes four decisions: where the secrets live, how many CP nodes you run, where logs go, and how you upgrade. This page walks through each one.

Production hardening

Two config fields have dev-friendly fallbacks that explicitly do not survive a restart. Set them before going live.

KEK — at-rest encryption key

crypto:
  kek_file: "/etc/tiyi/kek.bin"     # 32-byte at-rest Key Encryption Key

All envelope-encrypted blobs (TLS private keys, ACME account keys, ACME DNS provider credentials, the bundle signing key) decrypt with this KEK. Generate it once and back it up next to the state database:

$ install -m 0600 /dev/null /etc/tiyi/kek.bin
$ head -c 32 /dev/urandom > /etc/tiyi/kek.bin

Lose the KEK and you lose your secrets. Back it up alongside state.db. Rotating the KEK requires re-wrapping every encrypted blob — out of scope for v3.0.0-rc.1.

Also: if you don't set crypto.kek_file, Tiyi falls back to <state-db-dir>/kek.bin auto-generated on first boot. This works for single-node dev but it's fragile in production where state directories may be ephemeral.

JWT signing secret

auth:
  jwt_secret: "<32+ random bytes>"     # or set via TIYI_AUTH_JWT_SECRET

HS256 signing secret for access tokens. When the field is empty Tiyi generates a 32-byte ephemeral secret on every restart and prints a WARNING line. Sessions invalidate at restart and replicas can't share refresh tokens — both are usually problems in production.

Bundle signing key

The server's ed25519 bundle signing key is stored in the singleton bundle_signing_key row, envelope-encrypted with the KEK. Agents pin the public key on first contact (TOFU) and refuse to re-pin afterwards. Rotation requires full re-enrollment — issue fresh tokens with tiyi agents issue-token and bring agents back up.

HA failover

A Tiyi cluster can run with just a primary plus N agents (zero secondary is allowed). For non-trivial deployments, run one warm secondary so a primary outage does not interrupt mutations for long.

Topology

# All four nodes proxy traffic. The L4/L7 LB health-checks /healthz on each.
$ TOKEN_B="$(tiyi agents issue-token --tag secondary | jq -r .token)"
$ TOKEN_C="$(tiyi agents issue-token --tag edge-c | jq -r .token)"

Node A: tiyi server     --addr 0.0.0.0:8080
Node B: tiyi secondary  --primary-api http://A:8080 --enrollment-token $TOKEN_B
Node C: tiyi agent      --api http://A:8080 --enrollment-token $TOKEN_C

Promotion

If Node A goes down, run tiyi promote on Node B. The two-phase handover prevents split-brain:

Contact the old primary via SystemService.Demote — primary flips to read-only, returns its current epoch and last mutation sequence.
Verify the secondary has all data up to the primary's last sequence. If behind, abort.
Bump epoch on the secondary (primary_epoch + 1), flip role to primary, accept mutations.
The old primary's embedded agent reconnects to the new primary as just an agent.

If the primary is confirmed unreachable, tiyi promote --force skips the demote step. You accept bounded data loss for any mutations between the last bundle sync and the primary failure. If the old primary comes back, it sees agents reporting a higher epoch and auto-demotes to agent-only.

Epoch fencing prevents split-brain. Every mutation checks the local epoch; a node that sees a higher epoch from any agent's state report auto-demotes. The epoch is monotonic and persisted in cluster_state.

Client-IP trust profile

If Tiyi sits behind a CDN or L4 LB, the client IP that reaches the WAF is not the real client IP — it's the proxy in front of you. The trust profile tells Tiyi which proxies are trusted and which header carries the real IP.

Set it once via the UI (Settings → Trust Profile tab) or the CLI:

$ tiyi trust set \
    --trusted-proxies 10.0.0.0/8,172.16.0.0/12 \
    --client-ip-headers X-Forwarded-For

# Auto-fetch CDN ranges:
$ tiyi trust cdn list
$ tiyi trust cdn refresh cloudflare
$ tiyi trust cdn refresh fastly

# Trace any (peer, headers) tuple to the resolved client IP:
$ tiyi trust test --peer 10.0.0.5 --header "X-Forwarded-For: 1.2.3.4, 10.0.0.5"

The Origin Bypass Attempt alert is seeded by default and fires when a request claims a client IP via XFF but the peer isn't a trusted proxy.

SIEM egress

Tiyi forwards security, access, error, and (optionally) audit events over best-effort TCP, UDP, or unixgram in RFC 5424, CEF, or LEEF. Configure once via Settings → SIEM tab or the CLI:

$ tiyi system settings update --key siem.enabled --value true
$ tiyi system settings update --key siem.address --value "tcp://siem.internal:514"
$ tiyi system settings update --key siem.format --value rfc5424
$ tiyi system settings update --key siem.filter.include_audit --value true

SIEM forwarding is intentionally best-effort. Receiver health, durable delivery, and replay live in your SIEM pipeline, not in Tiyi. The dispatcher caches a single net.Conn per dispatcher with reconnect-on-error.

Observability

Prometheus exporter at /metrics on the local admin socket. Scrape it from a sidecar; metrics are derived from the same telemetry pipeline that drives the dashboard.
/healthz returns {epoch, role, proxy_healthy}. Use this for L4/L7 LB health checks.
/debug/logsink/stats on the local admin socket exposes per-kind attempted/written/dropped_full/panicked/queue_depth/last_error counters. The panicked counter is the cross-layer canary for boundary defer recover handlers.
Telemetry explorer UI at /telemetry/explorer for ad-hoc Top-K and sample browsing without leaving the dashboard.

Rolling upgrades

Once a binary release is imported, tiyi release apply fans an APPLY_BINARY command out to every agent whose OS/arch matches. Each agent downloads the replacement, verifies the SHA-256, and exits so its supervisor restarts it into the new binary.

$ tiyi release import --tarball ./tiyi-3.0.1.tar.gz
$ tiyi release list
$ tiyi release apply <release-id>             # all matching agents
$ tiyi release apply <release-id> --agent-id A  # wave deployments
$ tiyi release runs                          # active rollouts
$ tiyi release rollback                      # revert to previous binary

Backup

Three things to back up:

state.db — the SQLite control-plane database. WAL-mode; sqlite3 state.db ".backup '/path/to/backup.db'" works while Tiyi is running.
kek.bin — the at-rest encryption key. Loss is unrecoverable. Back it up to the same vault you use for other long-lived secrets.
jwt_secret — already in your config repo, but make sure that repo is backed up too.

Litestream-style WAL streaming for offsite backup is supported and recommended. logs/ partition files are not part of the backup set — they're operational data with their own retention loop.

Going-live checklist

Set crypto.kek_file to a backed-up file
Set auth.jwt_secret to a 32+ byte random value
Replace the bootstrap admin password with a real account
Configure the trust profile if Tiyi sits behind a CDN/LB
Pick a SIEM destination and verify a test event lands
Wire /healthz into the L4/L7 LB
Scrape /metrics from your Prometheus instance
Add state.db + kek.bin to your backup pipeline
Issue at least one secondary node so a single primary loss isn't terminal
Verify the audit chain: tiyi audit verify-chain exits 0