Incident response

Reporting

Security vulnerabilities: email security@pylonsync.com. Do NOT file a public issue. Acknowledgement within 48 hours. Target remediation 7 days for high severity, 30 days for medium. Operational incidents (your own deploy): follow your internal runbook. This page is for anchoring the generic moves.

First five minutes

Contain. If the blast radius is unknown, flip the relevant feature flag or route to maintenance mode. A 503 is better than a breach.

Preserve. Snapshot the database before any destructive recovery:

pylon backup /var/snapshots/incident-$(date +%Y%m%d-%H%M%S)

Capture signal. Pull logs, metrics, and recent audit trail:

journalctl -u pylon --since "30 min ago" > /tmp/incident.log
curl -s http://localhost:4321/metrics > /tmp/incident-metrics.txt

Declare. Spin up an incident channel. One person owns coordination, one person drives fixes. Everyone else stays off.

Common incidents

Admin token leaked

Follow Admin token rotation → emergency path. Revoke every session, rotate the token, audit the window of exposure. If OAuth credentials were stored as admin-level env vars, rotate those too.

Policy bypass / data exposure

Reproduce with a throwaway session token against staging.
Check audit_log for rows accessed during the window — this is the GDPR notification trigger in the EU.
Patch, ship, verify the regression test in crates/policy/src/lib.rs::tests covers it.
If user data was exposed: initiate breach notification workflow.

Runaway write loop

Signal: write rate pegged above the 70k/sec ceiling, WAL growing unbounded, disk filling.

Identify the source: check recent deploys + audit_log for the offending user_id or IP.
Rate-limit or block at the proxy — don’t try to fix inside the server while it’s under load.
Once stable, investigate the trigger. Common cause: a client in an exponential-backoff retry loop without jitter.

WAL file growth past disk budget

SQLite in WAL mode delays checkpoints. If checkpoint isn’t completing:

# Offline checkpoint — needs exclusive DB access.
systemctl stop pylon
sqlite3 /var/lib/pylon/pylon.db "PRAGMA wal_checkpoint(TRUNCATE)"
systemctl start pylon

If the WAL is always growing, a long-running read transaction is holding it open. Find it via sqlite3 … ".pragma stats" and kill the client holding the lock.

WS fanout storm

Signal: ws-broadcast-N threads at 100% CPU, clients reporting missed events, queue-full warnings in the log.

Check client count — per-IP cap is 64; if one IP has 64, a client is looping.
Check broadcast rate. The event that’s being broadcast might not be expected — e.g. a retry loop creating 1000 inserts/sec.
Bounce the WS server (kill -HUP <pid>) only as a last resort — every client has to reconnect.

Cloudflare Workers billing spike

Signal: CF email saying you’ve used 80% of your monthly budget in a week. See Workers costs for patterns.

Magic-link emails not arriving

Check the spam folder.
Verify the email provider env vars (PYLON_EMAIL_*).
Hit the provider’s dashboard — your account may be paused for high bounce rate.
Check journalctl -u pylon | grep email for delivery errors.
For domain authentication, verify SPF/DKIM/DMARC records via mxtoolbox.com.

Sudden surge of 401 / 403 errors

Check if a deploy went out — a regression in policy expressions can lock everyone out.
Check session DB integrity — if sessions.db is corrupted, every authenticated request fails.
Restore the previous session DB from backup if needed (sessions are recoverable; users just sign back in).

Post-mortem template

Summary — one paragraph, what happened, duration, blast radius.
Timeline — UTC timestamps from first signal to all-clear.
Root cause — what went wrong, not who.
What worked — detection, containment, comms.
What didn’t — slow alerts, unclear runbooks, missing graphs.
Action items — with owners and target dates. Track in the sprint.

Publish internally within 5 business days. External disclosure for user-impacting incidents per your privacy policy.

On Pylon Cloud

For Cloud workspaces, Pylon’s on-call team is paged on infrastructure-level incidents (DB outage, region-wide failure, control-plane bugs). For app-level incidents (your code, your data), Cloud’s dashboard surfaces logs and request traces but you own the response. Status: status.pylonsync.com shows region health and incident history.

Contacts

Security: security@pylonsync.com
Public issues / feature requests: github.com/pylonsync/pylon/issues
Cloud support: dashboard → Help → Contact
Your internal oncall: (fill in your rotation)

Get started

Core concepts

Auth

Plugins

Clients

Operations

Compare

Incident response

Reporting

First five minutes

Common incidents

Admin token leaked

Policy bypass / data exposure

Runaway write loop

WAL file growth past disk budget

WS fanout storm

Cloudflare Workers billing spike

Magic-link emails not arriving

Sudden surge of 401 / 403 errors

Post-mortem template

On Pylon Cloud

Contacts

Get started

Core concepts

Auth

Plugins

Clients

Operations

Compare

​Reporting

​First five minutes

​Common incidents

​Admin token leaked

​Policy bypass / data exposure

​Runaway write loop

​WAL file growth past disk budget

​WS fanout storm

​Cloudflare Workers billing spike

​Magic-link emails not arriving

​Sudden surge of 401 / 403 errors

​Post-mortem template

​On Pylon Cloud

​Contacts

Reporting

First five minutes

Common incidents

Admin token leaked

Policy bypass / data exposure

Runaway write loop

WAL file growth past disk budget

WS fanout storm

Cloudflare Workers billing spike

Magic-link emails not arriving

Sudden surge of 401 / 403 errors

Post-mortem template

On Pylon Cloud

Contacts