Reporting
Security vulnerabilities: emailsecurity@pylonsync.com. Do NOT file a public issue. Acknowledgement within 48 hours. Target remediation 7 days for high severity, 30 days for medium.
Operational incidents (your own deploy): follow your internal runbook. This page is for anchoring the generic moves.
First five minutes
- Contain. If the blast radius is unknown, flip the relevant feature flag or route to maintenance mode. A 503 is better than a breach.
-
Preserve. Snapshot the database before any destructive recovery:
-
Capture signal. Pull logs, metrics, and recent audit trail:
- Declare. Spin up an incident channel. One person owns coordination, one person drives fixes. Everyone else stays off.
Common incidents
Admin token leaked
Follow Admin token rotation → emergency path. Revoke every session, rotate the token, audit the window of exposure. If OAuth credentials were stored as admin-level env vars, rotate those too.Policy bypass / data exposure
- Reproduce with a throwaway session token against staging.
- Check
audit_logfor rows accessed during the window — this is the GDPR notification trigger in the EU. - Patch, ship, verify the regression test in
crates/policy/src/lib.rs::testscovers it. - If user data was exposed: initiate breach notification workflow.
Runaway write loop
Signal: write rate pegged above the 70k/sec ceiling, WAL growing unbounded, disk filling.- Identify the source: check recent deploys +
audit_logfor the offendinguser_idor IP. - Rate-limit or block at the proxy — don’t try to fix inside the server while it’s under load.
- Once stable, investigate the trigger. Common cause: a client in an exponential-backoff retry loop without jitter.
WAL file growth past disk budget
SQLite in WAL mode delays checkpoints. If checkpoint isn’t completing:sqlite3 … ".pragma stats" and kill the client holding the lock.
WS fanout storm
Signal: ws-broadcast-N threads at 100% CPU, clients reporting missed events, queue-full warnings in the log.- Check client count — per-IP cap is 64; if one IP has 64, a client is looping.
- Check broadcast rate. The event that’s being broadcast might not be expected — e.g. a retry loop creating 1000 inserts/sec.
- Bounce the WS server (
kill -HUP <pid>) only as a last resort — every client has to reconnect.
Cloudflare Workers billing spike
Signal: CF email saying you’ve used 80% of your monthly budget in a week. See Workers costs for patterns.Magic-link emails not arriving
- Check the spam folder.
- Verify the email provider env vars (
PYLON_EMAIL_*). - Hit the provider’s dashboard — your account may be paused for high bounce rate.
- Check
journalctl -u pylon | grep emailfor delivery errors. - For domain authentication, verify SPF/DKIM/DMARC records via
mxtoolbox.com.
Sudden surge of 401 / 403 errors
- Check if a deploy went out — a regression in policy expressions can lock everyone out.
- Check session DB integrity — if
sessions.dbis corrupted, every authenticated request fails. - Restore the previous session DB from backup if needed (sessions are recoverable; users just sign back in).
Post-mortem template
- Summary — one paragraph, what happened, duration, blast radius.
- Timeline — UTC timestamps from first signal to all-clear.
- Root cause — what went wrong, not who.
- What worked — detection, containment, comms.
- What didn’t — slow alerts, unclear runbooks, missing graphs.
- Action items — with owners and target dates. Track in the sprint.
On Pylon Cloud
For Cloud workspaces, Pylon’s on-call team is paged on infrastructure-level incidents (DB outage, region-wide failure, control-plane bugs). For app-level incidents (your code, your data), Cloud’s dashboard surfaces logs and request traces but you own the response. Status: status.pylonsync.com shows region health and incident history.Contacts
- Security:
security@pylonsync.com - Public issues / feature requests: github.com/pylonsync/pylon/issues
- Cloud support: dashboard → Help → Contact
- Your internal oncall: (fill in your rotation)