Issue Brain · for platform engineering
The same outage,
again.
plsfix turns the incidents your team has already resolved — across Slack, PagerDuty, GitHub and Claude — into verified runbooks and executable skills. So the second time it happens, the bot already has the answer, posted in-thread, with a one-click Run playbook.
- 14,802 events ingested
- 7 recurring clusters identified
- 22% of new events auto-resolved
§ 01 Why this exists
Most outages aren't new.
They're the same outage, badly remembered.
-
// problem 01
Your senior engineer re-derives the same fix at 3 a.m.
The exact resolution lives in a Slack thread from January. Your on-call is searching scrollback at 3:14 a.m. while production is dropping packets. The thread was archived two sprints ago.
-
// problem 02
Resolution knowledge is scattered across six tools and lost in days.
The acknowledgement is in PagerDuty. The diagnosis is in a Claude thread. The actual fix is a comment on a closed PR. Nobody is going to assemble that timeline by hand.
-
// problem 03
The AI agents on the market today hallucinate fixes.
A bot that generates a runbook from a prompt is a liability, not a teammate. We don't generate from prompts. Every step we propose traces back to a specific incident your team already resolved.
§ 02 How it works
Four steps from a resolved incident
to an executable skill.
-
01
Ingest
Read-only connectors pull resolved work from every place your team actually fixes things. PII is stripped before anything is clustered.
- SSlack
- PPagerDuty
- GHGitHub
- JJira
- LLinear
- SNServiceNow
- NNotion
- CLClaude / ChatGPT
-
02
Cluster
We learn the signature of each recurring incident — a regex on the alert payload, a service set, a deploy proximity, a channel + reporter pattern. Same shape, same cluster, every time.
-
03
Verify
plsfix drafts the runbook from your team's own past resolutions — not a prompt. An engineer reviews it once, edits if needed, clicks Verify. Provenance is attached to every step.
-
04
Execute
The runbook compiles to an executable skill. Low-risk steps run automatically. Anything with blast radius pauses for a named approver. It can fire from Slack, the CLI, PagerDuty, Linear, Jira — same source of truth.
- S Slack thread
- P PagerDuty page
- L Linear / Jira
- $ /pls fix CLI
- W Web inbox
§ 03 The Slack moment
The bot posts the answer
in the same thread that asked the question.
4 seconds after the alert. 94% confidence. The runbook is already verified — written by Aditi two months ago after INC-4189. Click Run playbook and the safe steps run; the risky one waits on you.
// match signal
-
MC
Marcus Chen 10:42 AM
fx rates look stale on a bunch of orders 😬 some prices off by 0.2% — am i seeing things?
fx.rate.age_msspiking -
P
PagerDuty APP 10:43 AM
SEV2 Triggered ·fx.rate.age_ms_high· servicepricing-svc· 4 of 12 pods -
px
plsfix APP 10:43 AM
I think I've seen this one. Matched cluster FX rate cache TTL fallback · 94% confidence.
verified runbookStale FX rates priced to live trades
Cache miss path falls back to a 1-hour TTL constant left over from a 2025 load test. On Redis evictions during peak, prices are computed against rates >60s old. Last seen 11 days ago.
- 1 check · confirm cache age & eviction rate auto
- 2 fix · force refresh and shorten fallback TTL → 60s approval
- 3 verify · max age < 60s for 2 minutes auto
-
MC
Marcus Chen 10:44 AM
running it 🙏
§ 04 What we cluster
Real engineering bugs.
From real fintech platform teams.
These aren't process complaints. Each one is a specific code or configuration defect with a recognizable signature. Three examples from the seven we currently cluster at our pilot.
-
cl-fx-01 verified
FX rate cache TTL fallback set to 1 hour, not 1 minute
fx.rate.age_ms > 60000 AND order.execution.status = filledwhy Cache-miss path returns a constant
TTL_FALLBACK_MS = 3_600_000left over from a load test. During Redis evictions at peak, ~14k symbols are priced against rates >60s old.$340k of mispriced trades in 18 minutes before the prior incident was caught manually.
-
cl-idem-02 verified
Idempotency keys regenerated on retry → duplicate ACH debits
ach.duplicate_debit AND idempotency_key.reused = falsewhy Retry middleware mints a fresh
X-Idempotency-Keyon every 5xx instead of reusing the original. On 504-then-200 responses from the bank, the second retry posts a second debit.12 duplicate debits last month — every one a manual reversal and a customer apology.
-
cl-dec-03 draft
Decimal precision drift between risk-svc and ledger-svc
pnl.reconcile.diff > 0.01 AND services.disagree = [risk, ledger]why
risk-svcdeserializes amounts asfloat64;ledger-svcusesDecimal128. The JSON round-trip loses sub-cent precision; differences pile up across thousands of trades and trip reconciliation by mid-afternoon.4 quiet weeks of pennies snowballing into a $9.2k recon delta before anyone caught it.
// and 4 more in the pilot · stripe webhook drops post-deploy · postgres pool exhaustion on report-gen · kafka rebalance storm · market-data WS subscription leak
§ 05 The runbook is the skill
Not a wiki page.
An executable specification.
Every verified runbook compiles to a YAML skill: the trigger signature, the steps, the expected outputs, the named approver for anything risky. Drift-checked against the runbook on every save, so the doc and the executable can never diverge.
- ✓ Steps are real shell commands, not pseudo-code.
-
✓
Each
fixstep has a named approver and a stated blast radius. - ✓ Every run becomes a new training example for the next match.
# rb-fx-01 · stale fx rates priced to live trades
# verified by aditi rao · apr 30 · 6 runs · 83% success
id: rb-fx-01
title: "FX rate cache TTL fallback set to 1h, not 1m"
owner: platform-payments
status: verified
trigger:
signature: |
fx.rate.age_ms > 60000
AND order.execution.status = filled
sources: [pagerduty, slack, claude]
confidence_threshold: 0.85
steps:
- kind: check
id: confirm_cache_age
cmd: |
redis-cli -h $RATE_CACHE_HOST GET fx:rate:meta
expect: "ttl_ms < 60000"
- kind: fix
id: force_cache_refresh
approval: required
approver: platform-payments
blast_radius: "~14k symbols, ~2s pricing pause"
cmd: |
kubectl exec deploy/pricing-svc -- \
./bin/refresh-rates --force --window=2m
- kind: verify
id: confirm_fresh_rates
cmd: |
curl -s $PRICING_HEALTH/rates/age \
| jq '.max_age_ms < 60000'
expect: true
post_run:
notify: ["#payments-platform", "#platform-oncall"]
log: s3://plsfix-runs/rb-fx-01/
§ 06 Trust & governance
Read-only by default.
Approval-gated by design.
We were built for fintech. The same posture that lets your security team approve the pilot also lets your auditor sign off on the runs.
- i. Data residency & retention
- 90 days raw, 18 months redacted. Approved by Legal in our pilot. Single-tenant deployment available. No training data leaves your tenant.
- ii. PII redaction before clustering
- Email, IP, customer-id and a configurable secrets dictionary are stripped before any embedding or LLM call. Original artifacts stay where they are.
- iii. Read-only by default
- Every connector starts read-only. Execute scopes are granted per runbook, with a named approver, and revoked with one click.
- iv. Provenance, end to end
- Every step in every runbook traces back to the specific resolved incident it learned from. Audit log per run, signed, exportable to your SIEM.
§ 07 Where it fires
The same skill, every surface your team already lives in.
-
S
Slack thread auto-suggest
When a familiar signature fires, plsfix posts the matching runbook in the same thread asking the question.
-
$
/pls fix from your CLI
pls fix stripe-webhooks— same runbook, same approval gates, from a terminal at 3 a.m. -
P
PagerDuty incident page
The matched runbook appears on the incident card with a one-click run, before the on-call has finished typing.
-
L
Linear / Jira issue
Open an issue with a known signature; plsfix attaches the runbook as a comment and offers to run it.
-
W
Web inbox
For the platform lead — every event, every cluster, every run, with the post-mortem in one place.
Closed pilot · 4 design partners · 2026 Q2
Stop solving
the same outage
twice.
We're working with a small group of fintech and platform teams for a 6-week pilot. Read-only ingest, one cluster verified together, then turn on auto-suggest. If we don't reduce your recurring incident volume by 30% in week 4, you owe us nothing.
// takes ~30 minutes to set up the read-only ingest · we run a joint cluster review in week 1 · no commitment until week 4