Issue Brain · for platform engineering

The same outage,
again.

plsfix turns the incidents your team has already resolved — across Slack, PagerDuty, GitHub and Claude — into verified runbooks and executable skills. So the second time it happens, the bot already has the answer, posted in-thread, with a one-click Run playbook.

Request pilot See a real runbook

14,802 events ingested
7 recurring clusters identified
22% of new events auto-resolved
CL P S GH J week 1 of pilot at acme

INC-4822 · resolved Apr 28, 11:04 ET

// post-mortem · payments-platform

Stripe webhooks dropping after rolling deploy

stripe.webhook.delivery_failure spike +5m of deploy event

10:38 P triggered stripe.webhook.delivery_failure rate > 2/min
10:39 S Marcus: "webhooks dying again 😭 was it the deploy?"
10:39 px matched cluster · 94% · posted rb-01
10:43 GH PR #1903 · readiness gate window 30s→90s
10:44 S Marcus: "running it 🙏"
11:04 ✓ resolved · MTTR 26m · 5th time this signature, 2nd auto-resolved

→ next time this fires, plsfix runs it on its own

§ 01 Why this exists

Most outages aren't new.
They're the same outage, badly remembered.

// problem 01

Your senior engineer re-derives the same fix at 3 a.m.

The exact resolution lives in a Slack thread from January. Your on-call is searching scrollback at 3:14 a.m. while production is dropping packets. The thread was archived two sprints ago.
// problem 02

Resolution knowledge is scattered across six tools and lost in days.

The acknowledgement is in PagerDuty. The diagnosis is in a Claude thread. The actual fix is a comment on a closed PR. Nobody is going to assemble that timeline by hand.
// problem 03

The AI agents on the market today hallucinate fixes.

A bot that generates a runbook from a prompt is a liability, not a teammate. We don't generate from prompts. Every step we propose traces back to a specific incident your team already resolved.

§ 02 How it works

Four steps from a resolved incident
to an executable skill.

01
Ingest

Read-only connectors pull resolved work from every place your team actually fixes things. PII is stripped before anything is clustered.
- SSlack
- PPagerDuty
- GHGitHub
- JJira
- LLinear
- SNServiceNow
- NNotion
- CLClaude / ChatGPT
02
Cluster

We learn the signature of each recurring incident — a regex on the alert payload, a service set, a deploy proximity, a channel + reporter pattern. Same shape, same cluster, every time.
// signature · cl-fx-01
```
# a real cluster signature, distilled from 7 past incidents
match:
  when: "fx.rate.age_ms > 60000"
  and:  "order.execution.status == filled"
  scope: [pricing-svc, rate-cache]
confidence_min: 0.85
```
03

Verify

plsfix drafts the runbook from your team's own past resolutions — not a prompt. An engineer reviews it once, edits if needed, clicks Verify. Provenance is attached to every step.
04
Execute

The runbook compiles to an executable skill. Low-risk steps run automatically. Anything with blast radius pauses for a named approver. It can fire from Slack, the CLI, PagerDuty, Linear, Jira — same source of truth.
- S Slack thread
- P PagerDuty page
- L Linear / Jira
- $ /pls fix CLI
- W Web inbox

§ 03 The Slack moment

The bot posts the answer
in the same thread that asked the question.

4 seconds after the alert. 94% confidence. The runbook is already verified — written by Aditi two months ago after INC-4189. Click Run playbook and the safe steps run; the risky one waits on you.

// match signal

Signature regex 94%
Recent deploy proximity 87%
Service overlap 100%
Channel + reporter history 71%

# payments-platform 4 members · billing platform

MC

Marcus Chen 10:42 AM

fx rates look stale on a bunch of orders 😬 some prices off by 0.2% — am i seeing things? fx.rate.age_ms spiking
P

PagerDuty APP 10:43 AM

SEV2 Triggered · fx.rate.age_ms_high · service pricing-svc · 4 of 12 pods
px
plsfix APP 10:43 AM

I think I've seen this one. Matched cluster FX rate cache TTL fallback · 94% confidence.
verified runbook v3 · used 6× · 83% success

Stale FX rates priced to live trades

Cache miss path falls back to a 1-hour TTL constant left over from a 2025 load test. On Redis evictions during peak, prices are computed against rates >60s old. Last seen 11 days ago.
1. 1 check · confirm cache age & eviction rate auto
2. 2 fix · force refresh and shorten fallback TTL → 60s approval
3. 3 verify · max age < 60s for 2 minutes auto
MC

Marcus Chen 10:44 AM

running it 🙏

§ 04 What we cluster

Real engineering bugs.
From real fintech platform teams.

These aren't process complaints. Each one is a specific code or configuration defect with a recognizable signature. Three examples from the seven we currently cluster at our pilot.

cl-fx-01 verified

FX rate cache TTL fallback set to 1 hour, not 1 minute
```
fx.rate.age_ms > 60000
  AND order.execution.status = filled
```
why Cache-miss path returns a constant TTL_FALLBACK_MS = 3_600_000 left over from a load test. During Redis evictions at peak, ~14k symbols are priced against rates >60s old.

$340k of mispriced trades in 18 minutes before the prior incident was caught manually.
- P
- S
- GH
- CL
7 occurrences · MTTR 22m
cl-idem-02 verified

Idempotency keys regenerated on retry → duplicate ACH debits
```
ach.duplicate_debit
  AND idempotency_key.reused = false
```
why Retry middleware mints a fresh X-Idempotency-Key on every 5xx instead of reusing the original. On 504-then-200 responses from the bank, the second retry posts a second debit.

12 duplicate debits last month — every one a manual reversal and a customer apology.
- P
- S
- GH
- SN
12 occurrences · MTTR 48m
cl-dec-03 draft

Decimal precision drift between risk-svc and ledger-svc
```
pnl.reconcile.diff > 0.01
  AND services.disagree = [risk, ledger]
```
why risk-svc deserializes amounts as float64; ledger-svc uses Decimal128. The JSON round-trip loses sub-cent precision; differences pile up across thousands of trades and trip reconciliation by mid-afternoon.

4 quiet weeks of pennies snowballing into a $9.2k recon delta before anyone caught it.
- S
- GH
- J
- CL
4 occurrences · MTTR 3h 14m

// and 4 more in the pilot · stripe webhook drops post-deploy · postgres pool exhaustion on report-gen · kafka rebalance storm · market-data WS subscription leak

§ 05 The runbook is the skill

Not a wiki page.
An executable specification.

Every verified runbook compiles to a YAML skill: the trigger signature, the steps, the expected outputs, the named approver for anything risky. Drift-checked against the runbook on every save, so the doc and the executable can never diverge.

✓ Steps are real shell commands, not pseudo-code.
✓ Each fix step has a named approver and a stated blast radius.
✓ Every run becomes a new training example for the next match.

YAML runbooks/rb-fx-01.skill.yaml dry-run

#  rb-fx-01 · stale fx rates priced to live trades
#  verified by aditi rao · apr 30 · 6 runs · 83% success

id: rb-fx-01
title: "FX rate cache TTL fallback set to 1h, not 1m"
owner: platform-payments
status: verified

trigger:
  signature: |
    fx.rate.age_ms > 60000
    AND order.execution.status = filled
  sources: [pagerduty, slack, claude]
  confidence_threshold: 0.85

steps:
  - kind: check
    id: confirm_cache_age
    cmd: |
      redis-cli -h $RATE_CACHE_HOST GET fx:rate:meta
    expect: "ttl_ms < 60000"

  - kind: fix
    id: force_cache_refresh
    approval: required
    approver: platform-payments
    blast_radius: "~14k symbols, ~2s pricing pause"
    cmd: |
      kubectl exec deploy/pricing-svc -- \
        ./bin/refresh-rates --force --window=2m

  - kind: verify
    id: confirm_fresh_rates
    cmd: |
      curl -s $PRICING_HEALTH/rates/age \
        | jq '.max_age_ms < 60000'
    expect: true

post_run:
  notify: ["#payments-platform", "#platform-oncall"]
  log: s3://plsfix-runs/rb-fx-01/

§ 06 Trust & governance

Read-only by default.
Approval-gated by design.

We were built for fintech. The same posture that lets your security team approve the pilot also lets your auditor sign off on the runs.

i. Data residency & retention: 90 days raw, 18 months redacted. Approved by Legal in our pilot. Single-tenant deployment available. No training data leaves your tenant.
ii. PII redaction before clustering: Email, IP, customer-id and a configurable secrets dictionary are stripped before any embedding or LLM call. Original artifacts stay where they are.
iii. Read-only by default: Every connector starts read-only. Execute scopes are granted per runbook, with a named approver, and revoked with one click.
iv. Provenance, end to end: Every step in every runbook traces back to the specific resolved incident it learned from. Audit log per run, signed, exportable to your SIEM.

§ 07 Where it fires

The same skill, every surface your team already lives in.

S
Slack thread auto-suggest

When a familiar signature fires, plsfix posts the matching runbook in the same thread asking the question.
$
/pls fix from your CLI

pls fix stripe-webhooks — same runbook, same approval gates, from a terminal at 3 a.m.
P
PagerDuty incident page

The matched runbook appears on the incident card with a one-click run, before the on-call has finished typing.
L
Linear / Jira issue

Open an issue with a known signature; plsfix attaches the runbook as a comment and offers to run it.
W
Web inbox

For the platform lead — every event, every cluster, every run, with the post-mortem in one place.

Closed pilot · 4 design partners · 2026 Q2

Stop solving
the same outage
twice.

We're working with a small group of fintech and platform teams for a 6-week pilot. Read-only ingest, one cluster verified together, then turn on auto-suggest. If we don't reduce your recurring incident volume by 30% in week 4, you owe us nothing.

// takes ~30 minutes to set up the read-only ingest · we run a joint cluster review in week 1 · no commitment until week 4

The same outage, again.

Your senior engineer re-derives the same fix at 3 a.m.

Resolution knowledge is scattered across six tools and lost in days.

The AI agents on the market today hallucinate fixes.

Ingest

Cluster

Verify

Execute

Stale FX rates priced to live trades

FX rate cache TTL fallback set to 1 hour, not 1 minute

Idempotency keys regenerated on retry → duplicate ACH debits

Decimal precision drift between risk-svc and ledger-svc

Slack thread auto-suggest

/pls fix from your CLI

PagerDuty incident page

Linear / Jira issue

Web inbox

Stop solving the same outage twice.

The same outage,
again.

Stop solving
the same outage
twice.