Issue Brain · for platform engineering

The same outage,
again.

plsfix turns the incidents your team has already resolved — across Slack, PagerDuty, GitHub and Claude — into verified runbooks and executable skills. So the second time it happens, the bot already has the answer, posted in-thread, with a one-click Run playbook.

  • 14,802 events ingested
  • 7 recurring clusters identified
  • 22% of new events auto-resolved
  • CL P S GH J week 1 of pilot at acme

§ 01 Why this exists

Most outages aren't new.
They're the same outage, badly remembered.

  1. // problem 01

    Your senior engineer re-derives the same fix at 3 a.m.

    The exact resolution lives in a Slack thread from January. Your on-call is searching scrollback at 3:14 a.m. while production is dropping packets. The thread was archived two sprints ago.

  2. // problem 02

    Resolution knowledge is scattered across six tools and lost in days.

    The acknowledgement is in PagerDuty. The diagnosis is in a Claude thread. The actual fix is a comment on a closed PR. Nobody is going to assemble that timeline by hand.

  3. // problem 03

    The AI agents on the market today hallucinate fixes.

    A bot that generates a runbook from a prompt is a liability, not a teammate. We don't generate from prompts. Every step we propose traces back to a specific incident your team already resolved.

§ 02 How it works

Four steps from a resolved incident
to an executable skill.

  1. 01

    Ingest

    Read-only connectors pull resolved work from every place your team actually fixes things. PII is stripped before anything is clustered.

    • SSlack
    • PPagerDuty
    • GHGitHub
    • JJira
    • LLinear
    • SNServiceNow
    • NNotion
    • CLClaude / ChatGPT
  2. 02

    Cluster

    We learn the signature of each recurring incident — a regex on the alert payload, a service set, a deploy proximity, a channel + reporter pattern. Same shape, same cluster, every time.

  3. 03

    Verify

    plsfix drafts the runbook from your team's own past resolutions — not a prompt. An engineer reviews it once, edits if needed, clicks Verify. Provenance is attached to every step.

  4. 04

    Execute

    The runbook compiles to an executable skill. Low-risk steps run automatically. Anything with blast radius pauses for a named approver. It can fire from Slack, the CLI, PagerDuty, Linear, Jira — same source of truth.

    • S Slack thread
    • P PagerDuty page
    • L Linear / Jira
    • $ /pls fix CLI
    • W Web inbox

§ 03 The Slack moment

The bot posts the answer
in the same thread that asked the question.

4 seconds after the alert. 94% confidence. The runbook is already verified — written by Aditi two months ago after INC-4189. Click Run playbook and the safe steps run; the risky one waits on you.

// match signal

  • Signature regex 94%
  • Recent deploy proximity 87%
  • Service overlap 100%
  • Channel + reporter history 71%

§ 04 What we cluster

Real engineering bugs.
From real fintech platform teams.

These aren't process complaints. Each one is a specific code or configuration defect with a recognizable signature. Three examples from the seven we currently cluster at our pilot.

  1. cl-fx-01 verified

    FX rate cache TTL fallback set to 1 hour, not 1 minute

    fx.rate.age_ms > 60000
      AND order.execution.status = filled

    why Cache-miss path returns a constant TTL_FALLBACK_MS = 3_600_000 left over from a load test. During Redis evictions at peak, ~14k symbols are priced against rates >60s old.

    $340k of mispriced trades in 18 minutes before the prior incident was caught manually.

    • P
    • S
    • GH
    • CL
    7 occurrences · MTTR 22m
  2. cl-idem-02 verified

    Idempotency keys regenerated on retry → duplicate ACH debits

    ach.duplicate_debit
      AND idempotency_key.reused = false

    why Retry middleware mints a fresh X-Idempotency-Key on every 5xx instead of reusing the original. On 504-then-200 responses from the bank, the second retry posts a second debit.

    12 duplicate debits last month — every one a manual reversal and a customer apology.

    • P
    • S
    • GH
    • SN
    12 occurrences · MTTR 48m
  3. cl-dec-03 draft

    Decimal precision drift between risk-svc and ledger-svc

    pnl.reconcile.diff > 0.01
      AND services.disagree = [risk, ledger]

    why risk-svc deserializes amounts as float64; ledger-svc uses Decimal128. The JSON round-trip loses sub-cent precision; differences pile up across thousands of trades and trip reconciliation by mid-afternoon.

    4 quiet weeks of pennies snowballing into a $9.2k recon delta before anyone caught it.

    • S
    • GH
    • J
    • CL
    4 occurrences · MTTR 3h 14m

// and 4 more in the pilot · stripe webhook drops post-deploy · postgres pool exhaustion on report-gen · kafka rebalance storm · market-data WS subscription leak

§ 05 The runbook is the skill

Not a wiki page.
An executable specification.

Every verified runbook compiles to a YAML skill: the trigger signature, the steps, the expected outputs, the named approver for anything risky. Drift-checked against the runbook on every save, so the doc and the executable can never diverge.

  • Steps are real shell commands, not pseudo-code.
  • Each fix step has a named approver and a stated blast radius.
  • Every run becomes a new training example for the next match.
YAML runbooks/rb-fx-01.skill.yaml dry-run
#  rb-fx-01 · stale fx rates priced to live trades
#  verified by aditi rao · apr 30 · 6 runs · 83% success

id: rb-fx-01
title: "FX rate cache TTL fallback set to 1h, not 1m"
owner: platform-payments
status: verified

trigger:
  signature: |
    fx.rate.age_ms > 60000
    AND order.execution.status = filled
  sources: [pagerduty, slack, claude]
  confidence_threshold: 0.85

steps:
  - kind: check
    id: confirm_cache_age
    cmd: |
      redis-cli -h $RATE_CACHE_HOST GET fx:rate:meta
    expect: "ttl_ms < 60000"

  - kind: fix
    id: force_cache_refresh
    approval: required
    approver: platform-payments
    blast_radius: "~14k symbols, ~2s pricing pause"
    cmd: |
      kubectl exec deploy/pricing-svc -- \
        ./bin/refresh-rates --force --window=2m

  - kind: verify
    id: confirm_fresh_rates
    cmd: |
      curl -s $PRICING_HEALTH/rates/age \
        | jq '.max_age_ms < 60000'
    expect: true

post_run:
  notify: ["#payments-platform", "#platform-oncall"]
  log: s3://plsfix-runs/rb-fx-01/

§ 06 Trust & governance

Read-only by default.
Approval-gated by design.

We were built for fintech. The same posture that lets your security team approve the pilot also lets your auditor sign off on the runs.

i. Data residency & retention
90 days raw, 18 months redacted. Approved by Legal in our pilot. Single-tenant deployment available. No training data leaves your tenant.
ii. PII redaction before clustering
Email, IP, customer-id and a configurable secrets dictionary are stripped before any embedding or LLM call. Original artifacts stay where they are.
iii. Read-only by default
Every connector starts read-only. Execute scopes are granted per runbook, with a named approver, and revoked with one click.
iv. Provenance, end to end
Every step in every runbook traces back to the specific resolved incident it learned from. Audit log per run, signed, exportable to your SIEM.

§ 07 Where it fires

The same skill, every surface your team already lives in.

  • S

    Slack thread auto-suggest

    When a familiar signature fires, plsfix posts the matching runbook in the same thread asking the question.

  • $

    /pls fix from your CLI

    pls fix stripe-webhooks — same runbook, same approval gates, from a terminal at 3 a.m.

  • P

    PagerDuty incident page

    The matched runbook appears on the incident card with a one-click run, before the on-call has finished typing.

  • L

    Linear / Jira issue

    Open an issue with a known signature; plsfix attaches the runbook as a comment and offers to run it.

  • W

    Web inbox

    For the platform lead — every event, every cluster, every run, with the post-mortem in one place.

Closed pilot · 4 design partners · 2026 Q2

Stop solving
the same outage
twice.

We're working with a small group of fintech and platform teams for a 6-week pilot. Read-only ingest, one cluster verified together, then turn on auto-suggest. If we don't reduce your recurring incident volume by 30% in week 4, you owe us nothing.

// takes ~30 minutes to set up the read-only ingest · we run a joint cluster review in week 1 · no commitment until week 4