Failure Recovery - Platinur

This walkthrough covers the most common runtime failure: a source schema changed underneath the generated models. A column the warehouse models depend on is dropped from a raw table, the next scheduled refresh fails, and the operator recovers with an assistant repair.

1. The failure appears in Monitoring

The scheduled refresh runs dbt against raw tables that no longer match the models, so the run fails. The failed run appears in the Monitoring runs list with its status, summary, and event log. Use the Failures filter to isolate it, and open the run card to read the dbt events and logs. If a notification channel is configured (Slack webhook, generic webhook, or email), a failed-refresh notification also fires, naming the failing step and environment — for example “dbt run failed in prod” — with the run summary as detail.

2. Schema drift names the cause

Platinur keeps a baseline of the raw source schemas as they existed when Initial Run generated the models. An hourly runtime health check compares the current ClickHouse schemas against that baseline, and Monitoring has a Check button to run the comparison on demand. The Monitoring schema drift strip shows one of:

Drift detected — “Schema drift — N change(s) in raw schemas since the last build”, with an expandable list of changes.
Clean — “Schema drift — raw schemas match the last build”.
No baseline — the baseline is captured by the first Initial Run.

Each drift event names the exact change, for example:

column_removed — “schema.table lost column X.”
column_type_changed — “schema.table.X changed type from A to B.”
table_removed — “schema.table no longer exists.”
column_added / table_added — new columns or tables since the last build.

A dropped source column shows up as a column_removed event. When drift is detected, a schema drift notification fires once per unique set of changes — “Schema drift detected in <schema>” — listing the first few change details.

3. Start an assistant repair from the failed run

Repairs run in Staging. Production is release-only, so switch to Staging if needed. Start an assistant repair from the failed dbt run. Platinur queues a “Repair failed dbt run” task and opens it in the Assistant view. Only failed runs can be sent to the assistant, and repair attempts are limited (default 2, configurable as Attempt Limit in Configuration). If the limit is reached or the same validation error repeats, the repair is blocked with “Escalate with logs.”

4. What the repair agent does

The repair agent never edits the live workspace. It works in a sandbox copy of the generated code on the worker, under .platinur/agent-runs/<task id>/.

It starts by reading the failed run’s logs, then inspects the failed model SQL and YAML, their ref() dependencies, upstream models, and model contracts. It can run bounded read-only ClickHouse queries to confirm which columns actually exist — it does not guess.
It writes complete replacement files, restricted to the generated dbt models and Evidence dashboard paths. It cannot touch secrets, configuration, users, or infrastructure.
It validates the fix by running dbt for the failed selector, tests included by default. If validation fails, it inspects the new logs and retries within the attempt limit.
It creates a proposal only after validation passes. If sandbox files change after the last successful validation, it must validate again before a proposal can be created.

If the agent cannot fix the failure safely, it stops without creating a proposal and explains what blocked it. Failed repair attempts stay in the Assistant task log and never appear in Promotions. The Assistant task view shows the step timeline (inspect, edit, validate, proposal), and the task logs include inspected files, changed files, validation logs, and the final diff. When a proposal is created, a “Proposal ready for review” notification fires.

5. Review and apply the repair proposal

The validated repair proposal appears in Promotions like any other proposal:

Review the diff and validation status.
Apply it to Staging.
Run the changed models and check the staging dashboard preview.
Create the promotion PR and merge it. Production runs sync from the production branch on the next refresh.

6. Re-run and confirm

Re-run the refresh manually or wait for the next scheduled run, and confirm it succeeds in Monitoring. The drift baseline reflects the schemas as of the last Initial Run, so Monitoring continues to report the schema change even after the models are repaired. Treat it as a record of how the raw schemas have moved since the models were generated; a new Initial Run captures a fresh baseline.

​1. The failure appears in Monitoring

​2. Schema drift names the cause

​3. Start an assistant repair from the failed run

​4. What the repair agent does

​5. Review and apply the repair proposal

​6. Re-run and confirm