Managing runbooks

Create structured incident response procedures using Better Stack's Escalation policies. Runbooks help your team follow consistent steps during incidents, reducing response time and ensuring nothing is missed.

Need dedicated runbook management?

While some monitoring tools offer separate runbook features, Better Stack integrates runbooks directly into Escalation policies, keeping your incident response workflow simple and unified.

What are runbooks?

A runbook is a structured set of predefined steps for handling specific incidents or scenarios. They typically include:

  • Tasks to perform: step-by-step procedures.
  • People to notify: this is automated by using Escalation rules.
  • Links to resources: your dashboards, playbooks, and internal documentation.

Creating runbooks

Step 1: Create a runbook policy

Create a dedicated Escalation policy for your new runbook:

  1. Go to Escalation policies → Create escalation policy.
  2. Name your runbook, for example Runbook: High CPU Usage.
  3. Remove the default escalation steps to avoid notifying people in the runbook directly.
  4. Position the runbook step as the last step of your escalation flow.
  5. Add your runbook instructions
    • Use the Instructions & todo list step.
    • Format your the step-by-step guide using markdown.
    • Start a line with - [ ] to add an interactive task.
  6. Save the escalation policy
Example runbook instructions with TODO list
## When to Use
Triggered when CPU > 90% for 5+ minutes on a web server.

## Steps
- [ ] **Acknowledge the Alert**
- [ ] **Find the Affected Server**
   - Use logs or metrics dashboard to identify the instance/container
   - Example: `aws ecs list-tasks --cluster web-prod`
- [ ] **SSH or Access Container**
   - `ssh ec2-user@<instance-ip>`
- [ ] **Diagnose the Issue**
   - Run `top` or `htop` to find CPU-heavy process
   - Check application logs
- [ ] **Fix or Mitigate**
   - Restart service if needed
   - Scale up if traffic is legitimate
- [ ] **Verify**
   - CPU drops below 70%
   - No 5xx errors
   - App is responsive

Create a runbook policy

Step 2: Reference runbooks in active policies

In your actual escalation policies - the ones that notify your team:

  1. Go to Escalation policies and edit one of your existing policies.
  2. Add a time-based rule for the runbook.
  3. Set schedule to all days from 00:00 to 00:00, ensuring it will be always used.
  4. Select your runbook policy from the dropdown.
  5. Position the runbook step as the last step of your escalation flow.

Need to use different runbooks based on incident context?

Use Metadata-based rule instead of the time-based rule, and redirect to your runbooks based on the incident metadata values.

Reference runbooks in active policies

Step 3: Test the escalation

Click the Report a new incident in your escalation policy to create a new incident.

You should see your instructions in the incident timeline:

Test the escalation

Best practices

Naming convention

Use a consistent prefix for easy identification:

  • Runbook: Database Outage
  • Runbook: API Rate Limiting
  • Runbook: SSL Certificate Expiry

Keep instructions actionable

  • Use checkboxes for step-by-step procedures.
  • Include specific commands and code snippets.
  • Add links to relevant dashboards and documentation.
  • Specify expected outcomes for verification steps.

Reusing runbooks

The same runbook can be referenced across multiple escalation policies. For example, your High CPU Usage runbook might be used in policies for:

  • Web server monitoring
  • Background job processing
  • Database server alerts

This approach keeps runbooks centralized while allowing flexible incident response workflows 🚀