Managing runbooks

Create structured incident response procedures using Better Stack's Escalation policies. Runbooks help your team follow consistent steps during incidents, reducing response time and ensuring nothing is missed.

Need dedicated runbook management?

While some monitoring tools offer separate runbook features, Better Stack integrates runbooks directly into Escalation policies, keeping your incident response workflow simple and unified.

What are runbooks?

A runbook is a structured set of predefined steps for handling specific incidents or scenarios. They typically include:

Tasks to perform: step-by-step procedures.
People to notify: this is automated by using Escalation rules.
Links to resources: your dashboards, playbooks, and internal documentation.

Creating runbooks

Step 1: Create a runbook policy

Create a dedicated Escalation policy for your new runbook:

Go to Escalation policies → Create escalation policy.
Name your runbook, for example Runbook: High CPU Usage.
Remove the default escalation steps to avoid notifying people in the runbook directly.
Position the runbook step as the last step of your escalation flow.
Add your runbook instructions
- Use the Instructions & todo list step.
- Format your the step-by-step guide using markdown.
- Start a line with - [ ] to add an interactive task.
Save the escalation policy

Example runbook instructions with TODO list

Copied!

## When to Use
Triggered when CPU > 90% for 5+ minutes on a web server.

## Steps
- [ ] **Acknowledge the Alert**
- [ ] **Find the Affected Server**
   - Use logs or metrics dashboard to identify the instance/container
   - Example: `aws ecs list-tasks --cluster web-prod`
- [ ] **SSH or Access Container**
   - `ssh ec2-user@<instance-ip>`
- [ ] **Diagnose the Issue**
   - Run `top` or `htop` to find CPU-heavy process
   - Check application logs
- [ ] **Fix or Mitigate**
   - Restart service if needed
   - Scale up if traffic is legitimate
- [ ] **Verify**
   - CPU drops below 70%
   - No 5xx errors
   - App is responsive

Step 2: Reference runbooks in active policies

In your actual escalation policies - the ones that notify your team:

Go to Escalation policies and edit one of your existing policies.
Add a time-based rule for the runbook.
Set schedule to all days from 00:00 to 00:00, ensuring it will be always used.
Select your runbook policy from the dropdown.
Position the runbook step as the last step of your escalation flow.

Need to use different runbooks based on incident context?

Use Metadata-based rule instead of the time-based rule, and redirect to your runbooks based on the incident metadata values.

Step 3: Test the escalation

Click the Report a new incident in your escalation policy to create a new incident.

You should see your instructions in the incident timeline:

Best practices

Naming convention

Use a consistent prefix for easy identification:

Runbook: Database Outage
Runbook: API Rate Limiting
Runbook: SSL Certificate Expiry

Keep instructions actionable

Use checkboxes for step-by-step procedures.
Include specific commands and code snippets.
Add links to relevant dashboards and documentation.
Specify expected outcomes for verification steps.

Reusing runbooks

The same runbook can be referenced across multiple escalation policies. For example, your High CPU Usage runbook might be used in policies for:

Web server monitoring
Background job processing
Database server alerts

This approach keeps runbooks centralized while allowing flexible incident response workflows 🚀

Create a backup on-call schedule

Acknowledging an incident

Explore documentation