Time to Restore

Summary

Time to Restore - When a failure occurs, how long does it take to restore service?

Our Calculation Methodology

Time to Restore is calculated directly based on the teams incident management process: > Time to Restore = time of incident resolution - time of incident start

Background Information

Definition

Common metrics for monitoring operational health are MTTR (Mean Time To Recovery), MTTF (Mean Time To Failure), MTBF (Mean Time Between Failure).

Time to Restore Background
Time to Restore
  • MTBF - Mean Time Between Failure - How stable is my stuff?
  • MTTI - Mean Time To Identification - How long does it take to realise my application is on fire?
  • MTTA - Mean Time to Action - How long does it take for the First Responder to start investigation?
  • MTTR - Mean Time To Recovery - How long to put out the fires?
  • MTTF - Mean Time To Failure - What's my uptime like?

When working with an incident management system, an incident is generally created at the point it is identified - TTI. This may be some point after the actual failure occurred:

  • For automated alarms, TTI and the actual incident start may be near instantaneous.
  • For end user notified failures (support tickets), the TTI and the actual start of the failure may be hours or days apart.

Classification

WayFinder calculates a classification tag for time to restore to allow you to compare your performance with wider industry performance.

Classification tags are based on industry standards and DORA findings.

elitehighmediumlow
time To RestoreLess than one hourLess than one dayLess than 1 weekMore than 1 week

Out of scope

For user notified events, understanding the true start of the incident is part of the Post Mortem process - understanding the full nature of the incident. This requires retrospective action that may affect the calculation.

At WayFinder, we use the time that the incident was created as the start time for measuring time to restore.

Whilst this leaves room for improvement on the confidence of the metric, there is still opportunity for action after the failure was identified.

Challenges

  • Getting people to record accurately when the incident starts and ends.
  • The start time of an incident may be updated as part of the post mortem process.
  • Sophisticated deployment pipelines enable self healing. If these are not captured, the metric is misleading
  • Zero-downtime deployment strategies (e.g. blue/green) can detect failure without any user impact (no time to restore measurement). Canary deployment may cause an outage, but limit the blast radius.
  • Alarms aren’t a true incident management system, but may be a good proxy/first step. An improvement would be for alarms to automatically trigger your incident management system.
  • If the incident management system is used to track all issues (including minor ones), it leads to a zero defect approach.

Data Sources

We calculate time to restore using your incident management system.


FAQ

Q: Does time to restore start from the point the fault was introduced or the point the fault is first noticed?

A: It starts from the point where the incident is created (in your incident management system). In future, we may extend the functionality to capture additional/alternative fields (e.g. an ‘incident start’ event).

Q: What happens if I update the incident start time after the incident is created?

A: That is not currently supported in our calculations. But let us know if this is important to you.

Q: Does time to restore stop at the point the fix is deployed into production?

A: No. time to restore stops only once the incident is declared resolved. This can be after manual checking, after the production deployment including the fix.

Q: What do you mean by “resolved”?

A: The lifecycle of an incident generally follows the lifecycle:

  • Incident Created - an incident is identified and recorded in the system
  • Incident Acknowledged - the incident is acknowledged by support staff who start actively working on the problem
  • Incident Resolved - the support team have worked to stem the bleeding and restore service to users. There may still be underlying issues to be resolved (i.e. a workaround is in place).
  • Incident Closed - the incident post mortem process has been completed and countermeasures are in place to prevent future incidents.

Q: What are the types of incident are included in time to restore?

A: This is up to you, but we count the following examples as incidents:

  • Production infrastructure incidents
  • Service interruptions
  • Bugs or faults introduced to production through a change