Engineering Scorecard – Quality metrics

Reading time

6 min

A balanced set of metrics will allow us to publish KPIs to the organisation which are both robust and meaningful to the wider organisation. When introducing the Engineering Scorecard I proposed three categories for the Scorecard:

FlowMetrics focus on the flow of work through the engineering system, reducing waste and increasing efficiency.
QualityMetrics assess how well deliverables match their requirements and original intent.
ValueMetrics focus on the wider value stream and looking end to end at how customer value is generated.

Why a Quality metric?

We have seen that having a single measure is a risk and a balanced scorecard is preferable. FlowMetrics are not enough. They indicate that work is being done. They show that the backlog is being processed in an effective manner. However, they are focussed on quantity. We could complete a large number of backlog items. But if these cannot be used because they need rework or do not address the requirements, we will not be successful. We need to have a clear public measure of quality which we can communicate.

Quality means doing it right when no one is looking.
Henry Ford

Case study – quality visibility

I worked with one organization who were concerned about development schedule overruns. When I started working with them, the senior management raised concerns about releases being late. To make planning more predictable they had implemented timeboxing, forced the releases to deliver on time and were pleased at the improvement in the on-time metric.

It became clear that the problem was a little different to that perceived by the organization. Without a reliable quality process, the teams were generating incomplete and poorly tested work. With large batch sizes, this was discovered at the release points, leading to release delays.

Introducing timeboxing was now preventing testing and rework and resulting in a sharp drop in quality. Releases were now being reworked or abandoned.

The organization still believed every release was on time. The harsh reality was that code was not being fully released to customers and for two years there had not been a full rollout.

Waste is hidden. Do not hide it.
Make problems visible.
Taiichi Ohno

Customer Satisfaction

The key measure of product quality is the opinion of the customer who is purchasing the product.

Every organisation should perform customer satisfaction surveys to understand where the product is matching or falling short of customer needs.

Although important, this is not necessarily a good engineering KPI and is more likely to sit on a corporate balanced scorecard.

Its linkage to engineering capabilities and behaviours is hard to assess, and it is likely to have a very long lag time.

A product or service possesses quality if it helps someone and enjoys a sustainable market.
W.E. DEMING

System uptime

There is a class of measures around the overall stability of the production system. The amount of time for which the system is available is one such measure. This has advantages as a KPI in that it is typically very visible, high impact and widely understood.

One weakness of this measure is it focusses on the deployed system and not on the upstream reduction of defects. Improving the production system robustness will improve uptime, while upstream defect rates and underlying root cause may be unchanged.

System uptime can also potentially be improved by reducing deployment frequency and resisting changes to the system. These are not behaviours we wish to promote as they prevent us achieving flow.

Further, focussing on system uptime may cause us to ignore defects which do not cause a system failure or rollback.

Defects found

If we count defects, we have more flexibility on what we count as a defect. We probably want to measure more than just those defects which bring down production systems. But we also want to exclude “cosmetic” defects from a top level KPI.

We also have a choice where we measure defects. We don’t want to measure defects found as part of development – there should be an on-going cycle of coding and testing which we expect to identify (and fix) defects.

And we should be careful that measuring defects found in pre-release validation does not discourage finding defects before production.

Change failure rate

The DORA metrics propose a software metric of Change failure rate. This looks at the fraction of deployments which fail and need to be rolled back. This is a great measure once we have small batch size and fast recovery time, as we will be rolling back individual commits when defects are found. With this level of maturity, rollbacks and significant defects will probably match.

It is important this measure is a fraction of deployments, not an absolute number. Minimising failed deployments will tend to encourage a reduction in the number of deployments, so increasing batch size and working against flow improvement.

Escape frequency

It is most likely that we initially want a custom measure of the level of defects. We probably want to customise this more than counting all defects or looking at number of rollbacks. We can define “escapes” as those which have passed our standard development and validation and actually, or potentially, been exposed on production systems.

A custom “escape frequency” measure is probably a good starting KPI. You have some flexibility in how you define “escapes”. You are likely to want to include some defects which do not require rollbacks, but exclude some minor defects which can be fixed in a future release.

With maturity this should converge with the DORA “Change failure rate” measure. Rollbacks become simpler and more painless and batch sizes shrinks to deploying individual commits. Rolling back individual defects becomes a realistic approach.

Case study – Defect Metric

Let us look at some example data from a past organization with which I worked.

As the published Quality measure we used high severity defects on deployed systems. Here the release cadence was relatively low, meaning we had a larger batch size and rollback was complex. This meant that defects could not be individually rolled back and change failure rate would be unlikely to be an effective measure.

The slow release cadence meant that the metric was strongly lagging. Not ideal, but the best possible at the time. Lag time meant that the Quality measure remained fairly flat for the first two quarters of change. Improvements would take some time to have an effect within development. They would then take some time to have an observed effect in deployed systems.

After the initial delay, the range of continuous improvement activities resulted in a reduction in the level of defects observed. This caused a fairly constant improvement in this Quality KPI.

We observed a defect level reduction of approximately 20% year-on-year driven by ongoing process improvement. In particular the effective application of retrospectives and the investment in technical debt reduction were key here.

There was no sign of the rate of improvement tailing off. We were working with a large legacy system and it is clear we were continuing to find defects. Due to the slow release cadence it can take considerable time to reduce legacy defect levels. Once defect levels are lower and flow is improved with smaller batch sizes, the measure could evolve to look at which deployments we chose to roll back.