Why do we have IT delays?

closeup photo of a stop sign
Photo by Kai Pilger on Unsplash

You’ve got a quick job to do so fire it off to an IT team. It should only take around half a day of effort so you expect to hear back by the end of the week. Three weeks later and you’re still waiting. So why is this happening? What is the deal with IT delays?

Apologies in advance to all scientists, engineers, and mathematicians.

The effect of utilisation on wait times

The request you made might only be 4 hours of effort, but you need to factor in the waiting time. How long is it going to take before your IT team picks up the request and services it for you? Wait time is largely defined by 3 variables:

  • The processing time
  • The job type variation
  • The input variation

Of these, only the processing time is static. Let’s see how utilisation impacts the queue length to start. When utilisation grows, then queue length increases dramatically. Assume the following:

The following work assumes a M/M/1/INF queue type

Utilisation of work centre: p
Idle time: 1-p
Therefore queue length: (p / 1-p)
The effect of utilisation on queue length

We can see that as utilisation approaches 100%, the impact on our queue approaches infinity. And this makes sense. What happens if a road is at 100% capacity, it grinds to a halt.

It’s worth noting that when p = 0.5, this is a point of equilibrium in the system where utilisation is at its maximum while maintaining a queue length of at most 1.

This shows that we will have to wait significantly longer when our dependency is busy. And if they’re a dependency of ours, then I think we can safely assume they might have others in the wings. When we have dependencies, the most inextricably linked of them will have the longest delays, sending us spinning into a vicious cycle.

Calculating your wait time

Assuming that the processing time for an item is standard, (It isn’t, and we’ll see how that variation affects things in a moment), then we can work out the expected wait time for a single item.

Processing time (pt): 0.5 days
Queue length (q)

We can’t have a total time of less than the processing time, so if the queue length is equal to or less than 1 then our total time is equal to our processing time.

if( pt<=1 ) { pt }

But if it’s greater, then we need to multiply the queue length by our processing time in order to clear the queue and then add another pt for our item.

if( pt<=1 ) { pt } else { pt + pt * q }
Time to process a work item based on utilisation

As we half the amount of excess capacity by increasing our utilisation, we roughly double the average queue size. So moving from 40 to 70 doubles the queue size, the same when we increase utilisation from 70 to 85.

How efficiency is changed by utilisation

Here’s what the time breakdown looks like across this series.

Proportion of doing vs. wait time

Here’s what your process efficiency looks like when utilisation increases.

Process efficiency compared with utilisation

So when your teams are using 95% of their time to service requests, the efficiency of your system in terms of waiting time for value is actually 0.05%. Maybe having a day a week to pursue personal development, work on pet projects, or even browse Facebook isn’t so bad.

The effect of variation

There are two major types of variation that will affect the lead time of a work item, I highlighted them above.

  • Input variation
  • Job type variation

Expanding the formula slightly will give us something called the Kingman Approximation. This formula seeks to estimate the mean waiting time for work based on utilisation, processing time, and variation.

https://en.wikipedia.org/wiki/Kingman%27s_formula
E(W):               Expected waiting time
(p / 1-p):          Utilisation (discussed above)
(Ca^2 + Cs^2) / 2:  Variation coefficient
T:                  Mean service time for a work item

We know what wait time means, we’ve already tackled utilisation, and we can easily work out the mean time for service (it’s the average time it takes for something to be done). So that leaves us with variation, both types in fact.

Ca: Standard deviation divided by mean for time between arrivals
Cs: Standard deviation divided by mean for service time (T)

What Kingman is showing us is that our expected wait time is a function of utilisation, variance, and processing time.

Arrival and processing variation

If work arrives unpredictably into our system, then we experience inefficiencies. It’s much easier to understand this now that we have mastered utilisation. Imagine a system where our utilisation was 80%, this meant that we would have a queue length of 4. We’re working as quickly as we can within that 80% so we’re processing 2 items every day. If our queue arrival is steady and matches our delivery rate, (TAKT time), them our queue will remain stable.

But what happens is we get a small surge? If our queue temporarily increases from 4 items to 5 items as a result of arrival variation, then we will be unable to pay that debt back unless the input rate drops sufficiently to allow us to ‘catch up’. This means that increases in arrival rate, even if temporary, will increase the lead time for every item that comes after it in our system.

The exact concept is true for the other type of variation, processing. If a job takes longer to process than our TAKT time then we will be unable to pay that back.

How can we improve?

Quite simply, we alter some of the variables. We know that variation, processing time, and utilisation all have a multiplicative effect on lead time. In order to reduce the time it takes to deliver value, we can take 3 routes:

  1. Reduce processing time – Nice and simple, reduces the overall time by a fixed amount.
  2. Reduce variation – Moving the system towards homogeneity of both arrival and processing time smooths out those bumps that cause the system to begin operating less effectively.
  3. Reduce utilisation – As mental as it sounds, by reducing utilisation we improve the efficiency of the system and allow value to flow more quickly to our customers.

But we can’t have developers just sitting around!

Even if you followed this post whole-heartedly, you probably still have a little bit of you that cringes at the thought of having people just sat around waiting for work. This doesn’t have to be the case, let’s think of all the amazing value added activities they could be doing while ‘waiting’.

  1. Personal development to improve morale and upskill the team
  2. Pet projects to develop skills and open up future revenue opportunities
  3. Networking and conferences to promote the organisation and build skills
  4. Assist in other business areas to build cross-functionality
  5. Sitting around on the sofa playing video games to enhance morale and reduce workplace stress

Almost anything is better for value flow than working people to the bone, scientifically.

Wrap Up

Thanks for sticking with me. I really love this type of content, even if I end up butchering the mathematics. All I can do is apologise, and promise to use some of my spare capacity to improve it for next time.

If you enjoyed this content, or found it useful, then it would really help me out if you could share it around. It takes a while to produce this sort of thing and it really makes it worthwhile knowing I’ve managed to help a wider audience.

3 thoughts on “Why do we have IT delays?”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.