Diagnosing Patients and Troubleshooting Software

🗓️
•
🔄
•
⏳ 5 min

It could be argued that, in the IT sector, a lot of time is spent troubleshooting buggy software. Whether it’s your software or someone else’s, chances are you’ve had to diagnose a software related issue in the last couple of days.

“Troubleshooting” humans is a core part of any medical profession, let’s see if we can learn a thing or two from them.

Step by step

Let’s explore how IT professionals can apply medical diagnostic concepts to their troubleshooting process.

Interview

Our patient might not verbally express its symptoms (yet), but if we pay attention/ask the right questions we’ll be able to gather them nontheless.

What exactly is not working? Does it fail all the time? How does it fail exactly? What’s the expected behavior?

Get to know the system, get to know the failure. We can’t fix what we don’t fully understand!

Clinical History

Take a moment to look at the context surrounding the fault. Problems usually don’t come out of nowhere.

When does the issue occur? What happened before it started occurring? Is there any pattern in the occurrence of the error? Any chain of events that always seem to trigger it?

These little nuggets of information might provide more insights than you’d expect, and will point you in the right direction later on.

Physical Exam

More digital than physical but you get the point.

It hurts when I…

If possible, a patient will attempt to reproduce their problem for the clinician. This ensures all parties involved have a clear and common understanding of the issue.

What input(s) causes the unintended behavior? How do we get it to happen consistently?

This is where the _“it works on my machine ¯\_(ツ)_/¯”_ meme comes into play. It’s a bit inconsiderate and not really helpful, but it reflects a rather central part of any troubleshooting process: Reproducing the issue.

If we can reproduce it consistently, we can fix it eventually!

Does this hurt?

Once the practitioner understands what the issue is, he’ll most likely try to “prod the system” a bit.

Does Y seem to make it any better? Does it also hurt if you X? Is it getting worse when I apply pressure here?

This might come off as a bit sadistic, but it really is just a way to gather as much information as possible. It can be translated to software as evaluation by I/O: Give the system a bunch of different inputs and see how it affects the output.

What if you press this button/use that module instead?

This process is doubly useful, since it can verify/nullify a hypothesis regarding the cause of the problem, as well as provide more context about the behavior of the system.

Fear no consequence: Software (unlike people) can be rolled back!

Tests and Data Analysis

Just like hospitals have MRI, IT professionals have logs, metrics, user data, etc.

Test results are a very important part of any objective analysis of a situation, but should never be the only base for a diagnosis. Rather, they complement the information gathered in the previous steps.

A good practitioner will look at all the available data and draw his own conclusions.

Follow-up

Once a diagnosis is reached, and a treatment is prescribed, follow-up appointments are scheduled. This is done to confirm that the diagnosis was correct, the treatment is effective and that there are no unwanted surprises or further actions needed.

Similarly, “fixing” an issue is usually not enough. We need to test our solution once it’s applied, not least to ensure that no other issues has started to appear as a consequence of our fix, as well as checking that the original fault actually stopped occurring.

Strategies

There are multiple diagnostic procedures, some of them are also applicable to troubleshooting software.

In fact, you are probably already applying them in some capacity. Still, being mindful of these strategies might help.

Pattern recognition

This involves identifying certain patterns where the cause may be evident given a particular set of circumstances and symptoms.

If you updated your Nvidia drivers before booting into a black screen, the drivers are probably broken or incompatible. There is no real reason to evaluate other possible causes: it’s so likely to be the cause that its worth giving it a try.

Of course, this requires some degree of experience with the subject at hand.

Differential diagnosis

It involves finding all possible causes and then eliminating them one by one, leaving only the root cause.

A black screen at boot has a lot of possible causes, but if you can hear the fans spinning and see some lights turn on, you can eliminate the power supply as one of them.

This doesn’t guarantee a correct diagnosis. Rather, it produces the most likely one, while helping to unblock a difficult diagnostic process.

Bonus: Treatment

Medical and IT professionals both face a critical choice: Either find the root cause and treat it, or merely treat the symptoms.

In the medical field, the latter is exclusively reserved for three scenarios:

  1. There is no treatment: best we can do is alleviate the symptoms.
  2. The treatment is unavailable/unaffordable.
  3. The system is overstretched: we lack the time/resources to diagnose and/or treat properly.

Unfortunately, in the IT space, treating the symptom is a near-ubiquitous practice and the third scenario seems to be the norm.

It is essential to keep in mind that software serves the needs of the people, and even if users are not visible, they are still affected by inadequate diagnosis and treatment.

Would you behave the same way if the user was sitting by your side? What if your software was being used by medical professionals? Are you considering the impact that software has in the lives of people and the choices they make?

Be sure there is a valid reason not to diagnose and treat the root cause.


Other posts you might like