Unless building a greenfield project, devs spend a lot of time troubleshooting buggy systems.
I find it odd that we seem to have no method to this madness, no procedures, no nothing. Just smash your head into the keyboard until something clicks.
Medical professionals have to âtroubleshoot peopleâ all the time, maybe they know what theyâre doing.
Gather data
Thereâs not always a sensible bug report to start with.
Pinpoint the issue
People often donât realize the full extent of their symptoms. They just know it kinda hurts around here sometimes.
Good, sensible questions need to be asked to get a full picture of the unexpected behavior.
What exactly is not working? Does it fail all the time? How does it fail exactly? Whatâs the expected behavior?
Get to know the system, understand the failure.
Clinical History
Look at the context surrounding the error, problems donât come out of nowhere.
When does it happen? What makes it fail? What happened before it started? Can you find a pattern?
If a system hasnât changed recently and a bug was âintroduced yesterdayâ, either the user is the bug, or itâs been there for a while.
Physical Exam
Well, digital really but you get the point.
Once a general understanding of the behavior and context is reached, try to go deeper.
âIt hurts when IâŚâ
Get your user to reproduce the bug for you.
Yes, this is not always possible. But patient and therapist should be on the same page. Maybe itâs not a bug but a missing feature.
What input(s) causes the unintended behavior? How do we get it to happen consistently?
This aims at a low level, I/O approach to reproduce the issue. Reason about the bug like if you were to write a test around it (which you might actually want to do).
If you can reproduce it consistently, youâll fix it eventually.
Does this hurt?
The âprod it with a stickâ part of the process.
Does Y seem to make it any better? Does it also break if you X? What makes it worse?
This might come off as a bit sadistic (and sometimes it is), but itâs analysis by I/O: Give the system a bunch of different inputs and see how it affects the output.
What if you press this button/use that plugin instead?
Fear no consequence, break the thing: Software (unlike people) can be rolled back.
Tests and Data Analysis
Thereâs no MRI for software, but we do have logs, metrics, user data, observability, etc.
If they are not present in the system, yesterday is a good time to add them. There is never too much information, you can always filter out irrelevant data.
While test results are a very important part of any objective analysis, they should not be the only base for a diagnosis. Use this data to complement the information gathered in the previous steps.
Make a bet
Gathering data is alright, but how does one actually reach a diagnosis? In any reasonably complex system, itâs hard or impossible to actually know what is happening e2e. There are often unknowns, black boxes we donât fully understand.
Even so, we can do better than guessing.
Pattern recognition
If you feel tired, your nose is running, and you have a fever, it doesnât take a rocket scientist to bet on you having the Flu.
If you updated your Nvidia drivers yesterday, and today you got a black screen on boot, your OS is probably fine, the drivers are likely broken or incompatible.
This doesnât mean there cannot be any other issue, itâs just so likely to be the cause that focusing on any other possibility as a first guess makes no sense.
Of course, this requires some experience: you can probably only recognize these pattern if they are not new to you.
Differential diagnosis
It involves finding all possible causes and eliminating them one by one, leaving only the (most likely) root cause.
A PC may not boot for a bunch of different reasons, but if you can hear the fans spinning and see some lights turn on, you can eliminate the power supply as one of them.
This can be, especially with software, a long and tedious process. But is accessible with or without previous experience, and allows you to be methodical in the process.
Treat the damn issue
Medical and IT professionals both face a critical choice: Either find the root cause and treat it, or simply treat the symptoms.
In medical fields, the latter is exclusively reserved for three scenarios:
- There is no treatment, so best we can do is alleviate the symptoms.
- The treatment is unavailable/unaffordable.
- The system is overstretched, and we lack the time/resources to diagnose and/or treat properly.
Unfortunately, in the IT space, treating the symptom is a near-ubiquitous practice and the third scenario seems to be the norm.
We should keep in mind that software serves the needs of people, and even if users are not visible, they are still affected by inadequate diagnosis and treatment.
Would we behave the same way if the user was sitting by your side? What if the software was used by medical professionals? Are we considering the impact that software has in the lives of people and the choices they make?
We should make sure there is a valid reason not to diagnose and treat the root cause.
Follow-up
Once a diagnosis is reached, and a treatment is prescribed, follow-up appointments are scheduled.
This is done to confirm that the diagnosis was correct, the treatment is effective and that there are no unwanted surprises or further actions needed.
Try to reproduce the bug, press the same buttons as before, stress the system.
The issue is not fixed until proven so.