Six Step Troubleshooting
The US Navy's six step troubleshooting procedure has become part of academic and professional courses and certifications around the world. It presents a logical, step-by-step approach for troubleshooting system faults. We can apply this to computer networks, electrical and electronic circuits, or business processes. When we use the six steps properly, our troubleshooting can be faster and more efficient than it would be if we "just jump right in".
The primary goal of troubleshooting is very simple - fix faults. But that goal is more nuanced than it first appears. While we want to fix faults, we should aim to do it as efficiently and quickly as possible. Time wasted troubleshooting a system that's unrelated to the fault is expensive. Meanwhile, the person who originally reported the fault is still unable to perform whatever task they had been attempting.
First, we'll outline the six steps. Second, we'll explore what each of them entails. Third, we'll apply the six steps to a real-world network outage scenario. The following six steps make up the formal troubleshooting process:
- Symptom Recognition
- Symptom Elaboration
- List Probable Faulty Functions
- Localize the Faulty Function
- Localize the Faulty Component
- Failure Analysis
With the steps originally developed for troubleshooting electrical and electronic systems, some of the wording has been changed over time. For example, Step 5 was originially "Localizing trouble to the circuit". The wording has evolved but the result is still the same - finding the specific root cause.
This first step kicks off the overall troubleshooting process. Often for IT professionals this happens when someone calls the helpdesk or puts in a ticket. IT staff might also be alerted by a monitoring tool that a system has gone offline. At this point we know "something is wrong", but there's no indication of exactly what it is. Begin the troubleshooting process and respond with urgency.
Now that we know something is wrong it's time to begin asking questions. Here's a list of some questions that I like to ask my users when they come to me with a problem:
- What aren't you able to do?
- Were you able to do it before?
- Is it just you, or is this happening to others?
- Has it ever worked?
- Has anything changed recently?
When dealing with non-technical users it's important to understand that they may not be able to fully articulate what they're experiencing when giving us answers. For example, a common report I get when troubleshooting network outages is, "the internet is down!". While this isn't strictly true, most users don't have the training to understand the distinction between LAN, WAN, and the internet. It's not their job to understand that the LAN isn't working and the internet is still there waiting for them. A certain amount of interpretation is needed, and the skills to do it come with time and experience.
During this steps I'm also looking for sights, sounds, and smells. Loss of power is typically easy to spot because there will be a lack of LEDs, and the conspicuous sound of silence where there should be the whirring of cooling fans. The smell of burning plastic and electronic components is very distinctive as well.
List Probable Faulty Functions
The first step started up the troubleshooting process, and the second step probed the general nature of the fault. Now we'll brainstorm what the general cause of the fault could be. First though, we need to define what a "function" is. For the purposes of troubleshooting IT systems, a "function" is a general area of operation. Some IT professionals refer to these as "silos" or "domains" as well. The following examples are all functions or silos that can fault-out for one reason or another:
- Environmental Controls
These are all very broad and that's the point of this step. We'll brainstorm which function could be the cause of our fault, and we'll also rule out which could not be. This points us in the right general direction. It's important to note what could not be the cause of a fault because that prevents us from wasting time on an unrelated system. When a technician gets pulled in the wrong direction and troubleshoots a function unrelated to the fault it's sometimes called "going down the rabbit hole".
If the lights are on in a server room, and hardware LEDs on front panels are blinking while the fans whir away, the Power domain can probably be ruled out. If one of those servers with blinking lights cannot be pinged or accessed remotely it's a fair guess that the Network domain might be faulted. There is also a possibility that a hardware failure has occurred on the server, taking it off the network. Depending on past reliability of your servers, you may or may not include the Server domain in your list of possible faulty functions.
Localize the Faulty Function
At this stage we begin to actively search within those brainstormed domains likely at fault. We want to narrow the cause down to a specific domain and focus our efforts further. Going back to our server example, it's thought that the network could be faulted, or possibly the server hardware itself. The server is powered on, with LEDs lit and fans whirring away. Looking at the network connection on the back of the server, our NIC's lights show both a lit connection and blinking activity LED. This tells us the cable is connected and the NIC is powered on - server hardware domain is looking less-likely.
Running a traceroute to the server's address shows successful hops all the way to the switch that the server connects to. That switch is the final hop, after which all packets are lost. Based on that result it appears likely that the Network domain in the culprit.
Localize the Faulty Component
Now that we know the network domain is most likely where the fault resides, we home in on the actual cause. Looking at the NIC's lights showed us that the interface is on and connected. The connection at the other end of the cable must be there as well, otherwise there would be no link light. A traceroute to the server stopped at the switch, so we'll log into the switch and investigate further. A list of all the switch ports shows that the server's port is enabled, with speed and duplex set to autonegotiate.
We know that our network is segmented using VLANs, so we list the VLANs configured on the switch and their associated ports. The port that connects the server is assigned to VLAN number 1 - that's the default VLAN, not the server VLAN. This explains why we have a good physical connection with link lights, but no network traffic.
At this final step we correct the fault and document the process. In the case of our server, setting the port to the right VLAN restored network connectivity, and our users could access the server once again. Once the fault is fixed we need to verify that operations have returned to normal. It's important to follow up with whomever originally reported the fault and ensure that it's been fully resolved. This leads us to the point where we ask questions and document the process. By documenting the fault we make it possible for future technicians to fix the same issue much faster if they experience it again.
Here are the questions I like to ask when documenting a fault:
- What was wrong?
- What symptoms did we see?
- What was the cause?
- How do we prevent it from happening again?
The fault documentation might go something like this:
The network port attached to server ABC123 was placed in the wrong VLAN, breaking network connectivity. The server was powered on and had link lights, but couldn't be reached over the network. Switchport status was up, but the port's VLAN assignment did not match our documentation. A technician "fat-fingered" the port number when changing the VLAN for another host, accidently knocking the server offline. Putting the switchport back on the server VLAN restored connectivity.
Preventing the fault from happening again can be tricky. A mix of training, mentoring, good documentation, and change management processes can stop it from happening again. Even informal knowledge sharing within an IT team is better than nothing. During a weekly meeting it's good to recap faults quickly with the following points:
- This is what happened
- This is what we saw while troubleshooting
- Here's how we fixed it
Doing this week-over-week grows the knowledge base within an IT team and helps develop good troubleshooters.