Troubleshooting philosophy – Windows event log error analysis

There are different circumstances that lead to an event log analysis. One of the most common is the reactive Windows error log search, “reactive” because it is done in reaction to a problem, “error log” because what we are looking for error events that will help us identify the origin of the problem. When searching the error log entries (error events in Application and System logs), it can be difficult to establish a direct relation between the problem being experienced, and the error event itself.

Normally the events are analyzed after the problem occurs. One of the most frequent actions when troubleshooting is filtering the same event to correlate it with past problems.

There are distinct cases in wich this is helpfull:
– This particular problem never happened, and the event never appeared, there is a high probability that the event is related to the problem.
– This particular problem has happened before but the event never appeared, there is low probability that the event is related to the problem.
– The event is much more frequent than the occurence of the problem or the problem never happened and the event is seen frequently, there is a low probability that the event is related to the problem.

In search of events logged due to an error, filtering out the “noise” is a very big step in finding the right events to analyze.

After finding a windows error log entry that we think is most likely related to the problem, some before and after events should also be analyzed, even if they are warnings or informational log entries, they can be related to the error.
Frequently a sequence of events is observed when doing a deeper analysis of windows logs (not only checking for error events), for example, a warning that a service did not respond, followed by an information that the service is restarting, a warning like a problem loading a dll, the occurrence of the error, and an information that the service successfully started.

The events before and after the error seem related to the error itself, and we now know that there is probably an issue with a dll and also with a service, and both can be related to the error.

Doing a Windows error log analysis can be difficult due to the amount of information found, and there is a risk of losing focus on the search for that event that will lead to the solution.

As stated above, filtering and carefully choosing the events to concentrate the investigation will help.
A clear case to exemplify the importance of such method is the classical “solve one thing, break another” that can happen when, by taking action trying to correct some issue that you discovered existed by the presence of that windows error log, you change the environment in a way that your original error manifests itself in a different way or with a different frequency, and if repeated, this action (fixing an error different from the original) can lead to cascading events that will certainly drive you further away from effectively solving the original problem.

So, doing the research, one should ignore (during the course of a particular analysis) other non-related or not critically related errors that are encountered.

In addition, circumstances can change after the manifestation of the problem, which leads to different event entries after a reboot, with the specific software or hardware component still failing, but being registered as the Windows error log entry in a different way.

A remediation should be considered many times as a per-round iterative process, in which we detect an issue, find evidence that leads to a potential solution, test the solution, and restart the process/Operating System.
In a truly complex investigation of an issue and its related entries in the Windows log, it is also useful to build a chain of events. A chain of events is best built if we are sure that certain events are related in a defined sequence, and none of them appears regularly unrelated to the error event we are focusing on. It might be necessary to eliminate intermediate events which are unrelated to the issue being analyzed, and due to the large number of events that are logged, can appear registered between two events which are part of the chain of events we are analyzing.

After a problem has been truly understood, it can be replicated, in which case the chain of events would also appear to prove it was exactly the same problem.