Troubleshooting knowledge base

By | March 16, 2016

Some of us have dealt with the frustration of dealing with a problem we know we’ve experienced before, but didn’t register it properly.

The outcome? We do it all over again, and basically, waste time and past effort. A good record of an analysis and key events can save a lot of time.

Most of the internal knowledge bases used by companies, small and large, have the same basic and fundamental building block: they aim to map a set of keywords to a specific issue and it’s solution.

How could this work for you, and how does this relate to the windows log file?

Quite simply, you can build you own Knowledge Base, using the event information you are analyzing.

If you are looking at a windows log file, either something is wrong, you are careful and do regular check to prevent problems, or you are interested in understanding the flow of events in your operating system (also, you have quite a bit of time on your hands).

I will assume you are with the majority, you are looking at a windows log file because something is wrong.

Something you will find very useful in the future is a clear description of the problem, and its solution. If it never happens again, great! But if it does, you will at least know what you did to solve it. You can even learn that your solution wasn’t the best, and that is why it happened again, therefore it’s a good time to review that solution.

A good method for taking record of issues using windows log file analysis involves choosing the words that describe the problem, and choosing the events ID to associate with that problem.

The best way to illustrate this is a real-life experience:

A SQL Express database has maximum size of 10GB.

When full, it stops and an error is logged.

The vCenter software (installed in Windows 2008) uses a SQL Express database by default, and has a defined number of days to keep history, etc, all thing that can lead to the use of the 10GBs

When vCenter fails, the error is not registered as an SQL database error, but as a SQL connection error.

The first time you solve such problem, which is actually simple to solve, you must shrink the Database or move the database to a full SQL server, you need more time to understand what’s really causing the issue, For you, the issue isn’t called “SQL Express problems”. The issue is called “can’t access vCenter”. How do we relate both?

The easiest way is to actually tell a story, explain in detail what you are seeing and how you are interpreting it, what events you are analyzing, and what actions you are taking.

After you solved the problem, you should review all the information, organize it, and make sure it is accurate and easy to understand. You should by now have a good description of the issue, with some key words or sentences associated with it, and the event IDs and details that were important to the solution.

For the issue in my example, I would have a Knowledge base entry named “Can’t access vCenter” with associated tags “database access, SQL Express, database limit, windows 2008, connection error” and in the description of the entry I would have all the events that were relevant in my investigation, and the detailed description of my investigation. I would also provide links to external articles and eventually some cases in which the events would lead up to different issues.

Having a solid documentation of this issue means that in the future, if this happens again, I can just lookup the work vCenter, find this article, verify that the events in the windows log files are the same, and would solve my issue quickly based on the past experience.

Also quite important is the possibility of contributing to the community that helped you in the past. That forum entry of a person having just the same issue as you, and that helped you with the solution, was written by someone who actually documented what happened. The more everyone contributes the more everyone profits from the solutions available.

And what can be more accurate than a good windows log file analysis with concrete consequences and actions? A specific event is the same in all computers with that same Operating System.

If you spend 30 minutes documenting a solution that would take you 1 hour to accomplish, don’t think you wasted 30 minutes, think that you earned 3 hours if this problem happens 3 more times.