Bug-killing Via Logs

I am a log watcher. I regularly “tail -f” numerous files such as catalina.out (Tomcat), log4j.log, and apache's error and access logs. By watching these, I can monitor conditions from test or production users and see what the system is really doing. Sometimes this opens interesting windows into the system and its day to day functioning and sometimes its simply a feel good. Since I have simple notification when something really bad happens, this is mostly to monitor systems for little oddities that pop up through rigorous repeated use. No, unfortunately, I'm not a perfect developer. Occassionally bugs creep in and cause problems… smart people realize this and act accordingly.

Anyway, in the past 6 months, I've been involved in numerous aspects of projects where I've recovered and heavily modified old codebases, tweaked existing systems, or deployed completely new ones. Over the weekend, I was installing some new functionality on CodeSnipers and performed the same process. I plugged into the access and error logs and monitored Drupal's logs to see what was going on.

CodeSnipers has been online since August of 2005 with little downtime, a few upgrades, and numerous additional modules, tweaks, and creative little mods. Luckily, Drupal has exceptional error handling and rarely dumps these back to the UI, so although users have rarely been affected, I decided to delve into a few of the errors. With tweaking just a few files, cleaning up one of the customizations, and refining the spam filters*, we're down to a handful of errors each day. To resolve those last ones, I'd have to get deeper into the code and functionality than time and responsibilities allow.

The point of all of this is that logging can be your friend…. your best friend. Yes, sometimes writing this information to disk or database can slow your application down, but if you want to know exactly how it's being used, there's really no alternative. Make the time to get familiar with a few key tools like grep on the command-line and “group by”, sum, and joins in sql. If you know how to parse your logs and watch for the occurance of particular errors – bugs or otherwise! – you can prioritize your efforts to resolve the most critical or annoying issues before considering any of the others. More importantly, you can often determine the exact conditions which caused the error and recreate them yourself at a later time. Better yet, if you have a collection of automated Unit or Integration Tests available, you can use the collected data to replay the process as necessary to fully exercise and evaluate the conditions to make sure it never comes back.

* If you've noticed improvement with the response time of CodeSnipers in the last few months, this is predominantly due to my spam filter tweaking. Previously we were getting upwards of 300 comments/minute at times. The new filter completely blocks an IP address for 24 hours after too much spam is received in a small window. It's still not perfect, but now I'm only seeing about 300/day. Regardless, it's nice to know we're so popular. 😉