Monday, May 23, 2011

My hardest bug

Lately I've been thinking a lot about the question "What is the hardest bug you've ever tackled?" It's an interesting index into the interests and style of different software engineers, and usually provides a jumping off point for a really great conversation.

With that in mind, I figured it was time to write up my hardest bug. This is a pretty old one, but I still enjoy it for just how head-slappingly bad the eventual resolution was.

Early in the year 2000, I was working on a cool cross-platform data warehousing product. One of the coolest things about it was that we had a vendor-independent SQL macro layer, so normal developers could write in one SQL dialect, and it would automatically translate to each of the underlying platforms (eg, SQLServer, Oracle).

My first project was to port the reporting engine to DB2; it was relatively straightforward, it just required figuring out some of the idiosyncracies of DB2, auditing the code to match those, and getting the regression tests to pass. It was a fun project and shipped without much trouble.

A couple months later, though, our largest DB2 customer called in with major complaints about reporting performance. This was my responsibility, so I took a copy of their logs and started digging in. The weird thing was, the logs showed that all the database queries were running very quickly, and yet, periodically, there were these weird 30 second gaps in the logs, as though nothing were happening at all.

I instrumented the code a bit more, shipped a patch to the customer, and asked them to send back new versions of the logs. Everywhere I had added log statements continued to show things running smoothly, and yet, there were still those odd gaps where nothing was happening.

In desperation, I asked the customer to send me a full copy of their datamart, which they did, and after taking a day to get the system up locally, I tried out the problem operations, and they ran perfectly fine - no performance issues, no weird gaps in the logs, nothing. Now I was really annoyed.

I returned to instrumenting the log files, adding a log statement after nearly every line of code. When the customer sent me the output of this very detailed logging, I could identify the exact line of source code that corresponded to the gaps, but I looked at it and said “there’s no way this could be taking 30 seconds, all it’s doing is macro replacement from our SQL meta-language to DB2 - it’s a straight string substitution.”

Still, I dug into it a little more, and that’s when my mind was blown. It turned out that the developer who had ported the macro language to DB2 had changed the string-replacement function to do a database query as part of the replacement step. The query had a performance profile that would grow as the square of the size of a certain internal log table, so the longer a system had been in production, the longer it would take to do string substitution.

It hadn’t shown up in the local test I did because I had only replicated the warehouse data, not the operational data, and it didn’t show up in the logs in production because the developer had intentionally covered his tracks to route around our built-in query logging.

Once we hit on this, it was a matter of minutes to refactor the n^2 query and ship a patch to the customer.

No comments: