Nerdy tidbits from my life as a software engineer

Monday, January 19, 2009

The Universal Record / Replay Framework (TURRF)

An idea finally came to me the other day.  Actually, this has been brewing in my head for some time, but only recently have I decided it might be a good time to act on it.

A common problem with software is retracing what goes wrong.  Some bugs are easier to reproduce than others, and one very common way to try and diagnose a problem that occurred on a particular machine is to spam your code with calls to a logger.  Now I’m sure we’ve all read plenty about the various virtues and pains of diagnostic logging, and we’ve probably all run into scenarios where logs have helped us solve and fix a problem.  But in my opinion, the effectiveness of this strategy is limited to the following scenarios:

  1. The program just happened to log information that was relevant to the problem that occurred.
  2. The log messages point conclusively to a particular block of code failing.
  3. You can change the code and verify that the fix works without additional logging.
  4. The log data can be unwound in a sequential manner.

If any of these four conditions are false, then fixing a bug based on log data alone becomes slightly better than guesswork.  This isn’t to say it’s not helpful, just arguing that it’s helpfulness is mostly marginal.  In an effort to improve the probability that log data can help you solve a problem, many engineers resort to spamming their log with as much diagnostic data as possible.  The result of this is usually a bloated text file with mostly useless data and a few important clues hidden somewhere in the pile of garbage.

I mean, how much information should you log?  Say you take a lax approach (which is what I usually do) and only log exceptions.  Is it helpful to know that you got an ArgumentException in MyAwesomeCode.cs on line 327?  Probably.  In my experience that’s usually enough.  But not always.  Retracing the code and figuring out why the argument was invalid is not always an easy thing to do.  There is often far too much invariance and randomness in your customers’ environment, and reproducing the particular sequence of events that created the invalid argument can literally be impossible outside of pure speculation.

Alright, say you take the spammers approach.  Now you are tasked with mapping various useless log statements to particular lines in code files in an effort to retrace the sequence of events that led up to the error.  The point of doing this is to try and understand what the value of everything was at the time that the log was written.  In other words, we’re trying to debug a complex sequence of events in our heads by mapping lines of nonsense to lines of code.  This is sort of like using a debugger from the year 1630.

So our current solution to this problem is to log as much data as they can and hope for the best…which just seems so, I don’t know, 5th grade to me.  And this is where the universal record / replay framework idea was born from (I think I will christen it “TURRF” from here on out.  That sounds pretty catchy to me).

Here’s the idea.  I want to borrow a page from the general strategy of mock frameworks and apply it to diagnostics.  In a mock framework, we set up a unit test by recording a sequence of method invocations on specific objects and simulate what those methods would return if they were actually invoked.  Then, when we invoke a method, we walk through the graph that was created during the recording process and validate that the right methods on the right objects were in fact invoked as we expected them to be.  Mock frameworks therefore have two stages. The first stage is where they are recording method calls, and the second stage is where they are replaying method calls.  The idea is to separate dependencies from your unit of work by simulating their behavior without actually invoking them. 

(Actually, I should clarify that I don’t really know how Moq or Rhino work. But that’s how I did it with my framework, and it seemed logical when I designed it.)

If we took this same basic principle of abstracting dependencies from your code via recording / replaying, but instead of applying it to testing, we applied it to an actual, running, production environment, we could essentially do the same thing with our applications.  Imagine, for instance, that we had a library that had a number of shim objects.  These objects encapsulated all interaction with outside dependencies (such as file operations, reading registry keys, reading environment variables, etc.) and exist in one of two states.  They are either recording or replaying.  When they are recording, they invoke the actual dependent object and record it’s return value to some type of log.  When they are replaying, instead of calling the actual dependency, they instead read the log and return the recorded value

So here’s the idea.  Take any system where you have chronic problems diagnosing bugs because the external dependencies on everybody’s system are so diverse and different.  We refactor every place that calls external dependencies to go through shims TURRF shims instead.  Now some customer calls you and complains about a bug.  So you tell them to switch the diagnostic logging on and send you the log the next time they get the problem.

Now you start your program and feed the TURRF API the customers’ replay log.  The program starts as usual, except that all of the interaction with external dependencies is now going through the TURRF shims instead of your actual environment, and those shims are replaying the behavior that occurred on the customers machine.  So now, when you attach your debugger to the program you can step through the code and diagnose it as though you were on the customers machine at the time that the error occurred.  You can actually simulate their environment exactly as it existed at the time it was recorded.  Imagine how much more useful that would be than peering at a semi-useless text file filled with misspelled warnings and messages!

So that’s the idea.  What do you think?  Obviously, there are a number of challenges:

  1. TURRF will only be able to record and replay data that are serializable.
  2. Multi-threaded code will be very difficult to cram into this framework.
  3. The overhead of recording data will almost certainly be such that you wouldn’t want to turn this on unless you were trying to trap problems.
  4. GUI’s will probably be tricky to record and replay.  As expected.
  5. Clearly, the code running on the customers machine would need to be identical to the code running on my machine.  Otherwise, the sequence of return values won’t match.  Also expected.

But the benefits would be enormous.  No more guess work.  Now you can hunt down the problem and fix it just like you would any other bug – the humane, and civilized way.  In a 21st century debugger.

I must confess that this idea was born to me out of personal pain.  The software I’m working on at work is very difficult to debug, and errors are often very specific to customers’ machines and environments.  Logs allow me to answer maybe 50% of all of their problems.  But the other 50% are nearly impossible to diagnose because I can’t possibly reproduce the problem on my computer.  A TURRF log would make it trivial to solve their problem.  And wouldn’t that be nice?


ido Sarig said...

Check out - we've developed a product that does just this, based on the same principles you describe.

Michael J. Braude said...

That's very cool! Too bad it's in Java and I work for MSFT.

Following this post I wrong my own internal tool which I am using in some new projects. But it's not nearly as robust as your solution.

ido Sarig said...

well, we've got a .NET solution coming down the road... stay tuned.