Nerdy tidbits from my life as a software engineer

Wednesday, January 28, 2009

Parallel Performance Tools in VS2010

I just got done watching an internal presentation by Hazim Shafi on the parallel performance tools in Visual Studio 2010.  They are extremely awesome…and yet, I wonder what percentage of people in the development world are going to benefit from them.  We’re in a funny phase in the software world right now where most end-user apps are migrating towards simple, web-based applications that don’t need 4 cores and 8 gigs of ram.  The popularity of netbooks is a good demonstration of this shift. This is despite the fact that computing power continues to expand exponentially, and we can do some awesome things on modern, mundane desktop.  It’s a strange thing, isn’t it?  We have these insanely powerful computers, and yet, most people only seem interested in running pitifully simple applications that can easily be run on a PC from 2002. 

Anyways, for the engineers out there who are still focused on writing high-performance applications that take advantage of multiples cores / threads / etc., the new tools in VS2010 are going to be a very nice addition.  I can see, actually, how my old company would benefit from them.  Detecting explosives and by examining X-Ray images is a CPU-demanding exercise.  They’re going to enjoy this new feature, I’m sure.

Monday, January 19, 2009

The Universal Record / Replay Framework (TURRF)

An idea finally came to me the other day.  Actually, this has been brewing in my head for some time, but only recently have I decided it might be a good time to act on it.

A common problem with software is retracing what goes wrong.  Some bugs are easier to reproduce than others, and one very common way to try and diagnose a problem that occurred on a particular machine is to spam your code with calls to a logger.  Now I’m sure we’ve all read plenty about the various virtues and pains of diagnostic logging, and we’ve probably all run into scenarios where logs have helped us solve and fix a problem.  But in my opinion, the effectiveness of this strategy is limited to the following scenarios:

  1. The program just happened to log information that was relevant to the problem that occurred.
  2. The log messages point conclusively to a particular block of code failing.
  3. You can change the code and verify that the fix works without additional logging.
  4. The log data can be unwound in a sequential manner.

If any of these four conditions are false, then fixing a bug based on log data alone becomes slightly better than guesswork.  This isn’t to say it’s not helpful, just arguing that it’s helpfulness is mostly marginal.  In an effort to improve the probability that log data can help you solve a problem, many engineers resort to spamming their log with as much diagnostic data as possible.  The result of this is usually a bloated text file with mostly useless data and a few important clues hidden somewhere in the pile of garbage.

I mean, how much information should you log?  Say you take a lax approach (which is what I usually do) and only log exceptions.  Is it helpful to know that you got an ArgumentException in MyAwesomeCode.cs on line 327?  Probably.  In my experience that’s usually enough.  But not always.  Retracing the code and figuring out why the argument was invalid is not always an easy thing to do.  There is often far too much invariance and randomness in your customers’ environment, and reproducing the particular sequence of events that created the invalid argument can literally be impossible outside of pure speculation.

Alright, say you take the spammers approach.  Now you are tasked with mapping various useless log statements to particular lines in code files in an effort to retrace the sequence of events that led up to the error.  The point of doing this is to try and understand what the value of everything was at the time that the log was written.  In other words, we’re trying to debug a complex sequence of events in our heads by mapping lines of nonsense to lines of code.  This is sort of like using a debugger from the year 1630.

So our current solution to this problem is to log as much data as they can and hope for the best…which just seems so, I don’t know, 5th grade to me.  And this is where the universal record / replay framework idea was born from (I think I will christen it “TURRF” from here on out.  That sounds pretty catchy to me).

Here’s the idea.  I want to borrow a page from the general strategy of mock frameworks and apply it to diagnostics.  In a mock framework, we set up a unit test by recording a sequence of method invocations on specific objects and simulate what those methods would return if they were actually invoked.  Then, when we invoke a method, we walk through the graph that was created during the recording process and validate that the right methods on the right objects were in fact invoked as we expected them to be.  Mock frameworks therefore have two stages. The first stage is where they are recording method calls, and the second stage is where they are replaying method calls.  The idea is to separate dependencies from your unit of work by simulating their behavior without actually invoking them. 

(Actually, I should clarify that I don’t really know how Moq or Rhino work. But that’s how I did it with my framework, and it seemed logical when I designed it.)

If we took this same basic principle of abstracting dependencies from your code via recording / replaying, but instead of applying it to testing, we applied it to an actual, running, production environment, we could essentially do the same thing with our applications.  Imagine, for instance, that we had a library that had a number of shim objects.  These objects encapsulated all interaction with outside dependencies (such as file operations, reading registry keys, reading environment variables, etc.) and exist in one of two states.  They are either recording or replaying.  When they are recording, they invoke the actual dependent object and record it’s return value to some type of log.  When they are replaying, instead of calling the actual dependency, they instead read the log and return the recorded value

So here’s the idea.  Take any system where you have chronic problems diagnosing bugs because the external dependencies on everybody’s system are so diverse and different.  We refactor every place that calls external dependencies to go through shims TURRF shims instead.  Now some customer calls you and complains about a bug.  So you tell them to switch the diagnostic logging on and send you the log the next time they get the problem.

Now you start your program and feed the TURRF API the customers’ replay log.  The program starts as usual, except that all of the interaction with external dependencies is now going through the TURRF shims instead of your actual environment, and those shims are replaying the behavior that occurred on the customers machine.  So now, when you attach your debugger to the program you can step through the code and diagnose it as though you were on the customers machine at the time that the error occurred.  You can actually simulate their environment exactly as it existed at the time it was recorded.  Imagine how much more useful that would be than peering at a semi-useless text file filled with misspelled warnings and messages!

So that’s the idea.  What do you think?  Obviously, there are a number of challenges:

  1. TURRF will only be able to record and replay data that are serializable.
  2. Multi-threaded code will be very difficult to cram into this framework.
  3. The overhead of recording data will almost certainly be such that you wouldn’t want to turn this on unless you were trying to trap problems.
  4. GUI’s will probably be tricky to record and replay.  As expected.
  5. Clearly, the code running on the customers machine would need to be identical to the code running on my machine.  Otherwise, the sequence of return values won’t match.  Also expected.

But the benefits would be enormous.  No more guess work.  Now you can hunt down the problem and fix it just like you would any other bug – the humane, and civilized way.  In a 21st century debugger.

I must confess that this idea was born to me out of personal pain.  The software I’m working on at work is very difficult to debug, and errors are often very specific to customers’ machines and environments.  Logs allow me to answer maybe 50% of all of their problems.  But the other 50% are nearly impossible to diagnose because I can’t possibly reproduce the problem on my computer.  A TURRF log would make it trivial to solve their problem.  And wouldn’t that be nice?

Saturday, January 17, 2009

The Problem with Live Mesh

I’ve been following the progress of the Windows Live desktop applications for the last six months or so.  I do this mostly because I now work for Microsoft and I think it’s important to keep up to date on all the things going on at the company that I work for.  I think that Windows Live has a lot of potential to become popular and successful.  Certainly, some of the desktop applications are more appealing to me than others.  For instance, I use Windows Live Writer to author and manage my blog posts, and I should say that it’s 10 million times easier to use Live Writer than to use the online webform on blogger.  I’m a big fan.

Another Live application is the Live Toolbar, which I must say is significantly improved in Wave 3.  What two websites do I have open almost every time I launch a web browser?  Well, that would be Gmail and Google Calendar (I would love to switch to Live Mail & Calendar – but there are a few key features that are currently missing that I just can’t live without.  Namely, automatic syncing with my Calendar with my Blackberry and conversation view on Hotmail).  The new Live Toolbar integrates with your mail and calendar, so you can check both of them without opening the webpages.  I think it’s very nice.  And if people actually used the Windows Live networking system, then all those status updates / posted links / pictures / etc. would be integrated nicely as well.  And that would be very cool, if people actually start using it, of course.

Now there is one feature of the toolbar that I was initially pretty excited about, and that is, automatic synchronization of your favorites.  On the surface this seemed like a great idea, because I’m always re-discovering my bookmarks on every computer I ever use or re-image, and I never have them synchronized or backed up properly.  It would be great if they all lived in one place in the “cloud” so that I could access them any where I wanted to as if they were local links stored on my computer.

And supposedly, that’s what the bookmark synchronizer is supposed to do for me.  And it works.  Sort of.  It works great for adding new bookmarks.  What doesn’t work, at all, is anything that involves deleting bookmarks. 

OK, some examples.  Say I want to move a group of bookmarks into a new folder.  Here is the order of events that occurs:

  1. On my local machine, a new folder is created and the bookmarks are added to that folder.
  2. On my local machine, the old bookmarks are deleted from their old location.
  3. On Sky Drive, a new folder is created and the new bookmarks are added to it.
  4. Live Sync notices that there are bookmarks missing on my machine, and so it downloads them from Sky Drive and puts them in my favorites folder. 
  5. The end result is: my bookmarks now exist in two places.  They exist in the old location and the new location.  This result is then copied to every computer that is synchronized with Sky Drive via the Windows Live Toolbar.

So you may think, well, this is just a bug.  They can fix this and they can resolve it.  Not so fast.  This is a difficult, technically challenging problem to solve.  How do you know, programmatically, that these files were moved and shouldn’t be replicated in your favorites folder?  What if I go to another machine and launch the toolbar.  How is the toolbar supposed to know whether the files should be added or deleted?  And what happens if I log onto Sky Drive manually, remove some bookmarks, and then log back into the Live Toolbar?  How does the toolbar know that it should actually delete those bookmarks instead of re-adding them?

So here’s the real challenge with Live Mesh: it’s really supposed to be a glorified, easy to use, generic source control system.  And if it’s engineered as a version control system, then it will work fine.  But the reason it’s not engineered as a version control system is because most people on this planet don’t know what version control systems are, have no clue how to use them, and don’t want to bother learning them.  My mother would be lost trying to resolve conflicts and merge files.  And so for that reason Live Mesh was kept as simple as possible.  But as long as it’s kept this simple, this problem will never go away: we will always have problems resolving conflicts, merging files, and dealing with moves and deletes.  Which is why, cool as it is, I’m skeptical that it will work out the way we want it to.

There is one easy way to solve this problem: instead of trying to replicate data on multiple devices and trying to synchronize data between them all, put the data in one place and access it from multiple devices. This isn’t anything new, of course.  The reason web mail is so popular is precisely because it’s really, really convenient to be able to access the same inbox and deleted items from multiple computers without worrying about whether you already read and downloaded specific emails on specific computers.  Synchronizing email messages between multiple computers is hard, but accessing the same email messages from multiple web browsers is easy. 

This is precisely why internet applications are gaining so much popularity.  It’s not because they’re better than desktop applications, it’s that they’ve solved the inherent problem of data synchronization.  The way they solve this is to circumvent the problem entirely and store the data in one location.

So if we’re really trying to embrace software + services (which I do believe is a good strategy), then we need to stop pretending that there’s this need to solve the problem of file synchronization.  Instead, why not use sky drive as a way to store data across multiple devices and use mesh more as a way to cache and access those data?  I would be much happier if my bookmarks exists solely on the internet, but if Explorer allowed me to access them via the favorites button as though they sat in my Favorites folder on my hard drive.  Isn’t that what software + services is supposed to be about?  Combining the power of desktop applications with the portability of the internet?

Monday, January 12, 2009

The Windows 7 Beta is Here!

As I’m sure you’ve heard, the Windows 7 Beta has been released.  If you’ve been following the news releases over the last few months, then none of the new features should be particularly surprising to you.  But screenshots and demos just don’t do it quite so much justice.  To really appreciate how awesome Windows 7 is, you need to see it for yourself – and I highly recommend you do so.

I downloaded and installed the beta on Saturday.  The installer took just 20 minutes from start to finish on my two year old laptop.  I haven’t had a single crash.  I’m contemplating putting it on my main machine, which needs to be paved anyways.  The only thing holding me back, frankly, is a lingering concern about win7’s ability to play video games.  Although I personally don’t think Vista was the train wreck the rest of the world thinks it is, I will handedly acknowledge that Win7 is the most innovative Microsoft operating system since Windows 95.  This is the first time since then that there’s been a fundamental change in the way the operating system handles task management, and it does, truly, make an enormous difference in the way you work.  It’s exciting to see it taking shape, and hopefully bodes well for our future.

You really owe it to yourself to check it out.  It’s worth it.

Wednesday, January 7, 2009

Changing Return Types By Overriding Methods

The C# compiler will not allow you to do something like this:

public class A
{
    public object Test()
    {
        return new object();
    }

    public int Test()
    {
        return 0;
    }

}

error CS0111: Type 'A' already defines a member called 'Test' with the same parameter types

But if you really wanted to be able to do this sort of method overriding, you can actually accomplish it by using a bit of inheritance magic:

public class A
{
    public object Test()
    {
        return new object();
    }

}

public class B : A
{
    new public int Test()
    {
        return 0;
    }
}
B b = new B();
System.Diagnostics.Debug.WriteLine(b.Test());
System.Diagnostics.Debug.WriteLine(((A)b).Test());

outputs:
0
System.Object

I would imagine that this is probably a dangerous thing to do, and I wonder if there are any practical applications for consciously using it in your design.  A coworker of mine did something similar because he wanted to prevent clients from accessing a property of a base class that had a particularly attractive name.  But since that property was non-virtual (and returned a different type), he decided to hide it using the new keyword in the subclass.  So long as that property is referenced by its subclasses, this works fine.  And I suppose that if the property was virtual, life would be more problematic in another way, since he wouldn’t be able to change the type of of the property he was overloading.

Still, I’m not sure I like the fact that the C# compiler allows this design.  I suppose there’s no reason it can’t be legal, but I definitely think it’s something to be avoided as much as possible.

Tuesday, January 6, 2009

Nerd With Hammer Seeks Nail

I need a development project to do in my spare time.  Something interesting that involves GUIs.  Preferably some WPF application.  It’s strange that I can’t think of anything; I’m usually teaming with ideas.  But right now I just keep drawing blanks.  Call it post-holiday lethargy, but I’m completely stock out of interesting thoughts at the moment.  But I need to do something.  I’ll put it up on CodePlex and open source it; or maybe, if it’s good enough, I’ll even try and make an extra buck or two on it.  Who knows.  But I definitely need a project to waste my spare time on.

I can see what draws people into open source development.  Why not spend my time working on, whatever, Firefox or Linux?  It’s a pretty interesting body of code, I bet.  But I’ll think of something, eventually.  Most of my good ideas come to me when I least expect them.  It would just be really nice to think of something sooner than later.