Nerdy tidbits from my life as a software engineer

Wednesday, July 8, 2009

The Beauty of Data-Driven Applications

A common problem I run into when writing applications is this:

I have a situation where a series of tasks need to be assembled and arranged in a way that does something complicated.  Each individual section may be simple, but the process as a whole is complicated.  The pieces of this process need to be arranged dynamically, and I want the ability to update them and slot new pieces in without disrupting the system as a whole.  What’s the best way to design such a system?

Of course, no matter what, you want something with lots of abstractions – ways that disconnected pieces can plug into each other without really knowing who their neighbors are.  That much is a given.  But where do you define the process as a whole?  In other words, where do you physically say, “For the process that I call ‘A’, first do Task1,  then do Task2, then do Task3”, etc.?

Perhaps the easiest and most obvious way to do this would be to use a simple object hierarchy.  Something like this:

DataDriven Now, you’re library of Task objects will grow whenever you need to add some new small block of functionality.  And your library of Process objects will grow whenever you need to define a new process.  An individual Process object may be very stupid, and could simply look like this:

public class ProcessA : BaseProcess
{
public ProcessA()
{
this.Tasks = new BaseTask[]
{
new TaskA(),
new TaskB(),
new TaskC(),
...
};
}
}
All of the business logic on how to execute a process can be contained inside the generic BaseProcess object, so the only thing that the subclasses need to do is define the array of tasks that get executed and what their order is.  In other words, the only purpose of the subclasses is to define the data that the parent class needs in order to execute.
 
Things get more tricky, however, when more complicated connections needs to be defined.  Just defining a sequence of tasks may not be enough.  Maybe we also need to define what output from what task goes into the input of another task.  Where do we define that logic?  How do we represent it?  Potentially, we could just shove it into our current model and everything will be fine.  But we could soon find ourselves writing a lot of code that just glues these things together.  And that makes me wonder: how much decoupling have we really achieved by separating these tasks into separate procedures instead of just strong-coupling everything together in the first place?  After all, the whole purpose of this design is to decouple each task from one another so that we can arrange them in any number of ways.  All we’re really doing in this case is moving that coupling from the Task library to the Process library.
 
To some extent, we will never really get around this problem.  We may like to pretend that TaskB is decoupled from TaskA, but if TaskB requires some input that can only come out of TaskA, then this really isn’t the case.  The important thing to note, however, is that TaskB shouldn’t care where this input comes from – so long as it gets it.  The other important thing to note is that if TaskA produces this input, it shouldn’t care who uses it or what it’s purpose is.  So from task A and B’s perspective, this dependency doesn’t exist.  But from the processes perspective, it does.  The question is: where is the best place to define this dependency?
 
I say, put this logic in external data instead of in your code.  Rather than create a large, complicated, compiled hierarchy of Process classes, define an XML schema and create a library of documents that define these bindings for you.  Then, define an adapter or a factory that generates a generic Process object by parsing these XML files.
 
Understand that both solutions are functionally equivalent.  But making your application data-driven has a few distinct advantages:
  1. You can now alter the behavior of a process object without recompiling it.  This means you can easily distribute hot-fixes and additional functionality.
  2. Third-party’s can more easily integrate with your application and extend it.
  3. The source of a Process’ XML can now come from any location.  Loading them from a web server or a database instead of a local file system will have no impact on your system.
  4. You can easily write a library of adapters which can deserialize the process object from any number of formats.  You are no longer tied down to any one data representation.

Most importantly, however, your application now only reacts to changes in data.  This is the way I think of it: imagine you have two machines that build houses based on schematics.  One machine has a number of buttons on it.  Each button builds a different house.  If you want to build additional houses, you need to buy a new machine.  Contrast that with a rival machine, which, has only one button but also has a scanner.  The scanner can read schematics directly from any piece of paper so long as it adheres to a certain standard and can build any house that can be specified in a schematic.

Wouldn’t you rather have the second machine?  The beauty of writing data-driven applications is that at their core, you have created something akin to the second machine.  You have decoupled the dependencies from your application so much that your program is now simply responding to input rather than replaying set procedures.  This makes it far more versatile, and it’s why programming in the WPF is so much more pleasant than writing WinForms applications – because now you get to focus on modifying the data and the UI separately from each other.  There is still a contract that the two sides need to adhere to, but your programming paradigm becomes much cleaner.

Which is why, I always try and make my applications as data-driven as possible.

0 comments: