I cannot count the number of times customers or prospects have asked us to remap a relational database structure into a directory, or vice versa. So keeping a separate, different database structure for directories is and has been a very expensive operation for most companies.

So let’s take a look at why we need to have a separate database structure, and how we could reduce the pain of synchronization.

Now, in my previous post, I asked why we need directories in the first place. But I’d like to refine that question a little further:

  1. Do we need a hierarchical structure like a directory? (I’ll be discussing this question in future blog posts, so stay tuned!)
  2. And the issue behind today’s post: do we need a separate, different kind of database to support such a structure? (After all, didn’t we all learn that relational theory is the alpha and the omega, back in Database 101 class?)

Let’s face it, relational databases are:

  • The workhorse of most enterprise applications.
  • Well understood, well supported, and transaction-enabled.
  • Currently optimized for updates.
  • Potentially able to answer to all kind of queries.

So why not implement a directory using a relational model on top of standard SQL engine and avoid the complexity of having to synchronize two very different data models?

Two Possible Directions, Two Big Dilemmas

If we build a directory using relational best practices, we’ll end up with a model that is “standard” and well understood and well integrated with other SQL applications. So far, so good, right? In terms of design, we will likely end up either with a recursive relation, or a static number of joins for a predefined depth of our tree. The problem is that the result of our queries would be atrociously slowno matter how you slice it, multiple joins or recursive queries are very expensive with mainstream SQL implementations. So this approach won’t work because a directory needs fast lookup/queries.

Or we could bypass best practices and build a “hard-coded” hierarchical structure model that just runs on top of a SQL database. In fact, some vendors took this approach (IBM and Oracle, I believe) and it works well. But the approach is so specific that we’re basically building a hierarchical database-a separate hierarchical structure using SQL as an access method! This gives us excellent results in term of speed. But now we’re back to square one: we have 2 different database structures, two different models, and the same old problem of synchronization, This approach buys us speed but creates exactly the impedance headache that our DBA wanted to avoid.

(In fact, If we’re foregoing the relational model, why not just bypass SQL altogether and use the core access method engine (ISAM/B*trees) to win some extra speed, which is how the majority of LDAP implementations have been done?)

The Best of Both Worlds: Caching In…

So on one side we have a highly scalable system that’s optimized for transaction and updates, making it ideal for storing the data from our directory, which could deliver all the queries we need, but not at the right speed, because the multiple joins of a recursive call kills our real time/online performance. But what if we virtualize and cache the result of all queries beforehand, and update this cache incrementally in real time as new entries are created?

That would give us the best of both worlds: the infinite flexibility and scalability of the relational model, and the speed of a directory. On top of that, we also get the rich context that a hierarchical structure, such as a directory, can carry, while benefitting from a well-established, well understood entity relationship model. So when we write, update, or insert, we’re working in the classic SQL model with transactions our application loves and understands. And when we query, we get the speed of a directory, because the cache is the directory.

One final thing, since all the data from an existing application could now be virtualized as a set of context trees, an interesting thing happens: We inverse the relationship of the directory with the rest of the infrastructure. The directory no longer sits in splendid isolation; instead, it’s fed by a grass roots effort of “publishing” which could be consumed by the community. (Sounds a little like the web, right?)

But I’m getting ahead of myself here…let’s dive into this in my next post, when I will cover hierarchy, directory, and context. Oh, and cache, of course. I’m a big fan of persistent cache. 😉