04 April 2009

RDBMS thinking

I stumbled upon a blog post by Michael Stonebraker discussing the mismatch between the needs of scientific applications and the features provided by relational database management systems. What catched my eye was the first sentence of a comment written by Jacek Szymczyk:

This statement might surprise some people but relational databases do require relational approach to stored data.

I think there may exist problems realizing this is true: it's related to some confusion that exist in the DBMS area. It's a fact that any possible data structure could be stored using a relational database. However, this doesn't imply that the relational model always is a good fit for your data. The relational model puts its own constraints on how you can model your data, just like other models. Think about it.

03 April 2009

Neoclipse reloaded

As I've been working a lot on Neoclipse recently, I'd like to showcase it using a few fresh screenshots (from the current trunk version). To begin with, Neoclipse is a front end to the Neo4j graph database. As it's easy to visualize structures in a graph database it's also a got fit to have a graphical tool for it as well.

Neo4j supports properties on both nodes and relationships, so why not show them right away in the database graph view:

One of the main features in Neo4j are the relationship types, so we dedicated a view to them too:

This view can be used to create new relationship types, to highlight relationships of different types, and to create relationships and new nodes. The checkboxes are used to filter out what relationship types should be followed during the graph traversal. To make it smoother to work with many relationship types at once, we added some filtering options. As you see from the screenshot, it's concerned with the direction of relationships:

New relationships and nodes can be added from the context menu of the database graph view as well:

To delve deeper into the properties of nodes and relationships, you can use the properties view:

There's a few operations you can perform on properties, and in the following screenshot you can see the available data types (arrays of those types are fine too):

There's one important enhancement that maybe wouldn't catch your eye all that easy: After you perform changes on the graph, you can choose to commit them or to do a rollback.

Finally, there's an integrated help system. As soon as you click on a view, corresponding help content is made available:

There's still some work to do in order to get the help content up to date with all the new stuff in the application. When that part is finished off, it's release time!

04 March 2009

Flexibility in data modeling

Martin Fowler posted an interesting article regarding Contradictory Observations in his bliki. Simply put, real world data doesn't always adapt well to our idea of how data models should be organized. Fowlers uses the blood group of a patient as an example:

One thing the clinicians were very strong about was this need to capture contradictory information. I might have a note from the Royal Hope Hospital saying my blood type is A and another note from the Sisters of Plenitude saying my blood type is B. This would clearly be nonsense, blood types don't change. But that doesn't mean we cannot record these two bits of data. Without further investigation we don't know which one is correct.

The solution the team used to solve this problem was to record observations, not only simple attributes. The structure for the particular blood group problem looks along the lines of this:

Inside the nodes you find the blood group and the hospital that made the observation. To sum up, we can say the following:

  • a patient can have multiple blood group observations
  • every observation contains metadata as well
  • there can be relationships between observations ("rejects")

Now, this data isn't stored for its own sake, but to guide behavior. Test results end up as evidence used to arrive at correct or at least likely to be correct conclusions. Our data structure now looks like this:

At the end of his article Fowler concludes:

Most of the time, of course, we don't use complicated schemes like this. We mostly program in a world that we assume is consistent.

Here I would like to object a bit. What's so complicated about this, really? It should be quite straightforward to model this on a whiteboard. But how to put this information into a database? When it comes to data storage we tend to think in terms of tables, as most DBMS:s are table-based. However there are alternatives, like graph databases. Actually, the screen shots are from the Neo4j graph database. I made them using Neoclipse, a Neo4j tool where I'm the main contributor at the moment. This is the full interface focusing the Albion Hospital node (using the trunk version of Neoclipse):

Let's get back to the more philosophical aspects for a moment. Why is it so hard for us to think in a terms of a flexible graph structure, why do we want all data structured in square tables?! I think part of the problem is that we want behavior to be tied to classes of objects in a static way. That's a nice and simple model, but the question is how well it reflects the real world our applications try to mirror. Jim Coplien and others are developing some interesting thoughts on how roles that encapsulate behavior could be related to objects in a different manner. Read about the DCI architecture in the Lean Architecture book (draft version; pdf)!

16 February 2009

The future of RDBMS's

Tony Bain writes over at ReadWriteWeb about the subject Is the Relational Database Doomed? While I think relationships are essential to data, there for sure exists problems with RDBMS's and how they handle data. I mentioned a bit about it in my previous post. The article by Tony Bain is a nice wrap up on RDBMS vs. key/value stores, but there's still a lot to discuss around it. In the beginning of the article Tony Bain describes how RDBMS's function and says:

Those tables have constraints, and relationships are defined between them.

As far as I know you technically speaking define constraints on foreign keys in a RDBMS. Then you choose to think of those keys as representing relationships. But as long as a corresponding key entry exists there's no way for a RDBMS to tell if the application (SQL statement) got the relationship right! Usually it's also quite hard to see this from the SQL code. I think Pawel Lubczonok is with me on this one:

The word relational should be replaced with slightly relational. The relations reflected are only of the most trivial nature: key lookup. All other relations are embedded in programs that read some data and write some data.

That's why you should think twice about statements like this (from the article):

The inherent constraints of a relational database ensure that data at the lowest level have integrity.

"Lowest level" is very adequate here in my opinion.

Bain goes on to say that the problem RDBMS's face today is that of scalability. To this end Pawel Lubczonok wrote response well worth thinking through:

What is being discussed is scaling to volume. yes rdbms scales badly. however there is other scaling that is much greater problem : scaling to complexity. here rdbms is hopeless, vast number of tables have to be created and this is solidified upfront.

Next, Bain goes on to describe key/value stores and comparing them to RDBMS. As a comment to the term "non-relational" for the key/value stores Lemon Obrien writes:

here's the deal: if you use a "key" in any way to access data, it's a relationship, aka Relational Database.

I don't get why people think keys are relationships. You can implement relationships in different ways like using keys or pointers (hello C/C++!) It's also possible to let the DBMS abstract away the details of this for you - after all, you have a DBMS to abstract things away for you. Are everyone so obsessed with keys while they once put a lot of effort into understanding them?! In graph databases relationships are first class citizens of the model, so you don't actually need to know so much about how they are implemented. As Bain doesn't mention graph databases or Neo4j, I'm happy to see someone else did.

Andrej Koelewijn goes for the really big scale stuff, saying:

In my opinion the database implementation is getting less and less important, but the ability to view loosely coupled distributed data as consistent whole. We need to be able to treat the internet as a database.

In a blog post he takes this further and says "REST is a distributed data model". Interesting thoughts, especially if you have read Martin Fowlers post on the future of databases and integration.

My conclusion from all of this is: The future of databases is to combine different ways to store application data. Don't squeeze data into a model that isn't a good fit - at least for web-scale applications, it won't perform well. So there's a lot of fun here in learning about the new models that exist and inventing new ones!

20 January 2009

Aging databases and relationships

Last week Peter Harkins wrote about Rules of Database Aging. What especially caught my eye was the second rule:

All Relationships Become Many-to-Many

...

The modern database paradigm is defined by relations, so of course that’s what falls apart as soon as you get an app into production.

Pinderkent adds his reflections in the blog post Most real-world relationships are truly many-to-many, where he also suggests a way to handle this:

One option that should be considered is treating all relationships as many-to-many. Although it brings in a level of complexity, it can help avoid the after-the-fact database-level hacks that are often necessary to allow for such relationships to be stored. Arguably, the hacks are worse than the complexity brought in by always dealing with many-to-many relationships.

Other than this, you'll find many comments over at Hacker News.

To set things straight from the beginning, I'd like to say that relationships are not really native to "relational databases", how strange it may seem to you! In fact, these databases are set based, and what you store in the database is keys, not relationships. Actually you define the relationships in the queries, not in the database itself. Foreign key constraints are just a way to check for the existence of keys, nothing else.

What's interesting with the debate on one-to-many relationships becoming many-to-many relationships is that it highlights how "relational databases" handle relationships: they just don't! You have to do it yourself, more or less manually. And that's where the pain comes.

So my take on the suggestion from Pinderkent, making all relationships many-to-many from the start, would be: use a better abstraction of data which includes relationships natively as the set based one doesn't really do this. One way to go is to use a graph database such as Neo4j. The primitives of a graph database are nodes and relationships (edges). This way, you actually have relationships in your databse, and going from one-to-many to many-to-many doesn't imply any changes to the database at all. You only have to change your application; the database will let you add relationships as needed. In this case, relationships are defined when storing data, not on retrieval.

My conclusion is that the problem is not so much aging databases, but aging database concepts. Today many new ways of thinking is popping up in this area, and I think that's a very good thing.

15 December 2008

Navigating Neoclipse

Sometimes when you inspect a Neo4j node space using Neoclipse, you want to go back a few steps to an earlier position. As the path you followed to reach a node isn't always easy to remember, I added a simple browser history to the Neoclipse toolbar. The back/forward buttons act like their common web browser equivalents.

09 December 2008

Colored commit emails

If you read svn/cvs commit emails on a daily basis in Thunderbird, you will appreciate colorediffs, install from mozilla addon page. I prefer the side-by-side mode (image below), but there are other modes as well.