Agile Data

On Relational Theory: Questioning the Dogma

Follow @scottwambler on Twitter!

As you can see at this site I've written a fair bit about leading-edge practices surrounding the development and evolution of relational database management systems (RDBMSs). Granted, I've also strayed to other technologies such as XML and enterprise issues, but for the most part the focus has been on RDBMSs. Be that as it may, one thing that I haven't done is written about relational theory at all. The reason for this is two-fold: first, my focus is on practical matters and second, relational theory really doesn't seem to have much to offer to database practitioners.

In the 1980s I earned a degree in Computer Science at the University of Toronto. I was lucky enough to realize that data-related topics were important and chose to take as many classes as I could in the subject (they were all electives if memory serves). The most memorable was a fourth year course on RDBMSs where the focus was relational theory. The course was memorable not because of the content, which was solid, but because none of it proved relevant on the job. In addition to a few problem sets where the focus was on writing proofs using predicate calculus and set theory, the main assignment was the development of a database engine which could process simple CRUD queries. Sadly, at the end of the course, I had the skill to build an RDBMS engine to process SQL, but didn't know a thing about applying SQL in practice.

Over the years I worked on a variety of systems, almost all of which had RDBMSs on the back end. I did this at banks, insurance companies, government agencies, telecommunications firms, and retail firms. I've worked with a variety of technologies and with a range of people with different experiences.  In all cases, although relational theory was sometimes mentioned in conversation (more on this in a minute), it never proved directly relevant in practice.  Indirectly I very likely did benefit from learning about selection, projection, union, relations, relation variables (relvars), tuples, and all those other good things. These days, I'm occasionally asked what I think about relational theory and where it fits into practice. My answer is that it's important in a few niche situations, and it does seem to provide a foundation upon which you can build practical skills, but invariably it seems to me that the person asking about it really isn't interested in practice at all and is more likely looking for an excuse not to improve their skillset.


Why Relational Theory?

From what I've seen over the years, relational theory is important when:

  1. You're developing a relational database engine. I haven't built a commercial database engine myself, but I'm willing to go out on a limb and guess that the people working on such things at IBM, Oracle, and Microsoft (to name a few), are interested in relational theory.  However, considering how much as been written about the fact that RDBMS vendors are not remaining true to the relational model, and regularly go beyond it, it tells me that although they may be interested in relational theory they're only acting on it when it makes practical sense. More importantly, when you see the features which have been added to mainstream RDBMSs, such as the Java VM to Oracle and the CLR to Microsoft SQLServer, it's clear that the vendors are moving away from basing their products on relational theory. As Dawn Wolthuis has said, this may seem controversial to the theorists, but more and more it's the state of the industry.
  2. You're a computer science academic. Academics like to focus on theory, and many have literally made a career out of it, and have even managed to sell a few books on the subject. Good for them, but that's not practice.
  3. That's all you know, or at least that's what you prefer to focus on.  Many people within IT are overly specialized, a reflection of the Tayloristic theories inflicted on IT in the 1960s and 1970s, and as a result they have an unjustified belief in the importance of their specialty.  This is a completely natural thing to happen, albeit a dysfunctional one.  Luckily there is a clear and growing trend in the IT industry away from specialists towards generalizing specialists, so the already niche specialty of relational theorists will surely shrink over time.
  4. You're focused on past glories. Relational theory has had important impacts on the IT industry, in particular the SQL language and RDBMSs are at least partially based upon it, but that was way back in the 1970s. It's also had an impact on data modeling practices, including introducing the concept of data normalization and functional dependencies, which is clearly valuable.  But, what has come of relational theory lately? Furthermore, with data being only one of many aspects (e.g. functionality, security, hardware, network, user interface, ...) facing IT professionals today, relational theory at best seems applicable only to a very narrow sliver of software development and therefore doesn't appear to supply much of a basis for new advances (and sure enough, it hasn't).
  5. You can't find anything better to talk about over a couple of beers.  How sad.

Every so often someone asks me about how the techniques such as database refactoring, agile data modeling, or database regression testing relate to relational theory. My answer is always the same: my focus is on practice, not theory. As I indicate above, relational theory has provided the foundation for some important practices, but I really can't recall the time when I saw a database practitioner stop to work out the relational algebra behind whatever it was they were working on at the time. They just did the work. 

The theorists like to claim that the reason why there are so many problems with existing database designs is that practitioners don't understand relational theory, and in some ways they have a valid point.  A lot of good database developers that I know understand the theory, whether or not they've received any formal training in it, but they also understand far more than just that. Unfortunately, the theorists struggle to make their ideas attractive to practitioners, writing books which are either inaccessible to them or simply too-far divorced from the realities of software development, and in the end exacerbate the problem they are trying to address. Worse yet, the theorists seem to focus on modeling databases from scratch, they rarely seem to have advice for those of us who are dealing with existing legacy data sources and the mission-critical systems using them. It typically isn't an option to start from scratch and rebuild them, so where are the techniques to help us address the problems which we actually face? They don't seem to be coming from the theory guys (NOTE: If you know of any writings from the theory folks among us which do address legacy concerns, I'd appreciate it if you could send me an email). Techniques such as database refactoring and database regression testing aren't coming from the theory folks, they're coming from those of us in the trenches who are trying to find ways to get the job done.


Why Even Mention Relational Theory in Conversation?

So why do people ask about relational theory? From what I've seen, there are several reasons:

  1. That's all they know, or at least that's what they focus on.  They're one-trick ponies, and they're desperate to convince you that the one trick that they know is impressive.
  2. They're looking for an excuse not to change. This is probably the most common problem, the person fears the many changes which they're seeing in the IT community and they're desperately trying to avoid such changes. They're often looking to justify their unwillingness to change by claiming that the promoter of a new idea doesn't understand relational theory (regardless of whether relational theory is even applicable or whether the person actually does understand relational theory) or that if there isn't a mathematical proof supporting the concept that it couldn't be any good. The book Fearless Change is a great resource, as is Becoming Agile.
  3. They're a zealot. Unfortunately, we have them among us and there's rarely anything that we can do to help them. They have their way of doing things and they're really not interested in hearing about anything else. Concepts such as applying the right software process/method for your situation, or applying the right model for your situation, are the antithesis of their "one size fits all" theories. Worse yet, they're often in complete denial that their approaches don't seem to be widely adopted in practice but seem to think that it's only a matter of time until this happens. If practitioners didn't bother to learn relational theory at its height of popularity, it's doubtful that many will bother now. 
  4. They think that it drives development efforts. Some people have a nasty habit of making sweeping statements about the importance and applicability of relational and/or mathematical theory, statements which often make sense only to people who are narrowly focused on data-oriented activities. How many times have you heard claims that a solid grounding in relational theory will result in great database designs, or will ensure data integrity? Shouldn't we worry about great overall designs which look at the entire picture, not just data? Wouldn't a good testing strategy do more to help ensure quality, particularly when the traditional approach certainly seems to have resulted in some questionable database designs over the years? When you start to look at the bigger picture and you accept the fact that there is far more to development than just data, then you quickly realize that relational theory is not as important as the theorists would like you to believe. Or at best, it's one of many aspects of theory that you should learn. Call me a radical, but shouldn't we adopt techniques which work in practice and which address the actual problems that we face, and worry a little bit less about mathematical theory? 
  5. They've been misguided by the one-trick ponies, the fearful, and the zealots among us. These people we can actually help, which is one of the reasons why I wrote this article. Many people will often listen to someone, and when they hear what they expect to hear from them, or more importantly what they want to hear from them, then they pretty much leave it at that. They may not know that there are other sides to the issue, or that perhaps these other sides have been misrepresented to them (if mentioned at all) by these other people. 

Relational Theory In Practice

Agile Database Techniques Why is relational theory an issue for someone who is clearly a practitioner?  I've become concerned because of the damage within the IT industry that I'm seeing caused in its name. As I noted, some people use relational theory as an excuse not to change, but frankly that's their business and I'm happy to let them travel along their own path. But other people, in particular college instructors and book writers, needlessly inflict relational theory on people who are trying to learn how to become an effective data professional, or better yet an effective IT professional. Too much focus on theory can really make data-oriented development techniques unattractive to practitioners, which is one of the reasons why I think so many application developers seem to have little or no skills in this area.

Going back full circle, what should I have learned in my university database course. If I were organizing such a course today, the agenda would look something like this:

  1. The history of databases and data theory. There's always value in spending a few hours on foundational concepts, including relational theory as well as newer data-oriented theories.
  2. An overview of data storage technologies. Students should know the differences between the various data storage options available to them. This would include the various database management system approaches (relational, network, hierarchical, XML, and object (this isn't a complete list)) as well as file management strategies. Furthermore there should be a discussion of the trade-offs between the approaches and advice for when to use each.  An important message should be that RDBMSs and files are the most common storage mechanisms in use today, and that XML is an important data transport representation.
  3. Where data fits into the overall software development process.  This is a message which is sorely missing in many university curriculums and books (surprisingly, including the vast majority of data books). The first philosophy of the Agile Data method says it well: Data is one of many important aspects of IT. As you can see at Agile Models Distilled and Software Development Phases Examined, data-oriented techniques represent a small portion of the knowledge which IT professionals require to be successful. Important knowledge to be sure, but only a small sliver of the overall picture.
  4. Data modelingData modeling is one of many important skills a developer should have. Furthermore, they should have an understanding of both traditional approaches to data modeling as well as agile/evolutionary approaches to be effective. 
  5. Database development techniques. Students should learn when, and how, to implement functionality in relational databases. They should understand what triggers, stored procedures/functions, and database objects are and how to develop them. Furthermore, they should understand relevant application development issues such as how to retrieve objects from an RDB, security access control, transaction control, and concurrency control.
  6. Database testing. Students should learn how to test relational databases. Data is an important corporate asset, and measures should be taken to ensure its quality. Similarly, mission-critical functionality is often implemented in databases which should also be tested.
  7. Database refactoring. Just like you should refactor your code to ensure that it's of the highest quality design at all times, you should do the same for your database schema. Modern developers work in an evolutionary, if not agile manner, and so must people doing database work.  Database refactoring enables them to do exactly that.
  8. Working with legacy data sources. An understanding of the challenges presented by legacy data sources, and how to overcome them, is critical knowledge. Legacy data sources are a fact of life: you might be able to refactor them over time, but the reality is that you'll need to learn to live with them and to deal with the data quality, design, and architectural challenges which they suffer from.
  9. Object/relational development techniques. A common strategy in organizations today is build applications using a combination of object and relational technologies. Students should understand the technical impedance mismatch between the two technologies and understand the fundamentals of O/R mapping.
  10. Reporting strategies. The course should include a discussion of the various strategies for implementing reports, including discussion of data marts and data warehouses.
  11. Data management within the enterprise. Although it will likely be difficult for students to grasp due to lack of real world experience, they should be given an appreciation for the need to take enterprise issues into account when developing systems. This includes having an appreciation for the importance of enterprise architecture and administration. Students should also learn about the cultural impedance mismatch that they are likely to face in some organizations.

In short, relational theory does have its place in modern database practice, it's just that this place is several orders of magnitude less than what the theorists among us would have us think. But they're welcome to grind that axe if it makes them happy, they just shouldn't be surprised that the rest of us aren't paying much attention to them. I also invite the theorists to get their hands dirty and gain some practical experience on a modern software development project (e.g. a RUP or an agile (XP, AUP, FDD, ...) project) and see what actually happens in the real world.

 
Remember the adage:

In theory, practice and theory are one and the same.

In practice, they're not.


Acknowledgements

I'd like to thank Curt Monash, Curt Sampson, and Dawn Wolthuis for their feedback regarding this article.