You learned in A
Vision for the Agile Data Method that agile methodologies such as eXtreme
Programming (XP) and
Disciplined Agile Delivery (DAD) take an iterative and
incremental approach to software development.
Application developers on agile teams typically forsake
big
design up front (BDUF) approaches in favor of emergent approaches where the
design of a system evolves throughout the life of the initiative.
On an agile development initiative the final design often isn't known
until the application is ready to be released.
This is a very different way to work for many experienced IT
professionals.
The implication is that the traditional approach of creating a (nearly) complete set
of logical and physical data models up front isn't going to work.
The main advantage of the traditional approach is that it makes the job of the
data engineer much easier - the data schema is put into place
early and that's what people use. However
there are several disadvantages. First,
it requires the designers to get it right early, forcing you to identify most
requirements even earlier in the initiative,
and therefore forcing your
team into taking a serial approach to development.
Second, it doesn't support change easily.
As things progress your stakeholders understanding of
what they need will evolve, motivating them to evolve their requirements.
The business environment will also change during your efort, once again
motivating your stakeholders to evolve their requirements.
In short the traditional way of working simply doesn't work well in an
agile environment. If Agile data engineers are going to work on and support teams that are following
agile methodologies they need to find techniques that support working
iteratively and incrementally. My experience is that one critical technique is database refactoring.
Table of Contents
- Refactoring
- Database Refactoring
- Why Database Refactoring is Hard
- How to Refactor Your Database
- Adopting Database Refactoring Within Your Organization
- Database Refactoring Best Practices
- Database Refactoring in Context
- Parting Thoughts
Martin Fowler (1999) describes a programming technique called
refactoring, a disciplined way to restructure code. The basic idea is that
you make small changes to your code to improve your design, making it easier to
understand and to modify. Refactoring enables you to evolve your code
slowly over time, to take an iterative and incremental approach to programming.
Martin's refactoring site, www.refactoring.com,
is a good online resource.
A critical aspect of a refactoring is that it retains the
behavioral semantics of your code, at least from a black box point of view.
For example there is a very simple refactoring called Rename Method,
perhaps from getPersons() to getPeople().
Although this change looks easy on the surface you need to do more than
just make this single change, you must also change every single invocation of
this operation throughout all of your application code to invoke the new name.
Once you've made these changes then you can say you've truly
refactored your code because it still works again as before.
It is important to understand that you do not add functionality
when you are refactoring. When you
refactor you improve existing code, when you add functionality you are adding
new code. Yes, you may need to
refactor your existing code before you can add new functionality.
Yes, you may discover later on that you need to refactor the new code
that you just added. The point to
be made is that refactoring and adding new functionality are two different but
complementary tasks.
In the February 2002 issue of Software Development
I described a technique that I called data refactoring. This article described my preliminary experiences at something that
should more appropriately have been called database refactoring in hindsight.
Hence the new name. From
this point forward I'll use the term code refactoring to refer to traditional
refactoring as described by Fowler to distinguish it from database refactoring.
Let's start with some definitions.
A database refactoring is a simple change to a database schema that
improves its design while retaining both its behavioral and informational
semantics. For the sake of this
discussion a database schema includes both structural aspects such as table and
view definitions as well as functional aspects such as stored procedures and
triggers. An interesting thing to
note is that a database refactoring is conceptually more difficult than a code
refactoring; code refactorings only need to maintain behavioral semantics while
database refactorings also must maintain informational semantics.
There is a database refactoring named Split Column, one of
many described in A
Catalog of Database Refactorings, where
you replace a single table column with two or more other columns.
For example you are working on the Person table in your database
and discover that the FirstDate column is being used for two distinct
purposes - when the person is a customer this column stores their birth date
and when the person is an employee it stores their hire date.
Your application now needs to support people who can be both a customer
and an employee so you've got a problem.
Before you can implement this new requirement you need to fix your
database schema by replacing the FirstDate column with BirthDate
and HireDate columns. To
maintain the behavioral semantics of your database schema you need to update all
source code that accesses the FirstDate column to now work with the two
new columns. To maintain the
informational semantics you will need to write a migration script that loops
through the table, determines the type, then copies the existing date into the
appropriate column. Although this
sounds easy, and sometimes it is, my experience is that database refactoring is
incredibly difficult in practice when
cultural issues are taken into account (more
on this later).
There are two fundamental reasons why you should be interested in
database refactoring:
-
To safely fix existing legacy databases. The
bottom line is that
legacy
databases aren't going to fix themselves, and that from a technical
point of view database refactoring
is a safe, simple way to improve data, and database, quality over time.
My various
surveys into data quality have consistently shown over the years that organizations suffer from data quality issues
but in many cases do not have a realistic strategy in place to address them.
-
To support evolutionary development. Modern
software development processes, such as
DAD,
XP,
and Scrum,
all work in an
evolutionary manner. Data professionals need to adopt
techniques, including this one, which enable them to work in such a manner.
-
To tune your database.. Part of your tuning efforts may include the
(de)normalization of the schema.
Informational semantics refers to the meaning of the information
within the database from the point of view of the users of that information.
To preserve the informational semantics implies that when you change the
values of the data stored in a column the clients of that information
shouldn't be affected by the improvement.
Similarly, with respect to behavioral semantics the goal is to keep the
black box functionality the same - any source code that works with the changed
aspects of your database schema must be reworked to accomplish the same
functionality as before.
A small transformation to your schema to extend it, such as
the addition of a new column or table, is not a database refactoring because the
change extends your design. A large
number of small changes simultaneously applied to your database schema, such as
the renaming of ten columns, would not be considered a database refactoring
because this isn't a single, small change.
Database refactorings are small changes to your database schema that
improve its design while preserving the behavioral and informational semantics.
That's it. I have no doubt
that you can make those changes to your schema, and you may even follow a
similar process, but they're not database refactorings.
Coupling. As
you learned in
Relational Databases 101 coupling is a measure of the degree of dependence between
two items - the more highly coupled two things are the greater the chance
that a change in one will require a change in another.
Coupling is the "root of all evil" when in comes to
database refactoring, the more things that your database schema is coupled
to the harder it is to refactor. Unfortunately
you learned in Relational
Databases 101 that relational database schemas are potentially coupled
to a wide variety of things:
-
Your application source code
-
Other application source code
-
Data load source code
-
Data extract source code
-
Persistence frameworks/layers
-
Your database schema (captured via models or scripts)
-
Data migration scripts
-
Test code
-
Models and/or documentation
Figure
1 depicts the best-case scenario for database refactoring - when it is
only your application code that is coupled to your database schema. Figure
2 depicts the worst-case scenario for database refactoring efforts
where a wide variety of software systems are coupled to your database schema, a
situation that is quite common with existing production databases.
Figure 1. The
best-case scenario.
Figure 2. The worst-case scenario.
As you can see, coupling is a serious problem when it
comes to database refactoring. For the sake of simplicity, throughout the rest of
this article the term
"application" will refer to all external systems, databases, applications,
programs, test suites "¦ that are coupled to your database.
Before I describe the steps for
refactoring a database I need to address a critical issue - Does the simple
situation depicted in Figure 1 imply you'll do different things than the
highly coupled one of Figure 2?
Yes and no. The fundamental process itself remains the same although the
difficulty of implementing individual database refactorings increases
dramatically as the coupling your database is involved with increases. If
you find yourself in the simple situation then you will not need to do the
transition period work described below, you can simply refactor your database
schema and application code in parallel and deploy them simultaneously.
People who find themselves in the more complex situation do not have this
luxury.
This section is written under the
assumption that your technical and cultural environments are organized to
support database refactoring. Although
this sounds like a big assumption, and it is, I will describe what you need to
do to get to the point where these environments are in fact in place.
Anything less would be inappropriate.
I like to think of database refactoring as
a three-step process:
-
Start in your development sandbox
-
Implement in your integration
sandbox(es)
-
Install into production
Your
development
sandbox is the technical environment where your software, including both
your application code and database schema, is developed and
unit tested. The need to refactor your database schema is typically
identified by an application developer who is trying to implement a new requirement or
who is fixing a defect. For example, a
developer may need to extend their application to accept Canadian mailing
addresses in addition to American addresses.
The main difference is that Canadian addresses have postal codes such as
R2D 2C3 instead of zip codes such as 90210-1234.
Unfortunately the ZipCode column of the SurfaceAddress
table is numeric and therefore will not currently support Canadian postal codes.
The application developer describes the needed change to one of the Agile data engineer(s) on their team and the database refactoring effort begins.
As depicted in
Figure 3, the Agile data engineer
and application developer will typically work through some or all
of the following steps to implement the refactoring:
Figure 3. The process of
database refactoring.
The first thing that the Agile data engineer does is they'll
try to determine if the database refactoring even needs to occur and if it is
the right one to perform. The second thing that the Agile data engineer does is
internally assess the likeliness that the change is actually needed.
This is usually a "gut call" based on the Agile data engineer's previous
experience with the application developer.
The next thing the Agile data engineer does is to assess the
overall impact of the refactoring. In the stovepipe situation of Figure 1 this is fairly
straightforward because the Agile data engineer should have an understanding of how
the application is coupled to this part of the database.
When this isn't the case they'll need to work with the application
developer to do so. In the complex
case of Figure 2 the Agile data engineer will
need to have an understanding of the overall technical infrastructure within
your organization and how the other applications are coupled to your database.
This is knowledge that they'll need to build up over time by working
with the enterprise architects, agile developers, and even other Agile data engineers.
When the Agile data engineer isn't sure of the impact they will either need
to decide to make a decision at the time and go with their gut feel or decide to
advise the application developer to wait while they talk to the right people.
The goal of this effort is to make sure that you attempt a database
refactoring that you aren't likely going to be able to do - if you are going
to need to update, test, and redeploy twenty other applications to make this
refactoring then it likely isn't viable for you to continue.
An important skill that Agile data engineers require is the
understanding that you typically have several choices for implementing new data
structures and new logic within a database.
Like code refactoring, database refactoring is enabled by the existence of a
comprehensive test suite - you know you can safely change your database schema
if you can easily validate that the database still works after the change.
Here's
my advice:
-
Your primary goal is to ensure that the tests exist.
-
You should try to have each test implemented once,
either at the application level or at the database level but not both.
-
Some unit tests will be at the application level and
some at the database level, and that's ok.
-
Go for the lowest common denominator - if the
database is accessed by several applications thed any data-related tests
should appear in your database test suite, helping to ensure they're
tested once.
-
When you have a choice, implement the test at the level
where you have the best testing tools (often at the application level).
Testing tools are discussed in the Tools
article.
The article
Database
Regression Testing describes database testing in
detail.
An effective technique that Pramod Sadalage and Peter Schuh (2002) promote is a
deprecation period, although transition period is a better term, for the original portion of the schema that you're changing.
They observe that you can't simply make the change to your database
schema instantly, that instead you need to work with both the old and the new
schema in parallel for awhile to provide time for the other application teams to
refactor and redeploy their systems. Figure 4
shows how this idea would work when we apply the Replace Column database
refactoring to ZipCode (this example was created in 2003, hence a removal date in 2007 - more on this later). Notice
the changes between the original schema and the schema during the transition
period.
PostCode has been added as a column, exactly what you
would expect. The ZipCode
column has been marked as deprecated - you know this because a removal date
has been assigned to it using a UML named variable.
A trigger was also introduced to keep the values contained in the two
columns synchronized, the assumption being that new application code will work
with PostCode but should not be expect to keep ZipCode up to date,
and that older application code that has not been refactored to use the new
schema won't know to keep PostCode up to date.
This trigger is an example of database scaffolding code, simple and
common code that is required to keep your database "glued together".
This code has been assigned the same removal date as ZipCode.
Figure 4.
Refactoring the Address table.

An interesting thing to notice about Figure
4 is the addition of the Country column to Address. Wait a minute, there isn't an Add Column database
refactoring in the catalog.
Have we found a new type of database refactoring?
No. Database refactorings
are small changes to database schemas that IMPROVE their design, not simply
CHANGE the design. Adding a new
column is a transformation to the schema but not a design improvement to it.
Although this is clearly a very small nuance I believe that it's an
important one.
To see how to implement the code for a database refactoring, read
The Rename Column
Database Refactoring.
Figure 5 depicts the lifecycle of a database refactoring. You first
implement it within the scope of your initiative, and if successful you eventually
deploy it into production. During the transition period both the original
schema and the new schema exist, with sufficient scaffolding code to ensure that
any updates are correctly supported. During the transition period some
applications will work with PostCode and others with ZipCode, but
not both at once. Regardless of which column they work with, the applications
should all run properly. Once the deprecation period has expired the original
schema plus any scaffolding code is removed and the database retested. At this
point in time the assumption is that all applications work with PostCode.
Figure 5. The lifecycle of a database
refactoring.

Across the top of Figure 5 we've applied the 3C pattern from Disciplined Agile (DA) to the lifecycle.
This is a three-step pattern:
- Coordinate. Implement the refactoring, coordinating with the appropriate people at the organizational level (likely your data management group) to ensure it gets into your overall change process.
- Collaborate. Teams across your organization work together to change anyhing coupled to the database (see Figure 2).
- Conclude. The original schema and scaffolding code are removed, effectively concluding the refactoring.
The application developer and Agile data engineer work together to make the changes within the development sandbox.
The strategy is to start each refactoring simply, by performing the
refactoring within the development sandbox first you are effectively putting
yourself in the situation described in
Figure 1.
To do this, you need to update two logs
(assuming you don't have a database refactoring tool which does this
automatically):
-
Database change log.
This is the source code implementing all database schema changes in the
order that they were applied throughout the course of an initiative.
When you are implementing a database refactoring, you include only the
immediate changes in this log. When
applying the Replace Column database refactoring we would include the
DDL for adding the PostCode column and the data definition language (DDL) to implement the
trigger(s) to maintain the values between the PostCode and ZipCode
columns during the transition period.
-
Update log.
This log contains the source code for future changes to the database
schema that are to be run after the transition period for database
refactorings. In our example
this would be the source code required to remove the ZipCode column
and the triggers we introduced.
Many refactorings require you to migrate, or copy data,
from the old version of the schema to the new. Your data migration log contains the data
manipulation language (DML) to reformat or cleanse the source data
throughout the course of your effort.
In our example this would include any code to improve the quality of
the values in the ZipCode column.
The programs which access the portion of the database schema which you're
refactoring must be updated to work with the new version of the database schema.
All of these programs must be refactored and then deployed in production before
the transition period expires, as implied in Figure
5.
Once the changes to your application code and database schema have been put
in place you then need to run your
regression test suite.
Because successful tests discover problems you will need to rework things
until you get it right. A
significant advantage of database refactorings being small changes is that if
your tests do in fact break you've got a pretty good idea where the problem
lies - in the application code and database schema that you just changed.
The larger your changes are the more difficult it becomes to track down
problems, and therefore the slower and less effective your development efforts
are. You'll discover that developing
in small, incremental steps works incredibly well in practice.
Because your database is a shared resource, minimally
it is shared within your application development team if not by several
application teams, the Agile data engineer needs to communicate the changes that have
been made. If you haven't already done so you should update the physical
data model (PDM) for your database. I
personally have a tendency to model the new schema in a PDM tool such as ERWin
and then generate the initial DDL that I'll then modify and include in my
database change scripts.
A critical skill for agile developers is the habit of
putting all of their work under configuration management (CM) control by
checking it into a version control tool. In
the case of database refactoring this includes any DDL that you've created,
change scripts, data migration scripts, test data, test cases, test data
generation code, documentation, and models.
This is in addition to the application-oriented artifacts that you would
normally version - treat your database-oriented artifact the exact same way
that you'd treat other development artifacts and you should be ok.
After several days have passed you will be
ready to implement your database refactoring within your
team integration
sandbox. The reason why you need to
wait to do so is to give your teammates time to refactor their own code to use
the new schema.
Teams that have chosen to encapsulate access to their
database via the use of a persistence
framework will find it easier to react to database schema changes and
therefore may discover they can tighten up the period between implementing a
database refactoring within a development sandbox and in their team
integration sandbox. This is due to
the fact that the database schema is represented in meta data, therefore many
database schema changes will only require updates to the meta data and not to
the actual source code.
To deploy into each sandbox you will need
to both build your application and run your database management scripts (tools
such as Autopatch can help).
The next step is to rerun your regression tests to ensure that your
system still works - if not you will need to fix it in your development
environment, redeploy, and retest. The goal in your team
integration sandbox is to validate that the work of everyone on your team functions when put
together, whereas your goal in the Test/QA sandbox is to validate that your
system works well with the other systems within your organization.
Communication is a critical part of deploying database refactorings into
your Test/QA sandbox, I'm using the plural now because you typically introduce
several database changes (refactorings) into this environment at once.
Long before you change your database schema you need to communicate and
negotiate the changes with the owners of all of the other applications that
access your database. Your
operations engineers (if any) will be involved in this negotiation, they may even
facilitate the effort, to ensure that the overall needs of your organization are
met. Luckily the process that you
followed in your development sandbox has made this aspect of database
refactoring easier:
-
The Agile data engineer only allowed database refactorings
that can realistically be implemented - if another application team
isn't going to be able to rework their code to access the new schema then
you can't make the change.
-
The documentation, even if it's only a brief
description of each change, that the Agile data engineer wrote is important
because it provides an overview of the changes that are about to be
deployed.
-
The new version physical data model (PDM), which was
updated as database refactorings were implemented, serves as a focal point
for the negotiations with other teams.
Agile Modeling (AM) would consider the PDM to be a "contract
model" that your team has with the other application teams, a model that
they can count on to be accurate and that they can count on being actively
involved in negotiating changes to it.
Installing into production is the hardest part of database
refactoring, particularly in the complex situation of Figure
2. You generally won't deploy database refactorings on their own,
instead you will deploy them as part of the overall deployment of one or more
systems. Deployment is easiest when you have one application and one database
to update, and this situation does occur in practice, but realistically we need
to consider the situation where you are deploying several systems and several
data sources at once. Figure 6
overviews the steps of deploying your refactorings into production.
Figure 6. The steps of deploying your
database refactorings.
Figure 7 depicts how you will need to
schedule the deployment of your application pre-defined deployment windows,
shown in green. A deployment window, often called a release window, is a
specific point in time where it is permissible to deploy a system into
production. Your
operations staff will very likely have strict rules regarding
when application teams may deploy systems.
Figure 7 shows how two teams schedule the deployment of their
changes (including database refactorings) into available deployment windows.
Sometimes there is nothing to deploy, sometimes one team has changes, and other
times both teams have schema changes to deploy.
Figure 7. Scheduling your
refactorings into deployment windows.

You will naturally need to coordinate with any other teams which are deploying
during the same deployment window. This coordination will occur long before you
go to deploy, and frankly the primary reason why your pre-production test
environment exists is to provide a sandbox in which you can resolve multi-system
issues. Regardless of how many database refactorings are to be applied to your
production database, or how many teams those refactorings were developed by,
they will have first been tested within your pre-production testing environment
before being applied in production.
Although the adoption of effective tools is an important
part of enabling database refactoring it is only the tip of the iceberg -
database refactoring requires a significant cultural change within your
organization. Because database
refactoring is an enabling technique of the Agile Data method many of the
cultural issues for adopting database refactoring are the same ones that you
face adopting the Agile Data method in general.
These cultural issues include a serial mindset within many
data professionals, resistance to change, and political inertia.
The following approach should help you to overcome these challenges:
-
Start simple.
Database refactoring is easiest in greenfield environments where a
new application accesses a new database, and the next easiest situation is
when a single application accesses a legacy database.
Both of these scenarios are typified by Figure
1. By starting simple you
provide yourself with an environment in which you can learn the basics, once
you understand the basics you are in a much better position to tackle the
situations typified by Figure 2.
-
Accept that iterative and incremental development is
the norm. Modern software development methodologies take an
iterative and incremental approach to software development. Although
serial development is often the preferred approach by many data
professionals unfortunately it doesn't reflect the current way that
application developers work. Time to change.
-
Accept that there is no magic solution to get you
out of your existing mess. Your data quality problems didn't create themselves and they are not going to fix themselves. People created the problem and people need to fix the problem. Database refactoring is the safest and most straightforward strategy available to you to dig your way out of your data technical debt.
-
Adopt a 100% database regression testing policy.
For database refactoring to work, and in general for
iteratively and incremental development to work, you need to be effective at
regression testing. To be successful at database refactoring you need
to not only be able to regression test the database itself but any
application that is coupled to your database. The implication is that
you require regression test suites for every single application, something
you very likely do not have. So start writing them.
-
Explore the technique. Experiment with database refactoring in simple situations first to learn the technique and gain some experience. Then start refactoring more complicated things. So, start simple.
Database refactoring works in practice, it isn't
simply just another academic theory. For
the vast majority of organizations this is a new, "bleeding edge" technique.
6. Database Refactoring Strategies
Fowler (1999) suggests a collection of practices for
code refactoring, practices that I recast below for database refactoring:
- Refactor to ease additions to your schema.
- Ensure the test suite is in place.
- Take small steps.
- Program for people.
- Don't publish data models prematurely.
- The need to document reflects a need to refactor.
- Test frequently.
7. Database Refactoring in Context
The following table summarizes the trade-offs associated with
database refactoring and provides advice for when (not) to adopt it.
Advantages
|
- Enables safe evolution of a data source
- Supports removal of data technical debt
- Enables evolutionary database development
|
Disadvantages
|
- Although possible to implement by hand, you really want to invest in database refactoring tooling.
- Not possible for data sources that do not support executable functionality, stored procedures for example, such as CSV files.
- Requires capability to schedule, and then execute upon, removal of refactoring scaffolding once systems that access the data source are updated.
|
When to Adopt This Practice
|
You need to have a reasonably complete automated regression test suite in place, one that at least validates the critical and high-risk aspects of your database, for database refactoring to be a viable option.
|
8. Parting Thoughts
Database refactoring supports an incremental approach to
the evolution of your database schema, one of the three fundamental strategies
(you can give up, take a "big bang release" approach, take an incremental
approach).
Each strategy has its unique strengths and weaknesses.
I suspect that many organizations, perhaps because of a serial mindset,
have either tried the big-bang release approach or have been too scared to do so
and have now given up. It doesn't
have to be this way. Yes, it will
likely take a significant effort for your organization to put the culture and
technologies in place to support database refactoring across your enterprise,
but in the long run this is likely far more palatable than your other
alternatives.
See the Catalog
of Database Refactorings.
Regardless of your strategy database evolution is hard,
something that is particularly true when your database is highly coupled to
other things. Database refactoring
is not a silver bullet that's going to magically solve all of your database
problems. This article described
how to successfully approach database refactoring within a simple, stovepipe
environment.
9. Related Resources