Agile Data

Agile Best Practices for Data Warehousing (DW)/Business Intelligence (BI) Projects

Follow @scottwambler on Twitter!

This article summarizes "best practices" for the development of a data warehouse (DW) or business intelligence (BI) solution. These best practices describe ways to reduce overall risk on your project while increasing the probability that you will deliver a DW or BI solution which meets the actual needs of its end users. Successful DW/BI projects take an evolutionary approach to development, and better yet an agile one. The principles, practices and philosophies of the Agile Modeling (AM) and Agile Data (AD) methods are applied throughout. I've organized the best practices into the following categories:

DW/BI Modeling Best Practices

  1. Do some initial architecture envisioning. At the beginning of a project, during Iteration 0 (see Figure 1), you want to do some initial architecture modeling to identify a potential vision for how your team will build the data warehouse. As Agile Model Driven Development (AMDD) suggests, you do not need to create a comprehensive, detailed model up front, you only need a high-level vision at the beginning of the project and the details can be identified on a just-in-time (JIT) basis via model storming.  Sometimes a simple whiteboard sketch is all you need to understand your architectural vision. If so, then just do that. For a BI/DW project, the initial architecture views would likely be some form of deployment diagram capturing the technologies you intend to use and a high-level domain model overviewing the business entities and the relationships between them.

Figure 1. The Agile SDLC.


  1. Model the details just in time (JIT). The best time to model details isn't at the beginning of a project but instead to model storm them throughout the project in a JIT manner. There are several reasons for this. First, like it or not, the requirements are going to change throughout the project. Second, by waiting to analyze the details JIT, you have much more domain knowledge than if you had done so at the beginning of a project. For example, if a requirement is to be implemented three months into a project, if you explore the details of that requirement at that point you have three months more domain knowledge than if you had done so at the beginning of the project, therefore you can ask more intelligent questions. Third, if you've been delivering working software on a regular basis (see below) your stakeholders now have three months worth of experience with the system that you've built. In other words, they can give you better answers. Fourth, modeling everything up front appears to result in significant wastage.
  2. Prove the architecture early. Everything works in PowerPoint slides, on a whiteboard, or in CASE tool models but it isn't until you prove it with code that you know that your architecture actually works. Processes such as Disciplined Agile Delivery (DAD) and Unified Process (UP) suggest that you build a working, end-to-end "skeleton" of your system to prove that all aspects of it work. In the case of a DW, this would entail that you show that you can access the major legacy data sources, that your extract-transform-load (ETL) strategy works, that your database regression testing strategy works, and that your reporting tools can access your DW.
  3. Focus on usage. If you want to develop a system effectively, including a DW/BI system, then you need to understand how people will potentially use it to support their business objectives. This means that we need a usage-centered approach to development driven by use cases or usage scenarios, not a data-centered one driven by data models. Data is clearly an important part of the overall picture, but it's only one of many parts. If we focus on data and not usage we run the risk of building something that nobody is interested in using, an all-too-common occurrence on traditional data warehouse efforts.
  4. Don't get hung up on "the one truth". The "one truth" philosophy says that it is desirable to have a single definition for a data element or business term, that there should be a common, shared definition for your master reference data and perhaps even your major business entities. To get to this "one truth", when it is possible, often requires significant effort which often goes past the point of diminishing returns. "One truth" is a nice vision to work toward, but don't let it prevent your team from delivering important business value in a timely manner. The fact is that various portions of your organization have different ways of working, different priorities, and different constraints.  Seeking the one truth for a data element often proves to be an artificial constraint imposed by traditional data professionals, not by the actual business. You can in fact take an Agile approach to Master Data Management (MDM).
  5. Organize your work by requirements. On agile projects we perform work based on prioritized requirements, not by technical issues such as source systems. Each iteration we do the work to fulfill the highest priority stakeholder requirements which fit into that iteration. During each iteration we get a little more data from system X, and some more from system Y, and some more from system Z, and so on. If our iterations are two weeks in length, we pull two weeks worth of work from the top of the priority stack. By working in this manner we are always in a position where we are achieving the maximum benefit for our stakeholder's IT investment, thereby reducing their risk.
  6. Active stakeholder participation.  Stakeholder involvement is critical throughout your project, and better yet active stakeholder participation where stakeholders are not only involved with your project on a daily basis, they are also directly involved with the actual modeling effort itself.

DW/BI Lifecycle Best Practices

    Agile Data Warehousing
  1. Take an evolutionary approach. Requirements, or at least your understanding of them, will change throughout the lifecycle of your project for a variety of reasons. The implication is that if you want to develop a solution which meets the needs of your stakeholders then you will need to take an evolutionary (iterative and incremental) approach to development. Envision the requirements and architecture at a high-level to start, but model storm the details just in time (JIT).
  2. Embrace change. A changed requirement late in the lifecycle is a competitive advantage as long as you can act on it. Instead of adopting a strict change management process which for the most part is based on change prevention, you can instead adopt a more agile approach to change management where your stakeholders can easily change their minds as the progress progresses. This of course requires the development team to adopt evolutionary database development techniques such as database refactoring, evolutionary data modeling, and database regression testing.
  3. Deliver working software regularly. Following short iterations, perhaps a few weeks in length, and providing working software at the end of each iteration, often results in stakeholders who are far more interested in getting more software than they are in getting more specifications. Effective DW/BI teams focus on high-value activities such as actually providing access to data and developing reports instead of merely documenting what you intend to deliver at some point in the future. The only accurate measure of progress on a software development project is the delivery of working software, not the delivery of documentation or other non-executable work products which are nothing more than promised to deliver software at some point in the future. A working system provides the concrete feedback to stakeholders that progress is being made, and regular updates to that software (at least internally) helps to reduce the feedback cycle and thereby decrease overall risk on your project.
  4. Strive for iterations of one to two weeks. Iterations of this length provide more opportunity to govern the project effectively due to the greater feedback provided by regular delivery of working software. Short iterations motivate you to focus on high value activities, long iterations allow room for bureaucratic practices (and the inherent waste which goes along with them). 
  5. Test throughout the lifecycle. One of the great blindspots in the data community is testing, which very likely explains the data quality challenges which we continue to suffer from in production databases. It is common on agile projects to do significantly more testing than what typically occurs on traditional projects, and as you can see in Figure 2 agilists tend to test throughout the project lifecycle (in fact, many agilists prefer to take a test-first approach to development).  There is nothing special about databases -- just like you should test your application code you should also test your database code. If data is a corporate asset, a common refrain in the vast majority of organizations, should you not have a test suite in place which validates your data?  In short,  database regression testing is critical to your success.
  6. Involve operations and support people early. Operations and support staff a key stakeholders on any project, including DW/BI projects. The sooner you involve them the sooner you find out what their requirements are (including pesky details such as deployment windows, archiving needs, hardware lead times, ...).

Figure 2. Testing during development iterations.



Other Good Ideas

  1. Adopt common development standards. Just like when you're building any other system, you should follow your organization's common development guidance. This includes modeling style guidelines, coding guidelines, data naming conventions, report design guidelines, and so on.  Consistency is an important contributor to quality, and guidance which is willingly followed by developers seems to be more effective in practice than guidance which is mandated and enforced by your governing bodies. If such guidance doesn't exist within your organization then you'll want to develop some, or better yet, adopt existing ones from industry. It's amazing what you can find on the Internet these days just by looking for it.
  2. Use good tools. DW/BI development is complex, and good tools will help to address that complexity. In addition to the standard tools for data modeling, extract-transform-load (ETL), and reporting you'll also need tools which support evolutionary development techniques such as database refactoring, database testing, and database deployment. The agile/evolutionary tools are currently emerging in the marketplace.
  3. Don't underestimate legacy data challenges. Existing data sources are often a mess, as I discuss in The Joy of Legacy Data. Ideally you'll refactor the source to fix any data quality problems, but if that's not an option then you'll need to cleanse the source data as much as possible as you extract it from the legacy sources. In my experience data cleansing should be seen as a process smell which indicates the need for legacy data source owners to become better at database evolution.
  4. Travel light. Serial approaches to development are typically documentation heavy, often in a naive attempt to counteract the inherent communication risks of waterfall lifecycles. With the serial approach you often see teams create comprehensive logical and physical data models and detailed report specifications. With an Agile approach where you develop working software each iteration, you quickly discover two things. First, an evolutionary approach to development demands an evolutionary approach to data modeling and therefore you don't need to create detailed data models up front. Second, by having developers work closely with stakeholders you don't need to create detailed report specifications, instead it is more effective to simply create a report and get feedback from your stakeholders which you then act on iteratively.
  5. Adopt a lean approach to data governance. Traditional, command-and-control approaches to data governance appear to work very poorly in practice. The 2006 DDJ survey into the current state of data management practices showed that 66% of development teams will choose to "work around" their organization's data group, and when they do so that 75% of the time it is because they find the data group too difficult to work with, too slow to respond, or that the data group doesn't provide sufficient value to justify the effort of working with them. This is clearly problematic. It is possible to take a lean/agile approach to data governance.