A Definition of Data Warehousing

My favored definition of a data warehouse is a slightly modified version of Ralph Kimball’s definition from his first edition of The Data Warehouse Toolkit:

A data warehouse is a copy of transaction data specifically structured for querying and reporting.

Ralph states that a data warehouse is "a copy of transaction data specifically structured for query and analysis." Two quibbles I have with Ralph's definition are: 1) Sometimes non-transaction data are stored in a data warehouse - though probably 95-99% of the data usually are transaction data. 2) I say "querying and reporting" rather than "query and analysis" because the main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted "formal" reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis. – For some more information about why the transaction data are copied, you may want to see my essay The Case for Data Warehousing. To learn about the key decisions that must be made in determining the structure of a data warehouse, you may want to see my essay Aspects of Data Warehouse Architecture.

What I especially like about Ralph’s definition is what he does not say.

The form of the stored data has nothing to do with whether something is a data warehouse.

A data warehouse can be normalized or denormalized. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity.

Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making.

Of course if you want to define every user as a decision maker and all activities as decision making processes, then my assertion is false. But in my experience, the overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist for making decisions with wide ranging effects (so-called "strategic" decisions.). In fact, I would assert that most of data warehouses are used for post-decision monitoring of the effects of decisions – or, as some people might say, for "operational" issues. By the way, this is not saying that using data warehousing in the decision making process is not a wonderful, potentially high return effort. But my caution is that though the trade press, vendors, and many industry experts trumpet the role of data warehousing vis–à–vis decision making, in reality we do not now have nor will we ever have a clear understanding of decision making.

Comments? Contact Larry Greenfield at larryg@lgisystems.com