What Data Errors You May Find When Building a Data Warehouse
 You may have seen publications that tell you that you
may have to spend the majority of your data warehouse development time
building the means for both the initial and recurring extraction, transforming,
and loading of data. What I have not seen, though, is much in-depth discussion
of what exactly are those errors in the dirty data that you will spend your time cleaning up.
Forewarned is forearmed. If you know the possibility that
certain errors exist, you will be more prone to spot them and to plan your
project to attack the errors in a manageable way. Perhaps the material in this
paper can help you formulate a checklist of errors you will be checking for.
What follows is a list of common errors.
Also, if you are a relational database expert, bear with my imprecise use
of some terminology. Finally, note that when I refer to a data warehouse,
I refer to the database that is directly fed with data from the source
systems - not the data marts (or whatever you want to call them) that are
fed with cleansed data.
The categories of "errors"
I place "errors" into four categories. Quotations are
around the word errors because some errors are not, in the metaphysical
sense, erroneous. So, with some awkwardness, let me suggest that errors
involve data that are either:
 |
Incomplete |
 |
Incorrect |
 |
Incomprehensible |
 |
Inconsistent. |

Incomplete errors
These consist of:
Missing records
This means a record that should be in a source system is not there. Usually
this is caused by a programmer who diddled with a file and did not clean
up completely. (I read a white paper about how users have to "fess up"
about bad data. Actually, usually system personnel cause MUCH more headaches
than users.) Note you may not spot this type of error unless you have another
system or old reports to tie to.
Missing fields
These are fields that should be there but are not. There is often a mistaken
belief that a source system requires entry of a field.
Records or fields that,
by design, are not being recorded That is, by intelligent
or careless design, data you want to store in the data warehouse are not
being recorded anywhere. I further divide this situation into three categories.
First, there may be dimension table attributes you will want to record
but which are not in any system feeding the data warehouse. For example,
the marketing user may have a personal classification scheme for products
indicating the degree to which items are being promoted. Second, if you
are feeding the same type of data in from multiple systems you may find
that one of the source systems does not record a field your user wants
to store in the data warehouse. Third, there may be "transactions" you
need to store in the data warehouse that are not recorded in a explicit
manner. For example, updating the source system may not necessarily cause
the recording of a transaction. Or, sometimes adjustments to source system
data are made downstream from the source system. Off-invoice adjustments
made in general ledger systems are a big offender. In this case you may
find that the grain of the information to be stored in the warehouse may
be lost in the downstream system.
Incorrect errors
You can say that again! That is, the data really are
incorrect.
Wrong (but sometimes
right) codes This usually occurs when an old transaction processing
system is assigning a code that the transaction processing system users
do not care about. Now if the code is not valid, you are going to catch
it. The "gotcha" comes when the code is wrong but it is still a valid code.
For example, you may have to extract data from an ancient repair parts
ordering system that was programmed in 1968 to assign a product code of
100 to all transactions. Now, however, product code 100 stands for something
other than repair parts.
Wrong calculations, aggregations This situation refers to when you decide to or have to load
data that have already been calculated or aggregated outside the data warehouse
environment. You will have to make a judgment call on whether to check
the data. You may find it necessary to bring data into the warehouse environment
solely to allow you to check the calculation.
Duplicate records
There usually are two situations to be dealt with. First, there are duplicate
records within one system whose data are feeding the warehouse. Second,
there is information that is duplicated in multiple systems that feed in
the same type of information. For example, maybe you are feeding in data
from an order entry system for products and an order entry system for services.
Unbeknownst to you, your branch in West Wauwatosa is booking services in
both the product and service order entry systems. (The possibility of situation
like this may sound crazy until you encounter the quirks in real world
systems.) In both cases, note that you may miss the duplicates if you feed
already aggregated data into the warehouse.
Wrong information entered into source system
Sometimes a source system contains data that were simply incorrectly entered into the system. For instance, someone may have keypunched 6/9/96 as 9/6/96. Now the obvious action is to correct the source system. However, sometimes, for various reasons, the source system cannot be corrected. Note that if you have many errors in a source system that cannot be corrected, you have a much larger issue in that you do not really have a reliable "system of record".
Incorrect pairing of codes
This is best described by an example. Sometimes there are supposed to be rules
that state that if a part number suffix is XXX, then the category code should
be either A, B, or C. In more technical terms, there is a non-arithmetic
relationship between attributes whose rules have been broken.
Incomprehensibility errors
These are the types of conditions that make source data
difficult to read.
Multiple fields within
one field This is the situation where a source system has
one field which contains information that the data warehouse will carry
in multiple fields. By far the most common occurrence of this problem is
when a whole name, e.g., "Joe E. Brown", is kept in one field in the source
system and it is necessary to parse this into three fields in the warehouse.
Weird formatting to conserve
disk space This occurs when the programmer of the source system
resorted to some out of the ordinary scheme to save disk space. In addition
to singular fields being formatted strangely, the programmer may also have
instituted a record layout that varies.
Unknown codes
Many times you can figure out what 99% of what codes mean. However, you
usually find that there will be a handful of records with unknown codes
and usually these records contain huge or minuscule dollar amounts and
are several years old.
Spreadsheets and word
processing files Often in order to perform the initial load
of a data warehouse it is necessary to extract critical data being held
in spreadsheet files and/or "merge list" files. However, often anything
goes in these files. They may contain a semblance of a structure with data
that are half validated.
Many-to-many relationships
and hierarchical files that allow multiple parents Watch out
for this architecture in source systems. It is easy to incorrectly transfer
data organized in such manner.
Inconsistency errors
The category of inconsistency errors encompasses the
widest range of problems. Obviously similar data from different systems
can easily be inconsistent. However, data within one system can be inconsistent
across locations, reporting units, and time.
Inconsistent use of different
codes Much of the data warehousing literature gives the example of one
system that uses "M" and "F" and another system that uses
"1" or "2" to distinguish gender. May I suggest that you
wish that this is the toughest data cleaning problem you will face.
Inconsistent meaning of
a code This is usually an issue when the definition of an organizational
entity changes over time. For example, say in 1995 you have customers A, B, C,
and D. In 1996, customer A buys customer B. In 1997, customer A buys customer C.
In 1998, Customer A sells of part of what was A and C to customer D. When you
build your warehouse in 1999, based on the type of business analysis you
perform, you may face the dilemma of how to identify the sales to customers A,
B, C, and D in previous years.
Overlapping codes This is a situation where one source system
records, say, all its sales to Customer A with three customer numbers and
another source system records its sales to customer A with two different
customer numbers. Now, the obvious solution is to use one customer number here.
The problem is that there is usually some good business reason why there are
five customer numbers.
Different codes with the
same meaning For example, some records may indicate a color of violet and
some may indicate a color of purple. The data warehouse users may want to see
these as one color. More annoyingly, sometimes spaces and other extraneous
information have been inconsistently embedded in codes.
Inconsistent names and
addresses Strictly speaking this is a case of different codes with the
same meaning. My unscientific impression of this type of problem is that decent
knowledge of string searching will allow you to relatively easily make name and
address information 80% consistent. Going for 90% consistency requires a huge
jump in the level of effort, Going for 95% consistency requires another
incremental huge jump in effort. As for 100% consistency in a database of
substantial size, you may want to decide if sending a person to Mars is easier.
Inconsistent business
rules This, for the most part, is a fancy way of saying that calculated
numbers are calculated differently. Normally, you will probably avoid loading
calculated numbers into the warehouse but there sometimes is the situation where
this must be done. As noted before, you may have to feed data into the warehouse
solely to check calculations. - This can also mean that a non-arithmetic
relationship between two fields (e.g., if a part number suffix is XXX, then
the category code should be either A, B, or C) is non consistently followed.
Inconsistent aggregating
Strictly speaking this is a case of inconsistent business rules. In a
nutshell, this refers to when you need to compare multiple sets of aggregated
data and the data are aggregated differently in the source systems. I believe
the most common instance of this type of problem is where data are aggregated by
customer.
Inconsistent grain of
the most atomic information Certain times you need to compare multiple
sets of information that are not available at the same grain. For example,
customer and product profitability systems compare sales and expenses by product
and customer. Often sales are recorded by product and customer but expenses are
recorded by account and profit center. The problem occurs when there is not
necessarily a relation between the customer or product grain of the sales data
and the account - profit center grain of the expense data.
Inconsistent timing Strictly speaking this is a case of
inconsistent grain of the most atomic information. This problem especially comes
into play when you buy data. For example, if you work for a pickle company you
might want to analyze purchased scanner data for grocery store sales of
gherkins. Perhaps you purchase weekly numbers. When someone comes up with the
idea to produce a monthly report that incorporates monthly expense data from
internal systems, you'll find that you are, well, in a pickle.
Inconsistent use of an
attribute For example, an order entry system may have a field labeled
shipping instructions. You may find that this field contains the name of the
customer purchasing agent, the e-mail address of the customer, etc. A more
difficult situation is when different business policies are used to populate a
field. For example, perhaps you have a fact table with ledger account numbers.
You may find that entity A uses account '1000' for administrative expenses while
entity B uses '1500' for administrative expenses. (This problem gets more
interesting if entity A uses '1500' and entity B uses '1000' for something other
than administrative expenses.)
Inconsistent date cut-offs
Strictly speaking this is a case of inconsistent use of an attribute. This is
when you are merging data from two systems that follow different policies as to
dating transactions. As you can imagine, the issue comes up most with dating
sales and sales returns.
Inconsistent use of nulls,
spaces, empty values, etc. Now this is not the hardest problem to correct
in a warehouse. It is easy, though, to forget about this until it is discovered
at the worst possible time.
Lack of referential integrity It is surprising about how many
source systems have been built without this basic check.
Out of synch fact data
Certain summary information may be derived independently from data in
different fact tables. For example, a total sales number may be derived from
adding up either transactions in a ledger debit/credit fact table or
transactions in a sales invoice fact table. Obviously there may be differences
because one table is updated later than another table. Often, however, the
differences are symptoms of deeper problems.
Some ending thoughts
I hope this paper adds to the understanding of what takes
up the majority of time in a data warehouse. Let me offer the following
ending thoughts:
Be prepared for a lot
of tedious work. Probably the most important "tools" for solving
these problems are a sharp eye and endurance for checking an abundance
of detail information.
You may spend much more
time checking for errors than cleaning up errors. Most of
these errors do not jump out at you.
The errors of inconsistency
are the most difficult to handle. At least that is my experience.
The complexity of a data
warehouse increases geometrically with the number of sources of data fed
into it. Having to reconcile inconsistent systems is the reason.
For example, if it takes 100 hours to reconcile data from two source systems,
you can expect that it will take on the order of 400, not 200, hours to
reconcile data from four source systems.
The complexity of a data
warehouse increases geometrically with the span of time of data to be fed
into it. My previous comment applies. Note, however, that
reconciling inconsistencies over time may be even harder because the people
who know what happened in previous years may not be around to answer your
questions.
You will be faced with
an economic and political question as to how erroneous the data in your
system will be. Completely fixing some of these problems can be
quite expensive. More vexingly, often what constitutes "correct" data is
debatable. What you do, more often then not, boils down to a question
of money and politics.

|