What to Learn About in Order to Speed Up Data Warehouse Loading

 This paper is another laundry list of items data warehouse implementers may wish to learn more about in order to speed up the process of extracting, transforming and loading data (henceforth simply referred to as loading) or to make these processes less prone to errors. This paper will not attempt to provide detailed explanations of these topics. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up loading. Rather, data warehouse implementers may use this paper as a starting point in their search for ways to speed up loading. This list does not include points relevant to a specific vendor’s technology. Your DBA should know some ways of speeding up the load that apply only to the technology of your DBMS vendor.

How often the users really need updated data

Oftentimes data warehouse developers unquestioningly give in to the most extreme demands for freshness of data or they automatically assume data need to be updated far more often than makes business sense. Though you read sometimes ridiculous articles in the trade press and from industry analysts (who have coined the awful term "information latency") about how the business world wants to know everything immediately, the reality is quite different. If your data warehouse is not there to support day-to-day monitoring and analysis, question why it should be updated daily. If your data warehouse is not there for week-to-week monitoring and analysis, question why it should be updated weekly. By the way, though, if you do decide to update weekly or monthly, try to design your loading process so you are not tied to loading at a specific interval. There may be certain "crunch" times when you have to load more frequently.

How to drop and re–establish indices and how to set index fill factors

If you update a large portion of the database (I've heard estimates from 10 – 25% up), you may want to learn about dropping indices before a database load and then re–establishing them after the load. If you do not drop indices, you want to make sure you set the index fill factors so your server’s disk drives do not waste time looking for space in which to write index updates.

What facilities does the database have for bulk loading data and which of those facilities does it make sense to use

Many databases have ways of speeding up loading at the expense of data integrity checking. Note that certain bulk loaders do more than load – they will reformat data and sometimes aggregate data.

What input file formatting will speed up bulk loading

Oftentimes operations done on the input data on the feeder system platform (e.g., sorting, eliminating packed and signed fields) can speed up loading.

How to parallelize table load and index maintenance or re–creation

Dropping indices and bulk loading in parallel can drastically improve loading time. By the way, learn the differences between pipeline, component, and data parallelism. Given the circumstances, these different types of parallelism can have widely varying amounts of effect.

How to load databases via a stream

Certain ETL tools will allow you to extract, transform, and load in one process. That is, it is not necessary to create intermediate files. You do, though, have to be careful about data source, platform, size, scalability restrictions and limitations on how sophisticated your transformations can efficiently be.

How indices are used by your database optimizer

You need to learn this so you can figure out whether your indices are actually going to get used. In more recent versions of DBMS software, you may be able to get away with less indices than in older versions.

What integrity checks should be done in the loading process

After you perform the initial load of data warehouse tables, you may want to start a "discussion" of how all the errors you found should be trapped in the feeder systems (preferably at data entry time).

Where does it make sense to transform the data

There may be faster places to do it than in your data warehouse database system. You may want to work with flat files and a dedicated sort/merge utility either on the data warehouse platform or, if the source data are on another platform, you may want to do it on that platform. The problem with doing this on the source system platform, though, is that you then will need people skilled in that platform and you may be invading someone else’s fiefdom.

Where processes can be done in memory

If you have got the available memory, learn how to use it. Sorts especially can be speeded up by doing them in memory.

What domain integrity checks should be in the data warehouse database

Depending on how you resolve the above two issues, you have to investigate the sensibility of incorporating referential integrity or any other type of domain integrity checking in your database.

Where does it make sense to aggregate the data

Sometimes if you do the aggregating outside the data warehouse database environment you can create multiple aggregate output files in one "pass" of the input data. You will probably have to learn how to use memory very carefully if you do this (and have a lot of memory on the server on which you are doing the aggregating).

What statistics are available on aggregate table usage

As you might have read ad nauseum, building a data warehouse is an iterative undertaking. You will probably create aggregates that seldom get used. You need these statistics for making the case for deleting the aggregates (though be forewarned this can get you into a quirky political aspect of data warehouse management.)

What level of data it makes sense to aggregate it and what non-additive measures are sensible to include in your aggregate tables

Say you have region, territory, customer, product, and salesperson dimensions. You may find that you get the most benefit by creating a region, territory, customer, product, and salesperson aggregate and say, that, an additional region, territory, customer, product aggregate adds little to the performance of your queries. A complicating factor, though, is use of non-additive measures in your aggregates because they will force you to re–aggregate. Suffice it to say that you should think twice before adding these measures to your aggregates.

What are non–FTP ways of transferring data

FTP–ing can be slow. There are a number of high speed transfer technologies to investigate. Also, don’t forget about tape. Even if you have to send a tape overnight for early delivery, tape is sometimes the fastest way to transfer data. Also, don’t forget about using compression technology in conjunction with transferring.

Whether you should incrementally update or rebuild a table

Sometimes you have the option to either incrementally update a table or rebuild a table. You may find that after a certain level of update activity it is faster to rebuild than to update. A rule of thumb sometimes stated is that if 20% of the records will be updated, it is faster to rebuild. This is a rough rule and the actual threshold will vary. Nevertheless, if you have options, it may be worth experimenting with them.

What are alternate methods for changed data capture

Presuming you must incrementally update your data warehouse database and you are not extracting from date stamped transaction records in the feeder system, you may find you have a technically daunting task in capturing changed information. Be aware that you may have options in how you do this and the options will differ in speed.

How to modify feeder systems so changes to records are written to flat files

Though this usually is not worth it, if this is done it can eliminate the time needed to go through sometimes time consuming, convoluted processing to determine what feeder system data has changed.

How to use report scraping software

If a report that has the data you need to extract is available, sometimes it make sense to put the report image in a file and use software specially designed to extract data from report image files. You do run a risk if the report format changes. But this technique often makes sense for extracting data the systems whose code hasn't been touched in the last ten years.

How to perform disk mirroring and hot backups

Disk mirroring and hot backups will not speed up loading the data warehouse database (in fact, if a disk is mirrored while being bulk loaded, loading time can greatly increase) but they can give you some greatly desired flexibility and breathing room. With mirrored disks, you can "break" the mirror, update the copy, and restore the mirror with the updated copy. This means that you can still have your data warehouse available while loading it. (Though be careful that you understand how mirroring can be handled by both hardware and software). Similarly, hot backups allow you to have your data warehouse database available when backing it up. By the way, a cycle of partial backups followed by a full backup is also worth looking into.

How to schedule loading processes

Loading a data warehouse usually requires quite a few processes. Obviously, you want to understand where there are and are not dependencies so you can "multi–task" these processes as much as possible. Where there are dependencies, you want to do risk analyses so you can find out whether it is worth the effort to build in restart capabilities in the intermediate processes. And you want to make sure you have the human and automated support for scheduling the way you want to.

How to set a restartable checkpoint

Again, checkpoints will not by themselves speed up the loading process. However, if you have a tight window for loading the data warehouse and that loading takes considerable time, availability of a checkpoint can be a lifesaver when the load crashes (which it does at the worst times).

How certain forms of RAID technology can both speed and slow loading

RAID technology can both help and harm loading speed.

Partial updating of multidimensional (MOLAP) databases

Many of these tools allow you to only recalculate some of the calculated numbers stored in the "cube". Most of these tools that have the capability will warn you that you do so at the risk of possibly getting data out of synch.

How to distribute data on multiple physical disks

If you can afford multiple disks, you may want to make sure input data, data warehouse tables, indexes, and logs (if you do not disable logging) are on different physical disks. In fact, you may want to learn about striping to spread a file over multiple disks and partitioning to divide a logical file into many physical files spread over different disks.

How to defragment table and index files

This is basic knowledge it will probably do you well to know.

How to make a copy of your transaction system database

If you really want to use your data warehouse only for production reporting, you may be better off just copying the transaction database periodically as is. Architectural purists hate this solution but sometimes it just makes sense to handle your reporting needs this way.

How to use multiple disk controllers

You will want high–speed interconnects to these controllers.

What is the cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up loading.

Some final comments – In the long run long loading times usually will cause bigger problems than long query times. It is not completely uncommon that data warehouse development teams find themselves with systems they have promised to update daily but then they find the update time stretches to 12, 14, 16, and maybe even 20 hours. You can throw more and more technology at this but ultimately your best tactics are the ability to understand what really is most important to the business and good user expectation management. And, unless it is done by design, do not let your data warehouse be the main source for operational–oriented query and report functionality that, in the big picture, ought to be in the feeder transaction processing systems. 

Comments? Contact Larry Greenfield at larryg@lgisystems.com