44 Computer
The “Achieving Fast Query Response Time” sidebar
outlines some of the techniques used to accomplish
this. ROLAP systems typically scale better in the num-
ber of facts they can store (although some MOLAP
tools are now becoming just as scalable), are more
flexible with respect to cube redefinitions, and pro-
vide better support for frequent updates. The virtues
of the two approaches are combined in the hybrid
OLAP approach, which uses MOLAP technology to
store higher-level summary data and ROLAP systems
to store the detail data.
ROLAP implementations typically employ star or
snowflake schemas,
5
both of which store data in fact
tables and dimension tables. A fact table holds one
row for each fact in the cube. It has a column for each
measure, containing the measure value for the partic-
ular fact, as well as a column for each dimension that
contains a foreign key referencing a dimension table
for the particular dimension.
Star and snowflake schemas differ in how they han-
dle dimensions, and choosing between them largely
depends on the desired properties of the system being
developed. As Figure 3 shows, a star schema has one
table for each dimension. The dimension table con-
tains a key column, one column for each dimension
level containing textual descriptions of that level’s val-
ues, and one column for each level property in the
dimension.
The star schema’s fact table holds the sales price for
one particular sale and its related dimension values.
It has a foreign key column for each of the three
dimensions: product, location, and time. The dimen-
sion tables have corresponding key columns and one
column for each dimension level—for example,
LocID, City, and Country. No column is necessary for
the T level, which will always hold the same value. The
dimension table’s key column is typically a dummy
integer key without any semantics. This prevents mis-
use of keys, offers better storage use, and provides
more support for dimension updates than informa-
tion-bearing keys from the source systems.
5
Redundancy will occur in higher-level data. For
example, because May 2001 has 31 day values, the
year value “2001” will be repeated 30 times. Because
dimensions typically only take up one to five percent
of a cube’s total required storage, however, redun-
dancy is not a storage problem. Also, the central han-
dling of dimension updates ensures consistency. Thus,
using denormalized dimension tables, which support
a simpler formulation of better-performing queries, is
often beneficial.
Snowflake schemas contain one table for each
dimension level to avoid redundancy, which may be
advantageous in some situations. The dimension
tables each contain a key, a column holding textual
descriptions of the level values, and possibly columns
for level properties. Tables for lower levels also con-
tain a foreign key to the containing level. For exam-
ple, the day table in Figure 4 contains an integer key,
the date, and a foreign key to the month table.
COMPLEX MULTIDIMENSIONAL DATA
Traditional multidimensional data models and
implementation techniques assume that
• all facts map directly to the lowest-level dimen-
sion values and only to one value in each dimen-
sion, and
• dimension hierarchies are balanced trees.
When these assumptions fail, however, standard mod-
els and systems do not adequately support the desired
applications. Complex multidimensional data is espe-
cially problematic because it is not summarizable—
higher-level aggregate results cannot be derived from
lower-level aggregate results. Queries on lower-level
results will provide the wrong results or precomput-
ing, storing, and subsequently reusing lower-level
results to compute higher-level results is no longer pos-
sible. Aggregates must instead be calculated directly
from base data, which considerably increases com-
putational costs.
Summarizability requires distributive aggregate
functions and dimension hierarchy values.
1,7
In-
formally, a dimension hierarchy is strict if no dimen-
sion value has more than one direct parent, onto if the
Achieving Fast Query Response Time
The most essential performance-enhancing techniques in multidi-
mensional databases are precomputation and its more specialized
cousin, preaggregation, which enable response times to queries involv-
ing potentially huge amounts of data to be fast enough to allow inter-
active data analysis.
Computing and storing, or materializing, a product’s total sales by
country and month is one application of preaggregation. This enables
fast answers to queries that ask for the total sales—for example, by
month alone, by country alone, or by quarter and country in combina-
tion. These answers can be derived entirely from the precomputed results
without needing to access bulks of data in the data warehouse.
The latest versions of commercial relational database products, as
well as dedicated multidimensional systems, offer query optimization
based on precomputed aggregates and automatic maintenance of stored
aggregates during updating of base data.
1
Full preaggregation—materializing all combinations of aggregates—
is infeasible because it takes too much storage and initial computation
time. Instead, modern OLAP systems adopt the practical preaggrega-
tion approach of materializing only select combinations of aggregates
and then reusing these to efficiently compute other aggregates.
2
Reusing
aggregates requires a well-behaved multidimensional data structure.
References
1. R. Winter, “Databases: Back in the OLAP Game,” Intelligent Enterprise
Magazine, vol. 1, no. 4, 1998, pp. 60-64.
2. E. Thomsen, G. Spofford, and D. Chase, Microsoft OLAP Solutions, John
Wiley & Sons, New York, 1999.