Data Warehousing: From Business Goals to a Dimensional Data Model
Learn how the design of a dimensional data model emerges from the business goals that you want to achieve.
By Yuli Vasiliev
Not a secret, the data model on which a data warehouse is built comes from the organization's business requirements. So, when it comes to a data warehousing project, you start with identifying and analyzing the goals of your organization, as well as determining the requirements to achieve those goals. Once you have defined a set of business requirements, the next step is to translate them into technical requirements that will be used as a basis for designing your data warehouse.
This concept can be best understood by example. Suppose you need to analyze data derived from purchase orders. Among other things, you want to understand how shipping options that the customer chooses correlate with the total amount in an average purchase order. When translated into technical terms, this business requirement requires you to aggregate the PO data over available shipping options. To address this requirement, you naturally arrive at the idea of defining a Shipping dimension during the design stage of the dimensional model for your data warehouse.
Continuing with the above example, this article covers the steps you need to follow when working on a data warehousing project. Of course, in a real-world scenario, you might need to implement a number of dimensional objects, including a set of dimensions. For the sake of simplicity however, the example in this article illustrates how to implement a single dimension.
Phases of Constructing a Data Warehousing Environment
Designing a data warehouse is a multi-level process. Leaving the implementation details for later, you naturally start with more conceptual and abstract things. There is a concept of a logical design for a data warehousing environment, followed by a physical design. Of course, the logical design is not your starting point either when it comes to constructing a data warehousing environment. As mentioned, it all starts with identifying business requirements that you then convert into technical requirements, which, in turn, are used as a basis for the design of your data warehouse. This process is summarized diagrammatically below:
Figure 1: The major phases of constructing a data warehousing environment.
The rest of the article details the steps shown in the above diagram.
Analyzing Your Business Requirements
Like many other IT solutions, a data warehouse is developed in response to a business need. And business requirements are statements, which describe that need, representing high-level declarations of your goals and objectives.
A business requirements analysis is a multifaceted process that involves discovering, analyzing, and finally documenting the requirements related to your business need. A key aspect of a business requirements analysis is to clearly identify what end-users want from the system you’re designing, to help you establish a scope for the project.
It is important to emphasize that business requirements describe what should be done rather than how it should be done. For example, a business requirement for analyzing customer delivery satisfaction might be stated as follows:
To keep up with the oncoming increased demand, address the issue of improving customer delivery satisfaction, offering shipping options that will keep customers happy.
When translated into a technical requirement, the above might look like this:
Identify when, how, and what is sold by each shipping option the customer chooses. Define a Shipping dimension to facilitate analysis of the PO data over the shipping options chosen.
The above is a simple example of implementing a business requirement in a data model, resulting in the need for a dimension to be defined. Later in this article, you will see what the structure of the Shipping dimension mentioned above might look like.
As you no doubt have realized, the process of identifying, documenting and then translating business requirements deserves an article of its own and is out of scope here. An example of how the design of a dimensional data model emerges from business requirements can be found in this case study from the OLAP User's Guide in the Oracle Database Documentation.
Converting Business Questions into SQL Queries
As mentioned, business requirements are derived from analysis of business goals. Once identified, business requirements, in turn, are used as a basis for deriving technical requirements, which you then use to determine your data warehouse design. It is interesting to note however, that the above is not the only chain in the design process, which starts from business goals. Also, business goals provide the basis for identifying business questions you want to get answered through analyzing your data.
Of course, business questions expressed in human language cannot be sent to a data warehouse system directly. They must first be translated into SQL queries. Turning to the example from the previous section, suppose you have the following question: "What was the cumulative dollar amount of the orders shipped overseas over the last three quarters?" When translated into SQL, this question would turn into a SELECT statement that might look like this:
SELECT t.quarter_dsc quarter,
time_dim t, shipping_dim s, po_fact f
t.time_key = f.time_key AND
s.shipping_key = f.shipping_key AND
t.quarter_dsc IN ('1Q2016', '2Q2016', '3Q2016') AND
s.shipping_group = 'Global Shipping'
GROUP BY t.quarter_dsc;
As you can see, the above is a join query that apparently involves two dimension tables and a fact table.
In practice however, you will more likely deal with a cube rather than with its underlying fact table directly. So, the same query issued against a cube might look like this:
SELECT t.time_dsc quarter, s.amount revenue
FROM cube_view cv,
WHERE t.time_dsc IN ('1Q2016', '2Q2016', '3Q2016')
AND s.group_dsc = 'Global Shipping'
AND p.level_name = 'TOTAL'
AND p.dim_key = cv.product
AND t.dim_key = cv.time
AND s.dim_key = cv.shipping
ORDER BY quarter;
Note that unlike the preceding SQL statement, this one does not include a SUM operator and a GROUP BY clause. This is because the cube already contains aggregated data – there is no need to re-aggregate it, since it would degrade performance. To select the desired aggregates, you apply filters to every cube’s dimension in the WHERE clause.
Examining dimension filters in the above query can give you a grasp of how a cube aggregates data. So, you may notice that a filter can be applied either to an entire level of a dimension or to a particular dimension member at any level, thus instructing the query to select corresponding aggregates stored in the cube.
Dimensions are further discussed in the next section, where you’ll look at how a list of shipping options available to customers can be translated into a dimension object when designing a data model for a warehouse.
Designing a Dimensional Data Model
Once the scope of your project has been set in accordance with the requirements identified, you can proceed to identify the dimensional objects and their relationships in a data model. Thus, you need to identify and define dimensions and cubes. As an example, let’s look at how you might define a logical structure of the Shipping dimension already introduced in this article.
Categorizing data to help answer business questions, dimensions consist of attributes, levels, and hierarchies. Before you can define those objects for a dimension however, you have to understand the data the dimension is supposed to contain. So, turning to our example, suppose the shipping options list is as follows:
As you can see, the shipping options in the above list are grouped into a two-level hierarchy. Defining a dimension in a data warehousing environment, you also have to add a Total level as the highest level of aggregation. So, you might define the following three-level hierarchy for the Shipping dimension:
Total > Group > Shipping
Each level must have level attributes associated with it, including a surrogate identifier and a business identifier. You can also define level attributes that provide descriptive information about a level member. Thus, each level in the Shipping dimension might have the following attributes: ID, NAME, and DESCRIPTION. Here, ID is the surrogate identifier to uniquely identify each level record across the entire dimension; NAME is the business identifier to uniquely identify each level record across a level; and DESCRIPTION is to describe a level member.
It is important to note that a dimension can have more than one hierarchy of levels. Continuing with the Shipping dimension discussed here, suppose you also want to enable analysis across shipping carriers – not only shipping options. In that case, you might add the following hierarchy to the dimension:
Total -> Carrier
This might also have the Group level in the middle, if you want to divide carriers into groups, such as, for example, national carriers and regional carriers.
Implement a Dimensional Data Model with ODI 12cR2
Now that you have decided on the Shipping dimension structure, you might want to implement it. In this section you’ll look at how this can be done with Oracle Data Integrator (ODI) 12c Release 2, using its new support for dimensional objects. If you don’t have it installed and configured in your system, the simplest way to get started is to take advantage of the pre-built virtual machine for Oracle Data Integrator 12cR2 that you can run in Oracle VM VirtualBox.
Although the above virtual machine comes with a pre-installed Oracle database that you can use for storing source data sets to be used in ODI, you are free to use another database for it, even if it resides in the cloud. Thus, for example, you might implement the underlying tables for the Shipping dimension discussed here in an Oracle database located in the cloud. Connect ODI to Oracle Database Cloud Service (DBCS) describes in detail how ODI on-premise can integrate with Oracle Database Cloud Service (DBCS), allowing you to work with a database instance of DBCS in ODI as if it were an on-premise instance.
Turning to the article sample, your first step in implementing the Shipping dimension is to create a set of database tables that will be used for storing the dimension data. So, wherever your database resides, you need to connect to it and create – preferably in a separate database schema – the following underlying tables for the Shipping dimension and its levels:
CREATE TABLE dw1_pr.shipping_dm(
CONSTRAINT shipping_dm__pk PRIMARY KEY(DIMENSION_KEY)
CREATE TABLE dw1_pr.shipping_shipping_stg(
CREATE TABLE dw1_pr.shipping_group_stg(
CREATE TABLE dw1_pr.shipping_total_stg(
Now you can launch ODI Studio and define corresponding dimensional objects on top of the above tables. Before you can do it, you must reverse engineer these tables into an ODI model, which, in turn, can be created upon a topology defined for this model in advance.
Below are the general steps you need to accomplish in order to set up an ODI model for the database schema containing the above tables. (The detailed steps provided for a similar example can be found in my OTN article Data Warehousing with Oracle Data Integrator 12cR2):
- Define a data server corresponding to the database schema containing the tables created above.
- Create a physical schema for the data server defined above.
- Create a logical schema and associate it with the above physical schema.
- Create a model for the topology just defined.
- Reverse-engineer the model.
As a result, the above model should include the following datastores: SHIPPING_DM, SHIPPING_TOTAL_STG, SHIPPING_GROUP_STG, SHIPPING_SHIPPING_STG, as shown in the screenshot below, taken from the Designer tab of ODI Studio:
Figure 2: The model reverse-engineered from the underlying database.
Now that you have the underlying tables reverse-engineered into an ODI model, you can go ahead and define dimensional objects on top of those tables. The only preliminary step left is to define a sequence that will be used to automatically generate surrogate key values for the dimension table. This step is described in detail in the Defining a Sequence in ODI Studio section of Data Warehousing with Oracle Data Integrator 12cR2.
Now you are ready to create the Shipping dimension. For that, complete the following steps in ODI Studio:
1. On the Designer tab of ODI Studio, create a new dimensional model and then create a new dimension under it:
Figure 3: Creating a new dimension under a dimensional model in ODI Studio.
2. On the Definition tab of the new dimension panel, enter Shipping as the name for the dimension, select SHIPPING_DM as the Datastore, and select the sequence (you should have already created) as the Surrogate Key Sequence.
3. Move on to the Levels tab of the Shipping dimension panel, and define these levels: Total, Group, and Shipping, one after the other. The screenshot in the figure below illustrates how you could define the Total level:
Figure 4: Defining a level of the Shipping dimension.
4. For the Group and Shipping levels, you also need to define parent references in the Parent Level References section on the Levels tab.
5. Move on to the Hierarchies tab of the Shipping dimension panel, and define a hierarchy, as shown in the screenshot below:
Figure 5: Defining a hierarchy of levels for the Shipping dimension.
6. In the File menu of ODI Studio, select Save.
Once the above steps are completed, you have the Shipping dimension object in ODI, which is ready to be populated with data. It is important to note that ODI understands different formats such as JSON and XML – so you are not limited only to relational data sources.
As you learned in this article, the design of a data model for a data warehouse starts long before any technical terms come into play. First, you must define a set of business requirements that drive the need for your data warehouse solution. The, you looked at how a business domain entity expressed in human language is then transformed into a SQL statement or a dimensional object, thus implementing a business requirement in a data model.
About the Author
Yuli Vasiliev is a software developer, freelance author, and consultant currently specializing in open source development, Java technologies, business intelligence (BI), databases, service-oriented architecture (SOA), and more recently virtualization. He is the author of a series of books on the Oracle technology, including Oracle Business Intelligence: An introduction to Business Analysis and Reporting (Packt) and PHP Oracle Web Development: Data processing, Security, Caching, XML, Web Services, and Ajax (Packt).