12 Replies Latest reply: Mar 21, 2012 6:03 PM by jgarry RSS

    How to design a fact table to keep track of active dimensions?

    922141
      I would like to design a classic OLAP facts table using a star scheme. The SQL model of the facts table should be independent of any concrete RDBMS technology and portable between different systems.

      The problem is this: users should be able to select subsets of the facts based on conjunctive queries on the dimension values defined for the facts. However, the program that provides the interface for doing this to the user should only present those dimensions where anything is still selectable at all. For example, if a user selected year 2001 and for dimension contract code there is only a single value for all records in the fact table for that year, this dimension should not be shown to the user any more. This should be solved in a generic way. So for n dimensions in total, if the current set of facts is based on constraints from j dimensions, I want to know for which of the remaining n-j dimensions there is still something to select from and only show those.
      The obvious way is to make a count(*) query on the distinct foreign keys of each of the dimensions on the fact table, using the same where clause. That means that one would need (n-j) such queries on the whole facts table and that sounds like an awful waste of resources given that the original query for selecting the facts could have done it internally "on the fly".

      How can this be achieved in the most performant way? Is there a "classical" way of how to approach this problem? Is there tool support for doing this efficiently?

      Any help or pointers to where one could find out more about this would be greatly apreciated - thank you!
        • 1. Re: How to design a fact table to keep track of active dimensions?
          LKBrwn_DBA
          919138 wrote:
          I would like to design a classic OLAP facts table ... Etc ...

          ... This should be solved in a generic way ... and that sounds like an awful waste of resources
          ... Etc ...
          That is the price you pay for anything "Generic" in which you cannot make use of the built in functionality already provided by the RDBMS (whichever it is).

          Good luck in your futility!
          :p
          • 2. Re: How to design a fact table to keep track of active dimensions?
            922141
            Well many thanks for your kind words, but would this kind of functionality not be something that is needed for nearly every application that offers and interactive way to select facts by choosing values from a larger number of dimensions?
            This is not my personal spleen but as I would have thought something common enough so that there should be solution patterns or maybe even tools support to implement it in a way that is less clumsy than the naive approach of submitting those (n-j) complex queries on the fact table.
            • 3. Re: How to design a fact table to keep track of active dimensions?
              922141
              Can anyone else shed some light on this? Surely this must be a pretty common use case?
              • 4. Re: How to design a fact table to keep track of active dimensions?
                sb92075
                919138 wrote:
                Surely this must be a pretty common use case?
                or not.
                • 5. Re: How to design a fact table to keep track of active dimensions?
                  rp0428
                  >
                  I would like to design a classic OLAP facts table using a star scheme. The SQL model of the facts table should be independent of any concrete RDBMS technology and portable between different systems.
                  >
                  Nothing hard about that. It's done all the time.
                  CREATE TABLE MY_FACTS (age varchar2(20), marital_status varchar2(10))
                  Simple.

                  Create an DIM_AGE and DIM_MARITAL_STATUS dimension tables
                  e.g. DIM_AGE has AGE column with values like 'none', 'pre_teen', 'teen', 'young', 'middle-aged', 'old'
                  DIM_MARITAL_STATUS hav values like 'none', 'single', 'married', 'divorced'

                  That's your data model. But not all RDBMS support the same technology. So using Oracle as the example.
                  Create BITMAP indexes on the AGE and MARITAL_STATUS columns

                  >
                  users should be able to select subsets of the facts based on conjunctive queries on the dimension values defined for the facts
                  >
                  This has nothing to do with the database. This is UI. I did one of these using Java and CSV files about 10 years ago; just dug up my demo.

                  The UI shows a tree of 'Attributes' which will have 'age' and 'marital_status' children. Each attribute has children for each distinct value that shows the count for that attribute value
                  Attributes
                       age
                            none - 174531
                            pre_teen - 174402
                            teen - 174641
                            young - 174752
                            middle_aged - 175115
                            old - 175144
                       marital_status
                            none - 261875
                            single - 262808
                            married - 261962
                            divorced - 261931
                            You select tree items (attribute values) and operators "&, |, ^, ~, (, )" to create a compound nested expression and the result is shown in another panel. Each sub-operation shows the two values (with their cardinality) being joined and the cardinality of the result .
                   
                  young & divorced
                  
                  young - 174742 -----
                                                - & - 43798
                  divorced - 261931 --
                  A more complex example
                  (CA | CO) & (young & divorced) & ((baseball & pepsi) | (golf & coors))
                  
                  ca - 18951
                             -- | 37810
                  co - 18859
                                                             -- & - 1636
                  young - 174742
                                    - & - 43798
                  divorced - 261931
                                                                                    -- & - 112
                  baseball - 174564
                                    - & - 29363
                  pepsi - 174788
                                                              -- | 64040
                  golf - 17444
                                 - & - 34677
                  coors - 209604
                  The spacing may not come out properly but you should get the idea.

                  I wrote this in Java and it ran on an IPAQ using 1 million records for the FACT table with dimensions for: age, beer, education, marital_status, gender, softdrink, state, summer_sport and winter_sport.

                  You will notice the counts for set of attribute values is 1 million.
                  Results were displayed in real-time why the user was entering the compound condition. Enter a condition that only returned one value and there was no need to continue since '1' is the fewest you could get.

                  It was easy to see how different conditions (AND, OR) would filter the result. This was part of an application for selecting cable tv stations (FACT table) for advertising purchases. The user might be interested in reaching a particular demographic of viewer.
                  • 6. Re: How to design a fact table to keep track of active dimensions?
                    922141
                    Thank you rp0428, this is exactly what I am interested in!
                    I agree that what I am talking of is GUI, but what I was wondering about is how to find those counts in the most performance effective way. I am confronted with a facts table with several billion rows and several dozen dimensions, so performance is an issue.
                    Also, because there are so many dimensions, filtering those out which are not useful for further selection any more becomes even more important.

                    So question still is: how did you implement this kind of GUI? Did you get the counts for each value of each dimension by doing a separate query with the current "WHERE" clause on each dimension?
                    Because, with several dozen of dimensions, that would imply several dozen queries over a table with several bullion rows which does have a significant performance impact.
                    I am still hoping that there may be a more efficient way to do this or there may be some tool support to do it by going over the big facts table less often then several dozen times in order to get all the counts.
                    • 7. Re: How to design a fact table to keep track of active dimensions?
                      Paulie
                      >
                      So question still is: how did you implement this kind of GUI?
                      Take a look here - there are as many ways of doing this as there are development environments. Java swing,
                      PHP..;. the list is endless.

                      The question is what is its purpose - desktop, web, intranet? What are your own areas of
                      expertise?

                      I am still hoping that there may be a more efficient way to do this or there may be some tool support to do
                      it by going over the big facts table less often then several dozen times in order to get all the counts.
                      I'm not sure if I've understood you - you seem to want to pre-compute everything? This will not
                      be possible, because beyond a small number of facts, the number of possible combinations
                      is huge. You should let those who wish to analyse the data construct their own queries.


                      Paul...
                      • 8. Re: How to design a fact table to keep track of active dimensions?
                        rp0428
                        >
                        Did you get the counts for each value of each dimension by doing a separate query with the current "WHERE" clause on each dimension?
                        >
                        My method doesn't apply to your use case. I wrote a Java class to create my own bit-mapped indexes on CSV files. So each attribute value was a one million bit binary raw.

                        I don't know, and don't want to know, what your particular requirements are. But I can show you a basic process that will work for large numbers of rows. Get a simple process working and then explore to see if it will meet your particular needs. Not going to answer questions here about anything but about my example code

                        1. Assume a single fact table with one primary key column and multiple single-value attribute columns.

                        2. The table is not subject to DML operations AT ALL - truncate and load if you want apply changes. Meaning it will be useful for research purposes on archived data.

                        3. The purpose of the table is to select the fact table ROWIDs for records of interest. So the only value selected is a result set of ROWIDs that can then be used to get any of the normal FACt table data and other linked data as needed.

                        Create the table - insert some records, create a bitmap index on each dimension column and collect the statistics
                        ALTER TABLE SCOTT.STAR_FACT
                         DROP PRIMARY KEY CASCADE;
                        
                        DROP TABLE SCOTT.STAR_FACT CASCADE CONSTRAINTS;
                        
                        create table star_fact (
                            fact_key varchar2(30) DEFAULT 'N/A' not null,
                            age      varchar2(30) DEFAULT 'N/A' not null,
                            beer    varchar2(30) DEFAULT 'N/A' not null,
                            marital_status varchar2(30) DEFAULT 'N/A' not null,
                            softdrink varchar2(30) DEFAULT 'N/A' not null,
                            state    varchar2(30) DEFAULT 'N/A' not null,
                            summer_sport varchar2(30) DEFAULT 'N/A' not null,
                            constraint star_fact_pk PRIMARY KEY (fact_key)
                        );
                        
                        INSERT INTO STAR_FACT (FACT_KEY) SELECT ROWNUM FROM ALL_OBJECTS;
                        create bitmap index age_bitmap on star_fact (age);
                        create bitmap index beer_bitmap on star_fact (beer);
                        create bitmap index marital_status_bitmap on star_fact (marital_status);
                        create bitmap index softdrink_bitmap on star_fact (softdrink);
                        create bitmap index state_bitmap on star_fact (state);
                        create bitmap index summer_sport_bitmap on star_fact (summer_sport);
                        exec DBMS_STATS.GATHER_TABLE_STATS('SCOTT', 'STAR_FACT', NULL, CASCADE => TRUE);
                        Now if you run the 'complex' query for the example from my first reply you will get
                        SQL> set serveroutput on
                        SQL> set autotrace on explain
                        SQL> select rowid from star_fact where
                          2   (state = 'CA') or (state = 'CO')
                          3  and (age = 'young') and (marital_status = 'divorced')
                          4  and (((summer_sport = 'baseball') and (softdrink = 'pepsi'))
                          5  or ((summer_sport = 'golf') and (beer = 'coors')));
                        
                        no rows selected
                        
                        Execution Plan
                        ----------------------------------------------------------
                        Plan hash value: 1934160231
                        ------------------------
                        
                        | Id  | Operation                      | Name                  | Rows  | Bytes |
                        |   0 | SELECT STATEMENT               |                       |     1 |    30 |
                        |   1 |  BITMAP CONVERSION TO ROWIDS   |                       |     1 |    30 |
                        |   2 |   BITMAP OR                    |                       |       |       |
                        |*  3 |    BITMAP INDEX SINGLE VALUE   | STATE_BITMAP          |       |       |
                        |   4 |    BITMAP AND                  |                       |       |       |
                        |*  5 |     BITMAP INDEX SINGLE VALUE  | AGE_BITMAP            |       |       |
                        |*  6 |     BITMAP INDEX SINGLE VALUE  | MARITAL_STATUS_BITMAP |       |       |
                        |*  7 |     BITMAP INDEX SINGLE VALUE  | STATE_BITMAP          |       |       |
                        |   8 |     BITMAP OR                  |                       |       |       |
                        |   9 |      BITMAP AND                |                       |       |       |
                        |* 10 |       BITMAP INDEX SINGLE VALUE| SOFTDRINK_BITMAP      |       |       |
                        |* 11 |       BITMAP INDEX SINGLE VALUE| SUMMER_SPORT_BITMAP   |       |       |
                        |  12 |      BITMAP AND                |                       |       |       |
                        |* 13 |       BITMAP INDEX SINGLE VALUE| BEER_BITMAP           |       |       |
                        |* 14 |       BITMAP INDEX SINGLE VALUE| SUMMER_SPORT_BITMAP   |       |       |
                        
                        Predicate Information (identified by operation id):
                        ---------------------------------------------------
                        
                           3 - access("STATE"='CA')
                           5 - access("AGE"='young')
                           6 - access("MARITAL_STATUS"='divorced')
                           7 - access("STATE"='CO')
                          10 - access("SOFTDRINK"='pepsi')
                          11 - access("SUMMER_SPORT"='baseball')
                          13 - access("BEER"='coors')
                          14 - access("SUMMER_SPORT"='golf')
                        
                        SQL>
                        As you can see Oracle is combining bitmap indexes on columns in a single table to implement the same AND/OR complex conditions I showed earlier. It doesn't need any other table to do this.

                        In 11g you can create virtual columns and then index them.
                        so if you find that the condition 'young' and 'divorced' is used frequently you could create a VIRTUAL 'young_divorced' column and create an index.
                        alter table star_fact add (young_divorced AS (case 
                           when (age = 'young' and marital_status = 'divorced') then 'TRUE' else 'N/A' end) VIRTUAL);
                        
                        create bitmap index young_divorced_ndx on star_fact (young_divorced);
                        exec DBMS_STATS.GATHER_TABLE_STATS('SCOTT', 'STAR_FACT', NULL, CASCADE => TRUE);
                        Now you can query using the name of the virtual column
                        SQL> select rowid from star_fact where young_divorced = 'TRUE'
                          2  and  (state = 'CA') or (state = 'CO')
                          3  /
                        
                        no rows selected
                        
                        
                        Execution Plan
                        ----------------------------------------------------------
                        Plan hash value: 2656088680
                        | Id  | Operation                    | Name               | Rows  | Bytes | Cost
                        |   0 | SELECT STATEMENT             |                    |     1 |    28 |
                        |   1 |  BITMAP CONVERSION TO ROWIDS |                    |       |       |
                        |   2 |   BITMAP OR                  |                    |       |       |
                        |*  3 |    BITMAP INDEX SINGLE VALUE | STATE_BITMAP       |       |       |
                        |   4 |    BITMAP AND                |                    |       |       |
                        |*  5 |     BITMAP INDEX SINGLE VALUE| STATE_BITMAP       |       |       |
                        |*  6 |     BITMAP INDEX SINGLE VALUE| YOUNG_DIVORCED_NDX |       |       |
                        
                        Predicate Information (identified by operation id):
                        ---------------------------------------------------
                        
                           3 - access("STATE"='CO')
                           5 - access("STATE"='CA')
                           6 - access("YOUNG_DIVORCED"='TRUE')
                        
                        SQL>
                        ----------------------------------------------------------------
                        Notice that at line #6 the new index was used. The VIRTUAL column itselfl doesn't create data for the fact table; the definition only exists in the data dictionary.

                        The YOUNG_DIVORCE_NDX is real and does consume space. The tradeoff is additional space for the index but you make the query easier because you don't have to recreate the complex condition every time.

                        Oracle can work with the complex condition and combine the indexes so this really only helps the query writer. Your UI should be able to hide the query construction from the user so I would avoid the use of VIRTUAL columns and an additional index until you demonstrate you really need it.

                        If you provide users with their own RESULT table to store custom query results you could just store the query name and the set of primary keys from the result set. I used ROWIDs in the example but don't use rowid for a real application - use a primary key value that won't change.

                        So your UI would let users construct complext dimension queries for 'young_sportsters' and get a result set of primary keys for that. They could save the label 'young_sportsters' and the primary keys in their own work table. Then you can let them run queries that use the primary keys to query data from your active data warehouse to get any other data it contains.
                        >
                        Did you get the counts for each value of each dimension by doing a separate query with the current "WHERE" clause on each dimension?
                        >
                        For an Oracle implementation you need to do a count select for each dimension. I haven't tried it but you might be able to do multiple dimensions in a singe query. One query would look like this>
                        -- get the dimension counts
                        SQL> select beer, count(*) from star_fact group by beer;
                        
                        BEER                             COUNT(*)
                        ------------------------------ ----------
                        N/A                                 56977
                        
                        
                        Execution Plan
                        ----------------------------------------------------------
                        Plan hash value: 1692670403
                        | Id  | Operation                | Name        | Rows  | Bytes | Cost (%CPU)| Ti
                        |   0 | SELECT STATEMENT         |             |     1 |    12 |     3   (0)| 00
                        |   1 |  SORT GROUP BY NOSORT    |             |     1 |    12 |     3   (0)| 00
                        |   2 |   BITMAP CONVERSION COUNT|             | 56977 |   667K|     3   (0)| 00
                        |   3 |    BITMAP INDEX FULL SCAN| BEER_BITMAP |       |       |            |
                        
                        SQL>
                        Notice that Oracle uses only the index to gather the data.
                        • 9. Re: How to design a fact table to keep track of active dimensions?
                          922141
                          Thanks a lot for the detailed example!
                          I guess the somewhat sobering bottom line is that for figuring out which dimensions still have more than one value for the current selection (i.e. the current where clause), there is really not much else to do but to do a separate count(distinct dimensionkeycolumn) for each remaining candidate dimension.
                          I understand that in certain situations frequently used "where" clauses could be optimized, but in my case this is not foreseeable and the number of dimensions is rather high. So it really seems to boil down to testing the performance impact of doing all these additional queries versus the usability benefit for the end user.
                          • 10. Re: How to design a fact table to keep track of active dimensions?
                            rp0428
                            >
                            I guess the somewhat sobering bottom line is that for figuring out which dimensions still have more than one value for the current selection (i.e. the current where clause), there is really not much else to do but to do a separate count(distinct dimensionkeycolumn) for each remaining candidate dimension.
                            >
                            I'm not sure you're quite sober yet. You can't just use a 'remaining candidate dimension' for an 'AND' condition. You actually have to AND the dimension with the set of result bits you have so far.

                            With Oracle - you don't have a 'so far' - when you add one new dimension the entire query has to be rerun - meaning if you do the type of 'what if' in my example you either build ALL of the conditions up front using the selections made by the user and then execute the entire query.

                            Or you execute a new query every time the user adds a new dimension to the rule set; a build and query on the fly. I was able to do that with my Java app because I wrote the bitmap indexes and they were memory-resident; I could AND and OR them together very quickly and keep the interim bit result set in memory also. So adding a new dimension I could just AND the current set of result bits with a new selection and either keep it or throws it away and do it again with another dimension.

                            That's why I said my case was different than yours. I could simply store the bitset for a complicated set of conditions the user put together by giving it a new and writing it to a file. That, in essence, became a new bitmap index for that user that I could use again later as a starting point without having to actually rerun all of the bit operations that created it to begin with.

                            The Oracle equivalent of that is actually creating a new bitmap index like I showed you. Or create your own bitmap using a long RAW where bit 3 being on means record 3 has the bit set and so on. That is really what my Java bitmaps were: just a million bit binary string.

                            The UTL_RAW package has functions to AND and OR data of RAW data type so a hybrid is certainly possible.
                            Create a conversion table of ROWID values to ROWNUM (1 to n). Do an normal oracle query to get a result set of ROWIDs, convert the ROWID values to a number from 1 to n and then construct a RAW that has the correct bits set. This RAW could be stored in a table and given a name 'coke drinkers that live in florida and plaly golf'.

                            Now to test a new dimension you have to convert the bits for dimension attribute value you are using to a RAW and then AND it with your saved RAW in just one operation.

                            If you were going to use this approach you might as well just do all of the bitmapping yourself and preconvert the rowid values to record number values and create your own bitmapped dimension attribute raw values.

                            Then all you need is a simple expression parser that understands '(, ), &, ^, !' and do all of the AND and OR using precomputed, static RAW data.

                            Java versions now have a BitSet class that you could use to do this in a Java app: that class didn't exist when I wrote my app.

                            See the BitSet class in the Java Docs
                            http://docs.oracle.com/javase/1.5.0/docs/api/java/util/BitSet.html

                            The best you can do easily is to have pre-computed and stored the distinct counts for each attribute value for each dimension. This will give you some idea of the cardinality that might result from using the dimension.

                            If NativeLanguage is an attribute the knowing that there are only 3 people out of 100 million that speak Swahili will visually tell anyone that using that in an AND condition will be pretty selective.

                            That is why it is most useful to use the low cardinality dimension attribute values as early in the decision process as possible because they quickly narrow down the set of possibilities.

                            The basic OR operation using any dimension wlll always result in having ALL of that dimensions records represented in the result set. It is the AND operations that you can't predict.

                            Good luck with project
                            • 11. Re: How to design a fact table to keep track of active dimensions?
                              jgarry
                              To put it in a nutshell: Generic solutions don't scale well. A million row in memory bitmap is quantitatively several orders of magnitude different than a billion row bitmap, which makes it qualitatively different.

                              To put it in a different nutshell: Never optimize code prematurely. Dimensionalization can be thought of as a physical denormalization for performance. If you try to generalize denormalization, most solutions will not be optimized for performance. Look up Anti-pattern on wikipedia.
                              • 12. Re: How to design a fact table to keep track of active dimensions?
                                rp0428
                                Absolutely. That's why I tried to keep emphasizing that what I did was a once time implementation for demo purposes to show what could be done on an IPAQ when no one thought anything could be done. It used a read-only dataset that would only get refreshed once a week or month. Worked great for what it was designed for but that approach would never scale or be appropriate for any other use.