3 Replies Latest reply: Jan 20, 2013 3:55 AM by 973995 RSS

    index selectivity

    793965
      Hi gems..good afternoon..

      I was reading the chapter Indexes in the book of Tom Kyte's Oracle Expert Database Architecture. There I clearly understood the fact that we should use B Tree index when we want to access very small percentage of rows from a table through the index or if we want to access large percentage of rows which are retrievable from the index entirely.

      Now a situation came in my mind (may be its a very silly question but I am not getting it, please help me)...

      Suppose in a table of eighty thousand rows there are several columns and one column has total five distinct values (which is very less and we should not use index). But on the other hand, among those five distinct values, three of them appeared for very less number of times (in 80000 rows, they appear only for 10 to 20 times) and rest two distinct values appeared a lot of times.

      Now in this case, an index on that column will be beneficial if we access those three values which appeared less number of times and the same index will be disastrous for the other two values.

      So, what should we do in this case...please help.

      Thanks in advance.
        • 1. Re: index selectivity
          Harmandeep Singh
          The solution for you problem is as below
          1. Create histogram on this highly skewed column
          2. To use the index use the literal values when the column is referenced in WHERE clause

          select * from tab1 where c1=4;

          HTH

          Thanks,
          Harman
          • 2. Re: index selectivity
            Jonathan Lewis
            gogol wrote:

            Now a situation came in my mind (may be its a very silly question but I am not getting it, please help me)...
            Congratulations for reading something and taking the time to come up with a thought experiment to question what you've read. In fact your scenario is quite common and not a silly question at all.

            >
            Suppose in a table of eighty thousand rows there are several columns and one column has total five distinct values (which is very less and we should not use index). But on the other hand, among those five distinct values, three of them appeared for very less number of times (in 80000 rows, they appear only for 10 to 20 times) and rest two distinct values appeared a lot of times.

            Now in this case, an index on that column will be beneficial if we access those three values which appeared less number of times and the same index will be disastrous for the other two values.
            80,000 isn't a particularly large table - the last time I saw this the table held several million rows with 2 common values and 4 or 5 values which tended to hold 10 to 100 rows each, and only temporarily. (See http://jonathanlewis.wordpress.com/2009/04/23/histogram-change/ )

            The histogram with literals in the SQL is one (perfectly valid) approach - but if you do that you have to be very careful with the histogram in case Oracle manages to sample a histogram that only sees the two popular values. The solution to that would be to ensure that you could create a representative histogram using some pl/sql ( See http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/ )

            (Updated - I hit "post" when I meant to hit "Preview")

            An alternative that is much nicer if you can modify the application code is to create a function-based index (or a virtual column with an index in 11g) to hide the popular values, e.g.:
            create index t1_f1 on t1(
               (case flag_col
               when 'X' then null
               when 'Y' then null
                            else flag_col
               end
               )
            )
            (I may have got the syntax a bit wrong - I'm not near a database at present.)
            This index will be tiny as it holds only the rare values that you'd typically be interested in - which also means it reduces the index maintenance costs, of course.

            The side effect is that your client code has to know that queries for X or Y have to use predicates of the form flag_col = 'Y', while queries for all other values should use "case ... end = value".

            Regards
            Jonathan Lewis
            • 3. Re: index selectivity
              973995
              >
              Now a situation came in my mind (may be its a very silly question but I am not getting it, please help me)...
              >

              Perfect question. I did learn new things owing to you. Thanks to you.