Forum Stats

  • 3,768,651 Users
  • 2,252,827 Discussions
  • 7,874,669 Comments

Discussions

index recommendation

kaericn
kaericn Member Posts: 305 Blue Ribbon
edited Jan 16, 2020 1:34PM in SQL & PL/SQL

Dear community,

I have the below table def and query.

The volume is 200 million plus rows.

What index would it suited to speed up the query ?

This table is mixed work load both transactional (insert and update ) and read.

Index idea:

CREATE index tbl_1 on tbl(cnt,r_creation_dt,c_id,a_action,last_modified_by);CREATE index tbl_2 on tbl(cnt,c_status,r_creation_dt);CREATE bitmap index tbl_3 on tbl(c_status);

tbl defintion below:

R_ID          NOT NULL VARCHAR2(30) C_ID                     NUMBER       N_ID                     VARCHAR2(40) CREATED_BY         NOT NULL VARCHAR2(30) LAST_MODIFIED_BY            VARCHAR2(30) C_STATUS           CHAR(1)      A_ACTION            CHAR(1)      R_CREATION_DT          TIMESTAMP(6) CNT                     NUMBER(38)  

SQL:

SELECT    c_id,    a_action,    cnt,    last_modified_byFROM    tblWHERE    c_status IN(        'N',        'F'    )    AND cnt<=5GROUP BY    cnt,    r_creation_dt,    case_id,    anonymize_action,    last_modified_by    ORDER BY    r_creation_dtFETCH FIRST 1000 ROWS ONLY;
Tagged:
Jonathan Lewis_jumAndrewSayer

Answers

  • AndrewSayer
    AndrewSayer Member Posts: 12,998 Gold Crown
    edited Jan 15, 2020 6:21PM

    You almost definitely dont want to go anywhere near a bitmap index if your table needs to be updatable by multiple sessions.

    Tell is about the selectivity of these filters, how many rows have the c_status values you care about? how many rows have a cnt value <=5 ? How many rows fit both those filters?

    How often will the query run? Does it need to be completely accurate or can it give slightly old data?

    Jonathan Lewis
  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,910 Red Diamond
    edited Jan 15, 2020 7:18PM

    It is hard to tell without knowing your data and processes. Index can help your query but slow down insert/update/delete if there is heavy OLTP. Anyway, I'd try:

    SQL> create index tbl_idx1

      2    on tbl(

      3           cnt,

      4           c_status,

      5           r_creation_dt,

      6           c_id,

      7           a_action,

      8           last_modified_by

      9          )

    10  /

    Index created.

    SQL> explain plan for

      2  SELECT

      3      c_id,

      4      a_action,

      5      cnt,

      6      last_modified_by

      7  FROM

      8      tbl

      9  WHERE

    10      c_status IN(

    11          'N',

    12          'F'

    13      )

    14      AND cnt<=5

    15  GROUP BY

    16      cnt,

    17      r_creation_dt,

    18      c_id,

    19      a_action,

    20      last_modified_by

    21  ORDER BY

    22      r_creation_dt

    23  FETCH FIRST 1000 ROWS ONLY

    24  /

    Explained.

    SQL> select * from table(dbms_xplan.display)

      2  /

    PLAN_TABLE_OUTPUT

    --------------------------------------------------------------------------------------------------------

    Plan hash value: 2572020106

    -----------------------------------------------------------------------------------

    | Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |

    -----------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT       |          |  1000 | 72000 |     1 (100)| 00:00:01 |

    |*  1 |  VIEW                  |          |  1000 | 72000 |     1 (100)| 00:00:01 |

    |*  2 |   WINDOW NOSORT STOPKEY|          |     1 |    62 |     1 (100)| 00:00:01 |

    |   3 |    SORT GROUP BY       |          |     1 |    62 |     1 (100)| 00:00:01 |

    |*  4 |     INDEX RANGE SCAN   | TBL_IDX1 |     1 |    62 |     0   (0)| 00:00:01 |

    -----------------------------------------------------------------------------------

    PLAN_TABLE_OUTPUT

    --------------------------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1000)

       2 - filter(ROW_NUMBER() OVER ( ORDER BY "R_CREATION_DT")<=1000)

       4 - access("CNT"<=5)

           filter("C_STATUS"='F' OR "C_STATUS"='N')

    Note

    -----

    PLAN_TABLE_OUTPUT

    --------------------------------------------------------------------------------------------------------

       - dynamic statistics used: dynamic sampling (level=2)

    23 rows selected.

    SQL>

    Or:

    SQL> create index tbl_idx1

      2    on tbl(

      3           c_status,

      4           cnt,

      5           r_creation_dt,

      6           c_id,

      7           a_action,

      8           last_modified_by

      9          )

    10  /

    Index created.

    SQL> explain plan for

      2  SELECT

      3      c_id,

      4      a_action,

      5      cnt,

      6      last_modified_by

      7  FROM

      8      tbl

      9  WHERE

    10      c_status IN(

    11          'N',

    12          'F'

    13      )

    14      AND cnt<=5

    15  GROUP BY

    16      cnt,

    17      r_creation_dt,

    18      c_id,

    19      a_action,

    20      last_modified_by

    21  ORDER BY

    22      r_creation_dt

    23  FETCH FIRST 1000 ROWS ONLY

    24  /

    Explained.

    SQL> select * from table(dbms_xplan.display)

      2  /

    PLAN_TABLE_OUTPUT

    ------------------------------------------------------------------------------------

    Plan hash value: 2189129534

    -----------------------------------------------------------------------------------

    | Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |

    -----------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT       |          |  1000 | 72000 |     2  (50)| 00:00:01 |

    |*  1 |  VIEW                  |          |  1000 | 72000 |     2  (50)| 00:00:01 |

    |*  2 |   WINDOW NOSORT STOPKEY|          |     1 |    62 |     2  (50)| 00:00:01 |

    |   3 |    SORT GROUP BY       |          |     1 |    62 |     2  (50)| 00:00:01 |

    |   4 |     INLIST ITERATOR    |          |       |       |            |          |

    |*  5 |      INDEX RANGE SCAN  | TBL_IDX1 |     1 |    62 |     1   (0)| 00:00:01 |

    PLAN_TABLE_OUTPUT

    ------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1000)

       2 - filter(ROW_NUMBER() OVER ( ORDER BY "R_CREATION_DT")<=1000)

       5 - access(("C_STATUS"='F' OR "C_STATUS"='N') AND "CNT"<=5)

    Note

    -----

    PLAN_TABLE_OUTPUT

    ------------------------------------------------------------------------------------

       - dynamic statistics used: dynamic sampling (level=2)

    23 rows selected.

    SQL>

    SY.

  • John_K
    John_K Member Posts: 2,498 Gold Trophy
    edited Jan 16, 2020 6:25AM

    Why not just create the indexes and look at the execution plan (and real-time performance) that the different options produce?

  • BEDE
    BEDE Oracle Developer Member Posts: 2,302 Gold Trophy
    edited Jan 16, 2020 6:57AM

    I think r_creation_dt is to be the first in the index, considering that you will need the first 1000 rows or, possibly, the rows in a given date interval.

    Is there any possibility that one or more of those columns be null?

    If in that table data is inserted on a daily basis and r_creation_dt is not null and is meant to be the timestamp when the row was inserted, maybe it would be good to partition that table by range r_creation_dt with interval.

  • AndrewSayer
    AndrewSayer Member Posts: 12,998 Gold Crown
    edited Jan 16, 2020 7:29AM
    BEDE wrote:I think r_creation_dt is to be the first in the index, considering that you will need the first 1000 rows or, possibly, the rows in a given date interval.Is there any possibility that one or more of those columns be null?If in that table data is inserted on a daily basis and r_creation_dt is not null and is meant to be the timestamp when the row was inserted, maybe it would be good to partition that table by range r_creation_dt with interval.

    All rows within a group need to be identified before the rows can by returned, there's not a legal execution plan that will use an index on r_creation_dt to avoid that sort. Same applies for partitioning.

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,910 Red Diamond
    edited Jan 16, 2020 7:32AM
    BEDE wrote:I think r_creation_dt is to be the first in the index, considering that you will need the first 1000 rows or, possibly, the rows in a given date interval.

    There is no filter/access condition on r_creation_dt so best you'd get is index full scan vs index range scan when index first column is CNT or C_STATUS.

    SY.

  • Joerg.Sobottka
    Joerg.Sobottka Senior Consultant and Oracle Ace Member Posts: 596 Bronze Trophy
    edited Jan 16, 2020 8:11AM

    The goal of every execution plan is to do as less block gets (from memory and disk) as possible. If this is the only select at your table, you can go forward with the following 2 solutions, imho both should work (but the optimizer isn't anytime predictable).:

    One is to put all columns into the index, the column with the highest selectivity on first place (cnt or c_status), see Solomons post => you get an index scan without table access

    The second one is to put only cnt, c_status and r_id into the index, also here the column with the highest selectivity first. The optimizer than should/will choose an index scan with an additional table access afterwards.

    What's faster or better? That depends, if you need the output of the query to select e.g. details in a next step it could be helpful that the table data is already loaded into the buffer cache.

    Best is, you just try it.

    By the way - I have never seen remarkable downgrade of performance on DML due to 1 or 3 indexes created on a table (except with function-based indexes based on own application functions)

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 16, 2020 10:38AM

    Andrew's first response is the most important one.  You (we) need to know more about the numbers than you've told us. (And you almost certainly need to get rid of the bitmap index to minimise time lost to locking problems.

    I'm going to guess that you've read something that says "a column with only a few distinct values should have a bitmap index", leading you to create a bitmap index on c_status. The directive is very badly stated - if every value in the table appears in a very large number of rows a bitmap index is not going to make a magical difference to the performance as most of the work will be about visiting a large number of rows in the table. The real benefit of bitmap indexes comes from the optimizer's ability to combine many of them using bit-oriented operations.

    So - forget the bitmap index, but ask how many rows there are for each distinct value. Is this an example of a common pattern where one (or maybe two) values cover almost all the data and all the other values correspond to a tiny fraction of data ?  In particular, do 'N' and 'F' cover almost all the 200M rows, or do they account for only a few thousand.

    What about cnt ?  It is always positive, is it always integer, how many distinct values appear, is the number of rows where cnt <=5 tiny or huge ?

    Then you need to think of the combinations - maybe there are lots of N/F rows, but maybe almost all N/F rows have cnt>5. The only predicates you've got to minimise the visits to the table are from c_status and cnt, so you need to know whether or not you can get the performance you want by getting EXACTLY the right set of rows without geting any other rows from the table.  If so then a really appropriate index on the two columns would be a starting point.  If the query would be too slow even with a high precision index you would then need to think about creating the most efficient index that would include all the information you needed.

    For example - assume that the number of rows matching your filter predicates is small (and you don't have lots of variations in these predicates, then you could do the following:

    alter table tbl add(

            nf_r_creation_dt generated always as (

                    case

                            when c_status in ('N','F') and cnt <= 5

                            then r_creation_dt

                    end

            ) virtual

    );

    create index tbl_f1 on tbl(nf_r_creation_dt);

    This will give you an index which is as small as it can be and will identify exactly the rows you want to select from the table and no others. Then (after gathering stats on the virtual column) if you add the predicate  "and nf_r_creation_dt is not null" to your query the optimizer will be able to do a full scan of this index and access the data by rowid from the table.

    Regards

    Jonathan Lewis

    _jum
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 16, 2020 1:34PM
    AndrewSayer wrote:
    All rows within a group need to be identified before the rows can by returned, there's not a legal execution plan that will use an index on r_creation_dt to avoid that sort. Same applies for partitioning.

    Technically the following index would (starting with the relevant column, and covering all the required data) - see plan below - it does "sort" but it's a "nosort sort" which stops after the minimum amount of data:

    create index tbl_i1 on tbl(

            r_creation_dt,

            cnt,

            c_id,

            a_action,

            last_modified_by,

            c_status

    );

    ----------------------------------------------------------------------------------------------------

    | Id  | Operation              | Name   | Starts | E-Rows | A-Rows |    A-Time  | Buffers | Reads  |

    ----------------------------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT       |        |      1 |        |   1000 |00:00:00.01 |     109 |     27 |

    |*  1 |  VIEW                  |        |      1 |   1000 |   1000 |00:00:00.01 |     109 |     27 |

    |*  2 |  WINDOW NOSORT STOPKEY |        |      1 |   1000 |   1000 |00:00:00.01 |     109 |     27 |

    |   3 |    SORT GROUP BY NOSORT|        |      1 |   1000 |   1000 |00:00:00.01 |     109 |     27 |

    |*  4 |    INDEX FULL SCAN     | TBL_I1 |      1 |   1000 |   1002 |00:00:00.01 |     109 |     27 |

    ----------------------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):

    ---------------------------------------------------

      1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1000)

      2 - filter(ROW_NUMBER() OVER ( ORDER BY "R_CREATION_DT")<=1000)

      4 - access("CNT"<=5)

          filter(("CNT"<=5 AND INTERNAL_FUNCTION("C_STATUS") AND "R_CREATION_DT" IS NOT NULL))

    The flaw with this is, as SY said, you start doing an index full scan, so if you have to go a long way through the index before you find any relevant data it's very expensive.  Of course it would be possible to combine this "full index scan" strategy with the function-based / virtual column index, creating 6 virtual columns based on the supplied predicate in a way that produced index entries ONLY for the data that is needed. Smallest index, with guaranteed minimum work to get the 1st 1,000 aggregates.

    Regards

    Jonathan Lewis

    AndrewSayer