Here are 3 occurrences of the same SELECT DISTINCT, with durations that range from 15 seconds to 966 seconds!

 

SQL_ID

Cost

Duration (s)

End Time

SQL TEXT

1hdzqy0f9ukcc

2046761

21

24-06-15 02H

SELECT DISTINCT IRI, CT FROM BIGTABLETHATALWAYSGROWS WHERE INI IS NULL AND CT = 012

35vbdxu0xucjh

2046761

15

24-06-15 02H

SELECT DISTINCT IRI, CT FROM BIGTABLETHATALWAYSGROWS WHERE INI IS NULL AND CT = 034

aa316uybjk84s

2046761

966,14

24-06-15 03H

SELECT DISTINCT IRI, CT FROM BIGTABLETHATALWAYSGROWS WHERE INI IS NULL AND CT = 027

 

This SELECT DISTINCT selects the records to be modified by a nightly batch job, and runs for too long.  This partitioned table has 357M rows and weighs over 50Go.  I am tasked to optimize this SELECT DISTINCT.

 

First and foremost, a detail in the SQL code screams for attention – did you spot it?  The Explain Plan mentions it:

SQL> explain plan for SELECT DISTINCT IRI, CT FROM BIGTABLETHATALWAYSGROWS WHERE INI IS NULL AND CT = 040;

Explained.

SQL> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());

PLAN_TABLE_OUTPUT

Plan hash value: 2894093667

--------------------------------------------------------------------------------| Id  | Operation           | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |

------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT | |     1 |    20 |  2036K  (1)| 06:47:17 |       | |

|   1 | PARTITION LIST ALL | |     1 |    20 |  2036K  (1)| 06:47:17 |     1 |    35 |

|   2 | HASH UNIQUE | |     1 |    20 |  2036K  (1)| 06:47:17 | |       |

|*  3 |    TABLE ACCESS FULL| BIGTABLETHATALWAYSGROWS |     1 |    20 |  2036K  (1)| 06:47:17 |     1 |    35 |

--------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

3 - filter("INI" IS NULL AND TO_NUMBER("CT")=040)

 

Yes, an implicit conversion (040 being converted to a number, because of the lack of single quotes around it) will prevent the use of any index (except for the appropriate function-based index, which, of course, has not been created) because the CT column is VARCHAR.

 

So an extremely simple, yet highly effective, tweak is to add single quotes:

EXPLAIN PLAN FOR

SELECT DISTINCT IRI, CT FROM BIGTABLETHATALWAYSGROWS WHERE INI IS NULL AND CT = '040' ;

---------------------------------------------------------------------------------------------------------

| Id  | Operation              | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |                                                                                                                                                                                                  

---------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT       |                |   711 | 14220 | 42066   (1)| 00:08:25 |       | |

|   1 | HASH UNIQUE           |                |   711 | 14220 | 42066   (1)| 00:08:25 |       | |

|   2 |   PARTITION LIST SINGLE|                |  8496 | 165K| 42065   (1)| 00:08:25 |   KEY | KEY |

|*  3 |    TABLE ACCESS FULL   | BIGTABLETHATALWAYSGROWS |  8496 | 165K| 42065   (1)| 00:08:25 |    16 | 16 |

---------------------------------------------------------------------------------------------------------                                                                                                                                                                                                  

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - filter("INI" IS NULL)

 

From a cost of over 2M, we’re down to 42K.  Good, but not good enough for that SELECT that will return only a fraction of that table’s records.  A Full Table Scan is probably overkill.  Let’s see which index will give us the best performance.

 

There are a few indexes on that big table.  Let’s tell the optimizer to use them with a hint:

EXPLAIN PLAN FOR

SELECT  /*+ INDEX */  DISTINCT…

No can do:

0 | SELECT STATEMENT       |                |     1 | 20 | 41718   (1)| 00:08:21 |       | |

|   1 |  HASH UNIQUE           |                |     1 | 20 | 41718   (1)| 00:08:21 |       | |

|   2 |   PARTITION LIST SINGLE|                |     5 | 100 | 41717   (1)| 00:08:21 |   KEY | KEY |                                                                                                                                                                                                  

|*  3 |    TABLE ACCESS FULL   | BIGTABLETHATALWAYSGROWS |     5 |   100 | 41717 (1)| 00:08:21 |    16 | 16

 

The CBO considers the full table scan more effective than using any of the existing indexes. I tested each of them and each time EXPLAIN PLAN gave me a cost higher than that of the full table scan.

 

So let’s try of few indexes of our own.

 

CREATE INDEX fd4_INI_IRI_CT ON BIGTABLETHATALWAYSGROWS (INI,IRI,CT) LOCAL;

CREATE INDEX fd3_IRI_INI_CT ON BIGTABLETHATALWAYSGROWS (IRI,INI,CT) LOCAL;

CREATE INDEX fd2_INI_CT ON BIGTABLETHATALWAYSGROWS (INI,CT) LOCAL;

CREATE INDEX fd1_CT_INI ON BIGTABLETHATALWAYSGROWS (CT,INI) LOCAL;

 

With index #1 (index on all the columns mentioned in the WHERE clause):

----------------------------------------------------------------------------------------------------------------------

| Id  | Operation                           | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |

----------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                    |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   1 |  HASH UNIQUE                        |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   2 |   PARTITION LIST SINGLE             |                |     5 | 100 |     2   (0)| 00:00:01 |   KEY | KEY |                                                                                                                                                                                     

|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID| BIGTABLETHATALWAYSGROWS |     5 |   100 | 2   (0)| 00:00:01 |    16 | 16 |

|*  4 | INDEX RANGE SCAN                | FD1_CT_INI     | 1 |       |     1 (0)| 00:00:01 |    16 |    16 |

Index Range Scan, good, but the table still has to be accessed because a column of the SELECT is not in that index.  Wow! Cost of 3.  Can’t get it any better, right?  Let’s find out. 

 

With index #2 (index on all the columns mentioned in the WHERE clause, but the order of columns is reversed):

----------------------------------------------------------------------------------------------------------------------

| Id  | Operation                           | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |                                                                                                                                                                                     

----------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                    |                |     1 | 20 |     3 (34)| 00:00:01 |       |       |

|   1 |  HASH UNIQUE                        |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   2 |   PARTITION LIST SINGLE             |                |     5 | 100 |     2   (0)| 00:00:01 |   KEY | KEY |

|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID| BIGTABLETHATALWAYSGROWS |     5 |   100 | 2   (0)| 00:00:01 |    16 | 16 |

|*  4 |     INDEX RANGE SCAN                | FD2_INI_CT     | 1 |       |     1 (0)| 00:00:01 |    16 |    16 |

 

No difference, so no better.

 

With index # 3 (index on all the columns mentioned in that SELECT, the least selective last):

---------------------------------------------------------------------------------------------------------                                                                                                                                                                                                  

| Id  | Operation              | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |                                                                                                                                                                                                  

---------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT       |                |     1 | 20 |    27   (4)| 00:00:01 |       | |

|   1 |  SORT UNIQUE NOSORT    |                |     1 | 20 |    27   (4)| 00:00:01 |       | |

|   2 |   PARTITION LIST SINGLE|                |     5 | 100 |    26   (0)| 00:00:01 |   KEY | KEY |

|*  3 |    INDEX SKIP SCAN     | FD3_IRI_INI_CT |     5 | 100 |    26   (0)| 00:00:01 |    16 | 16 |

Interesting, because since all the columns mentioned in our SELECT are in that index, no need to access the table.  Still the cost is higher than our previous attempts, which did include a table access by rowid.

So since that INDEX SKIP SCAN seems too expensive, let’s shuffle the order of columns in our index …

 

With index # 4 (index on all the columns mentioned in that SELECT, the least selective last, and the other 2 columns swapped):

---------------------------------------------------------------------------------------------------------

| Id  | Operation              | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |

---------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT       |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   1 |  HASH UNIQUE           |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   2 |   PARTITION LIST SINGLE|                |     5 | 100 |     2   (0)| 00:00:01 |   KEY | KEY |                                                                                                                                                                                                  

|*  3 |    INDEX RANGE SCAN    | FD4_INI_IRI_CT |     5 | 100 |     2   (0)| 00:00:01 |    16 | 16 |

---------------------------------------------------------------------------------------------------------  

Hmmh, cost of 3, again.  And with no table access.

 

So, we have 3 possibilities with a CBO cost of 3 (indexes fd1, fd2, and fd4).  Which one really is the best?  I’d rather go for the one with the least steps (there are 5 steps in plans 1 and 2, 4 steps in plan 4).  If we ask the CBO to decide, which will it be?

 

EXPLAIN PLAN FOR

SELECT  DISTINCT IRI…

 

----------------------------------------------------------------------------------------------------------------------

| Id  | Operation                           | Name           | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |

----------------------------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                    |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   1 |  HASH UNIQUE                        |                |     1 | 20 |     3  (34)| 00:00:01 |       | |

|   2 |   PARTITION LIST SINGLE             |                |     5 | 100 |     2   (0)| 00:00:01 |   KEY | KEY |                                                                                                                                                                                     

|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID| BIGTABLETHATALWAYSGROWS |     5 |   100 | 2   (0)| 00:00:01 |    16 | 16 |

|*  4 | INDEX RANGE SCAN                | FD1_CT_INI     | 1 |       |     1 (0)| 00:00:01 |    16 |    16 |

 

Index #1 seems to be the CBO’s favorite.   Let’s double-check with different values for our bind variable (big partitions, small partitions), and yes, the CBO constantly chooses index #1.


To keep in mind:

  • let's be wary of literals and implicit conversions
  • partitioned (LOCAL) indexes on partitioned tables are a great way to access data when a sufficiently small subset of data is requested.