Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

How to take advantage of data already being "grouped" in rows

mathguyJun 12 2017 — edited Jun 16 2017

Suppose we have data as in the table TEST_DATA below (ID is primary key, all columns are NUMBER data type, blank spaces signify NULL):

        ID       COL1       COL2       COL3       COL4       COL5
---------- ---------- ---------- ---------- ---------- ----------
         1        100        200        300        400        500
         2        300        100        100        300        300
         3                   300        100                   400
         4                                                      
         5                   300                              300

And we want to write a query that will return the ID and the highest NON-NULL value in each row (return NULL only if all column values, COL1 through COL5, are NULL, as for ID = 4).

Factored subquery to replicate the table is provided below, and statements to create a much bigger table, for testing, are provided further down in the message.

One may ask, why is the data in that structure to begin with. (It would be better, perhaps, if the table had just three columns: ID, COL_NAME, and VALUE.) For the sake of this exercise, let's accept that the base table just is in this structure.

If no NULLs were present, we could simply use GREATEST(COL1, ... , COL5). Alas, GREATEST and LEAST return NULL if at least one argument is NULL, and that's not what our problem requires.

MIN and MAX work the way we want (they ignore NULLs) but they only work on columns, not on rows. It is easy to write a query that unpivots the base table, then groups by ID and takes the MAX. But this has two disadvantages. One is that, in the original arrangement, each set of five values was in effect already "grouped" by ID (by the fact that the values were all in one row). This is something that we can use to our advantage - for example if a function like GREATEST IGNORE NULLS (COL1, ... , COL5) existed. Unfortunately it doesn't. And the second disadvantage, demonstrated recently in another thread by Solomon Yakobson, is that UNPIVOT, unfortunately, does a UNION ALL behind the scenes - in this case, the table is read five times. (So UNPIVOT will do the same job as "manual unpivoting", which is just a UNION ALL of SELECT ID, COL1 FROM TEST_DATA UNION ALL SELECT ID, COL2 FROM... etc.)

with

     test_data ( id, col1, col2, col3, col4, col5 ) as (

       select 1, 100 , 200 , 300 , 400 , 500  from dual union all

       select 2, 300 , 100 , 100 , 300 , 300  from dual union all

       select 3, null, 300 , 100 , null, 400  from dual union all

       select 4, null, null, null, null, null from dual union all

       select 5, null, 300 , null, null, 300  from dual

     )

select   id, max(val) as high_val

from     test_data

unpivot  include nulls ( val for source in (col1, col2, col3, col4, col5) )

group by id

order by id

;

        ID   HIGH_VAL

---------- ----------

         1        500

         2        300

         3        400

         4          

         5        300

So - in order to avoid un-grouping only to group again just so we can use MAX, we may instead work a little harder on the original idea - use GREATEST, but find a way to handle NULLs. Here is one way:

select id,

       greatest ( coalesce(col1, col2, col3, col4, col5),

                  coalesce(col2, col3, col4, col5, col1),

                  coalesce(col3, col4, col5, col1, col2),

                  coalesce(col4, col5, col1, col2, col3),

                  coalesce(col5, col1, col2, col3, col4)

                ) as high_val

from   test_data

;

There are other ways to write this, but they are all ugly (and become uglier with the number of columns involved). However, testing - described later - shows that it's efficient.

There are various ways to write solutions with correlated subqueries, but they don't seem efficient. In Oracle 12 and above, we can use the LATERAL clause (see below). I was hoping this may be more efficient, but I couldn't figure out how to use it efficiently (or perhaps there is no "good" way to use it). Here is what I came up with.

select t.id, s.high_val

from   test_data t,

       lateral

               ( select max(y.val) as high_val

                 from   (

                          select t.id as id

                          from   dual

                        ) x,

                          lateral (

                                    select id, val

                                    from   test_data

                                    unpivot ( val for source in (col1, col2, col3, col4, col5) )

                                    where id = x.id

                                  ) y

               ) s

order by id

;

(There is also a more concise way to write this, with only one LATERAL clause, but that version - in effect - correlates two levels deep; as Solomon Yakobson pointed out in another thread today, that only works in some versions of Oracle, and only because Oracle didn't get around to prevent it yet. And performance testing shows there is no improvement; the query is shorter but the execution time is about the same.)


My question in this thread is: Are there better ways than the ones already discussed, to take advantage of the way the values are already presented in rows?

Statements to create a bigger TEST_DATA table:  (NOTE - I didn't bother to add NULLs since that is unlikely to affect performance much, as long as I don't change the solutions.)

create table test_data ( id number, col1 number, col2 number, col3 number, col4 number, col5 number );

insert into test_data
  select level, dbms_random.value(), dbms_random.value(), dbms_random.value(), dbms_random.value(), dbms_random.value()
  from   dual
  connect by level <= 1000000
;

commit;

On this data, the solution with GREATEST, rigged to handle NULLs, runs in a little over 3 seconds on my laptop. The UNPIVOT and MAX solution takes about 8 seconds, and the LATERAL solution about 7 seconds (with some variation from one test to the next; I would say there isn't much of a difference between UNPIVOT/MAX and LATERAL).

Cheers,     -     mathguy

Comments

Processing
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Jul 13 2017
Added on Jun 12 2017
35 comments
2,898 views