1 2 3 Previous Next 35 Replies Latest reply on Jun 16, 2017 10:02 AM by Jonathan Lewis

    How to take advantage of data already being "grouped" in rows

    mathguy

      Suppose we have data as in the table TEST_DATA below (ID is primary key, all columns are NUMBER data type, blank spaces signify NULL):

       

              ID       COL1       COL2       COL3       COL4       COL5
      ---------- ---------- ---------- ---------- ---------- ----------
               1        100        200        300        400        500
               2        300        100        100        300        300
               3                   300        100                   400
               4                                                      
               5                   300                              300

       

       

      And we want to write a query that will return the ID and the highest NON-NULL value in each row (return NULL only if all column values, COL1 through COL5, are NULL, as for ID = 4).

       

      Factored subquery to replicate the table is provided below, and statements to create a much bigger table, for testing, are provided further down in the message.

       

      One may ask, why is the data in that structure to begin with. (It would be better, perhaps, if the table had just three columns: ID, COL_NAME, and VALUE.) For the sake of this exercise, let's accept that the base table just is in this structure.

       

      If no NULLs were present, we could simply use GREATEST(COL1, ... , COL5). Alas, GREATEST and LEAST return NULL if at least one argument is NULL, and that's not what our problem requires.

       

      MIN and MAX work the way we want (they ignore NULLs) but they only work on columns, not on rows. It is easy to write a query that unpivots the base table, then groups by ID and takes the MAX. But this has two disadvantages. One is that, in the original arrangement, each set of five values was in effect already "grouped" by ID (by the fact that the values were all in one row). This is something that we can use to our advantage - for example if a function like GREATEST IGNORE NULLS (COL1, ... , COL5) existed. Unfortunately it doesn't. And the second disadvantage, demonstrated recently in another thread by Solomon Yakobson, is that UNPIVOT, unfortunately, does a UNION ALL behind the scenes - in this case, the table is read five times. (So UNPIVOT will do the same job as "manual unpivoting", which is just a UNION ALL of SELECT ID, COL1 FROM TEST_DATA UNION ALL SELECT ID, COL2 FROM... etc.)

       

      with

           test_data ( id, col1, col2, col3, col4, col5 ) as (

             select 1, 100 , 200 , 300 , 400 , 500  from dual union all

             select 2, 300 , 100 , 100 , 300 , 300  from dual union all

             select 3, null, 300 , 100 , null, 400  from dual union all

             select 4, null, null, null, null, null from dual union all

             select 5, null, 300 , null, null, 300  from dual

           )

      select   id, max(val) as high_val

      from     test_data

      unpivot  include nulls ( val for source in (col1, col2, col3, col4, col5) )

      group by id

      order by id

      ;

       

              ID   HIGH_VAL

      ---------- ----------

               1        500

               2        300

               3        400

               4          

               5        300

       

      So - in order to avoid un-grouping only to group again just so we can use MAX, we may instead work a little harder on the original idea - use GREATEST, but find a way to handle NULLs. Here is one way:

       

      select id,

             greatest ( coalesce(col1, col2, col3, col4, col5),

                        coalesce(col2, col3, col4, col5, col1),

                        coalesce(col3, col4, col5, col1, col2),

                        coalesce(col4, col5, col1, col2, col3),

                        coalesce(col5, col1, col2, col3, col4)

                      ) as high_val

      from   test_data

      ;

       

      There are other ways to write this, but they are all ugly (and become uglier with the number of columns involved). However, testing - described later - shows that it's efficient.

       

      There are various ways to write solutions with correlated subqueries, but they don't seem efficient. In Oracle 12 and above, we can use the LATERAL clause (see below). I was hoping this may be more efficient, but I couldn't figure out how to use it efficiently (or perhaps there is no "good" way to use it). Here is what I came up with.

       

      select t.id, s.high_val

      from   test_data t,

             lateral

                     ( select max(y.val) as high_val

                       from   (

                                select t.id as id

                                from   dual

                              ) x,

                                lateral (

                                          select id, val

                                          from   test_data

                                          unpivot ( val for source in (col1, col2, col3, col4, col5) )

                                          where id = x.id

                                        ) y

                     ) s

      order by id

      ;

       

      (There is also a more concise way to write this, with only one LATERAL clause, but that version - in effect - correlates two levels deep; as Solomon Yakobson pointed out in another thread today, that only works in some versions of Oracle, and only because Oracle didn't get around to prevent it yet. And performance testing shows there is no improvement; the query is shorter but the execution time is about the same.)


      My question in this thread is: Are there better ways than the ones already discussed, to take advantage of the way the values are already presented in rows?

       

      Statements to create a bigger TEST_DATA table:  (NOTE - I didn't bother to add NULLs since that is unlikely to affect performance much, as long as I don't change the solutions.)

       

      create table test_data ( id number, col1 number, col2 number, col3 number, col4 number, col5 number );

       

      insert into test_data
        select level, dbms_random.value(), dbms_random.value(), dbms_random.value(), dbms_random.value(), dbms_random.value()
        from   dual
        connect by level <= 1000000
      ;

       

      commit;

       

      On this data, the solution with GREATEST, rigged to handle NULLs, runs in a little over 3 seconds on my laptop. The UNPIVOT and MAX solution takes about 8 seconds, and the LATERAL solution about 7 seconds (with some variation from one test to the next; I would say there isn't much of a difference between UNPIVOT/MAX and LATERAL).

       

      Cheers,     -     mathguy

        • 1. Re: How to take advantage of data already being "grouped" in rows
          rp0428

          If no NULLs were present, we could simply use GREATEST(COL1, ... , COL5).

          Then try getting rid of the nulls using NVL to replace the values; for example replace a null with a large negative value.

           

          Then if the 'greatest' value is the right multiple of that large negative value it means ALL values were null so replace that with a null result.

           

          Can't test it right now.

          • 2. Re: How to take advantage of data already being "grouped" in rows
            Manik

            what do you think about case statement approach rather than greatest function ...

             

            case  
                      when col1>=col2 and col1>=col3 and col1>=col4 and col1>=col5 then col1   
                      when col2>=col3 and col2>=col4 and col2>=col5 then col2  
                      when col3>=col4 and col3>=col5 then col3  
                      when col4>=col5 then col4  
                      else col5   
            end  
            

             

             

            Cheers,

            Manik.

            • 3. Re: How to take advantage of data already being "grouped" in rows
              mathguy

              That can be done easily, but may lead to wrong results.

               

              Let's say just for the sake of the argument that -1000 is the least number (read "negative number of greatest absolute value") possible in the NUMBER data type. (We all know it's not -1000, but let's just say it is).  Then I could do NVL(COLn, -1000) and take the GREATEST over the results. Now if the result is -1000, all I have learnt is that all values are EITHER NULL OR -1000, and I don't know more than that. The correct answer is either NULL or -1000 and I can't tell from what I've done so far (although I can, indeed, check if at least one value is non-NULL).

               

              My question, though, wasn't so much "how do we solve this problem using GREATEST and NVL", but rather "what other techniques/features are available for this type of situation", which may prove helpful in broader classes of similar problems.

               

              Best,    -    mathguy

              • 4. Re: How to take advantage of data already being "grouped" in rows
                mathguy

                This can be done with CASE, but not as simply as you wrote it.

                 

                Just imagine the largest value is COL1, but COL3 is NULL.  What will be the value returned by your CASE expression?  (Hint:  definitely not COL1, because COL1 >= COL3 will not be TRUE.)

                • 5. Re: How to take advantage of data already being "grouped" in rows
                  Manik

                  Well yeah forgot to mention about nulls there.. But just curious about case performance over greatest.

                   

                  Cheers,

                  Manik.

                  • 6. Re: How to take advantage of data already being "grouped" in rows
                    mathguy

                    OK - suppose GREATEST didn't exist, we were writing our own version, and we didn't worry about NULLs. One way to write the function would be with a CASE like you showed.

                     

                    I don't think that would be the most efficient way; too many comparisons, on average. Some sort of recursive definition, like GREATEST(COL1, ... , COL5) = GREATEST(GREATEST(COL1, ... , COL4), COL5) requires only 4 comparisons in all cases.

                     

                    I just tested the CASE solution on the big data (1 million rows) and it completed in about 4 seconds. A bit longer than the GREATEST solution, even with all the COALESCE calls.

                     

                    Cheers,    mathguy

                    • 7. Re: How to take advantage of data already being "grouped" in rows
                      Manik

                      Ok got you.. just thinking on greatest function lines (avoiding re-invention of wheel) ... may be : (assuming you don't have -1e66 in your column values )

                       

                      SELECT id,

                             NULLIF (GREATEST (NVL (col1, -1E66),

                                               NVL (col2, -1E66),

                                               NVL (col3, -1E66),

                                               NVL (col4, -1E66),

                                               NVL (col5, -1E66)),

                                     -1E66)

                                AS high_val

                        FROM test_data;

                       

                      Cheers,

                      Manik.

                      • 8. Re: How to take advantage of data already being "grouped" in rows
                        Frank Kulash

                        Hi,

                        mathguy wrote:

                         

                        ...

                        And the second disadvantage, demonstrated recently in another thread by Solomon Yakobson, is that UNPIVOT, unfortunately, does a UNION ALL behind the scenes - in this case, the table is read five times. (So UNPIVOT will do the same job as "manual unpivoting", which is just a UNION ALL of SELECT ID, COL1 FROM TEST_DATA UNION ALL SELECT ID, COL2 FROM... etc.)

                        ...

                        In Oracle 12 and above, we can use the LATERAL clause (see below). I was hoping this may be more efficient, but I couldn't figure out how to use it efficiently (or perhaps there is no "good" way to use it).

                        ...

                        I recall Solomon saying the UNPIVOT isn't always efficient.  He usually suggests using a cross-join instead.

                         

                        Try using LATERAL or CROSS APPLY without UNPIVOT.  Sorry, I'm not at an Oracle 12 database now, so I can't test it, but It goes something like this:

                        SELECT       t.id, s.high_val

                        FROM         test_data  t

                        CROSS APPLY  (

                                         SELECT  MAX ( CASE  LEVEL

                                                           WHEN  1  THEN  t.col1

                                                           WHEN  2  THEN  t.col2

                                                           WHEN  3  THEN  t.col3

                                                           WHEN  4  THEN  t.col4

                                                           WHEN  5  THEN  t.col5

                                                       END

                                                     ) AS high_val

                                         FROM    dual

                                         CONNECT BY  LEVEL  <= 5

                                     )          s

                        ORDER BY     t.id

                        ;

                        1 person found this helpful
                        • 9. Re: How to take advantage of data already being "grouped" in rows
                          Frank Kulash

                          Hi,

                          mathguy wrote:

                           

                          That can be done easily, but may lead to wrong results.

                           

                          Let's say just for the sake of the argument that -1000 is the least number (read "negative number of greatest absolute value") possible in the NUMBER data type. (We all know it's not -1000, but let's just say it is). Then I could do NVL(COLn, -1000) and take the GREATEST over the results. Now if the result is -1000, all I have learnt is that all values are EITHER NULL OR -1000, and I don't know more than that. The correct answer is either NULL or -1000 and I can't tell from what I've done so far (although I can, indeed, check if at least one value is non-NULL).

                          ...

                          The point of replies #1 and #7 is, when you know a lower bound to the possible values, use GREATEST and NVL with some value lower than (not equal to) that lower bound.  So, if you know that the columns might contain values as low as -1000 (but no lower), then use something lower than -1000 (like -1001 or, as Manik did, -1E66) as the 2nd argument to NVL, like this:

                          SELECT    id

                          ,         NULLIF ( GREATEST ( NVL (col1, -1E66)

                                                      , NVL (col2, -1E66)

                                                      , NVL (col3, -1E66)

                                                      , NVL (col4, -1E66)

                                                      , NVL (col5, -1E66)

                                                      )

                                           , -1E66

                                           )  AS high_val

                          FROM      test_data

                          ORDER BY  id

                          ;

                          • 10. Re: How to take advantage of data already being "grouped" in rows
                            CameronS

                            Apologies for not being able to test this (my hands are tied due to - lack of - availability of tools currently).

                             

                            This topics has been covered in a DB2 blog https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/pivoting_tables56?lang=en

                            where "The most efficient way to unpivot is to do a cross join between the [relevant] table and a correlated VALUES of as many rows as columns that need to be unpivoted"

                             

                            Suggesting you might want to start with something similar to

                            SELECT T.id, U.clmn, U.result

                            FROM test_data AS T,

                                 LATERAL (VALUES (1, T.col1),

                                                 (2, T.col2),

                                                 (3, T.col3),

                                                 (4, T.col4),

                                                 (5, T.col5)) AS U (clmn, result);

                             

                            Now of course, the same set of tests can be carried out in Oracle.  You mentioned you were after a more efficient method, and this appears to be a more efficient method of unpivoting than your UNION ALL.

                             

                            Just not sure that it works.

                            • 11. Re: How to take advantage of data already being "grouped" in rows
                              Stew Ashton

                              This thread includes some dissing of UNPIVOT. mathguy says "[as] demonstrated recently in another thread by Solomon Yakobson ... UNPIVOT, unfortunately, does a UNION ALL behind the scenes - in this case, the table is read five times."

                               

                              Here is the statement in question:

                               

                              with
                                   test_data ( id, col1, col2, col3, col4, col5 ) as (
                                     select 1, 100 , 200 , 300 , 400 , 500  from dual union all
                                     select 2, 300 , 100 , 100 , 300 , 300  from dual union all
                                     select 3, null, 300 , 100 , null, 400  from dual union all
                                     select 4, null, null, null, null, null from dual union all
                                     select 5, null, 300 , null, null, 300  from dual
                                   )
                              select   id, max(val) as high_val
                              from     test_data
                              unpivot  include nulls ( val for source in (col1, col2, col3, col4, col5) )
                              group by id
                              order by id
                              ;
                              

                               

                              The UNION ALL is not done by the UNPIVOT ! It is done in the WITH clause. You can see the text "union all" written four times: it is not "behind the scenes" but on stage front and center.

                               

                              Now put that data into a table and the UNION ALL disappears.

                               

                              create table t( id, col1, col2, col3, col4, col5 ) cache as 
                                     select 1, 100 , 200 , 300 , 400 , 500  from dual union all
                                     select 2, 300 , 100 , 100 , 300 , 300  from dual union all
                                     select 3, null, 300 , 100 , null, 400  from dual union all
                                     select 4, null, null, null, null, null from dual union all
                                     select 5, null, 300 , null, null, 300  from dual
                                   ;
                              select count(*) from t;
                              select /*+ gather_plan_statistics */   id, max(val) as high_val
                              from t
                              unpivot include nulls ( val for source in (col1, col2, col3, col4, col5) )
                              group by id
                              order by id
                              ;
                              select * from table(dbms_xplan.display_cursor(null,null,'IOSTATS LAST'));
                              
                              ---------------------------------------------------------------------------------------
                              | Id  | Operation            | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
                              ---------------------------------------------------------------------------------------
                              |   0 | SELECT STATEMENT     |      |      1 |        |      5 |00:00:00.01 |       9 |
                              |   1 |  SORT GROUP BY       |      |      1 |     25 |      5 |00:00:00.01 |       9 |
                              |   2 |   VIEW               |      |      1 |     25 |     25 |00:00:00.01 |       9 |
                              |   3 |    UNPIVOT           |      |      1 |        |     25 |00:00:00.01 |       9 |
                              |   4 |     TABLE ACCESS FULL| T    |      1 |      5 |      5 |00:00:00.01 |       9 |
                              ---------------------------------------------------------------------------------------
                              

                               

                              Perhaps Solomon had another test case in the thread mathguy mentioned. I would appreciate a link.

                               

                              Best regards, Stew Ashton

                              • 12. Re: How to take advantage of data already being "grouped" in rows
                                Stew Ashton

                                I think there is a more concise way to code LATERAL:

                                 

                                select id, high_val
                                from test_data t, lateral (
                                    select max(
                                        case level  when 1 then col1
                                                    when 2 then col2
                                                    when 3 then col3
                                                    when 4 then col4
                                                    when 5 then col5
                                        end
                                    ) high_val
                                    from dual connect by level <= 5
                                );
                                

                                 

                                But it doesn't perform any better than UNPIVOT.

                                 

                                [UPDATE: sorry, just realized Frank already spoke of this in reply 8.]

                                 

                                On my laptop running the Oracle Developer Days VM for 12.2, the GREATEST(COALESCE(... solution runs in less than half a second.

                                 

                                Why not simplify the the COALESCE part a bit?

                                 

                                select id,
                                    nullif(
                                        greatest (
                                            coalesce(col1, -1E99),  
                                            coalesce(col2, -1E99),  
                                            coalesce(col3, -1E99),  
                                            coalesce(col4, -1E99),  
                                            coalesce(col5, -1E99)  
                                        ),
                                        -1E99
                                    )as high_val  
                                from   test_data;
                                

                                 

                                [UPDATE: forgot the NULLIF !]

                                 

                                I prefer COALESCE to NVL because the documentation guarantees short-circuit evaluation: COALESCE

                                • 13. Re: How to take advantage of data already being "grouped" in rows
                                  ascheffer
                                  select nullif( greatest( nvl( col1, -binary_float_infinity )
                                                        , nvl( col2, -binary_float_infinity )
                                                        , nvl( col3, -binary_float_infinity )
                                                        , nvl( col4, -binary_float_infinity )
                                                        , nvl( col5, -binary_float_infinity )
                                                        )
                                              , -binary_float_infinity )
                                  from test_data
                                  
                                  • 14. Re: How to take advantage of data already being "grouped" in rows
                                    Stew Ashton

                                    ascheffer wrote:

                                     

                                    1. selectnullif(greatest(nvl(col1,-binary_float_infinity)
                                    2. ,nvl(col2,-binary_float_infinity)
                                    3. ,nvl(col3,-binary_float_infinity)
                                    4. ,nvl(col4,-binary_float_infinity)
                                    5. ,nvl(col5,-binary_float_infinity)
                                    6. )
                                    7. ,-binary_float_infinity)
                                    8. fromtest_data

                                    But you lose precision! All the columns are converted from numbers to binary floats.

                                     

                                    Best regards, Stew

                                    1 person found this helpful
                                    1 2 3 Previous Next