developers

    Forum Stats

  • 3,874,064 Users
  • 2,266,670 Discussions
  • 7,911,712 Comments

Discussions

Intermediate count

user13117585
user13117585 Member Posts: 680 Bronze Badge
edited Apr 16, 2022 6:01PM in SQL & PL/SQL

Hello everyone,

I was wondering if anyone could help and guide me to achieve intermediate counts on a recursive query.

I have a structure like this:


CREATE TABLE grp (id number, pid number, name varchar2(10));
insert into grp values(1, null, 'grp1');
insert into grp values(11, 1, 'grp11');
insert into grp values(12, 11, 'grp12');
insert into grp values(13, 12, 'grp13');
insert into grp values(14, 12, 'grp14');

insert into grp values(2, null, 'grp2');

commit;

WITH t(id, pid, lvl, ident) AS 
(
 SELECT id, pid, 1, name FROM grp WHERE pid IS NULL
 UNION ALL
 SELECT c.id, c.pid, t.lvl +1, lpad(' ', t.lvl) || c.name FROM grp c, t WHERE t.id = c.pid
)
SEARCH DEPTH FIRST BY id SET rnk
select id, pid, lvl, ident from t;

        ID        PID        LVL IDENT     
---------- ---------- ---------- ----------
         1                     1 grp1      
        11          1          2  grp11    
        12         11          3   grp12   
        13         12          4    grp13  
        14         12          4    grp14  
         2                     1 grp2      


6 rows selected. 



I was wondering if there is any easy way to create intermediate counts? To have something like:

        ID        PID        LVL IDENT      direct  total  
---------- ---------- ---------- ---------- ------ -------
         1                     1 grp1            1      4
        11          1          2  grp11          1      3
        12         11          3   grp12         2      2
        13         12          4    grp13        0      0
        14         12          4    grp14        0      0
         2                     1 grp2            0      0

For each level, I would like to have 2 columns. One that will count only immediate children. And, the other one that should count ALL descending nodes from the current level. I could probably do this using a function that will select and count. But, I was wondering if there is a way in pure SQL?

Any suggestion is welcome,

Regards,

Tagged:

Best Answers

  • mathguy
    mathguy Member Posts: 10,893 Black Diamond
    edited Apr 16, 2022 7:26PM Answer ✓

    The match_recognize clause can be used for an elegant solution to such problems.

    When running a traditional hierarchical query (connect by), the hierarchical order together with the level pseudo-column generate a row pattern that we can exploit: all the descendants of any node immediately follow that node, and have level strictly greater than the node's own level. match_recognize can recognize that pattern.

    QUERY:
    with
      prep (id, pid, name, lvl, rn) as (
        select  id, pid, name, level, rownum
        from    grp
        start   with pid is null
        connect by pid = prior id
        order   siblings by pid
      )
    select id, pid, lvl, ident, direct, total
    from   prep
    match_recognize (
      order    by rn
      measures a.id as id, a.pid as pid, a.lvl as lvl,
               lpad(' ', a.lvl) || a.name as ident,
               count(c.*) as direct, count(t.*) as total
      after match skip to next row
      pattern  (a (c|d)*)
      subset   t = (c, d)
      define   c as lvl = a.lvl + 1, d as lvl > a.lvl
    )
    ;
    
    

    For testing I added a few more rows to your sample table. Using the following inputs, I get the output shown at the end.

    TEST DATA:
    drop table grp purge;
    
    create table grp (id number, pid number, name varchar2(10));
    insert into grp values ( 1, null, 'grp1' );
    insert into grp values (11,    1, 'grp11');
    insert into grp values (12,   11, 'grp12');
    insert into grp values (13,   12, 'grp13');
    insert into grp values (14,   12, 'grp14');
    insert into grp values (31,    1, 'grp31');
    insert into grp values (38,   31, 'grp38');
    insert into grp values (39,   31, 'grp39');
    insert into grp values (55,   38, 'grp55');
    insert into grp values ( 2, null, 'grp2' );
    
    commit;
    
    

    Using the query above on this sample data, I get the following output:

    OUTPUT:
      ID  PID  LVL IDENT      DIRECT TOTAL
    ---- ---- ---- ---------- ------ -----
       1         1  grp1           2     8
      11    1    2   grp11         1     3
      12   11    3    grp12        2     2
      13   12    4     grp13       0     0
      14   12    4     grp14       0     0
      31    1    2   grp31         2     3
      38   31    3    grp38        1     1
      55   38    4     grp55       0     0
      39   31    3    grp39        0     0
       2         1  grp2           0     0
    
  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,745 Red Diamond
    Answer ✓

    Hi, @user1311758

    Using CONNECT BY, here's one way:

    WITH  top_down  AS
    (
    	SELECT id, pid
    	, 	LEVEL				AS lvl
    	,	LPAD (' ', LEVEL - 1) || name	AS ident
    	,	ROWNUM	  	    	 	AS rnum
    	,	SYS_CONNECT_BY_PATH (pid, ',') || ','	AS ancestors	
    	FROM	grp
    	START WITH pid IS NULL
    	CONNECT BY pid = PRIOR id
    )
    SELECT  t.*
    ,	 (
    	   SELECT COUNT (*)
    	   FROM  top_down d
    	   WHERE  d.pid = t.id
    	 )				AS direct
    ,	 (
    	   SELECT COUNT (*)
    	   FROM  top_down i
    	   WHERE  INSTR ( i.ancestors
    	   	   	  , ',' || t.id || ','
    			  ) > 0
    	 )				AS indirect
    FROM	 top_down t
    ORDER BY rnum
    ;
    

    I tried the same approach (scalar sub-queries) on your results, but I got an error "ORA-00604: error occurred at recursive SQL level 1. ORA-00910: specified length too long for its datatype" I'll try to figure out why

    user13117585
  • mathguy
    mathguy Member Posts: 10,893 Black Diamond
    edited Apr 16, 2022 9:14PM Answer ✓

    Actually I just ran a simple test, described below. On a table with 3,905 rows, the match_recognize solution runs in about 0.012 seconds. The other solution (using correlated subqueries) runs in over 3 seconds, or 250 times slower.

    Then I tried again, this time with a table with 19,530 rows. Still way less than your 300k rows. The match_recognize solution runs in 0.07 seconds. The other solution takes 85 seconds; 1,200 times slower.

    To keep the displaying of thousands of rows to the terminal (irrelevant for this test), instead of using the SELECT statements as posted in both answers, I selected SUM(ID + TOTAL) / SUM(PID + DIRECT). The optimizer can't optimize out the generation of all rows (and all relevant columns), but the output is just a single number. In all cases, the two solutions give the same result (as expected).

    The test table is a graph (a forest, consisting of several trees), as follows. I start with five roots - that's generation 1. Then each node has exactly five direct children. So there are five nodes in generation 1, 25 nodes in generation 2, 125 nodes in generation 3, etc. For the first test I stopped at generation 4 5, and for the second, at generation 5 6. (EDITED: in the code below, if I stop when r.gen <=5, this means I am already generating the sixth generation in the recursive step.)

    Test data generated as follows. Change the cutoff for "generation" toward the end of the insert statement for the different test cases.

    truncate table grp;
    
    insert into grp
    with
      h (id, pid, name) as (
        select  level, null, 'grp' || to_char(level)
        from    dual
        connect by level <= 5
      )
    , r (gen, id, pid, name) as (
        select  1, id, pid, name
          from  h
        union all
        select  r.gen + 1, 10 * r.id + h.id, r.id,
                'grp' || to_char(10 * r.id + h.id)
          from  r join h on r.gen <= 5               ---- <<<< ----
      )
    select id, pid, name from r;
    

Answers

  • mathguy
    mathguy Member Posts: 10,893 Black Diamond
    edited Apr 16, 2022 7:26PM Answer ✓

    The match_recognize clause can be used for an elegant solution to such problems.

    When running a traditional hierarchical query (connect by), the hierarchical order together with the level pseudo-column generate a row pattern that we can exploit: all the descendants of any node immediately follow that node, and have level strictly greater than the node's own level. match_recognize can recognize that pattern.

    QUERY:
    with
      prep (id, pid, name, lvl, rn) as (
        select  id, pid, name, level, rownum
        from    grp
        start   with pid is null
        connect by pid = prior id
        order   siblings by pid
      )
    select id, pid, lvl, ident, direct, total
    from   prep
    match_recognize (
      order    by rn
      measures a.id as id, a.pid as pid, a.lvl as lvl,
               lpad(' ', a.lvl) || a.name as ident,
               count(c.*) as direct, count(t.*) as total
      after match skip to next row
      pattern  (a (c|d)*)
      subset   t = (c, d)
      define   c as lvl = a.lvl + 1, d as lvl > a.lvl
    )
    ;
    
    

    For testing I added a few more rows to your sample table. Using the following inputs, I get the output shown at the end.

    TEST DATA:
    drop table grp purge;
    
    create table grp (id number, pid number, name varchar2(10));
    insert into grp values ( 1, null, 'grp1' );
    insert into grp values (11,    1, 'grp11');
    insert into grp values (12,   11, 'grp12');
    insert into grp values (13,   12, 'grp13');
    insert into grp values (14,   12, 'grp14');
    insert into grp values (31,    1, 'grp31');
    insert into grp values (38,   31, 'grp38');
    insert into grp values (39,   31, 'grp39');
    insert into grp values (55,   38, 'grp55');
    insert into grp values ( 2, null, 'grp2' );
    
    commit;
    
    

    Using the query above on this sample data, I get the following output:

    OUTPUT:
      ID  PID  LVL IDENT      DIRECT TOTAL
    ---- ---- ---- ---------- ------ -----
       1         1  grp1           2     8
      11    1    2   grp11         1     3
      12   11    3    grp12        2     2
      13   12    4     grp13       0     0
      14   12    4     grp14       0     0
      31    1    2   grp31         2     3
      38   31    3    grp38        1     1
      55   38    4     grp55       0     0
      39   31    3    grp39        0     0
       2         1  grp2           0     0
    
  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,745 Red Diamond
    Answer ✓

    Hi, @user1311758

    Using CONNECT BY, here's one way:

    WITH  top_down  AS
    (
    	SELECT id, pid
    	, 	LEVEL				AS lvl
    	,	LPAD (' ', LEVEL - 1) || name	AS ident
    	,	ROWNUM	  	    	 	AS rnum
    	,	SYS_CONNECT_BY_PATH (pid, ',') || ','	AS ancestors	
    	FROM	grp
    	START WITH pid IS NULL
    	CONNECT BY pid = PRIOR id
    )
    SELECT  t.*
    ,	 (
    	   SELECT COUNT (*)
    	   FROM  top_down d
    	   WHERE  d.pid = t.id
    	 )				AS direct
    ,	 (
    	   SELECT COUNT (*)
    	   FROM  top_down i
    	   WHERE  INSTR ( i.ancestors
    	   	   	  , ',' || t.id || ','
    			  ) > 0
    	 )				AS indirect
    FROM	 top_down t
    ORDER BY rnum
    ;
    

    I tried the same approach (scalar sub-queries) on your results, but I got an error "ORA-00604: error occurred at recursive SQL level 1. ORA-00910: specified length too long for its datatype" I'll try to figure out why

    user13117585
  • user13117585
    user13117585 Member Posts: 680 Bronze Badge

    Hello both of you.

    I have to admin Frank's solution is easy to understand. To get direct sub level, he uses a join, and to get all of the counts, he concatenate the ids all together and uses that string to count. Very clean, very clear and a very nice trick the concatenation.

    For match_recognize, I'm not familiar with that function. But, it's time to get my hands on it. So, I will take the time to understand that solution.


    Thank you very much for this very nice tricks.

    Regards,

  • mathguy
    mathguy Member Posts: 10,893 Black Diamond

    If you have a sufficiently large amount of data, it will be interesting to compare the performance of the two approaches. I have a strong suspicion, but since I haven't tested, I'll keep that to myself.

  • user13117585
    user13117585 Member Posts: 680 Bronze Badge

    Hi mathguy,

    I have a table with about 300K rows. I will try both solutions and will let you know which one has better performances.

    I'm reading this right now. And I see this is very powerful...

    Thank you again!

  • mathguy
    mathguy Member Posts: 10,893 Black Diamond
    edited Apr 16, 2022 9:14PM Answer ✓

    Actually I just ran a simple test, described below. On a table with 3,905 rows, the match_recognize solution runs in about 0.012 seconds. The other solution (using correlated subqueries) runs in over 3 seconds, or 250 times slower.

    Then I tried again, this time with a table with 19,530 rows. Still way less than your 300k rows. The match_recognize solution runs in 0.07 seconds. The other solution takes 85 seconds; 1,200 times slower.

    To keep the displaying of thousands of rows to the terminal (irrelevant for this test), instead of using the SELECT statements as posted in both answers, I selected SUM(ID + TOTAL) / SUM(PID + DIRECT). The optimizer can't optimize out the generation of all rows (and all relevant columns), but the output is just a single number. In all cases, the two solutions give the same result (as expected).

    The test table is a graph (a forest, consisting of several trees), as follows. I start with five roots - that's generation 1. Then each node has exactly five direct children. So there are five nodes in generation 1, 25 nodes in generation 2, 125 nodes in generation 3, etc. For the first test I stopped at generation 4 5, and for the second, at generation 5 6. (EDITED: in the code below, if I stop when r.gen <=5, this means I am already generating the sixth generation in the recursive step.)

    Test data generated as follows. Change the cutoff for "generation" toward the end of the insert statement for the different test cases.

    truncate table grp;
    
    insert into grp
    with
      h (id, pid, name) as (
        select  level, null, 'grp' || to_char(level)
        from    dual
        connect by level <= 5
      )
    , r (gen, id, pid, name) as (
        select  1, id, pid, name
          from  h
        union all
        select  r.gen + 1, 10 * r.id + h.id, r.id,
                'grp' || to_char(10 * r.id + h.id)
          from  r join h on r.gen <= 5               ---- <<<< ----
      )
    select id, pid, name from r;
    
  • user13117585
    user13117585 Member Posts: 680 Bronze Badge


    Hello mathguy, 


    I had compared both solutions... And you were correct. Your solution is blazing fast in comparison to Franks.


    With Frank's solution, I get this plan.


    Explained.
    
    PLAN_TABLE_OUTPUT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
    ---------------------------------------------------------------------------------------------------------------------------------
    Plan hash value: 1658207697
     
    ---------------------------------------------------------------------------------------------------------------------------------
    | Id | Operation                 | Name            | Rows | Bytes |TempSpc| Cost (%CPU)| Time   |
    ---------------------------------------------------------------------------------------------------------------------------------
    |  0 | SELECT STATEMENT              |              | 4554K| 8921M|    |  39G (1)|433:21:16 |
    |  1 | SORT AGGREGATE              |              |   1 |  13 |    |      |     |
    |* 2 |  VIEW                   |              | 4554K|  56M|    | 4391  (1)| 00:00:01 |
    |  3 |  TABLE ACCESS FULL            | SYS_TEMP_0FD9D662F_1F267E6 | 4554K|  112M|    | 4391  (1)| 00:00:01 |
    |  4 | SORT AGGREGATE              |              |   1 | 2002 |    |      |     |
    |* 5 |  VIEW                   |              | 4554K| 8695M|    | 4391  (1)| 00:00:01 |
    |  6 |  TABLE ACCESS FULL            | SYS_TEMP_0FD9D662F_1F267E6 | 4554K|  112M|    | 4391  (1)| 00:00:01 |
    |  7 | TEMP TABLE TRANSFORMATION         |              |    |    |    |      |     |
    |  8 |  LOAD AS SELECT (CURSOR DURATION MEMORY) | SYS_TEMP_0FD9D662F_1F267E6 |    |    |    |      |     |
    |  9 |  COUNT                  |              |    |    |    |      |     |
    |* 10 |   CONNECT BY NO FILTERING WITH START-WITH|              |    |    |    |      |     |
    | 11 |   TABLE ACCESS FULL           | TST            | 2360K|  31M|    | 4828  (1)| 00:00:01 |
    | 12 |  SORT ORDER BY              |              | 4554K| 8921M|  11G|  39G (1)|433:21:15 |
    | 13 |  VIEW                  |              | 4554K| 8921M|    | 4391  (1)| 00:00:01 |
    | 14 |   TABLE ACCESS FULL           | SYS_TEMP_0FD9D662F_1F267E6 | 4554K|  112M|    | 4391  (1)| 00:00:01 |
    ---------------------------------------------------------------------------------------------------------------------------------
     
    Predicate Information (identified by operation id):
    ---------------------------------------------------
     
      2 - filter("D"."PARENT_ID"=:B1)
      5 - filter(INSTR("I"."ANCESTORS",','||TO_CHAR(:B1)||',')>0)
     10 - access("PARENT_ID"=PRIOR "ID")
        filter("PARENT_ID" IS NULL AND "D_TYPE"='ROOT')
    
    29 rows selected. 
    




    With your solution, I get this plan.


    PLAN_TABLE_OUTPUT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
                                                                                                                                                  
    ----------------------------------------------------------------------------------------------------
    Plan hash value: 2619942942
     
    ----------------------------------------------------------------------------------------------------
    | Id | Operation                  | Name | Rows | Bytes | Cost (%CPU)| Time   |
    ----------------------------------------------------------------------------------------------------
    |  0 | SELECT STATEMENT              |   | 4554K|  282M| 97166  (1)| 00:00:04 |
    |  1 | VIEW                    |   | 4554K|  282M| 97166  (1)| 00:00:04 |
    |  2 |  MATCH RECOGNIZE SORT           |   | 4554K|  225M| 97166  (1)| 00:00:04 |
    |  3 |  VIEW                   |   | 4554K|  225M| 38527  (1)| 00:00:02 |
    |  4 |   COUNT                  |   |    |    |      |     |
    |* 5 |   CONNECT BY NO FILTERING WITH START-WITH|   |    |    |      |     |
    |  6 |    TABLE ACCESS FULL           | TST | 2360K|  31M| 4828  (1)| 00:00:01 |
    ----------------------------------------------------------------------------------------------------
     
    Predicate Information (identified by operation id):
    ---------------------------------------------------
     
      5 - access("PARENT_ID"=PRIOR "ID")
        filter("PARENT_ID" IS NULL AND "D_TYPE"='ROOT')
    
    19 rows selected. 
    


    With your proposal, I get an answer in 5 about seconds. For the other suggestion, I didn't get the results yet (40minutes have passed already). So, this is another reason to learn more about this pattern matching.

    THANK you again both of you. These are very nice tricks.

developers