Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Amalgamating groups to be beyond a given size threshold

JonWatNov 2 2016 — edited Nov 3 2016

I'm having a hard time thinking of a good way to do this.

I have a bunch of small groups that have a category, and an order within the category. If the groups are below a certain size threshold I want to merge them with the group that is just below them in the order, and keep on doing that until the amalgamated group is beyond the threshold.

For example, I have a table like this and my threshold is 10:

cat gives the category -- there is no merging between categories.

lev gives the ordering within the category.

create table small_groups

(id number(3),

cat varchar2(10),

lev number(2),

grpsize number(3),

final_grp number(3));

insert into small_groups values(1,'Rural',10,2,null);

insert into small_groups values(2,'Rural',9,3,null);

insert into small_groups values(3,'Rural',8,1,null);

insert into small_groups values(4,'Rural',7,4,null);

insert into small_groups values(5,'Rural',6,11,null);

insert into small_groups values(6,'Rural',5,2,null);

insert into small_groups values(7,'Rural',4,2,null);

insert into small_groups values(8,'Rural',3,4,null);

insert into small_groups values(9,'Rural',2,30,null);

insert into small_groups values(10,'Rural',1,12,null);

insert into small_groups values(11,'Urban',10,1,null);

insert into small_groups values(12,'Urban',9,12,null);

insert into small_groups values(13,'Urban',8,2,null);

insert into small_groups values(14,'Urban',7,5,null);

insert into small_groups values(15,'Urban',6,7,null);

insert into small_groups values(16,'Urban',5,15,null);

insert into small_groups values(17,'Urban',4,25,null);

insert into small_groups values(18,'Urban',3,2,null);

insert into small_groups values(19,'Urban',2,1,null);

insert into small_groups values(20,'Urban',1,8,null);

ID 1 has only two people, so it would be merged with the Rural 9 group. That would still only have 5 people, so merge with Rural 8, still only six people, so merge with Rural 7, which is ID #4. That gives me 10 people, so IDs 1 through 4 get their final_grp numbers set to 4.

ID 5 is big enough on its own, so its final_grp is 5.

Ids 7 and 8 become part of 9

10 is fine by itself

11 gets a final_grp of 12,

13 and 14 get a final_grp of 15

16 and 17 are fine by themselves and 18 and 19 get a final_grp of 20.

I think I could probably manage to do this procedurally using PL/SQL, but I suspect there is a clever method using analytic functions that, at the moment, escapes me.

Thanks.

This post has been answered by mathguy on Nov 2 2016
Jump to Answer

Comments

mathguy Nov 2 2016 — edited on Nov 2 2016
Answer

I don't know if this can be done with analytic functions alone, but a recursive query may still work faster than recursion in PL/SQL. (Or not! One should always test both ways...)

Here's one way to do it. I wasn't sure what role "lev" plays in the whole thing; I assumed the ordering is by decreasing lev rather than by increasing id. (id's should NOT contain information in most cases - not even for ordering things). If in fact your id contains ordering information, then you could simplify the solution somewhat.

with
    prep ( id, cat, lev, grpsize, rn ) as (
      select id, cat, lev, grpsize,
             row_number() over (partition by cat order by lev desc)
      from   small_groups
    ),
    rec ( id, cat, lev, grpsize, rn, c_size, gp ) as (
      select  id, cat, lev, grpsize, rn, grpsize, 1
        from  prep
        where rn = 1
      union all
      select  p.id, p.cat, p.lev, p.grpsize, p.rn,
              p.grpsize + case when r.c_size >= 10 then 0 else r.c_size end,
              r.gp + case when r.c_size >= 10 then 1 else 0 end
      from    rec r join prep p on r.cat = p.cat and r.rn + 1 = p.rn
    )
select id, cat, lev, grpsize, c_size,
       first_value(id) over (partition by cat, gp order by lev) as final_grp
from   rec
order by cat, final_grp, id

;

        ID CAT               LEV    GRPSIZE     C_SIZE  FINAL_GRP
---------- ---------- ---------- ---------- ---------- ----------
         1 Rural              10          2          2          4
         2 Rural               9          3          5          4
         3 Rural               8          1          6          4
         4 Rural               7          4         10          4
         5 Rural               6         11         11          5
         6 Rural               5          2          2          9
         7 Rural               4          2          4          9
         8 Rural               3          4          8          9
         9 Rural               2         30         38          9
        10 Rural               1         12         12         10
        11 Urban              10          1          1         12
        12 Urban               9         12         13         12
        13 Urban               8          2          2         15
        14 Urban               7          5          7         15
        15 Urban               6          7         14         15
        16 Urban               5         15         15         16
        17 Urban               4         25         25         17
        18 Urban               3          2          2         20
        19 Urban               2          1          3         20
        20 Urban               1          8         11         20

20 rows selected

Marked as Answer by JonWat · Sep 27 2020
Frank Kulash Nov 2 2016 — edited on Nov 2 2016

Hi,

Starting in Oracle 12, you can also do it using MATCH_RECOGNIZE:

WITH    got_running_grpsize    AS

(

    SELECT    id, cat, lev, grpsize

    ,         SUM (grpsize) OVER ( PARTITION BY  cat

                                   ORDER BY      id

                                 ) - grpsize  AS running_grpsize

    FROM    small_groups

)

SELECT    id, cat, lev, grpsize

,         MAX (id) OVER ( PARTITION BY  leader_id)

                            AS final_grp

FROM      got_running_grpsize

MATCH_RECOGNIZE (

                  PARTITION BY  cat

                  ORDER BY      id

                  MEASURES      leader.id  AS leader_id

                  ALL ROWS PER MATCH

                  PATTERN       ( leader  follower* )

                  DEFINE        follower  AS  follower.running_grpsize 

                                            < leader.running_grpsize + 10

                )

ORDER BY  cat

,         id

;

I'll bet there's a simpler MATCH_RECOGNIZE solution, one that doesn't require the sub-query and/or the analytic MAX function.

Do you really want to store final_grp in the table?  It will need to be re-calculated whenever any DML is done on the table.

If you really do need to store it, you can use something like the query above in a MERGE statement.

mathguy Nov 2 2016

Oracle 12 and above - using match_recognize.  I added a column:  c_size (as in the previous solution) is cumulative or "running" sum of grpsize, while the new column t_size is total grpsize for the new group. (That can also be added to the earlier solution if desired).

select id, cat, lev, grpsize, c_size, t_size, final_gp as final_grp

from   small_groups

match_recognize(

  partition by cat

  order     by lev desc

  measures           sum(grpsize) as c_size,

               final sum(grpsize) as t_size,

               final last(id) as final_gp

  all rows per match

  pattern ( a* b | a* )

  define    a as sum(grpsize) <  10,

            b as sum(grpsize) >= 10

);

mathguy Nov 2 2016

Ha! You can tell by the difference in posting times that it took me over a half-hour to come up with my version.       When I started your reply wasn't there yet...    Cheers!   mathguy

unknown-7404 Nov 2 2016

If the groups are below a certain size threshold I want to merge them with the group that is just below them in the order, and keep on doing that until the amalgamated group is beyond the threshold.

Ok - but 'merge' usually means a reduction in the number of rows.

So explain what the desired result is:

1. One row for each 'amalgamated group'

2. Same number of rows as the input but with each row assigned an 'amalgamated group' number.

The type and amount of processing is different for each of those. The first is just an accumulation which, when the threshold is exceeded:

1. spits out the accumulated row

2. starts a new group

For #1 you just order the rows and then accumulate.

I think I could probably manage to do this procedurally using PL/SQL

And, depending on your FULL requirements, a PIPELINED function could be a useful solution.

For #2 you can't update each row until you know what group the row belongs to. That means you are processing each row at least twice: once to get its data and again to update it with its group number.

That can cause a BIG difference in performance depending on the number of rows involved and how often you have to do the processing.

So to know the 'best' solution we need to know EVERYTHING about that data.

1. are the ids ALWAYS sequential, with no gaps, within each category?

2. are the levels ALWAYS sequential, with no gaps, within each category?

If so there are some optimizations that can be implemented when the set of ids for a group can be computed by knowing the start/end id number and the number of rows in the group. The start/end and #numrows can be computed during the scan.

So for use cases that involve large numbers of rows it can be more effective to use a 'mini' batch process:

1. use a temp table that holds the id and computed group number

2. use either of the above two processes to compute the group number for each id - storing the ids and group numbers in the temp table. If ids/levels are sequential (see last questions above) an optimization is to just store the start/end id and the number of rows in the group along with the group number

3. use the data in the temp table to update/supplement the original source data.

It ALL hinges on two things:

1. the true nature of the data

2. the need for real-time versus summary info - that is how often the source data changes

JonWat Nov 3 2016

I should have said: I'm still using 11, so the match_recognize solution won't work for me right now. Something to look forward to.

Yes, as you guessed, lev was the ordering variable; the ordering of the IDs was just my test data, they won't be that way in real life.

And to answer some of rp0428s questions...

Looking for solution number 2.

     2. Same number of rows as the input but with each row assigned an 'amalgamated group' number.

Ultimately there is a "merge" happening, but not at this stage. This table just contains a set of attributes and the group that any rows with those attributes will belong to. Ultimately the source data will aggregated using those amalgamated groups.

This table is just a temporary table that will be used in the preparation of an aggregate data set from which historical reporting will be done, so updates are not a problem.

The IDs will generally not be ordered. The ordering variable (lev in the example) may have gaps. In the real word, it's the number of years that people have been receiving services, so you could have 20 people who have been receiving services for five years, none for six years, and a couple for seven years who have to be merged back into the five-year group.

The table is not large -- about 700 rows. These are the definition of the "cells" on which reporting will be based. The problem is that some of these cells end up with just a couple of people in, and for protection of personal information reasons they can't exist in the final data set. Merging with other cells is a preferable solution to suppressing the cell entirely.

Thanks for the help.

mathguy Nov 3 2016

One runs into a similar need (to group smaller groups together or merge them into larger groups) when one does statistical analysis. The first thing you do is classify data points by one or more characteristics; but if some of the samples are too small, you need to either discard them (if they are too different from anything else) or combine them together. Something we advise against in the classroom, but in real life sometimes you don't have many other options.

BrendanP Nov 3 2016 — edited on Nov 3 2016

By coincidence I was looking at a similar type of problem, Activities and breaks, over the weekend with a view to using it as a demo for some performance benchmarking code that I am going to publish this weekend. I have four solutions, two of which correspond to the recursive SQL and the match recognize posted above - but slightly simpler because the problem is simpler, not needing the running sums in this one. My performance analysis would also apply to the problem here I think.

The recursive query solution is very slow for larger problems, and in fact timing grows quadratically in the number of records. You may see why from the following execution plan:

Plan hash value: 167333996  ------------------------------------------------------------------------------------------------------------------------------------------------- | Id  | Operation                                   | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------------------------------- |   0 | SELECT STATEMENT                            |              |      1 |        |     20 |00:00:00.01 |      77 |       |       |          | |   1 |  SORT ORDER BY                              |              |      1 |     24 |     20 |00:00:00.01 |      77 |  2048 |  2048 | 2048  (0)| |   2 |   WINDOW SORT                               |              |      1 |     24 |     20 |00:00:00.01 |      77 |  2048 |  2048 | 2048  (0)| |   3 |    VIEW                                     |              |      1 |     24 |     20 |00:00:00.01 |      77 |       |       |          | |   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|              |      1 |        |     20 |00:00:00.01 |      77 |       |       |          | |*  5 |      VIEW                                   |              |      1 |     20 |      2 |00:00:00.01 |       7 |       |       |          | |*  6 |       WINDOW SORT PUSHED RANK               |              |      1 |     20 |      4 |00:00:00.01 |       7 |  2048 |  2048 | 2048  (0)| |   7 |        TABLE ACCESS FULL                    | SMALL_GROUPS |      1 |     20 |     20 |00:00:00.01 |       7 |       |       |          | |*  8 |      HASH JOIN                              |              |     10 |      4 |     18 |00:00:00.01 |      70 |   947K|   947K|  532K (0)| |   9 |       RECURSIVE WITH PUMP                   |              |     10 |        |     20 |00:00:00.01 |       0 |       |       |          | |  10 |       VIEW                                  |              |     10 |     20 |    200 |00:00:00.01 |      70 |       |       |          | |  11 |        WINDOW SORT                          |              |     10 |     20 |    200 |00:00:00.01 |      70 |  2048 |  2048 | 2048  (0)| |  12 |         TABLE ACCESS FULL                   | SMALL_GROUPS |     10 |     20 |    200 |00:00:00.01 |      70 |       |       |          | -------------------------------------------------------------------------------------------------------------------------------------------------  Predicate Information (identified by operation id): ---------------------------------------------------     5 - filter("RN"=1)    6 - filter(ROW_NUMBER() OVER ( PARTITION BY "CAT" ORDER BY INTERNAL_FUNCTION("LEV") DESC )<=1)    8 - access("SGP"."RN"="RSQ"."RN"+1 AND "SGP"."CAT"="RSQ"."CAT")  Note -----    - dynamic sampling used for this statement (level=2)

I found a big improvement that removes the quadratic term by writing the first subquery out to a temporary table with an index, then querying against that. However, it is still slower (on my test data) than a solution using the model clause. I have not performance tested the query in this thread, just used the small data set supplied, but you may still get the impression that it would be faster from the execution plan:

SELECT /*+ MOD_QRY gather_plan_statistics */         id,         cat,          lev,          grpsize,         sub_grpsize,         final_grp   FROM small_groups  MODEL     PARTITION BY (cat)     DIMENSION BY (Row_Number() OVER (PARTITION BY cat ORDER BY lev DESC) rn)     MEASURES (id, grpsize, grpsize sub_grpsize, id final_grp, lev)     RULES AUTOMATIC ORDER (        sub_grpsize[rn > 1] = CASE WHEN sub_grpsize[cv()-1] >= :DEPTH THEN grpsize[cv()] ELSE sub_grpsize[cv()-1] + grpsize[cv()] END,        final_grp[ANY] = PRESENTV (final_grp[cv()+1], CASE WHEN sub_grpsize[cv()] >= :DEPTH THEN id[cv()] ELSE final_grp[cv()+1] END, id[cv()])     ) ORDER BY 1, 2, 3  Plan hash value: 2629703372  -------------------------------------------------------------------------------------------------------------------------- | Id  | Operation            | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem | -------------------------------------------------------------------------------------------------------------------------- |   0 | SELECT STATEMENT     |              |      1 |        |     20 |00:00:00.01 |       7 |       |       |          | |   1 |  SORT ORDER BY       |              |      1 |     20 |     20 |00:00:00.01 |       7 |  2048 |  2048 | 2048  (0)| |   2 |   SQL MODEL CYCLIC   |              |      1 |     20 |     20 |00:00:00.01 |       7 |   763K|   763K|  545K (0)| |   3 |    WINDOW SORT       |              |      1 |     20 |     20 |00:00:00.01 |       7 |  2048 |  2048 | 2048  (0)| |   4 |     TABLE ACCESS FULL| SMALL_GROUPS |      1 |     20 |     20 |00:00:00.01 |       7 |       |       |          | --------------------------------------------------------------------------------------------------------------------------  Note -----    - dynamic sampling used for this statement (level=2)

Match Recognize was much faster than the other three solutions in my performance testing.

BrendanP Nov 3 2016

Hmm pre tags for preserve don't seem to work great

1 - 9
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Dec 1 2016
Added on Nov 2 2016
9 comments
1,039 views