4 Replies Latest reply on Oct 3, 2012 7:07 AM by Kim Berg Hansen

    Group by groups

    Kim Berg Hansen
      I have a bunch of data I need to transform and "group by groups." Let me explain by an example:
      SQL> create table orig_data as
        2  select distinct job, deptno
        3  from scott.emp e
        4  /
      
      Table created.
      
      SQL> select job
        2       , deptno
        3    from orig_data
        4   order by
        5         job
        6       , deptno
        7  /
      
      JOB           DEPTNO
      --------- ----------
      ANALYST           20
      CLERK             10
      CLERK             20
      CLERK             30
      MANAGER           10
      MANAGER           20
      MANAGER           30
      PRESIDENT         10
      SALESMAN          30
      
      9 rows selected.
      The real-world data is about 5 million rows.

      First I group by job (I use xmlagg here because I am on version 11.1 and therefore no listagg ;-) ):
      SQL> select od.job
        2       , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
        3    from orig_data od
        4   group by od.job
        5  /
      
      JOB       DEPTNOS
      --------- ------------------------------
      ANALYST   20
      CLERK     10,20,30
      MANAGER   10,20,30
      PRESIDENT 10
      SALESMAN  30
      I notice here that both job CLERK and MANAGER has the same set of deptnos.
      So if I group by deptnos I can get this result:
      SQL> select s2.deptnos
        2       , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
        3    from (
        4     select od.job
        5          , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
        6       from orig_data od
        7      group by od.job
        8         ) s2
        9   group by s2.deptnos
       10  /
      
      DEPTNOS                        JOBS
      ------------------------------ ------------------------------
      10                             PRESIDENT
      10,20,30                       CLERK,MANAGER
      20                             ANALYST
      30                             SALESMAN
      My requirement is to identify all such unique groups of deptnos in my orig_data table, give each such group a surrogate key in a parent table, and then populate two child tables with the deptnos of each group and the jobs that have that group of deptnos:
      SQL> create table groups (
        2     groupkey number primary key
        3  )
        4  /
      
      Table created.
      
      SQL> create table groups_depts (
        2     groupkey number references groups (groupkey)
        3   , deptno number(2)
        4  )
        5  /
      
      Table created.
      
      SQL> create table groups_jobs (
        2     groupkey number references groups (groupkey)
        3   , job varchar2(9)
        4  )
        5  /
      
      Table created.
      For the surrogate groupkey I can just use a rownumber on my group by deptnos query:
      SQL> select row_number() over (order by s2.deptnos) groupkey
        2       , s2.deptnos
        3       , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
        4    from (
        5     select od.job
        6          , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
        7       from orig_data od
        8      group by od.job
        9         ) s2
       10   group by s2.deptnos
       11  /
      
        GROUPKEY DEPTNOS                        JOBS
      ---------- ------------------------------ ------------------------------
               1 10                             PRESIDENT
               2 10,20,30                       CLERK,MANAGER
               3 20                             ANALYST
               4 30                             SALESMAN
      That query I can use for a (slow) insert into my three tables in this simple manner:
      SQL> begin
        2     for g in (
        3        select row_number() over (order by s2.deptnos) groupkey
        4             , s2.deptnos
        5             , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
        6          from (
        7           select od.job
        8                , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
        9             from orig_data od
       10            group by od.job
       11               ) s2
       12         group by s2.deptnos
       13     ) loop
       14        insert into groups values (g.groupkey);
       15
       16        insert into groups_depts
       17           select g.groupkey
       18                , to_number(regexp_substr(str, '[^,]+', 1, level)) deptno
       19             from (
       20                    select rownum id
       21                         , g.deptnos str
       22                      from dual
       23                  )
       24           connect by instr(str, ',', 1, level-1) > 0
       25                  and id = prior id
       26                  and prior dbms_random.value is not null;
       27
       28        insert into groups_jobs
       29           select g.groupkey
       30                , regexp_substr(str, '[^,]+', 1, level) job
       31             from (
       32                    select rownum id
       33                         , g.jobs str
       34                      from dual
       35                  )
       36           connect by instr(str, ',', 1, level-1) > 0
       37                  and id = prior id
       38                  and prior dbms_random.value is not null;
       39
       40     end loop;
       41  end;
       42  /
      
      PL/SQL procedure successfully completed.
      The tables now contain this data:
      SQL> select *
        2    from groups
        3   order by groupkey
        4  /
      
        GROUPKEY
      ----------
               1
               2
               3
               4
      
      SQL> select *
        2    from groups_depts
        3   order by groupkey, deptno
        4  /
      
        GROUPKEY     DEPTNO
      ---------- ----------
               1         10
               2         10
               2         20
               2         30
               3         20
               4         30
      
      6 rows selected.
      
      SQL> select *
        2    from groups_jobs
        3   order by groupkey, job
        4  /
      
        GROUPKEY JOB
      ---------- ---------
               1 PRESIDENT
               2 CLERK
               2 MANAGER
               3 ANALYST
               4 SALESMAN
      I can now from these data get the same result as before (just to test I have created the desired data):
      SQL> select g.groupkey
        2       , d.deptnos
        3       , j.jobs
        4    from groups g
        5    join (
        6           select groupkey
        7                , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
        8             from groups_depts
        9            group by groupkey
       10         ) d
       11         on d.groupkey = g.groupkey
       12    join (
       13           select groupkey
       14                , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
       15             from groups_jobs
       16            group by groupkey
       17         ) j
       18         on j.groupkey = g.groupkey
       19  /
      
        GROUPKEY DEPTNOS                        JOBS
      ---------- ------------------------------ ------------------------------
               1 10                             PRESIDENT
               2 10,20,30                       CLERK,MANAGER
               3 20                             ANALYST
               4 30                             SALESMAN
      So far so good. This all works pretty much as desired - except for a couple of things:

      The very simple loop insert code will be slow. OK, it is a one-time conversion job (in theory, but very few times at least) so that could probably be acceptable (except for my professional pride ;-) .)

      But worse is, that I have groups where the string aggregation won't work - the string would have to be about varchar2(10000) which won't work in SQL in the group by :-( .

      So I have tried an attempt using collections. First a collection of deptnos:
      SQL> create type deptno_tab_type as table of number(2)
        2  /
      
      Type created.
      
      SQL> select od.job
        2       , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
        3    from orig_data od
        4   group by od.job
        5  /
      
      JOB       DEPTNOS
      --------- ------------------------------
      ANALYST   DEPTNO_TAB_TYPE(20)
      CLERK     DEPTNO_TAB_TYPE(10, 20, 30)
      MANAGER   DEPTNO_TAB_TYPE(10, 20, 30)
      PRESIDENT DEPTNO_TAB_TYPE(10)
      SALESMAN  DEPTNO_TAB_TYPE(30)
      All very good - no problems here. But then a collection of jobs:
      SQL> create type job_tab_type as table of varchar2(9)
        2  /
      
      Type created.
      
      SQL> select s2.deptnos
        2       , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
        3    from (
        4     select od.job
        5          , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
        6       from orig_data od
        7      group by od.job
        8         ) s2
        9   group by s2.deptnos
       10  /
       group by s2.deptnos
                *
      ERROR at line 9:
      ORA-00932: inkonsistente datatyper: forventede -, fik XAL_SUPERVISOR.DEPTNO_TAB_TYPE
      Now it fails - I cannot group by a collection datatype...

      I am not asking anyone to write my code, but I know there are sharper brains out there on the forums ;-) .

      Would anyone have an idea of something I might try that will allow me to create these "groups of groups" even for larger groups than string aggregation techniques can handle?

      Thanks for any help, hints or tips ;-)
        • 1. Re: Group by groups
          odie_63
          The "group-by-collection" issue can be solved by creating a container object on which we define an ORDER method :
          SQL> create type deptno_container as object (
            2    nt deptno_tab_type
            3  , order member function match (o deptno_container) return integer
            4  );
            5  /
           
          Type created
           
          SQL> create or replace type body deptno_container as
            2    order member function match (o deptno_container) return integer is
            3    begin
            4      return case when nt = o.nt then 0 else 1 end;
            5    end;
            6  end;
            7  /
           
          Type body created
           
          Then a multitable INSERT can do the job, after unnesting the collections :
          SQL> insert all
            2    when rn0 = 1 then into groups (groupkey) values (gid)
            3    when rn1 = 1 then into groups_jobs (groupkey, job) values(gid, job)
            4    when rn2 = 1 then into groups_depts (groupkey, deptno) values(gid, deptno)
            5  with all_groups as (
            6    select s2.deptnos
            7         , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
            8         , row_number() over(order by null) gid
            9    from (
           10      select od.job
           11           , deptno_container(
           12               cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
           13             ) deptnos
           14      from orig_data od
           15      group by od.job
           16    ) s2
           17    group by s2.deptnos
           18  )
           19  select gid
           20       , value(j) job
           21       , value(d) deptno
           22       , row_number() over(partition by gid order by null) rn0
           23       , row_number() over(partition by gid, value(j) order by null) rn1
           24       , row_number() over(partition by gid, value(d) order by null) rn2
           25  from all_groups t
           26     , table(t.jobs) j
           27     , table(t.deptnos.nt) d
           28  ;
           
          15 rows inserted
           
          SQL> select * from groups;
           
            GROUPKEY
          ----------
                   1
                   2
                   3
                   4
           
          SQL> select * from groups_jobs;
           
            GROUPKEY JOB
          ---------- ---------
                   1 SALESMAN
                   2 PRESIDENT
                   3 CLERK
                   3 MANAGER
                   4 ANALYST
           
          SQL> select * from groups_depts;
           
            GROUPKEY DEPTNO
          ---------- ------
                   1     30
                   2     10
                   3     10
                   3     30
                   3     20
                   4     20
           
          6 rows selected
           
          Works great on the sample data but how this approach scales on a much (much) larger dataset is another story :)
          • 2. Re: Group by groups
            Kim Berg Hansen
            Thanks, odie, it worked ;-)

            I think I was assuming I needed a MAP member function rather than an ORDER member function. But it does work with the ORDER method.

            It might not be the absolutely most efficient, but it does calculate my groups of groups and insert about 8 million rows in total in the three tables in about 1½ hour - which is quite good enough for my purposes.
            • 3. Re: Group by groups
              odie_63
              I think I was assuming I needed a MAP member function rather than an ORDER member function.
              We can also do it with a MAP method, but we would need an additional function to generate a unique hash value from the collection content.
              I find using the ORDER method handier since we can directly use the built-in comparison operator on collections.
              • 4. Re: Group by groups
                Kim Berg Hansen
                Just a followup:

                When digging into the created data, unfortunately I discovered that multiple "identical groups" had been created. It seems like the group by objecttype didn't quite work out anyway?

                After some testing it looks like when the group by is performed on this larger set of data (maybe particularly since those data are a result of a couple hash joins so they are very un-sorted?), then it fails when the order member method always returns either 0 or 1. Looks like some part of the group by algorithm is not just dependent on "equal or not equal" but actually needs -1, 0 and +1 for "less than, equal and greater than." I think it doesn't like that comparing two object instances returns +1 no matter if compared one to the other or vice versa?

                I would have liked a map method but I cannot think of a map that I can ensure uniqueness (hashing will never guarantee unique) and still fit into a scalar value that is not a LOB. The minimum I could find for my data would need something like a RAW(4300). Perhaps zipped data maybe?

                But what I did instead was a different order member method:
                create or replace type body deptno_container as
                  order member function match (o deptno_container) return integer is
                  begin
                    for i in 1..nt.count loop
                      if i > o.nt.count then
                         return 1;
                      elsif nt(i) > o.nt(i) then
                         return 1;
                      elsif nt(i) < o.nt(i) then
                         return -1;
                      end if;
                    end loop;
                    if nt.count < o.nt.count then
                      return -1;
                    else
                      return 0;
                    end if;
                  end;
                end;
                /
                It does increase the sort time but the whole process didn't slow by more than maybe 5%.

                I am now checking and re-checking the data whether it did produce unique groups this time - it looks like it so far ;-)