Forum Stats

  • 3,751,001 Users
  • 2,250,294 Discussions
  • 7,867,247 Comments

Discussions

Filtered hierarchical data - previous available ancestor?

paul zip
paul zip Member Posts: 26
edited Jul 10, 2013 2:23PM in SQL & PL/SQL

I'm trying to come up with a solution to a problem relating to hierarchical data and the concept of nearest available ancestor - my platform is 10gR2 and 11gR2.  I am given some hierarchical data (in its simplest form it is in form ID, NAME, PARENT_ID, where PARENT_ID links back to ID).

For example:

with qryData as (

  select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

  select 2 as ID, 'Mark' ,   1 from dual union

  select 3 as ID, 'Brenda' , 1 from dual union

  select 4 as ID, 'Mike' ,   3 from dual union

  select 5 as ID, 'Steve' ,  4 from dual union

  select 6 as ID, 'John' ,   2 from dual union

  select 7 as ID, 'Jo' ,     6 from dual union

  select 8 as ID, 'Jim' ,    2 from dual union

  select 9 as ID, 'Jane' ,   7 from dual           

)

select q.*, sys_connect_by_path(ID, '/') ID_PATH

from qryData q

start with parent_id is null

connect by prior ID = PARENT_ID

/

        ID NAME    PARENT_ID ID_PATH

---------- ------ ---------- ----------------

         1 Bert              /1

         2 Mark            1 /1/2

         6 John            2 /1/2/6

         7 Jo              6 /1/2/6/7

         9 Jane            7 /1/2/6/7/9

         8 Jim             2 /1/2/8

         3 Brenda          1 /1/3

         4 Mike            3 /1/3/4

         5 Steve           4 /1/3/4/5

In reality this dataset can be several thousand rows with tens of nesting levels, multiple start nodes with no parent but most importantly often filtered, so certain arbitrary rows are missing.  When this happens I need to find the nearest available ancestor that appears in the list. "Nearest available ancestor" is closest ancestor relation in your family tree that still exists in the filtered list found starting from parent, then contuniing upwards to grandparent, great-grandparent .... nth great grandparent until one ancestor is found.


For example:

        ID NAME    PARENT_ID ID_PATH       AVAIL_ANCESTOR_ID

---------- ------ ---------- ------------- -----------------

         1 Bert              /1             

         6 John            2 /1/2/6        1

         8 Jim             2 /1/2/8        1

         9 Jane            7 /1/2/6/7/9    6

         3 Brenda          1 /1/3          1

         5 Steve           4 /1/3/4/5      3

For example.  Who is Steve's nearest available ancestor in the filtered list?

Steve's family tree is : /Bert/Brenda/Mike/Steve

Mike (ID = 4) isn't in the filtered list, so we move onto Brenda, who is.  Brenda, ID = 3, So ANCESTOR_ID = 3

I do have access to the original table, so can join the filtered list back to the full table, I can also ask for other columns (such as ROWID, ROWNUM, LEVEL, SYS_CONNECT_ROOT) to be included in the filtered set.  I've tried various approaches to achieve this, but all seem to be quite poor in performance (ID and PARENT_ID columns are indexed where appropriate) or don't quite give the correct result, such as...

1. Parsing the ID_PATH and treewalking based on that and joining back to the list...

  select distinct CUSTOMER_ID, regexp_substr( ID_PATH, '[^/]+', 1, level) as ANCESTOR_ID

  from qryData

  connect by regexp_substr( ID_PATH, '[^/]+', 1, level)  is not null

2. Anti-joining back to full list to identify missing items , then treewalking for those.

3. Writing a function which treewalks back.

It's quite a challenging problem, so can anyone think of a solution which performs well?  Ideally I'm trying to find a SQL based solution?

Best Answer

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 40,900 Red Diamond
    Accepted Answer

    Hi,

    paulzip wrote:
    
    This is a clever solution!  The only problem I have is I wouldn't be able to generate "good_id_path" as part of the filtered list, as filtering is not done by me and is done after the id_path is generated.  I'd have to join filtered list back to the full list to generate it, unless there's a tweak to do it?
     

    I see; you're given connect_by_results (with id_path), and that's all you have to work with, besides the original table.

    Here's one way:

    SELECT   d.id, d.name, d.parent_id, d.id_path

    ,        MIN (a.id) KEEP (DENSE_RANK LAST ORDER BY a.id_path) AS avail_ancestor_id

    FROM          connect_by_results  d

    LEFT OUTER JOIN  connect_by_results  a  ON  d.id_path  LIKE a.id_path || '/%'

    GROUP BY  d.id, d.name, d.parent_id, d.id_path

    ORDER BY  d.id_path

    ;

    The fastest way depends on lots of things, some of which can't be known until run-time.  The query above might be best if the number of rows in the result set is small, but you don't know anything else about them (e.g., whether they all have a common ancestor).

«1

Answers

  • BrendanP
    BrendanP Member Posts: 383 Bronze Badge

    Looks like you haven't got a well-defined problem as what you're looking for, 'nearest available ancestor' doesn't mean anything until you specify it, and your example doesn't do that.

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,665 Black Diamond

    You just need to traverse the hierarchy from leaf to root:

    with qryData as (

                     select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                     select 2 as ID, 'Mark' ,   1 from dual union

                     select 3 as ID, 'Brenda' , 1 from dual union

                     select 4 as ID, 'Mike' ,   3 from dual union

                     select 5 as ID, 'Steve' ,  4 from dual union

                     select 6 as ID, 'John' ,   2 from dual union

                     select 7 as ID, 'Jo' ,     6 from dual union

                     select 8 as ID, 'Jim' ,    2 from dual union

                     select 9 as ID, 'Jane' ,   7 from dual

                    )

    select  connect_by_root id id,

            id ancestor_id

      from  qryData

      connect by ID = prior PARENT_ID

      order by id,

               ancestor_id

    /


    ID ANCESTOR_ID
    --- -----------
      1           1
      2           1
      2           2
      3           1
      3           3
      4           1
      4           3
      4           4
      5           1
      5           3
      5           4

    ID ANCESTOR_ID
    --- -----------
      5           5
      6           1
      6           2
      6           6
      7           1
      7           2
      7           6
      7           7
      8           1
      8           2
      8           8

    ID ANCESTOR_ID
    --- -----------
      9           1
      9           2
      9           6
      9           7
      9           9

    27 rows selected.

    SQL>

    SY.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 40,900 Red Diamond
    edited Jul 10, 2013 10:37AM

    Hi,

    So you need to preserve the tree structure (in things like id_path) after you remove some of the rows from the output, is that the problem?
    Here's one way to do that:

    WITH connect_by_results AS

    (

        SELECT  q.*

        ,       SYS_CONNECT_BY_PATH (id, '/')   AS id_path

        ,       SYS_CONNECT_BY_PATH ( CASE

                                          WHEN id IN (1, 3, 5, 6, 8, 9) -- or whatever

                                          THEN id

                                      END

                                    , '/'

                                    )           AS good_id_path

        ,       ROWNUM                          AS r_num

        FROM    qryData  q

        START WITH  parent_id  IS NULL

        CONNECT BY  PRIOR id   = parent_id

    )

    SELECT    id, name, parent_id, id_path

    ,         REGEXP_SUBSTR ( RTRIM ( SUBSTR ( good_id_path

                                             , 1

                                             , INSTR (good_id_path, '/', -1)

                                             )

                                    , '/'

                                    )

                            , '[^/]+$'

                            )        AS avail_ancestor_id

    ,         good_id_path               -- Only for debugging

    FROM      connect_by_results

    WHERE     SUBSTR (good_id_path, -1) != '/'

    ORDER BY  r_num

    ;

    This doesn't involve a second CONNECT BY, a self-join, or any analytic functions, all of which can be slow.

    Whatever criteia determine whether a row is in the result set or not are put in the CASE expression that produces good_id_path, and not in a WHERE clause, so good_id_path will include only ids from the final result set.   If good_id_path does not end with a '/', then it will be included.  Finding the available_ancestor_id is simply a matter of finding the next-to-last id (if any) in good_id_path.

    Output:

                                   AVAIL_

               PARENT              ANCESTOR

    ID NAME      _ID ID_PATH      _ID        GOOD_ID_PATH

    --- ------ ------ ------------ ---------- ------------

      1 Bert          /1                      /1

      6 John        2 /1/2/6       1          /1//6

      9 Jane        7 /1/2/6/7/9   6          /1//6//9

      8 Jim         2 /1/2/8       1          /1//8

      3 Brenda      1 /1/3         1          /1/3

      5 Steve       4 /1/3/4/5     3          /1/3//5

    Frank Kulash
  • paul zip
    paul zip Member Posts: 26

    "Nearest available ancestor" is closest ancestor relation in your family tree that exists in the list starting from parent, grandparent, great-grandparent .... nth great grandparent .


    For example.  Who is Steve's nearest available ancestor in the filtered list?

    Steve's family tree is : /Bert/Brenda/Mike/Steve

    Mike isn't in the list, so we move onto Brenda, who is.  Brenda, ID = 3, So ANCESTOR_ID = 3

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,665 Black Diamond

    It looks like I didn't understand the requirements. Based on

            ID NAME    PARENT_ID ID_PATH       AVAIL_ANCESTOR_ID

    ---------- ------ ---------- ------------- -----------------

             1 Bert              /1             

             6 John            2 /1/2/6        1

             8 Jim             2 /1/2/8        1

             9 Jane            7 /1/2/6/7/9    6

             3 Brenda          1 /1/3          1

             5 Steve           4 /1/3/4/5      3

    You want a grandparent id, or parent id if there is no grandparent. If so:

    with qryData as (

                     select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                     select 2 as ID, 'Mark' ,   1 from dual union

                     select 3 as ID, 'Brenda' , 1 from dual union

                     select 4 as ID, 'Mike' ,   3 from dual union

                     select 5 as ID, 'Steve' ,  4 from dual union

                     select 6 as ID, 'John' ,   2 from dual union

                     select 7 as ID, 'Jo' ,     6 from dual union

                     select 8 as ID, 'Jim' ,    2 from dual union

                     select 9 as ID, 'Jane' ,   7 from dual

                    )

    select  q.*,

            sys_connect_by_path(ID, '/') ID_PATH,

            nvl(prior parent_id,parent_id) AVAIL_ANCESTOR_ID

      from  qryData q

      where id IN (1,3,5,6,8,9)

      start with parent_id is null

      connect by prior ID = PARENT_ID

    /


    ID NAME    PARENT_ID ID_PATH               AVAIL_ANCESTOR_ID
    --- ------ ---------- -------------------- ------------------
      1 Bert              /1
      6 John            2 /1/2/6                                1
      9 Jane            7 /1/2/6/7/9                            6
      8 Jim             2 /1/2/8                                1
      3 Brenda          1 /1/3                                  1
      5 Steve           4 /1/3/4/5                              3

    6 rows selected.

    SQL>

    SY.

  • BrendanP
    BrendanP Member Posts: 383 Bronze Badge
    edited Jul 10, 2013 1:28PM

    Well, perhaps I'm missing something obvious, but surely you don't have the family tree as such, only id, parent_id and have to deduce the tree, and as the parent of Steve doesn't exist, how do you know Brenda is the one? If you actually have the id_path - your qryData test subquery doesn't - then ok.

    Message was edited by: BrendanP Looks like OP has updated the subquery so it now does have the 'missing' ids, but still seems underspecified - where are the missing ids identified as such?

  • paul zip
    paul zip Member Posts: 26

    This is a clever solution!  The only problem I have is I wouldn't be able to generate "good_id_path" as part of the filtered list, as filtering is not done by me and is done after the id_path is generated.  I'd have to join filtered list back to the full list to generate it, unless there's a tweak to do it?

  • paul zip
    paul zip Member Posts: 26

    This only works for missing parents, I need to continue from parent => grandparent => great grandparent....... => great (x n) grandparent

    if I use "where id IN (1,3,5,8,9)" it says Jane's ANCESTOR_ID = 6 (John) but he isn't in the list, so it is incorrect.

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,665 Black Diamond
    edited Jul 10, 2013 11:42AM

    OK. It looks like now I undestand your requirements. As I showed in my first reply, you can traverse hierarchy from leaf to root:

    with qryData as (

                     select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                     select 2 as ID, 'Mark' ,   1 from dual union

                     select 3 as ID, 'Brenda' , 1 from dual union

                     select 4 as ID, 'Mike' ,   3 from dual union

                     select 5 as ID, 'Steve' ,  4 from dual union

                     select 6 as ID, 'John' ,   2 from dual union

                     select 7 as ID, 'Jo' ,     6 from dual union

                     select 8 as ID, 'Jim' ,    2 from dual union

                     select 9 as ID, 'Jane' ,   7 from dual

                    ),

       ancestors as (

                     select  level lvl,

                             connect_by_root id id,

                             connect_by_root name name,

                             connect_by_root parent_id parent_id,

                             parent_id ancestor_id

                       from  qryData

                       start with id in (1,3,5,6,8,9)

                       connect by ID = prior PARENT_ID

                    )

    select  id,

            name,

            parent_id,

            min(ancestor_id) keep(dense_rank first order by lvl) avail_ancestor_id

      from  ancestors

      where ancestor_id in (1,3,5,6,8,9)

         or ancestor_id is null

      group by id,

               name,

               parent_id

      order by id,

               avail_ancestor_id

    /


    ID NAME    PARENT_ID  AVAIL_ANCESTOR_ID
    --- ------ ---------- ------------------
      1 Bert
      3 Brenda          1                  1
      5 Steve           4                  3
      6 John            2                  1
      8 Jim             2                  1
      9 Jane            7                  6

    6 rows selected.

    SQL>

    SY.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 40,900 Red Diamond
    Accepted Answer

    Hi,

    paulzip wrote:
    
    This is a clever solution!  The only problem I have is I wouldn't be able to generate "good_id_path" as part of the filtered list, as filtering is not done by me and is done after the id_path is generated.  I'd have to join filtered list back to the full list to generate it, unless there's a tweak to do it?
     

    I see; you're given connect_by_results (with id_path), and that's all you have to work with, besides the original table.

    Here's one way:

    SELECT   d.id, d.name, d.parent_id, d.id_path

    ,        MIN (a.id) KEEP (DENSE_RANK LAST ORDER BY a.id_path) AS avail_ancestor_id

    FROM          connect_by_results  d

    LEFT OUTER JOIN  connect_by_results  a  ON  d.id_path  LIKE a.id_path || '/%'

    GROUP BY  d.id, d.name, d.parent_id, d.id_path

    ORDER BY  d.id_path

    ;

    The fastest way depends on lots of things, some of which can't be known until run-time.  The query above might be best if the number of rows in the result set is small, but you don't know anything else about them (e.g., whether they all have a common ancestor).

This discussion has been closed.