Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Filtered hierarchical data - previous available ancestor?

paul zipJul 10 2013 — edited Jul 10 2013

I'm trying to come up with a solution to a problem relating to hierarchical data and the concept of nearest available ancestor - my platform is 10gR2 and 11gR2.  I am given some hierarchical data (in its simplest form it is in form ID, NAME, PARENT_ID, where PARENT_ID links back to ID).

For example:

with qryData as (

  select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

  select 2 as ID, 'Mark' ,   1 from dual union

  select 3 as ID, 'Brenda' , 1 from dual union

  select 4 as ID, 'Mike' ,   3 from dual union

  select 5 as ID, 'Steve' ,  4 from dual union

  select 6 as ID, 'John' ,   2 from dual union

  select 7 as ID, 'Jo' ,     6 from dual union

  select 8 as ID, 'Jim' ,    2 from dual union

  select 9 as ID, 'Jane' ,   7 from dual           

)

select q.*, sys_connect_by_path(ID, '/') ID_PATH

from qryData q

start with parent_id is null

connect by prior ID = PARENT_ID

/

        ID NAME    PARENT_ID ID_PATH

---------- ------ ---------- ----------------

         1 Bert              /1

         2 Mark            1 /1/2

         6 John            2 /1/2/6

         7 Jo              6 /1/2/6/7

         9 Jane            7 /1/2/6/7/9

         8 Jim             2 /1/2/8

         3 Brenda          1 /1/3

         4 Mike            3 /1/3/4

         5 Steve           4 /1/3/4/5

In reality this dataset can be several thousand rows with tens of nesting levels, multiple start nodes with no parent but most importantly often filtered, so certain arbitrary rows are missing.  When this happens I need to find the nearest available ancestor that appears in the list. "Nearest available ancestor" is closest ancestor relation in your family tree that still exists in the filtered list found starting from parent, then contuniing upwards to grandparent, great-grandparent .... nth great grandparent until one ancestor is found.


For example:

        ID NAME    PARENT_ID ID_PATH       AVAIL_ANCESTOR_ID

---------- ------ ---------- ------------- -----------------

         1 Bert              /1             

         6 John            2 /1/2/6        1

         8 Jim             2 /1/2/8        1

         9 Jane            7 /1/2/6/7/9    6

         3 Brenda          1 /1/3          1

         5 Steve           4 /1/3/4/5      3

For example.  Who is Steve's nearest available ancestor in the filtered list?

Steve's family tree is : /Bert/Brenda/Mike/Steve

Mike (ID = 4) isn't in the filtered list, so we move onto Brenda, who is.  Brenda, ID = 3, So ANCESTOR_ID = 3

I do have access to the original table, so can join the filtered list back to the full table, I can also ask for other columns (such as ROWID, ROWNUM, LEVEL, SYS_CONNECT_ROOT) to be included in the filtered set.  I've tried various approaches to achieve this, but all seem to be quite poor in performance (ID and PARENT_ID columns are indexed where appropriate) or don't quite give the correct result, such as...

1. Parsing the ID_PATH and treewalking based on that and joining back to the list...

  select distinct CUSTOMER_ID, regexp_substr( ID_PATH, '[^/]+', 1, level) as ANCESTOR_ID

  from qryData

  connect by regexp_substr( ID_PATH, '[^/]+', 1, level)  is not null

2. Anti-joining back to full list to identify missing items , then treewalking for those.

3. Writing a function which treewalks back.

It's quite a challenging problem, so can anyone think of a solution which performs well?  Ideally I'm trying to find a SQL based solution?

This post has been answered by Frank Kulash on Jul 10 2013
Jump to Answer

Comments

BrendanP

Looks like you haven't got a well-defined problem as what you're looking for, 'nearest available ancestor' doesn't mean anything until you specify it, and your example doesn't do that.

Solomon Yakobson

You just need to traverse the hierarchy from leaf to root:

with qryData as (

                 select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                 select 2 as ID, 'Mark' ,   1 from dual union

                 select 3 as ID, 'Brenda' , 1 from dual union

                 select 4 as ID, 'Mike' ,   3 from dual union

                 select 5 as ID, 'Steve' ,  4 from dual union

                 select 6 as ID, 'John' ,   2 from dual union

                 select 7 as ID, 'Jo' ,     6 from dual union

                 select 8 as ID, 'Jim' ,    2 from dual union

                 select 9 as ID, 'Jane' ,   7 from dual

                )

select  connect_by_root id id,

        id ancestor_id

  from  qryData

  connect by ID = prior PARENT_ID

  order by id,

           ancestor_id

/


ID ANCESTOR_ID
--- -----------
  1           1
  2           1
  2           2
  3           1
  3           3
  4           1
  4           3
  4           4
  5           1
  5           3
  5           4

ID ANCESTOR_ID
--- -----------
  5           5
  6           1
  6           2
  6           6
  7           1
  7           2
  7           6
  7           7
  8           1
  8           2
  8           8

ID ANCESTOR_ID
--- -----------
  9           1
  9           2
  9           6
  9           7
  9           9

27 rows selected.

SQL>

SY.

Frank Kulash

Hi,

So you need to preserve the tree structure (in things like id_path) after you remove some of the rows from the output, is that the problem?
Here's one way to do that:

WITH connect_by_results AS

(

    SELECT  q.*

    ,       SYS_CONNECT_BY_PATH (id, '/')   AS id_path

    ,       SYS_CONNECT_BY_PATH ( CASE

                                      WHEN id IN (1, 3, 5, 6, 8, 9) -- or whatever

                                      THEN id

                                  END

                                , '/'

                                )           AS good_id_path

    ,       ROWNUM                          AS r_num

    FROM    qryData  q

    START WITH  parent_id  IS NULL

    CONNECT BY  PRIOR id   = parent_id

)

SELECT    id, name, parent_id, id_path

,         REGEXP_SUBSTR ( RTRIM ( SUBSTR ( good_id_path

                                         , 1

                                         , INSTR (good_id_path, '/', -1)

                                         )

                                , '/'

                                )

                        , '[^/]+$'

                        )        AS avail_ancestor_id

,         good_id_path               -- Only for debugging

FROM      connect_by_results

WHERE     SUBSTR (good_id_path, -1) != '/'

ORDER BY  r_num

;

This doesn't involve a second CONNECT BY, a self-join, or any analytic functions, all of which can be slow.

Whatever criteia determine whether a row is in the result set or not are put in the CASE expression that produces good_id_path, and not in a WHERE clause, so good_id_path will include only ids from the final result set.   If good_id_path does not end with a '/', then it will be included.  Finding the available_ancestor_id is simply a matter of finding the next-to-last id (if any) in good_id_path.

Output:

                               AVAIL_

           PARENT              ANCESTOR

ID NAME      _ID ID_PATH      _ID        GOOD_ID_PATH

--- ------ ------ ------------ ---------- ------------

  1 Bert          /1                      /1

  6 John        2 /1/2/6       1          /1//6

  9 Jane        7 /1/2/6/7/9   6          /1//6//9

  8 Jim         2 /1/2/8       1          /1//8

  3 Brenda      1 /1/3         1          /1/3

  5 Steve       4 /1/3/4/5     3          /1/3//5

paul zip

"Nearest available ancestor" is closest ancestor relation in your family tree that exists in the list starting from parent, grandparent, great-grandparent .... nth great grandparent .


For example.  Who is Steve's nearest available ancestor in the filtered list?

Steve's family tree is : /Bert/Brenda/Mike/Steve

Mike isn't in the list, so we move onto Brenda, who is.  Brenda, ID = 3, So ANCESTOR_ID = 3

Solomon Yakobson

It looks like I didn't understand the requirements. Based on

        ID NAME    PARENT_ID ID_PATH       AVAIL_ANCESTOR_ID

---------- ------ ---------- ------------- -----------------

         1 Bert              /1             

         6 John            2 /1/2/6        1

         8 Jim             2 /1/2/8        1

         9 Jane            7 /1/2/6/7/9    6

         3 Brenda          1 /1/3          1

         5 Steve           4 /1/3/4/5      3

You want a grandparent id, or parent id if there is no grandparent. If so:

with qryData as (

                 select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                 select 2 as ID, 'Mark' ,   1 from dual union

                 select 3 as ID, 'Brenda' , 1 from dual union

                 select 4 as ID, 'Mike' ,   3 from dual union

                 select 5 as ID, 'Steve' ,  4 from dual union

                 select 6 as ID, 'John' ,   2 from dual union

                 select 7 as ID, 'Jo' ,     6 from dual union

                 select 8 as ID, 'Jim' ,    2 from dual union

                 select 9 as ID, 'Jane' ,   7 from dual

                )

select  q.*,

        sys_connect_by_path(ID, '/') ID_PATH,

        nvl(prior parent_id,parent_id) AVAIL_ANCESTOR_ID

  from  qryData q

  where id IN (1,3,5,6,8,9)

  start with parent_id is null

  connect by prior ID = PARENT_ID

/


ID NAME    PARENT_ID ID_PATH               AVAIL_ANCESTOR_ID
--- ------ ---------- -------------------- ------------------
  1 Bert              /1
  6 John            2 /1/2/6                                1
  9 Jane            7 /1/2/6/7/9                            6
  8 Jim             2 /1/2/8                                1
  3 Brenda          1 /1/3                                  1
  5 Steve           4 /1/3/4/5                              3

6 rows selected.

SQL>

SY.

BrendanP

Well, perhaps I'm missing something obvious, but surely you don't have the family tree as such, only id, parent_id and have to deduce the tree, and as the parent of Steve doesn't exist, how do you know Brenda is the one? If you actually have the id_path - your qryData test subquery doesn't - then ok.

Message was edited by: BrendanP Looks like OP has updated the subquery so it now does have the 'missing' ids, but still seems underspecified - where are the missing ids identified as such?

paul zip

This is a clever solution!  The only problem I have is I wouldn't be able to generate "good_id_path" as part of the filtered list, as filtering is not done by me and is done after the id_path is generated.  I'd have to join filtered list back to the full list to generate it, unless there's a tweak to do it?

paul zip

This only works for missing parents, I need to continue from parent => grandparent => great grandparent....... => great (x n) grandparent

if I use "where id IN (1,3,5,8,9)" it says Jane's ANCESTOR_ID = 6 (John) but he isn't in the list, so it is incorrect.

Solomon Yakobson

OK. It looks like now I undestand your requirements. As I showed in my first reply, you can traverse hierarchy from leaf to root:

with qryData as (

                 select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                 select 2 as ID, 'Mark' ,   1 from dual union

                 select 3 as ID, 'Brenda' , 1 from dual union

                 select 4 as ID, 'Mike' ,   3 from dual union

                 select 5 as ID, 'Steve' ,  4 from dual union

                 select 6 as ID, 'John' ,   2 from dual union

                 select 7 as ID, 'Jo' ,     6 from dual union

                 select 8 as ID, 'Jim' ,    2 from dual union

                 select 9 as ID, 'Jane' ,   7 from dual

                ),

   ancestors as (

                 select  level lvl,

                         connect_by_root id id,

                         connect_by_root name name,

                         connect_by_root parent_id parent_id,

                         parent_id ancestor_id

                   from  qryData

                   start with id in (1,3,5,6,8,9)

                   connect by ID = prior PARENT_ID

                )

select  id,

        name,

        parent_id,

        min(ancestor_id) keep(dense_rank first order by lvl) avail_ancestor_id

  from  ancestors

  where ancestor_id in (1,3,5,6,8,9)

     or ancestor_id is null

  group by id,

           name,

           parent_id

  order by id,

           avail_ancestor_id

/


ID NAME    PARENT_ID  AVAIL_ANCESTOR_ID
--- ------ ---------- ------------------
  1 Bert
  3 Brenda          1                  1
  5 Steve           4                  3
  6 John            2                  1
  8 Jim             2                  1
  9 Jane            7                  6

6 rows selected.

SQL>

SY.

Frank Kulash
Answer

Hi,

paulzip wrote:

This is a clever solution!  The only problem I have is I wouldn't be able to generate "good_id_path" as part of the filtered list, as filtering is not done by me and is done after the id_path is generated.  I'd have to join filtered list back to the full list to generate it, unless there's a tweak to do it?

I see; you're given connect_by_results (with id_path), and that's all you have to work with, besides the original table.

Here's one way:

SELECT   d.id, d.name, d.parent_id, d.id_path

,        MIN (a.id) KEEP (DENSE_RANK LAST ORDER BY a.id_path) AS avail_ancestor_id

FROM          connect_by_results  d

LEFT OUTER JOIN  connect_by_results  a  ON  d.id_path  LIKE a.id_path || '/%'

GROUP BY  d.id, d.name, d.parent_id, d.id_path

ORDER BY  d.id_path

;

The fastest way depends on lots of things, some of which can't be known until run-time.  The query above might be best if the number of rows in the result set is small, but you don't know anything else about them (e.g., whether they all have a common ancestor).

Marked as Answer by paul zip · Sep 27 2020
Solomon Yakobson

Actually, it can be optimized even more:

with qryData as (

                 select 1 as ID, 'Bert' as NAME, to_number(null) as PARENT_ID from dual union

                 select 2 as ID, 'Mark' ,   1 from dual union

                 select 3 as ID, 'Brenda' , 1 from dual union

                 select 4 as ID, 'Mike' ,   3 from dual union

                 select 5 as ID, 'Steve' ,  4 from dual union

                 select 6 as ID, 'John' ,   2 from dual union

                 select 7 as ID, 'Jo' ,     6 from dual union

                 select 8 as ID, 'Jim' ,    2 from dual union

                 select 9 as ID, 'Jane' ,   7 from dual

                ),

   ancestors as (

                 select  level lvl,

                         connect_by_root id id,

                         connect_by_root name name,

                         connect_by_root parent_id parent_id,

                         parent_id ancestor_id

                   from  qryData

                   start with id in (1,3,5,6,8,9)

                   connect by id = prior parent_id

                          and id not in (1,3,5,6,8,9)

                )

select  id,

        name,

        parent_id,

        min(ancestor_id) keep(dense_rank first order by case when parent_id in (1,3,5,6,8,9) then lvl end nulls last) avail_ancestor_id

  from  ancestors

  group by id,

           name,

           parent_id

  order by id,

           avail_ancestor_id

/


ID NAME    PARENT_ID  AVAIL_ANCESTOR_ID
--- ------ ---------- ------------------
  1 Bert
  3 Brenda          1                  1
  5 Steve           4                  3
  6 John            2                  1
  8 Jim             2                  1
  9 Jane            7                  6

6 rows selected.

SQL>

SY.

paul zip

I was very impressed with this approach.  Yes, it can see may be slow with large datasets, but with smaller sets it'll be fine.  What impressed me is it was extremely clever in the way it only uses the filtered dataset to yield the answer,  Quite genius.

paul zip

I can see what you are doing, but my list could contain 100s / 1000s of items, how would this transpose to your "in (1,3,5,6,8,9)" situations, without multiple table scans?

Solomon Yakobson

paulzip wrote:

I can see what you are doing, but my list could contain 100s / 1000s of items, how would this transpose to your "in (1,3,5,6,8,9)" situations, without multiple table scans?

So what prevents you from using nested table/varray or a driver table:

SQL> select  *
  2    from  qryData
  3  /

ID NAME    PARENT_ID
--- ------ ----------
  1 Bert
  2 Mark            1
  3 Brenda          1
  4 Mike            3
  5 Steve           4
  6 John            2
  7 Jo              6
  8 Jim             2
  9 Jane            7

9 rows selected.

SQL> explain plan for
  2  with  ancestors as (
  3                   select  level lvl,
  4                           connect_by_root id id,
  5                           connect_by_root name name,
  6                           connect_by_root parent_id parent_id,
  7                           parent_id ancestor_id
  8                     from  qryData
  9                     start with id in (select * from table(sys.OdciNumberList(1,3,5,6,8,9)))
10                     connect by id = prior parent_id
11                            and id not in (select * from table(sys.OdciNumberList(1,3,5,6,8,9)))
12                  )
13  select  id,
14          name,
15          parent_id,
16          min(ancestor_id) keep(dense_rank first order by case when parent_id in (select * from table(sys.OdciNumberList(1,3,5,6,8,9))) then lvl end nulls last) avail_ancestor_id
17    from  ancestors
18    group by id,
19             name,
20             parent_id
21    order by id,
22             avail_ancestor_id
23  /

Explained.

SQL> select  *
  2    from  table(dbms_xplan.display)
  3  /

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3643621826

------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |         |     9 |   513 |     5  (40)| 00:00:01 |
|*  1 |  COLLECTION ITERATOR CONSTRUCTOR FETCH     |         |     1 |     2 |     2   (0)| 00:00:01 |
|   2 |  SORT ORDER BY                             |         |     9 |   513 |     5  (40)| 00:00:01 |
|   3 |   SORT GROUP BY                            |         |     9 |   513 |     5  (40)| 00:00:01 |
|   4 |    VIEW                                    |         |     9 |   513 |    22  (87)| 00:00:01 |
|*  5 |     CONNECT BY NO FILTERING WITH START-WITH|         |       |       |            |          |

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
|   6 |      TABLE ACCESS FULL                     | QRYDATA |     9 |   279 |     3   (0)| 00:00:01 |
|*  7 |      COLLECTION ITERATOR CONSTRUCTOR FETCH |         |     2 |     4 |     2   (0)| 00:00:01 |
|*  8 |      COLLECTION ITERATOR CONSTRUCTOR FETCH |         |     1 |     2 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(VALUE(KOKBF$)=:B1)
   5 - access("ID"=PRIOR "PARENT_ID")
       filter( NOT EXISTS (SELECT 0 FROM TABLE() "KOKBF$3" WHERE LNNVL(VALUE(KOKBF$)<>:B1))

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
              AND  EXISTS (SELECT 0 FROM TABLE() "KOKBF$0" WHERE VALUE(KOKBF$)=:B2))
   7 - filter(LNNVL(VALUE(KOKBF$)<>:B1))
   8 - filter(VALUE(KOKBF$)=:B1)

Note
-----
   - dynamic sampling used for this statement (level=2)

29 rows selected.

SQL>

SY.

1 - 14
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Aug 7 2013
Added on Jul 10 2013
14 comments
2,554 views