Forum Stats

  • 3,874,489 Users
  • 2,266,746 Discussions
  • 7,911,865 Comments

Discussions

How can I use a subquery in conjunction with recursive subquery factoring?

Pseudo Nym
Pseudo Nym Member Posts: 3 Red Ribbon

I am attempting to implement a data sampling algorithm that requires partitioning the data (rows of a table) into 'buckets', iterating through the buckets sequentially, and choosing one data item (row) from each bucket. The row selected for each bucket is determined by maximising a function that is dependent on:

  1. the value of the data value for the row being considered in the current bucket;
  2. the already selected (optimised) value in the previous bucket; and
  3. a property derived from all the data points on the next bucket.

A recursive subquery factoring clause appears to be ideally suited to this. (Indeed, a postgresql implementation of the algorithm I am working towards - which is not mine - uses recursive subquery factoring.)

If I pre-calculate the required derived value (3) for each bucket, and store it into a table, then the recursive subquery factoring clause works as intended. (See SQL below.)

I wish to calculate the result as a single WITH ... SELECT ..., without having to create any additional temporary tables with pre-calculated intermediate results. With Oracle 19c database, if, instead of pre-calculating (3) and storing results in a table to be referenced, I calculate it using a subquery in the main WITH ... SELECT ... clause, it fails to compile with an unexpected error (ORA-00904). (See SQL below.)

The two forms appear to be equivalent and so I cannot understand why the single statement form is not working.

Does anyone understand why this is not working? Have I encountered an Oracle DB limitation or bug and if so, are there any workarounds?

As an aside, assuming I get this to work, I'll be needing to introduce another level of subquery in the WITH clause to take the data from the source table and get it into the form required by the algorithm, so this ability to handle subqueries is important. How can I do this?

Thank you.

INPUT DATA

-- A table holding the input data
CREATE TABLE test_data
(
  id NUMBER -- Row identifier
  , b NUMBER -- Bucket number
  , y NUMBER -- Value
);
INSERT INTO test_data (id, b, y) VALUES(1, 1, 10);
INSERT INTO test_data (id, b, y) VALUES(2, 2, 0);
INSERT INTO test_data (id, b, y) VALUES(3, 3, 5);
INSERT INTO test_data (id, b, y) VALUES(4, 3, 4);
INSERT INTO test_data (id, b, y) VALUES(5, 4, 7);
INSERT INTO test_data (id, b, y) VALUES(6, 5, 6);
INSERT INTO test_data (id, b, y) VALUES(7, 6, 2);
INSERT INTO test_data (id, b, y) VALUES(8, 7, 8);
INSERT INTO test_data (id, b, y) VALUES(9, 8, 1);

EXAMPLE 1 (WORKING)

-- Pre-calculate a table holding intermediate results
CREATE TABLE centroids
(
  b NUMBER -- Bucket number
  , y NUMBER -- Derived value
);
INSERT INTO centroids
SELECT
  TD.b
  , AVG(y) -- average y for each bucket
FROM test_data TD
GROUP BY TD.b;


-- Calculate full result via recursive subquery factoring
WITH
  compiled( id, b, y ) AS
  (
    (
      SELECT
        TD.id
        , TD.b
        , TD.y
      FROM
        test_data TD
      WHERE
        TD.b = 1
    )
    UNION ALL
    (
      SELECT
        CUR.id
        , CUR.b
        , CUR.y
      FROM
        compiled PRE    
      LEFT JOIN
        centroids NXT
      ON
        NXT.b = PRE.b + 2
      JOIN
        test_data CUR
      ON
        CUR.b = PRE.b + 1
        AND
        (
          CUR.id =
            (
              SELECT
                CUR_BEST.id
              FROM
                (
                  SELECT
                    CUR_ORDERED.id
                    , CASE
                        WHEN (NXT.b IS NULL) THEN
                          CUR_ORDERED.y
                        ELSE
                          PRE.y + CUR_ORDERED.y + NXT.y
                      END
                      AS a
                  FROM
                    test_data CUR_ORDERED
                  WHERE
                    CUR_ORDERED.b = PRE.b + 1
                  ORDER BY
                    a DESC
                ) CUR_BEST
              WHERE ROWNUM = 1
          )
        )
    )
  )
SELECT
  C.id
  , C.b
  , C.y
FROM
  compiled C
ORDER BY
  C.b ASC;

EXAMPLE 2 (NOT WORKING)

-- Do entire calculation in a single WITH ... SELECT ... statement
WITH
  -- Calculate intermediate result as a subquery
  -- This part does exactly the same calculation as the initial step in example 1. 
  centroids as
  (
    SELECT
      TD.b
      , AVG(y)
    FROM
      test_data TD
    GROUP BY
      TD.b
  )
  -- This part is exactly the same as example 1
  , compiled( id, b, y ) AS
  (
    (
      SELECT
        TD.id
        , TD.b
        , TD.y
      FROM
        test_data TD
      WHERE
        TD.b = 1
    )
    UNION ALL
    (
      SELECT
        CUR.id
        , CUR.b
        , CUR.y
      FROM
        compiled PRE    
      LEFT JOIN
        centroids NXT
      ON
        NXT.b = PRE.b + 2
      JOIN
        test_data CUR
      ON
        CUR.b = PRE.b + 1
        AND
        (
          CUR.id =
            (
              SELECT
                CUR_BEST.id
              FROM
                (
                  SELECT
                    CUR_ORDERED.id
                    , CASE
                        WHEN (NXT.b IS NULL) THEN
                          CUR_ORDERED.y
                        ELSE
                          PRE.y + CUR_ORDERED.y + NXT.y
                      END
                      AS a
                  FROM
                    test_data CUR_ORDERED
                  WHERE
                    CUR_ORDERED.b = PRE.b + 1
                  ORDER BY
                    a DESC
                ) CUR_BEST
              WHERE ROWNUM = 1
          )
        )
    )
  )  
SELECT
  C.id
  , C.b
  , C.y
FROM
  compiled C
ORDER BY
  C.b ASC;

This fails with:

ORA-00904: "NXT"."Y": invalid identifier

Best Answer

Answers

  • BluShadow
    BluShadow Member, Moderator Posts: 42,559 Red Diamond

    You haven't provided us with any data to test (nor expected results)... so I'm going to have to guess..

    WITH
      -- Calculate intermediate result as a subquery
      -- This part does exactly the same calculation as the initial step in example 1. 
      centroids as (
        SELECT TD.b
              ,AVG(y)
        FROM   test_data TD
        GROUP BY TD.b
        )
      -- This part is exactly the same as example 1
      ,compiled(id, b, y) AS (
        SELECT TD.id
              ,TD.b
              ,TD.y
        FROM   test_data TD
        WHERE  TD.b = 1
        UNION ALL
        SELECT CUR.id
              ,CUR.b
              ,CUR.y
        FROM   compiled PRE    
               LEFT JOIN       centroids NXT ON NXT.b = PRE.b + 2
                    JOIN       test_data CUR ON CUR.b = PRE.b + 1
                    JOIN LATERAL (SELECT CUR_BEST.id
                                  FROM  (SELECT CUR_ORDERED.id
                                               ,CASE WHEN (NXT.b IS NULL) THEN CUR_ORDERED.y
                                                ELSE PRE.y + CUR_ORDERED.y + NXT.y
                                                END AS a
                                         FROM   test_data CUR_ORDERED
                                         WHERE  CUR_ORDERED.b = PRE.b + 1
                                         ORDER BY a DESC
                                        ) CUR_BEST
                                  WHERE ROWNUM = 1
                                 ) x on (x.id = CUR.id)
        )
    SELECT C.id
          ,C.b
          ,C.y
    FROM   compiled C
    ORDER BY C.b ASC;
    


    For lateral joins see examples...


    You need to do this because you can't reference data from other tables in subqueries like you were trying to do.

    p.s. I've also tidied up your formatting.... in these days of wide screen monitors there's really no need to spread the code out as if you're coding on a narrow screen mobile phone. Oh, and you don't need so many brackets around your queries.

  • Pseudo Nym
    Pseudo Nym Member Posts: 3 Red Ribbon

    Thank you for the quick response @BluShadow.

    You haven't provided us with any data to test (nor expected results)... so I'm going to have to guess..

    I did provide data to test on. See the INPUT DATA section in the original post.

    While I didn't provide the expected output, it is what you receive if you run the working example, EXAMPLE 1. Here is the expected output:

            ID          B          Y
    ---------- ---------- ----------
             1          1         10
             2          2          0
             3          3          5
             5          4          7
             6          5          6
             7          6          2
             8          7          8
             9          8          1 
    

    Most importantly, I tried your suggestion, but unfortunately it didn't work. I receive the same error:

    ORA-00904: "NXT"."Y": invalid identifier

    Regarding LATERAL joins, I haven't come across them before. I'll have a read to see if I can understand how they might help and why they might be needed. I can see from the link you provided that they are supposed to address this kind of error message..., but in this case the table being referred to is a named subquery that can be pre-calculated independently, so I'm not sure that that's the issue.

    You need to do this because you can't reference data from other tables in subqueries like you were trying to do.

    You certainly can split out subqueries and then reference them by name, so the key thing must be "like you were trying to do". There's something about the way I have done it that Oracle doesn't like in the second example, whereas it's perfectly happy with the same thing in the first example.

  • Pseudo Nym
    Pseudo Nym Member Posts: 3 Red Ribbon
    edited Nov 11, 2022 11:48AM Answer ✓

    It should be

    SELECT
      TD.b
      , AVG(y) AS y
    

    in EXAMPLE 2. I missed the AS y ! Doh!

    Both examples working now. LATERAL not required.

  • BluShadow
    BluShadow Member, Moderator Posts: 42,559 Red Diamond

    Ah, ok, apologies, I missed the example data.

    Lateral type joins (JOIN LATERAL, CROSS APPLY etc.) are useful if you need to use the columns from one table in a subquery.

    e.g. If I try and use a value from one table in a subquery in the FROM clause...

    SQL> with t(id, csv) as (
      2    select 1, '1,2,3' from dual union all
      3    select 2, '4,5' from dual
      4    )
      5  select t.id
      6        ,v.val
      7  from   t
      8         cross join
      9         (-- in this query we want to take the csv value from our
     10          -- t row so we can split it out to seperate values
     11          select regexp_substr(t.csv,'[^,]+',1,level) as val
     12          from   dual
     13          connect by level <= regexp_count(t.csv,',')+1
     14         ) v
     15  /
            connect by level <= regexp_count(t.csv,',')+1
                                             *
    ERROR at line 13:
    ORA-00904: "T"."CSV": invalid identifier
    

    I get a similar error to what you were getting... the invalid identifier, because the subquery can't reference the value from the other table.

    If I change my CROSS JOIN to a CROSS APPLY, that tells SQL that I want to be able to apply values from my previously referenced table(s) inside the subquery...

    SQL> with t(id, csv) as (
      2    select 1, '1,2,3' from dual union all
      3    select 2, '4,5' from dual
      4    )
      5  select t.id
      6        ,v.val
      7  from   t
      8         cross apply
      9         (-- in this query we want to take the csv value from our
     10          -- t row so we can split it out to seperate values
     11          select regexp_substr(t.csv,'[^,]+',1,level) as val
     12          from   dual
     13          connect by level <= regexp_count(t.csv,',')+1
     14         ) v
     15  /
    
            ID VAL
    ---------- --------------------
             1 1
             1 2
             1 3
             2 4
             2 5
    

    Now it's quite happy to reference the t.csv value inside the subquery, and the subquery is processed for each reference i.e. we don't end up with a cartesian product of all the t.csv value from each row against each of the t.id values as you'd probably expect with a cross join.