Forum Stats

  • 3,851,947 Users
  • 2,264,053 Discussions
  • 7,904,914 Comments

Discussions

match_recognize: Working with pattern-matching for multiple columns

mNem
mNem Member Posts: 1,380 Gold Trophy
edited May 15, 2018 6:33PM in SQL & PL/SQL

Trying to work out a solution for a recently posted question (edit: ).

For columns v1 and v2 in each row, need to find the sum of all values occurring in the subsequent rows

upto a value that is 100 or greater.

with t(id, v1, v2) as

(

  select 1,  80 ,  20 from dual union all

  select 2,  90 ,  40 from dual union all

  select 3, 130 , 170 from dual union all

  select 4,  20 ,  30 from dual union all

  select 5, 150 , 140 from dual union all

  select 6,  30 ,  40 from dual union all

  select 7, 110 , 110 from dual union all

  select 8,  20 , 220 from dual union all

  select 9, 130 ,  30 from dual

)

select * from t;

expected resultset:

        ID     V1FRST      V1SUM     V2FRST      V2SUM

---------- ---------- ---------- ---------- ----------

         1         80        300         20        230

         2         90        220         40        210

         3        130        300        170        340

         4         20        170         30        170

         5        150        290        140        290

         6         30        140         40        150

         7        110        260        110        330

         8         20        150        220        250

         9        130        130         30         30

Thanks.

GregVmNem

Best Answer

  • mathguy
    mathguy Member Posts: 10,674 Blue Diamond
    edited May 10, 2018 2:00PM Answer ✓

    The solution in Reply 3 has a bug - see a brief explanation at the end of Reply 10. To illustrate the bug, I changed the data slightly: for ID = 9, I changed V1 to 30 instead of 130. In the output, the row for ID = 7 is incorrect: V2SUM should be 330, but it is 360 (because I add all the V2 values to the end of the table, even though FOR V2 I don't need to). This is what I explained in Reply 10.

    The fix is simple - I need to replace the  E Z*  portion of the PATTERN, since I need to distinguish the reason E exists in the first place (is it because B.V1 is less than 100, or is it because B.V2 is less than 100?)  Easy fix - I need to add new classifiers, F and ZZZ  (I didn't want to change all the other classifiers to make room for this last one). The definition of E must change, and the SUBSET definitions are adjusted for this fix.

    So: Below I show first the (incorrect) solution from Reply 10, with the data slightly modified (as highlighted, and as explained above) and showing the incorrect result. Then, after that, I show the corrected solution, with the correct output. Sorry for the code alignment - since I am copying and pasting from SQL*Plus, I know I won't be able to line things up properly.

    SQL> with t(id, v1, v2) as  2  (  3    select 1,  80 ,  20 from dual union all  4    select 2,  90 ,  40 from dual union all  5    select 3, 130 , 170 from dual union all  6    select 4,  20 ,  30 from dual union all  7    select 5, 150 , 140 from dual union all  8    select 6,  30 ,  40 from dual union all  9    select 7, 110 , 110 from dual union all10     select 8,  20 , 220 from dual union all11     select 9,  30 ,  30 from dual              -- I changed this value from 130 to 30
    12  )13  select id, v1frst, v1sum, v2frst, v2sum14  from  t15  match_recognize(16    order by id17    measures a.id       as id18           , a.v1       as v1frst19           , sum(s1.v1) as v1sum20           , a.v2       as v2frst21           , sum(s2.v2) as v2sum22    after match skip to next row23    pattern ( a w*? ( b  ( x*? c | y*? d | e z* )?  | $) )24    subset s1 = (a, w, b, x, c, e, z), s2 = (a, w, b, y, d, e, z)25    define  b as   v1 >= 100 or    v2 >= 10026         ,  c as b.v1 <  100 and c.v1 >= 10027         ,  d as b.v2 <  100 and d.v2 >= 10028         ,  e as b.v1 <  100 or  b.v2 <  10029  )30  ;        ID     V1FRST      V1SUM     V2FRST      V2SUM---------- ---------- ---------- ---------- ----------         1         80        300         20        230         2         90        220         40        210         3        130        300        170        340         4         20        170         30        170         5        150        290        140        290         6         30        140         40        150         7        110        160        110        360   -- This one's wrong! Should be 330
             8         20         50        220        250         9         30         30         30          3--  CORRECTED QUERY AND CORRECT RESULT:SQL> with t(id, v1, v2) as  2  (  3    select 1,  80 ,  20 from dual union all  4    select 2,  90 ,  40 from dual union all  5    select 3, 130 , 170 from dual union all  6    select 4,  20 ,  30 from dual union all  7    select 5, 150 , 140 from dual union all  8    select 6,  30 ,  40 from dual union all  9    select 7, 110 , 110 from dual union all10     select 8,  20 , 220 from dual union all11     select 9,  30 ,  30 from dual12  )13  select id, v1frst, v1sum, v2frst, v2sum14  from  t15  match_recognize(16    order by id17    measures a.id       as id18           , a.v1       as v1frst19           , sum(s1.v1) as v1sum20           , a.v2       as v2frst21           , sum(s2.v2) as v2sum22    after match skip to next row23    pattern ( a w*? ( b  ( x*? c | y*? d | e z* | f zzz* )?  | $) )24    subset s1 = (a, w, b, x, c, e, z), s2 = (a, w, b, y, d, f, zzz)25    define  b as   v1 >= 100 or    v2 >= 10026         ,  c as b.v1 <  100 and c.v1 >= 10027         ,  d as b.v2 <  100 and d.v2 >= 10028         ,  e as b.v1 <  10029         ,  f as b.v2 <  10030  )31  ;        ID     V1FRST      V1SUM     V2FRST      V2SUM---------- ---------- ---------- ---------- ----------         1         80        300         20        230         2         90        220         40        210         3        130        300        170        340         4         20        170         30        170         5        150        290        140        290         6         30        140         40        150         7        110        160        110        330         8         20         50        220        250         9         30         30         30         30
    mNem
«134

Answers

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,484 Red Diamond
    edited May 10, 2018 9:25AM

    Hi,

    Interesting problem!

    mNem wrote:Trying to work out a solution for a recently posted question. ...

    What was that question?  If it's worth mentioning that there was a previous question, then it's worth posting a link to it.

    This sure sounds like a MATCH_RECOGNIZE problem, but I can't think of a MATCH_RECOGNIZE solution.

    Here's a simple analytic solution:

    WITH    got_grps    AS(    SELECT  id, v1, v2    ,       COUNT (CASE WHEN v1 >= 100 THEN 1 END)                OVER (ORDER BY  id  DESC)           AS grp1    ,       COUNT (CASE WHEN v2 >= 100 THEN 1 END)                OVER (ORDER BY  id  DESC)           AS grp2    FROM    t)SELECT    id,         v1                AS v1frst,         SUM (v1) OVER ( PARTITION BY  grp1                          ORDER BY      id  DESC                        )   AS v1sum,         v2                AS v2frst,         SUM (v2) OVER ( PARTITION BY  grp2                          ORDER BY      id  DESC                        )   AS v2sumFROM      got_grpsORDER BY  id;

    Output (as requested):

        ID     V1FRST      V1SUM     V2FRST      V2SUM---- ---------- ---------- ---------- ----------   1         80        300         20        230   2         90        220         40        210   3        130        130        170        170   4         20        170         30        170   5        150        150        140        140   6         30        140         40        150   7        110        110        110        110   8         20        150        220        220   9        130        130         30         30

    The best MATCH_RECOGNIZE solution might involve some analytic functions as well, so it might be more complicated and less efficient than the solution above.  However, that's a separate issue.

    GregV
  • mNem
    mNem Member Posts: 1,380 Gold Trophy
    edited May 10, 2018 10:25AM

    Hi Frank,

    Thanks for the solution based on analytic functions. But I am interested in seeing a match_recognize solution. At this stage, not too worried about the performance of the query as it is a learning exercise for me.

    Regards,

    mNem

  • mathguy
    mathguy Member Posts: 10,674 Blue Diamond
    edited May 10, 2018 11:26AM

    Fun problem!

    I believe the solution below is correct; please try it on a few other inputs to see if it works as intended.

    The trick is in the PATTERN - the problem is hard because you want different "stopping points" for the summation depending on where the next value >= 100 appears in each column. Since we have to deal with branching (depending on whether the stopping point for V1 is before, after, or at the same time as the stopping point for V2), we can use the SUBSET clause to sum over what's really needed for each column.

    with t(id, v1, v2) as(  select 1,  80 ,  20 from dual union all  select 2,  90 ,  40 from dual union all  select 3, 130 , 170 from dual union all  select 4,  20 ,  30 from dual union all  select 5, 150 , 140 from dual union all  select 6,  30 ,  40 from dual union all  select 7, 110 , 110 from dual union all  select 8,  20 , 220 from dual union all  select 9, 130 ,  30 from dual)select id, v1frst, v1sum, v2frst, v2sumfrom  tmatch_recognize(  order by id  measures a.id       as id         , a.v1       as v1frst         , sum(s1.v1) as v1sum         , a.v2       as v2frst         , sum(s2.v2) as v2sum  after match skip to next row  pattern ( a w*? ( b  ( x*? c | y*? d | e z* )?  | $) )  subset  s1 = (a, w, b, x, c, e, z), s2 = (a, w, b, y, d, e, z)  define  b as   v1 >= 100 or    v2 >= 100       ,  c as b.v1 <  100 and c.v1 >= 100       ,  d as b.v2 <  100 and d.v2 >= 100       ,  e as b.v1 <  100 or  b.v2 <  100);
  • mathguy
    mathguy Member Posts: 10,674 Blue Diamond
    edited May 10, 2018 11:02AM

    Besides not being a MATCH_RECOGNIZE solution, the query in Reply 1 does not return the desired output, as included from the beginning in the original post. Perhaps you didn't notice that? This may be due to a different understanding of the problem statement (inconsistent with the OP's intent).

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,484 Red Diamond
    edited May 10, 2018 11:11AM

    Hi,

    I'm not sure if you can do this using a single MATCH_RECOGNIZE.

    Here's a way using 2 MATCH_RECOGNIZEs (separate ones for v1 and v2):

    SELECT    *FROM      tMATCH_RECOGNIZE          (            ORDER BY      id            MEASURES      NVL (a.id, c.id)  AS id            ,             NVL (a.v1, c.v1)  AS v1frst            ,             SUM (v1)          AS v1sum            ,             NVL (a.v2, c.v2)  AS v2            ONE ROW PER MATCH     AFTER MATCH SKIP TO NEXT ROW            PATTERN       (a?  b* c)            DEFINE        b  AS   v1         <  100            ,             c  AS   v1         >= 100                               OR NEXT (id)  IS NULL  -- is there a better way?          )MATCH_RECOGNIZE          (            ORDER BY      id            MEASURES      NVL (a2.id,     c2.id)      AS id            ,             NVL (a2.v1frst, c2.v1frst)  AS v1frst            ,             NVL (a2.v1sum,  c2.v1sum)   AS v1sum            ,             NVL (a2.v2,     c2.v2)      AS v2frst            ,             SUM (v2)                    AS v2sum            ONE ROW PER MATCH     AFTER MATCH SKIP TO NEXT ROW            PATTERN       (a2?  b2* c2)            DEFINE        b2  AS   v2        <  100            ,             c2  AS   v2        >= 100                               OR NEXT (id)  IS NULL  -- is there a better way?          )ORDER BY  id;

    Every row will start a new pattern, and the pattern will end as soon as v1 (or v2) is 100 or more; but if there is no row will v1 >= 100, the pattern will end with the last row.  (Is there a direct way to test if a row is first or last, according to the ORDER BY clause?  I used the NEXT operator, but I suspect there's another way.)

    What was the earlier question that prompted you to ask this?  I'll bet I'm not the only one who would be interested in ti.

  • Mike Kutz
    Mike Kutz Member Posts: 6,199 Silver Crown
    edited May 10, 2018 11:05AM

    Can you explain the missing DEFINE identifiers W,X,Y,Z please?

    Do they operate the same as STRT?

    Thanks,

    MK

  • mathguy
    mathguy Member Posts: 10,674 Blue Diamond
    edited May 10, 2018 11:12AM

    I am not sure what STRT is. I know it is used as a classifier in the documentation examples - but note that STRT has no particular meaning (no different from A or W).

    A classifier that is used in PATTERN without being included in DEFINE will match ANY row. No conditions whatsoever. (Unfortunately, that can't be taken full advantage of; at least one classifier MUST be defined, even if it's by a condition like 0 = 0; the DEFINE clause is required, even though it shouldn't be.)

    The trick here is to use several different classifiers for "any" rows, because they must (or must not) be included in one sum or another - so we still must be able to differentiate them. In the solution I proposed, I didn't need to differentiate between W and Z (since all W and all Z are included in both sums anyway), but that would just make the query even harder to read.

  • mNem
    mNem Member Posts: 1,380 Gold Trophy
    edited May 10, 2018 1:14PM

    Hi mathguy,

    Thanks for the reply. It works as expected in sqlplus but NOT in sql developer. I am not a sqlplus user.  Would appreciate if the solution could be tweaked to work in sql developer.

    Additionally, can you please explain the role of subsets in your solution?

    Thanks.

    Regards,

    mNem

  • mNem
    mNem Member Posts: 1,380 Gold Trophy
    edited May 10, 2018 1:17PM
    Frank Kulash wrote:What was the earlier question that prompted you to ask this? I'll bet I'm not the only one who would be interested in ti.

    Initial post now includes a link to the earlier question.

  • mathguy
    mathguy Member Posts: 10,674 Blue Diamond
    edited May 10, 2018 1:40PM

    You won't be able to use this solution in SQL Developer. The problem is that SQL Developer uses the question mark, when not quoted, as a placeholder for a variable. Before the MATCH_RECOGNIZE clause introduced in 12.1, this wasn't a problem; the question mark had no meaning in Oracle SQL, other than in regular expressions, which are enclosed within single quotes. (Similarly if you include a question mark in an object name, the name must be in double-quotes.)

    Perhaps it would have been good if Oracle would have required the PATTERN in the MATCH_RECOGNIZE clause to be enclosed in single quotes; that would have avoided this issue. Alas, that is not the case. Now we have question marks exposed directly to the parser (whichever parser comes first - in this case some sort of Java nonsense).

    When you have something like X? you can replace it with X{0,1}.   However, there is no alternative for "non-greedy match" operators,  *?  and +?.   And I definitely don't see how you can use MATCH_RECOGNIZE for this problem, and not use non-greedy matches.

    Ideally, the writers of SQL Developer should add a setting to turn off the interpretation of question mark as a placeholder. I am not holding my breath though...

    The SUBSET clause:   All it does is to map classifiers to new names, so that they can be grouped together. When I SUM for column V1, I don't want to add V1 values from rows classified as Y or D - such rows would only exist in the pattern if B.V1 was already >= 100 but B.V2 is < 100 and I need to add Y and D rows to get to the first value of V2 that is >= 0. Those rows should be used when I sum over V2, but not when I sum over V1. So, if the pattern ends up matching A X*? B Y* D, I only want to add the A, X and B rows. If it's A X*? B E Z* (I find a V1 or a V2 that is >= 100 but I don't ever find >= 100 in the other column, so I must read to the end of the table), then I want to add A, X, B, E, Z. In all cases I want to sum the rows grouped in S1 for column V1, and in S2 for column V2.

    While writing this explanation, I just realized I have a bug in my solution. Suppose the match is A X*? B E Z*.  For example, B.V1 >= 100, B.V2 < 100, and V2 never becomes >= 100. Then I must read all the rows to the end of the table. The mistake is that in this situation, I add all the rows for BOTH sums, instead of just for V2. The sum for V1 should only include the A, X and B rows, not the E and Z. I will fix this (I think I know how) and post the corrected solution - please refer back here to see why I made the change, and how that change is supposed to work.

    mNem
This discussion has been closed.