1 2 Previous Next 22 Replies Latest reply on Feb 5, 2017 9:35 PM by BrendanP Go to original post
      • 15. Re: Matching ( in a string
        mathguy

        Recurrence is fun. However, in this case it is almost certain that a solution based on analytic functions will be faster.

         

        I worked on this independently and came up with pretty much the same solution as Stew. In the last part I was doing a join, but Stew's solution to the same "last issue" is clearly superior - GROUP BY in the right way will give the same answer with far less effort. I shamelessly stole that approach from Stew and changed my solution accordingly.

         

        I am posting my solution below, with more input data and output, for a couple of reasons. I am running the query for several input strings at once, which requires one small adjustment to the initial hierarchical query. (I include an ID column in the inputs to distinguish possibly identical input strings in different rows.) I also include strings with different kinds of unbalanced parentheses, with no parentheses, and the empty string as inputs, to see what happens.

         

        I didn't add a column for "Notes" - it should flag the strings where parentheses aren't matched properly. The right definition for that is:  considering the running sum of +1 for opening parenthesis and -1 for closing parenthesis, the sum must always be >= 0 and the final value must be 0. This can be checked in one of the intermediate results. In the query below, I didn't bother with that, so in one case the closing position is before the opening position, but that is NOT how the "invalid" parentheses should be caught, they should be caught earlier. My last example demonstrates why that is... the LAST pair of parentheses seems to be properly matched, but this comes after the parentheses were messed up earlier already, so this "last" pair seeming to be OK is meaningless.

         

         

        with

            inputs ( id, str ) as (

              select 1, ' ((Hello ( Hi Hi hi ( A B C ( D)) (EF)

        why Whwy whyhhh )

        )

        )'                                      from dual union all

              select 2, '(1+3*(3-1) + 3*(2+1))' from dual union all

              select 3, '()()*(())a()(())'      from dual union all

              select 4, 'no-parentheses'        from dual union all

              select 5, '((( unbalanced'        from dual union all

              select 6, ''                      from dual union all

              select 7, 'b0(b1(b2(b3(x))(xy)))' from dual union all

              select 8, '(x))(()'               from dual

            ),

            d ( id, str, pos ) as (

              select id, str, regexp_instr(str, '\(|\)', 1, level)

              from   inputs

              connect by level <= length(str) - length(translate(str, 'x()', 'x'))

                     and prior id = id

                     and prior sys_guid() is not null

            ),

            p ( id, str, pos, flag, l_ct, ct ) as (

              select id, str, pos, case substr(str, pos, 1) when '(' then 1 else -1 end,

                     sum(case substr(str, pos, 1) when '(' then 1         end) over (partition by id order by pos),

                     sum(case substr(str, pos, 1) when '(' then 1 else -1 end) over (partition by id order by pos)

              from   d

            ),

            f ( id, str, pos, flag, l_ct, ct, o_ct ) as (

              select id, str, pos, flag, l_ct, ct + case flag when 1 then 0 else 1 end as ct,

                     row_number() over (partition by id, flag, ct order by pos)

              from   p

            )

        select  id, str, min(l_ct) as l_ct,

                min(case when flag =  1 then pos end) as l_pos,

                min(case when flag = -1 then pos end) as r_pos

        from    f

        group by id, str, ct, o_ct

        order by id, l_ct

        ;

         

         

        Output:

         

        ID STR                                                                L_CT      L_POS      R_POS
        -- ------------------------------------------------------------ ---------- ---------- ----------
        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                1          2         60
           why Whwy whyhhh )
           )
           )

        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                2          3         58
           why Whwy whyhhh )
           )
           )

        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                3         10         56
           why Whwy whyhhh )
           )
           )

        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                4         21         33
           why Whwy whyhhh )
           )
           )

        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                5         29         32
           why Whwy whyhhh )
           )
           )

        1  ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                                6         35         38
           why Whwy whyhhh )
           )
           )

        2 (1+3*(3-1) + 3*(2+1))                                                 1          1         21
        2 (1+3*(3-1) + 3*(2+1))                                                 2          6         10
        2 (1+3*(3-1) + 3*(2+1))                                                 3         16         20
        3 ()()*(())a()(())                                                      1          1          2
        3 ()()*(())a()(())                                                      2          3          4
        3 ()()*(())a()(())                                                      3          6          9
        3 ()()*(())a()(())                                                      4          7          8
        3 ()()*(())a()(())                                                      5         11         12
        3 ()()*(())a()(())                                                      6         13         16
        3 ()()*(())a()(())                                                      7         14         15
        4 no-parentheses                                                                              0
        5 ((( unbalanced                                                        1          1
        5 ((( unbalanced                                                        2          2
        5 ((( unbalanced                                                        3          3
        6
        7 b0(b1(b2(b3(x))(xy)))                                                 1          3         21
        7 b0(b1(b2(b3(x))(xy)))                                                 2          6         20
        7 b0(b1(b2(b3(x))(xy)))                                                 3          9         15
        7 b0(b1(b2(b3(x))(xy)))                                                 4         12         14
        7 b0(b1(b2(b3(x))(xy)))                                                 5         16         19
        8 (x))(()                                                               1          5          4   -- invalid pair!
        8 (x))(()                                                               1          1          3   -- this pair is correctly matched and valid
        8 (x))(()                                                               3          6          7   -- this shouldn't be considered valid

         

        29 rows selected.

        • 16. Re: Matching ( in a string

          Now I need to find the matching bracket position for the 3rd bracket.

          First - there are no 'brackets' in what you posted. Did you mean parentheses?

           

          Second - which '3rd parenthesis' do you want to match? The 3rd 'right' parentheses or the 3rd 'left' parentheses? You need to be VERY explicit when you define a problem statement.

           

          Third - why? What possible value can knowing that one value have? I don't see how it can have ANY without having other info such as the actual position of the 'other' matching parentheses.

           

          If you have an actual problem that involves nested expressions why don't you start over and tell us what the real problem is.

           

          Evaluation of a nested expression is generally part of a larger process.

           

          One of the solutions being offered may be the 'best' or 'fastest' in identifying that one component but don't count of that 'solution' being usable to evaluation the actual expression the string represents.

           

          Intellectual exercises can be fun but they aren't necessarily usable in the real world.

          • 17. Re: Matching ( in a string
            Solomon Yakobson

            Using REGEXP negates all that advantage:

             

            SQL> set timing on

            SQL> -- CONNECT BY + REGEXP (hierarchy with one row per parenthesis) + GROUP BY

            SQL> declare

              2      cursor v_cur

              3        is with data as (  

              4    select 'b0(b1(b2(b3(x))(xy)))' as str  

              5    from dual  

              6  )  

              7  , paren_locs as (  

              8    select level as num_parens,  

              9    regexp_instr(str,'[()]',1,level) as paren_loc  

            10    from data  

            11    connect by regexp_instr(str,'[()]',1,level) > 0  

            12  )  

            13  , paren_counts as (  

            14    select num_parens, paren_loc,  

            15      substr(str, paren_loc, 1) as paren,  

            16      case substr(str, paren_loc, 1)  

            17        when '(' then 0 else 1  

            18      end  

            19      +  

            20      sum(  

            21        case substr(str, paren_loc, 1)  

            22          when '(' then 1 else -1  

            23        end  

            24      ) over(order by num_parens) as paren_count  

            25    from data, paren_locs  

            26  )  

            27  select paren_count, rn,

            28    min(decode(paren, '(', paren_loc)) paren_start,  

            29    max(decode(paren, ')', paren_loc)) paren_end  

            30  from (  

            31    select paren_loc, paren_count, paren,

            32    row_number() over(partition by paren_count, paren order by num_parens) rn  

            33    from paren_counts a  

            34  )  

            35  group by paren_count, rn

            36  order by paren_start;

            37  begin

            38      for v_i in 1..10000 loop

            39        for v_rec in v_cur loop

            40          null;

            41        end loop;

            42      end loop;

            43  end;

            44  /

             

            PL/SQL procedure successfully completed.

             

            Elapsed: 00:00:02.12

            SQL> CONNECT BY (no regexp hierarchy with one row per character) + analytics

            SQL> declare

              2      cursor v_cur

              3        is with tbl as (select 'b0(b1(b2(b3(x))(xy)))' str from dual),

              4      t1 as (

              5              select  str,

              6                      level pos,

              7                    sum(case substr(str,level,1)

              8                      when '(' then 1

              9                      else -1

            10                    end) over(order by level) + case substr(str,level,1)

            11                                                  when ')' then 1

            12                                                  else 0

            13                                                end weight

            14                from  tbl

            15                where substr(str,level,1) in ('(',')')

            16                connect by level <= length(str)

            17            ),

            18      t2 as (

            19              select  str,

            20                      pos start_pos,

            21                      lead(pos) over(partition by weight order by pos) end_pos,

            22                      weight

            23                from  t1

            24            )

            25  select  substr(str,start_pos,end_pos - start_pos + 1) str,

            26          weight parenthesis_level,

            27          start_pos,

            28          end_pos

            29    from  t2

            30    where substr(str,start_pos,1) = '('

            31    order by start_pos;

            32  begin

            33      for v_i in 1..10000 loop

            34        for v_rec in v_cur loop

            35          null;

            36        end loop;

            37      end loop;

            38  end;

            39  /

             

            PL/SQL procedure successfully completed.

             

            Elapsed: 00:00:00.81

             

            SQL> -- CONNECT BY replacing hierarchy with one row per character with hierarchy with one row per parenthesis via REGEXP + analytics

            SQL> declare

              2      cursor v_cur

              3        is with tbl as (select 'b0(b1(b2(b3(x))(xy)))' str from dual),

              4  t1 as (

              5              select  str,

              6                      regexp_instr(str,'\(|\)',1,level) pos,

              7                      sum(case regexp_substr(str,'\(|\)',1,level)

              8                      when '(' then 1

              9                      else -1

            10                    end) over(order by level) + case regexp_substr(str,'\(|\)',1,level)

            11                                                  when ')' then 1

            12                                                  else 0

            13                                                end weight

            14                from  tbl

            15                connect by level <= regexp_count(str,'\(|\)')

            16            ),

            17      t2 as (

            18              select  str,

            19                      pos start_pos,

            20                      lead(pos) over(partition by weight order by pos) end_pos,

            21                      weight

            22                from  t1

            23            )

            24  select  substr(str,start_pos,end_pos - start_pos + 1) str,

            25          weight parenthesis_level,

            26          start_pos,

            27          end_pos

            28    from  t2

            29    where substr(str,start_pos,1) = '('

            30    order by start_pos;

            31  begin

            32      for v_i in 1..10000 loop

            33        for v_rec in v_cur loop

            34          null;

            35        end loop;

            36      end loop;

            37  end;

            38  /

             

            PL/SQL procedure successfully completed.

             

            Elapsed: 00:00:01.73

            SQL>

             

            SY.

            • 18. Re: Matching ( in a string
              mathguy

              I keep forgetting Solomon's admonition...  If you are using analytic functions, try to do it all with analytic functions (don't add aggregation at the end).

               

              The output from the version below, where I replace the GROUP BY at the end with LEAD() and partitioning (as suggested by Solomon), looks better too.

               

              Note - in my computations I count opening parentheses from left to right (as meaningful, for example, when counting them in a regular expression pattern), but I don't show the "depth" (nesting level). That is present in the intermediate results and could be included in the output too, with no effort - I just didn't do that.

               

              with

                  inputs ( id, str ) as ( <same as before> ),

                  d ( id, str, pos ) as (

                    select id, str, regexp_instr(str, '\(|\)', 1, level)

                    from   inputs

                    connect by level <= length(str) - length(translate(str, 'x()', 'x'))

                           and prior id = id

                           and prior sys_guid() is not null

                  ),

                  p ( id, str, pos, flag, l_ct, ct ) as (

                    select id, str, pos, case substr(str, pos, 1) when '(' then 1 else -1 end,

                           sum(case substr(str, pos, 1) when '(' then 1         end) over (partition by id order by pos),

                           sum(case substr(str, pos, 1) when '(' then 1 else -1 end) over (partition by id order by pos)

                    from  d

                  ),

                  f ( id, str, pos, flag, l_ct, r_pos) as (

                    select id, str, pos, flag, l_ct, lead(pos) over (partition by id, ct + case flag when 1 then 0 else 1 end order by pos) as end_pos

                    from   p

                  )

              select id, str, l_ct, pos as l_pos, r_pos

              from   f

              where  flag = 1 or pos = 0 or pos is null    --  to include strings with no  (  in them

              order by id, l_ct, l_pos

              ;

              Output:

               

              ID STR                                            L_CT      L_POS      R_POS
              -- ---------------------------------------- ---------- ---------- ----------
              <same results for the first string>

              2 (1+3*(3-1) + 3*(2+1))                              1          1         21
              2 (1+3*(3-1) + 3*(2+1))                              2          6         10
              2 (1+3*(3-1) + 3*(2+1))                              3         16         20
              3 ()()*(())a()(())                                   1          1          2
              3 ()()*(())a()(())                                   2          3          4
              3 ()()*(())a()(())                                   3          6          9
              3 ()()*(())a()(())                                   4          7          8
              3 ()()*(())a()(())                                   5         11         12
              3 ()()*(())a()(())                                   6         13         16
              3 ()()*(())a()(())                                   7         14         15
              4 no-parentheses                                                0
              5 ((( unbalanced                                     1          1
              5 ((( unbalanced                                     2          2
              5 ((( unbalanced                                     3          3
              6
              7 b0(b1(b2(b3(x))(xy)))                              1          3         21
              7 b0(b1(b2(b3(x))(xy)))                              2          6         20
              7 b0(b1(b2(b3(x))(xy)))                              3          9         15
              7 b0(b1(b2(b3(x))(xy)))                              4         12         14
              7 b0(b1(b2(b3(x))(xy)))                              5         16         19
              8 (x))(()                                            1          1          3    --  I like the output for this case a lot better than before
              8 (x))(()                                            2          5
              8 (x))(()                                            3          6          7

              • 19. Re: Matching ( in a string
                mathguy

                There is another approach - the first one that came to mind... I dismissed it because it reads the base rows twice, but perhaps even so it may be faster in some cases?

                 

                We can read each row twice - once to identify the positions of opening parentheses only (so no need for regexp), and once to identify the position of closing perntheses. Then UNION ALL. Would that be faster than creating one row per character in the input string, especially if strings are long and parentheses are relatively rare (for example, only ten parentheses pairs in strings that are otherwise 500 characters on average?) The rest of the query can be as in any of the solutions offered so far.

                 

                Cheers,   -   mathguy

                • 20. Re: Matching ( in a string
                  Stew Ashton

                  If we want or need to optimize away the REGEXP_SUBSTR, we can use UNPIVOT instead of UNION ALL.

                   

                  with data as ( select 'b0(b1(b2(b3(x))(xy)))' as str from dual) 
                  , paren_positions as ( 
                    select paren_pos, num_paren,
                      sum(num_paren) over(order by paren_pos)
                      + case when num_paren = 1 then 0 else 1 end
                      as num_parens
                    from (
                      select
                        instr(str,'(',1,level) pos_start,
                        instr(str, ')',1,level) pos_end
                      from data
                      connect by instr(str,'(',1,level) > 0 or instr(str,')',1,level) > 0  
                    )
                    unpivot( paren_pos for num_paren in ( pos_start as 1, pos_end as -1 ) )
                    where paren_pos > 0
                  )
                  , paren_counts as (    
                    select a.*,
                      row_number()
                        over(partition by num_parens, num_paren order by paren_pos) as rn
                    from paren_positions a
                  )
                  select num_parens, rn,  
                    min(decode(num_paren, 1, paren_pos)) paren_start,    
                    max(decode(num_paren, -1, paren_pos)) paren_end    
                  from paren_counts
                  group by num_parens, rn  
                  order by paren_start; 
                  
                  • 21. Re: Matching ( in a string
                    Peter vd Zwan

                    Hi,

                     

                    Since oracle 12 we can also use MATCH_RECOGNIZE to use pattern recognition between rows. I just played with that and came to the following solution:

                     

                    with a (text_no,txt)as
                    (
                    select 1, ' ((Hello ( Hi Hi hi ( A B C ( D)) (EF)
                    why Whwy whyhhh )
                    )
                    )'                                      from dual union all
                    select 2, '(1+3*(3-1) + 3*(2+1))'       from dual union all
                    select 3, '()()*(())a()(())'            from dual union all
                    SELECT 4, 'no-parentheses'              FROM dual UNION ALL
                    SELECT 5, '((( unbalanced'              FROM dual UNION ALL
                    SELECT 6, ''                            FROM dual UNION ALL
                    SELECT 7, 'b0(b1(b2(b3(x))(xy)))'       FROM dual UNION ALL
                    SELECT 8, '(x))(()' FROM dual
                    )
                    , b as
                    (
                    select
                      substr(txt,level,1) s
                      ,level n
                      ,text_no
                      ,txt
                    from
                      a
                    connect by
                    text_no =  prior text_no
                    and substr(txt,level,1) is not null
                    and prior sys_guid() is not null
                    )
                    select
                      text_no
                      ,opening_bracket
                      ,closing_bracket
                      ,substr(txt,opening_bracket,closing_bracket - opening_bracket + 1) t
                    from
                    b
                    MATCH_RECOGNIZE (
                    partition by text_no
                    ORDER BY n
                    MEASURES
                      txt as txt
                      ,FIRST( N) AS opening_bracket
                      ,LAST( N) AS closing_bracket
                    one ROW PER MATCH
                    AFTER MATCH SKIP to next row
                    PATTERN (ob (ob | nb | cb)*? lcb)
                    DEFINE
                      ob as ob.s = '('
                      ,cb as cb.s = ')'
                      ,nb as nb.s not in ('(',')')
                      ,lcb as lcb.s = ')' and (count(ob.s) = count(cb.s) + 1)
                    ) MR
                    ;

                     

                    And the outcome is:

                     

                       TEXT_NO OPENING_BRACKET CLOSING_BRACKET T                                                         
                    ---------- --------------- --------------- ------------------------------------------------------------
                             1               2              60 ((Hello ( Hi Hi hi ( A B C ( D)) (EF)                       
                                                               why Whwy whyhhh )                                           
                                                               )                                                           
                                                               )                                                           

                             1               3              58 (Hello ( Hi Hi hi ( A B C ( D)) (EF)                        
                                                               why Whwy whyhhh )                                           
                                                               )                                                           

                             1              10              56 ( Hi Hi hi ( A B C ( D)) (EF)                               
                                                               why Whwy whyhhh )                                           

                             1              21              33 ( A B C ( D))                                               
                             1              29              32 ( D)                                                        
                             1              35              38 (EF)                                                        
                             2               1              21 (1+3*(3-1) + 3*(2+1))                                       
                             2               6              10 (3-1)                                                       
                             2              16              20 (2+1)                                                       
                             3               1               2 ()                                                          
                             3               3               4 ()                                                          
                             3               6               9 (())                                                        
                             3               7               8 ()                                                          
                             3              11              12 ()                                                          
                             3              13              16 (())                                                        
                             3              14              15 ()                                                          
                             7               3              21 (b1(b2(b3(x))(xy)))                                         
                             7               6              20 (b2(b3(x))(xy))                                             
                             7               9              15 (b3(x))                                                     
                             7              12              14                                                          
                             7              16              19 (xy)                                                        
                             8               1               3                                                          
                             8               6               7 ()                                                          

                    23 rows selected

                     

                     

                     

                     

                    Regards,

                    Peter

                    • 22. Re: Matching ( in a string
                      BrendanP

                      You can also do this using the v12.1 feature where a PL/SQL function can be defined within an SQL statement. It's maybe more complicated, but a lot faster.

                       

                      CREATE OR REPLACE TYPE bra_rec_type IS OBJECT (o_pos INTEGER, c_pos INTEGER, str VARCHAR2(4000));
                      /
                      CREATE TYPE bra_lis_type IS VARRAY(4000) OF bra_rec_type;
                      /
                      WITH FUNCTION Parse_Brackets (p_str VARCHAR2) RETURN bra_lis_type IS
                        c_n_ob       CONSTANT PLS_INTEGER := Length (p_str) - Length (Replace (p_str, '(', ''));
                        l_ob_lis              SYS.ODCINumberList := SYS.ODCINumberList();
                        l_cb_lis              SYS.ODCINumberList := SYS.ODCINumberList();
                        TYPE b_rec_type   IS  RECORD (pos INTEGER, diff INTEGER);
                        TYPE b_lis_type   IS  VARRAY(32767) OF b_rec_type;
                        l_b_lis               b_lis_type := b_lis_type(NULL);
                        l_bra_lis             bra_lis_type := bra_lis_type();
                        n_b                   PLS_INTEGER := 0;
                        n_ob                  PLS_INTEGER := 0;
                        n_cb                  PLS_INTEGER := 0;
                        l_chr                 VARCHAR2(1);
                        l_o_diff              PLS_INTEGER;
                      BEGIN
                      
                      
                        IF c_n_ob = 0 THEN
                          RETURN NULL;
                        END IF;
                        l_ob_lis.EXTEND (c_n_ob);
                        l_bra_lis.EXTEND (c_n_ob);
                        l_cb_lis.EXTEND (c_n_ob);
                        l_b_lis.EXTEND (c_n_ob + c_n_ob);
                      
                        FOR i IN 1..Length (p_str) LOOP
                      
                          l_chr := Substr (p_str, i, 1);
                          IF l_chr NOT IN ('(', ')') THEN CONTINUE; END IF;
                      
                      
                          n_b := n_b + 1;
                          l_b_lis(n_b).pos := i;
                      
                          IF l_chr = '(' THEN
                            n_ob := n_ob + 1;
                            l_ob_lis(n_ob) := n_b;
                          ELSE
                            n_cb := n_cb + 1;
                            l_cb_lis(n_cb) := n_b;
                          END IF;
                      
                      
                          l_b_lis(n_b).diff := n_ob - n_cb;
                          DBMS_Output.Put_Line (l_b_lis(n_b).pos || ': ' || l_b_lis(n_b).diff);
                      
                        END LOOP;
                      
                      
                        FOR i IN 1..n_ob LOOP
                      
                      
                          l_o_diff := l_b_lis (l_ob_lis(i)).diff;
                          FOR j IN 1..n_cb LOOP
                      
                      
                            IF l_b_lis (l_cb_lis(j)).pos < l_b_lis (l_ob_lis(i)).pos THEN CONTINUE; END IF;
                            IF l_o_diff = l_b_lis (l_cb_lis(j)).diff + 1 THEN
                      
                      
                              l_bra_lis(i) := bra_rec_type (l_b_lis(l_ob_lis(i)).pos, l_b_lis(l_cb_lis(j)).pos, Substr (p_str, l_b_lis(l_ob_lis(i)).pos, l_b_lis(l_cb_lis(j)).pos - l_b_lis(l_ob_lis(i)).pos + 1));
                              EXIT;
                      
                      
                            END IF;
                      
                      
                          END LOOP;
                          DBMS_Output.Put_Line (l_bra_lis(i).o_pos || ' - ' || l_bra_lis(i).c_pos || ' : ' || l_bra_lis(i).str);
                      
                        END LOOP;
                        RETURN l_bra_lis;
                      
                      
                      END;
                      SELECT /*+ WFB_QRY gather_plan_statistics */
                          b.id, t.o_pos, t.c_pos, t.str
                        FROM bracket_strings b
                        OUTER APPLY TABLE (Parse_Brackets (Nvl (b.str, ' '))) t
                       ORDER BY 1, 2
                      
                      1 2 Previous Next