This discussion is archived
1 2 Previous Next 25 Replies Latest reply: Dec 6, 2012 4:13 AM by chris227 Go to original post RSS
  • 15. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    chris227 Guru
    Currently Being Moderated
    Perhaps it's just a lack of understanding of how the regexp-machine works.
    May be one should be more careful with the use of .*.
    with t as
    (select q'[WEEKLY(.417,'THURSDAY')x]' repeat_interval from dual)
    SELECT
      repeat_interval,
      '|'||regexp_replace(repeat_interval,'.*','-')||'|' STRING5,
      '|'||regexp_replace(repeat_interval,'.*?','-')||'|' STRING5a,
      '|'||regexp_replace(repeat_interval,'(.*)','-')||'|' STRING5b,
      '|'||regexp_replace(repeat_interval,'.+','-')||'|' STRING5c,
      '|'||regexp_replace(repeat_interval,'.+?','-')||'|' STRING5d,
      '|'||regexp_replace(repeat_interval,'^(.*)$','-')||'|' STRING6
    FROM  t;
    
    REPEAT_INTERVAL     STRING5     STRING5A     STRING5B     STRING5C     STRING5D     STRING6
    WEEKLY(.417,'THURSDAY')x     |--|     |-W-E-E-K-L-Y-(-.-4-1-7-,-'-T-H-U-R-S-D-A-Y-'-)-x-|     |--|     |-|     |------------------------|     |-|
  • 16. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    Sven W. Guru
    Currently Being Moderated
    jeneesh wrote:
    I was thinking in line of what Sven explained in his first reply..But at second look after reading all the comments, me too feel it like a bug..

    @Alberto: (.*) works good confirms it as a bug I feel...
    I'm still thinking about it. And I come closer to the idea, that it is not a bug....

    Let me try to explain. I simplified the string a little just to rule out any strange side effects. Especially I replaced the double single quotes with "z". The effect is still the same.
    with t as
    (select 'wzTHURSDAYzx' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'.*?z([^z]+)z.*$','-\1-') STRING1,
      regexp_replace(repeat_interval,'.*?z([^z]+)z.*','-\1-') STRING2
    FROM  t;
    
    REPEAT_INTERVAL      STRING1          STRING2
    --------------------------------------------------------------------------
    wzTHURSDAYzx     -THURSDAY-     -THURSDAY-x
    So still same as before.
    * means grab greedly everything you can but try to match the rest.
    *? means grab non greedy everything you can until you find a match for the next char in the search string

    I think the problem arises from the "not z" expression ([^z]+).

    The regexp parser goes until the Y of "THURSDAY". This is still not a "z".
    The next char is a "z". At this point the "not z" logic will stop. However this position is now already parsed. Therefore nothing until the end of the string will now match with the rest of the expression.
    In string2 wil will finish at this point with the search. Therefore the remaining part ("x") will not be replaced. String1 is not allowed to finish, because the end of line char $ somehow requires to check everything else. Therefore in string1 the whole string is part of the replacement. The difference is how and when the search stopped.

    If we think this thourgh we can find a solution that is even more simple then the not expression.
    with t as
    (select 'wzTHURSDAYzx' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'.*z(.+)z.*','-\1-') STRING3
    FROM  t;
    
    REPEAT_INTERVAL          STRING3
    --------------------------------------------------------
    wzTHURSDAYzx     -THURSDAY-
    This might help to understand why I mean with the "not z" logic / parser stopping the string.
    with t as
    (select 'wzTHURSDAYzx' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'.*?z([^z]+).*','-\1-') STRING4,
      regexp_replace(repeat_interval,'.*?z([^z]+)','-\1-') STRING5
    FROM  t;
    
    
    REPEAT_INTERVAL      STRING4          STRING5
    --------------------------------------------------------------------------
    wzTHURSDAYzx     -THURSDAY--x-     -THURSDAY--x-
    The parser stopped after the "Y". ".*" is resolved to an empty string, that's why we see no difference between string4 and string5.
    Then a second replacement is done using the remaining "zx" part of the string.

    Edited by: Sven W. on Dec 5, 2012 5:52 PM
  • 17. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    jeneesh Guru
    Currently Being Moderated
    Well said..:)

    But one point is still open:how it works when we use (.*) as Alberto shown..

    Same was posted by Ranit also..
  • 18. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    Sven W. Guru
    Currently Being Moderated
    jeneesh wrote:
    I was thinking in line of what Sven explained in his first reply..But at second look after reading all the comments, me too feel it like a bug..

    @Alberto: (.*) works good confirms it as a bug I feel...
    I think
    (.*)?
    is something different than
    .*?
    . We can't replace one with the other.
    "(.* )?" means: any char (greedy) and this expresssion zero or one time. And this is the same as simply ". *"
    ".* ?" means any char (non-greedy)

    Edited by: Sven W. on Dec 5, 2012 6:14 PM -- added blanks to prevent forum text bolding
  • 19. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    jeneesh Guru
    Currently Being Moderated
    If you check Ranit post, adding brackets for the .* at the end of the pattern (without ? Mark)also is giving correct output...

    Sorry: I am posting from my mobile, cant test any codes..
  • 20. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    AlbertoFaenza Expert
    Currently Being Moderated
    Hi Sven,

    I might be wrong but the question mark tells the engine to to attempt to match the preceding token zero times or once, in effect making it optional.
    If you use this operator after a star, which means "match the preceding token zero times or n times" it might be a bit confusing for the engine.
    I have no idea how the engine interprets this .{noformat}*{noformat}? but if I want to make optional probably you can use only .* without the ? which is redundant.

    In effect coming to the initial question if we remove the question mark everything seems to work fine:
    with t as
    (select 'WEEKLY(.417,''THURSDAY'')x' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'.*?''([^'']+)''.*$','\1') STRING1,
      regexp_replace(repeat_interval,'.*?''([^'']+)''.*','\1') STRING2,
      regexp_replace(repeat_interval,'.*''([^'']+)''.*$','\1') STRING3,
      regexp_replace(repeat_interval,'.*''([^'']+)''.*','\1') STRING4
    FROM  t;
    
    REPEAT_INTERVAL          STRING1         STRING2         STRING3         STRING4        
    ------------------------ --------------- --------------- --------------- ---------------
    WEEKLY(.417,'THURSDAY')x THURSDAY        THURSDAY)x      THURSDAY        THURSDAY       
    Regards.
    Al
  • 21. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    Sven W. Guru
    Currently Being Moderated
    Alberto Faenza wrote:
    Hi Sven,

    I might be wrong but the question mark tells the engine to to attempt to match the preceding token zero times or once, in effect making it optional.
    There are two versions of ?.
    One is the standalone version which means zero or one. But in combination with another char like *?, +? or ?? it makes this other char "non-greedy".
    For *? this means it will match the empty string whenever possible.

    See also: http://docs.oracle.com/cd/B28359_01/appdev.111/b28424/adfns_regexp.htm

    Edited by: Sven W. on Dec 5, 2012 6:37 PM - link added
  • 22. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    AlbertoFaenza Expert
    Currently Being Moderated
    Sven W. wrote:
    Alberto Faenza wrote:
    Hi Sven,

    I might be wrong but the question mark tells the engine to to attempt to match the preceding token zero times or once, in effect making it optional.
    There are two versions of ?.
    One is the standalone version which means zero or one. But in combination with another char like *?, +? or ?? it makes this other char "non-greedy".
    For *? this means it will match the empty string whenever possible.

    See also: http://docs.oracle.com/cd/B28359_01/appdev.111/b28424/adfns_regexp.htm

    Edited by: Sven W. on Dec 5, 2012 6:37 PM - link added
    Hi Sven,

    correct according to the documentation. I'm a bit confused then. :-)

    Regards.
    Al
  • 23. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    BobLilly Expert
    Currently Being Moderated
    More food for thought...

    Seems to me that using {0,} should be equivalent to using * (and {0,}? should be the same as *?). Using this form in the example from the first posting in this thread produces the expected result...
    with t as
    (select 'WEEKLY(.417,''THURSDAY'')x' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'.{0,}?''([^'']+)''.{0,}$','\1') STRING1,
      regexp_replace(repeat_interval,'.{0,}?''([^'']+)''.{0,}','\1') STRING2
    FROM  t
    REPEAT_INTERVAL          STRING1              STRING2              
    ------------------------ -------------------- -------------------- 
    WEEKLY(.417,'THURSDAY')x THURSDAY             THURSDAY             
    and with regard to Sven's question about the 3 dashes, it looks like oracle is making 2 replacements--the first replaces the enitire string, and the second replaces the end of the string ??!!
    with t as
    (select q'[WEEKLY(.417,'THURSDAY')x]' repeat_interval from dual)
    SELECT
      repeat_interval,
      '|'||regexp_replace(repeat_interval,'(.*)','-\1-')||'|' STRING4,
      '|'||regexp_replace(repeat_interval,'(.*)','-\1-',1,1)||'|' STRING5,
      '|'||regexp_replace(repeat_interval,'(.*)','-\1-',1,2)||'|' STRING6
    FROM  t
    REPEAT_INTERVAL          STRING4                        STRING5                        STRING6                        
    ------------------------ ------------------------------ ------------------------------ ------------------------------ 
    WEEKLY(.417,'THURSDAY')x |-WEEKLY(.417,'THURSDAY')x---| |-WEEKLY(.417,'THURSDAY')x-|   |WEEKLY(.417,'THURSDAY')x--|   
    Changing the pattern to '^(.*)$' or '(.+)' prevents the second replacement, but it does seem strange.

    Regards,
    Bob
  • 24. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    AlbertoFaenza Expert
    Currently Being Moderated
    BobLilly wrote:
    More food for thought...

    Changing the pattern to '^(.*)$' or '(.+)' prevents the second replacement, but it does seem strange.
    I also noticed that but have a look:
    with t as
    (select 'aaa' repeat_interval from dual)
    SELECT
      repeat_interval,
      regexp_replace(repeat_interval,'(.*)','-\1-') STRING1,
      regexp_replace(repeat_interval,'^(.*)','-\1-') STRING2,
      regexp_replace(repeat_interval,'(.*)$','-\1-') STRING3
    FROM  t;
    
    REPEAT_INTERVAL STRING1         STRING2         STRING3        
    --------------- --------------- --------------- ---------------
    aaa             -aaa---         -aaa-           -aaa---        
    If you put the start of line anchor (^) only then end of line is ignored.
    If you put end of line anchor ($) only then end of line is not ignored.

    Confusing.

    Regards.
    Al
  • 25. Re: REGEXP_REPLACE - Is this strange Or Am I Missing something?
    chris227 Guru
    Currently Being Moderated
    There are three questions to answer at least:

    1. How does .*? works?
    2. What result should the solution with .*? give?
    3. Is it it bug or not?

    1.
    .*? matches everything and tries to stop if a match occurs, immediatly, because the ? defines it as non-greedy.
    Siince it is not greedy it returns the minimum match, but this is not the first character it meets, it's nothing since it is defined with cardinality 0 zero (One can overcome this with using + instaed of * for illustration purposes).
    So the results of the first test-pattern must be read as:
    nothing replace with--
    character matched
    nothing replace with--
    character matched

    In the case under examination .*? is followed by ''. So the regexp has to test if the next occurence is ''. It is not, so it continues. It continues until it reaches the ''. That's why .*? matches the whole string before the '', because of the '' after it.

    One word about the scope of the regexp. Is it global, the means, all occurencies of the pattern are searched, or is it not, therefore only the first match is searched. Since in our case the occurence-paramter is left out, the scope is global.

    2.
    So the solution with
    .*?
    should give the right result, because the second * is greedy, it should match all the rest after the second (2nd because of the negated character class!) ''.

    3.
    But oracle does not work like it was suggested in 2.
    So let's test it in another enviroment. I did it with VBA in Excel 2010. May be someone will do this with perl or java.
    Sub reg_test()
    
    Dim s As String
    Dim r As String
    Dim regexp
    Dim field()
    
    field = Array( _
             "(.*?)", _
             "(.*)", _
             ".*?''([^'']+)''.*", _
             ".*''([^'']+)''.*" _
             )
    
    s = "WEEKLY(.417,''THURSDAY'')x"
    
    Set regexp = CreateObject("VBScript.regexp")
    
    regexp.Global = True
    
    For i = 0 To UBound(field)
        regexp.Pattern = field(i)
        
        r = regexp.Replace(s, "-$1-")
        
        Debug.Print field(i); Tab(1); r
    
    Next i
    
    End Sub
    
    (.*?)
    --W--E--E--K--L--Y--(--.--4--1--7--,--'--'--T--H--U--R--S--D--A--Y--'--'--)--x--
    (.*)
    -WEEKLY(.417,''THURSDAY'')x---
    .*?''([^'']+)''.*
    -THURSDAY-
    .*''([^'']+)''.*
    -THURSDAY-
    We see two things, it works like expected for the solution with .*?.
    .*? alone stops after the start and therefore replaced just ^ with -^-.

    Lets take a look on what oracle does:
    with t as
    (select 'WEEKLY(.417,''THURSDAY'')x' repeat_interval from dual)
    SELECT
      regexp_replace(repeat_interval,'(.*?)','-\1-') STRING1,
      regexp_replace(repeat_interval,'(.*)','-\1-') STRING2,
      regexp_replace(repeat_interval,'.*?''([^'']+)''.*','-\1-') STRING3,
      regexp_replace(repeat_interval,'.*''([^'']+)''.*','-\1-') STRING4
    FROM  t;
    
    STRING1     STRING2     STRING3     STRING4
    --W--E--E--K--L--Y--(--.--4--1--7--,--'--T--H--U--R--S--D--A--Y--'--)--x--     -WEEKLY(.417,'THURSDAY')x---     -THURSDAY-)x     -THURSDAY-
    So we could answer the third question: Yes, it is a bug, as .* is greedy and should match all the rest to the end. Obviously the first ? defines the non-greediness for this last * too.

    One final conclusion: .* is evil, in meanings of semantics and performance. One should try to avoid it.

    so one correct way for the origin case might be
    with t as
    (select 'WEEKLY(.417,''THURSDAY'')x' repeat_interval from dual)
    SELECT
      regexp_replace(repeat_interval,'[^'']*''([^'']+)''.*','-\1-') STRING1
    FROM  t;
    
    STRING1
    -THURSDAY-
    The only qeustion open to me now is:
    Why in the second and third case of the tests, there is a final --?

    It is because of the final void of the string, indicating the end of the string. It is matched separatly.
    One can see this in VBA
    Sub reg_test()
    
    Dim s As String
    Dim r As String
    Dim regexp
    Dim field()
    
    field = Array( _
             ".", _
             ".*", _
             "(.*)" _
             )
    
    s = ""
    
    Set regexp = CreateObject("VBScript.regexp")
    regexp.Global = True
    'regexp.Global = False
    
    For i = 0 To UBound(field)
        regexp.Pattern = field(i)
        
        r = regexp.Replace(s, "-")
        
        Debug.Print field(i); Tab(1); r
    
    Next i
    
    End Sub
    .
    
    .*
    -
    (.*)
    -
    
    Same for oralce, with one exception
    
    with t as
    (select 'x' s from dual
    union all
    select '' s from dual
    )
    SELECT
      regexp_replace(s,'.','-') STRING1,
      regexp_replace(s,'.*','-') STRING2,
      regexp_replace(s,'(.*)','-') STRING3
    FROM  t;
    
    STRING1     STRING2     STRING3
    -     --     --
              
    In case of '' oracle the regexp-fiunctions behave like other sql-functions and simple return null short circuit. So the behaviour is shown with x.
    Edited by: chris227 on 06.12.2012 04:08
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points