Forum Stats

  • 3,770,131 Users
  • 2,253,073 Discussions
  • 7,875,336 Comments

Discussions

Search a clob for a pattern

2»

Answers

  • User_G3NSO
    User_G3NSO Member Posts: 13 Green Ribbon

    @mathguy this looks great thanks for your time and expertise

  • User_H3J7U
    User_H3J7U Member Posts: 690 Silver Trophy

    @mathguy

    Do not mix clob and varchar2 expressions. This can lead to wrong result or error.

    Instead of

    len int := length(p_what);
    

    Use

    len int := length(to_clob(p_what));
    


  • mathguy
    mathguy Member Posts: 10,164 Blue Diamond

    Not sure what you mean. Either explain, or give an example where that matters, or whatever.

    I saw something you posted earlier; you may not be aware that LENGTH gives length in characters, and the position in INSTR is also the character position - regardless of whether the input is varchar2 or CLOB; so the possibility of different character sets (and therefore different byte length, etc.) is irrelevant.

    To phrase it another way: I believe for any varchar2 string str, length(str) = length(to_clob(str)); do you know different?

  • User_H3J7U
    User_H3J7U Member Posts: 690 Silver Trophy

    In multibyte databse charset clob is stored as al16utf16. This is leads to different length in characters of some characters.


  • Sayan Malakshinov
    Sayan Malakshinov Member Posts: 32 Blue Ribbon

    Using SQL Macro (tested on Oracle 19.11):

    CREATE OR REPLACE FUNCTION search_clob(input_clob clob, search_string varchar2) return varchar2 SQL_MACRO is
    BEGIN
      RETURN q'{
       select 
               nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0) start_pos
              ,nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0)
                + length(search_clob.search_string) as end_pos
            from dual
            connect by instr(lower(search_clob.input_clob), search_clob.search_string, 1, level)>0
      }';
    END;
    /
    select
      x, 
      length(y) clob_len,
      s.*
    from t
         outer apply search_clob(t.y, '1') s
    
  • mathguy
    mathguy Member Posts: 10,164 Blue Diamond

    OK, I see. The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding. Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character. The same emoji character in varchar2 (encoded in UTF-8 on my system) has character length 1, but when converted to clob it has character length 2, precisely because clob uses UCS-2 - in particular for counting characters - and not UTF-16. This distinction is to be kept in mind - if we find that the match is at characters 8-9, is that when "characters" are counted UCS2 style, or UTF-8 (and UTF-16) style?

    If the OP's text may include emoji and other such symbols, he will need to pay attention to that. The cure in my function is to use length2 on the varchar2 input, and - for consistency - instr2 also, so that the length is counted and the position is shown in UCS-2 characters.

    It seems that his more pressing issue has to do with case-insensitive matching (he just posted that question on Stack Overflow, making no reference to this thread). I'll just leave this where it is.

  • User_H3J7U
    User_H3J7U Member Posts: 690 Silver Trophy

    The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding. 

    Note 257772.1

    Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character

    select fl, length(fl), length2(fl), length(to_clob(fl)), length(cast(fl as nvarchar2(100)))
    from (select chr(4036986790)||chr(4036986806) fl from dual);
    
    FL LENGTH(FL) LENGTH2(FL) LENGTH(TO_CLOB(FL)) LENGTH(CAST(FLASNVARCHAR2(100)))
    -- ---------- ----------- ------------------- --------------------------------
    🇦🇶          2           4                   4                                4
    
    select * from nls_database_parameters where parameter like '%CHARACTERSET%';
    
    PARAMETER              VALUE
    ---------------------- ---------
    NLS_NCHAR_CHARACTERSET AL16UTF16
    NLS_CHARACTERSET       AL32UTF8
    


  • User_G3NSO
    User_G3NSO Member Posts: 13 Green Ribbon

    @mathguy

    @Frank Kulash

    Case and newlines  between the phrase don't matter for now so valid examples would be 'HELlo world' or 'hello


    world' are valid examples; an invalid example would be 'helloworld' or 'xhello world ' as there is no space separating the phrase to be MATCHED and there is no space between the 'x' and 'hello world'. 


    For now, I am only trying to ignore LF separating the phrase to be MATCHED. 


    To Frank's question the function returns a varchar instead of a number as I was trying to get a handle to see what I was up against. Though I don't have all the tools yet to make that assessment I'm starting to believe it isn't a good situation and the application team, if they're still around, needs to come up with a standardized approach to writing out CLOBS as this is a frustrating task.


    Perhaps seeing examples of a simple phrase starting at position 100 and ending at a ridiculous position of 150 (including LF)will sway them.


    Personally I would hope the LF could be removed and words would just be separated by a space. Therefore, my solution of starting/ending position into the CLOB would make it easier to seek to the correct position and highlight the entire phrase if matched.


     

    I hope this answers your questions. Let me know if you have any more. Thanks for taking the time to answer.  Your help and expertise are greatly appreciated.

  • mathguy
    mathguy Member Posts: 10,164 Blue Diamond

    @User_H3J7U

    Note 257772.1

    Not sure what that means. If it's a pointer to something in Oracle Support, please note that I don't have access - I am not a paying customer of Oracle.

    Your example seems to show that non-BMP characters, encoded as surrogate pairs in UTF-16, are counted as 2 characters (or, to be fair, in AL16UTF16; I don't know what differences, if any, are supposed to exist between UTF-16 and AL16UTF16). If this is so, then it's either a bug (in the LENGTH function) or an intentional departure from the standard on Oracle's part.