Forum Stats

  • 3,768,187 Users
  • 2,252,756 Discussions
  • 7,874,485 Comments

Discussions

Search a clob for a pattern

User_G3NSO
User_G3NSO Member Posts: 9 Green Ribbon


I am trying to create a function that searches a CLOB for a specific term. In this case 'hello world' ONLY!!!


Note some of the text within the CLOB can contain LF and some may not. I provided some sample data off both.


In this particular example I'm looking to search through the CLOB to find the starting and ending positions of the phrase 'hello world'  ONLY.


I know to find the starting position I need to use the INSTR command and the ending position should be INSTR+LENGTH of the phrase I'm searching for.


I haven't added that code yet as I am unable to get the parsing working correctly yet.


If you run the sample data I provided you can I'm getting back more than the phrase 'hello world'


Once the parse works, I'm looking to generate the following output (see below).


id  start_pos  end_pos  search_term


Note in my function the number 4000 is hard coded as I recall that being the max limit of a VARCHAR2. I think it's been extended to 32767 in 12c and up but that should be a non issue here as I only have a small amount of data.


Any help would be greatly appreciated. Below is my  test CASE and im testing on livesql so my environment can be replicated if needed.



create table t ( x int primary key, y clob );

insert into t values ( 1, 'line 1 line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 2, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 3, 'line 1 hello world line 2 line 3 line 4 line 5

hello world' );

declare
 l_text long;
 begin
 for i in 1 .. 3000
 loop
 l_text := l_text || 'line ' || i || chr(10);
 end loop;
 l_text := l_text || 'hello world 3001';
 insert into t values ( 4, l_text );
 l_text := l_text || chr(10) || 'line 3002';
 insert into t values ( 5, l_text );
 l_text := 'hello world 0' || chr(10) || l_text;
 insert into t values ( 6, l_text );
 end;

create or replace function search_clob( p_lob in clob, p_what in varchar2 ) return varchar2
 as
 l_text long;
 l_instr number;
 begin
 l_instr := dbms_lob.instr( p_lob, p_what );
 l_instr := dbms_lob.instr( p_lob, chr(10), l_instr );
 if ( l_instr = 0 ) then l_instr := dbms_lob.getlength(p_lob); end if;
 if ( l_instr < 4000 )
 then
 l_text := rtrim(chr(10) || dbms_lob.substr( p_lob, l_instr, 1 ), chr(10));
 else
 l_text := rtrim(chr(10) || dbms_lob.substr( p_lob, 4000, l_instr-3999 ), chr(10));
 end if;
 return substr( l_text, instr(l_text,chr(10), -1, 1 )+1 );
 end;
 

select x,search_clob( y, 'hello world' ) y, dbms_lob.getlength(y) len
 from t
 where dbms_lob.instr( y, 'hello world' ) > 0;


Tagged:
«1

Answers

  • User_H3J7U
    User_H3J7U Member Posts: 640 Silver Trophy

    I know to find the starting position I need to use the INSTR command and the ending position should be INSTR+LENGTH of the phrase I'm searching for.


    I haven't added that code yet as I am unable to get the parsing working correctly yet.

    What is a "parsing"??

     I think it's been extended to 32767 in 12c 

    Database reference/max_string_size.

    l_text long;
    

    PL/SQL type long is varchar2(32760). Do not use it,  it is misleading.

    .

  • mathguy
    mathguy Member Posts: 10,151 Blue Diamond

    Before we get too deep in the weeds, let's make sure we have a clear statement of the problem.

    You say "search for a phrase." Do you need to search for it case-sensitive? Or should "Hello" match "hello"? Or do you want a function parameter to indicate whether search should be case sensitive or not? (Perhaps with a default if the parameter is not used?)

    Can the word "hello" appear at the end of a line of text, then a newline, and then the word "world" at the beginning of the next line? Can there also be one (or more) blank line(s) in between?

    Should the function find a match for "hello world" in the CLOB containing " ... Othello worldview ..." ?

    I won't continue - I will let you attempt a complete, non-ambiguous phrasing of the requirement; I will ask follow-up questions if something remains unclear.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,205 Red Diamond

    Hi, @User_G3NSO

    Thanks for posting the sample data. Don't forget to post the complete, exact results you want from the given sample data, and an explanation of exactly what you want. In particular, what do you want when 'hello world' occurs more than once in y (as in the row where x=2)? What do you want if 'hello' and/or 'world' are not complete words? What if there is punctuation around them? For example:

    • 'ABC dhello world EFG'
    • 'HIJ hello worlds'
    • 'Somebody said "hello world" a minute ago'

    If you need to handle special cases like these, include examples in your sample data, results and explanation.

    Note in my function the number 4000 is hard coded as I recall that being the max limit of a VARCHAR2. I think it's been extended to 32767 in 12c and up but that should be a non issue here as I only have a small amount of data.

    I'm confused. Are you saying that the length of y will never be more than 4000 bytes? If so, why are you using CLOBs? If you are using CLOBs, then any limitations of VARCHAR2s won't matter. What role does the function play in this problem or its solution? If you're looking for the postition of 'hello world' (which is a NUMBER) why does the function return a VARCHAR2?

  • User_G3NSO
    User_G3NSO Member Posts: 9 Green Ribbon

    @Frank Kulash sorry to have wasted your time as this is becoming to obfuscated.

    I'm going to start simply by passing in CLOBS and a search pattern.

    Since I want to keep the code encapsulated I'm going to try and create a function to return starting/ending positions.


    Since a function could return more than one row, the return value will be grouped as a list of : start_pos_1, end_pos_1 ; start_pos_2, end_pos_2 ; start_pos_3, end_pos_3

    Apologies for wasting your time. Great way to spend a Sunday huh

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,205 Red Diamond

    Hi, @User_G3NSO

    @Frank Kulash sorry to have wasted your time as this is becoming to obfuscated.

    Why single me out? Other people made similar replies, too. Be sure to answer all the questions asked in those replies.

    Since a function could return more than one row,

    It's not clear what you want. Normally, a function doesn't return any rows; it returns a single value (which can be a cursor, which can be used later to produce multiple rows). There are Pipelined Functions, that do return rows; is that what you want?

    the return value will be grouped as a list of : start_pos_1, end_pos_1 ; start_pos_2, end_pos_2 ; start_pos_3, end_pos_3

    Once again, post the exact results you want from the given sample data. Are you looking for a string, with pairs of positions (beginning and ending positions) separated by commas, and (when the pattern occurs multiple times in the same string) semicolons delimiting the pairs? That's probably not a good idea. How do you plan to use the output?

  • User_G3NSO
    User_G3NSO Member Posts: 9 Green Ribbon

    @Frank Kulash A request came in from a user to be able to see exactly where a word or phrase appears in a CLOB in addition to seeing if it exists in the CLOB.


    The application team asked for the starting and ending positions so they can highlight with a color so it could stand out for users.


    While looking at some data I noticed that some CLOBS had many LF, some had none, some words appearing in the pattern weren't on the same lines....


    What I was thinking was passing in CLOBS and a pattern into a function like I showed in my example and RETURN the starting and ending positions of each phrase being searched. By making this a generic  (nothing hard coded) like table name, column name... other departments could take advantage of this code just by supplying the wrapper like I did in my example.


    I don't like the idea of providing different varieties of the same solution as it becomes difficult to maintain, hence the function or procedure idea.


    So I want to step back and forget about the LF for now and RETURN starting and ending positions for a match and let the application team scrub their data before storing it.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,205 Red Diamond

    Hi, @User_G3NSO

    Okay. As soon as you post the answers to the questions asked in the first three replies (especially "whta ar the exact results you want from the given sample data") we can begin.

  • User_G3NSO
    User_G3NSO Member Posts: 9 Green Ribbon

    @Frank Kulash

    I made the data simple and used only 3 rows for iinput. Nothing with LF.

    Expected output 
    Id  term start_pos end_pos
    1 hello world 36 46
    2 hello world 8 18
    2 hello world 48 58
    3 hello world 8 18
    3 hello world 48 58
    
    create table t ( x int primary key, y clob );
    
    insert into t values ( 1, 'line 1 line 2 line 3 line 4 line 5 hello world' );
    
    insert into t values ( 2, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );
    
    insert into t values ( 3, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );
    


    This is what I want to take in. As for the out, whatever is easiest to produce the required above mentioned output. Perhaps an object type, PIPELINED function? It doesn't matter to me!!! 


    I was thinking maybe use INSTR in a recursive sub-query factoring clause to produce the output as that would make the code legible and easier to read.


    create or replace function search_clob( p_lob in clob, p_what in varchar2 ) return varchar2
    ….
    ….
    


    The calling wrapper should be simple, something easy and flexible like this though I don't want the user to have to put in the term 'hello world' twice if possible. Note I don't care about the length I only need the positions


    select x,search_clob( y, 'hello world' ) y, dbms_lob.getlength(y) len
     from t
     where dbms_lob.instr( y, 'hello world' ) > 0;
    


    I think I answered all your questions

  • User_H3J7U
    User_H3J7U Member Posts: 640 Silver Trophy

    The application team asked for the starting and ending positions so they can highlight with a color so it could stand out for users.

    The search and highlighting are simpler to do at one place on the client. Furthermore, the length of some characters may differ in app charset and position will be wrong.

  • mathguy
    mathguy Member Posts: 10,151 Blue Diamond
    edited Sep 12, 2021 3:48PM

    Here is one way. I used your original data for testing. I added two more rows to make sure the function works correctly when the CLOB is null and when it is non-null but it does not contain the search phrase.

    insert into t values (7, null);
    insert into t values (8, 'no match');
    

    Newlines are not an issue; my question (which you did not dignify with an answer) was whether a newline BETWEEN the words "hello" and "world" should result in an automatic non-match. In this Reply I assume a newline between the search terms automatically makes it a non-match. Your users may not be happy about that, but that's on you.

    Note that I won't entertain follow-up questions like "what if my words, in the phrase I am searching for, are separated by a newline, and that should still be a match" (or "separated by two spaces instead of one" etc.) I gave you a chance to clarify the question before looking for solutions; you chose to ignore that, fine - but I won't answer the same question more than once.

    In the solution below I define a schema-level object type and corresponding schema-level table type to hold the output from the function. "Pipelined" is not essential; functions that "return rows" are known as table functions, and they may or may not be pipelined. In this solution I chose "pipelined" but that is not required.

    For a long time now, functions like INSTR and LENGTH work fine on CLOB, and that's what I used.

    create or replace type se_obj is object (s number, e number);
    /
    create or replace type se_tbl is table of se_obj;
    /
    
    create or replace function search_clob(p_lob clob, p_what varchar2)
      return se_tbl
      pipelined
    as
      len int := length(p_what);
      s   int;
      pos int := 1;
    begin
      if p_lob is not null then
        loop
          s := instr(p_lob, p_what, pos);
          exit when s = 0;
          pipe row (se_obj(s, s + len - 1));
          pos := s + len;
        end loop;
      end if;
      return;
    end;
    /
    

    You can call the function with hard-coded arguments for testing. If you need to use it in a SQL context, you can do something like I show below. Note the OUTER APPLY operator; you need that if you want to show the input rows with no matches.

    select t.x, l.s, l.e
    from   t outer apply search_clob(y, 'hello world') l
    ;
    
    
             X          S          E
    ---------- ---------- ----------
             1         36         46
             2          8         18
             2         48         58
             3          8         18
             3         49         59
             4      28894      28904
             5      28894      28904
             6          1         11
             6      28908      28918
             7                      
             8                      
    
    User_G3NSO