SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Search a clob for a pattern

PugzlySep 11 2021

I am trying to create a function that searches a CLOB for a specific term. In this case 'hello world' ONLY!!!

Note some of the text within the CLOB can contain LF and some may not. I provided some sample data off both.

In this particular example I'm looking to search through the CLOB to find the starting and ending positions of the phrase 'hello world' ONLY.

I know to find the starting position I need to use the INSTR command and the ending position should be INSTR+LENGTH of the phrase I'm searching for.

I haven't added that code yet as I am unable to get the parsing working correctly yet.

If you run the sample data I provided you can I'm getting back more than the phrase 'hello world'

Once the parse works, I'm looking to generate the following output (see below).

id start_pos end_pos search_term

Note in my function the number 4000 is hard coded as I recall that being the max limit of a VARCHAR2. I think it's been extended to 32767 in 12c and up but that should be a non issue here as I only have a small amount of data.

Any help would be greatly appreciated. Below is my test CASE and im testing on livesql so my environment can be replicated if needed.

create table t ( x int primary key, y clob );

insert into t values ( 1, 'line 1 line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 2, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 3, 'line 1 hello world line 2 line 3 line 4 line 5

hello world' );

declare
 l_text long;
 begin
 for i in 1 .. 3000
 loop
 l_text := l_text || 'line ' || i || chr(10);
 end loop;
 l_text := l_text || 'hello world 3001';
 insert into t values ( 4, l_text );
 l_text := l_text || chr(10) || 'line 3002';
 insert into t values ( 5, l_text );
 l_text := 'hello world 0' || chr(10) || l_text;
 insert into t values ( 6, l_text );
 end;

create or replace function search_clob( p_lob in clob, p_what in varchar2 ) return varchar2
 as
 l_text long;
 l_instr number;
 begin
 l_instr := dbms_lob.instr( p_lob, p_what );
 l_instr := dbms_lob.instr( p_lob, chr(10), l_instr );
 if ( l_instr = 0 ) then l_instr := dbms_lob.getlength(p_lob); end if;
 if ( l_instr < 4000 )
 then
 l_text := rtrim(chr(10) || dbms_lob.substr( p_lob, l_instr, 1 ), chr(10));
 else
 l_text := rtrim(chr(10) || dbms_lob.substr( p_lob, 4000, l_instr-3999 ), chr(10));
 end if;
 return substr( l_text, instr(l_text,chr(10), -1, 1 )+1 );
 end;
 

select x,search_clob( y, 'hello world' ) y, dbms_lob.getlength(y) len
 from t
 where dbms_lob.instr( y, 'hello world' ) > 0;

User_H3J7U

I know to find the starting position I need to use the INSTR command and the ending position should be INSTR+LENGTH of the phrase I'm searching for.

I haven't added that code yet as I am unable to get the parsing working correctly yet.
What is a "parsing"??
I think it's been extended to 32767 in 12c
Database reference/max_string_size.

l_text long;

PL/SQL type long is varchar2(32760). Do not use it, it is misleading.
.

mathguy

Before we get too deep in the weeds, let's make sure we have a clear statement of the problem.
You say "search for a phrase." Do you need to search for it case-sensitive? Or should "Hello" match "hello"? Or do you want a function parameter to indicate whether search should be case sensitive or not? (Perhaps with a default if the parameter is not used?)
Can the word "hello" appear at the end of a line of text, then a newline, and then the word "world" at the beginning of the next line? Can there also be one (or more) blank line(s) in between?
Should the function find a match for "hello world" in the CLOB containing " ... Othello worldview ..." ?
I won't continue - I will let you attempt a complete, non-ambiguous phrasing of the requirement; I will ask follow-up questions if something remains unclear.

Frank Kulash

Hi, @pugzly
Thanks for posting the sample data. Don't forget to post the complete, exact results you want from the given sample data, and an explanation of exactly what you want. In particular, what do you want when 'hello world' occurs more than once in y (as in the row where x=2)? What do you want if 'hello' and/or 'world' are not complete words? What if there is punctuation around them? For example:
'ABC dhello world EFG'
'HIJ hello worlds'
'Somebody said "hello world" a minute ago'
If you need to handle special cases like these, include examples in your sample data, results and explanation.
Note in my function the number 4000 is hard coded as I recall that being the max limit of a VARCHAR2. I think it's been extended to 32767 in 12c and up but that should be a non issue here as I only have a small amount of data.
I'm confused. Are you saying that the length of y will never be more than 4000 bytes? If so, why are you using CLOBs? If you are using CLOBs, then any limitations of VARCHAR2s won't matter. What role does the function play in this problem or its solution? If you're looking for the postition of 'hello world' (which is a NUMBER) why does the function return a VARCHAR2?

Pugzly

@frank-kulash sorry to have wasted your time as this is becoming to obfuscated.
I'm going to start simply by passing in CLOBS and a search pattern.
Since I want to keep the code encapsulated I'm going to try and create a function to return starting/ending positions.

Since a function could return more than one row, the return value will be grouped as a list of : start_pos_1, end_pos_1 ; start_pos_2, end_pos_2 ; start_pos_3, end_pos_3
Apologies for wasting your time. Great way to spend a Sunday huh

Frank Kulash

Hi, @pugzly
@frank-kulash sorry to have wasted your time as this is becoming to obfuscated.
Why single me out? Other people made similar replies, too. Be sure to answer all the questions asked in those replies.
Since a function could return more than one row,
It's not clear what you want. Normally, a function doesn't return any rows; it returns a single value (which can be a cursor, which can be used later to produce multiple rows). There are Pipelined Functions, that do return rows; is that what you want?
the return value will be grouped as a list of : start_pos_1, end_pos_1 ; start_pos_2, end_pos_2 ; start_pos_3, end_pos_3
Once again, post the exact results you want from the given sample data. Are you looking for a string, with pairs of positions (beginning and ending positions) separated by commas, and (when the pattern occurs multiple times in the same string) semicolons delimiting the pairs? That's probably not a good idea. How do you plan to use the output?

Pugzly

@frank-kulash A request came in from a user to be able to see exactly where a word or phrase appears in a CLOB in addition to seeing if it exists in the CLOB.

The application team asked for the starting and ending positions so they can highlight with a color so it could stand out for users.

While looking at some data I noticed that some CLOBS had many LF, some had none, some words appearing in the pattern weren't on the same lines....

What I was thinking was passing in CLOBS and a pattern into a function like I showed in my example and RETURN the starting and ending positions of each phrase being searched. By making this a generic (nothing hard coded) like table name, column name... other departments could take advantage of this code just by supplying the wrapper like I did in my example.

I don't like the idea of providing different varieties of the same solution as it becomes difficult to maintain, hence the function or procedure idea.

So I want to step back and forget about the LF for now and RETURN starting and ending positions for a match and let the application team scrub their data before storing it.

Frank Kulash

Hi, @pugzly
Okay. As soon as you post the answers to the questions asked in the first three replies (especially "whta ar the exact results you want from the given sample data") we can begin.

Pugzly

[Frank Kulash](/ords/forums/user/Frank Kulash)
I made the data simple and used only 3 rows for iinput. Nothing with LF.

Expected output 
Id  term start_pos end_pos
1 hello world 36 46
2 hello world 8 18
2 hello world 48 58
3 hello world 8 18
3 hello world 48 58

create table t ( x int primary key, y clob );

insert into t values ( 1, 'line 1 line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 2, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );

insert into t values ( 3, 'line 1 hello world line 2 line 3 line 4 line 5 hello world' );

This is what I want to take in. As for the out, whatever is easiest to produce the required above mentioned output. Perhaps an object type, PIPELINED function? It doesn't matter to me!!!

I was thinking maybe use INSTR in a recursive sub-query factoring clause to produce the output as that would make the code legible and easier to read.

create or replace function search_clob( p_lob in clob, p_what in varchar2 ) return varchar2
….
….

The calling wrapper should be simple, something easy and flexible like this though I don't want the user to have to put in the term 'hello world' twice if possible. Note I don't care about the length I only need the positions

select x,search_clob( y, 'hello world' ) y, dbms_lob.getlength(y) len
 from t
 where dbms_lob.instr( y, 'hello world' ) > 0;

I think I answered all your questions

User_H3J7U

The application team asked for the starting and ending positions so they can highlight with a color so it could stand out for users.
The search and highlighting are simpler to do at one place on the client. Furthermore, the length of some characters may differ in app charset and position will be wrong.

mathguy

Here is one way. I used your original data for testing. I added two more rows to make sure the function works correctly when the CLOB is null and when it is non-null but it does not contain the search phrase.

insert into t values (7, null);
insert into t values (8, 'no match');

Newlines are not an issue; my question (which you did not dignify with an answer) was whether a newline BETWEEN the words "hello" and "world" should result in an automatic non-match. In this Reply I assume a newline between the search terms automatically makes it a non-match. Your users may not be happy about that, but that's on you.
Note that I won't entertain follow-up questions like "what if my words, in the phrase I am searching for, are separated by a newline, and that should still be a match" (or "separated by two spaces instead of one" etc.) I gave you a chance to clarify the question before looking for solutions; you chose to ignore that, fine - but I won't answer the same question more than once.
In the solution below I define a schema-level object type and corresponding schema-level table type to hold the output from the function. "Pipelined" is not essential; functions that "return rows" are known as table functions, and they may or may not be pipelined. In this solution I chose "pipelined" but that is not required.
For a long time now, functions like INSTR and LENGTH work fine on CLOB, and that's what I used.

create or replace type se_obj is object (s number, e number);
/
create or replace type se_tbl is table of se_obj;
/

create or replace function search_clob(p_lob clob, p_what varchar2)
  return se_tbl
  pipelined
as
  len int := length(p_what);
  s   int;
  pos int := 1;
begin
  if p_lob is not null then
    loop
      s := instr(p_lob, p_what, pos);
      exit when s = 0;
      pipe row (se_obj(s, s + len - 1));
      pos := s + len;
    end loop;
  end if;
  return;
end;
/

You can call the function with hard-coded arguments for testing. If you need to use it in a SQL context, you can do something like I show below. Note the OUTER APPLY operator; you need that if you want to show the input rows with no matches.

select t.x, l.s, l.e
from   t outer apply search_clob(y, 'hello world') l
;


         X          S          E
---------- ---------- ----------
         1         36         46
         2          8         18
         2         48         58
         3          8         18
         3         49         59
         4      28894      28904
         5      28894      28904
         6          1         11
         6      28908      28918
         7                      
         8

1 person found this helpful

Pugzly

@mathguy this looks great thanks for your time and expertise

User_H3J7U

mathguy
Do not mix clob and varchar2 expressions. This can lead to wrong result or error.
Instead of

len int := length(p_what);

Use

len int := length(to_clob(p_what));

mathguy

Not sure what you mean. Either explain, or give an example where that matters, or whatever.
I saw something you posted earlier; you may not be aware that LENGTH gives length in characters, and the position in INSTR is also the character position - regardless of whether the input is varchar2 or CLOB; so the possibility of different character sets (and therefore different byte length, etc.) is irrelevant.
To phrase it another way: I believe for any varchar2 string str, length(str) = length(to_clob(str)); do you know different?

User_H3J7U

In multibyte databse charset clob is stored as al16utf16. This is leads to different length in characters of some characters.

Sayan Malakshinov

Using SQL Macro (tested on Oracle 19.11):

CREATE OR REPLACE FUNCTION search_clob(input_clob clob, search_string varchar2) return varchar2 SQL_MACRO is
BEGIN
  RETURN q'{
   select 
           nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0) start_pos
          ,nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0)
            + length(search_clob.search_string) as end_pos
        from dual
        connect by instr(lower(search_clob.input_clob), search_clob.search_string, 1, level)>0
  }';
END;
/
select
  x, 
  length(y) clob_len,
  s.*
from t
     outer apply search_clob(t.y, '1') s

mathguy

OK, I see. The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding. Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character. The same emoji character in varchar2 (encoded in UTF-8 on my system) has character length 1, but when converted to clob it has character length 2, precisely because clob uses UCS-2 - in particular for counting characters - and not UTF-16. This distinction is to be kept in mind - if we find that the match is at characters 8-9, is that when "characters" are counted UCS2 style, or UTF-8 (and UTF-16) style?
If the OP's text may include emoji and other such symbols, he will need to pay attention to that. The cure in my function is to use length2 on the varchar2 input, and - for consistency - instr2 also, so that the length is counted and the position is shown in UCS-2 characters.
It seems that his more pressing issue has to do with case-insensitive matching (he just posted that question on Stack Overflow, making no reference to this thread). I'll just leave this where it is.

User_H3J7U

The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding.
Note 257772.1
Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character.

select fl, length(fl), length2(fl), length(to_clob(fl)), length(cast(fl as nvarchar2(100)))
from (select chr(4036986790)||chr(4036986806) fl from dual);

FL LENGTH(FL) LENGTH2(FL) LENGTH(TO_CLOB(FL)) LENGTH(CAST(FLASNVARCHAR2(100)))
-- ---------- ----------- ------------------- --------------------------------
🇦🇶          2           4                   4                                4

select * from nls_database_parameters where parameter like '%CHARACTERSET%';

PARAMETER              VALUE
---------------------- ---------
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_CHARACTERSET       AL32UTF8

Pugzly

@mathguy
@frank-kulash
Case and newlines between the phrase don't matter for now so valid examples would be 'HELlo world' or 'hello

world' are valid examples; an invalid example would be 'helloworld' or 'xhello world ' as there is no space separating the phrase to be MATCHED and there is no space between the 'x' and 'hello world'.

For now, I am only trying to ignore LF separating the phrase to be MATCHED.

To Frank's question the function returns a varchar instead of a number as I was trying to get a handle to see what I was up against. Though I don't have all the tools yet to make that assessment I'm starting to believe it isn't a good situation and the application team, if they're still around, needs to come up with a standardized approach to writing out CLOBS as this is a frustrating task.

Perhaps seeing examples of a simple phrase starting at position 100 and ending at a ridiculous position of 150 (including LF)will sway them.

Personally I would hope the LF could be removed and words would just be separated by a space. Therefore, my solution of starting/ending position into the CLOB would make it easier to seek to the correct position and highlight the entire phrase if matched.

I hope this answers your questions. Let me know if you have any more. Thanks for taking the time to answer. Your help and expertise are greatly appreciated.

mathguy

@user-h3j7u
Note 257772.1
Not sure what that means. If it's a pointer to something in Oracle Support, please note that I don't have access - I am not a paying customer of Oracle.
Your example seems to show that non-BMP characters, encoded as surrogate pairs in UTF-16, are counted as 2 characters (or, to be fair, in AL16UTF16; I don't know what differences, if any, are supposed to exist between UTF-16 and AL16UTF16). If this is so, then it's either a bug (in the LENGTH function) or an intentional departure from the standard on Oracle's part.

1 - 19

Added on Sep 11 2021

#function, #oracle, #sql

19 comments

7,717 views

4 mentions

SQL & PL/SQL

Search a clob for a pattern

Comments

Post Details