## Forum Stats

• 3,770,131 Users
• 2,253,073 Discussions
• 7,875,336 Comments

Discussions

# Search a clob for a pattern

2»

## Answers

• Member Posts: 13 Green Ribbon

@mathguy this looks great thanks for your time and expertise

• Member Posts: 690 Silver Trophy

Do not mix clob and varchar2 expressions. This can lead to wrong result or error.

Instead of

```len int := length(p_what);
```

Use

```len int := length(to_clob(p_what));
```

• Member Posts: 10,164 Blue Diamond

Not sure what you mean. Either explain, or give an example where that matters, or whatever.

I saw something you posted earlier; you may not be aware that `LENGTH` gives length in characters, and the position in `INSTR` is also the character position - regardless of whether the input is `varchar2` or `CLOB`; so the possibility of different character sets (and therefore different byte length, etc.) is irrelevant.

To phrase it another way: I believe for any `varchar2` string `str`, `length(str) = length(to_clob(str))`; do you know different?

• Member Posts: 690 Silver Trophy

In multibyte databse charset clob is stored as al16utf16. This is leads to different length in characters of some characters.

• Member Posts: 32 Blue Ribbon

Using SQL Macro (tested on Oracle 19.11):

```CREATE OR REPLACE FUNCTION search_clob(input_clob clob, search_string varchar2) return varchar2 SQL_MACRO is
BEGIN
RETURN q'{
select
nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0) start_pos
,nullif(instr(lower(search_clob.input_clob), search_clob.search_string, 1, level),0)
+ length(search_clob.search_string) as end_pos
from dual
connect by instr(lower(search_clob.input_clob), search_clob.search_string, 1, level)>0
}';
END;
/
select
x,
length(y) clob_len,
s.*
from t
outer apply search_clob(t.y, '1') s
```
• Member Posts: 10,164 Blue Diamond

OK, I see. The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding. Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character. The same emoji character in `varchar2` (encoded in UTF-8 on my system) has character length 1, but when converted to `clob` it has character length 2, precisely because `clob` uses UCS-2 - in particular for counting characters - and not UTF-16. This distinction is to be kept in mind - if we find that the match is at characters 8-9, is that when "characters" are counted UCS2 style, or UTF-8 (and UTF-16) style?

If the OP's text may include emoji and other such symbols, he will need to pay attention to that. The cure in my function is to use `length2` on the `varchar2` input, and - for consistency - `instr2` also, so that the length is counted and the position is shown in UCS-2 characters.

It seems that his more pressing issue has to do with case-insensitive matching (he just posted that question on Stack Overflow, making no reference to this thread). I'll just leave this where it is.

• Member Posts: 690 Silver Trophy

The problem is caused precisely by the fact that in Oracle database, CLOB are not encoded in UTF-16, but rather in the older UCS-2 encoding.

Note 257772.1

Perhaps the main difference is that "surrogate pairs" in UCS-2 are considered separate characters (and counted as such), while in UTF-16 they are considered a single character

```select fl, length(fl), length2(fl), length(to_clob(fl)), length(cast(fl as nvarchar2(100)))
from (select chr(4036986790)||chr(4036986806) fl from dual);

FL LENGTH(FL) LENGTH2(FL) LENGTH(TO_CLOB(FL)) LENGTH(CAST(FLASNVARCHAR2(100)))
-- ---------- ----------- ------------------- --------------------------------
🇦🇶          2           4                   4                                4

select * from nls_database_parameters where parameter like '%CHARACTERSET%';

PARAMETER              VALUE
---------------------- ---------
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_CHARACTERSET       AL32UTF8
```

• Member Posts: 13 Green Ribbon

Case and newlines  between the phrase don't matter for now so valid examples would be 'HELlo world' or 'hello

world' are valid examples; an invalid example would be 'helloworld' or 'xhello world ' as there is no space separating the phrase to be MATCHED and there is no space between the 'x' and 'hello world'.

For now, I am only trying to ignore LF separating the phrase to be MATCHED.

To Frank's question the function returns a varchar instead of a number as I was trying to get a handle to see what I was up against. Though I don't have all the tools yet to make that assessment I'm starting to believe it isn't a good situation and the application team, if they're still around, needs to come up with a standardized approach to writing out CLOBS as this is a frustrating task.

Perhaps seeing examples of a simple phrase starting at position 100 and ending at a ridiculous position of 150 (including LF)will sway them.

Personally I would hope the LF could be removed and words would just be separated by a space. Therefore, my solution of starting/ending position into the CLOB would make it easier to seek to the correct position and highlight the entire phrase if matched.

I hope this answers your questions. Let me know if you have any more. Thanks for taking the time to answer.  Your help and expertise are greatly appreciated.

• Member Posts: 10,164 Blue Diamond

Note 257772.1

Not sure what that means. If it's a pointer to something in Oracle Support, please note that I don't have access - I am not a paying customer of Oracle.

Your example seems to show that non-BMP characters, encoded as surrogate pairs in UTF-16, are counted as 2 characters (or, to be fair, in AL16UTF16; I don't know what differences, if any, are supposed to exist between UTF-16 and AL16UTF16). If this is so, then it's either a bug (in the `LENGTH` function) or an intentional departure from the standard on Oracle's part.