Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

REGEXP Extract a word from any place in a string

user13117585Aug 19 2020 — edited Aug 24 2020

Hello everyone,

I have a set of data like this :

Hello MSG_WORLD. Life is great

THis is anothe [MSG_MESSAGE] and it's there

MSG_START and this one start the string.

Is there any way to extract the words starting with MSG from each example with one REGEXP?

Regards,

Comments

Frank Kulash

Hi,

Whenever you have a question, please post a little sample data (CREATE TABLE and INSERT statements, relevant columns only) for all the tables involved, and the exact results you want from that data, so that the people who want to help you can re-create the problem and test their ideas.

Explain, using specific examples, how you get those results from that data.

Always say what version of Oracle you're using (e.g. 12.2.0.2.0).

See the forum FAQ:

user13117585 wrote:

Hello everyone,

I have a set of data like this :

Hello MSG_WORLD. Life is great

THis is anothe [MSG_MESSAGE] and it's there

MSG_START and this one start the string.

Is there any way to extract the words starting with MSG from each example with one REGEXP?

Regards,

Sure.  Depending on what exactly you want, you don't even need regular expressions.

To see the part of str beginning with the first occurrence of 'MSG'

SUBSTR ( str

       , INSTR (str || 'MSG', 'MSG')

       )

If 'MSG' does not appear in str, the expression above is NULL.

EJ-Egyed

If you know that there is only one MSG per string, then a simple REGEXP_SUBSTR like this will work

WITH

    d

    AS

        (SELECT 'Hello MSG_WORLD. Life is great' AS str FROM DUAL

         UNION ALL

         SELECT 'THis is anothe [MSG_MESSAGE] and it''s there' AS str FROM DUAL

         UNION ALL

         SELECT 'MSG_START and this one start the string.' AS str FROM DUAL)

SELECT d.str, REGEXP_SUBSTR (d.str, 'MSG\w*') AS extracted_str

  FROM d;

If there is the potential for more than one MSG per string, then a solution like this might be needed

WITH

    d

    AS

        (SELECT 'Hello MSG_WORLD. Life is great' AS str FROM DUAL

         UNION ALL

         SELECT 'THis is anothe [MSG_MESSAGE] and it''s there' AS str FROM DUAL

         UNION ALL

         SELECT 'MSG_START and this one start the string and MSG_MIDDLE and MSG_ENDING'     AS str

           FROM DUAL),

    msgs AS (SELECT str, REGEXP_COUNT (str, 'MSG') AS msg_count FROM d)

  SELECT m.str,

         REGEXP_SUBSTR (m.str,

                        'MSG\w*',

                        1,

                        level_num),

         l.level_num

    FROM msgs m,

         (    SELECT LEVEL     AS level_num

                FROM DUAL

          CONNECT BY LEVEL <= (SELECT MAX (msg_count) FROM msgs)) l

   WHERE m.msg_count >= l.level_num

ORDER BY 1, 3;

In either solution, the same Regex expression is being used.  The word boundary may also need to be adjusted to fit your needs but in this case I tested with \w.

Paulzip

with data(str) as (

select

q'¬Hello MSG_WORLD. Life is great

THis is anothe [MSG_MESSAGE] and it's there

MSG_START and this one start the string.¬'

from dual

)

select level num, regexp_substr(str, 'MSG\w*', 1, level, 'i') MSG

from data

connect by level <= regexp_count(str, 'MSG\w*');

NUMMSG
1MSG_WORLD
2MSG_MESSAGE
3MSG_START

3 rows selected.

mathguy

user13117585 wrote:

Hello everyone,

I have a set of data like this :

Hello MSG_WORLD. Life is great

THis is anothe [MSG_MESSAGE] and it's there

MSG_START and this one start the string.

Is there any way to extract the words starting with MSG from each example with one REGEXP?

Regards,

You may think the question is clearly formulated, but it isn't. There are four solutions offered so far (in three answers), and with enough testing inputs, they will all produce different results from the same test data.

Here are just a few questions you need to answer:

  • Are you looking for MSG or for MSG_  (always requiring an underscore also)?

  • What do you mean by "word"? Any sequence of alphanumeric characters and/or underscore? For example, MSG_PAPA234  -  do you need to get the full "word", including the digits at the end? If not, what is a "word" in your business problem?

  • Do you need to "extract" the whole string, including the MSG_ part, or only the word that follows AFTER the MSG_ part?

  • Must MSG be the beginning of a word, or can it be somewhere in a longer word? For example, all the solutions proposed so far will return MSG_STOP if the input string contains FBMSG_STOP.  Is that what you need?
Paulzip

Yes, good point about the FBMSG_STOP type example.

So maybe something like this...

select level num, regexp_substr(str, '\W(MSG\w*)', 1, level, 'i', 1) MSG

from data

connect by level <= regexp_count(str, '\WMSG\w*');

user13117585

Thank you.

I will have a look at my dataset and see how I can deal with it.

Regards

jaramill

Read the FAQ guidelines that Frank posted on HOW TO ASK A QUESTION ON THE FORUMS....it'll help you in the future.  And mark this post as ANSWERED based on whomever posted the correct reply to your question.

mathguy

jaramill wrote:

And mark this post as ANSWERED based on whomever posted the correct reply to your question.

Did you read through the thread, to see what stage it has reached so far?

I asked for further clarification in Reply 4. I believe in Reply 6 the poster acknowledged that he needs to clarify the problem statement by looking at the dataset in more detail. If so, then your demand is premature.

jaramill

mathguy wrote:

jaramill wrote:

And mark this post as ANSWERED based on whomever posted the correct reply to your question.

Did you read through the thread, to see what stage it has reached so far?

yes I did, read through it,

mathguy wrote:

jaramill wrote:

And mark this post as ANSWERED based on whomever posted the correct reply to your question.

Did you read through the thread, to see what stage it has reached so far?

I asked for further clarification in Reply 4. I believe in Reply 6 the poster acknowledged that he needs to clarify the problem statement by looking at the dataset in more detail. If so, then your demand is premature.

No it's  not.  I said

" And mark this post as ANSWERED based on whomever posted the correct reply to your question."  So it's a reminder.

SpotlessFox

You may think the question is clearly formulated, but it isn't. There are four solutions offered so far (in three answers), talktosonic and with enough testing inputs, they will all produce different results from the same test data.

Solomon Yakobson

Paulzip wrote:

So maybe something like this...

Close:

SQL> with data as (select 'MSG_WORLD FBMSG_STOP' str from dual)

  2  select level num, regexp_substr(str, '\W(MSG\w*)', 1, level, 'i', 1) MSG

  3  from data

  4  connect by level <= regexp_count(str, '\WMSG\w*')

  5  /

      NUM MSG

---------- ----------

        1

SQL> with data as (select 'MSG_WORLD FBMSG_STOP' str from dual)

  2  select level num, regexp_substr(str, '(^|\W)(MSG\w*)', 1, level, 'i', 2) MSG

  3  from data

  4  connect by level <= regexp_count(str, '(^|\W)MSG\w*')

  5  /

      NUM MSG

---------- ----------

        1 MSG_WORLD

SQL>

SY.

chris227

Assumptions:

"Words" are separated with blanks and therefore dont contain blanks itself.

The words searched for start with MSG_ may be preceded by an optional square bracket.

There will be one word search for only per row.

select

regexp_replace (

  your_column

, '^.*?(\[?MSG_[^ ]*).*$'

,'\1'

)

from your_table

You may extend the non-matching characterlist [^ ]* depending on your requirements, adding a period for example:

[^ .]*

Or you may define the complemtary set: all characters that are alllowed in a word, therefore any other character will define the end of the word.

for example

, '^.*?(\[?MSG_[]A-Z]*)[^A-Z]*.*$'

As you can see from the answes given, there are plenty of possibilities and they all depend heavily on your exact requirements.

So each solution you will finallly get will match exactly one seet of requirements.

There are many pitfalls so the most important things in those cases is to define the requirements as exactly as possible.

Usually from that the regexp is writing itself.

1 - 12

Post Details

Added on Aug 19 2020
12 comments
2,554 views