Forum Stats

  • 3,740,414 Users
  • 2,248,254 Discussions
  • 7,861,241 Comments

Discussions

REGEXP Extract a word from any place in a string

user13117585
user13117585 Member Posts: 640 Bronze Badge
edited Aug 24, 2020 6:21AM in SQL & PL/SQL

Hello everyone,

I have a set of data like this :

Hello MSG_WORLD. Life is great

THis is anothe [MSG_MESSAGE] and it's there

MSG_START and this one start the string.

Is there any way to extract the words starting with MSG from each example with one REGEXP?

Regards,

Tagged:
user13117585

Answers

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 40,699 Red Diamond
    edited Aug 19, 2020 3:17PM

    Hi,

    Whenever you have a question, please post a little sample data (CREATE TABLE and INSERT statements, relevant columns only) for all the tables involved, and the exact results you want from that data, so that the people who want to help you can re-create the problem and test their ideas.

    Explain, using specific examples, how you get those results from that data.

    Always say what version of Oracle you're using (e.g. 12.2.0.2.0).

    See the forum FAQ:

    user13117585 wrote:Hello everyone, I have a set of data like this : Hello MSG_WORLD. Life is greatTHis is anothe [MSG_MESSAGE] and it's thereMSG_START and this one start the string.Is there any way to extract the words starting with MSG from each example with one REGEXP?Regards, 

    Sure.  Depending on what exactly you want, you don't even need regular expressions.

    To see the part of str beginning with the first occurrence of 'MSG'

    SUBSTR ( str       , INSTR (str || 'MSG', 'MSG')       )

    If 'MSG' does not appear in str, the expression above is NULL.

    user13117585
  • EJ-Egyed
    EJ-Egyed Member Posts: 125 Blue Ribbon
    edited Aug 19, 2020 5:09PM

    If you know that there is only one MSG per string, then a simple REGEXP_SUBSTR like this will work

    WITH    d    AS        (SELECT 'Hello MSG_WORLD. Life is great' AS str FROM DUAL         UNION ALL         SELECT 'THis is anothe [MSG_MESSAGE] and it''s there' AS str FROM DUAL         UNION ALL         SELECT 'MSG_START and this one start the string.' AS str FROM DUAL)SELECT d.str, REGEXP_SUBSTR (d.str, 'MSG\w*') AS extracted_str  FROM d;

    If there is the potential for more than one MSG per string, then a solution like this might be needed

    WITH    d    AS        (SELECT 'Hello MSG_WORLD. Life is great' AS str FROM DUAL         UNION ALL         SELECT 'THis is anothe [MSG_MESSAGE] and it''s there' AS str FROM DUAL         UNION ALL         SELECT 'MSG_START and this one start the string and MSG_MIDDLE and MSG_ENDING'     AS str           FROM DUAL),    msgs AS (SELECT str, REGEXP_COUNT (str, 'MSG') AS msg_count FROM d)  SELECT m.str,         REGEXP_SUBSTR (m.str,                        'MSG\w*',                        1,                        level_num),         l.level_num    FROM msgs m,         (    SELECT LEVEL     AS level_num                FROM DUAL          CONNECT BY LEVEL <= (SELECT MAX (msg_count) FROM msgs)) l   WHERE m.msg_count >= l.level_numORDER BY 1, 3;

    In either solution, the same Regex expression is being used.  The word boundary may also need to be adjusted to fit your needs but in this case I tested with \w.

    user13117585
  • Paulzip
    Paulzip Member Posts: 8,369 Blue Diamond
    edited Aug 19, 2020 5:16PM

    with data(str) as (

    select

    q'¬Hello MSG_WORLD. Life is great

    THis is anothe [MSG_MESSAGE] and it's there

    MSG_START and this one start the string.¬'

    from dual

    )

    select level num, regexp_substr(str, 'MSG\w*', 1, level, 'i') MSG

    from data

    connect by level <= regexp_count(str, 'MSG\w*');

    NUMMSG
    1MSG_WORLD
    2MSG_MESSAGE
    3MSG_START

    3 rows selected.

  • mathguy
    mathguy Member Posts: 9,836 Gold Crown
    edited Aug 19, 2020 7:18PM
    user13117585 wrote:Hello everyone, I have a set of data like this : Hello MSG_WORLD. Life is greatTHis is anothe [MSG_MESSAGE] and it's thereMSG_START and this one start the string.Is there any way to extract the words starting with MSG from each example with one REGEXP?Regards, 

    You may think the question is clearly formulated, but it isn't. There are four solutions offered so far (in three answers), and with enough testing inputs, they will all produce different results from the same test data.

    Here are just a few questions you need to answer:

    • Are you looking for MSG or for MSG_  (always requiring an underscore also)?

    • What do you mean by "word"? Any sequence of alphanumeric characters and/or underscore? For example, MSG_PAPA234  -  do you need to get the full "word", including the digits at the end? If not, what is a "word" in your business problem?

    • Do you need to "extract" the whole string, including the MSG_ part, or only the word that follows AFTER the MSG_ part?

    • Must MSG be the beginning of a word, or can it be somewhere in a longer word? For example, all the solutions proposed so far will return MSG_STOP if the input string contains FBMSG_STOP.  Is that what you need?
  • Paulzip
    Paulzip Member Posts: 8,369 Blue Diamond
    edited Aug 19, 2020 8:56PM

    Yes, good point about the FBMSG_STOP type example.

    So maybe something like this...

    select level num, regexp_substr(str, '\W(MSG\w*)', 1, level, 'i', 1) MSG

    from data

    connect by level <= regexp_count(str, '\WMSG\w*');

  • user13117585
    user13117585 Member Posts: 640 Bronze Badge
    edited Aug 20, 2020 2:34AM

    Thank you.

    I will have a look at my dataset and see how I can deal with it.

    Regards

  • jaramill
    jaramill Member Posts: 4,299 Gold Trophy
    edited Aug 20, 2020 10:16AM

    Read the FAQ guidelines that Frank posted on HOW TO ASK A QUESTION ON THE FORUMS....it'll help you in the future.  And mark this post as ANSWERED based on whomever posted the correct reply to your question.

  • mathguy
    mathguy Member Posts: 9,836 Gold Crown
    edited Aug 20, 2020 10:25AM
    jaramill wrote:And mark this post as ANSWERED based on whomever posted the correct reply to your question.

    Did you read through the thread, to see what stage it has reached so far?

    I asked for further clarification in Reply 4. I believe in Reply 6 the poster acknowledged that he needs to clarify the problem statement by looking at the dataset in more detail. If so, then your demand is premature.

  • jaramill
    jaramill Member Posts: 4,299 Gold Trophy
    edited Aug 20, 2020 10:57AM
    mathguy wrote:jaramill wrote:And mark this post as ANSWERED based on whomever posted the correct reply to your question.Did you read through the thread, to see what stage it has reached so far?

    yes I did, read through it,

    mathguy wrote:jaramill wrote:And mark this post as ANSWERED based on whomever posted the correct reply to your question.Did you read through the thread, to see what stage it has reached so far?I asked for further clarification in Reply 4. I believe in Reply 6 the poster acknowledged that he needs to clarify the problem statement by looking at the dataset in more detail. If so, then your demand is premature.

    No it's  not.  I said

    " And mark this post as ANSWERED based on whomever posted the correct reply to your question."  So it's a reminder.

  • SpotlessFox
    SpotlessFox Member Posts: 1
    edited Aug 24, 2020 6:21AM

    You may think the question is clearly formulated, but it isn't. There are four solutions offered so far (in three answers), talktosonic and with enough testing inputs, they will all produce different results from the same test data.

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,499 Black Diamond
    edited Aug 22, 2020 8:49AM
    Paulzip wrote:So maybe something like this...

    Close:

    SQL> with data as (select 'MSG_WORLD FBMSG_STOP' str from dual)

      2  select level num, regexp_substr(str, '\W(MSG\w*)', 1, level, 'i', 1) MSG

      3  from data

      4  connect by level <= regexp_count(str, '\WMSG\w*')

      5  /

          NUM MSG

    ---------- ----------

            1

    SQL> with data as (select 'MSG_WORLD FBMSG_STOP' str from dual)

      2  select level num, regexp_substr(str, '(^|\W)(MSG\w*)', 1, level, 'i', 2) MSG

      3  from data

      4  connect by level <= regexp_count(str, '(^|\W)MSG\w*')

      5  /

          NUM MSG

    ---------- ----------

            1 MSG_WORLD

    SQL>

    SY.

  • chris227
    chris227 Member Posts: 3,513 Bronze Crown
    edited Aug 23, 2020 10:01AM

    Assumptions:

    "Words" are separated with blanks and therefore dont contain blanks itself.

    The words searched for start with MSG_ may be preceded by an optional square bracket.

    There will be one word search for only per row.

    selectregexp_replace (  your_column, '^.*?(\[?MSG_[^ ]*).*$','\1')from your_table

    You may extend the non-matching characterlist [^ ]* depending on your requirements, adding a period for example:

    [^ .]*

    Or you may define the complemtary set: all characters that are alllowed in a word, therefore any other character will define the end of the word.

    for example

    , '^.*?(\[?MSG_[]A-Z]*)[^A-Z]*.*$'

    As you can see from the answes given, there are plenty of possibilities and they all depend heavily on your exact requirements.

    So each solution you will finallly get will match exactly one seet of requirements.

    There are many pitfalls so the most important things in those cases is to define the requirements as exactly as possible.

    Usually from that the regexp is writing itself.

Sign In or Register to comment.