This discussion is archived
14 Replies Latest reply: Sep 12, 2013 5:24 AM by BluShadow RSS

Problem with REGEXP_SUBSTR

c1cf35c4-2107-4109-91ee-aaa09fbb4ce1 Newbie
Currently Being Moderated

Hello,

 

I'm trying to use the regexp_substr function in Oracle 10 to split a string into several lines (semicolon is the delimiter).

 

 

So I'm using:

 

Select Site_List, trim(regexp_substr(Site_List, '[^;]+, 1, level)) FROM (Select Site_List FROM ALL_SITES) Connect by  trim(regexp_substr(Site_List, '[^;]+, 1, level)) is not null

 

 

When I run this in SQL Developer, it's working (in a few seconds, but SQL Developer only shows the 50 first results).

 

Now, if I try to use a count(*) on this or to insert the query result in a table, it takes forever (I have only 375 lines in ALL_SITES table).

 

 

Is there something I did wrong ?

 

 

Thanks for your help.

  • 1. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    No, you didn't do anything wrong.

     

    Tools like SQL Developer, TOAD etc. tend to just return the first 50 rows (or however many) very quickly.  That's an indication that the query itself has executed quickly, but the time it takes to process and send you all of the data is clearly going to take longer, so if you do a count(*) it has to process everything first before it can count it.

     

    How many rows of data are in your table?  How many 'elements' are in each string that are gettnig split out?

     

    I assume you are running it on multiple rows?

    I, which case you may need to add to your connect by clause as you're creating a kind-of connect by cartesian product.

     

    e.g. With the following data we expect to get 5 rows from the two strings...

     

    SQL> ed
    Wrote file afiedt.buf

      1  with t as (select 'element1,element2,element3' as txt from dual union all
      2             select 'element4,element5' from dual)
      3  --
      4  select txt
      5        ,regexp_substr(txt,'[^,]+',1,level) as el
      6  from   t
      7* connect by regexp_substr(txt,'[^,]+',1,level) is not null
    SQL> /

     

    TXT                        EL
    -------------------------- --------------------------
    element1,element2,element3 element1
    element1,element2,element3 element2
    element1,element2,element3 element3
    element4,element5          element5
    element1,element2,element3 element3
    element4,element5          element4
    element1,element2,element3 element2
    element1,element2,element3 element3
    element4,element5          element5
    element1,element2,element3 element3

     

    10 rows selected.

     

    but we end up with more than we expected.

     

    If you try to ensure you just connect for each row you're processing...

     

    SQL> ed
    Wrote file afiedt.buf

      1  with t as (select 'element1,element2,element3' as txt from dual union all
      2             select 'element4,element5' from dual)
      3  --
      4  select txt
      5        ,regexp_substr(txt,'[^,]+',1,level) as el
      6  from   t
      7  connect by regexp_substr(txt,'[^,]+',1,level) is not null
      8*        and txt = prior txt
    SQL> /


    ERROR:
    ORA-01436: CONNECT BY loop in user data

     

    no rows selected

     

    ... you get a connect by loop (cycle) in your data.

     

    This can be got around by making each cycle unique...

     

    SQL> ed
    Wrote file afiedt.buf

      1  with t as (select 'element1,element2,element3' as txt from dual union all
      2             select 'element4,element5' from dual)
      3  --
      4  select txt
      5        ,regexp_substr(txt,'[^,]+',1,level) as el
      6  from   t
      7  connect by regexp_substr(txt,'[^,]+',1,level) is not null
      8         and txt = prior txt
      9*        and prior sys_guid() is not null
    SQL> /

     

    TXT                        EL
    -------------------------- --------------------------
    element1,element2,element3 element1
    element1,element2,element3 element2
    element1,element2,element3 element3
    element4,element5          element4
    element4,element5          element5

     

    so now we're getting the 5 rows we expected.

     

    I suspect you're generating far more rows than you planned. 

  • 2. Re: Problem with REGEXP_SUBSTR
    c1cf35c4-2107-4109-91ee-aaa09fbb4ce1 Newbie
    Currently Being Moderated

    Hi BluShadow,

     

    Thanks for the answer. Indeed, I had too many lines in the result. I tried your solution and it kind of works. I just have one issue that I'll try to explain.

     

    In order to make some tests, I created a subset of the data. I created a ALL_SITES_2013 table and populated it with:

     

    SELECT DISTINCT Site_List FROM ALL_SITES Where Year = '2013'


    Then I used the code I sent in the first post and added the "prior thing" and replaced ALL_SITES by ALL_SITES_2013 and everything is working fine (Thanks again !!).


    Next step was to populate ALL_SITES_2013 with all data. So I droped it and then created it again and populate with


    The code is still working.


    So at this step, my code is:


     SELECT 
      Site_List, 
      trim(regexp_substr(Site_List, '[^;]+', 1, level))
    FROM 
      (Select Site_Listfrom ALL_SITES_2013)
    CONNECT BY 
      trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null;

     

    Having ALL_SITES_2013 populated this way:

     

    INSERT INTO ALL_SITES_2013 (SELECT DISTINCT Site_List FROM ALL_SITES);

     

     

    Now my last step would be not to use the temporary ALL_SITES_2013 table. So I replaced it by the code from the INSERT statement like this:

     

     

    SELECT 
      Site_List, 
      trim(regexp_substr(Site_List, '[^;]+', 1, level))
    FROM 
      (SELECT DISTINCT Site_List FROM ALL_SITES)
    CONNECT BY 
      trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null;

     

    And it's not working anymore.The "CONNECT BY loop in user data" is returned.

    Obviously there is something I miss about the Connect By (first time I use it)

     

    How can it work on the temporary table and not on the select distinct from the final one ?

     

    By the way I tried to replace the temporary table by a temporary view (with same code as insert) but I have the same loop error...

     

    I have to admit that I'm lost. Worth case, I can use the temp table but I would like to understand at least

  • 3. Re: Problem with REGEXP_SUBSTR
    Frank Kulash Guru
    Currently Being Moderated

    Hi,

     

    Do you really need to use REGEXP_SUBSTR?  How about XMLSEQUENCE?

     

    SELECT  TRIM ( x.column_value.EXTRACT ( 'e/text()')

                 ) AS site

    FROM         all_sites  s

    CROSS JOIN   TABLE ( XMLSEQUENCE ( XMLTYPE (  '<e><e>'

                                               || REPLACE (s.site_list, ';', '</e><e>')

                                               || '</e></e>'

                                               ).EXTRACT ('e/e')

                                     )

                       )    x

    ;

    Too bad you're using Oracle 10; there are simpler ways available in later versions.

  • 4. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    c1cf35c4-2107-4109-91ee-aaa09fbb4ce1 wrote:

     

     

    I have to admit that I'm lost. Worth case, I can use the temp table but I would like to understand at least

     

    Using a temporary table shouldn't make any difference, it sounds like some issue with the data perhaps.  Unfortunately I don't have your data, so I can't figure out what's happening exactly.  You've already shown that it works for 2013 data, so can you narrow it down to a particular subset of data that's causing the problem, using similar tests?

  • 5. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    FrankKulash wrote:

     

    Hi,

     

    Do you really need to use REGEXP_SUBSTR?  How about XMLSEQUENCE?

     

    SELECT  TRIM ( x.column_value.EXTRACT ( 'e/text()')

                 ) AS site

    FROM         all_sites  s

    CROSS JOIN   TABLE ( XMLSEQUENCE ( XMLTYPE (  '<e><e>'

                                               || REPLACE (s.site_list, ';', '</e><e>')

                                               || '</e></e>'

                                               ).EXTRACT ('e/e')

                                     )

                       )    x

    ;

    Too bad you're using Oracle 10; there are simpler ways available in later versions.

     

      Not sure I'd recommend using XMLSEQUENCE... as it's deprecated from 11g onwards...

     

    XMLSEQUENCE

     

    XMLTABLE should be used instead..

     

    SQL> ed
    Wrote file afiedt.buf

      1  with t as (select 'element1,element2,element3' as txt from dual union all
      2             select 'element4,element5' from dual)
      3  --
      4  SELECT  t.txt, x.site
      5  FROM    t
      6         ,XMLTABLE('e/e'
      7                   passing XMLTYPE ('<e><e>'||REPLACE(t.txt, ',', '</e><e>')||'</e></e>')
      8                   columns site varchar2(20) path '.'
      9*                 ) x
    SQL> /

    TXT                        SITE
    -------------------------- --------------------
    element1,element2,element3 element1
    element1,element2,element3 element2
    element1,element2,element3 element3
    element4,element5          element4
    element4,element5          element5

     

    The other issue with that way though is that we're adding to the length of the original string, so if we exceed the 4000 byte varchar2 limit we could get an error (unless we move to 12c for th 32K strings), though we could convert the string to a clob before replacing and appending the tags, and that would circumvent that issue.

  • 6. Re: Problem with REGEXP_SUBSTR
    Manik Expert
    Currently Being Moderated

    Blu,

     

    Without actually converting into xmltype.. following Michael S old solution somewhere in the forum, this is another way... (similar to your option anyways)

     

    ---

     

    WITH t AS

            (SELECT 'element1,element2,element3' AS txt FROM DUAL

             UNION ALL

             SELECT 'element4,element5' FROM DUAL)

    SELECT txt, COLUMN_VALUE

      FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

           XMLTABLE (str);

     

    ---

     

    Cheers,

    Manik.

  • 7. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    Manik wrote:

     

    Blu,

     

    Without actually converting into xmltype.. following Michael S old solution somewhere in the forum, this is another way... (similar to your option anyways)

     

    ---

     

    WITH t AS

            (SELECT 'element1,element2,element3' AS txt FROM DUAL

             UNION ALL

             SELECT 'element4,element5' FROM DUAL)

    SELECT txt, COLUMN_VALUE

      FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

           XMLTABLE (str);

     

    ---

     

    Cheers,

    Manik.

     

    Which work ok on 11g, but not on 10g which the OP has...

     

    SQL> col column_values format a20

    SQL> WITH t AS (SELECT 'element1,element2,element3' AS txt FROM DUAL UNION ALL

      2             SELECT 'element4,element5' FROM DUAL)

      3  SELECT txt, COLUMN_VALUE

      4  FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

      5       XMLTABLE (str);

         XMLTABLE (str)

                   *

    ERROR at line 5:

    ORA-19102: XQuery string literal expected

  • 8. Re: Problem with REGEXP_SUBSTR
    Manik Expert
    Currently Being Moderated

    Oops I missed the version of OP..

     

    Cheers,

    Manik.

  • 9. Re: Problem with REGEXP_SUBSTR
    BrendanP Journeyer
    Currently Being Moderated

    XML might be a bit slow, as might Regexp. Here's another way to try:

    WITH t AS (
            SELECT 'element1,element22,element3,E' AS txt FROM DUAL UNION ALL
            SELECT 'element4,element5' FROM DUAL
    ), l AS (
            SELECT ',' || txt || ',' txt, Length (txt) - Length (Replace (txt, ',', '')) + 1 n_fields
              FROM t
    ), r AS (
            SELECT LEVEL ind
              FROM DUAL
           CONNECT BY LEVEL < 100
    )
    SELECT l.txt, Substr (l.txt, Instr (l.txt, ',', 1, r.ind) + 1, Instr (l.txt, ',', 1, r.ind + 1) - Instr (l.txt, ',', 1, r.ind) - 1) token
      FROM l
      JOIN r
        ON r.ind  LE l.n_fields

     

    Message was edited by: BrendanP Oops, I though that formatting problem had gone away LE stands for the Oracle &le

  • 10. Re: Problem with REGEXP_SUBSTR
    Manik Expert
    Currently Being Moderated

    Why number 100 iterations there? why not 5 or 6 or 7 or 8 ?????

     

    5 will be even more faster than your solution of 100 iterations

     

     

    Cheers,

    Manik.

  • 11. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    Manik,  I think Brenden is just demonstrating the method... it's up to the OP to adapt as necessary.  As we don't have the data or knowledge of how many 'elements' will be in the strings.

  • 12. Re: Problem with REGEXP_SUBSTR
    Manik Expert
    Currently Being Moderated

    Agreed Blu, that was just for fun. .... Don't you allow a bit of fun in this forums mighty moderator

     

    Cheers,

    Manik.

  • 13. Re: Problem with REGEXP_SUBSTR
    c1cf35c4-2107-4109-91ee-aaa09fbb4ce1 Newbie
    Currently Being Moderated

    Thanks for all the solutions you gave. My probleme is now solved. I added the "Nocycle" instruction and also use the "with t as [Subquery]" instead of putting the subquery in the FROM statement. And everything works fine. Not really sure why for now but it works. I'll do more detailed research in order to understand the details later

     

    In the end, instead of :

    SELECT   
      Site_List,   
      trim(regexp_substr(Site_List, '[^;]+', 1, level))  
    FROM   
      (SELECT DISTINCT Site_List FROM ALL_SITES)  
    CONNECT BY   
      trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null



     

    I have :

     

    With t as (Select distinct Site_List FROM ALL_SITES)
    SELECT   
      Site_List,   
      trim(regexp_substr(Site_List, '[^;]+', 1, level))  
    FROM   
      t  
    CONNECT BY   nocycle
      trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null


    Thanks again

     

  • 14. Re: Problem with REGEXP_SUBSTR
    BluShadow Guru Moderator
    Currently Being Moderated

    You probably have duplicate site_list values in the table.

    the connect by needs to connect uniquely, so if there's duplicates that'll cause problems.   It was an assumption on my part that you were just providing an example and that you'd actually have some table with a primary key on it, and then youd use the "and pk_column = prior pk_column" in the connect by clause.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points