14 Replies Latest reply: Sep 12, 2013 7:24 AM by BluShadow RSS

    Problem with REGEXP_SUBSTR

    c1cf35c4-2107-4109-91ee-aaa09fbb4ce1

      Hello,

       

      I'm trying to use the regexp_substr function in Oracle 10 to split a string into several lines (semicolon is the delimiter).

       

       

      So I'm using:

       

      Select Site_List, trim(regexp_substr(Site_List, '[^;]+, 1, level)) FROM (Select Site_List FROM ALL_SITES) Connect by  trim(regexp_substr(Site_List, '[^;]+, 1, level)) is not null
      

       

       

      When I run this in SQL Developer, it's working (in a few seconds, but SQL Developer only shows the 50 first results).

       

      Now, if I try to use a count(*) on this or to insert the query result in a table, it takes forever (I have only 375 lines in ALL_SITES table).

       

       

      Is there something I did wrong ?

       

       

      Thanks for your help.

        • 1. Re: Problem with REGEXP_SUBSTR
          BluShadow

          No, you didn't do anything wrong.

           

          Tools like SQL Developer, TOAD etc. tend to just return the first 50 rows (or however many) very quickly.  That's an indication that the query itself has executed quickly, but the time it takes to process and send you all of the data is clearly going to take longer, so if you do a count(*) it has to process everything first before it can count it.

           

          How many rows of data are in your table?  How many 'elements' are in each string that are gettnig split out?

           

          I assume you are running it on multiple rows?

          I, which case you may need to add to your connect by clause as you're creating a kind-of connect by cartesian product.

           

          e.g. With the following data we expect to get 5 rows from the two strings...

           

          SQL> ed
          Wrote file afiedt.buf

            1  with t as (select 'element1,element2,element3' as txt from dual union all
            2             select 'element4,element5' from dual)
            3  --
            4  select txt
            5        ,regexp_substr(txt,'[^,]+',1,level) as el
            6  from   t
            7* connect by regexp_substr(txt,'[^,]+',1,level) is not null
          SQL> /

           

          TXT                        EL
          -------------------------- --------------------------
          element1,element2,element3 element1
          element1,element2,element3 element2
          element1,element2,element3 element3
          element4,element5          element5
          element1,element2,element3 element3
          element4,element5          element4
          element1,element2,element3 element2
          element1,element2,element3 element3
          element4,element5          element5
          element1,element2,element3 element3

           

          10 rows selected.

           

          but we end up with more than we expected.

           

          If you try to ensure you just connect for each row you're processing...

           

          SQL> ed
          Wrote file afiedt.buf

            1  with t as (select 'element1,element2,element3' as txt from dual union all
            2             select 'element4,element5' from dual)
            3  --
            4  select txt
            5        ,regexp_substr(txt,'[^,]+',1,level) as el
            6  from   t
            7  connect by regexp_substr(txt,'[^,]+',1,level) is not null
            8*        and txt = prior txt
          SQL> /


          ERROR:
          ORA-01436: CONNECT BY loop in user data

           

          no rows selected

           

          ... you get a connect by loop (cycle) in your data.

           

          This can be got around by making each cycle unique...

           

          SQL> ed
          Wrote file afiedt.buf

            1  with t as (select 'element1,element2,element3' as txt from dual union all
            2             select 'element4,element5' from dual)
            3  --
            4  select txt
            5        ,regexp_substr(txt,'[^,]+',1,level) as el
            6  from   t
            7  connect by regexp_substr(txt,'[^,]+',1,level) is not null
            8         and txt = prior txt
            9*        and prior sys_guid() is not null
          SQL> /

           

          TXT                        EL
          -------------------------- --------------------------
          element1,element2,element3 element1
          element1,element2,element3 element2
          element1,element2,element3 element3
          element4,element5          element4
          element4,element5          element5

           

          so now we're getting the 5 rows we expected.

           

          I suspect you're generating far more rows than you planned. 

          • 2. Re: Problem with REGEXP_SUBSTR
            c1cf35c4-2107-4109-91ee-aaa09fbb4ce1

            Hi BluShadow,

             

            Thanks for the answer. Indeed, I had too many lines in the result. I tried your solution and it kind of works. I just have one issue that I'll try to explain.

             

            In order to make some tests, I created a subset of the data. I created a ALL_SITES_2013 table and populated it with:

             

            SELECT DISTINCT Site_List FROM ALL_SITES Where Year = '2013'


            Then I used the code I sent in the first post and added the "prior thing" and replaced ALL_SITES by ALL_SITES_2013 and everything is working fine (Thanks again !!).


            Next step was to populate ALL_SITES_2013 with all data. So I droped it and then created it again and populate with


            The code is still working.


            So at this step, my code is:


             SELECT 
              Site_List, 
              trim(regexp_substr(Site_List, '[^;]+', 1, level))
            FROM 
              (Select Site_Listfrom ALL_SITES_2013)
            CONNECT BY 
              trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null;
            

             

            Having ALL_SITES_2013 populated this way:

             

            INSERT INTO ALL_SITES_2013 (SELECT DISTINCT Site_List FROM ALL_SITES);
            

             

             

            Now my last step would be not to use the temporary ALL_SITES_2013 table. So I replaced it by the code from the INSERT statement like this:

             

             

            SELECT 
              Site_List, 
              trim(regexp_substr(Site_List, '[^;]+', 1, level))
            FROM 
              (SELECT DISTINCT Site_List FROM ALL_SITES)
            CONNECT BY 
              trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null;
            

             

            And it's not working anymore.The "CONNECT BY loop in user data" is returned.

            Obviously there is something I miss about the Connect By (first time I use it)

             

            How can it work on the temporary table and not on the select distinct from the final one ?

             

            By the way I tried to replace the temporary table by a temporary view (with same code as insert) but I have the same loop error...

             

            I have to admit that I'm lost. Worth case, I can use the temp table but I would like to understand at least

            • 3. Re: Problem with REGEXP_SUBSTR
              Frank Kulash

              Hi,

               

              Do you really need to use REGEXP_SUBSTR?  How about XMLSEQUENCE?

               

              SELECT  TRIM ( x.column_value.EXTRACT ( 'e/text()')

                           ) AS site

              FROM         all_sites  s

              CROSS JOIN   TABLE ( XMLSEQUENCE ( XMLTYPE (  '<e><e>'

                                                         || REPLACE (s.site_list, ';', '</e><e>')

                                                         || '</e></e>'

                                                         ).EXTRACT ('e/e')

                                               )

                                 )    x

              ;

              Too bad you're using Oracle 10; there are simpler ways available in later versions.

              • 4. Re: Problem with REGEXP_SUBSTR
                BluShadow

                c1cf35c4-2107-4109-91ee-aaa09fbb4ce1 wrote:

                 

                 

                I have to admit that I'm lost. Worth case, I can use the temp table but I would like to understand at least

                 

                Using a temporary table shouldn't make any difference, it sounds like some issue with the data perhaps.  Unfortunately I don't have your data, so I can't figure out what's happening exactly.  You've already shown that it works for 2013 data, so can you narrow it down to a particular subset of data that's causing the problem, using similar tests?

                • 5. Re: Problem with REGEXP_SUBSTR
                  BluShadow

                  FrankKulash wrote:

                   

                  Hi,

                   

                  Do you really need to use REGEXP_SUBSTR?  How about XMLSEQUENCE?

                   

                  SELECT  TRIM ( x.column_value.EXTRACT ( 'e/text()')

                               ) AS site

                  FROM         all_sites  s

                  CROSS JOIN   TABLE ( XMLSEQUENCE ( XMLTYPE (  '<e><e>'

                                                             || REPLACE (s.site_list, ';', '</e><e>')

                                                             || '</e></e>'

                                                             ).EXTRACT ('e/e')

                                                   )

                                     )    x

                  ;

                  Too bad you're using Oracle 10; there are simpler ways available in later versions.

                   

                    Not sure I'd recommend using XMLSEQUENCE... as it's deprecated from 11g onwards...

                   

                  XMLSEQUENCE

                   

                  XMLTABLE should be used instead..

                   

                  SQL> ed
                  Wrote file afiedt.buf

                    1  with t as (select 'element1,element2,element3' as txt from dual union all
                    2             select 'element4,element5' from dual)
                    3  --
                    4  SELECT  t.txt, x.site
                    5  FROM    t
                    6         ,XMLTABLE('e/e'
                    7                   passing XMLTYPE ('<e><e>'||REPLACE(t.txt, ',', '</e><e>')||'</e></e>')
                    8                   columns site varchar2(20) path '.'
                    9*                 ) x
                  SQL> /

                  TXT                        SITE
                  -------------------------- --------------------
                  element1,element2,element3 element1
                  element1,element2,element3 element2
                  element1,element2,element3 element3
                  element4,element5          element4
                  element4,element5          element5

                   

                  The other issue with that way though is that we're adding to the length of the original string, so if we exceed the 4000 byte varchar2 limit we could get an error (unless we move to 12c for th 32K strings), though we could convert the string to a clob before replacing and appending the tags, and that would circumvent that issue.

                  • 6. Re: Problem with REGEXP_SUBSTR
                    Manik

                    Blu,

                     

                    Without actually converting into xmltype.. following Michael S old solution somewhere in the forum, this is another way... (similar to your option anyways)

                     

                    ---

                     

                    WITH t AS

                            (SELECT 'element1,element2,element3' AS txt FROM DUAL

                             UNION ALL

                             SELECT 'element4,element5' FROM DUAL)

                    SELECT txt, COLUMN_VALUE

                      FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

                           XMLTABLE (str);

                     

                    ---

                     

                    Cheers,

                    Manik.

                    • 7. Re: Problem with REGEXP_SUBSTR
                      BluShadow

                      Manik wrote:

                       

                      Blu,

                       

                      Without actually converting into xmltype.. following Michael S old solution somewhere in the forum, this is another way... (similar to your option anyways)

                       

                      ---

                       

                      WITH t AS

                              (SELECT 'element1,element2,element3' AS txt FROM DUAL

                               UNION ALL

                               SELECT 'element4,element5' FROM DUAL)

                      SELECT txt, COLUMN_VALUE

                        FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

                             XMLTABLE (str);

                       

                      ---

                       

                      Cheers,

                      Manik.

                       

                      Which work ok on 11g, but not on 10g which the OP has...

                       

                      SQL> col column_values format a20

                      SQL> WITH t AS (SELECT 'element1,element2,element3' AS txt FROM DUAL UNION ALL

                        2             SELECT 'element4,element5' FROM DUAL)

                        3  SELECT txt, COLUMN_VALUE

                        4  FROM (SELECT t.*, '"' || REPLACE (txt, ',', '","') || '"' str FROM t),

                        5       XMLTABLE (str);

                           XMLTABLE (str)

                                     *

                      ERROR at line 5:

                      ORA-19102: XQuery string literal expected

                      • 8. Re: Problem with REGEXP_SUBSTR
                        Manik

                        Oops I missed the version of OP..

                         

                        Cheers,

                        Manik.

                        • 9. Re: Problem with REGEXP_SUBSTR
                          BrendanP

                          XML might be a bit slow, as might Regexp. Here's another way to try:

                          WITH t AS (
                                  SELECT 'element1,element22,element3,E' AS txt FROM DUAL UNION ALL
                                  SELECT 'element4,element5' FROM DUAL
                          ), l AS (
                                  SELECT ',' || txt || ',' txt, Length (txt) - Length (Replace (txt, ',', '')) + 1 n_fields
                                    FROM t
                          ), r AS (
                                  SELECT LEVEL ind
                                    FROM DUAL
                                 CONNECT BY LEVEL < 100
                          )
                          SELECT l.txt, Substr (l.txt, Instr (l.txt, ',', 1, r.ind) + 1, Instr (l.txt, ',', 1, r.ind + 1) - Instr (l.txt, ',', 1, r.ind) - 1) token
                            FROM l
                            JOIN r
                              ON r.ind  LE l.n_fields

                           

                          Message was edited by: BrendanP Oops, I though that formatting problem had gone away LE stands for the Oracle &le

                          • 10. Re: Problem with REGEXP_SUBSTR
                            Manik

                            Why number 100 iterations there? why not 5 or 6 or 7 or 8 ?????

                             

                            5 will be even more faster than your solution of 100 iterations

                             

                             

                            Cheers,

                            Manik.

                            • 11. Re: Problem with REGEXP_SUBSTR
                              BluShadow

                              Manik,  I think Brenden is just demonstrating the method... it's up to the OP to adapt as necessary.  As we don't have the data or knowledge of how many 'elements' will be in the strings.

                              • 12. Re: Problem with REGEXP_SUBSTR
                                Manik

                                Agreed Blu, that was just for fun. .... Don't you allow a bit of fun in this forums mighty moderator

                                 

                                Cheers,

                                Manik.

                                • 13. Re: Problem with REGEXP_SUBSTR
                                  c1cf35c4-2107-4109-91ee-aaa09fbb4ce1

                                  Thanks for all the solutions you gave. My probleme is now solved. I added the "Nocycle" instruction and also use the "with t as [Subquery]" instead of putting the subquery in the FROM statement. And everything works fine. Not really sure why for now but it works. I'll do more detailed research in order to understand the details later

                                   

                                  In the end, instead of :

                                  SELECT   
                                    Site_List,   
                                    trim(regexp_substr(Site_List, '[^;]+', 1, level))  
                                  FROM   
                                    (SELECT DISTINCT Site_List FROM ALL_SITES)  
                                  CONNECT BY   
                                    trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null
                                  



                                   

                                  I have :

                                   

                                  With t as (Select distinct Site_List FROM ALL_SITES)
                                  SELECT   
                                    Site_List,   
                                    trim(regexp_substr(Site_List, '[^;]+', 1, level))  
                                  FROM   
                                    t  
                                  CONNECT BY   nocycle
                                    trim(regexp_substr(Site_List, '[^;]+', 1, level)) is not null and Site_List= prior Site_Listand prior sys_guid() is not null
                                  


                                  Thanks again

                                   

                                  • 14. Re: Problem with REGEXP_SUBSTR
                                    BluShadow

                                    You probably have duplicate site_list values in the table.

                                    the connect by needs to connect uniquely, so if there's duplicates that'll cause problems.   It was an assumption on my part that you were just providing an example and that you'd actually have some table with a primary key on it, and then youd use the "and pk_column = prior pk_column" in the connect by clause.