1 2 3 Previous Next 34 Replies Latest reply on Feb 18, 2015 7:21 AM by Etbin Go to original post
      • 15. Re: How to select csv data stored in a BLOB column as if it were an external table?
        Mike Kutz

        Just to put it out there:  - Process Type Plugin - External Table Processor

         

        At the very least, it should give you an excuse to learn External Tables.

         

        This type of scenario is why I made this suggestion:

        parse CLOBS/BLOBS using "external table" capability

         

        MK

        • 16. Re: How to select csv data stored in a BLOB column as if it were an external table?
          Boneist

          I think you have possibly missed the bit where Hoek said "I cannot use an external table (for the first time in my life )" in the original post...

          • 17. Re: How to select csv data stored in a BLOB column as if it were an external table?
            Great stuff (as always), Anton, thanks! Very interesting.

             

            Why not combine Anton's and Odie's answers and meet ALL of your requirements while improving performance at the same time?

             

            IMHO Anton likely has the best performing query.

             

            Now use that query in a pl/sql procedure or pipelined function to perform your data cleansing and conversions. That function can also detect and log rows with data issues while still processing the clean rows.

             

            That can be done in bulk if you insert the data into a work table using DML ERROR LOGGING. The resulting rows in the work table will be clean and you can process them further.

             

            The rows in the log table will have one or more problems issues but you will only know what ONE of them is.

             

            If a row is likely to have multiple cleansing or validation issues you need to resort to that dreaded slow-by-slow (row by row) processing. You can combine Anton's query with bulk collection and a LIMIT clause. Then use FORALL to process each batch with SAVE EXCEPTIONs.(should NOT be any after your thorough cleansing).

             

            The typical use case where the row by row method IS suitable is when you have to check multiple columns for complex validations and need to record/log ALL problems. It is MUCH MORE inefficient to use a process that can only detect ONE problem and then reject the row when you need to ultimately know about ALL problems.

             

            Any ineffficencies in Anton's (or other) complex query will be masked/hidden by the inefficencies in the cleansing/validation process. But overall you can still get the best performance because there is no longer just ONE weakest link.

            1 person found this helpful
            • 18. Re: How to select csv data stored in a BLOB column as if it were an external table?
              Mike Kutz

              Boneist wrote:

               

              I think you have possibly missed the bit where Hoek said "I cannot use an external table (for the first time in my life )" in the original post...

              I was scratching my head on that statement.

              It initially came across to me as a limitation of the person (which was questionable for someone like Hoek) more than a limitation imposed by the site-based.

               

              My apologies to Hoek for any bad assumptions about you.

               

              MK

              • 19. Re: Re: How to select csv data stored in a BLOB column as if it were an external table?
                chris227

                rp0428 wrote:

                 

                IMHO Anton likely has the best performing query.

                 

                I didtest it:

                The best performing approach regearding runtime seems to be Kathriks.

                It took under 10 sec for inserting 10.000 rows splitted into tow parts converting those to number and date.

                Antons took about 40 secs.

                 

                I tuned the pipelined function so it now completed in ~15 sec (down from 60 sec) in this scenario:

                 

                create or replace function get_csv_split( p_blob blob )

                return csv_table_split_type

                pipelined

                as

                  c_sep_hex constant varchar2(2) := '3B'; -- hex ;

                  c_sep constant varchar2(2) := ';';

                  c_numsep constant number := 2;

                  l_off number := 1;

                  l_off_new number := 1;

                  l_row varchar2(32000);

                 

                begin

                 

                  -- Attention: assumption that there exists at least one "wellformed" csv-line

                  -- should find some better guard condition

                 

                  while dbms_lob.getlength(p_blob) > l_off and l_off > 0 loop

                    l_off_new := dbms_lob.instr(p_blob, c_sep_hex, l_off, c_numsep);

                    l_row := utl_raw.cast_to_varchar2(

                                  dbms_lob.substr(p_blob

                                      , l_off_new - l_off

                                      , l_off)

                                );

                    l_off := l_off_new + 2; -- to skip c_sep and row sep (chr(10))

                    pipe row (csv_split_type(substr(l_row, 1, instr(l_row, c_sep) - 1),substr(l_row, instr(l_row, c_sep) + 1)));

                  end loop;

                  return;

                 

                end;

                A test for 50.000 rows resulted in 50 secs (Kathricks) vs 115 sec (pipelined) which might give a hint that the former one scales linear whereas the latter one did not

                (The reason might be that this Kathricks approach works one constantly sized chunks)

                1 person found this helpful
                • 20. Re: How to select csv data stored in a BLOB column as if it were an external table?

                  The query itself is just ONE piece of the process. The crux of the problem, and likely the limiting factors, is the number and types of validations that need to be performed and whether ALL of those validations need to be logged. This is what Hoek said:

                  In short: the 'blobbed csv' needs to be validated (valid dates, valid numbers, that sort of stuff) completely first, before the data is allowed to be inserted

                   

                  My idea was to loop though the blob, apply to_number and to_date to the columns that are to be numbers or dates in the destination table.

                   

                  If there's invalid data then there will be no DML done, but then instead they want a complete listing of all rows that have invalid data.

                   

                  So, after the 'validation loop', when no invalid data is found I can simply insert all data at once with a single SQL statement reusing Boneists query, maybe using Parallel DML.

                  If you need to detect and log multiple vaildations in a row (which is typical for data cleansing/validation stages) then you typically need a row-by-row process that examines/tests EVERY column for possible issues and calls an AUTONOMOUS TRANSACTION logging procedure to log ALL of the issues for each column in a row. A single column might even have more than one issue.

                   

                  The ultimate throughput of a process like that depends on the SLOWEST step in the process.

                   

                  It doesn't really matter that much how fast the query itself is as long as it is FASTER than each of the other components. Usain Bolt is the fastest man in the world but in a three-legged sack race he can't run any faster than his teammate. All of that speed will be wasted.

                   

                  You can use a query that is the EASIEST to understand and maintain. I'm not advocating against using regular expressions but if Hoek is the only one at his org that understands how to use them the query willl be HARDER to maintain/enhance/debug by other developers.

                  1 person found this helpful
                  • 21. Re: Re: How to select csv data stored in a BLOB column as if it were an external table?
                    chris227

                    I agree with all you said.

                    I just was interested in your remark (so i answered to your post, not necessarily to discuss this issue with you personally ;-) , that antons query might likely be the best performing (which i thought too at the first sight).

                    So i tested it and came to the results posted above.

                    As far as i can see the access to the blob per dbms_lob functions or the utl_raw.convert-call is the bottleneck here.

                    Personnally i think it might be better to know that than not and if we know it we can take it into account during the implementation without extra costs.

                    (And for maintenance's sake i still prefer the pipelined table function in this case (if it performs on a competetive basis)

                    It also may be the right place to start the row per row validations.)

                    Hoek stated that 1 min or 1000 rows may be ok, so saving time on the split will give more time for the validation.

                     

                    Consequently i come up with another "tuned" approach converting the blb to a clob first (mostly for my own educational purpose) performing out now Kathricks approach (~6 sec for 10.000 rows) and seemingly (based on this naiv test) scaling linear (30 secs for 50.000).

                     

                    create or replace function get_csv_split_c( p_blob blob )

                    return csv_table_split_type

                    pipelined

                    as

                     

                      c_sep_hex constant varchar2(2) := '3B'; -- hex ;

                      c_sep constant varchar2(2) := ';';

                      c_numsep constant number := 2;

                     

                      l_off number := 1;

                      l_off_new number := 1;

                      l_row varchar2(32000);

                      l_clob clob;

                      l_src_off pls_integer:=1;

                      l_dst_off pls_integer:=1;

                      l_ctx number := dbms_lob.DEFAULT_LANG_CTX;

                      l_warn number := dbms_lob.WARN_INCONVERTIBLE_CHAR;


                    begin

                      dbms_lob.createtemporary(l_clob,true);

                      dbms_lob.converttoclob(l_clob, p_blob, dbms_lob.lobmaxsize, l_src_off, l_dst_off, dbms_lob.DEFAULT_CSID, l_ctx, l_warn);

                      -- Attention: assumption that there exists at least one "wellformed" csv-line

                      -- should find some better guard condition

                      while dbms_lob.getlength(l_clob) > l_off and l_off > 0 loop

                      l_off_new := instr(l_clob, c_sep, l_off, c_numsep);

                      pipe row (csv_split_type(

                              substr(l_clob, l_off, instr(l_clob, c_sep, l_off) - l_off)

                            ,substr(l_clob, instr(l_clob, c_sep, l_off) + 1, l_off_new - instr(l_clob, c_sep, l_off) - 1)

                      ));

                      l_off := l_off_new + 2; -- to skip c_sep and row sep (chr(10))

                      end loop;

                      return;

                     

                    end;

                    1 person found this helpful
                    • 22. Re: Re: How to select csv data stored in a BLOB column as if it were an external table?
                      Etbin

                      Using Karthick's idea to split the clob in chunks the model clause (seems not to work with clobs) on chunks is much faster than recursion chopping a row at a time from the clob.

                      It's a dirty example as I almost forgot how to deal with model clause.

                      Hopefully someone (chris227 ) might make it better.

                       

                      select str,num,dte

                        from (select chunk_id,i,a_row,

                                     substr(a_row,1,instr(a_row,';',1,1) - 1) str,

                                     to_number(substr(a_row,instr(a_row,';',1,1) + 1,instr(a_row,';',1,2) - instr(a_row,';',1,1) - 1)) num,

                                     to_date(substr(a_row,instr(a_row,';',1,2) + 1),'yyyy-mm-dd hh24:mi:ss') dte

                                from (select level chunk_id,

                                             to_char(substr(clob_column,

                                                            instr(clob_column,chr(10),3900 * (level - 1),1) + 1,

                                                            case when instr(clob_column,chr(10),3900 * level,1) = 0

                                                                 then length(clob_column)

                                                                 else instr(clob_column,chr(10),3900 * level,1)

                                                            end  - instr(clob_column,chr(10),3900 * (level - 1),1) - 1

                                                           )

                                                    ) || chr(10) chunk

                                        from clob_table

                                      connect by level <= ceil(length(clob_column) / 3900)

                                     )

                               model

                               partition by (chunk_id)

                               dimension by (0 i)

                               measures (chunk,cast(null as varchar2(100)) a_row)

                               rules iterate (1000) until a_row[iteration_number] is null

                               (

                                a_row[iteration_number] = substr(chunk[iteration_number],1,instr(chunk[iteration_number],chr(10)) - 1),

                                chunk[iteration_number + 1] = substr(chunk[iteration_number],instr(chunk[iteration_number],chr(10)) + 1)

                              )

                             )

                      where a_row is not null

                      order by chunk_id,i -- to preserve the clob order

                       

                      1000 rows returned in 0.14 seconds

                       

                      Regards

                       

                      Etbin


                      Further investigatios on my Apex table space about the clob size revealed:

                      • I was able to create 30000 (under 40000) rows before receiving ORA-00040: active time limit exceeded - call aborted
                      • those 30000 rows threw ORA-32696: HTI: No free slot (assumed due to dealing with over 300 chunks)
                      • reducing the number of rows to 20000 (to keep the number of slots under 255) threw ORA-32695: HTI: Not enough memory
                      • at 3500 rows (under 4000) it finally worked: 3500 rows returned in 0.42 seconds (pretty proportional to 1000 rows returned in 0.14 seconds but the factor 3.5 might be too low for serious judgement)
                      • 23. Re: Re: How to select csv data stored in a BLOB column as if it were an external table?
                        Etbin

                        Karthick's idea to split the clob in chunks has a very positive influence on recursion too


                        with

                        slicer(a_str,a_num,a_dte,a_clob) as

                        (select substr(chunk,1,instr(chunk,';',1,1) - 1),

                                substr(chunk,instr(chunk,';',1,1) + 1,instr(chunk,';',1,2) - instr(chunk,';',1,1) - 1),

                                substr(chunk,instr(chunk,';',1,2) + 1,instr(chunk,chr(10),1,1) - instr(chunk,';',1,2) - 1),

                                substr(chunk,instr(chunk,chr(10),1,1) + 1)

                           from (select to_char(substr(clob_column,

                                                       instr(clob_column,chr(10),3900 * (level - 1),1) + 1,

                                                       case when instr(clob_column,chr(10),3900 * level,1) = 0

                                                            then length(clob_column)

                                                            else instr(clob_column,chr(10),3900 * level,1)

                                                       end  - instr(clob_column,chr(10),3900 * (level - 1),1) - 1

                                                      )

                                               ) || chr(10) chunk

                                          from clob_table

                                        connect by level <= ceil(length(clob_column) / 3900)

                                )

                        union all

                        select substr(a_clob,1,instr(a_clob,';',1,1) - 1),

                                substr(a_clob,instr(a_clob,';',1,1) + 1,instr(a_clob,';',1,2) - instr(a_clob,';',1,1) - 1),

                                substr(a_clob,instr(a_clob,';',1,2) + 1,instr(a_clob,chr(10),1,1) - instr(a_clob,';',1,2) - 1),

                                substr(a_clob,instr(a_clob,chr(10),1,1) + 1)

                           from slicer,

                                clob_table

                          where coalesce(a_str,a_num,a_dte) is not null

                        )

                        select a_str str,to_number(a_num) num,to_date(a_dte,'yyyy-mm-dd hh24:mi:ss') dte

                          from slicer

                          where coalesce(a_str,a_num,a_dte) is not null

                         

                        30000 rows returned in 10.50 seconds


                        however


                        select regexp_substr(ltrim(x.col1, ';'), '[^;]+', 1, 1) str

                        ,      to_number(regexp_substr(ltrim(x.col1, ';'), '[^;]+', 1, 2)) num  

                        ,      to_date(regexp_substr(ltrim(x.col1, ';'), '[^;]+', 1, 3),'yyyy-mm-dd hh24:mi:ss') dte 

                        from   clob_table 

                        ,      xmltable('/a/b'  

                                      passing xmltype(replace('<a><b>;'||replace(clob_column, chr(10), '</b><b>;')||'</b></a>', ';;', '; ;'))  

                                      columns  

                                        col1 varchar2(100) path '.') x

                        where  x.col1 != ';'


                         

                        30000 rows returned in 2.90 seconds
                        • 24. Re: How to select csv data stored in a BLOB column as if it were an external table?
                          odie_63

                          Pushing a little further the initial XML approach suggested by Boneist, we can also materialize the columns and directly project them without the need for any regex call :

                           

                          (don't pay attention to the table name, it's a general-purpose test table for BLOBs)

                          Connected to:
                          Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - Production
                          With the Partitioning, OLAP, Data Mining and Real Application Testing options
                          
                          SQL>
                          SQL> set autotrace traceonly statistics
                          SQL> set timing on
                          SQL>
                          SQL> select x.*
                            2  from test_jdbc_blob t
                            3     , xmltable('/a/b'
                            4         passing xmlparse(document
                            5                   '<a><b><c>' ||
                            6                   replace(
                            7                     replace(
                            8                       dbms_xmlgen.convert(conv_to_clob(t.image))
                            9                     , ','
                          10                     , '</c><c>'
                          11                     )
                          12                   , chr(10)
                          13                   , '</c></b><b><c>'
                          14                   ) ||
                          15                   '</c></b></a>'
                          16                   wellformed
                          17                 )
                          18         columns col1 varchar2(4000) path 'c[1]'
                          19               , col2 varchar2(4000) path 'c[2]'
                          20               , col3 varchar2(4000) path 'c[3]'
                          21               , col4 varchar2(4000) path 'c[4]'
                          22               , col5 varchar2(4000) path 'c[5]'
                          23               , col6 varchar2(4000) path 'c[6]'
                          24       ) x
                          25  where t.img_id = 'mockdata.csv';
                          
                          10000 rows selected.
                          
                          Elapsed: 00:00:02.27
                          
                          Statistics
                          ----------------------------------------------------------
                                    0  recursive calls
                                24598  db block gets
                                 3223  consistent gets
                                  155  physical reads
                                    0  redo size
                               723902  bytes sent via SQL*Net to client
                                 7746  bytes received via SQL*Net from client
                                  668  SQL*Net roundtrips to/from client
                                    0  sorts (memory)
                                    0  sorts (disk)
                                10000  rows processed
                          
                          

                           

                          NB1 : tested on a 10,000-row csv like this :

                          1,Theresa,Ramirez,tramirez0@timesonline.co.uk,South Korea,246.93.110.253

                          2,Jeremy,Welch,jwelch1@naver.com,Russia,109.244.189.37

                          3,David,Sullivan,dsullivan2@weibo.com,Serbia,133.110.46.44

                          4,Martin,Hunt,mhunt3@answers.com,Portugal,144.44.120.149

                          ...

                           

                          NB2 : the db version (>= 11.2.0.4) is important as it allows optimization on transient XMLType

                           

                          I'll soon post a Java-based pipelined approach to be complete.

                          1 person found this helpful
                          • 25. Re: Re: How to select csv data stored in a BLOB column as if it were an external table?
                            odie_63

                            A java-based implementation :

                             

                            create or replace and compile java source named CSVTableSrc as
                            import java.sql.*;
                            import java.math.BigDecimal;
                            import java.util.ArrayList;
                            import java.io.BufferedReader;
                            import java.io.IOException;
                            import java.io.InputStreamReader;
                            import java.io.InputStream;
                            import java.io.UnsupportedEncodingException;
                            import oracle.sql.*;
                            import oracle.CartridgeServices.*;
                            
                            public class CSVTableImpl implements SQLData {
                            
                                private BigDecimal key;
                                final static BigDecimal SUCCESS = new BigDecimal(0);
                                final static BigDecimal ERROR = new BigDecimal(1);
                            
                                // Implement SQLData interface.
                                String sql_type;
                                public String getSQLTypeName() throws SQLException {
                                    return sql_type;
                                }
                            
                                public void readSQL(SQLInput stream, String typeName) throws SQLException {
                                    sql_type = typeName;
                                    key = stream.readBigDecimal();
                                }
                            
                                public void writeSQL(SQLOutput stream) throws SQLException {
                                    stream.writeBigDecimal(key);
                                }
                            
                                // stored context type
                                private static class StoredCtx {
                                    BufferedReader reader;
                                    String fieldSep;
                                    public StoredCtx(BufferedReader rd, String fs) {
                                        this.reader = rd;
                                        this.fieldSep = fs;
                                    }
                                }
                            
                                private static StructDescriptor outDesc;
                                private static ArrayDescriptor arrayDesc;
                                private static Connection conn;
                            
                                // type methods implementing ODCITable interface
                                // ===============================================================
                                //    ODCITableStart
                                // ===============================================================
                                static public BigDecimal ODCITableStart(STRUCT[] sctx, BLOB csv, String charset, String fieldsep)
                                        throws SQLException, UnsupportedEncodingException {
                                    // initialization
                                    conn = DriverManager.getConnection("jdbc:default:connection:");
                                    outDesc = StructDescriptor.createDescriptor("CSVTABLEROW", conn);
                                    arrayDesc = ArrayDescriptor.createDescriptor("CSVTABLE", conn);
                                    InputStream is = csv.getBinaryStream();
                                    BufferedReader reader = new BufferedReader(new InputStreamReader(is, charset));
                                    StoredCtx ctx = new StoredCtx(reader, fieldsep);
                                    // register stored context with cartridge services
                                    int key;
                                    try {
                                        key = ContextManager.setContext(ctx);
                                    } catch (CountException ce) {
                                        return ERROR;
                                    }
                                    // create an CSVTableImpl instance and store the key in it
                                    Object[] impAttr = new Object[] { new BigDecimal(key) };
                                    StructDescriptor sd = new StructDescriptor("CSVTABLEIMPL", conn);
                                    sctx[0] = new STRUCT(sd, conn, impAttr);
                            
                                    return SUCCESS;
                                }
                            
                                // ===============================================================
                                //    ODCITableFetch
                                // ===============================================================
                                public BigDecimal ODCITableFetch(BigDecimal nrows, ARRAY[] outSet)
                                        throws SQLException, IOException {
                                    // retrieve stored context using the key
                                    StoredCtx ctx;
                                    try {
                                      ctx = (StoredCtx)ContextManager.getContext(key.intValue());
                                    } catch (InvalidKeyException ik) {
                                        return ERROR;
                                    }
                                    // get the nrows parameter, but return up to 8k rows
                                    int nrowsval = nrows.intValue();
                                    if (nrowsval > 8192) {
                                        nrowsval = 8192;
                                    }
                                    // create a collection for the fetched rows
                                    ArrayList<STRUCT> coll = new ArrayList<STRUCT>(1024);
                                    String line;
                                    // iterate through lines
                                    while (nrowsval > 0 && (line = ctx.reader.readLine()) != null) {
                                        Object[] fields = line.split(ctx.fieldSep);
                                        coll.add(new STRUCT(outDesc, conn, fields));
                                    }
                                    // return if no rows found
                                    if (coll.isEmpty()) {
                                        return SUCCESS;
                                    }
                                    // create the output ARRAY using the collection
                                    outSet[0] = new ARRAY(arrayDesc, conn, coll.toArray());
                                    return SUCCESS;
                                }
                            
                                // ===============================================================
                                //    ODCITableClose
                                // ===============================================================
                                public BigDecimal ODCITableClose() throws SQLException, IOException {
                                    StoredCtx ctx;
                                    try {
                                        ctx = (StoredCtx)ContextManager.clearContext(key.intValue());
                                        ctx.reader.close();
                                    } catch (InvalidKeyException ik) {
                                        return ERROR;
                                    }
                                    return SUCCESS;
                                }
                            
                            }
                            

                             

                            Object types :

                            create or replace type CSVTableRow as object (
                              col1  varchar2(4000)
                            , col2  varchar2(4000)
                            , col3  varchar2(4000)
                            , col4  varchar2(4000)
                            , col5  varchar2(4000)
                            , col6  varchar2(4000)
                            );
                            
                            create or replace type CSVTable as table of CSVTableRow;
                            
                            create or replace type CSVTableImpl as object (
                              key integer
                            , static function ODCITableStart(sctx out CSVTableImpl, csv blob, charset in varchar2, fieldsep in varchar2) return number
                                as language java
                                name 'CSVTableImpl.ODCITableStart(oracle.sql.STRUCT[], oracle.sql.BLOB, java.lang.String, java.lang.String) return java.math.BigDecimal'
                            , member function ODCITableFetch(self in out CSVTableImpl, nrows in number, outSet out CSVTable) return number
                                as language java
                                name 'CSVTableImpl.ODCITableFetch(java.math.BigDecimal, oracle.sql.ARRAY[]) return java.math.BigDecimal'
                            , member function ODCITableClose(self in CSVTableImpl) return number
                                as language java
                                name 'CSVTableImpl.ODCITableClose() return java.math.BigDecimal'
                            );
                            

                             

                            Pipelined function :

                            create or replace function csv2table(
                              csv blob
                            , charset varchar2
                            , fieldsep varchar2
                            )
                            return CSVTable pipelined 
                            using CSVTableImpl;
                            

                             

                            Test :

                            SQL> select x.*
                              2  from test_jdbc_blob t
                              3     , table(csv2table(t.image, 'utf-8', ',')) x
                              4  where t.img_id = 'mockdata.csv';
                            
                            10000 rows selected.
                            
                            Elapsed: 00:00:02.15
                            
                            Statistics
                            ----------------------------------------------------------
                                     33  recursive calls
                                      0  db block gets
                                   2817  consistent gets
                                     78  physical reads
                                      0  redo size
                                 641974  bytes sent via SQL*Net to client
                                    519  bytes received via SQL*Net from client
                                     11  SQL*Net roundtrips to/from client
                                      6  sorts (memory)
                                      0  sorts (disk)
                                  10000  rows processed
                            
                            
                            1 person found this helpful
                            • 26. Re: How to select csv data stored in a BLOB column as if it were an external table?
                              Hoek

                              Unfortunatly external tables are not an option in this specific case....

                              • 27. Re: How to select csv data stored in a BLOB column as if it were an external table?
                                Hoek

                                Amazing!

                                Thanks for all the additional inputs, all are being considered.

                                You all rock!

                                *tips virtual hat*

                                • 28. Re: How to select csv data stored in a BLOB column as if it were an external table?
                                  chris227

                                  Impressive, thanks for the demonstration.

                                  Would you mind to give some explanation why this java approach performs this fast?

                                  Does it outperform any pl/sql based approach? (I manage to push my approach to under 4 sec for 10.000 rows (as far as this could be compared over differnent systems) by reducing the clob/blob access near to some minimum.

                                  How does it scales for bigger data set, like 1 billion rows (resp. million in europe)?

                                   

                                  Thanks!

                                  • 29. Re: How to select csv data stored in a BLOB column as if it were an external table?

                                    Would you mind to give some explanation why this java approach performs this fast?

                                    Likely because Java is optimized for the main operations being performed

                                            conn = DriverManager.getConnection("jdbc:default:connection:"); 
                                    . . .
                                             BufferedReader reader = new BufferedReader(new InputStreamReader(is, charset)); 

                                    . . .     

                                    // iterate through lines

                                        while (nrowsval > 0 && (line = ctx.reader.readLine()) != null) {

                                           Object[] fields = line.split(ctx.fieldSep);

                                           coll.add(new STRUCT(outDesc, conn, fields));

                                        } 

                                    1. The internal JDBC driver is being used ('jdbc:default:connection') - Oracle knows it is IN the database so no Oracle client or network functionality is used.

                                    2. The BufferedReader will buffer the input data

                                    3. The 'split' method of the string class ('line.split') parses the data by a regular explession into an array

                                    4. Java has TRUE arrays that are physically indexed using the known length of each entry  

                                    5. Oracles JDBC constructor for the STRUCT can directly use the array produced by the 'split' method

                                     

                                    Java, either insider or outside the DB, is often the best choice when the operations being performed match its skill set. The most common use case is when file operations are needed to access both server and CLIENT files. Java has built-in functionality for dealing with the OS and the file system and can easily produce a list of files in a directory.

                                    Does it outperform any pl/sql based approach?

                                    Only testing specific use cases would provide that answer.

                                    How does it scales for bigger data set, like 1 billion rows (resp. million in europe)?

                                     

                                    There is NOTHING in the Java code that doesn't scale.

                                     

                                    NOTE: The data cartridge functionality was used so that the custom object type's methods could be used to DRIVE the entire process and be used in a single SQL statement to query the data.

                                     

                                    You don't need any of that if you just wanted to convert the data and store it somewhere. Java can just read the BLOB/CLOBs directly from the source table and then do its processing.

                                     

                                    And since each CLOB is in a separate row and a separate CLOB it would be rather trivial to use a multi-threaded approach and do the operation in parallel - each thread could work on a range of rows and insert its results to a table.

                                     

                                    That would be FAR SIMPLER than trying to parallelize the operation in PL/SQL.