4 Replies Latest reply: Jul 24, 2013 11:26 AM by Mike Kutz RSS

    DBMS_PARALLEL_EXECUTE chunk size

    Gwydion

      Hi,

       

      I'd to process data of a table in parallel using DBMS_PARALLEL_EXECUTE. I need to keep a single chunk rather small (about 100 rows) but this doesn't work.

      I tried

      dbms_parallel_execute.create_chunks_by_rowid(

            task_name   => g_task_name,

            table_owner => user,

            table_name  => 'QUEUE',

            by_row      => TRUE,

            chunk_size  => 100 );

       

      However, the chunk size was much larger, 1500 rows or more. I know that chunk size is just an approximation, but even when I tried setting the chunk size to a single row or block, the smallest chunk I could get was about 500 rows. I need small chunks, since I want to send the data later via HTTP and the requests should be rather small. I could split each chunk manually, but I would prefer to avoid this extra work. Any idea?

        • 1. Re: DBMS_PARALLEL_EXECUTE chunk size
          Billy~Verreynne

          Perhaps rows/block plays an issue. When chunking, I would think one performance goal would be to read a data block only once - and not multiple times by different parallel processes, each lifting only 1 or 2 rows from the data block. Ideally have a single parallel process read the data block and do all the rows in it - as that means less I/O in total. And I/O is what is horrible expensive.

           

          In which case I would have designed the chunking interface to use data block as boundaries, and not rowids in the same data block as boundary. (kind of like using word boundaries in C/Pascal when defining structs to improve performance)

           

          Perhaps check if something like this is the case?

          • 2. Re: DBMS_PARALLEL_EXECUTE chunk size
            Mike Kutz

            I've ran into this similar issue myself.

            However, my chunk sizes were either 10x the number of rows I wanted or 0 rows per chunk.

            Without getting into too much details, the rows moved thus causing a lot of space between min(ROWID) and max(ROWID).

             

            Remember, the algorithm used to find :START_ROWID and :END_ROWID for each chunk is very simplistic and is based on the min/max values.

            As such, it isn't going to do what you want 100% of the time.

             

            I'll assume there are major gaps in your PK thus preventing you from using chunk_by_number.

            You will have to go with the "CHUNK_BY_SQL" route.

            The 'easiest to code' method may be writing a pipelined function that will be used by the 'chunk_by_sql'

             

            PS - 100 rows for HTTP sounds small.

            If this is a web-service, you want to make 1 call for the 100 rows for each chunk, not 100 calls of 1 row each for each chunk.

            This is the same concept as SQL.  Bulk Process, not slow-by-slow.

            • 3. Re: DBMS_PARALLEL_EXECUTE chunk size
              Gwydion

              Exactly, I want to make one HTTP request per 100 rows. Actually, I generate a rather large XML from each row, so the actual request isn't that small.

              I will try using CHUNK_BY_SQL, currently I simply split the chunk into smaller slices within the task. This isn't really a problem, but I needed to implement some extra lines of code.

              • 4. Re: DBMS_PARALLEL_EXECUTE chunk size
                Mike Kutz

                I just thought about the SQL statement for CHUNK_BY_SQL.

                I don't think you need to use a pipelined function.

                A single SQL statement with some Analytics should do just nicely.

                 

                row_number() to sequentialize the PKs

                mod() to find out the START_ID of each chunk

                lead() to find out the next START_ID (then you subtract 1 to get END_ID)

                max() for the END_ID of the last chunk since LEAD() will produce NULL

                 

                I haven't tested it, but this is what I came up with.

                with start_id_list as (
                  -- find the first row for each chunk
                  select PK, MAX_PK
                  from (
                    select PK
                      ,mod(
                          row_number() over (order by PK)
                        ,N ) mod_n -- N === # of rows per chunk
                      ,max(PK) over () max_pk -- last chunk's END_ID
                    from T
                  )
                where mod_n=1-- row_number() is 1-based index. you want mod()-1=0
                )
                select PK as start_id
                  ,lead(PK,1,max_pk+1) over (order by PK) - 1
                   AS END_ID
                from start_id_list