I'd to process data of a table in parallel using
task_name => g_task_name,
table_owner => user,
table_name => 'QUEUE',
by_row => TRUE,
chunk_size => 100 );
However, the chunk size was much larger, 1500 rows or more. I know that chunk size is just an approximation, but even when I tried setting the chunk size to a single row or block, the smallest chunk I could get was about 500 rows. I need small chunks, since I want to send the data later via HTTP and the requests should be rather small. I could split each chunk manually, but I would prefer to avoid this extra work. Any idea?
Perhaps rows/block plays an issue. When chunking, I would think one performance goal would be to read a data block only once - and not multiple times by different parallel processes, each lifting only 1 or 2 rows from the data block. Ideally have a single parallel process read the data block and do all the rows in it - as that means less I/O in total. And I/O is what is horrible expensive.
In which case I would have designed the chunking interface to use data block as boundaries, and not rowids in the same data block as boundary. (kind of like using word boundaries in C/Pascal when defining structs to improve performance)
Perhaps check if something like this is the case?
I've ran into this similar issue myself.
However, my chunk sizes were either 10x the number of rows I wanted or 0 rows per chunk.
Without getting into too much details, the rows moved thus causing a lot of space between min(ROWID) and max(ROWID).
Remember, the algorithm used to find :START_ROWID and :END_ROWID for each chunk is very simplistic and is based on the min/max values.
As such, it isn't going to do what you want 100% of the time.
I'll assume there are major gaps in your PK thus preventing you from using chunk_by_number.
You will have to go with the "CHUNK_BY_SQL" route.
The 'easiest to code' method may be writing a pipelined function that will be used by the 'chunk_by_sql'
PS - 100 rows for HTTP sounds small.
If this is a web-service, you want to make 1 call for the 100 rows for each chunk, not 100 calls of 1 row each for each chunk.
This is the same concept as SQL. Bulk Process, not slow-by-slow.
Exactly, I want to make one HTTP request per 100 rows. Actually, I generate a rather large XML from each row, so the actual request isn't that small.
I will try using CHUNK_BY_SQL, currently I simply split the chunk into smaller slices within the task. This isn't really a problem, but I needed to implement some extra lines of code.
I just thought about the SQL statement for CHUNK_BY_SQL.
I don't think you need to use a pipelined function.
A single SQL statement with some Analytics should do just nicely.
row_number() to sequentialize the PKs
mod() to find out the START_ID of each chunk
lead() to find out the next START_ID (then you subtract 1 to get END_ID)
max() for the END_ID of the last chunk since LEAD() will produce NULL
I haven't tested it, but this is what I came up with.
with start_id_list as ( -- find the first row for each chunk select PK, MAX_PK from ( select PK ,mod( row_number() over (order by PK) ,N ) mod_n -- N === # of rows per chunk ,max(PK) over () max_pk -- last chunk's END_ID from T ) where mod_n=1-- row_number() is 1-based index. you want mod()-1=0 ) select PK as start_id ,lead(PK,1,max_pk+1) over (order by PK) - 1 AS END_ID from start_id_list