2 Replies Latest reply: Feb 19, 2013 11:15 AM by rp0428 RSS

    Pipelined functions and parallel processing

    rahulras
      Hi,

      I am on 11.2.0.3.
      I am writting a data transformation process, which has 3 to 4 transformation (and data cleansing) steps.
      Probably, I will write series of pipelined table functions, something like
      function1 --<< reads input data and does first step of transformation
      
      function2 : select * from TABLE ( function1 ) -- will do next transformation
      
      function3 : select * from TABLE( function2 ) -- more transformation
      Final output will be fetched using "select * from TABLE( function3 )" (lets call this the Final-output-query).
      My question is, while the final-output-query is running in one session, are all previous steps also running "in parallel" within the same session??
      E.g. when function1 passes a row to function2, while function2 process the row, is function1 "at the same time" running next set of data?
      OR
      all the functions will process one record, produce the output and then function1 starts processing next record?

      Thanks in advance
        • 1. Re: Pipelined functions and parallel processing
          APC
          Basically a single thread can only do one thing at a time. What you have posted is a serial process which means that it goes
          f1->f2->f3
          (Actually, what happens is slightly more complicated than that, uf F1 drives round a cursor with a BULK COLLECT INTO ... LIMIT clause. In that situation Oracle passes the bulk limit set of records to the next function. So it is more like
          f1->f2->f3-+
          ^          |
          |          |
          +----------+   
          until the cursor is exhausted.)

          It's still serial though. Even if you have parallel enabled your functions what will happen is that the f1->f2->f3 runs in multiple threads, each with a subset of the partitioned cursor.

          Cheers, APC

          Edited by: APC on Feb 19, 2013 4:07 PM

          Oh how I miss a WYSIWYG editor
          • 2. Re: Pipelined functions and parallel processing
            rp0428
            >
            I am on 11.2.0.3.
            I am writting a data transformation process, which has 3 to 4 transformation (and data cleansing) steps.
            Probably, I will write series of pipelined table functions, something like

            function1 --<< reads input data and does first step of transformation

            function2 : select * from TABLE ( function1 ) -- will do next transformation

            function3 : select * from TABLE( function2 ) -- more transformation

            Final output will be fetched using "select * from TABLE( function3 )" (lets call this the Final-output-query).
            My question is, while the final-output-query is running in one session, are all previous steps also running "in parallel" within the same session??
            E.g. when function1 passes a row to function2, while function2 process the row, is function1 "at the same time" running next set of data?
            OR
            all the functions will process one record, produce the output and then function1 starts processing next record?
            >
            You are getting ahead of yourself. You are already focused on a solution when, based on what you posted, you haven't completed the process of requirements and identified other possible solutions. The recommended process is:

            1. Identify a problem
            2. Document the requirements solutions must meet.
            3. Identify possible solutions for solving/mitigating the problem
            4. Select one or two solutions for further evaluation and testing
            5. Conduct testing of the selected candidate solutions
            6. Implement your 'best' alternative

            You seem to already be on step 5 ('Probably, I will write series of pipelined table functions'). How can you consider that a viable solution if you don't understand how PIPELINED functions even work? Not trying to sound harsh, just realistic.

            Chapter 13 'Using Pipelined and Parallel Table Functions' of the Data Cartridge Dev Guide has an excellent description and summary that includes a discussion of the questions you asked:
            http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/pipe_paral_tbl.htm
            >
            This chapter describes table functions. It also explains the generic datatypes ANYTYPE, ANYDATA, and ANYDATASET, which are likely to be used with table functions.

            This chapter contains these topics:

            •Overview of Table Functions

            •Table Function Concepts

            •Pipelined Table Functions

            •Parallel Table Functions

            •Input Data Streaming for Table Functions
            . . .
            Execution of a table function can be parallelized, and returned rows can be streamed directly to the next process without intermediate staging. Rows from a collection returned by a table function can also be pipelined; this means that they are iteratively returned as they are produced, instead of being returned in a single batch after all processing of the table function's input is completed.

            Streaming, pipelining, and parallel execution of table functions can improve performance in the followingmanner:

            •By enabling multithreaded, concurrent execution of table functions
            . . .
            Pipelined table functions can be implemented in two ways:

            •In the native PL/SQL approach, the consumer and producers can run on separate execution threads (either in the same or different process context) and communicate through a pipe or queuing mechanism. This approach is similar to co-routine execution.

            •In the interface approach, the consumer and producers run on the same execution thread. Producer explicitly returns the control back to the consumer after producing a set of results. In addition, the producer caches the current state so that it can resume where it left off when the consumer invokes it again.

            The interface approach requires you to implement a set of well-defined interfaces in a procedural language.
            >
            Whether PIPELINED functions are appropriate for your use case depends a great deal on exactly the type of transformations and data cleansing that are being done and to what degree you need to be able to restart/recover from data problems that will occur.

            Data cleansing, in particular, can have encounter data issues that cannot be resolved in real-time. A one-pass PIPELINED approach needs to filter out bad data and somehow keep processing the good data. The end result of a 'batch run' might be a fraction of the entire data set.

            That means that the discovered data problems need to be examined and resolved and then another 'batch run' done to reprocess that data.

            So if transform4 calls transform3 which calls transform2 that calls transform1 and transform 1 finds data that cannot be processed further what needs to happen in your use case? Transform1 needs to be able to filter out any bad data and keep feeding good data to the next step.

            The more traditional approach is to call transform1 to perform the first cleansing/validation step. Examine the results, fix any data issues and then reprocess the 'bad but now fixed' data. The result after a 'successful' run of step 1 is that ALL data is available and ready to be processed by step 2.

            Because of the different types of data in a system you often need a mix of these approaches.

            The 'one-step-at-a-time' method is used for the application-specifc data such as lookup tables and common data tables that are global to the application. For example a table of valid REGIONS or INVENTORY items is usually cleansed and processed through one stage at a time.

            You don't really want to process customer data for all inventory items except vacuum cleaners. You can't really process invoices that only include SOME of the products. The master inventory table needs to be 100% complete and clean before you process the data that refers to it.

            In short much of what would be considered DIMENSION data needs to be squeaky-clean before you process the main customer/app data.

            The 'pipelined' method can then often be used productively at a higher level; for example one customer at a time. There often aren't any issues with a staging process that only includes 90% of the customers and that filters out a 'CUSTOMER' if any part of the customer info has problems.

            It is easy to rerun processes for a subset of customers.

            I suggest you flesh out your requirements in particular as regards how each step will identify problem data, how that problem data has to be handled, what lever of restart/recovery is needed for each step. In a multi-stage pipelined process you don't just restart step 4; you wind up reexecuting ALL of the pipelined steps.