Forum Stats

  • 3,826,352 Users
  • 2,260,634 Discussions
  • 7,896,914 Comments

Discussions

Need help with writing a query!

User_K6LH0
User_K6LH0 Member Posts: 64 Blue Ribbon
edited Jun 18, 2022 6:50PM in SQL & PL/SQL

Hi,

We have a process which can generate the following data rows each run (Per Day):

As part of the query I need to make sure that anytime a 2nd row appears for the same subject - We first select the terminated row (In this case the 3/1/2022 in Run 2) and then for each subsequent runs, just select the 2nd row with a start date of 6/1/2022 for Run 3, Run 4...

The challenge I am facing is how to make sure I pick the first row in Run 2 but in all subsequent runs - just pick the 2nd row for the same subject.

Any ideas or help would be greatly appreciated. Sharing a livesql link, please check:

*Note - The 2nd row that appears can also have an end date

=====================================================================

Edit1:

I have updated the screenshot above + livesql link in the hopes of making the question clearer. Please note the highlighted rows member data is exactly same. And the row highlighted in green denotes the correct row we need per run_id and name combination.

Some background on the process: The target table will be initially empty but once we go live and the process starts running, it will contain member (Eg Abc from above) records with their division start and end dates (Also shown in the above screenshot).

Lets say we have a function that gets us the sequence_id based on run_id and name inputs:

fn_get_sequence_id(run_id, name) should get us the following results based on the data above

Please advise if additional information/clarification is needed.

=====================================================================

Thanks!

Tagged:

Best Answer

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,052 Red Diamond
    Answer ✓

    Hi, @User_K6LH0

    If there was a column: "communicated_date" in the table to denote the rows which have been sent on the file previously, would you do anything different in the query you wrote above?

    Most of the function would be the same, but the first sorting criterion would change. Without a communicated_date column, a CASE expression and EXISTS sub-query assigned all rows to one of two groups:

    1. (Best match) A row has an end_date and there was a similar row (i.e. same name, division and start_date with the previous run_id) but a NULL end_date.
    2. (Worst match) All others.

    With a communicated_date column, then (if I usderstand your requirements) rows can be assigned to four groups:

    1. (Best match) There is a similar row that was communicated, but it had a NULL end_date that has now been supplied
    2. Same as 1., except that no such row was communicated
    3. There were no similar rows
    4. (Worst match) There was a similar row that was communicated with the same end_date

    Here's one way to implement those rules:

    CREATE OR REPLACE FUNCTION fn_get_sequence_id
    ( run_id_wanted IN test.run_id%TYPE
    , name_wanted  IN test.name%TYPE
    )
    RETURN test.sequence_id%TYPE
    IS
      return_val  test.sequence_id%TYPE;
    BEGIN
      SELECT MIN (sequence_id)
      	    KEEP ( DENSE_RANK FIRST ORDER BY (
                                                        -- Begin chnaged section
    		    		  	    	   SELECT NVL ( MAX ( CASE
    									   WHEN  s.communicated_date IS NOT NULL
    									   AND   s.end_date          IS NULL
    									   AND   m.end_date	     IS NOT NULL
    									     THEN 1
    									   WHEN  s.end_date          IS NULL
    									   AND   m.end_date	     IS NOT NULL
    									     THEN 2
    								  	   WHEN  s.communicated_date IS NOT NULL
    									     THEN 4
    									 END
    									)
    
    								 , 3
    								 )
    						   FROM  test s -- s is for sub-query
    						   WHERE  s.run_id    = m.run_id - 1
    						   AND   s.name       = m.name
    						   AND   s.division   = m.division
    						   AND   s.start_date = m.start_date
    		    		  	    	 )
                                                     -- End changed section
    		   		 	 ,       start_date DESC
    					 ,	 end_date  DESC
    		   )
      INTO    return_val
      FROM    test m -- M is for Main query
      WHERE   run_id       = run_id_wanted
      AND	  name        = name_wanted;
    
      RETURN return_val;
    END fn_get_sequence_id;
    

    Comments indicated the part that was changed from the previous solution.

«1

Answers

  • mathguy
    mathguy Member Posts: 10,539 Blue Diamond

    Suppose there is a Run 4. Will it also have exactly two rows? If so, you should say so in your problem description. If not - could it have just one row? If so, how do you need to handle that? Can it have three or more rows? If so, how do you need to handle that?

    Then - is the run number (what you show in the first column) part of what is inserted in your table? If so, why doesn't the run number appear on EVERY row, and instead it is shown only on the first row (of that run)? If that is really what is inserted, then how do you know which row is from which run - except for the rows that do show the run number?

    You talk about rows for the same "subject". What does "subject" mean? Name? Name and division? Something else?

  • User_K6LH0
    User_K6LH0 Member Posts: 64 Blue Ribbon

    If there is a Run 4 with 3 rows - Division 2 line will be termed which needs to be selected and then going forward the newly added Division 3 line will have to be selected - Hope that makes sense.

    Run number is part of every row, it's not in the image but is part of the livesql link shared.

    Subject = Name + Division combination.

    So every time a new Name + Division combination shows up, we need to select the old Name + Division and then going forward select the latest Name + Division combination

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,052 Red Diamond

    Hi @User_K6LH0

    It's still unclear what you want. Show the complete, exact results you want for each run, and explain why you want those results and no others. For example, "When only the rows with run_id <= 3 are in the table, then the results should be ... because .... After the rows with run_id = 4 are inserted, then I want ... because ...".

    Not everyone who wants to help you knows your application as well as you do. Make sure you define all the application-specific terms you use.

    We first select the terminated row

    What is a "terminated row"? Is it a row where end_date is not NULL? What if there are two (or more) terminated rows?

    pick the first row

    What is the "first" row? Is it the row with the lowest run_id? The row with the earliest start_date? Something else? What if there is a tie (that is, two or more rows have an equal claim to being first)?

  • User_K6LH0
    User_K6LH0 Member Posts: 64 Blue Ribbon

    Hi @Frank Kulash - Thanks for your response. I have updated the main post with some additional information/clarification under "Edit 1" heading. Have also update the screenshots + livesql script for reference. Please let me know if this helps a bit. Thank you!

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,052 Red Diamond
    edited Jun 18, 2022 8:26PM

    Hi, @User_K6LH0

     I have updated the main post

    Please don't ever change your messages after you post them: it makes the thread hard to read and your changes easy to miss. Post all additions and corrections in a new message at the end of the thread. Post CREATE TABLE and INSERT statements for the sample data. Does every row in your actual table have the same value for name? If not, then not every row of the sample data should have the same value for name, either. Always post the exact results you want from that sample data, that is, the exact output you want from the query. For example: "When the table contains only rows for run_id = 1, then the output should be:

    SEQUENCE_ID RUN_ID NAME DIVISION START_DATE END_DATE
    ----------- ------ ---- -------- ---------- --------
    ...
    

    After the rows with run_id = 2 are added, then the results should be:

    SEQUENCE_ID RUN_ID NAME DIVISION START_DATE END_DATE
    ----------- ------ ---- -------- ---------- --------
    ...
    

    Then, when the rows with run_id = 3 are added ..."


    We first select the terminated row

    Once again, what does "termintated" mean? How can we tell if a row is terminated or not? What happens if two pf more rows are terminated?

    for each subsequent runs, just select the 2nd row

    Which row is the 2nd row? Is it the row with the 2nd lowest value of division? The 2nd highest value of division? The 2nd lowest value of start_date? The 2nd highest value of start_date? The 2nd lowest value of sequence_id? The 2nd ... What happens when there is a tie?

  • User_K6LH0
    User_K6LH0 Member Posts: 64 Blue Ribbon
    edited Jun 19, 2022 3:19AM

    Hi @Frank Kulash, the CREATE TABLE/DDL statements have been shared in the livesql link, please see below:

    Oracle Live SQL - Script: Need help with writing a query!https://livesql.oracle.com/apex/livesql/s/nnenumjm74gpqj3kbrgt0az56

    Here is the sample data in a picture format for member "Abc":

    Please note the highlighted rows member data is exactly same (Run_Id 2 and 3 data matches and Run_Id 4 and 5 data matches). And the row highlighted in green denotes the correct row we need per run_id and name combination.

    Some background on the process: The target table will be initially empty but once we go live and the process starts running, it will contain member (Eg Abc from above) records with their division start and end dates (Also shown in the above screenshot).

    Lets say we have a function that gets us the sequence_id based on run_id and name inputs:

    fn_get_sequence_id(run_id, name) should get us the following results based on the data above

    Please advise if additional information/clarification is needed.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,052 Red Diamond

    Hi, @User_K6LH0

    Sorry, I don't have a clear idea of your requiremnets in general terms. Specific examples are helpful, but general rules are absolutely essential. I still don't understand what "terminated" or "second" mean, and I don't have any sample data that I can easily cut and paste. What does "communicated" mean? How do you know if a row has been communicated or not? What happens if all rows have been communicated?

  • User_K6LH0
    User_K6LH0 Member Posts: 64 Blue Ribbon

    Hi @Frank Kulash, please see below for DDL/DML statements and some additional clarifications:

    Background: The test table holds member (name) rows with their respective start and end dates in a given division. Every time the base process runs, it will look into the core application tables (not shows here) and if any member + divisions have had any new inserts or updates to their respective start and end dates - Those rows will get inserted into this table. This is a staging table which holds data before it's sent over/"communicated" to the target application. I am in the process of generating a file feed out of this test table which will be sent over to the target system after each run_id. Only selected data from the latest run_id will be on the file. Process execution happens in the following manner:

    OLTP System makes updates -> Base Process Runs and inserts data into the below "test" staging table with all inserts/updates -> The process I am designing right now will select data from the staging table for the latest run_id and create an outbound file with the member name, division and start and end dates - That file is sent over/communicated to the target system.

    The term "terminated" - It means a row in the test table has a "end_date" populated. If the "end_date" is NULL, that means the row is not terminated. If it has a date, that means the row has been terminated. Depending on whether SYSDATE falls between the row's start and end date, that row could still be active but if the "end_date" is populated, we consider that row as being termed.

    The term "second" - In the context above, I was referring to run_id = 3 where the second row (sequence_id = 5) is the correct row which needs to be selected.

    Goal: I would like to have a function built fn_get_sequence_id(run_id, name) which would return me the correct sequence_id based on the grid shown above.

    CREATE TABLE test  
    (  
    sequence_id     NUMBER GENERATED ALWAYS as IDENTITY(START with 1 INCREMENT by 1) 
    , run_id        NUMBER  
    , name          VARCHAR2(50)  
    , division      NUMBER  
    , start_date    DATE  
    , end_date      DATE  
    );
    
    --Step 0: Table is empty
    
    --Step 1: Base process runs to populate table with run_id = 1
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (1, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), NULL);
    
    --Correct Record: During run_id = 1, only 1 row for Abc is present and
    --that's what gets returned
    
    --Step 2: Base process runs to populate table with run_id = 2
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (2, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (2, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), NULL);
    
    --Correct Record: During run_id = 2, two rows for Abc are present - The row from run_id = 1 
    --now has a new end_date of 3/1/2022 so this new end_date update needs to be sent over in
    --the file before the newer row with a start_date of 6/1/22 and division 2 can be sent over
    
    --Step 3: Base process runs to populate table with run_id = 3
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (3, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (3, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), NULL);
    
    --Correct Record: During run_id = 3, the row having start and end_date of 1/1/2020 - 3/1/2022 has already been sent over in the last file run where run_id = 2. So the new row with 6/1/2022 needs to be selected for run_id = 3.
    
    --Step 4: Base process runs to populate table with run_id = 4
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), TO_DATE('07/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 3, TO_DATE('07/02/2022','MM/DD/YYYY'), NULL);
    
    --Correct Record: During run_id = 4, the row from run_id = 3 now has a new end_date of 7/1/2022 which needs to be sent over and thus sequence_id = 7 is selected
    
    --Step 5: Base process runs to populate table with run_id = 5
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), TO_DATE('07/01/2022','MM/DD/YYYY'));
    
    INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 3, TO_DATE('07/02/2022','MM/DD/YYYY'), NULL);
    
    --Correct Record: During run_id = 5, the row having the start and end_date of 6/1/2022 to 7/1/2022 has already been sent over during the last run. So the latest row having 7/2/2022 as the start_date needs to be selected for this run_id
    

    Please let me know if you have any questions.

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 42,052 Red Diamond

    Hi, @User_K6LH0

    I still haven't seen a description, in general terms, of what you want to do.

    Goal: I would like to have a function built fn_get_sequence_id(run_id, name) which would return me the correct sequence_id based on the grid shown above.

    The grid shown above contains specific examples that illustrate some general rule. What is that general rule? For example: "Given a run_id and name, I want to find the sequence_id of row with the latest ternminated run with that run_id and name that has not already been communicated, where :latest" means in order by start_date. We don't need to worry about ties, because the combination (run_id, name, start_date) is unique. If there is no uncommunicated terminated row with that run_id and name, then I want the sequence_id of the latest unterminated row. If all rows with the given run_id and name have been communicated, then return NULL."

    Earlier, I asked how we can tell if a row has been communicated or not. I think you're saying that there is nothing in the table that says if a row has been communicated or not. If that's the case, then you need to add that information to the table, or perhaps another table. If you have a column called communicated_date in the test table which is not NULL if (and only if) the row has been communicated, then you can do something like this:

    CREATE OR REPLACE FUNCTION fn_get_sequence_id
    ( run_id_wanted IN test.run_id%TYPE
    , name_wanted  IN test.name%TYPE
    )
    RETURN test.sequence_id%TYPE
    IS
      return_val  test.sequence_id%TYPE;
    BEGIN
      SELECT MIN (sequence_id)
      	    KEEP ( DENSE_RANK FIRST ORDER BY NVL2 (end_date, 1, 2)
    		   	            ,        start_date DESC
    		 )
      INTO   return_val
      FROM   test
      WHERE  run_id            = run_id_wanted
      AND	 name              = name_wanted
      AND	 communicated_date IS NULL;
    
      RETURN return_val;
    END fn_get_sequence_id;
    /
    


    To test the function, you can run this query

    SELECT   run_id, name
    ,	 fn_get_sequence_id (run_id, name) AS f
    FROM	 test
    GROUP BY run_id, name
    ORDER BY run_id, name
    ;
    


  • User_K6LH0
    User_K6LH0 Member Posts: 64 Blue Ribbon

    Hi @Frank Kulash - Thanks for your response. I do see now what you mean by a general problem statement so here goes:

    Given a run_id and name I want to find the sequence_id of row with the following rules:

    1) If there is only 1 row available/valid in the table for a combination of run_id and name - Return that row and STOP

    2) If there are more than 1 row available for a combination of run_id + name then do the following:

    i) Pick the row which was sent in the previous file but now has a new/updated end_date populated and STOP

    ii) If the current row's division, start and end date exactly matches with the previous row sent over in last file then disregard the current row (This means that row has already been sent over in the last file and can be marked as "invalid" so as not to consider the same row for sending) and move on to the next available row and go to Step 1

    iii) If there are multiple rows for the same combination of name + run_id and none of the rows have been previously sent over in the past files then pick the one row with the MAX start_date and also MAX end_date: ORDER BY start_date DESC, end_date DESC and get the first row

    And yes - So far, I haven't added a "communicated_date" column in the table but will probably have to add one. And we need to worry about comparisons with what was sent over in the past file for the requirements 2.ii and 2.iii from above.

    Please let me know if this makes sense. Thank you!