SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Need help with writing a query!

User_K6LH0Jun 18 2022 — edited Jun 18 2022

Hi,
We have a process which can generate the following data rows each run (Per Day):
Screen Shot 2022-06-18 at 1.22.53 PM.png As part of the query I need to make sure that anytime a 2nd row appears for the same subject - We first select the terminated row (In this case the 3/1/2022 in Run 2) and then for each subsequent runs, just select the 2nd row with a start date of 6/1/2022 for Run 3, Run 4...
The challenge I am facing is how to make sure I pick the first row in Run 2 but in all subsequent runs - just pick the 2nd row for the same subject.
Any ideas or help would be greatly appreciated. Sharing a livesql link, please check:
Oracle Live SQL - Script: Need help with writing a query! (0 Bytes)*Note - The 2nd row that appears can also have an end date
=====================================================================
Edit1:
I have updated the screenshot above + livesql link in the hopes of making the question clearer. Please note the highlighted rows member data is exactly same. And the row highlighted in green denotes the correct row we need per run_id and name combination.
Some background on the process: The target table will be initially empty but once we go live and the process starts running, it will contain member (Eg Abc from above) records with their division start and end dates (Also shown in the above screenshot).
Lets say we have a function that gets us the sequence_id based on run_id and name inputs:
fn_get_sequence_id(run_id, name) should get us the following results based on the data above
Screen Shot 2022-06-18 at 1.23.58 PM.png Please advise if additional information/clarification is needed.
=====================================================================
Thanks!

This post has been answered by Frank Kulash on Jun 21 2022

Jump to Answer

mathguy

Suppose there is a Run 4. Will it also have exactly two rows? If so, you should say so in your problem description. If not - could it have just one row? If so, how do you need to handle that? Can it have three or more rows? If so, how do you need to handle that?
Then - is the run number (what you show in the first column) part of what is inserted in your table? If so, why doesn't the run number appear on EVERY row, and instead it is shown only on the first row (of that run)? If that is really what is inserted, then how do you know which row is from which run - except for the rows that do show the run number?
You talk about rows for the same "subject". What does "subject" mean? Name? Name and division? Something else?

User_K6LH0

If there is a Run 4 with 3 rows - Division 2 line will be termed which needs to be selected and then going forward the newly added Division 3 line will have to be selected - Hope that makes sense.
Run number is part of every row, it's not in the image but is part of the livesql link shared.
Subject = Name + Division combination.
So every time a new Name + Division combination shows up, we need to select the old Name + Division and then going forward select the latest Name + Division combination

Frank Kulash

Hi @user-k6lh0
It's still unclear what you want. Show the complete, exact results you want for each run, and explain why you want those results and no others. For example, "When only the rows with run_id <= 3 are in the table, then the results should be ... because .... After the rows with run_id = 4 are inserted, then I want ... because ...".
Not everyone who wants to help you knows your application as well as you do. Make sure you define all the application-specific terms you use.
We first select the terminated row
What is a "terminated row"? Is it a row where end_date is not NULL? What if there are two (or more) terminated rows?
pick the first row
What is the "first" row? Is it the row with the lowest run_id? The row with the earliest start_date? Something else? What if there is a tie (that is, two or more rows have an equal claim to being first)?

User_K6LH0

Hi @frank-kulash - Thanks for your response. I have updated the main post with some additional information/clarification under "Edit 1" heading. Have also update the screenshots + livesql script for reference. Please let me know if this helps a bit. Thank you!

Frank Kulash

Hi, User_K6LH0
I have updated the main post
Please don't ever change your messages after you post them: it makes the thread hard to read and your changes easy to miss. Post all additions and corrections in a new message at the end of the thread. Post CREATE TABLE and INSERT statements for the sample data. Does every row in your actual table have the same value for name? If not, then not every row of the sample data should have the same value for name, either. Always post the exact results you want from that sample data, that is, the exact output you want from the query. For example: "When the table contains only rows for run_id = 1, then the output should be:

SEQUENCE_ID RUN_ID NAME DIVISION START_DATE END_DATE
----------- ------ ---- -------- ---------- --------
...

After the rows with run_id = 2 are added, then the results should be:

SEQUENCE_ID RUN_ID NAME DIVISION START_DATE END_DATE
----------- ------ ---- -------- ---------- --------
...

Then, when the rows with run_id = 3 are added ..."

We first select the terminated row
Once again, what does "termintated" mean? How can we tell if a row is terminated or not? What happens if two pf more rows are terminated?
for each subsequent runs, just select the 2nd row
Which row is the 2nd row? Is it the row with the 2nd lowest value of division? The 2nd highest value of division? The 2nd lowest value of start_date? The 2nd highest value of start_date? The 2nd lowest value of sequence_id? The 2nd ... What happens when there is a tie?

User_K6LH0

Hi @frank-kulash, the CREATE TABLE/DDL statements have been shared in the livesql link, please see below:
Oracle Live SQL - Script: Need help with writing a query!https://livesql.oracle.com/apex/livesql/s/nnenumjm74gpqj3kbrgt0az56
Here is the sample data in a picture format for member "Abc":
Screen Shot 2022-06-18 at 1.22.53 PM.png Please note the highlighted rows member data is exactly same (Run_Id 2 and 3 data matches and Run_Id 4 and 5 data matches). And the row highlighted in green denotes the correct row we need per run_id and name combination.
Some background on the process: The target table will be initially empty but once we go live and the process starts running, it will contain member (Eg Abc from above) records with their division start and end dates (Also shown in the above screenshot).
Lets say we have a function that gets us the sequence_id based on run_id and name inputs:
fn_get_sequence_id(run_id, name) should get us the following results based on the data above
Screen Shot 2022-06-18 at 1.23.58 PM.png Please advise if additional information/clarification is needed.

Frank Kulash

Hi, @user-k6lh0
Sorry, I don't have a clear idea of your requiremnets in general terms. Specific examples are helpful, but general rules are absolutely essential. I still don't understand what "terminated" or "second" mean, and I don't have any sample data that I can easily cut and paste. What does "communicated" mean? How do you know if a row has been communicated or not? What happens if all rows have been communicated?

User_K6LH0

Hi [Frank Kulash](/ords/forums/user/Frank Kulash), please see below for DDL/DML statements and some additional clarifications:
Background: The test table holds member (name) rows with their respective start and end dates in a given division. Every time the base process runs, it will look into the core application tables (not shows here) and if any member + divisions have had any new inserts or updates to their respective start and end dates - Those rows will get inserted into this table. This is a staging table which holds data before it's sent over/"communicated" to the target application. I am in the process of generating a file feed out of this test table which will be sent over to the target system after each run_id. Only selected data from the latest run_id will be on the file. Process execution happens in the following manner:
OLTP System makes updates -> Base Process Runs and inserts data into the below "test" staging table with all inserts/updates -> The process I am designing right now will select data from the staging table for the latest run_id and create an outbound file with the member name, division and start and end dates - That file is sent over/communicated to the target system.
The term "terminated" - It means a row in the test table has a "end_date" populated. If the "end_date" is NULL, that means the row is not terminated. If it has a date, that means the row has been terminated. Depending on whether SYSDATE falls between the row's start and end date, that row could still be active but if the "end_date" is populated, we consider that row as being termed.
The term "second" - In the context above, I was referring to run_id = 3 where the second row (sequence_id = 5) is the correct row which needs to be selected.
Goal: I would like to have a function built fn_get_sequence_id(run_id, name) which would return me the correct sequence_id based on the grid shown above.

CREATE TABLE test  
(  
sequence_id     NUMBER GENERATED ALWAYS as IDENTITY(START with 1 INCREMENT by 1) 
, run_id        NUMBER  
, name          VARCHAR2(50)  
, division      NUMBER  
, start_date    DATE  
, end_date      DATE  
);

--Step 0: Table is empty

--Step 1: Base process runs to populate table with run_id = 1
INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (1, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), NULL);

--Correct Record: During run_id = 1, only 1 row for Abc is present and
--that's what gets returned

--Step 2: Base process runs to populate table with run_id = 2
INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (2, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (2, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), NULL);

--Correct Record: During run_id = 2, two rows for Abc are present - The row from run_id = 1 
--now has a new end_date of 3/1/2022 so this new end_date update needs to be sent over in
--the file before the newer row with a start_date of 6/1/22 and division 2 can be sent over

--Step 3: Base process runs to populate table with run_id = 3
INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (3, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (3, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), NULL);

--Correct Record: During run_id = 3, the row having start and end_date of 1/1/2020 - 3/1/2022 has already been sent over in the last file run where run_id = 2. So the new row with 6/1/2022 needs to be selected for run_id = 3.

--Step 4: Base process runs to populate table with run_id = 4
INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), TO_DATE('07/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (4, 'Abc', 3, TO_DATE('07/02/2022','MM/DD/YYYY'), NULL);

--Correct Record: During run_id = 4, the row from run_id = 3 now has a new end_date of 7/1/2022 which needs to be sent over and thus sequence_id = 7 is selected

--Step 5: Base process runs to populate table with run_id = 5
INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 1, TO_DATE('01/01/2020','MM/DD/YYYY'), TO_DATE('03/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 2, TO_DATE('06/01/2022','MM/DD/YYYY'), TO_DATE('07/01/2022','MM/DD/YYYY'));

INSERT INTO test (run_id, name, division, start_date, end_date) VALUES (5, 'Abc', 3, TO_DATE('07/02/2022','MM/DD/YYYY'), NULL);

--Correct Record: During run_id = 5, the row having the start and end_date of 6/1/2022 to 7/1/2022 has already been sent over during the last run. So the latest row having 7/2/2022 as the start_date needs to be selected for this run_id

Please let me know if you have any questions.

Frank Kulash

Hi, User_K6LH0
I still haven't seen a description, in general terms, of what you want to do.
Goal: I would like to have a function built fn_get_sequence_id(run_id, name) which would return me the correct sequence_id based on the grid shown above.
The grid shown above contains specific examples that illustrate some general rule. What is that general rule? For example: "Given a run_id and name, I want to find the sequence_id of row with the latest ternminated run with that run_id and name that has not already been communicated, where :latest" means in order by start_date. We don't need to worry about ties, because the combination (run_id, name, start_date) is unique. If there is no uncommunicated terminated row with that run_id and name, then I want the sequence_id of the latest unterminated row. If all rows with the given run_id and name have been communicated, then return NULL."
Earlier, I asked how we can tell if a row has been communicated or not. I think you're saying that there is nothing in the table that says if a row has been communicated or not. If that's the case, then you need to add that information to the table, or perhaps another table. If you have a column called communicated_date in the test table which is not NULL if (and only if) the row has been communicated, then you can do something like this:

CREATE OR REPLACE FUNCTION fn_get_sequence_id
( run_id_wanted IN test.run_id%TYPE
, name_wanted  IN test.name%TYPE
)
RETURN test.sequence_id%TYPE
IS
  return_val  test.sequence_id%TYPE;
BEGIN
  SELECT MIN (sequence_id)
  	    KEEP ( DENSE_RANK FIRST ORDER BY NVL2 (end_date, 1, 2)
		   	            ,        start_date DESC
		 )
  INTO   return_val
  FROM   test
  WHERE  run_id            = run_id_wanted
  AND	 name              = name_wanted
  AND	 communicated_date IS NULL;

  RETURN return_val;
END fn_get_sequence_id;
/

To test the function, you can run this query

SELECT   run_id, name
,	 fn_get_sequence_id (run_id, name) AS f
FROM	 test
GROUP BY run_id, name
ORDER BY run_id, name
;

User_K6LH0

Hi @frank-kulash - Thanks for your response. I do see now what you mean by a general problem statement so here goes:
Given a run_id and name I want to find the sequence_id of row with the following rules:

If there is only 1 row available/valid in the table for a combination of run_id and name - Return that row and STOP
If there are more than 1 row available for a combination of run_id + name then do the following:
i) Pick the row which was sent in the previous file but now has a new/updated end_date populated and STOP
ii) If the current row's division, start and end date exactly matches with the previous row sent over in last file then disregard the current row (This means that row has already been sent over in the last file and can be marked as "invalid" so as not to consider the same row for sending) and move on to the next available row and go to Step 1
iii) If there are multiple rows for the same combination of name + run_id and none of the rows have been previously sent over in the past files then pick the one row with the MAX start_date and also MAX end_date: ORDER BY start_date DESC, end_date DESC and get the first row
And yes - So far, I haven't added a "communicated_date" column in the table but will probably have to add one. And we need to worry about comparisons with what was sent over in the past file for the requirements 2.ii and 2.iii from above.
Please let me know if this makes sense. Thank you!

Frank Kulash

Hi, @user-k6lh0
Sorry, I don't understand.
i) Pick the row which was sent in the previous file but now has a new/updated end_date populated and STOP
When you say "previous file", do you mean "previous run_id"? Each row has exactly one run_id. A row can't have both run_id = N and run_id = (N - 1).

User_K6LH0

Yes, when I say previous file, I mean previous run_id. For example:
Screen Shot 2022-06-19 at 7.23.38 PM.png Consider rows with sequence_id = 1 and 2, the data in these 2 rows are exactly the same except the "end_date" is updated to 3/1/2022 in sequence_id = 2 as opposed to NULL in sequence_id = 1.
So, if I were to call the function and pass in (run_id = 2, name = 'Abc') - I would expect to get back sequence_id = 2 because that same row was sent in the previous file/run_id = 1 and now has an updated end_date which has not been sent over yet.
Hence this requirement:

i) Pick the row which was sent in the previous file but now has a new/updated end_date populated and STOP

Thank you very much for your help [Frank Kulash](/ords/forums/user/Frank Kulash) - please let me know if you have any other questions, appreciate your help with this issue.

Frank Kulash

Hi, User_K6LH0
Absent a communicated_date column (or equivalent) you can't implement any rules that deponed on whether a row was communicated or not. The best you can do might be something like this:

CREATE OR REPLACE FUNCTION fn_get_sequence_id
( run_id_wanted IN test.run_id%TYPE
, name_wanted  IN test.name%TYPE
)
RETURN test.sequence_id%TYPE
IS
  return_val  test.sequence_id%TYPE;
BEGIN
  SELECT MIN (sequence_id)
  	    KEEP ( DENSE_RANK FIRST ORDER BY CASE
		    		  	    	   WHEN end_date IS NOT NULL
						   AND  EXISTS (
								    SELECT 0
E								    FROM   test e -- e is for EXISTS sub-query
								    WHERE  e.run_id    = m.run_id - 1
								    AND    e.name      = m.name
								    AND    e.division  = m.division
								    AND    e.start_date = m.start_date
								    AND    e.end_date  IS NULL
						   	       )
						   THEN 1
						   ELSE 2
		    		  	    	 END
		   		 	 ,       start_date DESC
					 ,	 end_date   DESC
		   )
  INTO    return_val
  FROM    test m -- m is for Main query
  WHERE   run_id            = run_id_wanted
  AND	  name              = name_wanted;

  RETURN return_val;
END fn_get_sequence_id;

This picks the "best" row, according to these criteria (in order, most important first)
The best rows are ones with a value for end_date when the previous run_id had a row with the same name, division and start_date, but no end_date.
After that, rows are ordered by start_date (descending order)
In case of a tie, rows are ordered by end_date (descending order)
If a tie persists, the lowest sequence_id of the contenders is returned
When there is only one row with the given run_id and name, then that row will be picked regardless of the sorting criteria.

User_K6LH0

Thank you very much @frank-kulash - I am testing the query you have created and trying to understand how KEEP works.
If there was a column: "communicated_date" in the table to denote the rows which have been sent on the file previously, would you do anything different in the query you wrote above?
Thanks again!

Frank Kulash

Answer

Hi, User_K6LH0
If there was a column: "communicated_date" in the table to denote the rows which have been sent on the file previously, would you do anything different in the query you wrote above?
Most of the function would be the same, but the first sorting criterion would change. Without a communicated_date column, a CASE expression and EXISTS sub-query assigned all rows to one of two groups:
(Best match) A row has an end_date and there was a similar row (i.e. same name, division and start_date with the previous run_id) but a NULL end_date.
(Worst match) All others.
With a communicated_date column, then (if I usderstand your requirements) rows can be assigned to four groups:
(Best match) There is a similar row that was communicated, but it had a NULL end_date that has now been supplied
Same as 1., except that no such row was communicated
There were no similar rows
(Worst match) There was a similar row that was communicated with the same end_date
Here's one way to implement those rules:

CREATE OR REPLACE FUNCTION fn_get_sequence_id
( run_id_wanted IN test.run_id%TYPE
, name_wanted  IN test.name%TYPE
)
RETURN test.sequence_id%TYPE
IS
  return_val  test.sequence_id%TYPE;
BEGIN
  SELECT MIN (sequence_id)
  	    KEEP ( DENSE_RANK FIRST ORDER BY (
                                                    -- Begin chnaged section
		    		  	    	   SELECT NVL ( MAX ( CASE
									   WHEN  s.communicated_date IS NOT NULL
									   AND   s.end_date          IS NULL
									   AND   m.end_date	     IS NOT NULL
									     THEN 1
									   WHEN  s.end_date          IS NULL
									   AND   m.end_date	     IS NOT NULL
									     THEN 2
								  	   WHEN  s.communicated_date IS NOT NULL
									     THEN 4
									 END
									)

								 , 3
								 )
						   FROM  test s -- s is for sub-query
						   WHERE  s.run_id    = m.run_id - 1
						   AND   s.name       = m.name
						   AND   s.division   = m.division
						   AND   s.start_date = m.start_date
		    		  	    	 )
                                                 -- End changed section
		   		 	 ,       start_date DESC
					 ,	 end_date  DESC
		   )
  INTO    return_val
  FROM    test m -- M is for Main query
  WHERE   run_id       = run_id_wanted
  AND	  name        = name_wanted;

  RETURN return_val;
END fn_get_sequence_id;

Comments indicated the part that was changed from the previous solution.

Marked as Answer by User_K6LH0 · Jun 22 2022

User_K6LH0

Thank you very much @frank-kulash - appreciate all your help with this issue.

1 - 16

Added on Jun 18 2022

#oracle, #plsql, #sql

16 comments

424 views

2 mentions

SQL & PL/SQL

Need help with writing a query!

Comments

Post Details