Is there a more efficient way to do this conditional cross-join?
rwsBertman Mar 4, 2015 11:36 PMAbstracting the situation: we are monitoring the condition of a variety of units, looking at each unit for a recorded period of time. At any single point in time, there may be nearly 8000 units currently being monitored.
Every unit has an ID (required), a start_date (required), and end_date (may be null, i.e. still being monitored), and the date of the most recent info collected for the unit (latest_data; equal to some date later than start_date).
Every unit gets a periodic check (review), which classifies it into a group, identified by group_ID.
If the unit has not been checked yet, then review_date and group_ID are both null.
What I need to do (for the data warehouse) is create a record of the details for each unit for every day of its life in the system, including its most recent review date and current group_ID.
If the unit is still being monitored, i.e. does not have an end_date, then the daily records should continue up to the date of the most recent info collected for the unit, or sysdate if no data was received yet.
select CR.ID, CR.start_date, CR.end_date, cal.day, cal.month, max(CR.rd) as review_date, --get the most recent review date max(CR.group_ID) keep(dense_rank last order by rd nulls first) as group_ID --get the most recent group_ID from ( select C.ID, C.start_date, C.end_date, C.latest_data, R.rd, R.group_ID from table_1 C left outer join table_2 R on R.ID = C.ID ) CR cross join ( --create a calendar of days from 1-Jan-2012 to sysdate select day, trunc(day,'month') as month from ( select rownum-1 + to_date('1-Jan-2012','dd-Mon-yyyy') as day from (select 1 from dual) connect by level <= trunc(sysdate) - to_date('1-Jan-2012','dd-Mon-yyyy') ) ) cal where cal.day >= CR.start_date and cal.day <= least(nvl(CR.end_date, CR.latest_data), CR.latest_data, sysdate) --if the review date hasn't happened yet, then show null and (CR.rd <= cal.day or CR.rd is null) group_ID by CR.ID, CR.start_date, CR.end_date, cal.day, cal.month ;
Table_1 (C) has about 17k records, and is fully indexed (=all units ever monitored; about +500 turnaround per month)
Table_2 (R) has about 330k records, and is indexed on ID and group_ID (=all checks on all units ever done; about 4000+ new per month)
When table_1 and table_2 are left-outer-joined, the result set is about 85k records.
The entire data set and query are stored and running on Oracle 10.2.0.3.
I have tried a number of different configurations for this query, and this one seems to be the most efficient, but it still runs rather slowly -- about 10 minutes per month of data / 2 hours per (fiscal) year of data.
I am currently perusing various Oracle docs and other resources on query tuning, but it's a lot to wade through, and I'm not finding much that directly relates to this kind of situation.
Any suggestions, preferably with explanations and/or references would be very appreciated.
Thanks!
Robert
EDIT: Please see reply 18 and 19 for corrections, and reply 23 for sample data and output.