This discussion is archived
2 Replies Latest reply: Oct 26, 2011 7:34 PM by 860993 RSS

Analytic Question with lag and lead

739106 Newbie
Currently Being Moderated
Hello,
I'm working on tracking a package and the number of times it was recorded in an office. I want to see the start and end dates along with number of occurrences (or records) during the start/end dates. I'm pretty confident I can get the start end date correct but it is the number of occurences that is the issue.

Essentially, I want to build a time line start and end_dates and the number of times the package was recorded in the office.
I am fumbling around with using the lag and lead analytic to build start/end dates along with the count of occurrences during that period.
I've been using analytics lag and lead feature and can pretty much get the start and end dates setup but having difficulty determining count ---count(*) within the analytic. (I think I can do it outside of the analytic with a self join but performance will suffer). I have millions of records in this table.

I've been playing with the windowing using RANGE and INTERVAL days but to no avail. When I try this and count(*) (over partition by package_ID, location_office_id order by event_date range ......) I can calculate the interval correctly by subtracting the lead date - current date, however,
the count is off because when I partition the values by package_id, location_office_id I get the third group of package 12 partitioned with the first group of package 12 (or in same window) because they are at the same office. However, I want to treat these separately because the package has gone to a different office in be-tween.

I've attached the DDL/DML to create my test case. Any help would be appreciated.

--Current
package_id, location_office_ID. event_date
12 1 20010101
12 1 20010102
12 1 20010103
13 5 20010102
13 5 20010104
13 5 20010105
13 6 20010106
13 6 20010111
12 2 20010108
12 2 20010110
12 1 20010111
12 1 20010112
12 1 20010113
12 1 20010114


--Needs to look like
package_id location_office_id start_date end_date count
12     1      20010101 20010103 3
12 2      20010108 20010110 2
12 1      20010111 20010114 4
13 5 20010102 20010105 3
13 6 20010106 20010111 2




create table test (package_id number, location_office_id number,event_date date);

insert into test values (12,1,to_date('20010101','YYYYMMDD'));
insert into test values (12,1,to_date('20010102','YYYYMMDD'));
insert into test values (12,1,to_date('20010103','YYYYMMDD'));
insert into test values (13,5,to_date('20010102','YYYYMMDD'));
insert into test values (13,5,to_date('20010104','YYYYMMDD'));
insert into test values (13,5,to_date('20010105','YYYYMMDD'));
insert into test values (13,6,to_date('20010106','YYYYMMDD'));
insert into test values (13,6,to_date('20010111','YYYYMMDD'));
insert into test values (12,2,to_date('20010108','YYYYMMDD'));
insert into test values (12,2,to_date('20010110','YYYYMMDD'));
insert into test values (12,1,to_date('20010111','YYYYMMDD'));
insert into test values (12,1,to_date('20010112','YYYYMMDD'));
insert into test values (12,1,to_date('20010113','YYYYMMDD'));
insert into test values (12,1,to_date('20010114','YYYYMMDD'));
commit;

--I'm trying something like

select package_id, location_office_id, event_date,
lead(event_date) over (partition by package_id, location_office_id order by event_date) lead_event,
count(*) over (partition by package_id, location_office_id order by event_date) rcount -- When I do this it merges the window together for package 12 and location 1 so I get the total, However, I want to keep them separate because the package moved to another office in between).

Appreciate your help,
  • 1. Fixed Difference
    Frank Kulash Guru
    Currently Being Moderated
    Hi,

    Thanks for posting the CREATE TABLE and INSERT statements; that's very helpful!

    You can do what you want with LEAD and/or LAG, but here's a more elegant way, using the analytic ROW_NUMBER function:
    WITH     got_grp_num     AS
    (
         SELECT     package_id, location_office_id, event_date
         ,     ROW_NUMBER () OVER ( PARTITION BY  package_id
                             ORDER BY        event_date
                           )
               -     ROW_NUMBER () OVER ( PARTITION BY  package_id
                             ,             location_office_id
                             ORDER BY        event_date
                           )     AS grp_num
         FROM     test
    --     WHERE     ...     -- If you need any filtering, put it here
    )
    SELECT       package_id
    ,       location_office_id
    ,       MIN (event_date)     AS start_date
    ,       MAX (event_date)     AS end_date
    ,       COUNT (*)          AS cnt
    FROM       got_grp_num
    GROUP BY  package_id
    ,       location_office_id
    ,       grp_num
    ORDER BY  package_id
    ,       start_date
    ;
    This approach treats the problem as a GROUP BY problem. Getting the start_date, end_date and cnt are all trivial using aggregate functions. The tricky part is what to GROUP BY. We can't just GROUP BY package_id and location_office_id, because, when a package (like package_id=1) leaves an office, goes to another office, then comes back, the two periods spent in the same office have to be treated as separate groups. We need something else to GROUP BY. The query above uses the Fixed Difference method to provide that something else. To see how this works, let's run the sub-query (slightly modified) by itself:
    WITH     got_grp_num     AS
    (
         SELECT     package_id, location_office_id, event_date
         ,     ROW_NUMBER () OVER ( PARTITION BY  package_id
                             ORDER BY        event_date
                           )     AS p_num
         ,          ROW_NUMBER () OVER ( PARTITION BY  package_id
                             ,             location_office_id
                             ORDER BY        event_date
                           )     AS p_l_num
         FROM     test
    )
    SELECT       g.*
    ,       p_num - p_l_num     AS grp_num
    FROM       got_grp_num     g
    ORDER BY  package_id
    ,       event_date
    ;
    Output:
    `       LOCATION
    PACKAGE  _OFFICE
        _ID      _ID EVENT_DATE P_NUM P_L_NUM GRP_NUM
    ------- -------- ---------- ----- ------- -------
         12        1 2001-01-01     1       1       0
         12        1 2001-01-02     2       2       0
         12        1 2001-01-03     3       3       0
         12        2 2001-01-08     4       1       3
         12        2 2001-01-10     5       2       3
         12        1 2001-01-11     6       4       2
         12        1 2001-01-12     7       5       2
         12        1 2001-01-13     8       6       2
         12        1 2001-01-14     9       7       2
         13        5 2001-01-02     1       1       0
         13        5 2001-01-04     2       2       0
         13        5 2001-01-05     3       3       0
         13        6 2001-01-06     4       1       3
         13        6 2001-01-11     5       2       3
    As you can see, p_num numbers the rows for each package with consecutive integers. P_l_num likewise numbers the rows with consecutive integers, but instead of having a separate series of numbers for each package, it has a separate series for each package and location . As long as a package remains at the same location, both numbers increase by 1, and therefore the difference between those two numbers stays fixed. (This assumes that the combination (package_id, event_date is unique.) But whenever a pacakge changes from one location to another, and then comes back, p_num will have increased, but p_l_num will resume where it left off, and so the difference will not be the same as it was previously. The amount of the difference doesn't mean anything by itself; it's just a number (more or less arbitrary) that, together wth package_id and location_office_id, uniquely identifies the groups.

    Edited by: Frank Kulash on Oct 26, 2011 8:49 PM
    Added explanation.
  • 2. Re: Analytic Question with lag and lead
    860993 Explorer
    Currently Being Moderated
    Tabibitosan method may help you.
    Tabibitosan method tutorial by Aketi Jyuuzou

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points