This discussion is archived
1 2 Previous Next 18 Replies Latest reply: Nov 1, 2011 8:18 AM by TinyPenguin RSS

Ranking query

TinyPenguin Newbie
Currently Being Moderated
Hi All,

Could anyone help me out please?

I need to rank/order a set of data by date and period type (blocktype) for groups of worker records. Where the blocktype is 'Ind(ividual)', I need to increment the rank by 1. Where the type is 'Cont(inuous)', I don't want to increment the rank until the next 'Ind' record. This is so I can then get the min/max start/end date for that rank number.

I'm not sure if that's clear, but it's really obvious what I need to do with the attached SQL below (please see 'Desired' column). I'm guessing rank or dense_rank is the way forward, but I'm not sure what combination of columns to partition/order by?

If anyone could help it would make me smile. :)

Thank you.



create table mytable
(pid varchar(10),
startdate date,
enddate date,
blocktype varchar(4),
desired varchar (1));

insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('23/10/2007','DD/MM/RRRR'),to_date('05/11/2007','DD/MM/RRRR'), 'ind',1);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('16/11/2007','DD/MM/RRRR'),to_date('19/11/2007','DD/MM/RRRR'), 'ind',2);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('26/11/2007','DD/MM/RRRR'),to_date('25/04/2008','DD/MM/RRRR'), 'cont',3);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('25/04/2008','DD/MM/RRRR'),to_date('28/08/2008','DD/MM/RRRR'), 'cont',3);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('01/04/2009','DD/MM/RRRR'),to_date('09/09/2009','DD/MM/RRRR'), 'ind',4);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('01/05/2010','DD/MM/RRRR'),to_date('03/05/2010','DD/MM/RRRR'), 'cont',5);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('03/05/2010','DD/MM/RRRR'),to_date('19/11/2010','DD/MM/RRRR'), 'cont',5);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('19/11/2010','DD/MM/RRRR'),to_date('06/02/2011','DD/MM/RRRR'), 'cont',5);
insert into mytable (pid, startdate,enddate,blocktype, desired) values ('A123', to_date('29/04/2011','DD/MM/RRRR'),to_date('14/11/2011','DD/MM/RRRR'), 'ind',6);

Edited by: user9363122 on 28-Oct-2011 16:35
  • 1. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,

    So you want to keep a count, but only count rows where either:
    (a) blocktype = 'ind' or
    (b) blocktype on the previous row for the same pid was 'ind'. (I assume "Previous" means in order by startdate.)

    Here's one way:
    WITH     got_prev_blocktype     AS
    (
         SELECT  m.*
         ,     LAG (blocktype) OVER ( PARTITION BY  pid
                               ORDER BY          startdate
                             )     AS prev_blocktype
         FROM    mytable     m
    --     WHERE     ...     -- If you need any filtering, this it where it goes
    )
    SELECT       p.*          -- Or list all columns except prev_blocktype
    ,       COUNT ( CASE
                    WHEN  'ind' IN (blocktype, prev_blocktype)
                    THEN  1
                   END
                 ) OVER ( PARTITION BY  pid
                       ORDER BY     startdate
                     )     AS cnt
    FROM       got_prev_blocktype     p
    ORDER BY  pid
    ,       startdate
    ;
    You can't nest analytic functions, so we need a sub-query to compute the analytic LAG value, which we then use in the analytic COUNT function.

    Thanks for posting the CREATE TABLE and INSERT statements! That's very helpful.
  • 2. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    No, it's me that needs to thank you! I've just checked it out and it looks like it's working well. I didn't think about using lag a second time. I actually already used the lead function to create the existing cont/ind flag to see if the next start date is equal to the current row's end date.

    If you don't mind me asking, how does the count() over (partition by) part of the syntax work? I don't quite get how it increments the value over the group (worker pid). I understand it only counts if 'ind' is equal to the row's value in both the block type and prev_blocktype columns but not sure how it keeps the cumulative total from one row to the next?

    Thanks again Frank.
  • 3. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,
    user9363122 wrote:
    ... If you don't mind me asking, how does the count() over (partition by) part of the syntax work? I don't quite get how it increments the value over the group (worker pid). I understand it only counts if 'ind' is equal to the row's value in both the block type and prev_blocktype columns
    Actually, it counts if 'ind' is equal to either blocktype or prev_blocktype. 'ind' doesn't have to be equal to both.
    but not sure how it keeps the cumulative total from one row to the next?
    Remember (or look up) how the analytic ORDER BY clause works.
    http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/functions001.htm#i97469

    COUNT (x) OVER (PARTITION BY pid) tells how many rows with this pid have a non-NULL x.
    COUNT (x) OVER (PARTITION BY pid ORDER BY startdate ) tells how many rows with this pid and startdate less than or equal to this row's stratdate have a non-NULL x.
    Without an ORDER BY clause, the function operates over the whole partition.
    With an ORDER BY clause, the function operates over only part of the partition. That's how it gets the cumulative COUNT.
  • 4. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    Sorry yes you're right, IN() is indeed an Or operator. Interesting use of In() - I've never thought to use columns in In().

    Thanks for the explanation - very concise and clear.

    You've cheered my weekend right up! :)
  • 5. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,
    user9363122 wrote:
    ... I actually already used the lead function to create the existing cont/ind flag to see if the next start date is equal to the current row's end date.
    If that's what 'ind' and 'cont' mean, then be sure to test the situation where you have consective multi-row blocks. For example, if you add this row to your sample data:
    insert into mytable (pid,    startdate,                          enddate) 
         values      ('A123', to_date('09/09/2009','DD/MM/RRRR'), to_date('01/05/2010','DD/MM/RRRR');
    Now we have a block of 2 rows (with startdates 26/11/2007 and 25/04/2008) followed immediately by a block of 5 rows (with startdates between 01/04/2009 and 19/11/2010). Does that mean all 7 rows would have blocktype='cont', and therefore they would all have the same desired value, even though there's a gap (from 28/08/2008 to 01/04/2009) in the middle of that section?

    You probably don't need blocktype to compute desired. You can probably compute desited from startdate and enddate, using only 1 sub-query.
  • 6. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    I'm actually really glad you picked up on that, good call. I was just reviewing my input data and yes, there's a flaw to my logic.

    If you have time, can you have a look at the create/insert statement below, please?

    Again, the desired column is what I'm trying to arrive at. This is so I can get the min/max start/end dates of the consecutive blocks where the end date of the current row = the start date of the next row.

    For information, I'm using SQL Developer.

    Thanks again Frank.


    Create table mytable2
    (PID varchar(10),
    Startdate date,
    enddate date,
    desired varchar(2));

    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('11/12/2008','DD/MM/RRRR'), to_date('24/12/2008','DD/MM/RRRR'), 1);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('24/12/2008','DD/MM/RRRR'), to_date('22/05/2009','DD/MM/RRRR'), 1);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('22/05/2009','DD/MM/RRRR'), to_date('25/11/2009','DD/MM/RRRR'), 1);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('04/05/2010','DD/MM/RRRR'), to_date('17/06/2010','DD/MM/RRRR'), 2);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('17/06/2010','DD/MM/RRRR'), to_date('19/07/2010','DD/MM/RRRR'), 2);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('29/09/2010','DD/MM/RRRR'), to_date('23/11/2010','DD/MM/RRRR'), 3);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('14/01/2011','DD/MM/RRRR'), to_date('20/04/2011','DD/MM/RRRR'), 4);
    insert into mytable2 (PID, StartDate, Enddate, desired) values ('A657', to_date('20/04/2011','DD/MM/RRRR'), to_date('21/09/2011','DD/MM/RRRR'), 4);

    Edited by: user9363122 on 29-Oct-2011 06:11
  • 7. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,

    Here's one way to do that:
    WITH     got_grp_date     AS
    (
         SELECT     mytable2.*
         ,     enddate - SUM (enddate - startdate) OVER ( PARTITION BY  pid
                                          ORDER BY      enddate
                                        )     AS grp_date
         FROM     mytable2
    )
    SELECT       got_grp_date.*     -- After development, list all columns except grp_date
    ,       DENSE_RANK () OVER ( PARTITION BY pid
                          ORDER BY     grp_date
                        )     AS d
    FROM       got_grp_date
    ORDER BY  pid
    ,       startdate
    ;
    Whenever you have a problem, it really helps if you explain what you're doing. Keep posting the results you want, like you've always done, but also describe what those results are. Things that are "really obvious" to you might not be quite so obvious to people who aren't familiar with your data or your business requirements.
    For example, a description of this problem (as I understand it) might be: "The desired column is the number of continous blocks of time. When enddate of of row equals startdate of the next row (for the same pid), then those two rows are part of the same block, and desired will be the same. When endate of one row is less than startdate of the next row (for the same pid), then they are in different blocks, and desired for the earlier row will be 1 less than desire of the later row."
    A more mathematical, and equally good, way of describing the same results would be: "For each pid, the row with the earliest startdate has desired=1. On other rows, let N be desired from the previous row (in order by startdate) for the same pid. If startdate on a given row is equal to enddate of the previous row, then desired on the given row is N. If startdate on a given row is later than enddate on the previous row, then desired on the given row is N+1."
    I'm assuming that the combination (pid, startdate) is unique, and that, within a pid, rows will never overlap. That is, if startdate on row A is less than startdate on row B, then enddate on row A must be less than or equal to startdate on row B.
    Results from the query above:
    PID   STARTDATE  ENDDATE    DE GRP_DATE       D
    ----- ---------- ---------- -- ---------- -----
    A657  11/12/2008 24/12/2008 1  11/12/2008     1
    A657  24/12/2008 22/05/2009 1  11/12/2008     1
    A657  22/05/2009 25/11/2009 1  11/12/2008     1
    A657  04/05/2010 17/06/2010 2  20/05/2009     2
    A657  17/06/2010 19/07/2010 2  20/05/2009     2
    A657  29/09/2010 23/11/2010 3  31/07/2009     3
    A657  14/01/2011 20/04/2011 4  21/09/2009     4
    A657  20/04/2011 21/09/2011 4  21/09/2009     4
    You don't want to display grp_date in your output. I displayed it to help show how the query works.

    Edited by: Frank Kulash on Oct 29, 2011 10:03 AM
  • 8. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    Thanks for that and letting me know how to make things easier for you to interpret. I've just ran through a handful (of about 40,000 records) and it looks like it's working like a dream. Yes, your assumptions are correct. The dataset is about children in education who may move class but remain in a continuous period of education. They can't be in 2 classes at once so the end date of row N must be less than or equal to the start date of row N+1. Therefore, the start date of row N+1 <= start date of row N.

    I like your approach to this. I would never have considered using dates to 'floor' the range - is that correct? - to identify contiguous blocks. Sorry to ask this again, but would you mind describing how that works? It's fascinating.

    Thank you.
  • 9. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,
    Tiny Penguin wrote:
    Hi Frank,

    Thanks for that and letting me know how to make things easier for you to interpret.
    I don't think it's just me. I think most people would prefer to talk about "when a class begins on the same day the child's last class ended" rather than "when startdate on row N equals enddate on row N-1 for the same pid". Don't you find that easier to say, to hear, and to understand?
    I've just ran through a handful (of about 40,000 records) and it looks like it's working like a dream. Yes, your assumptions are correct. The dataset is about children in education who may move class but remain in a continuous period of education. They can't be in 2 classes at once so the end date of row N must be less than or equal to the start date of row N+1. Therefore, the start date of row N+1 <= start date of row N.

    I like your approach to this. I would never have considered using dates to 'floor' the range - is that correct? - to identify contiguous blocks.
    Exactly! The values in grp_date identify contiguous blocks. Those values don't mean much in absolute terms; they only have meaning relative to other values of grp_date. Equal grp_dates mean the same block. A later grp_date means a later block.
    Sorry to ask this again, but would you mind describing how that works?
    To give a hint as to how this was working, I posted a query that displayed grp_date, not because you need to show it, but because the query needs to use it. Let's make it even clearer by showing some more expressionss, just because they are important in computing grp_date. Grp_date is defined as "enddate - SUM (x)". We're already displaying enddate; let's display SUM (x) (which I called diff_so_far below) as well. In case that's not really obvious, let's show x itself (diff in the query below).
    WITH     got_grp_date     AS
    (
         SELECT     mytable2.*
         ,     enddate - SUM (enddate - startdate) OVER ( PARTITION BY  pid
                                          ORDER BY      enddate
                                        )     AS grp_date
         ,               SUM (enddate - startdate) OVER ( PARTITION BY  pid
                                          ORDER BY      enddate
                                        )     AS diff_so_far          -- For debugging only
         ,                   (enddate - startdate)            AS diff               -- For debugging only
         FROM     mytable2
    )
    SELECT       got_grp_date.*     -- After development, list all columns except grp_date
    ,       DENSE_RANK () OVER ( PARTITION BY pid
                          ORDER BY     grp_date
                        )     AS d
    FROM       got_grp_date
    ORDER BY  pid
    ,       startdate
    ;
    Output from your sample data:
                                               DIFF
                                                _SO
    PID   STARTDATE  ENDDATE    DE GRP_DATE    _FAR  DIFF    D
    ----- ---------- ---------- -- ---------- ----- ----- ----
    A657  11/12/2008 24/12/2008 1  11/12/2008    13    13    1
    A657  24/12/2008 22/05/2009 1  11/12/2008   162   149    1
    A657  22/05/2009 25/11/2009 1  11/12/2008   349   187    1
    A657  04/05/2010 17/06/2010 2  20/05/2009   393    44    2
    A657  17/06/2010 19/07/2010 2  20/05/2009   425    32    2
    A657  29/09/2010 23/11/2010 3  31/07/2009   480    55    3
    A657  14/01/2011 20/04/2011 4  21/09/2009   576    96    4
    A657  20/04/2011 21/09/2011 4  21/09/2009   730   154    4
    What are we trying to do in this problem? We're trying to derive the last column, d, which is a "block number" (1, 2, 3, ...) for each child, where a block is a series of clasess without any gaps between them; that is, one class begins exactly when that child's previous class ended. DENSE_RANK is an obvious choice for computing d, since it assigns numbers (1, 2, 3, ...) with a different series for each partition, repeat numbers when there is a tie in the ordering, and never skipping a number. The problem is, what can we use for an ORDER BY expression in DENSE_RANK? We need something that uniquely identifies each block. We need something that will be the same for row A and row B when those two rows are in the same block, and somehting that is less for row A than for row B if row A belongs in an earlier block than row B. That something in the query above is grp_date.
    This is an example of a Fixed Difference problem. When two rows belong in the same block, they do not necessarily have any column of actual data in common (except pid, of course), but what they do have in common is the difference between two expressions, enddate and diff_so_far in this example. Imagine that, when a child begins his first class, we start a stopwatch ticking, and that stopwatch continues to run as long as the child is in a class. If the child stops taking classes, then the stopwatch stops. If the child starts taking classes again, then the stopwatch resumes from the point where it left off. The difference between the real time and the stopwatch time will be fixed (or constant) as long as the child stays in some class. When the child takes a break, the difference will increase. If the child resumes classes, the stopwatch with start again, and the difference between real time and the stopwatch time will stay at the new level.
    Let's see how this applies to your sample data. This child started taking classes on December 11, 2008. The first class ended on December 24, 2008, that is, 13 days later. The stopwatch stopped, but immediately start again, because that child started another class on December 24, and that class continued unitl May 22, 2009, or 149 days later. Again the stopwatch but immediately started again (you might think of it as staying stopped for 0 days) because the child started another class, which lasted for 187 days. On November 25, 2009, the stopwatch stopped again, and this time it stayed stopped for a while, because the child did not immediately start another class. As on November 25, 2009, the child had been in class a total of 13+149+187 = 349 days. (This is the diff_so_far column.) At any poit so far, if you subtract the number of days that child had been in class from the current date, you would get the same result: November 12, 2008. (This is the grp_date column).
    Now let's see what happens when the child returns to school on May 4, 2010. The stopwatch starts ticking again, starting from the point where it left off, which was 349 days. When that class ends, 44 days later, the stopwatch reads 349_44 = 393. At that point, the child has had as much schooling as he would have had if he had started classes 393 days earlier and never stopped. 393 days before the end of that class (June 17 2010) happens to be May 20, 2009. Not that anything special happend on May 20, 2009, it just happens to be a date that reflects how long the child has been in class.
    The child's next class begins immeidatley, and when it ends 32 days later (on July 19, 2010) the child now has 425 of schooling to his credit, as if he had started school on May 20, 2009. As long as the child keeps taking classes without a break, it will be as if the child had started on May 20, 2009. As long as the child stays in class, the difference between the current date (enddate) and that theoretical point when the child could have started, had he not taken any breaks, will stay fixed. Every day the child stays in class adds 1 day to the current date, and also adds 1 day to diff_so_far, so the difference remains the same as long as the child stays in class.
    If the child takes a break, then, when he resumes, the difference will have increased, but then, as long as the child keeps taking classes, the difference won't get any greater.
    It's fascinating.
    If you think so, you might be interested in another example of the Fixed Difference technique, with another long-winded explanation. If you feel you understand how the query above solves this problem, then see if you can solve the problem in the thread below without reading the posted solution:
    Analytic Question with lag and lead
    If you don't think you understand the solution to your problem, then read the solution to this other problem for another look at how the Fixed Difference tecnique works. I think this other example is a little easier because it doesn't use a DATE column in such an un-intuitive way.
  • 10. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Wow Frank,

    Thanks for posting such a comprehesive explanation of your workings - that's really great and makes it really clear how you achieved it. Have you ever thought about writing a book (if you haven't already!) - It would probably be the best selling SQL book ever! I know I'm only an amateur compared to you guys but this kind problem really shows how versatile SQL is - i.e. so much more than just 'select from where'.

    Thanks again for taking the time to help me.
  • 11. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    Do you mind if I ask you a further question on this? A bit of a variation on a theme....

    I have a similar dataset, this time with a start code indicating the start of a child's period of education and an end code indicating the end of a child's period of education. This is still dependent on a contiguous block of dates, with the start code and the end code acting as book ends for the period.

    In this case, I need to get the start date and end date for the child's period of education where the start code is in (S ,B) and the End code != X.

    However, a child may have more than one S,B code (S means start of new period, B means both change of class and start of new period) and therefore more than one E (End) code. Where the End Date/code is null, the period is still ongoing. This could be substituted for sysdate+1/'E' though.

    Therefore I need to assign a dense_rank based on the first start code to the first end code, the second start code to the second end code, the third start code to the third end code etc ordered by start date.

    I've included a desired column that I hope makes this clear.

    I was thinking it might be possible using row_number()?

    Thank you. :)


    Create table mytable3
    (PID varchar(10),
    Startdate date,
    enddate date,
    StartCode varchar(1),
    EndCode varchar(1),
    desired varchar(2));

    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('11/12/2008','DD/MM/RRRR'), to_date('24/12/2008','DD/MM/RRRR'),'S','X',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('24/12/2008','DD/MM/RRRR'), to_date('22/05/2009','DD/MM/RRRR'),'P','X',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('22/05/2009','DD/MM/RRRR'), to_date('25/11/2009','DD/MM/RRRR'),'P','E',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('04/05/2010','DD/MM/RRRR'), to_date('17/06/2010','DD/MM/RRRR'),'S','X',2);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('17/06/2010','DD/MM/RRRR'), to_date('19/07/2010','DD/MM/RRRR'),'B','E',3);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('29/09/2010','DD/MM/RRRR'), to_date('23/11/2010','DD/MM/RRRR'),'S','E',4);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('14/01/2011','DD/MM/RRRR'), to_date('20/04/2011','DD/MM/RRRR'),'P','E',4);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('20/04/2011','DD/MM/RRRR'), null,'B',null,5);

    Edited by: Tiny Penguin on 30-Oct-2011 11:15 - typo

    Edited by: Tiny Penguin on 30-Oct-2011 11:16

    Edited by: Tiny Penguin on 30-Oct-2011 11:17
  • 12. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,
    Tiny Penguin wrote:
    In this case, I need to get the start date and end date for the child's period of education where the start code is in (S ,B) and the End code != X.
    In the sample data you posted, it looks like a new period begions whenever startcode is either 'S' or 'B', and that endcode has nothing to do with it:
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('11/12/2008','DD/MM/RRRR'), to_date('24/12/2008','DD/MM/RRRR'),'S','X',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('24/12/2008','DD/MM/RRRR'), to_date('22/05/2009','DD/MM/RRRR'),'P','X',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('22/05/2009','DD/MM/RRRR'), to_date('25/11/2009','DD/MM/RRRR'),'P','E',1);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('04/05/2010','DD/MM/RRRR'), to_date('17/06/2010','DD/MM/RRRR'),'S','X',2);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('17/06/2010','DD/MM/RRRR'), to_date('19/07/2010','DD/MM/RRRR'),'B','E',3);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('29/09/2010','DD/MM/RRRR'), to_date('23/11/2010','DD/MM/RRRR'),'S','E',4);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('14/01/2011','DD/MM/RRRR'), to_date('20/04/2011','DD/MM/RRRR'),'P','E',4);
    insert into mytable3 (PID, StartDate, Enddate, StartCode, EndCode, desired) values ('A657', to_date('20/04/2011','DD/MM/RRRR'), null,'B',null,5);
    In the data above, periods 2 and 5 do not contain an enddate='E'.
    Therefore I need to assign a dense_rank
    It doesn't have to be a DENSE_RANK, does it? As long as you get the right results, you don't care if DENSE_RANK or some other function produces those results, right?
    based on the first start code to the first end code, the second start code to the second end code, the third start code to the third end code etc ordered by start date.
    ... I was thinking it might be possible using row_number()?
    Good thought, but there might be even better ways of doing it.
    The analytic ROW_NUMBER fucntion counts every row .
    The analytic COUNT function counts only certain rows . In the query below, it only counts rows where startcode in ('B', 'S'):
    SELECT       m.*
    ,       COUNT ( CASE
                          WHEN  startcode IN ('B', 'S')
                    THEN  1
                    ELSE  NULL     -- This is the default, but you can explicitly say it if you want to
                   END
                 ) OVER ( PARTITION BY  pid
                           ORDER BY      startdate
                     )     AS period
    FROM       mytable3     m
    ORDER BY  pid
    ,            startdate
    ;
    To find the startdate and enddate of the whole period, do this in a sub-query, and then use the analytic MIN and MAX functions, like this:
    WITH     got_period     AS
    (
         SELECT     m.*
         ,     COUNT ( CASE
                             WHEN  startcode IN ('B', 'S')
                      THEN  1
                          ELSE  NULL     -- This is the default, but you can explicitly say it if you want to
                         END
                       )     OVER ( PARTITION BY  pid
                                ORDER BY          startdate
                        )     AS period
         FROM     mytable3     m
    )
    SELECT       p.*
    ,       MIN (startdate) OVER (PARTITION BY period)     AS period_startdate
    ,       MAX (enddate)     OVER (PARTITION BY period)     AS period_enddate
    FROM       got_period  p
    ORDER BY  pid
    ,            startdate
    ;
    Output:
    PID   STARTDATE  ENDDATE    S E DE     PERIOD PERIOD_STA PERIOD_END
    ----- ---------- ---------- - - -- ---------- ---------- ----------
    A657  11/12/2008 24/12/2008 S X 1           1 11/12/2008 25/11/2009
    A657  24/12/2008 22/05/2009 P X 1           1 11/12/2008 25/11/2009
    A657  22/05/2009 25/11/2009 P E 1           1 11/12/2008 25/11/2009
    A657  04/05/2010 17/06/2010 S X 2           2 04/05/2010 17/06/2010
    A657  17/06/2010 19/07/2010 B E 3           3 17/06/2010 19/07/2010
    A657  29/09/2010 23/11/2010 S E 4           4 29/09/2010 20/04/2011
    A657  14/01/2011 20/04/2011 P E 4           4 29/09/2010 20/04/2011
    A657  20/04/2011            B   5           5 20/04/2011
    Why did I say you have to use a sub-query to get period? Discuss.
  • 13. Re: Ranking query
    TinyPenguin Newbie
    Currently Being Moderated
    Hi Frank,

    Thanks for getting back to me so quickly. Sorry, you're right. In the amended table I posted period 2 should have had an end code and not an ongoing code. You've got such a keen eye.

    Ah excellent, that looks like it's working a treat. Kind of a bit annoyed I didn't think of it earlier, especially as you posted a similar approach when you first started helping me.

    As for why we need to use a sub query, I'm not sure but I think it's because grouping functions are executed ahead the select statement which would mean it would try to find the min/max of the group before the analytical function had defined the product for the aggregate function to group on? Would love to know if I'm wrong though.

    Thanks Frank. :)
  • 14. Re: Ranking query
    Frank Kulash Guru
    Currently Being Moderated
    Hi,
    Tiny Penguin wrote:
    ... As for why we need to use a sub query, I'm not sure but I think it's because grouping functions are executed ahead the select statement which would mean it would try to find the min/max of the group before the analytical function had defined the product for the aggregate function to group on? Would love to know if I'm wrong though.
    You're on the right track. What you said about aggreagate functions and analytic functions is true: the GROUP BY clause (if there is one) is applied and aggregate functions are computed before analytic functions are computed, so the only way to GROUP BY the results of an analytic function is to compute the analytic function first, in a sub-query. In the query above, however, there are no aggregate functions, only analyitc functions. Analytic functions can't be nested, or depend on one another, so we can't say "MIN (startdate) OVER (PARTITION BY period)" if period is another analytic function.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points