PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method)

Version 4

    PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method)

    (based on the thread: https://community.oracle.com/message/3996302#3996302 by Aketi Jyuuzou)

     

    Author: BluShadow

    Credit to: Aketi Jyuuzou

    Last Updated: 1st June 2015

     

     

    Introduction

    Back in 2011, Aketi Jyuuzou introducted the community to a method of grouping sequences of numbers, called the Tabibitosan Method in Japan.
    (As Aketi no longer appears to be contributing in this community, I'll document this technique rather than try and ask him to do so)

     

    Roughly translated (as Japanese does not directly translate to English), Tabibito means the Traveler, and Tabibitosan is Traveler Calculation.
    The basis of Tabobitisan comes from a Japanese High School entrance exam, and is detailed in the following (translated) Japanese web page:

     

         http://translate.google.co.uk/translate?hl=en&sl=ja&u=https://www.manabinoba.com/index.cfm/7,757,33,html&prev=search

     

    However, in terms of data processing we can simply relate this to our desire to find groups or ranges within sequences, as you'll see in this article.

    It's surprising how often questions pop up on the community that require this technique.

     

     

    1. The basic problem


    Let us say we have some data containing several sequence ranges...

     

    Val

    ---

    1

    2

    3

    5

    6

    7

    10

    11

    12

    20

    21

     

    As you can see from that data, there's a range of numbers from 1-3, 5-7, 10-12 and 20-21, and we can determine those ranges by seeing where there are 'gaps' in sequence.
    But how can we group these sequences together into their ranges using SQL? That's our basic problem.

     

     

    2. The basic method


    Ok, so firstly, let's (theoretically) take our data and apply a row number (RN) to each record

     

    Val RN

    --- --

    1   1

    2   2

    3   3

    5   4

    6   5

    7   6

    10  7

    11  8

    12  9

    20  10

    21  11

     

    Well, that's nothing special.  What does that do for us?

     

    Here's the clever bit.  If we now subtract our row number (RN) from our sequence number (Val) we get this...

     

    Val RN Val-RN

    --- -- ------

    1   1       0

    2   2       0

    3   3       0

    5   4       1

    6   5       1

    7   6       1

    10  7       3

    11  8       3

    12  9       3

    20  10     10

    21  11     10

     

    As if by magic, we've created a number that is identical (and unique) for each "group" in the range of numbers.
    If we've got unique "group" numbers, then SQL is perfectly suited to grouping our rows together with the functionality that is familiar to most... yes... the GROUP BY clause and aggregate functions (MIN, MAX etc.).

     

    Let's do it in SQL and see...

     

    create table myvals as

      select 1 as val from dual union all

      select 2 from dual union all

      select 3 from dual union all

      select 5 from dual union all

      select 6 from dual union all

      select 7 from dual union all

      select 10 from dual union all

      select 11 from dual union all

      select 12 from dual union all

      select 20 from dual union all

      select 21 from dual

    /

     

    Table created.

     

    select val

          ,row_number() over (order by val) as rn

          ,val-row_number() over (order by val) as grp

    from   myvals

    order by val

    /

           VAL         RN        GRP

    ---------- ---------- ----------

             1          1          0

             2          2          0

             3          3          0

             5          4          1

             6          5          1

             7          6          1

            10          7          3

            11          8          3

            12          9          3

            20         10         10

            21         11         10

     

    11 rows selected.

     

    And now let's group our value ranges based on those groups...

     

    select min(val) as range_start

          ,max(val) as range_end

          ,count(*) as range_count

    from (-- our previous query

          select val

                ,row_number() over (order by val) as rn

                ,val-row_number() over (order by val) as grp

          from   myvals

         )

    group by grp

    order by 1

    /

    RANGE_START  RANGE_END RANGE_COUNT

    ----------- ---------- -----------

              1          3           3

              5          7           3

             10         12           3

             20         21           2

     

    4 rows selected.

     

    There we go, that's the basics of the Tabibitosan Method for grouping sequences.  It really is that simple, but so many people are unaware of this simple trick.


    The "group" relates to the Tabibitosan 'traveller' as it's effectively measuring the distance between our two sequences (the one in our data, and the one we generate as a row number).  In the Japanese website, these are the two people walking at different speeds.  This is sometimes referred to as "Grouping by Distance".

     

     

    3. Another Example - Date Ranges


    A common requirement is when we have ranges of Dates.  Can we use the same method for Dates? Yes we can.
    Anything that we can relate to a numeric sequence in some way can be treated like this. Let's take a look...

     

    create table mydates as

      select date '2015-04-01' as dt from dual union all

      select date '2015-04-02' from dual union all

      select date '2015-04-03' from dual union all

      select date '2015-04-04' from dual union all

      select date '2015-04-07' from dual union all

      select date '2015-04-08' from dual union all

      select date '2015-04-10' from dual union all

      select date '2015-04-12' from dual union all

      select date '2015-04-13' from dual union all

      select date '2015-04-14' from dual

    /

     

    alter session set nls_date_format='YYYY-MM-DD';

     

    select dt

          ,row_number() over (order by dt) as rn

          ,dt-row_number() over (order by dt) as grp

    from   mydates

    /

    DT                 RN GRP

    ---------- ---------- ----------

    2015-04-01          1 2015-03-31

    2015-04-02          2 2015-03-31

    2015-04-03          3 2015-03-31

    2015-04-04          4 2015-03-31

    2015-04-07          5 2015-04-02

    2015-04-08          6 2015-04-02

    2015-04-10          7 2015-04-03

    2015-04-12          8 2015-04-04

    2015-04-13          9 2015-04-04

    2015-04-14         10 2015-04-04

     

    10 rows selected.

     

    Here, our "groups" are defined by an actual date.  It doesn't matter what the date is, as long as the date is unique to the group.

     

    So, now we can use those dates to group on...

     

    select min(dt) as range_start

          ,max(dt) as range_end

          ,count(*) as range_count

    from (-- our grouping query

          select dt

                ,row_number() over (order by dt) as rn

                ,dt-row_number() over (order by dt) as grp

          from   mydates

         )

    group by grp

    order by 1

    /

    RANGE_STAR RANGE_END  RANGE_COUNT

    ---------- ---------- -----------

    2015-04-01 2015-04-04           4

    2015-04-07 2015-04-08           2

    2015-04-10 2015-04-10           1

    2015-04-12 2015-04-14           3

     

    4 rows selected.

     

    Likewise you could do your ranges for seconds, minutes, hours, months or years as you want, simply by adjusting the group calculation accordingly.


    We'll look at a slightly different example, grouping by months later, as it may not be as obvious as you first think, especially when you have multiple records for the same month; and that leads us to look at what happens if we have duplicate rows.

     

     

    4. Tabibitosan with duplicate values


    What do we do if we have duplicate rows in our sequences?


    Well, we could distinct those sequences before we apply our Tabibitosan Method to them, that's one way, but requires an extra subquery.
    Another way would be to group them before we apply our Tabibitosan Method to them, again that requires an extra subquery
    Also, in our real applications, we're probably dealing with more than just a sequence, we probably have some other data too.

     

    So, let's set up some example data, and see if we can use what we've learnt already...

    create table mysales as

      select 1 as day, 'Fred' as who, 100 as dollars from dual union all

      select 1, 'Bob', 50 from dual union all

      select 1, 'Jim', 75 from dual union all

      select 2, 'Bob', 125 from dual union all

      select 2, 'Jim', 100 from dual union all

      select 3, 'Fred', 25 as dollars from dual union all

      select 4, 'Fred', 50 from dual union all

      select 4, 'Jim', 150 from dual union all

      select 5, 'Jim', 50 from dual union all

      select 8, 'Fred', 25 from dual union all

      select 8, 'Bob', 100 from dual union all

      select 9, 'Jim', 175 from dual union all

      select 9, 'Fred', 75 from dual union all

      select 10, 'Fred', 125 from dual union all

      select 10, 'Fred', 225 from dual union all

      select 11, 'Fred', 75 from dual union all

      select 12, 'Fred', 100 from dual union all

      select 15, 'Jim', 150 from dual union all

      select 16, 'Bob', 150 from dual

    /

     

    select day, who, dollars

          ,row_number() over (order by day) as rn

          ,day-row_number() over (order by day) as grp

    from   mysales

    order by 1,4

    /

           DAY WHO     DOLLARS         RN        GRP

    ---------- ---- ---------- ---------- ----------

             1 Fred        100          1          0

             1 Bob          50          2         -1

             1 Jim          75          3         -2

             2 Bob         125          4         -2

             2 Jim         100          5         -3

             3 Fred         25          6         -3

             4 Fred         50          7         -3

             4 Jim         150          8         -4

             5 Jim          50          9         -4

             8 Fred         25         10         -2

             8 Bob         100         11         -3

             9 Jim         175         12         -3

             9 Fred         75         13         -4

            10 Fred        125         14         -4

            10 Fred        225         15         -5

            11 Fred         75         16         -5

            12 Fred        100         17         -5

            15 Jim         150         18         -3

            16 Bob         150         19         -3

     

    19 rows selected.

     

     

    Hmmm, those groupings don't look right.


    That's because we are trying to apply Tabibitosan based on the "Day" but that data is not actually sequential; it effectively has duplicates in it.
    So, when we apply our row numbering, which IS sequential, against it, we don't get the right result.

     

    Perhaps there's some other way we can account for these duplicates?  Yes there is.  Rather than a sequential row number, we want something similar, but that takes account of duplicates.
    It may not be immediately obvious, but Oracle provides us with another analytical function, called "dense_rank".

     

    Let's check it out...

    select day, who, dollars

          ,dense_rank() over (order by day) as rn

          ,day-dense_rank() over (order by day) as grp

    from   mysales

    order by 1,4

    /

           DAY WHO     DOLLARS         RN        GRP

    ---------- ---- ---------- ---------- ----------

             1 Fred        100          1          0

             1 Bob          50          1          0

             1 Jim          75          1          0

             2 Bob         125          2          0

             2 Jim         100          2          0

             3 Fred         25          3          0

             4 Fred         50          4          0

             4 Jim         150          4          0

             5 Jim          50          5          0

             8 Fred         25          6          2

             8 Bob         100          6          2

             9 Jim         175          7          2

             9 Fred         75          7          2

            10 Fred        125          8          2

            10 Fred        225          8          2

            11 Fred         75          9          2

            12 Fred        100         10          2

            15 Jim         150         11          4

            16 Bob         150         12          4

     

    19 rows selected.

     

    That looks good.

     

    Ranking in principle works like positioning people on a leaderboard in sports.  If we had several people competing, and two people come in first place, they are both considered 1st, and then the next person is considered in 3rd place (in general there is no second place because of the two people in 1st place).  Dense Ranking works in a similar way, but doesn't leave gaps, so if there are multiple people in one position, the next people are in the next position i.e. two people in 1st position, the next place is 2nd position.

     

    So, looking at our data above, we can see that all the Day 1 records are in 1st position (RN=1), all Day2 records are in 2nd position (RN=2) and so on.  You can see that the RN goes up in a gapless sequence even though it too now has duplicates.

     

    This difference between our generated gapless sequence and the Days which have gaps, allows us to apply out Tabibitosan Method, subtracting one from the other to generate unique group identifiers.

     

    So, now we can group the data...

    select min(day) as start_day

          ,max(day) as end_day

          ,count(distinct who) as sales_people

          ,count(*) as sales_count

          ,sum(dollars) as sales_amount

    from (-- our dense rank grouping query

          select day, who, dollars

                ,dense_rank() over (order by day) as rn

                ,day-dense_rank() over (order by day) as grp

          from   mysales

         )

    group by grp

    order by 1

    /

    START_DAY    END_DAY SALES_PEOPLE SALES_COUNT SALES_AMOUNT

    ---------- ---------- ------------ ----------- ------------

             1          5            3           9          725

             8         12            3           8          900

            15         16            2           2          300

     

    3 rows selected.

     

    That works really well.

     

    We can see from this that, the key to applying Tabibitosan, is to be able to generate a gapless sequence (even if it has duplicates) so that we can compare it with the sequential (and potentially duplicated) gappy sequence we have in our data.  Once we have those two key components we're able to generate those unique groupings.

     

     

    5. Tabibitosan on Dates - by Month


    So, our above example had duplicate rows for the "Days", but those days weren't very realistic for most people's data.  It's unlikely we would be recording a number to represent a day.

    So, let's change our sales data to have some more realistic dates, and apply our dense_rank method to those dates...

     

    drop table mysales

    /

    create table mysales as

      select date '2014-01-01' as dt, 'Fred' as who, 100 as dollars from dual union all

      select date '2014-01-02', 'Bob', 50 from dual union all

      select date '2014-01-03', 'Jim', 75 from dual union all

      select date '2014-02-12', 'Bob', 125 from dual union all

      select date '2014-02-15', 'Jim', 100 from dual union all

      select date '2014-03-07', 'Fred', 25 as dollars from dual union all

      select date '2014-04-01', 'Fred', 50 from dual union all

      select date '2014-04-28', 'Jim', 150 from dual union all

      select date '2014-05-02', 'Jim', 50 from dual union all

      select date '2014-08-13', 'Fred', 25 from dual union all

      select date '2014-08-20', 'Bob', 100 from dual union all

      select date '2014-09-05', 'Jim', 175 from dual union all

      select date '2014-09-06', 'Fred', 75 from dual union all

      select date '2014-10-11', 'Fred', 125 from dual union all

      select date '2014-10-14', 'Fred', 225 from dual union all

      select date '2014-11-11', 'Fred', 75 from dual union all

      select date '2014-12-01', 'Fred', 100 from dual union all

      select date '2015-03-06', 'Jim', 150 from dual union all

      select date '2015-04-01', 'Bob', 150 from dual

    /

     

    select min(dt) as start_day

          ,max(dt) as end_day

          ,count(distinct who) as sales_people

          ,count(*) as sales_count

          ,sum(dollars) as sales_amount

    from (-- our dense rank grouping query

          select dt, who, dollars

                ,dense_rank() over (order by dt) as rn

                ,dt-dense_rank() over (order by dt) as grp

          from   mysales

         )

    group by grp

    order by 1

    /

    START_DAY  END_DAY    SALES_PEOPLE SALES_COUNT SALES_AMOUNT

    ---------- ---------- ------------ ----------- ------------

    2014-01-01 2014-01-03            3           3          225

    2014-02-12 2014-02-12            1           1          125

    2014-02-15 2014-02-15            1           1          100

    2014-03-07 2014-03-07            1           1           25

    2014-04-01 2014-04-01            1           1           50

    2014-04-28 2014-04-28            1           1          150

    2014-05-02 2014-05-02            1           1           50

    2014-08-13 2014-08-13            1           1           25

    2014-08-20 2014-08-20            1           1          100

    2014-09-05 2014-09-06            2           2          250

    2014-10-11 2014-10-11            1           1          125

    2014-10-14 2014-10-14            1           1          225

    2014-11-11 2014-11-11            1           1           75

    2014-12-01 2014-12-01            1           1          100

    2015-03-06 2015-03-06            1           1          150

    2015-04-01 2015-04-01            1           1          150

     

    16 rows selected.

     

    Well, that works, if we had wanted to group the data by day, but our sales team is lazy (or perhaps just buzy doing other things ), and don't make sales in some months, so we want to just group by consecutive months to get a more 'overall' picture of what they've been up to.

     

    For that, we need to consider that we aren't interested in the day component of our DATE values, and our calculation for our group needs to be based on months rather than days (as most will know just subtracting a number from a DATE is subtracting a number of days, not months; and in addition there are variable numbers of days in each month so we cannot just factor up the value we take off).

     

    Therefore, for each of our dates, we'll truncate them to their Month, and we'll calculate our groups using Oracle's add_months function (using a negative value to affect the subtraction).  Likewise, for display purposes, we'll just display the year and month so it makes more sense...

     

     

    select to_char(min(dt),'YYYY-MM') as start_month

          ,to_char(max(dt),'YYYY-MM') as end_month

          ,count(distinct who) as sales_people

          ,count(*) as sales_count

          ,sum(dollars) as sales_amount

    from (-- our dense rank grouping query

          select dt, who, dollars

                ,dense_rank() over (order by trunc(dt,'MM')) as rn

                ,add_months(trunc(dt,'MM'), -dense_rank() over (order by trunc(dt,'MM'))) as grp

          from   mysales

         )

    group by grp

    order by 1

    /

    START_M END_MON SALES_PEOPLE SALES_COUNT SALES_AMOUNT

    ------- ------- ------------ ----------- ------------

    2014-01 2014-05            3           9          725

    2014-08 2014-12            3           8          900

    2015-03 2015-04            2           2          300

     

    3 rows selected.

     

    Brilliant, we've achieved what we wanted, grouping duplicate rows of dates into month ranges.  And because of our Tabibitosan Method, it's nice concise SQL.

     

     

    6. Summary


    Here we've seen a few examples using numbers and dates.  If it suits the situation you can convert your dates to numbers using to_number(to_char(dt, 'j')) and apply Tabibitosan method to that, but clearly there are far too many scenarios to cover in this 101 introduction to the topic.

     

    The link at the top of this article to Aketi's thread shows some more advanced scenarios and provides some other links with examples, so I do encourage you to take a look at that thread.

     

    The one thing to remember though is, if you're considering how to group sequences of values into 'ranges', then think "Tabibitosan" !!

     


     

    3rd June 2015 Update:

    Following feedback to Steven Feuersteine's tweet about this article (thanks for the email Steven ), if you're using Oracle 12c or above, there is now another way to achieve this using the new MATCH_RECOGNIZE clause.

     

    For details see Stew Ashton's excellent blog post demonstrating how this works and how it can improve upon what we've seen with Tabibitosan

     

    https://stewashton.wordpress.com/2014/03/05/12c-match_recognize-grouping-sequences/

     

    Nice one Stew!