Forum Stats

  • 3,768,513 Users
  • 2,252,804 Discussions
  • 7,874,602 Comments

Discussions

PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method)

BluShadow
BluShadow Member, Moderator Posts: 41,478 Red Diamond
edited Dec 3, 2015 3:06AM in SQL & PL/SQL

PL/SQL 101 : Grouping Sequence Ranges (Tabibitosan Method)

(based on the thread: https://community.oracle.com/message/3996302#3996302 by @Aketi Jyuuzou)

Author: @BluShadow

Credit to: @Aketi Jyuuzou

Last Updated: 1st June 2015

Introduction

Back in 2011, Aketi Jyuuzou introducted the community to a method of grouping sequences of numbers, called the Tabibitosan Method in Japan.
(As Aketi no longer appears to be contributing in this community, I'll document this technique rather than try and ask him to do so)

Roughly translated (as Japanese does not directly translate to English), Tabibito means the Traveler, and Tabibitosan is Traveler Calculation.
The basis of Tabobitisan comes from a Japanese High School entrance exam, and is detailed in the following (translated) Japanese web page:

     http://translate.google.co.uk/translate?hl=en&sl=ja&u=https://www.manabinoba.com/index.cfm/7,757,33,html&prev=search

However, in terms of data processing we can simply relate this to our desire to find groups or ranges within sequences, as you'll see in this article.

It's surprising how often questions pop up on the community that require this technique.

1. The basic problem


Let us say we have some data containing several sequence ranges...

Val
---
 1
 2
 3
 5
 6
 7
 10
 11
 12
 20
 21



As you can see from that data, there's a range of numbers from 1-3, 5-7, 10-12 and 20-21, and we can determine those ranges by seeing where there are 'gaps' in sequence.
But how can we group these sequences together into their ranges using SQL? That's our basic problem.

2. The basic method


Ok, so firstly, let's (theoretically) take our data and apply a row number (RN) to each record

Val RN
--- --
 1   1
 2   2
 3   3
 5   4
 6   5
 7   6
 10  7
 11  8
 12  9
 20  10
 21  11



Well, that's nothing special.  What does that do for us?

Here's the clever bit.  If we now subtract our row number (RN) from our sequence number (Val) we get this...

Val RN Val-RN
--- -- ------
 1   1       <strong>0</strong>
 2   2       <strong>0</strong>
 3   3       <strong>0</strong>
 5   4       1
 6   5       1
 7   6       1
 10  7       <strong>3</strong>
 11  8       <strong>3</strong>
 12  9       <strong>3</strong>
 20  10     10
 21  11     10



As if by magic, we've created a number that is identical (and unique) for each "group" in the range of numbers.
If we've got unique "group" numbers, then SQL is perfectly suited to grouping our rows together with the functionality that is familiar to most... yes... the GROUP BY clause and aggregate functions (MIN, MAX etc.).

Let's do it in SQL and see...

create table myvals as
  select 1 as val from dual union all
  select 2 from dual union all
  select 3 from dual union all
  select 5 from dual union all
  select 6 from dual union all
  select 7 from dual union all
  select 10 from dual union all
  select 11 from dual union all
  select 12 from dual union all
  select 20 from dual union all
  select 21 from dual
/

Table created.

select val
      ,row_number() over (order by val) as rn
      ,val-row_number() over (order by val) as grp
from   myvals
order by val
/
       VAL         RN        GRP
---------- ---------- ----------
         1          1          <strong>0</strong>
         2          2          <strong>0</strong>
         3          3          <strong>0</strong>
         5          4          1
         6          5          1
         7          6          1
        10          7          <strong>3</strong>
        11          8          <strong>3</strong>
        12          9          <strong>3</strong>
        20         10         10
        21         11         10

11 rows selected.



And now let's group our value ranges based on those groups...

select min(val) as range_start
      ,max(val) as range_end
      ,count(*) as range_count
from (-- our previous query
      select val
            ,row_number() over (order by val) as rn
            ,val-row_number() over (order by val) as grp
      from   myvals
     )
group by grp
order by 1
/
RANGE_START  RANGE_END RANGE_COUNT
----------- ---------- -----------
          1          3           3
          5          7           3
         10         12           3
         20         21           2

4 rows selected.



There we go, that's the basics of the Tabibitosan Method for grouping sequences.  It really is that simple, but so many people are unaware of this simple trick.


The "group" relates to the Tabibitosan 'traveller' as it's effectively measuring the distance between our two sequences (the one in our data, and the one we generate as a row number).  In the Japanese website, these are the two people walking at different speeds.  This is sometimes referred to as "Grouping by Distance".

3. Another Example - Date Ranges


A common requirement is when we have ranges of Dates.  Can we use the same method for Dates? Yes we can.
Anything that we can relate to a numeric sequence in some way can be treated like this. Let's take a look...

create table mydates as
  select date '2015-04-01' as dt from dual union all
  select date '2015-04-02' from dual union all
  select date '2015-04-03' from dual union all
  select date '2015-04-04' from dual union all
  select date '2015-04-07' from dual union all
  select date '2015-04-08' from dual union all
  select date '2015-04-10' from dual union all
  select date '2015-04-12' from dual union all
  select date '2015-04-13' from dual union all
  select date '2015-04-14' from dual
/

alter session set nls_date_format='YYYY-MM-DD';

select dt
      ,row_number() over (order by dt) as rn
      ,dt-row_number() over (order by dt) as grp
from   mydates
/
DT                 RN GRP
---------- ---------- ----------
2015-04-01          1 <strong>2015-03-31</strong>
2015-04-02          2 <strong>2015-03-31</strong>
2015-04-03          3 <strong>2015-03-31</strong>
2015-04-04          4 <strong>2015-03-31</strong>
2015-04-07          5 2015-04-02
2015-04-08          6 2015-04-02
2015-04-10          7 <strong>2015-04-03</strong>
2015-04-12          8 2015-04-04
2015-04-13          9 2015-04-04
2015-04-14         10 2015-04-04

10 rows selected.



Here, our "groups" are defined by an actual date.  It doesn't matter what the date is, as long as the date is unique to the group.

So, now we can use those dates to group on...

select min(dt) as range_start
      ,max(dt) as range_end
      ,count(*) as range_count
from (-- our grouping query
      select dt
            ,row_number() over (order by dt) as rn
            ,dt-row_number() over (order by dt) as grp
      from   mydates
     )
group by grp
order by 1
/
RANGE_STAR RANGE_END  RANGE_COUNT
---------- ---------- -----------
2015-04-01 2015-04-04           4
2015-04-07 2015-04-08           2
2015-04-10 2015-04-10           1
2015-04-12 2015-04-14           3

4 rows selected.



Likewise you could do your ranges for seconds, minutes, hours, months or years as you want, simply by adjusting the group calculation accordingly.


We'll look at a slightly different example, grouping by months later, as it may not be as obvious as you first think, especially when you have multiple records for the same month; and that leads us to look at what happens if we have duplicate rows.

4. Tabibitosan with duplicate values


What do we do if we have duplicate rows in our sequences?


Well, we could distinct those sequences before we apply our Tabibitosan Method to them, that's one way, but requires an extra subquery.
Another way would be to group them before we apply our Tabibitosan Method to them, again that requires an extra subquery
Also, in our real applications, we're probably dealing with more than just a sequence, we probably have some other data too.

So, let's set up some example data, and see if we can use what we've learnt already...

create table mysales as
  select 1 as day, 'Fred' as who, 100 as dollars from dual union all
  select 1, 'Bob', 50 from dual union all
  select 1, 'Jim', 75 from dual union all
  select 2, 'Bob', 125 from dual union all
  select 2, 'Jim', 100 from dual union all
  select 3, 'Fred', 25 as dollars from dual union all
  select 4, 'Fred', 50 from dual union all
  select 4, 'Jim', 150 from dual union all
  select 5, 'Jim', 50 from dual union all
  select 8, 'Fred', 25 from dual union all
  select 8, 'Bob', 100 from dual union all
  select 9, 'Jim', 175 from dual union all
  select 9, 'Fred', 75 from dual union all
  select 10, 'Fred', 125 from dual union all
  select 10, 'Fred', 225 from dual union all
  select 11, 'Fred', 75 from dual union all
  select 12, 'Fred', 100 from dual union all
  select 15, 'Jim', 150 from dual union all
  select 16, 'Bob', 150 from dual
/

select day, who, dollars
      ,row_number() over (order by day) as rn
      ,day-row_number() over (order by day) as grp
from   mysales
order by 1,4
/
       DAY WHO     DOLLARS         RN        GRP
---------- ---- ---------- ---------- ----------
         1 Fred        100          1          0
         1 Bob          50          2         -1
         1 Jim          75          3         -2
         2 Bob         125          4         -2
         2 Jim         100          5         -3
         3 Fred         25          6         -3
         4 Fred         50          7         -3
         4 Jim         150          8         -4
         5 Jim          50          9         -4
         8 Fred         25         10         -2
         8 Bob         100         11         -3
         9 Jim         175         12         -3
         9 Fred         75         13         -4
        10 Fred        125         14         -4
        10 Fred        225         15         -5
        11 Fred         75         16         -5
        12 Fred        100         17         -5
        15 Jim         150         18         -3
        16 Bob         150         19         -3

19 rows selected.




Hmmm, those groupings don't look right.


That's because we are trying to apply Tabibitosan based on the "Day" but that data is not actually sequential; it effectively has duplicates in it.
So, when we apply our row numbering, which IS sequential, against it, we don't get the right result.

Perhaps there's some other way we can account for these duplicates?  Yes there is.  Rather than a sequential row number, we want something similar, but that takes account of duplicates.
It may not be immediately obvious, but Oracle provides us with another analytical function, called "dense_rank".

Let's check it out...

select day, who, dollars
      ,dense_rank() over (order by day) as rn
      ,day-dense_rank() over (order by day) as grp
from   mysales
order by 1,4
/
       DAY WHO     DOLLARS         RN        GRP
---------- ---- ---------- ---------- ----------
         1 Fred        100          1          <strong>0</strong>
         1 Bob          50          1          <strong>0</strong>
         1 Jim          75          1          <strong>0</strong>
         2 Bob         125          2          <strong>0</strong>
         2 Jim         100          2          <strong>0</strong>
         3 Fred         25          3          <strong>0</strong>
         4 Fred         50          4          <strong>0</strong>
         4 Jim         150          4          <strong>0</strong>
         5 Jim          50          5          <strong>0</strong>
         8 Fred         25          6          2
         8 Bob         100          6          2
         9 Jim         175          7          2
         9 Fred         75          7          2
        10 Fred        125          8          2
        10 Fred        225          8          2
        11 Fred         75          9          2
        12 Fred        100         10          2
        15 Jim         150         11          <strong>4</strong>
        16 Bob         150         12          <strong>4</strong>

19 rows selected.



That looks good.

Ranking in principle works like positioning people on a leaderboard in sports.  If we had several people competing, and two people come in first place, they are both considered 1st, and then the next person is considered in 3rd place (in general there is no second place because of the two people in 1st place).  Dense Ranking works in a similar way, but doesn't leave gaps, so if there are multiple people in one position, the next people are in the next position i.e. two people in 1st position, the next place is 2nd position.

So, looking at our data above, we can see that all the Day 1 records are in 1st position (RN=1), all Day2 records are in 2nd position (RN=2) and so on.  You can see that the RN goes up in a gapless sequence even though it too now has duplicates.

This difference between our generated gapless sequence and the Days which have gaps, allows us to apply out Tabibitosan Method, subtracting one from the other to generate unique group identifiers.

So, now we can group the data...

select min(day) as start_day
      ,max(day) as end_day
      ,count(distinct who) as sales_people
      ,count(*) as sales_count
      ,sum(dollars) as sales_amount
from (-- our dense rank grouping query
      select day, who, dollars
            ,dense_rank() over (order by day) as rn
            ,day-dense_rank() over (order by day) as grp
      from   mysales
     )
group by grp
order by 1
/
 START_DAY    END_DAY SALES_PEOPLE SALES_COUNT SALES_AMOUNT
---------- ---------- ------------ ----------- ------------
         1          5            3           9          725
         8         12            3           8          900
        15         16            2           2          300

3 rows selected.



That works really well.

We can see from this that, the key to applying Tabibitosan, is to be able to generate a gapless sequence (even if it has duplicates) so that we can compare it with the sequential (and potentially duplicated) gappy sequence we have in our data.  Once we have those two key components we're able to generate those unique groupings.

5. Tabibitosan on Dates - by Month


So, our above example had duplicate rows for the "Days", but those days weren't very realistic for most people's data.  It's unlikely we would be recording a number to represent a day.

So, let's change our sales data to have some more realistic dates, and apply our dense_rank method to those dates...

drop table mysales
/
create table mysales as
  select date '2014-01-01' as dt, 'Fred' as who, 100 as dollars from dual union all
  select date '2014-01-02', 'Bob', 50 from dual union all
  select date '2014-01-03', 'Jim', 75 from dual union all
  select date '2014-02-12', 'Bob', 125 from dual union all
  select date '2014-02-15', 'Jim', 100 from dual union all
  select date '2014-03-07', 'Fred', 25 as dollars from dual union all
  select date '2014-04-01', 'Fred', 50 from dual union all
  select date '2014-04-28', 'Jim', 150 from dual union all
  select date '2014-05-02', 'Jim', 50 from dual union all
  select date '2014-08-13', 'Fred', 25 from dual union all
  select date '2014-08-20', 'Bob', 100 from dual union all
  select date '2014-09-05', 'Jim', 175 from dual union all
  select date '2014-09-06', 'Fred', 75 from dual union all
  select date '2014-10-11', 'Fred', 125 from dual union all
  select date '2014-10-14', 'Fred', 225 from dual union all
  select date '2014-11-11', 'Fred', 75 from dual union all
  select date '2014-12-01', 'Fred', 100 from dual union all
  select date '2015-03-06', 'Jim', 150 from dual union all
  select date '2015-04-01', 'Bob', 150 from dual
/

select min(dt) as start_day
      ,max(dt) as end_day
      ,count(distinct who) as sales_people
      ,count(*) as sales_count
      ,sum(dollars) as sales_amount
from (-- our dense rank grouping query
      select dt, who, dollars
            ,dense_rank() over (order by dt) as rn
            ,dt-dense_rank() over (order by dt) as grp
      from   mysales
     )
group by grp
order by 1
/
START_DAY  END_DAY    SALES_PEOPLE SALES_COUNT SALES_AMOUNT
---------- ---------- ------------ ----------- ------------
2014-01-01 2014-01-03            3           3          225
2014-02-12 2014-02-12            1           1          125
2014-02-15 2014-02-15            1           1          100
2014-03-07 2014-03-07            1           1           25
2014-04-01 2014-04-01            1           1           50
2014-04-28 2014-04-28            1           1          150
2014-05-02 2014-05-02            1           1           50
2014-08-13 2014-08-13            1           1           25
2014-08-20 2014-08-20            1           1          100
2014-09-05 2014-09-06            2           2          250
2014-10-11 2014-10-11            1           1          125
2014-10-14 2014-10-14            1           1          225
2014-11-11 2014-11-11            1           1           75
2014-12-01 2014-12-01            1           1          100
2015-03-06 2015-03-06            1           1          150
2015-04-01 2015-04-01            1           1          150

16 rows selected.



Well, that works, if we had wanted to group the data by day, but our sales team is lazy (or perhaps just buzy doing other things ), and don't make sales in some months, so we want to just group by consecutive months to get a more 'overall' picture of what they've been up to.

For that, we need to consider that we aren't interested in the day component of our DATE values, and our calculation for our group needs to be based on months rather than days (as most will know just subtracting a number from a DATE is subtracting a number of days, not months; and in addition there are variable numbers of days in each month so we cannot just factor up the value we take off).

Therefore, for each of our dates, we'll truncate them to their Month, and we'll calculate our groups using Oracle's add_months function (using a negative value to affect the subtraction).  Likewise, for display purposes, we'll just display the year and month so it makes more sense...

select to_char(min(dt),'YYYY-MM') as start_month
      ,to_char(max(dt),'YYYY-MM') as end_month
      ,count(distinct who) as sales_people
      ,count(*) as sales_count
      ,sum(dollars) as sales_amount
from (-- our dense rank grouping query
      select dt, who, dollars
            ,dense_rank() over (order by trunc(dt,'MM')) as rn
            ,add_months(trunc(dt,'MM'), -dense_rank() over (order by trunc(dt,'MM'))) as grp
      from   mysales
     )
group by grp
order by 1
/
START_M END_MON SALES_PEOPLE SALES_COUNT SALES_AMOUNT
------- ------- ------------ ----------- ------------
2014-01 2014-05            3           9          725
2014-08 2014-12            3           8          900
2015-03 2015-04            2           2          300

3 rows selected.



Brilliant, we've achieved what we wanted, grouping duplicate rows of dates into month ranges.  And because of our Tabibitosan Method, it's nice concise SQL.

6. Summary


Here we've seen a few examples using numbers and dates.  If it suits the situation you can convert your dates to numbers using to_number(to_char(dt, 'j')) and apply Tabibitosan method to that, but clearly there are far too many scenarios to cover in this 101 introduction to the topic.

The link at the top of this article to Aketi's thread shows some more advanced scenarios and provides some other links with examples, so I do encourage you to take a look at that thread.

The one thing to remember though is, if you're considering how to group sequences of values into 'ranges', then think "Tabibitosan" !!


3rd June 2015 Update:

Following feedback to Steven Feuersteine's tweet about this article (thanks for the email Steven ), if you're using Oracle 12c or above, there is now another way to achieve this using the new MATCH_RECOGNIZE clause.

For details see @Stew Ashton's excellent blog post demonstrating how this works and how it can improve upon what we've seen with Tabibitosan

https://stewashton.wordpress.com/2014/03/05/12c-match_recognize-grouping-sequences/

Nice one Stew!

Comments

  • Huzaifa_Apex
    Huzaifa_Apex Member Posts: 383

    Wow... tht's a great trick.

    Thanks for sharing it

    Br,

    Zaif

  • Gbenga Ajakaye
    Gbenga Ajakaye Member Posts: 3,422 Gold Trophy

    Thanks for the write up.

  • Eric Hinrichsen
    Eric Hinrichsen Member Posts: 1 Red Ribbon

    This approach was just what I was after for a recent BI report I was tasked with creating. THANK YOU!

  • 4051339
    4051339 Member Posts: 1

    Very Helpful !!

  • user3897193
    user3897193 Member Posts: 16 Blue Ribbon

    Hi


    Tabibitosan method is meant for discrete sets of data

    There is modification of this which could be used e.g. with time columns for looking groups with some preset gap with other groups.

    --

    -- if there is gap longer that 2 minutes, there is a new grouping

    --


    with basedata as (

    select

                                PERSONNUMBER,

                                EXECTIME,

                                CASE when (lead(EXECTIME) over (partition by PERSONNUMBER order by EXECTIME) - EXECTIME) > 2/(24*60)

                                then

                                   1

                                else

                                    0

                                end CHANGING

    from

                                TEST2),

    GROUPING as (

    select

                                PERSONNUMBER,

                                EXECTIME,

                                sum(CHANGING) over (partition by PERSONNUMBER order by EXECTIME rows between unbounded preceding and current row) GROUPINGNUMBER

    from BASEDATA

    )

    select PERSONNUMBER,

          GROUPINGNUMBER,

          min(EXECTIME) GROUPING_BEGINS,

          max(EXECTIME) GROUPING_ENDS,

          count(*) ACTIONS

     from

          GROUPING

    group by

          PERSONNUMBER,

          GROUPINGNUMBER

    order by 1,2;


    lh

    finland

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,909 Red Diamond
    edited Nov 11, 2020 1:24PM

    Solution you posted uses start of group method. Tabibitosan method is applicable for a subset of grouping tasks where we can calculate "distance". Rows with same "distance" belong to same group. STart of group method covers both cases where "distance" can be or can't be calculated however when "distance" can be calculated tabibitosan method is,in most cases, more efficient. And as it was already noted match_recognize provides even more efficient method for these type of tasks.

    SY.