This discussion is archived
10 Replies Latest reply: Sep 7, 2009 4:11 AM by Aketi Jyuuzou RSS

sql statement for click stream analyzing

486393 Explorer
Currently Being Moderated
I want to divide user clicks into sessions. A session is defined to include all of a certain user's clicks that occur within 60 seconds of one another:
create table clickstream 
(  t date
,  userid number(10) );


insert into clickstream values (to_date('5-9-2009 10:00:00','dd-mm-yyyy hh24:mi:ss'), 2);
insert into clickstream values (to_date('5-9-2009 10:00:24','dd-mm-yyyy hh24:mi:ss') , 2);
insert into clickstream values (to_date('5-9-2009 10:01:23','dd-mm-yyyy hh24:mi:ss') , 2);
insert into clickstream values (to_date('5-9-2009 10:02:40','dd-mm-yyyy hh24:mi:ss') , 2);
insert into clickstream values (to_date('5-9-2009 0:58:24','dd-mm-yyyy hh24:mi:ss') , 7);
insert into clickstream values (to_date('5-9-2009 2:30:33','dd-mm-yyyy hh24:mi:ss') , 7);
 
commit;
The output should be:
Time t       Userid    Session

10:00:00     2          0
10:00:24     2          0
10:01:23     2          0
10:02:40     2          1
00:58:24     7          0
02:30:33     7          1
This query divides the clickstream into sessions:
select to_char(t,'hh24:mi:ss') t
,      userid
,      count(*) over
       ( partition by userid order by t RANGE between (1/(24*60)) preceding and current row) count
from clickstream
order by userid,t

T            USERID      COUNT
-------- ---------- ----------
10:00:00          2          1
10:00:24          2          2
10:01:23          2          2
10:02:40          2          1
00:58:24          7          1
02:30:33          7          1
Because when COUNT = 1 a new session is started for a certain user but I don't know how to proceed?
  • 1. Re: sql statement for click stream analyzing
    Hoek Guru
    Currently Being Moderated
    Hi Wateenmooiedag,

    it's cloudy over here btw ;)

    Have you tried using LAG:
    MHO%xe> select t
      2  ,      userid
      3  ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 ) ) >= 1
      4              then 1
      5              else 0
      6         end cnt
      7  from   clickstream;
    
    T                       USERID        CNT
    ------------------- ---------- ----------
    05-09-2009 10:00:00          2          0
    05-09-2009 10:00:24          2          0
    05-09-2009 10:01:23          2          0
    05-09-2009 10:02:40          2          1
    05-09-2009 00:58:24          7          0
    05-09-2009 02:30:33          7          1
    
    6 rijen zijn geselecteerd.
    ( It reminded me of this link:http://asktom.oracle.com/tkyte/Misc/DateDiff.html )
  • 2. Re: sql statement for click stream analyzing
    486393 Explorer
    Currently Being Moderated
    It is not exactly what I had in mind.

    When I add two new rows:

    insert into clickstream values (to_date('5-9-2009 10:02:50','dd-mm-yyyy hh24:mi:ss') , 2);
    insert into clickstream values (to_date('5-9-2009 10:05:40','dd-mm-yyyy hh24:mi:ss') , 2);

    It should return:
    T                       USERID        CNT
    ------------------- ---------- ----------
    05-09-2009 10:00:00          2          0
    05-09-2009 10:00:24          2          0
    05-09-2009 10:01:23          2          0
    05-09-2009 10:02:40          2          1
    05-09-2009 10:02:50          2          1
    05-09-2009 10:05:40          2          2
    05-09-2009 00:58:24          7          0
    05-09-2009 02:30:33          7          1
    Because user 2 started a new session on 10:02:40 and another one on 10:05:40 .


    By the way it's cloudy here too but that should not suprise you! :) (wateenmooiedag means what a beautiful day in Dutch)

    Edited by: wateenmooiedag on Sep 6, 2009 3:33 AM
  • 3. Re: sql statement for click stream analyzing
    Hoek Guru
    Currently Being Moderated
    Aha, OK, that calls for some adjustment , indeed.

    One little thing:
    Shouldn't
    insert into clickstream values (to_date('5-9-2009 10:02:40','dd-mm-yyyy hh24:mi:ss') , 2);
    (from your extra inserts)
    be
    insert into clickstream values (to_date('5-9-2009 10:02:50','dd-mm-yyyy hh24:mi:ss') , 2);
    when looking at your desired output?

    And what exactly makes the requirement for CNT = 2 on 10:05:40?
    And how does 10:02:50 being 1 relate to that (since it differs only 10 secs from the previous t)?
    Because I interpreted the requirement like:
    "For at least a difference of 60 seconds (one minute) between the current and previous t we will set CNT to 1."
    But that apparently isn't enough.
  • 4. Re: sql statement for click stream analyzing
    486393 Explorer
    Currently Being Moderated
    It should indeed be: insert into clickstream values (to_date('5-9-2009 10:02:_50_'...

    Every session should have a unique number (SESSIONID) assigned (unique per userid).

    I should have renamed column CNT to SESSIONID.
  • 5. Re: sql statement for click stream analyzing
    Hoek Guru
    Currently Being Moderated
    If the number has to be unique but it has no further meaning or purpose then maybe using row_number() is OK here:
    MHO%xe> select t
      2  ,      userid
      3  ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over ( partition by userid order by userid, t),0) sessionid
      4  from ( select t
      5         ,      userid 
      6  --       ,      trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 ) ) minutes
      7         ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
      8                     then row_number() over ( partition by userid order by userid, t)
      9                     else 0
     10                end sessionid
     11         from   clickstream
     12       );
    
    T                       USERID  SESSIONID
    ------------------- ---------- ----------
    05-09-2009 10:00:00          2          0
    05-09-2009 10:00:24          2          0
    05-09-2009 10:01:23          2          0
    05-09-2009 10:02:40          2          4
    05-09-2009 10:02:50          2          4
    05-09-2009 10:05:40          2          6
    05-09-2009 00:58:24          7          0
    05-09-2009 02:30:33          7          2
    
    8 rijen zijn geselecteerd.
    I've made a few assumptions here regarding setting sessionid to 4 on 10:02:50:
    Once a session is identified, then keep that session id for the next records, regardless of the difference in minutes/seconds until the difference is at least a minute again.
  • 6. Re: sql statement for click stream analyzing
    486393 Explorer
    Currently Being Moderated
    That looks very good!! I added a dense_rank() to improve the sessionid:
    select t
    ,      userid
    ,      -1 + (dense_rank() over (partition by userid order by sessionid)) sessionid
    from
    (
    select t
    ,      userid
    ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over 
           ( partition by userid order by userid, t),0) sessionid
    from ( select t
         ,      userid 
         ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
                        then row_number() over ( partition by userid order by userid, t)
                        else 0
                end sessionid
         from   clickstream
    )
    )
    / 
    You can read here more: http://www.asterdata.com/resources/downloads/whitepapers/sqlmr.pdf , chapter 1.2 .

    Edited by: wateenmooiedag on Sep 6, 2009 5:27 AM
  • 7. Re: sql statement for click stream analyzing
    Hoek Guru
    Currently Being Moderated
    Yes ofcourse, dense_rank() tops it off nicely, didn't think of that one.

    Interesting link by the way, thanks Tuinstoel.
    It states SQL needs 'an expensive self-join' but by using analytics we don't need a self-join at all.
    MHO%xe> select t
      2  ,      userid
      3  ,      -1 + (dense_rank() over (partition by userid order by sessionid)) sessionid
      4  from ( select t
      5         ,      userid
      6         ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over 
      7                ( partition by userid order by userid, t),0) sessionid
      8         from ( select t
      9                ,      userid 
     10                ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
     11                               then dense_rank() over ( partition by userid order by userid, t)
     12                               else 0
     13                       end sessionid
     14                from   clickstream
     15              )
     16       );
    Verstreken: 00:00:00.25
    
    Uitvoeringspan
    ----------------------------------------------------------
    Plan hash value: 1413192182
    
    --------------------------------------------------------------------------------------
    | Id  | Operation              | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT       |             |     8 |   280 |     5  (40)| 00:00:01 |
    |   1 |  WINDOW SORT           |             |     8 |   280 |     5  (40)| 00:00:01 |
    |   2 |   VIEW                 |             |     8 |   280 |     4  (25)| 00:00:01 |
    |   3 |    WINDOW BUFFER       |             |     8 |   280 |     4  (25)| 00:00:01 |
    |   4 |     VIEW               |             |     8 |   280 |     4  (25)| 00:00:01 |
    |   5 |      WINDOW SORT       |             |     8 |   176 |     4  (25)| 00:00:01 |
    |   6 |       TABLE ACCESS FULL| CLICKSTREAM |     8 |   176 |     3   (0)| 00:00:01 |
    --------------------------------------------------------------------------------------
    
    Note
    -----
       - dynamic sampling used for this statement
    It would be very interesting to see how this query would look like when using the MODEL-clause or some XML function, by the way...
    I think this requirement can be met using less code that way.

    hopes a few other well-known users will drop in here ;)
  • 8. Re: sql statement for click stream analyzing
    MichaelS Guru
    Currently Being Moderated
    Looks to me you can do a bit simpler:
    SQL>  select userid, t, nvl (sum (s) over (partition by userid order by t), 0) sessionid
    from (select userid, 
                 t, 
                 case when t - lag(t) over (partition by userid order by t) > 1 / (24 * 60) then 1 end s
          from clickstream
          order by userid, t)
    /
        USERID T                      SESSIONID
    ---------- --------------------- ----------
             2 05.09.2009 10:00:00            0
             2 05.09.2009 10:00:24            0
             2 05.09.2009 10:01:23            0
             2 05.09.2009 10:02:40            1
             2 05.09.2009 10:02:50            1
             2 05.09.2009 10:05:40            2
             7 05.09.2009 00:58:24            0
             7 05.09.2009 02:30:33            1
    
    8 rows selected.
  • 9. Re: sql statement for click stream analyzing
    Hoek Guru
    Currently Being Moderated
    Nice and very elegant, Michael!
    It looks so easy, but it just never came to my mind...
    memorizing the  '> 1 / (24 * 60)' - part
  • 10. Re: sql statement for click stream analyzing
    Aketi Jyuuzou Oracle ACE
    Currently Being Moderated
    I like using interVal ;-)
    create table streamT(USERID,TimeVal) as
    select 2,to_date('2009-09-05 10:00:00','YYYY/MM/DD HH24:MI:SS') from dual union
    select 2,to_date('2009-09-05 10:00:24','YYYY/MM/DD HH24:MI:SS') from dual union
    select 2,to_date('2009-09-05 10:01:23','YYYY/MM/DD HH24:MI:SS') from dual union
    select 2,to_date('2009-09-05 10:02:40','YYYY/MM/DD HH24:MI:SS') from dual union
    select 2,to_date('2009-09-05 10:02:50','YYYY/MM/DD HH24:MI:SS') from dual union
    select 2,to_date('2009-09-05 10:05:40','YYYY/MM/DD HH24:MI:SS') from dual union
    select 7,to_date('2009-09-05 00:58:24','YYYY/MM/DD HH24:MI:SS') from dual union
    select 7,to_date('2009-09-05 02:30:33','YYYY/MM/DD HH24:MI:SS') from dual;
    
    select USERID,to_char(TimeVal,'yyyy-mm-dd hh24:mi:ss') as TimeVal,
    count(willCnt) over(partition by USERID order by TimeVal) as SESSIONID
    from (select USERID,TimeVal,
          case when TimeVal-interVal '1' minute >=
                    Lag(TimeVal) over(partition by USERID order by TimeVal)
               then 1 end as willCnt
            from streamT)
    order by USERID,TimeVal;
    
     USERID  TimeVal              SESSIONID
    -------  -------------------  ---------
          2  2009-09-05 10:00:00          0
          2  2009-09-05 10:00:24          0
          2  2009-09-05 10:01:23          0
          2  2009-09-05 10:02:40          1
          2  2009-09-05 10:02:50          1
          2  2009-09-05 10:05:40          2
          7  2009-09-05 00:58:24          0
          7  2009-09-05 02:30:33          1

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points