10 Replies Latest reply: Sep 7, 2009 6:11 AM by Aketi Jyuuzou RSS

    sql statement for click stream analyzing

    486393
      I want to divide user clicks into sessions. A session is defined to include all of a certain user's clicks that occur within 60 seconds of one another:
      create table clickstream 
      (  t date
      ,  userid number(10) );
      
      
      insert into clickstream values (to_date('5-9-2009 10:00:00','dd-mm-yyyy hh24:mi:ss'), 2);
      insert into clickstream values (to_date('5-9-2009 10:00:24','dd-mm-yyyy hh24:mi:ss') , 2);
      insert into clickstream values (to_date('5-9-2009 10:01:23','dd-mm-yyyy hh24:mi:ss') , 2);
      insert into clickstream values (to_date('5-9-2009 10:02:40','dd-mm-yyyy hh24:mi:ss') , 2);
      insert into clickstream values (to_date('5-9-2009 0:58:24','dd-mm-yyyy hh24:mi:ss') , 7);
      insert into clickstream values (to_date('5-9-2009 2:30:33','dd-mm-yyyy hh24:mi:ss') , 7);
       
      commit;
      The output should be:
      Time t       Userid    Session
      
      10:00:00     2          0
      10:00:24     2          0
      10:01:23     2          0
      10:02:40     2          1
      00:58:24     7          0
      02:30:33     7          1
      This query divides the clickstream into sessions:
      select to_char(t,'hh24:mi:ss') t
      ,      userid
      ,      count(*) over
             ( partition by userid order by t RANGE between (1/(24*60)) preceding and current row) count
      from clickstream
      order by userid,t
      
      T            USERID      COUNT
      -------- ---------- ----------
      10:00:00          2          1
      10:00:24          2          2
      10:01:23          2          2
      10:02:40          2          1
      00:58:24          7          1
      02:30:33          7          1
      Because when COUNT = 1 a new session is started for a certain user but I don't know how to proceed?
        • 1. Re: sql statement for click stream analyzing
          Hoek
          Hi Wateenmooiedag,

          it's cloudy over here btw ;)

          Have you tried using LAG:
          MHO%xe> select t
            2  ,      userid
            3  ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 ) ) >= 1
            4              then 1
            5              else 0
            6         end cnt
            7  from   clickstream;
          
          T                       USERID        CNT
          ------------------- ---------- ----------
          05-09-2009 10:00:00          2          0
          05-09-2009 10:00:24          2          0
          05-09-2009 10:01:23          2          0
          05-09-2009 10:02:40          2          1
          05-09-2009 00:58:24          7          0
          05-09-2009 02:30:33          7          1
          
          6 rijen zijn geselecteerd.
          ( It reminded me of this link:http://asktom.oracle.com/tkyte/Misc/DateDiff.html )
          • 2. Re: sql statement for click stream analyzing
            486393
            It is not exactly what I had in mind.

            When I add two new rows:

            insert into clickstream values (to_date('5-9-2009 10:02:50','dd-mm-yyyy hh24:mi:ss') , 2);
            insert into clickstream values (to_date('5-9-2009 10:05:40','dd-mm-yyyy hh24:mi:ss') , 2);

            It should return:
            T                       USERID        CNT
            ------------------- ---------- ----------
            05-09-2009 10:00:00          2          0
            05-09-2009 10:00:24          2          0
            05-09-2009 10:01:23          2          0
            05-09-2009 10:02:40          2          1
            05-09-2009 10:02:50          2          1
            05-09-2009 10:05:40          2          2
            05-09-2009 00:58:24          7          0
            05-09-2009 02:30:33          7          1
            Because user 2 started a new session on 10:02:40 and another one on 10:05:40 .


            By the way it's cloudy here too but that should not suprise you! :) (wateenmooiedag means what a beautiful day in Dutch)

            Edited by: wateenmooiedag on Sep 6, 2009 3:33 AM
            • 3. Re: sql statement for click stream analyzing
              Hoek
              Aha, OK, that calls for some adjustment , indeed.

              One little thing:
              Shouldn't
              insert into clickstream values (to_date('5-9-2009 10:02:40','dd-mm-yyyy hh24:mi:ss') , 2);
              (from your extra inserts)
              be
              insert into clickstream values (to_date('5-9-2009 10:02:50','dd-mm-yyyy hh24:mi:ss') , 2);
              when looking at your desired output?

              And what exactly makes the requirement for CNT = 2 on 10:05:40?
              And how does 10:02:50 being 1 relate to that (since it differs only 10 secs from the previous t)?
              Because I interpreted the requirement like:
              "For at least a difference of 60 seconds (one minute) between the current and previous t we will set CNT to 1."
              But that apparently isn't enough.
              • 4. Re: sql statement for click stream analyzing
                486393
                It should indeed be: insert into clickstream values (to_date('5-9-2009 10:02:_50_'...

                Every session should have a unique number (SESSIONID) assigned (unique per userid).

                I should have renamed column CNT to SESSIONID.
                • 5. Re: sql statement for click stream analyzing
                  Hoek
                  If the number has to be unique but it has no further meaning or purpose then maybe using row_number() is OK here:
                  MHO%xe> select t
                    2  ,      userid
                    3  ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over ( partition by userid order by userid, t),0) sessionid
                    4  from ( select t
                    5         ,      userid 
                    6  --       ,      trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 ) ) minutes
                    7         ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
                    8                     then row_number() over ( partition by userid order by userid, t)
                    9                     else 0
                   10                end sessionid
                   11         from   clickstream
                   12       );
                  
                  T                       USERID  SESSIONID
                  ------------------- ---------- ----------
                  05-09-2009 10:00:00          2          0
                  05-09-2009 10:00:24          2          0
                  05-09-2009 10:01:23          2          0
                  05-09-2009 10:02:40          2          4
                  05-09-2009 10:02:50          2          4
                  05-09-2009 10:05:40          2          6
                  05-09-2009 00:58:24          7          0
                  05-09-2009 02:30:33          7          2
                  
                  8 rijen zijn geselecteerd.
                  I've made a few assumptions here regarding setting sessionid to 4 on 10:02:50:
                  Once a session is identified, then keep that session id for the next records, regardless of the difference in minutes/seconds until the difference is at least a minute again.
                  • 6. Re: sql statement for click stream analyzing
                    486393
                    That looks very good!! I added a dense_rank() to improve the sessionid:
                    select t
                    ,      userid
                    ,      -1 + (dense_rank() over (partition by userid order by sessionid)) sessionid
                    from
                    (
                    select t
                    ,      userid
                    ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over 
                           ( partition by userid order by userid, t),0) sessionid
                    from ( select t
                         ,      userid 
                         ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
                                        then row_number() over ( partition by userid order by userid, t)
                                        else 0
                                end sessionid
                         from   clickstream
                    )
                    )
                    / 
                    You can read here more: http://www.asterdata.com/resources/downloads/whitepapers/sqlmr.pdf , chapter 1.2 .

                    Edited by: wateenmooiedag on Sep 6, 2009 5:27 AM
                    • 7. Re: sql statement for click stream analyzing
                      Hoek
                      Yes ofcourse, dense_rank() tops it off nicely, didn't think of that one.

                      Interesting link by the way, thanks Tuinstoel.
                      It states SQL needs 'an expensive self-join' but by using analytics we don't need a self-join at all.
                      MHO%xe> select t
                        2  ,      userid
                        3  ,      -1 + (dense_rank() over (partition by userid order by sessionid)) sessionid
                        4  from ( select t
                        5         ,      userid
                        6         ,      nvl(last_value(nullif(sessionid, 0) ignore nulls) over 
                        7                ( partition by userid order by userid, t),0) sessionid
                        8         from ( select t
                        9                ,      userid 
                       10                ,      case when trunc( mod( (t-lag(t) over ( partition by userid order by userid, t))*24*60, 60 )) >= 1
                       11                               then dense_rank() over ( partition by userid order by userid, t)
                       12                               else 0
                       13                       end sessionid
                       14                from   clickstream
                       15              )
                       16       );
                      Verstreken: 00:00:00.25
                      
                      Uitvoeringspan
                      ----------------------------------------------------------
                      Plan hash value: 1413192182
                      
                      --------------------------------------------------------------------------------------
                      | Id  | Operation              | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
                      --------------------------------------------------------------------------------------
                      |   0 | SELECT STATEMENT       |             |     8 |   280 |     5  (40)| 00:00:01 |
                      |   1 |  WINDOW SORT           |             |     8 |   280 |     5  (40)| 00:00:01 |
                      |   2 |   VIEW                 |             |     8 |   280 |     4  (25)| 00:00:01 |
                      |   3 |    WINDOW BUFFER       |             |     8 |   280 |     4  (25)| 00:00:01 |
                      |   4 |     VIEW               |             |     8 |   280 |     4  (25)| 00:00:01 |
                      |   5 |      WINDOW SORT       |             |     8 |   176 |     4  (25)| 00:00:01 |
                      |   6 |       TABLE ACCESS FULL| CLICKSTREAM |     8 |   176 |     3   (0)| 00:00:01 |
                      --------------------------------------------------------------------------------------
                      
                      Note
                      -----
                         - dynamic sampling used for this statement
                      It would be very interesting to see how this query would look like when using the MODEL-clause or some XML function, by the way...
                      I think this requirement can be met using less code that way.

                      hopes a few other well-known users will drop in here ;)
                      • 8. Re: sql statement for click stream analyzing
                        MichaelS
                        Looks to me you can do a bit simpler:
                        SQL>  select userid, t, nvl (sum (s) over (partition by userid order by t), 0) sessionid
                        from (select userid, 
                                     t, 
                                     case when t - lag(t) over (partition by userid order by t) > 1 / (24 * 60) then 1 end s
                              from clickstream
                              order by userid, t)
                        /
                            USERID T                      SESSIONID
                        ---------- --------------------- ----------
                                 2 05.09.2009 10:00:00            0
                                 2 05.09.2009 10:00:24            0
                                 2 05.09.2009 10:01:23            0
                                 2 05.09.2009 10:02:40            1
                                 2 05.09.2009 10:02:50            1
                                 2 05.09.2009 10:05:40            2
                                 7 05.09.2009 00:58:24            0
                                 7 05.09.2009 02:30:33            1
                        
                        8 rows selected.
                        • 9. Re: sql statement for click stream analyzing
                          Hoek
                          Nice and very elegant, Michael!
                          It looks so easy, but it just never came to my mind...
                          memorizing the  '> 1 / (24 * 60)' - part
                          • 10. Re: sql statement for click stream analyzing
                            Aketi Jyuuzou
                            I like using interVal ;-)
                            create table streamT(USERID,TimeVal) as
                            select 2,to_date('2009-09-05 10:00:00','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 2,to_date('2009-09-05 10:00:24','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 2,to_date('2009-09-05 10:01:23','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 2,to_date('2009-09-05 10:02:40','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 2,to_date('2009-09-05 10:02:50','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 2,to_date('2009-09-05 10:05:40','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 7,to_date('2009-09-05 00:58:24','YYYY/MM/DD HH24:MI:SS') from dual union
                            select 7,to_date('2009-09-05 02:30:33','YYYY/MM/DD HH24:MI:SS') from dual;
                            
                            select USERID,to_char(TimeVal,'yyyy-mm-dd hh24:mi:ss') as TimeVal,
                            count(willCnt) over(partition by USERID order by TimeVal) as SESSIONID
                            from (select USERID,TimeVal,
                                  case when TimeVal-interVal '1' minute >=
                                            Lag(TimeVal) over(partition by USERID order by TimeVal)
                                       then 1 end as willCnt
                                    from streamT)
                            order by USERID,TimeVal;
                            
                             USERID  TimeVal              SESSIONID
                            -------  -------------------  ---------
                                  2  2009-09-05 10:00:00          0
                                  2  2009-09-05 10:00:24          0
                                  2  2009-09-05 10:01:23          0
                                  2  2009-09-05 10:02:40          1
                                  2  2009-09-05 10:02:50          1
                                  2  2009-09-05 10:05:40          2
                                  7  2009-09-05 00:58:24          0
                                  7  2009-09-05 02:30:33          1