Forum Stats

  • 3,768,285 Users
  • 2,252,771 Discussions
  • 7,874,515 Comments

Discussions

timeseries/eventstream processing

User_2SRRN
User_2SRRN Member Posts: 7 Green Ribbon

Hello,

I’m having sort of a struggle querying a table which contains events in the form of “timestamp” and “start/stop” records.

To better illustrate my problem, I’ve created a small table with some data (please see attachment). The demo data consists of a “customer” (representing any business key) and a timestamp marked as either “start” or “stop” event type. My goal is to get the actual duration between a “start” and a “stop” event/record.

The date calculation itself is not a problem and it’s easy to enumerate the events. However, I don’t know how to “reset” the group id, if there are more than one “roundtrips” (pairs of start and stop)

Customers 1000 and 2000 show the most simple case, there’s one, resp. two pairs of a start and a stop event. These work well for me.

My base idea was to just assign row numbers, so “start 1” goes to “stop 1” and so on. However to simulate some “noise” I’ve also included some examples which contain multiple start or stop events. In this situation, I want to pick the first “start” and the last “stop” of a group.

If there’s only one “start”, I’d like to assume the current date (sysdate) as the “stop” event. Invalid constellations (see comments in script) would be ignored/filtered out; eg. If there’s no “start” in a group or “stop” before “start”

Anyone knows how to do this? I was thinking of CTEs each dealing with one situation, but I’m not sure if that’s a performance killer.

Thanks in advance for any help :-)



«1

Answers

  • User_H3J7U
    User_H3J7U Member Posts: 640 Silver Trophy

    match_recognize ... pattern("start"+ ("stop"+|$))

  • Frank Kulash
    Frank Kulash Member, Moderator Posts: 41,205 Red Diamond

    Hi, @User_2SRRN

    I’ve created a small table with some data (please see attachment).

    Please post the sample data and the exact results you want from that data right in this space. Don't use attachments; a lot of people won't open them.

    Anyone knows how to do this

    As already mentioned, this sounds like a job for MATCH_RECOGNIZE. Analytic functions (e.g. LAG and COUNT) might be another way.

  • User_2SRRN
    User_2SRRN Member Posts: 7 Green Ribbon

    create table eventStream


    (

      customer int,

      event_ts date,

      event_type varchar(5)

    );


    -- simple (correct)

    insert into eventStream values (1000, to_date('2021-01-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (1000, to_date('2021-01-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    -- two roundtrips (correct)

    insert into eventStream values (2000, to_date('2021-02-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (2000, to_date('2021-02-01 16:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    insert into eventStream values (2000, to_date('2021-02-02 08:30:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (2000, to_date('2021-02-02 11:30:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');



    -- event repeating: select first start or last stop (one roundtrip)


    insert into eventStream values (3000, to_date('2021-03-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3000, to_date('2021-03-01 11:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3000, to_date('2021-03-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3000, to_date('2021-03-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    insert into eventStream values (3100, to_date('2021-03-02 09:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3100, to_date('2021-03-02 20:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');

    insert into eventStream values (3100, to_date('2021-03-02 21:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');



    -- event repeating: select first start or last stop (two roundtrips)

    insert into eventStream values (3200, to_date('2021-03-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3200, to_date('2021-03-01 11:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3200, to_date('2021-03-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3200, to_date('2021-03-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    insert into eventStream values (3200, to_date('2021-03-02 09:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (3200, to_date('2021-03-02 20:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');

    insert into eventStream values (3200, to_date('2021-03-02 21:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    -- ongoing (correct): use sysdate as stop

    insert into eventStream values (4000, to_date('2021-04-01 15:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');


    -- ongoing (correct): use sysdate as stop - select first start

    insert into eventStream values (4100, to_date('2021-04-02 15:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

    insert into eventStream values (4100, to_date('2021-04-02 17:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');


    -- invalid: stop without start (use 2021-01-01 as start)

    insert into eventStream values (-1000, to_date('2020-01-01 13:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');


    -- invalid: stop after start (ignore)

    insert into eventStream values (-2000, to_date('2020-01-01 13:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop');

    insert into eventStream values (-2000, to_date('2020-01-01 14:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'start');

     

    commit;

  • User_2SRRN
    User_2SRRN Member Posts: 7 Green Ribbon

    okay, this is the script to create and fill a table with some test data which abstracts my problem.

    what I'm trying to do is to pick the first event_type = "start" and the last event_type = "stop" to calculate the time between the dates of event_ts. if there are more than one pairs, I'm loking for these records too.

    [customer 1000] would be the two records event_ts (as start_ts) and event_ts (as stop_ts)

    [customer 2000] would be start 14:00 and stop 16:00 and start 8:30 with stop 11:30

    [customer 3200] produces two pairs:

    start 10:00 stop 14:00

    start 09:00 stop 21:00

    In the same, or a following step, I'd like to create one record using the lead function, so there is someting like start_ts and stop_ts for each pair.

  • User_H3J7U
    User_H3J7U Member Posts: 640 Silver Trophy
    select *
    from eventStream
    match_recognize(
      partition by customer
      order by event_Ts
      measures first(sta.event_ts) as fts,
               nvl(last(sto.event_ts), sysdate) as lts
      pattern (sta+ (sto+|$))
    --   pattern (sta+ sto+ | ^ sta+ $)
      define sta as event_type='start',
             sto as event_type='stop'
    );
    
      CUSTOMER FTS                 LTS                
    ---------- ------------------- -------------------
         -2000 2020-01-01 14:59:00 2021-10-15 15:14:04
          1000 2021-01-01 10:00:00 2021-01-01 12:00:00
          2000 2021-02-01 14:00:00 2021-02-01 16:00:00
          2000 2021-02-02 08:30:00 2021-02-02 11:30:00
          3000 2021-03-01 10:00:00 2021-03-01 14:00:00
          3100 2021-03-02 09:00:00 2021-03-02 21:00:00
          3200 2021-03-01 10:00:00 2021-03-01 14:00:00
          3200 2021-03-02 09:00:00 2021-03-02 21:00:00
          4000 2021-04-01 15:15:00 2021-10-15 15:14:04
          4100 2021-04-02 15:15:00 2021-10-15 15:14:04
    


  • BluShadow
    BluShadow Member, Moderator Posts: 41,474 Red Diamond

    Using recursive subquery factoring... (a little longer than using match recognize or other techniques, but perhaps a little easier to follow/debug).

    SQL> with eventStream(customer, event_ts, event_type) as (
      2    -- simple (correct)
      3    select 1000, to_date('2021-01-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
      4    select 1000, to_date('2021-01-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
      5    -- two roundtrips (correct)
      6    select 2000, to_date('2021-02-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
      7    select 2000, to_date('2021-02-01 16:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
      8    select 2000, to_date('2021-02-02 08:30:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
      9    select 2000, to_date('2021-02-02 11:30:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     10    -- event repeating: select first start or last stop (one roundtrip)
     11    select 3000, to_date('2021-03-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     12    select 3000, to_date('2021-03-01 11:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     13    select 3000, to_date('2021-03-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     14    select 3000, to_date('2021-03-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     15    select 3100, to_date('2021-03-02 09:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     16    select 3100, to_date('2021-03-02 20:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     17    select 3100, to_date('2021-03-02 21:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     18    -- event repeating: select first start or last stop (two roundtrips)
     19    select 3200, to_date('2021-03-01 10:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     20    select 3200, to_date('2021-03-01 11:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     21    select 3200, to_date('2021-03-01 12:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     22    select 3200, to_date('2021-03-01 14:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     23    select 3200, to_date('2021-03-02 09:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     24    select 3200, to_date('2021-03-02 20:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     25    select 3200, to_date('2021-03-02 21:00:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     26    -- ongoing (correct): use sysdate as stop
     27    select 4000, to_date('2021-04-01 15:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     28    -- ongoing (correct): use sysdate as stop - select first start
     29    select 4100, to_date('2021-04-02 15:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     30    select 4100, to_date('2021-04-02 17:15:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual union all
     31    -- invalid: stop without start (use 2021-01-01 as start)
     32    select -1000, to_date('2020-01-01 13:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     33    -- invalid: stop after start (ignore)
     34    select -2000, to_date('2020-01-01 13:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'stop' from dual union all
     35    select -2000, to_date('2020-01-01 14:59:00', 'yyyy-mm-dd hh24:mi:ss'), 'start' from dual
     36    )
     37    ,rn(customer, event_ts, event_type, rn, cnt) as (
     38    -- apply simple sequential row numbering
     39    select customer
     40          ,event_ts
     41          ,event_type
     42          ,row_number() over (partition by customer order by event_ts) as rn
     43          ,count(*) over (partition by customer) as cnt
     44    from   eventStream
     45    )
     46    ,r(customer, start_ts, stop_ts, rn) as (
     47    select customer
     48          ,case when event_type = 'stop' then date '2021-01-01' else event_ts end
     49          ,case when rn.cnt = 1 then sysdate else cast(null as date) end
     50          ,rn
     51    from   rn
     52    where  rn.rn = 1
     53    and    ((rn.cnt > 1 and rn.event_type = 'start') or (rn.cnt = 1 and rn.event_type = 'stop'))
     54    union all
     55    select r.customer
     56          ,case when r.stop_ts is not null and rn.event_type = 'start' then rn.event_ts -- start new roundtrip
     57           else r.start_ts -- keep existing start
     58           end as start_ts
     59          ,case when rn.event_type = 'start' then
     60             case when rn.rn = rn.cnt then sysdate
     61             else null
     62             end
     63           else rn.event_ts
     64           end as end_ts
     65          ,rn.rn
     66    from   r
     67           join rn on (    rn.customer = r.customer
     68                       and rn.rn = r.rn+1
     69                      )
     70   )
     71  select customer, start_ts, max(stop_ts) as stop_ts
     72        ,floor((max(stop_ts)-start_ts)*24) as hrs
     73  from   r
     74  group by customer, start_ts
     75  order by customer, start_ts
     76  /
    
      CUSTOMER START_TS             STOP_TS                     HRS
    ---------- -------------------- -------------------- ----------
         -1000 01-JAN-2021 00:00:00 15-OCT-2021 13:33:50       6901
          1000 01-JAN-2021 10:00:00 01-JAN-2021 12:00:00          2
          2000 01-FEB-2021 14:00:00 01-FEB-2021 16:00:00          2
          2000 02-FEB-2021 08:30:00 02-FEB-2021 11:30:00          3
          3000 01-MAR-2021 10:00:00 01-MAR-2021 14:00:00          4
          3100 02-MAR-2021 09:00:00 02-MAR-2021 21:00:00         12
          3200 01-MAR-2021 10:00:00 01-MAR-2021 14:00:00          4
          3200 02-MAR-2021 09:00:00 02-MAR-2021 21:00:00         12
          4100 02-APR-2021 15:15:00 15-OCT-2021 13:33:50       4702
    
    9 rows selected.
    
    

    Hoping I've got the results as you expected. I based it on the comments in your text file.

  • mathguy
    mathguy Member Posts: 10,154 Blue Diamond

    Question about your approach (having nothing to do with "programming"). If you have multiple starts with no stop in between, it does make sense to me, too, that you would consider the first "start" as the real one, and the others as "mistakes" or "noise".

    But for stop, you say you will choose the last stop in a sequence, not the first one. Why? It would make more sense to me (at least) to pick the first stop also, not the last.

    On my TV remote, if I press ON, the TV will come on. If I press ON a few more times (without turning the TV off), all the additional ON clicks don't matter - they are just noise. The same with OFF - the first one (not the last one) is the one that actually turns the TV off.

    This should not be a question for you (the programmer), but for your business users. Ask them - in the presence of mistakes or "noise", what do they consider the correct handling?

  • mathguy
    mathguy Member Posts: 10,154 Blue Diamond

    @BluShadow - do you really think that anyone would find your solution easier to follow or easier to maintain than the match_recognize solution? Just stare at both solutions for a few seconds. Is that the vibe you get?

    The only reason not to use match_recognize - other than a personal choice not to learn this clause - is an old Oracle version (prior to 12.1). In those versions, the correct approach is to use the tabibitosan method to identify the "groups"; this will be simpler, easier to understand, easier to maintain, and a lot faster than the recursive query. Besides, the recursive query itself is only available since 11.2, so that's the only sub-version in which recursive query is available while match_recognize is not.

  • User_2SRRN
    User_2SRRN Member Posts: 7 Green Ribbon

    lol, people here really don't like attachments. well, they do have a bad taste :-)

    please give me some time to understand the anwers (especially match_recognize is new to me)

  • mathguy
    mathguy Member Posts: 10,154 Blue Diamond

    "People don't like attachments" is probably true in general, but here the point is more specific. People don't like attachments from unknown sources, posted on a public forum. With all due respect, a user name like User_2SRRN doesn't inspire the highest degree of trust.

    You may "lol" all you want at that preference, but opening such attachments is by far the most common way of compromising security.

    Klostrie-Oracle