SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Select value of most occurances - analytical function?!

777489Jan 14 2011 — edited Jan 31 2011

Hi ...

I've get stuck with a "little" problem.
I try to provide some testing code for this:

CREATE TABLE a1 ( 
	id NUMBER(8), 
	val NUMBER(6),
	title VARCHAR2(16),
	CONSTRAINT test_pk PRIMARY KEY(id)
);
INERT INTO a1 (id, val, title) VALUES (1,12,'c');
INERT INTO a1 (id, val, title) VALUES (2,13,'b');
INERT INTO a1 (id, val, title) VALUES (3,13,'a');
INERT INTO a1 (id, val, title) VALUES (4,13,'a');
INERT INTO a1 (id, val, title) VALUES (5,42,'a');
INERT INTO a1 (id, val, title) VALUES (6,42,'b');
INERT INTO a1 (id, val, title) VALUES (7,42,'b');

Actually the table is much bigger ;) But this should be ok for this question. It already exist a query like:

SELECT
	count(*) -- just an example
FROM 
	a1
GROUP BY
	val
;
-- should return 1,3,3 (for the groups val=12, val=13,val=42)

Now it is nessecary to select a title for each group (specified by group by). And there the title which occurs the most in this group should be selected. For this example this are 'c' for the group val=12 .. 'a' for the group val=13 and finally 'b' for the group val=42.
I tried to use some anayltical function, but I#m not able to get this to work - may be because I never used analytical functions before. If I try something I mostly get an error: Keyword FROM not at expected position (ORA-00923). I searched for some tutorial/howto documentations where my problem is handled but without success. So I guess the syntax and the way to understand analytical functions is not as easy as it semms to be ...
title OVER ( ORDER BY count(*) ROWS | RANGE BETWEEN 1 AND 1 ) <-- that would by logical for my brain, but not for oracles ;-)
Can somebody help?

Thanks!

fsitja

Hi,
Maybe this:

SELECT val, title
  FROM (SELECT val,
               title,
               cnt_title,
               RANK() OVER (PARTITION BY val ORDER BY cnt_title DESC) rnk
          FROM (SELECT val,
                       title,
                       COUNT(title) over (PARTITION BY val, title) cnt_title
                  FROM a1))
 WHERE rnk = 1
 GROUP BY val, title
 ORDER BY val;

Sample execution:

SQL> SELECT val, title
  2    FROM (SELECT val,
  3                 title,
  4                 cnt_title,
  5                 RANK() OVER (PARTITION BY val ORDER BY cnt_title DESC) rnk
  6            FROM (SELECT val,
  7                         title,
  8                         COUNT(title) over (PARTITION BY val, title) cnt_title
  9                    FROM a1))
 10   WHERE rnk = 1
 11   GROUP BY val, title
 12   ORDER BY val;
 
    VAL TITLE
------- ----------------
     12 c
     13 a
     42 b

189821

SELECT   val, MAX(title)KEEP (DENSE_RANK FIRST ORDER BY c DESC) AS title, SUM(c) AS c
    FROM (SELECT   val, title, COUNT(*) AS c
              FROM a1
          GROUP BY val, title)
GROUP BY val


    VAL          TITLE           C     
---------- ---------------- ---------- 
        12 c                         1 
        13 a                         3 
        42 b                         3

Urs

Boneist

There's this:

select val, title
from   (select val, title, count(*) cnt, max(count(*)) over (partition by val) max_count
        from   a1
        group by val, title)
where  cnt = max_count;

       VAL TITLE           
---------- ----------------
        12 c               
        13 a               
        42 b

189821

Boneist wrote:
There's this:

select val, title
from   (select val, title, count(*) cnt, max(count(*)) over (partition by val) max_count
from   a1
group by val, title)
where  cnt = max_count;

VAL TITLE           
---------- ----------------
12 c               
13 a               
42 b

You might get more than one title with this query.

Urs

Frank Kulash

Hi,

Stede wrote:
... I try to provide some testing code for this: ...

INERT INTO a1 (id, val, title) VALUES (6,42,'b');
INERT INTO a1 (id, val, title) VALUES (7,42,'b');

Actually the table is much bigger ;) But this should be ok for this question. It already exist a query like:

Thanks, that looks like a very good and clear example.
Please test you code before posting in. (You mis-spelled IN<b>S</b>ERT.)

Fsitja's solutio can be simplified by using both aggregate and analytic functions.

WITH   got_rnk	  AS
(
	SELECT	  val, title
	,	  RANK () OVER ( PARTITION BY  val
		       	       	 ORDER BY      COUNT (title)	DESC
			       )       AS rnk
	FROM      a1
	GROUP BY  val, title
)
SELECT	val, title
FROM	got_rnk
WHERE	rnk	= 1
;

Aggregate functions are computed before analytic functions, so the there's no need to do the COUNT in a separate sub-query just to make sure it's done before RANK; we can depend on the aggregate COUNT being completed before the analytic RANK is computed.

You only have to worry about RANGE BETWEEEN or ROWS BETWEEN for things like point-in-time calculations; for example: how many times has this title occurred up to this date . There's no such requirement in this problem.

Edited by: Frank Kulash on Jan 14, 2011 9:15 AM

Edited by: Frank Kulash on Jan 14, 2011 9:17 AM

1 person found this helpful

fsitja

True, my bad, thanks for the clarifying it Frank. I didn't give it much thought, being rushed out to the lunch break. ;)

Boneist

metzguar wrote:
You might get more than one title with this query.

Good point, although the OP didn't suggest what they wanted to happen in the case where there was a tie. *{:-)

Sven W.

There is no need for an analytical function. Metzguar already showed the way how to do it with aggegate functions only.
However analytic functions might be helpful if you need more then one column for the same applied logic.

"GROUP function"
MAX(ColX) KEEP (dense_rank first order by ... )

"Analytic function"
MAX(ColX) over (partition by ... order by ...)

777489

Hi folks,

thanks for the variuos answers! Weekend is over, so we should work on ...

I tried some examples you gave me. And I decide to provide more detailes information! At first, if the amount of 'a' and 'b' is equal it doesn't matter which one will be returned (so it can be undefined, if that makes thing easier). I will now paste the original query I work with and I add some comments for you - so you can find the lines which should be changed to return the most occurances-value.
If you think it makes sense to Provide some create-table ddl an (maybe as csv file) some data, tell me how i can do that (I think its not an option to post ~ Mio rows as Inserts here).

The select-query I want to manipulate - this is no more related to our test-table 'a1' !! For Example let us look on the rows where 'A' and 'drm_' is selected - starts at line 7!

SELECT
  box_id,
  schedule_id,
  fixsecs_down(MIN(acqtime),600),
  COUNT(*), -- each row in rfm_meas_hr represents one frame of measuremnt-data, so this represents the number of frames received in this block

  -- instead of 'A' the most occurance of the col 'rpro' should be selected here
  'A',
  -- like above, but from the column 'rdmo'
  'drm_',

  -- below this some other cols are calculated/selected, not important here
  FLOOR(MEDIAN(rfre)),
  ROUND(AVG(rdbv),2),
  ROUND(SUM(POWER(rdbv,2)),2),
  ROUND(MAX(rdbv),2),
  ROUND(MIN(rdbv),2),
  ROUND(SUM(rsnr)/SUM(nframes),2),
  ROUND(SUM(POWER(rsnr,2)),2),
  ROUND(MAX(rsnr),2),
  ROUND(MIN(rsnr),2),
  ROUND( AVG(rsta_sync),2), -- rsta_sync
  ROUND(SUM(POWER(rsta_sync,2)),2), -- rsta_sync_s
  ROUND( MIN(rsta_sync),2), -- rsta_sync_min
  ROUND( MAX(rsta_sync),2), -- rsta_sync_max
  ROUND( AVG(rsta_fac),2), -- rsta_facc
  ROUND(SUM(POWER(rsta_fac,2)),2), -- rsta_fac_s
  ROUND( MIN(rsta_fac),2), -- rsta_fac_min
  ROUND( MAX(rsta_fac),2), -- rsta_fac_max
  ROUND( AVG(rsta_sdc),2), -- rsta_sdc
  ROUND(SUM(POWER(rsta_sdc,2)),2), -- rsta_sdc_s
  ROUND( MIN(rsta_sdc),2), -- rsta_sdc_min
  ROUND( MAX(rsta_sdc),2), -- rsta_sdc_max
  ROUND( AVG(rsta_audio),2), -- rsta_audio
  ROUND(SUM(POWER(rsta_audio,2)),2), -- rsta_audio_s
  ROUND( MIN(rsta_audio),2), -- rsta_audio_min
  ROUND( MAX(rsta_audio),2), -- rsta_audio_max
  MIN(rser), TODO:  most occurances
  MIN(rtty_stream0), -- TODO:  most occurances
  MIN(rtty_stream1), -- TODO:  most occurances
  MIN(rtty_stream2), -- TODO:  most occurances
  MIN(rtty_stream3), -- TODO:  most occurances
  ROUND(AVG(NVL(rafs_error/nullif(rafs_au,0),1))*SUM(rafs_au)/NULLIF(SUM(rafs_au),0),2), -- rafs
  ROUND( SUM( POWER(  NVL(rafs_error/nullif(rafs_au,0),1),2))*SUM(rafs_au)/NULLIF(SUM(rafs_au),0) ,2), -- rafs_s
  ROUND(MIN(rafs_error/ NULLIF(rafs_au,0)),2), -- rafs_min
  ROUND(MAX(NVL(rafs_error/NULLIF(rafs_au,0),1) )*SUM(rafs_au)/NULLIF(SUM(rafs_au),0),2), -- rafs_max
  SUM(robm_A),
  SUM(robm_B),
  SUM(robm_C),
  SUM(robm_D),
  SUM(robm_E),
  ROUND(SUM(rwmf) / SUM(nframes),2), -- rwmf
  ROUND(SUM(POWER(rwmf,2)),2), -- rwmf_s
  ROUND(MIN(rwmf),2), -- rwmf_min
  ROUND(MAX(rwmf),2), -- rwmf_max
  ROUND(SUM(rwmm) / SUM(nframes),2), -- rwmm
  ROUND(SUM(POWER(rwmm,2)),2), -- rwmm_s
  ROUND(MIN(rwmm),2), -- rwmm_min
  ROUND(MAX(rwmm),2), -- rwmm_max
  ROUND(SUM(rmer) / SUM(nframes),2), -- rmer
  ROUND(SUM(POWER(rmer,2)),2), -- rmer_s
  ROUND(MIN(rmer),2), -- rmer_min
  ROUND(MAX(rmer),2), -- rmer_max
  ROUND(SUM(RBP0_ERRS+ RBP1_ERRS+ RBP2_ERRS+ RBP3_ERRS) / NULLIF(SUM(RBP0_BITS+ RBP1_BITS+ RBP2_BITS+ RBP3_BITS),0) ,10), -- ber
  ROUND(SUM(POWER( (RBP0_ERRS+ RBP1_ERRS+ RBP2_ERRS+ RBP3_ERRS) / NULLIF((RBP0_BITS+ RBP1_BITS+ RBP2_BITS+ RBP3_BITS),0) ,2)),10), -- ber_s
  ROUND(MIN(RBP0_ERRS+ RBP1_ERRS+ RBP2_ERRS+ RBP3_ERRS) / NULLIF(MIN(RBP0_BITS+ RBP1_BITS+ RBP2_BITS+ RBP3_BITS),0) ,10), -- ber_min
  ROUND(MAX(RBP0_ERRS+ RBP1_ERRS+ RBP2_ERRS+ RBP3_ERRS) / NULLIF(MAX(RBP0_BITS+ RBP1_BITS+ RBP2_BITS+ RBP3_BITS),0) ,10), -- ber_max
  ROUND(AVG(rdop),2), -- rdop
  ROUND(SUM(POWER(rdop,2) ),2), -- rdop_s
  ROUND(MIN(rdop),2), -- rdop_min
  ROUND(MAX(rdop),2), -- rdop_max
  ROUND(AVG(rdel90),2), -- rdel90
  ROUND(SUM(POWER(rdel90,2) ),2), -- rdel90_s
  ROUND(MIN(rdel90),2), -- rdel90_min
  ROUND(MAX(rdel90),2), -- rdel90_max
  ROUND(AVG(rdel95),2), -- rdel95
  ROUND(AVG(rdel99),2), -- rdel99
  null AS reslevel
FROM
-- select the data where this should be calculated
  (
    SELECT 
      *
    FROM
      rfm_meas_hr
    WHERE
      acqtime < fixsecs_down(to_timestamp('07-01-2011 14:00:00,00','dd-mm-yyyy hh24:mi:ss,ff'),600)
      AND (reslevel IS NULL OR reslevel=10)
  )
-- group the selected data and executes the calculation given by SELECT-statement
GROUP BY
  -- group the data into 10min packages, indicated by its timestamp
  to_char( EXTRACT(MONTH FROM acqtime)*100000 + EXTRACT(DAY FROM acqtime)*1000 + EXTRACT(HOUR FROM acqtime)*10 + floor(EXTRACT(MINUTE FROM acqtime)/10) ),
  schedule_id,
  box_id
HAVING
  SUM(nframes)>15
;

I chould say: I can add indexes if nessecary! At the moment there is one on (acqtime, reslevel) as this improves the access speed. But the query above, executed on typical data already takes 5-7 sec.

Please let me know if you need any more information.

Regards!

777489

I worry about there are no answers :/ But what are the reasons? Are the information not detailed enough? Is the question more complex as people want to understand the problem?
I hope someone can help - sometimes.

Regards.

Sven W.

Stede wrote:
Hi folks,

thanks for the variuos answers! Weekend is over, so we should work on ...

I tried some examples you gave me.

And how was the result? What worked or didn't work as intented? And keep the example simple.

Frank Kulash

Duplicate.

Edited by: Frank Kulash on Jan 24, 2011 11:50 AM

Frank Kulash

Hi,

What version of Oracle you're using can be very important, so always mention it: e.g. "I'm using Oracle 11.1.0.7.0"

The NTH_VALUE function (new in Oracle 11.2) might also help in problems like this.

Edited by: Frank Kulash on Jan 24, 2011 11:53 AM

Frank Kulash

Hi,

Stede wrote:
I worry about there are no answers :/ But what are the reasons? Are the information not detailed enough?

Exactly; not enough relevant information, too much irrelevant stuff.

Whenever you have a question, post a little sample data (CREATE TABLE and INSERT statements) like you did in your first message (except that they should work), and the results you want from that data. Explain how you get those results from that data.
Simplify the question as much as possible. If you're selecting 40 columns that you already know how to do, and four that you don't know how to compute, then just post a problem that incudes the four (or, in this case, since they're all similar, make it two or three) plus whatever other columns are needed to actually run the query. In this case, that might be just one or two columns that you need for the GROUP BY clause.
You might mention how your real problem differs from what you psoted. E.g., "In my real job, I have to join two tables, and the SELECT clause has over 40 more columns that I get from aggregate functions."

If you can illustrate your problem using commonly available tables, such as those in the hr or scott schemas (as I did below), then you don't have to post any sample data; just the results and an explanation.

Is the question more complex as people want to understand the problem?
There are some people who might be more likely to reply to a question if it is complex, or at least unusual.

This problem is a little different, and therefore more interesting, than what you originally posted.
If you neeed to get the most common value of two (or more) independent columns, then RANK (or ROW_NUMBER) will get rather cumbersome, since you'd need to do two (or more) independent sub-queries, each with a somewhat different GROUP BY clause, and then join the results. It would probably be easier to call he analytic COUNT function two (or more) times, each time with a somewhat different PARTITION BY clause, and then get the value related to the maximum count using the aggregate FIRST (or LAST) function.

For example, say we want to know the most common job, the most common sal, and the most comm in each deptartment of the scott.emp table. Say we also need to know the average sal for each deprtment. The query below implements the strategy above:

WITH	got_cnt		AS
(
	SELECT	deptno
	,	job	, COUNT (job)  OVER (PARTITION BY deptno, job)	AS job_cnt
	,	sal	, COUNT (sal)  OVER (PARTITION BY deptno, sal)	AS sal_cnt
	,	comm	, COUNT (comm) OVER (PARTITION BY deptno, comm)	AS comm_cnt
	FROM	scott.emp
--	WHERE	...		-- If you need any filtering, put it here
)
SELECT	  deptno
,	  AVG (sal)						AS avg_sal
,	  MIN (job)  KEEP (DENSE_RANK LAST ORDER BY job_cnt)	AS most_common_job
,	  MIN (sal)  KEEP (DENSE_RANK LAST ORDER BY sal_cnt)	AS most_common_sal
,	  MIN (comm) KEEP (DENSE_RANK LAST ORDER BY comm_cnt)	AS most_common_comm
FROM	  got_cnt
GROUP BY  deptno
ORDER BY  deptno
;

Output:

`   DEPTNO   AVG_SAL MOST_COMM MOST_COMMON_SAL MOST_COMMON_COMM
---------- --------- --------- --------------- ----------------
        10   2916.67 CLERK                1300
        20   2175.00 ANALYST              3000
        30   1566.67 SALESMAN             1250                0

When there is a tie for the most common value, the contender with the lowest value is displayed: that's what MIN is doing above. For example, when deptno=30, no 2 rows have the same comm. Of the 6 rows in that group, 2 have NULL comm, and the others have values of 0, 300, 500 and 1400. The query above returns the lowest of those 4 values, that is, 0. Notice that the 2 NULLs do not count in this case; you would have to use NVL if you wanted to return NULL, since more people had NULL than any value.

1 person found this helpful

777489

Thanks Frank Kulash for those hints.
I'm using: Oracle Database 11g Release 11.2.0.1.0
And now try to find out where I can find the Sott examples and how to use it (as i read in the documentation the scott example is outdated and another is used instead today. however, i dont find $ORACLE_HOME/rdbms/admin/scott.sql on my System) ....

If I "sometimes" made it to put the scott example into my db I will try to work with your suggestions ...

Regards

[edit]
Ok I have taken the exampletable from http://www.oracle.com/technology/sample_code/tech/sql_plus/htdocs/demobld.html
Logically it does the thing i want. But i can't put that in a syntax for my task ..
Therefore I will prepare exampe data scripts ... but that will takes some times - be patient please.

Edited by: Stede on 27.01.2011 01:45

777489

Okay .. I tried to cut down my stuff as much as possible. To sort the rowvalues in time I uses a litte function you need to process the example. It simply cut down the given timestamp to N * seconds from midnight. so 2:34:33,123 will be 2:30:00,00 in result of this function.

CREATE OR REPLACE
FUNCTION fixsecs_down(ts TIMESTAMP, secs INTEGER) RETURN TIMESTAMP DETERMINISTIC AS
  s INTEGER;
  ts_temp TIMESTAMP;
BEGIN
  s := to_number(to_char(ts, 'SS')) + 60 * to_number(to_char(ts, 'MI')) + 3600 * to_number(to_char(ts, 'HH24'));
  s := secs*floor(s/secs);
  ts_temp := (trunc(ts) + s / (3600* 24));
  -- not the same like: RETURN ts_temp := (trunc(ts) + s / (3600* 24)); !!!
  RETURN ts_temp;
END;

Now comes the data table - I named it sd_meas and put some data into it. I cut out most of the columns so only some important ones left. But some are needed to make this example representative for my whole table. In addition i write some comments into the create table statement - here we go:

CREATE TABLE sd_meas (
	id NUMBER(14), -- counted by an sequence
	box_id NUMBER(8), -- foreign key, e.g. like deptno from scotts
	schedule_id NUMBER(8), -- foreign key, e.g. like deptno from scotts
	acqtime TIMESTAMP(3), -- point in time where data is captured
	rpro VARCHAR2(1), -- can have 'A' / 'B' / 'C' / ... and some others
	rdmo VARCHAR2(4), -- can be 'drm_' / 'drm+' / 'mwa_'
	rfre NUMBER(10), -- receiving frequency
	rdbv NUMBER(10,7), -- a measurement value
	reslevel NUMBER(2), -- indicates wether to calculate this row or not
	CONSTRAINT sd_meas_pk PRIMARY KEY(id)
);

-- between 13:40 and 13:50 for box_id 777
INSERT INTO sd_meas VALUES(56214453,777,441,to_timestamp('07.01.2011 13:43:15,195','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm_',27000000,36.3320313,1);
INSERT INTO sd_meas VALUES(56214455,777,441,to_timestamp('07.01.2011 13:44:15,895','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm_',27000000,36.3476563,1);
INSERT INTO sd_meas VALUES(56214457,777,441,to_timestamp('07.01.2011 13:45:15,695','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm_',27000000,36.3554688,1);
INSERT INTO sd_meas VALUES(56214459,777,441,to_timestamp('07.01.2011 13:46:15,612','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm_',27000000,36.3632813,1);
INSERT INTO sd_meas VALUES(56214461,777,441,to_timestamp('07.01.2011 13:46:15,195','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm_',27000000,36.3710938,1);

-- between 13:40 and 13:50 for box_id 888
INSERT INTO sd_meas VALUES(56214454,888,331,to_timestamp('07.01.2011 13:43:15,295','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.3984375,null);
INSERT INTO sd_meas VALUES(56214456,888,331,to_timestamp('07.01.2011 13:44:15,995','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.390625,null);
INSERT INTO sd_meas VALUES(56214458,888,331,to_timestamp('07.01.2011 13:45:15,795','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.375,null);
INSERT INTO sd_meas VALUES(56214460,888,331,to_timestamp('07.01.2011 13:46:15,612','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.3789063,null);
INSERT INTO sd_meas VALUES(56214462,888,331,to_timestamp('07.01.2011 13:46:15,295','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm+',27000000,36.3867188,null);

-- between 13:50 and 14:00 for box_id 777
INSERT INTO sd_meas VALUES(56215420,777,441,to_timestamp('07.01.2011 13:53:15,195','DD.MM.YYYY HH24:MI:SS,FF'),'C','mwa_',27000000,36.53125,null);
INSERT INTO sd_meas VALUES(56215422,777,441,to_timestamp('07.01.2011 13:54:15,895','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm_',27000000,36.5351563,null);
INSERT INTO sd_meas VALUES(56215425,777,441,to_timestamp('07.01.2011 13:55:15,695','DD.MM.YYYY HH24:MI:SS,FF'),'C','mwa_',27000000,36.5429688,null);
INSERT INTO sd_meas VALUES(56215430,777,441,to_timestamp('07.01.2011 13:56:15,612','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm_',27000000,36.5234375,null);
INSERT INTO sd_meas VALUES(56215432,777,441,to_timestamp('07.01.2011 13:56:15,195','DD.MM.YYYY HH24:MI:SS,FF'),'C','mwa_',27000000,36.5,null);

-- between 13:50 and 14:00 for box_id 888
INSERT INTO sd_meas VALUES(56215421,888,331,to_timestamp('07.01.2011 13:53:15,295','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm+',27000000,36.4960938,null);
INSERT INTO sd_meas VALUES(56215423,888,331,to_timestamp('07.01.2011 13:54:15,995','DD.MM.YYYY HH24:MI:SS,FF'),'A','drm+',27000000,36.4921875,null);
INSERT INTO sd_meas VALUES(56215424,888,331,to_timestamp('07.01.2011 13:55:15,795','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.484375,null);
INSERT INTO sd_meas VALUES(56215428,888,331,to_timestamp('07.01.2011 13:56:15,612','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.4765625,null);
INSERT INTO sd_meas VALUES(56215431,888,331,to_timestamp('07.01.2011 13:56:15,295','DD.MM.YYYY HH24:MI:SS,FF'),'B','drm+',27000000,36.4726563,null);

Before I show you the current implementation and it's result I will describe what happens. Each row represents various measurment values from one device with a measurement setting for that device (indicated by schedule_id). Every device puts 3-10 rows per second in this table and the system should do up to 100 device - right that might be a lot of data to work with. Because no one need every of these rows I put a job in dbms_scheduler which execute a procedure which should summarize the measurment data to statistical values over 10minute-pieces. Seperated for device (box_id) and settings (schedule_id). Because of the (maybe) big amount of rows it is important to keep an eye on the performance.
There fore the current query has one subquery which selects those rows, which I like to summarize to statistical data. That should make sure, that the calculation of statistics is only computed on the needed amount! What kind of statistical data I calculate is mentionend by the comments in the query. The "ROUND" is used to make sure, the number fits in another table and the size of its columns.

  SELECT
    box_id,
    schedule_id,
    fixsecs_down(MIN(acqtime),600) AS acqtime, -- time rounded down to 10-minutes intervals
    'A' AS rpro, -- the most common rpro is needed at this place
    'drm_' AS rdmo, -- the most common rdmo is needed
    FLOOR(MEDIAN(rfre)) AS rfre, -- get the median, keep it as a natural number -> FLOOR should be ok and fast
    ROUND(AVG(rdbv),2) AS rdbv, -- take the average of rdbv
    ROUND(SUM(POWER(rdbv,2)),2) AS rdbv_s, -- sum the squares of rdbv
    ROUND(MAX(rdbv),2) AS rdbv_max, -- get the maximum value of rdbv
    ROUND(MIN(rdbv),2) AS rdbv_min, -- get the minimum value of rdbv
    null AS reslevel -- reslevel should be forced to null
  FROM
  -- subselect the affected data
  (
  SELECT 
    *
  FROM
      sd_meas
  WHERE
      --acqtime < fixsecs_down(to_timestamp('07-01-2011 14:00:00,00','dd-mm-yyyy hh24:mi:ss,ff'),600)
      --AND
      (reslevel IS NULL OR reslevel=10)
  )
  GROUP BY
    -- make packages of 10-minutes, seperated for any schedule_id and box_id combination
    to_char( EXTRACT(MONTH FROM acqtime)*100000 + EXTRACT(DAY FROM acqtime)*1000 + EXTRACT(HOUR FROM acqtime)*10 + floor(EXTRACT(MINUTE FROM acqtime)/10) ),
    schedule_id,
    box_id
  ;

For this example the columns where I need help are for rpro and rdmo (like mentioned in the comments)!
The REAL result of this is (thats NOT the result I want it to be, but the one from the query above):

box_id	schedule_id	acqtime				rpro	rdmo	rfre		rdbv	rdbv_s	rdbv_max	rdbv_min	reslevel
888	331		07.01.11 13:40:00,000000000	A	drm_	27000000	36,39	6619,68	36,4		36,38	
888	331		07.01.11 13:50:00,000000000	A	drm_	27000000	36,48	6655,55	36,5		36,47	
777	441		07.01.11 13:50:00,000000000	A	drm_	27000000	36,53	6670,95	36,54		36,5

And if anything is implemented correctly it should be like this:

box_id	schedule_id	acqtime				rpro	rdmo	rfre		rdbv	rdbv_s	rdbv_max	rdbv_min	reslevel
888	331		07.01.11 13:40:00,000000000	B	drm+	27000000	36,39	6619,68	36,4		36,38	
888	331		07.01.11 13:50:00,000000000	B	drm+	27000000	36,48	6655,55	36,5		36,47	
777	441		07.01.11 13:50:00,000000000	C	mwa_	27000000	36,53	6670,95	36,54		36,5

It looks like rpro and rdmo changes always on the same time - but this is only related to this example.
If you finished your testing, simply drop the function and the table:

DROP TABLE sd_meas;
DROP FUNCTION fixsecs_down;

I hope that it's enough data/information that anyone can simply understand wahts happen here but less enough to keep it clear and "simple" ;-)
If i forgot something, please just ask for it!
And as demanded: I'm using Oracle Database 11g Release 11.2.0.1.0 Standard Edition One

Regards + Thanks!

Frank Kulash

Hi,

That's a nice, clear example. The comments help a lot. Thanks.

The method I suggested about this time on Monday should work nicely: for each variable for which you need to show the most common value, compute the COUNT of each value in a sub-query, using the analytic COUNT function. The PARTITION BY clause will include the column itself, plus all the columns that you will ultimately be GROUPing BY. Do not use an ORDER BY clause in the analytic fucntions.
In the main query, do the GROUP BY as you normally would. Use MIN ... KEEP (LAST_VALUE ...) to get the most common values for all the columns that you COUNTed in the sub-query. MIN here means that, in case of a tie, return the lowest value from among the contenders.

WITH	got_cnts	AS
(
	SELECT	sd_meas.*	-- or list columns needed
	,	fixsecs_down (acqtime, 600)	        AS rounded_acqtime
	,	COUNT (*) OVER ( PARTITION BY  box_id
		      	       	 ,	       schedule_id
				 ,	       fixsecs_down (acqtime, 600)
				 ,	       rpro
			       )                        AS rpro_cnt
	,	COUNT (*) OVER ( PARTITION BY  box_id
		      	       	 ,	       schedule_id
				 ,	       fixsecs_down (acqtime, 600)
				 ,	       rdmo
			       )                        AS rdmo_cnt
	FROM	sd_meas
     	WHERE   NVL (reslevel, 10)	= 10
--      AND     acqtime < fixsecs_down (TO_TIMESTAMP ('07-01-2011 14:00:00,00', 'dd-mm-yyyy hh24:mi:ss,ff'), 600)
)
SELECT    box_id
,	  schedule_id
,    	  rounded_acqtime
,    	  MIN (rpro) KEEP (DENSE_RANK LAST ORDER BY rpro_cnt)
					   	AS rpro		-- the most common rpro
,    	  MIN (rdmo) KEEP (DENSE_RANK LAST ORDER BY rdmo_cnt)
					   	AS rdmo		-- the most common rdmo
,    	  FLOOR (MEDIAN (rfre))			AS rfre 	-- get the median, keep it as a natural number -> FLOOR should be ok and fast
,    	  ROUND (AVG (rdbv), 2) 		AS rdbv 	-- take the average of rdbv
,    	  ROUND ( SUM (POWER (rdbv, 2))
	  	, 2
		)				AS rdbv_s	-- sum the squares of rdbv
,    	  ROUND (MAX( rdbv), 2)			AS rdbv_max 	-- get the maximum value of rdbv
,    	  ROUND (MIN (rdbv), 2) 		AS rdbv_min 	-- get the minimum value of rdbv
,    	  null 		      			AS reslevel 	-- reslevel should be forced to null
FROM	  got_cnts
GROUP BY  rounded_acqtime
,    	  schedule_id
,    	  box_id
;

Output:

BOX  SCHEDULE                                                     RDBV   RDBV
 _ID      _ID ROUNDED_ACQTIME  R RDMO      RFRE   RDBV   RDBV_S   _MAX   _MIN R
---- -------- ---------------- - ---- --------- ------ -------- ------ ------ -
 777      441 07.01.2011 13:50 C mwa_  27000000  36.53  6670.95  36.54  36.50
 888      331 07.01.2011 13:40 B drm+  27000000  36.39  6619.68  36.40  36.38
 888      331 07.01.2011 13:50 B drm+  27000000  36.48  6655.55  36.50  36.47

You'll probably want an ORDER BY clause in the main query.

I called fixsecs_down with the same arguments 3 times in the sub-query. You might find it a little cleaner to add another sub-query before got_cnts, just to call it once, and then use that value 3 times in got_cnts.

Since you have that fucntion, you might as well use the value it returns in the GROUP BY clause, rather doing a cumbersome process using EXTRACT.

Most of the other changes I made to your query were superficial stylistic things. For example, I used NVL instead or OR to filter for reslevel in the sub-query. There's nothing wrong with using OR; I just find NVL a little clearer, but I'm not going to maintain this code, so you should use whatever suits you best.

777489

Hi Frank,

I'm done for today. Thanks for your Help. I will check your solution tomorrow - and how slow/quick it works. An additional ORDER BY is not nessecary because the result is directly inserto into another table - therefore data will be sorted if another query selects it there.
Unfortuneally I can't use my function in the GROUP BY statement because Oracle can't recognize it as an groupable-statement - thats why I use the EXTRACT there.
Further I already tested the NVL on reslevel and find out (autotrace) that this don't improve the execution time dramatically. Therefore I leave it like that so maybe in the future I can add some other reslevel-values i want to affect. Reffering to this I should mention that I don't provide the indexes I creates on that table for performance reasons ;)
On my complete testtable the query so far takes about 5secs ... I'm curious what it takes with your solution. But that I will test tomorrow :/

Thanks - and I will report the result to you tomorrow (and then mark your question as "correct" if it works).

Regards.

Frank Kulash

Stede wrote:
Hi Frank,

I'm done for today. Thanks for your Help. I will check your solution tomorrow - and how slow/quick it works. An additional ORDER BY is not nessecary because the result is directly inserto into another table - therefore data will be sorted if another query selects it there.

I see, in that case, you definitely don't want an ORDER BY clause when you move this to production. If you're testing this without INSERTing, then it's good to see the output in the same order every time, but if you're actually INSERTing, then you should only use ORDER BY when you query that table.

Unfortuneally I can't use my function in the GROUP BY statement because Oracle can't recognize it as an groupable-statement - thats why I use the EXTRACT there.

Why not? I used it that way, in Oracle 11.2.0.1.0 Enterprise Edition. I can't believe the Standard and Enterprise editions differ on that.
Post the error message that you get, and the query that caused the error.

Further I already tested the NVL on reslevel and find out (autotrace) that this don't improve the execution time dramatically.

No, I wouldn't expect there to be a significant difference in execution time one way or the other. I prefer NVL because I find the code easier to understand, debug and maintain. There's nothing wrong with using OR if you prefer it.

777489

>
Why not? I used it that way, in Oracle 11.2.0.1.0 Enterprise Edition. I can't believe the Standard and Enterprise editions differ on that.
Post the error message that you get, and the query that caused the error.
>

Here the alternative GROUP BY part which don't work:

GROUP BY
  --to_char( EXTRACT(MONTH FROM acqtime)*100000 + EXTRACT(DAY FROM acqtime)*1000 + EXTRACT(HOUR FROM acqtime)*10 + floor(EXTRACT(MINUTE FROM acqtime)/10) ),
  fixsecs_down(MIN(acqtime),600),
  schedule_id,
  box_id
HAVING
  SUM(nframes)>15
;

It returns:

ORA-00934: "group function is not allowed here"

I will now take care of the implementation from yesterday :) ...

Edited by: Stede on 27.01.2011 23:52

Aketi Jyuuzou

2146662

I say again that...
sometimes,Please remember stats_mode function ;-)

col modeVal for a10

with t(id,val,title) as(
select 1,12,'c' from dual union
select 2,13,'b' from dual union
select 3,13,'a' from dual union
select 4,13,'a' from dual union
select 5,42,'a' from dual union
select 6,42,'b' from dual union
select 7,42,'b' from dual)
select val,stats_mode(title) as modeVal
  from t
group by val
order by val;

val  modeVal
---  -------
 12  c
 13  a
 42  b

munky

>
sometimes,Please remember stats_mode function
>
Thank you, Aketi - I've learnt my new thing for the day now! :)

Newbie

Pls let me know more brief about that stats_mode function

Thanks in advance

Frank Kulash

Hi,

You've probably figured out how to use STATS_MODE by now, but in case you haven't gotten to that yet:

SELECT    box_id
,	  schedule_id
,	  fixsecs_down (acqtime, 600)	        AS rounded_acqtime
,    	  STATS_MODE (rpro)		   	AS rpro		-- the most common rpro
,    	  STATS_MODE (rdmo) 			AS rdmo		-- the most common rdmo
,    	  FLOOR (MEDIAN (rfre))			AS rfre 	-- get the median, keep it as a natural number -> FLOOR should be ok and fast
,    	  ROUND (AVG (rdbv), 2) 		AS rdbv 	-- take the average of rdbv
,    	  ROUND ( SUM (POWER (rdbv, 2))
	  	, 2
		)				AS rdbv_s	-- sum the squares of rdbv
,    	  ROUND (MAX( rdbv), 2)			AS rdbv_max 	-- get the maximum value of rdbv
,    	  ROUND (MIN (rdbv), 2) 		AS rdbv_min 	-- get the minimum value of rdbv
,    	  null 		      			AS reslevel 	-- reslevel should be forced to null
FROM	  sd_meas
WHERE     NVL (reslevel, 10)	= 10
--AND     acqtime < fixsecs_down (TO_TIMESTAMP ('07-01-2011 14:00:00,00', 'dd-mm-yyyy hh24:mi:ss,ff'), 600)
GROUP BY  fixsecs_down (acqtime, 600)
,    	  schedule_id
,    	  box_id
;

I wish I had known about this (or that Aketi had rplied to this) on Monday.
Unlike most of the other aggregate functions (AVG, MIN and SUM, for example) there is no analytic version of STATS_MODE, or of any of the other aggreate functions names STATS_something. Curious.

Stede wrote:
... Here the alternative GROUP BY part which don't work:

GROUP BY
--to_char( EXTRACT(MONTH FROM acqtime)*100000 + EXTRACT(DAY FROM acqtime)*1000 + EXTRACT(HOUR FROM acqtime)*10 + floor(EXTRACT(MINUTE FROM acqtime)/10) ),
fixsecs_down(MIN(acqtime),600),
schedule_id,
box_id
HAVING
SUM(nframes)>15
;

It returns:

ORA-00934: "group function is not allowed here"

The group function about which it's complaining is MIN, not fixsecs_down. MIN is something you compute for each group; that means the groups must be defined before you can compute MIN, and therefore MIN can't be used in defining a group. Does the code I posted (including GROUP BY fixsecs_down) work for you with the sample data? The same kind f GROUP BY will work if you apply it to your reakl query, with your real data.

1 person found this helpful

777489

Hi Frank,

your solution you've posted after my last example do the job. However, during I try that query I noticed it boosts up the execution dramatically. Therefore I thinked about the logic of the data in the table and got the conclusion to calculate the most common accourance only for 3 columns. Why? because 2 of those columns are independent from other columns but one are "linked" to the values of the other columns (I know it's bad DB design, it is not normalized here :/ ). So I will try to calculate one common occurance, get the id of the row and take the values from the other rows.
To explain by the given example: Lets assume that rpro and rdmo are "linked" together - where rpro is the major value. Means rdmo can only change the value if rpro does! So I'll try to select the id (any of it) where rpro have the value of its most commen occurances, to be able to take rdmo from the same row. But is this kind of selection possible at all? I hope to decrease the execution time with this.

Result of the work so far:_
If I execute my old query (with out the calculation of the most common occurances) on my table with realistic production data it takes around 7seconds to execute:

Plan hash value: 2326700091
 
-------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                      |  1426 |   185K|  4627   (1)| 00:00:56 |
|*  1 |  FILTER                        |                      |       |       |            |          |
|   2 |   SORT GROUP BY                |                      |  1426 |   185K|            |          |
|   3 |    CONCATENATION               |                      |       |       |            |          |
|   4 |     TABLE ACCESS BY INDEX ROWID| RFM_MEAS_HR          |   485 | 64505 |    76   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | RFM_MEAS_HR_RESL_ACQ |    87 |       |     3   (0)| 00:00:01 |
|   6 |     TABLE ACCESS BY INDEX ROWID| RFM_MEAS_HR          | 27630 |  3588K|  4230   (1)| 00:00:51 |
|*  7 |      INDEX RANGE SCAN          | RFM_MEAS_HR_RESL_ACQ |  5045 |       |    20   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(SUM("RFM_MEAS_HR"."NFRAMES")>15)
   5 - access("RESLEVEL"=10 AND "ACQTIME"<"FIXSECS_DOWN"(TIMESTAMP' 2011-01-07 
              14:00:00,000000000',600))
   7 - access("RESLEVEL" IS NULL AND "ACQTIME"<"FIXSECS_DOWN"(TIMESTAMP' 2011-01-07 
              14:00:00,000000000',600))
       filter("ACQTIME"<"FIXSECS_DOWN"(TIMESTAMP' 2011-01-07 14:00:00,000000000',600) AND 
              LNNVL("RESLEVEL"=10))

   Statistics
-----------------------------------------------------------
               4  user calls

And to compare I'll execute the new query where I calculate the most common occurances for one column only it takes ~20-40seconds (depending on HD acitivity on the server - read operations) of execution time - which make it difficult to use during production:

Plan hash value: 2800370525
 
----------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name            | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                 |  1263 |   616K|       | 11419   (1)| 00:02:18 |
|   1 |  SORT GROUP BY                 |                 |  1263 |   616K|    17M| 11419   (1)| 00:02:18 |
|   2 |   VIEW                         |                 | 28514 |    13M|       | 10267   (1)| 00:02:04 |
|   3 |    WINDOW SORT                 |                 | 28514 |  3759K|  6176K| 10267   (1)| 00:02:04 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| RFM_MEAS_HR     | 28514 |  3759K|       |  9405   (1)| 00:01:53 |
|*  5 |      INDEX RANGE SCAN          | RFM_MEAS_HR_ACQ | 11165 |       |       |    38   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - filter(NVL("RESLEVEL",10)=10)
   5 - access("ACQTIME"<"FIXSECS_DOWN"(TIMESTAMP' 2011-01-07 14:00:00,000000000',600))
   Statistics
-----------------------------------------------------------
               4  user calls

I see that the optimizer picks differend Indexes - donno why, but I trust in, that it is the best choice (or should I test it by giving hints?). Now I want to compare the EP if two columns are calculated for the most common occurances - the execution time don't increase here in i way I can notice it:

Plan hash value: 4086348800
 
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name            | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                 |  1263 |   637K|       | 12374   (1)| 00:02:29 |
|   1 |  SORT GROUP BY                  |                 |  1263 |   637K|    17M| 12374   (1)| 00:02:29 |
|   2 |   VIEW                          |                 | 28514 |    14M|       | 11183   (1)| 00:02:15 |
|   3 |    WINDOW SORT                  |                 | 28514 |  3898K|  6528K| 11183   (1)| 00:02:15 |
|   4 |     WINDOW SORT                 |                 | 28514 |  3898K|  6528K| 11183   (1)| 00:02:15 |
|*  5 |      TABLE ACCESS BY INDEX ROWID| RFM_MEAS_HR     | 28514 |  3898K|       |  9405   (1)| 00:01:53 |
|*  6 |       INDEX RANGE SCAN          | RFM_MEAS_HR_ACQ | 11165 |       |       |    38   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter(NVL("RESLEVEL",10)=10)
   6 - access("ACQTIME"<"FIXSECS_DOWN"(TIMESTAMP' 2011-01-07 14:00:00,000000000',600))

   Statistics
-----------------------------------------------------------
               4  user calls

If i read the EP's correctly (which I don't know I does?!) more columns where the most occurances are calculated only increases the CPU cost for the "view" with the severeal "window sorts" ... assuming one for each calcolated most common occurances value. Maybe I put in all possible columns later today and can post the result (EP). But as this are ~ 33 columns I would guess it takes lots of CPU cost due to the 33 windows sorts I would expect.

So long ... Thanks.
Regards.

Frank Kulash

Hi,

Stede wrote:
Hi Frank,

your solution you've posted after my last example do the job. However, during I try that query I noticed it boosts up the execution dramatically.

You are asking it to do somthing considerably more complicated, so an increase in execution time isn't unexpected.

Therefore I thinked about the logic of the data in the table and got the conclusion to calculate the most common accourance only for 3 columns. Why? because 2 of those columns are independent from other columns but one are "linked" to the values of the other columns (I know it's bad DB design, it is not normalized here :/ ). So I will try to calculate one common occurance, get the id of the row and take the values from the other rows.

I'm not sure which id you're talking about.

To explain by the given example: Lets assume that rpro and rdmo are "linked" together - where rpro is the major value. Means rdmo can only change the value if rpro does! So I'll try to select the id (any of it) where rpro have the value of its most commen occurances, to be able to take rdmo from the same row. But is this kind of selection possible at all?

Are you saying that you want to keep the id of some row that has the most common rpro, and then find the rdmo on the row with that id? Of course, every row will have a different id, but I think you're saying that it won't matter which id you pick, because they will all have the same rdmo.
Yes, you can do that, but instead of saving an id that was related to the msot common rpro, you might as well just save the rdmo that was related to it.

I think you can do something like this:

SELECT    box_id
,	  schedule_id
,	  fixsecs_down (acqtime, 600)	        AS rounded_acqtime
,    	  SUBSTR ( STATS_MODE (rpro || rdmo)
	  	 , 1
		 , 1
		 )				AS rpro		-- the most common rpro
,    	  SUBSTR ( STATS_MODE (rpro || rdmo)
	  	 , 2
		 )				AS rdmo		-- the rdmo that goes with the most common rpro
,    	  FLOOR (MEDIAN (rfre))			AS rfre 	-- get the median, keep it as a natural number -> FLOOR should be ok and fast
,    	  ROUND (AVG (rdbv), 2) 		AS rdbv 	-- take the average of rdbv
,    	  ROUND ( SUM (POWER (rdbv, 2))
	  	, 2
		)				AS rdbv_s	-- sum the squares of rdbv
,    	  ROUND (MAX( rdbv), 2)			AS rdbv_max 	-- get the maximum value of rdbv
,    	  ROUND (MIN (rdbv), 2) 		AS rdbv_min 	-- get the minimum value of rdbv
,    	  null 		      			AS reslevel 	-- reslevel should be forced to null
FROM	  sd_meas
WHERE     NVL (reslevel, 10)	= 10
--AND     acqtime < fixsecs_down (TO_TIMESTAMP ('07-01-2011 14:00:00,00', 'dd-mm-yyyy hh24:mi:ss,ff'), 600)
GROUP BY  fixsecs_down (acqtime, 600)
,    	  schedule_id
,    	  box_id
;

Even though the query above is calling STATS_MODE twice, it lloks like the optimizer is smart enough to realize both calls have the same argument, and only call it once. I dod some experiments with a 25,000 row table on my system, and noticed no difference in execution time or plan depending on whether I called STATS_MODE once or twice (inside different SUBSTR calls, as above). If I changed the STATS_MODE calls so that they were not identical, then the execution time did increase significantly (but the execution plan remained the same).
If you wanted to be sure that STATS_MODE was only being called once, then you could do the GROUP BY (including STATS_MODE) in a sub-query, and use the value it returned as often as you needed to in the main query, each in a different SUBSTR call. I tried that, and it didn't change the execution speed significantly, and it didn't change the execution plan at all.

This assumes that all rows with the same rpro will have the same rdmo. It does not assume the converse: rows with the same rdmo may have different values for rpro.

You can add as many more dependent valriables as you want. If they are not always as long as the declared value (for example, if you sometimes have rdmo values that are shorter than 4 characters), then you may have to use RPAD or LPAD when concatenating the values, and RTRIM or LTRIM when retrieving them.

Edited by: Frank Kulash on Jan 31, 2011 12:52 PM

1 - 26

Locked Post

New comments cannot be posted to this locked post.

Locked on Feb 28 2011

Added on Jan 14 2011

26 comments

16,600 views

SQL & PL/SQL

Select value of most occurances - analytical function?!

Comments

Post Details