Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Which one is better Distinct or Group By?

655429May 28 2009 — edited May 29 2009
Which one is better performance vise?

Select A.aa,B.bb,C.cc
From A,B,C
Where A.aa = B.aa and A.cc = C.cc
Group By A.aa,B.bb,C.cc

OR

Select DISTINCT A.aa,B.bb,C.cc
From A,B,C
Where A.aa = B.aa and A.cc = C.cc

Comments

Hoek May 28 2009
Hi,

It depends (on your data, f.i.)...

Type:
SQL> SET AUTOTRACE TRACEONLY
SQL> --and run your queries....

Post the results in a readable format. using the
 tags.                                                                                                                                                                                                                                                                                                                                                                    
655429 May 28 2009
@hoek
I'm getting this error when i set sutotrace on

SP2-0618: Cannot find the Session Identifier. Check PLUSTRACE role is enabled
SP2-0611: Error enabling STATISTICS report
Centinul May 28 2009
I personally wouldn't consider performance, but what is semantically correct. DISTINCT implies you want a distinct set of columns. However, GROUP BY implies you want to compute some sort of aggregate value which you are not.
655429 May 28 2009
@hoek
I'm getting this error when i set sutotrace on

SP2-0618: Cannot find the Session Identifier. Check PLUSTRACE role is enabled
SP2-0611: Error enabling STATISTICS report
696786 May 28 2009
both provide different functionality
so its unfair to compare

IF you dont want to perform any aggregate functions dont use group by
It will take more time in your case.
Hope am clear
And please mark helpful or correct if you find any of it..


Regards
Shreyans
Hoek May 28 2009
Hi,

Please inform your DBA, you need to have the proper privileges/roles so you can test your code/performance in a decent and proper manner, so you can do your job like you ought to.
You need to be able to see how your query performs, explain plans and statistics provide you vital information, just as tracing and tkprofs..

( And if he thinks you're not allowed to do that, ask him to measure the performance of your query at least 50 times a day, so you'll get your privileges/roles that way ;) )
655429 May 28 2009
As a matter of fact I'm not using any aggregate functions. But the query returns a large data from three tables.
BluShadow May 28 2009
shreyans wrote:
IF you dont want to perform any aggregate functions dont use group by
It will take more time in your case.
Would you care to demonstrate with an example how it will take more time for the given case?
696786 May 28 2009
you seem to be too much free
why dont you devote your time in solving peoples queries rather than poking your nose in my matters

And yes i will care to do that with example for you but for that raise a new thread.
Hope things are clear for you
548849 May 28 2009
when you say something and that is not clear to others then u should give an example. that's what blu was asking.

if you don't have time then don't waste your precious time on the forum helping others.

Look at the solutions Blu and others have provided before replying.
BluShadow May 28 2009
shreyans wrote:
you seem to be too much free
why dont you devote your time in solving peoples queries rather than poking your nose in my matters

And yes i will care to do that with example for you but for that raise a new thread.
Hope things are clear for you
It's amazing how immature and unprofessional some of the people on here are getting.

You come out with a blatent statement that "It will take more time in your case." for the OP's requirement and yet you provide no evidence of such.

In the OP's case I believe that you will find little or no difference performance wise.

example using a large table and selecting distinct values using distinct and also group by (without aggregates functions)..
SQL> select distinct line from largetable;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.09

Execution Plan
----------------------------------------------------------
Plan hash value: 4286450035

--------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |              |  3023 | 12092 |  4978  (27)| 00:01:00 |
|   1 |  HASH UNIQUE          |              |  3023 | 12092 |  4978  (27)| 00:01:00 |
|   2 |   INDEX FAST FULL SCAN| LARGETABLE_PK|  6637K|    25M|  3812   (5)| 00:00:46 |
--------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      16752  consistent gets
          0  physical reads
          0  redo size
      41610  bytes sent via SQL*Net to client
       2607  bytes received via SQL*Net from client
        203  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       3023  rows processed


SQL> select line from largetable group by line;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.12

Execution Plan
----------------------------------------------------------
Plan hash value: 41397859

--------------------------------------------------------------------------------------
| Id  | Operation             | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |              |  3023 | 12092 |  4978  (27)| 00:01:00 |
|   1 |  HASH GROUP BY        |              |  3023 | 12092 |  4978  (27)| 00:01:00 |
|   2 |   INDEX FAST FULL SCAN| LARGETABLE_PK|  6637K|    25M|  3812   (5)| 00:00:46 |
--------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      16752  consistent gets
          0  physical reads
          0  redo size
      41610  bytes sent via SQL*Net to client
       2607  bytes received via SQL*Net from client
        203  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       3023  rows processed

SQL>
Now please prove that such a simple query as the OP's will have a performance problem if he uses the group by instead of distinct.
696786 May 28 2009
*"when you say something and that is not clear to others then u should give an example. that's what blu was asking."*

You seem to know blu too well,why dont you yourself answer his query

*"if you don't have time then don't waste your precious time on the forum helping others."*

Since this a voluntary forum you are no one to give me suggestion about when should i help and when not

*"Look at the solutions Blu and others have provided before replying"*

So you want me too do what you seem to be doing in this forum from last 2 years.


And my comments on blu was not regarding this post only so if u dont know why you get into between..
Boneist May 28 2009
BluShadow wrote:
shreyans wrote:
you seem to be too much free
why dont you devote your time in solving peoples queries rather than poking your nose in my matters

And yes i will care to do that with example for you but for that raise a new thread.
Hope things are clear for you
It's amazing how immature and unprofessional some of the people on here are getting.
But I appreciated the irony in shreyans' statement *{;-)

Anyway, you're too much free. You should charge! *{;-D
696786 May 28 2009
select distinct line from largetable;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.09


select line from largetable group by line;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.12


Look at your solution and statistics
It is self explanatory
Hence it is proved..

Well mark helpful or correct
Centinul May 28 2009
shreyans wrote:
select distinct line from largetable;

Elapsed: 00:00:02.09


select line from largetable group by line;

Elapsed: 00:00:02.12
I am not sure what you are implying here but a less than 1.5% swing in elapsed time may not be from the added GROUP BY clause. It could be because either the DB or the OS was doing something ELSE at the time the second query was run, causing it to take slightly more time.

I would say the key here is that the consistent gets are the same, as well as the execution plan. So for all intents and purposes these both perform equally.
Hoek May 28 2009
Well, I don't agree with you at all here, since the PLANS and STATS are 100% identical.
Hence it is proved that you are wrong.
BluShadow May 28 2009
shreyans wrote:
select distinct line from largetable;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.09


select line from largetable group by line;

#data cut#

3023 rows selected.

Elapsed: 00:00:02.12


Look at your solution and statistics
It is self explanatory
Hence it is proved..

Well mark helpful or correct
Shame we can't mark you unhelpful or incorrect.

As well as you not proving that your statement was correct, you would also struggle to prove that an identical query run multiple times would always produce the same elapsed time. 3 hundredths of a second is not a performance difference, especially as those queries were operating on over 6.5 million rows of data.

Feel free to prove otherwise. I'm perfectly open to accept I am wrong if you can show me sufficient evidence.
BluShadow May 28 2009
hoek wrote:
Well, I don't agree with you at all here, since the PLANS and STATS are 100% identical.
Hence it is proved that you are wrong.
Almost 100% ;)

One was a HASH UNIQUE and the other a HASH GROUP BY, but essentially there were doing the same as there were no aggregate functions.
696786 May 28 2009
I cant do that since you are having your own definition for performance

"So enjoy and yes anything not marked helpful is unhelpful.(Opposite of helpful)" You didnt know that SHAME
Centinul May 28 2009
shreyans wrote:
I cant do that since you are having your own definition for performance
<my opinion>
I would venture a guess that Blushadow's measure of performance is the same as mine, as well as many other respected forum members here. That measure is typically less consistents gets is better.

Time is relative and hard to use as a bench mark. Typically if one reduces the number of consistent gets time to execute will reduce as well.
<end my opinion>

What is your measure of performance?
Hoek May 28 2009
Well, OK, almost ;)

But a 'full 100%' if I look at columns: Rows | Bytes | Cost (%CPU)| Time
and that says it all imo: how much work the database has done.

Values for Elapsed can vary due to network load/speed etc...it's a client thing, not a database thing.
Boneist May 28 2009
shreyans wrote:
"So enjoy and yes anything not marked helpful is unhelpful.(Opposite of helpful)" You didnt know that SHAME
LOL!

Not so; with only 5 "helpful" marks available to be allocated for a question, does that mean the 6th helpful answer was in fact not helpful, simply because it couldn't be marked as helpful?

Deary me, you don't seem to have grasped how this forum runs very well; the points, whilst they are an attempt to guide other users to the most helpful / correct answers, are not foolproof. Not everyone marks helpful / correct responses, for instance. So, I like your naive optimism, but dislike your bombastic attitude.

You won't get many points from anyone for being arrogant and demanding points, or by lambasting well respected and prominent members of the forum.
BluShadow May 28 2009
Centinul wrote:
<my opinion>
I would venture a guess that Blushadow's measure of performance is the same as mine, as well as many other respected forum members here. That measure is typically less consistents gets is better.

Time is relative and hard to use as a bench mark. Typically if one reduces the number of consistent gets time to execute will reduce as well.
<end my opinion>
Yep, that's pretty much my measure, and I know it to be most other people's too.
SomeoneElse May 28 2009
And here's an example that shows GROUP BY is slightly faster. I ran both statements several times and posted only the last two. This is on a table with 7 million rows:
SQL> set autotrace traceonly

SQL> select freight_terms
  2  from   all_po_orders
  3  group  by freight_terms
  4  ;

29 rows selected.

Elapsed: 00:00:51.50

Execution Plan
----------------------------------------------------------
Plan hash value: 1003577189

------------------------------------------------------------------------------------
| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |               |    26 |   338 | 11418   (4)| 00:02:49 |
|   1 |  HASH GROUP BY     |               |    26 |   338 | 11418   (4)| 00:02:49 |
|   2 |   TABLE ACCESS FULL| ALL_PO_ORDERS |  7303K|    90M| 11186   (2)| 00:02:46 |
------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     132871  consistent gets
     132472  physical reads
          0  redo size
       1024  bytes sent via SQL*Net to client
        357  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         29  rows processed

SQL>
SQL>
SQL> select distinct freight_terms
  2  from   all_po_orders
  3  ;

29 rows selected.

Elapsed: 00:00:52.72

Execution Plan
----------------------------------------------------------
Plan hash value: 1529717229

------------------------------------------------------------------------------------
| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |               |    26 |   338 | 11418   (4)| 00:02:49 |
|   1 |  HASH UNIQUE       |               |    26 |   338 | 11418   (4)| 00:02:49 |
|   2 |   TABLE ACCESS FULL| ALL_PO_ORDERS |  7303K|    90M| 11186   (2)| 00:02:46 |
------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     132871  consistent gets
     132797  physical reads
          0  redo size
       1024  bytes sent via SQL*Net to client
        357  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         29  rows processed
But this wasn't conclusive. Sometimes the GROUP BY was slightly faster, sometimes the DISTINCT was slightly faster.
Timur Akhmadeev May 28 2009
SomeoneElse wrote:
And here's an example that shows GROUP BY is slightly faster.
It shows only difference in number of physical reads, which can always explain difference in response time. It shows nothing about differences between group by/distinct.

IMO, the most reasonable way to get the answer is to ask Oracle Support about it. Here is why: we have four possible row source operations for implementing requirements via GROUP BY or DISTINCT (assuming we have no case of appropriate index to skip sort):

1) SORT GROUP BY - for GROUP BY in <10 or >=10 with disabled hash aggregation (via NO_USE_HASH_AGGREGATION hint for example) or when CBO decides to not use it
2) SORT UNIQUE - for DISTINCT in <10 or >=10 with disabled hash aggregation or when CBO decides to not use it
3) HASH GROUP BY - new GROUP BY implementaion from 10g onwards
4) HASH UNIQUE - new DISTINCT implementation from 10g onwards

Since all of them are different row sources (or at least they are displayed as different to a user) no one can say for sure that some of them share different logic or not. Yes, most likely some of them are implemented using common logic (SORT GROUP BY & SORT UNIQUE, HASH GROUP BY & HASH UNIQUE), but how one can prove that they behave always the same? I see no options to get the correct answer other than asking the support or Oracle developers. Or looking at the code if you are allowed to do so ;)
655429 May 29 2009
FYI - I ran a test with GROUP BY and DISTINCT. Here are the results:

GROUP BY:
SELECT STATEMENT, GOAL = FIRST_ROWS Cost=20 Cardinality=1 Bytes=68
FILTER
SORT GROUP BY Cost=20 Cardinality=1 Bytes=68
HASH JOIN Cost=19 Cardinality=6 Bytes=408
HASH JOIN Cost=10 Cardinality=57 Bytes=2280
VIEW Cost=3 Cardinality=57 Bytes=684
HASH JOIN
INDEX RANGE SCAN Cost=2 Cardinality=57 Bytes=684
INDEX FAST FULL SCAN Cost=2 Cardinality=57 Bytes=684
TABLE ACCESS FULL Cost=6 Cardinality=1055 Bytes=29540
TABLE ACCESS FULL Cost=9 Cardinality=215 Bytes=6020


DISTINCT
SELECT STATEMENT, GOAL = FIRST_ROWS Cost=13 Cardinality=1 Bytes=68
SORT UNIQUE Cost=12 Cardinality=1 Bytes=68
TABLE ACCESS BY INDEX ROWID Cost=1 Cardinality=1 Bytes=12
NESTED LOOPS Cost=11 Cardinality=1 Bytes=68
NESTED LOOPS Cost=10 Cardinality=1 Bytes=56
TABLE ACCESS FULL Cost=7 Cardinality=11 Bytes=308
TABLE ACCESS BY INDEX ROWID Cost=1 Cardinality=1 Bytes=28
INDEX UNIQUE SCAN Cost=1 Cardinality=1
INDEX RANGE SCAN Cost=1 Cardinality=1
BluShadow May 29 2009
Kakashi wrote:
FYI - I ran a test with GROUP BY and DISTINCT. Here are the results:
And what does that show?

You haven't provided the context of the tables structures or the query you have used, or the statistics showing the actual I/O type information.

You can't compare the cost of different queries from an explain plan. The explain plan is just a view of what the optimiser plans to do and is based on internal statistical information about the tables and data. What it actually gets, performance wise, when it executes the query may vary from the plan, due to statistics being out of date etc.
The cost is something more specific to a particular query, so just because one query says it has a cost of 13 and another says it has a cost of 20, doesn't necessarily mean that the query with cost 13 is more performant.
Sven W. May 29 2009 — edited on May 29 2009
Kakashi wrote:
Which one is better performance vise?

Select A.aa,B.bb,C.cc
From A,B,C
Where A.aa = B.aa and A.cc = C.cc
Group By A.aa,B.bb,C.cc

OR

Select DISTINCT A.aa,B.bb,C.cc
From A,B,C
Where A.aa = B.aa and A.cc = C.cc
I prefer group by 90% of the times.

But not for performance reasons.
I agree with all the senior forum users here that the performance difference between group by and distinct is so marginal that it is not worth talking about.

The more important reason for me is maintainability!
Imagine what happens when another programmer need to add something to this query.
This can be a simple column, this could be a join with another table and a little sum, it could be anything really.

Group by has the advantage that it will force the developer to think about the logical grouping that he wants to apply.
In my experience distinct is very often used by bad programmers who just forgot a join condition and didn't check their data.
This can also happen with the group by clause but i feel it is less often.
1 - 28
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Jun 26 2009
Added on May 28 2009
28 comments
7,596 views