Forum Stats

  • 3,768,646 Users
  • 2,252,827 Discussions
  • 7,874,668 Comments

Discussions

COUNT DISTINCT for multiple columns

Thorsten Kettner
Thorsten Kettner Member Posts: 39 Red Ribbon
edited Nov 26, 2020 12:58PM in Database Ideas - Ideas
Well the title says it all. We can count distinct values such as in
select count(distinct col1) from mytable;
but when we want to count distinct column combinations, we must either clumsily concatenate values (and be very careful to choose the right separator):
select count(distinct col1 || '-' || col2) from mytable;
or use a subquery:
select count(*) from (select distinct col1, col2 from mytable);
So I am looking for something along the lines of:
select count(distinct col1, col2) from mytable;
I don't know whether the SQL standard allows multiple values for COUNT. If not, we might use tuples:
select count(distinct (col1, col2) ) from mytable;
or something similar.
Orcl Sajjan2773476Spike HouseThorsten KettnerWilliam Robertsonblessed DBAulohmannApexBineErik van RoonLukas Ederkev22257mathguyGregVKayKFatMartinRUser_G3UV3
19 votes

Active · Last Updated

Comments

  • mtefft
    mtefft Member Posts: 836 Gold Badge
    edited Jul 13, 2017 3:52PM

    Why not just

    select count(*) from (select distinct col1, col2  from mytable);

    (sorry, my eyes just bounced over the line where you gave this as an option)

    BEDE
  • Sven W.
    Sven W. Member Posts: 10,534 Gold Crown
    edited Jul 20, 2017 5:19AM

    DISTINCT is pure evil. Counting distinct values is only slightly less evil. If you want to count something prepare the datasource in such a way that duplicates can be avoided. This will make the whole count operation way faster.

    Other than that, we can count on groups.

    select col1,col2, count(*)from mytablegroup by col1,col2;
    select count(count(*)) as "distinct count"from testdatagroup by col1,col2;

    Or as an analytic function

    select col1,col2, count(*) over (partition by col1,col2)from mytable;
    select col1,col2, count(*) over () as "distinct count"from testdatagroup by col1,col2;
    BEDEobrisson
  • Thorsten Kettner
    Thorsten Kettner Member Posts: 39 Red Ribbon
    edited Nov 26, 2020 12:59PM

    Why not just

    select count(*) from (select distinct col1, col2  from mytable);

    (sorry, my eyes just bounced over the line where you gave this as an option)

    For the same resson that I prefer

    select count(*) from mytable;

    over

    select count(*) from (select * from mytable);

    It's more readable. I don't see why I have to write a subquery when I only want to count.

  • Thorsten Kettner
    Thorsten Kettner Member Posts: 39 Red Ribbon

    DISTINCT is pure evil. Counting distinct values is only slightly less evil. If you want to count something prepare the datasource in such a way that duplicates can be avoided. This will make the whole count operation way faster.

    Other than that, we can count on groups.

    select col1,col2, count(*)from mytablegroup by col1,col2;
    select count(count(*)) as "distinct count"from testdatagroup by col1,col2;

    Or as an analytic function

    select col1,col2, count(*) over (partition by col1,col2)from mytable;
    select col1,col2, count(*) over () as "distinct count"from testdatagroup by col1,col2;

    Sorry, but that's nonsense. Yes, DISTINCT is often used to kind of "repair" queries that are badly written. Especially beginners tend to join all tables involved, only to have to remove duplicates in the end. But that doesn't make DISTINCT bad. It is just something one rarely needs. DISTINCT is perfect for queries like "get all surnames from the users table". Such queries are rare, but they exist.

    Your queries have nothing to do with counting distinct tuples, so I don't see the point of posting them here.

  • Sven W.
    Sven W. Member Posts: 10,534 Gold Crown
    edited Apr 27, 2018 5:12AM

    Sorry, but that's nonsense. Yes, DISTINCT is often used to kind of "repair" queries that are badly written. Especially beginners tend to join all tables involved, only to have to remove duplicates in the end. But that doesn't make DISTINCT bad. It is just something one rarely needs. DISTINCT is perfect for queries like "get all surnames from the users table". Such queries are rare, but they exist.

    Your queries have nothing to do with counting distinct tuples, so I don't see the point of posting them here.

    3500235 wrote:Sorry, but that's nonsense. Yes, DISTINCT is often used to kind of "repair" queries that are badly written. Especially beginners tend to join all tables involved, only to have to remove duplicates in the end. But that doesn't make DISTINCT bad. It is just something one rarely needs. DISTINCT is perfect for queries like "get all surnames from the users table". Such queries are rare, but they exist.Your queries have nothing to do with counting distinct tuples, so I don't see the point of posting them here.

    You are partially right. Sorry I forgot something in the queries I posted. I changed them now.

    My main reason why I consider DISTINCT as evil is twofold.

    Firstly is is used wrong so often (you mentioned beginners that forget joins too) that I just wish it wasn't there in the first place.

    Secondly it is easy to accidently break such code. Just by adding another column to the select clause you suddenly get different counts.

    This won't happen BY ACCIDENT if you do the row elimination using a GROUP BY clause.


    We always should consider which part of an select sattement is responsible for what. The projection (=SELECT clause) serves to give us the columns that we want. The selection (=FROM+GROUP BY clause  ) define which rows we return. By using distinct you move the ROW selection part to the clause where we do not expect that.

    Other parts of an application should be agnostic about adding an attribute(=column) somewhere. Or if they have a strong dependency, then we should get a compilation error. Using DISTINCT (in general) will make your code more error prone for future changes and by that less maintainable.

    Btw. You are now changeing your examples. Do you want to return a distinct list of surnames? Or do you want to know how many different surnames are in the table? I agree there are cases where we want to do such thing. In ALL of those cases I prefer a specifically written GROUP BY clause over the DISTINCT keyword in the select clause.

  • Thorsten Kettner
    Thorsten Kettner Member Posts: 39 Red Ribbon

    I just looked that up in other database systems:

    • PostgreSQL supports select count(distinct (col1, col2)) from mytable;
    • MySQL supports select count(distinct col1, col2) from mytable;

    So other vendors have seen a need and applied a solution. I admit this is something we don't need often, but from time to time it can be really useful.

    William Robertson
  • blessed DBA
    blessed DBA Member Posts: 218

    select count (Col) from (

    Select distinct (col1) as col from mytable

    union

    Select distinct (col2) as col from mytable ) createdtable;

  • Thorsten Kettner
    Thorsten Kettner Member Posts: 39 Red Ribbon

    select count (Col) from (

    Select distinct (col1) as col from mytable

    union

    Select distinct (col2) as col from mytable ) createdtable;

    I don't see how this query contributes to the request. It certainly does not count distinct col1/col2 pairs. (Moreover, DISTINCT is completely superfluous in your query, because of UNION. And DISTINCT is not a function, so the parentheses around col1 and col2 should be removed for better readability. And are you purposely using COUNT(col) instead of COUNT(*) in order not to count possible nulls? Why?)

    William Robertson
  • DISTINCT is pure evil. Counting distinct values is only slightly less evil. If you want to count something prepare the datasource in such a way that duplicates can be avoided. This will make the whole count operation way faster.

    Other than that, we can count on groups.

    select col1,col2, count(*)from mytablegroup by col1,col2;
    select count(count(*)) as "distinct count"from testdatagroup by col1,col2;

    Or as an analytic function

    select col1,col2, count(*) over (partition by col1,col2)from mytable;
    select col1,col2, count(*) over () as "distinct count"from testdatagroup by col1,col2;

    Yeah, this is nonsense. All functionality can be misused by people that don't know what they are doing.

    Distinct has appropriate uses all the time in profiling and exploratory data analysis. Group by is only better in those cases if you enjoy typing more.

  • mathguy
    mathguy Member Posts: 10,155 Blue Diamond
    edited Feb 22, 2020 10:21PM

    Your ask is not ambitious enough. Why only count distinct at the table level, and not in full generality (meaning: in any aggregate query)?

    We should be able to do this (using tuple notation, which to me makes the most sense in the relational model):

    select expr1 [as alias1], ... , COUNT( DISTINCT(COL1, COL2, COL3) ) [as count_distinct_col1_col2_col2]from  ....[where .....]group by .....[having .....]

    I didn't know if other RDBMS support this - glad to learn from this thread that, indeed, at least some do.

    It shouldn't be too difficult for Oracle to implement this; worst case, they can code internally the silly things we must do ourselves, such as concatenate with a good separator (chr(0) seems a lot better to me than any graphical character) - although that of course must be done carefully to avoid going over 4k/32k bytes, etc. Surely Oracle can do this better than we can.

    user560737Thorsten Kettner