4 Replies Latest reply: May 10, 2012 4:08 AM by 920802

Clustering

Hello,

I have results of 10 different clustering runs.
Now I want to compare them - which are similar (level % of similarity) and which are different.
I know how to compare 2 clusters (simple case), but how to make it for 10 and build a rank of similarity ??
Any idea ?

Paul.
• 1. Re: Clustering
Paul,

Can you be more specific about what exactly you are trying to achieve? What are the differences among the 10 runs? Do you use different settings or different data?
To compare 2 different models, one can consider things like:
1. How similar the hierarchical trees are (depth, branching out)
2. The distribution of data records among clusters (avg, min, max, etc.)
3. If you have class labels, calculate cluster purity - how many of the points within a cluster belong to the same class on average

Lastly, if you have the same data across models, you can use the cluster assignments of 2 models and calculate what percentage of the points fall in the same clusters. To do that you can consider all possible pairs and count how many pairs fall in the same cluster in both models.

I hope this helps,

Boriana
• 2. Re: Clustering
Boriana,

Thanks for answer. I'm trying to focus on results.
I want to calculate the ratio of id, which falls into the same cluster in each run and pick 2 runs most similar.
I can produce sql who compares results of two runs.
But it requires to compare (as you wrote) each pair and will require many operations.

My question is whether there is a simple method to compare many (10) runs and find most similar....
Data are in 1 table and for each run (assignment to cluster) I have separate column in table.

SQL to compare 2 runs is:

select r.r1, r.r2,
count(r.r2),
round(count(r.r2)/(select count(o.r1)
from name_of_table o
where o.r1=r.r1
group by o.r1)*100,2) perc_r2_in_r1

from name_of_table r
group by r.r1, r.r2
order by r.r1, r.r2;

Edited by: 917799 on 2012-05-02 07:39

Edited by: 917799 on 2012-05-02 07:42
• 3. Re: Clustering
Given how your data is organized, it will be simplest to just write a loop in PL/SQL that considers each pair of columns and outputs the 2 columns with highest score.
• 4. Re: Clustering
Dataset is very simple:

Case_id, clustering_num_1, clustering_num_2, ... clustering_num_10

In dataset I have about 3000 rows.