I have results of 10 different clustering runs.
Now I want to compare them - which are similar (level % of similarity) and which are different.
I know how to compare 2 clusters (simple case), but how to make it for 10 and build a rank of similarity ??
Any idea ?
Can you be more specific about what exactly you are trying to achieve? What are the differences among the 10 runs? Do you use different settings or different data?
To compare 2 different models, one can consider things like:
1. How similar the hierarchical trees are (depth, branching out)
2. The distribution of data records among clusters (avg, min, max, etc.)
3. If you have class labels, calculate cluster purity - how many of the points within a cluster belong to the same class on average
Lastly, if you have the same data across models, you can use the cluster assignments of 2 models and calculate what percentage of the points fall in the same clusters. To do that you can consider all possible pairs and count how many pairs fall in the same cluster in both models.
Thanks for answer. I'm trying to focus on results.
I want to calculate the ratio of id, which falls into the same cluster in each run and pick 2 runs most similar.
I can produce sql who compares results of two runs.
But it requires to compare (as you wrote) each pair and will require many operations.
My question is whether there is a simple method to compare many (10) runs and find most similar....
Data are in 1 table and for each run (assignment to cluster) I have separate column in table.
SQL to compare 2 runs is:
select r.r1, r.r2,
from name_of_table o
group by o.r1)*100,2) perc_r2_in_r1
from name_of_table r
group by r.r1, r.r2
order by r.r1, r.r2;