This content has been marked as final. Show 4 replies
Can you be more specific about what exactly you are trying to achieve? What are the differences among the 10 runs? Do you use different settings or different data?
To compare 2 different models, one can consider things like:
1. How similar the hierarchical trees are (depth, branching out)
2. The distribution of data records among clusters (avg, min, max, etc.)
3. If you have class labels, calculate cluster purity - how many of the points within a cluster belong to the same class on average
Lastly, if you have the same data across models, you can use the cluster assignments of 2 models and calculate what percentage of the points fall in the same clusters. To do that you can consider all possible pairs and count how many pairs fall in the same cluster in both models.
I hope this helps,
Thanks for answer. I'm trying to focus on results.
I want to calculate the ratio of id, which falls into the same cluster in each run and pick 2 runs most similar.
I can produce sql who compares results of two runs.
But it requires to compare (as you wrote) each pair and will require many operations.
My question is whether there is a simple method to compare many (10) runs and find most similar....
Data are in 1 table and for each run (assignment to cluster) I have separate column in table.
SQL to compare 2 runs is:
select r.r1, r.r2,
from name_of_table o
group by o.r1)*100,2) perc_r2_in_r1
from name_of_table r
group by r.r1, r.r2
order by r.r1, r.r2;
Edited by: 917799 on 2012-05-02 07:39
Edited by: 917799 on 2012-05-02 07:42
Given how your data is organized, it will be simplest to just write a loop in PL/SQL that considers each pair of columns and outputs the 2 columns with highest score.
Dataset is very simple:
Case_id, clustering_num_1, clustering_num_2, ... clustering_num_10
In dataset I have about 3000 rows.
Thanks in advance.
Edited by: 917799 on 2012-05-10 02:08