4 Replies Latest reply on May 10, 2012 9:08 AM by 920802



      I have results of 10 different clustering runs.
      Now I want to compare them - which are similar (level % of similarity) and which are different.
      I know how to compare 2 clusters (simple case), but how to make it for 10 and build a rank of similarity ??
      Any idea ?

      Thanks in advance,
        • 1. Re: Clustering

          Can you be more specific about what exactly you are trying to achieve? What are the differences among the 10 runs? Do you use different settings or different data?
          To compare 2 different models, one can consider things like:
          1. How similar the hierarchical trees are (depth, branching out)
          2. The distribution of data records among clusters (avg, min, max, etc.)
          3. If you have class labels, calculate cluster purity - how many of the points within a cluster belong to the same class on average

          Lastly, if you have the same data across models, you can use the cluster assignments of 2 models and calculate what percentage of the points fall in the same clusters. To do that you can consider all possible pairs and count how many pairs fall in the same cluster in both models.

          I hope this helps,

          • 2. Re: Clustering

            Thanks for answer. I'm trying to focus on results.
            I want to calculate the ratio of id, which falls into the same cluster in each run and pick 2 runs most similar.
            I can produce sql who compares results of two runs.
            But it requires to compare (as you wrote) each pair and will require many operations.

            My question is whether there is a simple method to compare many (10) runs and find most similar....
            Data are in 1 table and for each run (assignment to cluster) I have separate column in table.

            SQL to compare 2 runs is:

            select r.r1, r.r2,
            round(count(r.r2)/(select count(o.r1)
            from name_of_table o
            where o.r1=r.r1
            group by o.r1)*100,2) perc_r2_in_r1

            from name_of_table r
            group by r.r1, r.r2
            order by r.r1, r.r2;

            Edited by: 917799 on 2012-05-02 07:39

            Edited by: 917799 on 2012-05-02 07:42
            • 3. Re: Clustering
              Given how your data is organized, it will be simplest to just write a loop in PL/SQL that considers each pair of columns and outputs the 2 columns with highest score.
              • 4. Re: Clustering
                Dataset is very simple:

                Case_id, clustering_num_1, clustering_num_2, ... clustering_num_10

                In dataset I have about 3000 rows.

                Thanks in advance.

                Edited by: 917799 on 2012-05-10 02:08