The below text is from “Oracle® Data Mining Concepts 11g Release 2 (11.2)”:
"If you manage your own data preparation for k-Means, keep in mind that outliers with equi-width binning can prevent k-Means from creating clusters that are different in content. The clusters may have very similar centroids, histograms, and rules."
I have two questions:
1. Why does equi-width binning can prevent k-Means from creating clusters that are different in content?
Suppose histogram of an attribute from a dataset with outliers, in the extreme case we have two dense sections (one for normal data and another for outliers) and long interval with empty bins between them, each of these sections can be a different cluster.
Please correct me if I’m wrong!
2. I’m using SQL Developer 220.127.116.11 for applying K_means. I have three numeric attributes with very different ranges.
I tried different cases: using pure data (without normalization) and using 3 different normalization methods in order to normal numeric fields (Min Max, Linear Scale, Z-Score). The result of 4 tests is same.
I expected different results from these tests.
Can you explain this situation?
The automated data preparation in k-Means does not bin the data. The quote from the documentation refers to the situation if the user decides to use binning in his own data transformations. If the data has outliers and few bins are used, it will result in loss of resolution. That is the gist of the quote.
Using the different normalization approaches may or may not produce different clusters. It depends on how the different the individual scaling factors are. This is equivalent on weighting the individual attributes. If the weights are different enough to change the distance function sufficiently, it is possible to get a different solution.
Thank you Boriana,
I got the gist of the quote. But I'm still interested in knowing more about normalization types of ODM. There isn't any setting which I can do to control the normalization process.
I just select normalization type and process goes through, but none of these types have no effect on the result and I think there is something wrong!
In my case the attributes have very different mean and standard deviation, it's expected that z-score normalization for example, will change the clustering results!