I am currently conducting a value based segmentation whereby each of the Marginal cost and marginal revenue sub-components are used as input into O-cluster.
I have a problem with truncating the data (say remove the top 2% on each MC). But the documentation indicates that removing outliers is strongly encouraged. WHat should I do? What is the risk of not removing outliers in O-cluster ?
O-Cluster calculates uni-dimensional histograms to find good cutting planes. The histograms are based on equi-width binning. If you have extreme outliers, the main body of the data may get squashed into a few adjacent bins. It may prevent you from finding good separation along these dimensions.
Not sure what stages you are referring to. The algorithm does not look for outliers because it operates on transformed binned data. In the presence of outliers, the clustering may still work well. It really depends on the bin boundaries.
We haven't seen any big issues with outliers. Usually, the users can perform a transformation outside the algorithm will fix the issue (e.g., log(x)). We also have a transformation package that performs outlier removal. However, it is not embedded.
Just to add to the transformation options. The Data Miner Transform node has Outlier and Normalization transformations that can be used. You can also create your own using the custom column option as well. You can then use the workflow to treat your scoring data in the same way you treated your build data. This alleviates the issue of not having the transformation embedded in the model.