Using K means i got 5 clusters in the final output but when i am using apply node to add cluster Id's to the base data, only 4 cluster Id's are getting added to the base data and distribution of customer is also changing. Please help
Here is the response from our algorithm team in regards to the KM:
This has come up before and is related to convergence. The model is not quite converged and the scoring reflects the next iteration of the algorithm.
The fix is to increase the number of iterations to ensure convergence and maybe decrease the tolerance (convergence criterion) as well.
Thanks Mark. I tried increasing the number of iterations to 20 ( maximum allowed) and reducing the convergence tolerance to 0.00000000001 but it does not help. Still i am getting only 4 clusters when mapping to the whole data whereas clustering output shows 5. Just For your reference:
Number if clusters : 5
Growth Factor : 2
convergence tolerance : 0.00000000001
Distance functions : Euclidean
Number if iterations: 20
Min% attribute rule support: 0.1
Number of histogram bins: 10
Split criterion is : Size
A few questions:
1) Is the data is the same as that used to build the model?
2) Are you using Data Miner or the API
3) What version of Data Miner and DB are you using?
4) Is the data being prepared differently for the build vs. apply?
Please find below the Answers:
1.) Yes the same data which is used to build the model
2.) I am using Data Miner( Oracle SQL developer)
3.) I am not able to find the version
4.) The data is same for build and apply
There is a bug that was restricting the number of iterations to 20 which consequently prevented proper convergence.
This is fixed in the upcoming 188.8.131.52 release and of course in the 12.1 release.
One other thing.
If you want to provide us the data, we can validate it against the 184.108.40.206 code base to be sure there is no other issue at play.
You can either file a SR with Oracle Support, or place the data and the workflow somewhere accessible to us (maybe google docs).
Thanks Mark. I cannot share the data since i am working on confidential data.
One more question was: If the clustering model was built on 10,000 customers and we want to score a population of 15000 , then i was not able to do this using the apply node to 15000 customers data. Is it not possible in Oracle to score the data which has more customers than on what model was actually built?
In addition to above, when i was scoring 15K data on the clusters which were build on 10,000 customers , the whole population of 15K was entering only one cluster.
Does O-Cluster suffers from the same issue? Can we feed the centroids in K-means for the starting point?
The first question regarding whether there is some row limit to the number of rows that can be scored with a cluster model: No. There is no such limit. I wonder how you are validating the result is not the full set of rows? Perhaps you are using the View Data option and just viewing a sample of the result off of the Apply node?
The second question is confusing to me as here you state that all 15k were scored. This is inconsistent with your first question...
It is pretty hard to evaluate your use case as we don't have a view of the overall methodology that you are following.
1. Make sure you are transforming the score data the same way you are for the build.
2. You can do some manual validations by comparing some of the scored rows with the cluster model rules to see if the score makes sense.
Thanks for the reply and Sorry for the confusion. Let me explain in detail.(15K and 10K Numbers are hypothetical) Clustering: I had a data for 15000 customers. I performed outlier treatments and removed the customers with outlier values(done in sql) so I was left with 10,000 customers. I built a clustering model on 10,000 customers which created 4 clusters with good distribution in each. Scoring: Now , my objective is to score all 15,000 customers(including outliers). So i used apply node and connected it to 15000 customer data and the cluster model( which has 4 cluster ID).
Then I Used aggregate node to check the count of customers in various clusters and i saw that all 15,000 customers were getting into 1 cluster which means ODM is not able to assign 15,000 Households into 4 clusters.
Does O cluster face the same issue?
What is the difference between O cluster and K means( which uses distance algorithm). How should we decide where we should use O cluster and where K means?
When we use O cluster , the means of clustering variables in each cluster does not come out correctly when we look at the tree. Is this an issue with O cluster?
Is it possible to fix the centroids in the starting point for scoring?
Ok, I see what the numbers no reflect.
For the cluster mean, see the following posting that explains how to interpret the mean of a cluster: Re: Error on Clustering Result
For OCluster, can you run a test to see if the same scoring of all rows to cluster 1 occurs?
I'll see if I can provide some advice on why users choose between OCluster vs. KMeans.
In general, I think it would depend on the data and what provides the user the best result.
It is really unclear as to how your model could be behaving as described and it seems difficult to decipher the details without a more information.
I would recommend that you open a SR with Oracle Support which would allow you to provide data on a secured basis.
Aside from that, the best we can say is that the data you are scoring with seems to be quite different from the data that the model was buillt with.
The other possibility is that the Apply node is set to just predict for cluster 1.
The default is to predict the most likely cluster id along with its probability, but it can be change to select a specific cluster id.
As a point of reference, can you provide the db version that you are using?
You can run the following query:
select * from product_component_version;
The version of SQL Dev should be viewable if you go to the menu option: Help->About
You should see the version of SQL Dev displayed in this dialog.
You can then select the Extensions tab in the same dialog and scroll down to the Data Miner extension to see its version number.
Please find below the details:
Oracle SQL Dev - 3.1.07
NLSRTL - 220.127.116.11.0
Oracle database enterprise edition-18.104.22.168.0
PL/SQL - 22.214.171.124.0
TNS For Linux - 126.96.36.199.0
I have performed the scoring using SQL code now. Thanks.
There is a more current SQL Dev available:
SQL Developer 3.2.2 RTM Version 3.2.20.09 Build MAIN-09.87
You can download it and it will upgrade your existing repository.
The one you are using is fairly current though, released in 2/2012.
The db version is fine.
Are you having better success with the SQL code?