This content has been marked as final. Show 14 replies
Here is the response from our algorithm team in regards to the KM:
This has come up before and is related to convergence. The model is not quite converged and the scoring reflects the next iteration of the algorithm.
The fix is to increase the number of iterations to ensure convergence and maybe decrease the tolerance (convergence criterion) as well.
Thanks Mark. I tried increasing the number of iterations to 20 ( maximum allowed) and reducing the convergence tolerance to 0.00000000001 but it does not help. Still i am getting only 4 clusters when mapping to the whole data whereas clustering output shows 5. Just For your reference:
Number if clusters : 5
Growth Factor : 2
convergence tolerance : 0.00000000001
Distance functions : Euclidean
Number if iterations: 20
Min% attribute rule support: 0.1
Number of histogram bins: 10
Split criterion is : Size
Thanks Mark. I cannot share the data since i am working on confidential data.
One more question was: If the clustering model was built on 10,000 customers and we want to score a population of 15000 , then i was not able to do this using the apply node to 15000 customers data. Is it not possible in Oracle to score the data which has more customers than on what model was actually built?
The first question regarding whether there is some row limit to the number of rows that can be scored with a cluster model: No. There is no such limit. I wonder how you are validating the result is not the full set of rows? Perhaps you are using the View Data option and just viewing a sample of the result off of the Apply node?
The second question is confusing to me as here you state that all 15k were scored. This is inconsistent with your first question...
It is pretty hard to evaluate your use case as we don't have a view of the overall methodology that you are following.
1. Make sure you are transforming the score data the same way you are for the build.
2. You can do some manual validations by comparing some of the scored rows with the cluster model rules to see if the score makes sense.
Thanks for the reply and Sorry for the confusion. Let me explain in detail.(15K and 10K Numbers are hypothetical)
Clustering: I had a data for 15000 customers. I performed outlier treatments and removed the customers with outlier values(done in sql) so I was left with 10,000 customers. I built a clustering model on 10,000 customers which created 4 clusters with good distribution in each.
Scoring: Now , my objective is to score all 15,000 customers(including outliers). So i used apply node and connected it to 15000 customer data and the cluster model( which has 4 cluster ID).
Then I Used aggregate node to check the count of customers in various clusters and i saw that all 15,000 customers were getting into 1 cluster which means ODM is not able to assign 15,000 Households into 4 clusters.
Does O cluster face the same issue?
What is the difference between O cluster and K means( which uses distance algorithm). How should we decide where we should use O cluster and where K means?
When we use O cluster , the means of clustering variables in each cluster does not come out correctly when we look at the tree. Is this an issue with O cluster?
Is it possible to fix the centroids in the starting point for scoring?
Ok, I see what the numbers no reflect.
For the cluster mean, see the following posting that explains how to interpret the mean of a cluster:
Re: Error on Clustering Result
For OCluster, can you run a test to see if the same scoring of all rows to cluster 1 occurs?
I'll see if I can provide some advice on why users choose between OCluster vs. KMeans.
In general, I think it would depend on the data and what provides the user the best result.
It is really unclear as to how your model could be behaving as described and it seems difficult to decipher the details without a more information.
I would recommend that you open a SR with Oracle Support which would allow you to provide data on a secured basis.
Aside from that, the best we can say is that the data you are scoring with seems to be quite different from the data that the model was buillt with.
The other possibility is that the Apply node is set to just predict for cluster 1.
The default is to predict the most likely cluster id along with its probability, but it can be change to select a specific cluster id.
As a point of reference, can you provide the db version that you are using?
You can run the following query:
select * from product_component_version;
The version of SQL Dev should be viewable if you go to the menu option: Help->About
You should see the version of SQL Dev displayed in this dialog.
You can then select the Extensions tab in the same dialog and scroll down to the Data Miner extension to see its version number.
There is a more current SQL Dev available:
SQL Developer 3.2.2 RTM Version 3.2.20.09 Build MAIN-09.87
You can download it and it will upgrade your existing repository.
The one you are using is fairly current though, released in 2/2012.
The db version is fine.
Are you having better success with the SQL code?