I’m trying to do a simple K-means clustering on a small dataset. But on one of the numeric variables(no of children in household values of 0-4), the clustering is converting the variable to a character. Is there a way to stop the algorithm converting the variable to a character?
So you must be using ODMr and have the "automatic data usage" setting on.
There is a heruistic that will cast a number to a varchar if it has distinct values of 5 or less.
Turn this off at the model level in the advanced dialog off of the build node editor.
Then insure that the mining type for the column you are interested in is set to numeric.
I'm using Oracle Data Miner version 18.104.22.168.0 and it under preferences-->Data.
It is directly under the option that was mentioned earlier and the description is "Max unique Count for Categorical Strings". I'm changing this option at the moment, but I want to understand what impact this option may have on how the Data miner creates models.
I did not realize that you were using the old client.
You must be running on a db that is 11.1 or earlier, otherwise I would suggest using the new UI integrated with SQL Dev.
I pasted in the online help available from the Preference UI below.
The Mining Activities in Data Miner Classic have heuristics used to determine what mining type a column should be (categorical, numerical) or whether to include a column as input for a model etc.
In the older versions of ODM, the models did not perform the type of internal transformations that are available in the later releases.
So another aspect of Mining Activities that you will notice, is how there are a number of transformation steps included as part of the mining activity.
Depending on what algorithm is used, the number and type of transformations in the activity will change.
If you run the wizard associated with the model build step, you will be able to override the default behavior provided by the heuristics.
Hope this helps.
Use the Data tab to specify the characteristics of data profiling.
The first set of constants specify characteristics of unique values:
Percent Unique Threshold: Specify the percent of distinct values that a numerical attribute must have in order to be considered having unique values. The default value is .97 (97%). This default means that 97% percent of the values are different values. (The remaining 3% of values may contain duplicates.) This value must be of type NUMBER.
Percent Unique Categorical Threshold: Specify the percent of distinct values that a categorical attribute must have in order to be considered having unique values. The default value is .97 (97%). This default means that 97% percent of the values are different values. (The remaining 3% of values may contain duplicates.) This value must be of type NUMBER.
Max Unique Count for Categorical Numbers: A categorical attribute that represents a number should not take on more than this number of unique values. The default is 5. This value must be an integer.
Max Unique Count for Categorical Strings: A categorical attribute that represents a string should not take on more than this number of unique values. The default is 60. This value must be an integer. This value applies to all algorithms.
The next set of constants specify Warn When limits related to numbers of bins. If your data exceeds one of these limits, you are warned. For example, if you have more than 250 bins for a categorical attribute, you are warned.
Categorical Bins Exceed: The number of bins for a categorical attribute should not exceed this integer value; the default is 250.
Numerical Bins Exceed: The number of bins for a numerical attribute should not exceed this integer value; the default is 250