our customer required us to implement a very simple ranking scheme for the indexed xml documents. Each occurence of the search term is weigthed depending on the tag in which it is contained. Most tags simply score 1 by default, some (about a dozen) special tags have higher scores. The final score is the sum of the scores of all occurences.
This works pretty well by now, but there are still two limitations:
First, Oracle doesn't allow weigths greater than 10. As mentioned above, we currently have about a dozen special tags. We want to make a clear distinction between the weighting of each tag (e.g. each time doubling the weight of the next higher tag). Obviously, a maximum weigth of ten ist not enough, even if we would start with a weight of 0.1 (the minimum weight that Oracle allows).
Second, the maximum total score is 100. That means we cannot distinguish between documents that have a large number of occurences within tags which have a high weighting, since they easily score more than 100.
Is there any way to circumvent these limitations? I wonder why one cannot use arbitrary large weightings and scores.
1. You can "stack" multipliers, so if you want to multipy by 50 you can use (expression)*5*10
2. Using a query template and SCORE DATATYPE="FLOAT" will allow for more fine-grained scores. It is still limited to 100.
Perhaps something like:
dog*9.3*3.5 OR cat*1.7*2.5
<score datatype="FLOAT" algorithm="COUNT"/>
Yes, you have to use a template to specify a SCORE DATATYPE.
I don't know why the score is limited to 100. There are good reasons to limit it to something, but I guess 100 is somewhat arbitrary. Probably simply because it has always been that, and some applications may rely on it being that so changing it would break backward compatibility.
For example a common technique where you want to filter by some criterion, but not have that criterion affect the score, is to do:
(dog AND cat) AND (yes WITHIN published)*10*10
Because AND returns the lower of the two sides, we can be sure that "yes WITHIN published" will saturate at 100 and NOT affect the score of (dog AND cat).
If we suddenly changed things such that (dog and cat) could score 200, then it's possible that the right hand side of this expression would be lower than the left, and hence the score would be based on the filter criteria (yes WITHIN published) rather than the main search.