I'm doing a program in Java whereas an article is matched against the top 10 topic-related entries in Google using similarity algorithms.
Now let us assume that the original article is matched against the first one and the similarity is 0.98 (nearly identical). Against article 2 we get 0.86, against article 3 we get 0.92 and so on...
For this program, I decided to also consider the PageRank (PR) score for each of the 10 Google entries. So for example the PR for the first entry is 8, the PR for the second article is 6 and so on...
My query is how to apply a weight upon the similarity score generated from the PR score. For example, for the first match I could say 0.98 * 8 = 7.84 but this score doesn't make much sense. I need the PR score to serve as a weight on the similarity score, but my idea of just multiplying the two is not very useful.
Does anyone can provide any suggestions how can I do this please?
One question, what is your program's purpose? From your description, I think it's find most similarity topic. If I'm right, why do you care about Google's rank? I mean Google's top-10 just provides you the most 10 topic-related pages, it doesn't mean the content is best one that matches your original topic, right? For example, your program get all top 10 pages and none of them has a score above 9. But your program find the 11th topic has score 9.5! If your program is correct, I think this 11th topic should be on the top of your list.
I can see one problem is your program might find two topic are very similar but they're just talking about different thing! But that's the issue of your program...
Yes this is just an initial thing, it has many issues and problems that need to be investigated further.
You're right about the pagerank issue, but the thing is that the 11th topic would probably be less related to the topic than the first ten. And also we have to keep in mind that Google not only considers keywords, but also inlinks from authoritative sources. So a high pagerank score, although not necessarily meaning it is the best article, gives a rough indication of the content. And also we would have an aggregation from 10 different locations, not just one, so the ranking should hopefully be more accurate.
However, I found a solution to the weight problem by using logarithms for the pagerank scores.