This content has been marked as final. Show 3 replies
Wow, interesting program.
One question, what is your program's purpose? From your description, I think it's find most similarity topic. If I'm right, why do you care about Google's rank? I mean Google's top-10 just provides you the most 10 topic-related pages, it doesn't mean the content is best one that matches your original topic, right? For example, your program get all top 10 pages and none of them has a score above 9. But your program find the 11th topic has score 9.5! If your program is correct, I think this 11th topic should be on the top of your list.
I can see one problem is your program might find two topic are very similar but they're just talking about different thing! But that's the issue of your program...
Yes this is just an initial thing, it has many issues and problems that need to be investigated further.
You're right about the pagerank issue, but the thing is that the 11th topic would probably be less related to the topic than the first ten. And also we have to keep in mind that Google not only considers keywords, but also inlinks from authoritative sources. So a high pagerank score, although not necessarily meaning it is the best article, gives a rough indication of the content. And also we would have an aggregation from 10 different locations, not just one, so the ranking should hopefully be more accurate.
However, I found a solution to the weight problem by using logarithms for the pagerank scores.