ErichG wrote:I don't know about Perl, but if it comes to Java, there's really only one (open source/free) web-crawler: Heritrix.
Quick question. If I had to write a program to spider websites for specific information in volume (1000's of websites a week using a DB of URLS). Which language is best for this task Perl or Java. I've been looking into both and trying to come to a conclusion one way or another.
Thanks in advance.
ErichG wrote:*nix, Solaris or Windows: it runs on any of them.
OK, I need one more bit of information on this Open Source Java program and sorry for the delay but takes a while to go over the doco. Basically from what I understand it runs on Linux, can it run on another variant of UNIX such as Sun Solaris?
Also, It just brings the webpages over to a local directory to remove the desired content from them (using Java Regex) or does the program keep the web pages in place and just send the desired content back?I don't know what you mean by that.
prometheuzz wrote:I think I might, but if I do it means the OP doesn't have much knowledge in how HTTP works. Of course all of the data must be downloaded before your program runs on that data, as your program isn't actually running on all of those servers, it's running on your own computer. Expecting it to do otherwise would imply some serious security flaws, no? Asking it to get you the information you want without allowing it access to the information is as impossible as asking a software engineer to design you a software system without giving him any requirements (oh wait, that happens daily.)Also, It just brings the webpages over to a local directory to remove the desired content from them (using Java Regex) or does the program keep the web pages in place and just send the desired content back?I don't know what you mean by that.
endasil wrote:Note that this half TB was only from one (!) single website.
All this brings up another question. Prometheuzz...how the F did you run a spider on 0.5 TB of data?
endasil wrote:I have no idea what we pay, or even IF we pay. There is a semi-governmental service in the Netherlands called SurfNet who maintain (own?) the fiber-optic network used by universities, research institutes and other non-profit organisations. The server running these crawls was tapped directly into that network with a 10 G/bit interface. AFAIK, institutes on that network pay a yearly fee which is not directly related to the GB of data they burn. And of course, the government pays a part of the bill (probably a substantial amount).
I'm frustrated because I've been over on my cable usage which was capped at 60 GB. What the hell are you (or your company) paying for bandwidth???