11 Replies Latest reply: Mar 23, 2009 4:51 PM by PhHein RSS

    Website Spider / Crawler program

    843785
      Quick question. If I had to write a program to spider websites for specific information in volume (1000's of websites a week using a DB of URLS). Which language is best for this task Perl or Java. I've been looking into both and trying to come to a conclusion one way or another.

      Thanks in advance.
        • 1. Re: Website Spider / Crawler program
          843785
          I suggest you use tools which do this already like wget.
          [http://www.gnu.org/software/wget/manual/wget.html#Recursive-Retrieval-Options]
          • 2. Re: Website Spider / Crawler program
            843785
            OK, the problem I have with WGET is this little blurb in it's doco:

            ?--spider?
            When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. For example, you can use Wget to check your bookmarks:

            wget spider force-html -i bookmarks.html

            This feature needs much more work for Wget to get close to the functionality of real web spiders.

            Also, the blurbs about security issues and the ROBOTS.TXT exclusion issues have me a little concerned.
            • 3. Re: Website Spider / Crawler program
              800282
              ErichG wrote:
              Quick question. If I had to write a program to spider websites for specific information in volume (1000's of websites a week using a DB of URLS). Which language is best for this task Perl or Java. I've been looking into both and trying to come to a conclusion one way or another.

              Thanks in advance.
              I don't know about Perl, but if it comes to Java, there's really only one (open source/free) web-crawler: Heritrix.
              I've used it to crawl websites up-to half a terra-byte (5000 GB!) without much hassle!
              • 4. Re: Website Spider / Crawler program
                843785
                OK, I need one more bit of information on this Open Source Java program and sorry for the delay but takes a while to go over the doco. Basically from what I understand it runs on Linux, can it run on another variant of UNIX such as Sun Solaris? Also, It just brings the webpages over to a local directory to remove the desired content from them (using Java Regex) or does the program keep the web pages in place and just send the desired content back?
                • 5. Re: Website Spider / Crawler program
                  800282
                  ErichG wrote:
                  OK, I need one more bit of information on this Open Source Java program and sorry for the delay but takes a while to go over the doco. Basically from what I understand it runs on Linux, can it run on another variant of UNIX such as Sun Solaris?
                  *nix, Solaris or Windows: it runs on any of them.
                  Also, It just brings the webpages over to a local directory to remove the desired content from them (using Java Regex) or does the program keep the web pages in place and just send the desired content back?
                  I don't know what you mean by that.
                  • 6. Re: Website Spider / Crawler program
                    843785
                    prometheuzz wrote:
                    Also, It just brings the webpages over to a local directory to remove the desired content from them (using Java Regex) or does the program keep the web pages in place and just send the desired content back?
                    I don't know what you mean by that.
                    I think I might, but if I do it means the OP doesn't have much knowledge in how HTTP works. Of course all of the data must be downloaded before your program runs on that data, as your program isn't actually running on all of those servers, it's running on your own computer. Expecting it to do otherwise would imply some serious security flaws, no? Asking it to get you the information you want without allowing it access to the information is as impossible as asking a software engineer to design you a software system without giving him any requirements (oh wait, that happens daily.)

                    Likely the spider doesn't, however, need to store all of that data in a "local directory," it can just discard it from main memory when it's been processed.

                    All this brings up another question. Prometheuzz...how the F did you run a spider on 0.5 TB of data? I'm frustrated because I've been over on my cable usage which was capped at 60 GB. What the hell are you (or your company) paying for bandwidth???
                    • 7. Re: Website Spider / Crawler program
                      800282
                      endasil wrote:
                      ...
                      All this brings up another question. Prometheuzz...how the F did you run a spider on 0.5 TB of data?
                      Note that this half TB was only from one (!) single website.
                      endasil wrote:
                      I'm frustrated because I've been over on my cable usage which was capped at 60 GB. What the hell are you (or your company) paying for bandwidth???
                      I have no idea what we pay, or even IF we pay. There is a semi-governmental service in the Netherlands called SurfNet who maintain (own?) the fiber-optic network used by universities, research institutes and other non-profit organisations. The server running these crawls was tapped directly into that network with a 10 G/bit interface. AFAIK, institutes on that network pay a yearly fee which is not directly related to the GB of data they burn. And of course, the government pays a part of the bill (probably a substantial amount).
                      But again, (thankfully) I have nothing to do with financial stuff at work.
                      • 8. Re: Website Spider / Crawler program
                        843789
                        Thats what I was trying to find out if the HTML pages are stored locally in a UNIX File System or if this is done in memory. Thanks
                        • 9. Re: Website Spider / Crawler program
                          843789
                          Neither.
                          You have to decide what to do with the data.
                          • 10. Re: Website Spider / Crawler program
                            843789
                            do we get some opensource code for crawlers that can be readily used?
                            thanks,
                            Kim
                            www.4thejobless.com
                            www.sareeuniverse.com
                            • 11. Re: Website Spider / Crawler program
                              PhHein
                              No link sigs, please.