This content has been marked as final. Show 2 replies
I'm assuming you'll be using the Web Crawler product in the CAS product suite. There is a "CAS Web Crawler Guide" that should tell you how to specify the URLs to crawl. Here's a link to the doc:
By the way, you can write the Web Crawler's output to a CAS Record Store, and then use the Integrator's CAS Record Store Reader component to read in the crawled data from CAS Record Store and then load it in the Dgraph.
Edited by: Frank on Oct 24, 2012 11:36 AM
CAS is not a firm requirement here, by any means. If you're interested in the raw social media feeds like Twitter, Facebook, YouTube, etc., web crawling may not be the way to go, actually. Web crawling parses HTML, which is a fragile, unstable data acquisition approach. If the HTML changes, your web crawl may "break" and need continuous reconfiguration. Most all of the social media sites on the web offer APIs by which to acquire data which will lead to a more robust acquisition approach. Integrator has an SDK by which you can create custom data readers to read directly from the social media sites APIs. Additionally, rumor has it that the next version of OEID v2.4 will offer a JSON data reader that will read the sparly attributed data feeds offered by the social media sites.
Here at http://branchbird.com, we've pioneered many of the ingest design patterns around social media and OEID. We actually have custom readers which can consume social media JSON output directly from the APIs of the social media sites themselves or from social media aggregators like DataSift, Gnip, or Oracle's own Collective Intellect. See here for more: http://branchbird.com/blog/consuming-data-as-a-service-with-oeid/
If you must go about this through a web crawl, putting those records into a record store may be overkill. The CAS web crawler is not capable of doing partial web crawls so, in my experience, writing your crawl output to a record store doesn't buy you anything over writing those web crawls to a XML file.
Edited by: Dan at Branchbird on Oct 24, 2012 1:22 PM
Edited by: Dan at Branchbird on Nov 8, 2012 7:17 AM