I'm looking for an automatic process to read data from facebook and twitter and feed them into OEID.
For instance, in facebook I would like to have a company page with posts for new products, etc. I would like to crawl all the comments my followers make on my posts and load them into the OEID.
I would also like to crawl my followers data and their wall posts. And the same for twitter, of course.
It seems everything I read about Endeca mentions reading non-structured data from social websites but when I try to find how to do it, all I can find is "load this .csv", "load this .xls", etc.
Can anyone help here?
Thanks in advance.
OEID provides a great platform for developers to build a number of extensions in various areas of the product, one of which is custom readers to assist in ingesting content from a number of different sources.
You're probably seeing a lot of "load that CSV" because it's the quick and easy way to get a demo/POC up and running (dump my social data to Excel, load it in) and doesn't actually require any dev. However, when looking to build a robust, productionalized solution, you'll want to have that content ingestion process streamlined.
When you're integrating social data, you kind of have two options.
1) For each source of content, FB, Twitter, Youtube, you build a content acquisition strategy.
2) You use one of the social DaaS providers like DataSift or Infochimps to provide the data for you in an easy to consume format. In a vacuum, this doesn't really differ from the CSV/XLS model (*read on*) but it handles the vagaries of the different social APIs for you and provides the data in a KVP-friendly format.
To truly integrate this data "in realtime" and productionalize it, you'll need to have something that can handle this data, persist it and stream it into your OEID instance. There are a few partners out there that can do this and probably fewer that have actually done it, ourselves included.
As Patrick mentions, going with a service such as DataSift is probably your best bet, but it will take some elbow grease.
The following is a set of steps that illustrate how to implement a solution with DataSift, but the general steps should apply regardless of the service you choose.
DataSift freely provides pretty complete sample code that you can use in your solution...
... from which you may find, for example, Java code (https://github.com/datasift/datasift-java/tree/master) for processing an HTTP stream or a single JSON file.
You can use this code to create a custom CloverETL component, as described in Clover's guide...
With some effort, this should get you to the point where you have a new component in your CloverETL graph that ingests social media interactions, and feeds them into your dgraph.
It's worth noting however that there is a key issue with real-time streaming of social media data. The problem is inherent in maintaining any persistent HTTP connection to the DataSift server. While DataSift provides sample code to automate reconnection in case the connection drops, there is no reliable way to recover the interactions that were lost while your solution was recovering from its failure.
So if data integrity is more important than real-time ingestion, you'll likely want to go with a push solution (to an FTP server; for example, http://dev.datasift.com/docs/push/push-steps) that will allow you to store and process every interaction.