This content has been marked as final. Show 4 replies
Endeca is meant to ingest pre-joined, denormalized records so don't let the resulting data duplication called out in your concern #1 trouble you. An Endeca index is highly compressed and doesn't actually store the repetitive data as you think. You should aim to join the three record types together on ingest into one denormalized record.
Your 2nd concern is a valid one, but there are plenty of approaches that will allow this to work. If 1, 2, 3.csv all source from the same database, I would start by considering if they can be joined in a view in the database itself. If not, you can always persist the latest version of 1.csv so that when 2.csv changes or new values become available, you have 1.csv available to 'rejoin'. There are other approaches if these don't work for you so let me know.
Really appreciate your response.
1 ,2 ,3 are different(independent) data sources. Assume 1 is master data source having information about customer(like mobilenumber,name,address,dob etc) and 2 & 3 are transactional sources got huge financial data.they will have some properties which is common and can be used to correlate master and different transactional information.
these information comes from all sources as updates(incremental information) to us time by time and we are supposed to maintain this.
I understand using ETL we can perform join but it will require reading of all data sources at the time of ingestion which is:
2.need to maintain all data sources updated outside endeca for correlation purposes.
3.In case if changes in existing master record , it will require updates to earlier ingested records(lying with endeca now)
I am looking for an approach where I upload different data source and define relationship model among them so that it will perform correlation on the fly.
Ahh, yes. I've been where you are before...you're smack dab in the middle of the most challenging decisions when it comes to Endeca data modeling and incremental ingest.
The truth of the matter is that you're going to maximize the power of your Endeca application if you denormalize in your data model and join the entities together during ingest. Unless your 3 record types share a lot of dimensionality, you're going to have to produce some query-time gymnastics to allow proper pivoting between your normalized entities/record types. These query-time gymnastics can also be costly both from a development effort and performance impact perspective.
With that in mind, I would strongly urge you to consider the denormalization approach, albeit with its own costly ETL gymnastics.
If you're interested, Branchbird recently blogged about just this topic. Find more of subtle details here: http://branchbird.com/blog/pivoting-in-endeca/
Sorry there isn't a silver bullet here,
As Dan suggested, Converting data into denormalized befiore igesting is a good idea.
This might be an expensive computation for multiple source data into a views or MVs before ingesting into an endeca, if data size is huge.
But I feel overall indexing time remain almost same or better.
But if you want to do in graph itself, I am not sure which joiner did you use?
is it DBJoin or something else?
I am wondering why DBJoin shouldn't help here.
is there any reason, you want to ingest csv file over DB conenction?
Pradeep K Pathak
Edited by: Pradeep K Pathak on Dec 23, 2012 11:34 AM