4 Replies Latest reply: Dec 23, 2012 1:35 PM by PradeepKPathak RSS

    Endeca information : multiple data source feature

      I evaluated Oracle Endeca for multiple datasource.
      Steps are :
      1.Ingested data from 1.csv having attributes(x,y,z)
      2.Ingested data from 2.csv having attributes(x,p,q)
      3.Ingested data from 3.csv having attributes(p,a,b)
      x is linking attribute between 1 & 2 whereas p is linking attribute between 2 & 3 input data-source.

      When I search on x it returns 3 records and visualization of these records in Result Table component look weird because 1st record in Results table has value in x,y,z column whereas remaining columns(p,q,a,b) are empty whereas 2nd record contains value for x,p,q and remaining values are empty and so on.
      Can I expect endeca to treat this as one logical record (if it allows me to define this linking definition in advance) so that if I search not only on x but even on any other attributes it can use link attribute and get related details in chain.

      Using ETL and join at the time of ingestion is not desirable as
      1.it will physically create one record which is denormalized and if 1.csv is master which means this information will go in all transaction records(2 & 3)
      2.In many cases 1.csv is already uploaded and 2.csv will be uploaded later.

      Is it correct way to use endeca to store separate records for different input source (like i mentioned) or there is better approach to tackle problem which I am looking for.
        • 1. Re: Endeca information : multiple data source feature
          Dan at Branchbird
          Endeca is meant to ingest pre-joined, denormalized records so don't let the resulting data duplication called out in your concern #1 trouble you. An Endeca index is highly compressed and doesn't actually store the repetitive data as you think. You should aim to join the three record types together on ingest into one denormalized record.

          Your 2nd concern is a valid one, but there are plenty of approaches that will allow this to work. If 1, 2, 3.csv all source from the same database, I would start by considering if they can be joined in a view in the database itself. If not, you can always persist the latest version of 1.csv so that when 2.csv changes or new values become available, you have 1.csv available to 'rejoin'. There are other approaches if these don't work for you so let me know.

          • 2. Re: Endeca information : multiple data source feature
            Really appreciate your response.

            1 ,2 ,3 are different(independent) data sources. Assume 1 is master data source having information about customer(like mobilenumber,name,address,dob etc) and 2 & 3 are transactional sources got huge financial data.they will have some properties which is common and can be used to correlate master and different transactional information.

            these information comes from all sources as updates(incremental information) to us time by time and we are supposed to maintain this.

            I understand using ETL we can perform join but it will require reading of all data sources at the time of ingestion which is:
            1.costly affair
            2.need to maintain all data sources updated outside endeca for correlation purposes.
            3.In case if changes in existing master record , it will require updates to earlier ingested records(lying with endeca now)

            I am looking for an approach where I upload different data source and define relationship model among them so that it will perform correlation on the fly.
            • 3. Re: Endeca information : multiple data source feature
              Dan at Branchbird
              Ahh, yes. I've been where you are before...you're smack dab in the middle of the most challenging decisions when it comes to Endeca data modeling and incremental ingest.

              The truth of the matter is that you're going to maximize the power of your Endeca application if you denormalize in your data model and join the entities together during ingest. Unless your 3 record types share a lot of dimensionality, you're going to have to produce some query-time gymnastics to allow proper pivoting between your normalized entities/record types. These query-time gymnastics can also be costly both from a development effort and performance impact perspective.

              With that in mind, I would strongly urge you to consider the denormalization approach, albeit with its own costly ETL gymnastics.

              If you're interested, Branchbird recently blogged about just this topic. Find more of subtle details here: http://branchbird.com/blog/pivoting-in-endeca/

              Sorry there isn't a silver bullet here,
              • 4. Re: Endeca information : multiple data source feature
                As Dan suggested, Converting data into denormalized befiore igesting is a good idea.
                This might be an expensive computation for multiple source data into a views or MVs before ingesting into an endeca, if data size is huge.
                But I feel overall indexing time remain almost same or better.

                But if you want to do in graph itself, I am not sure which joiner did you use?
                is it DBJoin or something else?
                I am wondering why DBJoin shouldn't help here.
                is there any reason, you want to ingest csv file over DB conenction?

                Pradeep K Pathak

                Edited by: Pradeep K Pathak on Dec 23, 2012 11:34 AM