We are trying to implement ATG and Endeca Integration. Currently the process of indexing data in endeca is taken care by the jobs/process defined in ATG.
On running these processes/jobs from ATG the data is getting indexed in Endeca. The whole process is automated and there is no need to touch any
configurations in Endeca.
We want to index some external data not coming from ATG. The data needs to be merged with the product catalog data coming from the ATG. Currently
since the indexing is happening automatically by running job/process from ATG console. We are not having any control to the endeca pipeline.
How can we bring the external(non ATG) data in the endeca pipeline, merge the two data(coming from ATG and external source) and index the same in Endeca.
Edited by: 923763 on Dec 23, 2012 10:09 AM
This is an interesting use-case, I have not tried this out but in my opinion it is doable. See if this helps.
In the endeca\apps folder where you deployed you application, goto <app-name>\config\pipeline. You will see the pipeline that was auto generated for you, if you open it you will see a CAS adapter that consumes the data from ATG and forges it into Endeca. You can modify this pipeline to include data from the other source (OR create another pipeline that feeds into this one).
The question then remains will you still able to run it automatically from the ATG admin console OR will you have to invoke it manually using the script in the control, that I leave it for you to find out. Please update us with your findings.
Thanks for sharing the information. I actually tried that initially, but find that on running the baseline automatically from ATG console the pipeline(inside config folder) files are not modified.
It somehow updating the pipeline related files directly in the /data/processing folder. I compared the pipeline configuration files at two locations they were totally different.
Another problem is even if I import the external data in the pipeline, How I will be able to merge the data which is
being send by ATG and my external data.
Although I will definately work on your suggestion. But I suspect on the solution. Let me know your thoughts on the same.
<telephone number removed by moderator>
You should be able to join this information in the pipeline. I'm currently doing this for another project where I "switch join" unstructured content from a system containing support cases and FAQs with catalog data from ATG.
It's definitely not straight-forward if you are looking to share attribution across the two sources since you have the catalog data coming from ATG through CAS so you have to have two sets of mapping logic (though you don't need 2 prop mappers). The difference that you noticed between the data\processing\pipeline.epx pipeline and the config\pipeline\pipeline.epx is due to the ATG configuration being generated dynamically via the configurationGenerator.epx pipeline. All of the properties and dimensions related to catalog get generated via a FCM-style configuration generation mechanism.
However, at least on my machine, this process also respects the properties and dimensions that you have created in your pipeline manually. I've run into issues where if I happen to give a property or dimension the same name as one coming from ATG. When that happens, the integration typically overwrites my manually created property with the ATG one. But other than that, it seems to work fine as long as you avoid conflicts.
That said, I can see "left-joining" to the catalog data as a little "tricky" since you need to dig into your record store to find the source property names before you can create the Record Assembler. I'm lucky in the sense that I just need to switch join so this isn't a concern.
You can index other data along with the product catalog data. If you look at the pipeline in the endeca application that is created using the DeploymentTemplatePCI, you will have by default one record adapter with the name "CASDataFeed". You can modify this default pipeline by adding any number of record adapters(for different type of inputs such as xls, etc) and then using an assembler to assemble these record adapters.
One the pipeline is properly configured you can still trigger the baseline update from ProductCatalogSimpleIndexingAdmin component of ATG. As part of indexing steps, all the DocumentSubmitter components loads the data into CAS record store instances, and then finally the EndecaScriptService component invokes the baseline_update script present in the control directory of the Endeca application.
As pipeline is already configured to take both CASDataFeed and other input feeds (through other recordAdapters), all of them will be indexed together.
Had another though about the joining. If a join other than switch join is needed can something be done at the CAS layer? Once ATG pushes data in the record store can it be manipulated there?
The pipeline structure itself (contained in pipeline.epx) is not fundamentally changed so you should be OK as long as you're making changes in the config\pipeline folder. It's really focused on properties/dimensions and their associated configuration (searchability, etc.). So as long as you don't conflict with that, you should be in the clear.
As for manipulating directly in CAS, though it's technically feasible, I would advise against. The ATG record store write process could potentially overwrite any changes that you've made to the records. You could try and inject your logic directly in the ATG Catalog export process but I don't know enough about the ATG side of the house to say if that it is possible.
I have been able to successfully merge the external data with the product catalog data coming from ATG.
I have used the left join to merge the two different data.
Thank you all for your suggestions.