I'm trying to migrate our Endeca 5.1 pipeline to the new Integrator graphs.
As a test, I created a graph in Integrator 2.3 to load 2 sets of data.
1. Each set of data is loaded using a DB_INPUT_Table and Reformat component. The DB_INPUT_Table loads data from Oracle database using Oracle JDBC driver.
2. One data set has 2,000,000 rows and the another has 3000 rows.
3. Then these 2 set of data is joined using the ExHashJoin component.
4. Finally the data is loaded to the data store using the Bulk Add/Replace Records component
I've configured the JRE -Xmx to be 4096M (4G), but when I run the graph, I keep gettting the java.lang.OutOfMemoryError: Java heap space error. Even I increase the -Xmx value to 6G, the graph processed more records but end up with the OutOfMemory error. The test data I'm trying to load is only about 15% of the amount we load in our Endeca 5.1 pipeline.
I'm new to the Endeca Information Discovery tool set. So my questions are:
1. Is Integrator designed to load large amount of data? Or I need to use a different tool.
2. Are there any white papers on large data load and pipeline deployment in a production environment?
Intergrator is a tool that uses alot of memory, but there are steps you can take to decrease it's foot print. We have a document with a few notes on how to reduce it's usage. Since you graph sounds pretty simple, if you do have large text fields, I would suggested changing the edge type to "fast propagate edge".
LDI graph which is processing a number of very large text columns is using alot more memory then expected. What are some ways to reduce the amount of memory consumed?
The use of external jars in jdbc, jms connections consume alot of memory, it is better to add them to the classpath when running graph.
Some component use more memory then others, examples are ExtSort and FastSort, or HashJoin and MergeJoin. Review the use of these components. Place them in the graph in locations that were they will need to process the minimum number of records, and if necessary isolate them in their own phase of the graph.
Finally if the graph is run in verbose mode look for the following message:
INFO [WatchDog] - Edge5 type: buffered
The default edge type is "detected", which leaves it up to LDI to determine the edgetype, if you have large text data, the buffered edge will consume alot of memory, it is better to the the fast propagate edge.
I followed your advices and made the following changes:
1. Changed edge type to "fast propagate edge" in the graph.
2. Added the Oracle JDBC jar file to the JRE System Library under the Integrator Preference/JRE Definition settings.
After these changes, I ran the graph. It noticed that the memory usage for DB_INPUT_Table and Reformat component improved. But I still got the same outOfMemory error during the ExHash Join. Are there any further steps I need take to resolve this issue? I can increase the memory setting. But just not sure how big it needs to be.
I am quoting the following from the section on ExtHashJoin in the Endeca Information Discovery Integrator Guide (available here: http://docs.oracle.com/cd/E29805_01/index.htm):
This joiner should be avoided in case of large inputs on the slave port. The reason is slave data is cached in the memory.
Tip: If you have larger data, consider using the ExtMergeJoin component. If your data sources are unsorted, use a sorting component first (ExtSort, FastSort, or SortWithinGroups).
As the documentation says, please make sure that your smaller dataset is attached to the slave port. Alternatively, you can use ExtMergeJoin.