A few things:
- You can increase the number of threads used by Dgidx. In later versions of Endeca, you can add <arg>--threads</arg><arg>4</arg> to the "Dgidx" section of your /config/script/DataIngest.xml file. In older versions, this was in the /config/script/AppConfig.xml.
- You can set the threads to be the number of CPU cores.
- This won't vastly improve your indexing, however. Only certain parts of building the index use the extra cores. But it won't hurt!
What you can really try and do is examine how many fields are enabled for search. In your pipeline folder, find the file with a name like "discover.recsearch_indexes.xml". (Replace "discover" with your app name).
In there are all of the various fields enabled for keyword search. Removing entries will reduce indexing time. Also, with your quantity of data you should pretty much make sure no fields are enabled for wildcard searching. Look for <WILDCARD_INDEX/> entries and consider removing them.
One last suggestion: In your Dgidx section (which I mentioned above) consider removing --compoundDimSearch. This is only useful for certain kinds of typeahead searches. Removing this will also help with indexing time.
If you're not doing typeahead searches at all, consider turning off wildcard searches for dimension searches by updating "discover.dimsearch_index.xml".
Look into the breakup of time taken in each of the baseline process, then apply the tip that Greg mentioned. Also see if you can do some processing/massaging on the data outside the pipeline so that you can reduce the forge times.
Another thing to think at a higher level, do all your customers need the full data (of 12 million)? If not you can think of sharding your Endeca instance.
Thanks for the reply Greg and Pankaj.
I cannot see <arg>--threads</arg><arg>4</arg> in "DataIngest.xml" file. Instead I can see it in Dgraph_defaults.xml file. By default it is set to 2. Is it the same which you have mentioned? And I will try with the other tips which you have mentioned. Thanks a lot.
Yes. we need all 12 million data and we control data for different users via navigation queries.
You have to add the <arg>--threads</arg><arg>4</arg> to the Dgidx block. It is not there by default.
If Dgidx is taking a long time, adding the threads and revisiting which fields are enabled for keyword search will help with that. Please post how many fields are enabled for search before and after, with the corresponding times.
Thanks a lot Greg.
I increased the no of threads for Dgidx to 12. Now my baseline runs better than before. We haven't enabled all the properties and dimensions for search. We have enabled only few properties for record search which was client's requirement. So that can't be disabled. Is there any other way to further improve the speed of indexing?