I have a question on Coherence indexing, I hope you can clarify.
Currently, our indexing process takes up to 60 minutes to create indexes for 500K entries on one cache.
It is a XML cache, and uses reflection extractor to extract XPath attribute to create index.
Priming cache is a two step process: first, we finish priming the cache; second, we start adding indexes.
60 minutes is too long, so we wrote a quick code to test whether adding index in parallel (using processing nodes) would reduce overall creation time.
But, it made no difference. It is still taking 60 mins.
Does that mean addIndex() call will lock the entry whilst it is being indexed, which makes other threads to wait for its turn ?
I checked all documentation, could not find any answers. Any tips will be a big help.
- Business definitely wants to keep XML cache and index the attributes based on XPATH.
Oh dear... if I had £1 for everywhere I have worked where the business were certain they needed to store XML inside a Coherence cache I'd be a rich man (well, I'd have a few pounds anyway ). If you really must store large amounts of XML and querying it then as much as it pains me to say it there are possibly better products than Coherence to do it in.
Regarding running multiple addIndex calls in parallel, as far as I am aware Coherence will only run a single addIndex call on a cache at a time.
As for your problem, 60 minutes to add indexes does seem like a very long time. There are various reasons why this might be. How many indexes are you adding? How big (in bytes) are the XML values you put in the cache. How many storage nodes are in your cluster and what is the heap size for them. If you are using reflection extractors then the storage nodes need to deserialize the whole cache for each index you add which could result in a lot of GC.
Thanks for your reply. Yeap you are right, I tried to explain them (Client) to move away from XML store but it was a no go. It is a bad design. Hence this approach.
I did few tests yesterday, it proved that addIndex call is a single threaded and it uses one member node at a time (I profiled the JVM). However, when I tried to addIndex before priming the cache, the results are better.
The cache service uses all storage nodes to update indexes in parallel. It does reduces overall indexing time.
We have 16 indexes to create, average size of XML (compressed using zLib) is 10KB (but there are big files upto 70KB) and there are 500K entries. In total 16 storage nodes (3.5 G heap each) split by 2 physical machines in dev environment.
I agree, everytime an index needs to be created or updated, the process needs to deserialize xml and extract the XPATH, add it to SimpleMapIndex. After that, the objects is ready for collection, which (as you said) causes lot of GC (CPU hits 100% most of the time).
Given that I cannot change XML store and XPath based index, I have few options to tune with:
1. Tune GC
2. Use fast xml processing libraries
3. Use LZ4 compression rather than Zlib.