8 Replies Latest reply on Aug 6, 2019 4:26 PM by 962259

    Any way to flush the write behind queue after x store fails?


      We would like to benefit from the requeue mechaninsm of the write behind cache store but with a retry limit.


      Our use case is the following:


      Our write behind cache store sends REST calls to a downstream system but it can become unavailable from time to time.

      If we receive batch updates/inserts to the write behind cache, the queue gets flooded if the REST service is unresponsive.

      We would like to  retry the REST calls maximum three times then abort, how to remove an entry from the write behind queue after it has been requeued?


      We also noticed that stores are retried every minute, is there a way to control this through a parameter?



        • 1. Re: Any way to flush the write behind queue after x store fails?


              I don't think you can specify number of retries.  Before 3.6, the configurable parameter "write-requeue-threshold" was used to control the requeue operation; reaching the threshold could cause permanent loss of the store operation.  In releases after 3.6, this parameter is just used as a flag to enable/disable requeue after a store operation failure, based on whether the value is zero or non-zero.   If write-requeue-threshold is greater than 0,  coherence will always requeue failed store entries to ensure that entries are never dropped.

              Yes, you can control the entry retry delay through a parameter (write-behind delay).   The default is at least a minute or two times of the write-behind delay.



          • 2. Re: Any way to flush the write behind queue after x store fails?

            And what would be the exact name of the write-behind delay parameter and where is it documented?  Tks

            • 4. Re: Any way to flush the write behind queue after x store fails?

              Yes, the <write-delay> parameter.  So we've set this to 5 s to enable the write-behind behavior.


              So what you are saying is that the minimum amount of time between retries for requeued stores is 1 minute in our case, unless we put the write-delay setting above 30 seconds.






              • 5. Re: Any way to flush the write behind queue after x store fails?
                Randy Stafford-Oracle

                Upon source inspection I concur with Ryan.


                How did you experience "the queue gets flooded" - what were the symptoms?  OOME?


                Thank You,

                Randy Stafford

                Oracle Coherence Product Manager

                • 6. Re: Any way to flush the write behind queue after x store fails?

                  Hi Randy,


                  It's JP from T***S.


                  We had this write-behind cache (FraudWirelessCacheStore) running on 4 storage nodes, defined in its own cache service (DistributedCache:DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message).

                  We found out after a few days that the thread associated with the write-behind cache service on one of the nodes kept logging thread dumps after soft-timeout detection (<cachestore-timeout> was not set):


                  ERROR 2019-06-28 04:26:57,977 [com.tangosol.coherence.component.util.logOutput.Log4j] - 2019-06-28 04:26:57.977/3086020.634 Oracle Coherence GE <Error> (thread=DistributedCache:DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message, member=16): Detected soft timeout of {WrapperGuardable Daemon{Thread="Thread[WriteBehindThread:CacheStoreWrapper(xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.cachestore.FraudWirelessCacheStore):DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message:xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.FraudWirelessEventMessage,5,WriteBehindThread:CacheStoreWrapper(xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.cachestore.FraudWirelessCacheStore):DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message]", State=Running} Service=PartitionedCache{Name=DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message, State=(SERVICE_STARTED), Id=10, OldestMemberId=1, LocalStorage=enabled, PartitionCount=257, BackupCount=0, AssignedPartitions=21, CoordinatorId=1}}


                  From the thread dump, looking at the stack trace of that thread, this is what we could see:


                  "WriteBehindThread:CacheStoreWrapper(xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.cachestore.FraudWirelessCacheStore):DistributedCacheService-DataGrid-EVENTNOTIFY-FraudMgmt-Message:xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.FraudWirelessEventMessage" id=69 State:RUNNABLE (in native)

                  at java.net.SocketInputStream.socketRead0(Native Method)

                  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

                  at java.net.SocketInputStream.read(SocketInputStream.java:171)

                  at java.net.SocketInputStream.read(SocketInputStream.java:141)

                  at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

                  at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)

                  at sun.security.ssl.InputRecord.read(InputRecord.java:532)

                  at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

                  -  locked java.lang.Object@7973ed67

                  at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)

                  -  locked java.lang.Object@bc1bb94

                  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)

                  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)

                  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)

                  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)

                  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)

                  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)

                  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)

                  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)

                  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)

                  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)

                  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)

                  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)

                  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)

                  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)

                  at xxxxxxxxxxxxxxxxxxxxxxxx.cachestore.client.RestWSClient.sendMessage(RestWSClient.java:111)

                  at xxxxxxxxxxxxxxxxxxxxxxxx.cachestore.client.RestWSClient.sendMessage(RestWSClient.java:28)

                  at xxxxxxxxxxxxxxxxxxxxxxxx.cachestore.AbstractMessageCacheStore.sendMessage(AbstractMessageCacheStore.java:214)

                  at xxxxxxxxxxxxxxxxxxxxxxxx.cachestore.AbstractMessageCacheStore.storeAndSendMessage(AbstractMessageCacheStore.java:190)

                  at xxxxxxxxxxxxxxxxxxxxxxxx.cachestore.AbstractMessageCacheStore.store(AbstractMessageCacheStore.java:117)

                  at com.tangosol.net.cache.ReadWriteBackingMap$CacheStoreWrapper.storeInternal(ReadWriteBackingMap.java:6014)

                  at com.tangosol.net.cache.ReadWriteBackingMap$StoreWrapper.store(ReadWriteBackingMap.java:5085)

                  at com.tangosol.net.cache.ReadWriteBackingMap$WriteThread.run(ReadWriteBackingMap.java:4491)

                  at com.tangosol.util.Daemon$DaemonWorker.run(Daemon.java:806)

                  at java.lang.Thread.run(Thread.java:748)


                  So this cache store uses apache CloseableHttpClient to send a request to a REST service based on the store data.  This HTTPclient had no configured timeouts, all defaults and uses a pool, although not useful since the write-behind is single threaded per JVM.


                  We think this HTTP client had some issue connecting to the REST target and somehow the connection was never properly resetted.  So it kept failing, and the wrote-behind retry queue kept growing - measured though the CacheMBean QueueSize JMX metric.


                  We since have changed the proper connection timeouts on the HTTP client, but then we wanted to know exactly what were the retry delays on the queue to fully understand the write-behind behaviour.

                  • 7. Re: Any way to flush the write behind queue after x store fails?
                    Randy Stafford-Oracle

                    Hi JP,


                    Thanks for the explanation.  I'm glad you noticed the issue through monitoring not OOME


                    Though Ryan covered the retry delay part of the write-behind behavior, I wanted to look at the code to see if there was an API, not configuration, way to do what you originally asked about (removing entries from the queue after three retries).


                    You can suppress all re-queueing by setting the RWBM's write-requeue-threshold to 0, as Ryan covered.  But to do what you originally asked about, you'd have to do the bookkeeping yourself on the number of failed stores per entry - there is no internal bookkeeping on that.  Through API you can get your hands on the RWBM's WriteQueue, but its remove(key) method is protected.


                    It sounds like you have a solution through configuration of the RWBM and HTTP client.




                    • 8. Re: Any way to flush the write behind queue after x store fails?

                      I implemented a custom "re-queue 3 time then forget" logic and I have to mention that it was fairly easy to implement within the cache store and using an extra transient retry attribute to the cache entry.  Tested it and worked fine.


                      So we might not need it after all, but we might in a future use-case.


                      Thanks for your contribution, always much appreciated.