Forum Stats

  • 3,770,000 Users
  • 2,253,045 Discussions
  • 7,875,263 Comments

Discussions

DbPing does not respect socketTimeout for reads and can hang indefinitely

keithwall
keithwall Member Posts: 29
edited Mar 19, 2017 6:02PM in Berkeley DB Java Edition

We use DbPing as part of our application to check a HA node's health .   We notice that if the node is unresponsive for some reason, the the thread calling DbPing can hang forever, even though DbPing is being created with a socketTimeout set to a reasonable value.  Our application used 5.0.104, but I have reproduced the same problem with releases up to 6.4.25 too. 

The thread dump looks like this:

"Broker-Config" #13 prio=5 os_prio=31 tid=0x00007fb1ea3d8800 nid=0x5103 runnable [0x000070000134d000]

   java.lang.Thread.State: RUNNABLE

  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

  at sun.nio.ch.IOUtil.read(IOUtil.java:197)

  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

  - locked <0x000000072fab0eb8> (a java.lang.Object)

  at com.sleepycat.je.rep.utilint.net.SimpleDataChannel.read(SimpleDataChannel.java:73)

  at com.sleepycat.je.rep.utilint.ServiceHandshake$ByteChannelIOAdapter.read(ServiceHandshake.java:1082)

  at com.sleepycat.je.rep.utilint.ServiceHandshake$SendNameOp.processOp(ServiceHandshake.java:523)

  at com.sleepycat.je.rep.utilint.ServiceHandshake$ClientHandshake.process(ServiceHandshake.java:237)

  at com.sleepycat.je.rep.utilint.ServiceDispatcher.doServiceHandshake(ServiceDispatcher.java:434)

  at com.sleepycat.je.rep.utilint.ServiceDispatcher.doServiceHandshake(ServiceDispatcher.java:408)

  at com.sleepycat.je.rep.util.DbPing.getNodeState(DbPing.java:345)

I notice that the JE code tries to configure a Socket SO_TIMEOUT, but then goes out to use a SocketChannel#read.  As described by http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4614802 SocketChannel reads are not subject to SO_TIMEOUT, so the read hangs forever and its configuration is ineffective.

To simulate the unresponsive node easily, I use netcat on the command line to start a listening socket (e.g. nc -l 5001) then provide no keyboard input.   Point DbPing at the socket will then produce a hung thread.,

Answers

This discussion has been closed.