7 Replies Latest reply on Jun 12, 2012 1:52 AM by 942985

    Selector.select(timeout) seems to sometimes "timeout" way too soon


      We're using NIO for reading and writing from Telnet sockets, and occasionally we're finding that java.nio.channels.Selector.select(timeout) seems to return prematurely. It's supposed to return "only after at least one channel is selected, this selector's wakeup method is invoked, the current thread is interrupted, or the given timeout period expires, whichever comes first" but we're seeing it return when none of those conditions are true. With a 15-second timeout it's returning after about 3 seconds.

      I'm posting here before making a bug report to see if any of you all can catch something we're doing wrong. Also because we frustratingly do not have a repro in house - we are only seeing this on a customer's site. Distressingly, the bug only happens if we use a newer jre. With version 1.6.0_07 there is no repro; with 1.6.0_26 and 1.7.0_04 there is.

      I see a similar bug here which was resolved as not a bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5035864
      However we're not using the return value the same way.

      We're creating the selector like this:
          public void connect(String hostname, int port)  {
              // hook up the underlying socket
              InetSocketAddress addr;
              try {
                  // open socket
                  addr = new InetSocketAddress(InetAddress.getByName(hostname), port);
                  channel = SocketChannel.open();
                  channel.configureBlocking(true); // while connecting
                  // switch to non-blocking mode now that we've connected okay
                  // open selector
                  selector = Selector.open();
                  channel.register(selector, SelectionKey.OP_READ);
              } catch (Exception  e) {
      And using it thus:
          private void selectRead(Selector selector, long requestStartMilli, long timeoutMilli) throws IOException {
               long remainingMilli;
              if (timeoutMilli == 0) {  // timeout of 0 also means block indefinitely 
                  remainingMilli = 0;
              } else {
                  remainingMilli = timeoutMilli - (System.currentTimeMillis() - requestStartMilli) ;
                  if (remainingMilli < 0) {
                      // timeout elapsed since the last time we blocked, so short circuit
                      log.debug("[TSC] Timeout elapsed since last read");
                      throw new IOTimeoutException("Timeout elapsed since last read");
              // block for an event
                 log.debug("requestStartMilli " + requestStartMilli + " timeoutMilli " + timeoutMilli + " remainingMilli " + remainingMilli);
                 log.debug("current nanos before calling connect " + System.nanoTime() );
           Log.debug(" total number of keys "+  selector.keys().size() + " selected keys " + selector.selectedKeys());
                 int numkeys = selector.select(remainingMilli);
                log.debug("current nanos after calling connect " + System.nanoTime() + " got num keys " + numkeys );
                  //  has something bad happened?
                 if (!selector.isOpen() || !channel.isOpen())     throw new IOException("Channel or selector not open");     
                  if (Thread.interrupted())                                     throw new InterruptedIOException(); 
              Set<SelectionKey> selectedKeys = selector.selectedKeys();
              Iterator<SelectionKey> selectedKeysItr = selectedKeys.iterator();
              if (selectedKeysItr.hasNext()) {
                  SelectionKey selected = selectedKeysItr.next();
            // do stuff           
              } else {
                  // the timeout elapsed with no data from the server
                  log.debug("[TSC] Timeout elapsed while blocking");
                  throw new IOTimeoutException("Timeout elapsed blocking");
      About once every 20th time we connect and read, we see the inappropriately fast return.
      2012-06-08 13:37:40,703 DEBUG  requestStartMilli 1339177060687 timeoutMilli 15000 remainingMilli 14984 additionalTimeoutMilli 0
      2012-06-08 13:37:40,703 DEBUG  current nanos before calling connect 341011346120112
      2012-06-08 13:37:40,703 DEBUG  total number of keys 1 selected keys []
      2012-06-08 13:37:40,718 DEBUG  current nanos after calling connect  341011349665502 got num keys 0
      2012-06-08 13:37:40,718 DEBUG  [TSC] Timeout elapsed while blocking
      Are we doing something wrong? Can anybody guess why we'd be unable to repro in house? Any other theories?

      Thanks very much.
        • 1. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
          So to be clear, the selector fires sooner than expected, and 'numkeys' is zero? or 'numkeys' > 0 but key.readyOps() == 0 for all selected keys? and you are processing EOS correctly?
          • 2. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
            It fires sooner than expected with no selected keys, as determined by checking if selectedKeys().iterator().hasNext(). I'm just recording the return value in numkeys for debugging, since as I understand it it just reports the newly selected keys and can return 0 if there are selected keys left over from a previous call.

            The documentation says it can also return if the current thread is interrupted, which we're checking for with Thread.interrupted. Or it can return if "this selector's wakeup method is invoked", which is not happening anywhere in our code, at least. That leaves that it should have returned because the timeout has elapsed, but not nearly enough time has happened for that.

            I believe we're processing EOS correctly. We're checking if the SelectionKey isValid and isReadable before doing channel.read, and if read returns -1 we're not attempting to read from it again. Is that all you mean?
            • 3. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
              If read() returns -1 you must close the channel. Same if you get any IOException using the channel, actually. Other than that, if it fires early with nothing to do, just keep looping. You already have all the mechanics to detect real timeouts.
              • 4. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
                OK, I'll ensure the channel is closed if read returns -1 or throws anything. But that isn't occurring in the case of the selector.select() timeout error, though. (I am certain of that because we're throwing an exception if it ever does return -1.)
                • 5. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
                  Sorry, got distracted by the first half of your response and didn't notice the second.
                  You already have all the mechanics to detect real timeouts.
                  I'm a newb in this area, what mechanics do you mean? My workaround for the moment is just to call this in a loop and keep track of the elapsed time myself, is that all you meant? It works, but it's pretty frustrating that select(timeout) can sometimes basically just ignore the timeout.

                  Out of curiosity, did you have any thoughts why we'd see this repro on one system but not another? I know that's awfully vague, just thought I'd check if anything had sprung to mind.

                  Thanks for your help!

                  Edited by: 939982 on Jun 11, 2012 6:31 PM
                  • 6. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
                    Yep that's what I mean. I don' t have any thoughts on why this would happen on one system in particular, especially if you're not seeing Thread.interrupted(). I'm wondering why you're using NIO at all actually. It's pretty rare in a client, and barely worth it I would have thought. I would use a Socket and call setSoTimeout() myself, never seen that fail.
                    • 7. Re: Selector.select(timeout) seems to sometimes "timeout" way too soon
                      Yeah, if we were writing this today it probably wouldn't be NIO. I'm just in here for maintenance - this wasn't a problem on older JREs but it is on newer ones and started causing problems on some customers we just upgraded.

                      Thanks a bunch :)