Skip to Main Content

Java APIs

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Difference between "connection timed out" and "connection refused"

jduprezMar 22 2011 — edited Mar 23 2011
Hello,

A while ago I opened a topic on {message:id=4635424} , to discuss what could cause the dreaded exception +"java.net.ConnectException: Connection timed out: connect"+ , and whether it would make sense to increase the client timeout (was set to 20s).
The answers to the first question were "the server may be too busy", "the host may be down", "there may be a network problem".
The answer to the second question was "No, if the server and network are healthy, 20 seconds is amply enough".

I am still facing the following exception reproducibly, in a round of performance tests of a WebService application.
A set of test clients exercise WS calls over HTTPS targetting the server configuration.
The server has an Apache reverse-proxy that receives the incoming flow, unwraps the SSL layer, and forwards regular HTTP WebService calls to a back-end Glassfish server hosting a JEE application.

Since a recent modification of the client code (each client now has two threads that invoke the same web service endpoint), the test client has started facing ConnectException reproducibly (~0.1% of requests). I quote an example of these exceptions below.

1) From earlier reading on this forum, I thought this indicates that the server-side (in this case, Apache) has not accepted, or not timely accepted, the connection request issued by the client.
However, I see nothing in my server-side logs or monitoring page that indicates that the Apache server is too busy to service the connection requests. The Apache status page shows lots of non-busy processes.

2) Moreover, someone from my team pointed out that if the server could not accept the connection request, I should face +"java.net.ConnectException: Connection refused: connect"+ instead of +"java.net.ConnectException: Connection timed out: connect"+ (indeed Connection refused is what I observe if shut down the Apache server, and also what I observe on a dummy JAX-WS client/server test with an undersized server-side pool).


Further research gave no authoritative answers (TCP tutorial or RFC) as to the difference; I don't know whether those messages are OS-specific or API-specific. I found the following pages though:
http://www.realvnc.com/pipermail/vnc-list/2005-November/052929.html (seems to indicate that "connection timed out" points out a network problem as opposed to "connection refused" indicating a non-working server)
http://answers.yahoo.com/question/index?qid=20090415213118AAsbWzJ (I'm not sure this answer is very reliable)

3) Someone from another team mentioned that maybe the client-side "connection request timeout" might have elapsed before the Apache-to-backend timeout, which could explain that Apache doesn't see nor log any problem. However, to only even know that it's supposed to forward the request to the back-end Glassfish server, Apache should have "accepted" the socket, right? So how could the connection request itself time out?

4) If the "network problem" hypothesis is right (nasty firewall, bandwidth limit,...), how does the client code have an impact on the problem (remember, the problem, which used to be very infrequent, is almost systematically reproducible since a client code update).

5) How else would you advise to investigate this issue?

Thanks in advance,

J.

P.S.: stack trace of intermittent "Connection timed out" problem.
Caused by: java.net.ConnectException: Connection timed out: connect
                at java.net.PlainSocketImpl.socketConnect(Native Method)
                at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
                at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
                at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
                at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
                at java.net.Socket.connect(Socket.java:519)
                at com.sun.net.ssl.internal.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:550)
                at com.sun.net.ssl.internal.ssl.BaseSSLSocketImpl.connect(BaseSSLSocketImpl.java:141)
                at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
                at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
                at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
                at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:271)
                at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:328)
                at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:172)
                at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:816)
                at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:158)
                at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:881)
                at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:230)
                at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.getOutput(HttpClientTransport.java:107)
                ... 22 more

Comments

EJP
The answers to the first question were "the server may be too busy"
Not if it's running on Windows.
"the host may be down"
Correct.
"there may be a network problem".
Correct.
The answer to the second question was "No, if the server and network are healthy, 20 seconds is amply enough".
Far more than enough. A couple of seconds is enough to transit an IP packet several times around the world.
1) From earlier reading on this forum, I thought this indicates that the server-side (in this case, Apache) has not accepted, or not timely accepted, the connection request issued by the client.
No. Connection timeout means there has been no response whatsoever. This is due to, in decreasing order of probability:

1. There has been a temporary network outage for the duration of the timeout, for example a router going down and up.

2. The host is down.

3. The server is running on a non-Windows platform and its backlog queue is full, caused by accept() running behind incoming connections. This is extremely unlikely given that minimum actual backlog queue sizes on non-Windows platforms are of the order of 500 and have been for a decade.
2) Moreover, someone from my team pointed out that if the server could not accept the connection request, I should face +"java.net.ConnectException: Connection refused: connect"+
That is correct for a server running on Windows only.
3) [snipped]
This suggestion makes no sense. If the client got a connection timeout there was no response to the connection attempt, so there wasn't any back-end activity whatsoever.
4) ?.. how does the client code have an impact on the problem ...
By parallelizing the operations you are presumably increasing network load. Maybe you are doing so far more than you think.
5) How else would you advise to investigate this issue?
Sniff the network between the client and whatever it is failing to connect to and observe the SYN and SYN-ACK packets.

I would also trawl your client code for HttpURLConnection.disconnect() calls and remove them. They inhibit client-side connection pooling, which mitigates most problems of this sort considerably.
802316
EJP wrote:
1) From earlier reading on this forum, I thought this indicates that the server-side (in this case, Apache) has not accepted, or not timely accepted, the connection request issued by the client.
No. Connection timeout means there has been no response whatsoever. This is due to, in decreasing order of probability:
I thought you could get a timeout if you have a firewall which doesn't allow traffic through.
EJP
Correct, I should have included that. There were firewalls on the market that issued RSTs in this situation but I think everybody now agrees that this is incorrect behaviour. The other possible firewall action is an ICMP Destination Unreachable.
jduprez
Hello Esmond,

thanks for your help.
As a general question first, do you have a reference that describe when such or such exception's "message" is raised?
As I said in the OP, I haven't found this in the TCP RFCs, and I fear it's system-specific; in this case it's not clear whether the message comes from java.net API, from the Windows' socket API layer,...
EJP wrote:
The answers to the first question were "the server may be too busy"
Not if it's running on Windows.
The server is running on Linux (RedHat). The client is on Windows.
"the host may be down"
"there may be a network problem".
Correct [twice]
TBH, you were the one that suggested those reasons :o)
The answer to the second question was "No, if the server and network are healthy, 20 seconds is amply enough".
Far more than enough. A couple of seconds is enough to transit an IP packet several times around the world.
Still, revisited below.
1) From earlier reading on this forum, I thought this indicates that the server-side (in this case, Apache) has not accepted, or not timely accepted, the connection request issued by the client.
No. Connection timeout means there has been no response whatsoever.
OK.
This is due to, in decreasing order of probability:

1. There has been a temporary network outage for the duration of the timeout, for example a router going down and up.
In my case I find this unlikely, given the "client code version" effect (unless the router restart duration was so pinpointedly timed that it consistently impacts the "small traffic" client and consistently not the "higher traffic" one.
Moreover, my network admin team claims there were no such incidents.
2. The host is down.
In my case, that seems unlikely, as parallel requests did succeed within seconds of some requests failing.
3. The server is running on a non-Windows platform and its backlog queue is full, caused by accept() running behind incoming connections. This is extremely unlikely given that minimum actual backlog queue sizes on non-Windows platforms are of the order of 500 and have been for a decade.
Ah,ah, that's something to chew on.
What do you call the "backlog queue"? There are a handful of "backlog" mentioned in, e.g., this Red-Hat-specific doc:
net.ipv4.tcp_max_syn_backlog "Length of the backlog queue for each socket (default is 128)"
net.core.netdev_max_backlog "Maximum number of packets that can be queued on input when a network interface receives packets faster than the kernel can process them (default is 300)"

Additionally, if I happened to find out that the "backlog queue" indeed enqueues lots of connection requests, doesn't that add delay to the connection request-accept duration?
Of course, if the queue fills up, the client will eventually face the exception, no matter what its timeout setting is... I should definitely monitor this queue's status.
2) Moreover, someone from my team pointed out that if the server could not accept the connection request, I should face +"java.net.ConnectException: Connection refused: connect"+
That is correct for a server running on Windows only.
I'm not sure I understand: I do witness +"java.net.ConnectException: Connection refused: connect"+ if I shut down the Apache server on the Linux host. Admittedly, I haven't proven I get them if I let the server up and overflow its capacity, on Linux. Does your remark "correct for a server running on Windows only" apply only to the "overloaded server" case?
3) [snipped]
This suggestion makes no sense. If the client got a connection timeout there was no response to the connection attempt, so there wasn't any back-end activity whatsoever.
Thanks for confirming.
4) ?.. how does the client code have an impact on the problem ...
By parallelizing the operations you are presumably increasing network load. Maybe you are doing so far more than you think.
I did another test with the "parallel calls" disabled, and faced the same issues.
I'm still on investigating the changes between the the client versions (I fear a JAX-WS version change has sneaked into this as well, sorry if I unwantingly misrepresent the "small changes").
5) How else would you advise to investigate this issue?
Sniff the network between the client and whatever it is failing to connect to and observe the SYN and SYN-ACK packets.
Err..., thanks, but what would I do with the results? Shouldn't I, instead, or in addition, monitor the backlog queue you mentioned above?
I would also trawl your client code for HttpURLConnection.disconnect() calls and remove them. They inhibit client-side connection pooling, which mitigates most problems of this sort considerably.
Yes. But the client has no access to the HttpURLConnection (encapsulated in JAX-WS stubs). I don't know if the JAX-WS client layer performs any disconnection (why would it).
However, that raises another question: why do I even have connection requests with HTTP1.1 enabled on both ends? From a cursory look at the Apache doc, I seem to understand that the number of requests over the same socket is bounded in the default configuration.

I'll post updates if I find anything.

Regards,

J.
jduprez
No. Connection timeout means there has been no response whatsoever. This is due to, in decreasing order of probability:
I thought you could get a timeout if you have a firewall which doesn't allow traffic through.
Thanks for the suggestion.

I did a ping -r 5 serverIp from two Windows client, and the connection is direct (no hops). That guarantee that there is no intermediate machine between the client sand the server, correct?

Additionally, I did a netstat -o | find "<Apache port>" on the Windows clients, and only the PIDS of Java processes are listed.
Does that guarantee that this is not Windows' software firewall that opens the socket with the Apache server?
EJP
As a general question first, do you have a reference that describe when such or such exception's "message" is raised?
As I said in the OP, I haven't found this in the TCP RFCs
They don't specify APIs or exceptions at all, just the protocol, in fact one of the curiousities of TCP/IP is the complete lack of a formal document relating the Sockets API to the protocol. However you can figure it out like this:

- ICMP UNREACH -> NoRouteToHostException
- RST -> ConnectException 'connection refused'
- nothing -> connection timeout.
I fear it's system-specific
It is indeed. Windows will reject a connection when the backlog queue is full with an RST. Most other systems just ignore it and rely on the connect() API to retry, which it does three times in normal implementations ... if all 3 fail with no response you get the timeout exception.
in this case it's not clear whether the message comes from java.net API, from the Windows' socket API layer,...
Everything except the timeout exception emanates from the TCP/IP stack. Timeout exception originates from Java. That's because there is no timed-connect API at the Sockets level, you have to synthethize it with a non-blocking connect() followed by a select() followed by another connect() if the select() didn't time out. That is also the reason for the existence of SocketChannel.finishConnect().
TBH, you were the one that suggested those reasons :o)
I suggested the ones that were correct :-|
What do you call the "backlog queue"? There are a handful of "backlog"
The queue of incoming connections which have already been accepted by the TCP/IP stack (SYN-SYN/ACK-ACK) but which haven't been returned yet as sockets via the accept() method, because the server is running behind. The length of the queue is specified in the Sockets listen() call, or in Java in the ServerSocket constructor, unless you default it.
Additionally, if I happened to find out that the "backlog queue" indeed enqueues lots of connection requests, doesn't that add delay to the connection request-accept duration?
No. That's what it's for. The connection is completed by the stack before or while the application calls accept().
Of course, if the queue fills up, the client will eventually face the exception, no matter what its timeout setting is... I should definitely monitor this queue's status.
You can't. There is no API to tell you anything whatsoever about the backlog queue, not even its actual or maximum size. You might be able to dig around in a Linux/Unix /dev/proc way, that's it.
I do witness +"java.net.ConnectException: Connection refused: connect"+ if I shut down the Apache server on the Linux host.
Of course you do. Nothing is listening at the port so the TCP/IP stack rejects the connection. I was referring to your colleague's theory about the server not being able to accept the connection, which I assumed means the server process is still running and its backlog queue is full. If he means something else, OK.
Admittedly, I haven't proven I get them if I let the server up and overflow its capacity, on Linux. Does your remark "correct for a server running on Windows only" apply only to the "overloaded server" case?
Correct.
Err..., thanks, but what would I do with the results?
Analyze them. Post them here as a last resort: somebody might look ;-)
Shouldn't I, instead, or in addition, monitor the backlog queue you mentioned above?
If you can, yes, but as well, not instead.
However, that raises another question: why do I even have connection requests with HTTP1.1 enabled on both ends? From a cursory look at the Apache doc, I seem to understand that the number of requests over the same socket is bounded in the default configuration.
No idea what Apache does, but Java pools HTTP 1.1 connections and reuses them if required within a short time, I think 15 seconds. This is the implementation of HTTP 1.1 Connection: keep-alive.
1 - 6
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Apr 20 2011
Added on Mar 22 2011
6 comments
8,627 views