This discussion is archived
1 2 3 Previous Next 30 Replies Latest reply: Aug 30, 2010 6:43 AM by CharlesHooper Go to original post RSS
  • 15. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hi!

    It takes a reasonable time. I, as a developer, use that connection all the time without any problem, except for Oracle transactions. That's why I think there is some config problem with Oracle, SQL net or Windows server. I discarded network configs because our Housing system insisted in that. So all should see my confusion, it seems not to be an Oracle problem on one side, but it seems so in the other hand.

    Thanks!
  • 16. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hi Charles, thanks for your interest!

    Thank you again, now for the article, seems promising because we use wireshark for network issues, but I never knew how to use it well. I'll take a deep look into it.

    But we used Wireshark in the server side and we observed the strange behaviour I spoke earlier: several packets (all together around 20 KB) an then around 5 seconds of NOTHING, nor ACK neither dropped packets. I didn't test the client side but I assume that'd be the same stuff.

    And the relative time delta between packets is around 0.0001 s, normal ¿right?

    Thanks a lot!
  • 17. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello!

    The network administrators from the housing service have reconfigured the interface to force 100 Mb full duplex and guess the time it takes now to the query that took 3:31 s before: 1:38 s!!! I think we might found the solution but I have to wait until my network administrators deign to do the same config change on our switches and confirmation from other networks. I'll keep you in the loop.

    Thank you very much!
  • 18. Re: network performance
    CharlesHooper Expert
    Currently Being Moderated
    user11098377 wrote:
    Traza a la dirección www.75.101.212.in-addr.arpa [212.101.75.188]
    sobre un máximo de 30 saltos:

    1 * * * Tiempo de espera agotado para esta solicitud.
    2 * * * Tiempo de espera agotado para esta solicitud.
    3 69 ms 69 ms 69 ms 193-238-52-2.static.voztelecom.net [193.238.52.2]
    4 69 ms 70 ms 70 ms gi-0-0-0-10.rt-es-07.voztelecom.net [193.22.119.51]
    5 70 ms 72 ms 70 ms ae0-35.mad44.ip4.tinet.net [213.200.71.133]
    6 69 ms 70 ms 73 ms as174.ip4.tinet.net [77.67.72.126]
    7 70 ms 70 ms 70 ms te2-4.mpd02.mad05.atlas.cogentco.com [130.117.3.117]
    8 82 ms 88 ms 83 ms te3-1.mpd01.mad04.atlas.cogentco.com [130.117.2.70]
    9 97 ms 102 ms 99 ms 149.6.151.22
    10 98 ms 97 ms 101 ms 212.101.79.89
    11 97 ms 98 ms 98 ms 212.101.79.77
    12 99 ms 98 ms 98 ms 192.168.35.3
    13 102 ms 98 ms 97 ms 192.168.36.3
    14 98 ms 97 ms 97 ms www.75.101.212.in-addr.arpa [212.101.75.188]
    --
    Hi Charles, thanks for your interest!

    Thank you again, now for the article, seems promising because we use wireshark for network issues, but I never knew how to use it well. I'll take a deep look into it.

    But we used Wireshark in the server side and we observed the strange behaviour I spoke earlier: several packets (all together around 20 KB) an then around 5 seconds of NOTHING, nor ACK neither dropped packets. I didn't test the client side but I assume that'd be the same stuff.

    And the relative time delta between packets is around 0.0001 s, normal ¿right?
    --
    The network administrators from the housing service have reconfigured the interface to force 100 Mb full duplex and guess the time it takes now to the query that took 3:31 s before: 1:38 s!!! I think we might found the solution but I have to wait until my network administrators deign to do the same config change on our switches and confirmation from other networks. I'll keep you in the loop.
    Something seems to be very odd here. First, the two sides are connected by a high latency (Internet) connection with many hops, and an average ping time of about 0.1 seconds. You will probably find that the MTU size (consider this to be the maximum packet size before packet fragmentation) when jumping through the various routers is probably about 1500 bytes. If the network administrators have configured the servers and local switch to support jumbo frames (8KB packets, for instance), those jumbo frames will be split into multiple packets so that they are able to pass through the intermediate routers between the server and client computers (you might not see a problem on the server side while one might appear on the client side). You indicate that the time delta between packets is about 0.0001 seconds (0.1ms), which is significantly less than the ping times suggested by the trace route - how many of those packets appear together with a delay of 0.0001 seconds in between?

    If you have a Nagle/delayed ACK problem where one side is set to send an ACK after 13 (or some other number more than 2) packets while the other side is set to send an ACK after the default of 2 packets, that might, in part, explain the 5 seconds where no packets appeared on the server side of the Wireshark capture (for reference, on a gigabit network, a 130MB file will typically transfer across the network in about 3 second. When a Nagle/delayed ACK problem is present, that same transfer requires roughly 45 minutes).

    Your network administrators forced the servers to 100Mb/s and it improved performance? This very well could be just a coincidence. If the server is forced to 100Mb/s full duplex but the switch port is not also forced to 100Mb/s full duplex, in most cases the actual connection speed will fall back to the minimum possible configuration - 10Mb/s half duplex. I wonder if the network administrators previously forced the network switch port to 100Mb/s full duplex but forgot to do the same for the server? Still, at 10Mb/s half duplex, the speed (throughput) is probably faster than your WAN link, so that change should not have made much of a difference. The very low latency between the server and switch is still much lower than the latency that you reported with the trace route - and that is what is likely killing performance. I wonder if the network administrators also disabled jumbo frames when they forced the connection speed? Disabling jumbo frames might explain the sudden decrease in time.

    In any case, it would still be interesting to investigate a Wireshark capture from the client side.

    Charles Hooper
    Co-author of "Expert Oracle Practices: Oracle Database Administration from the Oak Table"
    http://hoopercharles.wordpress.com/
    IT Manager/Oracle DBA
    K&M Machine-Fabricating, Inc.
  • 19. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Something seems to be very odd here. First, the two sides are connected by a high latency (Internet) connection with many hops, and an average ping time of about 0.1 seconds. You will probably find that the MTU size (consider this to be the maximum packet size before packet fragmentation) when jumping through the various routers is probably about 1500 bytes. If the network administrators have configured the servers and local switch to support jumbo frames (8KB packets, for instance), those jumbo frames will be split into multiple packets so that they are able to pass through the intermediate routers between the server and client computers (you might not see a problem on the server side while one might appear on the client side). You indicate that the time delta between packets is about 0.0001 seconds (0.1ms), which is significantly less than the ping times suggested by the trace route - how many of those packets appear together with a delay of 0.0001 seconds in between?
    May the MTU be 1474? It's the size of most of the frames. Should it be 1500? I've heard 1500 more times than 1474 as the right value. I think that jumbo frames are configured as not supported but I'll ask. And about the delta times... I think I was right, but I don't understand why are so little. Here is a capture of the wireshark trace:

    screenshot

    You can see the delta time of almost every frame and the 5 seconds delay every now and then. The delta time of frames 625 and 649 is about 5 s. I should note that the SQL captured by this trace was sent by SQL developer and the array fetch size was set to 500.
    If you have a Nagle/delayed ACK problem where one side is set to send an ACK after 13 (or some other number more than 2) packets while the other side is set to send an ACK after the default of 2 packets, that might, in part, explain the 5 seconds where no packets appeared on the server side of the Wireshark capture (for reference, on a gigabit network, a 130MB file will typically transfer across the network in about 3 second. When a Nagle/delayed ACK problem is present, that same transfer requires roughly 45 minutes).
    I really don't know, because there is so much ACK packets but every so many time... It doesn't seem to fit any of the cases: nor single ACK every 2 packets neither single ACK every any fixed number of packets... What I do can told you is that a single 16611 KB file has transfered from server to client in a little more of 2 minutes, what gives me an average speed of 120 KB/s, our line bandwidth.
    Your network administrators forced the servers to 100Mb/s and it improved performance? This very well could be just a coincidence. If the server is forced to 100Mb/s full duplex but the switch port is not also forced to 100Mb/s full duplex, in most cases the actual connection speed will fall back to the minimum possible configuration - 10Mb/s half duplex. I wonder if the network administrators previously forced the network switch port to 100Mb/s full duplex but forgot to do the same for the server? Still, at 10Mb/s half duplex, the speed (throughput) is probably faster than your WAN link, so that change should not have made much of a difference. The very low latency between the server and switch is still much lower than the latency that you reported with the trace route - and that is what is likely killing performance. I wonder if the network administrators also disabled jumbo frames when they forced the connection speed? Disabling jumbo frames might explain the sudden decrease in time.
    Our network administratos frmo the server part did force the server interface to 100 Mb/s but not the switches interfaces. Should I disable jumbo frames everywhere? I am pretty sure that they are disabled in the client network.
    Edit: jumbo frames seems to be disabled everywhere.
    In any case, it would still be interesting to investigate a Wireshark capture from the client side.
    My PC (as the client side) is a DB server too so the traffic capture could be a little messy, but I remember myself doing a wireshark capture with the same query from my PC and I'm pretty sure that I recieved basically the same packets that appears in the server capture...

    Thank you very much for your interest, I really appreciate.

    Edited by: user11098377 on 13-ago-2010 4:56
  • 20. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello there!

    Is there any problem if server has a 100 MB interface (forced to 100) but is connected to a Giga switch that I assume is configured in autonegotiation? I hope don't because I doubt they will agree to change config of the all Housing switches...

    Thanks!
  • 21. Re: network performance
    CharlesHooper Expert
    Currently Being Moderated
    user11098377 wrote:
    Hello there!

    Is there any problem if server has a 100 MB interface (forced to 100) but is connected to a Giga switch that I assume is configured in autonegotiation? I hope don't because I doubt they will agree to change config of the all Housing switches...

    Thanks!
    The standard rule that should be followed by network hardware is that if one side of the connection is set to auto-negotiate and the other side of the connection is forced to a specific speed/duplex (and possibily flow control) setting, the actual link speed will drop to 10Mb/s half duplex, and there will be a good chance for packet retransmits and duplicate ACKs showing in a Wireshark trace. If you cannot change the switch port, it is best to leave the server at auto-negotiate. If the switch configuration is changed and the switch supports gigabit speeds, think about what might happen when a gigabit network card is installed in the server or a new server is installed - the link speed will drop to 10Mb/s with no apparent explanation - and someone might wonder why the database performance is slow all of a sudden.

    Charles Hooper
    Co-author of "Expert Oracle Practices: Oracle Database Administration from the Oak Table"
    http://hoopercharles.wordpress.com/
    IT Manager/Oracle DBA
    K&M Machine-Fabricating, Inc.
  • 22. Re: network performance
    CharlesHooper Expert
    Currently Being Moderated
    user11098377 wrote:
    May the MTU be 1474? It's the size of most of the frames. Should it be 1500? I've heard 1500 more times than 1474 as the right value. I think that jumbo frames are configured as not supported but I'll ask. And about the delta times... I think I was right, but I don't understand why are so little. Here is a capture of the wireshark trace:
    Some Internet connections only support an MTU smaller than 1500 - the number that you mention sounds familiar (maybe a DSL connection?).
    screenshot
    That screen capture (assuming that it was captured from the server) shows that the server sent 16 packets in rapid fire and then waited. 78ms later (roughly the ping time that you reported based on the TRACERT output) your server received 8 ACK packets - one for every 2 packets sent by the server (the client side is not configured to ACK after every 8 or 16 packets as the server appears to be configured). You will see a somewhat similar screen capture here where I "optimized" the ACK frequency on one side of the connection with a large array fetch size:
    http://hoopercharles.files.wordpress.com/2009/12/fastwirelessselectfrompartnaglefetcharray1000.jpg

    Please check page 24 of this Microsoft document:
    http://download.microsoft.com/download/2/8/0/2800a518-7ac6-4aac-bd85-74d2c52e1ec6/tuning.doc
    Find the section titled TcpAckFrequency that suggests a change to the Windows registry. Do not change anything, I am just curious what the current registry value is showing. In case you are wondering, I mentioned that Microsoft document in this blog article:
    http://hoopercharles.wordpress.com/2010/07/13/windows-as-an-os-platform-for-oracle-database-where-do-i-start/
    You can see the delta time of almost every frame and the 5 seconds delay every now and then. The delta time of frames 625 and 649 is about 5 s. I should note that the SQL captured by this trace was sent by SQL developer and the array fetch size was set to 500.
    If I am reading the Wireshark screen capture correctly, the 5 second delay happened after the server received 8 ACK packets (ACKs for the previous 16 packets that the server sent), and before the server continued to send the next batch of packets to the client. The problem appears to be on the server side of the connection - the server is probably waiting for another packet or a timeout. The typical timeout permitted to wait for an expected packet is 0.2 seconds, much less than the 5 second delay. Is it possible that someone "optimized" some of the other network parameters on the Windows server?
    I really don't know, because there is so much ACK packets but every so many time... It doesn't seem to fit any of the cases: nor single ACK every 2 packets neither single ACK every any fixed number of packets... What I do can told you is that a single 16611 KB file has transfered from server to client in a little more of 2 minutes, what gives me an average speed of 120 KB/s, our line bandwidth.
    The test that I mentioned should be performed on the server side of the WAN - two devices on that segment of the network, where a large file that should transfer in 3 or 4 seconds on a gigabit connection takes 45 minutes instead. See the above for an explanation.
    Your network administrators forced the servers to 100Mb/s and it improved performance? This very well could be just a coincidence. If the server is forced to 100Mb/s full duplex but the switch port is not also forced to 100Mb/s full duplex, in most cases the actual connection speed will fall back to the minimum possible configuration - 10Mb/s half duplex. I wonder if the network administrators previously forced the network switch port to 100Mb/s full duplex but forgot to do the same for the server? Still, at 10Mb/s half duplex, the speed (throughput) is probably faster than your WAN link, so that change should not have made much of a difference. The very low latency between the server and switch is still much lower than the latency that you reported with the trace route - and that is what is likely killing performance. I wonder if the network administrators also disabled jumbo frames when they forced the connection speed? Disabling jumbo frames might explain the sudden decrease in time.
    Our network administratos frmo the server part did force the server interface to 100 Mb/s but not the switches interfaces. Should I disable jumbo frames everywhere? I am pretty sure that they are disabled in the client network.
    Edit: jumbo frames seems to be disabled everywhere.
    I would not recommend implementing jumbo frames if those packets need to traverse the WAN. I could see where jumbo frames (8KB) would be helpful in a RAC configuration on just the private network segment between the RAC nodes.
    In any case, it would still be interesting to investigate a Wireshark capture from the client side.
    My PC (as the client side) is a DB server too so the traffic capture could be a little messy, but I remember myself doing a wireshark capture with the same query from my PC and I'm pretty sure that I recieved basically the same packets that appears in the server capture...

    Thank you very much for your interest, I really appreciate.
    Charles Hooper
    Co-author of "Expert Oracle Practices: Oracle Database Administration from the Oak Table"
    http://hoopercharles.wordpress.com/
    IT Manager/Oracle DBA
    K&M Machine-Fabricating, Inc.
  • 23. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello Charles!
    Some Internet connections only support an MTU smaller than 1500 - the number that you mention sounds familiar (maybe a DSL connection?).
    Yes indeed, a DSL link 1 Mb symmetric.
    That screen capture (assuming that it was captured from the server) shows that the server sent 16 packets in rapid fire and then waited. 78ms later (roughly the ping time that you reported based on the TRACERT output) your server received 8 ACK packets - one for every 2 packets sent by the server (the client side is not configured to ACK after every 8 or 16 packets as the server appears to be configured). You will see a somewhat similar screen capture here where I "optimized" the ACK frequency on one side of the connection with a large array fetch size:
    http://hoopercharles.files.wordpress.com/2009/12/fastwirelessselectfrompartnaglefetcharray1000.jpg>

    I think I don't get it. Client-side is configured to send an ACK every 2 packs (as default) but, ¿why is it sending all them together? In your capture client seems to be configured to ACK every certain number of packets but it sends ACK when it should, not as mine.
    Please check page 24 of this Microsoft document:
    http://download.microsoft.com/download/2/8/0/2800a518-7ac6-4aac-bd85-74d2c52e1ec6/tuning.doc
    Find the section titled TcpAckFrequency that suggests a change to the Windows registry. Do not change anything, I am just curious what the current registry value is showing. In case you are wondering, I mentioned that Microsoft document in this blog article:
    http://hoopercharles.wordpress.com/2010/07/13/windows-as-an-os-platform-for-oracle-database-where-do-i-start/>

    Ok, there is no TcpAckFrecuency parameter in that registry key, so I guess it's using the default value of 2.
    If I am reading the Wireshark screen capture correctly, the 5 second delay happened after the server received 8 ACK packets (ACKs for the previous 16 packets that the server sent), and before the server continued to send the next batch of packets to the client. The problem appears to be on the server side of the connection - the server is probably waiting for another packet or a timeout. The typical timeout permitted to wait for an expected packet is 0.2 seconds, much less than the 5 second delay. Is it possible that someone "optimized" some of the other network parameters on the Windows server?
    Tricky question we have here. I do think so, but the person responsible for those changes will not admit that is the problem. It was so long ago that he may not even remember. Should I review that Microsoft document for all the parameters and tweaks?
    The test that I mentioned should be performed on the server side of the WAN - two devices on that segment of the network, where a large file that should transfer in 3 or 4 seconds on a gigabit connection takes 45 minutes instead. See the above for an explanation.
    Ok, that test was to prove we use all the DSL link bandwidth when it's all fine. Then I checked the network card of the server and I saw that it was configured as autonegotiate. I switched to 100 Mb full and start a network copy of a 640 MB folder from the server to another idle server in the same network: 140, 250 or even more minutes expected. What?! Then I remembered your lessons and discovered that network card from the second server was autonegotiating too. So I switched to 100 and started the test again: 3-4 minutes. Fantastic.

    But, when I forced the 100 Mb full interface in the server (switches was changed last week as I told in an earlier post) I went back to the initial performance!!! now the query takes 3:30 min again, I don't know what the problem is! There is another server in the network that will be configured as autonegotiate, could this cause some trouble?

    Thank you very much!
  • 24. Re: network performance
    CharlesHooper Expert
    Currently Being Moderated
    user11098377 wrote:
    That screen capture (assuming that it was captured from the server) shows that the server sent 16 packets in rapid fire and then waited. 78ms later (roughly the ping time that you reported based on the TRACERT output) your server received 8 ACK packets - one for every 2 packets sent by the server (the client side is not configured to ACK after every 8 or 16 packets as the server appears to be configured). You will see a somewhat similar screen capture here where I "optimized" the ACK frequency on one side of the connection with a large array fetch size:
    http://hoopercharles.files.wordpress.com/2009/12/fastwirelessselectfrompartnaglefetcharray1000.jpg>

    I think I don't get it. Client-side is configured to send an ACK every 2 packs (as default) but, ¿why is it sending all them together? In your capture client seems to be configured to ACK every certain number of packets but it sends ACK when it should, not as mine.
    It is an illusion that all of the ACK packets are being sent together - well, not really. The ACK packet is a little bit like a method of controlling the speed of a transfer: the sender promises not to send more than (by default) two network packets at a time before receiving confirmation that the two packets were safely received by the sender. If the sender does not receive an ACK packet typically within 0.2 seconds it assumes that the packets were lost in transmission and attempts to resend the packets. Your server is sending out 16 packets before pausing to wait for the return of the single ACK packet to acknowledge that the previous 16 packets were received. The client computer at the other end of the DSL line receives the first two packets 0.04 seconds after the server sent them and replies with an ACK for those two packets. It then immediately sees the next two packets and replies with an ACK for those two. It then immediately sees the next two packets are replies with an ACK for those two. This continues until the client receives the 15th and 16th packets (a Wireshark capture on the client should show this). The server sees the 8 ACK packets for every 2 packets sent coming in all at once (0.04 seconds after being sent by the client), when it is expecting a single ACK for the previous 16 packets. I assume that this is at least part of the cause for the 5 second pauses.
    Please check page 24 of this Microsoft document:
    http://download.microsoft.com/download/2/8/0/2800a518-7ac6-4aac-bd85-74d2c52e1ec6/tuning.doc
    Find the section titled TcpAckFrequency that suggests a change to the Windows registry. Do not change anything, I am just curious what the current registry value is showing. In case you are wondering, I mentioned that Microsoft document in this blog article:
    http://hoopercharles.wordpress.com/2010/07/13/windows-as-an-os-platform-for-oracle-database-where-do-i-start/>

    Ok, there is no TcpAckFrecuency parameter in that registry key, so I guess it's using the default value of 2.
    Please check again. There is a lowercase Q (q) after "TcpAckFre" - if you do not find it, then it is set to the default value. Note that this setting may appear as a setting for a specific network adapter, see the following which seems to describe the location a little better:
    http://support.microsoft.com/kb/328890/

    For example, in:
    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<Interface GUID>
    The actual location may be something like this:
    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{EA219350-C25F-4304-B0A7-CA6C15D25C3F}
    If I am reading the Wireshark screen capture correctly, the 5 second delay happened after the server received 8 ACK packets (ACKs for the previous 16 packets that the server sent), and before the server continued to send the next batch of packets to the client. The problem appears to be on the server side of the connection - the server is probably waiting for another packet or a timeout. The typical timeout permitted to wait for an expected packet is 0.2 seconds, much less than the 5 second delay. Is it possible that someone "optimized" some of the other network parameters on the Windows server?
    Tricky question we have here. I do think so, but the person responsible for those changes will not admit that is the problem. It was so long ago that he may not even remember. Should I review that Microsoft document for all the parameters and tweaks?
    It may be a good idea to review the parameters - not all of the parameters are included in that Microsoft document. For example, if someone has run the utility at the following page and then allowed the tool to apply the changes, it will change parameters in the Windows registry that probably are not covered in Microsoft's document:
    http://www.dslreports.com/drtcp

    I have used the above utility in the past to adjust these parameters to try to improve WAN performance, but there is a side-effect to making those changes - local LAN performance will likely be impacted. If you run the utility, compare the displayed parameters for the network card with the screen capture that is displayed on the page. If no one has modified the settings on the server it should look about the same as you see in that screen capture (the TCP Receive Window might be different). Needless to say, if you do run the above program, do not hit the "Apply" button unless you know what is being changed.
    The test that I mentioned should be performed on the server side of the WAN - two devices on that segment of the network, where a large file that should transfer in 3 or 4 seconds on a gigabit connection takes 45 minutes instead. See the above for an explanation.
    Ok, that test was to prove we use all the DSL link bandwidth when it's all fine. Then I checked the network card of the server and I saw that it was configured as autonegotiate. I switched to 100 Mb full and start a network copy of a 640 MB folder from the server to another idle server in the same network: 140, 250 or even more minutes expected. What?! Then I remembered your lessons and discovered that network card from the second server was autonegotiating too. So I switched to 100 and started the test again: 3-4 minutes. Fantastic.
    Something is wrong with the above. On a 100Mb/s connection, I am able to copy a 129MB file to a laptop in about 18 seconds. That suggests a 10Mb/s connection (which you will see when the server and switch are not set to the same speed - if not auto-negotiate) would be able to transfer roughly 12.9MB in the same 18 seconds. Doing the math:
    129MB in 18 seconds at 100Mb/s, 12.9MB in 18 seconds at 10Mb/s
    640MB/12.9 * 18 = 893 seconds = 14.9 minutes to transfer 640MB over a 10Mb/s connection.
    (or 12.9MB/18 = 0.7166MB/s, 640MB / 0.7166MB ~ 893 seconds)

    Or another way, assuming that the real-world speed of a 10Mb/s is close to 7Mb/s:
    640MB / 7mb/s * 8bits = 731.43 seconds = 12.2 minutes to transfer 640MB over a 10Mb/s connection.

    You should not be seeing 140 or 250 minutes (unless your server is not sending ACK correctly, in which case Wireshark would show periodic 0.2 second delays) - it should be close to 15 minutes.
    But, when I forced the 100 Mb full interface in the server (switches was changed last week as I told in an earlier post) I went back to the initial performance!!! now the query takes 3:30 min again, I don't know what the problem is! There is another server in the network that will be configured as autonegotiate, could this cause some trouble?
    As long as the switch port and the server are both set to autonegotiate, you are fine. Some switches will refuse to operate their ports at gigabit speeds unless both the switch port and the device (the server's network card) are both set to autonegotiate - you cannot force it to gigabit speed. I don't think that this will be the cause of the problem 3:30 transfer time. Are you able to run Wireshark on the client side of the WAN link?

    Charles Hooper
    Co-author of "Expert Oracle Practices: Oracle Database Administration from the Oak Table"
    http://hoopercharles.wordpress.com/
    IT Manager/Oracle DBA
    K&M Machine-Fabricating, Inc.
  • 25. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello Charles,
    It is an illusion that all of the ACK packets are being sent together - well, not really. The ACK packet is a little bit like a method of controlling the speed of a transfer: the sender promises not to send more than (by default) two network packets at a time before receiving confirmation that the two packets were safely received by the sender. If the sender does not receive an ACK packet typically within 0.2 seconds it assumes that the packets were lost in transmission and attempts to resend the packets. Your server is sending out 16 packets before pausing to wait for the return of the single ACK packet to acknowledge that the previous 16 packets were received. The client computer at the other end of the DSL line receives the first two packets 0.04 seconds after the server sent them and replies with an ACK for those two packets. It then immediately sees the next two packets and replies with an ACK for those two. It then immediately sees the next two packets are replies with an ACK for those two. This continues until the client receives the 15th and 16th packets (a Wireshark capture on the client should show this). The server sees the 8 ACK packets for every 2 packets sent coming in all at once (0.04 seconds after being sent by the client), when it is expecting a single ACK for the previous 16 packets. I assume that this is at least part of the cause for the 5 second pauses.
    Wow! You really know your stuff, because that is exactly what is happening in the client capture... here it is:

    Client screenshot

    Well, I can see that every one of my packets has an incorrect header checksum, ¿is that a problem? I can see the ACK packet sent every two server packets, exactly as you foretold, and server waiting 5 secs every 16 packets. I can see 4 packets received and only one ACK sent, this happens every now and then, don't know why.
    I tried a lighter query that takes 5 seconds and it has the 5 secs delay too, it is almost all the time it takes in the query. This query was sent with an array fetch size of 150, if I put 500 then the query takes 15 secs with 3 delays. It cannot take so little time in transferring data, is it? It seems that the packets are stalled in some switch, may be? But the server capture was taken at the server interface and shows that no packets come to the network card from Oracle while the delays.
    Please check again. There is a lowercase Q (q) after "TcpAckFre" - if you do not find it, then it is set to the default value. Note that this setting may appear as a setting for a specific network adapter, see the following which seems to describe the location a little better:
    http://support.microsoft.com/kb/328890/>

    Sorry, I misspelled that. I did a search for the word TcpAckFrequency and it couldn't find it, so it's set to the default value. I have some of those hexadecimal-named interface entries in registry but no TcpAckFrequency parameter.
    It may be a good idea to review the parameters - not all of the parameters are included in that Microsoft document. For example, if someone has run the utility at the following page and then allowed the tool to apply the changes, it will change parameters in the Windows registry that probably are not covered in Microsoft's document:
    http://www.dslreports.com/drtcp

    I have used the above utility in the past to adjust these parameters to try to improve WAN performance, but there is a side-effect to making those changes - local LAN performance will likely be impacted. If you run the utility, compare the displayed parameters for the network card with the screen capture that is displayed on the page. If no one has modified the settings on the server it should look about the same as you see in that screen capture (the TCP Receive Window might be different). Needless to say, if you do run the above program, do not hit the "Apply" button unless you know what is being changed.>

    Ok, unexpected results (may be O.S. not supported completely, WS 2003 R2) here:

    DRTCP

    But at least it seems to be at default settings.
    Something is wrong with the above. On a 100Mb/s connection, I am able to copy a 129MB file to a laptop in about 18 seconds. That suggests a 10Mb/s connection (which you will see when the server and switch are not set to the same speed - if not auto-negotiate) would be able to transfer roughly 12.9MB in the same 18 seconds. Doing the math:
    129MB in 18 seconds at 100Mb/s, 12.9MB in 18 seconds at 10Mb/s
    640MB/12.9 * 18 = 893 seconds = 14.9 minutes to transfer 640MB over a 10Mb/s connection.
    (or 12.9MB/18 = 0.7166MB/s, 640MB / 0.7166MB ~ 893 seconds)

    Or another way, assuming that the real-world speed of a 10Mb/s is close to 7Mb/s:
    640MB / 7mb/s * 8bits = 731.43 seconds = 12.2 minutes to transfer 640MB over a 10Mb/s connection.

    You should not be seeing 140 or 250 minutes (unless your server is not sending ACK correctly, in which case Wireshark would show periodic 0.2 second delays) - it should be close to 15 minutes.>

    I must admit I didn't wait till it finished, so I retest the copy and I chose a 358 MB folder. In 10 minutes it has trasferred around 70 MB so it gives me an average of 0,10 - 0,15 MB/s. Unbelievable. Interface says that it si connected 100 Mb half duplex. Same test wit interface forced to 100 full: less than two minutes -> about 3 - 4 MB/s, what is still less than expected, isn't it? It really seems to be a problem, but where? I think that there are at least two different switches in the server network and the one we haven't touched is configured in autonegotiate... This is driving me mad!
    As long as the switch port and the server are both set to autonegotiate, you are fine. Some switches will refuse to operate their ports at gigabit speeds unless both the switch port and the device (the server's network card) are both set to autonegotiate - you cannot force it to gigabit speed. I don't think that this will be the cause of the problem 3:30 transfer time. Are you able to run Wireshark on the client side of the WAN link?
    The new switch I mentioned I think is a 100 Mb switch, so we cannot force interfaces to giga even though they are giga interfaces. At the beginning all interfaces were set to autonegotiate so it may not be the solution. Enjoy the client side strange capture.

    Thank you so much for your time, you are the best!!
  • 26. Re: network performance
    CharlesHooper Expert
    Currently Being Moderated
    user11098377 wrote:
    Hello Charles,

    Well, I can see that every one of my packets has an incorrect header checksum, ¿is that a problem? I can see the ACK packet sent every two server packets, exactly as you foretold, and server waiting 5 secs every 16 packets. I can see 4 packets received and only one ACK sent, this happens every now and then, don't know why.
    I tried a lighter query that takes 5 seconds and it has the 5 secs delay too, it is almost all the time it takes in the query. This query was sent with an array fetch size of 150, if I put 500 then the query takes 15 secs with 3 delays. It cannot take so little time in transferring data, is it? It seems that the packets are stalled in some switch, may be? But the server capture was taken at the server interface and shows that no packets come to the network card from Oracle while the delays.
    The header checksum errors are caused by TCP checksum offloading. You will typically receive that message when a better quality network card is installed in the computer - the packet checksum calculation/verification is handled by the CPU in the network card itself, rather than relying on the host computer's CPU. I think that one of my blog articles mentions this feature. More information may be found here:
    http://www.wireshark.org/docs/wsug_html_chunked/ChAdvChecksums.html

    That is interesting that you periodically see 4 packets arriving with a single ACK packet sent in return - possibly some of the packets were received by the client out of order, so the client had to wait for the expected packet to be received. The typical pattern for a database sending packets seems to be one or more TCP packets with "[TCP segment of a ressembled PDU]" in the info column followed by a single TNS packet with "Response, Data (6), Data" in the info column - that TNS packet seems to marks the end of the array fetch (although I may need to perform additional analysis). Based on some of the tests that I have performed with varying the array fetch size, if you cannot track down why the ACK frequency is wrong, you may be able to work around the issue by setting the array fetch size much smaller.
    http://hoopercharles.wordpress.com/2009/12/15/network-monitoring-experimentations-1/

    In the above blog article, you can compare the “Optimized” transfer speed as the fetch array size is increased, and also compare it with the unoptimized (no one has changed the ACK frequency) transfer. You will see that as the fetch array size increases, the ACK frequency drops off - this is a setup where I modified the ACK frequency on the client, while I believe that the ACK frequency of your server was adjusted.
    Please check again. There is a lowercase Q (q) after "TcpAckFre" - if you do not find it, then it is set to the default value. Note that this setting may appear as a setting for a specific network adapter, see the following which seems to describe the location a little better:
    http://support.microsoft.com/kb/328890/>
    I have used the above utility in the past to adjust these parameters to try to improve WAN performance, but there is a side-effect to making those changes - local LAN performance will likely be impacted. If you run the utility, compare the displayed parameters for the network card with the screen capture that is displayed on the page. If no one has modified the settings on the server it should look about the same as you see in that screen capture (the TCP Receive Window might be different). Needless to say, if you do run the above program, do not hit the "Apply" button unless you know what is being changed.>

    Ok, unexpected results (may be O.S. not supported completely, WS 2003 R2) here:

    DRTCP

    But at least it seems to be at default settings.
    I think that "default" in this case means that the setting was not modified specifically for this particular network card. The server's defaults could have been changed. Please take a look at the keys in the following registry location:
    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

    If I recall correctly, that location indicates the default values when specific values are not specified for a particular network card. Some of the settings are described here:
    http://technet.microsoft.com/en-us/library/bb463205.aspx
    http://support.microsoft.com/kb/314053
    Something is wrong with the above. On a 100Mb/s connection, I am able to copy a 129MB file to a laptop in about 18 seconds. That suggests a 10Mb/s connection (which you will see when the server and switch are not set to the same speed - if not auto-negotiate) would be able to transfer roughly 12.9MB in the same 18 seconds. Doing the math:
    129MB in 18 seconds at 100Mb/s, 12.9MB in 18 seconds at 10Mb/s
    640MB/12.9 * 18 = 893 seconds = 14.9 minutes to transfer 640MB over a 10Mb/s connection.
    (or 12.9MB/18 = 0.7166MB/s, 640MB / 0.7166MB ~ 893 seconds)

    Or another way, assuming that the real-world speed of a 10Mb/s is close to 7Mb/s:
    640MB / 7mb/s * 8bits = 731.43 seconds = 12.2 minutes to transfer 640MB over a 10Mb/s connection.

    You should not be seeing 140 or 250 minutes (unless your server is not sending ACK correctly, in which case Wireshark would show periodic 0.2 second delays) - it should be close to 15 minutes.>

    I must admit I didn't wait till it finished, so I retest the copy and I chose a 358 MB folder. In 10 minutes it has trasferred around 70 MB so it gives me an average of 0,10 - 0,15 MB/s. Unbelievable. Interface says that it si connected 100 Mb half duplex. Same test wit interface forced to 100 full: less than two minutes -> about 3 - 4 MB/s, what is still less than expected, isn't it? It really seems to be a problem, but where? I think that there are at least two different switches in the server network and the one we haven't touched is configured in autonegotiate... This is driving me mad!
    358MB/120 seconds = 2.98MB/s * (8bits + 1overhead bit) ~ 26.85Mb/s. Are you sure that this is a switch and not a hub between the two computers? Hubs operate at half duplex.
    The new switch I mentioned I think is a 100 Mb switch, so we cannot force interfaces to giga even though they are giga interfaces. At the beginning all interfaces were set to autonegotiate so it may not be the solution. Enjoy the client side strange capture.
    Are you able to indicate the specific name and model of the switches? Keep in mind that even if the switch and server configuration are mismatched, it is still far faster than the DSL connection, so it should not make a noticeable difference in the remote performance. If the switches are managed switches, there should be a communication log of some sort that shows whether or not there are spurious network packet issues.

    Charles Hooper
    Co-author of "Expert Oracle Practices: Oracle Database Administration from the Oak Table"
    http://hoopercharles.wordpress.com/
    IT Manager/Oracle DBA
    K&M Machine-Fabricating, Inc.
  • 27. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello Charles!

    Sorry, but I couldn't spent time last days. Here we go:
    The header checksum errors are caused by TCP checksum offloading. You will typically receive that message when a better quality network card is installed in the computer - the packet checksum calculation/verification is handled by the CPU in the network card itself, rather than relying on the host computer's CPU. I think that one of my blog articles mentions this feature. More information may be found here:
    http://www.wireshark.org/docs/wsug_html_chunked/ChAdvChecksums.html

    That is interesting that you periodically see 4 packets arriving with a single ACK packet sent in return - possibly some of the packets were received by the client out of order, so the client had to wait for the expected packet to be received. The typical pattern for a database sending packets seems to be one or more TCP packets with "[TCP segment of a ressembled PDU]" in the info column followed by a single TNS packet with "Response, Data (6), Data" in the info column - that TNS packet seems to marks the end of the array fetch (although I may need to perform additional analysis). Based on some of the tests that I have performed with varying the array fetch size, if you cannot track down why the ACK frequency is wrong, you may be able to work around the issue by setting the array fetch size much smaller.
    http://hoopercharles.wordpress.com/2009/12/15/network-monitoring-experimentations-1/

    In the above blog article, you can compare the “Optimized” transfer speed as the fetch array size is increased, and also compare it with the unoptimized (no one has changed the ACK frequency) transfer. You will see that as the fetch array size increases, the ACK frequency drops off - this is a setup where I modified the ACK frequency on the client, while I believe that the ACK frequency of your server was adjusted.>

    I don't see ACK frequency dropping off, this captures were taken with a fetch array size of 500 and I don't remember seeing other ACK frequency when I modify it.
    I think that "default" in this case means that the setting was not modified specifically for this particular network card. The server's defaults could have been changed. Please take a look at the keys in the following registry location:
    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

    If I recall correctly, that location indicates the default values when specific values are not specified for a particular network card. Some of the settings are described here:
    http://technet.microsoft.com/en-us/library/bb463205.aspx
    http://support.microsoft.com/kb/314053>

    I cannot see any parameter related with parameter shown in TCPDOCTOR, nor in Parameters registry key neither inside keys (interfaces, etc). I don't know if that's a problem but I guess not.
    Something is wrong with the above. On a 100Mb/s connection, I am able to copy a 129MB file to a laptop in about 18 seconds. That suggests a 10Mb/s connection (which you will see when the server and switch are not set to the same speed - if not auto-negotiate) would be able to transfer roughly 12.9MB in the same 18 seconds. Doing the math:
    129MB in 18 seconds at 100Mb/s, 12.9MB in 18 seconds at 10Mb/s
    640MB/12.9 * 18 = 893 seconds = 14.9 minutes to transfer 640MB over a 10Mb/s connection.
    (or 12.9MB/18 = 0.7166MB/s, 640MB / 0.7166MB ~ 893 seconds)

    Or another way, assuming that the real-world speed of a 10Mb/s is close to 7Mb/s:
    640MB / 7mb/s * 8bits = 731.43 seconds = 12.2 minutes to transfer 640MB over a 10Mb/s connection.

    You should not be seeing 140 or 250 minutes (unless your server is not sending ACK correctly, in which case Wireshark would show periodic 0.2 second delays) - it should be close to 15 minutes.

    358MB/120 seconds = 2.98MB/s * (8bits + 1overhead bit) ~ 26.85Mb/s. Are you sure that this is a switch and not a hub between the two computers? Hubs operate at half duplex.>

    I have my network administrators looking into it. It is abnormal, at least, so I expect some reason or changes. I'll keep you in the loop.
    Are you able to indicate the specific name and model of the switches? Keep in mind that even if the switch and server configuration are mismatched, it is still far faster than the DSL connection, so it should not make a noticeable difference in the remote performance. If the switches are managed switches, there should be a communication log of some sort that shows whether or not there are spurious network packet issues.
    I asked for that info, I'll post ASAP.

    I run wireshark in the server network adapter and I saw exactly the same behaviour detected in captures from client. If packets don't reach the server card, isn't there a problem in server config or software? May the problem still be outside the server? We use a domestic adsl line, symmetric, but the same line any home user could subscribe. May this be a problem? Maybe ISP or some inner nodes are filtering or blocking TNS traffic? Why the perfect behaviour when I query from the other server in the LAN? Our Oracle license is for the cheapest production database, any limitations?

    Thank you very much!
  • 28. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hello,

    They use only CISCO switches, didn't unveil model name... no HUBs. No logs too, they think that won't help us so they give us no logs. Anyway if all packets arrive (with lag, but arrive) to the client and the 5s lag is detected in server network adapter, I don't think switch logs would help here, am I right?

    Thanks!
  • 29. Re: network performance
    734766 Newbie
    Currently Being Moderated
    Hi everyone!

    Any suggestions, please?

    Thanks!!

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points