1 2 Previous Next 26 Replies Latest reply on Jun 4, 2019 2:50 AM by 3836784

    Windows SMB client performance issues

    3836784

      Hi all,

       

       

      We recently set up a Solaris 11.4 server based on the following hardware;

       

      SuperMicro SSG-6049P-E1CR45L

      2x XEON Silver 4114 10 Core 2.2GHz

      (LSI 3008 IT mode)

      512GB DDR4-2666 ECC RAM

      2 x Intel S4600 240GB

      2 x Intel Optane 905P 480GB U.2

      24 x HGST 10TB SAS

      2 x Intel X540T2 10Gbe NIC

       

       

      Our main storage pool is setup like so;

       

      4 groups of 6 physical 10TB drives in RAID-Z2

      1 x Intel Optane 905P 480GB U.2 - cache

      1 x Intel Optane 905P 480GB U.2 - log

       

       

      lz4 compression is enabled on the shared filesystem.

       

       

      Sharing is via the Solaris CIFS service, sharing to about 32 total clients. Almost all the clients would have only small periods of reading or writing to the server when they save files or do renders, usually files between 500mb - 4GB in size.

       

       

      The problem we are having is that Windows clients with SMB have random performance - when it works, the Windows SMB clients will copy files at up to a GB/sec over 10Gbe, usually hovering around 500-700MB/s. The problem is is that at frequent and random times, the copy will just completely stall and drop to 0 bytes/s. It will remain stalled often for a few minutes and then resume, ramping up to the previous speed. Sometimes it will stall again.


      These stalls only seem to happen on Windows SMB clients and not macOS clients.

       

      The stalls only seem to happen when writing to the server, not reading from it.

       

      Clients are running Windows 10 1709. Solaris version is 11.4.5.3.0.

       

      I can't see any correspondence to anything happening on the server that would seem to cause this.

       

      Any help on working this out would be greatly appreciated.

       

      Tristan

        • 1. Re: Windows SMB client performance issues
          Steve H -Oracle

          Hi,

           

          If "stalls only seem to happen on Windows SMB clients and not macOS clients."

          Then you may want to run snoop or tcpdump to see what is happening on network during that time

          to help narrow down where the issue is.

           

          # snoop -q -d <iface> -o snoop.out <IP-addr-of-PC>

          after packet capture is collected you can check with:

          # snoop -r -i snoop.out

          • 2. Re: Windows SMB client performance issues
            3836784

            Hi Steve,

             

            Thanks for the suggestion. I ran the snoop command as suggested and in the snoop.out file there is a ton of these 'Unknown length' messages;

             

            548   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            549   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            550   0.00001  192.168.1.2 -> 192.168.1.131 SMB R port=55748

            551   0.00005 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            552   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            553   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            554   0.00002 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            555   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            556   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            557   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            558   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            559   0.00001  192.168.1.2 -> 192.168.1.131 SMB R port=55748

            560   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            561   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            562   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            563   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            564   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            565   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            566   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            567   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            568   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            569   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

            570   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            571   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

            572   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

             

            Could this be related to the issue we are seeing?

            • 3. Re: Windows SMB client performance issues
              3836784

              I also used tcpdump and compared a problem free copy to the server with the stalled copy - the main difference was the presence of several lines like this in the stalled copy;

               

              10:56:49.811973 IP saltgum.microsoft-ds > tr2990wx-01.xxxxx.net.56853: Flags [.], ack 775249932, win 32804, options [nop,nop,sack 1 {775251392:775252852}], length 0

              • 4. Re: Windows SMB client performance issues
                Steve H -Oracle

                Hi,

                Do you see TCP retransmits and/or large delta times between packets from PC?

                Solaris has wireshark package also to analyze network captures that may help find which side is causing delay.

                # tshark -r snoop.out

                • 5. Re: Windows SMB client performance issues
                  3836784

                  Thanks Steve I will try this.

                   

                  So far I have tried all of the following without success;

                   

                  • Turn off oplocks with - sharectl set -p oplock_enable=false smb

                  • Turn off multichannel with - sharectl set -p multichannel_enable=false smb

                  • Turn off ipv6 with - sharectl set -p ipv6_enable=false smb

                  • Try SMB 2 - sharectl set -p server_maxprotocol=2 smb

                  • Try SMB 1 - sharectl set -p server_maxprotocol=1 smb

                  • Update to latest Solaris SRU

                  • Disable sync on the filesystem

                  • Disable nbmand on the filesystem

                  • Removed the log device

                  • Removed the cache device

                  • Removed the link aggregate

                  • Created a pool with a single Intel Optane device and tested with that instead of the main pool - same issue

                   

                  I'm starting to run out of ideas as to how to potentially fix this or even work around it.

                   

                  Will try WireShark and see if I can understand what is going on. Thanks for your help.

                  • 6. Re: Windows SMB client performance issues
                    3836784

                    Using Wireshark I am seeing lots of these errors from one of my test clients;

                     

                    3708257  12.076199  192.168.1.2 → 192.168.1.72 TCP 54 445 → 50003 [ACK] Seq=387295 Ack=223867368 Win=32804 Len=0

                    3708258  12.076205 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224728768 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

                    3708259  12.076206 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224730228 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

                    3708260  12.076207 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224731688 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

                     

                    As well as lots of these messages:

                     

                    416109   4.982806  192.168.1.2 → 192.168.1.72 TCP 54 445 → 50003 [ACK] Seq=27698 Ack=180468424 Win=32804 Len=0

                    416110   4.982807 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181274344 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

                    416111   4.982808 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181275804 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

                    416112   4.982809 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181277264 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

                    • 7. Re: Windows SMB client performance issues
                      3836784

                      Here you can see copy performance from two different clients when copying individually;

                       

                      SingleMachineCopy.mov

                       

                      And then here you can see what happens to the performance when a copy is started on one client, and then as soon as another copy is started on a different client, both clients stall. A copy that should take about 5 seconds takes 10 minutes instead.

                       

                      TwoMachineCopy.mov

                       

                      (the second link takes quite a while for Jumpshare to process - it is probably easier to download it to view).

                      • 8. Re: Windows SMB client performance issues
                        Scott S.

                        Although this doesn't quite relate to fixing your issue, have you tested using NFS in place of SMB? 3836784

                        • 9. Re: Windows SMB client performance issues
                          Andrew Watkins

                          Hi,

                           

                          I have just had a quick look on our system and I think you are onto something. I will look again on Monday, but had a quick play and seeing something the same.

                           

                          Example:

                          Coping a single file 800MB file (iso) from Windows client (1GB desktop network) to Solaris 11.4.6.4.0 kernel zone running SMB.

                           

                          Windows 7 to Solaris = No problem nice an quick

                          Windows 10 to Solaris = Starts well then stops for a few seconds (10+) and then starts again. See photo.

                          I wonder is there a problem with Windows 10 SMB versions and Solaris SMB!

                          to use the GUI.

                           

                          Interesting.

                           

                          Andrew

                          • 10. Re: Windows SMB client performance issues
                            3836784

                            Hey Scott,

                             

                            I just set up NFS quickly to test on the Windows clients and yes, I am able to reproduce this issue over NFS as well.

                            • 11. Re: Windows SMB client performance issues
                              Andrew Watkins

                              Tristan,

                               

                              Have you put in a SR to Oracle with this one, since be interesting what they say, you may also be lucky there may already be an internal bug report already?

                               

                              Andrew

                              • 12. Re: Windows SMB client performance issues
                                3836784

                                Hi Andrew,

                                 

                                Yes I've opened up a SR. Haven't heard much back yet. At the moment we're looking at downgrading to 11.3 to deal with it as we can't seem to reproduce the issue on our old server running 11.3. Unfortunately for us our pools are too new for 11.3 so we need to migrate all the data off and back on again.

                                • 13. Re: Windows SMB client performance issues
                                  Cindys-Oracle

                                  Don't have a good suggestion about what is going on here but a few comments:

                                  • If the sluggish performance is reproduced over both SMB and NFS then this points to something outside of these 2 protocols.
                                  • A RAIDZ2 configuration is not a good performance match for an SMB workload. RAIDZ2 is great for large I/O workloads, likes streaming video or RMAN backups. A mirrored configuration is really best for SMB workloads, usually characterized by smallish I/O. This is just a general comment and I believe is unrelated to this particular performance issue since changing the pool configuration doesn't resolve the problem.
                                  • I would focus on the network retransmission errors. However, I have no idea why this is happening on S11.4 and not S11.3 IF your network configuration is identical but maybe S11.4 is exposing something different. Also unclear whether this related to switch problems, mismatching MTU sizes, MSS, or disabling ECN but my guess is that the network needs investigation.

                                  Thanks, Cindy

                                  • 14. Re: Windows SMB client performance issues
                                    Andrew Watkins

                                    Hello,

                                     

                                    I tried to compare Windows 7 and Windows 10 by removing the network part.

                                     

                                    I installed Virtualbox and then installed new Windows 7 and Windows 10 VMs on the system to see how they compare.

                                     

                                    I used robocopy to copy 3 iso's (770MB, 4.1GB, 2.5GB) to a Solaris 1.4 server and to a Windows Server. Note Desktop is a 1G network.

                                     

                                    Windows 7:  (to Solaris)      2297 MB/min Time= 3.08 mins

                                                        (to Windows)  1622  MB/min Time= 4.27 mins

                                     

                                    Windows 10: Pauses a many times and hitting Enter key starts the robocopy again! So, speed depends if I keep hitting enter on just leave it!!!!

                                                         (to Solaris)     1011 MB/min,  Time 7.39 mins             

                                                         (to Windows)  3026 MB/min,  Time 2.22 mins

                                     

                                    Hope it may give some ideas.

                                    Note:

                                    I changes the robocopy command so it dumps to a logfile just in case it is some User I/O problem

                                    C:\> robocopy  C:\temp  H:\temp  *.iso  /R:1  /W:1 /LOG:C:\temp\log.txt

                                    1 2 Previous Next