Forum Stats

  • 3,851,678 Users
  • 2,264,011 Discussions
  • 7,904,815 Comments

Discussions

Windows SMB client performance issues

3836784
3836784 Member Posts: 21
edited Oct 17, 2019 2:07AM in Solaris 11

Hi all,

We recently set up a Solaris 11.4 server based on the following hardware;

SuperMicro SSG-6049P-E1CR45L

2x XEON Silver 4114 10 Core 2.2GHz

(LSI 3008 IT mode)

512GB DDR4-2666 ECC RAM

2 x Intel S4600 240GB

2 x Intel Optane 905P 480GB U.2

24 x HGST 10TB SAS

2 x Intel X540T2 10Gbe NIC

Our main storage pool is setup like so;

4 groups of 6 physical 10TB drives in RAID-Z2

1 x Intel Optane 905P 480GB U.2 - cache

1 x Intel Optane 905P 480GB U.2 - log

lz4 compression is enabled on the shared filesystem.

Sharing is via the Solaris CIFS service, sharing to about 32 total clients. Almost all the clients would have only small periods of reading or writing to the server when they save files or do renders, usually files between 500mb - 4GB in size.

The problem we are having is that Windows clients with SMB have random performance - when it works, the Windows SMB clients will copy files at up to a GB/sec over 10Gbe, usually hovering around 500-700MB/s. The problem is is that at frequent and random times, the copy will just completely stall and drop to 0 bytes/s. It will remain stalled often for a few minutes and then resume, ramping up to the previous speed. Sometimes it will stall again.


These stalls only seem to happen on Windows SMB clients and not macOS clients.

The stalls only seem to happen when writing to the server, not reading from it.

Clients are running Windows 10 1709. Solaris version is 11.4.5.3.0.

I can't see any correspondence to anything happening on the server that would seem to cause this.

Any help on working this out would be greatly appreciated.

Tristan

Tagged:

Best Answer

«13

Answers

  • Steve H -Oracle
    Steve H -Oracle Member Posts: 74
    edited Feb 27, 2019 9:21AM

    Hi,

    If "stalls only seem to happen on Windows SMB clients and not macOS clients."

    Then you may want to run snoop or tcpdump to see what is happening on network during that time

    to help narrow down where the issue is.

    # snoop -q -d <iface> -o snoop.out <IP-addr-of-PC>

    after packet capture is collected you can check with:

    # snoop -r -i snoop.out

  • 3836784
    3836784 Member Posts: 21
    edited Feb 27, 2019 6:35PM

    Hi Steve,

    Thanks for the suggestion. I ran the snoop command as suggested and in the snoop.out file there is a ton of these 'Unknown length' messages;

    548   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    549   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    550   0.00001  192.168.1.2 -> 192.168.1.131 SMB R port=55748

    551   0.00005 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    552   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    553   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    554   0.00002 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    555   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    556   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    557   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    558   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    559   0.00001  192.168.1.2 -> 192.168.1.131 SMB R port=55748

    560   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    561   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    562   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    563   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    564   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    565   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    566   0.00001 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    567   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    568   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    569   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    570   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    571   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=SESSION MESSAGE Length=1456

    572   0.00000 192.168.1.131 -> 192.168.1.2  NBT Type=Unknown Length=1456

    Could this be related to the issue we are seeing?

  • 3836784
    3836784 Member Posts: 21
    edited Feb 27, 2019 7:12PM

    I also used tcpdump and compared a problem free copy to the server with the stalled copy - the main difference was the presence of several lines like this in the stalled copy;

    10:56:49.811973 IP saltgum.microsoft-ds > tr2990wx-01.xxxxx.net.56853: Flags [.], ack 775249932, win 32804, options [nop,nop,sack 1 {775251392:775252852}], length 0

  • Steve H -Oracle
    Steve H -Oracle Member Posts: 74
    edited Feb 28, 2019 12:14PM

    Hi,

    Do you see TCP retransmits and/or large delta times between packets from PC?

    Solaris has wireshark package also to analyze network captures that may help find which side is causing delay.

    # tshark -r snoop.out

  • 3836784
    3836784 Member Posts: 21
    edited Mar 1, 2019 9:04PM

    Thanks Steve I will try this.

    So far I have tried all of the following without success;

    • Turn off oplocks with - sharectl set -p oplock_enable=false smb

    • Turn off multichannel with - sharectl set -p multichannel_enable=false smb

    • Turn off ipv6 with - sharectl set -p ipv6_enable=false smb

    • Try SMB 2 - sharectl set -p server_maxprotocol=2 smb

    • Try SMB 1 - sharectl set -p server_maxprotocol=1 smb

    • Update to latest Solaris SRU

    • Disable sync on the filesystem

    • Disable nbmand on the filesystem

    • Removed the log device

    • Removed the cache device

    • Removed the link aggregate

    • Created a pool with a single Intel Optane device and tested with that instead of the main pool - same issue

    I'm starting to run out of ideas as to how to potentially fix this or even work around it.

    Will try WireShark and see if I can understand what is going on. Thanks for your help.

  • 3836784
    3836784 Member Posts: 21
    edited Mar 2, 2019 2:06AM

    Using Wireshark I am seeing lots of these errors from one of my test clients;

    3708257  12.076199  192.168.1.2 → 192.168.1.72 TCP 54 445 → 50003 [ACK] Seq=387295 Ack=223867368 Win=32804 Len=0

    3708258  12.076205 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224728768 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

    3708259  12.076206 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224730228 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

    3708260  12.076207 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=224731688 Ack=387295 Win=8212 Len=1460[Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)]

    As well as lots of these messages:

    416109   4.982806  192.168.1.2 → 192.168.1.72 TCP 54 445 → 50003 [ACK] Seq=27698 Ack=180468424 Win=32804 Len=0

    416110   4.982807 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181274344 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

    416111   4.982808 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181275804 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

    416112   4.982809 192.168.1.72 → 192.168.1.2  TCP 1514 50003 → 445 [ACK] Seq=181277264 Ack=27698 Win=8207 Len=1460 [TCP segment of a reassembled PDU]

  • 3836784
    3836784 Member Posts: 21
    edited Mar 2, 2019 2:07AM

    Here you can see copy performance from two different clients when copying individually;

    SingleMachineCopy.mov

    And then here you can see what happens to the performance when a copy is started on one client, and then as soon as another copy is started on a different client, both clients stall. A copy that should take about 5 seconds takes 10 minutes instead.

    TwoMachineCopy.mov

    (the second link takes quite a while for Jumpshare to process - it is probably easier to download it to view).

  • Scott S.
    Scott S. Member Posts: 89 Red Ribbon
    edited Mar 2, 2019 1:39PM

    Although this doesn't quite relate to fixing your issue, have you tested using NFS in place of SMB? @3836784

  • Andrew Watkins
    Andrew Watkins Member Posts: 189 Bronze Badge
    edited Mar 3, 2019 8:12AM

    Hi,

    I have just had a quick look on our system and I think you are onto something. I will look again on Monday, but had a quick play and seeing something the same.

    Example:

    Coping a single file 800MB file (iso) from Windows client (1GB desktop network) to Solaris 11.4.6.4.0 kernel zone running SMB.

    Windows 7 to Solaris = No problem nice an quick

    Windows 10 to Solaris = Starts well then stops for a few seconds (10+) and then starts again. See photo.

    pastedImage_0.png

    I wonder is there a problem with Windows 10 SMB versions and Solaris SMB!

    to use the GUI.

    Interesting.

    Andrew

  • 3836784
    3836784 Member Posts: 21
    edited Mar 4, 2019 9:13PM

    Hey Scott,

    I just set up NFS quickly to test on the Windows clients and yes, I am able to reproduce this issue over NFS as well.