This discussion is archived
7 Replies Latest reply: Jun 5, 2013 10:22 AM by 878011 RSS

Solaris 10 LDAP Clients Intermittently Fail

933733 Newbie
Currently Being Moderated
I'm working on a rather puzzling issue with some of our Solaris 10 systems authenticating against DSEE 6.3. These clients previously worked without issue but starting last week SSH connections would hang for a few minutes and then start working again. This never happened on more than one system at a time.

I found the following messages in /var/adm/messages during the time we have these problems:

Apr 27 08:04:57 hostname nscd[20634]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (85): Timed out.
Apr 27 08:05:47 hostname nscd[20634]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (85): Timed out.
... many of these
Apr 27 08:10:07 hostname nscd[20634]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (85): Timed out.
Apr 27 08:10:17 hostname nscd[20634]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (85): Timed out.
Apr 27 08:10:31 hostname nscd[20634]: [ID 293258 user.warning] libsldap: Status: 7 Mesg: LDAP ERROR (81): Can't contact LDAP server.

To test connectivity to the LDAP server I have a ldapsearch running every 15 seconds an logging the time it took and checking for correct results. during the time that I see the libsldap messages and ssh connections are hanging, the ldapsearch command continues to run fine without slowing down.

A final note is that all three of the problem systems are on the same subnet and systems outside of this subnet aren't having any problems with the same configuration. My first thought was the firewall but ldapsearch continues to work.

Does anyone know if nscd tries to keep the LDAP connection open. Looking at the logged messages it appears as though it gives up after 5 minutes or so, throws the LDAP ERROR (81) and then starts to work again.

Any ideas would be appreciated. This one is making me crazy (crazier).

Thanks.
  • 1. Re: Solaris 10 LDAP Clients Intermittently Fail
    933733 Newbie
    Currently Being Moderated
    I just saw this in the logs as well:


    Apr 27 09:45:15 hostname ldap_cachemgr[26286]: [ID 293258 daemon.warning] libsldap: Status: 81 Mesg: openConnection: simple bind failed - Can't contact LDAP server
    Apr 27 09:45:15 hostname ldap_cachemgr[26286]: [ID 545954 daemon.error] libsldap: makeConnection: failed to open connection to ldap3.xxxxx.net:636
    Apr 27 09:45:15 hostname ldap_cachemgr[26286]: [ID 687686 daemon.warning] libsldap: Falling back to anonymous, non-SSL mode for __ns_ldap_getRootDSE. openConnection: simple bind failed - Can't contact LDAP server
    Apr 27 09:45:25 hostname ldap_cachemgr[26286]: [ID 293258 daemon.error] libsldap: Status: 1 Mesg: Timed out
  • 2. Re: Solaris 10 LDAP Clients Intermittently Fail
    rukbat Guru Moderator
    Currently Being Moderated
    user6916976 wrote:
    These clients previously worked without issue but starting last week ...
    ... snip ...
    Has anything changed in that time frame?
    Any physical changes such as office-moves? new hires? lay-offs?

    Could there have been any modifications to the networking hardware such as lengthening the cabling? Is it possible to re-route the subnet to different switches or to different posts on the switches? You might consider snooping the traffic to watch how it traverses the paths to the LDAP server.

    If there are other systems on the subnet, do they experience any sort of timeouts ( even if it is to unrelated tasks such as database access or surfing to the Intranet/Internet ) ?


    ... just random thoughts from a hardware perspective.
  • 3. Re: Solaris 10 LDAP Clients Intermittently Fail
    933733 Newbie
    Currently Being Moderated
    rukbat wrote:

    Has anything changed in that time frame?
    Any physical changes such as office-moves? new hires? lay-offs?

    Could there have been any modifications to the networking hardware such as lengthening the cabling? Is it possible to re-route the subnet to different switches or to different posts on the switches? You might consider snooping the traffic to watch how it traverses the paths to the LDAP server.

    If there are other systems on the subnet, do they experience any sort of timeouts ( even if it is to unrelated tasks such as database access or surfing to the Intranet/Internet ) ?


    ... just random thoughts from a hardware perspective.
    Given that this started after a maintenance night I'm sure you are correct and something changed. However there are no changes in the maintenance plan that could cause this and nobody will own up to any additional changes. This leaves it to me to try to find what is causing the failure so I can get it corrected.

    These are the only three Unix systems on that subnet and they are all experiencing the problem so I don't have anything that is working to compare them to except for the other systems that aren't on that subnet. The other systems are working fine with the same configuration. That's why I'm thinking that it is something external to the problem systems.

    Given that all other services on these systems are working, I'm not currently exploring a hardware type failure.

    I've been running pfiles on nscd and it appears that it is indeed holding a connection to the LDAP server open (if I'm reading it correctly). The inode assocated with #8 hasn't changed. So my current theory is that maybe the firewall is killing off long connections after a while. This appears to be consistent with the log entries where I get many ERROR (85) and then a final (81). I'm thinking that after the ERROR 81, it re-opens the connection. Just guesses though.

    8: S_IFSOCK mode:0666 dev:329,0 ino:3753 uid:0 gid:0 size:0
    O_RDWR|O_NONBLOCK
    SOCK_STREAM
    SO_SNDBUF(49152),SO_RCVBUF(49680),IP_NEXTHOP(0.0.194.16)
    sockname: AF_INET6 ::ffff:10.1.50.50 port: 42758
    peername: AF_INET6 ::ffff:10.1.52.25 port: *636*
  • 4. Re: Solaris 10 LDAP Clients Intermittently Fail
    rukbat Guru Moderator
    Currently Being Moderated
    I'm sure you'll get it figured out.
    Perplexing problems can often turn out to be "fun" problems.
    They make us think out of the box.
  • 5. Re: Solaris 10 LDAP Clients Intermittently Fail
    933733 Newbie
    Currently Being Moderated
    A little more information on this. I have been running pfiles and netstat on the nscd process and one of the events just happened.

    pfiles on nscd shows that it keeps the connection to the LDAP server open. Once it starts to have problems, I see the Error 85. However as soon as I see the ERROR 81, the inode that pfiles shows changes and things start to work again. I would think that this would indicate the connection to the LDAP server that nscd is keeping open is being closed (firewall?) and that it's taking 5 minutes or so for nscd to give up and re-open the connection.

    Time to go chat with the networking folks.


    8: S_IFSOCK mode:0666 dev:333,0 ino:64280 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:64280 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:*24013* uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:24013 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:24013 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:24013 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:24013 uid:0 gid:0 size:0
    8: S_IFSOCK mode:0666 dev:333,0 ino:24013 uid:0 gid:0 size:0
  • 6. Re: Solaris 10 LDAP Clients Intermittently Fail
    933733 Newbie
    Currently Being Moderated
    This one turned out to be a bit obvious once I finally figured it out.

    A new application was making a LOT of connections to the front end of a load balancer that was using round-robbin to connect to the two back-end LDAP servers. Each new connection from the application to the load balancer was using the next outgoing port. When the outgoing port matched the outgoing port that NSCD was already using and the load balancer sent it to the same LDAP server that NSCD was connected to directly the source IP:Port and dest IP:Port matched for both connections. This caused the NSCD connection to be closed.

    The obvious solution would be to switch the LDAP client to use the load balancer. However it appears that the client is quite picky about SSL certificates and doesn't like getting ldap1.domain.com when it connected to the load balancer at ldap.domain.com. My solution is to restart NSCD every 12 hours which causes it to re-open the connection using the next available outgoing port. This prevents the application outgoing port from ever reaching the same port that NSCD is using.

    Hope all that made sense.

    - Jon
  • 7. Re: Solaris 10 LDAP Clients Intermittently Fail
    878011 Newbie
    Currently Being Moderated
    hiya Jon,

    great work with this one,,
    Just wondering, if you did notice any ns-slapd leakage in the DS server while this
    issue was happening on the clients?

    I am in a similar boat where clients /var/adm/message showing this timeout issue,
    and i see on the DS server that non SSL connection, but there is a funny issue with ns-slapd
    consuming loads of space, and although this is in a replicated environment,
    this issue is only visible on LDAPServer1 MultiMaster Replication is on same hardware/ configuration / Solaris leve and DS level

    would be able to confirm this observation?
    thanks a lot

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points