uring the same 3 minute time span during the night, my Solaris and Linux clients lose connectivity to my Sun Directory Server 6.3 LDAP server. I've checked the crons and /var/adm/messages and I don't see anything on the LDAP server that would cause this issue. I am seeing the messages:
Nov 12 01:28:53 client1 ldap_cachemgr: [ID 293258 daemon.error] libsldap: Status: 1 Mesg: Can't connect to the LDAP server
Nov 12 01:30:23 client1 ldap_cachemgr: [ID 293258 daemon.warning] libsldap: Status: 81 Mesg: openConnection: simple bind failed - Can't contact LDAP server
Nov 12 01:30:23 client1 ldap_cachemgr: [ID 545954 daemon.error] libsldap: makeConnection: failed to open connection to ldapsvr1
Nov 12 01:30:23 client1 ldap_cachemgr: [ID 687686 daemon.warning] libsldap: Falling back to anonymous, non-SSL mode for __ns_ldap_getRootDSE. openConnection: simple bind failed - Can't contact LDAP server
Can someone point me in the right directory as to what could be causing this?
during the same time interval, are you able to make an 'ldapsearch' from you client machine to the LDAP server? (machines could be well up and running, but there could be a 'transient' network issue, like a router/switch reboot)
I haven't tried that. I have tried performing ping tests during this same time frame and we didn't lose any network packets. Would an ldapsearch reveal different results that would be seen during a ping test?
The PING command is just an ICMP echo request, so it will just check the presence of the machine on the network.
The ldapsearch command will instead also test if the 'target port' is reachable at that moment in time.
To have a better understanding, you should/could schedule the same ldapsearch command on both ldap client and ldap server machines at the same time.
If the search on the server succeeds and the search on the client fails, then it could be a network issue... or maybe just the server machine is 'too busy' to answer to your requests, which is the CPU usage pattern on the LDAP server around that time?
I ran the ldapsearch commands on the client and server last night. I found that the server fails at the specified time, and at about that exact time plus a 3 minute additional timespan, the LDAP clients' ldapsearches time out. I think this pretty much eliminates that network since the search on the LDAP server itself fails. I haven't checked the CPU usage during this time. I'll look into that to determine if this could be the cause of it. However, I also noticed in my error logs on the LDAP server, that I am getting a bunch of errors like:
WARNING<12364> - Connection - conn=-1 op=-1 msgId=-1 - Configuration warning Cannot disable TCP/IP nagle algorithm: error -5962 (The value requested is too large to be stored in the data buffer provided.). Check your system
This also occurs at other times of the day, but they are way more sporadic. I've opened a ticket with Oracle about this warning previously and was told that it's pretty much a "normal warning" and shouldn't cause any issues with my directory server.
Ok, I'd say you are going to want to collect some data when this problem is happening. If you were going to make it from scratch, it would be something along the lines of:
1. Make a script that can detect when the server becomes unresponsive.
2. When the script in (1) detects an unresponsive server condition, kick off a data collection script.
3. Data collection script takes pstacks, prstat -L, maybe gcore, etc.
Luckily there is already a script that will most of this for you:
I implemented the dirtracer script to kick off tonight if the ldapsearch command I'm running every 20 seconds encounters a failure. Hopefully it'll reveal exactly what is going on. I'll provide an update tomorrow.
I ran the dirtracer right after my script ran into an ldapsearch failure. The logs reveals a spike in A1(Client abort connections) and B4 (Server failed to flush data (response) back to Client) errors and a decline in U1(Cleanly Closed Connections). I went from a total of 8 B4 errors during normal operations to 157 during this 10 minute time span. 41 A1 errors to 398 during the same time span. So something is definitely causing an interrupt in LDAP functionality. I haven't been able to pinpoint exactly what it is.
You may also need to check the following during the failure time frame...
1. Check high etime in access log or notes=U
2. Backup is scheduled that is choking the server
3. Any batch job scheduled to ADD or DEL (Del is expensive operation than ADD if you have ref integ plug-in ON)
4. What is the idle time out for DS
5. Check how many new connections are opened during that time frame
definitely it looks there's something strange, and the fact that it's happening 'regularly', always at the same time... makes me think to the OLD issue of tombstones purging... but that happened in the 5.x days, so you should not be hitting such limitation, however, I would check the following:
1. CPU usage
2. I/O subsystem usage and throughput
with close probes (1-2s) to see if there's any 'suspicious' pattern (i.e.: another process taking all the CPU, or some other process causing degraded I/O performances) that could bring the server to a kind of freeze/locked state.
I opened a ticket with Oracle and their thoughts are that I've encountered a bug 12240732 - - SUNBT6712614 STARTTLS PERFORMANCE BETTER IN 5.X VERSIONS THAN IN 6.X VERSIONS. Their resolution is for me to upgrade from DSEE 6.3 to DSEE 126.96.36.199.1. I'm going to schedule a downtime and see if this resolves the issue.