This content has been marked as final. Show 16 replies
The PING command is just an ICMP echo request, so it will just check the presence of the machine on the network.
The ldapsearch command will instead also test if the 'target port' is reachable at that moment in time.
To have a better understanding, you should/could schedule the same ldapsearch command on both ldap client and ldap server machines at the same time.
If the search on the server succeeds and the search on the client fails, then it could be a network issue... or maybe just the server machine is 'too busy' to answer to your requests, which is the CPU usage pattern on the LDAP server around that time?
I ran the ldapsearch commands on the client and server last night. I found that the server fails at the specified time, and at about that exact time plus a 3 minute additional timespan, the LDAP clients' ldapsearches time out. I think this pretty much eliminates that network since the search on the LDAP server itself fails. I haven't checked the CPU usage during this time. I'll look into that to determine if this could be the cause of it. However, I also noticed in my error logs on the LDAP server, that I am getting a bunch of errors like:
WARNING<12364> - Connection - conn=-1 op=-1 msgId=-1 - Configuration warning Cannot disable TCP/IP nagle algorithm: error -5962 (The value requested is too large to be stored in the data buffer provided.). Check your system
This also occurs at other times of the day, but they are way more sporadic. I've opened a ticket with Oracle about this warning previously and was told that it's pretty much a "normal warning" and shouldn't cause any issues with my directory server.
Ok, I'd say you are going to want to collect some data when this problem is happening. If you were going to make it from scratch, it would be something along the lines of:
1. Make a script that can detect when the server becomes unresponsive.
2. When the script in (1) detects an unresponsive server condition, kick off a data collection script.
3. Data collection script takes pstacks, prstat -L, maybe gcore, etc.
Luckily there is already a script that will most of this for you:
I ran the dirtracer right after my script ran into an ldapsearch failure. The logs reveals a spike in A1(Client abort connections) and B4 (Server failed to flush data (response) back to Client) errors and a decline in U1(Cleanly Closed Connections). I went from a total of 8 B4 errors during normal operations to 157 during this 10 minute time span. 41 A1 errors to 398 during the same time span. So something is definitely causing an interrupt in LDAP functionality. I haven't been able to pinpoint exactly what it is.
You may also need to check the following during the failure time frame...
1. Check high etime in access log or notes=U
2. Backup is scheduled that is choking the server
3. Any batch job scheduled to ADD or DEL (Del is expensive operation than ADD if you have ref integ plug-in ON)
4. What is the idle time out for DS
5. Check how many new connections are opened during that time frame
definitely it looks there's something strange, and the fact that it's happening 'regularly', always at the same time... makes me think to the OLD issue of tombstones purging... but that happened in the 5.x days, so you should not be hitting such limitation, however, I would check the following:
1. CPU usage
2. I/O subsystem usage and throughput
with close probes (1-2s) to see if there's any 'suspicious' pattern (i.e.: another process taking all the CPU, or some other process causing degraded I/O performances) that could bring the server to a kind of freeze/locked state.
I run sar daily on my servers, and the output for this timeframe shows at least 97% idle. This editor doesn't format very well, but the last column (98,99,97) is the %idle.
00:01:00 %usr %sys %wio %idle
00:31:00 1 1 0 98
01:01:01 1 1 0 98
01:23:00 1 1 0 98
01:24:01 1 1 0 99
01:25:00 1 0 0 98
01:26:01 1 0 0 98
01:27:00 1 0 0 99
01:28:01 1 0 0 99
01:29:00 1 0 0 98
01:30:01 1 0 0 98
01:31:00 2 0 0 98
01:32:01 2 0 0 98
01:33:00 2 1 0 98
01:34:01 1 2 0 98
01:35:00 1 2 0 97
01:36:00 1 2 0 97
01:37:01 2 1 0 97
01:38:00 2 1 0 97
01:39:01 2 0 0 97
01:40:00 2 0 0 97
1. etimes are high during that time:
[15/Nov/2012:01:27:34 -0500] conn=569501510 op=1 msgId=2 - RESULT err=0 tag=97 nentries=0 etime=14.151070 dn=""
[15/Nov/2012:01:27:34 -0500] conn=569501555 op=1 msgId=2 - RESULT err=0 tag=97 nentries=0 etime=14.301360 dn=""
[15/Nov/2012:01:27:34 -0500] conn=569501640 op=1 msgId=2 - RESULT err=0 tag=97 nentries=0 etime=14.453580 dn=""
[15/Nov/2012:01:27:34 -0500] conn=569501565 op=1 msgId=2 - RESULT err=0 tag=97 nentries=0 etime=14.596140 dn=""
[15/Nov/2012:01:27:35 -0500] conn=569501458 op=1 msgId=2 - RESULT err=0 tag=97 nentries=0 etime=14.736160 dn=""
2. The backups occur earlier in the night
3. I don't have any jobs scheduled. Is there a nightly batch job that DS perfoms?
4. The idle timeis at least 97%
5. I'll check on the new connections.
I opened a ticket with Oracle and their thoughts are that I've encountered a bug 12240732 - - SUNBT6712614 STARTTLS PERFORMANCE BETTER IN 5.X VERSIONS THAN IN 6.X VERSIONS. Their resolution is for me to upgrade from DSEE 6.3 to DSEE 18.104.22.168.1. I'm going to schedule a downtime and see if this resolves the issue.