This is just a suspect of memory issue, but we can not prove it.
We have this symptoms issue in our Apps Tier only ( Db Tier has no issue):
Every other day our EBS instance (appstier only), will encounter sudden disconnection. All the EBS users gets disconnected, and not one of them able to connect.
We can ping the server but we can not connect to it using ssh/putty. The after sometime like 5 mins we can now connect and need to restart the apps tier to clear-up resource
We just suspect there is Memory shortage? or Memory is not being manage or allocated correctly.
Where can we find the O.S log that shows shortage of memory whenever a linux server hangs and disallow all connections?
Our AppsTier Memory is 24Gb, and so is our Db Tier which are separate servers.
top - 00:02:23 up 33 days, 17:34, 1 user, load average: 0.00, 0.02, 0.05
Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24650920k total, 18382204k used, 6268716k free, 556440k buffers
Swap: 24707060k total, 628k used, 24706432k free, 9385116k cached
In fact is does not use swap so much, showing that our memory is big enough.
I just suspect it is not allocated correctly.
Can you help me find the OS log showing that all "applmgr" process is running out of memory?
We also have the following parameter settings. Can you help me find the erroneous settings?
# cat /etc/security/limits.conf
* hard nofile 65536
* soft nofile 4096
* hard nproc 16384
* soft nproc 32768
* hard stack 16384
* soft stack 10240
# cat /etc/sysctl.conf
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
kernel.msgmnb = 65536
kernel.msgmax = 8192
kernel.shmall = 2097152
net.ipv4.conf.all.arp_notify = 1
kernel.msgmni = 2878
fs.file-max = 6815744
kernel.sem = 256 32000 100 142
kernel.shmmni = 4096
kernel.shmmax = 12621271040
kernel.panic_on_oops = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
net.ipv4.tcp_tw_recycle = 0