In our production environment we are observing some unhealthy long running transactions which actually causing the total server to crash. After some investigation we have identified a filer (A remote NFS) server to be root cause of the long running transaction. We have replaced the filer and now things are working okay.
The problem I have noticed that each transactions are taking much higher time than in lower instance like (Performance, Testing, development etc). So I have tried to understand if by any change our production server is slow or not. I have run the below program in all Production, performance, testing and development environments and observe red that production server is taking 3 times than it is taking in other environment.
for ((ii=0; ii<=2000; ii++))
a=`expr $a + 1`
If I compare the environment they will be like below
1) Production: Solaris 10 on SPARC Enterprise T5220. With one physical processor, 8 core and 8 thread in each core. Total 64 threads. Total RAM 64 GB
2) Performance Environment: Solaris 9 sparc SUNW,Sun-Fire-V440. With 4 physical processor, 0 core and 16 GB memory
3) Development Box: GNU/Linux, kernel 220.127.116.11-4-smp. 2 processor intel Pentium III with 2 GB memory
4) Testing environment: GNU/Linux 64 bit kernel 2.6.18-128.el5. Intel(R) Xeon(R) CPU with 2 core of 2.53 GHz speed . 4 GB memory
From the above system details it looks like that production server is hugely capable of processing any number of thread very fast but in actual scenario it is not able to do that. Even a simple ls command on a empty file is taking 4 times slower in production when compare to the performance environment.
Can anybody help me tuning our production server or at least point me to some documents which might help me.
Is there a reason you are trying to compare metrics between 4 differing environments? They are all running different hardware with single and multi-threaded processing, so you will see different results. Regarding solaris 9 versus solaris 10 command library calls, command functionality can change between releases. Solaris 10 might have a more robust "ls" command which in turn calls more libraries (just a guess). I believe the commands you are using for your performance testing (incrementing an arithmetic variable in a loop, ls, etc) are single threaded.
I agree that these machines are with different hardware as well as of different version of OS. But these basic operations(expr etc) should have similar response time, isn't it? I mean I am trying to understand if by any change production environment has any glitch which caused this huge response time difference.
I can understand that SPARC T2 CMT technology would surely improve throughput in a multi-threaded environment (Which we really wants) but these basic simple commands response time must not changed for that.
Or can you suggest me any other way to find out if our production box is well tuned for a multi-theaded multi JVM application.