This content has been marked as final. Show 10 replies
Well, you're running 10.2.0.3 (which is no longer supported by Oracle) with AIX 5.3 (which is no longer supported by IBM). You're running an uncertified, never mind unsupported configuration. From what I understand from our system admin team, IBM won't even take your call if you have an issue and say you're running 5.3.
Obviously, I understand that you might be stuck on that version and there's nothing you can do about it. However, if you are able, I would recommend upgrading to 11.2 and then AIX 6 (at least).
Why are you running the ONS: are you using Fast Application Notification? If not, and it's unlikely that you are, you don't need the ONS process to be running.
Of course, this post comes with all the usual caveats: don't trust anyone on a forum, test it out yourself in a non-Production environment first, etc, etc.
What's your PGA usage? That can sometimes be the cause of the memory getting swallowed up on a box (though it would be a co-incidence that it happened when you replaced the memory). It is something to check.
Thank you for your response,
Actually we have a maximum number of 300 users, and about 220 of them are connected at the same time, our PGA is set for 989 MB, this value was implemented after many DBA's suggestions saying that an average user consumption is about 4 to 5 MB. Actually crashes have taken place during low load times.
We are conducting tests today, we are gonna install back the old RAMs, and conduct a firmware update for the server's mother board.
I want to know if there's a way to know if Fast Application Notification is being used, I know that load balancing and instance fail-over is used, but unfortunately it is not us who installed the RAC system. We are meanwhile preparing a high availability new system with the latest Oracle releases.
+13/03/18 09:34:33  Passive connection 0,<IP of server 1>,6200 invalid connect server IP format+
hostName: <hostname of server2>
This is the message occurring in a very frequent way in ons.log, and unfortunately I can't find any documentation for that.
The above link might offer some insight. FAN is event-driven, but needs for someone to develop the actions that are driven by the events.
The PGA metric is a target: that means Oracle will try and keep its PGA usage within those bounds, but there are quite a few ways to make the PGA balloon and eat up all of the memory (and swap) of the server. It might be worth checking into, just in case.
But I suspect the issue is a) with your ONS configuration and b) something to do with your memory structures
Further investigations on the issue have led us to notice that the ons daemon is stopped, starting it using onsctl would change nothing regarding the memory consumption,
but on another side killing the process on the OS level dramatically solved the memory consumption issue without any negative effect on the cluster or the overall system,
but the thing is that the process ons (oracle/ocr/opm/) restarts on its own and again begins consuming memory in an ascending manner. Any idea how to definitely stop that process ?
Check srvctl and see what that has for the entry for the ONS daemon. You might be automatically restarting it after a failure.
Thank you for your help,
Actually We've checked onsctl ping, and even after few seconds of stopping the daemon (onsctl stop) it restarts again and onsctl ping returns (ons is running ...).
In srvctl we always get the following result:
+$ srvctl status nodeapps -n p51+
VIP is running on node: p51
GSD is running on node: p51
Listener is running on node: p51
ONS daemon is running on node: p51
Someone suggested that I should stop ons on both instances and perform a clean restart and this should solve the problem. What do you think about that ?
You've said :
+"You might be automatically restarting it after a failure."+
If that's true, any Idea how to stop the automatic restart ?
1 person found this helpful
975727 wrote:You are very welcome :)
Thank you for your help,
Someone suggested that I should stop ons on both instances and perform a clean restart and this should solve the problem. What do you think about that ?I don't think that will solve your problem - it sounds very much like you have auto_start enabled. My 10g RAC is a bit rusty, but you might try doing something like this:
*> srvctl status nodeapps -n cheese*
VIP is running on node: cheese
GSD is running on node: cheese
Listener is running on node: cheese
ONS daemon is running on node: cheese
Now check what the name of the ONS is on the server(s) you want to look at
crs_stat -l | grep ons
Use crs_stat -f to show detailed information about the ONS process:
crs_stat -f ora.cheese.ons | grep AUTO_START
Notice that the AUTO_START is set to ALWAYS (at least in this example) You can choose either 'always', 'restore', 'never'. You can also use numbers, I think.
The process to actually change it is rather complicated on 10gR2 (I think) - you have to create a profile, edit that, shutdown the component, re-register it with the new profile and then restart.
The best example of something similar that I can find is [this link|http://surachartopun.com/2009/07/change-oracle-asm-resource-autostart.html] (obviously, this is for the ASM, not the ONS)
Barring that, you could ask Oracle for the correct syntax. Unfortunately, I don't have a 10gR2 RAC I can try this out on (we upgraded them all to 11gR2)
Thank you for that !
It was very helpful, I will work on it :)
You're welcome :). Happy to help (assuming it does help and if it causes you Production problems, then it's not me!)
Good luck and let me know how it goes. It is a pain and it is easier in 11gR2...
(P.S. try and upgrade ASAP!)