Skip to Main Content

Oracle Database Discussions

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Node Affinity - A feature with high overhead; need scalable alternative

746367Jan 13 2010 — edited Oct 1 2010
We recently moved to RAC on Exadata. It has a behavior in the architecture which I think is hard-to-scale, difficult to get right, and expensive to maintain, but I'm hoping I'm just missing something.

THE ORACLE DESIGN

The recommendation from Oracle Consulting is to use "node affinity" as follows:

- DML can cause lots of cross-node traffic if some of the parallel slaves are on one node and some are on another.

- To prevent this traffic, pin any particular DML job to just one node. For example, of the DML has a degree of parallelism of 16, all parallel slaves would run on, say, node 6, thus having no cross-node traffic.

- Facilitate this using "services". For any given DML service, the service will use only one node.

- Note that it is not that all the DML on the server must go to one node; just that for any one DML session, that session would use a particular service. Thus, for example, if we have 8 nodes, we could have 8 DML services with each pinning one-to-one. Then the DML jobs would use the 8 services as needed.

In principle, this can still use the full machine as long as we send an equal number of jobs to each of the eight services/nodes. Thus we have full use of the machine and no inter-node issues.

THE REALITY

In practice, this actually produces a very bad outcome. Here is why:

- This is essentially manual load balancing. That is, we have to constantly monitor the load on the nodes to ensure they are being efficiently utilized.

- If one node is getting overwhelmed because we are not watching it constantly or have more valuable things to do for the business, that node will begin to downgrade or serialize queries (pre 11g R2) or queue those statements (11g R2). Either way, performance degrades.

- Meanwhile, the other nodes go underutilized. That is, even as the overwhelmed node is giving bad service to those pinned to it, the underwhelmed nodes sit idle, leaving CPU and I/O power to go to waste.

- To remedy this situation, we need the intervention of expensive staff. One is the DBAs who, as noted, must watch this, and discovering it, must analyze it and make recommendations on how to fix it.

- It also requires the intervention of application staff or power users--whoever is sending the DML jobs in. They have to connect to the services, and if the things are out of balance, they have to change their job streams to connect to different services.

POSSIBLE SOLUTIONS

My thoughts on solutions:

1) Tell me what I'm missing. How is this anything but terrible architecture? If there is a better way, or I'm just wrong, please let me know.

2) Here's what I think would be the best fix:

- Instead of pinning a service to a single node, let it pin to anyone one of several nodes such that for any one DML query, it would be on one node.

That is:

- The DML service is set up to use, say 4 nodes.

- A session is created on the service.

- The session submits the DML.

- The service will pin it to one of its 4 nodes, keeping all parallel slaves on that one node, but it may be any one of the four nodes.

- Then, if one of those nodes gets overwhelmed, that's fine, it can pick one of the other 3.

This would give both valuable points: it would pin any one DML query/session to one node, but it would not be forced to use just one node.

3) Just a note: it seems we cannot just let DML fly on all nodes because it does seem to slow some queries down, and quite a bit.

4) And we do NOT want to have to solve this on a case by case basis. I know Oracle documentation says you can do some kinds of tricks with joins or clusters or the like. This does not scale. We have 30,000 tables that we manage, and hundreds of thousands of queries; we cannot go retrofitting all that to handle what I can easily see being done at a database level (if the product can do it and if I knew how!).

Thoughts welcome!

Edited by: user2369423 on Jan 13, 2010 6:47 PM
This post has been answered by Arup Nanda_2 on Jan 26 2010
Jump to Answer

Comments

Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Oct 29 2010
Added on Jan 13 2010
17 comments
6,926 views