Forum Stats

  • 3,852,385 Users
  • 2,264,100 Discussions
  • 7,905,056 Comments

Discussions

Node Affinity - A feature with high overhead; need scalable alternative

746367
746367 Member Posts: 9
edited Oct 1, 2010 3:10PM in General Database Discussions
We recently moved to RAC on Exadata. It has a behavior in the architecture which I think is hard-to-scale, difficult to get right, and expensive to maintain, but I'm hoping I'm just missing something.

THE ORACLE DESIGN

The recommendation from Oracle Consulting is to use "node affinity" as follows:

- DML can cause lots of cross-node traffic if some of the parallel slaves are on one node and some are on another.

- To prevent this traffic, pin any particular DML job to just one node. For example, of the DML has a degree of parallelism of 16, all parallel slaves would run on, say, node 6, thus having no cross-node traffic.

- Facilitate this using "services". For any given DML service, the service will use only one node.

- Note that it is not that all the DML on the server must go to one node; just that for any one DML session, that session would use a particular service. Thus, for example, if we have 8 nodes, we could have 8 DML services with each pinning one-to-one. Then the DML jobs would use the 8 services as needed.

In principle, this can still use the full machine as long as we send an equal number of jobs to each of the eight services/nodes. Thus we have full use of the machine and no inter-node issues.

THE REALITY

In practice, this actually produces a very bad outcome. Here is why:

- This is essentially manual load balancing. That is, we have to constantly monitor the load on the nodes to ensure they are being efficiently utilized.

- If one node is getting overwhelmed because we are not watching it constantly or have more valuable things to do for the business, that node will begin to downgrade or serialize queries (pre 11g R2) or queue those statements (11g R2). Either way, performance degrades.

- Meanwhile, the other nodes go underutilized. That is, even as the overwhelmed node is giving bad service to those pinned to it, the underwhelmed nodes sit idle, leaving CPU and I/O power to go to waste.

- To remedy this situation, we need the intervention of expensive staff. One is the DBAs who, as noted, must watch this, and discovering it, must analyze it and make recommendations on how to fix it.

- It also requires the intervention of application staff or power users--whoever is sending the DML jobs in. They have to connect to the services, and if the things are out of balance, they have to change their job streams to connect to different services.

POSSIBLE SOLUTIONS

My thoughts on solutions:

1) Tell me what I'm missing. How is this anything but terrible architecture? If there is a better way, or I'm just wrong, please let me know.

2) Here's what I think would be the best fix:

- Instead of pinning a service to a single node, let it pin to anyone one of several nodes such that for any one DML query, it would be on one node.

That is:

- The DML service is set up to use, say 4 nodes.

- A session is created on the service.

- The session submits the DML.

- The service will pin it to one of its 4 nodes, keeping all parallel slaves on that one node, but it may be any one of the four nodes.

- Then, if one of those nodes gets overwhelmed, that's fine, it can pick one of the other 3.

This would give both valuable points: it would pin any one DML query/session to one node, but it would not be forced to use just one node.

3) Just a note: it seems we cannot just let DML fly on all nodes because it does seem to slow some queries down, and quite a bit.

4) And we do NOT want to have to solve this on a case by case basis. I know Oracle documentation says you can do some kinds of tricks with joins or clusters or the like. This does not scale. We have 30,000 tables that we manage, and hundreds of thousands of queries; we cannot go retrofitting all that to handle what I can easily see being done at a database level (if the product can do it and if I knew how!).

Thoughts welcome!

Edited by: user2369423 on Jan 13, 2010 6:47 PM

Best Answer

  • Arup Nanda_2
    Arup Nanda_2 Member Posts: 73 Red Ribbon
    James,

    PRODB1 and PRODB2 are instance names. The snippet I mentioned there was an excerpt from the spfile (or pfile). If it's spfile, you would have to prefix the instance names before the parameters; otherwise it doe snot become clear the settings for each instance. If you need to set something for all instances, you would prefix with a "*".

    Now having said that, there is no connection between what I described and the service names. you would define service names as you would, e.g. a service called SERV1 load balanced across the instances. When a user connects using service SERV1, he can go to either instance with equal probability (assuming listener load balancing). If he connects to PRODB1, and kicks off a 2 degree PQ, all his slaves will be in PRODB1 instance only.

    Similarly, another user now connects using the same service SERV1, but she connects to instance PRODB2. When she kicks off a 2 degree PQ, all the slaves will be in instance PRODB2 alone.

    So, you accomplish your load balance objective and also make sure that slaves and coordinators of individual parallelized queries are restricted to a specific instance only.

    Hope this helps.

    Arup
«1

Answers

  • Robert Geier
    Robert Geier Member Posts: 2,989
    edited Jan 14, 2010 4:43AM
    I suspect the main issue you are trying to avoid is GC waits when a transaction on one RAC instance needs access to a block cached on other RAC instance.

    GC waits happen when multiple instances do work on a particular segment. This is a symptom of the way Cache Fusion works, your application architecture, your table and index structure, and the queries you run.

    If all work for a particular segment is done on one instance, then you can avoid GC waits, and this looks to be the solution that you are implementing, and it does work. If you don't want to restrict work on particular segments to specific RAC nodes, then you will see an increase in GC waits, which you MAY be able to reduce by tuning the code to require less blocks. You may also want to look at the speed of your interconnects, and do some more analysis of exactly what GC waits you are getting.

    http://jose-valerio.com.ar/blog/?p=196

    Many applications are not designed to scale horizontally, and it sounds like your application would be better suited to fewer servers with more cpus and memory.

    You can find lot of blogs and presentations on this topic if you google for "rac performance gc waits". In particular there are some good presentations by Julian Dyke and Arup Nanda which give a good overview of the problem, and common solutions.
  • Arup Nanda_2
    Arup Nanda_2 Member Posts: 73 Red Ribbon
    I am not sure where you got the advice from about pinning a service in a specific instance. I wouldn't have receommended that. That may create severe imbalance in the load on RAC nodes. The effect could be CPU stravation of LMS and LMD processes that will eventually lead to gc_* wait events (or even node eviction), the very problem you are trying to address.

    Rather, I would recommend a solution based on parallel_instance_groups. Suppose you have two instances - prodb1 and prodb2. You would define two instance groups as shown below (in the initilialization parameter file)

    prodb1.instance_groups='pqgroup1'
    prodb2.instance_groups='pqgroup2'

    You would also define two more parameters there:

    prodb1.parallel_instance_group='pqgroup1'
    prodb2.parallel_instance_group='pqgroup2'

    What this means is someone connecte to prodb1 will spawn off parallel query slaves only in that instance; not in podb2. The session will see the parameter "parallel_instance_group" is set to pqgroup1 and therefore will spawn off only in those instances where instance_groups is set to "pqgroup1", which happens to be prodb1. This will work regardless of which service name the users used to connect.

    In this solution, you would define the services to go across all instances, as appropriate; so the sessions will be load balanced. But the sessions will not spawn off PQ slaves on other instances. Technically speaking, this is still unbalanced, since PQ slaves may be created on one instance more than the other; but in reality they will be somewhat balanced.

    Hoep this helps.

    Arup
    Arup Nanda_2
  • Timur Akhmadeev
    Timur Akhmadeev Member Posts: 750 Silver Badge
    Hi,

    please describe your situation in a more clear way:

    1) what is your DML job - how it is organized (is it a single PX statement or a bunch of PX/no-PX in a PL/SQL), in what hours it runs (business/off)
    2) what is your goal - get the most out of the available hardware for the DML job performance or maybe something else; if your goal is job performance, then what it's current time to run and how much improvements do you want
    3) are your tables partitioned & do they have usable indexes during job run; if tables are partitioned, are queries from the DML job contains PX messaging steps execution plans
  • 746367
    746367 Member Posts: 9
    Arup:

    Great thoughts. The advice came directly from Oracle Consulting, but don't let that sway you. I thought parallel_instance_group might have something to it...but I still have a few questions:

    If we set the parameters as you note, then someone connects to PRODB1, okay, but what is PRODB1? Is that a service?

    If it's a service, we're back in the same place--people have to worry about how they are connecting, instead of just connecting to the database.

    Likewise, if that service is associated to a parallel instance group, and we get a huge spike of connections to that service, it would be imbalanced...the very problem we're trying to avoid.

    So what am I missing? How does the parallel instance group allow us to:

    1) Just let people connect without worring about connecting to a particular service or other construct that pins them.

    2) Not cause loading on one node while other nodes remain underutilized.

    But again, what I think I'm missing is just this: what precisely is PRODB1 if it is not a service, and how does a particular session become associated with it?

    Thanks!

    James
  • Arup Nanda_2
    Arup Nanda_2 Member Posts: 73 Red Ribbon
    James,

    PRODB1 and PRODB2 are instance names. The snippet I mentioned there was an excerpt from the spfile (or pfile). If it's spfile, you would have to prefix the instance names before the parameters; otherwise it doe snot become clear the settings for each instance. If you need to set something for all instances, you would prefix with a "*".

    Now having said that, there is no connection between what I described and the service names. you would define service names as you would, e.g. a service called SERV1 load balanced across the instances. When a user connects using service SERV1, he can go to either instance with equal probability (assuming listener load balancing). If he connects to PRODB1, and kicks off a 2 degree PQ, all his slaves will be in PRODB1 instance only.

    Similarly, another user now connects using the same service SERV1, but she connects to instance PRODB2. When she kicks off a 2 degree PQ, all the slaves will be in instance PRODB2 alone.

    So, you accomplish your load balance objective and also make sure that slaves and coordinators of individual parallelized queries are restricted to a specific instance only.

    Hope this helps.

    Arup
  • damorgan
    damorgan Member Posts: 14,464 Bronze Crown
    edited Jan 27, 2010 1:23AM
    The fact that advice comes from Oracle consulting has, I am sorry to say, no authority as compared to advice from Arup Nanda, Jonathan Lewis, Julian Dyke, or a lot of other people far more qualified than many of those Oracle employs. There I said it out loud.

    With RAC clusters, whether Exadata or using any other hardware, the entire point is node affinity so the Oracle consulting people are just stating the obvious. Whether your specific application, as written, will scale on a RAC cluster is quite a different question and one that can only be answered with a thorough review of architecture and code.

    Can you post results from the following?
    SELECT * FROM v$version;
    
    SELECT instance_name, host_name
    FROM gv$instance;
    
    col failover_method format a20
    col failover_type format a20
    
    SELECT name, failover_method, failover_type, enabled, clb_goal
    FROM dba_services;
    
    SELECT *
    FROM gv$gc_elements_with_collisions;
    
    col "AVG RECVD TIME (ms)" format 9999999.9
    col inst_id format 9999
    prompt GCS CURRENT BLOCKS
    
    SELECT b1.inst_id, b2.value RECEIVED, b1.value "RECEIVE TIME", ((b1.value/b2.value)*10) "AVG RECEIVE TIME (ms)"
    FROM gv$sysstat b1, gv$sysstat b2
    WHERE b1.name = 'global cache current block receive time'
    AND b2.name = 'global cache current blocks received'
    AND b1.inst_id = b2.inst_id;
    
    col "AVG RECVD TIME (ms)" format 9999999.9
    col inst_id format 9999
    
    SELECT b1.inst_id, b2.value RECEIVED, b1.value "RECEIVE TIME", ((b1.value/b2.value)*10) "AVG RECEIVE TIME (ms)"
    FROM gv$sysstat b1, gv$sysstat b2
    WHERE b1.name = 'global cache cr block receive time'
    AND b2.name = 'global cache cr blocks received'
    AND b1.inst_id = b2.inst_id;
    damorgan
  • 746367
    746367 Member Posts: 9
    Arup:

    This sounds rather precisely like what I'm trying to accomplish, and quite elegantly.

    It keeps the administration of services simple since we would only need a few of them and can associate them with however many nodes we may need.

    Yet when a connection comes in on that service, it will have all its parallel slaves connected to one node, thus getting the single-node performance gain.

    And at the same time, it will spread the connections around the nodes so that we don't have some nodes overloaded and some under-utilized.

    And in that last sentence I said "connections". Is that right? Once a session gets assigned to a node, it stays there and all the parallel slaves it uses are there?

    Again, sounds great. This has been escalated to the Oracle product area, since I've been quite persistent that there must be a better way. I have a conference call with them Friday. I'd like to discuss it with them then.

    Great thanks.

    James
  • 746367
    746367 Member Posts: 9
    I'll run that code, though there is very little in production for a few more weeks, so I'm not sure it's a representative load at this point.

    I'm new to RAC (via Exadata), as is almost our whole organization, so even obvious points are welcome.

    But your statement "the entire point is node affinity" is interesting. The more I understand it the more I ask: wouldn't you want to pin almost everything to some node? (Which I think may be your point)

    As long as you can do it in a low maintenance fashion like Arup describes, not a high maintenance one like I originally defined, it seems quite desirable.

    That is, because RAC is "shared everything", every compute node can get to all storage. Thus I'd almost always want a given job stream on one node, unlike an MPP architecture where spreading across compute nodes is also necessary.

    For example, our DOP default is 16 and we have 10 nodes. If I have, say, 30 big jobs running, wouldn't I want 3 on each node such that each gets all 16 slaves on one given node. What possible advantage is there to having, say, each query running 2 slaves on each of 8 nodes.
  • 573141
    573141 Member Posts: 171
    With RAC clusters, whether Exadata or using any other hardware, the entire point is node affinity so the Oracle consulting people are just stating the obvious. Whether your specific application, as written, will scale on a RAC cluster is quite a different question and one that can only be answered with a thorough review of architecture and code.
    hi morgan,
    do we need any code changes to configure a particular application for connecting to a RAC database?

    what my understanding is, the only input the application requires is the tns details specific to the RAC


    thanks,
    charles
  • damorgan
    damorgan Member Posts: 14,464 Bronze Crown
    Assuming a 2 node cluster you ideally pin to a single node. In a three node cluster you might want to put reporting on one node and use the other two for batch jobs and processing. You really need to understand your application and some applications, as designed, will never scale on a RAC cluster. What you want to do is minimize the contention for blocks between nodes.

    With AQ that means separate queues on each node.
    With indexes it might (and I emphasize might) mean reverse key indexes.
    With other parts of the application it might mean additional creativity.

    You have the best possible solution, hardware wise, for RAC with the possible exception of an IBM z10 and definitely the best hardware solution for just about anything else you could throw at it. If it does not scale then you need someone to help you change the design and have it recoded accordingly.
This discussion has been closed.