3 Replies Latest reply: Mar 16, 2012 6:12 AM by Chrisjenkins-Oracle RSS

    High iowat on replication

    598001
      I have 2 HP GL460 boxes (2 cores). Rhel 5x 64 Every 2 databases with master-master replication configured; Most load is performed on 1-st node. When replication agent is stared iowait is very high (10%). For sample
      Cpu(s): 20.0%us, 1.5%sy, 0.0%ni, 70.2%id, 7.8%wa, 0.0%hi, 0.5%si, 0.0%st
      or
      Cpu(s): 3.7%us, 0.2%sy, 0.0%ni, 84.5%id, 11.5%wa, 0.0%hi, 0.1%si, 0.0%st

      Replication traffic not very high (eth2 - croslink used for replication):
      [root@spb-lab-pcrf1 ~]# dstat -nf 10
      -net/bond0--net/bond0.1---net/eth0----net/eth1----net/eth2-
      recv send: recv send: recv send: recv send: recv send
      0 0 : 0 0 : 0 0 : 0 0 : 0 0
      244k 147k: 114k 145k: 122k 147k: 121k 0 : 65k 477k

      Another strange thing:
      I see CPU usage peak on first box when second box do checkpoints. This lead to application timeouts

      TimesTen Release 11.2.1.8.0 (64 bit Linux/x86_64) (timesten:53388) 2011-02-02T02:20:46Z

      [spr]
      Driver=/opt/TimesTen/timesten/lib/libtten.so
      DataStore=/var/TimesTen/timesten/spr/spr
      LogDir=/var/TimesTen/timesten/log/spr
      PermSize=4243
      TempSize=530
      CkptFrequency=36000
      CkptLogVolume=12729
      CkptRate=10
      PrivateCommands=1
      DatabaseCharacterSet=TIMESTEN8
      MemoryLock=4
      Preallocate=0
      Connections=150
      LockWait=1
      QueryThreshold=2
      #SQLQueryTimeout=2
      LogFileSize=1024
      LogBufMB=1024
      LogBufParallelism=8
      LogFlushMethod=1
      RecoveryThreads=4
      ReplicationParallelism=2
      ReplicationApplyOrdering=1
      PLSQL=0

      [session]
      Driver=/opt/TimesTen/timesten/lib/libtten.so
      DataStore=/var/TimesTen/timesten/session/session
      LogDir=/var/TimesTen/timesten/log/session
      PermSize=2870
      TempSize=358
      CkptFrequency=36000
      CkptLogVolume=8610
      CkptRate=10
      PrivateCommands=1
      DatabaseCharacterSet=TIMESTEN8
      MemoryLock=4
      Preallocate=0
      Connections=150
      LockWait=1
      QueryThreshold=2
      #SQLQueryTimeout=2
      LogFileSize=717
      LogBufMB=717
      LogBufParallelism=8
      LogFlushMethod=1
      RecoveryThreads=4
      ReplicationParallelism=2
      ReplicationApplyOrdering=1
      PLSQL=0
        • 1. Re: High iowat on replication
          Chrisjenkins-Oracle
          Hi Vladimir,

          High iowait is not something to be concerned about it is simply idle CPU time during which there was at least one pending I/O request. It is not a very good indicator of, well, anything really. Do you actually see heavy I/O load on the disks during this period by using 'sar' or 'iostat'?

          It is a little surprising that a checkpoint on one machine would impact the other but I have some questions/comments on your setup if I may:

          1. Your checkpointing parameters seem very strangely set. You only checkpoint every 10 hours or after many GB of log generation. How quickly do these datastores generate log data under normal load? You should be aiming to checkpoint every 5-10 minutes typically.

          2. How many concurrent connections do you typically have to each datastore under normal conditions (as reported by ttStatus)? Hopefully it is <<150...

          3. You say that these machines have just 2 cores but your setup (LogBufParallelism setting, number of connections, use of parallel replication) suggests a setup for a machine with a much higher numvber of cores...

          4. Are you using RETURN RECEIPT or RETURN TWOSAFE replication?

          Thanks,

          Chris
          • 2. Re: High iowat on replication
            598001
            1) Just today we do performance tests. After 7 hours of load session db write 361 log files (30 checkpoints). 1 checkoint every 14 minutes.
            2) 39 and 38 connections. We do not use multithreading. Every process open only one connection for every database
            3) I'm wrong. Server have 2 CPU with 4 cores. 8 cores total
            4) We don't use RETURN RECEIPT or RETURN TWOSAFE
            I can publish charts with CPU load. I don't see high I/O load on disk.
            http://www.digilo.com/images/cpu.png - cpu chart. As you can see iowait not scale with load

            Edited by: Vladimir Romanov on 15.03.2012 20:27
            • 3. Re: High iowat on replication
              Chrisjenkins-Oracle
              Okay. I don't see any reason to be concerned about iowait. It's a normal thing and doesn't tell us much about anything. The only 'issue' as far as I can tell is your statement that when a checkpoint occurs on the replication subscriber machine you see a CPU peak on the replication primary machine. This seems very unlikely so can I ask if you have checked for any correlation between the CPU peaks on the primary and other activity on the primary (for example maybe one of the datastores on the primary checkpoints). A checkpoint is a fairly CPU intensive operation and can lead to impact on the application if the system is not well balanced.

              One concern I have is that you have two datastores apparently sharing the same filesystem for both checkpoints and log files. This for sure not recommended. Can you tell me:

              1. Can you tell me what kind of disk storage you are using here. I'm hoping it is something like a RAID-10 stripe across several 15k rpm disks together with a cached hardware RAID controller and not just plain internal disks...

              2. Are the checkpoint and log files for both datastores really in the same filesystem as it appears from the DSN definitions?

              3. What kind of load (log files per hour) does the 'spr' datastore generate? Is this workload being run concurrently with the workload on the 'session' datastore during your tests?

              Thanks,

              Chris

              Edited by: ChrisJenkins on Mar 16, 2012 11:09 AM