11 Replies Latest reply: Jun 2, 2014 3:23 PM by Henk Vandenbergh-Oracle RSS

    RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs

    user2679908

      Running RHEL 5.9, 6,4, 6.4 OELUEK, Windows 2008 R2, and Windows 2012 R1.  Of these OSes, just RHEL 6.4 is giving an error.  All the other OSes are passing this test fine giving the expected output.  Some other facts are:

      • Using VDBench 5.03 (also tried 5.04 with same error)
      • Using jre1.7.0_21 (also tried IBM 1.5, Sun 1.6 versions with VDBench 5.03 & 5.04 which also did not complete)
      • This is happening for multipe machine types (Dell, IBM, HP).  All have RAID 1 with 2 disks with RAID array controller.

       

      Any feedback on what else to examine and ideas to resolve this will be appriecated.  Thanks, Sam

       

      The paramater file is

      * Data Drive OLTP, size 1m to 1g,  64k transfer size, 70/30 read/write
      fsd=fsd1,anchor=/data/perfdata/vdbench/DM-01,depth=1,width=1,files=12,size=(1m,25,100m,25,250m,25,1g,25),openflags=directio
      fsd=fsd2,anchor=/data/perfdata/vdbench/DM-02,depth=1,width=1,files=12,size=(1m,25,100m,25,250m,25,1g,25),openflags=directio
      fsd=fsd3,anchor=/data/perfdata/vdbench/DM-03,depth=1,width=1,files=12,size=(1m,25,100m,25,250m,25,1g,25),openflags=directio
      fsd=fsd4,anchor=/data/perfdata/vdbench/DM-04,depth=1,width=1,files=12,size=(1m,25,100m,25,250m,25,1g,25),openflags=directio

      fwd=fwd1,fsd=(fsd1,fsd2,fsd3,fsd4),xfersize=64k,fileselect=random,fileio=random,rdpct=70,threads=32
      rd=rd1,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd2,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd3,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd4,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd5,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd6,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30
      rd=rd7,fwd=*,fwdrate=max,format=yes,pause=60,elapsed=930,interval=30

       

      Typical output of the problem is (trimming lines to be more readiable)

      15:01:53.000 Starting RD=rd1; elapsed=930; fwdrate=max. For loops: None

      Mar 11, 2014 .Interval. .ReqstdOps.. ...cpu%...  read ....read.... ...write.... ..mb/sec... mb/sec .xfer..
                               rate  resp total  sys   pct   rate  resp   rate  resp  read write  total    size 

      15:02:23.440          1  784.4 41.03   0.5 0.10  69.6  545.9 54.01  238.6 11.33 34.12 14.91  49.03   65536  

      15:02:53.057          2  662.2 47.70   0.6 0.11  69.7  461.5 57.86  200.7 24.32 28.85 12.54  41.38   65529  

      15:03:23.056          3  671.8 47.46   0.6 0.11  69.8  468.7 58.67  203.1 21.60 29.29 12.70  41.99   65536  

      15:04:32.551          4  704.1 45.57   0.1 0.09  70.2  494.3 59.88  209.8 11.85 30.89 13.11  44.01   65536  

      15:04:32.570          5  693.1 46.18   0.1 0.08  70.2  486.3 59.45  206.8 14.96 30.40 12.93  43.32   65536  

      15:04:53.056          6  590.6 54.09   0.6 0.10  70.0  413.2 67.89  177.4 21.95 25.82 11.09  36.91   65536  

      15:05:23.048          7  712.5 44.83   0.2 0.11  69.6  496.2 60.53  216.3  8.79 31.01 13.52  44.53   65536  

      15:05:53.053          8  535.4 60.08   0.5 0.09  69.8  373.9 73.32  161.5 29.42 23.37 10.09  33.46   65536  

      15:06:23.053          9  571.4 55.76   0.6 0.10  69.5  397.4 67.74  174.0 28.39 24.84 10.88  35.71   65536  

      15:06:53.047         10  692.6 45.40   0.6 0.12  70.7  490.0 57.72  202.6 15.61 30.63 12.66  43.29   65536  

      15:07:23.053         11  561.2 57.33   0.1 0.08  70.4  395.0 71.35  166.2 24.02 24.69 10.39  35.08   65536  

      15:07:53.052         12  660.1 48.48   0.6 0.11  70.1  462.5 61.65  197.7 17.68 28.90 12.35  41.26   65536  

      15:08:23.048         13  576.2 55.33   0.5 0.10  70.1  403.7 67.12  172.5 27.75 25.23 10.78  36.01   65536  

      15:08:53.053         14  622.8 51.55   0.1 0.09  69.9  435.4 67.91  187.3 13.52 27.21 11.71  38.92   65536  

      15:09:23.052         15  712.6 44.16   0.1 0.09  70.3  500.6 58.00  212.0 11.47 31.29 13.25  44.54   65536  

      15:10:03.646         16  673.6 47.23   0.1 0.09  70.2  472.9 60.95  200.7 14.91 29.56 12.54  42.10   65536  

      15:10:23.052         17  561.4 47.18   0.6 0.11  70.6  396.2 59.54  165.2 17.51 24.76 10.32  35.09   65536  

      15:10:53.052         18  730.7 43.95   0.1 0.09  69.9  510.8 57.78  219.9 11.85 31.92 13.75  45.67   65536  

      15:14:37.119 HeartBeat.checkHeartBeat(): slave localhost-0 has not responded for 185 seconds.
      15:17:44.961 Start/end command: executing '/banktools/perfbench/vdbench504/vdbench jstack'
      15:17:44.962 execute(): /banktools/perfbench/vdbench504/vdbench jstack
      15:17:44.977         19  707.8 45.18   0.1 0.08  69.6  492.3 62.26  215.5  6.15 30.77 13.47  44.24   65536   0.0  0.00   0.0  0.00   0.0  0.00   0.5  0.02   0.5  0.02   0.0  0.00   0.0  0.00   0.0  0.00

      15:17:45.207 tg: java.lang.ThreadGroup[name=main,maxpri=10]

      15:17:45.207 tg.get_name: main

      15:17:45.207 tg.activeCount: 7

      java.lang.ThreadGroup[name=main,maxpri=10]

          Thread[Vdbmain,5,main]

          Thread[SlaveStarter localhost-10-140311-14.58.51.646,5,main]

          Thread[Get_cmd_stream stderr /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570,5,main]

          Thread[Get_cmd_stream stdout /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570,5,main]

          Thread[SlaveOnMaster,10,main]

          Thread[Slave HeartBeat,5,main]

          Thread[Reporter,10,main]

      15:17:45.208 tl            [Ljava.lang.Thread;@58093139

      15:17:45.208 tl[]:         Thread[Vdbmain,5,main]

      15:17:45.208 tl[].isalive: true

      15:17:45.208 Threadname:   Vdbmain

      15:17:45.209 dumpOneStack: java.lang.Thread -2

      15:17:45.209 dumpOneStack: java.lang.Thread 340

      15:17:45.209 dumpOneStack: Vdb.common 383

      15:17:45.209 dumpOneStack: Vdb.common 367

      15:17:45.209 dumpOneStack: Vdb.Vdbmain 784

      15:17:45.209 dumpOneStack: Vdb.Vdbmain 572

      15:17:45.209 tl            [Ljava.lang.Thread;@58093139

      15:17:45.209 tl[]:         Thread[SlaveStarter localhost-10-140311-14.58.51.646,5,main]

      15:17:45.209 tl[].isalive: true

      15:17:45.209 Threadname:   SlaveStarter localhost-10-140311-14.58.51.646

      15:17:45.209 dumpOneStack: java.lang.Object -2

      15:17:45.209 dumpOneStack: java.lang.Object 503

      15:17:45.209 dumpOneStack: java.lang.UNIXProcess 210

      15:17:45.209 dumpOneStack: Utils.OS_cmd 304

      15:17:45.209 dumpOneStack: Utils.OS_cmd 234

      15:17:45.209 dumpOneStack: Vdb.SlaveStarter 208

      15:17:45.209 dumpOneStack: Vdb.SlaveStarter 157

      15:17:45.210 dumpOneStack: Vdb.SlaveStarter 50

      15:17:45.210 tl            [Ljava.lang.Thread;@58093139

      15:17:45.210 tl[]:         Thread[Get_cmd_stream stderr /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570,5,main]

      15:17:45.210 tl[].isalive: true

      15:17:45.210 Threadname:   Get_cmd_stream stderr /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570

      15:17:45.210 dumpOneStack: java.io.FileInputStream -2

      15:17:45.210 dumpOneStack: java.io.FileInputStream 272

      15:17:45.210 dumpOneStack: java.io.BufferedInputStream 273

      15:17:45.210 dumpOneStack: java.io.BufferedInputStream 334

      15:17:45.210 dumpOneStack: sun.nio.cs.StreamDecoder 283

      15:17:45.210 dumpOneStack: sun.nio.cs.StreamDecoder 325

      15:17:45.210 dumpOneStack: sun.nio.cs.StreamDecoder 177

      15:17:45.211 dumpOneStack: java.io.InputStreamReader 184

      15:17:45.211 dumpOneStack: java.io.BufferedReader 154

      15:17:45.211 dumpOneStack: java.io.BufferedReader 317

      15:17:45.211 dumpOneStack: java.io.BufferedReader 382

      15:17:45.211 dumpOneStack: Utils.Get_cmd_stream 640

      15:17:45.211 tl            [Ljava.lang.Thread;@58093139

      15:17:45.211 tl[]:         Thread[Get_cmd_stream stdout /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570,5,main]

      15:17:45.211 tl[].isalive: true

      15:17:45.211 Threadname:   Get_cmd_stream stdout /banktools/perfbench/vdbench504/vdbench SlaveJvm -m localhost -n localhost-10-140311-14.58.51.646 -l localhost-0 -p 5570

      15:17:45.211 dumpOneStack: java.nio.charset.CoderResult 120

      15:17:45.211 dumpOneStack: sun.nio.cs.StreamDecoder 318

      15:17:45.211 dumpOneStack: sun.nio.cs.StreamDecoder 177

      15:17:45.211 dumpOneStack: java.io.InputStreamReader 184

      15:17:45.212 dumpOneStack: java.io.BufferedReader 154

      15:17:45.212 dumpOneStack: java.io.BufferedReader 317

      15:17:45.212 dumpOneStack: java.io.BufferedReader 382

      15:17:45.212 dumpOneStack: Utils.Get_cmd_stream 640

      15:17:45.212 tl            [Ljava.lang.Thread;@58093139

      15:17:45.212 tl[]:         Thread[SlaveOnMaster,10,main]

      15:17:45.212 tl[].isalive: true

      15:17:45.212 Threadname:   SlaveOnMaster

      15:17:45.212 dumpOneStack: sun.nio.ch.FileDispatcherImpl -2

      15:17:45.212 dumpOneStack: sun.nio.ch.FileDispatcherImpl 60

      15:17:45.212 dumpOneStack: sun.nio.ch.IOUtil 93

      15:17:45.212 dumpOneStack: sun.nio.ch.IOUtil 65

      15:17:45.212 dumpOneStack: sun.nio.ch.FileChannelImpl 205

      15:17:45.212 dumpOneStack: Utils.Bin 759

      15:17:45.212 dumpOneStack: Utils.Bin 853

      15:17:45.212 dumpOneStack: Vdb.FwdStats 668

      15:17:45.212 dumpOneStack: Vdb.FwdReport 92

      15:17:45.213 dumpOneStack: Vdb.CollectSlaveStats 415

      15:17:45.213 dumpOneStack: Vdb.CollectSlaveStats 320

      15:17:45.213 dumpOneStack: Vdb.SlaveOnMaster 113

      15:17:45.213 dumpOneStack: Vdb.SlaveOnMaster 42

      15:17:45.213 tl            [Ljava.lang.Thread;@58093139

      15:17:45.213 tl[]:         Thread[Slave HeartBeat,5,main]

      15:17:45.213 tl[].isalive: true

      15:17:45.213 Threadname:   Slave HeartBeat

      15:17:45.213 dumpOneStack: java.lang.Thread 1588

      15:17:45.213 dumpOneStack: Vdb.common 1633

      15:17:45.213 dumpOneStack: Vdb.common 1627

      15:17:45.213 dumpOneStack: Vdb.HeartBeat 78

      15:17:45.213 tl            [Ljava.lang.Thread;@58093139

      15:17:45.213 tl[]:         Thread[Reporter,10,main]

      15:17:45.213 tl[].isalive: true

      15:17:45.213 Threadname:   Reporter

      15:17:45.214 dumpOneStack: java.lang.Object -2

      15:17:45.214 dumpOneStack: Utils.Semaphore 168

      15:17:45.214 dumpOneStack: Vdb.CollectSlaveStats 489

      15:17:45.214 dumpOneStack: Vdb.Reporter 189

      15:17:45.214

      15:17:45.214 Heartbeat monitor: One or more slaves did not respond

      15:17:45.214

      java.lang.RuntimeException: Heartbeat monitor: One or more slaves did not respond

              at Vdb.common.failure(common.java:308)

              at Vdb.HeartBeat.run(HeartBeat.java:79)

        • 1. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
          Henk Vandenbergh-Oracle

          My first reaction to heartbeat timeouts is usually "memory filling up, and the OS falling apart trying to free memory which then is immediately used again for the next read or write.". But you DID specify openflags=directio, which on Linux adds 0x004000 to the open() call.

          I must say though that, because of the lack of useful documentation on all the open flags in both the Solaris and Linux world that I am never sure how those flags really work and if they always work the same way.

          So, to see if we can get further with this I suggest the following:

          - add 'fsd=default,workingsetsize=xxx'    where xxx equals half of your current memory size divided by four (the # of FSDs).

          This will limit the amount of total file space that Vdbench uses.

          If it then still blows up we know it is not memory related.

          - an other test: while the above or the unchanged test is running, keep track of the amount of memory used. If memory is indeed overused then 'directio' does not work as expected (by me).

          - Try one of the other openflags.

           

          Henk

          • 2. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
            user2679908

            Thanks for the tips. This is what has been done.

             

            1) Ran adding fsd=defaulf,workingsetsize=16 (since 64 GB machine and 4 FSDs), which failed.

            2) Removing the openflag=directio, it ran in spurts (faster and slower) which ran longer but then failed.

            3) Did a run, origial options, with vmstat & iostat reporting every 5 seconds. In that noticed an increase in CPU an decrease in IO rate until VDBench gave an error.

             

            I was unable to upload images with Excel graphs, so have the summary below.  Thanks for looking at this.

             

            CPU sda dm-7
            RemarksTime%user%nice%system%iowait%steal%idle TimetpsMB_read/sMB_wrtn/sMB_readMB_wrtn TimetpsMB_read/sMB_wrtn/sMB_readMB_wrtn
            5:52:502.3800.170097.45 5:52:502.200.0200 5:52:5000000
            VDBench Start5:52:551.300.640.26097.8 5:52:55433.6042.990214 5:52:55553.8043.90219
            Starting
              RD=format_for_rd1
            5:53:000.0500.1418.04081.76 5:53:0010140113.220566 5:53:001310.80113.080565
            5:53:050.0900.1317.52082.26 5:53:05994.40111.40557 5:53:051294.60111.40557
            5:53:100.0300.1616.1083.72 5:53:109920112.060560 5:53:1012590112.020560
            5:53:150.0100.1416.71083.13 5:53:15987.40111.610558 5:53:151252.20111.440557
            5:53:200.0700.1416.34083.44 5:53:2010060111.540557 5:53:201297.20111.320556
            5:53:252.800.2914.65082.26 5:53:25986.40112.870564 5:53:251265.60112.70563
            5:53:300.0200.1314.72085.14 5:53:30779.2084.180420 5:53:301024.8084.240421
            5:53:350.0300.1113.13086.73 5:53:35687077.480387 5:53:35907.8077.470387
            5:53:400.0300.1313.49086.35 5:53:40726.2083.770418 5:53:40946.2083.760418
            5:53:450.0300.1511.37088.45 5:53:459300102.990514 5:53:451205.40102.920514
            5:53:500.0500.1211.45088.38 5:53:50653.2076.020380 5:53:50873.4076.010380
            5:53:550.5400.1514085.31 5:53:55686.4076.670383 5:53:55884.4076.540382
            5:54:002.9900.2815.18081.54 5:54:00751.2085.880429 5:54:00985085.710428
            5:54:050.9600.1313.88085.03 5:54:05800.2085.450427 5:54:05997.4084.880424
            5:54:100.5600.1613.28086 5:54:10925.8088.160440 5:54:101015.6086.020430
            5:54:150.0200.1110.02089.86 5:54:15854.4073.460367 5:54:151010.4072.840364
            5:54:200.0200.138.18091.67 5:54:20848.8095.290476 5:54:201067.4095.260476
            5:54:250.1400.118.26091.49 5:54:25506.6056.350281 5:54:25647.2056.340281
            5:54:300.0200.137.78092.08 5:54:30740082.90414 5:54:30925.4082.80413
            5:54:352.3500.268.1089.29 5:54:35943.20106.950534 5:54:351205.40106.930534
            5:54:400.3600.1810.88088.57 5:54:40898.60100.750503 5:54:401145.20100.630503
            5:54:450.0300.169.81090 5:54:45856.8096.830484 5:54:451093.4096.80484
            5:54:500.0600.139.82089.99 5:54:50980.4092.570462 5:54:501206.8092.520462
            5:54:550.2300.128.68090.97 5:54:55706.4079.350396 5:54:55901079.340396
            5:55:000.0200.137.34092.51 5:55:00966.40109.260546 5:55:0012250109.250546
            5:55:050.0400.219.38090.36 5:55:059310104.740523 5:55:051185.40104.680523
            5:55:102.1200.269.21088.42 5:55:10922.20103.490517 5:55:1011760103.380516
            5:55:150.5600.169.37089.91 5:55:15852.8095.870479 5:55:151081.8095.650478
            5:55:200.0300.169.42090.4 5:55:20872098.110490 5:55:201102098.050490
            5:55:250.1800.148.99090.69 5:55:25927.2084.230421 5:55:251137.8084.160420
            5:55:300.0300.119.29090.57 5:55:30879.8099.340496 5:55:301122.8099.350496
            5:55:350.0200.148.05091.8 5:55:35878098.930494 5:55:351116.8098.90494
            5:55:400.0500.137.43092.39 5:55:40922.40103.690518 5:55:401170.60103.660518
            5:55:451.8900.227.36090.53 5:55:45737.6083.70418 5:55:45945.6083.680418
            5:55:500.8100.187.64091.37 5:55:50885098.970494 5:55:501122.6098.840494
            5:55:550.0700.136.73093.07 5:55:55964.6091.490457 5:55:551185.2091.220456
            Starting RD=rd15:56:000.2400.173.5096.09 5:56:00718.4080.520402 5:56:00902080.220401
            5:56:050.6400.143.79095.43 5:56:05101641.7318.620893 5:56:051170.642.1218.5921092
            5:56:100.0900.096.04093.78 5:56:1088437.0715.8718579 5:56:10990.237.0615.8718579
            5:56:150.0600.068.27091.61 5:56:15517.422.269.0711145 5:56:15587.222.169.1511045
            5:56:201.6800.158.39089.78 5:56:20558.222.6910.9111354 5:56:20643.622.710.8811354
            5:56:250.9900.138.92089.96 5:56:2559724.4111.3612256 5:56:25680.424.5111.1912255
            5:56:300.0300.117.75092.12 5:56:30842.835.7215.2117876 5:56:30956.635.7115.317876
            5:56:350.0600.0812.53087.34 5:56:35513.422.468.6211243 5:56:35592.622.388.6711143
            5:56:400.0300.0815.57084.33 5:56:40492.820.868.9110444 5:56:40563.820.98.8510444
            5:56:450.0300.0815.54084.36 5:56:45554.223.949.4411947 5:56:45662.623.999.3811946
            5:56:500.0400.0617.85082.05 5:56:5046119.578.379741 5:56:5050419.568.389741
            5:56:550.0200.0919.42080.47 5:56:55681.228.4612.4814262 5:56:55791.228.4612.414262
            5:57:000.0200.120.57079.32 5:57:00101142.518.6921293 5:57:001147.642.5118.6821293
            5:57:050.0300.120.53079.34 5:57:05801.433.7814.7916873 5:57:05897.233.7314.816874
            5:57:100.0300.1120.13079.74 5:57:10776.632.7613.9216369 5:57:10909.632.7913.916369
            5:57:150.0300.0820.86079.03 5:57:15681.829.2512.0914660 5:57:15757.429.212.1214660
            5:57:200.0200.0620.9079.01 5:57:2053022.699.3911346 5:57:20588.822.549.3811246
            Last VDBench5:57:250.0300.0419.81080.12 5:57:25265.210.95.015425 5:57:25315.410.855425
            normal reporting5:57:301.7800.1915.71082.32 5:57:30766.430.3313.6115168 5:57:3083430.6413.2315366
            5:57:350.9400.1815.85083.03 5:57:35928.639.216.6819683 5:57:35107739.1916.6719583
            5:57:400.0400.0916.08083.79 5:57:40810.233.9414.5316972 5:57:40971.833.9514.516972
            5:57:450.0200.0915.95083.94 5:57:4577032.3614.2516171 5:57:45873.632.3814.2416171
            5:57:500.0600.116.22083.62 5:57:50739.831.513.2315766 5:57:5083131.4713.2415766
            5:57:550.0200.0916.16083.73 5:57:55721.430.6912.815363 5:57:55823.230.712.7815363
            5:58:000.0300.0921.39078.49 5:58:00735.631.4113.0715765 5:58:00836.431.3713.1215665
            5:58:050.0300.0924.78075.1 5:58:05717.630.6112.6215363 5:58:0581630.6412.5815362
            5:58:100.0200.0924.83075.05 5:58:10683.628.8512.1714460 5:58:10776.628.8112.1914460
            5:58:150.0100.0924.57075.32 5:58:15629.72611.9313059 5:58:15729.9425.9611.9913060
            5:58:200.0200.0925.13074.76 5:58:20712.229.713.1814865 5:58:2081529.7613.0914865
            5:58:250.0300.125.43074.45 5:58:25734.831.2312.9815664 5:58:25849.631.2512.9815664
            5:58:300.0200.0926.06073.84 5:58:30724.630.7812.6815363 5:58:30832.430.7612.6815363
            5:58:350.0300.0926.47073.41 5:58:35688.229.0112.214561 5:58:35801.228.9912.214461
            5:58:400.0300.0825.42074.47 5:58:4063726.6211.913359 5:58:40712.826.5611.913259
            5:58:450.0100.0925.75074.14 5:58:45722.430.6512.9215364 5:58:4583230.6412.915364
            5:58:500.0500.0827.15072.73 5:58:50646.627.211.9813559 5:58:50717.627.2511.9813659
            5:58:550.0100.0825.38074.53 5:58:55617.426.1211.3713056 5:58:55704.226.1111.3813056
            5:59:000.0300.0725.34074.57 5:59:00637.827.6810.8913854 5:59:00728.627.6610.913854
            5:59:050.0200.0826.5073.4 5:59:0562826.1211.6713058 5:59:05715.426.0911.6513058
            5:59:100.0300.0828.28071.62 5:59:10630.226.7511.313356 5:59:10722.626.7711.313356
            5:59:150.0200.0928.84071.05 5:59:1557424.1710.3212051 5:59:15651.224.1510.3212051
            5:59:200.0200.0828.77071.13 5:59:20654.427.0612.3713561 5:59:20753.227.0612.3713561
            5:59:250.0200.0829.62070.28 5:59:25620.825.8711.2712956 5:59:25728.625.8711.2512956
            5:59:300.0100.0629.17070.76 5:59:3060625.7610.8812854 5:59:30698.625.7510.8812854
            5:59:350.0200.0928.56071.34 5:59:35654.227.4811.7413758 5:59:35755.827.4611.7513758
            5:59:400.0200.0728.65071.27 5:59:40548.823.149.9411549 5:59:4063123.169.9311549
            5:59:450.0100.0828.47071.44 5:59:45628.226.6511.2313356 5:59:45715.826.5711.2313256
            5:59:500.0400.0627.92071.98 5:59:50357.615.326.37631 5:59:50392.615.266.37631
            Test Failed5:59:550.8100.1724.11074.91 5:59:5553922.069.3311046 5:59:55595.822.328.8511144
            6:00:002.1600.219.46088.17 6:00:00924.239.515.7719778 6:00:001047.639.4215.7419778
            6:00:050.1100.151.41098.32 6:00:0568.42.561.03125 6:00:05572.240.86114
            6:00:10000.010.01099.99 6:00:102.400.0300 6:00:101.20000
            6:00:15000.010099.99 6:00:150.40000 6:00:1500000
            6:00:20000.010099.99 6:00:200.20000 6:00:200.80000
            • 3. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
              Henk Vandenbergh-Oracle

              64gb machine, four FSDs having workingsetsize=16g still does add up to 64gb.

              Did vmstat data show any impact on memory usage?

              • 4. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                user2679908

                This is the vmstat info.  I'm new to Linux and asking the Linux staff here to give additional monitoring ideas.  Is there a way to get VDBench to give extra messages from the Slaves about what is happening as they run?   Thanks

                 

                ProcMemorySwapIOSystemCPU
                CommentDateTimerbswpdfreebuffcachesisobiboincsussyidwast
                3/24/201417:52:4200019563881618310870090800485801009810
                3/24/201417:52:471001956287361831167009080002220011348109900
                3/24/201417:52:52000195629472183124700908000221647512109900
                VDBench Start3/24/201417:52:5703201954597761837687008120008787630774643119170
                Starting
                  RD=format_for_rd1
                3/24/201417:53:022320195462624184376700760001114122162020400082180
                3/24/201417:53:071310195464064184924700840000115909162521090083170
                3/24/201417:53:120320195465232185532700740000113541159724500083160
                3/24/201417:53:170320195465984186068700820000114466165721150083170
                3/24/201417:53:221310195455776186592700792000114923323433221083160
                3/24/201417:53:270280195455216187128700844000111300460643102083140
                3/24/201417:53:32129019545651218757670082000081790212833990086140
                3/24/201417:53:37029019545822418796870085200082519205834250086140
                3/24/201417:53:42130019545910418842070078400090716216235970087130
                3/24/201417:53:47029019546062418886470084400090209218739410089100
                3/24/201417:53:52130019546193618928470079600081954207933890088120
                3/24/201417:53:57231019545356818970470084400074933359644051083150
                3/24/201417:54:02031019545080019019670106400092091493942092083140
                3/24/201417:54:07029019545236819068070085200088066328137631085130
                3/24/201417:54:12014019545121619113270081600089374254735370087130
                3/24/201417:54:1701201954529121915247008440007849124233433009190
                3/24/201417:54:2201201954541921920127008360009747325813986009190
                3/24/201417:54:2711201954535041922807008480005255520363065009280
                3/24/201417:54:3211201954483201927447008320008975130944262109180
                3/24/201417:54:37012019544449619328870084800011004853054492208890
                3/24/201417:54:420120195444928193808700844000103974314140540089100
                3/24/201417:54:47012019544886419430870084000099142279038720090100
                3/24/201417:54:52012019545027219478070080400097365252037490090100
                3/24/201417:54:5701201954403521952047008560008391620493851009280
                3/24/201417:55:02012019543944019580470078000011453124754052009180
                3/24/201417:55:07112019543526419633270086000010697026024531009090
                3/24/201417:55:1201201954334561968367008600009926147584256208890
                3/24/201417:55:17012019543462419733270085600098869237138210090100
                3/24/201417:55:2211201954367041978207008160009795425603979009190
                3/24/201417:55:2701201954332641982687008640008950225124073009090
                3/24/201417:55:3211201954398241987727008400009944620773701009190
                3/24/201417:55:37012019544000019929670084000010600621593922009370
                3/24/201417:55:4211201954372961996927008080007957519803457009190
                3/24/201417:55:47112019541683220026070086400011277854815009309160
                3/24/201417:55:520801954303842007287008640009597121133647009280
                3/24/201417:55:5708019543280020120470086000010118429174119009460
                Starting RD=rd13/24/201417:56:02032019540107220140870098400106374357030635792109630
                3/24/201417:56:07032019540432020144070098400499462179125763062009540
                3/24/201417:56:12032019540502420145670098400288001216314611756009280
                3/24/201417:56:17032019540587220146470098400230021028212251516009280
                3/24/201417:56:22032019539812820148870098400230781087541932767308980
                3/24/201417:56:27131019539976020150870098400331651461116002066009280
                3/24/201417:56:32033019540171220152470098400282751220520692258009190
                3/24/201417:56:37037019540352020154470098400223238733115018840086140
                3/24/201417:56:42036019540454420155670098400229509482117014590084160
                3/24/201417:56:47035019540632020157270098400241799886121415350084160
                3/24/201417:56:52037019540745620158870098400196488662102412730081190
                3/24/201417:56:571370195408208201596700984003755516389175122670080200
                3/24/201417:57:020380195408368201604700984004108818053189024420079200
                3/24/201417:57:070350195408704201612700984003155213688153319510079210
                3/24/201417:57:120350195408832201620700984003343413598156219610080200
                3/24/201417:57:170270195408896201628700984002917112229134917050079210
                3/24/201417:57:2201401954089922016327009840018688846698112270079210
                Last VDBench3/24/201417:57:2709019540828820164070098400830733715667990080200
                normal reporting3/24/201417:57:321340195396800201668700984004335418856575747353084130
                3/24/201417:57:370340195399456201688700984003822117082145823150083170
                3/24/201417:57:421340195401920201700700984003328013985130620710084160
                3/24/201417:57:470340195404752201708700984003349814136128620650084160
                3/24/201417:57:521340195406048201720700984003137313423124219200084160
                3/24/201417:57:570340195407168201728700984003274213458121819790084160
                3/24/201417:58:020400195407648201736700984003109113067119819380075250
                3/24/201417:58:070400195408224201752700984003189813219121119560075250
                3/24/201417:58:120400195408320201768700984002766112270108417800075250
                3/24/201417:58:170400195408768201776700984002793012395115118480075250
                3/24/201417:58:220400195406400201784700984002924812843115819190074260
                3/24/201417:58:270400195406496201792700984003384314302128221190075250
                3/24/201417:58:320390195400928201800700984003061812370115918900073270
                3/24/201417:58:370300195401040201808700984002873612114112918760074250
                3/24/201417:58:420340195385168201816700984002684212370104618620074260
                3/24/201417:58:470350195385392201824700984003306213830127921580074260
                3/24/201417:58:520330195369136201832700984002516511128102616740074260
                3/24/201417:58:570350195369216201840700984002895411806112218400075250
                3/24/201417:59:02035019535297620184870098400267141114199517840074260
                3/24/201417:59:070330195353040201856700984002857012651111619410072280
                3/24/201417:59:120330195342816201864700984002410210386102317270072280
                3/24/201417:59:170320195342704201872700984002656011179101817540071290
                3/24/201417:59:220320195340768201880700984002883813406113419350071290
                3/24/201417:59:27031019534088020188870098400259461071899516990070290
                3/24/201417:59:320310195336816201896700984002653411614101717360071290
                3/24/201417:59:370310195336656201904700984002618911102101017110071290
                3/24/201417:59:4203101953366722019127009840024026996392315710071290
                3/24/201417:59:47025019533673620191670098400240381100099015050072280
                3/24/201417:59:5202001953367042019247009840016128579073512800072280
                Test fails3/24/201417:59:571320195481424201952701004003383014325435041202081170
                3/24/201418:00:022001956042882019927010800025216968626232485109170
                3/24/201418:00:07000195601600202016701004000383332620010000
                3/24/201418:00:1210019560504020202470100400041021340010000
                3/24/201418:00:1700019560894420203270100400041051380010000
                • 5. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                  Henk Vandenbergh-Oracle

                  I am out of options.

                  The only thing to try:

                  -make sure you are running vdbench50401 from a JDK/SDK

                  - then, when you notice that vdbench is no longer starting i/o, run ./vdbench jstack and send me the output.

                   

                  This will give me the active execution stacks of all running java JVMs.

                   

                  Henk.

                  • 6. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                    user2679908

                    Using vdbench50401 have the Jstack traces.  I'm not able to add files to this post (says not authorized) and the bank is very restrictive on getting information out so I'm checking on options to get you the zip files.  If can email me at  xxxx I'll reply with the ZIP files.  Otherwise if Oracle has an FTP site to save them, I can try to deliver that way.

                     

                    Some more background

                    1) See the same issue with RHEL 6.5.

                    2) Don't see the issue with OEL 6.5

                    I'll be testing RHEL 5.10 soon to see if the issue started with that version.  Thanks, Sam

                    • 7. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                      user2679908

                      Some more information:

                      1) RHEL 5.10 does not have the issue (OEL 6.4 & OEL 6.5 are also fine).  Is lmited to RHEL 6.4 & RHEL 6.5.

                      2) OS is unresponsive as the issue occurs, seen both in delays in reporting and from other consoles that do not reply back for 5+ minutes for basic commands.

                      3) Backing down the fwdrate to lower levels this will run,  Near the max rate possible, get ocassional delayed reporting without failures.

                       

                      Henk has given many great things to try.  I'll pursue some additional things are report back to ths forum.

                      1) Base RHEL 6.4 build (no company specific add ons) to confirm is issue is base build or possibly the add ons.

                      2) Try to duplicate the IO load with another tool under RHEL 6.4

                      • 8. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                        user2679908

                        Did a build using RedHat RHEL 6.5 ISO and got the same problem with no bank customizations.  Created RedHat Issue 01094838 to examine this.   Mabye the new messages below give a new clue.  Thanks, Sam

                         

                        Do see one new message type with RedHat build that did not see at the console before.  These are some exampels.

                        14:44:13.523 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel: INFO: task jbd2/dm-0-8:952 blocked for more than 120 seconds.

                        14:44:13.523 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1

                           . . .

                        14:44:13.529 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel: INFO: task master:7698 blocked for more than 120 seconds.

                        14:44:13.529 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1

                        . . .

                        14:44:13.531 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel: INFO: task qmgr:7705 blocked for more than 120 seconds.

                        14:44:13.531 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1

                          . . .

                        14:44:13.536 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel: INFO: task java:8595 blocked for more than 120 seconds.

                        14:44:13.536 /var/log/messages: May 15 14:31:39 cr3conlabc160 kernel:      Not tainted 2.6.32-431.el6.x86_64 #1

                        • 9. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                          Henk Vandenbergh-Oracle

                          It appears to me that it is not only Vdbench complaining about something being stuck.

                          I still have no clue as to what is happening though.

                          • 10. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                            user2679908

                            RedHat analyzed the issue and identified Bug 982069 with RHEL 6.x versions as the problem.  One work around is to swtich from "cfq" to "deadline" IO scheduling for the local device.

                            Further, in checking OEL 6.4 and OEL 6.5, they default to "deadline" scheduling for IO and avoid Bug 982069.  This explains why OEL versions worked and RHEL ones did not for this specific VDBench test.

                             

                            Thanks for your help on this and getting pointed towards finding a root cause.  Sam

                            • 11. Re: RHEL 6.4 Gets HeartBeat error, erratic rates, and does not complete runs
                              Henk Vandenbergh-Oracle

                              An update here.

                              In a lab environment crazy things happen. Crazy things as in "50 second disk response time when writing detailed Vdbench statistics".

                              If you have several of those, and possibly longer ( I did not see any), that quickly adds up to a possible three-minute heartbeat timeout.

                              (Some of these report writes are synchronous writes).

                              Since Vdbench depends on a 'working OS', I don't see this as a Vdbench bug, merely an annoyance I should try to work around.

                              I'll change the code to work around this.

                               

                              Henk.