1 2 3 4 Previous Next 46 Replies Latest reply: May 23, 2011 9:26 AM by 834371 Go to original post RSS
      • 30. Re: Long running update
        Jonathan Lewis
        Colin T wrote:
        Oh, also missed that we already tried "compress for all operations", it reduced row migrations, but they were still prevalent.
        That's a little surprising - was this still with pctfree 25 ?
        Have you done a gather_stats before and after all the update - if so what were avg_row_len before and after ?
        This figure shows the uncompressed length even when the rows are compressed.


        As Randolf has pointed out, when you update a compressed row it is expanded and not recompressed (although when the block is full the complete block shold be recompressed iif you are using the "compress for OLTP" option). This is why I assumed you would only be able to update two or three rows per block before migration.

        Whether or not the ASSM bug is a problem you have a problem with migration. As I suggested (but hadn't tested) earlier on, when you do the update by tablescan update if you find a migrated row pointer in a block you have to jump to the migrated row at once to check whether or not you need to migrate it. This may require you to do an incomplete "commit time" check so that you can determine the state of the migrated row. If you have (say) 3M rows in the partition and 1M of those have migrated then you're going to do at least 1M "extra" buffer visits - even if you don't update anything. If you then have to do a commit time check on each jump that's a total of 2M extra block visits - and you do 1,200 scans of the partition -- which easily bumps your block visits up to the billion level.

        The row migration is the major problem - it makes the buffer visits proportional to ""number of rows migrated" x "number of scans", rather than "number of blocks in partition" x "number of scans".

        (Note - when you query a partition with a table scan with lots of migrated row, Oracle doesn't jump to the row when it hits the row pointer, it simply waits until it finds the migrated row in the course of the scan - so you don't see the same excess of buffer visits on tablescan queries)

        Regards
        Jonathan Lewis

        P.S. Why are you setting INITRANS to 100 ? It seems a little excessive, especially if you think compression is a good idea. Highly concurrent systems and compression don't go together.

        Edited by: Jonathan Lewis on May 11, 2011 9:21 PM
        • 31. Re: Long running update
          Colin Thiner
          Jonathan Lewis wrote:
          Colin T wrote:
          Oh, also missed that we already tried "compress for all operations", it reduced row migrations, but they were still prevalent.
          That's a little surprising - was this still with pctfree 25 ?
          Have you done a gather_stats before and after all the update - if so what were avg_row_len before and after ?
          This figure shows the uncompressed length even when the rows are compressed.
          Will check on this.
          Whether or not the ASSM bug is a problem you have a problem with migration. As I suggested (but hadn't tested) earlier on, when you do the update by tablescan update if you find a migrated row pointer in a block you have to jump to the migrated row at once to check whether or not you need to migrate it. This may require you to do an incomplete "commit time" check so that you can determine the state of the migrated row. If you have (say) 3M rows in the partition and 1M of those have migrated then you're going to do at least 1M "extra" buffer visits - even if you don't update anything. If you then have to do a commit time check on each jump that's a total of 2M extra block visits - and you do 1,200 scans of the partition -- which easily bumps your block visits up to the billion level.
          Note that the 1+ billion buffer gets I see is on a single update statement.
          The row migration is the major problem - it makes the buffer visits proportional to ""number of rows migrated" x "number of scans", rather than "number of blocks in partition" x "number of scans".

          (Note - when you query a partition with a table scan with lots of migrated row, Oracle doesn't jump to the row when it hits the row pointer, it simply waits until it finds the migrated row in the course of the scan - so you don't see the same excess of buffer visits on tablescan queries)
          Agreed, I think(and hope) this is really the root of the problem. Will be testing this over the course of next day or two.
          P.S. Why are you setting INITRANS to 100 ? It seems a little excessive, especially if you think compression is a good idea. Highly concurrent systems and compression don't go together.
          Good question, inherited this setting and should probably decrease.

          Thanks again for your help!
          • 32. Re: Long running update
            Jonathan Lewis
            Colin T wrote:

            Note that the 1+ billion buffer gets I see is on a single update statement.
            I've lost track of where you are with your results.

            The set of statistics you supplied earlier on showed 300M after 1,700 user commits.

            Do you have an up to date set of stats for one update statement where you can tell us how many rows were in the partition at the time and how man were updated, and how many prior updates had already taken place. Also need to know about block size, pctfree, compression option, initrans and how the data got into the table in the first place. (I've been assumed serial "insert /*+ append */" or direct path load.)


            Regards
            Jonathan Lewis
            • 33. Re: Long running update
              Randolf Geist
              Jonathan Lewis wrote:
              Colin T wrote:
              Oh, also missed that we already tried "compress for all operations", it reduced row migrations, but they were still prevalent.
              That's a little surprising - was this still with pctfree 25 ?

              As Randolf has pointed out, when you update a compressed row it is expanded and not recompressed (although when the block is full the complete block shold be recompressed iif you are using the "compress for OLTP" option). This is why I assumed you would only be able to update two or three rows per block before migration.
              Jonathan,

              it looks like the PCTFREE doesn't matter in this case. As I've tried to express in my previous reply the "oltp" compression surprisingly:

              - does migrate a row rather than re-compress the block with data that should be equally re-compressable

              - In 11.1 it even doesn't attempt to compress the blocks holding the migrated rows. In 11.2 at least these get re-compressed, but the growing rows in the original block still get migrated rather than re-compressed (and therefore stay uncompressed if not migrated)
              Whether or not the ASSM bug is a problem you have a problem with migration. As I suggested (but hadn't tested) earlier on, when you do the update by tablescan update if you find a migrated row pointer in a block you have to jump to the migrated row at once to check whether or not you need to migrate it.
              Did you mean to say "to check whether you need to clean it out" rather than "migrate it"?

              Randolf
              • 34. Re: Long running update
                Randolf Geist
                Colin T wrote:
                Since my next test will be without compression, I did the test with/without compression in my dev environment, with just a single month of data/single thread. Here are some numbers:
                Option                                    Compress OFF         Compress ON
                Migrated rows                                         0              Substantial
                cleanout - number of ktugct calls               652,768               18,362,131
                consistent gets                               5,274,519                   23,513,012
                consistent gets - examination                   672,908               18,454,782
                redo size                                  912,611,900            4,932,805,620
                (It would be interesting to see at least the "db block gets" numbers in addition - these should also be excessive with compression enabled.

                I'm pretty sure that you manage to hit the bug given these the numbers. Although there is still information missing about row sizes before / after, number of rows in a partition / updated for a single update etc. - as requested by Jonathan - I've managed to reproduce the bug in both 8K and 16K ASSM tablespace with the same PCTFREE / INITRANS settings and rows similar large than yours and only a limited number of updates per block. The interesting part is that the bug does only show up if partitioning is thrown into the mix - a stand alone table does not reproduce.)

                Ignore above - although the numbers mentioned in the first post might indicate that you hit the bug - my test case is not representative for what you're doing. Need more investigation.

                Hope that helps,
                Randolf

                Edited by: Randolf Geist on May 12, 2011 3:40 PM

                Wrong conclusion, test case not representative
                • 35. Re: Long running update
                  Colin Thiner
                  Is there a bug number - documented with Oracle? I can't find it in the links provied.

                  Thanks
                  • 36. Re: Long running update
                    Colin Thiner
                    Is there a bug number - documented with Oracle? I can't find it in the links provied.

                    Thanks
                    • 37. Re: Long running update
                      Randolf Geist
                      Colin T wrote:
                      Is there a bug number - documented with Oracle? I can't find it in the links provied.
                      The bug number is 6918210, see e.g. Greg Rahn's post about it: http://structureddata.org/2008/09/08/understanding-performance/

                      Hope that helps,
                      Randolf
                      • 38. Re: Long running update
                        Colin Thiner
                             Option                                    Compress OFF         Compress ON
                        Migrated rows                                         0              Substantial
                        cleanout - number of ktugct calls               652,768               18,362,131
                        consistent gets                               5,274,519                   23,513,012
                        consistent gets - examination                   672,908               18,454,782
                        redo size                                  912,611,900            4,932,805,620
                        (It would be interesting to see at least the "db block gets" numbers in addition - these should also be excessive with compression enabled.
                             Option                                    Compress OFF         Compress ON
                        Migrated rows                                         0              Substantial
                        cleanout - number of ktugct calls               652,768               18,362,131
                        consistent gets                               5,274,519                   23,513,012
                        consistent gets - examination                   672,908               18,454,782
                        redo size                                  912,611,900            4,932,805,620
                        db block gets                                 6,798,486               30,778,118
                        I'm pretty sure that you manage to hit the bug given these the numbers. Although there is still information missing about row sizes before / after, number of rows in a partition / updated for a single update etc. - as requested by Jonathan - I've managed to reproduce the bug in both 8K and 16K ASSM tablespace with the same PCTFREE / INITRANS settings and rows similar large than yours and only a limited number of updates per block. The interesting part is that the bug does only show up if partitioning is thrown into the mix - a stand alone table does not reproduce.)
                        Didn't get row lenth...

                        Example of one partition is:
                        num rows: 2,973,485
                        300,000-400,000 rows maximum updated in a single update statement.

                        Number of blocks in partition:
                        before updates: 91,129
                        after update: 133,760

                        Ignore above - although the numbers mentioned in the first post might indicate that you hit the bug - my test case is not representative for what you're doing. Need more investigation.

                        Hope that helps,
                        Randolf

                        Edited by: Randolf Geist on May 12, 2011 3:40 PM

                        Wrong conclusion, test case not representative
                        • 39. Re: Long running update
                          Jonathan Lewis
                          Randolf Geist wrote:

                          Did you mean to say "to check whether you need to clean it out" rather than "migrate it"?
                          Randolf,

                          Thanks for the follow-up. Yes, that was supposed to say "cleanit out".

                          The hypothesis is that when you scan a table to update rows, since you have to jump to every migrated row to check if it needs to be updated, you also do a cleanout on the target block at the same time. If you then have to jump to the same target block again because you need to check another migrated row it's possible that Oracle will repeat the block cleanout.

                          I'm not convinced that the OP is seeing the ASSM migration bug. A critical feature of that bug was that the rows were very short and a large number of rows were migrated into the target block so that the target block ITL was full even when there was a lot of free space in the block. (The migration takes one ITL entry for every row migrated). In the case of the OP the rows are long and the ITL won't get full.


                          Sorry I missed your comments about compression not behaving the way we might expect and even making the problem worse. I look forward to reading the blog when you've written it up.


                          Regards
                          Jonathan Lewis
                          • 40. Re: Long running update
                            Randolf Geist
                            Jonathan Lewis wrote:
                            I'm not convinced that the OP is seeing the ASSM migration bug. A critical feature of that bug was that the rows were very short and a large number of rows were migrated into the target block so that the target block ITL was full even when there was a lot of free space in the block. (The migration takes one ITL entry for every row migrated). In the case of the OP the rows are long and the ITL won't get full.
                            Well I was pretty sure that I wouldn't be able to surprise you with the first part of the post - yes, exactly, the basic bug seems to a variation of the problem "when does ASSM mark a block as no candidate for insert of further rows" in case the ITL directory becomes full, but there is still space available in the block. It looks like the more blocks get into this state as rows get migrated, the harder the ASSM space layer has to work for every row to migrate to find a block where it can store the row, since it seems to search through all these candidate blocks only to find out that the ITL is full and therefore cannot be used for further migrations.

                            That also seems to be the explanation why it is more likely to see the bug with larger block sizes - there can be at most 255 ITL entries, but in block sizes less than 16K Oracle restricts the ITL entries to approx. 50% of the block size so we end up with something like 169 / 83 / 41 for 8K / 4K / 2K max. number of ITLs (and Oracle doesn't throw an error but silently lowers the number of ITLs when using a larger value for INITRANS). With only 50% of the block free space left it needs very small rows to be able to migrate that many rows into a single block with smaller block sizes (though it is possible yet very unlikely).

                            With 16K and larger the max. 255 ITL entries allocate less than 50% of the block and allow for more net data, so potentially more migrated rows fit into a single block hence the problem can be reproduced with larger row sizes as well.
                            Sorry I missed your comments about compression not behaving the way we might expect and even making the problem worse. I look forward to reading the blog when you've written it up.
                            Yep, that is going to be part two: OLTP compression or a combination of basic compression and partitioning seems to allow to hit the bug even with non-full ITL entries which is quite worrying - all of this seems to be fixed in 11.2, too. Therefore I'm pretty sure that the figures for the first single UPDATE that did 3 billion logical I/Os to update 300,000 rows in a partition is an indication of this particular variation of the bug.

                            Randolf
                            • 41. Re: Long running update
                              Colin Thiner
                              Randolph and Johnathan - sorry this took me a while to get out. Yesterday we ran without compression and had a good run. Execution time was down quite a bit, and stats were back to normal. I'm pretty sure we will have solid and consistent processing going forward and we can now continue with tuning as we normally would. Thanks very much for leading us in the right direction here.

                              I've compiled some stats over time the last few weeks along with environment changes. It is difficult to get apples to apples comparison, but I think we are pretty close with these. Timings are for the first set of rules (the 1200), but the AWR are for all 4 sets of rules. Set one is the only one where we see the bug(??) surface.

                              http://webpages.charter.net/thinerc/CSTAR_BatchStats.htm

                              It seems that the question of whether we are hitting the bug is still open for debate. We should be able to reproduce this to determine if we are hitting the bug, but if not, what other explainations could there be? I would not expect this drastic of a change in behavior just from processing the row migrations or the delayed block cleanout effect. There was definately overhead with the row migrations - which we see eliminated from runs 1/2 to run 5 without compression, but there was something else going on in runs 3/4.

                              The first run is basically our starting point. Notice higher amount of reading, we started by trying to reduce that using histograms, exact cursor sharing, and also did some instance tuning. These worked possitively and we had some decent runs with compression on, but it still didn't seem very efficient. Runs 3 and 4 the behavior completely changed. The updates would be running efficiently for some time, then suddenly threads would get stuck on a single update running for hours, pegging cpu, and we see the high CR and Gets. The only things we were playing with at that point was ASYNCH_IO (FILESYSTEM_IO=SETALL), and block size. Notice that we never did get back to our configuration that yielded initial best run time (runs 1/2) of ASYNCH_IO on and 8k block size, but I'm not sure that would have guaranteed the bug would not surface. I didn't capture row migrations and block counts each run, but it was very consistent in later testing the number of migrations and block additions with compression, and we see 0 without compression.



                              Edited by: Colin T on May 14, 2011 3:16 PM

                              Edited by: Colin T on May 14, 2011 3:19 PM

                              Edited by: Colin T on May 14, 2011 7:24 PM
                              • 42. Re: Long running update
                                Randolf Geist
                                Colin Thiner wrote:
                                It seems that the question of whether we are hitting the bug is still open for debate. We should be able to reproduce this to determine if we are hitting the bug, but if not, what other explainations could there be? I would not expect this drastic of a change in behavior just from processing the row migrations or the delayed block cleanout effect. There was definately overhead with the row migrations - which we see eliminated from runs 1/2 to run 5 without compression, but there was something else going on in runs 3/4.
                                You've been provided instructions how to determine if you're hitting the "free space search" anomaly. There is no need to debate, just run the stack profiler or the buffer gets profiler script and check the results if you see an update taking much longer than expected.

                                As mentioned above, it looks like 11.1 introduced a variation of the bug that only surfaces with OLTP compression or basic compression in combination with partitioning - the latter corresponds to your configuration used in your first four runs.

                                The underlying reason for the anomaly is probably different from the basic bug since there are still ITL slots available but the block is full and therefore doesn't allow any further migrated rows to go into due to the row size, but the symptoms ought to be the same: You'll see the "ktspscan_bmb" and "ktspfsrch" functions on top in the output if you should really hit this.

                                Besides that the point of hitting the bug or not is more or less moot since your strategy with (basic) compression and subsequent updates is not a good one anyway. Unfortunately the OLTP compression does not behave as expected otherwise it might have been a reasonable option.

                                Randolf
                                • 43. Re: Long running update
                                  Hemant K Chitale
                                  I'm not sure if this is related ....

                                  Brad Peek has some tests and a bug ?
                                  UPDATE DML performance on EHCC compressed data




                                  Hemant K Chitale
                                  • 44. Re: Long running update
                                    Randolf Geist
                                    Hemant K Chitale wrote:
                                    I'm not sure if this is related ....

                                    Brad Peek has some tests and a bug ?
                                    UPDATE DML performance on EHCC compressed data
                                    Hi Hemant,

                                    thanks for the pointer - I think it is not directly related since it is about OLTP compression, but it is possibly describing the behaviour that I've mentioned above, and could be about the variation of the bug.

                                    Randolf