10 Replies Latest reply: Jun 15, 2010 6:48 PM by 581825 RSS

    how oracle prevent "partial write"

    649748
      mysql innodb use "Double Write" to prevent partial write.
      http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/

      But oracle doesn't have "Double Write", so what's the mechanism in oracle to prevent partial write?

      We are planning to use 32k block size, and the file system level the block size is 8k. So if oracle writes 32k, but only the first blocks wrote to disk before crash, how oracle recover?
        • 1. Re: how oracle prevent "partial write"
          sb92075
          but only the first blocks wrote to disk before crash, how oracle recover?
          Without COMMIT, change is rolled back upon restart.
          With COMMIT, change is rolled forward from REDO upon restart.
          • 2. Re: how oracle prevent "partial write"
            Aman....
            I don't have experience with mysql so I won't comment about "double write" functionality of it. In oracle, changes are protected by the help of redo log vectors and on disk, by copying them into the physical redo log files. These changes are going to be in the o/s block size. I am not sure how you are saying that the o/s block size is 8kb since AFAIK , at the moment , most of the o/s don't support more than 4kb and disks are even lesser, 512bytes. But whatever may be the case, oracle would have the changes procted and also the writes are done by o/s only ,not by oracle. Oracle simply issues a request to the o/s to do the job so I don't think there should be any worries if o/s would do it.

            That said, what's the basics of choosing 32kb as the block size? There are several, very lengthy discussions happened here and at many other places on web about this and the result was that most of times, the block size of oracle should be left to the default and not be tempered.

            HTH
            Aman....
            • 3. Re: how oracle prevent "partial write"
              581825
              It's an interesting question. I think the clue is in the link you provided: "Such logging structure is geat as it require less data to be written to the log, however it requires pages to be internally consistent."

              What that's saying (I think!) is that the contents of the innodb transaction log can only be replayed to datafile pages which are 'clean' -and that's true for Oracle, too. You can't apply Oracle redo to an Oracle database block that is internally corrupted because some of its consituent "os pages" were written at a different time from others. When such partial writes happen, you get what's called a "fractured block", warnings in the alert log ...and the data file is regarded as corrupt from that point on.

              Oracle's fix to this potential problem, however, is also hinted at in the article you linked to: "Innodb does not log full pages to the log files". That's an interesting sentence because , you see, Oracle does write full pages to the logs! I should immediately qualify that: it only does so when you take a "hot backup" using O/S copy commands -because it's only then that you have to worry about the problem. In other words, you only have to worry about the fact that you can only apply redo to an internally consistent database block if you're actually in the business of applying redo... and you're only doing that in the event of a recovery. And complete recoveries in Oracle (as opposed to mere crash recoveries) require you to have restored something from backup. So, it's only during the backup process that you only have to worry about the problem of fractured blocks -and so it's only then that Oracle says, 'if you have put the tablespace into hot backup mode (alter tablespace X begin backup), then the first time a block of data is changed, the entire block should be written in a consistent state into the redo (transaction) logs. Then, if the datafile copy of the block in the backup turns out to be fractured, we've got a known good copy in the redo we can restore in its place. And once you have a clean block as a starting point, you can continue to apply redo from that point on'.

              Oracle has an alternative (and more up to date) mechanism for achieving this "I know your data block is clean" starting state, though. It's called RMAN -the Oracle backup and recovery tool. Unlike your OS copy command, it's an Oracle utility... so it understands the concept of Oracle blocks, and it can therefore check that a block that's been copied has been copied consistently, with all its constituent OS 'pages' written coherently to disk in the same state. It knows how to compare input and output in a way no OS command could ever hope to do. So when RMAN copies a data block hot, it reads the copy, compares it with the original -and if it sees the copy is fractured, it just has another go copying the block again. Repeat until the copy is indeed verified as a good copy of the original. No need to write the block into the transaction log at all, because you know that the backup file itself contains the necessary clean block copy.

              So, putting that into practice. Let's say your server corrupts data on the disk for whatever reason and, in the process, your Oracle instance dies. You try and restart Oracle, but you get told that recovery is needed (you might get a message that file 16, for example, can't be read). So you restore file 16 from your hot backup taken with OS commands. In that backup, one of the blocks is fractured, because only part of the Oracle block had hit disk at the point the backup was performed. So you restore a fractured block. But that's not a problem, because as redo is replayed, you'll find the clean copy of the block in the redo stream, and restore that over the top of the fractured block. The rest of the redo can then be replayed without a problem. Or, you restore file 16 using RMAN... and what it restores cannot be fractured, because it checks for that before it reports the original backup a success. Therefore, you restore a clean copy of file 16, and can apply redo to it without drama. Either way, you get your database recovered.

              So, the article you linked to nails the important matter: "It does not matter which page version it is – it could be “current” version in which case Innodb will skip page upate operation or “former” in which case Innodb will perform update. If page is inconsistent recovery can’t proceed." Absolutely true of Oracle, too. But Oracle has two alternatives for ensuring that a clean version of the block is always available: write a whole block into redo if it's changed whilst the database is being backed up with OS commands, or make sure you only write clean blocks into the backup if you're using RMAN -and you achieve that by multiple reads of the block, as many as are necessary to ensure the output is clean.

              Oracle's solutions in these regards are, I think, a lot more efficient than double-writing every block all the time, because the only time you have to worry that what's on disk isn't consistent is, as your linked article again points out, when 'power failure' or 'os crash' happens. That is, during some sort of failure. And the response to failure that involves corruption is always to restore something from backup... so, it's really only that backup that needs to worry about 'clean pages'. Instead of writing everything twice to disk during normal running (which sounds like a potentially enormous overhead to me!), therefore, Oracle only has to employ protective measures during the backup process itself (which should, ordinarily, be a mere fraction of 'normal running' time). The overhead is therefore only encountered sparingly and not something you need worry about as a potentially-constant performance problem.

              In closing, I'll second Aman's observation that it is generally and usually the case that any variation away from the default 8K block size is a bad idea. Not always, and there may be justification for it in extremis... but you will certainly be at risk of encountering more and weirder bugs than if you stick to the defaults.
              • 4. Re: how oracle prevent "partial write"
                Fahd.Mirza
                Very insightful Rogers.

                regards
                • 5. Re: how oracle prevent "partial write"
                  649748
                  Thanks HJR for detailed analysis.

                  But the double write mechanism works without restore from a backup an apply redo: before write the dirty buffer blocks, innodb flush the blocks in the double write buffer to disk which is a sequential disk area (so it's fast), so even if partial write happen, the the blocks in double write buffer already persistent in disk, and when mysql restart, innodb can compare the blocks flushed from double write buffer with the datafile blocks, if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just overwrite the datafile blocks with the blocks from double write buffer. So no media recover is required.

                  Based on your anaysis, oracle needs media recover.
                  • 6. Re: how oracle prevent "partial write"
                    EdStevens
                    user646745 wrote:
                    Thanks HJR for detailed analysis.

                    But the double write mechanism works without restore from a backup an apply redo: before write the dirty buffer blocks, innodb flush the blocks in the double write buffer to disk which is a sequential disk area (so it's fast),
                    before write the dirty buffer blocks, innodb logwr flush the blocks in the double write redo buffer to disk which is a sequential disk area (so it's fast),
                    so even if partial write happen, the the blocks in double write buffer already persistent in disk,
                    so even if partial write happen, the the blocks in double write redo buffer already persistent in disk,
                    and when mysql restart, innodb can compare the blocks flushed from double write buffer with the datafile blocks,
                    and when mysql restart, innodb smon can compare the blocks flushed from double write buffer control file scn with the datafile blocks scn,
                    if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just overwrite the datafile blocks with the blocks from double write buffer.
                    if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just overwrite the datafile blocks with the blocks from double write buffer apply the redo from the redo logs.
                    So no media recover is required.
                    sounds like media recovery to me
                    Based on your anaysis, oracle needs media recover.
                    Based on your analysis, so does mysql. It just applies it in a very slightly different fashion, and calls it all something else.
                    • 7. Re: how oracle prevent "partial write"
                      581825
                      I think Ed's right: it's really just a change of words and the same outcome. I'll just add a couple of things, though.

                      1. I've been trying for 11 years, and I've never yet managed to create a fractured block on demand. They are very difficult to make happen! (Though I've had a few when I wasn't expecting -or wanting- them!!) I've yanked out SATA cables and power cables, devised cunning RAID schemes, you name it... we're talking about something that doesn't happen very often at all, in other words.

                      2. Oracle's transaction logs are always written to by a single process (LGWR, or Log Writer), and they are supposed to be stored on fast storage, too. So, in that respect they sound pretty much the same as the double write buffer you're talking about. Also, bear in mind that redo logs, internally, are comprised of 'redo log blocks' which are always 512 bytes in size, and thus align perfectly with whatever filesystem (or non-filesystem in the case of raw devices) you are storing them on.

                      2. If we are only concerned about crash recovery (that is, power goes out, you boot up the server and re-start Oracle, no restore from backup involved), then you need to remember that Database Writer only writes blocks to disk after the redo vectors that changed them have been written to the redo logs and are confirmed to have been written by Log Writer. Log Writer only signals Database Writer to write to the datafiles after it's finished its work. So if it gets interrupted (power failure, etc), then the Database Writer won't yet have flushed any blocks to disk -so no fractured blocks to worry about. All your data blocks would therefore be 'clean' and thus be able to have redo applied to them as part of an entirely automatic instance recovery. A power failure or OS crash during LGWR operations is therefore not an issue and requires no formal recovery procedures on your part.

                      3. Database Writer can make use of checksums to make sure that what it's written matches what it thought it had written, so that during normal operation, strange glitches in the physical storage layer or the file system can be detected and resolved without the database being affected too much.

                      4. Finally, if it were to happen that Database Writer died half-way through a data block write, yes, that might well result in a corrupt block on disk, where half of it is from one time and the other half is from some other time -fractured, in other words. And yes, that will mean having to perform a restore-and-recover operation since there would be no 'clean page' to start applying redo to, except in some prior backup file or other. But see point 1... very difficult (i.e., practically impossible!) to actually get this to happen on demand.

                      And the only other thing I'd say is that, logically, the innodb solution is no real solution at all. What if the flush of the double write buffer is interrupted half-way through? You can keep inventing buffers all you like, but ultimately something has to hit disk. And no matter how fast and how sequential you make that write happen, there is always the logical possibility that it could be interrupted. At which point, your innodb double buffer mechanism is inoperative. And, if it were to happen that somehow Oracle's Log Writer messed up the contents of the redo logs, despite them also being very fast and sequential writes; well then Oracle too would be screwed (you'd be into performing incomplete recovery at that point). Ultimately, therefore, it's not what mechanisms you have in place to prevent partial writes that counts, because you can never eliminate the possibility of them happening 100%. It's more a question of having the recovery mechanisms in place to deal with them when they happen. And Oracle's mechanisms are extremely robust in that regard.
                      • 8. Re: how oracle prevent "partial write"
                        649748
                        ==>logically, the innodb solution is no real solution at all. What if the flush of the double write buffer is interrupted half-way through? You can keep inventing buffers all you like, but ultimately something has to hit disk.

                        As innodb first flushs the blocks in "double write buffer" (it's table/index blocks, not something like redo, redo are redo vectors instead of table/index blocks), after it completes and success, then write to datafile. So if the flush of "double write buffer" is partially done, that's not a problem as we are only care about the datafile when the database is running. If datafile is partially written, innodb is sure that the blocks from double write buffer is clean and can use that to recover.

                        Although this is something like media recover, but it doesn't need dba to kick in. I knew it's very rare to happen, but there is a chance. perhaps it just doesn't make sense to prevent it by impose addition cpu/IO cost to write double buffer.
                        • 9. Re: how oracle prevent "partial write"
                          EdStevens
                          user646745 wrote:
                          ==>logically, the innodb solution is no real solution at all. What if the flush of the double write buffer is interrupted half-way through? You can keep inventing buffers all you like, but ultimately something has to hit disk.

                          As innodb first flushs the blocks in "double write buffer" (it's table/index blocks, not something like redo, redo are redo vectors instead of table/index blocks), after it completes and success, then write to datafile. So if the flush of "double write buffer" is partially done, that's not a problem as we are only care about the datafile when the database is running. If datafile is partially written, innodb is sure that the blocks from double write buffer is clean and can use that to recover.

                          Although this is something like media recover, but it doesn't need dba to kick in.
                          And Oralce's instance crash recovery doesn't need the dba to "kick in" either. It happens automatically on instance startup, if needed.
                          I knew it's very rare to happen, but there is a chance. perhaps it just doesn't make sense to prevent it by impose addition cpu/IO cost to write double buffer.
                          I'm not sure where you have been going with all this, but from the beginning it has had the faint smell of "mysql is superior to oracle because ...". I'd like to think I'm wrong on that.

                          So what's the point?
                          • 10. Re: how oracle prevent "partial write"
                            581825
                            Well, I notice that PostgreSQL also has the feature of flushing entire pages to its transaction logs the first time they get updated after a checkpoint, so I see this is a common feature of other databases. (I guess if you were mad enough to insist on Oracle having an identical feature, you could just issue an 'alter database begin backup' command and leave it that way!)

                            All I can say is that the overhead of that approach is enormous (which is why PostgreSQL makes it optional, I suppose), and if you check what you've written to disk after you've written it (checksums), the full-page write to an alternative location should not really be necessary. DBWR does that. And since DBWR works in the background, the checksum computation should not amount to a noticeable overhead. You can also ask your server process to do it on each read operation (DB_BLOCK_CHECKING), though that involves more overhead.

                            Meanwhile, instance (crash) recoveries in Oracle never require DBA involvement (other than to issue the 'startup' command, I guess!).

                            But yes, if you ever get told 'corrupt block detected', the response is media recovery, not instance recovery -and yes, that will involve the DBA restoring and recovering things (or running something like dbms_repair to 'workaround' the issue). Bear in mind, however, that it is possible to use RMAN to do single block recoveries whilst the rest of the database is up and running... so it's not always the case that 'media recovery' is a huge, full-on affair that involves restoring gigabytes of stuff and sitting around for hours waiting for things to get fixed.