2 Replies Latest reply: Jan 21, 2013 2:28 PM by lrp1 RSS

    Oracle Database migrated to ZFS filesystem crashes server

    lrp1
      I've got some alarming results from a recent migration from UFS to ZFS-based filesystems in one of our environments.
      Quick hit of our environment:
      - two node cluster M4000's connected to a storageTek SAN
      - Solaris 10 update 10 SPARC
      - Oracle Solaris Cluster 3.3u1 for Solaris 10 sparc
      - Oracle Database 11.1.0.7 (we're in the process of upgrading to 11gR2)

      We moved the database (oracle_home, datafiles, archive logs, redologs, tempfiles, controlfiles) from UFS to equivalent ZFS directories. The only thing we changed out of the default was to set the recordsize=8k,logbias=throughput on the datafiles directory in accordance with the Oracle Database on ZFS whitepaper (Sep 2012) (http://www.oracle.com/technetwork/server-storage/solaris10/config-solaris-zfs-wp-167894.pdf).

      To our dismay, the database saw the following errors, resulting in a system CRASH and failover to the other node. It's been happening multiple times under load, and we can't make sense of it. Before anybody asks, I'm firing a ticket to OracleSupport, but I wanted to know if the community has seen these sorts of errors.

      Below is a snap of the alert log I'm seeing these errors on:
           ----- Error Stack Dump -----
           ORA-01115: IO error reading block from file 72 (block # 2288034)
           ORA-27063: number of bytes read/written is incorrect
           SVR4 Error: 45: Deadlock situation detected/avoided
           Additional information: -1
           Additional information: *16384*
           ORA-01115: IO error reading block from file 72 (block # 2288034)
           ORA-27063: number of bytes read/written is incorrect
           SVR4 Error: 45: Deadlock situation detected/avoided
           Additional information: -1
           Additional information: 16384

                ...
           ORA-01115: IO error reading block from file 74 (block # 2289584)
           ORA-27063: number of bytes read/written is incorrect
           SVR4 Error: 45: Deadlock situation detected/avoided
           Additional information: -1
           Additional information: *24576*

      The key here is the term SVR4 error: 45. I know it's an OS error, and I'm currently looking up our /var/adm/message history. however, I don't know what the 16384 and 24576 numbers mean-- I assume them to be values that I'm running up against (ie. it tried to write 16384 bytes, or ran into a 16384 stack, or open files descriptors limit).

      The problem is that I don't see any such errors on my UFS filesystems, so I assume this is purely having to do with the ZFS setup. Has anybody else seen these SVR4: error 45 statements in their environment?
        • 1. Re: Oracle Database migrated to ZFS filesystem crashes server
          Nik
          Hi
          At this moment you have currupted files of database.

          You should recover it from backup.

          How you copy files from UFS to ZFS ?
          How many free spaces on ZFS ?

          Regards.
          • 2. Re: Oracle Database migrated to ZFS filesystem crashes server
            lrp1
            See that's the thing, I thought that I would have seen some corruption on the database but when I re-opened the database, it opened fine. The tablespace & datafiles came online without any need for recovery.

            I ran an RMAN logical validate (to check for block corruption) and came up with no errors.

                 # validate the database without backing it up
                 export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI'
                 rman target / nocatalog
                 run {
                      allocate channel c1 device type disk ;
                      allocate channel c2 device type disk ;
                      backup validate check logical database filesperset 100;
                 }

                      List of Datafiles
                      =====
                      File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
                      ----
                      1 OK 0 10918 89600 2513212213
                      File Name: /hlpsdbrg/oradata10/HLPS01/datafile/SYSTEM_1.dbf
                      Block Type Blocks Failing Blocks Processed
                      ----------
                      Data 0 48488
                      Index 0 10661
                      Other 0 19533
            <etc.etc. all the datafiles showed NO 'marked corrupt' or 'failing blocks' numbers.>

            ...beyond that, we had already tested the SAME steps in our development environment successfully. They're running over ZFS without issue.

            To get more clues, I need to understand what the OS is coughing up on, particularly if it was bad enough to warrant an entire server coredump/crash. Again, there is little documentation when i do MOS / google searches for the following terms:
            "SVR4 45 deadlock zfs oracle"
            "io deadlock svr4 zfs"
            "blocks zfs oracle svr4"

            To do the migration, I was mostly following the process listed with the Oracle Documentation on BACKUP AS COPY and various other blog posts illustrating the concept.
            - http://hemantoracledba.blogspot.ca/2012/06/rman-backup-as-copy.html
            - http://afatkulin.blogspot.ca/2009/01/moving-datafile.html
            There is plenty of free space in the zpool, which all filesystems share. The only thing I was concerned about was the zfs recordsize (8k), but it appeared to be failing on datafiles which even had 128k recordsize.