1 2 Previous Next 16 Replies Latest reply: Jan 9, 2009 10:12 AM by 437959 RSS

    ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2.0.4

    437959
      We have encountered the disk I/O errors noted elsewhere and in Metalink 395350.1 but are not using Veritas and do not want to set disk_asynch_io=false and filesystemio_options=none. We are using ASM with a corporate SAN, Oracle oracleasm libraries but the problem occurs only at the local RAID which is Linux filesystem not ASM.

      Linux Redhat version is 2.6.18-92.e15xen x86-64
      Oracle version 10.2.0.4.0
      Server is Dell R900

      Any help would be appreciated.

      Peter
        • 1. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
          Tommyreynolds-Oracle
          What are the O/S errors listed in "/var/log/messages" when this happens?

          Is this a bare-metal or virtualized server?
          • 2. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
            437959
            all /var/log/messages are 0 bytes
            bare metal server, Dell R900

            from the alert log

            ARC1: Encountered disk I/O error 19502
            Exception 19502 encountered when closing file
            ORA-19502: write error on file
            ORA-27601: waiting for async I/Os failed
            Linux-x86_64 Error: 5: Input/output error
            Additional information: -1
            Additional information: 1048576

            Is it possible that this is simply an indication of a corrupt block in the file being written?

            db_block_checking=LOW
            db_block_checksum=TYPICAL
            • 3. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
              506787
              This error means the AIO call from the oracle database has gotten an error back from the operating system.

              The error from the o/s is error 5, I/O error.

              The 1048576 is probably the write size of the (a)IO call.

              It's the archiver who gets the error (ARC1). Where does the archiver write?

              (db block checking and db block checksum are not parameters who influence the archiver, so setting them to other values will probably not change anything)
              • 4. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                437959
                the archiver writes to local raid, as does LGWR
                I wondered if there was too much contention on the same mountpoint to have online redo logs and archive logs both there, but we have other servers that have the same configuration but no problems even same load

                one responder asked if it was a virtual server and I said no; however kernel is xen which is apparently the default for RHEL5 according to our sysadmin, so we are wondering if using a smp kernel (I don't even know what RHEL 5 options are..) might make a difference.

                I suppose it could simply be bad blocks on the local RAID; it just seems odd that our 2 DEV R900s are ok and our 2 PROD R900s are having problems. The only difference between the two environments that we are aware of is the SAN storage for our ASM diskgroups. Makes no sense to me that that would matter. I also pulled the value for /proc/sys/fs/aio-nr when the db hung up and it was very low, under 6000. The recommended value for aio-max-nr is 3 million.
                • 5. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                  506787
                  I am not aware of a timeout on (async) I/O, which is necessary to trigger the error with heavy contention.
                  I've seen IO times go up to 10+ seconds on various platforms, which did not trigger IO errors.
                  I did not look in the AIO linux code, but I assume this isn't the case.

                  As far as I know the xen kernel is not the default one, that's normal version. (unlike the x86 version, there are no UP (uniprocessor) and SMP versions of the x64 kernel version (and no PAE version obviously), just the version without '-xen')

                  If it were bad blocks on local raid, the IO error must be triggered by the O/S, which would very probably mean the error should be in logfiles of the O/S (messages).

                  AIO-MAX-NR is the total number of AIO context's available, AIO-NR is the number of AIO context's in use. This means AIO-MAX-NR minus AIO-NR is the number of AIO context's still available for use. But, if there are no AIO context's available, oracle just uses synchronous IO. I don't think there's an issue there.

                  I still think the database issued an AIO call which got an error back from the operating system, because of parameters which were invalid according to the O/S.
                  This error looks remarkably much like the veritas error message, which was a issue because of IO size.
                  • 6. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                    506787
                    Can you search in udump/bdump if any tracefiles contain more messages? On metalink, a similar case is present where the dbwrite trace shows messages about kernel resource starvation.
                    • 7. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                      506787
                      Also, please set the following using sqlplus / as sysdba:

                      alter system set event='27601 trace name errorstack level 3' scope=spfile;

                      and bounce the database to make the event active.
                      • 8. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                        437959
                        Frits, many thanks for your effort on this. Unfortunately, I will not have access to the system until Jan 5.

                        I had looked for OS error messages under /var/log/message and found empty files for aio. As for aio timeout, when last the instance hung, a log file sync session (LGWR) was blocking everything else for over an hour.

                        We could not kill any processes even using kill -9. And /sbin/shutdown -r now would NOT shutdown the server. Shutdown abort of the instance did nothing. I shutdown the ASM instance normally and then found the db instance still running.

                        I understand what you are saying about aio-max-nr and aio-nr; that the resource starvation, if indeed that is the fault, is not due to the aio configuration.

                        you wrote:
                        I still think the database issued an AIO call which got an error back from the operating system, because of parameters which were invalid according to the O/S.
                        This error looks remarkably much like the veritas error message, which was a issue because of IO size

                        if db parameters are invalid according to O/S then why is this error so intermittent? I recall the veritas error had Linux error 26 and my problem has Linux error 5. But in any case, how can a faulty configuration result in a) very intermittent errors and b) no errors in a Development environment that is identical in configuration and activity related to redo and archive?

                        I will set the event=27601 on Monday and report back; but why not set event=19502 instead?

                        Again, many thanks for your insight and time

                        Peter
                        • 9. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                          Tommyreynolds-Oracle
                          Exactly which processes are hung when this happens? What does:

                          # /bin/ps -ax | /bin/grep D

                          show? (Looking for processes waiting for an event that does not happen.) Look for a "D" in the STAT column.
                          • 10. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                            506787
                            Are you sure you are not on NFS? I've seen this behavior with another database (postgresql) using NFS, where a write error due to a faulty nfs client caused inode access to the inode which got the error to hang. Only resolution was to reboot.

                            The intermittancy could be cause by several causes. Not always the largest possible AIO write size is used (1M), for example.

                            You could set the event for 19502 also. It just causes oracle to dump more information if you encounter that message.
                            • 11. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                              437959
                              I am certain we are not on NFS.

                              We typically MERGE in millions of rows, our redo logfiles are 10G and switch 3-4 times per hour, so I imagine that the largest possible write size is not uncommon.

                              I will set event=27601 on Monday and monitor ps -efl for processes with status D.

                              thanks,

                              Peter
                              • 12. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                                437959
                                The OS processes with status of D are oracle parallel execution servers WCHAN=sync_p
                                Blocking session per v$session is LGWR and event is log_file_sync
                                • 13. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                                  437959
                                  I find a bdump lwgr trace file with multiple warnings

                                  Warning: Oracle process running out of OS kernel I/O resources (1)

                                  and then it calms down and simply says
                                  Warning: log write time 1280ms, size 102231KB
                                  etc

                                  but since I set aiowaittimeouts=250 (up from 100) the system has not hung despite heavy load and these lgwr trc file warnings
                                  • 14. Re: ORA-19502, ORA-27601, async_io fails, Linux RHEL5 Update 2, Oracle 10.2
                                    506787
                                    have you looked at bug: 6687381? it looks like your problem, and there's a one-off patch for it
                                    1 2 Previous Next