ORA-00445 issues explained

Version 2
Visibility: Open to anyone

    Did you ever encounter the error 'ORA-00445 Background Process.....'? If so, read on to find out more about this issue, such as

        o What does the error mean?

        o Which files contain details about the error?

        o Potential causes and actions

     

    What does the error "ORA-445 Background Process "xxxx" Did Not Start After 120 Seconds" mean?

    ORA-00445 occurs when an Oracle background process was unable to spawn a child process in the stipulated amount of time i.e., 2 mins(120 seconds) and therefore this child process startup is aborted. The process which starts the child process does it in a synchronous manner and hence this parent process can't proceed with any other tasks until the child process spawning is completed. Sometimes this may lead to instance wide hang as the parent process may hold up other resources while spawning the child.

     

    Where to find details about the this error?

    Details regarding the error are recorded in the Alert Log and trace files - for example:

    1- Alert Log file

        Messages such as below are generated in the alert log - for example:

        Errors in file xxx.trc (incident=12345):

        ORA-00445: background process "xxx" did not start after 120 seconds

        Incident details in: /export/oracle/diag/rdbms/orcl_u/orcl/incident/<incdir_312345>/<abc>_i12345.trc

     

    2- Incident Trace Files

        Information such as mentioned below, is generated in the incident trace file at the time the issue occurs:

        Dump continued from file /export/oracle/diag/rdbms/orcl_u/orcl/incident/<incdir_312345>/<abc>_i12345.trc

        ORA-00445: background process "xxxx" did not start after 120 seconds

     

    3- Traditional tracefile generated at the time of issue

        The traditional background trace file contains further info generated at the time the issue occurs:

        Errors in file /export/oracle/diag/rdbms/orcl_u/orcl /trace/orcl_<abc>_12345.trc:

     

    You need to review both the traditional as well as incident trace files to understand the load on the system and major waits on the database. Typically

    • The incident trace file will show you the database wide waits that the child process encountering when coming up.
    • The traditional trace file will show you the details of the load on the machine (below ones):

              - Load average
              - Memory consumption
              - Output of PS (process state)
              - Output of GDB (to view the function stack)

     

    In order to investigate the issue, it might also help to understand the general stages of the process startup:

    1. Queued phase
    2. Forking Phase
    3. Execution Phase
    4. Initialization phase

    In general, the forking and execution phases are directly linked to system load and resources. To check what phase the process startup is, open the traditional trace file (not incident) and look for the wording "Waited for process"

    Waited for process XYZ to be spawned for nnn seconds
    Waited for process XYZ to initialize for nnn seconds

    If the message contains "to be spawned", it means the process is still at queued or forking phase (1 & 2)
    If the message contains "to initialize", it means the process is at execution or Initialize phase (3 & 4)

    Other useful information can be obtained from the trace files:

    • Open the traditional trace file and review the section which starts with 'Process diagnostic dump for' - 'load average','memory information','process state - ps output' and also 'gdb output' will provide initial insight on the load on the system.
    • Open the incident trace file and find the section 'PROCESS STATE' and within that section look for 'Current Wait Stack'. This will provide the database wide events that the child process encountered and may provide clues and generic direction on how to proceed.

     

    Why does the error occur? - Potential causes and solutions/actions

    The root cause of this issue mainly falls under the following 2 categories.

    • Contention among processes: The process which is coming up might require resources that are actually being contended for by different other processes and sometimes the parent process itself might contend for the same resource as the child process (indirectly).
    • OS and network level issues: The machine on which this is happening might be CPU/memory saturated and this may delay the process spawning. Network latency when your storage is on a network file system may also delay the process spawning.

    Some of the common known issues and potential solutions are listed below:

    1. Lack of OS resources or incorrect configuration
      This error may be observed due to lack of OS resources or incorrect configuration, typically memory or swap space may be insufficient to spawn a new process.
      Accordingly, the following checks may help to identify the issue:
      • Check OS Error Log file for the time when the error is generated
        OS Messages log can provide an indication whether there is a problem with the Operating System Itself
        * AIX: the output of the "errpt" command and the "errpt -a" command
        * Linux: /var/log/messages
        * Solaris: /var/adm/messages
        * HP-UX: /var/adm/syslog/syslog.log
        * Tru64: /var/adm/messages
        * Windows: Save Application Log and System Log as txt files Using Event Viewer
      • Run HCVE script
        The HCVE script verifies whether OS resources are set as recommended by Oracle. Instructions on how to download and run the script are outlined in Document 250262.1.
        Please note, that the script will only check whether your system is configured as per the recommended 'minimum' values. Depending on your environment these values may not be sufficient though.
      • Run OS Watcher
        OS Watcher is an Oracle tool that allows you to monitor the system from an OS perspective. Document 301137.1 outlines the usage of this tool.
      • Check the defined user limitation (ulimit) settings (UNIX-only)
        Check the ulimit settings as the oracle user (or the owner of the oracle software) using
        # ulimit -a
        Minimum values can be found in Document 169706.1.
        Please note, that the values mentioned in the note are bare minimum values for a standard environment. Depending on your environment setup, you may need to increase these values accordingly.

      Recommended actions:

      • Review the resource-related suggestions as reported from the output of the HCVE script and make changes accordingly. The following 2 articles may help in understanding these suggestion better:
        • Document 169706.1: Oracle Database on Unix AIX,HP-UX,Linux,Mac OS X,Solaris,Tru64 Unix Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 11.2)
        • Document 225349.1: Address Windowing Extensions (AWE) or VLM on Windows Platforms
          (Typically on windows-system with more than 4Gb of RAM, enabling the /3GB switch in the boot.ini is required)
      • Check the define user limitation (ulimit) settings (UNIX-only)

     

         2. ASLR LINUX feature is being used
    ASLR is a feature that is activated by default on some of the newer LINUX distributions. It is designed to load shared memory objects in random addresses. In Oracle, multiple processes map a shared memory object at the same address across the processes. With ASLR turned on, Oracle cannot guarantee the availability of this shared memory address. This conflict in the address space means that a process trying to attach a shared memory object to a specific address may not be able to do so, resulting in a failure in SHMAT subroutine.
    This problem is mainly reported in Redhat 5 and Oracle 11.2.0.2. You can verify whether ASLR is being used as follows:
    # /sbin/sysctl -a | grep randomize
    kernel.randomize_va_space = 1
    -> If the parameter is set to any value other than 0 then ASLR is in use.

     

    Recommended actions:

    Disable the use of the ASLR Feature.
    On Redhat 5, to permanently disable ASLR, add/modify this parameter in /etc/sysctl.conf

    kernel.randomize_va_space=0
    kernel.exec-shield=0

    You need to reboot, for kernel.exec-shield parameter to take effect.
    Note that both kernel parameters are required for ASLR to be switched off.

     

         3. Incorrect Database Settings

    • Check whether PGA_AGGREGATE_TARGET is set to TRUE/FALSE
      The parameter PGA_AGGREGATE_TARGET is a numeric value and not a Boolean value and therefore must be set to a number for it to function correctly.
    • Check whether PRE_PAGE_SGA is set to TRUE
      PRE_PAGE_SGA instructs Oracle to read the entire SGA into active memory at instance startup. Operating system page table entries are then prebuilt for each page of the SGA. This setting can increase the amount of time necessary for instance startup, but it is likely to decrease the amount of time necessary for Oracle to reach its full performance capacity after startup. PRE_PAGE_SGA can increase the process startup duration, because every process that starts must access every page in the SGA, this can cause the PMON process to take longer to start and exceed the timeout which is by default 120 seconds causing the instance startup to fail.
    • Check output of  SQL> select * from v$resource_limit;
      V$resource_limit dynamic view provides the details of resources like sessions, processes, locks etc. It has the initialization values for the resources, maximum limit reached after the last database startup and current utilization of the resource.

      Recommended actions:

    • Properly set  PGA_AGGREGATE_TARGET to a numeric value.
    • Set PRE_PAGE_SGA to FALSE (recommended).
    • Check if limits were reached and accordingly increase the value of the resource in concern.

         4. Other Causes or Known Issues
    Other potential causes known to cause issues could be

    • Too much activity on your machine
    • NFS latency issues
    • Disk latency issue (that affects I/O)
    • Network latency

    Recommended actions:
    Since almost all these reasons are not related to the RDBMS itself, it is recommended you involve your Network, Storage and System Administrators in this investigation

     

    Please do let us know if you have any concerns/queries related to the above error and we will be happy to assist you further.

     

    Regards,

    Rehab

     

     

    This document was generated from the following discussion: ORA-00445 issues explained