7 Replies Latest reply on Jan 7, 2011 2:23 PM by EdStevens

    error 443 on startup with network outage

    EdStevens
      Oracle 10.2.0.4.0 EE on HP-UX 11.23

      Last evening I was applying a CPU. When opatch had completed successfully I tried to start the database to run the follow-on scripts, It hung for a very long time before eventually failing without even getting to MOUNT, and left no background processes up. While poking around to figure out what was going on I started getting network issues and eventually discovered that a net admin had done something that caused us to lose routing. At that point symptoms at my desk were:

      - established putty sessions were operational. I could navigate, issue commands, etc.
      - unable to establish any new putty sessions.
      - "lsnrctl start" would hang
      - database startup would hang, then fail.

      Once the net guys got things fixed, the database startup was textbook perfect.

      What I don't understand was why a network issue would prevent a database startup from even completing the initialization phase. The only possible connection I can see is the DB_DOMAIN init parameter, but I don't know what is being done with that at startup.

      Here's the extract from the alert log, beginning with the last few lines of the previous shutdown, and ending the the next startup after the first failed one.

      All references to SID, SERVER and DOMAIN names have been replaced with "generic" names.
      Completed: ALTER DATABASE DISMOUNT
      ARCH: Archival disabled due to shutdown: 1089
      Shutting down archive processes
      Archiving is disabled
      Archive process shutdown avoided: 0 active
      ARCH: Archival disabled due to shutdown: 1089
      Shutting down archive processes
      Archiving is disabled
      Archive process shutdown avoided: 0 active
      Tue Jan  4 17:35:13 2011
      Starting ORACLE instance (normal)
      LICENSE_MAX_SESSION = 0
      LICENSE_SESSIONS_WARNING = 0
      Picked latch-free SCN scheme 3
      Autotune of undo retention is turned on. 
      IMODE=BR
      ILAT =36
      LICENSE_MAX_USERS = 0
      SYS auditing is disabled
      Tue Jan  4 17:37:53 2011
      ksdpec: called for event 13740 prior to event group initialization
      Starting up ORACLE RDBMS Version: 10.2.0.4.0.
      System parameters with non-default values:
        processes                = 300
        sessions                 = 335
        __shared_pool_size       = 553648128
        __large_pool_size        = 16777216
        __java_pool_size         = 16777216
        __streams_pool_size      = 0
        disk_asynch_io           = FALSE
        sga_target               = 2550136832
        control_files            = /oradata/ora_control/mysid/control01.ctl, <snip>
        db_block_size            = 8192
        __db_cache_size          = 1946157056
        compatible               = 10.2.0.1.0
        log_archive_dest_10      = location=/export/oraarch
        log_archive_format       = %t_%s_%r.arc
        db_file_multiblock_read_count= 16
        db_recovery_file_dest    = /archive/ora_fra
        db_recovery_file_dest_size= 19327352832
        undo_management          = AUTO
        undo_tablespace          = UNDOTBS1
        O7_DICTIONARY_ACCESSIBILITY= TRUE
        remote_login_passwordfile= EXCLUSIVE
        db_domain                = my.domain.com
        dispatchers              = (PROTOCOL=TCP) (SERVICE=MYSIDXDB)
        smtp_out_server          = myserver:25
        job_queue_processes      = 10
        background_dump_dest     = /oracle/app/admin/mysid/bdump
        user_dump_dest           = /oracle/app/admin/mysid/udump
        core_dump_dest           = /oracle/app/admin/mysid/cdump
        audit_file_dest          = /oracle/app/admin/mysid/adump
        audit_trail              = DB, EXTENDED
      
        db_name                  = MYSID
        open_cursors             = 300
        pga_aggregate_target     = 848297984
      Tue Jan  4 17:40:48 2011
      USER: terminating instance due to error 443
      Instance terminated by USER, pid = 12248
      Tue Jan  4 18:18:21 2011
      Starting ORACLE instance (normal)
        • 1. Re: error 443 on startup with network outage
          558383
          I have not experienced the same issue with single instance but I have seen something a little bit similar with RAC as described in Is a gateway mandatory for RAC environment?
          1 person found this helpful
          • 2. Re: error 443 on startup with network outage
            EdStevens
            P. Forstmann wrote:
            I have not experienced the same issue with single instance but I have seen something a little bit similar with RAC as described in Is a gateway mandatory for RAC environment?
            Thanks for the reference. Yes, not exact but very possibly a different manifestation in non-RAC environments. I particularly notice Surachart's comment "By default, the server's default gateway is used as a ping target during the Oracle RAC 10g VIP status check action." If it does that same ping of default gateway early in a non-RAC startup, that would explain it nicely. I don't want to beat a dead horse, but would like to run this to ground just for my own education. Do you have any ideas a line of inquiry or testing that might conclusively confirm that this is the mechanism? As I said in my opening statement, a google AND a MeataLink search on the error code didn't turn up anything of any substance at all. Some kind of trace during startup would be ideal.
            • 3. Re: error 443 on startup with network outage
              558383
              You could try to set up a testing environment in which you can easily disable/enable network and test instance startup.
              I'm also wondering if using shared server (DISPATCHERS parameters) and STMP configuration (SMTP_OUT_SERVER) in your init file could impact instance startup with network checks.
              1 person found this helpful
              • 4. Re: error 443 on startup with network outage
                EdStevens
                I do have a box I can disconnect from the network and test that on. Different OS (linux vs hp-ux) but I wouldn't think that would make a difference for this scenario. It'll be Monday before I'm back in to office to be able to do that. In the mean time I'll do some searching to see if there is any technique to actually trace the startup, deeper than what the alert log shows.
                • 5. Re: error 443 on startup with network outage
                  EdStevens
                  P. Forstmann wrote:
                  You could try to set up a testing environment in which you can easily disable/enable network and test instance startup.
                  I'm also wondering if using shared server (DISPATCHERS parameters) and STMP configuration (SMTP_OUT_SERVER) in your init file could impact instance startup with network checks.
                  I am not in a position at the moment to create an reproduction of the problem, but I did run some tests using strace on my virtual linux machine on my laptop. I'm no expert with strace (never used it before, had to do some google and man page research) but I think even with some possible apples/oranges issues between this test and my actual failure, this is pretty informative.

                  I opened two putty session to my vm. The first logged on as oracle, connected as sysdba, then shut down the instance. Next I started a second putty session logged on as root, did a ps to get the pid of the sqlplus session, then:
                  [root@vmlnx01 ~]# strace -f -e trace=network -s 512 -p 23741
                  
                  Process 23741 attached - interrupt to quit
                  socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 7
                  bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
                  getsockname(7, {sa_family=AF_INET, sin_port=htons(15713), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
                  getpeername(7, 0xbfb5a9b8, [16])        = -1 ENOTCONN (Transport endpoint is not connected)
                  getsockopt(7, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
                  getsockopt(7, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
                  Process 23919 attached (waiting for parent)
                  Process 23919 resumed (parent 23741 ready)
                  [pid 23741] --- SIGCHLD (Child exited) @ 0 (0) ---
                  [pid 23919] socket(PF_FILE, SOCK_STREAM, 0) = 7
                   
                  <snip>
                  
                  Process 23944 attached
                  Process 23945 attached
                  Process 23944 detached
                  [pid 23919] --- SIGCHLD (Child exited) @ 0 (0) ---
                  [pid 23943] socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 13
                  notice here we are going to bind to localhost
                  [pid 23943] bind(13, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
                  [pid 23943] getsockname(13, {sa_family=AF_INET, sin_port=htons(32853), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
                  [pid 23943] getpeername(13, 0xbfc6bc0c, [16]) = -1 ENOTCONN (Transport endpoint is not connected)
                  [pid 23943] getsockopt(13, SOL_SOCKET, SO_SNDBUF, [262144], [4]) = 0
                  [pid 23943] getsockopt(13, SOL_SOCKET, SO_RCVBUF, [262144], [4]) = 0
                  [pid 23943] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 14
                  [pid 23943] setsockopt(14, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
                  [pid 23943] bind(14, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
                  [pid 23943] listen(14, 128)             = 0
                  [pid 23943] getsockname(14, {sa_family=AF_INET, sin_port=htons(27124), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
                  [pid 23943] getpeername(14, 0xbfc6ba00, [16]) = -1 ENOTCONN (Transport endpoint is not connected)
                  [pid 23943] getsockopt(14, SOL_SOCKET, SO_SNDBUF, [16384], [4]) = 0
                  [pid 23943] getsockopt(14, SOL_SOCKET, SO_RCVBUF, [87380], [4]) = 0
                  [pid 23943] sendto(13, "P", 1, 0, {sa_family=AF_INET, sin_port=htons(52993), sin_addr=inet_addr("127.0.0.1")}, 16) = 1
                  [pid 23921] getsockopt(0, SOL_SOCKET, SO_SNDBUF, 0xbfd9d21c, 0xbfd9d218) = -1 ENOTSOCK (Socket operation on non-socket)
                  [pid 23921] getsockopt(0, SOL_SOCKET, SO_RCVBUF, 0xbfd9d21c, 0xbfd9d218) = -1 ENOTSOCK (Socket operation on non-socket)
                  [pid 23921] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 15
                  and the next few lines show a connect to the host ip address
                  [pid 23921] connect(15, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("192.168.160.101")}, 16) = -1 EINPROGRESS (Operation now in progress)
                  [pid 23921] getsockopt(15, SOL_SOCKET, SO_SNDBUF, [50436], [4]) = 0
                  [pid 23921] getsockopt(15, SOL_SOCKET, SO_RCVBUF, [87680], [4]) = 0
                  [pid 23921] getsockname(15, {sa_family=AF_INET, sin_port=htons(18178), sin_addr=inet_addr("192.168.160.101")}, [16]) = 0
                  [pid 23945] socket(PF_FILE, SOCK_STREAM, 0) = 12
                  [pid 23945] connect(12, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
                  [pid 23945] socket(PF_FILE, SOCK_STREAM, 0) = 12
                  [pid 23945] connect(12, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
                  [pid 23945] socket(PF_FILE, SOCK_STREAM, 0) = 13
                  [pid 23945] connect(13, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
                  [pid 23945] socket(PF_FILE, SOCK_STREAM, 0) = 13
                  
                  <snip>
                  While I don't really understand what's going on, it looks pretty clear that just getting the database started involves some net activity on both the localhost (127.0.0.1) and the host's assigned IP address from /etc/hosts. Maybe for registration with the listener? In any event, some network routing would have been required, and in my actual failure, there was a breakdown in network routing. I don't know exactly where, but the net admin did mention that - among other issues - the dns server was not reachable.

                  Edited by: EdStevens on Jan 6, 2011 9:17 PM
                  • 6. Re: error 443 on startup with network outage
                    sb92075
                    On multiple occasions on this forum & others folks report TNS or ORA errors relating to SQL*Net
                    when connection to local DB when & where SQL*Net should NOT be involved.

                    Typically these are brand new installations.
                    Usually the problem is a networking/OS "mis-match" between hostname at OS level vs. what is in /etc/hosts file
                    or IP# confusion between 127.0.0.1 & IP# assigned to NIC
                    • 7. Re: error 443 on startup with network outage
                      EdStevens
                      sb92075 wrote:
                      On multiple occasions on this forum & others folks report TNS or ORA errors relating to SQL*Net
                      when connection to local DB when & where SQL*Net should NOT be involved.

                      Typically these are brand new installations.
                      Usually the problem is a networking/OS "mis-match" between hostname at OS level vs. what is in /etc/hosts file
                      or IP# confusion between 127.0.0.1 & IP# assigned to NIC
                      Very true. But in this case the question was "why would known network problems (outside of tns config) cause the instance *startup* to fail?"
                      - This was a nearly 4-year old installation - all tns config issues resolved long ago.
                      - The only connection in play was the '/ as sysdba' to an idle instance, for the purpose of starting it.

                      What I couldn't figure was why simple instance startup should be sensitive to network routing issues. I had failed to consider instance registration with the listener. The strace didn't give me that specifically, but it did provide the "aha!" moment. I still don't understand exactly what was going on (the strace revealed a bind on both IP addresses) but at least feel satisfied that at a slightly higher level I understand why fundamental network problems (not tns config) would prevent an instance from starting.