This discussion is archived
7 Replies Latest reply: Feb 14, 2013 6:37 AM by 884278 RSS

Problems with DataGuard fast start failover

884278 Newbie
Currently Being Moderated
Hello,

I need some help.
I work for a client on a Data Guard 11gR2 installed on 2 physical servers running Windows 2008 Server R2.

The broker is configured and the MaxAvailibility mode is activated.
I'm able to switchover and failover successfully.

The problems are coming when we try to configure FSFO.
We configure the systems to meet the FSFO's requirements:

- MaxAvailibility
- LogXptMode = SYNC
- Flashback enabled on both primary and standby
- FastStartFailoverTarget parameters are set properly

We configure the Observer on another server, virtual (vmware), running Windows 2008 Server R2.

All the 3 servers are pinging each other successfully, tnsping also, no problem.

I'm able to activate the FSFO successfully.
But later, both primary and standby are down, with these error messages in alert log:

On primary:

Primary has heard from neither observer nor target standby within FastStartFailoverThreshold seconds.
It is likely an automatic failover has already occurred. Primary is shutting down.
ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down
LGWR (ospid: 6336): terminating the instance due to error 16830
Instance terminated by LGWR, pid = 6336

The standby database has not been started...

I have also errors Fatal NI Connect 12514 and 12518 with ns main err code: 12564 in alert log
But I can't find the problem. All my tests with ping, tnsping and connection through SQL*Plus between servers were OK.$
No firewall are configured.

The parameter FastStartFailoverThreshold is set to 30 seconds.

Is there a problem in the data guard configuration or is there any other problem with Oracle Net , listeners ?
The fact is that without FSFO, the Data Guard works perfectly.

Thank you for your help.
Best Regards
  • 1. Re: Problems with DataGuard fast start failover
    P.Forstmann Guru
    Currently Being Moderated
    If not already done try to see if there is additional information in Data Guard Broker trace files (named drc*.log and located in same directory as database instance alert log).
  • 2. Re: Problems with DataGuard fast start failover
    teits Pro
    Currently Being Moderated
    Hello,

    Did fast-start failover occurred on the standby DB?
    what is the status of the standby when startup?
    SELECT PROTECTION_MODE,DATABASE_ROLE,FS_FAILOVER_STATUS, SWITCHOVER_STATUS FROM V$DATABASE;

    without OBSERVER do you experience the problem?

    suggestions:
    2. make sure packet is not dropping.(ping)
    3. make sure your vmware host has enough resources to avoid delay in processing!
    in others words, check your vmware performace.(likely cause)
    4. configure the observer on another server(preferably not virtual server)

    if not resolved, get more information in drc*.log

    HTH
    Tobi
  • 3. Re: Problems with DataGuard fast start failover
    884278 Newbie
    Currently Being Moderated
    Hello,

    Thank you for your response.
    Here is the primary drc*.log

    I see that the primary has problem to connect to standby and observer, but don't understand why...

    Any idea?
    Thank you in advance.
    Regards.


    2013-01-15 08:30:53.925 DMON: >> Starting Data Guard Broker bootstrap <<
    2013-01-15 08:30:53.940 DMON: Broker Configuration File Locations:
    2013-01-15 08:30:53.940 dg_broker_config_file1 = "D:\DG_BROKER_CONFIG_FILE1.DAT"
    2013-01-15 08:30:53.940 dg_broker_config_file2 = "F:\DG_BROKER_CONFIG_FILE2.DAT"
    2013-01-15 08:30:53.940 DMON: Attach state object
    2013-01-15 08:30:53.972 DMON: rfafoGetLocks reinitializing dubious PMYSHUT lock value block contents: flags=0x0, spare1=0x0, spare2=0x0, cksm=0x0, rndm=0x0
    2013-01-15 08:30:53.972 DMON: Entered rfm_get_chief_lock() for CTL_BOOTSTRAP, reason 2
    2013-01-15 08:30:53.972 7fffffff 0 DMON: chief lock convert for bootstrap
    2013-01-15 08:30:53.972 DMON: Boot configuration (0.0.0), loading from "D:\DG_BROKER_CONFIG_FILE1.DAT"
    2013-01-15 08:30:53.987 DMON Registering service my_primary_db_DGB with listener(s)
    2013-01-15 08:30:53.987 Executing SQL [ALTER SYSTEM REGISTER]
    2013-01-15 08:30:53.987 SQL [ALTER SYSTEM REGISTER] Executed successfully
    2013-01-15 08:30:53.987 DMON: ...committing load to memory, Seq.MIV = 4.57
    2013-01-15 08:30:53.987 DMON: Broker Configuration: "dg_conf1"
    2013-01-15 08:30:53.987 Metadata Version: 3.0 / UID=0x7f3462d6 / Seq.MIV=4.57 / blksz.grain=4096.8
    2013-01-15 08:30:53.987 Protection Mode: Maximum Availability
    2013-01-15 08:30:53.987 Fast-Start Failover (FSFO): Enabled, flags=0x40021, version=278
    2013-01-15 08:30:53.987 Primary Database: my_primary_db (0x01010000)
    2013-01-15 08:30:53.987 Standby Database: my_standby_db, Enabled Physical Standby (FSFO target) (0x02010000)
    2013-01-15 08:30:53.987 DMON: FSFO - observer ping countdown timer - oTime reset to current time
    2013-01-15 08:30:53.987 DMON: Instance ID 1 is the broker FSFP HOME instance
    2013-01-15 08:30:53.987 DMON: Instance ID 1 is the broker health check master
    2013-01-15 08:30:57.045 DMON: Creating process FSFP
    2013-01-15 08:30:57.372 Executing SQL [ALTER SYSTEM REGISTER]
    2013-01-15 08:30:57.372 SQL [ALTER SYSTEM REGISTER] Executed successfully
    2013-01-15 08:31:00.071 FSFP: Process started
    2013-01-15 08:31:00.414 DMON: FSFP successfully started
    2013-01-15 08:31:03.457 00001000 1795085422 DMON: Version Check Results
    2013-01-15 08:31:03.457 00001000 1795085422 Database my_standby_db returned ORA-00000
    2013-01-15 08:31:03.457 DMON: FSFO ack received
    2013-01-15 08:31:03.457 00001000 1795085422 DMON: messaging RSM to bootstrap locally
    2013-01-15 08:31:03.457 DMON: Creating process RSM0
    2013-01-15 08:31:03.831 Fore: FSFO ack received
    2013-01-15 08:31:03.831 Fore: FSFO ack received
    2013-01-15 08:31:06.483 RSM0: Process state object initialization complete
    2013-01-15 08:31:06.499 00001000 1795085422 DMON: Entered rfm_release_chief_lock() for CTL_VC
    2013-01-15 08:31:06.499 RSM0: Received Set State Request: rid=0x01010001, sid=9, phid=2, econd=7, sitehndl=0x01001000
    2013-01-15 08:31:06.499 Database Resource[IAM=PRIMARY]: SetState READ-WRITE-XPTON, phase BUILD-UP, External Cond ENABLE, Target Site Handle 0x01001000
    2013-01-15 08:31:06.514 Set log transport destination: SetState ONLINE, phase BUILD-UP, External Cond ENABLE
    2013-01-15 08:31:06.514 Executing SQL [ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH sid='my_primary_db']
    2013-01-15 08:31:06.655 SQL [ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH sid='my_primary_db'] Executed successfully
    2013-01-15 08:31:06.655 Executing SQL [ALTER SYSTEM SET log_archive_format='ARC%S_%R.%T' SCOPE=SPFILE sid='my_primary_db']
    2013-01-15 08:31:06.857 SQL [ALTER SYSTEM SET log_archive_format='ARC%S_%R.%T' SCOPE=SPFILE sid='my_primary_db'] Executed successfully
    2013-01-15 08:31:06.889 Executing SQL [ALTER SYSTEM SET standby_file_management='auto' SCOPE=BOTH sid='*']
    2013-01-15 08:31:06.904 SQL [ALTER SYSTEM SET standby_file_management='auto' SCOPE=BOTH sid='*'] Executed successfully
    2013-01-15 08:31:06.920 Executing SQL [ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH sid='*']
    2013-01-15 08:31:06.935 SQL [ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH sid='*'] Executed successfully
    2013-01-15 08:31:06.951 Executing SQL [ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH sid='*']
    2013-01-15 08:31:07.934 SQL [ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH sid='*'] Executed successfully
    2013-01-15 08:31:07.949 Executing SQL [ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH sid='*']
    2013-01-15 08:31:07.965 SQL [ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH sid='*'] Executed successfully
    2013-01-15 08:31:07.981 Database Resource SetState succeeded
    2013-01-15 08:31:07.996 RSM0: FSFO ack received
    2013-01-15 08:31:07.996 Data Guard: Fast-Start Failover state resolved, OK to open primary database now
    2013-01-15 08:31:07.996 RSM0: HEALTH CHECK ERROR: ORA-16808: primary database is not open
    2013-01-15 08:31:08.152 00000000 1795085423 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16808
    2013-01-15 08:31:08.152 00000000 1795085423 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16808
    2013-01-15 08:32:03.642 RSM0: HEALTH CHECK ERROR: ORA-16820: fast-start failover observer is no longer observing this database
    2013-01-15 08:32:03.642 00000000 1795085424 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16820
    2013-01-15 08:32:03.642 00000000 1795085424 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16820
    2013-01-15 08:32:12.925 LGWR: FSFO SetState("UNSYNC", 0x2) operation requires an ack
    2013-01-15 08:32:12.925      Potential primary shutdown if ack is not received within 27 seconds
    2013-01-15 08:32:23.564 FSFP: Failed to connect to remote database my_standby_db. Error is ORA-12528
    2013-01-15 08:32:23.704 DMON: processing last gasp from site 0x02001000, instance 1
    2013-01-15 08:32:23.704 DMON: done processing edit metadata.
    2013-01-15 08:32:23.845 RSM0: HEALTH CHECK ERROR: ORA-16820: fast-start failover observer is no longer observing this database
    2013-01-15 08:32:23.845 00000000 1795085425 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16820
    2013-01-15 08:32:23.845 00000000 1795085425 Operation HEALTH_CHECK canceled during phase 2, error = ORA-16820
    2013-01-15 08:32:28.135 LGWR: still awaiting FSFO ack after 15 seconds
    2013-01-15 08:32:33.876 FSFP: Failed to connect to remote database my_standby_db. Error is ORA-01034
    2013-01-15 08:32:35.015 NSV1: Failed to connect to remote database my_standby_db. Error is ORA-01034
    2013-01-15 08:32:35.015 NSV1: Failed to send message to site my_standby_db. Error code is ORA-01034.
    2013-01-15 08:32:35.015 00000000 1795085425 DMON: Database my_standby_db returned ORA-01034
    2013-01-15 08:32:35.015 00000000 1795085425 for opcode = HEALTH_CHECK, phase = BEGIN, req_id = 1.1.1795085425
    2013-01-15 08:32:35.015 00000000 1795085425 DMON: Marking site "my_standby_db" as shutdown disabled
    2013-01-15 08:32:35.030 Executing SQL [alter system set log_archive_dest_state_3 = 'RESET']
    2013-01-15 08:32:35.046 SQL [alter system set log_archive_dest_state_3 = 'RESET'] Executed successfully
    2013-01-15 08:32:35.046 00000000 1795085425 DMON: Evaluating critical status of standbys in configuration
    2013-01-15 08:32:35.046 DMON: Updated Seq.MIV to 4.58, writing metadata to "F:\DG_BROKER_CONFIG_FILE2.DAT"
    2013-01-15 08:32:40.303 LGWR: Notifying CRS to disable services and monitoring for Primary Shutdown on Failover
    2013-01-15 08:32:40.303 LGWR: Primary has heard from neither observer nor target standby
    2013-01-15 08:32:40.303      within FastStartFailoverThreshold seconds. It is
    2013-01-15 08:32:40.303      likely an automatic failover has already occurred.
    2013-01-15 08:32:40.303      The primary is shutting down.
    2013-01-15 08:32:40.303 CLSR: CRS not configured, config = 2
  • 4. Re: Problems with DataGuard fast start failover
    CKPT Guru
    Currently Being Moderated
    >
    ORA-16830: primary isolated from fast-start failover partners longer than FastStartFailoverThreshold seconds: shutting down
    LGWR (ospid: 6336): terminating the instance due to error 16830
    Instance terminated by LGWR, pid = 6336
    >

    Where are the locations of primary and standby, any timezone differences? Please check below MOS Notes they might help to you.

    Fast-Start Failover initiated due to clock change [ID 1131944.1]
    Bug 9647727 - Unexpected FSFO failover at onset of Daylight Savings Time (DST) [ID 9647727.8]

    >
    2013-01-15 08:31:07.996 RSM0: HEALTH CHECK ERROR: ORA-16808: primary database is not open
    2013-01-15 08:32:03.642 RSM0: HEALTH CHECK ERROR: ORA-16820: fast-start failover observer is no longer observing this database
    >

    And what is the resetlogs change on both primary and standby? check whether observer initiated Failover or not?
  • 5. Re: Problems with DataGuard fast start failover
    884278 Newbie
    Currently Being Moderated
    teits wrote:
    Hello,

    Did fast-start failover occurred on the standby DB?
    what is the status of the standby when startup?
    SELECT PROTECTION_MODE,DATABASE_ROLE,FS_FAILOVER_STATUS, SWITCHOVER_STATUS FROM V$DATABASE;

    without OBSERVER do you experience the problem?

    suggestions:
    2. make sure packet is not dropping.(ping)
    3. make sure your vmware host has enough resources to avoid delay in processing!
    in others words, check your vmware performace.(likely cause)
    4. configure the observer on another server(preferably not virtual server)

    if not resolved, get more information in drc*.log

    HTH
    Tobi
    No, I think FSFO did not occur on standby.
    The standby never startup.
    I didn't try this request and now the system is back to a normal state, without FSFO

    We have ping servers from each others and there is no problem.

    The virtual host has enough resources to run an Oracle database.... It runs actually only the Observer.

    We didn't try to change server for the observer. I'll think about it.

    Thank you for your help.
  • 6. Re: Problems with DataGuard fast start failover
    884278 Newbie
    Currently Being Moderated
    CKPT wrote:

    Where are the locations of primary and standby, any timezone differences? Please check below MOS Notes they might help to you.
    Primary and standby are located in the same site. The 2 servers are configured in the same timezone.


    We replay the scenarion again. DataGuard configured with Broker and without FSFO, all is ok.
    FSFO activated.
    A few hours later, the observer is "down".
    We see it with DGMGRL> show configuration
    ORA-16820: fast-start failover observer is no longing observing this database
    for the primary and standby.

    But when connecting to the observer server, the cmd windows is still open with the observer process. NOTHING in the observer log....

    So in our case, the observer seems to be not reliable...

    For the moment, the primary seems to be still ok.

    Edited by: 881275 on Jan 30, 2013 8:50 AM
  • 7. Re: Problems with DataGuard fast start failover
    884278 Newbie
    Currently Being Moderated
    Hello,

    Seems to have found the problem.
    Wrong information from me: Observer is running on a Windows 7 workstation (vmware).
    Seems that the power management was enabled and that was problematic (the machine went in sleeping mode or something like that).

    We change that (no power management enabled) and the observer don't fail anymore.

    Furthermore, the network interfaces of the 2 physical servers was setted with power manager "Allow the computer to turn off this device to save power".
    We disable it on all interfaces.
    Don't know well the IT part of Windows systems but seems to me horrible to have such options for servers version of Windows...

    Thank you for helping.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points