This discussion is archived
9 Replies Latest reply: Apr 13, 2013 2:59 PM by 1002897 RSS

Node crashes when enabling RDS for private interconnect.

user12028852 Newbie
Currently Being Moderated
OS: oel6.3 - 2.6.39-300.17.2.el6uek.x86_64
Grid and DB: 11.2.0.3.4

This is a two node Standard Edition cluster.

The node crashes upon restart of clusterware after following the instructions from note:751343.1 (RAC Support for RDS Over Infiniband) to enable RDS.
The cluster is running fine using ipoib for the cluster_interconnect.

1) As the ORACLE_HOME/GI_HOME owner, stop all resources (database, listener, ASM etc) that's running from the home. When stopping database, use NORMAL or IMMEDIATE option.
2) As root, if relinking 11gR2 Grid Infrastructure (GI) home, unlock GI home: GI_HOME/crs/install/rootcrs.pl -unlock
3) As the ORACLE_HOME/GI_HOME owner, go to ORACLE_HOME/GI_HOME and cd to rdbms/lib
4) As the ORACLE_HOME/GI_HOME owner, issue "make -f ins_rdbms.mk ipc_rds ioracle"
5) As root, if relinking 11gR2 Grid Infrastructure (GI) home, lock GI home: GI_HOME/crs/install/rootcrs.pl -patch

Looks to abend when asm tries to start with the message below on the console.
I have a service request open for this issue but, I am hoping someone may have seen this and has
some way around it.

Thanks
Alan

kernel BUG at net/rds/ib_send.c:547!
invalid opcode: 0000 [#1] SMP
CPU 2
Modules linked in: 8021q garp stp llc iptable_filter ip_tables nfs lockd
fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand powernow_k8
freq_table mperf rds_rdma rds_tcp rds ib_ipoib rdma_ucm ib_ucm ib_uverbs
ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa sr_mod cdrom microcode
serio_raw pcspkr ghes hed k10temp hwmon amd64_edac_mod edac_core
edac_mce_amd i2c_piix4 i2c_core sg igb dca mlx4_ib ib_mad ib_core
mlx4_en mlx4_core ext4 mbcache jbd2 usb_storage sd_mod crc_t10dif ahci
libahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
scsi_wait_scan]

Pid: 4140, comm: kworker/u:1 Not tainted 2.6.39-300.17.2.el6uek.x86_64
#1 Supermicro BHDGT/BHDGT
RIP: 0010:[<ffffffffa02db829>] [<ffffffffa02db829>]
rds_ib_xmit+0xa69/0xaf0 [rds_rdma]
RSP: 0018:ffff880fb84a3c50 EFLAGS: 00010202
RAX: ffff880fbb694000 RBX: ffff880fb3e4e600 RCX: 0000000000000000
RDX: 0000000000000030 RSI: ffff880fbb6c3a00 RDI: ffff880fb058a048
RBP: ffff880fb84a3d30 R08: 0000000000000fd0 R09: ffff880fbb6c3b90
R10: 0000000000000000 R11: 000000000000001a R12: ffff880fbb6c3a00
R13: ffff880fbb6c3a00 R14: 0000000000000000 R15: ffff880fb84a3d90
FS: 00007fd0a3a56700(0000) GS:ffff88101e240000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000002158ca2 CR3: 0000000001783000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:1 (pid: 4140, threadinfo ffff880fb84a2000, task
ffff880fae970180)
Stack:
0000000000012200 0000000000012200 ffff880f00000000 0000000000000000
000000000000e5b0 ffffffff8115af81 ffffffff81b8d6c0 ffffffffa02b2e12
00000001bf272240 ffffffff81267020 ffff880fbb6c3a00 0000003000000002
Call Trace:
[<ffffffff8115af81>] ? __kmalloc+0x1f1/0x200
[<ffffffffa02b2e12>] ? rds_message_alloc+0x22/0x90 [rds]
[<ffffffff81267020>] ? sg_init_table+0x30/0x50
[<ffffffffa02b2db2>] ? rds_message_alloc_sgs+0x62/0xa0 [rds]
[<ffffffffa02b31e4>] ? rds_message_map_pages+0xa4/0x110 [rds]
[<ffffffffa02b4f3b>] rds_send_xmit+0x38b/0x6e0 [rds]
[<ffffffff81089d53>] ? cwq_activate_first_delayed+0x53/0x100
[<ffffffffa02b6040>] ? rds_recv_worker+0xc0/0xc0 [rds]
[<ffffffffa02b6075>] rds_send_worker+0x35/0xc0 [rds]
[<ffffffff81089fd6>] process_one_work+0x136/0x450
[<ffffffff8108bbe0>] worker_thread+0x170/0x3c0
[<ffffffff8108ba70>] ? manage_workers+0x120/0x120
[<ffffffff810907e6>] kthread+0x96/0xa0
[<ffffffff81515544>] kernel_thread_helper+0x4/0x10
[<ffffffff81090750>] ? kthread_worker_fn+0x1a0/0x1a0
[<ffffffff81515540>] ? gs_change+0x13/0x13
Code: ff ff e9 b1 fe ff ff 48 8b 0d b4 54 4b e1 48 89 8d 70 ff ff ff e9
71 ff ff ff 83 bd 7c ff ff ff 00 0f 84 f4 f5 ff ff 0f 0b eb fe <0f> 0b
eb fe 44 8b 8d 48 ff ff ff 41 b7 01 e9 51 f6 ff ff 0f 0b
RIP [<ffffffffa02db829>] rds_ib_xmit+0xa69/0xaf0 [rds_rdma]
RSP <ffff880fb84a3c50>
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.39-300.17.2.el6uek.x86_64
(mockbuild@ca-build44.us.oracle.com) (gcc version 4.4.6 20110731 (Red
Hat 4.4.6-3) (GCC) ) #1 SMP Wed Nov 7 17:48:36 PST 2012
Command line: ro root=UUID=5ad1a268-b813-40da-bb76-d04895215677
rd_DM_UUID=ddf1_stor rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD
SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us numa=off
console=ttyS1,115200n8 irqpoll maxcpus=1 nr_cpus=1 reset_devices
cgroup_disable=memory mce=off memmap=exactmap memmap=538K@64K
memmap=130508K@770048K elfcorehdr=900556K memmap=72K#3668608K
memmap=184K#3668680K
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000100 - 0000000000096800 (usable)
BIOS-e820: 0000000000096800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dfe90000 (usable)
BIOS-e820: 00000000dfe9e000 - 00000000dfea0000 (reserved)
BIOS-e820: 00000000dfea0000 - 00000000dfeb2000 (ACPI data)
BIOS-e820: 00000000dfeb2000 - 00000000dfee0000 (ACPI NVS)
BIOS-e820: 00000000dfee0000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
  • 1. Re: Node crashes when enabling RDS for private interconnect.
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    What is the version of the OFED driver stack used?

    And what does rds-info say and have you used rds-stress for testing?

    From the looks of it +"net/rds/ib_send.c:547!invalid opcode: 0"+, this is a bug in the driver code. The older version OFED stacks were pretty buggy in my experience in some cases (suffered a range of SRP issues) - which we resolved by checking the latest stable from the OFED trunk and doing a manual build for the kernel being used.

    It will also be prudent to file a service request with Support on this (before doing/trying custom driver builds).
  • 2. Re: Node crashes when enabling RDS for private interconnect.
    user12028852 Newbie
    Currently Being Moderated
    I believe OFED version is 1.5.3.3 but I am not sure if this is correct.
    We have not added any third parry drivers. All that has been done to add infiniband to our build is
    a yum groupinstall iInfiniband support.

    I have not tries rds-stress but rds-ping works fine and rds-info seems fine.
    A service request has been opened but so far I have had better response here.

    oracle@blade1-6:~> rds-info

    RDS IB Connections:
    LocalAddr RemoteAddr LocalDev RemoteDev
    10.10.0.116 10.10.0.119 fe80::25:90ff:ff07:df1d fe80::25:90ff:ff07:e0e5

    TCP Connections:
    LocalAddr LPort RemoteAddr RPort HdrRemain DataRemain SentNxt ExpectUna SeenUna

    Counters:
    CounterName Value
    conn_reset 5
    recv_drop_bad_checksum 0
    recv_drop_old_seq 0
    recv_drop_no_sock 1
    recv_drop_dead_sock 0
    recv_deliver_raced 0
    recv_delivered 18
    recv_queued 18
    recv_immediate_retry 0
    recv_delayed_retry 0
    recv_ack_required 4
    recv_rdma_bytes 0
    recv_ping 14
    send_queue_empty 18
    send_queue_full 0
    send_lock_contention 0
    send_lock_queue_raced 0
    send_immediate_retry 0
    send_delayed_retry 0
    send_drop_acked 0
    send_ack_required 3
    send_queued 32
    send_rdma 0
    send_rdma_bytes 0
    send_pong 14
    page_remainder_hit 0
    page_remainder_miss 0
    copy_to_user 0
    copy_from_user 0
    cong_update_queued 0
    cong_update_received 1
    cong_send_error 0
    cong_send_blocked 0
    ib_connect_raced 4
    ib_listen_closed_stale 0
    ib_tx_cq_call 6
    ib_tx_cq_event 6
    ib_tx_ring_full 0
    ib_tx_throttle 0
    ib_tx_sg_mapping_failure 0
    ib_tx_stalled 16
    ib_tx_credit_updates 0
    ib_rx_cq_call 33
    ib_rx_cq_event 38
    ib_rx_ring_empty 0
    ib_rx_refill_from_cq 0
    ib_rx_refill_from_thread 0
    ib_rx_alloc_limit 0
    ib_rx_credit_updates 0
    ib_ack_sent 4
    ib_ack_send_failure 0
    ib_ack_send_delayed 0
    ib_ack_send_piggybacked 0
    ib_ack_received 3
    ib_rdma_mr_alloc 0
    ib_rdma_mr_free 0
    ib_rdma_mr_used 0
    ib_rdma_mr_pool_flush 8
    ib_rdma_mr_pool_wait 0
    ib_rdma_mr_pool_depleted 0
    ib_atomic_cswp 0
    ib_atomic_fadd 0
    iw_connect_raced 0
    iw_listen_closed_stale 0
    iw_tx_cq_call 0
    iw_tx_cq_event 0
    iw_tx_ring_full 0
    iw_tx_throttle 0
    iw_tx_sg_mapping_failure 0
    iw_tx_stalled 0
    iw_tx_credit_updates 0
    iw_rx_cq_call 0
    iw_rx_cq_event 0
    iw_rx_ring_empty 0
    iw_rx_refill_from_cq 0
    iw_rx_refill_from_thread 0
    iw_rx_alloc_limit 0
    iw_rx_credit_updates 0
    iw_ack_sent 0
    iw_ack_send_failure 0
    iw_ack_send_delayed 0
    iw_ack_send_piggybacked 0
    iw_ack_received 0
    iw_rdma_mr_alloc 0
    iw_rdma_mr_free 0
    iw_rdma_mr_used 0
    iw_rdma_mr_pool_flush 0
    iw_rdma_mr_pool_wait 0
    iw_rdma_mr_pool_depleted 0
    tcp_data_ready_calls 0
    tcp_write_space_calls 0
    tcp_sndbuf_full 0
    tcp_connect_raced 0
    tcp_listen_closed_stale 0

    RDS Sockets:
    BoundAddr BPort ConnAddr CPort SndBuf RcvBuf Inode
    0.0.0.0 0 0.0.0.0 0 131072 131072 340441

    RDS Connections:
    LocalAddr RemoteAddr NextTX NextRX Flg
    10.10.0.116 10.10.0.119 33 38 --C

    Receive Message Queue:
    LocalAddr LPort RemoteAddr RPort Seq Bytes

    Send Message Queue:
    LocalAddr LPort RemoteAddr RPort Seq Bytes

    Retransmit Message Queue:
    LocalAddr LPort RemoteAddr RPort Seq Bytes
    10.10.0.116 0 10.10.0.119 40549 32 0

    oracle@blade1-6:~> cat /etc/rdma/rdma.conf
    # Load IPoIB
    IPOIB_LOAD=yes
    # Load SRP module
    SRP_LOAD=no
    # Load iSER module
    ISER_LOAD=no
    # Load RDS network protocol
    RDS_LOAD=yes
    # Should we modify the system mtrr registers? We may need to do this if you
    # get messages from the ib_ipath driver saying that it couldn't enable
    # write combining for the PIO buffs on the card.
    #
    # Note: recent kernels should do this for us, but in case they don't, we'll
    # leave this option
    FIXUP_MTRR_REGS=no
    # Should we enable the NFSoRDMA service?
    NFSoRDMA_LOAD=yes
    NFSoRDMA_PORT=2050


    oracle@blade1-6:~> /etc/init.d/rdma status
    Low level hardware support loaded:
         mlx4_ib

    Upper layer protocol modules:
         rds_rdma ib_ipoib

    User space access modules:
         rdma_ucm ib_ucm ib_uverbs ib_umad

    Connection management modules:
         rdma_cm ib_cm iw_cm

    Configured IPoIB interfaces: none
    Currently active IPoIB interfaces: ib0
  • 3. Re: Node crashes when enabling RDS for private interconnect.
    user12028852 Newbie
    Currently Being Moderated
    Ok tried rds-stress with no joy. I guess I need to work this out brfore I worry about the clusterware.

    POD1 root@blade1-6:~> rds-stress
    waiting for incoming connection on 0.0.0.0:4000
    accepted connection from 10.10.0.119:19744 on 10.10.0.116:4000
    negotiated options, tasks will start in 2 seconds
    Starting up..sendto() truncated - 250..
    tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
    child pid 29726 exited with status 1


    POD1 root@blade1-9:~> rds-stress -s 10.10.0.116 -p 4000 -t 1 -d 1 -D 1024000
    connecting to 10.10.0.116:4000
    negotiated options, tasks will start in 2 seconds
    Starting up..sendto() truncated - 250..
    tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu %
    1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
    child pid 15954 exited with status 1
  • 4. Re: Node crashes when enabling RDS for private interconnect.
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    We're building another cluster this week and I want to use RDS as Interconnect protocol on it (it also uses new Infiniband h/w).

    If we run into similar issues, or there are specific config issues, I will post here or in the {forum:id=822} forum (depending on whether it is a Grid issue or an o/s one).

    Unfortunately though, my partner that builds clusters with me (network engineer and pretty much an Infiniband expert) is still on leave.. so I'm not sure how far we'll get without his assistance, and whether we'll leave the RDS stuff for when his back from leave.
  • 5. Re: Node crashes when enabling RDS for private interconnect.
    user12028852 Newbie
    Currently Being Moderated
    Appreciate any information you can provide.

    We have been spinning our wheels here for too long now. Haven't made any progress with our Service Request so we are
    going to back track to oel6.2 and install the latest ofed from Mellanox.

    I'll post back if we make any progress with that track.

    Alan

    Edited by: user12028852 on Jan 16, 2013 1:16 PM
  • 6. Re: Node crashes when enabling RDS for private interconnect.
    user12028852 Newbie
    Currently Being Moderated
    RDS update.

    Still grinding painfully slowly with a service request to resolve this issue.
    We did take another track with another cluster.
    Installed OEL 6.3 with the redhat kernel 2.6.32-279.14.1.el6.x86_64 and installed the latest OFED packages
    from Open Fabrics. OFED-1.5.4.1.tgz

    So far this cluster is running happily using RDS for the cluster interconnect.
    Now we need to see how it performs.

    Alan
  • 7. Re: Node crashes when enabling RDS for private interconnect.
    user12215805 Newbie
    Currently Being Moderated
    We appear to be encountering the exact same issue. Support doesn't appear to be making progress.

    OS 2.6.39-300.26.1.el6uek.x86_64
    GI Home: 11.2.0.3.4
    DB Home: 11.2.0.3.4

    Server crashes as soon as RDS is enabled (following note 751343.1). Not sure if we want to go the RedHat route from OEL.

    Thanks,
    Tom
  • 8. Re: Node crashes when enabling RDS for private interconnect.
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    I have posted my RDS woes in {thread:id=2487906} and the OL 5.9 working version, together with the "fix" that needs to be done for preventing a modprobe mess (attempting inserts of either or rds_rdma or rds_ip into the kernel).

    Not sure if any of this would be relevant to OEL 6.x.
  • 9. Re: Node crashes when enabling RDS for private interconnect.
    1002897 Newbie
    Currently Being Moderated
    We have experienced the same issues with the base 6.3 2.39-200.24.1 and 2.39-200.34.1 UEK2 kernel revisions. We have not however experienced these issues with the UEK 1 kernel and the 6.4 (2.6.39-400.17.1) UEK2 kernel. Setting up RDS for Oracle Linux with the ofa package and UEK 1 kernel is an extreme pain (requires removing rds.conf file and telling the rdma modules to load manually). The UEK2 kernel resolves the setup issues issues by utililizing the rdma startup package, but requires corrected OFED kernel modules included in kernel 2.6.39-400 or higher to work with Oracle RAC.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points