This discussion is archived
0 Replies Latest reply: Sep 9, 2013 1:29 AM by Super_Kevin RSS

DB_ASSERT failed on __db_has_pagelock() in repmgr's process_message

Super_Kevin Newbie
Currently Being Moderated

Hi,

I am using db-6.0.20 on debian 7 64bit, build with --enable-debug --enable-diagnostic --enable-umrw.

I have two nodes, A and B, running same program and same config except ip address and port number.

When first start A with empty env home, A creates the environment and several databases, and become master. Then start B, and B become client and getting changes from A. However, I get DB_ASSERT failed on __db_has_pagelock() in repmgr's process_message():

raise ()

abort ()

__os_abort (env=0x7ffff6b97e88)

__db_assert (env=0x7ffff6b97e88, e=0x9050c0 \"__db_has_pagelock(env, locker, dbmfp, (PAGE*)bhp->buf, DB_LOCK_WRITE) == 0\", file=0x904ccc \"../src/mp/mp_fget.c\", line=1190)

__memp_fget (dbmfp=0x7ffff7f93d08, pgnoaddr=0x7ffff6cdfe8c, ip=0x7ffff6cdff50, txn=0x0, flags=1, addrp=0x7ffff4ca7868)

__rep_write_page (env=0x7ffff6b97e88, ip=0x7ffff6cdff50, rep=0x7ffff6bb2410, msgfp=0x7ffff6cdfe88)

__rep_page (env=0x7ffff6b97e88, ip=0x7ffff6cdff50, eid=1, rp=0x7ffff4ca81f0, rec=0x7ffff4ca8280)

__rep_bulk_page (env=0x7ffff6b97e88, ip=0x7ffff6cdff50, eid=1, rp=0x7ffff4ca8328, rec=0x7ffff7f02110)

__rep_process_message_int (env=0x7ffff6b97e88, control=0x7ffff7f020e8, rec=0x7ffff7f02110, eid=1, ret_lsnp=0x7ffff4ca8ab8)

process_message (env=0x7ffff6b97e88, control=0x7ffff7f020e8, rec=0x7ffff7f02110, eid=1)

message_loop (env=0x7ffff6b97e88, th=0x7ffff6bcfba8)

__repmgr_msg_thread (argsp=0x7ffff6bcfba8)

start_thread ()

clone ()

I have tried to solve the problem and found that if I shutdown A when it finish creating databases and copy the entire env dir to B, and start A again, then start B, it seems OK.

 

My settings:

dbenv.set_flags(DB_DIRECT_DB      |

                        DB_REGION_INIT    |

                        DB_TIME_NOTGRANTED,

                        1);

dbenv.log_set_config(DB_LOG_AUTO_REMOVE |

                             DB_LOG_DIRECT      |

                             DB_LOG_ZERO,

                             1);

dbenv.rep_set_config(DB_REP_CONF_BULK           |

                             DB_REP_CONF_LEASE,

                             1);

dbenv.open(db_home.c_str(),

                   DB_PRIVATE   |

                   DB_CREATE    |

                   DB_RECOVER   | DB_FAILCHK  |

                   DB_THREAD    |

                   DB_INIT_REP  |

                   DB_INIT_LOCK | DB_INIT_LOG | DB_INIT_TXN |

                   DB_INIT_MPOOL ,

                   S_IRUSR | S_IWUSR | S_IRGRP);

 

My database:

MainDb.set_flags(DB_CHKSUM);

MainDb.set_partition(PartitionParts, NULL, db_partition_fcn);

//Transaction protected

MainDb.open(&txn, "MainDb.bdb", NULL, DB_HASH,  DB_CREATE, S_IRUSR | S_IWUSR);

 

Logs on A(192.168.101.148:8866) when B(192.168.101.148:6666) is connecting:

[1378709654:950047][52533/52540] MASTER: accepted a new connection

DB_EVENT_REP_CONNECT_ESTD [eid: 1, addr: 192.168.101.148:6666]

[1378709654:952301][52533/52540] MASTER: handshake from paused site 192.168.101.148:6666 EID 1

[1378709654:952387][52533/52540] MASTER: msgin: EID 1 CONNECTED, READY.  sites_avail 1

[1378709655:949542][52533/52539] MASTER: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 0 eid 1, type newclient, LSN [0][0]  nogroup

[1378709655:949764][52533/52539] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type newsite, LSN [0][0]  nobuf

[1378709655:949835][52533/52539] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type newmaster, LSN [1][1658422]  nobuf

[1378709655:950016][52533/52539] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type log, LSN [1][1658250]  lease

[1378709655:951508][52533/52539] MASTER: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 0 eid 1, type master_req, LSN [0][0]  nogroup

[1378709655:951607][52533/52539] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type newmaster, LSN [1][1658422]  nobuf

[1378709655:951747][52533/52539] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type log, LSN [1][1658250]  lease

[1378709665:950768][52533/52538] MASTER: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 0 eid 1, type master_req, LSN [0][0]

[1378709665:951019][52533/52538] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type newmaster, LSN [1][1658422]  nobuf

[1378709665:951244][52533/52538] MASTER: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid -1, type log, LSN [1][1658250]  lease

[1378709665:995365][52533/52538] MASTER: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type update_req, LSN [0][0]

[1378709665:998816][52533/52538] MASTER: Walk_dir: Getting info for datadir NULL, dir: ./PNRDB

[1378709666:424][52533/52538] MASTER: Walk_dir: Dir ./PNRDB has 196 files

[1378709666:526][52533/52538] MASTER: Walk_dir: File 0 name: __dbp.MainDb.bdb.018

[1378709666:888][52533/52538] MASTER: Walk_dir: File __dbp.MainDb.bdb.018 at 0x7ffff0b7d0d8: pgsize 4096, max_pgno 2

 

Logs on B(192.168.101.148:6666) when DB_ASSERT failed:

# Log file created/rotated Monday, 9 Sep 13 06:54:14 GMT

[PNRSrv] PNRServer starting up.

[stdout] PNRServer: Terminal output redirected to log.

[stderr] PNRServer pid: 52589

[DbMsg] [1378709654:794305][52589/52589] REP_UNDEF: EID 0 is assigned for site 192.168.101.148:6666

[DbMsg] [1378709654:795318][52589/52589] REP_UNDEF: EID 1 is assigned for site 192.168.101.148:8866

[DbMsg] BDB2525 No log files found

[DbMsg] [1378709654:803775][52589/52589] REP_UNDEF: try join request to site site 192.168.101.148:8866

[DbMsg] [1378709654:804183][52589/52589] REP_UNDEF: connection established

[DbMsg] [1378709654:813622][52589/52589] REP_UNDEF: set membership for 192.168.101.148:8866 4 (was 0)

[DbMsg] [1378709654:813774][52589/52589] REP_UNDEF: set membership for 192.168.101.148:6666 4 (was 0)

[DbMsg] [1378709654:813968][52589/52589] REP_UNDEF: Check view.  Exist 0, cb 0

[DbMsg] [1378709654:929467][52589/52589] REP_UNDEF: rep_start: Found old version log 20

[DbMsg] [1378709654:929789][52589/52589] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 0 eid -1, type newclient, LSN [0][0]  nogroup nobuf

[Event] DB_EVENT_REP_CLIENT

[DbMsg] [1378709654:941841][52589/52597] CLIENT: starting election thread

[DbMsg] [1378709654:944870][52589/52597] CLIENT: Election with leases pause sec 1, usec 0

[DbMsg] [1378709654:949358][52589/52599] CLIENT: starting connector thread, eid 1

[DbMsg] [1378709654:949792][52589/52599] CLIENT: connecting to site 192.168.101.148:8866

[DbMsg] [1378709654:950266][52589/52599] CLIENT: connection established

[Event] DB_EVENT_REP_CONNECT_ESTD [eid: 1, addr: 192.168.101.148:8866]

[DbMsg] [1378709654:950797][52589/52599] CLIENT: connector thread is exiting

[DbMsg] [1378709654:951885][52589/52598] CLIENT: handshake from connection to 192.168.101.148:8866 EID 1

[DbMsg] [1378709654:951990][52589/52598] CLIENT: handshake with no known master to wake election thread

[DbMsg] [1378709654:952076][52589/52598] CLIENT: vers_resp: EID 1 CONNECTED, READY.  sites_avail 1

[DbMsg] [1378709655:949051][52589/52597] CLIENT: Check view.  Exist 0, cb 0

[DbMsg] [1378709655:949295][52589/52597] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 0 eid -1, type newclient, LSN [0][0]  nogroup nobuf

[DbMsg] [1378709655:950297][52589/52595] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type newsite, LSN [0][0]

[DbMsg] [1378709655:950494][52589/52595] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 0 eid -1, type master_req, LSN [0][0]  nogroup nobuf

[DbMsg] [1378709655:950792][52589/52596] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type newmaster, LSN [1][1658422]

[DbMsg] [1378709655:951197][52589/52596] CLIENT: I am now part of an established group

[DbMsg] [1378709655:951014][52589/52594] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type log, LSN [1][1658250]  lease

[DbMsg] [1378709655:951445][52589/52596] CLIENT: wait_time: grant_expire 0 0 lease_to 10000000

[DbMsg] [1378709655:951807][52589/52594] CLIENT: Racing replication msg lockout, ignore message.

[DbMsg] [1378709655:990359][52589/52594] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type newmaster, LSN [1][1658422]

[DbMsg] [1378709655:990388][52589/52595] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type log, LSN [1][1658250]  lease

[DbMsg] [1378709655:990537][52589/52594] CLIENT: Racing replication msg lockout, ignore message.

[DbMsg] [1378709655:990673][52589/52595] CLIENT: Racing replication msg lockout, ignore message.

[DbMsg] [1378709665:949025][52589/52597] CLIENT: Start election nsites 2, ack 2, priority 100

[DbMsg] [1378709665:949951][52589/52597] CLIENT: wait_time: grant_expire 0 0 lease_to 10000000

[DbMsg] [1378709665:950001][52589/52597] CLIENT: PHASE0 waittime from rep_lease_waittime: 10000000

[DbMsg] [1378709665:950071][52589/52597] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 0 eid -1, type master_req, LSN [0][0]  nobuf

[DbMsg] [1378709665:951469][52589/52594] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type newmaster, LSN [1][1658422]

[DbMsg] [1378709665:951554][52589/52594] CLIENT: Racing replication msg lockout, ignore message.

[DbMsg] [1378709665:951599][52589/52594] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type log, LSN [1][1658250]  lease

[DbMsg] [1378709665:951635][52589/52594] CLIENT: Racing replication msg lockout, ignore message.

[DbMsg] [1378709665:975016][52589/52596] CLIENT: Election done; egen 1

[DbMsg] [1378709665:975298][52589/52596] CLIENT: Updating gen from 0 to 1 from master 1

[DbMsg] [1378709665:980937][52589/52596] CLIENT: egen: 2. rep version 7

[DbMsg] [1378709665:981356][52589/52596] CLIENT: No commit or ckp found.  Truncate log.

[DbMsg] [1378709665:994727][52589/52596] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid 1, type update_req, LSN [0][0]  nobuf

[Event] DB_EVENT_REP_NEWMASTER [eid: 1, addr: 192.168.101.148:8866]

[DbMsg] [1378709666:38479][52589/52594] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type update, LSN [1][1658422]

[DbMsg] [1378709666:38933][52589/52594] CLIENT: Update setup for 190 files.

[DbMsg] [1378709666:39045][52589/52594] CLIENT: Update setup:  First LSN [1][28].

[DbMsg] [1378709666:39097][52589/52594] CLIENT: Update setup:  Last LSN [1][1658422]

[DbMsg] [1378709666:40767][52589/52594] CLIENT: Walk_dir: Getting info for datadir NULL, dir: ./PNRDB

[DbMsg] [1378709666:40949][52589/52594] CLIENT: Walk_dir: Dir ./PNRDB has 7 files

[DbMsg] [1378709666:40985][52589/52594] CLIENT: Walk_dir: File 0 name: __db.rep.gen

[DbMsg] [1378709666:41019][52589/52594] CLIENT: Walk_dir: File 1 name: __db.rep.egen

[DbMsg] [1378709666:41052][52589/52594] CLIENT: Walk_dir: File 2 name: __db.rep.db

[DbMsg] [1378709666:41093][52589/52594] CLIENT: Walk_dir: File 3 name: DB_CONFIG

[DbMsg] [1378709666:41126][52589/52594] CLIENT: Walk_dir: File 4 name: __db.rep.diag00

[DbMsg] [1378709666:41159][52589/52594] CLIENT: Walk_dir: File 5 name: __db.rep.diag01

[DbMsg] [1378709666:41206][52589/52594] CLIENT: Walk_dir: File 6 name: log.0000000001

[DbMsg] [1378709666:41241][52589/52594] CLIENT: Walk_dir: Getting info for in-memory named files

[DbMsg] [1378709666:41287][52589/52594] CLIENT: Walk_dir: Dir INMEM has 0 files

[DbMsg] [1378709666:49936][52589/52594] CLIENT: Next file 0: pgsize 4096, maxpg 2

[DbMsg] [1378709666:50137][52589/52594] CLIENT: name __dbp.MainDb.bdb.018 dir NULL

[DbMsg] [1378709666:50219][52589/52594] CLIENT: ./PNRDB rep_send_message: msgv = 7 logv 20 gen = 1 eid 1, type page_req, LSN [0][0]  any nobuf

[DbMsg] [1378709666:52860][52589/52595] CLIENT: ./PNRDB rep_process_message: msgv = 7 logv 20 gen = 1 eid 1, type bulk_page, LSN [1][1658422]

[DbMsg] [1378709666:52987][52589/52595] CLIENT: rep_bulk_page: Processing LSN [1][1658422]

[DbMsg] [1378709666:53049][52589/52595] CLIENT: rep_bulk_page: p 0x7ffff7f031ac ep 0x7ffff7f05244 pgrec data 0x7ffff7f02170, size 4156 (0x103c)

[DbMsg] [1378709666:53092][52589/52595] CLIENT: PAGE: Received page 0 from file 0

[DbMsg] [1378709666:61072][52589/52595] CLIENT: PAGE: Write page 0 into mpool

[DbMsg] [1378709666:61208][52589/52595] CLIENT: rep_write_page: Calling fop_create for __dbp.MainDb.bdb.018

[DbErr] BDB0059 assert failure: ../src/mp/mp_fget.c/1190: "__db_has_pagelock(env, locker, dbmfp, (PAGE*)bhp->buf, DB_LOCK_WRITE) == 0"

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x6e6acc]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x6e19b4]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x6496f9]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x6c3ed3]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x7cbad2]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x7cb0b6]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x7ca695]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x7e726e]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x80e643]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x80dbd0]

[DbErr] Debug_env1/Intel-Linux-x86/pnrserver() [0x80d854]

[DbErr] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e0e) [0x7ffff7bc6e0e]

[DbErr] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ffff6edf93d]

 

Thanks for any help,

Kevin

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points