Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Process receiving SIGEXIT

173650Apr 20 2011 — edited Apr 25 2011

I am running a process on Solaris 10 sparc. It runs for several days. After that it crashes. pstack on the core shows one of the threads received SIGEXIT.

Wondering what circumstances process would receive SIGEXIT? Proces is not using sigexit(pid,0) call. It does has signal handlers.

Anyhelp is appreciated.

Matt Lord-Oracle

You are correct in assuming that when using rpl_semi_sync_master_wait_point=AFTER_SYNC (which is the default value), the transaction should only succeed when it's successful on both master and slave.

rpl_semi_sync_master_timeout=10000 is also the default value, and it's important to note that this is milliseconds. I assume that your intention was to set this to such a large value so as to preclude that from being a potential factor in your tests? If so, then you should instead try using rpl_semi_sync_master_timeout=10000000 (over 2.5 hours).

If you can repeat the failure--in this context meaning that the transaction succeeded (meaning it was persisted and externalized) on the master without reaching the slave--then it would definitely be a bug. I would encourage you to open a bug report and let us know all of the details so that we can verify it, and then fix it.

Thank you!

2784987

Hi Matt,

thanks for your reply.In fact,my test was very simple:

1.I set rpl_semi_sync_master_timeout=10000 ,because 10 seconds is long enough for me to crash the master and the master don't revert to async replication

2.I stopped io threads of all slaves

3.I launched a transaction as below:

insert into test.t5 select now();

then the above query hang

4.I killed the mysqld process of the master and from the window,I noticed an error occured that the connection was closed

5.then I restarted the mysqld process of the master and check the table and found that the row I inserted just now existed.

thanks

2784987

Can you pls do a simple test as what I do and check if the result is the same?

thanks a lot

Matt Lord-Oracle

Hi,

1.I set rpl_semi_sync_master_timeout=10000 ,because 10 seconds is long enough for me to crash the master and the master don't revert to async replication

2.I stopped io threads of all slaves

3.I launched a transaction as below:

...

I'm sorry, I had completely glossed over point#2 in your steps above.

If you stopped all IO threads of all slaves, then from the master's point of view there are no slaves at the time you executed step#3 and thus the write succeeded as it was a "stand-alone" instance at that point. You don't want your production system to block all writes if there are no slaves.

If you're interested in multi-master, then I would encourage you to look at MySQL Group Replication: Group Replication — an Overview | MySQL High Availability

Best Regards

2784987

Why I stopped all IO threads of all slaves? because I wanted the transaction in the master hang

or the binlog will be transfered to the slaves and I can't check if the after_sync works

In fact,the transaction in the master really hang,if it worked as a stand-alone instance,it should not hang,am I right?

thanks

Matt Lord-Oracle

Hi,

Sorry, after looking into it further I can see that I was wrong:

http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_rpl_semi_sync_master_wait_for_slave_count

http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_rpl_semi_sync_master_wait_no_slave

So assuming rpl_semi_sync_master_wait_no_slave was ON, then the master would still wait rpl_semi_sync_master_timeout milliseconds for an ACK from 0 slaves (only entering "stand-alone" mode after the first timeout).

I created a test environment to verify (highlights my own):

mysql> show global variables like "rpl%";

+-------------------------------------------+------------+

| Variable_name | Value |

+-------------------------------------------+------------+

| rpl_semi_sync_master_enabled | ON |

| rpl_semi_sync_master_timeout | 10000 |

| rpl_semi_sync_master_trace_level | 32 |

| rpl_semi_sync_master_wait_for_slave_count | 1 |

| rpl_semi_sync_master_wait_no_slave | ON |

| rpl_semi_sync_master_wait_point | AFTER_SYNC |

| rpl_stop_slave_timeout | 31536000 |

+-------------------------------------------+------------+

7 rows in set (0.00 sec)

mysql> insert into gcoltest (lon, lat) select lon, lat from gcoltest;

Query OK, 1 row affected (10.01 sec)

Records: 1 Duplicates: 0 Warnings: 0

And in the MySQL error log (highlights my own):

2015-12-24T22:58:12.193543Z 4 [Note] Semi-sync replication initialized for transactions.

2015-12-24T22:58:12.193584Z 4 [Note] Semi-sync replication enabled on the master.

2015-12-24T22:58:12.193802Z 0 [Note] Starting ack receiver thread

2015-12-24T23:00:10.129022Z 4 [Warning] Timeout waiting for reply of binlog (file: hanode4-bin.000007, pos: 890), semi-sync up to file , position 0.

2015-12-24T23:00:10.129064Z 4 [Note] Semi-sync replication switched OFF.

So you should check your MySQL error log. I suspect that you will see the same Timeout warning. You noted that you "killed mysqld", but assuming you simply did "kill <PID>" then you simply sent the process a SIGTERM, which tells it to start a normal shutdown. That shutdown process can take many seconds to complete as it tries to terminate various internal processes gracefully. If you want the process to terminate immediately, then you should send it a SIGKILL or "kill -9 <PID>".

If you see something else, then you should file a bug report so that we can look into it further.

I will talk to the documentation team, as these behaviors could certainly be better documented.

Best Regards

2784987

I really used "kill -9 <PID>"

additionally,no other errors when the transaction hang,but after the mysqld process started,the transaction was recovered:

thanks

2784987

pls refer to following pic:

Matt Lord-Oracle

Hi,

I was able to repeat the issue, and it certainly seems to be a bug. I will talk with the Replication team and ensure that it's addressed.

Thank you!

2784987

thanks a lot,

pls let me know if you get any responses.

Matt Lord-Oracle

I've created two bug reports that you can follow:

1. https://bugs.mysql.com/80394

2. https://bugs.mysql.com/80395

Thank you for letting us know!

2784987

Hi Matt,

how do the bugs go now?

it seems that the status of the bugs have not been changed yet

thanks

1 - 12

Locked Post

New comments cannot be posted to this locked post.

Locked on May 23 2011

Added on Apr 20 2011

#oracle-solaris, #solaris-10

4 comments

721 views

Infrastructure Software

Process receiving SIGEXIT

Comments

Post Details