Ok, now I need some official answer/help from Oracle:
I have installed Oracle Instant Client 18.5 and 19.3 on an Oracle Linux 7.3:
[f4gl@saphir sf]$ cat /etc/oracle-release
Oracle Linux Server release 7.3
[f4gl@saphir sf]$ uname -a
Linux saphir 4.1.12-61.1.28.el7uek.x86_64 #2 SMP Thu Feb 23 19:55:12 PST 2017 x86_64 x86_64 x86_64 GNU/Linux
No problem with the system("ls ...") call with following combinations (all through TCP to a remove server):
- Oracle Instant Client 18.5 to remote Oracle Server 18.3
- Oracle Instant Client 18.5 to remote Oracle Server 19.3
- Oracle Instant Client 19.3 to remote Oracle Server 18.3
But when connecting with Oracle Client 19.3 to a remote Oracle Server 19.3, system("ls ...") returns -1 from time to time !
I will provide strace details later here...
Here the strace outputs:
strace-client-18c-server-19c-1.txt : 18.5 client => 18.3 server (OK)
strace-client-19c-server-18c-1.txt : 19.3 client => 18.3 server (OK)
strace-client-19c-server-19c-1.txt : 19.3 client => 19.3 server (system() fails)
It appears that it's related to the following mix:
1) signal(SIGCHLD, handler) - BTW, yes, we should use sigaction() now.
2) Oracle Client 19c starting a new thread (we see an additional call to clone() in strace when comparing to Oracle Client 18c)
3) system() getting confused because SIGCHLD is received and treated by waitpid()/wait4() by the wrong thread.
While we need to figure out how to avoid SIGCHLD signal handler, can someone from Oracle explain why the Oracle Client creates a thread???
Our current OCI program is not prepared for a muti-threaded context, so is it possible to disable the thread creation with some option?
I'll ask someone to take a look. Have you got some compilable, sharable code that reproduces the issue ?
Hello CJ and thanks for considering this!
Sorry but I could not repro with a simple OCI program...
Here is a new summary that I have reported to my Oracle contact in France, with new strace outputs:
We may have found an issue with the Oracle Client lib libclntsh.so.19.1
on Oracle Linux 7.x regarding threads, signal handlers and the system()
function, this is new to us and was not occurring with 18c clients.
The problem appears with the following configuration:
A) The OCI client is 19c, and connects to a remote 19c server.
WARNING: The problem DOES NOT occur in the following cases:
B) The OCI client is 19c, and connects to 19c server ON THE SAME COMPUTER.
C) The OCI client is 19c, and connects to a remote 18c server.
D) The OCI client is 18c, and connects to a remote 19c server.
From time to time, after connecting to Oracle, a call to the system()
function returns -1 (fail) for a simple "ls -l" command, when the program
is implementing a SIHCHLD signal handler calling waitpid()...
Looking at the strace outputs, we suspect that the libclntsh.so.19.1 client
creates a thread (with the clone() function).
In our code, we create a signal handler for SIGCHLD (to avoid zombies):
static void hSIGCHLD(int sig)
while (waitpid(-1, &result, WUNTRACED | WNOHANG) > 0);
Note: To workaround the system() -1 issue with Oracle 19c client, we now do no longer
use this code, because we have no more cases where zombies can occur, but that code
reveals the issue when mixing threads and system()
When using the good old signal() API, the program is not prepared to handle
signals as a multi-threaded process:
When using threads, signals are delivered to an arbitrary thread!!!
When calling system(), the implementation of this function starts another
process with clone(), and then waits until it ends with wait4() ...
Normally system() delays any SIGCHLD delivery with:
rt_sigprocmask(SIG_BLOCK, [CHLD], , 8) = 0
But we see something wrong happens, probably because the wrong thread gets
the SIGCHLD signal, then the wrong wait4()/waitpid() is continued and then
system() gets confused and returns -1 ...
Attached you find the strace -f output:
strace-19c-threads-fail.txt: when system() returns -1
strace-19c-threads-ok.txt: when system() returns 0
We think that OCI should not create threads, unless the program has explicitly
stated to it is prepared for that... for ex with a parameter in OCIEnvCreate().
My point of view:
Assuming this is the reason for the issue, the Oracle client lib should NOT create threads, because legacy C application might use non-thread-safe APIs like signal()...
The Oracle Client lib could eventually create threads, if the OCI program explicitly tells that it is thread safe with an OCIEnvCreate() option.
We expect Oracle Client lib to be lightweight, like many other DB client libs are.
I wonder that it has to create a thread... for what feature???
I will prod the dev team again. Did you log an SR with Support on this?
As I understand it, threads are here to stay and there are plans for others in future.
If the OCI client lib is intended to work only in a multi-threaded context, then the OCI documentation needs to mention this, and legacy code using signal() needs to be reviewed.
I assume that the dev team understands this, and that it was a deliberate decision to introduce threads usage in OCI...
I wonder that this kind of feature is added in version 19c, which is in reality from the 12.x family and should required minimal backward compatibility issues.
Maybe I am wrong and we should not mix signal() with OCI at all from the beginning.
I could not find any topic about signal handling in the OCI documentation:
About the SR: Sorry but we have only a silver partnership and no permission to create a SR.
We are a dev tool company and we do not go to production with Oracle, we don't want to spend too much money in partnership contracts (we support other DB engines).
But we have large customers using our product with Oracle.
I have found something similar to our issue: https://stackoverflow.com/questions/17550217/linux-system-sigchld-handling-multithreading
system() calls sigprocmask() to block SIGCHLD, but sigprocmask() is not thread safe (should use pthread_sigmask()).
As stated before, we do no longer use SIGCHLD.
However, my conclusion is that since Oracle client lib is now multi-threaded, it's no longer possible to use standard C APIs like system().
At least on Linux.
Some more thinking:
I realized that proper signal handling is possible in a multi-threaded process, but it needs to be managed by a single piece of code/component.
Having different components (the libclntsh.so lib and my OCI program code) installing signal handlers can end up in a big mess, even if my code would properly handle signals for a multi-threaded process.
The idea in the above link is to dedicate a thread to receive signals and block the signals for all other threads.
How can I be sure that Oracle client will not overwrite my signal handlers and settings for multi-thread context?
(I want my program code to master the signal handling)
Where can I find documentation about Oracle Call Interface and signal handling?
What is Oracle two-task communication? Is this always enabled?
I also found this:
The sqlnet.ora parameter
... prevents the OCI client to install signal handlers (tested)
Is this sufficient?
This section of the doc states that OCI is thread-safe:
However, this does not mean that the user application actually needs to be thread-safe.
To me it says that the user application can eventually be multi-threaded... that's a big difference.
The developers asked me to log a bug on this so it could be tracked. It is bug 29865658.