are you sure that the BBL actually "kills" them, or do they actually crash on their own and BBL just discovers that they're gone (and restarts them)?
Are there any core files generated? Sometimes the Tuxedo server is unable to create a core file when it crashes due to directory permissions or file size quotas, but if you find a core file and "file core" tells you it was from one of the supplied servers I'd say you have "the smoking gun" right there.
To add to what Per said, if the BBL kills a server (which it normally only does in the case of a server exceeded service timeout value SVCTIMEOUT) then there should be an entry in the ULOG to that effect. I suspect as Per says, these servers likely have severe memory leaks and possibly memory corruption (although that is speculation on my part) problems. Memory growing in a Tuxedo server is often related to not performing tpfree() calls on buffers the server owns. But without sources, the only option is likely TMTRACE as was already mentioned.
Oracle Tuxedo Chief Architect
As these are third party services I dont have access to the code for me to see myself.
Yes, BBL is killing the server after a SVCTIMEOUT, the following is in the ULOG.
BBL.1168.8884.0: CMDTUX_CAT:1836: WARN: Server(7876) processing terminated with SIGKILL after SVCTIMEOUT
Straight after the LIBTUX_CAT:541 is logged.
There are multiple offending services which tells me it isn't in one service where the process exited without calling any Tuxedo exit routines... (Are these just tpreturn() and tpexit()? It's been a while since i have written a tux service. )
...Or where the memory leaks are.
Will the TMTRACE show tpalloc() along with (if any) tpfree()?
for details regarding TMTRACE where there's an example of logging all calls to tpacall().
As I'm not The Universal Master Of All RegExps I don't really know whether you can catch several different calls in the search expression.
On another note, I'd try looking into the processing time of your services. I'd say it's more probable that you have a performance problem in general rather than a major bug in the services (if they never call tpreturn() not much work will be done at all...).
What is the value for SVCTIMEOUT (that is trespassed every now and then)?
If you can add a "-r" in CLOPT you'll get statistics from all services executed in that particular server written to the stderr file (that might be specified with the -e option).
Using txrpt you can then get insight into the execution times for the services in question. If they are constantly near the SVCTIMEOUT you may need to adjust the SVCTIMEOUT. If they are varying very much you may need to check for locks in the database or other reasons for "spiky" behaviour.
for more info on how to interpret the statistics that -r creates.
It is fairly common to create own-written utilities for interpreting the stderr file, it's not really rocket science if you want to get for instance min and max values out of it, which, by the way, would make a nice enhancement to txrpt in the first place. Product Management: are you listening? :-)
Well if all of the LIBTUX_CAT:541 messages are preceded by a CMDTUX_CAT:1836, then it would make sense that exit handlers aren't being called because the process is being killed with SIGKILL which as far as I know can't be caught, so all user exit handlers are bypassed. But what does that have to do with leaking memory? The memory leaks are most certainly not related to exit handlers not being called, as that would be the last thing a server does in any case before the process disappears.
You would want to trace service routines, tpreturn(), tpalloc(), tpfree(), and tpcall()/tpacall(). There is no tpexit() routine to trace.
Oracle Tuxedo Chief Architect