This discussion is archived
1 2 Previous Next 17 Replies Latest reply: Nov 16, 2012 8:40 AM by Dude! RSS

IPC

BillyVerreynne Oracle ACE
Currently Being Moderated
Oracle Linux Server release 5.8, SVR4 IPC compatible/conformant message queues (the ftok, msgsnd and msgrcv calls).

I have two types of custom written processes - one that enqueues (msgsnd) data to a variable number of queues, and one that dequeues (msgrcv) data from a specific queue.

For example, one queue writer process writing data in a round-robin fashion to 5 message queues. With five queue reading processes, each one dedicated to a specific queue, reading data from the queue. The queue writer process has very little latency/overheads. The queue reader process in comparison is slower - thus the need for multiple queues and multiple queue reader processes.

On RHEL3 and RHEL4, this code has been fairly extensively used (old dual CPU servers). The queue writer process enqueues over 350 messages per second - with the 5 queue readers adequately dealing with the dequeuing. Far less than 1% of messages has to be discarded by the queue writer due to all the message queues being busy.

On OL5.8 (new 12 core servers), the error rate is 70%. Which means that 70% of msgsnd calls for queueing messages are failing. The exact same code running on the exact same data volumes than on RHEL3 and 4.

Any ideas what to look at, troubleshooting this discrepancy? Any significant changes with the IPC SVR4 message queue implementation? Should I be looking at porting the code to IPC Posix queues instead?

Any non-null pointers will be appreciated. :-)
  • 1. Re: IPC
    Dude! Guru
    Currently Being Moderated
    Have you compared the output of ipcs -l between the systems to check semaphores and other kernel parameters that affect message limits (MSGMNI, MSGMAX), etc.?

    It seems Posix IPC is preferred.
    http://stackoverflow.com/questions/967335/are-message-queues-obsolete-in-linux
    http://unix.stackexchange.com/questions/6930/how-is-a-message-queue-implemented-in-the-linux-kernel
    http://www.kernel.org/doc/man-pages/online/pages/man7/mq_overview.7.html
  • 2. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    Dude wrote:
    Have you compared the output of ipcs -l between the systems to check semaphores and other kernel parameters that affect message limits (MSGMNI, MSGMAX), etc.?
    Yes. New version (server and o/s) has significantly higher settings:
    OLD:
    ------ Messages: Limits --------
    max queues system wide = 16
    max size of message (bytes) = 8192
    default max size of queue (bytes) = 16384
    NEW:
    ------ Messages: Limits --------
    max queues system wide = 32768
    max size of message (bytes) = 65536
    default max size of queue (bytes) = 65536
    Also on the old RHEL3 kernel I could not use msgctl() to do an IPC_SET to increase the message queue size - on later kernel versions it worked fine.

    Which is why I find it very surprising that this IPC code actually run faster and better on an old kernel on an old server. I was expecting a significantly higher messages/sec thruput with 0% errors.
    It seems Posix IPC is preferred.
    Yeah - but then rewriting the s/w for Posix IPC means development time I do not really have available. And supporting old style IPC for the older RHEL3 servers until they too are one day upgraded to new servers and newer Linux versions.

    Will hack in what little time I have available for the remainder of the week and if still no joy, consider rewriting the code for Posix IPC next week. Not that I do not like *nix development - I love it. But the issue is always priorities and available time to get that done...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 3. Re: IPC
    800381 Explorer
    Currently Being Moderated
    Try lowering your limits. I'm thinking there could be some scaling issue, and it's easy enough to test.
  • 4. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    Will do so and try (race conditions came to mind, but without tangible ideas how to test for that) - though the limits are the defaults one for OL5.8.
  • 5. Re: IPC
    Dude! Guru
    Currently Being Moderated
    Actually I wonder how these parameters affect performance. Provided the kernel limits are sufficient to allow messages to be queued, higher values may allow the system to buffer and deal with more information and bursts, but will this necessarily increase performance or just delay errors?

    As far as I understand, when a process reads the message queue, the queue will be emptied. So is the problem reading or sending? Doesn't the process that queues and reads the messages have any error reporting? How did you come up with current failure statistics?

    Is there some sort of IPC scheduling? For instance, Oracle Linux uses the "deadline" instead of the "cfg" I/O scheduler, which affects disk I/O.
  • 6. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    Dude wrote:
    Actually I wonder how these parameters affect performance. Provided the kernel limits are sufficient to allow messages to be queued, higher values may allow the system to buffer and deal with more information and bursts, but will this necessarily increase performance or just delay errors?
    Likely just delay errors - as the queue needs to be emptied fast enough. Buffering is unlikely to increase performance when queue writes (at a consistent rate) is faster than what the queue can be emptied/read at.
    As far as I understand, when a process reads the message queue, the queue will be emptied.
    Correct. A read specifies what type of message to read from the queue and removes the message read.
    So is the problem reading or sending? Doesn't the process that queues and reads the messages have any error reporting? How did you come up with current failure statistics?
    Message writes are done using IPC_NOWAIT. This means a failure if the message cannot be immediately written to the queue. This process needs to spend minimal time waiting on a queue write, as it in turn receives metrics at a rate of 100's of UDP packets per second. So if a msgsnd() fails, it drops/discards that message.

    The number of discards versus the number of successful message sends, are expressed as a percentage by the process as part of its statistics.

    So the error is with the msgsnd() in the writer process - but is caused by the reader process for that queue, busy doing file I/O (servicing the previously dequeued message), not being able to do a msgrcv().
    Is there some sort of IPC scheduling? For instance, Oracle Linux uses the "deadline" instead of the "cfg" I/O scheduler, which affects disk I/O.
    Not with SVR4 style basic IPC as far as I know. A manpage on svipc shows the extend and interface of System V Release 4 IPC.
  • 7. Re: IPC
    800381 Explorer
    Currently Being Moderated
    Is your writer process is now running much faster? You were sending 350 msgs/sec on the older kernel, with 1% failure. Now you're seeing 70% failure rates, but how many messages do you actually manage to send?

    Another thought: how fast does the writer run right after you start it, before the queue fills up? What percent of msgsnd() calls fail during that time?

    Finally, if it's a race condition inside the kernel's message processing code, maybe external synchronization could help? Maybe using a PTHREAD_PROCESS_SHARED mutex between the reader and writer processes to mutex msgsnd() and msgrcv() calls would produce faster performance - and if it does, that would be indicative of a significant performance problem.

    However:
    So the error is with the msgsnd() in the writer process - but is caused by the reader process for that queue, busy doing file I/O (servicing the previously dequeued message), not being able to do a msgrcv().
    That seems to say to me that your processing pipeline is backed up, and it's bottlenecked on file output.
  • 8. Re: IPC
    Dude! Guru
    Currently Being Moderated
    Is there any reason that your application has to deal with more I/O under the newer kernel, or more I/O than necessary? Is there any change in architecture, size of memory pages, e.g. missing TLB cache. Are there changes in the kernel and how IPC messages are being processed internally, e.g. CPU time slice, etc? Is there anything you can do to address the problem in your application code or by modifying kernel parameters?
    Also on the old RHEL3 kernel I could not use msgctl() to do an IPC_SET to increase the
    Is it possible these resources are sized too large, resulting in unnecessary large chunks of data and I/O? Is it possible that something that did not work before was the reason for it to be working in the first place?

    Looking at the current situation, I think a some more in-depth live view of what is actually going on when the system processes these IPC messages is required. You can search e.g. "analysis of the IPC mechanisms", and should find references pointing to oprofile and perf to profile the Linux kernel to analyze what is going on during IPC processing. Now the question is, what tools exist to help finding the cause of your problem, without requiring a PhD in kernel design.

    I wonder, what happens if you nice the programs that are reading the queues and increase their priority? Does this help?
  • 9. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    user5287726 wrote:
    Is your writer process is now running much faster? You were sending 350 msgs/sec on the older kernel, with 1% failure. Now you're seeing 70% failure rates, but how many messages do you actually manage to send?
    The 350 message/sec is fairly constant for that specific server. It varies from server to server (scattered geographically). So the issue is not doing more messages/sec (as that rate is constant), but having 0% failures when sending that amount in a round robin fashion down 5 IPC queues.
    Another thought: how fast does the writer run right after you start it, before the queue fills up? What percent of msgsnd() calls fail during that time?
    The failures happen immediately - so from the first couple of seconds running, the writer process starts running into msgsnd() failing.
    Finally, if it's a race condition inside the kernel's message processing code, maybe external synchronization could help? Maybe using a PTHREAD_PROCESS_SHARED mutex between the reader and writer processes to mutex msgsnd() and msgrcv() calls would produce faster performance - and if it does, that would be indicative of a significant performance problem.
    Will have to look at that.
    However:
    That seems to say to me that your processing pipeline is backed up, and it's bottlenecked on file output.
    Correct. New server, newer h/w and I/O bus, faster disks - so I/O should also be faster too. Less latency from the message reader process writing messages to disk/file, should have resulted in this process to get back to a msgrcv() faster and prevented fewer msgsnd() calls (from the writer process) to fail.

    Anyway, this was the theory. :-)
  • 10. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    Dude wrote:
    Is there any reason that your application has to deal with more I/O under the newer kernel, or more I/O than necessary?
    I/O is unchanged - read a message from queue, write message to file. ext3 file system on both. Both are using vanilla defaults as far as kernel settings go.
    Also on the old RHEL3 kernel I could not use msgctl() to do an IPC_SET to increase the
    Is it possible these resources are sized too large, resulting in unnecessary large chunks of data and I/O? Is it possible that something that did not work before was the reason for it to be working in the first place?
    The messages are fairly fixed sizes. Most are 1427 bytes. Some are less (depending on the network device sending data). So I/O size to file is mostly 1427 byte (binary) writes.

    Trying to set the IPC queue size on a 2.4 kernel results in an error. Not so on a 2.6 kernel. As I develop on the latter, setting the IPC queue size was tested and worked fine. But this option was not used when the code was deployed on 2.4 kernels - there the default kernel setting applied and the size was not manually set.
    Looking at the current situation, I think a some more in-depth live view of what is actually going on when the system processes these IPC messages is required. You can search e.g. "analysis of the IPC mechanisms", and should find references pointing to oprofile and perf to profile the Linux kernel to analyze what is going on during IPC processing. Now the question is, what tools exist to help finding the cause of your problem, without requiring a PhD in kernel design.

    I wonder, what happens if you nice the programs that are reading the queues and increase their priority? Does this help?
    Interesting idea to nice it.. Will try it. (not at the moment though as I've got too many production issues to deal with, and no time at the moment for troubleshooting this IPC issue).

    Doing a nice reminds me of interesting behaviour of Win32 code I wrote for WinME and WinNT. The code did a tight loop (reading messages from a simulation process and writing that data into a motion sink to drive a motion platform). On WinME it ran fine. On WinNT this red-lined the CPU it executed on. Turned out that adding a small delay (Win32 kernel yield call) solved that problem on the WinNT kernel.. and made no difference to the process (running fine CPU wise) on the WinME kernel.

    Always made me wonder just what was coded differently between these two kernels that supported the same basic Win32 kernel concepts and API... :-)
  • 11. Re: IPC
    800381 Explorer
    Currently Being Moderated
    Correct. New server, newer h/w and I/O bus, faster disks - so I/O should also be faster too. Less latency from the message reader process writing messages to disk/file, should have resulted in this process to get back to a msgrcv() faster and prevented fewer msgsnd() calls (from the writer process) to fail.
    Not necessarily. What is the old filesystem? What hardware was it running on? What's the new file system? What hardware is it running on?

    Picture a SAN array were somebody who doesn't know what they're doing building you a file system using 4kb blocks on a RAID-5 array of 23 SATA disks, each with a segment size of 1 MB, and putting about 30 busy file systems each doing random small-block IO on the same RAID-5 array.
  • 12. Re: IPC
    BillyVerreynne Oracle ACE
    Currently Being Moderated
    user5287726 wrote:
    Correct. New server, newer h/w and I/O bus, faster disks - so I/O should also be faster too. Less latency from the message reader process writing messages to disk/file, should have resulted in this process to get back to a msgrcv() faster and prevented fewer msgsnd() calls (from the writer process) to fail.
    Not necessarily. What is the old filesystem? What hardware was it running on? What's the new file system? What hardware is it running on?

    Picture a SAN array were somebody who doesn't know what they're doing building you a file system using 4kb blocks on a RAID-5 array of 23 SATA disks, each with a segment size of 1 MB, and putting about 30 busy file systems each doing random small-block IO on the same RAID-5 array.
    No SAN or NAS - just local scsi disks.

    The new hardware is significantly faster (released in Apr this year) than old hardware (purchased in 2005). So basic things such as h/d speeds are faster, with a faster/wider CPU bus, and so on. Which is why the expectation was that the queue reader processes will have less I/O latency writing to disk on the new h/w, than the old h/w. Which I believe is the case - as I/O does not seem to be the underlying cause of the large percentage failure increase we are seeing with msgsnd() calls.

    Unfortunately I've had very little time to spend on this problem (old h/w still being used in the meantime) as there were (and still are) other pressing production issue. Never rains here. Just a permanent torrential rainstorm. ;-)

    I appreciate the feedback from you and Dude - I have a much better plan of attack to troubleshooting and resolving the issue than a week ago. Thanks.
  • 13. Re: IPC
    Dude! Guru
    Currently Being Moderated
    You are certainly welcome. Regarding new software and hardware, I was often disappointed by new software and hardware. I'm not saying that this is the case here, but new software may carry more overhead and can be less optimzed than a previous version due to bigger libraries and more features. New hardware, although showing better specs, will usually handle large amounts of data more economically, but may not necessarily mean an improvement dealing with regular text or small chunks of data.
  • 14. Re: IPC
    800381 Explorer
    Currently Being Moderated
    I'll also chime in and say not to be so sure the new hardware is faster.

    I currently have a customer that just replaced some several-year-old Sun x86 servers with brand-new HP ones. The old Sun servers are faster - both in CPU and IO benchmarks. You'd better believe that caused quite a bit of consternation. Just because the specs say it should be faster doesn't mean the parts are actually engineered to work together faster.

    Heck, I'll even go so far as to state disk performance on "standard" servers peaked several years ago when U320 SCSI disks were standard. A five-year-old 3 1/2" 15K rpm U320 SCSI disk is going to kick the snot out of a brand-spanking-new 2 1/2" 5400 rpm SATA disk performance-wise, even if that SATA drive supports 6.0 Gbps connectivity. Remember - you can have a Serial Attached SCSI controller and plug SATA drives into it. So your "SCSI" drives could very well be SATA ones in reality.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points