10 Replies Latest reply: Jan 22, 2010 1:38 PM by Tom B RSS

    Strange Behaviour with JMS Message Sends

    710897
      Hi,
      We have a very simple JMS producer that sends a single message to a JMS cluster(with 4 MS) at 5s intervals. We time the sends and on average it takes about 10-15ms for each send. However we see that the time for some message sends spike to about 2+ seconds at times. These occurrences are very random and with no obvious pattern. All the server logs and GC logs are clean with no apparent reason for the slowdown. There is nothing running on the WL cluster other than the Distributed Queue. The OS is RHEL and all stats look clean. Our application is very sensitive to delays and these time spikes are causing issues.

      Are there any tuning parameters that can be changed? It appears that the JMS subsystem is busy doing some internal tasks and doesn't respond to the message sends. We have tried JMS debug flags. It gives lot of information but we don't know what to look for.

      Configuration:
      Weblogic: Weblogic Server 10.3
      OS: RHEL 5.3
      JDK: JRockit R27.6.4

      Thanks in advance..
        • 1. Re: Strange Behaviour with JMS Message Sends
          Tom B
          Hi,

          Unless there's an extremely heavy load of other competing JMS messaging or persistent store related work, I very much doubt that the delay is due to the JMS server somehow being "busy doing other tasks".

          My first thought is that you are likely using file based persistence, and the 10-15 ms number is a clue that the setup is using consumer grade disks for persistence (e.g. something with no battery backed write-back cache to help mitigate disk latency delays). 10 ms roughly corresponds to the latency of a basic 5400 or 7200 RPM disk drive. You can check if the disk hardware latency is the root cause of the delays by temporarily configuring your file store with a "Disabled" synchronous write policy. If disk latency is a problem in your setup, then this should speedup performance considerably -- but at the expense of safe operation Note that there are other JMS vendors that default to the equivalent of Disabled, but WL does not in the interest of transactional safety.

          If you're using database persistence, then database latency could be the issue.

          Note that 10 ms is the typical time slice for O/S context switches between threads. In addition, 10 ms also tends to be the granularity of System.currentTimeMillis() (so an actual difference if 1 ms can yield a 10 ms diff if using System.currentTimeMillis() as your measuring tool -- and you should be using Java's nano timer instead).

          On the other hand, 2 seconds is a very long time, so perhaps its unlikely that the above has anything to do with a root cause of your problem. It's possible to get long delays in many ways. For example:

          -- A backlog developing on your queues (consumers running slower than producers, a backlog develops). Check "messages high" on your queue statistics to see if a backlog is developing.

          -- An overloaded host environment (for example, competing work on the same machine that periodically consumes a lot of database, network, memory, disk, or CPU).

          -- Faulty hardware (a bad network port).

          -- GCs occurring on the client side. I've seen cases where the client JVMs that are measuring latency between send and consume are themselves having GCs.

          If the above doesn't help, some questions that might help to narrow down the issue:

          -- Are the messages persistent?

          -- Is this queueing or pub/sub?

          -- Are you using distributed destinations?

          -- What is the size of the messages?

          -- Exactly how are the latency measurements being made? (On a client? Using what measurement tooling? Is the client running on the same machine?)

          -- What other load might there be on the host machine, disk, database, and network?

          -- Is there any application processing that's included in the latency measurement?

          Tom
          • 2. Re: Strange Behaviour with JMS Message Sends
            710897
            Thanks Tom. Some of your observations are spot on! We are running these tests on boxes with consumer grade disks but that is because we don't want to run the test script on Production. Production servers are connected to SAN. but we do see spikes there in the range of 2-3s. We are using file based persistence and the file store is on SAN. Here are some answers to your questions:

            -- A backlog developing on your queues (consumers running slower than producers, a backlog develops). Check "messages high" on your queue statistics to see if a backlog is developing.

            ----> The highest backlog we have seen on any of our queues is about 300 but we didn't see any issues when the backlog happened

            -- An overloaded host environment (for example, competing work on the same machine that periodically consumes a lot of database, network, memory, disk, or CPU).

            --> SAR output didn't show any significant load on the host when the spikes happened.

            -- Faulty hardware (a bad network port).

            --> We checked all hardware for any diagnostic messages indicating an issue but didn't find anything.

            -- GCs occurring on the client side. I've seen cases where the client JVMs that are measuring latency between send and consume are themselves having GCs.

            --> We did think about this but we saw a scenario where multiple clients (on separate JVMs) connected to Weblogic exhibited the same behavior at the same time. All of them encountered spikes in message send times!

            -- Are the messages persistent?

            --> Yes messages are persistent for all our queues.

            -- Is this queueing or pub/sub?

            --> Queueing

            -- Are you using distributed destinations?

            --> Yes

            -- What is the size of the messages?

            --> In production the message sizes vary from 300 bytes to 4k. Our typical message load in production is about 35,000 to 40,000/day. We have four nodes in the cluster (1 MS per node) with 1GB heap for each node. The box itself has 4 CPU and 16GB RAM.


            -- Exactly how are the latency measurements being made? (On a client? Using what measurement tooling? Is the client running on the same machine?)

            --> I've attached the code below

            -- What other load might there be on the host machine, disk, database, and network?

            --> Nothing significant as seen in SAR otuput.

            -- Is there any application processing that's included in the latency measurement?

            --> None

            Here is the code that we are using to measure spikes in message send times....

            ===================
            import java.util.Hashtable;
            import javax.jms.Connection;
            import javax.jms.ConnectionFactory;
            import javax.jms.Destination;
            import javax.jms.JMSException;
            import javax.jms.Message;
            import javax.jms.MessageProducer;
            import javax.jms.Session;
            import javax.naming.Context;
            import javax.naming.InitialContext;
            import java.util.Calendar;
            import java.text.SimpleDateFormat;

            public class Tester {
            private final ConnectionFactory cf;

            private final long timeBetweenChecks;

            private final long alertThreshold;

            private final long alertThresholdWarn;

            private final Destination healthQueue;

            public static final String DATE_FORMAT_NOW = "yyyy-MM-dd HH:mm:ss";

            public Tester(String providerUrl, String userName, String password,
            String connectionFactory, long timeBetweenChecks, long alertThreshold) throws Exception {
            this.timeBetweenChecks = timeBetweenChecks;
            this.alertThreshold = alertThreshold;
            this.alertThresholdWarn = alertThreshold - 1000;
            Hashtable<String, String> ht = new Hashtable<String, String>();
            ht.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
            ht.put(Context.PROVIDER_URL, providerUrl);
            ht.put(Context.SECURITY_PRINCIPAL, userName);
            ht.put(Context.SECURITY_CREDENTIALS, password);
            InitialContext ctx = new InitialContext(ht);
            System.out.println("Starting health checks with Error and Warn thresholds of " + this.alertThreshold + " & " + this.alertThresholdWarn);
            cf = (ConnectionFactory) ctx.lookup(connectionFactory);
            healthQueue = (Destination) ctx.lookup("com.company.jms.CheckQueue");
            }

            public void checkHealth() throws Exception {
            Connection conn = cf.createConnection();
            conn.start();
            Session session = conn.createSession(false, Session.AUTO_ACKNOWLEDGE);
            MessageProducer producer = session.createProducer(healthQueue);
            Calendar cal = Calendar.getInstance();
            SimpleDateFormat sdf = new SimpleDateFormat(DATE_FORMAT_NOW);
            System.out.println("Start time is: "+ sdf.format(cal.getTime()) );
            do {
            long startTime = System.currentTimeMillis();
            try {
            Message message = session.createMessage();
            producer.send(message);
            }
            catch (JMSException e) {
            e.printStackTrace();
            throw e;
            }
            finally {
            long endTime = System.currentTimeMillis();
            long total = endTime - startTime;
            if (total > alertThreshold) {
            System.out.println("Threshold exceeded at: "+ sdf.format(cal.getTime()) );
            System.out.println("Check took longer than alert threshold of " + alertThreshold
            + ", time:" + total);
            //producer.close();
            //session.close();
            //conn.close();
            //System.exit(1);
            }
            else if (total > alertThresholdWarn) {
            System.out.println("Warn Threshold exceeded at: "+ sdf.format(cal.getTime()) );
            System.out.println("Check took longer than alert warn threshold of " + alertThresholdWarn + ", time:" + total);
            }
            }
            Thread.sleep(timeBetweenChecks);
            }
            while (true);
            }

            public static void main(String args[]) throws Exception {
            String providerUrl = args[0];
            String userName = args[1];
            String password = args[2];
            String connectionFactory = args[3];
            long timeBetweenChecks = Long.parseLong(args[4]);
            long alert = Long.parseLong(args[5]);
            new Tester(providerUrl, userName, password, connectionFactory, timeBetweenChecks, alert)
            .checkHealth();
            }
            }

            ===================================

            Thanks in advance...
            • 3. Re: Strange Behaviour with JMS Message Sends
              450961
              You should retry with nonpersistent messages (and no XA transactions - the point being to achieve zero disk access).

              If the problem persists, then you've eliminated a lot of potential causes (disk, file system, WL store code).

              If the problem goes away, then you should write a small test app that performs tiny writes and syncs to disk a few times a second and run it on the server. (Or maybe try this first if modifying the app for NON_PERSISTENT is a problem.)

              If the test app experiences the same occasional delay, then chances are it's your hard drive either doing recalibration, recovery or trying to spin-down. In that case if you can get in a quiet enough environment and get sufficiently close to the hard drive that you can hear normal seeks, you should also be able to hear the recalib/recovery when it happens. If you can hear the constant spin-whine then you might be able to hear spin-down (though it might be too short to hear the pitch drop significantly). Another option is to just try swapping out the hard drive for one of a completely different brand, or use NFS to a remote system with real write syncs disabled - just for testing of course. If you have a SMART disk diagnosis tool available you may also want to check if it says your current hard drive is en route to failure.


              -Jeff
              • 4. Re: Strange Behaviour with JMS Message Sends
                Tom B
                You should retry with nonpersistent messages (and no XA transactions - the point being to achieve zero disk access).
                There's no need to modify the app to try this out. You can force all messages to non persistent via the "delivery mode override" configurable on the destination. I don't recall if this setting is dynamic -- if not, you'll need reboot the servers for it to take effect.
                • 5. Re: Strange Behaviour with JMS Message Sends
                  710897
                  Hi,
                  We have seen one occurrence of this issue with non persistent messages as well. At this point we don't know where the issue is. Any pointers to information will be greatly appreciated.

                  Thanks,
                  • 6. Re: Strange Behaviour with JMS Message Sends
                    Tom B
                    In the non-p case, did all sender clients report the 2 second delay at the same time?

                    Tom
                    • 7. Re: Strange Behaviour with JMS Message Sends
                      710897
                      No only one client reported the issue.

                      Thanks,
                      • 8. Re: Strange Behaviour with JMS Message Sends
                        Tom B
                        Since only one client reported the issue when running everything non-persistent, I suspect you may have hit a one-time GC on the client. If you are sure that your not having GC problems on the server, then, like Jeff, I suspect your disk subsystem (the SAN you mention) may be root cause. It would be interesting to see the results after follow Jeff's advice:
                        write a small test app that performs tiny writes and syncs to disk a few times a second and run it on the server.
                        If your running the WebLogic file store with the default synchronous write policy of "Direct", then the store is likely using its native unbuffered I/O driver. The closest you can get to this mode with with pure Java is to open a RandomAccessFile in "rwd" mode.

                        Tom
                        • 9. Weblogic 10.3 - not able to update JMS Queue Overrides
                          700336
                          Hi All,

                          I HAC customer who is trying to change the values in JMS modules.

                          Eg: JMS Modules >WwtWms1Module > CustomJMSQueue > Configuration > Overrides and saving them and Activating changes throws following exception:

                          problem: cvc-enumeration-valid: string value 'No-Delivery' is not a valid enumeration value for delivery-mode-type in namesp

                          I replicated the issue at my End.

                          1>Created JMS module
                          2>Targetted to Admin server.
                          3>Create a new JMS System module resource.
                          4>Created queue.
                          5>Clicked on the queue name for the configuration information say overide some of the settings that a message producer includes with its messages.
                          6>Change the value of the priority queue to 2.
                          7>Clicked on Save button and got the exception.
                          problem: cvc-enumeration-valid: string value 'No-Delivery' is not a valid enumeration value for delivery-mode-type in namesp

                          Interesting though Delivery Mode Override: at this time was No delivery

                          When we keep the Delivery Mode Override to persistent or Non persistent and than try to override the values they get overridden.

                          Through the console information it says:->

                          Delivery Mode Override

                          The delivery mode assigned to all messages that arrive at the destination regardless of the DeliveryMode specified by the message producer.

                          A value of No-Delivery specifies that the Delivery Mode will not be overridden.

                          Note: This attribute is dynamically configurable, but only incoming messages are impacted; stored messages are not impacted.

                          I had given the workaround too:->

                          Manually remove the entries from the JMS module in the jms_module.xml file (I believe Oracle documentation do not recommend manually editing the config files)

                          Disable the schema validation by -Dweblogic.configuration.schemaValidationEnabled=false .He does not want to use this option.

                          -So want to know is it a bug.

                          -And if not why is the option of No-Delivery specified when we cannot override the values.

                          Any pointers regarding the same are appreciated.
                          • 10. Re: Weblogic 10.3 - not able to update JMS Queue Overrides
                            Tom B
                            This problem has nothing to do with latency issues of the current thread.

                            Regardless:

                            1 - The No-Delivery issue is a known problem that's already being tracked.

                            2 - Work arounds: Don't explicitly set to No-Delivery (since No-Delivery is the default you will get the desired behavior), or, if already set, shutdown and manually edit JMS module to remove the setting.

                            3 - Contact customer support if you want them to raise the priority of fixing this Bug.