9 Replies Latest reply: May 18, 2011 8:06 AM by 800381 RSS

    Thread CPU pinning on Multi socket systems

    854422
      While having a 2 or 4 socket system of say 8 core CPUs, will result in the # of available processors being reported as 16 or 32. Just building a thread pool, based on this number can perform quite poorly. What I experienced I think is exaggerated, since first a single thread reads a lot of base info from a database into memory. That data gets read by all members of the thread pool. Now increasing the size of the pool beyond the size of 1 CPU has little effect.

      Assuming having multiple copies of data is not a big problem, is there a way to get a copy on all CPUs, and each worker thread knowing which copy to reference? Thread groups maybe?

      Edited by: user3055980 on May 16, 2011 3:33 PM
        • 1. Re: Thread CPU pinning on Multi socket systems
          854422
          Refining my issue, I am willing to restrict myself to Linux. Linux has a call, sched_setAffinity, which can restrict a PID, or the current thread, to a mask of processors. Of course, any read-only data would have to be replicated N socket times as an array. A thread knowing its assigned socket reads the same data as all other threads assigned this socket.

          Can the use of ThreadLocal in combination with that native call both restrict a thread & return the socket assigned? Here is a mockup with the native call just a comment:
          public final class SocketAffinity {
              private final ThreadLocal<Integer> socketHolder;
              
              private int socketFilling;
              private int cpuCount;
          
              private SocketAffinity(final int nSockets, final int cpuPerSocket){
                  socketHolder = new ThreadLocal<Integer>(){
                      @Override public synchronized Integer initialValue(){
                          if (++cpuCount >= cpuPerSocket){
                              cpuCount = 0;
                              if (++socketFilling >= nSockets) socketFilling = 0;
                          }
                          // call linux sched_setaffinity with pid of 0, which actually just set the current thread
                          // & and a mask that lets it run on any core of a socket
                          //int sched_setaffinity(0, size_t cpusetsize, cpu_set_t *mask);
                          return new Integer(socketFilling);
                      }  
                  };
              }
              // - - - - - - - - - - - - - - - -
              private static volatile SocketAffinity sa;
              public static void createInstance(final int nSockets, final int cpuPerSocket){
                  sa = new SocketAffinity(nSockets, cpuPerSocket);
              }
              
              public static int getSocketAssigned(){
                  return sa.socketHolder.get();    
              } 
          }
          Edited by: EJP on 17/05/2011 09:14: added {noformat}
          {noformat} tags. Please use them.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
          • 2. Re: Thread CPU pinning on Multi socket systems
            jtahlborn
            if you are only reading the data, how does multiple threads reading one shared copy differ from multiple threads reading their own copy? i would think more data in memory would just slow things down as it would reduce cache hits at various levels. have you actually profiled the application to see where the actual bottleneck is?
            • 3. Re: Thread CPU pinning on Multi socket systems
              EJP
              is there a way to get a copy on all CPUs
              I do not understand. CPUs don't have memory, they have registers, caches, etc. They all share the same memory. So your question is meaningless.

              Or are you talking about a distributed system here?
              • 4. Re: Thread CPU pinning on Multi socket systems
                854422
                I was referring to the caches of CPUs. From what I have searched, a piece of memory can only be cached in single CPU at a time (could not retrace a good link though I tried). The system I am using has 4 Opteron 6128s, each with 8 cores & shared Level 3 cache. If all 32 threads are constantly competing for a piece of memory, this could be a problem.

                Having 4 copies, where threads bound to a given CPU all reference the same copy seemed viable to me. Have found multiple sources of JNI for sched_setaffinity since posting. I might be able to leverage one of these, and just try it without too much work. Here's one http://blog.toadhead.net/index.php/2011/01/22/cputhread-affinity-in-java/

                I have profiled this very extensively on a quad core, desktop. Am set up as a server here, no X server. Have run 8 thread pools & they are almost identical to 32 threads in run time.
                • 5. Re: Thread CPU pinning on Multi socket systems
                  802316
                  user3055980 wrote:
                  I was referring to the caches of CPUs. From what I have searched, a piece of memory can only be cached in single CPU at a time (could not retrace a good link though I tried).
                  The same memory can be copied to every level of cache and in every cache at once.
                  If all 32 threads are constantly competing for a piece of memory, this could be a problem.
                  Only if they are constantly updating the same piece of memory. If this is the case, I suggest you re-design your application so it doesn't have hot resources like this.
                  • 6. Re: Thread CPU pinning on Multi socket systems
                    jtahlborn
                    user3055980 wrote:
                    I have profiled this very extensively on a quad core, desktop. Am set up as a server here, no X server. Have run 8 thread pools & they are almost identical to 32 threads in run time.
                    so what did the profiling show as your bottleneck? i find it hard to believe that simply reading values from a shared object instance was your bottleneck (assuming there is no synchronization/volatile usage involved).

                    Edited by: jtahlborn on May 17, 2011 3:30 PM
                    • 7. Re: Thread CPU pinning on Multi socket systems
                      854422
                      The only "profiling" I have done on the Linux server so far is have top running while processing (just got it last week). It showed CPU usage at around 800%.

                      Back on the Windows quad-core desktop, I have run Netbeans profiler many, many times. I usually ran the profiler with a thread pool of 1, and got each of the 7 consumer-producer stages running good. When pool was set to 4 got pretty much 4x, and CPU @ 400%. I tried doing all writes using stack vars, completely independent execution, and no synchronized calls.

                      This is the first build of the server. I am sure there are to be a # of tries. Was now just working on some minor mixed case file problem that did not show itself till now, while thinking of my next move on this front. Can profiling be done remotely, or am I going to have to switch to Linux desktop?

                      BTW, thanks for the redirection. Dead ends are not bad a all when you avoid them.
                      • 8. Re: Thread CPU pinning on Multi socket systems
                        854422
                        Ok, found the attachment Wizard in the Netbeans Profiler.
                        • 9. Re: Thread CPU pinning on Multi socket systems
                          800381
                          EJP wrote:
                          is there a way to get a copy on all CPUs
                          I do not understand. CPUs don't have memory, they have registers, caches, etc. They all share the same memory. So your question is meaningless.

                          Or are you talking about a distributed system here?
                          NUMA.

                          While all CPUs can access all memory, CPU to memory access can be (highly) variable.

                          And AMD systems are NUMA. For example, running Solaris 11 on an AMD box shows this:
                          -bash-4.0$ lgrpinfo 
                          lgroup 0 (root):
                               Children: 1 2
                               CPUs: 0-3
                               Memory: installed 16G, allocated 15G, free 765M
                               Lgroup resources: 1 2 (CPU); 1 2 (memory)
                               Latency: 104
                          lgroup 1 (leaf):
                               Children: none, Parent: 0
                               CPUs: 0 1
                               Memory: installed 8.0G, allocated 7.6G, free 460M
                               Lgroup resources: 1 (CPU); 1 (memory)
                               Load: 0.179
                               Latency: 71
                          lgroup 2 (leaf):
                               Children: none, Parent: 0
                               CPUs: 2 3
                               Memory: installed 8.0G, allocated 7.7G, free 306M
                               Lgroup resources: 2 (CPU); 2 (memory)
                               Load: 0.569
                               Latency: 71
                          -bash-4.0$ 
                          On an Intel-based box, we get this:
                          -bash-3.2$ lgrpinfo 
                          lgroup 0 (root):
                                  Children: none
                                  CPUs: 0-7
                                  Memory: installed 16G, allocated 11G, free 4.8G
                                  Lgroup resources: 0 (CPU); 0 (memory)
                                  Load: 0.0147
                                  Latency: 0
                          -bash-3.2$ 
                          Now, AMDs archtecture, while it is a NUMA system, doesn't really produce significant performance impacts, although if you really dig you can see it. Other architectures can produce huge variations. Like Sun's E10K/E15K/E25K line. Up to several hundred CPUs with significant performance differences in CPU to memory access times.

                          Dealing with this is very OS-specific. And pinning threads to CPUs isn't enough - you have to pin the memory, too. After you ensure your process actually gets physical memory located where you want it.

                          Except as an academic exercise, unless you're really pushing hardware limits and your investment in hardware is such that you can't just buy faster hardware for less than the money it would cost to dig into OS- and architecture-specific NUMA tuning, it's pretty much not worth doing.

                          Or even paying attention to.