Oracle Solaris Tools for Locality Observability

Version 1

    by Giri Mandalika

     

    Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. This article describes some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices.

     

     

    Modern multisocket servers exhibit non-uniform memory access (NUMA) characteristics that might hurt application performance if ignored. On a NUMA system, all memory is shared among processors. Each processor has access to its own memory (local memory) as well as memory that is local to another processor (remote memory). However, the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than remote memory, and these varying memory latencies play a big role in application performance.

     

    The Oracle Solaris operating system organizes the hardware resources—CPU, memory, and I/O devices—into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred to as locality groups (lgroups) or NUMA nodes. In other words, an lgroup is an abstraction that tells what hardware resources are near each other on a NUMA system. Each lgroup has at least one processor and possibly some associated memory, I/O devices, or both, as shown in Figure 1.

     

    loc-groups-schem.png

    Figure 1. Diagram of lgroups

     

    To minimize the impact of NUMA, Oracle Solaris considers the lgroup-based physical topology when mapping threads and data to CPUs and memory. However, some of the applications might still suffer the impact of NUMA due to the misconfiguration of software or hardware or for some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Nevertheless application developers, system administrators, and application administrators need to take NUMA into account while developing and managing applications on large systems. Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. Some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices are introduced with brief explanations and examples in the remainder of this article.

     

    Note: Some of the information in this article is based on information found in lgroup-related cases from the Oracle Solaris Architecture Review Committee (ARC).

     

    Locality Group Hierarchy

     

    The lgrpinfo(1) command prints information about the lgroup hierarchy and its contents. It is useful for understanding the context in which the OS is trying to optimize applications for locality, and also for figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.

     

    Here is sample output from a SPARC server that has four processors and 1 TB of physical memory:

     

    # lgrpinfo -a
    lgroup 0 (root):
             Children: 1-4
             CPUs: 0-255
             Memory: installed 1024G, allocated 75G, free 948G
             Lgroup resources: 1-4 (CPU); 1-4 (memory)
             Latency: 18
    lgroup 1 (leaf):
             Children: none, Parent: 0
             CPUs: 0-63
             Memory: installed 256G, allocated 18G, free 238G
             Lgroup resources: 1 (CPU); 1 (memory)
             Load: 0.0227
             Latency: 12
    lgroup 2 (leaf):
             Children: none, Parent: 0
             CPUs: 64-127
             Memory: installed 256G, allocated 15G, free 241G
             Lgroup resources: 2 (CPU); 2 (memory)
             Load: 0.000153
            Latency: 12
    lgroup 3 (leaf):
             Children: none, Parent: 0
             CPUs: 128-191
             Memory: installed 256G, allocated 20G, free 236G
             Lgroup resources: 3 (CPU); 3 (memory)
             Load: 0.016
             Latency: 12
    lgroup 4 (leaf):
             Children: none, Parent: 0
             CPUs: 192-255
             Memory: installed 256G, allocated 23G, free 233G
             Lgroup resources: 4 (CPU); 4 (memory)
             Load: 0.00824
             Latency: 12 
    Lgroup latencies: 

    ------------------
      |  0  1  2  3  4
    ------------------
    0 | 18 18 18 18 18
    1 | 18 12 18 18 18
    2 | 18 18 12 18 18
    3 | 18 18 18 12 18
    4 | 18 18 18 18 12
    ------------------

     

    CPU Locality

     

    The lgrpinfo(1) command shown above provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU IDs and locality groups.

     

    # echo ::lgrp -p | mdb -k

       LGRPID  PSRSETID      LOAD      #CPU      CPUS
            1         0     17873        64      0-63
            2         0     17755        64      64-127
            3         0      2256        64      128-191
            4         0     18173        64      192-255

     

    Memory Locality

     

    The lgrpinfo(1) command displays the total memory that belongs to each of the locality groups. However, the same command doesn't show the breakdown of memory into memory blocks and their association with locality groups. The syslayout debugger command (dcmd) in the genunix debugger module (dmod) helps in retrieving the association between a memory block and a locality group from the physical memory layout of the system.

     

    Note: To list the available debugger commands and print a brief description for each one, run echo ::dcmds | mdb -k in a shell. To list the debugger commands grouped by the debugger module, run echo "::dmods -l" | mdb -k. Keep in mind that non-built-in debugger commands might not continue to work on future versions of Oracle Solaris.

     

    Here is an example.

     

    1. List memory blocks:

     

    # ldm list-devices -a memory

    MEMORY
          PA                   SIZE            BOUND
          0xa00000             32M             _sys_
          0x2a00000            96M             _sys_
          0x8a00000            374M            _sys_
          0x20000000           1048064M        primary

    2. Print the physical memory layout of the system:

     

    # echo ::syslayout | mdb -k

              STARTPA            ENDPA  SIZE  MG MN    STL    ETL
             20000000        200000000  7.5g   0  0      4     40
            200000000        400000000    8g   1  1    800    840
            400000000        600000000    8g   2  2   1000   1040
            600000000        800000000    8g   3  3   1800   1840
            800000000        a00000000    8g   0  0     40     80
            a00000000        c00000000    8g   1  1    840    880
            c00000000        e00000000    8g   2  2   1040   1080
            e00000000       1000000000    8g   3  3   1840   1880
           1000000000       1200000000    8g   0  0     80     c0
           1200000000       1400000000    8g   1  1    880    8c0
           1400000000       1600000000    8g   2  2   1080   10c0
           1600000000       1800000000    8g   3  3   1880   18c0
           ...
           ...

     

    3. Use the debugger command, ::mnode, in the genunix debugger module to show the mapping of memory nodes to locality groups:

     

    # echo ::mnode | mdb -k
     
                MNODE ID LGRP ASLEEP  UTOTAL UFREE UCACHE KTOTAL  KFREE KCACHE
          2075ad80000  0    1      -   249g   237g   114m   5.7g   714m      -
          2075ad802c0  1    2      -   240g   236g   288m    15g   4.8g      -
          2075ad80580  2    3      -   246g   234g   619m   9.6g   951m      -
          2075ad80840  3    4      -   247g   231g    24m     9g   897m      -

     

    In the above example, the memory block with the physical address 1600000000 is on memory node #3, which translates to locality group #4. The sample output was collected on a SPARC T4-4 server from Oracle on which the main memory was interleaved across all memory banks with an 8 GB interleave size—meaning the first 8 GB chunk was populated in lgroup 1 closer to processor #1, the second 8 GB chunk in lgroup 2 closer to processor #2, the third 8 GB chunk in lgroup 3 closer to processor #3, the fourth 8 GB chunk in lgroup 4 closer to processor #4, the fifth 8 GB chunk again in lgroup 1 closer to processor #1, and so on. Memory is not interleaved on later SPARC systems that have Oracle's SPARC T5, M6, M7, or S7 processors.

     

    I/O Locality

     

    The -d option of the lgrpinfo(1) command accepts a path to an I/O device and returns the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes—so, it is not uncommon to see more than one lgroup ID returned by the lgrpinfo(1) command.

     

    Here are some examples:

     

    # lgrpinfo -d /dev/dsk/c1t0d0
    lgroup ID : 1

    # dladm show-phys | grep ixgbe0
    net4              Ethernet             up         10000  full      ixgbe0 

    # lgrpinfo -d /dev/ixgbe0
    lgroup ID : 1 

    # dladm show-phys | grep ibp0
    net12             Infiniband           up         32000  unknown   ibp0 

    # lgrpinfo -d /dev/ibp0
    lgroup IDs : 1-4

     

    Logical groupings of I/O devices, resources (threads, interrupts, and so on), and objects (for example, device information pointers and DMA-mapped memory) that are based on relative affinity to their source can be examined with the debugger command numaio_group. These groups are implicitly known as NUMA I/O groups.

     

    Here is an example:

     

    # echo ::numaio_group | mdb -k
                 ADDR GROUP_NAME              CONSTRAINT
         10050e1eba48 net4                    lgrp : 1
         10050e1ebbb0 net0                    lgrp : 1
         10050e1ebd18 usbecm2                 lgrp : 1
         10050e1ebe80 scsi_hba_ngrp_mpt_sas1  lgrp : 4
         10050e1ebef8 scsi_hba_ngrp_mpt_sas0  lgrp : 1
         ...
         ...

     

    Alternatively, the prtconf(1M) command can be used to find the locality for an I/O device.

     

    For example, here's how to find the device path for the network interface:

     

    # dladm show-phys | grep ixgbe0
    net4              Ethernet             up         10000  full      ixgbe0 

    # grep ixgbe /etc/path_to_inst | grep " 0 "
    "/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"

     

    And here's how to find the NUMA I/O Lgroups:

     

    # prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0
        ...
         Hardware properties:
         ...
         name='numaio-lgrps' type=int items=1
                 value=00000001   <-lgroup 1
         ...

     

    Application developers have no control over the grouping of objects into a NUMA I/O group at this time.

     

    Resource Groups

     

    In a resource group, hardware resources are grouped based on the underlying physical relationship between cores, memory, and I/O buses. On different SPARC hardware platforms, some of the server configurations (such as for Oracle's SPARC M7-8 server) might have a resource group that maps directly to a locality group.

     

    The list-rsrc-group subcommand of the Logical Domains Manager command-line interface for Oracle VM Server for SPARC, ldm(1M), shows a consolidated list of processor cores, memory blocks, and I/O devices that belong to each resource group. This subcommand is available in Oracle VM Server for SPARC 3.2 and later versions.

     

    Here is an example:

     

    # ldm ls-rsrc-group
    NAME                                    CORE  MEMORY   IO
    /SYS/CMIOU0                             32    480G     4
    /SYS/CMIOU1                             32    480G     4 

    # ldm ls-rsrc-group -l /SYS/CMIOU0
    NAME                                    CORE  MEMORY   IO
    /SYS/CMIOU0                             32    480G     4 

    CORE
         CID                                             BOUND
         0, 1, 2, 3, 8, 9, 10, 11                        primary
         16, 17, 18, 19, 24, 25                          primary
         ... 

    MEMORY
         PA               SIZE             BOUND
         0x0              60M              _sys_
         0x3c00000        32M              _sys_
         0x5c00000        94M              _sys_
         0x4c000000       64M              _sys_
         0x50000000       15104M           primary
         0x400000000      128G             primary
         ...
         0x7400000000     16128M           primary
         0x77f0000000     64M              _sys_
         0x77f4000000     192M             _sys_

    IO
         DEVICE           PSEUDONYM        BOUND
         pci@300          pci_0            primary
         pci@301          pci_1            primary
         pci@303          pci_3            primary
         pci@304          pci_4            primary

     

    Process and Thread Locality

     

    The Oracle Solaris kernel assigns a thread to an lgroup when the thread is created. That lgroup is called the thread/LWP's home lgroup. The kernel schedules and runs a thread on the CPUs that are in the thread's home lgroup, and it allocates memory from the same lgroup whenever possible.

     

    The -H option of the ps(1) command can be used to examine the home lgroup of all user processes and threads. The -h option can be used to list all processes that are in a certain locality group.

     

    Here's an example of listing the home lgroup of all processes:

     

    # ps -aeH
      PID  LGRP TTY         TIME CMD
         0    0 ?           0:11 sched
         1    4 ?          21:04 init
        11    3 ?           3:09 svc.star
      3322    1 ?         301:51 cssdagen
             ...
    11155     3 ?           0:52 oracle
    13091     4 ?           0:00 sshd
    12812     2 pts/3       0:00 bash
            ...

     

    The -H option of the prstat(1M) command shows the home lgroup of active user processes and threads, for example:

     

    # prstat -H

       PID  USERNAME  SIZE   RSS STATE   PRI NICE      TIME CPU  LGRP PROCESS/NLWP
       1865 root      420M  414M sleep    59    0 447:51:13 0.1%    2 java/108
       1814 oracle    155M  110M sleep    59    0  70:45:17 0.0%    4 gipcd.bin/9
       3765 root      447M  413M sleep    59    0  29:24:20 0.0%    3 crsd.bin/43
      10825 oracle   1097M 1074M sleep    59    0  18:13:27 0.0%    3 oracle/1
       3941 root      210M  184M sleep    59    0  20:03:37 0.0%    4 orarootagent.bi/14
       1585 oracle    122M   91M sleep    59    0  18:06:34 0.0%    3 evmd.bin/10
       3918 oracle    168M  144M sleep    58    0  14:35:31 0.0%    1 oraagent.bin/28
        ...

     

    The plgrp(1) command shows the placement of threads among locality groups. The same command can also be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.

     

    Let's examine the home lgroup of some of the LWPs in the process with process identifier (PID) 1865:

     

    # plgrp 1865

         PID/LWPID     HOME
         1865/1        2
         1865/2        2
         ...
         1865/22       4
         1865/23       4
        ...
         1865/41       1
         1865/42       1
         ...
         1865/60       3
         1865/61       3
         ...

    # plgrp 1865 | awk '{print $2}' | grep 2 | wc -l
          30 

    # plgrp 1865 | awk '{print $2}' | grep 1 | wc -l
          25

    # plgrp 1865 | awk '{print $2}' | grep 3 | wc -l
          25 

    # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
          28

     

    Let's reset the home lgroup of all LWPs in PID 1865 to 4:

     

    # plgrp -H 4 1865

         PID/LWPID    HOME
         1865/1        2 => 4
         1865/2        2 => 4
         ...
         1865/41       1 => 4
         ...
         1865/60       3 => 4

    # plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l
           0 

    # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
         108 

    # prstat -H -p 1865
       PID  USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU LGRP PROCESS/NLWP
       1865 root      420M  414M sleep    59    0 447:57:30 0.1%    4 java/108

     

    Memory Placement

     

    The -L option of the pmap(1) command shows the lgroup that contains the physical memory backing some virtual memory.

     

    Here's an example of determining the lgroups that shared memory segments are allocated from:

     

    # pmap -Ls 24513 | egrep "Lgrp|256M|2G"

             Address       Bytes Pgsz Mode   Lgrp Mapped File
    0000000400000000   33554432K   2G rwxs-    1   [ osm shmid=0x78000047 ]
    0000000C00000000     262144K 256M rwxs-    3   [ osm shmid=0x78000048 ]
    0000000C10000000     524288K 256M rwxs-    2   [ osm shmid=0x78000048 ]
        ...

     

    Memory placement among lgroups can possibly be achieved using the MADV_ACCESS flags to the pmadvise(1) command at runtime or to the madvise(3C) function during development, which provides advice to the kernel's virtual memory manager that a region of user virtual memory is expected to follow a particular pattern of use. The OS might use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.

     

    Here's an example of applying the MADV_ACCESS_MANY policy advice to a segment at a specific address:

     

    # pmap -Ls 27980 | grep anon
    FFFFFFFF78C70000       3648K  64K rw-----    2   [ anon ]
    FFFFFFFF79000000      53248K   4M rw-----    2   [ anon ]
    FFFFFFFF7C400000       3136K  64K rw-----    2   [ anon ]
        ... 

    # pmadvise -o FFFFFFFF79000000=access_many -v 27980 | grep anon
    FFFFFFFF78C70000       3648K rw-----    [ anon ]
    FFFFFFFF79000000       53248K rw-----   [ anon ]         <= access_many
    FFFFFFFF7C400000       3136K rw-----    [ anon ]
        ...

     

    Here's a sample code block demonstrating the madvise(3C) call that hints to the kernel that addresses in this memory region are likely to be accessed only once:

     

    #include <sys/mman.h>
    ...
    int sz, fd;
    ...
    char *content = mmap((caddr_t)0, sz, PROT_READ, MAP_PRIVATE, fd, 0);
    if ( madvise(content, 2048, MADV_SEQUENTIAL) < 0 ) {
          perror("madvise");
    }
    ...

     

    Locality Group API

     

    As of this writing, Oracle Solaris has limited support for lgroup observability using programmatic interfaces. Applications can use the lgroup API to traverse the locality group hierarchy, discover the contents of each locality group, and even affect thread and memory placement on desired lgroups.

     

    The man page for liblgrp(3LIB) lists the currently supported public interfaces, and a brief description for most of those interfaces can be obtained by running man -k lgroup in a shell. In order to use this API, applications must link with the locality group library, liblgrp(3LIB).

     

    The following sample code demonstrates lgroup API usage by making several lgrp_*() calls, including lgrp_device_lgrps(), to find the locality groups that are closest to the specified I/O device.

     

    # cat -n iodevlgrp.c

         1  #include <stdio.h>
         2  #include <stdlib.h>
         3  #include <assert.h>
         4  #include <sys/lgrp_user.h>
         5  #include <sys/types.h>
         6
         7  int main(int argc, char **argv) {
         8
         9          if (argc != 2) {
        10                  fprintf(stderr, "Usage: %s <iodevice>\n", argv[0]);
        11                  exit(1);
        12          }
        13
        14          /*  lgroup interface version check */
        15          if (lgrp_version(LGRP_VER_CURRENT) != LGRP_VER_CURRENT) {
        16              fprintf(stderr, "\nBuilt with unsupported lgroup interface %d",
        17                  LGRP_VER_CURRENT);
        18              exit(1);
        19          }
        20
        21          lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_OS);
        22          lgrp_id_t node = lgrp_root(cookie);
        23
        24          /* refresh the cookie if stale */
        25          if ( lgrp_cookie_stale(cookie) ) {
        26                  lgrp_fini(cookie);
        27                  cookie =lgrp_init(LGRP_VIEW_OS);
        28          }
        29
        30          int nlgrps = lgrp_nlgrps(cookie);
        31          if ( nlgrps == -1 ) {
        32                  perror("\n lgrp_nlgrps");
        33                  lgrp_fini(cookie);
        34                  exit(1);
        35          }
        36          printf("Number of locality groups on the system: %d\n", (nlgrps-1));
        37
        38          /* lgroups closest to the target device */
        39          int numlgrps = lgrp_device_lgrps(argv[1], NULL, 0);
        40          if (numlgrps == -1 ){
        41                  fprintf(stderr, "I/O device: %s. ", argv[1]);
        42                  perror("lgrp_device_lgrps");
        43          } else {
        44                  printf("lgroups closest to the I/O device %s: ", argv[1]);
        45                  lgrp_id_t *lgrpids = (lgrp_id_t *)calloc(numlgrps, sizeof(lgrp_id_t));
        46                  assert(lgrpids != NULL);
        47                  lgrp_device_lgrps(argv[1], lgrpids, numlgrps);
        48                  for (int i = 0; i < numlgrps; ++i) {     49
                             printf(" %d ", lgrpids[i]);
        50                  }
        51                  free(lgrpids);
        52          }
        53          lgrp_fini(cookie);
        54          printf("\n");
        55          return 0;
        56 
        57  }  

    % cc -o iodevlgrp -llgrp iodevlgrp.c

    % ./iodevlgrp /dev/ixgbe0
    Number of locality groups on the system: 2
    lgroups closest to the I/O device /dev/ixgbe0:  1 

    % lgrpinfo -d /dev/ixgbe0
    lgroup ID : 1 

    % ./iodevlgrp /dev/ixgbe1
    Number of locality groups on the system: 2
    lgroups closest to the I/O device /dev/ixgbe1:  2 

    % lgrpinfo -d /dev/ixgbe1
    lgroup ID : 2

     

    See Also

     

     

    Also see the following resources:

     

     

     

    About the Author

     

    Giri Mandalika is a principal software engineer in Oracle's Hardware Systems organization. Currently, Giri is focused on designing and implementing software for SPARC processor–based engineered systems. Giri manages a blog that focuses on Oracle Solaris, Oracle Database, and other topics at https://blogs.oracle.com/mandalika.

     

    Follow us:
    Blog | Facebook | Twitter | YouTube