Skip to Main Content

Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Oracle Solaris Tools for Locality Observability

Melvis-OracleJul 27 2016 — edited May 24 2018

by Giri Mandalika

Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. This article describes some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices.

Modern multisocket servers exhibit non-uniform memory access (NUMA) characteristics that might hurt application performance if ignored. On a NUMA system, all memory is shared among processors. Each processor has access to its own memory (local memory) as well as memory that is local to another processor (remote memory). However, the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than remote memory, and these varying memory latencies play a big role in application performance.

The Oracle Solaris operating system organizes the hardware resources—CPU, memory, and I/O devices—into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred to as locality groups (lgroups) or NUMA nodes. In other words, an lgroup is an abstraction that tells what hardware resources are near each other on a NUMA system. Each lgroup has at least one processor and possibly some associated memory, I/O devices, or both, as shown in Figure 1.

loc-groups-schem.png

Figure 1. Diagram of lgroups

To minimize the impact of NUMA, Oracle Solaris considers the lgroup-based physical topology when mapping threads and data to CPUs and memory. However, some of the applications might still suffer the impact of NUMA due to the misconfiguration of software or hardware or for some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Nevertheless application developers, system administrators, and application administrators need to take NUMA into account while developing and managing applications on large systems. Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. Some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices are introduced with brief explanations and examples in the remainder of this article.

Note: Some of the information in this article is based on information found in lgroup-related cases from the Oracle Solaris Architecture Review Committee (ARC).

Locality Group Hierarchy

The lgrpinfo(1) command prints information about the lgroup hierarchy and its contents. It is useful for understanding the context in which the OS is trying to optimize applications for locality, and also for figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.

Here is sample output from a SPARC server that has four processors and 1 TB of physical memory:

# lgrpinfo -algroup 0 (root):         Children: 1-4         CPUs: 0-255         Memory: installed 1024G, allocated 75G, free 948G         Lgroup resources: 1-4 (CPU); 1-4 (memory)         Latency: 18 lgroup 1 (leaf):         Children: none, Parent: 0         CPUs: 0-63         Memory: installed 256G, allocated 18G, free 238G         Lgroup resources: 1 (CPU); 1 (memory)         Load: 0.0227         Latency: 12 lgroup 2 (leaf):         Children: none, Parent: 0         CPUs: 64-127         Memory: installed 256G, allocated 15G, free 241G         Lgroup resources: 2 (CPU); 2 (memory)         Load: 0.000153         Latency: 12 lgroup 3 (leaf):         Children: none, Parent: 0         CPUs: 128-191         Memory: installed 256G, allocated 20G, free 236G         Lgroup resources: 3 (CPU); 3 (memory)         Load: 0.016         Latency: 12 lgroup 4 (leaf):         Children: none, Parent: 0         CPUs: 192-255         Memory: installed 256G, allocated 23G, free 233G         Lgroup resources: 4 (CPU); 4 (memory)         Load: 0.00824         Latency: 12  Lgroup latencies:  ------------------  |  0  1  2  3  4 ------------------ 0 | 18 18 18 18 18 1 | 18 12 18 18 18 2 | 18 18 12 18 18 3 | 18 18 18 12 18 4 | 18 18 18 18 12 ------------------

CPU Locality

The lgrpinfo(1) command shown above provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU IDs and locality groups.

# echo ::lgrp -p | mdb -k   LGRPID  PSRSETID      LOAD      #CPU      CPUS        1         0     17873        64      0-63        2         0     17755        64      64-127         3         0      2256        64      128-191         4         0     18173        64      192-255

Memory Locality

The lgrpinfo(1) command displays the total memory that belongs to each of the locality groups. However, the same command doesn't show the breakdown of memory into memory blocks and their association with locality groups. The syslayout debugger command (dcmd) in the genunix debugger module (dmod) helps in retrieving the association between a memory block and a locality group from the physical memory layout of the system.

Note: To list the available debugger commands and print a brief description for each one, run echo ::dcmds | mdb -k in a shell. To list the debugger commands grouped by the debugger module, run echo "::dmods -l" | mdb -k. Keep in mind that non-built-in debugger commands might not continue to work on future versions of Oracle Solaris.

Here is an example.

1. List memory blocks:

# ldm list-devices -a memoryMEMORY      PA                   SIZE            BOUND      0xa00000             32M             _sys_       0x2a00000            96M             _sys_      0x8a00000            374M            _sys_       0x20000000           1048064M        primary

2. Print the physical memory layout of the system:

# echo ::syslayout | mdb -k           STARTPA            ENDPA  SIZE  MG MN    STL    ETL         20000000        200000000  7.5g   0  0      4     40        200000000        400000000    8g   1  1    800    840        400000000        600000000    8g   2  2   1000   1040        600000000        800000000    8g   3  3   1800   1840        800000000        a00000000    8g   0  0     40     80        a00000000        c00000000    8g   1  1    840    880        c00000000        e00000000    8g   2  2   1040   1080        e00000000       1000000000    8g   3  3   1840   1880       1000000000       1200000000    8g   0  0     80     c0       1200000000       1400000000    8g   1  1    880    8c0       1400000000       1600000000    8g   2  2   1080   10c0       1600000000       1800000000    8g   3  3   1880   18c0       ...       ...

3. Use the debugger command, ::mnode, in the genunix debugger module to show the mapping of memory nodes to locality groups:

# echo ::mnode | mdb -k              MNODE ID LGRP ASLEEP  UTOTAL UFREE UCACHE KTOTAL  KFREE KCACHE      2075ad80000  0    1      -   249g   237g   114m   5.7g   714m      -      2075ad802c0  1    2      -   240g   236g   288m    15g   4.8g      -      2075ad80580  2    3      -   246g   234g   619m   9.6g   951m      -      2075ad80840  3    4      -   247g   231g    24m     9g   897m      -

In the above example, the memory block with the physical address 1600000000 is on memory node #3, which translates to locality group #4. The sample output was collected on a SPARC T4-4 server from Oracle on which the main memory was interleaved across all memory banks with an 8 GB interleave size—meaning the first 8 GB chunk was populated in lgroup 1 closer to processor #1, the second 8 GB chunk in lgroup 2 closer to processor #2, the third 8 GB chunk in lgroup 3 closer to processor #3, the fourth 8 GB chunk in lgroup 4 closer to processor #4, the fifth 8 GB chunk again in lgroup 1 closer to processor #1, and so on. Memory is not interleaved on later SPARC systems that have Oracle's SPARC T5, M6, M7, or S7 processors.

I/O Locality

The -d option of the lgrpinfo(1) command accepts a path to an I/O device and returns the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes—so, it is not uncommon to see more than one lgroup ID returned by the lgrpinfo(1) command.

Here are some examples:

# lgrpinfo -d /dev/dsk/c1t0d0lgroup ID : 1  # dladm show-phys | grep ixgbe0net4              Ethernet             up         10000  full      ixgbe0  # lgrpinfo -d /dev/ixgbe0lgroup ID : 1  # dladm show-phys | grep ibp0net12             Infiniband           up         32000  unknown   ibp0  # lgrpinfo -d /dev/ibp0lgroup IDs : 1-4

Logical groupings of I/O devices, resources (threads, interrupts, and so on), and objects (for example, device information pointers and DMA-mapped memory) that are based on relative affinity to their source can be examined with the debugger command numaio_group. These groups are implicitly known as NUMA I/O groups.

Here is an example:

# echo ::numaio_group | mdb -k             ADDR GROUP_NAME              CONSTRAINT     10050e1eba48 net4                    lgrp : 1     10050e1ebbb0 net0                    lgrp : 1      10050e1ebd18 usbecm2                 lgrp : 1     10050e1ebe80 scsi_hba_ngrp_mpt_sas1  lgrp : 4      10050e1ebef8 scsi_hba_ngrp_mpt_sas0  lgrp : 1     ...     ...

Alternatively, the prtconf(1M) command can be used to find the locality for an I/O device.

For example, here's how to find the device path for the network interface:

# dladm show-phys | grep ixgbe0net4              Ethernet             up         10000  full      ixgbe0  # grep ixgbe /etc/path_to_inst | grep " 0 ""/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"

And here's how to find the NUMA I/O Lgroups:

# prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0    ...     Hardware properties:     ...     name='numaio-lgrps' type=int items=1             value=00000001   <-lgroup 1     ...

Application developers have no control over the grouping of objects into a NUMA I/O group at this time.

Resource Groups

In a resource group, hardware resources are grouped based on the underlying physical relationship between cores, memory, and I/O buses. On different SPARC hardware platforms, some of the server configurations (such as for Oracle's SPARC M7-8 server) might have a resource group that maps directly to a locality group.

The list-rsrc-group subcommand of the Logical Domains Manager command-line interface for Oracle VM Server for SPARC, ldm(1M), shows a consolidated list of processor cores, memory blocks, and I/O devices that belong to each resource group. This subcommand is available in Oracle VM Server for SPARC 3.2 and later versions.

Here is an example:

# ldm ls-rsrc-groupNAME                                    CORE  MEMORY   IO /SYS/CMIOU0                             32    480G     4 /SYS/CMIOU1                             32    480G     4  # ldm ls-rsrc-group -l /SYS/CMIOU0NAME                                    CORE  MEMORY   IO /SYS/CMIOU0                             32    480G     4  CORE     CID                                             BOUND     0, 1, 2, 3, 8, 9, 10, 11                        primary     16, 17, 18, 19, 24, 25                          primary      ...  MEMORY     PA               SIZE             BOUND     0x0              60M              _sys_     0x3c00000        32M              _sys_     0x5c00000        94M              _sys_     0x4c000000       64M              _sys_      0x50000000       15104M           primary     0x400000000      128G             primary     ...     0x7400000000     16128M           primary     0x77f0000000     64M              _sys_      0x77f4000000     192M             _sys_  IO     DEVICE           PSEUDONYM        BOUND     pci@300          pci_0            primary     pci@301          pci_1            primary     pci@303          pci_3            primary     pci@304          pci_4            primary

Process and Thread Locality

The Oracle Solaris kernel assigns a thread to an lgroup when the thread is created. That lgroup is called the thread/LWP's home lgroup. The kernel schedules and runs a thread on the CPUs that are in the thread's home lgroup, and it allocates memory from the same lgroup whenever possible.

The -H option of the ps(1) command can be used to examine the home lgroup of all user processes and threads. The -h option can be used to list all processes that are in a certain locality group.

Here's an example of listing the home lgroup of all processes:

# ps -aeH  PID  LGRP TTY         TIME CMD     0    0 ?           0:11 sched     1    4 ?          21:04 init    11    3 ?           3:09 svc.star  3322    1 ?         301:51 cssdagen         ... 11155     3 ?           0:52 oracle 13091     4 ?           0:00 sshd 12812     2 pts/3       0:00 bash        ...

The -H option of the prstat(1M) command shows the home lgroup of active user processes and threads, for example:

# prstat -H   PID  USERNAME  SIZE   RSS STATE   PRI NICE      TIME CPU  LGRP PROCESS/NLWP   1865 root      420M  414M sleep    59    0 447:51:13 0.1%    2 java/108   1814 oracle    155M  110M sleep    59    0  70:45:17 0.0%    4 gipcd.bin/9   3765 root      447M  413M sleep    59    0  29:24:20 0.0%    3 crsd.bin/43  10825 oracle   1097M 1074M sleep    59    0  18:13:27 0.0%    3 oracle/1   3941 root      210M  184M sleep    59    0  20:03:37 0.0%    4 orarootagent.bi/14   1585 oracle    122M   91M sleep    59    0  18:06:34 0.0%    3 evmd.bin/10   3918 oracle    168M  144M sleep    58    0  14:35:31 0.0%    1 oraagent.bin/28    ...

The plgrp(1) command shows the placement of threads among locality groups. The same command can also be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.

Let's examine the home lgroup of some of the LWPs in the process with process identifier (PID) 1865:

# plgrp 1865     PID/LWPID     HOME     1865/1        2     1865/2        2     ...     1865/22       4     1865/23       4    ...     1865/41       1     1865/42       1     ...     1865/60       3     1865/61       3     ...# plgrp 1865 | awk '{print $2}' | grep 2 | wc -l      30  # plgrp 1865 | awk '{print $2}' | grep 1 | wc -l      25# plgrp 1865 | awk '{print $2}' | grep 3 | wc -l      25  # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l      28

Let's reset the home lgroup of all LWPs in PID 1865 to 4:

# plgrp -H 4 1865     PID/LWPID    HOME     1865/1        2 => 4      1865/2        2 => 4     ...     1865/41       1 => 4     ...     1865/60       3 => 4# plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l       0  # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l     108  # prstat -H -p 1865   PID  USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU LGRP PROCESS/NLWP   1865 root      420M  414M sleep    59    0 447:57:30 0.1%    4 java/108

Memory Placement

The -L option of the pmap(1) command shows the lgroup that contains the physical memory backing some virtual memory.

Here's an example of determining the lgroups that shared memory segments are allocated from:

# pmap -Ls 24513 | egrep "Lgrp|256M|2G"         Address       Bytes Pgsz Mode   Lgrp Mapped File 0000000400000000   33554432K   2G rwxs-    1   [ osm shmid=0x78000047 ] 0000000C00000000     262144K 256M rwxs-    3   [ osm shmid=0x78000048 ] 0000000C10000000     524288K 256M rwxs-    2   [ osm shmid=0x78000048 ]    ...

Memory placement among lgroups can possibly be achieved using the MADV_ACCESS flags to the pmadvise(1) command at runtime or to the madvise(3C) function during development, which provides advice to the kernel's virtual memory manager that a region of user virtual memory is expected to follow a particular pattern of use. The OS might use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.

Here's an example of applying the MADV_ACCESS_MANY policy advice to a segment at a specific address:

# pmap -Ls 27980 | grep anonFFFFFFFF78C70000       3648K  64K rw-----    2   [ anon ] FFFFFFFF79000000      53248K   4M rw-----    2   [ anon ] FFFFFFFF7C400000       3136K  64K rw-----    2   [ anon ]    ...  # pmadvise -o FFFFFFFF79000000=access_many -v 27980 | grep anonFFFFFFFF78C70000       3648K rw-----    [ anon ]FFFFFFFF79000000       53248K rw-----   [ anon ]         <= access_manyFFFFFFFF7C400000       3136K rw-----    [ anon ]    ...

Here's a sample code block demonstrating the madvise(3C) call that hints to the kernel that addresses in this memory region are likely to be accessed only once:

#include <sys/mman.h>... int sz, fd; ... char *content = mmap((caddr_t)0, sz, PROT_READ, MAP_PRIVATE, fd, 0); if ( madvise(content, 2048, MADV_SEQUENTIAL) < 0 ) {      perror("madvise"); } ...

Locality Group API

As of this writing, Oracle Solaris has limited support for lgroup observability using programmatic interfaces. Applications can use the lgroup API to traverse the locality group hierarchy, discover the contents of each locality group, and even affect thread and memory placement on desired lgroups.

The man page for liblgrp(3LIB) lists the currently supported public interfaces, and a brief description for most of those interfaces can be obtained by running man -k lgroup in a shell. In order to use this API, applications must link with the locality group library, liblgrp(3LIB).

The following sample code demonstrates lgroup API usage by making several lgrp_*() calls, including lgrp_device_lgrps(), to find the locality groups that are closest to the specified I/O device.

# cat -n iodevlgrp.c     1  #include <stdio.h>     2  #include <stdlib.h>     3  #include <assert.h>     4  #include <sys/lgrp_user.h>     5  #include <sys/types.h>     6     7  int main(int argc, char **argv) {     8      9          if (argc != 2) {    10                  fprintf(stderr, "Usage: %s <iodevice>\n", argv[0]);    11                  exit(1);    12          }    13     14          /*  lgroup interface version check */     15          if (lgrp_version(LGRP_VER_CURRENT) != LGRP_VER_CURRENT) {     16              fprintf(stderr, "\nBuilt with unsupported lgroup interface %d",     17                  LGRP_VER_CURRENT);     18              exit(1);     19          }     20     21          lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_OS);     22          lgrp_id_t node = lgrp_root(cookie);     23     24          /* refresh the cookie if stale */     25          if ( lgrp_cookie_stale(cookie) ) {     26                  lgrp_fini(cookie);     27                  cookie =lgrp_init(LGRP_VIEW_OS);     28          }     29     30          int nlgrps = lgrp_nlgrps(cookie);     31          if ( nlgrps == -1 ) {     32                  perror("\n lgrp_nlgrps");     33                  lgrp_fini(cookie);     34                  exit(1);     35          }     36          printf("Number of locality groups on the system: %d\n", (nlgrps-1));     37     38          /* lgroups closest to the target device */     39          int numlgrps = lgrp_device_lgrps(argv[1], NULL, 0);     40          if (numlgrps == -1 ){     41                  fprintf(stderr, "I/O device: %s. ", argv[1]);     42                  perror("lgrp_device_lgrps");     43          } else {     44                  printf("lgroups closest to the I/O device %s: ", argv[1]);     45                  lgrp_id_t *lgrpids = (lgrp_id_t *)calloc(numlgrps, sizeof(lgrp_id_t));     46                  assert(lgrpids != NULL);     47                  lgrp_device_lgrps(argv[1], lgrpids, numlgrps);     48                  for (int i = 0; i < numlgrps; ++i) {     49                          printf(" %d ", lgrpids[i]);     50                  }     51                  free(lgrpids);     52          }     53          lgrp_fini(cookie);     54          printf("\n");     55          return 0;     56      57  }   % cc -o iodevlgrp -llgrp iodevlgrp.c% ./iodevlgrp /dev/ixgbe0Number of locality groups on the system: 2lgroups closest to the I/O device /dev/ixgbe0:  1  % lgrpinfo -d /dev/ixgbe0lgroup ID : 1  % ./iodevlgrp /dev/ixgbe1Number of locality groups on the system: 2 lgroups closest to the I/O device /dev/ixgbe1:  2  % lgrpinfo -d /dev/ixgbe1lgroup ID : 2

See Also

Also see the following resources:

About the Author

Giri Mandalika is a principal software engineer in Oracle's Hardware Systems organization. Currently, Giri is focused on designing and implementing software for SPARC processor–based engineered systems. Giri manages a blog that focuses on Oracle Solaris, Oracle Database, and other topics at https://blogs.oracle.com/mandalika.

Follow us:
Blog | Facebook | Twitter | YouTube

Comments

Post Details

Added on Jul 27 2016
0 comments
2,236 views