by Giri Mandalika
Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. This article describes some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices.
Modern multisocket servers exhibit non-uniform memory access (NUMA) characteristics that might hurt application performance if ignored. On a NUMA system, all memory is shared among processors. Each processor has access to its own memory (local memory) as well as memory that is local to another processor (remote memory). However, the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than remote memory, and these varying memory latencies play a big role in application performance.
The Oracle Solaris operating system organizes the hardware resources—CPU, memory, and I/O devices—into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred to as locality groups (lgroups) or NUMA nodes. In other words, an lgroup is an abstraction that tells what hardware resources are near each other on a NUMA system. Each lgroup has at least one processor and possibly some associated memory, I/O devices, or both, as shown in Figure 1.

Figure 1. Diagram of lgroups
To minimize the impact of NUMA, Oracle Solaris considers the lgroup-based physical topology when mapping threads and data to CPUs and memory. However, some of the applications might still suffer the impact of NUMA due to the misconfiguration of software or hardware or for some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Nevertheless application developers, system administrators, and application administrators need to take NUMA into account while developing and managing applications on large systems. Oracle Solaris provides a variety of tools and APIs to observe, diagnose, control, and even fix issues related to locality and latency. Some of the tools and APIs that can be used to examine the locality of CPUs, memory and I/O devices are introduced with brief explanations and examples in the remainder of this article.
Note: Some of the information in this article is based on information found in lgroup-related cases from the Oracle Solaris Architecture Review Committee (ARC).
Locality Group Hierarchy
The lgrpinfo(1)
command prints information about the lgroup hierarchy and its contents. It is useful for understanding the context in which the OS is trying to optimize applications for locality, and also for figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.
Here is sample output from a SPARC server that has four processors and 1 TB of physical memory:
# lgrpinfo -algroup 0 (root): Children: 1-4 CPUs: 0-255 Memory: installed 1024G, allocated 75G, free 948G Lgroup resources: 1-4 (CPU); 1-4 (memory) Latency: 18 lgroup 1 (leaf): Children: none, Parent: 0 CPUs: 0-63 Memory: installed 256G, allocated 18G, free 238G Lgroup resources: 1 (CPU); 1 (memory) Load: 0.0227 Latency: 12 lgroup 2 (leaf): Children: none, Parent: 0 CPUs: 64-127 Memory: installed 256G, allocated 15G, free 241G Lgroup resources: 2 (CPU); 2 (memory) Load: 0.000153 Latency: 12 lgroup 3 (leaf): Children: none, Parent: 0 CPUs: 128-191 Memory: installed 256G, allocated 20G, free 236G Lgroup resources: 3 (CPU); 3 (memory) Load: 0.016 Latency: 12 lgroup 4 (leaf): Children: none, Parent: 0 CPUs: 192-255 Memory: installed 256G, allocated 23G, free 233G Lgroup resources: 4 (CPU); 4 (memory) Load: 0.00824 Latency: 12 Lgroup latencies: ------------------ | 0 1 2 3 4 ------------------ 0 | 18 18 18 18 18 1 | 18 12 18 18 18 2 | 18 18 12 18 18 3 | 18 18 18 12 18 4 | 18 18 18 18 12 ------------------
CPU Locality
The lgrpinfo(1)
command shown above provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU IDs and locality groups.
# echo ::lgrp -p | mdb -k LGRPID PSRSETID LOAD #CPU CPUS 1 0 17873 64 0-63 2 0 17755 64 64-127 3 0 2256 64 128-191 4 0 18173 64 192-255
Memory Locality
The lgrpinfo(1)
command displays the total memory that belongs to each of the locality groups. However, the same command doesn't show the breakdown of memory into memory blocks and their association with locality groups. The syslayout
debugger command (dcmd) in the genunix
debugger module (dmod) helps in retrieving the association between a memory block and a locality group from the physical memory layout of the system.
Note: To list the available debugger commands and print a brief description for each one, run echo ::dcmds | mdb -k
in a shell. To list the debugger commands grouped by the debugger module, run echo "::dmods -l" | mdb -k
. Keep in mind that non-built-in debugger commands might not continue to work on future versions of Oracle Solaris.
Here is an example.
1. List memory blocks:
# ldm list-devices -a memoryMEMORY PA SIZE BOUND 0xa00000 32M _sys_ 0x2a00000 96M _sys_ 0x8a00000 374M _sys_ 0x20000000 1048064M primary
2. Print the physical memory layout of the system:
# echo ::syslayout | mdb -k STARTPA ENDPA SIZE MG MN STL ETL 20000000 200000000 7.5g 0 0 4 40 200000000 400000000 8g 1 1 800 840 400000000 600000000 8g 2 2 1000 1040 600000000 800000000 8g 3 3 1800 1840 800000000 a00000000 8g 0 0 40 80 a00000000 c00000000 8g 1 1 840 880 c00000000 e00000000 8g 2 2 1040 1080 e00000000 1000000000 8g 3 3 1840 1880 1000000000 1200000000 8g 0 0 80 c0 1200000000 1400000000 8g 1 1 880 8c0 1400000000 1600000000 8g 2 2 1080 10c0 1600000000 1800000000 8g 3 3 1880 18c0 ... ...
3. Use the debugger command, ::mnode
, in the genunix
debugger module to show the mapping of memory nodes to locality groups:
# echo ::mnode | mdb -k MNODE ID LGRP ASLEEP UTOTAL UFREE UCACHE KTOTAL KFREE KCACHE 2075ad80000 0 1 - 249g 237g 114m 5.7g 714m - 2075ad802c0 1 2 - 240g 236g 288m 15g 4.8g - 2075ad80580 2 3 - 246g 234g 619m 9.6g 951m - 2075ad80840 3 4 - 247g 231g 24m 9g 897m -
In the above example, the memory block with the physical address 1600000000
is on memory node #3, which translates to locality group #4. The sample output was collected on a SPARC T4-4 server from Oracle on which the main memory was interleaved across all memory banks with an 8 GB interleave size—meaning the first 8 GB chunk was populated in lgroup 1 closer to processor #1, the second 8 GB chunk in lgroup 2 closer to processor #2, the third 8 GB chunk in lgroup 3 closer to processor #3, the fourth 8 GB chunk in lgroup 4 closer to processor #4, the fifth 8 GB chunk again in lgroup 1 closer to processor #1, and so on. Memory is not interleaved on later SPARC systems that have Oracle's SPARC T5, M6, M7, or S7 processors.
I/O Locality
The -d
option of the lgrpinfo(1)
command accepts a path to an I/O device and returns the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes—so, it is not uncommon to see more than one lgroup ID returned by the lgrpinfo(1)
command.
Here are some examples:
# lgrpinfo -d /dev/dsk/c1t0d0lgroup ID : 1 # dladm show-phys | grep ixgbe0net4 Ethernet up 10000 full ixgbe0 # lgrpinfo -d /dev/ixgbe0lgroup ID : 1 # dladm show-phys | grep ibp0net12 Infiniband up 32000 unknown ibp0 # lgrpinfo -d /dev/ibp0lgroup IDs : 1-4
Logical groupings of I/O devices, resources (threads, interrupts, and so on), and objects (for example, device information pointers and DMA-mapped memory) that are based on relative affinity to their source can be examined with the debugger command numaio_group
. These groups are implicitly known as NUMA I/O groups.
Here is an example:
# echo ::numaio_group | mdb -k ADDR GROUP_NAME CONSTRAINT 10050e1eba48 net4 lgrp : 1 10050e1ebbb0 net0 lgrp : 1 10050e1ebd18 usbecm2 lgrp : 1 10050e1ebe80 scsi_hba_ngrp_mpt_sas1 lgrp : 4 10050e1ebef8 scsi_hba_ngrp_mpt_sas0 lgrp : 1 ... ...
Alternatively, the prtconf(1M)
command can be used to find the locality for an I/O device.
For example, here's how to find the device path for the network interface:
# dladm show-phys | grep ixgbe0net4 Ethernet up 10000 full ixgbe0 # grep ixgbe /etc/path_to_inst | grep " 0 ""/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"
And here's how to find the NUMA I/O Lgroups:
# prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0 ... Hardware properties: ... name='numaio-lgrps' type=int items=1 value=00000001 <-lgroup 1 ...
Application developers have no control over the grouping of objects into a NUMA I/O group at this time.
Resource Groups
In a resource group, hardware resources are grouped based on the underlying physical relationship between cores, memory, and I/O buses. On different SPARC hardware platforms, some of the server configurations (such as for Oracle's SPARC M7-8 server) might have a resource group that maps directly to a locality group.
The list-rsrc-group
subcommand of the Logical Domains Manager command-line interface for Oracle VM Server for SPARC, ldm(1M)
, shows a consolidated list of processor cores, memory blocks, and I/O devices that belong to each resource group. This subcommand is available in Oracle VM Server for SPARC 3.2 and later versions.
Here is an example:
# ldm ls-rsrc-groupNAME CORE MEMORY IO /SYS/CMIOU0 32 480G 4 /SYS/CMIOU1 32 480G 4 # ldm ls-rsrc-group -l /SYS/CMIOU0NAME CORE MEMORY IO /SYS/CMIOU0 32 480G 4 CORE CID BOUND 0, 1, 2, 3, 8, 9, 10, 11 primary 16, 17, 18, 19, 24, 25 primary ... MEMORY PA SIZE BOUND 0x0 60M _sys_ 0x3c00000 32M _sys_ 0x5c00000 94M _sys_ 0x4c000000 64M _sys_ 0x50000000 15104M primary 0x400000000 128G primary ... 0x7400000000 16128M primary 0x77f0000000 64M _sys_ 0x77f4000000 192M _sys_ IO DEVICE PSEUDONYM BOUND pci@300 pci_0 primary pci@301 pci_1 primary pci@303 pci_3 primary pci@304 pci_4 primary
Process and Thread Locality
The Oracle Solaris kernel assigns a thread to an lgroup when the thread is created. That lgroup is called the thread/LWP's home lgroup. The kernel schedules and runs a thread on the CPUs that are in the thread's home lgroup, and it allocates memory from the same lgroup whenever possible.
The -H
option of the ps(1)
command can be used to examine the home lgroup of all user processes and threads. The -h
option can be used to list all processes that are in a certain locality group.
Here's an example of listing the home lgroup of all processes:
# ps -aeH PID LGRP TTY TIME CMD 0 0 ? 0:11 sched 1 4 ? 21:04 init 11 3 ? 3:09 svc.star 3322 1 ? 301:51 cssdagen ... 11155 3 ? 0:52 oracle 13091 4 ? 0:00 sshd 12812 2 pts/3 0:00 bash ...
The -H
option of the prstat(1M)
command shows the home lgroup of active user processes and threads, for example:
# prstat -H PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP 1865 root 420M 414M sleep 59 0 447:51:13 0.1% 2 java/108 1814 oracle 155M 110M sleep 59 0 70:45:17 0.0% 4 gipcd.bin/9 3765 root 447M 413M sleep 59 0 29:24:20 0.0% 3 crsd.bin/43 10825 oracle 1097M 1074M sleep 59 0 18:13:27 0.0% 3 oracle/1 3941 root 210M 184M sleep 59 0 20:03:37 0.0% 4 orarootagent.bi/14 1585 oracle 122M 91M sleep 59 0 18:06:34 0.0% 3 evmd.bin/10 3918 oracle 168M 144M sleep 58 0 14:35:31 0.0% 1 oraagent.bin/28 ...
The plgrp(1)
command shows the placement of threads among locality groups. The same command can also be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.
Let's examine the home lgroup of some of the LWPs in the process with process identifier (PID) 1865:
# plgrp 1865 PID/LWPID HOME 1865/1 2 1865/2 2 ... 1865/22 4 1865/23 4 ... 1865/41 1 1865/42 1 ... 1865/60 3 1865/61 3 ...# plgrp 1865 | awk '{print $2}' | grep 2 | wc -l 30 # plgrp 1865 | awk '{print $2}' | grep 1 | wc -l 25# plgrp 1865 | awk '{print $2}' | grep 3 | wc -l 25 # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l 28
Let's reset the home lgroup of all LWPs in PID 1865 to 4:
# plgrp -H 4 1865 PID/LWPID HOME 1865/1 2 => 4 1865/2 2 => 4 ... 1865/41 1 => 4 ... 1865/60 3 => 4# plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l 0 # plgrp 1865 | awk '{print $2}' | grep 4 | wc -l 108 # prstat -H -p 1865 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU LGRP PROCESS/NLWP 1865 root 420M 414M sleep 59 0 447:57:30 0.1% 4 java/108
Memory Placement
The -L
option of the pmap(1)
command shows the lgroup that contains the physical memory backing some virtual memory.
Here's an example of determining the lgroups that shared memory segments are allocated from:
# pmap -Ls 24513 | egrep "Lgrp|256M|2G" Address Bytes Pgsz Mode Lgrp Mapped File 0000000400000000 33554432K 2G rwxs- 1 [ osm shmid=0x78000047 ] 0000000C00000000 262144K 256M rwxs- 3 [ osm shmid=0x78000048 ] 0000000C10000000 524288K 256M rwxs- 2 [ osm shmid=0x78000048 ] ...
Memory placement among lgroups can possibly be achieved using the MADV_ACCESS
flags to the pmadvise(1)
command at runtime or to the madvise(3C)
function during development, which provides advice to the kernel's virtual memory manager that a region of user virtual memory is expected to follow a particular pattern of use. The OS might use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.
Here's an example of applying the MADV_ACCESS_MANY
policy advice to a segment at a specific address:
# pmap -Ls 27980 | grep anonFFFFFFFF78C70000 3648K 64K rw----- 2 [ anon ] FFFFFFFF79000000 53248K 4M rw----- 2 [ anon ] FFFFFFFF7C400000 3136K 64K rw----- 2 [ anon ] ... # pmadvise -o FFFFFFFF79000000=access_many -v 27980 | grep anonFFFFFFFF78C70000 3648K rw----- [ anon ]FFFFFFFF79000000 53248K rw----- [ anon ] <= access_manyFFFFFFFF7C400000 3136K rw----- [ anon ] ...
Here's a sample code block demonstrating the madvise
(3C) call that hints to the kernel that addresses in this memory region are likely to be accessed only once:
#include <sys/mman.h>... int sz, fd; ... char *content = mmap((caddr_t)0, sz, PROT_READ, MAP_PRIVATE, fd, 0); if ( madvise(content, 2048, MADV_SEQUENTIAL) < 0 ) { perror("madvise"); } ...
Locality Group API
As of this writing, Oracle Solaris has limited support for lgroup observability using programmatic interfaces. Applications can use the lgroup API to traverse the locality group hierarchy, discover the contents of each locality group, and even affect thread and memory placement on desired lgroups.
The man page for liblgrp(3LIB)
lists the currently supported public interfaces, and a brief description for most of those interfaces can be obtained by running man -k lgroup
in a shell. In order to use this API, applications must link with the locality group library, liblgrp(3LIB)
.
The following sample code demonstrates lgroup API usage by making several lgrp_*()
calls, including lgrp_device_lgrps()
, to find the locality groups that are closest to the specified I/O device.
# cat -n iodevlgrp.c 1 #include <stdio.h> 2 #include <stdlib.h> 3 #include <assert.h> 4 #include <sys/lgrp_user.h> 5 #include <sys/types.h> 6 7 int main(int argc, char **argv) { 8 9 if (argc != 2) { 10 fprintf(stderr, "Usage: %s <iodevice>\n", argv[0]); 11 exit(1); 12 } 13 14 /* lgroup interface version check */ 15 if (lgrp_version(LGRP_VER_CURRENT) != LGRP_VER_CURRENT) { 16 fprintf(stderr, "\nBuilt with unsupported lgroup interface %d", 17 LGRP_VER_CURRENT); 18 exit(1); 19 } 20 21 lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_OS); 22 lgrp_id_t node = lgrp_root(cookie); 23 24 /* refresh the cookie if stale */ 25 if ( lgrp_cookie_stale(cookie) ) { 26 lgrp_fini(cookie); 27 cookie =lgrp_init(LGRP_VIEW_OS); 28 } 29 30 int nlgrps = lgrp_nlgrps(cookie); 31 if ( nlgrps == -1 ) { 32 perror("\n lgrp_nlgrps"); 33 lgrp_fini(cookie); 34 exit(1); 35 } 36 printf("Number of locality groups on the system: %d\n", (nlgrps-1)); 37 38 /* lgroups closest to the target device */ 39 int numlgrps = lgrp_device_lgrps(argv[1], NULL, 0); 40 if (numlgrps == -1 ){ 41 fprintf(stderr, "I/O device: %s. ", argv[1]); 42 perror("lgrp_device_lgrps"); 43 } else { 44 printf("lgroups closest to the I/O device %s: ", argv[1]); 45 lgrp_id_t *lgrpids = (lgrp_id_t *)calloc(numlgrps, sizeof(lgrp_id_t)); 46 assert(lgrpids != NULL); 47 lgrp_device_lgrps(argv[1], lgrpids, numlgrps); 48 for (int i = 0; i < numlgrps; ++i) { 49 printf(" %d ", lgrpids[i]); 50 } 51 free(lgrpids); 52 } 53 lgrp_fini(cookie); 54 printf("\n"); 55 return 0; 56 57 } % cc -o iodevlgrp -llgrp iodevlgrp.c% ./iodevlgrp /dev/ixgbe0Number of locality groups on the system: 2lgroups closest to the I/O device /dev/ixgbe0: 1 % lgrpinfo -d /dev/ixgbe0lgroup ID : 1 % ./iodevlgrp /dev/ixgbe1Number of locality groups on the system: 2 lgroups closest to the I/O device /dev/ixgbe1: 2 % lgrpinfo -d /dev/ixgbe1lgroup ID : 2
See Also
Also see the following resources:
About the Author
Giri Mandalika is a principal software engineer in Oracle's Hardware Systems organization. Currently, Giri is focused on designing and implementing software for SPARC processor–based engineered systems. Giri manages a blog that focuses on Oracle Solaris, Oracle Database, and other topics at https://blogs.oracle.com/mandalika.
Follow us:
Blog | Facebook | Twitter | YouTube