Skip to Main Content

Infrastructure Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

kswapd using 100% of CPU

1506714Aug 1 2016 — edited Aug 12 2016

Hi team,

One of our OEL box is showing some strange behaviour thoses last days. Randomly, after some hours, the kswapd process is eating up 100 % of the CPU.

Swap and memory level :

[oracle@Cah-Dump ~]$ free

             total       used       free     shared    buffers     cached

Mem:       3915152    3806080     109072          0       1408       5392

-/+ buffers/cache:    3799280     115872

Swap:      4063228      91596    3971632

Linux version :

[oracle@Cah-Dump ~]$ uname -a

Linux Cah-Dump.cahors.local 2.6.32-504.3.3.el6.x86_64 #1 SMP Tue Dec 16 12:12:30 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

Extract of top :

top - 09:08:46 up 3 days, 15 min,  1 user,  load average: 3.02, 3.40, 3.47

Tasks: 111 total,   2 running, 107 sleeping,   0 stopped,   2 zombie

Cpu(s):  1.2%us, 82.1%sy,  0.0%ni, 14.6%id,  1.3%wa,  0.0%hi,  0.8%si,  0.0%st

Mem:   3915152k total,  3800636k used,   114516k free,      508k buffers

Swap:  4063228k total,    91836k used,  3971392k free,     3212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                      

   30 root      20   0     0    0    0 R 99.9  0.0   3494:23 kswapd0                                                                                       

    1 root      20   0 19356    4    4 S  0.0  0.0   0:01.05 init                                                                                          

    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd                                                                                      

    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0                                                                                   

    4 root      20   0     0    0    0 S  0.0  0.0   0:09.70 ksoftirqd/0                                                                                   

    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/0                                                                                     

    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.34 watchdog/0                                                                                    

    7 root      20   0     0    0    0 S  0.0  0.0   0:19.48 events/0                                                                                      

The only way I found to have the machine back to a normal state is a reboot... Do you know of a better way to handle this kind of situation ?

Thanks a lot.

Best regards.

--

Jérémy

This post has been answered by remzi.akyuz on Aug 3 2016
Jump to Answer

Comments

remzi.akyuz

Hi,

You can use slabtop and smem ( smem -s swap  ) for which application using swap.

However you can disable temporally or clean swan

swapoff -a

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

drop_caches

Writing to this will cause the kernel to drop clean caches, as well as

reclaimable slab objects like dentries and inodes. Once dropped, their

memory becomes free.

To free pagecache:

echo 1 > /proc/sys/vm/drop_caches

To free reclaimable slab objects (includes dentries and inodes):

echo 2 > /proc/sys/vm/drop_caches

To free slab objects and pagecache:

echo 3 > /proc/sys/vm/drop_caches

1506714

Hi,

Thanks a lot, next time this problem occurs, I'll give a try to your advices and let you know if it works.

--

Jérémy

1506714

Hi,

This morning again, kswapd was eating all of the CPU. So I tried the following :

1)

slabtop :

Active / Total Objects (% used)    : 27934202 / 27949137 (99.9%)

Active / Total Slabs (% used)      : 933051 / 933051 (100.0%)

Active / Total Caches (% used)     : 95 / 191 (49.7%)

Active / Total Size (% used)       : 3502972.51K / 3505333.98K (99.9%)

Minimum / Average / Maximum Object : 0.02K / 0.12K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                  

27866310 27866310 100%    0.12K 928877       30   3715508K size-128

17248  17038  98%    0.03K    154      112       616K size-32

  8673   7682  88%    0.06K    147       59       588K size-64

  8100   8084  99%    0.14K    300       27      1200K sysfs_dir_cache

  7791   7757  99%    0.07K    147       53       588K selinux_inode_security

  7600   5488  72%    0.19K    380       20      1520K dentry

  6760   6690  98%    0.19K    338       20      1352K size-192

  4776   4605  96%    0.58K    796        6      3184K inode_cache

  3773    577  15%    0.05K     49       77       196K anon_vma_chain

  2668    382  14%    0.04K     29       92       116K anon_vma

  1995    677  33%    0.20K    105       19       420K vm_area_struct

  1296   1276  98%    0.02K      9      144        36K z718cdad054

   952    925  97%    0.50K    119        8       476K size-512

   938    416  44%    0.55K    134        7       536K radix_tree_node

   901    867  96%    0.07K     17       53        68K Acpi-Operand

   780    190  24%    0.19K     39       20       156K filp

   720    713  99%    0.77K    144        5       576K shmem_inode_cache

   668    624  93%    1.00K    167        4       668K size-1024

   666    322  48%    0.10K     18       37        72K buffer_head

   552    517  93%    0.04K      6       92        24K Acpi-Namespace

   320    110  34%    0.19K     16       20        64K cred_jar

   290    229  78%    0.38K     29       10       116K ip_dst_cache

   276    150  54%    0.98K     69        4       276K ext4_inode_cache

   272     87  31%    0.11K      8       34        32K task_delay_info

   270     88  32%    0.12K      9       30        36K pid

   240     75  31%    0.25K     16       15        64K skbuff_head_cache

   231    231 100%    4.00K    231        1       924K size-4096

   224    219  97%    0.53K     32        7       128K idr_layer_cache

   202      6   2%    0.02K      1      202         4K jbd2_revoke_table

2)

smem -s swap :

[root@Cah-Dump ~]# smem -s swap

  PID User     Command                         Swap      USS      PSS      RSS

39445 root     python /usr/bin/smem -s swa        0     5936     6026     6132

39346 root     /sbin/mingetty /dev/tty5          60        4        4        8

39344 root     /sbin/mingetty /dev/tty3          64        4        4        8

39345 root     /sbin/mingetty /dev/tty4          64        4        4        8

39347 root     /sbin/mingetty /dev/tty6          64        4        4        8

39342 root     /sbin/mingetty /dev/tty1          68        4        4        8

39343 root     /sbin/mingetty /dev/tty2          68        4        4        8

1194 root     auditd                           284        4        4        8

    1 root     /sbin/init                       308        4        4        8

28939 root     /sbin/udevd -d                   432        0        1        4

28940 root     /sbin/udevd -d                   432        0        1        4

  391 root     /sbin/udevd -d                   676        0        1        4

1586 root     /usr/sbin/sshd                   704        4        4        4

39348 root     sshd: root@pts/0                 840      276      290      320

39350 root     -bash                           1836      896      978     1076

3) swapoff -a

4) swapon -a

5) echo 1 > /proc/sys/vm/drop_caches

6) echo 2 > /proc/sys/vm/drop_caches

7) echo 3 > /proc/sys/vm/drop_caches

8) sync

Unfortunately, kswapd is still on top of the list, rebooting the box to get it available again.

If ever you see anything else I'm missing, I'll be glad to try wathever solution you might have :-)

Thanks !

--

Jérémy

Alex-D

What kernel is this?

You have insane amount of objects in the size-128 slab apparently, could be a kernel leak corrected in some later version.

Let's wait for our Linux gurus to chime in, I am afraid my level of kernel understanding isn't great enough to give you more meaningful advice...

Alex-D

By the way, what this box is doing usually, what kind of software does it run, a lot of IO?

handat

If you have schedtool installed, you could try: schedtool -D -n 19 `pidof kswapd0`

That would lower the priority of kswapd so it cannot hog the CPU.

Dude!

I wouldn't do anything fancy yet. Here are some basic questions:

1. What was installed or changed recently that might have caused the problem?

2. Is this machine a virtual machine?

3. Do you have other kernels installed in the system that can reproduce the problem?

4. What does the top utility and pressing O p show and what is your output of cat /proc/meminfo?

1506714

Hi,

Kernel is the following :

[oracle@Cah-Dump ~]$ uname -a

Linux Cah-Dump.cahors.local 2.6.32-504.3.3.el6.x86_64 #1 SMP Tue Dec 16 12:12:30 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

1506714

The box is simply a mirror for some files created on another machine.

It does rsync a specific directory on a regular basis.

We didn't add any software recently except an update of the monit service manager.

1506714

We don't have schedtool installed on this box, so I'm afraid I won't be able to test your answer.

1506714

1. What was installed or changed recently that might have caused the problem?

Nothing I can remember of, except an update of a service manager called monit.

2. Is this machine a virtual machine?

Yes, it runs into hyper-v.

3. Do you have other kernels installed in the system that can reproduce the problem?

No, the box is up and running for 2 years now without noticing this problem.

4. What does the top utility and pressing O p show and what is your output of cat /proc/meminfo?

top utility :

top - 13:20:28 up  4:42,  1 user,  load average: 0.00, 0.00, 0.02

Tasks: 150 total,   1 running, 147 sleeping,   0 stopped,   2 zombie

Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:   3915152k total,  3547764k used,   367388k free,   249372k buffers

Swap:  4063228k total,      228k used,  4063000k free,  1793364k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND                                                                                 

    1 root      20   0 19356  844  640 S  0.0  0.0   0:00.96   96 init                                                                                     

  391 root      16  -4 10916 1000  332 S  0.0  0.0   0:00.46   20 udevd                                                                                    

1983 root      18  -2 10912 1012  340 S  0.0  0.0   0:00.00   16 udevd                                                                                    

1984 root      18  -2 10912 1012  340 S  0.0  0.0   0:00.00   16 udevd                                                                                    

    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kthreadd                                                                                 

    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00    0 migration/0                                                                              

    4 root      20   0     0    0    0 S  0.0  0.0   0:00.09    0 ksoftirqd/0                                                                              

    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00    0 stopper/0                                                                                

    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.02    0 watchdog/0                                                                               

    7 root      20   0     0    0    0 S  0.0  0.0   0:01.27    0 events/0                                                                                 

    8 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 cgroup                                                                                   

    9 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 khelper                                                                                  

   10 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 netns                                                                                    

   11 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 async/mgr                                                                                

   12 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 pm                                                                                       

   13 root      20   0     0    0    0 S  0.0  0.0   0:00.04    0 sync_supers                                                                              

   14 root      20   0     0    0    0 S  0.0  0.0   0:00.06    0 bdi-default                                                                              

   15 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kintegrityd/0                                                                            

   16 root      20   0     0    0    0 S  0.0  0.0   0:02.95    0 kblockd/0                                                                                

   17 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kacpid                                                                                   

   18 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kacpi_notify                                                                             

   19 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kacpi_hotplug                                                                            

   20 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 ata_aux                                                                                  

   21 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 ata_sff/0                                                                                

   22 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 ksuspend_usbd                                                                            

   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 khubd                                                                                    

   24 root      20   0     0    0    0 S  0.0  0.0   0:00.05    0 kseriod                                                                                  

   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 md/0                                                                                     

   26 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 md_misc/0                                                                                

   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 linkwatch                                                                                

   29 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 khungtaskd                                                                               

   30 root      20   0     0    0    0 S  0.0  0.0   0:14.94    0 kswapd0                                                                                  

   31 root      25   5     0    0    0 S  0.0  0.0   0:00.00    0 ksmd                                                                                     

   32 root      39  19     0    0    0 S  0.0  0.0   0:00.07    0 khugepaged                                                                               

   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 aio/0                                                                                    

   34 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 crypto/0                                                                                 

   42 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kthrotld/0                                                                               

   44 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kpsmoused                                                                                

   45 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 usbhid_resumer

Output of /proc/meminfo :

[root@Cah-Dump ~]# cat /proc/meminfo

MemTotal:        3915152 kB

MemFree:          359136 kB

Buffers:          249404 kB

Cached:          1793364 kB

SwapCached:          228 kB

Active:           359764 kB

Inactive:        1835136 kB

Active(anon):      72704 kB

Inactive(anon):    81084 kB

Active(file):     287060 kB

Inactive(file):  1754052 kB

Unevictable:           0 kB

Mlocked:               0 kB

SwapTotal:       4063228 kB

SwapFree:        4063000 kB

Dirty:               116 kB

Writeback:             0 kB

AnonPages:        152040 kB

Mapped:            32900 kB

Shmem:              1656 kB

Slab:            1276536 kB

SReclaimable:      71436 kB

SUnreclaim:      1205100 kB

KernelStack:        1816 kB

PageTables:        14580 kB

NFS_Unstable:          0 kB

Bounce:                0 kB

WritebackTmp:          0 kB

CommitLimit:     6020804 kB

Committed_AS:     659496 kB

VmallocTotal:   34359738367 kB

VmallocUsed:       35028 kB

VmallocChunk:   34359700476 kB

HardwareCorrupted:     0 kB

AnonHugePages:     38912 kB

HugePages_Total:       0

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

DirectMap4k:        8128 kB

DirectMap2M:     4186112 kB

I'll do the same when the problem raises again, to see if there are differences.

Dude!

Kernel 2.6.32 is ancient. Are you sure it is supported to run under Hyper-V? I suggest to check the Microsoft documentation to find out if the kernel is supported. You can find some more info if you search this forum.

What is the purpose of the machine? I see that you have Anon Huge-pages, which would not be good if you run Oracle database. Other than that, you have 2 zombie processes, which means these are process that have terminated, but still occupy some resources. That's something you should investigate, which might be related to the cause of your problem.

I don't know monit, but if you can disable, I suggest it to see if the problem persists.

To find out what these zombie process are, the following should give more clues:

ps aux | grep 'Z'

Alex-D

If I had similar presumably kernel-related problem on one of my boxes, I would certainly install different (newer) kernel and try to run with the new one for some time to see if it makes any difference.

Easy to switch back modifying your grub file too, and looks like in your case it is not mission critical (you say it is a mirror), meaning green light for new kernel.

Regards.

P.S. you say it is hyper-V, try to see if some new virtual device wasn't introduced to your VM, or parameters like allowing USB3 weren't put in place. This old kernel may be confused by those.

Alex-D

Oh, and I have vague memories of similar memory leak appearing with SE-Linux activated, older 2.6.something kernel and lots of file writes. For the life of mine I cannot find the reference to it though. Not even sure it was on Redhat/Fedora/Centos/OL or not.

If you have SE Linux activated but do not really use its functionalities, maybe try to run without?

Regards

remzi.akyuz

Hi,

Did you check dmesg and /var/log/messages?

Are there any error or warning messages?

If you could not find what caused this case,  you can reduce using swap;

                                                                                                         sysctl -w vm.swappiness=0

or disable. If you use this server for rsync/backup/copy, you can ignore swap.

1506714

Hi guys, I'll try to sum up a bit :-)

Right now, on the box, the problem occurs, here the result of the commands aked by Dude! :

"top O p"

top - 14:49:11 up 1 day,  6:11,  1 user,  load average: 1.00, 1.00, 1.00

Tasks:  85 total,   2 running,  83 sleeping,   0 stopped,   0 zombie

Cpu(s):  0.3%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:   3915152k total,  3801780k used,   113372k free,     2640k buffers

Swap:  4063228k total,     7228k used,  4056000k free,     3996k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP COMMAND                                   

61913 root      20   0  107m 1912  540 S  0.0  0.0   0:00.08  992 bash                                       

61911 root      20   0 97892  380  220 S  0.0  0.0   0:00.01  800 sshd                                       

1586 root      20   0 66216    0    0 S  0.0  0.0   0:00.00  712 sshd                                       

  391 root      16  -4 10916    0    0 S  0.0  0.0   0:00.47  688 udevd                                      

54109 root      18  -2 10648    0    0 S  0.0  0.0   0:00.00  428 udevd                                      

57103 root      18  -2 10648    0    0 S  0.0  0.0   0:00.00  428 udevd                                      

    1 root      20   0 19356    4    4 S  0.0  0.0   0:01.07  308 init                                       

1194 root      16  -4 93156    4    4 S  0.0  0.0   0:01.12  292 auditd                                     

61896 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   72 mingetty                                   

61897 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   72 mingetty                                   

61893 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   68 mingetty                                   

61894 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   68 mingetty                                   

61895 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   68 mingetty                                   

61898 root      20   0  4064    4    4 S  0.0  0.0   0:00.00   68 mingetty                                   

    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 kthreadd                                   

    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00    0 migration/0                                

    4 root      20   0     0    0    0 S  0.0  0.0   0:00.75    0 ksoftirqd/0                                

    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00    0 stopper/0                                  

    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.66    0 watchdog/0                                 

    7 root      20   0     0    0    0 S  0.0  0.0   0:08.47    0 events/0                                   

    8 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 cgroup                                     

    9 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 khelper                                    

   10 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 netns                                      

   11 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 async/mgr                                  

   12 root      20   0     0    0    0 S  0.0  0.0   0:00.00    0 pm                                         

   13 root      20   0     0    0    0 S  0.0  0.0   0:00.22    0 sync_supers                                

   14 root      20   0     0    0    0 S  0.0  0.0   0:00.30    0 bdi-default

cat /proc/meminfo

[root@Cah-Dump ~]# cat /proc/meminfo

MemTotal:        3915152 kB

MemFree:          116852 kB

Buffers:             504 kB

Cached:             3052 kB

SwapCached:         1864 kB

Active:             1064 kB

Inactive:           4852 kB

Active(anon):        248 kB

Inactive(anon):     2104 kB

Active(file):        816 kB

Inactive(file):     2748 kB

Unevictable:           0 kB

Mlocked:               0 kB

SwapTotal:       4063228 kB

SwapFree:        4056080 kB

Dirty:                16 kB

Writeback:             0 kB

AnonPages:          1620 kB

Mapped:             1128 kB

Shmem:                 0 kB

Slab:            3741848 kB

SReclaimable:       6012 kB

SUnreclaim:      3735836 kB

KernelStack:         680 kB

PageTables:          844 kB

NFS_Unstable:          0 kB

Bounce:                0 kB

WritebackTmp:          0 kB

CommitLimit:     6020804 kB

Committed_AS:      24092 kB

VmallocTotal:   34359738367 kB

VmallocUsed:       35028 kB

VmallocChunk:   34359700476 kB

HardwareCorrupted:     0 kB

AnonHugePages:         0 kB

HugePages_Total:       0

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

DirectMap4k:        8128 kB

DirectMap2M:     4186112 kB

[root@Cah-Dump ~]# ps aux |grep 'Z'

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

root     62004  0.0  0.0 103256   840 pts/0    S+   14:50   0:00 grep Z

@"Dude!" Kernel 2.6.32 is ancient. Are you sure it is supported to run under Hyper-V?

Well, I'm not, my sysadmin is coming back to work next week, so I'm gonna ask him to check.

@"Dude!" What is the purpose of the machine? I see that you have Anon Huge-pages, which would not be good if you run Oracle database. Other than that, you have 2 zombie processes, which means these are process that have terminated, but still occupy some resources. That's something you should investigate, which might be related to the cause of your problem.

This machine only does rsync files from another one to its own disk. It's not running oracle database. The zombie processes seem to have disappeared this time.

@"T-34" I'll check with sysadmin if he introduced anything in Hyper-V and I'll check also if we can try and install some new kernels. But not before next week :-)

By the way, I'm not familiar with SELinux, is there a simple command I can use to deactivate this thing and check if everything's running better ?

Again, thanks for your answers !

1506714

Hi,

Didn't see your message while answering :-)

So, dmesg is showing a lot of lines saying "Out of memory: Kill process XXXXX (rsync) score 1 or sacrifice child" and others talking about bash. Not sure this is the cause or a consequence.

Just tried disabling swap at the moment to check if it's doing anything good. I can still see kswapd0 eating up CPU, seems to make no difference.

Thanks.

Alex-D

My view on your issues is the following:

you have a lot of memory pressure (100 Meg of physical memory left), yet your swap is empty: that is because the memory is hogged by kernel slab allocation, and the slabs are not swappable as far as I understand.

Your rsync processes dying is probably just a collateral damage from the Out of memory killer (OOM) that tries to make free memory for OS itself.

Good run down on its functionalities: How to Configure the Linux Out of Memory Killer

Now, I wouldn't speculate what the reason of the kernel memory leakage is, but it certainly is kernel leakage, the amount of object allocations in the slab is completely out of normal. Statistically most problems in the kernel come from kernel drivers, but I  am not saying it is your case. Again, need to check what your virtual environment is. Installing newer kernel often cures many ills related to drivers.

My idea about the SELinux was a shot in the dark, but feel free to test it anyway:

edit the  /etc/selinux/config file and modify the SELINUX variable to SELINUX=disabled

you need to reboot to make it effective.

Edit:

Dude's previous remark on huge pages is interesting one. May be useful to investigate why you have those and try running without.

Alex-D

Haha, wait, that is exactly what one of the paragraphs says (from the link I've just provided).

There are several things that might cause an OOM event other than the system running out of RAM and available swap space due to the workload.  The kernel might not be able to utilize swap space optimally due to the type of workload on the system.  Applications that utilize mlock() or HugePages have memory that can't be swapped to disk when the system starts to run low on physical memory.  Kernel data structures can also take up too much space exhausting memory on the system and causing an OOM situation.

Dude!

Let's not confuse anonymous or transparent huge pages, introduced in OL 6, with huge pages. THP can be swapped. However, the kernel always needs a certain amount of low memory, which cannot be virtual memory in order to perform any kind of memory management. OOM is the last resort of a kernel to remain functioning.

https://access.redhat.com/solutions/46111

Since the system, according to the OP has worked for years and nothing was changed, perhaps the problem is not the virtual machine, but the virtual machine environment.

Alex-D

It is possible the hypervisor was migrated to different host or some other configuration changes  were made  resulting in new "hardware" presented to the guest. But that was already the first thing I was talking about earlier.

remzi.akyuz
Answer

So, dmesg is showing a lot of lines saying "Out of memory: Kill process XXXXX (rsync) score 1 or sacrifice child" and others talking about bash. Not sure this is the cause or a consequence.

Just tried disabling swap at the moment to check if it's doing anything good. I can still see kswapd0 eating up CPU, seems to make no difference.

You can use memory limit for user of the rsync;

if you add this line

                                rsyncuser hard as  3015152000

to /etc/security/limits.conf . You should change 'rsyncuser' with your's rsync user.

rsyncusers cannot use all of memory, only rsyncusers can  3015152000 byte memory.

And this link may be usefully .

https://www.lowendtalk.com/discussion/6734/rsync-out-of-memory-try-this

Marked as Answer by 1506714 · Sep 27 2020
Robert Chase

I wrote this article on the OOM killer a few years ago.  Might be helpful but I would suggest waiting for the system owner to come back before trying any aggressive changes to the system. 

How to Configure the Linux Out of Memory Killer

1506714

Hi,

We found the cause of this problem : it was an issue with rsync's --compress flag that was causing the machine to run out of memory because of a huge file (34 Gb) copy.

During late july, something changed in a backup made on another server that we used to rsync on this one, causing this big file to be created and transferred every day.

We just removed --compress and the machine's now running successfully. We also added some niceness to our crontabbed tasks to ease the load on the machine :-)

Big thanks all for your answers, I'm giving the "correct answer" flag to remzi.akyuz because he pointed out the memory usage of rsync that could be causing problems and eventually, lead us to the solution.

--

Jérémy

1 - 24
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Sep 9 2016
Added on Aug 1 2016
24 comments
6,837 views