Overview After weeks of normal operation we see processes start locking up on a machine. Upon further inspection an Oops is found. This behaviour is not observed in Kernel 2.6.32. But has been seen in 3.10.33-1 through to 3.10.37 Steps to Reproduce Unable to reproduce intentionally Oops happens at seemingly random times over long intervals (weeks) On a cluster of ~100 machines this occurs close to daily on one machine or another Actual results The following kernel oops BUG: unable to handle kernel NULL pointer dereference at 000000000000001c IP: [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 PGD 80458e067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: autofs4 ipmi_devintf ipmi_si ipmi_msghandler ipv6 vfat fat ext4 jbd2 mbcache gpio_ich iTCO_wdt iTCO_vendor_support coretemp freq_table mperf intel_powerclamp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr sb_edac edac_core i2c_i801 lpc_ich shpchp igb hwmon ptp pps_core sg ioatdma dca xfs libcrc32c sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 megaraid_sas wmi mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod CPU: 5 PID: 14949 Comm: java Not tainted 3.10.33-1.el6.elrepo.x86_64 #1 Hardware name: IBM System x3650 M4: -[7915AC1]-/00Y8494, BIOS -[VVE128GUS-1.41]- 07/22/2013 task: ffff880745423560 ti: ffff8807f16ac000 task.ti: ffff8807f16ac000 RIP: 0010:[<ffffffff8115145f>] [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 RSP: 0000:ffff8807f16ad878 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000b8d692 RCX: 0000000000005c6b RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000af RBP: ffff8807f16ad928 R08: ffffea0000000000 R09: 0000000000b8d800 R10: ffff88107ffd2200 R11: ffffea00286ed000 R12: ffffea00286eeff0 R13: 0000000000000093 R14: 0000000000000000 R15: ffff88107ffd1d80 FS: 00007f457911c700(0000) GS:ffff88087fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000001c CR3: 000000082320d000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000005c6b ffff88107ffd2200 ffffea0000000000 ffffea00286ed000 0000000000b8d800 0000000000000005 00ff88087fcb01a0 ffff880745423560 0000000000880000 0000000000000000 ffff8807f16ad9b8 0000000000000000 Call Trace: [<ffffffff81152590>] ? isolate_freepages+0x270/0x270 [<ffffffff81151c5e>] compact_zone+0x2ce/0x450 [<ffffffff811521c2>] compact_zone_order+0xa2/0xf0 [<ffffffff810982b8>] ? update_curr+0x178/0x1c0 [<ffffffff811522e1>] try_to_compact_pages+0xd1/0x110 [<ffffffff81137c60>] __alloc_pages_direct_compact+0x90/0x200 [<ffffffff8113531d>] ? get_page_from_freelist+0x22d/0x710 [<ffffffff811380f2>] __alloc_pages_slowpath+0x322/0x7c0 [<ffffffff8113889e>] __alloc_pages_nodemask+0x30e/0x330 [<ffffffff81177c31>] alloc_pages_vma+0xb1/0x160 [<ffffffff8118a8eb>] do_huge_pmd_anonymous_page+0x17b/0x310 [<ffffffff8115ac8b>] handle_mm_fault+0x2db/0x3d0 [<ffffffff815f66c7>] __do_page_fault+0x167/0x490 [<ffffffff81011937>] ? __switch_to+0x1a7/0x500 [<ffffffff815f0bde>] ? __schedule+0x3fe/0x720 [<ffffffff815f69fe>] do_page_fault+0xe/0x10 [<ffffffff815f2e18>] page_fault+0x28/0x30 Code: 41 f7 04 24 ff ff ff 01 0f 85 82 fd ff ff 41 8b 44 24 18 85 c0 0f 89 75 fd ff ff 49 8b 04 24 66 85 c0 0f 88 b3 04 00 00 4c 89 e0 <8b> 40 1c 83 f8 01 0f 85 59 fd ff ff 49 8b 44 24 08 48 85 c0 0f RIP [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 RSP <ffff8807f16ad878> CR2: 000000000000001c ---[ end trace 6c6dc51551ae50f1 ]--- ip_tables: (C) 2000-2006 Netfilter Core Team nr_pdflush_threads exported in /proc is scheduled for removal sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org. BUG: unable to handle kernel NULL pointer dereference at 000000000000001c IP: [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 PGD 8cde19067 PUD 0 Oops: 0000 [#2] SMP Modules linked in: iptable_filter ip_tables autofs4 ipmi_devintf ipmi_si ipmi_msghandler ipv6 vfat fat ext4 jbd2 mbcache gpio_ich iTCO_wdt iTCO_vendor_support coretemp freq_table mperf intel_powerclamp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr sb_edac edac_core i2c_i801 lpc_ich shpchp igb hwmon ptp pps_core sg ioatdma dca xfs libcrc32c sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 megaraid_sas wmi mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod CPU: 22 PID: 3065 Comm: java Tainted: G D 3.10.33-1.el6.elrepo.x86_64 #1 Hardware name: IBM System x3650 M4: -[7915AC1]-/00Y8494, BIOS -[VVE128GUS-1.41]- 07/22/2013 task: ffff880bf3684040 ti: ffff8809280a8000 task.ti: ffff8809280a8000 RIP: 0010:[<ffffffff8115145f>] [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 RSP: 0018:ffff8809280a9878 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 000000000010c162 RCX: 0000000000000860 RDX: ffff8809280a99a8 RSI: 0000000000000004 RDI: 0000000000000084 RBP: ffff8809280a9928 R08: ffffea0000000000 R09: 000000000010c200 R10: ffff88087ffda200 R11: ffffea0003aa0000 R12: ffffea0003aa4d70 R13: 0000000000000163 R14: 0000000000000000 R15: ffff88087ffd9d80 FS: 00007f77f7afa700(0000) GS:ffff88107fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000001c CR3: 0000001034cf6000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000860 ffff88087ffda200 ffffea0000000000 ffffea0003aa0000 000000000010c200 0000000000000001 0000000000000000 ffff880bf3684040 00000000280a9928 0000000000000000 ffff8809280a99b8 0000000000000000 Call Trace: [<ffffffff81152590>] ? isolate_freepages+0x270/0x270 [<ffffffff81151c5e>] compact_zone+0x2ce/0x450 [<ffffffff811521c2>] compact_zone_order+0xa2/0xf0 [<ffffffff811522e1>] try_to_compact_pages+0xd1/0x110 [<ffffffff81137c60>] __alloc_pages_direct_compact+0x90/0x200 [<ffffffff810bf3c3>] ? on_each_cpu_mask+0x53/0x70 [<ffffffff81138266>] __alloc_pages_slowpath+0x496/0x7c0 [<ffffffff8113889e>] __alloc_pages_nodemask+0x30e/0x330 [<ffffffff81177c31>] alloc_pages_vma+0xb1/0x160 [<ffffffff8118a8eb>] do_huge_pmd_anonymous_page+0x17b/0x310 [<ffffffff8115ac8b>] handle_mm_fault+0x2db/0x3d0 [<ffffffff815f66c7>] __do_page_fault+0x167/0x490 [<ffffffff810982b8>] ? update_curr+0x178/0x1c0 [<ffffffff81011937>] ? __switch_to+0x1a7/0x500 [<ffffffff815f0bde>] ? __schedule+0x3fe/0x720 [<ffffffff815f69fe>] do_page_fault+0xe/0x10 [<ffffffff815f2e18>] page_fault+0x28/0x30 Code: 41 f7 04 24 ff ff ff 01 0f 85 82 fd ff ff 41 8b 44 24 18 85 c0 0f 89 75 fd ff ff 49 8b 04 24 66 85 c0 0f 88 b3 04 00 00 4c 89 e0 <8b> 40 1c 83 f8 01 0f 85 59 fd ff ff 49 8b 44 24 08 48 85 c0 0f RIP [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970 RSP <ffff8809280a9878> CR2: 000000000000001c ---[ end trace 6c6dc51551ae50f2 ]--- Expected results No kernel oops