Bug 77351 - Intermittent kernel Oops
Summary: Intermittent kernel Oops
Status: NEW
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-04 19:30 UTC by Liam MacInnes
Modified: 2014-06-04 19:48 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.10.33-1 to 3.10.37
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Liam MacInnes 2014-06-04 19:30:49 UTC
Overview
After weeks of normal operation we see processes start locking up on a machine. Upon further inspection an Oops is found. This behaviour is not observed in Kernel 2.6.32. But has been seen in 3.10.33-1 through to 3.10.37 

Steps to Reproduce

Unable to reproduce intentionally

Oops happens at seemingly random times over long intervals (weeks)

On a cluster of ~100 machines this occurs close to daily on one machine or another


Actual results 

The following kernel oops

BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
IP: [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
PGD 80458e067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: autofs4 ipmi_devintf ipmi_si ipmi_msghandler ipv6 vfat fat ext4 jbd2 mbcache gpio_ich iTCO_wdt iTCO_vendor_support coretemp freq_table mperf intel_powerclamp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr sb_edac edac_core i2c_i801 lpc_ich shpchp igb hwmon ptp pps_core sg ioatdma dca xfs libcrc32c sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 megaraid_sas wmi mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod
CPU: 5 PID: 14949 Comm: java Not tainted 3.10.33-1.el6.elrepo.x86_64 #1
Hardware name: IBM System x3650 M4: -[7915AC1]-/00Y8494, BIOS -[VVE128GUS-1.41]- 07/22/2013
task: ffff880745423560 ti: ffff8807f16ac000 task.ti: ffff8807f16ac000
RIP: 0010:[<ffffffff8115145f>]  [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
RSP: 0000:ffff8807f16ad878  EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000b8d692 RCX: 0000000000005c6b
RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000af
RBP: ffff8807f16ad928 R08: ffffea0000000000 R09: 0000000000b8d800
R10: ffff88107ffd2200 R11: ffffea00286ed000 R12: ffffea00286eeff0
R13: 0000000000000093 R14: 0000000000000000 R15: ffff88107ffd1d80
FS:  00007f457911c700(0000) GS:ffff88087fca0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000001c CR3: 000000082320d000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000005c6b ffff88107ffd2200 ffffea0000000000 ffffea00286ed000
 0000000000b8d800 0000000000000005 00ff88087fcb01a0 ffff880745423560
 0000000000880000 0000000000000000 ffff8807f16ad9b8 0000000000000000
Call Trace:
 [<ffffffff81152590>] ? isolate_freepages+0x270/0x270
 [<ffffffff81151c5e>] compact_zone+0x2ce/0x450
 [<ffffffff811521c2>] compact_zone_order+0xa2/0xf0
 [<ffffffff810982b8>] ? update_curr+0x178/0x1c0
 [<ffffffff811522e1>] try_to_compact_pages+0xd1/0x110
 [<ffffffff81137c60>] __alloc_pages_direct_compact+0x90/0x200
 [<ffffffff8113531d>] ? get_page_from_freelist+0x22d/0x710
 [<ffffffff811380f2>] __alloc_pages_slowpath+0x322/0x7c0
 [<ffffffff8113889e>] __alloc_pages_nodemask+0x30e/0x330
 [<ffffffff81177c31>] alloc_pages_vma+0xb1/0x160
 [<ffffffff8118a8eb>] do_huge_pmd_anonymous_page+0x17b/0x310
 [<ffffffff8115ac8b>] handle_mm_fault+0x2db/0x3d0
 [<ffffffff815f66c7>] __do_page_fault+0x167/0x490
 [<ffffffff81011937>] ? __switch_to+0x1a7/0x500
 [<ffffffff815f0bde>] ? __schedule+0x3fe/0x720
 [<ffffffff815f69fe>] do_page_fault+0xe/0x10
 [<ffffffff815f2e18>] page_fault+0x28/0x30
Code: 41 f7 04 24 ff ff ff 01 0f 85 82 fd ff ff 41 8b 44 24 18 85 c0 0f 89 75 fd ff ff 49 8b 04 24 66 85 c0 0f 88 b3 04 00 00 4c 89 e0 <8b> 40 1c 83 f8 01 0f 85 59 fd ff ff 49 8b 44 24 08 48 85 c0 0f
RIP  [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
 RSP <ffff8807f16ad878>
CR2: 000000000000001c
---[ end trace 6c6dc51551ae50f1 ]---
ip_tables: (C) 2000-2006 Netfilter Core Team
nr_pdflush_threads exported in /proc is scheduled for removal
sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case.  If you have one, please send an email to linux-mm@kvack.org.
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
IP: [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
PGD 8cde19067 PUD 0
Oops: 0000 [#2] SMP
Modules linked in: iptable_filter ip_tables autofs4 ipmi_devintf ipmi_si ipmi_msghandler ipv6 vfat fat ext4 jbd2 mbcache gpio_ich iTCO_wdt iTCO_vendor_support coretemp freq_table mperf intel_powerclamp kvm_intel kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr sb_edac edac_core i2c_i801 lpc_ich shpchp igb hwmon ptp pps_core sg ioatdma dca xfs libcrc32c sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 megaraid_sas wmi mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod
CPU: 22 PID: 3065 Comm: java Tainted: G      D      3.10.33-1.el6.elrepo.x86_64 #1
Hardware name: IBM System x3650 M4: -[7915AC1]-/00Y8494, BIOS -[VVE128GUS-1.41]- 07/22/2013
task: ffff880bf3684040 ti: ffff8809280a8000 task.ti: ffff8809280a8000
RIP: 0010:[<ffffffff8115145f>]  [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
RSP: 0018:ffff8809280a9878  EFLAGS: 00010286
RAX: 0000000000000000 RBX: 000000000010c162 RCX: 0000000000000860
RDX: ffff8809280a99a8 RSI: 0000000000000004 RDI: 0000000000000084
RBP: ffff8809280a9928 R08: ffffea0000000000 R09: 000000000010c200
R10: ffff88087ffda200 R11: ffffea0003aa0000 R12: ffffea0003aa4d70
R13: 0000000000000163 R14: 0000000000000000 R15: ffff88087ffd9d80
FS:  00007f77f7afa700(0000) GS:ffff88107fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000001c CR3: 0000001034cf6000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000860 ffff88087ffda200 ffffea0000000000 ffffea0003aa0000
 000000000010c200 0000000000000001 0000000000000000 ffff880bf3684040
 00000000280a9928 0000000000000000 ffff8809280a99b8 0000000000000000
Call Trace:
 [<ffffffff81152590>] ? isolate_freepages+0x270/0x270
 [<ffffffff81151c5e>] compact_zone+0x2ce/0x450
 [<ffffffff811521c2>] compact_zone_order+0xa2/0xf0
 [<ffffffff811522e1>] try_to_compact_pages+0xd1/0x110
 [<ffffffff81137c60>] __alloc_pages_direct_compact+0x90/0x200
 [<ffffffff810bf3c3>] ? on_each_cpu_mask+0x53/0x70
 [<ffffffff81138266>] __alloc_pages_slowpath+0x496/0x7c0
 [<ffffffff8113889e>] __alloc_pages_nodemask+0x30e/0x330
 [<ffffffff81177c31>] alloc_pages_vma+0xb1/0x160
 [<ffffffff8118a8eb>] do_huge_pmd_anonymous_page+0x17b/0x310
 [<ffffffff8115ac8b>] handle_mm_fault+0x2db/0x3d0
 [<ffffffff815f66c7>] __do_page_fault+0x167/0x490
 [<ffffffff810982b8>] ? update_curr+0x178/0x1c0
 [<ffffffff81011937>] ? __switch_to+0x1a7/0x500
 [<ffffffff815f0bde>] ? __schedule+0x3fe/0x720
 [<ffffffff815f69fe>] do_page_fault+0xe/0x10
 [<ffffffff815f2e18>] page_fault+0x28/0x30
Code: 41 f7 04 24 ff ff ff 01 0f 85 82 fd ff ff 41 8b 44 24 18 85 c0 0f 89 75 fd ff ff 49 8b 04 24 66 85 c0 0f 88 b3 04 00 00 4c 89 e0 <8b> 40 1c 83 f8 01 0f 85 59 fd ff ff 49 8b 44 24 08 48 85 c0 0f
RIP  [<ffffffff8115145f>] isolate_migratepages_range+0x43f/0x970
 RSP <ffff8809280a9878>
CR2: 000000000000001c
---[ end trace 6c6dc51551ae50f2 ]---


Expected results 
No kernel oops

Note You need to log in before you can comment on or make changes to this bug.