Bug 14639

Summary: BUG: unable to handle kernel paging request
Product: Memory Management Reporter: Vadim Zeitlin (vadim)
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED OBSOLETE    
Severity: high CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-2-amd64 Subsystem:
Regression: No Bisected commit-id:

Description Vadim Zeitlin 2009-11-19 01:30:39 UTC
This is a stock install of Debian testing using 2.6.30 for amd64 (I plan to try with stock 2.6.32 if I am going to be able to build it) on a new i7 860 system (Gigabyte GA-P55-UD5 MB, i7 860 (HT enabled in BIOS, hence 8 logical CPUs), 8GB RAM and nothing special).

The problem arises (100% reproducible after the first time it happened) when launching make (initially it was with "make -j8" but it also happens with just "make" too).

Nov 19 02:07:42 twilight kernel: [  368.369328] BUG: unable to handle kernel paging request at ffff881213111518
Nov 19 02:07:42 twilight kernel: [  368.369331] IP: [<ffffffff802c8b8f>] __link_path_walk+0x5d6/0x70b
Nov 19 02:07:42 twilight kernel: [  368.369336] PGD 202063 PUD 0
Nov 19 02:07:42 twilight kernel: [  368.369338] Oops: 0000 [#2] SMP
Nov 19 02:07:42 twilight kernel: [  368.369339] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:04:06.0/class
Nov 19 02:07:42 twilight kernel: [  368.369341] CPU 6
Nov 19 02:07:42 twilight kernel: [  368.369342] Modules linked in: ppdev lp parport acpi_cpufreq cpufreq_stats cpufreq_powersave cpufreq_userspace cpufreq_conservative binfmt_misc ext2 it87 hwmon_vid coretemp firewire_sbp2 loop snd_hda_codec_realtek usbhid hid snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq firewire_ohci snd_timer firewire_core snd_seq_device crc_itu_t snd sg r8169 soundcore mii sr_mod i2c_i801 uhci_hcd i2c_core snd_page_alloc ehci_hcd cdrom evdev wmi button pcspkr serio_raw processor ext3 jbd mbcache dm_mod sd_mod crc_t10dif thermal fan thermal_sys ahci libata scsi_mod
Nov 19 02:07:42 twilight kernel: [  368.369367] Pid: 2552, comm: make Tainted: G      D W  2.6.30-2-amd64 #1 P55-UD5
Nov 19 02:07:42 twilight kernel: [  368.369368] RIP: 0010:[<ffffffff802c8b8f>]  [<ffffffff802c8b8f>] __link_path_walk+0x5d6/0x70b
Nov 19 02:07:42 twilight kernel: [  368.369370] RSP: 0018:ffff88021d55dd28  EFLAGS: 00010286
Nov 19 02:07:42 twilight kernel: [  368.369372] RAX: ffff880213110c80 RBX: ffff881213111428 RCX: 0000000000000017
Nov 19 02:07:42 twilight kernel: [  368.369373] RDX: ffff88021d55dd48 RSI: ffff880213110c80 RDI: ffff88021d55dd48
Nov 19 02:07:42 twilight kernel: [  368.369374] RBP: ffff88021d55de08 R08: 0000000000000000 R09: ffff88021959c000
Nov 19 02:07:42 twilight kernel: [  368.369375] R10: 00007fff05e5aa9a R11: ffffffff8031eb9e R12: ffff88021d55dd48
Nov 19 02:07:42 twilight kernel: [  368.369377] R13: ffff88021959c000 R14: ffff88021959c01d R15: 0000000000000001
Nov 19 02:07:42 twilight kernel: [  368.369378] FS:  00007f155d9a16f0(0000) GS:ffff8800281a0000(0000) knlGS:0000000000000000
Nov 19 02:07:42 twilight kernel: [  368.369380] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 19 02:07:42 twilight kernel: [  368.369381] CR2: ffff881213111518 CR3: 000000021485a000 CR4: 00000000000006e0
Nov 19 02:07:42 twilight kernel: [  368.369382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 19 02:07:42 twilight kernel: [  368.369383] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 19 02:07:42 twilight kernel: [  368.369385] Process make (pid: 2552, threadinfo ffff88021d55c000, task ffff88021d173510)
Nov 19 02:07:42 twilight kernel: [  368.369386] Stack:
Nov 19 02:07:42 twilight kernel: [  368.369386]  00000000000000a3 ffffffff802920c9 0000001714655eb3 ffff88021959c006
Nov 19 02:07:42 twilight kernel: [  368.369388]  ffff88021d245180 ffff880213110c80 ffff88021d55de08 ffff88021c376300
Nov 19 02:07:42 twilight kernel: [  368.369390]  ffff88021d55de08 ffff88021959c000 ffff88021959c000 00000000ffffff9c
Nov 19 02:07:42 twilight kernel: [  368.369393] Call Trace:
Nov 19 02:07:42 twilight kernel: [  368.369394]  [<ffffffff802920c9>] ? find_lock_page+0x15/0x50
Nov 19 02:07:42 twilight kernel: [  368.369397]  [<ffffffff802c91a0>] ? path_walk+0x66/0xc9
Nov 19 02:07:42 twilight kernel: [  368.369399]  [<ffffffff802ca115>] ? do_path_lookup+0x17a/0x1d1
Nov 19 02:07:42 twilight kernel: [  368.369401]  [<ffffffff802cab99>] ? getname+0x13e/0x1a0
Nov 19 02:07:42 twilight kernel: [  368.369403]  [<ffffffff802cb46e>] ? user_path_at+0x48/0x79
Nov 19 02:07:42 twilight kernel: [  368.369404]  [<ffffffff802c3e8e>] ? cp_new_stat+0xe9/0xfc
Nov 19 02:07:42 twilight kernel: [  368.369407]  [<ffffffff802c4052>] ? vfs_fstatat+0x2c/0x57
Nov 19 02:07:42 twilight kernel: [  368.369409]  [<ffffffff802c4145>] ? sys_newstat+0x11/0x30
Nov 19 02:07:42 twilight kernel: [  368.369411]  [<ffffffff8020fa42>] ? system_call_fastpath+0x16/0x1b
Nov 19 02:07:42 twilight kernel: [  368.369415] Code: 24 10 48 89 ef 4c 89 e2 e8 41 f4 ff ff 85 c0 89 c3 0f 85 0d 01 00 00 48 8b 44 24 28 41 f6 c7 01 48 8b 58 10 74 32 48 85 db 74 2d <48> 8b 83 f0 00 00 00 48 83 78 50 00 74 1f 48 89 ee 4c 89 e7 e8
Nov 19 02:07:42 twilight kernel: [  368.369428] RIP  [<ffffffff802c8b8f>] __link_path_walk+0x5d6/0x70b
Nov 19 02:07:42 twilight kernel: [  368.369430]  RSP <ffff88021d55dd28>
Nov 19 02:07:42 twilight kernel: [  368.369431] CR2: ffff881213111518
Nov 19 02:07:42 twilight kernel: [  368.369433] ---[ end trace d234fc8e0b1a0e56 ]---

Next I removed unnecessary modules but am still getting exactly the same bug with the only exception that the modules line now reads

Modules linked in: binfmt_misc ext2 crc_itu_t sg r8169 mii sr_mod i2c_i801 i2c_core cdrom evdev processor ext3 jbd mbcache dm_mod sd_mod crc_t10dif thermal_sys ahci libata scsi_mod [last unloaded: snd_page_alloc]

The sysfs device referenced above is 104c:8024 or FireWire controller and I have no idea what does it have to do with anything.

This system has serious problems after waking up from suspend (random segfaults in various programs) but this started happening immediately after a reboot -- albeit a catastrophic one because the login shell segfaulted. So maybe the file system was somehow corrupted and this is why I report this bug while I still can as I really don't know what's going to happen when I reboot the next time.

I'd be glad to do something to help with debugging _if_ the machine is going to remain usable at all which is far from a safe bet, it seems that 2.6.30 just doesn't like Lynnfield at all...
Comment 1 Andrew Morton 2009-11-19 23:02:31 UTC
Gee.  We have a million machines running that codepath a million times a minute.  It's really well tested.  I expect that your machine has hardware problems.
Comment 2 Vadim Zeitlin 2009-11-22 18:11:55 UTC
FWIW I wasn't able to reproduce this problem. I do have another, perfectly reproducible, problem on this machine which I filed in Fedora bugzilla as I'm not 100% sure that it's a kernel problem (although considering that I see it under both Fedora and Debian it might be): https://bugzilla.redhat.com/show_bug.cgi?id=540199

In short, file system data seems to be corrupted after resuming from suspend and although the bug reported above was observed after a reboot, it might be possible that something bad happened to on disk data too because of this. If this is indeed the case there is probably still a bug somewhere in the kernel as ideally it shouldn't crash even if the file system was corrupted but I realize that this is going to be all but impossible to debug so I'd understand perfectly well if this bug was simply closed.

As for hardware problems, I'd love this to be the case as it would mean that I still could hope to use Linux on this machine. Unfortunately several things are against this hypothesis:
 1. Windows 7 (x64) doesn't show any problems even when redoing the equivalent of the "make -j8" I'm running under Linux a dozen times in a row (this is my primary test because this is I'd actually like to use this machine for building software, I just need to stop testing/crashing it first...)
 2. I ran LinX under Windows and sys_basher under Fedora for several hours without any problems (this is before suspending and resuming under Linux -- after resuming things start crashing immediately, see the Fedora bug report above)
 3. I ran memtest for the entire night (10+ hours) and it didn't find any problems.

So if it is a hardware problem I really don't know what could it be...

Anyhow, to summarize, this bug is probably an extremely rare consequence of file system corruption caused by the other bug so this report should probably be closed as I don't know how to reproduce it nor can I provide any more information about it.