Bug 59321
Summary: | [hsw] S4 broken with Haswell | ||
---|---|---|---|
Product: | Drivers | Reporter: | Takashi Iwai (tiwai) |
Component: | Video(DRI - Intel) | Assignee: | intel-gfx-bugs (intel-gfx-bugs) |
Status: | RESOLVED MOVED | ||
Severity: | normal | CC: | ben, daniel, imre.deak, intel-gfx-bugs, jens-bugzilla.kernel.org, mauromol, mmarek, przanoni |
Priority: | P2 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10,3.11 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Possible fix
Don't let the GT write to memory after we're suspending. Idle harder 3.17-rc1 crash "Watchdog detected hard LOCKUP on CPU #x" 3.17-rc1 Crash during hibernate/resume, OOM failures |
Description
Takashi Iwai
2013-06-05 09:27:18 UTC
BTW, I checked fdo bugzilla https://bugs.freedesktop.org/show_bug.cgi?id=63586 and tried to revert the commit mentioned there. It didn't help. But the Oops pattern shown there (in comment 9) looks similar as what I've seen (procfs path lookup), so this might be the same cause. I'll try whether the point before the commit mentioned above survives my test case, too. BTW, I tried S4 stress tests without i915 KMS (nomodeset), and it survived well. It doesn't mean that it must be i915 driver's bug, as Oops implies some memory corruption or such, but at least i915 driver influences a lot on the buggy S4 behavior. (In reply to comment #1) > BTW, I checked fdo bugzilla > https://bugs.freedesktop.org/show_bug.cgi?id=63586 > and tried to revert the commit mentioned there. It didn't help. > > But the Oops pattern shown there (in comment 9) looks similar as what I've > seen > (procfs path lookup), so this might be the same cause. > > I'll try whether the point before the commit mentioned above survives my test > case, too. The kernel at commit 0e8ffe1bf81b crashed after 20 cycles, with similar Oops about procfs path lookup. Afaik Haswell has been flakey ever since, with random deaths at all kinds of strange places. Thus far I haven't seen any progress on this at all, so I'll escalate this. Been a while since I've had to do a maintainer-drill internally anyway ;-) (In reply to comment #2) > BTW, I tried S4 stress tests without i915 KMS (nomodeset), and it survived > well. Please define "survived well". Did it crash at least a single time without i915 loaded? No, it didn't crash at all without i915 for over 100 cycles. No Oops is seen, too. This is meant as "survived well". If it were more than 10000 cycles, I would have concluded that it doesn't crash in normal situation :) Hi I want to have exactly the same environment as you have, so I can reproduce it locally. - Which outputs are you using when reproducing this bug? VGA? eDP? DP? HDMI? DVI? - Do you attach/remove any of them while trying to reproduce? - Do you have X running when you suspend? - How do you suspend? By clicking on some interface or running some specific command? - Do you suspend from X or do you vt switch away before doing that? Thanks, Paulo Ok, so I have been playing with this for some time today. The machine I've used has only an eDP monitor. I enabled a bunch of debug options in the Kernel, including kmemleak. I could reproduce the bug a few times, and so far I notice that the S4 problem only happens *after* I see kmemleak complains about some weird stuff that happens when we're trying to turn the eDP panel power on. I have seen this 3 times: - Boot the machine - Check for kmemleak on dmesg - S4 suspend - Boot again and check for kememleak on dmesg - So far, I have only seen crashes after dmesg complains about kmemleak Another thing I have to point is that the crash happens when I'm already back to X, many seconds after the real reboot. Do you also observe that? Thanks, Paulo Another bug which looks like a memory corruption and might actually be the same thing as the one you're seeing: Remove eDP and all other outputs, attach only HDMI. Boot the machine, load i915 and see the error message. Happens 100% of the time for me. (In reply to comment #7) > Hi > > I want to have exactly the same environment as you have, so I can reproduce > it > locally. > > - Which outputs are you using when reproducing this bug? VGA? eDP? DP? HDMI? > DVI? - Do you attach/remove any of them while trying to reproduce? We've seen the hangs on both laptops and a desktop machine. eDP is used on all laptops, so at least eDP is always connected/used. I forgot about the detail of desktop, but it doesn't have eDP, at least. It happens without attaching/removing connections. Boot a laptop with eDP only, try S4 a few times, and it hangs. > - Do you have X running when you suspend? In most test cases, yes. It's GNOME 2.6 with compiz. But the hang happened without X, too. > - How do you suspend? By clicking on some interface or running some specific > command? Then hang happens all cases. No matter whether the suspend through the button (via pm-suspend), the direct kernel suspend, or via user-space suspend. > - Do you suspend from X or do you vt switch away before doing that? When pm-utils hook is running, yes, it's switched to VT1 before doing suspend. But the crash happens even without it by just writing /sys/power/disk on a X terminal. (In reply to comment #8) > Ok, so I have been playing with this for some time today. The machine I've > used > has only an eDP monitor. > > I enabled a bunch of debug options in the Kernel, including kmemleak. I could > reproduce the bug a few times, and so far I notice that the S4 problem only > happens *after* I see kmemleak complains about some weird stuff that happens > when we're trying to turn the eDP panel power on. I have seen this 3 times: > > - Boot the machine > - Check for kmemleak on dmesg > - S4 suspend > - Boot again and check for kememleak on dmesg > - So far, I have only seen crashes after dmesg complains about kmemleak > > Another thing I have to point is that the crash happens when I'm already back > to X, many seconds after the real reboot. > > Do you also observe that? OK, will try with kmemleak. I already tried other debug options but it didn't catch anything special before the crash. The crashing behavior isn't always same. As stated in the bug description, with a luck, you can get the Oops message. With a bad luck, the machine immediately crashes during the resume. Possibly because lots of tasks run via pm-utils resume hooks. kmemleak didn't catch the error in my case. The machine first shows the general protection fault: 0000 at do_dentry_open. Paulo, could you give your kernel config showing the kmemleak result, so that I can test on machines here, too? Created attachment 104441 [details]
Possible fix
Hi
Does this patch help? I found it while debugging another memory corruption from our driver...
Thanks,
Paulo
No dice, unfortunately. It still gives the same Oops after the first S4. Hi This really seems to be a memory corruption problem. I already had to fsck my disk twice while debugging this. I think our best bet is to try to bisect this using the linux-stable tree. Do you know any Kernel version that can't reproduce the problem? If you have the time, you could perhaps try to do the bisecting. Thanks, Paulo S4 has been always broken on Haswell, thus it's no bisectable regression. (Otherwise I would have done it :) And, this seems broken only on Haswell. S4 works fine with older chips (IvyBridge, at least) with the very same kernel. BTW, the comment 2 is still valid, and HP confirmed that, too. Don't load i915 driver in initrd of a resume kernel (i.e. the kernel that loads the S4 image), then the crash probability goes down from 10% to 2% or less. Hi I did some tests, and it seems that if I disable fbcon, vgacon and their friends I can't reproduce the problem. Can you please confirm that? Also, my tests show that the problem happens even if we don't start X. Can you also confirm that? In the meantime, I'll keep testing. Thanks, Paulo (In reply to Paulo Zanoni from comment #17) > Hi > > I did some tests, and it seems that if I disable fbcon, vgacon and their > friends I can't reproduce the problem. Can you please confirm that? Do you mean to disable the corresponding Kconfig? If so, could you share your kconfig to test here, too? > Also, my tests show that the problem happens even if we don't start X. Can > you also confirm that? Yes, read comment 10 again :) > In the meantime, I'll keep testing. Thanks! (In reply to Takashi Iwai from comment #18) > (In reply to Paulo Zanoni from comment #17) > > Hi > > > > I did some tests, and it seems that if I disable fbcon, vgacon and their > > friends I can't reproduce the problem. Can you please confirm that? > > Do you mean to disable the corresponding Kconfig? If so, could you share > your kconfig to test here, too? Nevermind, I redid the tests and I was still able to reproduce the bug. I'm sorry. But yes, I meant the .config file. You need CONFIG_EXPERT before you can change the values of fbcon and vgacon. Hi I did some more investigation and I discovered the following: - It seems that, after resuming, if you run "slabinfo -v" (from tools/vm/), there's a good chance you'll see dmesg messages saying we detected corruption on our slabs. It seems to me that it is much much easier to reproduce the bug with "hibernate, resume, run slabinfo -v, check dmesg, hibernate, resume, etc" than with just "hibernate, resume". Can you confirm that? - It also seems that the bug goes away if the kernel that resumes the machine doesn't load i915.ko. So an experiment you can try is: boot the machine normally, with i915.ko loaded, tell it to hibernate. Then make the machine wake-up, and use the "modprobe.blacklist=i915" option when loading the kernel that will resume the machine. After it resumes, check if the bug is there (possibly with slabinfo -v). The bug should be gone. Can you please confirm that? Thanks, Paulo (In reply to Paulo Zanoni from comment #20) > Hi > > I did some more investigation and I discovered the following: > > - It seems that, after resuming, if you run "slabinfo -v" (from tools/vm/), > there's a good chance you'll see dmesg messages saying we detected > corruption on our slabs. It seems to me that it is much much easier to > reproduce the bug with "hibernate, resume, run slabinfo -v, check dmesg, > hibernate, resume, etc" than with just "hibernate, resume". Can you confirm > that? I have no time in this week due to company's event, but I guess this would trigger more often, too. In my tests, an easy way to reproduce the bug is to run netconsole in background on the Haswell machine. Then it causes after just a couple of S4 cycles. > - It also seems that the bug goes away if the kernel that resumes the > machine doesn't load i915.ko. So an experiment you can try is: boot the > machine normally, with i915.ko loaded, tell it to hibernate. Then make the > machine wake-up, and use the "modprobe.blacklist=i915" option when loading > the kernel that will resume the machine. After it resumes, check if the bug > is there (possibly with slabinfo -v). The bug should be gone. Can you please > confirm that? This was already mentioned in the bug description! Created attachment 110401 [details]
Don't let the GT write to memory after we're suspending.
Takashi, can you please test this patch?
Created attachment 110411 [details]
Idle harder
Please try this patch (with and without the previous, if possible) as well. If you only have time to test 1, please test this with the previous patch.
This one is only compile tested.
(In reply to Ben Widawsky from comment #22) > Created attachment 110401 [details] > Don't let the GT write to memory after we're suspending. > > Takashi, can you please test this patch? I tested this patch and, alone, it seems enough to fix the bug. I tested it yesterday (the similar version which you sent to my personal email) and today (the version on this bugzilla). Both versions survived many hibernate/resume cycles without problems on "slabinfo -v". The interesting thing is that in two cases a problem happened where "slabinfo -v" got stuck, never finishing. I've never seen this problem before, so it may be caused by your patch, or it may be just another bug that was "hidden" behind the previous easier-to-reproduce bug which you fixed with the patch. (In reply to Ben Widawsky from comment #23) > Created attachment 110411 [details] > Idle harder > > Please try this patch (with and without the previous, if possible) as well. > If you only have time to test 1, please test this with the previous patch. > > This one is only compile tested. This patch alone is not enough to fix the bug: I can still reproduce the problem. (In reply to Paulo Zanoni from comment #25) > (In reply to Ben Widawsky from comment #23) > > Created attachment 110411 [details] > > Idle harder > > > > Please try this patch (with and without the previous, if possible) as well. > > If you only have time to test 1, please test this with the previous patch. > > > > This one is only compile tested. > > This patch alone is not enough to fix the bug: I can still reproduce the > problem. If anybody else happens to test the patch, please let me know if you see the WARN or DRM_ERROR. Thanks. Thanks Ben, it looks promising. I'll try to find some time and test the patches. BTW what are the magic registers 0x4194 and 0x2050? Are they available no matter which GPU generation? (In reply to Takashi Iwai from comment #27) > Thanks Ben, it looks promising. I'll try to find some time and test the > patches. > > BTW what are the magic registers 0x4194 and 0x2050? Are they available no > matter which GPU generation? 0x4194 has existed for some time. It's just hardcoded as a quick hack since all the experiments with Paulo shows the render ring causing issue. See #define RING_FAULT_REG(ring) (0x4094 + 0x100*(ring)->id) in drivers/gpu/drm/i915/i915_reg.h I'm not sure how long 0x2050 has been in the HW. It's a register which is meant for debug purposes only, and let's just say it's magic for now (you can figure out from the code what it should be telling us). As Paulo already confirmed, the first patch alone seems working. With the first patch, more than 100 S4 cycles with net and proc loads survived and never crashed. Great! \o/ With the second patch, there is no visible change, neither WARN nor DRM_ERROR. Feel free to take my tested-by tags when submitting to upstream: Tested-by: Takashi Iwai <tiwai@suse.de> (In reply to Ben Widawsky from comment #28) > (In reply to Takashi Iwai from comment #27) > > BTW what are the magic registers 0x4194 and 0x2050? Are they available no > > matter which GPU generation? > > 0x4194 has existed for some time. It's just hardcoded as a quick hack since > all the experiments with Paulo shows the render ring causing issue. See > #define RING_FAULT_REG(ring) (0x4094 + 0x100*(ring)->id) in > drivers/gpu/drm/i915/i915_reg.h Thanks for the pointer. > I'm not sure how long 0x2050 has been in the HW. It's a register which is > meant for debug purposes only, and let's just say it's magic for now (you > can figure out from the code what it should be telling us). OK, I was just curious because the functions are applied globally to all chips. Let me know when the final patch is ready and upstreamed (hopefully merged in time for 3.12). I've cleaned up the patches. Unfortunately I cannot reproduce the issue on my machine locally, so I am waiting for someone else on my team to test them. If you'd like to test them, that might speed things up. They are the top two patches here: http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503 We should be able to hit 3.12. (In reply to Ben Widawsky from comment #31) > I've cleaned up the patches. Unfortunately I cannot reproduce the issue on > my machine locally, so I am waiting for someone else on my team to test > them. If you'd like to test them, that might speed things up. > > They are the top two patches here: > http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/ > ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503 > > We should be able to hit 3.12. Whoops, make that top two commits here (I've forced pushed) http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321 (In reply to Ben Widawsky from comment #32) > (In reply to Ben Widawsky from comment #31) > > I've cleaned up the patches. Unfortunately I cannot reproduce the issue on > > my machine locally, so I am waiting for someone else on my team to test > > them. If you'd like to test them, that might speed things up. > > > > They are the top two patches here: > > http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/ > > ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503 > > > > We should be able to hit 3.12. > > Whoops, make that top two commits here (I've forced pushed) > http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321 In commit 0036ecbfb, hsw_pte_encode() doesn't seem to clear GEN6_PTE_VALID while others do. Is it really correct? (In reply to Takashi Iwai from comment #33) > (In reply to Ben Widawsky from comment #32) > > (In reply to Ben Widawsky from comment #31) > > > I've cleaned up the patches. Unfortunately I cannot reproduce the issue > on > > > my machine locally, so I am waiting for someone else on my team to test > > > them. If you'd like to test them, that might speed things up. > > > > > > They are the top two patches here: > > > http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/ > > > ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503 > > > > > > We should be able to hit 3.12. > > > > Whoops, make that top two commits here (I've forced pushed) > > http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321 > > In commit 0036ecbfb, hsw_pte_encode() doesn't seem to clear GEN6_PTE_VALID > while others do. Is it really correct? Darn. That is not correct. And that's the one platform we're trying to fix :P Just pushed the fix. Hopefully I can get internal testing soon. Posted for review. http://lists.freedesktop.org/archives/intel-gfx/2013-October/034863.html Ok, fix is merged and has landed: commit 828c79087cec61eaf4c76bb32c222fbe35ac3930 Author: Ben Widawsky <benjamin.widawsky@intel.com> Date: Wed Oct 16 09:21:30 2013 -0700 drm/i915: Disable GGTT PTEs on GEN6+ suspend The fix gets reverted recently, so this bug must be reopened... I still think applying this fix conditionally to certain chips (either gen >= 7 or only HSW) would be a better workaround. Or, we may apply it only for hibernation, too. Imo we should attempt to refill the gte ptes with stolen memory entries. That should address both the snb and hsw issue and looks more solid than trying to block all access to memory. Which apparently can kill the system. Someone from our side was signed up to do the testing but that didn't seem to happen. Meh ... Is there any activity on this bug yet? I am tracking (what I think is) the same issue at https://bugs.freedesktop.org/show_bug.cgi?id=78424 and have tested multiple kernels from 3.13 to 3.16.1 and it is always a similar ext4 related Oops that is described above when I resume from hibernation. In brief: * mainboard: MSI-B85M (MSI-7817 Haswell chipset, i5-4570 CPU) * Ubuntu 14.04 LTS (amd64) * hibernation/resume from the console using 'sudo pm-hibernate' seems to work multiple times, when i915.ko is not loaded * hibernation/resume from within X will work fine once after reboot (will result in "WARNING: SPLL already enabled" error message like above) * further attempts always fail on resume with a heap of Oops messages related to ext4 (lookup_fast, __inode_permission, etc.) and the machine will grind to death * It seems like using 'no_console_suspend=1' will delay the Oops until the second or third resume. Is there any chance of this being fixed soon? How can I help? I use multiple Haswell based machines and badly need the hibernation functionality. Thank you! Raising priority. I pulled the current "drm-intel-nightly" tree (http://cgit.freedesktop.org/drm-intel/log/?h=drm-intel-nightly, commit 09fcefee...) and tried again. Setup: * Ubuntu 14.04 LTS, Kernel 3.16.0+ (3.17rc1 as of now) * MSI-7817 chipset with i5-i4570 * Boot Lubuntu desktop * start "make -j4" in the above git checkout * start Firefox with Youtube video * hibernate and resume in a loop (I tried 4 reboots with 3..5 resume cycles each) Results: * No more WARNING: messages upon resume * Multiple resumes work fine * About one in every fifth resume the machine grinds to a halt with dozens of OOM killer messages in the logs So: A big improvement to before (I can hibernate and resume multiple times in a row, even with a loaded machine!). But we're not quite there yet - where do the OOM errors come from? When I hibernate, only ~1,5G out of 8G RAM were actually used. Created attachment 147091 [details]
3.17-rc1 crash "Watchdog detected hard LOCKUP on CPU #x"
Here is another (it seems) non-OOM related crash on the mentioned kernel during resume after hibernate which I was only able to catch using a digicam (thus sorry for the file format).
Created attachment 147101 [details]
3.17-rc1 Crash during hibernate/resume, OOM failures
Here is another dmesg output. The first file is a clean resume after a hibernate. The second file is a subsequent resume which triggered the mentioned OOM chaos. Unfortunately the first part was cut off because of the limited size of the dmesg buffer.
Another crash, after 5 (working) cycles of hibernate/resume during a kernel "make -j4" to keep the machine busy. I have seen ext4_ functions in the call trace often - this might be the same memory corruption that causes the OOM killer to run amok. [ 976.247858] CPU: 2 PID: 7160 Comm: rm Tainted: G W E 3.17.0-rc1+ #5 [ 976.247876] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 [ 976.247895] task: ffff8800c89ee400 ti: ffff880133e08000 task.ti: ffff880133e08000 [ 976.247914] RIP: 0010:[<ffffffff811ba875>] [<ffffffff811ba875>] kmem_cache_alloc+0x75/0x1e0 [ 976.247938] RSP: 0018:ffff880133e0bbb8 EFLAGS: 00010286 [ 976.247951] RAX: 0000000000000000 RBX: ffff88021ea5d340 RCX: 0000000000000338 [ 976.247969] RDX: 0000000000000337 RSI: 0000000000000050 RDI: ffff88020fdf6600 [ 976.247986] RBP: ffff880133e0bbe8 R08: 000000000001b200 R09: ffffffff8128e632 [ 976.248002] R10: ffff880133e0bb38 R11: ffffea00083c3040 R12: 0031000000300000 [ 976.248019] R13: 0000000000000050 R14: ffff88020fdf6600 R15: ffff88020fdf6600 [ 976.248037] FS: 00002b3156a34b80(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000 [ 976.248056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 976.248070] CR2: 00002b3156d5dfa0 CR3: 000000011799e000 CR4: 00000000001407e0 [ 976.248087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 976.248103] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 976.248120] Stack: [ 976.248125] ffffffff8128e632 ffff88021ea5d340 ffff88020fe9d800 0000000000000000 [ 976.248145] 0000000004cd599c 000000000000599c ffff880133e0bcc0 ffffffff8128e632 [ 976.248166] ffffea00083c3040 0000000000000000 0000000100000000 0000000004cd59a0 [ 976.248186] Call Trace: [ 976.248194] [<ffffffff8128e632>] ? ext4_free_blocks+0x6d2/0xb40 [ 976.248210] [<ffffffff8128e632>] ext4_free_blocks+0x6d2/0xb40 [ 976.248226] [<ffffffff81281369>] ext4_ext_remove_space+0x7d9/0x1050 [ 976.248242] [<ffffffff812954c9>] ? ext4_es_free_extent+0x59/0x60 [ 976.248258] [<ffffffff81283bd0>] ext4_ext_truncate+0xb0/0xe0 [ 976.248274] [<ffffffff8125cf57>] ext4_truncate+0x387/0x3d0 [ 976.248289] [<ffffffff8125db11>] ext4_evict_inode+0x491/0x4f0 [ 976.248305] [<ffffffff811f23b4>] evict+0xb4/0x180 [ 976.248317] [<ffffffff811f2bf5>] iput+0xf5/0x180 [ 976.248330] [<ffffffff811e4db3>] do_unlinkat+0x193/0x2c0 [ 976.248344] [<ffffffff81021d25>] ? syscall_trace_enter+0x145/0x250 [ 976.248360] [<ffffffff811e859b>] SyS_unlinkat+0x1b/0x40 [ 976.248375] [<ffffffff817513ff>] tracesys+0xe1/0xe6 [ 976.248387] Code: dd 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 [ 976.248475] RIP [<ffffffff811ba875>] kmem_cache_alloc+0x75/0x1e0 [ 976.248491] RSP <ffff880133e0bbb8> [ 976.254794] ---[ end trace 9b598b75bf3f05bd ]--- [ 985.828721] r8169 0000:02:00.0 eth0: link up Could you give a try if the following branch fixes the issue? : https://github.com/ideak/linux/commits/suspend-fix No crashes after 8 hibernate/resume cycles with this kernel tree. Great! However, I got the below backtrace upon the second resume, and after that the network was flaky (about each second resume I had not network access until I hibernated again): [ 460.342312] r8169 0000:02:00.0 eth0: link down [ 460.342331] r8169 0000:02:00.0 eth0: link down [ 460.342369] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 463.191127] r8169 0000:02:00.0 eth0: link up [ 463.191134] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 559.157551] ------------[ cut here ]------------ [ 559.157557] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x276/0x280() [ 559.157558] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 559.157559] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) snd_pcm(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) snd_seq_device(E) snd_timer(E) intel_rapl(E) snd(E) soundcore(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) lpc_ich(E) serio_raw(E) kvm_intel(E) mei_me(E) mei(E) kvm(E) shpchp(E) tpm_infineon(E) intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) dm_crypt(E) netconsole(E) configfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) i915(E) mxm_wmi(E) r8169(E) i2c_algo_bit(E) drm_kms_helper(E) hid_generic(E) aesni_intel(E) ahci(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) libahci(E) mii(E) ablk_helper(E) usbhid(E) cryptd(E) drm(E) hid(E) wmi(E) video(E) [ 559.157582] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 3.17.0-rc4+ #1 [ 559.157583] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 [ 559.157584] 0000000000000009 ffff88021ea03d98 ffffffff8174630a ffff88021ea03de0 [ 559.157585] ffff88021ea03dd0 ffffffff8106c8bd 0000000000000000 ffff88020fb4a000 [ 559.157586] ffff88020f812880 0000000000000001 0000000000000000 ffff88021ea03e30 [ 559.157588] Call Trace: [ 559.157589] <IRQ> [<ffffffff8174630a>] dump_stack+0x45/0x56 [ 559.157595] [<ffffffff8106c8bd>] warn_slowpath_common+0x7d/0xa0 [ 559.157596] [<ffffffff8106c92c>] warn_slowpath_fmt+0x4c/0x50 [ 559.157598] [<ffffffff8166b8c6>] dev_watchdog+0x276/0x280 [ 559.157600] [<ffffffff8166b650>] ? dev_graft_qdisc+0x80/0x80 [ 559.157602] [<ffffffff810cefd6>] call_timer_fn+0x36/0x100 [ 559.157603] [<ffffffff8166b650>] ? dev_graft_qdisc+0x80/0x80 [ 559.157605] [<ffffffff810d076f>] run_timer_softirq+0x20f/0x310 [ 559.157607] [<ffffffff81070865>] __do_softirq+0xf5/0x2e0 [ 559.157609] [<ffffffff81070d25>] irq_exit+0x105/0x110 [ 559.157611] [<ffffffff81751855>] smp_apic_timer_interrupt+0x45/0x60 [ 559.157612] [<ffffffff8174f95d>] apic_timer_interrupt+0x6d/0x80 [ 559.157613] <EOI> [<ffffffff815f4582>] ? poll_idle+0x42/0x90 [ 559.157616] [<ffffffff815f3fc5>] cpuidle_enter_state+0x55/0x170 [ 559.157617] [<ffffffff815f4197>] cpuidle_enter+0x17/0x20 [ 559.157620] [<ffffffff810aadfd>] cpu_startup_entry+0x31d/0x340 [ 559.157623] [<ffffffff81737747>] rest_init+0x77/0x80 [ 559.157625] [<ffffffff81d41084>] start_kernel+0x42f/0x43a [ 559.157626] [<ffffffff81d40a4e>] ? set_init_arg+0x53/0x53 [ 559.157628] [<ffffffff81d40120>] ? early_idt_handlers+0x120/0x120 [ 559.157629] [<ffffffff81d405ee>] x86_64_start_reservations+0x2a/0x2c [ 559.157630] [<ffffffff81d40733>] x86_64_start_kernel+0x143/0x152 [ 559.157631] ---[ end trace 94d117e156a45de1 ]--- [ 559.171533] r8169 0000:02:00.0 eth0: link up The above was on a MSI B85M chipset: [ 559.157583] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 This has so far survived several more hibernate/resume cycles and the above network watchdog failure did not repeat (yet). On another board (MSI B81M chipset), the same kernel will not boot, it crashes upon boot, I tried to boot several times: .... Sep 14 22:04:34 desktop kernel: [ 2.199146] AVX2 version of gcm_enc/dec engaged. Sep 14 22:04:34 desktop kernel: [ 2.199147] AES CTR mode by8 optimization enabled Sep 14 22:04:34 desktop kernel: [ 2.211116] checking generic (e0000000 500000) vs hw (e0000000 10000000) Sep 14 22:04:34 desktop kernel: [ 2.211117] fb: switching to inteldrmfb from VESA VGA Sep 14 22:04:34 desktop kernel: [ 2.211214] Console: switching to colour dummy device 80x25 Sep 14 22:04:34 desktop kernel: [ 2.211235] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100 Sep 14 22:04:34 desktop kernel: [ 2.211238] IP: [<ffffffff8147057e>] con_set_unimap+0x4e/0x270 Sep 14 22:04:34 desktop kernel: [ 2.211239] PGD 213bea067 PUD 213be9067 PMD 0 Sep 14 22:04:34 desktop kernel: [ 2.211240] Oops: 0000 [#1] SMP Sep 14 22:04:34 desktop kernel: [ 2.211249] Modules linked in: aesni_intel(E+) mxm_wmi(E) bnep(E) snd_seq(E) i915(E+) rfcomm(E+) snd_seq_device(E) aes_x86_64(E) snd_timer(E) lrw(E) bluetooth(E) gf128mul(E) glue_helper(E) ablk_helper(E) drm_kms_helper(E) cryptd(E) snd(E) drm(E) soundcore(E) mei_me(E) serio_raw(E) i2c_algo_bit(E) lpc_ich(E) mei(E) video(E) wmi(E) intel_smartconnect(E) mac_hid(E) tpm_infineon(E) parport_pc(E) ppdev(E) lp(E) parport(E) ahci(E) libahci(E) r8169(E) mii(E) Sep 14 22:04:34 desktop kernel: [ 2.211251] CPU: 3 PID: 497 Comm: setfont Tainted: G E 3.17.0-rc4+ #3 Sep 14 22:04:34 desktop kernel: [ 2.211251] Hardware name: MSI MS-7817/H81M-P33 (MS-7817), BIOS V1.5 05/30/2014 Sep 14 22:04:34 desktop kernel: [ 2.211252] task: ffff8800d34f0000 ti: ffff8800d34e4000 task.ti: ffff8800d34e4000 Sep 14 22:04:34 desktop kernel: [ 2.211253] RIP: 0010:[<ffffffff8147057e>] [<ffffffff8147057e>] con_set_unimap+0x4e/0x270 Sep 14 22:04:34 desktop kernel: [ 2.211254] RSP: 0018:ffff8800d34e7d50 EFLAGS: 00010246 Sep 14 22:04:34 desktop kernel: [ 2.211254] RAX: ffff88021700c310 RBX: 00000000022b9b10 RCX: ffff88021700c000 Sep 14 22:04:34 desktop kernel: [ 2.211255] RDX: 00000000000010c6 RSI: 0000000000000282 RDI: 0000000000000282 Sep 14 22:04:34 desktop kernel: [ 2.211255] RBP: ffff8800d34e7db8 R08: ffff8800d34e4000 R09: ffff8800d4293200 Sep 14 22:04:34 desktop kernel: [ 2.211255] R10: 000000000000b50d R11: 0000000000000010 R12: 0000000000000000 Sep 14 22:04:34 desktop kernel: [ 2.211256] R13: 0000000000004b67 R14: ffff88021700c000 R15: 00000000ffffffff Sep 14 22:04:34 desktop kernel: [ 2.211257] FS: 00007f4da387e740(0000) GS:ffff88021fb80000(0000) knlGS:0000000000000000 Sep 14 22:04:34 desktop kernel: [ 2.211257] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 14 22:04:34 desktop kernel: [ 2.211257] CR2: 0000000000000100 CR3: 00000000d345c000 CR4: 00000000001407e0 Sep 14 22:04:34 desktop kernel: [ 2.211258] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Sep 14 22:04:34 desktop kernel: [ 2.211258] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Sep 14 22:04:34 desktop kernel: [ 2.211258] Stack: Sep 14 22:04:34 desktop kernel: [ 2.211259] ffffffff810b1052 ffff8800d34e7d80 ffffffff813258ac ffff8800d34f0000 Sep 14 22:04:34 desktop kernel: [ 2.211260] 0000000000000000 0000000000004b67 000002e1d34e7d90 ffff88021700c000 Sep 14 22:04:34 desktop kernel: [ 2.211261] ffff880036bb9400 0000000000000000 0000000000004b67 ffff88021700c000 Sep 14 22:04:34 desktop kernel: [ 2.211261] Call Trace: Sep 14 22:04:34 desktop kernel: [ 2.211264] [<ffffffff810b1052>] ? up+0x32/0x50 Sep 14 22:04:34 desktop kernel: [ 2.211267] [<ffffffff813258ac>] ? apparmor_capable+0x1c/0x60 Sep 14 22:04:34 desktop kernel: [ 2.211270] [<ffffffff8146a475>] vt_ioctl+0xe75/0x11b0 Sep 14 22:04:34 desktop kernel: [ 2.211271] [<ffffffff8145e06d>] tty_ioctl+0x26d/0xbb0 Sep 14 22:04:34 desktop kernel: [ 2.211273] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0 Sep 14 22:04:34 desktop kernel: [ 2.211275] [<ffffffff8109ab34>] ? vtime_account_user+0x54/0x60 Sep 14 22:04:34 desktop kernel: [ 2.211277] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0 Sep 14 22:04:34 desktop kernel: [ 2.211279] [<ffffffff8174eb7f>] tracesys+0xe1/0xe6 Sep 14 22:04:34 desktop kernel: [ 2.211287] Code: 14 48 83 c4 40 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 89 d3 e8 e0 c2 c4 ff 48 8b 4d d0 48 8b 81 18 03 00 00 4c 8b 20 <49> 83 bc 24 00 01 00 00 01 0f 86 03 01 00 00 48 89 c8 48 05 18 Sep 14 22:04:34 desktop kernel: [ 2.211288] RIP [<ffffffff8147057e>] con_set_unimap+0x4e/0x270 Sep 14 22:04:34 desktop kernel: [ 2.211288] RSP <ffff8800d34e7d50> Sep 14 22:04:34 desktop kernel: [ 2.211289] CR2: 0000000000000100 Sep 14 22:04:34 desktop kernel: [ 2.211290] ---[ end trace 4c81f325d41b44a2 ]--- Both run Lubuntu 14.04 with all current updates applied. The kernels were created using make-kpkg with all default choices (during 'make oldconfig'). If you need more debugging logs, just say so, I can help testing. Thanks! Thanks. I can't say much about the network problem, I think the best would be opening a new bug for it. Afaics, the console crash happens because i915 switches to the dummy console and then userspace tries to change the font mapping, but the dummy console doesn't support this (its vc_uni_pagedir is never inited). Could you try the following: diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c index 610b720..eb867fa 100644 --- a/drivers/tty/vt/consolemap.c +++ b/drivers/tty/vt/consolemap.c @@ -539,6 +539,9 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) /* Save original vc_unipagdir_loc in case we allocate a new one */ p = *vc->vc_uni_pagedir_loc; + + if (!p) + return -EINVAL; if (p->refcount > 1) { int j, k; (In reply to Imre Deak from comment #48) > diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c > index 610b720..eb867fa 100644 > --- a/drivers/tty/vt/consolemap.c > +++ b/drivers/tty/vt/consolemap.c > @@ -539,6 +539,9 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct > unipair __user *list) > > /* Save original vc_unipagdir_loc in case we allocate a new one */ > p = *vc->vc_uni_pagedir_loc; > + > + if (!p) > + return -EINVAL; > > if (p->refcount > 1) { > int j, k; Oops, wrong patch, the correct one: diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c index 610b720..59b25e0 100644 --- a/drivers/tty/vt/consolemap.c +++ b/drivers/tty/vt/consolemap.c @@ -539,6 +539,12 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) /* Save original vc_unipagdir_loc in case we allocate a new one */ p = *vc->vc_uni_pagedir_loc; + + if (!p) { + err = -EINVAL; + + goto out_unlock; + } if (p->refcount > 1) { int j, k; @@ -623,6 +629,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list) set_inverse_transl(vc, p, i); /* Update inverse translations */ set_inverse_trans_unicode(vc, p); +out_unlock: console_unlock(); return err; } The above patch seems to fix the issue. I still have occasional network outages after a resume, another suspend/resume cycle usually fixes this, but the crash is gone (so far). I am running this patch on 3.17 on two machines and they hibernate a few times a day (both MSI-7817 chipsets, one Haswell 81, one 85). (In reply to Jens from comment #50) > The above patch seems to fix the issue. I still have occasional network > outages after a resume, another suspend/resume cycle usually fixes this, but > the crash is gone (so far). I am running this patch on 3.17 on two machines > and they hibernate a few times a day (both MSI-7817 chipsets, one Haswell > 81, one 85). Thanks! I sent the VT patch to the maintainers. The patches in the suspend-fix branch are under review on the intel-gfx ML, so we can close this bug (possibly) once those get merged. Jens, have you seen the problem since your last report (with or w/o the fixes)? Could you still try if you can reproduce the problem with the latest -nightly kernel and the same tree with the fixes reverted (resetting to 598ae05fd937 - "drm/i915: Emit even number of dwords when emitting LRIs"). (In reply to Imre Deak from comment #52) > Jens, have you seen the problem since your last report (with or w/o the > fixes)? > > Could you still try if you can reproduce the problem with the latest > -nightly kernel and the same tree with the fixes reverted (resetting to > 598ae05fd937 - "drm/i915: Emit even number of dwords when emitting LRIs"). Also, let's continue to track this on fdo: https://bugs.freedesktop.org/show_bug.cgi?id=82864 |