Bug 95541 - General Protection Fault on supend to disk
Summary: General Protection Fault on supend to disk
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-25 09:12 UTC by schrott3000
Modified: 2015-07-22 00:11 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.0.0-rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Linux Version Script (1.65 KB, text/plain)
2015-03-25 09:12 UTC, schrott3000
Details
Kernel Modules (4.62 KB, text/plain)
2015-03-25 09:13 UTC, schrott3000
Details
PCI Devices (890 bytes, text/plain)
2015-03-25 09:17 UTC, schrott3000
Details
USB Devices (633 bytes, text/plain)
2015-03-25 09:17 UTC, schrott3000
Details
suspend_analyse script stats (80.05 KB, text/html)
2015-03-25 09:20 UTC, schrott3000
Details
suspend_analyse script ftrace (75.76 KB, text/plain)
2015-03-25 09:21 UTC, schrott3000
Details
suspend_analyse script dmesg (5.20 KB, text/plain)
2015-03-25 09:26 UTC, schrott3000
Details
Kernel Messages on Crash (77.20 KB, text/plain)
2015-03-25 09:46 UTC, schrott3000
Details
Crash 2 Dmesg (1.05 KB, text/plain)
2015-03-26 07:47 UTC, schrott3000
Details
Crash 3 New Kernel (5.02 KB, text/plain)
2015-03-26 09:11 UTC, schrott3000
Details
Crash Kernel 3.19.2-1.gf2f9797-vanilla (9.87 KB, text/plain)
2015-03-26 10:05 UTC, schrott3000
Details
dmesg_bug_on_resume.txt (13.21 KB, text/plain)
2015-03-30 14:30 UTC, schrott3000
Details
disassembly_critical.txt (8.84 KB, text/plain)
2015-03-30 14:31 UTC, schrott3000
Details
dmesg-rc5-2-debug-device-detailed.txt (128.63 KB, text/plain)
2015-03-30 14:33 UTC, schrott3000
Details
disassembly_critical.txt (9.01 KB, text/plain)
2015-03-30 16:39 UTC, schrott3000
Details

Description schrott3000 2015-03-25 09:12:52 UTC
Created attachment 172241 [details]
Linux Version Script

Overview:

The system is unable to hibernate to disk. Both methods kernel internal and s2disk have been tried. A kernel fault due to an general protection fault happens during the hibernation process (OOPS Message at the end of this report).


How to reproduce:

1. Boot the Laptop
2. Login, open some programs and say suspend to disk
3. Kernel Crashes

At least on my system, but this is certainly hardware dependend.

Actual Result: Kernel Fault
Expected Result: Hibernation and S4 Sleep


Workaround:

Disable pm_async:
# echo 0 > /sys/power/pm-async 

The hibernation works perfectly. I guess it's what the suspend_analyse script does too. (Result is attached, but everything works fine when using it)


Additional Information:

System Description:

DELL Latitude D630
CPU: Dual Core (SMP), Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz
ARCH: x86_64
RAM: 4GB
Graphics: Intel GMA x3100 (i915)
Hard-Drive: SATA in ATA mode (no AHCI)
PCMIA Cards: No cards

Swap-Partition: 5GB Encrypted (LUKS)
System-Partition: EXT4 Encrypted (LUKS LVM)
Boot-Partition: Seperate EXT4 Part (Not Encrypted)
Bootloader: GRUB2
SELINUX: 
enabled, but has nothing to do with the hibernation (does'nt work even with selinux disabled)


Distro:

Name: openSUSE 13.2
Desktop: X11/KF5 using slim


Kernel Info:

- x86_64 kernel
- 2 days old (Build Time: 2015/03/23)
- Build from openSUSE (rpm)
- Vanilla: no SUSE Patches (raw Mainline Code)

OOPS Message:

[  321.286273] pcmcia_socket pcmcia_socket0: pccard: card ejected from slot 0
[  323.284937] general protection fault: 0000 [#1] SMP 
[  323.288011] Modules linked in: bnep bluetooth xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack xt_owner xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter ip_tables arptable_filter arp_tables x_tables fuse vmw_vsock_vmci_transport vsock vmw_vmci ctr ccm af_packet snd_hda_codec_idt arc4 snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd parport_pc ppdev parport tg3 iTCO_wdt iTCO_vendor_support ptp pps_core iwl4965 libphy iwlegacy gpio_ich lpc_ich mfd_core mac80211 joydev dell_laptop soundcore cfg80211 dell_wmi sparse_keymap serio_raw pcspkr 8250_fintek shpchp rfkill wmi coretemp kvm_intel acpi_cpufreq dcdbas tpm_tis processor i2c_i801 ac battery kvm tpm thermal i8k dm_crypt xts gf128mul algif_skcipher af_alg sr_mod cdrom ata_generic pcmcia ata_piix firewire_ohci firewire_core crc_itu_t uhci_hcd ehci_pci ehci_hcd i915 yenta_socket i2c_algo_bit pcmcia_rsrc pcmcia_core drm_kms_helper usbcore usb_common drm button video dm_mirror dm_region_hash dm_log dm_mod sg
[  323.288011] CPU: 0 PID: 321 Comm: kworker/u4:3 Tainted: G        W       4.0.0-rc5-1.g3583a4a-vanilla #1
[  323.288011] Hardware name: Dell Inc. Latitude D630                   /0KU184, BIOS A19 06/04/2013
[  323.288011] Workqueue: events_unbound async_run_entry_fn
[  323.288011] task: ffff880037066090 ti: ffff880036ff8000 task.ti: ffff880036ff8000
[  323.288011] RIP: 0010:[<ffffffff8163c956>]  [<ffffffff8163c956>] klist_next+0xd6/0xf0
[  323.288011] RSP: 0018:ffff880036ffbc98  EFLAGS: 00010283
[  323.288011] RAX: ffff8800d93ae700 RBX: ffff880036ffbcd8 RCX: ffff8800370ef66f
[  323.288011] RDX: ff8800370847f0ff RSI: ffff880036ffbcd8 RDI: ffff8800d93ae700
[  323.288011] RBP: ffff880036ffbcc8 R08: 0000000000000000 R09: ffff880060000490
[  323.288011] R10: 0000000066c8f33e R11: 0000000002da39cc R12: ff8800370847f0f7
[  323.288011] R13: ffffffff81474f10 R14: 0000000000000000 R15: 0000000000000000
[  323.288011] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[  323.288011] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  323.288011] CR2: 00007f3a8c003016 CR3: 0000000063c81000 CR4: 00000000000007f0
[  323.288011] Stack:
[  323.288011]  ffff880036ffbd30 0000000000000000 ffff880036ffbd1f ffffffff81485290
[  323.288011]  ffff88011ab1bf00 0000000000000080 ffff880036ffbd08 ffffffff81474ffe
[  323.288011]  ffff8800d93ae700 0000000000000000 ffff880036ffbd08 ffff8800373ff830
[  323.288011] Call Trace:
[  323.288011]  [<ffffffff81485290>] ? dpm_wait+0x40/0x40
[  323.288011]  [<ffffffff81474ffe>] device_for_each_child+0x4e/0x70
[  323.288011]  [<ffffffff81486332>] __device_suspend+0x42/0x380
[  323.288011]  [<ffffffff8108ff5e>] ? try_to_wake_up+0x1ee/0x320
[  323.288011]  [<ffffffff8148668f>] async_suspend+0x1f/0xa0
[  323.288011]  [<ffffffff81084f1c>] async_run_entry_fn+0x4c/0x160
[  323.288011]  [<ffffffff8107cd8a>] process_one_work+0x14a/0x3f0
[  323.288011]  [<ffffffff8107d461>] worker_thread+0x121/0x460
[  323.288011]  [<ffffffff8107d340>] ? rescuer_thread+0x310/0x310
[  323.288011]  [<ffffffff810823c9>] kthread+0xc9/0xe0
[  323.288011]  [<ffffffff81082300>] ? kthread_create_on_node+0x180/0x180
[  323.288011]  [<ffffffff81651c98>] ret_from_fork+0x58/0x90
[  323.288011]  [<ffffffff81082300>] ? kthread_create_on_node+0x180/0x180
[  323.288011] Code: 41 0f 95 c7 eb 99 0f 1f 80 00 00 00 00 48 8b 03 45 31 ff 48 8b 48 08 4c 8d 61 f8 eb 82 49 8b 54 24 08 4c 8d 62 f8 49 39 c4 74 a3 <f6> 42 f8 01 74 82 eb ea 66 90 e8 0a 0d 01 00 eb 8b 66 0f 1f 84 
[  323.288011] RIP  [<ffffffff8163c956>] klist_next+0xd6/0xf0
[  323.288011]  RSP <ffff880036ffbc98>
[  323.295406] ---[ end trace 601d92f4a439b5e2 ]---
[  323.295442] BUG: unable to handle kernel paging request at ffffffffffffffd8
[  323.295445] IP: [<ffffffff810829b0>] kthread_data+0x10/0x20
[  323.295447] PGD 1c11067 PUD 1c13067 PMD 0 
[  323.295449] Oops: 0000 [#2] SMP 
[  323.295476] Modules linked in: bnep bluetooth xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack xt_owner xt_tcpudp nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter ip_tables arptable_filter arp_tables x_tables fuse vmw_vsock_vmci_transport vsock vmw_vmci ctr ccm af_packet snd_hda_codec_idt arc4 snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd parport_pc ppdev parport tg3 iTCO_wdt iTCO_vendor_support ptp pps_core iwl4965 libphy iwlegacy gpio_ich lpc_ich mfd_core mac80211 joydev dell_laptop soundcore cfg80211 dell_wmi sparse_keymap serio_raw pcspkr 8250_fintek shpchp rfkill wmi coretemp kvm_intel acpi_cpufreq dcdbas tpm_tis processor i2c_i801 ac battery kvm tpm thermal i8k dm_crypt xts gf128mul algif_skcipher af_alg sr_mod cdrom ata_generic pcmcia ata_piix firewire_ohci firewire_core crc_itu_t uhci_hcd ehci_pci ehci_hcd i915 yenta_socket i2c_algo_bit pcmcia_rsrc pcmcia_core drm_kms_helper usbcore usb_common drm button video dm_mirror dm_region_hash dm_log dm_mod sg
[  323.295487] CPU: 0 PID: 321 Comm: kworker/u4:3 Tainted: G      D W       4.0.0-rc5-1.g3583a4a-vanilla #1
[  323.295488] Hardware name: Dell Inc. Latitude D630                   /0KU184, BIOS A19 06/04/2013
[  323.295498] task: ffff880037066090 ti: ffff880036ff8000 task.ti: ffff880036ff8000
[  323.295500] RIP: 0010:[<ffffffff810829b0>]  [<ffffffff810829b0>] kthread_data+0x10/0x20
[  323.295501] RSP: 0018:ffff880036ffba38  EFLAGS: 00010096
[  323.295502] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000000f
[  323.295503] RDX: 000000000000000e RSI: 0000000000000000 RDI: ffff880037066090
[  323.295504] RBP: ffff880036ffba38 R08: 0000000000000000 R09: 0000000000000246
[  323.295505] R10: ffffffff81fe9188 R11: 000000000000001a R12: ffff880037066090
[  323.295505] R13: 00000000000142c0 R14: 0000000000000000 R15: 0000000000000000
[  323.295507] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[  323.295508] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  323.295509] CR2: 0000000000000028 CR3: 0000000063c81000 CR4: 00000000000007f0
[  323.295510] Stack:
[  323.295512]  ffff880036ffba58 ffffffff8107d825 ffff880036ffba58 ffff88011fc142c0
[  323.295513]  ffff880036ffbaa8 ffffffff8164dec0 ffff880036e66570 ffff880037066090
[  323.295515]  ffff880036ffbaa8 ffff880036ffbfd8 ffff880037066510 0000000000000246
[  323.295515] Call Trace:
[  323.295518]  [<ffffffff8107d825>] wq_worker_sleeping+0x15/0xa0
[  323.295521]  [<ffffffff8164dec0>] __schedule+0x6f0/0x950
[  323.295523]  [<ffffffff8164e157>] schedule+0x37/0x90
[  323.295525]  [<ffffffff8106750c>] do_exit+0x6bc/0xb40
[  323.295528]  [<ffffffff8100698f>] oops_end+0x9f/0xe0
[  323.295530]  [<ffffffff81006efb>] die+0x4b/0x70
[  323.295532]  [<ffffffff81003622>] do_general_protection+0xe2/0x170
[  323.295534]  [<ffffffff81474f10>] ? put_device+0x20/0x20
[  323.295536]  [<ffffffff81653cf8>] general_protection+0x28/0x30
[  323.295537]  [<ffffffff81474f10>] ? put_device+0x20/0x20
[  323.295540]  [<ffffffff8163c956>] ? klist_next+0xd6/0xf0
[  323.295542]  [<ffffffff8163c8a4>] ? klist_next+0x24/0xf0
[  323.295543]  [<ffffffff81485290>] ? dpm_wait+0x40/0x40
[  323.295545]  [<ffffffff81474ffe>] device_for_each_child+0x4e/0x70
[  323.295547]  [<ffffffff81486332>] __device_suspend+0x42/0x380
[  323.295549]  [<ffffffff8108ff5e>] ? try_to_wake_up+0x1ee/0x320
[  323.295551]  [<ffffffff8148668f>] async_suspend+0x1f/0xa0
[  323.295553]  [<ffffffff81084f1c>] async_run_entry_fn+0x4c/0x160
[  323.295555]  [<ffffffff8107cd8a>] process_one_work+0x14a/0x3f0
[  323.295557]  [<ffffffff8107d461>] worker_thread+0x121/0x460
[  323.295559]  [<ffffffff8107d340>] ? rescuer_thread+0x310/0x310
[  323.295561]  [<ffffffff810823c9>] kthread+0xc9/0xe0
[  323.295563]  [<ffffffff81082300>] ? kthread_create_on_node+0x180/0x180
[  323.295564]  [<ffffffff81651c98>] ret_from_fork+0x58/0x90
[  323.295566]  [<ffffffff81082300>] ? kthread_create_on_node+0x180/0x180
[  323.295583] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 30 05 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 
[  323.295584] RIP  [<ffffffff810829b0>] kthread_data+0x10/0x20
[  323.295585]  RSP <ffff880036ffba38>
[  323.295586] CR2: ffffffffffffffd8
[  323.295587] ---[ end trace 601d92f4a439b5e3 ]---
[  323.295588] Fixing recursive fault but reboot is needed!
[  323.299147] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[  323.299147] Shutting down cpus with NMI
[  323.299147] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  323.299147] drm_kms_helper: panic occurred, switching back to text console
[  323.299147] Rebooting in 90 seconds..
Comment 1 schrott3000 2015-03-25 09:13:49 UTC
Created attachment 172251 [details]
Kernel Modules
Comment 2 schrott3000 2015-03-25 09:17:31 UTC
Created attachment 172261 [details]
PCI Devices
Comment 3 schrott3000 2015-03-25 09:17:56 UTC
Created attachment 172271 [details]
USB Devices
Comment 4 schrott3000 2015-03-25 09:20:49 UTC
Created attachment 172281 [details]
suspend_analyse script stats
Comment 5 schrott3000 2015-03-25 09:21:49 UTC
Created attachment 172291 [details]
suspend_analyse script ftrace
Comment 6 schrott3000 2015-03-25 09:26:01 UTC
Created attachment 172301 [details]
suspend_analyse script dmesg
Comment 7 schrott3000 2015-03-25 09:46:50 UTC
Created attachment 172321 [details]
Kernel Messages on Crash

These messages are aquired using a serial connection to an other computer and the no_console_suspend option. But the crash happens idependently from this options.
Comment 8 schrott3000 2015-03-26 07:47:08 UTC
Created attachment 172401 [details]
Crash 2 Dmesg

There is another possible result when trying to hibernate,
the kernel freezes before a hibernation image gets created.
(In this try PCMIA was completely disabled, it seems to have nothing to do with this problem)
Comment 9 Aaron Lu 2015-03-26 08:53:51 UTC
Is this a new problem or it also happened on older kernels?
Comment 10 schrott3000 2015-03-26 09:10:50 UTC
Older kernels:
I had similar freezes during suspend with older versions too, but i've no kernel logs for that. So I'm not quiote sure if it was the same problem back then. They freezed but there was no kernel fault with reboot (similar too crash 2 dmesg)
NOTE: they don't continue like the newest version, i've waited for several hours

Newer snapshots:
I'm currently trying a newer kernel version from March 24 (4.0.0-rc5-2.g7636f33-vanilla). At least till now i've not been able to reproduce this problem with it but i will do some further tries. I had a similar freeze with the newer version too, but it suceed after an awful long time of waiting for CPU 1 to come up (i'll attach a dmesg for that too)
Comment 11 schrott3000 2015-03-26 09:11:38 UTC
Created attachment 172431 [details]
Crash 3 New Kernel
Comment 12 schrott3000 2015-03-26 09:15:46 UTC
An additional note:
In case of freezes theres no GPF in serial log and no reboot.
Comment 13 schrott3000 2015-03-26 10:05:00 UTC
Created attachment 172441 [details]
Crash Kernel 3.19.2-1.gf2f9797-vanilla

Kernel 3.19.2-1.gf2f9797-vanilla (current stable):

Kernel freezes too but no oops
(I don't get this oops in 4.0-rc5-1 all the time also)

dmesg attached

NOTE (if you wonder about drm oops in dmesg): 
drm oops are different ugly problem that has nothing to do with the hibernation as far I am concerned about it (it's nearly fixed completely in 4.0)
Comment 14 schrott3000 2015-03-26 10:37:41 UTC
Kernel 3.16.7 (openSUSE Standard vanilla kernel)

I was not able to reproduce the bug in this version of the kernel. I assume that it doesn't exits in this kernel version
Comment 15 schrott3000 2015-03-30 14:29:45 UTC
OK, I did some further testing and I first need to correct some things.
Namely:
- The suggested workaround does not work its was only luck (see smp/nosmp)
  -> New workaround nomodeset
- The error happens idependently of the pm_async option, my first experiences with that are based on the random behavior of this bug when smp is enabled 
- The crash can also occur on resume (see dmesg_bug_on_resume.txt)
- I can now not be sure that drm errors don't have something to do with the suspend problems (see BUG analysis)

I tested 4.0.0-rc5-2.g7636f33-vanilla and theres the same issue there as well in 4.0.0-rc5-4.g2458897-vanilla aswell as in 4.0.0-rc5 downloaded from kernel.org

SMP/non-SMP:

I disabled the second core in BIOS and tried nosmp option in grub2.
Result:
- SMP enabled:
 -> BUG can occur but does not every time sometimes there is only a freeze
- SMP disabled:
 -> BUG occurs reproducible every time 
 -> There no freezes, but always the oops message

BUG analysis:

I further analysed the critical code where the crash happens a bit further
a disassembly with comments is attached in disassembly_critical.txt

Because klist_next gets called from dpm_wait in this case I added a few printk to the source code to get more information. I added a full dmesg of this (see dmesg-rc5-2-debug-device-detailed.txt). But the result is that the device on wich the fault occurs is different each time. 
NOTE: I can add files that show the exact changes if you want 
But never the less some drivers corrupts the kref lists in some form.

A hint may be in the drm oops (in  Kernel Messages on Crash):
WARNING: CPU: 0 PID: 271 at ../include/linux/kref.h:47 drm_framebuffer_reference+0x6e/0x80 [drm]()

WORKAROUND NOMODESET:
I did some further testing and the result is there are no oops when I boot with nomodeset enabled, suspend works just fine when using this option.
So my guess is the i915 graphics driver is to blame for this. 

I know there is currently active development on this driver, this can be seen in the commit logs. I will try the most recent linux version from git (commit e42391cd048809d903291d07f86ed3934ce138e9) and look if the bug still exits.
I will report if the bug disappears and close the report.

If you need more information, tell me and I'll try to provide it.
Comment 16 schrott3000 2015-03-30 14:30:48 UTC
Created attachment 172681 [details]
dmesg_bug_on_resume.txt

BUG occuring on resume
Comment 17 schrott3000 2015-03-30 14:31:22 UTC
Created attachment 172691 [details]
disassembly_critical.txt

Disassembly of the critical section with comments
Comment 18 schrott3000 2015-03-30 14:33:02 UTC
Created attachment 172701 [details]
dmesg-rc5-2-debug-device-detailed.txt

Full Dmesg with the added debug statements.
(Device that is processed during crash differs from try to try)
Comment 19 schrott3000 2015-03-30 16:39:26 UTC
Created attachment 172711 [details]
disassembly_critical.txt

Some minor correction in the comments
Summary: Some driver creates an corrupted list entry that crashes the kernel
Comment 20 schrott3000 2015-03-31 05:57:47 UTC
BUG INFO:

This BUG is caused by the i915 drm driver. There is a missing implementation of drm_framebuffer_unreference causing the list to break. This is what causes th drm bug also. There's only a BUG() put inside the placeholder function so this can't work.

RESOLVED:

I can confirm that this bug is fixed in 4.0.0-rc6 due to the changes made in drivers/gpu/drm/drm_crtc.c. The missing implementations got added and now everything work's just fine.

One last question: Should i set the bug status to CODE_FIX or PATCH_ALREADY_AVAILABLE?
Comment 21 Aaron Lu 2015-03-31 06:06:16 UTC
PATCH_ALREADY_AVAILABLE

Note You need to log in before you can comment on or make changes to this bug.