Bug 13149

Summary: system unstable after resume from suspend
Product: Power Management Reporter: Gu Rui (chaos.proton)
Component: Hibernation/SuspendAssignee: power-management_other
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: akpm, andyv133, chaos.proton, gavin.kinsey, rjw, rui.zhang, rw, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 11808    
Attachments: full dmesg
config
config for 64bit kernel
dmesg of 64bit kernel
config for 32bit kernel
dmesg of 32bit kernel
Dmesg, including error.
Kernel config for Slackware64 huge 2.6.29.6
acpidump output
lspci -vxxx output

Description Gu Rui 2009-04-22 15:10:25 UTC
2.6.27.21 doesn't have this problem, 2.6.30-rc3 still has the problem.

As the title, system go unstable after resume from suspend(suspend to RAM). Many apps quit or crash. It seems this bug will only be triggered when suspend enough long time, say, about 20mins. If I resume precisely after suspend, everything seems to be normal. Some more details can be found here:(actually it's for 2.6.29.1, but the symptoms alike)

http://www.linuxquestions.org/questions/slackware-14/system-unstable-after-resume-from-suspend-719718/

For convenience, provide some details here:

[ 7962.125970] BUG: unable to handle kernel paging request at f76f5004
[ 7962.125984] IP: [<c01e0129>] ext3_check_dir_entry+0x19/0x140
[ 7962.126000] *pde = 00007067 *pte = 77520002 
[ 7962.126010] Oops: 0000 [#1] SMP 
[ 7962.126017] last sysfs file: /sys/power/state
[ 7962.126024] Modules linked in: radeon drm vboxnetflt vboxdrv snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 nls_cp936 vfat fat ext4 jbd2 crc16 fuse dell_laptop dcdbas b43 mac80211 joydev ricoh_mmc sdhci_pci cfg80211 led_class shpchp ohci_hcd sg input_polldev ati_agp agpgart video output
[ 7962.126075] 
[ 7962.126082] Pid: 6509, comm: pm-powersave Not tainted (2.6.29.1-slk-based-2 #4) Inspiron 1501 
[ 7962.126089] EIP: 0060:[<c01e0129>] EFLAGS: 00010292 CPU: 1
[ 7962.126097] EIP is at ext3_check_dir_entry+0x19/0x140
[ 7962.126102] EAX: c0463684 EBX: f76f5000 ECX: f76f5000 EDX: f6c5a478
[ 7962.126108] ESI: f6846a00 EDI: 00002000 EBP: f6c5a478 ESP: f6785cec
[ 7962.126113]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 7962.126120] Process pm-powersave (pid: 6509, ti=f6784000 task=f7031090 task.ti=f6784000)
[ 7962.126126] Stack:
[ 7962.126129]  f7781000 d9e5ad68 f6c35040 c01e6c4d 00000000 00000000 f6785dec f6785dec
[ 7962.126141]  f6785e80 c0463684 00000000 f76f5000 f6846a00 00002000 f6e246c0 c01e86f3
[ 7962.126153]  f6e246c0 00002000 f66c2cc0 f6ba1400 00000000 c01e5a07 0000000b d9c5fbb8
[ 7962.126166] Call Trace:
[ 7962.126171]  [<c01e6c4d>] dx_probe+0x8d/0x350
[ 7962.126180]  [<c01e86f3>] ext3_find_entry+0x403/0x650
[ 7962.126189]  [<c01e5a07>] ext3_truncate+0x317/0x8e0
[ 7962.126197]  [<c01e263a>] __ext3_get_inode_loc+0xda/0x2f0
[ 7962.126205]  [<c012378e>] __wake_up+0x3e/0x60
[ 7962.126215]  [<c01e91a6>] ext3_lookup+0x46/0xf0
[ 7962.126222]  [<c01a4dd7>] d_alloc+0xf7/0x170
[ 7962.126233]  [<c019b8ba>] do_lookup+0x1ba/0x1e0
[ 7962.126240]  [<c019d5d5>] __link_path_walk+0x675/0xd90
[ 7962.126247]  [<c016baca>] generic_file_aio_read+0x2fa/0x6d0
[ 7962.126257]  [<c011bcdc>] do_page_fault+0x28c/0x6a0
[ 7962.126266]  [<c019dea4>] path_walk+0x54/0xb0
[ 7962.126273]  [<c019e000>] do_path_lookup+0xb0/0x160
[ 7962.126280]  [<c019cf26>] getname+0x96/0xd0
[ 7962.126287]  [<c019ec3a>] user_path_at+0x5a/0x90
[ 7962.126295]  [<c0197352>] vfs_stat_fd+0x22/0x60
[ 7962.126305]  [<c019746f>] sys_stat64+0xf/0x30
[ 7962.126313]  [<c010336e>] syscall_call+0x7/0xb
[ 7962.126321] Code: ff ff 83 c4 0c 5b c3 90 90 90 90 90 90 90 90 90 90 90 83 ec 3c 89 6c 24 38 89 5c 24 2c 89 d5 89 74 24 30 89 7c 24 34 89 44 24 24 <0f> b7 41 04 3d ff ff 00 00 74 74 83 f8 0b 89 c6 0f 8f c1 00 00 
[ 7962.126379] EIP: [<c01e0129>] ext3_check_dir_entry+0x19/0x140 SS:ESP 0068:f6785cec
[ 7962.126390] ---[ end trace a30eca9c02841218 ]---
[ 7962.164065] b43-phy0: Radio turned on by software
[ 7962.164076] b43-phy0: The hardware RF-kill button still turns the radio physically off. Press the button to turn it on.
[ 7962.763264] BUG: unable to handle kernel paging request at f76fd000
[ 7962.763286] IP: [<c016f919>] get_page_from_freelist+0x2a9/0x470
[ 7962.763315] *pde = 00007067 *pte = 7c00701a 
[ 7962.763328] Oops: 0002 [#2] SMP 
[ 7962.763335] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
[ 7962.763345] Modules linked in: radeon drm vboxnetflt vboxdrv snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 nls_cp936 vfat fat ext4 jbd2 crc16 fuse dell_laptop dcdbas b43 mac80211 joydev ricoh_mmc sdhci_pci cfg80211 led_class shpchp ohci_hcd sg input_polldev ati_agp agpgart video output
[ 7962.763397] 
[ 7962.763407] Pid: 3085, comm: knotify4 Tainted: G      D    (2.6.29.1-slk-based-2 #4) Inspiron 1501 
[ 7962.763415] EIP: 0060:[<c016f919>] EFLAGS: 00010246 CPU: 0
[ 7962.763424] EIP is at get_page_from_freelist+0x2a9/0x470
[ 7962.763431] EAX: 00000000 EBX: c16edfa0 ECX: 00000400 EDX: 00000003
[ 7962.763436] ESI: 00000000 EDI: f76fd000 EBP: c16edfa0 ESP: e67d5d78
[ 7962.763442]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 7962.763448] Process knotify4 (pid: 3085, ti=e67d4000 task=f73a67c0 task.ti=e67d4000)
[ 7962.763453] Stack:
[ 7962.763457]  00000002 00000044 00000246 f76fd000 c0617920 00000002 00000000 00000000
[ 7962.763468]  001280d2 00000002 00000246 00000001 c0587980 00000000 00000000 c058930c
[ 7962.763480]  c0587c80 c0587140 c0587c80 f73a67c0 e66c5088 001280d2 c016fddd c0589300
[ 7962.763494] Call Trace:
[ 7962.763500]  [<c016fddd>] __alloc_pages_internal+0xbd/0x470
[ 7962.763511]  [<c017eae5>] handle_mm_fault+0x385/0x610
[ 7962.763525]  [<c016e20e>] __rmqueue+0x1e/0x220
[ 7962.763533]  [<c011bc8e>] do_page_fault+0x23e/0x6a0
[ 7962.763545]  [<c0182364>] vma_adjust+0xe4/0x4b0
[ 7962.763553]  [<c0182ce7>] vma_merge+0x1a7/0x280
[ 7962.763559]  [<c01831a3>] do_brk+0x223/0x2e0
[ 7962.763566]  [<c018332e>] sys_brk+0xce/0x110
[ 7962.763572]  [<c011ba50>] do_page_fault+0x0/0x6a0
[ 7962.763579]  [<c0451292>] error_code+0x72/0x78
[ 7962.763595] Code: 24 20 00 79 40 8b 7c 24 2c 85 ff 7e 38 89 dd 31 f6 90 ba 03 00 00 00 89 e8 e8 94 03 fb ff b9 00 04 00 00 89 44 24 0c 89 c7 31 c0 <f3> ab 8b 44 24 0c ba 03 00 00 00 46 83 c5 20 e8 b3 01 fb ff 3b 
[ 7962.763651] EIP: [<c016f919>] get_page_from_freelist+0x2a9/0x470 SS:ESP 0068:e67d5d78
[ 7962.763667] ---[ end trace a30eca9c02841219 ]---
[ 7962.763677] note: knotify4[3085] exited with preempt_count 1
[ 7963.373230] b44: eth0: powering down PHY
[ 7963.421641] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 7966.988209] b44: eth0: Link is up at 100 Mbps, full duplex.
[ 7966.988220] b44: eth0: Flow control is off for TX and off for RX.
[ 7966.988664] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 7977.589038] eth0: no IPv6 routers present

I may provide more info tomorrow.

Thanks
Comment 1 Rafael J. Wysocki 2009-04-22 19:38:13 UTC
So this seems to be clock-related, doesn't it?
Comment 2 Robby Workman 2009-04-23 04:10:07 UTC
Based on the information given in the original report to me, yes.  I can't reproduce this on any of my hardware though.
Comment 3 Andrew Morton 2009-04-23 23:16:16 UTC
gack, what a nasty bug - crashes all over the place - it looks like memory got trashed.

Rafael, what effect does "NEED_CLOCK_SYNC=1 in /etc/pm/config.d/defaults"
have upon the kernel?
Comment 4 Robby Workman 2009-04-24 19:24:46 UTC
If NEED_CLOCK_SYNC=1, then on suspend, this is run:
        /sbin/hwclock --systohc
and on resume, this is run:
        /sbin/hwclock --hctosys
Comment 5 Gu Rui 2009-04-25 05:20:09 UTC
Created attachment 21110 [details]
full dmesg

Unfortunately, bug still exists in 2.6.30-rc3-git1. I think the key parts are(attachment is full dmesg):
=====================================================
BUG: unable to handle kernel paging request at f76f8000
IP: [<c016f6e1>] get_page_from_freelist+0x2b1/0x480
*pde = 00007067 *pte = 7bec701a 
Oops: 0002 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:08:00.0/ssb1:0/net/eth0/carrier
Modules linked in: rfkill_input radeon drm snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 nls_cp936 vfat fat ext4 jbd2 crc16 cpufreq_conservative cpufreq_performance cpufreq_powersave powernow_k8 freq_table fuse snd_hda_codec_idt b43 snd_hda_intel snd_hda_codec mac80211 snd_hwdep snd_pcm snd_timer snd cfg80211 psmouse soundcore sdhci_pci i2c_piix4 sdhci rtc_cmos dell_laptop mmc_core rtc_core rfkill b44 ati_agp video input_polldev ohci_hcd shpchp ssb agpgart thermal processor thermal_sys mii i2c_core pcmcia pcmcia_core dcdbas ricoh_mmc output ac led_class sg ehci_hcd evdev rtc_lib serio_raw snd_page_alloc battery wmi k8temp hwmon button

Pid: 3627, comm: firefox-bin Not tainted (2.6.30-rc3-smp-00329-g0c8454f #101) Inspiron 1501 
EIP: 0060:[<c016f6e1>] EFLAGS: 00010246 CPU: 0
EIP is at get_page_from_freelist+0x2b1/0x480
EAX: 00000000 EBX: c16edf00 ECX: 00000400 EDX: 00000003
ESI: 00000000 EDI: f76f8000 EBP: c16edf00 ESP: cf9b3e90
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process firefox-bin (pid: 3627, ti=cf9b2000 task=cf99fa70 task.ti=cf9b2000)
Stack:
 00000002 00000044 00000246 f76f8000 c06133c0 00000002 00000000 00000000
 001280d2 00000002 00000246 00000001 c05801c0 00000000 00000000 c0581aac
 c05804a0 c057f9a0 c05804a0 cf99fa70 cf8d3d10 001280d2 c016fbbd c0581aa0
Call Trace:
 [<c016fbbd>] ? __alloc_pages_internal+0xbd/0x480
 [<c017ec55>] ? handle_mm_fault+0x385/0x610
 [<c0119d92>] ? do_page_fault+0x132/0x290
 [<c0119c60>] ? do_page_fault+0x0/0x290
 [<c043fb2d>] ? error_code+0x6d/0x74
 [<c0119c60>] ? do_page_fault+0x0/0x290
Code: 00 79 44 8b 44 24 2c 85 c0 7e 3c 89 dd 31 f6 8d 76 00 ba 03 00 00 00 89 e8 e8 1c e3 fa ff b9 00 04 00 00 89 44 24 0c 89 c7 31 c0 <f3> ab 8b 44 24 0c ba 03 00 00 00 83 c6 01 83 c5 20 e8 89 e1 fa 
EIP: [<c016f6e1>] get_page_from_freelist+0x2b1/0x480 SS:ESP 0068:cf9b3e90
CR2: 00000000f76f8000
---[ end trace 1e8349f6b5d8828f ]---
note: firefox-bin[3627] exited with preempt_count 1
BUG: unable to handle kernel paging request at f76f9bf9
IP: [<c01735d1>] truncate_inode_pages_range+0x301/0x360
*pde = 00007067 *pte = 00336701 
Oops: 0003 [#2] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:08:00.0/ssb1:0/net/eth0/carrier
Modules linked in: rfkill_input radeon drm snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ipv6 nls_cp936 vfat fat ext4 jbd2 crc16 cpufreq_conservative cpufreq_performance cpufreq_powersave powernow_k8 freq_table fuse snd_hda_codec_idt b43 snd_hda_intel snd_hda_codec mac80211 snd_hwdep snd_pcm snd_timer snd cfg80211 psmouse soundcore sdhci_pci i2c_piix4 sdhci rtc_cmos dell_laptop mmc_core rtc_core rfkill b44 ati_agp video input_polldev ohci_hcd shpchp ssb agpgart thermal processor thermal_sys mii i2c_core pcmcia pcmcia_core dcdbas ricoh_mmc output ac led_class sg ehci_hcd evdev rtc_lib serio_raw snd_page_alloc battery wmi k8temp hwmon button

Pid: 2611, comm: syslogd Tainted: G      D    (2.6.30-rc3-smp-00329-g0c8454f #101) Inspiron 1501 
EIP: 0060:[<c01735d1>] EFLAGS: 00010213 CPU: 0
EIP is at truncate_inode_pages_range+0x301/0x360
EAX: 00000000 EBX: c16edf20 ECX: 00000101 EDX: 00000407
ESI: 00000bf9 EDI: f76f9bf9 EBP: f50f594c ESP: f61a3c60
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process syslogd (pid: 2611, ti=f61a2000 task=f7162cb0 task.ti=f61a2000)
Stack:
 0000000e 00000000 f6f1a030 f76f9000 00000407 00098bf9 00000000 00000099
 ffffffff 00000bf9 00000000 00000000 f50f5974 f50f596c f50f594c 00000292
 c012162e 00000206 00000000 00000003 f54bd03c f71fcc00 f54bd000 00000000
Call Trace:
 [<c012162e>] ? __wake_up+0x3e/0x60
 [<c0173647>] ? truncate_inode_pages+0x17/0x20
 [<c017e75f>] ? vmtruncate+0x14f/0x180
 [<c01e7683>] ? ext3_writeback_write_end+0x103/0x180
 [<c016abb9>] ? generic_file_buffered_write+0x169/0x300
 [<c016b1fa>] ? __generic_file_aio_write_nolock+0x23a/0x550
 [<c013f210>] ? autoremove_wake_function+0x0/0x50
 [<c016bfd0>] ? generic_file_aio_write+0x60/0xe0
 [<c01a54f4>] ? dput+0x84/0x120
 [<c01e4900>] ? ext3_file_write+0x30/0xc0
 [<c01e48d0>] ? ext3_file_write+0x0/0xc0
 [<c0194c0e>] ? do_sync_readv_writev+0xce/0x110
 [<c01a54f4>] ? dput+0x84/0x120
 [<c019ec5b>] ? getname+0x9b/0xd0
 [<c013f210>] ? autoremove_wake_function+0x0/0x50
 [<c0198897>] ? cp_new_stat64+0xf7/0x110
 [<c020d5ac>] ? security_file_permission+0xc/0x10
 [<c0194eea>] ? rw_verify_area+0x5a/0xd0
 [<c01953ca>] ? do_readv_writev+0xaa/0x1a0
 [<c01e48d0>] ? ext3_file_write+0x0/0xc0
 [<c0195503>] ? vfs_writev+0x43/0x60
 [<c0195607>] ? sys_writev+0x47/0x90
 [<c0102ed5>] ? syscall_call+0x7/0xb
Code: 46 a4 fa ff 8b 74 24 24 c7 44 24 10 00 10 00 00 89 c7 29 74 24 10 01 f7 8b 54 24 10 89 44 24 0c 31 c0 c1 ea 02 89 d1 8b 54 24 10 <f3> ab f6 c2 02 74 02 66 ab f6 c2 01 74 01 aa 8b 44 24 0c ba 03 
EIP: [<c01735d1>] truncate_inode_pages_range+0x301/0x360 SS:ESP 0068:f61a3c60
CR2: 00000000f76f9bf9
---[ end trace 1e8349f6b5d88290 ]---
note: syslogd[2611] exited with preempt_count 1
=====================================================

My laptop is DELL1501, AMD Turion(tm) 64 X2 Mobile Technology TL-56. Even with NEED_CLOCK_SYNC=1 in /etc/pm/config.d/defaults system won't survive. But it will run about 30mins. Without NEED_CLOCK_SYNC=1, system crash once I launch something(say, firefox) after resume.
Comment 6 Gu Rui 2009-04-25 05:25:18 UTC
Robby Workman,

What configuration did you use? I use the config shipped with slackware can trigger the bug above.
Comment 7 Robby Workman 2009-04-25 06:55:01 UTC
I'm using 2.6.29.1 (generic+initrd) in our -current tree.  I'm on relatively old hardware (T41) most of the time, but I've not seen a pm regression in a long time (lucky me).  I've even got two desktop boxes that I s2ram quite often, and aside from some issues with the r8169 modules having to be removed before suspend, it's been uneventful there too.  Anyway...

Are you using the hugesmp or genericsmp kernel image?  It shouldn't make a difference, but just in case you're using hugesmp, try genericsmp to be sure.  It's possible, though unlikely, that something built into the hugesmp kernel doesn't like the corresponding module for it coexisting in the module tree.
Comment 8 Gu Rui 2009-04-25 11:54:42 UTC
Created attachment 21113 [details]
config

Currently I'm using a self-compiled 2.6.30-rc3-git1 kernel in order to see whether there is anyone who fixed this bug accidentally. But the configuration is based on the configuration of kernel-generic-smp-2.6.29.1 in our -current tree. The only difference I made is compiling in Ext3 support so there is no need to bother initrd. The attachment is the full configuration.
Comment 9 Gu Rui 2009-05-03 14:48:10 UTC
The bug still exists in 2.6.30-rc4-git1. I notice that in 2.6.27.21, once I resume from a long time suspend, I will have a screen-saver on the screen that just start up. But in recent kernel I couldn't see this. It seems it's too late to /sbin/hwclock --hctosys after resume.
Comment 10 Gu Rui 2009-05-24 06:36:45 UTC
I installed slackware64 recently and found the suspend-resume works perfectly on the same hardware. While, 32 bit kernel still not working... I hope this info could help you to debug it.
Comment 11 Rafael J. Wysocki 2009-05-24 11:01:01 UTC
IOW, with 2.6.30-rc7 64-bit suspend-resume works, while with 2.6.30-rc7 32-bit it doesn't work correctly.  Is that correct?

If that's the case, please attach .config for both the 64-bit and 32-bit kernels.  Also please attach a boot log (dmesg output) for each of them.
Comment 12 Gu Rui 2009-05-24 14:50:05 UTC
Created attachment 21515 [details]
config for 64bit kernel
Comment 13 Gu Rui 2009-05-24 14:50:49 UTC
Created attachment 21516 [details]
dmesg of 64bit kernel
Comment 14 Gu Rui 2009-05-24 14:51:33 UTC
Created attachment 21517 [details]
config for 32bit kernel
Comment 15 Gu Rui 2009-05-24 14:53:40 UTC
Created attachment 21518 [details]
dmesg of 32bit kernel

Crash info included.
Comment 16 Gavin Kinsey 2009-09-10 02:12:22 UTC
*** Bug 14074 has been marked as a duplicate of this bug. ***
Comment 17 Andy V 2009-09-12 04:01:07 UTC
I believe I have this same issue, on the 2.6.29.6 Slackware64 kernel.  Previously, this hardware suspended/resumed perfectly with the 2.6.28-15.49 kernel under 64 bit Ubuntu.  Kernel config and dmesg to follow.
Comment 18 Andy V 2009-09-12 04:02:21 UTC
Created attachment 23080 [details]
Dmesg, including error.
Comment 19 Andy V 2009-09-12 04:04:11 UTC
Created attachment 23081 [details]
Kernel config for Slackware64 huge 2.6.29.6
Comment 20 ykzhao 2009-10-12 15:26:47 UTC
hi, Andy
    Will you please double check whether it can work on the 2.6.29.6 kernel? please wait for about 20 minutes after entering the sleeping state.
    Will you please also attach the output of lspci -vxxx, acpidump.
    Thanks.
Comment 21 Andy V 2009-10-17 02:12:35 UTC
ykzhao, 

I'll post those up after this comment.  Also, I think this might have something to do with cpu freqency scaling.  I put it to sleep, let it sleep a half hour or so, then ran it for a good 45 minutes and never had any issues.  I remembered that I had CPU frequency scaling disabled, re-enabled it, and repeated the process.  It was immediately after it woke up that I received the kernel paging request error in dmesg.

-Andy
Comment 22 Andy V 2009-10-17 02:14:08 UTC
Created attachment 23438 [details]
acpidump output
Comment 23 Andy V 2009-10-17 02:15:11 UTC
Created attachment 23439 [details]
lspci -vxxx output
Comment 24 Zhang Rui 2010-01-14 07:19:26 UTC
Andy and Gu Rui,

does the problem still exists in the latest upstream kernel, say 2.6.32?
Comment 25 Gu Rui 2010-01-15 03:17:42 UTC
Hello Zhang Rui and other kernel devs,

I'm sorry that I don't have 32bit linux in my box now(replaced by an ugly Windows...). So I can't test the 32bit kernel. Anyway, 64bit kernel(2.6.32.2) works very well on my side.

Thanks for your concern.
Comment 26 Zhang Rui 2010-01-18 02:33:27 UTC
close this bug as it can not be reproduced by the original bug reporter.