Bug 22642

Summary: 2.6.37-rc1: Disk takes 10 seconds to resume - MacBook2,1
Product: IO/Storage Reporter: Tobias (devnull)
Component: Serial ATAAssignee: Rafael J. Wysocki (rjw)
Severity: normal CC: acpi-bugzilla, devnull, error27, florian, rjw, stern, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 21782    
Attachments: dmesg

Description Tobias 2010-11-10 19:33:18 UTC
New with 2.6.37-rc1 (from the ubuntu ppa):

[108572.434303] ------------[ cut here ]------------
[108572.434311] WARNING: at /home/kernel-ppa/COD/linux/kernel/power/suspend_test.c:53 suspend_test_finish+0x89/0x90()
[108572.434313] Hardware name: MacBook2,1
[108572.434315] Component: resume devices, time: 17260
[108572.434316] Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc nls_utf8 isofs nls_iso8859_1 nls_cp437 vfat fat usb_storage binfmt_misc rfcomm sco bnep l2cap parport_pc ppdev arc4 snd_hda_codec_idt ath9k snd_hda_intel mac80211 snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi ath9k_common ath9k_hw snd_seq_midi_event snd_seq snd_timer ath snd_seq_device applesmc cfg80211 snd soundcore joydev led_class btusb bluetooth snd_page_alloc isight_firmware input_polldev appletouch lp mbp_nvidia_bl parport sha256_generic cryptd aes_x86_64 aes_generic hid_apple dm_crypt usbhid hid i915 drm_kms_helper drm i2c_algo_bit firewire_ohci sky2 firewire_core intel_agp intel_gtt video crc_itu_t output
[108572.434369] Pid: 15747, comm: pm-suspend Tainted: G        W   2.6.37-020637rc1-generic #201011020905
[108572.434371] Call Trace:
[108572.434376]  [<ffffffff810a44c9>] ? suspend_test_finish+0x89/0x90
[108572.434382]  [<ffffffff810646e0>] warn_slowpath_common+0x90/0xc0
[108572.434385]  [<ffffffff8106480e>] warn_slowpath_fmt+0x6e/0x70
[108572.434390]  [<ffffffff81038649>] ? default_spin_lock_flags+0x9/0x10
[108572.434394]  [<ffffffff812cb607>] ? kobject_put+0x27/0x60
[108572.434398]  [<ffffffff81399ae7>] ? put_device+0x17/0x20
[108572.434402]  [<ffffffff813a3b68>] ? dpm_complete+0x128/0x140
[108572.434406]  [<ffffffff810a44c9>] suspend_test_finish+0x89/0x90
[108572.434409]  [<ffffffff810a4282>] suspend_devices_and_enter+0xc2/0x100
[108572.434412]  [<ffffffff810a43a1>] enter_state+0xe1/0xf0
[108572.434416]  [<ffffffff810a3804>] state_store+0xb4/0xf0
[108572.434419]  [<ffffffff812cb3a7>] kobj_attr_store+0x17/0x20
[108572.434423]  [<ffffffff811ca1a2>] flush_write_buffer+0x62/0x90
[108572.434426]  [<ffffffff811ca2e6>] sysfs_write_file+0x66/0xa0
[108572.434431]  [<ffffffff8115e76c>] vfs_write+0xcc/0x190
[108572.434434]  [<ffffffff8115f195>] sys_write+0x55/0x90
[108572.434437]  [<ffffffff8100c132>] system_call_fastpath+0x16/0x1b
[108572.434440] ---[ end trace bdcb9cd79c701bb3 ]---
Comment 1 Rafael J. Wysocki 2010-11-10 20:18:15 UTC
Apparently, the resume of devices took too much time.

Is this reproducible?
Comment 2 Dan Carpenter 2010-11-14 19:23:15 UTC
I can reproduce it on my eee.  How do I debug this?
Comment 3 Rafael J. Wysocki 2010-11-14 23:18:04 UTC
Please try to boot with initcall_debug in the kernel command line and see what devices take long time to resume.
Comment 4 Dan Carpenter 2010-11-18 20:06:10 UTC
It's my harddrive that takes 10 seconds.

[    1.565105] scsi 0:0:0:0: Direct-Access     ATA      ST9160310AS      0303 PQ: 0 ANSI: 5

[   69.523147] call 0:0:0:0+ returned 0 after 10428995 usecs

I'll look into it some more, but I doubt there is anything that will speed it up.  I suspect this warning may lead to a lot of bogus bug reports where we just have to tell people their hardware sucks.
Comment 5 Rafael J. Wysocki 2010-11-18 21:57:10 UTC
Well, there have not been too many of them and I wouldn't like to overlook
a valid one.

This particular one may indicate a driver problem as well, so I'm reassigning
to Storage/SATA.
Comment 6 Dan Carpenter 2010-11-19 08:15:52 UTC
It get's delayed in ata_wait_ready().

That function has a printk "link is slow to respond, please be patient" if you go over 5 seconds.  I added a printk to see exactly how long it was taking.  It first waits for 6 seconds and triggers the message and it waits for 4 seconds which is where my 10 second delay comes from.

This doesn't look like a kernel bug to me.

Tobias, you may be experiencing a different bug.  I sort of hijacked your bug report.  Could you boot with initcall_debug and post your dmesg?
Comment 7 Tejun Heo 2010-11-19 13:25:22 UTC
Yeah, around 10 seconds is pretty normal duration for disk spinup.  Nothing wrong with Dan's case.  Tobias?
Comment 8 Tobias 2010-11-22 19:58:47 UTC
Created attachment 37882 [details]
Comment 9 Tobias 2010-11-22 20:05:10 UTC
Sorry for the delay, I was on an internet-less vacation.

I have not been able to reproduce this. If it was somewhat storage-related, could a slow nfsmount or a cdrom drive that is starting to fail trigger this?

I have now switched to rc3 and will see what happens. Kernel command line twisting is not my strong point, but from my noobish look at the attached dmesg initcall_debug seems to work. 

I'll report back if this happens again. Please tell me if there is more I can do.
Comment 10 Dan Carpenter 2010-11-23 05:29:45 UTC
We need a suspend resume cycle in your dmesg.  :P
Comment 11 Tobias 2010-11-23 19:39:35 UTC
Created attachment 37962 [details]
Comment 12 Tobias 2010-11-23 19:41:13 UTC
Doh, I blame the bugzilla breakage yesterday. Is the new dmesg of any use?
Comment 13 Dan Carpenter 2010-11-23 20:23:17 UTC
[ 9287.020921] ata3.01: configured for UDMA/133
[ 9287.035898] call 2:0:1:0+ returned 0 after 2379249 usecs
(that's your /dev/sda taking 2 seconds)
[ 9289.840112] usb 1-4: reset high speed USB device using ehci_hcd and address 3
[ 9289.990593] call 1-4+ returned 0 after 5205424 usecs
(that's your usb taking 5 seconds)

The other things take an additional 2 seconds for a total of 9 seconds.
[ 9294.004368] PM: resume of devices complete after 9412.372 msecs

You're on the border.  The warning and the stack dump only get printed if you hit 10 seconds.

I doubt this is a problem that's fixable with software.

Another bad thing about the stack dump is that it taints your kernel with the 'W' for warning.
Comment 14 Rafael J. Wysocki 2010-11-23 22:18:08 UTC
Well, we can drop the stack dump, but then people won't even report the
problem, I guess.
Comment 15 Florian Mickler 2011-02-20 10:57:01 UTC
And about ehci_hcd taking 5 seconds to reset... theres nothing to fix there?
Comment 16 Alan Stern 2011-02-20 22:15:32 UTC
It was 1-4 that took 5 seconds, not usb1 or 0000:00:1d.7 (either of which might reasonably be called ehci_hcd).  What is the 1-4 device, anyway?

Dan's timing calculations in comment #13 aren't right.  The ATA and USB resumes ran in parallel, taking 5 seconds overall.  As for why the USB resume took so long, I can't tell from this log.  Enabling CONFIG_USB_DEBUG might help.

By the way, the real bottleneck is the PNP0C0A:00 device, whatever that is:

[ 9294.000258] call PNP0C0A:00+ returned 0 after 9187140 usecs
Comment 17 Rafael J. Wysocki 2011-02-21 19:54:16 UTC
PNP0C0A is ACPI battery.
Comment 18 Florian Mickler 2011-02-21 22:05:18 UTC
could this be something like https://bugzilla.kernel.org/show_bug.cgi?id=8246  ?
does the kernel boot option "acpi=rsdt" help?
Comment 19 Florian Mickler 2011-04-17 16:15:50 UTC
(In reply to comment #9)

> I'll report back if this happens again. Please tell me if there is more I can
> do.

I guess the >10s resume didn't happen anymore, closing.