Bug 62801 - [PATCH]EliteBoot hangs at dock, suspend, undock, resume
Summary: [PATCH]EliteBoot hangs at dock, suspend, undock, resume
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-10 09:46 UTC by Tomaž Šolc
Modified: 2014-01-27 08:30 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.12.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
console after hang, part 1 (886.78 KB, image/jpeg)
2013-10-10 09:46 UTC, Tomaž Šolc
Details
console after hang, part 2 (983.85 KB, image/jpeg)
2013-10-10 09:47 UTC, Tomaž Šolc
Details
console after hang, part 3 (217.34 KB, image/jpeg)
2013-10-10 09:49 UTC, Tomaž Šolc
Details
console after hang, part 4 (241.50 KB, image/jpeg)
2013-10-10 09:49 UTC, Tomaž Šolc
Details
dmesg output before suspend (65.88 KB, text/plain)
2013-10-10 09:50 UTC, Tomaž Šolc
Details
Current (incomplete) git bisect log (2.56 KB, text/plain)
2013-11-28 08:47 UTC, Tomaž Šolc
Details
git bisect log (4.83 KB, text/plain)
2013-12-09 08:38 UTC, Tomaž Šolc
Details
patch to revert 839a8e8 for v3.10 (18.76 KB, patch)
2013-12-09 10:12 UTC, Tomaž Šolc
Details | Diff
wb-hang-fix.patch (3.14 KB, patch)
2013-12-13 20:38 UTC, Tejun Heo
Details | Diff
wb-hang-fix-v2.patch (4.81 KB, patch)
2013-12-17 12:51 UTC, Tejun Heo
Details | Diff

Description Tomaž Šolc 2013-10-10 09:46:48 UTC
Created attachment 110621 [details]
console after hang, part 1

Hi

HP EliteBook 8460p hangs if undocked from the docking station while the system is suspended.


How to reproduce:

While the laptop is docked, run "pm-suspend". Remove laptop from dock and press the power button. Virtual console switching still works, but all processes are frozen.

Hang is reproducible on 3.11.4, 3.11.2. Resume works correctly on 3.9.6.


Hung task timeout shows hang in kworker.
Workqueue: events ata_scsi_hotplug [libata]


See attached screenshots for the complete call trace. I can provide more information if needed.

Thanks.
Comment 1 Tomaž Šolc 2013-10-10 09:47:27 UTC
Created attachment 110631 [details]
console after hang, part 2
Comment 2 Tomaž Šolc 2013-10-10 09:49:02 UTC
Created attachment 110641 [details]
console after hang, part 3
Comment 3 Tomaž Šolc 2013-10-10 09:49:45 UTC
Created attachment 110651 [details]
console after hang, part 4
Comment 4 Tomaž Šolc 2013-10-10 09:50:54 UTC
Created attachment 110661 [details]
dmesg output before suspend
Comment 5 Aaron Lu 2013-11-14 06:48:01 UTC
One of the disk failed to be resumed. Does v3.10 work OK?
Comment 6 Tomaž Šolc 2013-11-14 09:09:19 UTC
I tested v3.10 and it still hangs.

It is definitely related to a hard drive. The docking station has a SATA disk mounted in it. If I remove the disk from the docking station, dock the laptop, suspend, undock and resume the laptop, the hang does not happen.
Comment 7 Aaron Lu 2013-11-19 03:08:06 UTC
The disk is left in the dock so on resume, the system failed to detect it. Is there some filesystem mounted on it?
Comment 8 Tomaž Šolc 2013-11-19 09:08:01 UTC
I checked again. The kernel always hangs, whether a filesystem is mounted on the  disk in the dock or not.

In all my tests before there was no filesystem mounted on the disk.

If you check "console after hang, part 3" attachment, this is the suspicious part: ("sdb" is the disk in the docking station):

ata4.00: disabled
ata4.00: detaching (SCSI 3:0:0:0)
sd 3:0:0:0: [sdb] starting disk
dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19
PM: Device 3:0:0:0 failed to resume aync: error -19
Comment 9 Aaron Lu 2013-11-20 00:27:34 UTC
(In reply to Tomaž Šolc from comment #8)
> If you check "console after hang, part 3" attachment, this is the suspicious
> part: ("sdb" is the disk in the docking station):
> 
> ata4.00: disabled
> ata4.00: detaching (SCSI 3:0:0:0)
> sd 3:0:0:0: [sdb] starting disk
> dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19
> PM: Device 3:0:0:0 failed to resume aync: error -19

Yes, I saw that. That error is expected since the disk is not there anymore. I'm thinking if losing a disk during resume should hang the system or not.
Comment 10 Tomaž Šolc 2013-11-20 10:55:58 UTC
Excuse me, but I fail to see how hanging the system would be the appropriate response here, especially since the disk is not even mounted. From my point of view this is no different than removing the disk while the system is running.

Would it help you if I bisect between v3.10 and v3.9.6 to find the commit that introduced this bug?
Comment 11 Rafael J. Wysocki 2013-11-20 11:45:21 UTC
On Wednesday, November 20, 2013 12:27:34 AM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=62801
> 
> --- Comment #9 from Aaron Lu <aaron.lu@intel.com> ---
> (In reply to Tomaž Šolc from comment #8)
> > If you check "console after hang, part 3" attachment, this is the
> suspicious
> > part: ("sdb" is the disk in the docking station):
> > 
> > ata4.00: disabled
> > ata4.00: detaching (SCSI 3:0:0:0)
> > sd 3:0:0:0: [sdb] starting disk
> > dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19
> > PM: Device 3:0:0:0 failed to resume aync: error -19
> 
> Yes, I saw that. That error is expected since the disk is not there anymore.
> I'm thinking if losing a disk during resume should hang the system or not.

No, it shouldn't.
Comment 12 Alan 2013-11-20 11:47:40 UTC
And yes a bisect would be very helpful
Comment 13 Aaron Lu 2013-11-28 02:00:47 UTC
Any news on the bisect?
Comment 14 Tomaž Šolc 2013-11-28 08:46:24 UTC
(In reply to Aaron Lu from comment #13)
> Any news on the bisect?

It's taking longer than I thought. The initial estimate I think was 11 steps. I did 14 tests so far and git bisect still says "roughly 9 steps left".

I would welcome any ideas how to speed it up.

I'm attaching the current bisect log.
Comment 15 Tomaž Šolc 2013-11-28 08:47:25 UTC
Created attachment 116511 [details]
Current (incomplete) git bisect log
Comment 16 Tomaž Šolc 2013-12-05 09:42:01 UTC
Just a note that v3.12.2 still has this problem.
Comment 17 Tomaž Šolc 2013-12-09 08:38:02 UTC
(In reply to Aaron Lu from comment #13)
> Any news on the bisect?

839a8e8660b6777e7fe4e80af1a048aebe2b5977 is the first bad commit:

Author: Tejun Heo <tj@kernel.org>
Date:   Mon Apr 1 19:08:06 2013 -0700

    writeback: replace custom worker pool implementation with unbound workqueue
Comment 18 Tomaž Šolc 2013-12-09 08:38:49 UTC
Created attachment 117911 [details]
git bisect log
Comment 19 Aaron Lu 2013-12-09 08:45:23 UTC
So if you do:
$ git reset --hard 839a8e8660b6777e7fe4e80af1a048aebe2b5977
and then build kernel, there is the problem; then on this tree, revert that commit alone and rebuild kernel, the problem would be gone. Is this correct?
Comment 20 Tomaž Šolc 2013-12-09 10:08:05 UTC
(In reply to Aaron Lu from comment #19)
> $ git reset --hard 839a8e8660b6777e7fe4e80af1a048aebe2b5977

This kernel hangs after undock.

$ git checkout 181387da2d64c3129e5b5186c4dd388bc5041d53

(Commit preceding 839a8e) 
This kernel doesn't hang.

$ git checkout v3.10

This kernel hangs after undock.

$ git checkout v3.10
$ git revert 839a8e8660b6777e7fe4e80af1a048aebe2b5977

This kernel doesn't hang.

(git revert requires manual editing - I'm attaching the exact patch I tested)
Comment 21 Tomaž Šolc 2013-12-09 10:12:12 UTC
Created attachment 117921 [details]
patch to revert 839a8e8 for v3.10
Comment 22 Aaron Lu 2013-12-10 03:11:14 UTC
Add Tejun.

Hi Tejun,

It seems the commit 839a8e8660b6777e7fe4e80af1a048aebe2b5977 "writeback: replace custom worker pool implementation with unbound workqueue" breaks resume for Tomaz, can you please take a look? Thanks.
Comment 23 Tejun Heo 2013-12-11 22:29:55 UTC
Hah, it looks like we somehow lost writeback requests while detaching a device and ata_scsi_hotplug() ends up waiting for writeback to finish indefinitely. We have some issues with bdi shutdown sequence. I'll look into it and report back.

Thanks.
Comment 24 Tejun Heo 2013-12-13 20:38:40 UTC
Created attachment 118311 [details]
wb-hang-fix.patch

Can you please try the attached patch?

FYI, upstream discussion taking place in the following thread.

 http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org

Thanks.
Comment 25 Tomaž Šolc 2013-12-15 20:44:35 UTC
(In reply to Tejun Heo from comment #24)
> Can you please try the attached patch?

When I apply your patch to either v3.10 or v3.12.2 I get the following error when compiling the kernel (gcc 4.7.2)

$ make modules
...
  Building modules, stage 2.
  MODPOST 2230 modules
ERROR: "pm_freezing" [drivers/ata/libata.ko] undefined!

Is there a specific commit you want me to apply this patch to?
Comment 26 Tomaž Šolc 2013-12-16 10:11:15 UTC
(In reply to Tomaž Šolc from comment #25)
> (In reply to Tejun Heo from comment #24)
> > Can you please try the attached patch?
> 
> When I apply your patch to either v3.10 or v3.12.2 I get the following error
> when compiling the kernel (gcc 4.7.2)

Ok, I had to compile the kernel with CONFIG_ATA=y (not =m).

v3.12.2 with the "wb-hang-fix.patch" applied does not hang.

Thanks.
Comment 27 Tejun Heo 2013-12-17 12:51:41 UTC
Created attachment 118791 [details]
wb-hang-fix-v2.patch

Yeah, it needed EXPORT_SYMBOL_GPL(pm_freezing). Updated patch attached. Thanks.
Comment 28 Xavier Claessens 2013-12-17 22:08:57 UTC
I have the same symptoms on a Lenovo Thinkpad X230 on ubuntu. In my case it is a DVD reader in the dock station. I applied the patch on the ubuntu kernel (3.11.0) and it fixed the issue.
Comment 29 Aaron Lu 2014-01-27 08:29:57 UTC
commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557
Author: Tejun Heo <tj@kernel.org>
Date:   Wed Dec 18 07:07:32 2013 -0500

    libata, freezer: avoid block device removal while system is frozen

Note You need to log in before you can comment on or make changes to this bug.