Created attachment 110621 [details] console after hang, part 1 Hi HP EliteBook 8460p hangs if undocked from the docking station while the system is suspended. How to reproduce: While the laptop is docked, run "pm-suspend". Remove laptop from dock and press the power button. Virtual console switching still works, but all processes are frozen. Hang is reproducible on 3.11.4, 3.11.2. Resume works correctly on 3.9.6. Hung task timeout shows hang in kworker. Workqueue: events ata_scsi_hotplug [libata] See attached screenshots for the complete call trace. I can provide more information if needed. Thanks.
Created attachment 110631 [details] console after hang, part 2
Created attachment 110641 [details] console after hang, part 3
Created attachment 110651 [details] console after hang, part 4
Created attachment 110661 [details] dmesg output before suspend
One of the disk failed to be resumed. Does v3.10 work OK?
I tested v3.10 and it still hangs. It is definitely related to a hard drive. The docking station has a SATA disk mounted in it. If I remove the disk from the docking station, dock the laptop, suspend, undock and resume the laptop, the hang does not happen.
The disk is left in the dock so on resume, the system failed to detect it. Is there some filesystem mounted on it?
I checked again. The kernel always hangs, whether a filesystem is mounted on the disk in the dock or not. In all my tests before there was no filesystem mounted on the disk. If you check "console after hang, part 3" attachment, this is the suspicious part: ("sdb" is the disk in the docking station): ata4.00: disabled ata4.00: detaching (SCSI 3:0:0:0) sd 3:0:0:0: [sdb] starting disk dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19 PM: Device 3:0:0:0 failed to resume aync: error -19
(In reply to Tomaž Šolc from comment #8) > If you check "console after hang, part 3" attachment, this is the suspicious > part: ("sdb" is the disk in the docking station): > > ata4.00: disabled > ata4.00: detaching (SCSI 3:0:0:0) > sd 3:0:0:0: [sdb] starting disk > dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19 > PM: Device 3:0:0:0 failed to resume aync: error -19 Yes, I saw that. That error is expected since the disk is not there anymore. I'm thinking if losing a disk during resume should hang the system or not.
Excuse me, but I fail to see how hanging the system would be the appropriate response here, especially since the disk is not even mounted. From my point of view this is no different than removing the disk while the system is running. Would it help you if I bisect between v3.10 and v3.9.6 to find the commit that introduced this bug?
On Wednesday, November 20, 2013 12:27:34 AM bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=62801 > > --- Comment #9 from Aaron Lu <aaron.lu@intel.com> --- > (In reply to Tomaž Šolc from comment #8) > > If you check "console after hang, part 3" attachment, this is the > suspicious > > part: ("sdb" is the disk in the docking station): > > > > ata4.00: disabled > > ata4.00: detaching (SCSI 3:0:0:0) > > sd 3:0:0:0: [sdb] starting disk > > dpm_run_callback(): scsi_bus_resume+0x0/0x24 [scsi_mod] returns -19 > > PM: Device 3:0:0:0 failed to resume aync: error -19 > > Yes, I saw that. That error is expected since the disk is not there anymore. > I'm thinking if losing a disk during resume should hang the system or not. No, it shouldn't.
And yes a bisect would be very helpful
Any news on the bisect?
(In reply to Aaron Lu from comment #13) > Any news on the bisect? It's taking longer than I thought. The initial estimate I think was 11 steps. I did 14 tests so far and git bisect still says "roughly 9 steps left". I would welcome any ideas how to speed it up. I'm attaching the current bisect log.
Created attachment 116511 [details] Current (incomplete) git bisect log
Just a note that v3.12.2 still has this problem.
(In reply to Aaron Lu from comment #13) > Any news on the bisect? 839a8e8660b6777e7fe4e80af1a048aebe2b5977 is the first bad commit: Author: Tejun Heo <tj@kernel.org> Date: Mon Apr 1 19:08:06 2013 -0700 writeback: replace custom worker pool implementation with unbound workqueue
Created attachment 117911 [details] git bisect log
So if you do: $ git reset --hard 839a8e8660b6777e7fe4e80af1a048aebe2b5977 and then build kernel, there is the problem; then on this tree, revert that commit alone and rebuild kernel, the problem would be gone. Is this correct?
(In reply to Aaron Lu from comment #19) > $ git reset --hard 839a8e8660b6777e7fe4e80af1a048aebe2b5977 This kernel hangs after undock. $ git checkout 181387da2d64c3129e5b5186c4dd388bc5041d53 (Commit preceding 839a8e) This kernel doesn't hang. $ git checkout v3.10 This kernel hangs after undock. $ git checkout v3.10 $ git revert 839a8e8660b6777e7fe4e80af1a048aebe2b5977 This kernel doesn't hang. (git revert requires manual editing - I'm attaching the exact patch I tested)
Created attachment 117921 [details] patch to revert 839a8e8 for v3.10
Add Tejun. Hi Tejun, It seems the commit 839a8e8660b6777e7fe4e80af1a048aebe2b5977 "writeback: replace custom worker pool implementation with unbound workqueue" breaks resume for Tomaz, can you please take a look? Thanks.
Hah, it looks like we somehow lost writeback requests while detaching a device and ata_scsi_hotplug() ends up waiting for writeback to finish indefinitely. We have some issues with bdi shutdown sequence. I'll look into it and report back. Thanks.
Created attachment 118311 [details] wb-hang-fix.patch Can you please try the attached patch? FYI, upstream discussion taking place in the following thread. http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org Thanks.
(In reply to Tejun Heo from comment #24) > Can you please try the attached patch? When I apply your patch to either v3.10 or v3.12.2 I get the following error when compiling the kernel (gcc 4.7.2) $ make modules ... Building modules, stage 2. MODPOST 2230 modules ERROR: "pm_freezing" [drivers/ata/libata.ko] undefined! Is there a specific commit you want me to apply this patch to?
(In reply to Tomaž Šolc from comment #25) > (In reply to Tejun Heo from comment #24) > > Can you please try the attached patch? > > When I apply your patch to either v3.10 or v3.12.2 I get the following error > when compiling the kernel (gcc 4.7.2) Ok, I had to compile the kernel with CONFIG_ATA=y (not =m). v3.12.2 with the "wb-hang-fix.patch" applied does not hang. Thanks.
Created attachment 118791 [details] wb-hang-fix-v2.patch Yeah, it needed EXPORT_SYMBOL_GPL(pm_freezing). Updated patch attached. Thanks.
I have the same symptoms on a Lenovo Thinkpad X230 on ubuntu. In my case it is a DVD reader in the dock station. I applied the patch on the ubuntu kernel (3.11.0) and it fixed the issue.
commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557 Author: Tejun Heo <tj@kernel.org> Date: Wed Dec 18 07:07:32 2013 -0500 libata, freezer: avoid block device removal while system is frozen