Bug 34692

Summary: IO operation hang when plug in AC adapter
Product: IO/Storage Reporter: Gu Rui (chaos.proton)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, maciej.rutecki, rjw, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39-rc5+ Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 32012    
Attachments: dmesg of 2.6.39-rc6+
dmesg of 2.6.38
smartctl --all /dev/sda
sata_alpm in pm-utils-1.4.1
dmesg of the kernel with bad commit 270dac35c264 and ATA_VERBOSE_DEBUG in libata.h

Description Gu Rui 2011-05-08 15:10:24 UTC
My system have KDE and pm-utils-1.4.1 installed. When I plug in the AC adapter, any IO operation will hang. I cannot write dmesg into files because of IO hang so I shoot the screen:

https://picasaweb.google.com/chaos.proton/Bugs

Upon this bug happens, the disk led is on for a while than off for a while than on for an other while and back and forward...

I seems there are fatal ata errors there. After I uninstalled pm-utils, things went OK. But the installed pm-utils is clean and AFAIS, the only thing it do is set link_power_management_policy to max_performance.

If you couldn't see the screenshot clearly, I can attach the original ones.
Comment 1 Gu Rui 2011-05-08 15:12:19 UTC
Forgot to mention, I think you can safely ignore the usb stuff inserted in dmesg. It should have nothing to do with this bug.
Comment 2 Tejun Heo 2011-05-08 16:00:14 UTC
Can you please post full boot kernel log?  Also, is this a regression?

Thanks.
Comment 3 Tejun Heo 2011-05-08 16:01:06 UTC
Ooh, one more question.  Does the machine come back from the hang?  Or is the machine completely dead after that?
Comment 4 Gu Rui 2011-05-09 13:50:53 UTC
Yes, the hang is a regression. I've tried 2.6.38 it only give:

[  112.076118] ata1: hard resetting link
[  112.380680] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  112.382488] ata1.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
[  112.385131] ata1.00: ACPI cmd 00/00:00:00:00:00:a0 (NOP) rejected by device (Stat=0x51 Err=0x04)
[  112.385495] ata1.00: configured for UDMA/133
[  112.387769] ata1: EH complete
[  112.454063] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  112.567175] EXT4-fs (sda5): re-mounted. Opts: commit=0
[  112.727557] EXT4-fs (sda6): re-mounted. Opts: commit=0

when plugin the AC again. But no hang there.

In 2.6.39-rc6+, the hang never come back. Since the every disk operation never returns, the system will become dead soon.(nearly every program use disk, right? ;)

I will attach the full dmesg and some other info in the following posts.
Comment 5 Gu Rui 2011-05-09 13:54:16 UTC
Created attachment 57002 [details]
dmesg of 2.6.39-rc6+

AC unplugged in about 154.372674.
Comment 6 Gu Rui 2011-05-09 13:55:42 UTC
Created attachment 57012 [details]
dmesg of 2.6.38

AC unplugged in about 91.747641 and plugged back in about 112.076118.
Comment 7 Gu Rui 2011-05-09 13:58:42 UTC
Created attachment 57022 [details]
smartctl --all /dev/sda
Comment 8 Tejun Heo 2011-05-09 15:08:07 UTC
Weird, 2.6.38 is okay but 2.6.39-rc6+ isn't.  The thing is that libata had major link power saving reimplementation during 2.6.38 devel cycle but there hasn't been any significant change in the area during 39 cycle.  I've looked through all the libata changes but nothing rings a bell.  Is the problem readily reproducible?  Would you be interested in doing a bisection?

Thank you.
Comment 9 Gu Rui 2011-05-09 16:25:10 UTC
Yes, very solid reproducible , I mean, every time. Hmm, I know how to bisect but I  may only have enough time in the weekend to do the build/reboot/test thing... ;(
Comment 10 Gu Rui 2011-05-13 17:03:48 UTC
Ok, I think I found the bad commit:

commit 270dac35c26433d06a89150c51e75ca0181ca7e4
Author: Jian Peng <jipeng2005@gmail.com>
Date:   Fri Apr 22 23:58:10 2011 -0700

    libata: ahci_start_engine compliant to AHCI spec
    
    At the end of section 10.1 of AHCI spec (rev 1.3), it states
    
    Software shall not set PxCMD.ST to 1 until it is determined that
    a functoinal device is present on the port as determined by
    PxTFD.STS.BSY=0, PxTFD.STS.DRQ=0 and PxSSTS.DET=3h
    
    Even though most AHCI host controller works without this check,
    specific controller will fail under this condition.
    
    Signed-off-by: Jian Peng <jipeng2005@gmail.com>
    Signed-off-by: Jeff Garzik <jgarzik@pobox.com>

problem gone after I revert it.
Comment 11 Rafael J. Wysocki 2011-05-13 20:20:13 UTC
First-Bad-Commit : 270dac35c26433d06a89150c51e75ca0181ca7e4
Comment 12 Tejun Heo 2011-05-14 10:41:07 UTC
Gu Rui, thank you very much for bisecting.  On the hindsight, yeap, that one makes sense.

Also reported in the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1138771

Revert patch posted.

  http://article.gmane.org/gmane.linux.ide/49533
Comment 14 Gu Rui 2011-05-17 01:07:24 UTC
Created attachment 58232 [details]
sata_alpm in pm-utils-1.4.1

As Jian Peng asked, I post the pm-utils script that cause the problem. I'm not familiar with pm-utils as well but I think the function of this script is call set_sata_alpm min_power and when AC is plugged out and set_sata_alpm max_performance when AC plugged in.
Comment 15 Gu Rui 2011-05-17 01:09:36 UTC
Created attachment 58242 [details]
dmesg of the kernel with bad commit 270dac35c264 and ATA_VERBOSE_DEBUG in libata.h