Bug 9298 - libata: implement ata_wait_after_reset()
Summary: libata: implement ata_wait_after_reset()
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks: 9243
  Show dependency tree
 
Reported: 2007-11-03 10:53 UTC by Rafael J. Wysocki
Modified: 2007-11-08 18:45 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.24-rc1-git12
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Boot log of working kernel (32.15 KB, text/plain)
2007-11-04 07:30 UTC, Luca Tettamanti
Details
Boot log of non-working kernel (18.88 KB, text/plain)
2007-11-04 07:31 UTC, Luca Tettamanti
Details
bug-9298-debug.patch (418 bytes, application/octet-stream)
2007-11-04 15:50 UTC, Tejun Heo
Details
debug.patch (442 bytes, patch)
2007-11-06 01:33 UTC, Tejun Heo
Details | Diff
skip-0xff-polling-for-PATA-controllers.patch (934 bytes, patch)
2007-11-06 19:26 UTC, Tejun Heo
Details | Diff

Description Rafael J. Wysocki 2007-11-03 10:53:37 UTC
Subject         : Regression: libata: implement ata_wait_after_reset()
Submitter       : Luca Tettamanti <kronos.it@gmail.com>
References      : http://lkml.org/lkml/2007/11/3/66
Comment 1 Rafael J. Wysocki 2007-11-03 10:55:57 UTC
Caused by:

commit 88ff6eafbb2a1c55f0f0e2e16d72e7b10d8ae8a5
Author: Tejun Heo <htejun@gmail.com>
Date:   Tue Oct 16 14:21:24 2007 -0700

    libata: implement ata_wait_after_reset()

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=88ff6eafbb2a1c55f0f0e2e16d72e7b10d8ae8a5
Comment 2 Tejun Heo 2007-11-03 16:50:16 UTC
Which controller is it?  Can you please post boot log and the result of "lspci -nn" (from the kernel w/o the offending change)?
Comment 3 Luca Tettamanti 2007-11-04 07:30:43 UTC
Created attachment 13391 [details]
Boot log of working kernel
Comment 4 Luca Tettamanti 2007-11-04 07:31:31 UTC
Created attachment 13392 [details]
Boot log of non-working kernel
Comment 5 Luca Tettamanti 2007-11-04 07:32:37 UTC
The controller is a PCI card with ALi chipset; parallel port is in use, SATA is not.

05:02.0 Mass storage controller [0180]: ALi Corporation ALi M5281 Serial ATA / RAID Host Controller [10b9:5281] (rev a4)
05:02.1 Mass storage controller [0180]: ALi Corporation M5228 ALi ATA/RAID Controller [10b9:5228] (rev c6)

I've attacched dmesg for working and non working kernels.
Comment 6 Tejun Heo 2007-11-04 15:50:04 UTC
Created attachment 13395 [details]
bug-9298-debug.patch

Does this fix the problem?
Comment 7 Luca Tettamanti 2007-11-05 10:43:55 UTC
Doesn't work. The controller starts screaming when the ioread8 (ata_check_status) is executed.
Comment 8 Tejun Heo 2007-11-06 01:33:35 UTC
Created attachment 13417 [details]
debug.patch

Please apply this patch and turn on CONFIG_PRINTK_TIME and report the result.  Thanks.
Comment 9 Luca Tettamanti 2007-11-06 01:52:52 UTC
Will do when I get home. Yesterday I tested with:

msleep(100);
printk("checking status...\n");
ata_chk_status(ap);
printk("done\n");
msleep(100);

and the timing showed that the interrupt started firing before printing "done". I didn't checked the return value though.
Comment 10 Tejun Heo 2007-11-06 02:16:53 UTC
The weird thing is that the ata_wait_after_reset() change didn't change that part.  It just re-factored ata_chk_status().  Right after reset, both old and new code wait for 150ms and do ata_chk_status().  Are you positive that change is what's causing the problem?
Comment 11 Luca Tettamanti 2007-11-06 02:38:38 UTC
I was running git current with the whole while(1) loop commented out.
Comment 12 Tejun Heo 2007-11-06 02:51:36 UTC
Yeah, that's pretty conclusive.  It's really weird tho.  ata_wait_after_reset() is called from ata_bus_softreset() and all it does which touches the hardware is calling ata_chk_status() a number of times.  On return, the next thing ata_bus_softreset() does is doing ata_chk_status() and then calling ata_wait_ready() which again is basically a bunch of ata_chk_status().

So, from the controller's POV, nothing really has changed.  Maybe timing could have changed a bit and that's why I suggested the first patch.  Anyways, it just doesn't make sense as polling STATUS is a very basic operation which should never fail.  In addition, reading the STATUS register has the side effect of clearing pending interrupt which makes screaming interrupts from reading STATUS very very very weird.
Comment 13 Luca Tettamanti 2007-11-06 10:43:09 UTC
Got it. I should have looked closer at the log...
The controller explodes at the *second* iteration; also notice that ata13 (the primary channel on the board) is not physically present; disks are attacched to ata14. So I guess that the rule for this card is "you can poke at my unconnected port at most once" - which is what the old code was doing ;)

[   95.628402] ata13: PATA max UDMA/133 cmd 0xc880 ctl 0xc800 bmdma 0xc080 irq 20
[   95.637840] ata14: PATA max UDMA/133 cmd 0xc480 ctl 0xc400 bmdma 0xc088 irq 20
[   95.865997] ata13.0: calling ata_chk_status...
[   95.973862] ata13.0: status = 0xff
[   96.081720] ata13.0: calling ata_chk_status...
[   96.181786] irq 20: nobody cared (try booting with the "irqpoll" option)
cut
[   96.249116] ata13.0: status = 0xff

the code is doing:

printk("ata%u.%u: calling ata_chk_status...\n",ap->print_id, ap->port_no);
msleep(100);
u8 status = ata_chk_status(ap);
printk("ata%u.%u: status = %#x\n", ap->print_id, ap->port_no, status);
msleep(100);
Comment 14 Tejun Heo 2007-11-06 19:25:54 UTC
Hmmm... Okay, so it doesn't want to be poked more than once when there's no device attached.  <grumble>  The level of workmanship these ATA chips are designed with is just astonishing.  That's an incredible presentation of attention to detail - die horribly while raising IRQ line if status register is read while there's no device attached.  Oh wait, that's too harsh, let's allow one and just one read.  Can anyone possibly pay more attention?  Oh BTW never mind that status register is supposed to clear exactly that pending interrupt. </grumble>

Will soon attach a patch.
Comment 15 Tejun Heo 2007-11-06 19:26:42 UTC
Created attachment 13429 [details]
skip-0xff-polling-for-PATA-controllers.patch

Please verify this patch fixes the problem.  Thanks.
Comment 16 Luca Tettamanti 2007-11-07 10:48:31 UTC
Works fine, thanks.

Tested-By: Luca Tettamanti <kronos.it@gmail.com>
Comment 17 Tejun Heo 2007-11-07 18:20:39 UTC
Alright, patch posted.
Comment 18 Rafael J. Wysocki 2007-11-08 07:01:45 UTC
When it hits the mainline, please provide the commit ID and subject.

Thanks!
Comment 19 Tejun Heo 2007-11-08 18:45:52 UTC
It's now queued in libata-dev#upstream-fixes as commit id 1974e20161a2c097c481d2ff711de7db56cb2cd6 and will be pulled into Linus's tree on the next pull.

Note You need to log in before you can comment on or make changes to this bug.