Subject : Regression: libata: implement ata_wait_after_reset() Submitter : Luca Tettamanti <kronos.it@gmail.com> References : http://lkml.org/lkml/2007/11/3/66
Caused by: commit 88ff6eafbb2a1c55f0f0e2e16d72e7b10d8ae8a5 Author: Tejun Heo <htejun@gmail.com> Date: Tue Oct 16 14:21:24 2007 -0700 libata: implement ata_wait_after_reset() http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=88ff6eafbb2a1c55f0f0e2e16d72e7b10d8ae8a5
Which controller is it? Can you please post boot log and the result of "lspci -nn" (from the kernel w/o the offending change)?
Created attachment 13391 [details] Boot log of working kernel
Created attachment 13392 [details] Boot log of non-working kernel
The controller is a PCI card with ALi chipset; parallel port is in use, SATA is not. 05:02.0 Mass storage controller [0180]: ALi Corporation ALi M5281 Serial ATA / RAID Host Controller [10b9:5281] (rev a4) 05:02.1 Mass storage controller [0180]: ALi Corporation M5228 ALi ATA/RAID Controller [10b9:5228] (rev c6) I've attacched dmesg for working and non working kernels.
Created attachment 13395 [details] bug-9298-debug.patch Does this fix the problem?
Doesn't work. The controller starts screaming when the ioread8 (ata_check_status) is executed.
Created attachment 13417 [details] debug.patch Please apply this patch and turn on CONFIG_PRINTK_TIME and report the result. Thanks.
Will do when I get home. Yesterday I tested with: msleep(100); printk("checking status...\n"); ata_chk_status(ap); printk("done\n"); msleep(100); and the timing showed that the interrupt started firing before printing "done". I didn't checked the return value though.
The weird thing is that the ata_wait_after_reset() change didn't change that part. It just re-factored ata_chk_status(). Right after reset, both old and new code wait for 150ms and do ata_chk_status(). Are you positive that change is what's causing the problem?
I was running git current with the whole while(1) loop commented out.
Yeah, that's pretty conclusive. It's really weird tho. ata_wait_after_reset() is called from ata_bus_softreset() and all it does which touches the hardware is calling ata_chk_status() a number of times. On return, the next thing ata_bus_softreset() does is doing ata_chk_status() and then calling ata_wait_ready() which again is basically a bunch of ata_chk_status(). So, from the controller's POV, nothing really has changed. Maybe timing could have changed a bit and that's why I suggested the first patch. Anyways, it just doesn't make sense as polling STATUS is a very basic operation which should never fail. In addition, reading the STATUS register has the side effect of clearing pending interrupt which makes screaming interrupts from reading STATUS very very very weird.
Got it. I should have looked closer at the log... The controller explodes at the *second* iteration; also notice that ata13 (the primary channel on the board) is not physically present; disks are attacched to ata14. So I guess that the rule for this card is "you can poke at my unconnected port at most once" - which is what the old code was doing ;) [ 95.628402] ata13: PATA max UDMA/133 cmd 0xc880 ctl 0xc800 bmdma 0xc080 irq 20 [ 95.637840] ata14: PATA max UDMA/133 cmd 0xc480 ctl 0xc400 bmdma 0xc088 irq 20 [ 95.865997] ata13.0: calling ata_chk_status... [ 95.973862] ata13.0: status = 0xff [ 96.081720] ata13.0: calling ata_chk_status... [ 96.181786] irq 20: nobody cared (try booting with the "irqpoll" option) cut [ 96.249116] ata13.0: status = 0xff the code is doing: printk("ata%u.%u: calling ata_chk_status...\n",ap->print_id, ap->port_no); msleep(100); u8 status = ata_chk_status(ap); printk("ata%u.%u: status = %#x\n", ap->print_id, ap->port_no, status); msleep(100);
Hmmm... Okay, so it doesn't want to be poked more than once when there's no device attached. <grumble> The level of workmanship these ATA chips are designed with is just astonishing. That's an incredible presentation of attention to detail - die horribly while raising IRQ line if status register is read while there's no device attached. Oh wait, that's too harsh, let's allow one and just one read. Can anyone possibly pay more attention? Oh BTW never mind that status register is supposed to clear exactly that pending interrupt. </grumble> Will soon attach a patch.
Created attachment 13429 [details] skip-0xff-polling-for-PATA-controllers.patch Please verify this patch fixes the problem. Thanks.
Works fine, thanks. Tested-By: Luca Tettamanti <kronos.it@gmail.com>
Alright, patch posted.
When it hits the mainline, please provide the commit ID and subject. Thanks!
It's now queued in libata-dev#upstream-fixes as commit id 1974e20161a2c097c481d2ff711de7db56cb2cd6 and will be pulled into Linus's tree on the next pull.