Subject : Is sata_nv compatible with async scsi scan? Submitter : Benny Halevy <bhalevy@panasas.com> Date : 2009-04-21 7:03 References : http://marc.info/?l=linux-kernel&m=124029746431777&w=4 Notify-Also : Jeff Garzik <jeff@garzik.org> Notify-Also : Matthew Wilcox <matthew@wil.cx> Notify-Also : Arjan van de Ven <arjan@infradead.org> This entry is being used for tracking a regression from 2.6.28. Please don't close it until the problem is fixed in the mainline.
v2.6.30-rc3 seems to do much better. I believe that this patch did the trick: (although I didn't bisect to make sure it is the one) commit d4d5291c8cd499b1b590336059d5cc3e24c1ced6 Author: Arjan van de Ven <arjan@linux.intel.com> Date: Tue Apr 21 13:32:54 2009 -0700 driver synchronization: make scsi_wait_scan more advanced Thanks!
I verified that this specific patch makes the difference I see. However, although it improves the odds to complete booting on my machine, it does not seem to fix the root cause, but rather, I suspect, it just changes timing. After successfully rebooting the machine for 6 times it again got into a bad state in which it fails to recognize /dev/sda7 at the resume (from swap) stage and to switch to the root file system. From reading the ata_scsi_scan_host code I can't figure out how the ata scanning mechanism is supposed to play with the scsi async scan mechanisms.
Aha! the async scsi scan is just a red herring. as I was able to reproduce this with scsi_mod.scan=sync. Couldn't reproduce with vanilla 2.6.28, but with the following patches applied (affecting MCP55) I hit the problem already 2 out of 3 boots. e8caa3c sata_nv: rename nv_nf2_hardreset() 2d77570 sata_nv: fix MCP5x reset The latter, that fixes bug 12351, is the meaningful one.
Can you please post failing kernel boot log? Thanks.
Also, if you add some delay before returning from nv_noclassify_hardreset(), does it make any difference? Adding ssleep(1) right above return should do the trick.
Created attachment 21154 [details] console log of successful boot
Created attachment 21155 [details] console log of failed boot
(In reply to comment #5) > Also, if you add some delay before returning from nv_noclassify_hardreset(), > does it make any difference? Adding ssleep(1) right above return should do > the > trick. Nope, that didn't help... I tried the following: diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c index 6cda12b..26fade0 100644 --- a/drivers/ata/sata_nv.c +++ b/drivers/ata/sata_nv.c @@ -1567,6 +1567,7 @@ static int nv_noclassify_hardreset(struct ata_link *link, unsigned int *class, rc = sata_link_hardreset(link, sata_deb_timing_hotplug, deadline, &online, NULL); + ssleep(1); return online ? -EAGAIN : rc; }
From your failed boot log. ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-7: HDT722516DLA380, V43OA96A, max UDMA/133 ata2.00: 321672960 sectors, multi 16: LBA48 NCQ (depth 31/32) ata2.00: configured for UDMA/133 isa bounce pool size: 16 pages scsi 1:0:0:0: Direct-Access ATA HDT722516DLA380 V43O PQ: 0 ANSI: 5 sd 1:0:0:0: [sda] 321672960 512-byte hardware sectors: (164 GB/153 GiB) sd 1:0:0:0: [sda] Write Protect is off sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 > sd 1:0:0:0: [sda] Attached SCSI disk ata4: SATA link down (SStatus 0 SControl 300) Waiting for driver initialization. Trying to resume from /dev/sda7 Unable to access resume device (/dev/sda7) Creating root device. Mounting root filesystem. mount: error mounting /dev/root on /sysroot as ext3: No such file or directory The harddrive is detected just fine and partition table read succeeded too. I don't think this is a low level driver problem. Looks like initrd rootfs is somehow missing device nodes. Not sure whether it's a kernel problem or initrd itself screwing up. Thanks.
Oh... one thing. If you boot w/o initrd, is it reliable?
(In reply to comment #9) > The harddrive is detected just fine and partition table read succeeded too. > I > don't think this is a low level driver problem. Looks like initrd rootfs is > somehow missing device nodes. Not sure whether it's a kernel problem or > initrd > itself screwing up. FWIW, I've also seen failed boots where I didn't see these messages from sd (ending with "Attached SCSI disk"). I'm not sure how initrd can mess up in such a way that it fails to access("/dev/sda7"). As you suggested, I'll try booting with no initrd (though I admit I've never tried that before, so it'll take some time to go through the learning curve). Benny
(In reply to comment #10) > Oh... one thing. If you boot w/o initrd, is it reliable? So far I haven't been able to reproduce a problem w/o initrd. Reading nash's sources it might have a problem with finding the symlink /sys/block/sda or its target in /sys/devices/.../block/sda [I'm trying to compile the damn thing on Fedora 9 so I can debug it, with not much success yet... it needs some old compat libraries e.g. libpump]
On Tuesday 26 May 2009, Benny Halevy wrote: > On May. 24, 2009, 22:31 +0300, "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.28 and 2.6.29. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.28 and 2.6.29. Please verify if it still should > > be listed and let me know (either way). > > Verified to still exist with 2.6.30-rc7 :-( > > [The bug description is misleading though. More testing > lead to post-kernel boot problems where initrd cannot find the sata > devices during its init phase]
I'm not sure this is a kernel bug at this point. The first thing would be to figure out why initrd is failing.
Taking the bug and setting status to NEEDINFO.
(In reply to comment #14) > I'm not sure this is a kernel bug at this point. The first thing would be to > figure out why initrd is failing. Agreed. After adding some "find" commands to the nash init script around the mkblkdev command all I can say at this point is that: a. The symbolic link /sys/block/sda exists before mkblkdev b. Adding "find /sys/block/../devices/pci0000:00/0000:00:05.0/host1/target1:0:0/1:0:0:0/block/sda" before mkblkdev finds everything in place (exactly the same entries regardless of boot failure or success). Moreover, this seems to mitigate the bug and I couldn't reproduce the failure with this in place. c. Replacing the find command with a 30 seconds sleep does not help, i.e. it's not just a mere timing problem. d. Moving the find command after the call to mkblkdev shows that sysfs is populated even after mkblkdev fails and there's no sda* entry in /dev. e. Replacing the find command before mkblkdev with another call to mkblkdev also seems to mitigate the bug. (my criterion is 5 successful boots in a row) In conclusion, my hypothesis is that the initial state in sysfs, in some cases, causes nash's mkbkldev to fail somewhere while scanning it the first time. Doing this scan seems to "fix" the bad state so that mkblkdev succeeds.
Original bug was hit with Fedora 9, nash-6.0.52-2.fc9.x86_64 I'm testing with Fedora 11, nash-6.0.86-1.fc11.x86_64. So far so good.
OK. Bug not hit with the new nash. Please close. Sorry about the noise...
Thanks. Resolving as INVALID.