Bug 6356
Summary: | sbp2: scsi_add_device failed. IEEE1394 HD is not working anymore. | ||
---|---|---|---|
Product: | Drivers | Reporter: | Vanei Heidemann (linux) |
Component: | IEEE1394 | Assignee: | Stefan Richter (stefanr) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | kernel |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.16 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
lsmod
lspci Full dmesg Log messages Log dmesg echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch patch It seems that nor everything functions as I thought. patch for 2.6.16/ 2.6.16.x |
Description
Vanei Heidemann
2006-04-08 09:30:40 UTC
Created attachment 7815 [details]
lsmod
Created attachment 7816 [details]
lspci
Created attachment 7817 [details]
Full dmesg
Created attachment 7818 [details]
Log messages
The dmesg lines ieee1394: Node changed: 0-01:1023 -> 0-00:1023 ieee1394: Node suspended: ID:BUS[0-00:1023] GUID[0050770e000042c0] indicate that the disk's FireWire node suddenly vanished from the bus. This is quite puzzling. Did you use the same hardware (disk, cable, PC) under 2.6.15? Can you access the disk again if you revert to Linux 2.6.15? If the answer to both is yes, please boot Linux 2.6.16(.2) again and enable "Excessive debugging output" in the IEEE 1394 section of the kernel configuration (i.e. CONFIG_IEEE1394_VERBOSEDEBUG=Y) and recompile, reinstall, reload the FireWire driver modules. Post the full dmesg from when you load the recompiled drivers and plug in the disk. Actually I would rather like to see the relevant portion of the syslog file where kernel messages are stored, instead of dmesg. The syslog is more informative than dmesg because of the human-readable timestamps. Make sure though that you look into the logfile which catches all kernel messages, also those at debug priority. I.e. compare if everything in dmesg can also be seen in the logfile. /etc/syslog.conf contains the configuration of the syslog daemon; it is different from distro to distro. Thanks. Created attachment 7819 [details]
Log dmesg
Yes, I use the same hardware and I can access the disk again if I revert to
2.6.15.
I have compiled, instaled, rebooted the system and plugued the hardware. The
dmesg output after that is attached.
I am spotting only one suspicious event in your log: When the disk granted the login by writing into sbp2's status FIFO, the Linux host returns a write response since Linux 2.6.16. (It only returned an ACK in 2.6.15 and before.) The disk should ACK this write response, but your log shows ack=0x3 == "missing ACK". But I don't know yet if this is related to the following SCSI command timeout. The host doesn't really need the ACK, and the disk even does not really need the response. But the response left unacknowledged may be a sign of something going wrong in the target. Although the subsequent three write requests from sbp2 to the target are OK (ack=0x11 == "complete"). I have two more requests: - Run "echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level". This requires that "SCSI device support" is configured as a loadable module (and that scsi_mod is loaded when you run the command) and also configured for "SCSI logging facility" (CONFIG_SCSI_LOGGING=y). After that, plug the disk in and post the resulting dmesg. - Download the following patch which went into Linux 2.6.16: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=35bdddb83f62978b5fad82a14fbfd78cc3a5a60c Change into the kernel source main directory (e.g. /usr/src/linux-2.6.16.2 or where ever you keep the sources of the currently running kernel) and revert the patch: $ patch -b -p1 -R < ~/git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch Then recompile and reinstall the modules, unload and reload sbp2, plug in the disk, and see if it behaves different. Alas there is quite a number of changes to the 1394 subsystem from 2.6.15 to 2.6.16 (http://me.in-berlin.de/~s5r6/linux1394/merged/available_in_2.6.16/) but the one above is directly related to this ack=0x3. Again, thanks for taking the time to debug this. Created attachment 7824 [details]
echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level
I have configured the kernel for compile SCSI as module and enabled SCSI log.
Created attachment 7825 [details]
git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch patch
I have reverted the patch. Here is the dmesg output. I can't access the disk
yet.
Your last log shows that the first SCSI command, the INQUIRY, succeeded. The device was properly detected by the scsi_mod core driver --- unlike all 2.6.16 attempts before. The next (missing) step is sd_mod being attached to the disk. sd_mod is responsible to make the device accessible via /dev/sd*. What if you run "modprobe sd_mod"? (The excessive debugging option of the ieee1394 drivers is no longer necessary.) Yes! After revert patch and run "modprobe sd_mod", my disk is back! I can have access it again. OK, thanks a lot. Alas, for reasons given in the offending patch's description, we cannot simply revert the patch. I will prepare a fix which will involve sbp2, ieee1394, and ohci1394. I will notify you once I got the fix ready. Do you know what chip this enclosure's bridge board is based on? (The only way to determine this for sure is to look inside the enclosure. Information given in /sys/bus/ieee1394/devices/*/ or the display by gscanbus sometimes also provides hints about the hardware, but since these are programmable firmware datafields they can also be misleading. But if you feal uncomfortable with opening the enclosure, don't do so.) I can see 3 chips: - PL-3507 05208A (Bigger) - PMC (I think) 0523 Pm39LV512-70VCE - I can't read many information about third chip, but I think is a Philips. I think is something like: pci1394p2380 tpc0525c Yes, Prolific PL-3507 is the bridge. This chip is infamous for a variety of strange bugs. If have a PL-3507 based device myself which was acting unreliably lately. I will check if this is also caused by the status FIFO patch in 2.6.16. (Side note: The issue with my own PL-3507 based CD-RW is unrelated to sbp2's status FIFO.) Vanei, I just posted a set of patches at linux1394-devel which essentially brings 2.6.15's behaviour back, but works in a cleaner fashion (i.e. is not broken on certain OHCI adapters as 2.6.15 was). Could you try these patches with a kernel of your choice? I am keeping all of the current IEEE 1394 patches as combo patches for the last few kernel releases at my web site: http://me.in-berlin.de/~s5r6/linux1394/updates/ Note, the 2.6.16.x variant requires 2.6.16.2 or later. If you are still using 2.6.16, you should apply patch-2.6.16.2 or later before applying my 1394 patchset. Alternatively I could create a patchset specifically for 2.6.16. Thanks in advance for testing. ...or rather try this reduced patch: http://me.in-berlin.de/~s5r6/linux1394/work-in-progress/2.6.16-ohci1394-sbp2-fix-scsi_add_device-failed-with-PL_3507-based-devices.patch Thanks. I have applied the patch to kernel 2.6.16.9. I have tested both, compiling SCSI Disk Support as module and built-in. And both work (if compiled as module, I need run "modprob sd_mod"). Created attachment 7904 [details]
It seems that nor everything functions as I thought.
I can access the external disk (with cfdisk) using SCSI Disk Support compiled
built-in, but when I tried mount I got a error and the attached message in
dmesg.
If I compile SCSI disk Support as module, I can access the disk, no problems. Re comment #18: Thanks for testing. The requirement to load sd_mod manually may be caused by a change in scsi_mod's sysfs representation of SCSI devices since Linux 2.6.14. Most FireWire HDDs are not shown as device type 0 but as type 14 now. The userspace helper of your distro which is responsible to load SCSI command set drivers (/etc/hotplug/scsi.agent, HAL, or whatever script or daemon) needs to be updated: http://marc.theaimsgroup.com/?l=linux-hotplug-devel&m=112129848913652 Re comment #19, #20: Very strange. AFAIU the log, the disk's node suddenly went away at the FireWire level. I don't understand how this could be related to sd_mod being a module or statically linked. Actually the way how sd_mod is linked should not matter at all, except perhaps for a slightly longer pause between scsi_mod's very first inquiry command and sd_mod's subsequent SCSI commands. If you have access to a Windows box with FireWire, you could try a firmware update to the PL-3507: http://www.prolific.com.tw/eng/downloads.asp?ID=44 (The site seems inaccessible at the moment.) The firmware update reportedly helped with various PL-3507 quirks in the past. I sent the patch which solved the original problem to Linus. I will close this bug once he merges the patch. The problem with the disk going suddenly away if sd was statically linked is unrelated; you should open another bug if that problem persists. Stefan, if it's not too much trouble, do you think you could attach the patch you intend to submit to -stable here? We'd like to include it in Gentoo's 2.6.16 kernel. Thanks. Created attachment 8117 [details]
patch for 2.6.16/ 2.6.16.x
Fix is now in Linus' tree. Fix is also in Linux 2.6.16.20 |