Most recent kernel where this bug did not occur: Distribution: Slackware 10.2 Hardware Environment: Notebook Pentium 4 3.2HT Software Environment: Kernel 2.6.16 (.1, .2) Problem Description: I have external HD (Toshiba MK8025GAS) connected through IEEE1394, using a Vipower VPA-25289 enclosure. With kernel 2.6.15 I can access it through /dev/sda. With 2.6.16, I can do it anymore. The dmesg output is: ieee1394: The root node is not cycle master capable; selecting a new root node a nd resetting... ieee1394: Error parsing configrom for node 0-00:1023 ieee1394: Node changed: 0-00:1023 -> 0-01:1023 ieee1394: Node added: ID:BUS[0-00:1023] GUID[0050770e000042c0] ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1) ieee1394: sbp2: Try serialize_io=0 for better performance scsi0 : SBP-2 IEEE-1394 ieee1394: sbp2: Logged into SBP-2 device ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048] ieee1394: sbp2: aborting sbp2 command 0:0:0:0: command: cdb[0]=0x12: 12 00 00 00 24 00 ieee1394: Node changed: 0-01:1023 -> 0-00:1023 ieee1394: Node suspended: ID:BUS[0-00:1023] GUID[0050770e000042c0] ieee1394: sbp2: aborting sbp2 command 0:0:0:0: command: cdb[0]=0x0: 00 00 00 00 00 00 ieee1394: sbp2: reset requested 0:0:0:0: scsi: Device offlined - not ready after error recovery ieee1394: sbp2: scsi_add_device failed Steps to reproduce: Connect a external HD through IEEE1394.
Created attachment 7815 [details] lsmod
Created attachment 7816 [details] lspci
Created attachment 7817 [details] Full dmesg
Created attachment 7818 [details] Log messages
The dmesg lines ieee1394: Node changed: 0-01:1023 -> 0-00:1023 ieee1394: Node suspended: ID:BUS[0-00:1023] GUID[0050770e000042c0] indicate that the disk's FireWire node suddenly vanished from the bus. This is quite puzzling. Did you use the same hardware (disk, cable, PC) under 2.6.15? Can you access the disk again if you revert to Linux 2.6.15? If the answer to both is yes, please boot Linux 2.6.16(.2) again and enable "Excessive debugging output" in the IEEE 1394 section of the kernel configuration (i.e. CONFIG_IEEE1394_VERBOSEDEBUG=Y) and recompile, reinstall, reload the FireWire driver modules. Post the full dmesg from when you load the recompiled drivers and plug in the disk. Actually I would rather like to see the relevant portion of the syslog file where kernel messages are stored, instead of dmesg. The syslog is more informative than dmesg because of the human-readable timestamps. Make sure though that you look into the logfile which catches all kernel messages, also those at debug priority. I.e. compare if everything in dmesg can also be seen in the logfile. /etc/syslog.conf contains the configuration of the syslog daemon; it is different from distro to distro. Thanks.
Created attachment 7819 [details] Log dmesg Yes, I use the same hardware and I can access the disk again if I revert to 2.6.15. I have compiled, instaled, rebooted the system and plugued the hardware. The dmesg output after that is attached.
I am spotting only one suspicious event in your log: When the disk granted the login by writing into sbp2's status FIFO, the Linux host returns a write response since Linux 2.6.16. (It only returned an ACK in 2.6.15 and before.) The disk should ACK this write response, but your log shows ack=0x3 == "missing ACK". But I don't know yet if this is related to the following SCSI command timeout. The host doesn't really need the ACK, and the disk even does not really need the response. But the response left unacknowledged may be a sign of something going wrong in the target. Although the subsequent three write requests from sbp2 to the target are OK (ack=0x11 == "complete"). I have two more requests: - Run "echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level". This requires that "SCSI device support" is configured as a loadable module (and that scsi_mod is loaded when you run the command) and also configured for "SCSI logging facility" (CONFIG_SCSI_LOGGING=y). After that, plug the disk in and post the resulting dmesg. - Download the following patch which went into Linux 2.6.16: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=35bdddb83f62978b5fad82a14fbfd78cc3a5a60c Change into the kernel source main directory (e.g. /usr/src/linux-2.6.16.2 or where ever you keep the sources of the currently running kernel) and revert the patch: $ patch -b -p1 -R < ~/git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch Then recompile and reinstall the modules, unload and reload sbp2, plug in the disk, and see if it behaves different. Alas there is quite a number of changes to the 1394 subsystem from 2.6.15 to 2.6.16 (http://me.in-berlin.de/~s5r6/linux1394/merged/available_in_2.6.16/) but the one above is directly related to this ack=0x3. Again, thanks for taking the time to debug this.
Created attachment 7824 [details] echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level I have configured the kernel for compile SCSI as module and enabled SCSI log.
Created attachment 7825 [details] git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch patch I have reverted the patch. Here is the dmesg output. I can't access the disk yet.
Your last log shows that the first SCSI command, the INQUIRY, succeeded. The device was properly detected by the scsi_mod core driver --- unlike all 2.6.16 attempts before. The next (missing) step is sd_mod being attached to the disk. sd_mod is responsible to make the device accessible via /dev/sd*. What if you run "modprobe sd_mod"? (The excessive debugging option of the ieee1394 drivers is no longer necessary.)
Yes! After revert patch and run "modprobe sd_mod", my disk is back! I can have access it again.
OK, thanks a lot. Alas, for reasons given in the offending patch's description, we cannot simply revert the patch. I will prepare a fix which will involve sbp2, ieee1394, and ohci1394. I will notify you once I got the fix ready. Do you know what chip this enclosure's bridge board is based on? (The only way to determine this for sure is to look inside the enclosure. Information given in /sys/bus/ieee1394/devices/*/ or the display by gscanbus sometimes also provides hints about the hardware, but since these are programmable firmware datafields they can also be misleading. But if you feal uncomfortable with opening the enclosure, don't do so.)
I can see 3 chips: - PL-3507 05208A (Bigger) - PMC (I think) 0523 Pm39LV512-70VCE - I can't read many information about third chip, but I think is a Philips. I think is something like: pci1394p2380 tpc0525c
Yes, Prolific PL-3507 is the bridge. This chip is infamous for a variety of strange bugs. If have a PL-3507 based device myself which was acting unreliably lately. I will check if this is also caused by the status FIFO patch in 2.6.16.
(Side note: The issue with my own PL-3507 based CD-RW is unrelated to sbp2's status FIFO.)
Vanei, I just posted a set of patches at linux1394-devel which essentially brings 2.6.15's behaviour back, but works in a cleaner fashion (i.e. is not broken on certain OHCI adapters as 2.6.15 was). Could you try these patches with a kernel of your choice? I am keeping all of the current IEEE 1394 patches as combo patches for the last few kernel releases at my web site: http://me.in-berlin.de/~s5r6/linux1394/updates/ Note, the 2.6.16.x variant requires 2.6.16.2 or later. If you are still using 2.6.16, you should apply patch-2.6.16.2 or later before applying my 1394 patchset. Alternatively I could create a patchset specifically for 2.6.16. Thanks in advance for testing.
...or rather try this reduced patch: http://me.in-berlin.de/~s5r6/linux1394/work-in-progress/2.6.16-ohci1394-sbp2-fix-scsi_add_device-failed-with-PL_3507-based-devices.patch Thanks.
I have applied the patch to kernel 2.6.16.9. I have tested both, compiling SCSI Disk Support as module and built-in. And both work (if compiled as module, I need run "modprob sd_mod").
Created attachment 7904 [details] It seems that nor everything functions as I thought. I can access the external disk (with cfdisk) using SCSI Disk Support compiled built-in, but when I tried mount I got a error and the attached message in dmesg.
If I compile SCSI disk Support as module, I can access the disk, no problems.
Re comment #18: Thanks for testing. The requirement to load sd_mod manually may be caused by a change in scsi_mod's sysfs representation of SCSI devices since Linux 2.6.14. Most FireWire HDDs are not shown as device type 0 but as type 14 now. The userspace helper of your distro which is responsible to load SCSI command set drivers (/etc/hotplug/scsi.agent, HAL, or whatever script or daemon) needs to be updated: http://marc.theaimsgroup.com/?l=linux-hotplug-devel&m=112129848913652 Re comment #19, #20: Very strange. AFAIU the log, the disk's node suddenly went away at the FireWire level. I don't understand how this could be related to sd_mod being a module or statically linked. Actually the way how sd_mod is linked should not matter at all, except perhaps for a slightly longer pause between scsi_mod's very first inquiry command and sd_mod's subsequent SCSI commands. If you have access to a Windows box with FireWire, you could try a firmware update to the PL-3507: http://www.prolific.com.tw/eng/downloads.asp?ID=44 (The site seems inaccessible at the moment.) The firmware update reportedly helped with various PL-3507 quirks in the past.
I sent the patch which solved the original problem to Linus. I will close this bug once he merges the patch. The problem with the disk going suddenly away if sd was statically linked is unrelated; you should open another bug if that problem persists.
Stefan, if it's not too much trouble, do you think you could attach the patch you intend to submit to -stable here? We'd like to include it in Gentoo's 2.6.16 kernel. Thanks.
Created attachment 8117 [details] patch for 2.6.16/ 2.6.16.x
Fix is now in Linus' tree.
Fix is also in Linux 2.6.16.20