Bug 6356 - sbp2: scsi_add_device failed. IEEE1394 HD is not working anymore.
Summary: sbp2: scsi_add_device failed. IEEE1394 HD is not working anymore.
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-04-08 09:30 UTC by Vanei Heidemann
Modified: 2006-06-07 06:48 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.16
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
lsmod (1.46 KB, text/plain)
2006-04-08 09:32 UTC, Vanei Heidemann
Details
lspci (1.21 KB, text/plain)
2006-04-08 09:33 UTC, Vanei Heidemann
Details
Full dmesg (15.06 KB, text/plain)
2006-04-08 09:34 UTC, Vanei Heidemann
Details
Log messages (647 bytes, text/plain)
2006-04-08 09:35 UTC, Vanei Heidemann
Details
Log dmesg (15.27 KB, text/plain)
2006-04-08 13:20 UTC, Vanei Heidemann
Details
echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level (15.26 KB, text/plain)
2006-04-09 06:41 UTC, Vanei Heidemann
Details
git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch patch (15.27 KB, text/plain)
2006-04-09 06:53 UTC, Vanei Heidemann
Details
It seems that nor everything functions as I thought. (2.68 KB, text/plain)
2006-04-19 16:12 UTC, Vanei Heidemann
Details
patch for 2.6.16/ 2.6.16.x (2.65 KB, patch)
2006-05-15 12:10 UTC, Stefan Richter
Details | Diff

Description Vanei Heidemann 2006-04-08 09:30:40 UTC
Most recent kernel where this bug did not occur:
Distribution: Slackware 10.2
Hardware Environment: Notebook Pentium 4 3.2HT
Software Environment: Kernel 2.6.16 (.1, .2)
Problem Description: I have external HD (Toshiba MK8025GAS) connected through
IEEE1394, using a Vipower VPA-25289 enclosure. With kernel 2.6.15 I can access
it through /dev/sda. With 2.6.16, I can do it anymore. The dmesg output is:
ieee1394: The root node is not cycle master capable; selecting a new root node a
                 nd resetting...
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[0050770e000042c0]
ieee1394: sbp2: Driver forced to serialize I/O (serialize_io=1)
ieee1394: sbp2: Try serialize_io=0 for better performance
scsi0 : SBP-2 IEEE-1394
ieee1394: sbp2: Logged into SBP-2 device
ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048]
ieee1394: sbp2: aborting sbp2 command
 0:0:0:0:
        command: cdb[0]=0x12: 12 00 00 00 24 00
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0050770e000042c0]
ieee1394: sbp2: aborting sbp2 command
 0:0:0:0:
        command: cdb[0]=0x0: 00 00 00 00 00 00
ieee1394: sbp2: reset requested
 0:0:0:0: scsi: Device offlined - not ready after error recovery
ieee1394: sbp2: scsi_add_device failed


Steps to reproduce: Connect a external HD through IEEE1394.
Comment 1 Vanei Heidemann 2006-04-08 09:32:18 UTC
Created attachment 7815 [details]
lsmod
Comment 2 Vanei Heidemann 2006-04-08 09:33:42 UTC
Created attachment 7816 [details]
lspci
Comment 3 Vanei Heidemann 2006-04-08 09:34:27 UTC
Created attachment 7817 [details]
Full dmesg
Comment 4 Vanei Heidemann 2006-04-08 09:35:59 UTC
Created attachment 7818 [details]
Log messages
Comment 5 Stefan Richter 2006-04-08 11:01:39 UTC
The dmesg lines
    ieee1394: Node changed: 0-01:1023 -> 0-00:1023
    ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0050770e000042c0]
indicate that the disk's FireWire node suddenly vanished from the bus. This is
quite puzzling.

Did you use the same hardware (disk, cable, PC) under 2.6.15?

Can you access the disk again if you revert to Linux 2.6.15?

If the answer to both is yes, please boot Linux 2.6.16(.2) again and enable
"Excessive debugging output" in the IEEE 1394 section of the kernel
configuration (i.e. CONFIG_IEEE1394_VERBOSEDEBUG=Y) and recompile, reinstall,
reload the FireWire driver modules. Post the full dmesg from when you load the
recompiled drivers and plug in the disk.

Actually I would rather like to see the relevant portion of the syslog file
where kernel messages are stored, instead of dmesg. The syslog is more
informative than dmesg because of the human-readable timestamps. Make sure
though that you look into the logfile which catches all kernel messages, also
those at debug priority. I.e. compare if everything in dmesg can also be seen in
the logfile. /etc/syslog.conf contains the configuration of the syslog daemon;
it is different from distro to distro.

Thanks.
Comment 6 Vanei Heidemann 2006-04-08 13:20:00 UTC
Created attachment 7819 [details]
Log dmesg

Yes, I use the same hardware and I can access the disk again if I revert to
2.6.15.
I have compiled, instaled, rebooted the system and plugued the hardware. The
dmesg output after that is attached.
Comment 7 Stefan Richter 2006-04-09 03:01:16 UTC
I am spotting only one suspicious event in your log: When the disk granted the
login by writing into sbp2's status FIFO, the Linux host returns a write
response since Linux 2.6.16. (It only returned an ACK in 2.6.15 and before.) The
disk should ACK this write response, but your log shows ack=0x3 == "missing
ACK". But I don't know yet if this is related to the following SCSI command
timeout. The host doesn't really need the ACK, and the disk even does not really
need the response. But the response left unacknowledged may be a sign of
something going wrong in the target. Although the subsequent three write
requests from sbp2 to the target are OK (ack=0x11 == "complete").

I have two more requests:

 - Run "echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level". This
requires that "SCSI device support" is configured as a loadable module (and that
scsi_mod is loaded when you run the command) and also configured for "SCSI
logging facility" (CONFIG_SCSI_LOGGING=y). After that, plug the disk in and post
the resulting dmesg.

 - Download the following patch which went into Linux 2.6.16:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=35bdddb83f62978b5fad82a14fbfd78cc3a5a60c
Change into the kernel source main directory (e.g. /usr/src/linux-2.6.16.2 or
where ever you keep the sources of the currently running kernel) and revert the
patch:
$ patch -b -p1 -R < ~/git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch
Then recompile and reinstall the modules, unload and reload sbp2, plug in the
disk, and see if it behaves different.

Alas there is quite a number of changes to the 1394 subsystem from 2.6.15 to
2.6.16 (http://me.in-berlin.de/~s5r6/linux1394/merged/available_in_2.6.16/) but
the one above is directly related to this ack=0x3.

Again, thanks for taking the time to debug this.
Comment 8 Vanei Heidemann 2006-04-09 06:41:17 UTC
Created attachment 7824 [details]
echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level

I have configured the kernel for compile SCSI as module and enabled SCSI log.
Comment 9 Vanei Heidemann 2006-04-09 06:53:02 UTC
Created attachment 7825 [details]
git-35bdddb83f62978b5fad82a14fbfd78cc3a5a60c.patch patch

I have reverted the patch. Here is the dmesg output. I can't access the disk
yet.
Comment 10 Stefan Richter 2006-04-09 08:49:39 UTC
Your last log shows that the first SCSI command, the INQUIRY, succeeded. The
device was properly detected by the scsi_mod core driver --- unlike all 2.6.16
attempts before.

The next (missing) step is sd_mod being attached to the disk. sd_mod is
responsible to make the device accessible via /dev/sd*. What if you run
"modprobe sd_mod"?

(The excessive debugging option of the ieee1394 drivers is no longer necessary.)
Comment 11 Vanei Heidemann 2006-04-09 09:42:09 UTC
Yes! After revert patch and run "modprobe sd_mod", my disk is back! I can have
access it again.
Comment 12 Stefan Richter 2006-04-09 10:46:39 UTC
OK, thanks a lot. Alas, for reasons given in the offending patch's description,
we cannot simply revert the patch. I will prepare a fix which will involve sbp2,
ieee1394, and ohci1394. I will notify you once I got the fix ready.

Do you know what chip this enclosure's bridge board is based on? (The only way
to determine this for sure is to look inside the enclosure. Information given in
/sys/bus/ieee1394/devices/*/ or the display by gscanbus sometimes also provides
hints about the hardware, but since these are programmable firmware datafields
they can also be misleading. But if you feal uncomfortable with opening the
enclosure, don't do so.)
Comment 13 Vanei Heidemann 2006-04-09 11:14:45 UTC
I can see 3 chips:
- PL-3507  05208A (Bigger)
- PMC (I think)  0523  Pm39LV512-70VCE
- I can't read many information about third chip, but I think is a Philips. I
think is something like: pci1394p2380  tpc0525c
Comment 14 Stefan Richter 2006-04-09 11:56:32 UTC
Yes, Prolific PL-3507 is the bridge. This chip is infamous for a variety of
strange bugs. If have a PL-3507 based device myself which was acting unreliably
lately. I will check if this is also caused by the status FIFO patch in 2.6.16.
Comment 15 Stefan Richter 2006-04-15 14:41:33 UTC
(Side note: The issue with my own PL-3507 based CD-RW is unrelated to sbp2's
status FIFO.)
Comment 16 Stefan Richter 2006-04-15 14:49:32 UTC
Vanei, I just posted a set of patches at linux1394-devel which essentially
brings 2.6.15's behaviour back, but works in a cleaner fashion (i.e. is not
broken on certain OHCI adapters as 2.6.15 was).

Could you try these patches with a kernel of your choice? I am keeping all of
the current IEEE 1394 patches as combo patches for the last few kernel releases
at my web site: http://me.in-berlin.de/~s5r6/linux1394/updates/

Note, the 2.6.16.x variant requires 2.6.16.2 or later. If you are still using
2.6.16, you should apply patch-2.6.16.2 or later before applying my 1394
patchset. Alternatively I could create a patchset specifically for 2.6.16.

Thanks in advance for testing.
Comment 18 Vanei Heidemann 2006-04-19 15:56:51 UTC
I have applied the patch to kernel 2.6.16.9. I have tested both, compiling SCSI
Disk Support as module and built-in. And both work (if compiled as module, I
need run "modprob sd_mod").
Comment 19 Vanei Heidemann 2006-04-19 16:12:30 UTC
Created attachment 7904 [details]
It seems that nor everything functions as I thought.

I can access the external disk (with cfdisk) using SCSI Disk Support compiled
built-in, but when I tried mount I got a error and the attached message in
dmesg.
Comment 20 Vanei Heidemann 2006-04-19 16:28:30 UTC
If I compile SCSI disk Support as module, I can access the disk, no problems.
Comment 21 Stefan Richter 2006-04-20 15:19:42 UTC
Re comment #18:

Thanks for testing.
The requirement to load sd_mod manually may be caused by a change in scsi_mod's
sysfs representation of SCSI devices since Linux 2.6.14. Most FireWire HDDs are
not shown as device type 0 but as type 14 now. The userspace helper of your
distro which is responsible to load SCSI command set drivers
(/etc/hotplug/scsi.agent, HAL, or whatever script or daemon) needs to be
updated: http://marc.theaimsgroup.com/?l=linux-hotplug-devel&m=112129848913652

Re comment #19, #20:

Very strange. AFAIU the log, the disk's node suddenly went away at the FireWire
level. I don't understand how this could be related to sd_mod being a module or
statically linked. Actually the way how sd_mod is linked should not matter at
all, except perhaps for a slightly longer pause between scsi_mod's very first
inquiry command and sd_mod's subsequent SCSI commands.

If you have access to a Windows box with FireWire, you could try a firmware
update to the PL-3507: http://www.prolific.com.tw/eng/downloads.asp?ID=44
(The site seems inaccessible at the moment.) The firmware update reportedly
helped with various PL-3507 quirks in the past.
Comment 22 Stefan Richter 2006-04-23 09:30:46 UTC
I sent the patch which solved the original problem to Linus. I will close this
bug once he merges the patch. The problem with the disk going suddenly away if
sd was statically linked is unrelated; you should open another bug if that
problem persists.
Comment 23 Daniel Drake 2006-05-15 08:33:06 UTC
Stefan, if it's not too much trouble, do you think you could attach the patch
you intend to submit to -stable here? We'd like to include it in Gentoo's 2.6.16
kernel. Thanks.
Comment 24 Stefan Richter 2006-05-15 12:10:05 UTC
Created attachment 8117 [details]
patch for 2.6.16/ 2.6.16.x
Comment 25 Stefan Richter 2006-05-18 07:57:59 UTC
Fix is now in Linus' tree.
Comment 26 Stefan Richter 2006-06-07 06:48:44 UTC
Fix is also in Linux 2.6.16.20

Note You need to log in before you can comment on or make changes to this bug.