Bug 4373

Summary:	USB DVDRW hangs during burning or reading (ehci_hcd)
Product:	Drivers	Reporter:	Brian Hall (hallbw)
Component:	USB	Assignee:	David Brownell (dbrownell)
Status:	REJECTED UNREPRODUCIBLE
Severity:	normal	CC:	greg, mdharm-usb
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.11	Subsystem:
Regression:	---	Bisected commit-id:
Bug Depends on:
Bug Blocks:	5089
Attachments:	usb kernel log removes annoying and useless message syslog showing the problem process list showing cdrecord in zombie state test case to demonstrate the problem compressed 2.6.14-rc1 syslog showing successful DVD burn w/ pauses/resets output of "lspci -vvxxxs 00:10.4" kernel .config I used for testing 2.6.14-rc1 tracebacks of the hang

Description Brian Hall 2005-03-19 13:18:24 UTC

Distribution: Gentoo 2004.3 (64-bit)
Hardware Environment: AMD64 3000+, MSI FISR VIA K8T800
Software Environment: cdrecord, 2.6.11 linus
Problem Description:
While burning a CD- or DVD-RW using an external USB enclosure
and a IDE burner, the burn process "hangs" and goes into these states:

root     11328  0.0  0.2   4400  1060 pts/1    S+   13:15   0:00
/bin/sh /net/scripts/burn /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso
root     11329  0.0  1.2   7244  6224 pts/1    DL+  13:15   0:00
cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso
root     11331  0.0  1.2   7244  6224 pts/1    D+   13:15   0:00
cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso

In the terminal window where I was burning, I get:

cdrecord: faio_wait_on_buffer for writer timed out.

In the kernel log, I get:
Mar 19 13:20:24 syrinx usb 1-3: USB disconnect, address 2
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19

The device is working in USB 2.0 mode, according to
/proc/usb/bus/devices (not included here because once this problem
occurs I can no longer access it, although my low-speed USB mouse
continues to work). Burner is registered to the ehci_hcd driver.

The burner no longer responds to inquiries after the problem
occurs- cycling power on it doesn't help, I have to reboot.

Problem is intermittent, sometimes I can read or write an entire
DVD with no issues.


Steps to reproduce:

cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso

Comment 1 Andrew Morton 2005-03-21 19:36:06 UTC

Did this work OK under any earlier kernels?  If so, which versions?

Thanks.

Comment 2 Greg Kroah-Hartman 2005-03-21 23:24:15 UTC

Can you enable CONFIG_USB_DEBUG and create an attachment with the kernel log
right when the error happens?

Comment 3 Brian Hall 2005-03-22 06:58:03 UTC

I was not able to get proper operation of my external USB drive or of USB flash
drives using ehci_hcd in 2.6.8, 2.6.9, or 2.6.10. For 2.6.11, I also removed the
USB front panel connection from the motherboard and connected a different FP
USB2.0 connector. So far, I have had no problems with ~10MB transfers to and
from a USB2.0 stick with ehci_hcd. I suspect I am seeing the problem only with
the external writer because of the longer (or more frequent) transfers.

Comment 4 Brian Hall 2005-03-22 09:51:03 UTC

I was not able to get proper operation of my external USB drive or of USB flash
drives using ehci_hcd in 2.6.8, 2.6.9, or 2.6.10. For 2.6.11, I also removed the
USB front panel connection from the motherboard and connected a different FP
USB2.0 connector. So far, I have had no problems with ~10MB transfers to and
from a USB2.0 stick with ehci_hcd. I suspect I am seeing the problem only with
the external writer because of the longer (or more frequent) transfers.

Comment 5 Brian Hall 2005-03-23 19:52:13 UTC

Created attachment 4791 [details]
usb kernel log

I enabled CONFIG_USB_DEBUG, but when the drive stops recording, no error shows
up in the log at that time. A few minutes later, I get a disconnect message in
the log.

Should I try again with the usb storage debug enabled?

I've put the Plextor back into my PC case. The first other CDRW I tried in the
external usb didn't work (a Sony CRX-160). I put in a IOMagic CDRW drive and it
allowed me to erase a disc and start recording another before locking up (this
is the attached logfile). Maybe the problem is with the cheap USB enclosure? 
When I get time I'll try it with a hard drive- perhaps the enclosure doesn't
like to speak ATAPI.

Comment 6 Brian Hall 2005-03-28 18:26:01 UTC

Maybe I just have a too-cheap USB->IDE enclosure. I don't have a Windows box
with USB2.0 around to test it on.

Here's the excerpt from my syslog when using usb_storage VERBOSE, and trying a
40GB Maxtor HDD with 2.6.11-ck3:

Mar 27 16:40:40 syrinx usb-storage: *** thread sleeping.
Mar 27 16:40:40 syrinx usb-storage: queuecommand called
Mar 27 16:40:40 syrinx usb-storage: *** thread awakened.
Mar 27 16:40:40 syrinx usb-storage: Command WRITE_10 (10 bytes)
Mar 27 16:40:40 syrinx usb-storage:  2a 00 04 5a b5 2f 00 00 f0 00
Mar 27 16:40:40 syrinx usb-storage: Bulk Command S 0x43425355 T 0x33473 L 122880
F 0 Trg 0 LUN 0 CL 12
Mar 27 16:40:40 syrinx usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Mar 27 16:40:40 syrinx usb-storage: Status code 0; transferred 31/31
Mar 27 16:40:40 syrinx usb-storage: -- transfer complete
Mar 27 16:40:40 syrinx usb-storage: Bulk command transfer result=0
Mar 27 16:40:40 syrinx usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880
bytes, 30 entries
Mar 27 16:40:40 syrinx hub 1-0:1.0: state 5 ports 8 chg 0000 evt 0008
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: GetStatus port 3 status 00180b
POWER sig=j  PEC CSC CONNECT
Mar 27 16:40:40 syrinx hub 1-0:1.0: port 3, status 0501, change 0003, 480 Mb/s
Mar 27 16:40:40 syrinx usb 1-3: USB disconnect, address 4
Mar 27 16:40:40 syrinx usb 1-3: usb_disable_device nuking all URBs
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bc00 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd900 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49cd40 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bb40 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b9c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49c380 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2200 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b900 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b0c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b240 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2380 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bdc00 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002e0485c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a22c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2ec0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32be40 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2500 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b300 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2800 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49cec0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bd80 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a28c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2a40 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b780 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49c080 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2680 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd9c0 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd840 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2c80 pipe
c0018400 ep3out-bulk
Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2740 pipe
c0018400 ep3out-bulk
Mar 27 16:41:10 syrinx usb-storage: command_abort called
Mar 27 16:41:10 syrinx usb-storage: usb_stor_stop_transport called
Mar 27 16:41:10 syrinx usb-storage: -- cancelling sg request
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19

Comment 7 Andrew Morton 2005-05-25 16:30:30 UTC

Is this bug still present in 2.6.12-rc5?

Comment 8 Brian Hall 2005-05-31 18:08:58 UTC

Yes, the bug it still present. That is, the transfer still hangs, athough with
12rc5 I didn't see error messages (maybe those were only present before with
debug enabled).

I have something useful to report, however. I was able to reproduce the hang
using a USB 2.0 thumbdrive (specifically, a Lexar 128MB JumpDrive), while
copying over a  110MB file. Note that in order to cause the hang I had to mount
it "sync". Problem does not occur if I mount the USB stick "async". Note this is
not a complete system hang, the problem is only with the USB 2.0/ehci device in use.

I think this eliminates my USB enclosure as the source of the problem, the only
remaining possible hardware issue would have to be with the motherboard itself
(a MSI K8T Neo FISR VIA K8T800). Possible kernel issues would be with the ehci
driver, perhaps only on AMD64. I believe it works fine in USB 1.1 mode.

Comment 9 Greg Kroah-Hartman 2005-06-01 14:50:16 UTC

David, any thoughts?

Comment 10 David Brownell 2005-06-01 21:41:05 UTC

The thing that looks most wierd is that the device seems to 
have spontaneously disconnected itself.  Combined with the 
fact that the device seems to have locked itself up at the 
time, that does suggest a device problem.  On the other hand 
it's also VIA hardware, never the most trustworthy in my own 
experience.  The symptoms have unfortunately been seen before. 
 
There's another wierdness ... why was sg_cancel trying to cancel 
the urbs that were already completed?

Comment 11 Brian Hall 2005-06-02 17:12:21 UTC

Based on my experimenting, it might be some sort of timing issue? I have been
doing repeated burns & erases of 10X CDRW media to reproduce this. Problem
occurs after 6 repeats with Cool & Quiet disabled, but occured after 3 repeats
with C&Q enabled. May or may not mean anything. Is there some kind of USB stress
test program?

I disabled C&Q on the fly with:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

I was hoping disabling C&Q would be viable workaround.

Comment 12 David Brownell 2005-06-06 23:57:14 UTC

Have a look at bug #4057 which is broadly similar:  usb-storage  
devices wedging after a period of much activity.   There's a patch 
I just mentioned there, now submitted ... a precursor was really 
effective at making a drive finally act reliably.  It's quite 
possible this is the same underlying (non-EHCI) bug.

Comment 13 Brian Hall 2005-06-08 09:15:44 UTC

Well, I applied both patches as mentioned in bug #4057 against 2.6.12rc5, and
the results were not encouraging. I was able to successfully blank my CDRW, but
attempting to burn it resulted in a hang at 0MB burned, the USB device was
forcibly disconnected, and cdrecord segfaulted:

scsi: Device offlined - not ready after error recovery: host 1 channel 0 id 0 lun 0
cdrecord[9169]: segfault at 00007fffff1d4f88 rip 0000000000430f2a rsp
00007fffff1d4f70 error 6
cdrecord[9170]: segfault at 00007fffff1d4ff8 rip 0000000000430f2a rsp
00007fffff1d4fe0 error 6

May be only an issue on AMD64, perhaps the patches work better on i386?

The second of the two patches was incompatible with 2.6.11-ck, but I was able to
apply the first patch from #4057. It didn't help either.

Comment 14 David Brownell 2005-08-01 10:30:25 UTC

Has this behavior changed in 2.6.13-rc4?

This really looks to me like a device lockup in the middle
of some big I/O sequence, leading to disconnect.  That's is
actually a friendly way for the device to fail, and one
that has in the past worked quite nicely at the USB level.

For now I'd like to just verify that it still IS working OK
at that level; the last several problems in this area have
all been in SCSI code.  Can you include a few Alt-SysRq-T
tracebacks, of the processes that are hanging, any usb-storage
threads, and khubd?  Also (with CONFIG_USB_DEBUG) the contents
of the /sys/bus/usb_host/usbN/async file?  (Where N=1 in the
debug logs I see here.)  This will help me see what sort of
issue we've got here.



Those usb_sg_cancel() messages should ignore -ENODEV since
those codes should always imply the URBs will have been
aborted by usbcore.  Trivial fix.

Comment 15 David Brownell 2005-08-01 10:47:57 UTC

Created attachment 5457 [details]
removes annoying and useless message

Comment 16 Brian Hall 2005-08-07 12:13:17 UTC

Well, I didn't make any tracebacks, but I did do a quick test in 2.6.13rc5 and
had  bascially the same problem. In this case, I tried to write to a CDRW that I
had not already blanked.

I understand that something has to happen when the USB device becomes
temporarily unresponsive. My opinion is, if something happens during a long
transfer and the USB device stops responding, the kernel should pause the
transfer, issue a device reset, and then resume. This isn't remotely like what
happens now. At the moment, if the USB device stops responding, the device
completely disappears from the USB device tree. And more annoying, if I re-probe
(which sometimes works to get the device working again), the USB/fake SCSI bus
is incremented by one, so that in my cdrecord command line I have to change my
dev=2,0,0 to dev=3,0,0 or whatever. Why can't this stay the same?

I will try to apply the small error patch and get some tracebacks next time I
reboot.

Has anyone else been able to reproduce this problem, or is it just me?

Aug  6 16:44:25 syrinx Vendor: CDWRITER  Model: IDE1610           Rev: 0.19
Aug  6 16:44:25 syrinx Type:   CD-ROM                             ANSI SCSI
revision: 00
Aug  6 16:44:25 syrinx sr0: scsi3-mmc drive: 4x/4x writer cd/rw xa/form2 cdda tray
Aug  6 16:44:25 syrinx Attached scsi CD-ROM sr0 at scsi3, channel 0, id 0, lun 0
Aug  6 16:44:25 syrinx Attached scsi generic sg1 at scsi3, channel 0, id 0, lun
0,  type 5
Aug  6 16:44:25 syrinx usb-storage: device scan complete
Aug  6 16:44:25 syrinx scsi.agent[5425]: cdrom at
/devices/pci0000:00/0000:00:10.4/usb5/5-1/5-1:1.0/host3/target3:0:0/3:0:0:0
Aug  6 16:48:43 syrinx ehci_hcd 0000:00:10.4: port 1 high speed
Aug  6 16:48:43 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005
POWER sig=se0 PE CONNECT
Aug  6 16:48:43 syrinx usb 5-1: reset high speed USB device using ehci_hcd and
address 2
Aug  6 16:48:43 syrinx ehci_hcd 0000:00:10.4: port 1 high speed
Aug  6 16:48:43 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005
POWER sig=se0 PE CONNECT
Aug  6 16:48:51 syrinx ehci_hcd 0000:00:10.4: port 1 high speed
Aug  6 16:48:51 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005
POWER sig=se0 PE CONNECT
Aug  6 16:48:51 syrinx usb 5-1: reset high speed USB device using ehci_hcd and
address 2
Aug  6 16:48:51 syrinx ehci_hcd 0000:00:10.4: port 1 high speed
Aug  6 16:48:51 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005
POWER sig=se0 PE CONNECT
Aug  6 16:48:51 syrinx scsi: Device offlined - not ready after error recovery:
host 3 channel 0 id 0 lun 0
Aug  6 16:48:56 syrinx cdrecord[5614]: segfault at 00007fffff3f0fa8 rip
0000000000430f2a rsp 00007fffff3f0f90 error 6

Comment 17 Greg Kroah-Hartman 2005-08-18 21:44:30 UTC

Matt, any ideas?

Comment 18 Matthew Dharm 2005-08-19 10:30:12 UTC

Ideas?  Just one.

There's a code-path which introduces a delay between command and data.  It's
currently only active when Genesys Logic chipsets are detected.  It might be
interesting to force that on and see what happens...

Just look in linux/drivers/usb/storage for a delay() which is conditional on the
vendor being Genesys.  There's only one.

What bothers me, however, is that the user reports that power-cycling the drive
does NOT bring it back.  That should force a full disconnect of the drive, reset
all of it's logic, and reconnect it.  If that doesn't work, either the drive is
toast or the HC is toast.  Since he's observed this with multiple devices, I'm
betting on the HC.

Comment 19 Brian Hall 2005-08-20 14:42:26 UTC

Created attachment 5700 [details]
syslog showing the problem

This is the syslog I generated using 2.6.13rc6 with the small usb patch listed
in this current bug report applied. Symptoms: same as before, cdrecord goes
into a zombie state after cdrecord hangs when the USB device stops responding
on long transfers.

Comment 20 Brian Hall 2005-08-20 14:44:06 UTC

Created attachment 5701 [details]
process list showing cdrecord in zombie state

Comment 21 Brian Hall 2005-08-20 14:47:47 UTC

Created attachment 5702 [details]
test case to demonstrate the problem

In order to reproduce the problem, the ehci_hcd module must be loaded. I am
testing it via a USB 2.0 connection to an external enclosure with a Sony
CRX-160 CDRW, on a 10X 650MB CDRW disk. The test file I am burning is
approximately 600MB.
On my most recent test, the erase & burn worked on the first pass, and then
hung during the data burn on the second iteration.

Comment 22 Brian Hall 2005-08-20 14:50:11 UTC

The syslog I just entered contains a traceback triggered with "echo t >
/proc/sysrq-trigger". I had usb debugging, and usb-storage debugging enabled, so
the resulting logfile is 25MB uncompressed. Hopefully there will be some clues
in there to indicate the specific problem!

Comment 23 Brian Hall 2005-08-21 13:59:26 UTC

I tried the userspace workaround I saw in the Gentoo forums (passing "ts=32768"
to cdrecord, to reduce the size of the SCSI command buffer). While it seemed to
take longer to fail (over half an hour), eventually it failed in the same manner.

Comment 24 David Brownell 2005-09-13 02:18:23 UTC

Can you:

 - verify that this still happens with 2.6.14-rc1?
         ... various bugfixes may have helped
 - say whether this device has its own power supply?
         ... relying on the 500 mA from the hub seems risky
 - include "lspci -vvxxxs 0000:00:10.4" output, dumping config space?
         ... in case something informative appears
 - try using this through an external usb 2.0 hub?
         ... one person reported this helped, oddly enough.

One interesting thing in that big syslog file, as it was failing,
as well as the sysrq-T trace.  Looks to me as if for some reason
the device decided to disconnect (why??), and during that part the
USB stack behaved fine.  The hang looks to be the same SCSI hang
that lots of people report ... not a USB issue.


Aug 20 15:27:28 syrinx usb-storage: Command WRITE_10 (10 bytes)
Aug 20 15:27:28 syrinx usb-storage:  2a 00 00 03 15 0c 00 00 1f 00
Aug 20 15:27:28 syrinx usb-storage: Bulk Command S 0x43425355 T 0x4698 L 63488 F
0Trg 0 LUN 0 CL 12
Aug 20 15:27:28 syrinx usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Aug 20 15:27:28 syrinx usb-storage: Status code 0; transferred 31/31
Aug 20 15:27:28 syrinx usb-storage: -- transfer complete
Aug 20 15:27:28 syrinx usb-storage: Bulk command transfer result=0
Aug 20 15:27:28 syrinx usb-storage: usb_stor_bulk_transfer_sglist: xfer 63488
bytes, 1 entries

     ... so at 27:28 all is fine.  But wait, not so quickly...

Aug 20 15:27:28 syrinx hub 5-0:1.0: state 5 ports 8 chg 0000 evt 0002
Aug 20 15:27:28 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 00180b
POWERsig=j PEC CSC CONNECT

     ... for some reason, right at that moment the root hub disconnected
     the device on that port.  Specifically, it looks like the device was
     disconnected then reconnected ... unclear _why_ though, it might have
     been either end.

     Maybe it was a signal quality problem (even a bad cable!)
     or frame timing problems.    I looked at the usb 2.0 spec,
     and figure 11-10 says the hub could do this because of a "port error"
     as detailed in section 11.8.1 ... which reads basically that when
     a device is busy too late in the microframe (the EOF2 point), the
     hub must disconnect it.  So far as I can tell, there's no way to
     modify the EOF2 point; the EHCI FLADJ register also modifies the
     USB clock ("how long is a millisecond").

     As a rule, when folk have had such problems, it's turned out to be
     some kind of hardware problem.  Switching USB disks has helped.

Aug 20 15:27:28 syrinx hub 5-0:1.0: port 1, status 0501, change 0003, 480 Mb/s
Aug 20 15:27:28 syrinx usb 5-1: USB disconnect, address 2
Aug 20 15:27:28 syrinx usb 5-1: usb_disable_device nuking all URBs
Aug 20 15:27:28 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff810012325a80 pipe
c0018200 ep3out-bulk
Aug 20 15:28:08 syrinx usb-storage: command_abort called
Aug 20 15:28:08 syrinx usb-storage: usb_stor_stop_transport called
Aug 20 15:28:08 syrinx usb-storage: -- cancelling sg request
Aug 20 15:28:08 syrinx usb 5-1: usb_sg_cancel, unlink --> -19

      And here presumably is the hang, but no USB level errors show.

Comment 25 Brian Hall 2005-09-17 09:23:49 UTC

Created attachment 6048 [details]
compressed 2.6.14-rc1 syslog showing successful DVD burn w/ pauses/resets

Compressed with bzip2 -9, 139M uncomressed! USB debug and USB storage debug
were both enabled.

Comment 26 Brian Hall 2005-09-17 09:26:48 UTC

Created attachment 6049 [details]
output of "lspci -vvxxxs 00:10.4"

Comment 27 Brian Hall 2005-09-17 09:27:32 UTC

Created attachment 6050 [details]
kernel .config I used for testing 2.6.14-rc1

Comment 28 Brian Hall 2005-09-17 09:35:41 UTC

1) verify that this still happens with 2.6.14-rc1?

The problem seems to be much better in 2.6.14-rc1, in that when the device stops
responding, the burn resumes after the device reset is complete (takes about 35
to 40 seconds to reset). I have burned 3 DVDs successfully, so the original
problem seems to be fixed (the DVD burner USB device no longer "goes away" and
never responds until a reboot- device no longer hangs).

However, during each burn, the burn pauses for a reset between 1 and 3 times.
This drags out the time required to burn a DVD. I have attached the syslog and
.config for this test. It is possible that this is 100% correct behavior as far
as the kernel is concerned, if my external USB enclosure hardware is just that bad. 

I seem to get fewer resets during a burn if I disable the options under ehci, 
turing off "full speed iso transactions" and "root hub translators", but I still  
get one or two during a burn.

I have never tested this device in Windows, is there any point to building a W2K  
box with USB 2.0 and doing some DVD test burns? 

2) say whether this device has its own power supply?

Yes, it has a 12V, 2.0A power supply brick.   

3) include "lspci -vvxxxs 0000:00:10.4" output, dumping config space?

# lspci -vvxxxs 0000:00:10.4
lspci: -f: Invalid slot number

but this command works: "lspci -vvxxxs 00:10.4"

See attached usb.config for output.

4)  - try using this through an external usb 2.0 hub?

The only hubs I have are 1.1. I'll pick up a 2.0 hub the next time I see one
free w/ rebate.

Also, I would not recommend 2.6.14-rc1 for general use yet, at least on hardware
similar to mine (Opteron/AMD64). My standard PS/2 keyboard wouldn't work after
rebooting from 2.6.14-rc-1, even after a poweroff; I had to boot into a
2.6.13-series kernel to get my keyboard working again.

Comment 29 Brian Hall 2005-09-17 17:07:59 UTC

Somewaht related bug: 5274
(for my problem with using PS/2 keyboard after heavy USB I/O on 2.6.14-rc1)
http://bugzilla.kernel.org/show_bug.cgi?id=5274

Comment 30 David Brownell 2005-10-10 20:27:19 UTC

Hmm, looks like this is fixed now ... the unforgivable behavior 
(hang/deadlock) is gone, and now it's "just" a temporary wedge 
that's properly recovered-from.  Different, and relatively minor, 
problem. 
 
I'll note that you never provided the requested alt-sysrq-T traces. 
If you'd done that we might have been able to track this down more 
precisely ... "ps" output is NOT COMPARABLE, it doesn't show where 
the deadlocks appear (no stack traces).

Comment 31 Brian Hall 2005-10-11 13:30:40 UTC

Created attachment 6284 [details]
tracebacks of the hang

I thought I had posted a logfile with the tracebacks in it. Here is a log
showing a traceback from this problem, but it appears to use 2.6.11.

Also, I tested the hardware from a WinXP PC and was unable to reproduce the
failure. The PC used an Intel 82801EB USB controller, so that's not directly
comparable.

I will retest this when 2.6.14 final comes out. Should I reopen this is bug if
there is a problem, or create a new one?