Distribution: Gentoo 2004.3 (64-bit) Hardware Environment: AMD64 3000+, MSI FISR VIA K8T800 Software Environment: cdrecord, 2.6.11 linus Problem Description: While burning a CD- or DVD-RW using an external USB enclosure and a IDE burner, the burn process "hangs" and goes into these states: root 11328 0.0 0.2 4400 1060 pts/1 S+ 13:15 0:00 /bin/sh /net/scripts/burn /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso root 11329 0.0 1.2 7244 6224 pts/1 DL+ 13:15 0:00 cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso root 11331 0.0 1.2 7244 6224 pts/1 D+ 13:15 0:00 cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso In the terminal window where I was burning, I get: cdrecord: faio_wait_on_buffer for writer timed out. In the kernel log, I get: Mar 19 13:20:24 syrinx usb 1-3: USB disconnect, address 2 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 19 13:21:04 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 The device is working in USB 2.0 mode, according to /proc/usb/bus/devices (not included here because once this problem occurs I can no longer access it, although my low-speed USB mouse continues to work). Burner is registered to the ehci_hcd driver. The burner no longer responds to inquiries after the problem occurs- cycling power on it doesn't help, I have to reboot. Problem is intermittent, sometimes I can read or write an entire DVD with no issues. Steps to reproduce: cdrecord -v -overburn -data /down/KNOPPIX_V3.8-2005-02-28-CeBIT_Edition.iso
Did this work OK under any earlier kernels? If so, which versions? Thanks.
Can you enable CONFIG_USB_DEBUG and create an attachment with the kernel log right when the error happens?
I was not able to get proper operation of my external USB drive or of USB flash drives using ehci_hcd in 2.6.8, 2.6.9, or 2.6.10. For 2.6.11, I also removed the USB front panel connection from the motherboard and connected a different FP USB2.0 connector. So far, I have had no problems with ~10MB transfers to and from a USB2.0 stick with ehci_hcd. I suspect I am seeing the problem only with the external writer because of the longer (or more frequent) transfers.
Created attachment 4791 [details] usb kernel log I enabled CONFIG_USB_DEBUG, but when the drive stops recording, no error shows up in the log at that time. A few minutes later, I get a disconnect message in the log. Should I try again with the usb storage debug enabled? I've put the Plextor back into my PC case. The first other CDRW I tried in the external usb didn't work (a Sony CRX-160). I put in a IOMagic CDRW drive and it allowed me to erase a disc and start recording another before locking up (this is the attached logfile). Maybe the problem is with the cheap USB enclosure? When I get time I'll try it with a hard drive- perhaps the enclosure doesn't like to speak ATAPI.
Maybe I just have a too-cheap USB->IDE enclosure. I don't have a Windows box with USB2.0 around to test it on. Here's the excerpt from my syslog when using usb_storage VERBOSE, and trying a 40GB Maxtor HDD with 2.6.11-ck3: Mar 27 16:40:40 syrinx usb-storage: *** thread sleeping. Mar 27 16:40:40 syrinx usb-storage: queuecommand called Mar 27 16:40:40 syrinx usb-storage: *** thread awakened. Mar 27 16:40:40 syrinx usb-storage: Command WRITE_10 (10 bytes) Mar 27 16:40:40 syrinx usb-storage: 2a 00 04 5a b5 2f 00 00 f0 00 Mar 27 16:40:40 syrinx usb-storage: Bulk Command S 0x43425355 T 0x33473 L 122880 F 0 Trg 0 LUN 0 CL 12 Mar 27 16:40:40 syrinx usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Mar 27 16:40:40 syrinx usb-storage: Status code 0; transferred 31/31 Mar 27 16:40:40 syrinx usb-storage: -- transfer complete Mar 27 16:40:40 syrinx usb-storage: Bulk command transfer result=0 Mar 27 16:40:40 syrinx usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 30 entries Mar 27 16:40:40 syrinx hub 1-0:1.0: state 5 ports 8 chg 0000 evt 0008 Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: GetStatus port 3 status 00180b POWER sig=j PEC CSC CONNECT Mar 27 16:40:40 syrinx hub 1-0:1.0: port 3, status 0501, change 0003, 480 Mb/s Mar 27 16:40:40 syrinx usb 1-3: USB disconnect, address 4 Mar 27 16:40:40 syrinx usb 1-3: usb_disable_device nuking all URBs Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bc00 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd900 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49cd40 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bb40 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b9c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49c380 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2200 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b900 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b0c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b240 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2380 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bdc00 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002e0485c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a22c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2ec0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32be40 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2500 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b300 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2800 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49cec0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32bd80 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a28c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2a40 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001d32b780 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81002b49c080 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2680 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd9c0 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff81001f4bd840 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2c80 pipe c0018400 ep3out-bulk Mar 27 16:40:40 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff8100123a2740 pipe c0018400 ep3out-bulk Mar 27 16:41:10 syrinx usb-storage: command_abort called Mar 27 16:41:10 syrinx usb-storage: usb_stor_stop_transport called Mar 27 16:41:10 syrinx usb-storage: -- cancelling sg request Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19 Mar 27 16:41:10 syrinx usb 1-3: usb_sg_cancel, unlink --> -19
Is this bug still present in 2.6.12-rc5?
Yes, the bug it still present. That is, the transfer still hangs, athough with 12rc5 I didn't see error messages (maybe those were only present before with debug enabled). I have something useful to report, however. I was able to reproduce the hang using a USB 2.0 thumbdrive (specifically, a Lexar 128MB JumpDrive), while copying over a 110MB file. Note that in order to cause the hang I had to mount it "sync". Problem does not occur if I mount the USB stick "async". Note this is not a complete system hang, the problem is only with the USB 2.0/ehci device in use. I think this eliminates my USB enclosure as the source of the problem, the only remaining possible hardware issue would have to be with the motherboard itself (a MSI K8T Neo FISR VIA K8T800). Possible kernel issues would be with the ehci driver, perhaps only on AMD64. I believe it works fine in USB 1.1 mode.
David, any thoughts?
The thing that looks most wierd is that the device seems to have spontaneously disconnected itself. Combined with the fact that the device seems to have locked itself up at the time, that does suggest a device problem. On the other hand it's also VIA hardware, never the most trustworthy in my own experience. The symptoms have unfortunately been seen before. There's another wierdness ... why was sg_cancel trying to cancel the urbs that were already completed?
Based on my experimenting, it might be some sort of timing issue? I have been doing repeated burns & erases of 10X CDRW media to reproduce this. Problem occurs after 6 repeats with Cool & Quiet disabled, but occured after 3 repeats with C&Q enabled. May or may not mean anything. Is there some kind of USB stress test program? I disabled C&Q on the fly with: echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor I was hoping disabling C&Q would be viable workaround.
Have a look at bug #4057 which is broadly similar: usb-storage devices wedging after a period of much activity. There's a patch I just mentioned there, now submitted ... a precursor was really effective at making a drive finally act reliably. It's quite possible this is the same underlying (non-EHCI) bug.
Well, I applied both patches as mentioned in bug #4057 against 2.6.12rc5, and the results were not encouraging. I was able to successfully blank my CDRW, but attempting to burn it resulted in a hang at 0MB burned, the USB device was forcibly disconnected, and cdrecord segfaulted: scsi: Device offlined - not ready after error recovery: host 1 channel 0 id 0 lun 0 cdrecord[9169]: segfault at 00007fffff1d4f88 rip 0000000000430f2a rsp 00007fffff1d4f70 error 6 cdrecord[9170]: segfault at 00007fffff1d4ff8 rip 0000000000430f2a rsp 00007fffff1d4fe0 error 6 May be only an issue on AMD64, perhaps the patches work better on i386? The second of the two patches was incompatible with 2.6.11-ck, but I was able to apply the first patch from #4057. It didn't help either.
Has this behavior changed in 2.6.13-rc4? This really looks to me like a device lockup in the middle of some big I/O sequence, leading to disconnect. That's is actually a friendly way for the device to fail, and one that has in the past worked quite nicely at the USB level. For now I'd like to just verify that it still IS working OK at that level; the last several problems in this area have all been in SCSI code. Can you include a few Alt-SysRq-T tracebacks, of the processes that are hanging, any usb-storage threads, and khubd? Also (with CONFIG_USB_DEBUG) the contents of the /sys/bus/usb_host/usbN/async file? (Where N=1 in the debug logs I see here.) This will help me see what sort of issue we've got here. Those usb_sg_cancel() messages should ignore -ENODEV since those codes should always imply the URBs will have been aborted by usbcore. Trivial fix.
Created attachment 5457 [details] removes annoying and useless message
Well, I didn't make any tracebacks, but I did do a quick test in 2.6.13rc5 and had bascially the same problem. In this case, I tried to write to a CDRW that I had not already blanked. I understand that something has to happen when the USB device becomes temporarily unresponsive. My opinion is, if something happens during a long transfer and the USB device stops responding, the kernel should pause the transfer, issue a device reset, and then resume. This isn't remotely like what happens now. At the moment, if the USB device stops responding, the device completely disappears from the USB device tree. And more annoying, if I re-probe (which sometimes works to get the device working again), the USB/fake SCSI bus is incremented by one, so that in my cdrecord command line I have to change my dev=2,0,0 to dev=3,0,0 or whatever. Why can't this stay the same? I will try to apply the small error patch and get some tracebacks next time I reboot. Has anyone else been able to reproduce this problem, or is it just me? Aug 6 16:44:25 syrinx Vendor: CDWRITER Model: IDE1610 Rev: 0.19 Aug 6 16:44:25 syrinx Type: CD-ROM ANSI SCSI revision: 00 Aug 6 16:44:25 syrinx sr0: scsi3-mmc drive: 4x/4x writer cd/rw xa/form2 cdda tray Aug 6 16:44:25 syrinx Attached scsi CD-ROM sr0 at scsi3, channel 0, id 0, lun 0 Aug 6 16:44:25 syrinx Attached scsi generic sg1 at scsi3, channel 0, id 0, lun 0, type 5 Aug 6 16:44:25 syrinx usb-storage: device scan complete Aug 6 16:44:25 syrinx scsi.agent[5425]: cdrom at /devices/pci0000:00/0000:00:10.4/usb5/5-1/5-1:1.0/host3/target3:0:0/3:0:0:0 Aug 6 16:48:43 syrinx ehci_hcd 0000:00:10.4: port 1 high speed Aug 6 16:48:43 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005 POWER sig=se0 PE CONNECT Aug 6 16:48:43 syrinx usb 5-1: reset high speed USB device using ehci_hcd and address 2 Aug 6 16:48:43 syrinx ehci_hcd 0000:00:10.4: port 1 high speed Aug 6 16:48:43 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005 POWER sig=se0 PE CONNECT Aug 6 16:48:51 syrinx ehci_hcd 0000:00:10.4: port 1 high speed Aug 6 16:48:51 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005 POWER sig=se0 PE CONNECT Aug 6 16:48:51 syrinx usb 5-1: reset high speed USB device using ehci_hcd and address 2 Aug 6 16:48:51 syrinx ehci_hcd 0000:00:10.4: port 1 high speed Aug 6 16:48:51 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 001005 POWER sig=se0 PE CONNECT Aug 6 16:48:51 syrinx scsi: Device offlined - not ready after error recovery: host 3 channel 0 id 0 lun 0 Aug 6 16:48:56 syrinx cdrecord[5614]: segfault at 00007fffff3f0fa8 rip 0000000000430f2a rsp 00007fffff3f0f90 error 6
Matt, any ideas?
Ideas? Just one. There's a code-path which introduces a delay between command and data. It's currently only active when Genesys Logic chipsets are detected. It might be interesting to force that on and see what happens... Just look in linux/drivers/usb/storage for a delay() which is conditional on the vendor being Genesys. There's only one. What bothers me, however, is that the user reports that power-cycling the drive does NOT bring it back. That should force a full disconnect of the drive, reset all of it's logic, and reconnect it. If that doesn't work, either the drive is toast or the HC is toast. Since he's observed this with multiple devices, I'm betting on the HC.
Created attachment 5700 [details] syslog showing the problem This is the syslog I generated using 2.6.13rc6 with the small usb patch listed in this current bug report applied. Symptoms: same as before, cdrecord goes into a zombie state after cdrecord hangs when the USB device stops responding on long transfers.
Created attachment 5701 [details] process list showing cdrecord in zombie state
Created attachment 5702 [details] test case to demonstrate the problem In order to reproduce the problem, the ehci_hcd module must be loaded. I am testing it via a USB 2.0 connection to an external enclosure with a Sony CRX-160 CDRW, on a 10X 650MB CDRW disk. The test file I am burning is approximately 600MB. On my most recent test, the erase & burn worked on the first pass, and then hung during the data burn on the second iteration.
The syslog I just entered contains a traceback triggered with "echo t > /proc/sysrq-trigger". I had usb debugging, and usb-storage debugging enabled, so the resulting logfile is 25MB uncompressed. Hopefully there will be some clues in there to indicate the specific problem!
I tried the userspace workaround I saw in the Gentoo forums (passing "ts=32768" to cdrecord, to reduce the size of the SCSI command buffer). While it seemed to take longer to fail (over half an hour), eventually it failed in the same manner.
Can you: - verify that this still happens with 2.6.14-rc1? ... various bugfixes may have helped - say whether this device has its own power supply? ... relying on the 500 mA from the hub seems risky - include "lspci -vvxxxs 0000:00:10.4" output, dumping config space? ... in case something informative appears - try using this through an external usb 2.0 hub? ... one person reported this helped, oddly enough. One interesting thing in that big syslog file, as it was failing, as well as the sysrq-T trace. Looks to me as if for some reason the device decided to disconnect (why??), and during that part the USB stack behaved fine. The hang looks to be the same SCSI hang that lots of people report ... not a USB issue. Aug 20 15:27:28 syrinx usb-storage: Command WRITE_10 (10 bytes) Aug 20 15:27:28 syrinx usb-storage: 2a 00 00 03 15 0c 00 00 1f 00 Aug 20 15:27:28 syrinx usb-storage: Bulk Command S 0x43425355 T 0x4698 L 63488 F 0Trg 0 LUN 0 CL 12 Aug 20 15:27:28 syrinx usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Aug 20 15:27:28 syrinx usb-storage: Status code 0; transferred 31/31 Aug 20 15:27:28 syrinx usb-storage: -- transfer complete Aug 20 15:27:28 syrinx usb-storage: Bulk command transfer result=0 Aug 20 15:27:28 syrinx usb-storage: usb_stor_bulk_transfer_sglist: xfer 63488 bytes, 1 entries ... so at 27:28 all is fine. But wait, not so quickly... Aug 20 15:27:28 syrinx hub 5-0:1.0: state 5 ports 8 chg 0000 evt 0002 Aug 20 15:27:28 syrinx ehci_hcd 0000:00:10.4: GetStatus port 1 status 00180b POWERsig=j PEC CSC CONNECT ... for some reason, right at that moment the root hub disconnected the device on that port. Specifically, it looks like the device was disconnected then reconnected ... unclear _why_ though, it might have been either end. Maybe it was a signal quality problem (even a bad cable!) or frame timing problems. I looked at the usb 2.0 spec, and figure 11-10 says the hub could do this because of a "port error" as detailed in section 11.8.1 ... which reads basically that when a device is busy too late in the microframe (the EOF2 point), the hub must disconnect it. So far as I can tell, there's no way to modify the EOF2 point; the EHCI FLADJ register also modifies the USB clock ("how long is a millisecond"). As a rule, when folk have had such problems, it's turned out to be some kind of hardware problem. Switching USB disks has helped. Aug 20 15:27:28 syrinx hub 5-0:1.0: port 1, status 0501, change 0003, 480 Mb/s Aug 20 15:27:28 syrinx usb 5-1: USB disconnect, address 2 Aug 20 15:27:28 syrinx usb 5-1: usb_disable_device nuking all URBs Aug 20 15:27:28 syrinx ehci_hcd 0000:00:10.4: shutdown urb ffff810012325a80 pipe c0018200 ep3out-bulk Aug 20 15:28:08 syrinx usb-storage: command_abort called Aug 20 15:28:08 syrinx usb-storage: usb_stor_stop_transport called Aug 20 15:28:08 syrinx usb-storage: -- cancelling sg request Aug 20 15:28:08 syrinx usb 5-1: usb_sg_cancel, unlink --> -19 And here presumably is the hang, but no USB level errors show.
Created attachment 6048 [details] compressed 2.6.14-rc1 syslog showing successful DVD burn w/ pauses/resets Compressed with bzip2 -9, 139M uncomressed! USB debug and USB storage debug were both enabled.
Created attachment 6049 [details] output of "lspci -vvxxxs 00:10.4"
Created attachment 6050 [details] kernel .config I used for testing 2.6.14-rc1
1) verify that this still happens with 2.6.14-rc1? The problem seems to be much better in 2.6.14-rc1, in that when the device stops responding, the burn resumes after the device reset is complete (takes about 35 to 40 seconds to reset). I have burned 3 DVDs successfully, so the original problem seems to be fixed (the DVD burner USB device no longer "goes away" and never responds until a reboot- device no longer hangs). However, during each burn, the burn pauses for a reset between 1 and 3 times. This drags out the time required to burn a DVD. I have attached the syslog and .config for this test. It is possible that this is 100% correct behavior as far as the kernel is concerned, if my external USB enclosure hardware is just that bad. I seem to get fewer resets during a burn if I disable the options under ehci, turing off "full speed iso transactions" and "root hub translators", but I still get one or two during a burn. I have never tested this device in Windows, is there any point to building a W2K box with USB 2.0 and doing some DVD test burns? 2) say whether this device has its own power supply? Yes, it has a 12V, 2.0A power supply brick. 3) include "lspci -vvxxxs 0000:00:10.4" output, dumping config space? # lspci -vvxxxs 0000:00:10.4 lspci: -f: Invalid slot number but this command works: "lspci -vvxxxs 00:10.4" See attached usb.config for output. 4) - try using this through an external usb 2.0 hub? The only hubs I have are 1.1. I'll pick up a 2.0 hub the next time I see one free w/ rebate. Also, I would not recommend 2.6.14-rc1 for general use yet, at least on hardware similar to mine (Opteron/AMD64). My standard PS/2 keyboard wouldn't work after rebooting from 2.6.14-rc-1, even after a poweroff; I had to boot into a 2.6.13-series kernel to get my keyboard working again.
Somewaht related bug: 5274 (for my problem with using PS/2 keyboard after heavy USB I/O on 2.6.14-rc1) http://bugzilla.kernel.org/show_bug.cgi?id=5274
Hmm, looks like this is fixed now ... the unforgivable behavior (hang/deadlock) is gone, and now it's "just" a temporary wedge that's properly recovered-from. Different, and relatively minor, problem. I'll note that you never provided the requested alt-sysrq-T traces. If you'd done that we might have been able to track this down more precisely ... "ps" output is NOT COMPARABLE, it doesn't show where the deadlocks appear (no stack traces).
Created attachment 6284 [details] tracebacks of the hang I thought I had posted a logfile with the tracebacks in it. Here is a log showing a traceback from this problem, but it appears to use 2.6.11. Also, I tested the hardware from a WinXP PC and was unable to reproduce the failure. The PC used an Intel 82801EB USB controller, so that's not directly comparable. I will retest this when 2.6.14 final comes out. Should I reopen this is bug if there is a problem, or create a new one?