Bug 8692
Description
Dominique Dumont
2007-06-30 08:50:36 UTC
Sorry, I've made a small mistake in the "Problem description section" I confirm that all USB I/O are blocked and I *cannot* mount another USB drive. Subject: Re: [Bugme-new] New: USB storage freeze On Sat, 30 Jun 2007 08:46:32 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=8692 > > Summary: USB storage freeze > Product: IO/Storage > Version: 2.5 > KernelVersion: 2.6.21 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: io_other@kernel-bugs.osdl.org > ReportedBy: domi.dumont@free.fr > > > Most recent kernel where this bug did not occur: unknown > Distribution: Debian/sid > Hardware Environment: AMD 64 X2, external 200GB USB disk (Prolific Technology > Inc) > Software Environment: 2.6.21 #3 SMP Sat Jun 30 16:30:03 CEST 2007 x86_64 > GNU/Linux > Problem Description: > > When doing file transfer, USB storage freezes randomly. It's possible that my > USB drive is low quality hardware, but this shouldn't freeze the USB bus. > > At this point, all I/O to USB are blocked: > - I can mount another USB disk > - sync blocks > - reboot blocks > - only reset button works to unlock the disks. > > I've seen this problem also on Debian's kernel 2.6.18 > > I've reproduced it on a vanilla kernel (built with make-kpkg) with usb debug > enabled. (traces are below) > > > Steps to reproduce: > > A simple cp of a multi GB file is often enough. (note that my disk if 90% > full) > > You can also run find <usb-drive> -type f -exec md5sum {} \; > > Traces: > Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes) > Jun 30 17:21:30 gandalf kernel: usb-storage: 2a 00 0c d9 01 87 00 00 f0 00 > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T > 0x4297 > L 122880 F 0 Trg 0 LUN 0 CL 10 > Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 31/31 > Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk command transfer result=0 > Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: > xfer 122880 bytes, 17 entries > Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred > 122880/122880 > Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk data transfer result 0x0 > Jun 30 17:21:30 gandalf kernel: usb-storage: Attempting to get CSW... > Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 13 bytes > Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 13/13 > Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk status result = 0 > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T > 0x4297 > R 0 Stat 0x0 > Jun 30 17:21:30 gandalf kernel: usb-storage: scsi cmd done, result=0x0 > Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes) > Jun 30 17:21:30 gandalf kernel: usb-storage: 2a 00 0c d9 02 77 00 00 f0 00 > Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T > 0x4298 > L 122880 F 0 Trg 0 LUN 0 CL 10 > Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:22:00 gandalf kernel: usb-storage: command_abort called > Jun 30 17:22:00 gandalf kernel: usb-storage: usb_stor_stop_transport called > Jun 30 17:22:00 gandalf kernel: usb-storage: -- cancelling URB > > At this point the external disk is frozen. > > Then I've powered on another USB disk: > > Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0000 evt > 0008 > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 > status > 001803 POWER sig=j CSC CONNECT > Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change > 0001, > 480 Mb/s > Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: debounce: port 3: total 100ms > stable 100ms status 0x501 > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 > status > 001005 POWER sig=se0 PE CONNECT > Jun 30 17:24:34 gandalf kernel: usb 6-3: new high speed USB device using > ehci_hcd and address 3 > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 > status > 001005 POWER sig=se0 PE CONNECT > Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: qh ffff81003d90d280 > (#00) state 4 > Jun 30 17:24:34 gandalf kernel: usb 6-3: default language 0x0409 > Jun 30 17:24:34 gandalf kernel: usb 6-3: new device strings: Mfr=1, > Product=2, > SerialNumber=3 > Jun 30 17:24:34 gandalf kernel: usb 6-3: Product: Mass Storage Device > Jun 30 17:24:34 gandalf kernel: usb 6-3: Manufacturer: Prolific Technology > Inc. > Jun 30 17:24:34 gandalf kernel: usb 6-3: SerialNumber: 0 > Jun 30 17:24:34 gandalf kernel: usb 6-3: uevent > Jun 30 17:24:34 gandalf kernel: usb 6-3: usb_probe_device > Jun 30 17:24:34 gandalf kernel: usb 6-3: configuration #1 chosen from 1 > choice > Jun 30 17:24:34 gandalf kernel: usb 6-3: adding 6-3:1.0 (config #1, interface > 0) > Jun 30 17:24:34 gandalf kernel: usb 6-3:1.0: uevent > Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface > Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface - > got > id > Jun 30 17:24:34 gandalf kernel: usb-storage: USB Mass Storage device detected > Jun 30 17:24:34 gandalf kernel: usb-storage: -- associate_dev > Jun 30 17:24:34 gandalf kernel: usb-storage: Vendor: 0x067b, Product: 0x2507, > Revision: 0x0100 > Jun 30 17:24:34 gandalf kernel: usb-storage: Interface Subclass: 0x06, > Protocol: 0x50 > Jun 30 17:24:34 gandalf kernel: usb-storage: Transport: Bulk > Jun 30 17:24:34 gandalf kernel: usb-storage: Protocol: Transparent SCSI > Jun 30 17:24:34 gandalf kernel: scsi6 : SCSI emulation for USB Mass Storage > devices > Jun 30 17:24:34 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:34 gandalf kernel: drivers/usb/core/inode.c: creating file '003' > Jun 30 17:24:34 gandalf kernel: usb-storage: device found at 3 > Jun 30 17:24:34 gandalf kernel: usb-storage: waiting for device to settle > before scanning > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_control_msg: rq=fe > rqtype=a1 value=0000 index=00 len=1 > Jun 30 17:24:39 gandalf kernel: usb-storage: GetMaxLUN command result is 1, > data is 0 > Jun 30 17:24:39 gandalf kernel: usb-storage: device scan complete > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Command INQUIRY (6 bytes) > Jun 30 17:24:39 gandalf kernel: usb-storage: 12 00 00 00 24 00 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x1 > L > 36 F 128 Trg 0 LUN 0 CL 6 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: > xfer 36 bytes, 1 entries > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 36/36 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW... > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 13 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x1 R > 0 > Stat 0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: scsi 6:0:0:0: Direct-Access ST316002 3A > 3.06 PQ: 0 ANSI: 0 > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad LUN (0:1) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (1:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (2:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (3:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (4:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (5:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (6:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (7:0) > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Command TEST_UNIT_READY (6 > bytes) > Jun 30 17:24:39 gandalf kernel: usb-storage: 00 00 00 00 00 00 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x2 > L > 0 F 0 Trg 0 LUN 0 CL 6 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW... > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 13 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x2 R > 0 > Stat 0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Command READ_CAPACITY (10 bytes) > Jun 30 17:24:39 gandalf kernel: usb-storage: 25 00 00 00 00 00 00 00 00 00 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x3 > L > 8 F 128 Trg 0 LUN 0 CL 10 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW... > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 13 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x3 R > 0 > Stat 0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0 > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping. > Jun 30 17:24:39 gandalf kernel: SCSI device sde: 312581807 512-byte hdwr > sectors (160042 MB) > Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called > Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened. > Jun 30 17:24:39 gandalf kernel: usb-storage: Command MODE_SENSE (6 bytes) > Jun 30 17:24:39 gandalf kernel: usb-storage: 1a 00 3f 00 c0 00 > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x4 > L > 192 F 128 Trg 0 LUN 0 CL 6 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 31 bytes > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0 > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: > xfer 192 bytes, 1 entries > Jun 30 17:24:39 gandalf kernel: usb-storage: Status code -121; transferred > 4/192 > Jun 30 17:24:39 gandalf kernel: usb-storage: -- short read transfer > Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x1 > Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW... > Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer > 13 bytes > Jun 30 17:25:09 gandalf kernel: usb-storage: command_abort called > Jun 30 17:25:09 gandalf kernel: usb-storage: usb_stor_stop_transport called > Jun 30 17:25:09 gandalf kernel: usb-storage: -- cancelling URB > > Then the second USB disk does not show up in /dev/disk. > > Feel free to contact me if you need more informations. > > Thanks > > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. Hello I've reproduced the problem with 2.6.22 from kernel.org. Here a trace from kern.log: Jul 15 12:15:19 gandalf kernel: usb-storage: *** thread awakened. Jul 15 12:15:19 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes) Jul 15 12:15:19 gandalf kernel: usb-storage: 2a 00 0e d6 7f 87 00 00 f0 00 Jul 15 12:15:19 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x75de L 122880 F 0 Trg 0 LUN 0 CL 10 Jul 15 12:15:19 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes Jul 15 12:15:19 gandalf kernel: usb-storage: Status code 0; transferred 31/31 Jul 15 12:15:19 gandalf kernel: usb-storage: -- transfer complete Jul 15 12:15:19 gandalf kernel: usb-storage: Bulk command transfer result=0 Jul 15 12:15:19 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 13 entries Jul 15 12:15:21 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2 Jul 15 12:15:49 gandalf kernel: usb-storage: command_abort called Jul 15 12:15:49 gandalf kernel: usb-storage: usb_stor_stop_transport called Jul 15 12:15:49 gandalf kernel: usb-storage: -- cancelling sg request At this point: - USB transfer is blocked - sync is blocked - my keyboard is also blocked. (keyboard pb also happens when usb-disk blocks while dvb_usb_nova_t_usb2 is loaded) I had one instance where /dev/sda was also blocked but my keyboard was not blocked. This is puzzling. I've reproduced the problem with a Maxtor external USB drive (should be a better quality hardware than the other USB drives I've tested) This time on Debian's 2.6.21, with dvb-udb module. My keyboard is again blocked. By logging through network, I've tried to check the status of USB modules. I've found that a simple "cat /proc/bus/usb/devices" is also blocked: Here's the output of /proc/<pid>/status of the cat command: Name: cat State: D (disk sleep) SleepAVG: 88% The cat command blocks on "disk sleep" even though I try to read a file from a virtual file system. Yet another (non-voluntary) test: USB has blocked while accessing my MP3 player. This device (Samsung YP-Z5F) is not really fast, so the USB lock may be more due to random factors than high speed transfers. This time, my keyboard was again blocked, only the mouse worked (as usual). HTH This looks like a knnown problem with some VIA EHCI USB controllers. It resembles one of the problems reported in Bug #6374. Does your system use VIA parts? No. My mobo is a MSI-K9A with an ATI RD580 + SB600 (Xpress3200) chipset. Here's the relevant part of lshw: *-usb:0 description: USB Controller product: SB600 USB (OHCI0) vendor: ATI Technologies Inc physical id: 13 bus info: pci@0000:00:13.0 version: 00 width: 32 bits clock: 66MHz capabilities: ohci bus_master configuration: driver=ohci_hcd latency=64 module=ohci_hcd That isn't the relevant part. You show the entry for an OHCI controller, but it's the EHCI controller causing the problems. However it's probably safe to assume the EHCI controller is also by ATI. Here's something you can try. Build a kernel with CONFIG_USB_DEBUG enabled. Then when you get one of these errors, make a copy of the files in /sys/class/usb_host/usb_host6 and attach it to this report. You may need to use a network login if your keyboard isn't working. Oh yes, and also attach a copy of /proc/bus/usb/devices. Created attachment 12079 [details]
async file from usb_hots6
Created attachment 12080 [details]
companion file from usb_host6
Created attachment 12081 [details]
periodic file from usb_host6
Created attachment 12082 [details]
registers file from usb_host6
Created attachment 12083 [details]
/proc/bus/usb/devices file
Hello The EHCI usb controller is indeed ATI: product: SB600 USB Controller (EHCI) vendor: ATI Technologies Inc I've attached the required information to this bug. All files from usb_host6 were read after the usb was hung on vanilla kernel 2.6.22 (compiled with Debian's kernel configuration and USB_DEBUG) HTH If you disable one of the CPUs, does it behave better? This looks like another variant of the unlink bug that shows up periodically. What needs to happen is that a cleaner version of 26f953fd884ea4879585287917f855c63c6b2666 get written (that one had to be reverted). Probably the timer cleanup and the unlink/IAA cleanup should be split. I've booted 2.6.22 with maxcpus=1 to no avail. usb still hangs. I've found that the problem is more difficult to reproduce when dvb_usb_nova_t_usb2 module is not loaded. (e.g. after 30 mn of md5sum instead of 2 or 3mns). In this case: - keyboard is not locked - I can interrupt the md5sum process with CTRL-C - If I re-run the mds5um process, it hangs immediately and I can no longer interrupt it Created attachment 12475 [details]
Add separate IAA watchdog timer for EHCI
It's been a while, but I finally found some time to work on this a little. Please try the attached patch for 2.6.22, and attach the resulting kernel log. Even if it doesn't fix the problem, it should provide some clues.
Reply-To: david-b@pacbell.net On Tuesday 21 August 2007, you wrote: > Created an attachment (id=12475) > --> (http://bugzilla.kernel.org/attachment.cgi?id=12475&action=view) > Add separate IAA watchdog timer for EHCI > > It's been a while, but I finally found some time to work on this a little. > Please try the attached patch for 2.6.22, and attach the resulting kernel > log. > Even if it doesn't fix the problem, it should provide some clues. Thanks for that, Alan ... on a first scan, it looks good. And if one hypothesis is true (that interactions with other timeouts breaks this one) it should be a good step forward... One question though: > @@ -643,7 +662,7 @@ static irqreturn_t ehci_irq (struct usb_ > /* complete the unlinking of some qh [4.15.2.3] */ > if (status & STS_IAA) { > COUNT (ehci->stats.reclaim); > - ehci->reclaim_ready = 1; > + end_unlink_async(ehci); > bh = 1; > } Given that you're not deferring that work, why is "bh" still set? Should be harmless but pointless. Because it was that way in your original patch and I didn't notice it. :-) Hello I've tried Alan's path on 2.6.22. Hangs still occur while dvb module is loaded. (I've not tried again without DVB module) I'll attach the corresponding kernel log, but I've not seen non usb-storage logs. Please tell me if I missed something. HTH Created attachment 12492 [details]
2.6.22 kernel log with separate IAA watchdog timer patch
Yes, those usb-storage log entries aren't very useful. It would be better if you turn off CONFIG_USB_STORAGE_DEBUG. Then when a transfer hangs get copies of the registers and async files from usb_host6 as before, in addition to the dmesg log. Created attachment 12509 [details]
usb_host6 files after usb hang (with IAA watchdog patch)
As requested, here's the required traces (all files from usb_host6 and dmesg) extracted after usb disk hang
Created attachment 12514 [details]
additional debugging for IAA watchdog
Crazy. It looks like the new code was never triggered.
This patch applies on top of the previous one. It adds a bunch more debugging messages, which hopefully will tell us what's going on.
One more thing. After the hang occurs and you collect the dmesg and usb_host6 data, please get a stack trace (Alt-SysRq-T) and attach that as well. Created attachment 12528 [details]
usb6 + sysrq-t traces
Here's the requested information including the sysrq-t dump (retrieved through dmesg, I've cleaned up demsg before running the test). The hung process is md5sum at the end of the file.
I still cannot find any log related to your iaa watchdog timer.
HTH
No log output at all from that patch? Are you sure you installed the patched version of ehci-hcd.ko? Just plugging in a new device should be enough to trigger some of those writes. The Alt-SysRq-T dump was incomplete; stuff near the top was lost from the ring buffer. You'll have to increase the size of CONFIG_LOG_BUF_SHIFT. Or even better, boot into single-user mode first so that most of those processes won't be running. Sorry for the long delay. Something is weird during module loading. I've modified a trace in ehci-hcd.c to make sure I load the patched ehci-hcd module. After boot I get the unmodified trace and no IAA watchdog trace: ehci_hcd: block sizes: qh 160 qtd 96 itd 192 sitd 96 Then I do "rmmod ehci-hdc" and "modprobe ehci-hcd", I get the modified trace: ehci_hcd (+patch): block sizes: qh 160 qtd 96 itd 192 sitd 96 and truckloads of IAA traces: Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0 Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff810024925140 Let me dig what's wrong during module loading and I'll get back to you with more traces. Cheers Probably at boot time ehci-hcd is loaded from an initramfs image, not from /lib/modules/... Created attachment 12657 [details]
trace for dvb communication failure
Argh ! You're right. The initrd image was not updated. I'm red with shame...
Now, I'm seeing some progress: the bug is *much* harder to trigger even when
the dvb module is up and running with vdr.
But, I had to comment out the IAA watchdog start and stop traces in ehci-q.c.
With them, it's impossible to work on the console: I can't see what I type nor
console messages. (today's kern.log is already 140M big ...)
Here's what I've observed:
- With dvb and *one* external usb disk: the bug did not show up even after
more than 30mn of testing
- When I launched a md5sum on my other usb disk, the first usb disk froze quite
fast. I don't have a kernel trace for this, but here's the relevant
extract from kern.log :
Sep 1 12:43:15 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003dffc460
Sep 1 12:43:59 gandalf last message repeated 14 times
Sep 1 12:44:46 gandalf last message repeated 41 times
Sep 1 12:45:31 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff81003dffc460
Sep 1 12:45:31 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003dffc460
Sep 1 12:45:33 gandalf last message repeated 19 times
Sep 1 12:47:03 gandalf last message repeated 7 times
Sep 1 12:47:50 gandalf last message repeated 14 times
Sep 1 12:48:36 gandalf last message repeated 21 times
Sep 1 12:50:08 gandalf last message repeated 42 times
Sep 1 12:50:54 gandalf last message repeated 7 times
Sep 1 12:50:54 gandalf kernel: fuse exit
Sep 1 12:50:56 gandalf kernel: Kernel logging (proc) stopped.
Sep 1 12:50:56 gandalf kernel: Kernel log daemon terminating.
At this point, both disk were hung, although I could interrupt them with ^C.
My keyboard was working but what I typed appeared on screen after 2s
(without your patch, the keyboard does not work except for sysrq).
I've run another test in single user mode. But this time, dvb communication
failed before the usb disk. I've attached the registers, and dmesg output (with
the sysrq-t trace)
I'll try a test in single user with only the usb disks and without the
dvb module.
HTH
Created attachment 12661 [details]
trace for usb disk failure
In single user mode with 2 usb disks running md5sum, I cannot reproduce the bug. I had not a single trace in kern.log for 2 hours. More that 300GB of data were read.
So your patch is a real improvement.
The bug shows up quite fast when dvb_usb_nova_t_usb2 is loaded and vdr is running.
Here's a trace of a usb disk failure (disk hangs and keyboard failure) done in single user mode with 2 usb disks and vdr running. I fear that some processes are missing from the trace though.
If this not enough, I'll run some other tests.
HTH
Okay, don't worry about those watchdog start and done messages. Also I think the stack traces aren't helping much, so you might as well forget them too. I don't have access to my development system at the moment, so you'll have to make the following changes yourself. In ehci-hcd.c, in the unlink_async() routine where the previous patch added an "unlink reclaim" line, have that line also print out the value of ehci->reclaim. Then immediately afterward, add this: if (ehci->reclaim == qh) dump_stack(); The reason is because I noticed in both your logs, there's a whole bunch of "unlink start" lines all with the same qh, followed by a single "unlink reclaim" line for that same qh. This should never happen -- so I want to make sure it really is occurring. No problem to implement your changes. Here are the results. The reclaim pointers are different: Sep 2 18:28:01 gandalf last message repeated 2 times Sep 2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280 Sep 2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:28:47 gandalf last message repeated 5 times Sep 2 18:28:47 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0 Sep 2 18:28:47 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:28:47 gandalf last message repeated 7 times Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44640 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c445a0 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:32 gandalf last message repeated 3 times Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c443c0 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:53 gandalf last message repeated 14 times Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0 Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280 I have not seen a case where the pointers were identical. BTW, I've also tested by actually watching a dvb-t channel while performing heavy traffic on both usb disk. Watching a channel on a dvb-usb device will require a steady data transfer from the dvb-usb device. In one instance, both usb disks froze while starting to watch the dvb-t and my keyboard was gone. First, DVB failed: Sep 2 18:17:18 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:18 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 2 18:17:20 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6 Sep 2 18:17:20 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:20 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:20 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 2 18:17:22 gandalf kernel: usb 6-3: usb-storage timed out on ep0out len=0/0 Sep 2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280 Sep 2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:22 gandalf kernel: usb 6-3: device not accepting address 4, error -110 Sep 2 18:17:22 gandalf kernel: hub 6-0:1.0: logical disconnect on port 3 Sep 2 18:17:22 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0008 evt 0000 Sep 2 18:17:22 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s Sep 2 18:17:22 gandalf kernel: usb 6-3: USB disconnect, address 4 Sep 2 18:17:22 gandalf kernel: usb 6-3: unregistering device Sep 2 18:17:22 gandalf kernel: usb 6-3: usb_disable_device nuking all URBs Sep 2 18:17:22 gandalf kernel: usb 6-3: unregistering interface 6-3:1.0 Sep 2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep01: ep_device_release called for usbdev6.4_ep01 Sep 2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep82: ep_device_release called for usbdev6.4_ep82 Sep 2 18:17:22 gandalf kernel: sd 5:0:0:0: scsi: Device offlined - not ready after error recovery Sep 2 18:17:22 gandalf kernel: sd 5:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Then both disk failed (I've removed most traces related to scsi or ext3). Sep 2 18:17:22 gandalf kernel: EXT3-fs error (device sdd1): ext3_get_inode_loc: unable to read inode block - inode=4521986, block=9043970 Sep 2 18:17:22 gandalf kernel: usb 6-3:1.0: uevent Sep 2 18:17:22 gandalf kernel: usb 6-3:1.0: uevent Sep 2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep00: ep_device_release called for usbdev6.4_ep00 Sep 2 18:17:22 gandalf kernel: usb 6-3: uevent Sep 2 18:17:22 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device Sep 2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: port 5 high speed Sep 2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 5 status 001005 POWER sig=se0 PE CONNECT Sep 2 18:17:22 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device Sep 2 18:17:22 gandalf kernel: EXT3-fs error (device sdd1): ext3_find_entry: reading directory #18792457 offset 0 Then I had these traces on and on until I had to hit 'reset': Sep 2 18:17:23 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device Sep 2 18:17:24 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6 Sep 2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:24 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 2 18:17:26 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2 Sep 2 18:17:26 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:26 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:26 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0) Sep 2 18:17:27 gandalf kernel: usb 6-5: usb-storage timed out on ep0in len=0/64 Sep 2 18:17:27 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280 Sep 2 18:17:27 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:28 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6 Sep 2 18:17:28 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:28 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 2 18:17:36 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:36 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:36 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 2 18:17:37 gandalf kernel: usb 6-5: usb-storage timed out on ep0in len=0/64 Sep 2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280 Sep 2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: port 5 high speed Sep 2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 5 status 001005 POWER sig=se0 PE CONNECT Sep 2 18:17:37 gandalf kernel: usb 6-5: device descriptor read/64, error -110 Sep 2 18:17:38 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2 Sep 2 18:17:38 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:38 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 2 18:17:38 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0) Sep 2 18:17:40 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6 Sep 2 18:17:40 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140 Sep 2 18:17:40 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 In the second instance of starting to watch a channel during heavy usb disks load, I had a different behavior: both disks were still working but the dvb channel was barely watchable as if a lot of data were thrown out. Here are the traces: Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:32 gandalf last message repeated 3 times Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c443c0 Sep 2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:53 gandalf last message repeated 14 times Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0 Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280 Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 Sep 2 18:34:02 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460 In this case, watching dvb did not change much the traces. I just had more of them. HTH So when you have all three devices running, errors show up pretty quickly, right? But with only two of them running, everything works more or less okay. What do you see in the async and registers files after a failure with the patched driver? Let's concentrate on the case where you start the disk activity first and then use the dvb device. Part of my confusion before was because I missed a code path. If you look about 8 lines before the end of ehci-q.c, you'll see a call to start_unlink_async(). Please add braces around it and include in them this line: ehci_info(ehci, "unlink idle qh %p\n", qh); That should solve the mystery of the strange pointer values. Created attachment 12683 [details]
lot of "unlink idle qh" when dvb is plugged
Here's a summary of the failures:
- heavy traffic on one usb disk: ok
- heavy traffic on 2 usb disk: mostly ok (very difficult to reproduce)
- heavy traffic on 2 usb disks + dvb plugged: usb disk hang in 5 to 10mn
- heavy traffic on 2 usb disks + wathc dvb: dvb unwatchable, difficult to reproduce usb disk hangs
I've added your "unlink idle qh" traces. As soon as I plug the dvb (or load
dvb-usb-nova-t-usb2 module), I get *lots* of unlink idle traces (never stops).
When I mount an usb disk, I get about 30 unlik idle traces and then no more
[ I removed many of them here ]:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
kjournald starting. Commit interval 5 seconds
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
EXT3 FS on sdd1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
Similar story when I plug a second disk:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
kjournald starting. Commit interval 5 seconds
EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
EXT3 FS on sdc1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
I aslo get lots of trace while doing heavy traffic on one disk:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
Doing more tests is not easy with so many traces.
HTH
There was no disk hang in the log file! In fact it all looked correct. The poor TV quality may simply be caused by insufficient bandwidth; the one bus may not be able to carry all that disk data plus the digital video. So I'm most interested in those disk hangs. Even with poor quality and slow transfers, nothing should actually hang. Likewise, the contents of the registers and async files aren't interesting until after a hang has occurred. By the way, I didn't mention it before but these lines from your log: Sep 2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 clearly indicate that your EHCI controller has an IAA bug. The new patch is supposed to keep things working in spite of that bug, but maybe there's something else going wrong as well. Whatever it is, apparently it gets triggered by high traffic levels. An enhanced version of the first patch might be able to help. I'll work on it and let you know when it's ready. Created attachment 12719 [details] Additions to the IAA watchdog Here's the additional IAA-watchdog patch. You should put everything back the way it was after the first patch (the one in comment #18) before applying this. Created attachment 12758 [details]
disk hang trace (2.6.22.6 + 2 IAA patches)
Sorry for the previous "normal" traces. I thought that the unlink reclaim traces were not normal.
Anyway, I've re-started from a vanilla 2.6.22.6 kernel with mostly Debian config and applied your 2 IAA patches.
I've included a trace of usb disk hang (with dvb running, but I was not watching a channel). I had a disk hand after about 5mns of traffic. The attachmeent contains usb register dump and dmesg output
HTH
Boy, that controller of yours sure is in a messed-up state! It's just not running the async schedule at all. The only thing left to try is a soft reset of the controller. I'll write another patch next week. It would be great if we could avoid the problem entirely, but since (1) we have no idea what the cause is, and (2) even if we knew the cause (most likely a bug in the controller hardware) we might not know how to avoid it, I don't see any other choice. My USB controller is AMD/ATI's SB600. Given AMD's new stance ragarding graphic open source development, may be you could get help from AMD. May be in AMD developer forum: http://forums.amd.com/devforum/ Another possibility is to enable the "USB legacy support" in the BIOS. I'll try that also. HTH Created attachment 12768 [details]
disk hang while watching dvb trace (2.6.22.6 + 2 IAA patches)
Hello
I think I've found a work-around. I've run the test (md5sum on 2 disks) while watching a dvb channel. For sure, the dvb stream was heavily corrupted, but the interesting thing is that the md5sum test ran for more than 20 mns without locking. Moreover there was not a single IAA trace in the kernel logs.
I still had a failure after 20mns, but I'm not sure that it was due to a disk failure: X had crashed, one disk stopped running while the other one was still running. So the symptoms are completely different. I've attached the usb registers content (after failure) in case you find something wrong with them.
I've also found this trace during boot sequence:
dvb-usb: schedule remote query interval to 100 msecs.
This means that when loaded, the dvb-t module is queried every 100ms, it's possible that the hang is triggered by one of these remote queries.
It may be time to watch more closely the interaction between dvb-t module and usb controller.
HTH
What is the workaround you found? BTW, the files at the top of the last attachment were taken from the wrong directory. For some reason, this time your EHCI controller was usb_host1, not usb_host6. In addition, there was no indication in the log of a disk crash. In theory, it shouldn't be possible for a module like dvb-t to crash the USB controller no matter what queries it makes. But of course, if the module actually contains a bug then all bets are off. The work-around I mentioned was just watching a DVB channel while running the test. Which means a sustained data stream (even if garbled) from the DVB device. This is to be compared to the 10 queries per seconds mentioned by the dvb log. Anyway I'll have to run more tests to check if this really makes a difference. I'll keep you posted. I'll also do some tests with dvb debug traces. I may find some clues there. HTH Some weird stuff I've noticed: - When the usb disk is hung. My keyboard (PS2) is not completely hung: what I type appears on screen after about 70s. - Once the usb disk is hung, after reboot, the sub disk itself is unresponsive: Sep 15 17:53:42 gandalf kernel: dvb_usb_nova_t_usb2 6-6:1.0: usb_probe_interface Sep 15 17:53:42 gandalf kernel: dvb_usb_nova_t_usb2 6-6:1.0: usb_probe_interface - got id Sep 15 17:53:42 gandalf kernel: usb usb6: usb auto-resume Sep 15 17:53:42 gandalf kernel: usb usb6: finish resume Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: hub_resume Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: resume root hub Sep 15 17:53:42 gandalf kernel: usb 6-6: usb auto-resume Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0000 evt 0000 Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 6 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: usb 6-6: finish resume Sep 15 17:53:42 gandalf kernel: dvb-usb: found a 'Hauppauge WinTV-NOVA-T usb2' in warm state. Sep 15 17:53:42 gandalf kernel: dvb-usb: will pass the complete MPEG2 transport stream to the software demuxer. Sep 15 17:53:42 gandalf kernel: DVB: registering new adapter (Hauppauge WinTV-NOVA-T usb2). Sep 15 17:53:42 gandalf kernel: dvb-usb: MAC address: 00:0d:fe:ff:ff:ff Sep 15 17:53:42 gandalf kernel: DiB3000MC/P:-I- found DiB3000MC/P: 3002 Sep 15 17:53:42 gandalf kernel: DVB: registering frontend 1 (DiBcom 3000MC/P)... Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: input: IR-receiver inside an USB DVB receiver as /class/input/input3 Sep 15 17:53:42 gandalf kernel: dvb-usb: schedule remote query interval to 100 msecs. Sep 15 17:53:42 gandalf kernel: dvb-usb: Hauppauge WinTV-NOVA-T usb2 successfully initialized and connected. Sep 15 17:53:42 gandalf kernel: usb 6-3: new high speed USB device using ehci_hcd and address 7 Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes Sep 15 17:53:42 gandalf last message repeated 2 times Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: DVB: registering frontend 0 (ST STV0299 DVB-S)... Sep 15 17:53:42 gandalf kernel: ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 21 (level, low) -> IRQ 21 Sep 15 17:53:42 gandalf kernel: gameport: EMU10K1 is pci0000:03:01.1/gameport0, io 0xbc00, speed 592kHz Sep 15 17:53:42 gandalf kernel: ACPI: PCI Interrupt 0000:00:14.2[A] -> GSI 16 (level, low) -> IRQ 16 Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71 Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes Sep 15 17:53:42 gandalf last message repeated 2 times Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71 Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: input: ImExPS/2 Logitech Wheel Mouse as /class/input/input4 Sep 15 17:53:42 gandalf kernel: usb 6-3: new high speed USB device using ehci_hcd and address 8 Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes Sep 15 17:53:42 gandalf last message repeated 2 times Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71 Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes Sep 15 17:53:42 gandalf last message repeated 2 times Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71 I have to switch off/on the external disk to get it back. - Here's a trace of a usb failure : Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:timf: 11075200 Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:autosearch returns: 2 Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:-I- Setting output mode for demod ffff81003eba9000 to 5 Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:timf: 11075200 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:47 gandalf kernel: usb 6-6: vdr-kbd timed out on ep1out len=0/6 Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:47 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Sep 15 17:46:47 gandalf kernel: DiB3000MC/P:i2c read error on 509 Sep 15 17:46:49 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2 Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status According to DiB3000MC/P tace, it looks like i2c bus is also gone ... I'm going to contact dvb usb people to see if one of them can help. It looks like a pure dvb bug: I just had a dvb failure even though the only dvb device was my nova-t dvb usb device. I've sent this information to the dvb list. I'll get back to this bug report as soon as some progress is made on the dvb front. Cheers Here are the most important lines from the log, right near the start of the failure: Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 In the status register, bit 8000 indicates the async schedule is running and bit 0020 (which never got turned on) is the IAA flag. Bits 0060 in the cmd register are the IAA-doorbell and async-enabled bits. So translating this into English, originally the system asked for an IAA signal and never received one. When the watchdog timer expired, the async schedule was still running and the IAA-doorbell request was still on. The driver then turned them both off, and at the second watchdog expiration you can see that the async schedule is indeed no longer running. But then the driver tried to turn the async schedule back on, and the controller refused to do it! All the remaining watchdog log entries show that the async schedule is not running, even when the async-enable command bit is set. Evidently the controller has malfunctioned. Giving it a light reset might get it going again -- but support for a light controller reset is optional and the chip might not include it, in which case there wouldn't be anything left for the driver to do but give up. I haven't had time to write a patch for this, but maybe sometime next week. Have you considered buying an add-on PCI USB controller card? It might work better than the onboard USB controller. Hello Sorry for the long delay in following up. Here's the status so far: - I've tried to reproduce the problem at work to no avail. I've recently found that the dvb device must be hooked to an antenna to trigger the bug. And I don't have dvb antenna at work... - I'm able to reproduce usb hang without disk traffic: the dvb device (with the antenna) fails once or twice a day. In this case the keyboard is "delayed" (characters appears on screen after 50 to 60 seconds or more) - When dvb is locked, ehci-hcd module hangs. Simple command like 'cat /proc/bus/usb/devices' hangs. Looks like a mutex is not freed. (See kernel stack trace below). - If I unplug the dvb device, I get back my keyboard but ehci-hcd is still hung. rmmod ehci-hcd hangs. Only a reboot restores usb functions. - When dvb hangs, I've no IAA watchdog trace before or after the hang I'll buy a USB controller PCI card if I cannot borrow a laptop to reproduce the problem. But this is not my favorite solution since I have no free PCI slot. Cheers Oct 6 14:47:23 gandalf kernel: cat D 000000010069bcc6 0 23980 12418 (NOTLB) Oct 6 14:47:23 gandalf kernel: ffff810018ef7df8 0000000000000082 0000000000000000 0000000000000000 Oct 6 14:47:23 gandalf kernel: 00003ffffffff000 ffffffff80263b30 0000000000000008 ffff81002f566ea0 Oct 6 14:47:23 gandalf kernel: ffffffff804aa4c0 00000da7d831a456 00000000000b0406 ffff81002f567050 Oct 6 14:47:23 gandalf kernel: Call Trace: Oct 6 14:47:23 gandalf kernel: [<ffffffff80263b30>] __alloc_pages+0x5a/0x2bc Oct 6 14:47:23 gandalf kernel: [<ffffffff803f4c0c>] __down+0xe8/0x100 Oct 6 14:47:23 gandalf kernel: [<ffffffff8022c321>] default_wake_function+0x0/0xe Oct 6 14:47:23 gandalf kernel: [<ffffffff802f6a3b>] __up_read+0x13/0x8a Oct 6 14:47:23 gandalf kernel: [<ffffffff803f4880>] __down_failed+0x35/0x3a Oct 6 14:47:23 gandalf kernel: [<ffffffff80373e3c>] usb_device_read+0xb4/0x149 Oct 6 14:47:23 gandalf kernel: [<ffffffff802831dd>] vfs_read+0xcb/0x153 Oct 6 14:47:23 gandalf kernel: [<ffffffff80283579>] sys_read+0x45/0x6e Oct 6 14:47:23 gandalf kernel: [<ffffffff80209d8e>] system_call+0x7e/0x83 Oct 6 14:47:23 gandalf kernel: Oct 6 14:47:23 gandalf kernel: cat D 000000010070ddb6 0 24464 12003 (NOTLB) Oct 6 14:47:23 gandalf kernel: ffff810008947dc8 0000000000000082 0000000000000000 0000000000000000 Oct 6 14:47:23 gandalf kernel: 00007fff2a345860 ffff81000adf8000 0000000000000008 ffff81001d253160 Oct 6 14:47:23 gandalf kernel: ffff8100010fb530 00000e7e95b838d1 00000000000ace6d ffff81001d253310 Oct 6 14:47:23 gandalf kernel: Call Trace: Oct 6 14:47:23 gandalf kernel: [<ffffffff803f4c0c>] __down+0xe8/0x100 Oct 6 14:47:23 gandalf kernel: [<ffffffff8022c321>] default_wake_function+0x0/0xe Oct 6 14:47:23 gandalf kernel: [<ffffffff8035a23e>] get_device+0x17/0x1f Oct 6 14:47:23 gandalf kernel: [<ffffffff803f4880>] __down_failed+0x35/0x3a Oct 6 14:47:23 gandalf kernel: [<ffffffff80370672>] usbdev_read+0x43/0x1c3 Oct 6 14:47:23 gandalf kernel: [<ffffffff802831dd>] vfs_read+0xcb/0x153 Oct 6 14:47:23 gandalf kernel: [<ffffffff80283579>] sys_read+0x45/0x6e Oct 6 14:47:23 gandalf kernel: [<ffffffff80209d8e>] system_call+0x7e/0x83 Here's another example of hung cat command where usb_device_read is involved. Oct 6 14:26:21 gandalf kernel: cat D 000000010069bcc6 0 23980 12418 (NOTLB) Oct 6 14:26:21 gandalf kernel: ffff810018ef7df8 0000000000000082 0000000000000000 0000000000000000 Oct 6 14:26:21 gandalf kernel: 00003ffffffff000 ffffffff80263b30 0000000000000008 ffff81002f566ea0 Oct 6 14:26:21 gandalf kernel: ffffffff804aa4c0 00000da7d831a456 00000000000b0406 ffff81002f567050 Oct 6 14:26:21 gandalf kernel: Call Trace: Oct 6 14:26:21 gandalf kernel: [<ffffffff80263b30>] __alloc_pages+0x5a/0x2bc Oct 6 14:26:21 gandalf kernel: [<ffffffff803f4c0c>] __down+0xe8/0x100 Oct 6 14:26:21 gandalf kernel: [<ffffffff8022c321>] default_wake_function+0x0/0xe Oct 6 14:26:21 gandalf kernel: [<ffffffff802f6a3b>] __up_read+0x13/0x8a Oct 6 14:26:21 gandalf kernel: [<ffffffff803f4880>] __down_failed+0x35/0x3a Oct 6 14:26:21 gandalf kernel: [<ffffffff80373e3c>] usb_device_read+0xb4/0x149 Oct 6 14:26:21 gandalf kernel: [<ffffffff802831dd>] vfs_read+0xcb/0x153 Oct 6 14:26:21 gandalf kernel: [<ffffffff80283579>] sys_read+0x45/0x6e Oct 6 14:26:21 gandalf kernel: [<ffffffff80209d8e>] system_call+0x7e/0x83 Created attachment 13057 [details]
usbmon trace of dvb failure
Hello
Here's an usbmon trace captured just before and during a dvb failure. I hope you
can find some information in there.
Cheers
Well, it's pretty clear where the problem started. At timestamp 1659505782 a transfer failed with error -2. The next transfer succeeded, and then all the following transfers failed. (Error code -2 means the transfer was cancelled; by looking at the timestamps you can see that each of those transfers was cancelled 2 seconds after it was submitted. That must be the dvb driver's timeout limit.) What the usbmon log doesn't say is _why_ the transfers timed out. There's only one clue: In the very first failed transfer (the one mentioned above), it timed out even though all the data had been sent! That wasn't true of the later transfers; none of them managed to send anything. Still, this indicates something is wrong with the EHCI controller -- but we already knew that. I don't think we have any hope of figuring out why the controller malfunctions. It could be a bug in the design or a flaw in the manufacturing. You've never been able to duplicate this problem on any other computer, have you? Which indicates a hardware error in your one machine, rather than a widespread bug. Created attachment 13144 [details]
hub failure log
No, I have not been able to duplicate the problem on another
hardware. But I haven't found another 64 bits machine that I
could bring home. :-(
Anyway, I have another clue.
I've bought an USB 2.0 hub (with a Philips chip) to check if the
problem was not due to a power supply problem.
I've found that this hub barely works. Once I plug a thumb drive
in it (or after 2 or 3 plug/unplug cycles), I get "IAA lost"
traces without mounting the drive.
On the other hand, I've booted a knoppix 4.0 (kernel 2.6.12 32 bits)
on the same hardware: In this case, the hub and thumb drives work
perfectly well.
So I believe there's a problem with usb core code, either a code
difference between 2.6.12 and 2.6.22.6 or a problem due to the
different architecture (32 vs 64 bits).
I'm going to try more recent knoppix to narrow down the code
difference between the working and non-working setup.
HTH
It's always good to have a working version of the kernel for comparison purposes. You would probably be better off using a vanilla kernel rather than a Knoppix kernel, if at all possible. Then you could try out intermediate versions and zero down on the particular change responsible for the problem. I can use recent vanilla kernels, but I'm not sure that Debian unstable will work fine on very old kernels. May be in single user. Anyway, here's what I've found so far: - 2.6.22.6 (vanilla + your patches) tends to fail fast (after 2 or 3 insertions) - 2.6.21-2-amd64 (Debian's): failure after 9 insertions - 2.6.19.1 (vanilla) single user: ok after 22 insertions (probably got lucky) - 2.6.19.1 (vanilla) multi user: failure after 9 insertions - 2.6.19.1 (knoppix) multi user: failure after 36 insertions - 2.6.12 (knoppix) multi user: ok after 127 insertions So there's a chance that the problem was introduced between 2.6.12 and 2.6.19 (even though I've not enough results to be really sure). Note that more recent kernels are more likely to fail than older ones. I'm going to try older kernels, I'll begin with vanilla kernel. If I get too many problems, I'll try with older knoppix. HTH I've great news !!! Until now, I've done all the tests reported here on a machine with powernowd (which uses cpufreq_userspace module) set on "performance" mode. The freq of my CPU varied from 1GHZ and 2.3GHz. Without powernowd (and cpufreq_userspace is not loaded), the CPU always runs at 2.3GHz. The good news are: without powernowd and cpufreq, all the USB system works fine : - I can plug re-plug USB hub wihtout problem (I;ve tested about 100 cycles) - DVB no longer fails (although i'd need more days to be really sure on this one) - USB disk traffic + DVB plugged works normally (I've tested more than 20mn of md5sum on the USB drive while DVB was plugged) So my USB problems are solved by removing CPU throttling. But this raises the question of why do I get problems when cpu throttling is enabled ? Does USB have timers which depends on CPU frequency ? Can this problem occur on other HW ? on laptop which most often use CPU throttling ? Feel free to send me patches if you want me to test solutions that would work with CPU throttling. I'll do my best to try them. All the best. Sounds like a problem observed on some systems, whereby cpufreq breaks EHCI. Linus didn't much like the fix since it broke some systems ... plus it did look pretty ugly, and a simpler one should probably have been possible. However, nobody has yet provided a mergeable patch. I think a mergeable patch would use such a notifier and either abort the cpufreq change if there are ISO transfers queued, or else stop both periodic and async schedules before the change (then restart the appropriate ones after the change completes). The underlying problem appears to be that when the actual CPU frequency change occurs the USB controller hardware is prevented from accessing the PCI bus, causing various sorts of errors or failures. At least one other person has seen these sorts of errors. However, David, it's worth noticing that Dominique's failures don't appear to involve Iso transfers -- only Bulk. Yes, the failures only involve bulk transfer: weirdly enough, the Nova-t dvb-t device only provides bulk endpoints: T: Bus=06 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#= 3 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs= 1 P: Vendor=2040 ProdID=9301 Rev= 0.00 S: Manufacturer=Hauppau?? S: Product=SOHO-FX2 S: SerialNumber=9301-00-F008B260 C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=500mA I:* If#= 0 Alt= 0 #EPs= 4 Cls=ff(vend.) Sub=00 Prot=00 Driver=dvb_usb_nova_t_usb2 E: Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms E: Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms HTH Hello Looks like I'm not out of the woods yet. I had a usb disk failure without powernowd or cpufreq (with dvb). This happened while reading a file from the usb disk at a rather slow rate (about 1.3MB/s) compared to the usual md5sum tests. So having a fixed frequency makes the bugs less likely to happen, but not impossible. Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069 Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status Nov 5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069 Nov 5 21:04:29 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009 HTH Created attachment 13497 [details]
dvb usb disk failure related to killing urb ?
Hello
I've tried the same test with dvb code from mercurial. I still have failure every now and then. The failures look related to killing urbs:
Nov 11 14:51:31 gandalf kernel: stop feeding
Nov 11 14:51:31 gandalf kernel: killing URB no. 0.
Nov 11 14:51:31 gandalf kernel: killing URB no. 1.
Nov 11 14:51:31 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Nov 11 14:51:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov 11 14:51:32 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov 11 14:51:32 gandalf kernel: killing URB no. 2.
and:
Nov 11 18:58:15 gandalf kernel: stop feeding
Nov 11 18:58:15 gandalf kernel: killing URB no. 0.
Nov 11 18:58:15 gandalf kernel: killing URB no. 1.
Nov 11 18:58:15 gandalf kernel: killing URB no. 2.
Nov 11 18:58:15 gandalf kernel: killing URB no. 3.
Nov 11 18:58:15 gandalf kernel: killing URB no. 4.
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Nov 11 18:58:15 gandalf kernel: >>> 04 35
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov 11 18:58:15 gandalf kernel: killing URB no. 5.
The "Killing URB" trace comes from dvb-USB/usb-urb.c line 73:
int usb_urb_kill(struct usb_data_stream *stream)
{
int i;
for (i = 0; i < stream->urbs_submitted; i++) {
deb_ts("killing URB no. %d.\n",i);
/* stop the URB */
usb_kill_urb(stream->urb_list[i]);
}
stream->urbs_submitted = 0;
return 0;
}
If this does not give you a clue about what is wrong (beside HW pb), I'm going to buy a PCI usb controller card to check if this problem can be reproduced with another controller (as Alan suggested).
HTH
Actually the problem is related to unlinking URBs, which is part of what usb_kill_urb() does. When an URB is unlinked its active QH has to be removed from the schedule. Removing QHs requires an IAA interrupt, so when your balky hardware fails to set the proper IAA status the problem occurs. This is all stuff we already knew. Hello Sorry for the delay in following up. I've tested like DVB+external USB disk with an add-on USB card (VIA chip). It's working normally with CPU freq set at max speed. I'm going to check if changing cpu freq causes problems with this chip. All the best Hi, I am experiencing the same problem as Dominique reported here. My board (Asus M2A-VM HDMI) also uses the SB600 chipset. Attached to one USB port is a D-LINK 7-port hub (USB 2.0). 7 USB-HDDs are attached to this hub. At least 5 USB-HDDs are using a Cypress chipset. The system is Ubuntu 7.10 with kernel "2.6.22-14-server #1 SMP i686". Ok, it is not a vanilla kernel, but I think it also happens to a vanilla kernel. I can try this as well if you wish. Access to the USB-HDDs without high traffic works without any problem. As soon as I start to copy a few big files from one USB-HDD to another USB-HDD, the cp command hangs at some point. "sync" does not work either. The only solution so far for me is to reboot the machine. This is reproducable. Copying from the internal IDE-HDD does not produce this bug, but copying via Samba to the USB-HDDs does. I have an Add-On PCI USB2.0 controller card which uses a VIA controller. I am going to see if the described behaviour also happens with the card. Regards, Christian Christian, there's no way you can possibly know whether your problem is the same as Dominique's without looking at the EHCI "registers" debugging file. See comments 8 and 12. Created attachment 14515 [details]
cat /proc/bus/usb/devices
ok. the debug output follows.
Created attachment 14516 [details]
cat /sys/class/usb_host/usb_host6/registers
Created attachment 14517 [details]
cat /sys/class/usb_host/usb_host6/async
Created attachment 14518 [details]
cat /sys/class/usb_host/usb_host6/periodic
the file "companion" is empty.
I have started to copy a few files from one USB-HDD to another USB-HDD.
Now the cp hangs again.
Okay, that does look like the same bug. But as far as I've been able to tell it's a hardware bug in the host controller, not a software bug in the driver. Nobody to my knowledge has found a way to work around it. If you have any ideas, I'd like to hear them. Same tests on the same board with the same kernel with an additional USB2.0 controller card (VIA VT6212L): Big bulk transfers from one USB2.0 HDD to another one work without any problems. No more hangs. Can you provide a patch which soft-resets the EHCI controller? -- Linux sleepless 2.6.22.9-custom #1 SMP Mon Jan 21 21:44:50 CET 2008 i686 GNU/Linux -- 03:06.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) (prog-if 00 [UHCI]) Subsystem: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller Flags: bus master, medium devsel, latency 64, IRQ 22 I/O ports at c800 [size=32] Capabilities: [80] Power Management version 2 03:06.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) (prog-if 00 [UHCI]) Subsystem: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller Flags: bus master, medium devsel, latency 64, IRQ 19 I/O ports at c400 [size=32] Capabilities: [80] Power Management version 2 03:06.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 63) (prog-if 20 [EHCI]) Subsystem: VIA Technologies, Inc. USB 2.0 Flags: bus master, medium devsel, latency 64, IRQ 20 Memory at fddfe000 (32-bit, non-prefetchable) [size=256] Capabilities: [80] Power Management version 2 -- Unfortunately I can't. The driver isn't designed in a way that makes it straightforward. Furthermore the EHCI spec says explicitly that hardware support for soft reset is optional. You can always do a manual reset. The easiest way is to rmmod ehci-hcd and then modprobe it back. Created attachment 14769 [details]
experimental IAA/IAAD patch for 2.6.24-git20+
Try this patch, especially on ATI hardware. It tries to handle one case that's been described as a hardware bug (albeit not one in public erata listings), and generally adds lots of paranoia about the two IAA related bits to detect similar cases of bits not being set or cleared when they should be.
Run with CONFIG_USB_DEBUG=y for the best diagnostics, and please report whether any interesting behavioral changes appear.
I will test your patch and get back to you. But before that, it looks like 2.6.24.1 behaves better: I still have to produce a single failure while slowly reading the content of an USB disk (i.e. playing a movie stored there). So far, it has worked for 2 hours. I may have been lucky, so I need to perform mode tests, including tests with DVB usb and re-enabling CPU throttle. Christian, could you test 2.6.24.1 on your side ? All the best Incidentally, Dominique, you shouldn't be using the patches attached to comments 18 and 38 any more. The first has been updated by a patch included in 2.6.25 which you can find here: <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=07d29b63ef6b39963ab37818653284d861cf55af>. The second is essentially replaced by Dave's new patches. If have a similar problem with an ATI IXP SB400 EHCI controller. I didn't read all comments, but maybe i can add some new facts here. If i can do anything further to help, feel free to contact me. I do not have this problem with usb storage but with a usb mouse. After some random time, all my usb ports freeze and i cannot use any usb devices. rmmod usb_storage freezes, too. I don't have any of the problems i describe here, when i boot with acpi=off noapic parameters. This problem occurs since at least 2.6.18 but not on all kernel versions since then. Versions where usb ports don't freeze despite having acpi and apic enabled have a known problem with netdev watchdog timer for realtek ethernet card (modules 8139cp/8139too). See http://bugzilla.kernel.org/show_bug.cgi?id=6138 Sorry for the delay in testing. Alan, thanks for the tip. I've tested 2.6.24-git22 with David's IAA patch. When any device is plugged in the usb bus, I get a lot of "IAAD cleared without IAA" messages. On the SB600 device: kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA But also on the VIA chip (which worked flawlessly so far): kernel: ehci_hcd 0000:03:01.2: IAAD cleared without IAA To continue the tests, I've removed all devices connected to the VIA chip. Here's what I get with minimal traffic (i.e. dvb connected and USB disk mounted with no traffic): Feb 24 19:33:54 gandalf kernel: EXT3 FS on sdh1, internal journal Feb 24 19:33:54 gandalf kernel: EXT3-fs: mounted filesystem with ordered data mode. Feb 24 19:33:54 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA Feb 24 19:34:25 gandalf last message repeated 1093 times Feb 24 19:35:26 gandalf last message repeated 2190 times Feb 24 19:36:27 gandalf last message repeated 2191 times Feb 24 19:37:28 gandalf last message repeated 2180 times Feb 24 19:38:29 gandalf last message repeated 2212 times And when I start the md5sum + dvb device test : Feb 24 19:40:29 gandalf last message repeated 26947 times Feb 24 19:41:29 gandalf last message repeated 35874 times Feb 24 19:42:29 gandalf last message repeated 35038 times Feb 24 19:43:29 gandalf last message repeated 35316 times With 2.6.24 and before, this test usually hangs the USB controller in a few minutes. After a while, the DVB device fails: Feb 24 20:19:09 gandalf last message repeated 21933 times Feb 24 20:19:09 gandalf kernel: usb 6-4: vdr-kbd timed out on ep1in len=0/2 Feb 24 20:19:09 gandalf kernel: dvb-usb: recv bulk message failed: -110 Feb 24 20:19:09 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA Feb 24 20:19:11 gandalf last message repeated 191 times At this time the symptoms are: - USB disk access still works (this is new behavior) - usbview hangs. - keyboard is weird: key pressed appears on screen after several seconds (symptoms always seen when dvb device fails). Then I stop disk traffic. And I need to unplug the DVB device to get a normal keyboard behavior. usbview is still hung and is unkillable (even with SIGKILL) Feel free to ask me other tests if needed. HTH Comment on attachment 14769 [details]
experimental IAA/IAAD patch for 2.6.24-git20+
original IAA patch had some bogus messaging in the typical case; replacing it.
Created attachment 14997 [details]
improved iaa diagnostics patch
This is basically same as the previous patch, except that it doesn't emit spurious diagnostics ("IAAD cleared without IAA") in the normal all-is-working-right case. (It should have checked for IAA clear before assuming it wasn't set! I'm told there is hardware which will trigger that corrected diagnostic...) It also restores the diagnostic for the normal watchdog-firing case to "verbose" level, to avoid log clutter on that uninteresting case.
Dominique, sorry for the noise, but that original patch had a goof; I posted the updated patch on linux-usb@vger but forgot to update the version here. Unfortunately I can't tell which of those "IAAD cleared" messages may be valid (if any!) versus which are noise (most of them). It'd be good if you could retest with the updated patch (say, RC3 -- or else just the post-2.6.24 kernel you're already using) plus the anti-oops patch in http://lkml.org/lkml/2008/2/19/546 Hello I've tested your new patch (2008-02-25 patch + anti-oops patch ) with 2.6.24-git22. Test setup is dvb-t device, reading a movie on usb disk (not md5sum sum), with powernowd running in performance mode. I indeed had a failure: - usb disk com failed - dvb-t com failed (with usual impact: sluggish keyboard) - but usb is not locked: usbview still works (that's an improvement :-) ) - cycling power on usb disk is not enough to get it back: the power cycleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee is not detected [you just witnessed another keyboard problem consecutive to usb failure: I hit 'e' only once and it got repeated :-/ ] See the traces below: Mar 2 16:18:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079 Mar 2 16:18:32 gandalf kernel: usb 6-4: events/1 timed out on ep1out len=0/2 Mar 2 16:18:32 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0) Mar 2 16:18:34 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079 Mar 2 16:18:34 gandalf kernel: usb 6-4: vdr-kbd timed out on ep1out len=0/6 Mar 2 16:18:34 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) [...] Mar 2 16:20:22 gandalf kernel: usb 6-3: usb-storage timed out on ep0out len=0/0 Mar 2 16:20:23 gandalf kernel: usb 6-3: device not accepting address 2, error -110 Mar 2 16:20:23 gandalf kernel: hub 6-0:1.0: logical disconnect on port 3 Mar 2 16:20:23 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0008 evt 0000 Mar 2 16:20:23 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s Mar 2 16:20:23 gandalf kernel: usb 6-3: USB disconnect, address 2 Mar 2 16:20:23 gandalf kernel: usb 6-3: unregistering device Mar 2 16:20:23 gandalf kernel: usb 6-3: usb_disable_device nuking all URBs Mar 2 16:20:23 gandalf kernel: usb 6-3: unregistering interface 6-3:1.0 Mar 2 16:20:23 gandalf kernel: sd 4:0:0:0: Device offlined - not ready after error recovery Mar 2 16:20:23 gandalf kernel: sd 4:0:0:0: [sdh] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK,SUGGEST_OK Mar 2 16:20:23 gandalf kernel: end_request: I/O error, dev sdh, sector 18237903 Mar 2 16:20:23 gandalf kernel: sd 4:0:0:0: rejecting I/O to offline device Mar 2 16:20:23 gandalf kernel: sd 4:0:0:0: rejecting I/O to offline device Mar 2 16:20:23 gandalf kernel: sd 4:0:0:0: [sdh] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Mar 2 16:20:23 gandalf kernel: end_request: I/O error, dev sdh, sector 18238143 Mar 2 16:20:23 gandalf kernel: Buffer I/O error on device sdh1, logical block 1138 Mar 2 16:20:23 gandalf kernel: lost page write due to I/O error on sdh1 Mar 2 16:20:23 gandalf kernel: Aborting journal on device sdh1. Mar 2 16:20:23 gandalf kernel: Buffer I/O error on device sdh1, logical block 532 [...] Mar 2 16:25:24 gandalf kernel: usb-storage: device scan complete Mar 2 16:25:24 gandalf kernel: scsi 19:0:0:0: Direct-Access Generic USB SD Reader 1.00 PQ: 0 ANSI: 0 Mar 2 16:25:24 gandalf kernel: scsi 19:0:0:1: Direct-Access Generic USB CF Reader 1.01 PQ: 0 ANSI: 0 Mar 2 16:25:24 gandalf kernel: scsi 19:0:0:2: Direct-Access Generic USB SM Reader 1.02 PQ: 0 ANSI: 0 Mar 2 16:25:24 gandalf kernel: scsi 19:0:0:3: Direct-Access Generic USB MS Reader 1.03 PQ: 0 ANSI: 0 Mar 2 16:25:24 gandalf kernel: sd 19:0:0:0: [sde] Attached SCSI removable disk Mar 2 16:25:24 gandalf kernel: sd 19:0:0:1: [sdf] Attached SCSI removable disk Mar 2 16:25:24 gandalf kernel: sd 19:0:0:2: [sdg] Attached SCSI removable disk Mar 2 16:25:24 gandalf kernel: sd 19:0:0:3: [sdk] Attached SCSI removable disk Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079 Mar 2 16:25:24 gandalf kernel: usb 6-4: kdvb-fe-0 timed out on ep1out len=0/6 Mar 2 16:25:24 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0) Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81007e4a7ec0 path 2 ep2in 92120000 cc 9 --> status -121 Mar 2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32 Mar 2 16:25:26 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079 Mar 2 16:25:26 gandalf kernel: usb 6-4: events/1 timed out on ep1out len=0/2 Mar 2 16:25:26 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0) Mar 2 16:25:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079 I also got: Mar 2 18:44:18 gandalf kernel: INFO: task khubd:147 blocked for more than 120 seconds. Mar 2 18:44:18 gandalf kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 2 18:44:18 gandalf kernel: khubd D 0000000000000000 0 147 2 Mar 2 18:44:18 gandalf kernel: ffff81007f0f7c40 0000000000000046 ffff810000000000 ffffffff00000000 Mar 2 18:44:18 gandalf kernel: ffff81007f0f7b90 ffff81007f0f7bf0 ffffffff805b3680 ffffffff805b3680 Mar 2 18:44:18 gandalf kernel: ffffffff805b3680 ffffffff805b0880 ffffffff805b3680 ffff81007f0f5878 Mar 2 18:44:18 gandalf kernel: Call Trace: Mar 2 18:44:18 gandalf kernel: [<ffffffff8843ba9c>] :dvb_core:dvb_unregister_frontend+0xaf/0xef Mar 2 18:44:18 gandalf kernel: [<ffffffff80243a2d>] autoremove_wake_function+0x0/0x2e Mar 2 18:44:18 gandalf kernel: [<ffffffff8031171b>] kref_put+0x41/0x4c Mar 2 18:44:18 gandalf kernel: [<ffffffff8844dbc9>] :dvb_usb:dvb_usb_adapter_frontend_exit+0x15/0x25 Mar 2 18:44:18 gandalf kernel: [<ffffffff8844d336>] :dvb_usb:dvb_usb_exit+0x31/0x98 Mar 2 18:44:18 gandalf kernel: [<ffffffff8844d3d3>] :dvb_usb:dvb_usb_device_exit+0x36/0x48 Mar 2 18:44:18 gandalf kernel: [<ffffffff803899ef>] usb_unbind_interface+0x59/0xb6 Mar 2 18:44:18 gandalf kernel: [<ffffffff80379c0d>] __device_release_driver+0x74/0x97 Mar 2 18:44:18 gandalf kernel: [<ffffffff80379fe4>] device_release_driver+0x31/0x49 Mar 2 18:44:18 gandalf kernel: [<ffffffff803794d3>] bus_remove_device+0x8e/0xa1 Mar 2 18:44:18 gandalf kernel: [<ffffffff803779b3>] device_del+0x10e/0x173 Mar 2 18:44:18 gandalf kernel: [<ffffffff80386e7e>] usb_disable_device+0xe2/0x163 Mar 2 18:44:18 gandalf kernel: [<ffffffff80382330>] usb_disconnect+0xd9/0x172 Mar 2 18:44:18 gandalf kernel: [<ffffffff803839fc>] hub_thread+0x66d/0xff3 Mar 2 18:44:18 gandalf kernel: [<ffffffff80243a2d>] autoremove_wake_function+0x0/0x2e Mar 2 18:44:18 gandalf kernel: [<ffffffff8038338f>] hub_thread+0x0/0xff3 Mar 2 18:44:18 gandalf kernel: [<ffffffff802438f9>] kthread+0x47/0x76 Mar 2 18:44:18 gandalf kernel: [<ffffffff8022cef5>] schedule_tail+0x28/0x5d Mar 2 18:44:18 gandalf kernel: [<ffffffff8020cc58>] child_rip+0xa/0x12 Mar 2 18:44:18 gandalf kernel: [<ffffffff802438b2>] kthread+0x0/0x76 Mar 2 18:44:18 gandalf kernel: [<ffffffff8020cc4e>] child_rip+0x0/0x12 On a related note, I've also seen some improvements between 2.6.22 and 2.6.24.1: Using an external hub with powernowd no longer fails. I can plug/unplug usb thumb drive on the external hub without problems (used to hang usb after 3 or 4 tries). HTH My impression is that this is just a buggy controller. David may have some more constructive suggestions. That's possible... On the other hand, 2.6.24 behaves much better than 2.6.22 with regard to USB disks. I've also some trouble with DVB-T device with the VIA chipset. On some occasion, when I reboot, the LED of the Nova-T device is still on during the whole reboot process. Normally, it should switch off when the machine goes down. I can only assume that the DVB-T device is trashed. Even when if the DVB-T device is plugged on the VIA card (PCI add-on usb controller) I get the following symptoms: - lot of "dvb-usb: bulk message failed: -110 (5/0)" in kern.log - PS/2 keyboard is sluggish (about 1 or 2s delay between key hit and effect) - behavior go back to normal when I unplug the DVB device. Looks like error recovery in DVB driver has some problem. Anyway, this may be unrelated to the USB disk hang I've experienced before. So I'm going to test only USB storage (without DVB) on 2.6.24-git22 + david's patch, so I can check ehci + usb storage behavior on SB600 chipset without DVB interferences. Here are some results with USB disk alone on 2.6.24-git12 + Dave's patch To test the disk I run mplayer on a movie stored on the usb disk. After about 75mnm I got one (and only one) IAA hiccup (mplayer did not stop): Mar 8 19:01:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA Mar 8 19:01:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010069 After this trace, I had no other trouble for at least one hour (and I had to stop the test). I'll keep testing this setup for a while to see if this kind of IAA hiccups happens often or not. HTH Hmm, the "IAAD cleared without IAA" seems to show that there's a real silicon issue there ... perhaps more prevalent with the SB600 than the SB700. Please try this with the patch attached to Bug 10078 ... that's essentially what I asked be merged for 2.6.25-final (so it's not got as much in the way of diagnostics). It lets the watchdog handle the case which triggers that particular message. Re the "IAAD wasn't clear", that message may be a bit dubious, but in any case the situation it addresses should also be addressed by that patch. I've tested your patch for 2 hours without a hiccup. USB disk is working normally. But, the file ehci-hdc.c contains: #undef EHCI_VERBOSE_DEBUG And you patch uses ehci_vdbg which is silent when EHCI_VERBOSE_DEBUG is not defined, so I don't think I will get traces. What should I do to enable these traces ? - update to a more recent git kernel (I'm still at 2.6.24 git22) - modify the patch to use ehci_dbg instead of ehci_vdbg - remove the '#undef EHCI_VERBOSE_DEBUG' Reply-To: david-b@pacbell.net > What should I do to enable these traces ? > - update to a more recent git kernel (I'm still at 2.6.24 git22) > - modify the patch to use ehci_dbg instead of ehci_vdbg > - remove the '#undef EHCI_VERBOSE_DEBUG' By all means update to 2.6.25-rc6 or newer. If you really want to see such messages, #define EHCI_VERBOSE_DEBUG. You should try 2.6.25-rc6; it includes the IAA patch. As regards the debugging output: The most important messages use ehci_dbg, not ehci_vdbg. (These are the messages in the STS_IAA part of ehci_irq().) So you would see them provided you enable CONFIG_USB_DEBUG. If you really want to see the ehci_vdbg messages then #define EHCI_VERBOSE_DEBUG, i.e., change the #undef to #define. Can we close this bug report? Sorry, I forgot to update this bug report. I couldn't get failure on 2.6.25-rc kernel for 2 hours. Then I had to switch back to 2.6.24. To be really sure that the bug is fixed, I'd like to test this with 2.6.25 for everyday use. But in this case, I need a working ATI driver (fglrx). This one was released by ATI this week and should be available soon for Debian. So I should get back to you next week or so for an update. Until then, please don't close this bug. Thanks for your patience. I have had this problem for quite some time as well. I am running 2.6.25.4 now and have not experienced any hangs with my external USB disk. Hello With 2.6.25.6 (Debian sid) kernel, I've been reading and writing to my USB disks during this week without problems when they are directly connected to the machine (through SB600). When connected through a hub, I sometime get errors when connecting a thumb drive to the hub. I cannot say if it's a HW problem or software problem. The good news is that the USB systems comes back alive after 10 or 20 seconds. After power cycling the disk and the hub, the usb and usb-storage will show them correctly. The only hitch is that udev mounts them under a different name. For instance /dev/disk/by-id/usb-MAXTOR_S_TM3320820A_00-0:0 instead of /dev/disk/by-id/usb-MAXTOR_S_TM3320820A_0-0:0 In any case, this is a minor issue compared to the USB hang. Do you want me to open another bug for this issue ? (so you can close this one) Thanks for all your help The different name probably reflects the different physical port being used by the device. You ought to be able to fix it by altering the appropriate udev script. In any event, it's a udev issue rather than a kernel issue so it doesn't below in the kernel bugzilla. If you're happy with the way the driver is working, we can close this bug report. Yes, you can close this bug. Thanks for your help :-) |