Bug 8692 - USB storage freeze
Summary: USB storage freeze
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: David Brownell
URL:
Keywords:
Depends on:
Blocks: USB
  Show dependency tree
 
Reported: 2007-06-30 08:50 UTC by Dominique Dumont
Modified: 2008-07-02 01:41 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.21
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
async file from usb_hots6 (3.00 KB, text/plain)
2007-07-21 00:28 UTC, Dominique Dumont
Details
companion file from usb_host6 (4.00 KB, text/plain)
2007-07-21 00:30 UTC, Dominique Dumont
Details
periodic file from usb_host6 (4.00 KB, text/plain)
2007-07-21 00:31 UTC, Dominique Dumont
Details
registers file from usb_host6 (4.00 KB, text/plain)
2007-07-21 00:47 UTC, Dominique Dumont
Details
/proc/bus/usb/devices file (4.17 KB, text/plain)
2007-07-21 00:51 UTC, Dominique Dumont
Details
Add separate IAA watchdog timer for EHCI (9.84 KB, patch)
2007-08-21 12:09 UTC, Alan Stern
Details | Diff
2.6.22 kernel log with separate IAA watchdog timer patch (3.69 KB, text/plain)
2007-08-22 12:24 UTC, Dominique Dumont
Details
usb_host6 files after usb hang (with IAA watchdog patch) (37.75 KB, text/plain)
2007-08-23 12:24 UTC, Dominique Dumont
Details
additional debugging for IAA watchdog (1.77 KB, patch)
2007-08-23 13:36 UTC, Alan Stern
Details | Diff
usb6 + sysrq-t traces (62.23 KB, text/x-patch)
2007-08-25 03:08 UTC, Dominique Dumont
Details
trace for dvb communication failure (61.38 KB, text/plain)
2007-09-01 05:38 UTC, Dominique Dumont
Details
trace for usb disk failure (61.52 KB, text/plain)
2007-09-01 08:57 UTC, Dominique Dumont
Details
lot of "unlink idle qh" when dvb is plugged (12.88 KB, text/x-patch)
2007-09-03 13:37 UTC, Dominique Dumont
Details
Additions to the IAA watchdog (8.35 KB, patch)
2007-09-05 14:27 UTC, Alan Stern
Details | Diff
disk hang trace (2.6.22.6 + 2 IAA patches) (61.64 KB, text/plain)
2007-09-08 03:46 UTC, Dominique Dumont
Details
disk hang while watching dvb trace (2.6.22.6 + 2 IAA patches) (37.46 KB, text/plain)
2007-09-09 07:37 UTC, Dominique Dumont
Details
usbmon trace of dvb failure (10.19 KB, text/x-patch)
2007-10-06 10:13 UTC, Dominique Dumont
Details
hub failure log (31.14 KB, text/plain)
2007-10-13 10:07 UTC, Dominique Dumont
Details
dvb usb disk failure related to killing urb ? (8.90 KB, text/plain)
2007-11-11 11:25 UTC, Dominique Dumont
Details
cat /proc/bus/usb/devices (6.62 KB, text/plain)
2008-01-21 14:24 UTC, Christian Wick
Details
cat /sys/class/usb_host/usb_host6/registers (810 bytes, text/plain)
2008-01-21 14:26 UTC, Christian Wick
Details
cat /sys/class/usb_host/usb_host6/async (508 bytes, text/plain)
2008-01-21 14:27 UTC, Christian Wick
Details
cat /sys/class/usb_host/usb_host6/periodic (61 bytes, text/plain)
2008-01-21 14:31 UTC, Christian Wick
Details
experimental IAA/IAAD patch for 2.6.24-git20+ (4.78 KB, patch)
2008-02-09 15:28 UTC, David Brownell
Details | Diff
improved iaa diagnostics patch (5.38 KB, patch)
2008-02-25 21:48 UTC, David Brownell
Details | Diff

Description Dominique Dumont 2007-06-30 08:50:36 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: Debian/sid
Hardware Environment: AMD 64 X2, external 200GB USB disk (Prolific Technology Inc)
Software Environment: 2.6.21 #3 SMP Sat Jun 30 16:30:03 CEST 2007 x86_64 GNU/Linux
Problem Description:

When doing file transfer, USB storage freezes randomly. It's possible that my USB drive is low quality hardware, but this shouldn't freeze the USB bus.

At this point, all I/O to USB are blocked:
- I can mount another USB disk
- sync blocks
- reboot blocks
- only reset button works to unlock the disks.

I've seen this problem also on Debian's kernel 2.6.18

I've reproduced it on a vanilla kernel (built with make-kpkg) with usb debug
enabled. (traces are below)


Steps to reproduce:

A simple cp of a multi GB file is often enough. (note that my disk if 90% full)

You can also run find <usb-drive> -type f -exec md5sum {} \;

Traces:
Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes)
Jun 30 17:21:30 gandalf kernel: usb-storage:  2a 00 0c d9 01 87 00 00 f0 00
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x4297 L 122880 F 0 Trg 0 LUN 0 CL 10
Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk command transfer result=0
Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 17 entries
Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 122880/122880
Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk data transfer result 0x0
Jun 30 17:21:30 gandalf kernel: usb-storage: Attempting to get CSW...
Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 13/13
Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk status result = 0
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x4297 R 0 Stat 0x0
Jun 30 17:21:30 gandalf kernel: usb-storage: scsi cmd done, result=0x0
Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes)
Jun 30 17:21:30 gandalf kernel: usb-storage:  2a 00 0c d9 02 77 00 00 f0 00
Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x4298 L 122880 F 0 Trg 0 LUN 0 CL 10
Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:22:00 gandalf kernel: usb-storage: command_abort called
Jun 30 17:22:00 gandalf kernel: usb-storage: usb_stor_stop_transport called
Jun 30 17:22:00 gandalf kernel: usb-storage: -- cancelling URB

At this point the external disk is frozen.

Then I've powered on another USB disk:

Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0000 evt 0008
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001803 POWER sig=j CSC CONNECT
Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0001, 480 Mb/s
Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: debounce: port 3: total 100ms stable 100ms status 0x501
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Jun 30 17:24:34 gandalf kernel: usb 6-3: new high speed USB device using ehci_hcd and address 3
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: qh ffff81003d90d280 (#00) state 4
Jun 30 17:24:34 gandalf kernel: usb 6-3: default language 0x0409
Jun 30 17:24:34 gandalf kernel: usb 6-3: new device strings: Mfr=1, Product=2, SerialNumber=3
Jun 30 17:24:34 gandalf kernel: usb 6-3: Product: Mass Storage Device
Jun 30 17:24:34 gandalf kernel: usb 6-3: Manufacturer: Prolific Technology Inc.
Jun 30 17:24:34 gandalf kernel: usb 6-3: SerialNumber: 0
Jun 30 17:24:34 gandalf kernel: usb 6-3: uevent
Jun 30 17:24:34 gandalf kernel: usb 6-3: usb_probe_device
Jun 30 17:24:34 gandalf kernel: usb 6-3: configuration #1 chosen from 1 choice
Jun 30 17:24:34 gandalf kernel: usb 6-3: adding 6-3:1.0 (config #1, interface 0)
Jun 30 17:24:34 gandalf kernel: usb 6-3:1.0: uevent
Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface
Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface - got id
Jun 30 17:24:34 gandalf kernel: usb-storage: USB Mass Storage device detected
Jun 30 17:24:34 gandalf kernel: usb-storage: -- associate_dev
Jun 30 17:24:34 gandalf kernel: usb-storage: Vendor: 0x067b, Product: 0x2507, Revision: 0x0100
Jun 30 17:24:34 gandalf kernel: usb-storage: Interface Subclass: 0x06, Protocol: 0x50
Jun 30 17:24:34 gandalf kernel: usb-storage: Transport: Bulk
Jun 30 17:24:34 gandalf kernel: usb-storage: Protocol: Transparent SCSI
Jun 30 17:24:34 gandalf kernel: scsi6 : SCSI emulation for USB Mass Storage devices
Jun 30 17:24:34 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:34 gandalf kernel: drivers/usb/core/inode.c: creating file '003'
Jun 30 17:24:34 gandalf kernel: usb-storage: device found at 3
Jun 30 17:24:34 gandalf kernel: usb-storage: waiting for device to settle before scanning
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_control_msg: rq=fe rqtype=a1 value=0000 index=00 len=1
Jun 30 17:24:39 gandalf kernel: usb-storage: GetMaxLUN command result is 1, data is 0
Jun 30 17:24:39 gandalf kernel: usb-storage: device scan complete
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Command INQUIRY (6 bytes)
Jun 30 17:24:39 gandalf kernel: usb-storage:  12 00 00 00 24 00
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x1 L 36 F 128 Trg 0 LUN 0 CL 6
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 36 bytes, 1 entries
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 36/36
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x1 R 0 Stat 0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: scsi 6:0:0:0: Direct-Access     ST316002 3A               3.06 PQ: 0 ANSI: 0
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad LUN (0:1)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (1:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (2:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (3:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (4:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (5:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (6:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (7:0)
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Command TEST_UNIT_READY (6 bytes)
Jun 30 17:24:39 gandalf kernel: usb-storage:  00 00 00 00 00 00
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x2 L 0 F 0 Trg 0 LUN 0 CL 6
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x2 R 0 Stat 0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Command READ_CAPACITY (10 bytes)
Jun 30 17:24:39 gandalf kernel: usb-storage:  25 00 00 00 00 00 00 00 00 00
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x3 L 8 F 128 Trg 0 LUN 0 CL 10
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x3 R 0 Stat 0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
Jun 30 17:24:39 gandalf kernel: SCSI device sde: 312581807 512-byte hdwr sectors (160042 MB)
Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
Jun 30 17:24:39 gandalf kernel: usb-storage: Command MODE_SENSE (6 bytes)
Jun 30 17:24:39 gandalf kernel: usb-storage:  1a 00 3f 00 c0 00
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x4 L 192 F 128 Trg 0 LUN 0 CL 6
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 192 bytes, 1 entries
Jun 30 17:24:39 gandalf kernel: usb-storage: Status code -121; transferred 4/192
Jun 30 17:24:39 gandalf kernel: usb-storage: -- short read transfer
Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x1
Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
Jun 30 17:25:09 gandalf kernel: usb-storage: command_abort called
Jun 30 17:25:09 gandalf kernel: usb-storage: usb_stor_stop_transport called
Jun 30 17:25:09 gandalf kernel: usb-storage: -- cancelling URB

Then the second USB disk does not show up in /dev/disk.

Feel free to contact me if you need more informations.

Thanks
Comment 1 Dominique Dumont 2007-06-30 08:55:03 UTC
Sorry, I've made a small mistake in the "Problem description section"

I confirm that all USB I/O are blocked and I *cannot* mount another USB drive.
Comment 2 Andrew Morton 2007-06-30 09:42:54 UTC
Subject: Re: [Bugme-new]  New: USB storage freeze

On Sat, 30 Jun 2007 08:46:32 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=8692
> 
>            Summary: USB storage freeze
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.21
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: io_other@kernel-bugs.osdl.org
>         ReportedBy: domi.dumont@free.fr
> 
> 
> Most recent kernel where this bug did not occur: unknown
> Distribution: Debian/sid
> Hardware Environment: AMD 64 X2, external 200GB USB disk (Prolific Technology
> Inc)
> Software Environment: 2.6.21 #3 SMP Sat Jun 30 16:30:03 CEST 2007 x86_64
> GNU/Linux
> Problem Description:
> 
> When doing file transfer, USB storage freezes randomly. It's possible that my
> USB drive is low quality hardware, but this shouldn't freeze the USB bus.
> 
> At this point, all I/O to USB are blocked:
> - I can mount another USB disk
> - sync blocks
> - reboot blocks
> - only reset button works to unlock the disks.
> 
> I've seen this problem also on Debian's kernel 2.6.18
> 
> I've reproduced it on a vanilla kernel (built with make-kpkg) with usb debug
> enabled. (traces are below)
> 
> 
> Steps to reproduce:
> 
> A simple cp of a multi GB file is often enough. (note that my disk if 90%
> full)
> 
> You can also run find <usb-drive> -type f -exec md5sum {} \;
> 
> Traces:
> Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes)
> Jun 30 17:21:30 gandalf kernel: usb-storage:  2a 00 0c d9 01 87 00 00 f0 00
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T
> 0x4297
> L 122880 F 0 Trg 0 LUN 0 CL 10
> Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 31/31
> Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk command transfer result=0
> Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist:
> xfer 122880 bytes, 17 entries
> Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred
> 122880/122880
> Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk data transfer result 0x0
> Jun 30 17:21:30 gandalf kernel: usb-storage: Attempting to get CSW...
> Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 13 bytes
> Jun 30 17:21:30 gandalf kernel: usb-storage: Status code 0; transferred 13/13
> Jun 30 17:21:30 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk status result = 0
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T
> 0x4297
> R 0 Stat 0x0
> Jun 30 17:21:30 gandalf kernel: usb-storage: scsi cmd done, result=0x0
> Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:21:30 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:21:30 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:21:30 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes)
> Jun 30 17:21:30 gandalf kernel: usb-storage:  2a 00 0c d9 02 77 00 00 f0 00
> Jun 30 17:21:30 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T
> 0x4298
> L 122880 F 0 Trg 0 LUN 0 CL 10
> Jun 30 17:21:30 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:22:00 gandalf kernel: usb-storage: command_abort called
> Jun 30 17:22:00 gandalf kernel: usb-storage: usb_stor_stop_transport called
> Jun 30 17:22:00 gandalf kernel: usb-storage: -- cancelling URB
> 
> At this point the external disk is frozen.
> 
> Then I've powered on another USB disk:
> 
> Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0000 evt
> 0008
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3
> status
> 001803 POWER sig=j CSC CONNECT
> Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change
> 0001,
> 480 Mb/s
> Jun 30 17:24:34 gandalf kernel: hub 6-0:1.0: debounce: port 3: total 100ms
> stable 100ms status 0x501
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3
> status
> 001005 POWER sig=se0 PE CONNECT
> Jun 30 17:24:34 gandalf kernel: usb 6-3: new high speed USB device using
> ehci_hcd and address 3
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3
> status
> 001005 POWER sig=se0 PE CONNECT
> Jun 30 17:24:34 gandalf kernel: ehci_hcd 0000:00:13.5: qh ffff81003d90d280
> (#00) state 4
> Jun 30 17:24:34 gandalf kernel: usb 6-3: default language 0x0409
> Jun 30 17:24:34 gandalf kernel: usb 6-3: new device strings: Mfr=1,
> Product=2,
> SerialNumber=3
> Jun 30 17:24:34 gandalf kernel: usb 6-3: Product: Mass Storage Device
> Jun 30 17:24:34 gandalf kernel: usb 6-3: Manufacturer: Prolific Technology
> Inc.
> Jun 30 17:24:34 gandalf kernel: usb 6-3: SerialNumber: 0
> Jun 30 17:24:34 gandalf kernel: usb 6-3: uevent
> Jun 30 17:24:34 gandalf kernel: usb 6-3: usb_probe_device
> Jun 30 17:24:34 gandalf kernel: usb 6-3: configuration #1 chosen from 1
> choice
> Jun 30 17:24:34 gandalf kernel: usb 6-3: adding 6-3:1.0 (config #1, interface
> 0)
> Jun 30 17:24:34 gandalf kernel: usb 6-3:1.0: uevent
> Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface
> Jun 30 17:24:34 gandalf kernel: usb-storage 6-3:1.0: usb_probe_interface -
> got
> id
> Jun 30 17:24:34 gandalf kernel: usb-storage: USB Mass Storage device detected
> Jun 30 17:24:34 gandalf kernel: usb-storage: -- associate_dev
> Jun 30 17:24:34 gandalf kernel: usb-storage: Vendor: 0x067b, Product: 0x2507,
> Revision: 0x0100
> Jun 30 17:24:34 gandalf kernel: usb-storage: Interface Subclass: 0x06,
> Protocol: 0x50
> Jun 30 17:24:34 gandalf kernel: usb-storage: Transport: Bulk
> Jun 30 17:24:34 gandalf kernel: usb-storage: Protocol: Transparent SCSI
> Jun 30 17:24:34 gandalf kernel: scsi6 : SCSI emulation for USB Mass Storage
> devices
> Jun 30 17:24:34 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:34 gandalf kernel: drivers/usb/core/inode.c: creating file '003'
> Jun 30 17:24:34 gandalf kernel: usb-storage: device found at 3
> Jun 30 17:24:34 gandalf kernel: usb-storage: waiting for device to settle
> before scanning
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_control_msg: rq=fe
> rqtype=a1 value=0000 index=00 len=1
> Jun 30 17:24:39 gandalf kernel: usb-storage: GetMaxLUN command result is 1,
> data is 0
> Jun 30 17:24:39 gandalf kernel: usb-storage: device scan complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Command INQUIRY (6 bytes)
> Jun 30 17:24:39 gandalf kernel: usb-storage:  12 00 00 00 24 00
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x1
> L
> 36 F 128 Trg 0 LUN 0 CL 6
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist:
> xfer 36 bytes, 1 entries
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 36/36
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 13 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x1 R
> 0
> Stat 0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: scsi 6:0:0:0: Direct-Access     ST316002 3A   
>           3.06 PQ: 0 ANSI: 0
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad LUN (0:1)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (1:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (2:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (3:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (4:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (5:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (6:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bad target number (7:0)
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x40000
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Command TEST_UNIT_READY (6
> bytes)
> Jun 30 17:24:39 gandalf kernel: usb-storage:  00 00 00 00 00 00
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x2
> L
> 0 F 0 Trg 0 LUN 0 CL 6
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 13 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x2 R
> 0
> Stat 0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Command READ_CAPACITY (10 bytes)
> Jun 30 17:24:39 gandalf kernel: usb-storage:  25 00 00 00 00 00 00 00 00 00
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x3
> L
> 8 F 128 Trg 0 LUN 0 CL 10
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 13 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 13/13
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk status result = 0
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Status S 0x53425355 T 0x3 R
> 0
> Stat 0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: scsi cmd done, result=0x0
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread sleeping.
> Jun 30 17:24:39 gandalf kernel: SCSI device sde: 312581807 512-byte hdwr
> sectors (160042 MB)
> Jun 30 17:24:39 gandalf kernel: usb-storage: queuecommand called
> Jun 30 17:24:39 gandalf kernel: usb-storage: *** thread awakened.
> Jun 30 17:24:39 gandalf kernel: usb-storage: Command MODE_SENSE (6 bytes)
> Jun 30 17:24:39 gandalf kernel: usb-storage:  1a 00 3f 00 c0 00
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x4
> L
> 192 F 128 Trg 0 LUN 0 CL 6
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 31 bytes
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code 0; transferred 31/31
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- transfer complete
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk command transfer result=0
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist:
> xfer 192 bytes, 1 entries
> Jun 30 17:24:39 gandalf kernel: usb-storage: Status code -121; transferred
> 4/192
> Jun 30 17:24:39 gandalf kernel: usb-storage: -- short read transfer
> Jun 30 17:24:39 gandalf kernel: usb-storage: Bulk data transfer result 0x1
> Jun 30 17:24:39 gandalf kernel: usb-storage: Attempting to get CSW...
> Jun 30 17:24:39 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer
> 13 bytes
> Jun 30 17:25:09 gandalf kernel: usb-storage: command_abort called
> Jun 30 17:25:09 gandalf kernel: usb-storage: usb_stor_stop_transport called
> Jun 30 17:25:09 gandalf kernel: usb-storage: -- cancelling URB
> 
> Then the second USB disk does not show up in /dev/disk.
> 
> Feel free to contact me if you need more informations.
> 
> Thanks
> 
> 
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.
Comment 3 Dominique Dumont 2007-07-15 05:36:53 UTC
Hello

I've reproduced the problem with 2.6.22 from kernel.org.

Here a trace from kern.log:

Jul 15 12:15:19 gandalf kernel: usb-storage: *** thread awakened.
Jul 15 12:15:19 gandalf kernel: usb-storage: Command WRITE_10 (10 bytes)
Jul 15 12:15:19 gandalf kernel: usb-storage:  2a 00 0e d6 7f 87 00 00 f0 00
Jul 15 12:15:19 gandalf kernel: usb-storage: Bulk Command S 0x43425355 T 0x75de L 122880 F 0 Trg 0 LUN 0 CL 10
Jul 15 12:15:19 gandalf kernel: usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
Jul 15 12:15:19 gandalf kernel: usb-storage: Status code 0; transferred 31/31
Jul 15 12:15:19 gandalf kernel: usb-storage: -- transfer complete
Jul 15 12:15:19 gandalf kernel: usb-storage: Bulk command transfer result=0
Jul 15 12:15:19 gandalf kernel: usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 13 entries
Jul 15 12:15:21 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2
Jul 15 12:15:49 gandalf kernel: usb-storage: command_abort called
Jul 15 12:15:49 gandalf kernel: usb-storage: usb_stor_stop_transport called
Jul 15 12:15:49 gandalf kernel: usb-storage: -- cancelling sg request

At this point:
- USB transfer is blocked
- sync is blocked 
- my keyboard is also blocked. (keyboard pb also happens when usb-disk blocks while dvb_usb_nova_t_usb2 is loaded)

I had one instance where /dev/sda was also blocked but my keyboard was not blocked. This is puzzling.
Comment 4 Dominique Dumont 2007-07-15 06:47:57 UTC
I've reproduced the problem with a Maxtor external USB drive (should be a better quality hardware than the other USB drives I've tested)

This time on Debian's 2.6.21, with dvb-udb module. My keyboard is again blocked.

By logging through network, I've tried to check the status of USB modules.

I've found that a simple "cat /proc/bus/usb/devices"  is also blocked:

Here's the output of /proc/<pid>/status of the cat command:

Name:   cat
State:  D (disk sleep)
SleepAVG:       88%

The cat command blocks on "disk sleep" even though I try to read a file from a virtual file system.
Comment 5 Dominique Dumont 2007-07-15 07:29:18 UTC
Yet another (non-voluntary) test: USB has blocked while accessing my MP3 player. This device (Samsung YP-Z5F) is not really fast, so the USB lock may be more due to random factors than high speed transfers.

This time, my keyboard was again blocked, only the mouse worked (as usual).

HTH
Comment 6 Alan Stern 2007-07-16 09:36:13 UTC
This looks like a knnown problem with some VIA EHCI USB controllers.  It resembles one of the problems reported in Bug #6374.  Does your system use VIA parts?
Comment 7 Dominique Dumont 2007-07-16 11:24:38 UTC
No. My mobo is a MSI-K9A with an ATI RD580 + SB600 (Xpress3200) chipset.

Here's the relevant part of lshw:

        *-usb:0
             description: USB Controller
             product: SB600 USB (OHCI0)
             vendor: ATI Technologies Inc
             physical id: 13
             bus info: pci@0000:00:13.0
             version: 00
             width: 32 bits
             clock: 66MHz
             capabilities: ohci bus_master
             configuration: driver=ohci_hcd latency=64 module=ohci_hcd
Comment 8 Alan Stern 2007-07-16 13:04:30 UTC
That isn't the relevant part.  You show the entry for an OHCI controller, but it's the EHCI controller causing the problems.  However it's probably safe to assume the EHCI controller is also by ATI.

Here's something you can try.  Build a kernel with CONFIG_USB_DEBUG enabled.  Then when you get one of these errors, make a copy of the files in /sys/class/usb_host/usb_host6 and attach it to this report.  You may need to use a network login if your keyboard isn't working.

Oh yes, and also attach a copy of /proc/bus/usb/devices.
Comment 9 Dominique Dumont 2007-07-21 00:28:45 UTC
Created attachment 12079 [details]
async file from usb_hots6
Comment 10 Dominique Dumont 2007-07-21 00:30:46 UTC
Created attachment 12080 [details]
companion file from usb_host6
Comment 11 Dominique Dumont 2007-07-21 00:31:29 UTC
Created attachment 12081 [details]
periodic file from usb_host6
Comment 12 Dominique Dumont 2007-07-21 00:47:54 UTC
Created attachment 12082 [details]
registers file from usb_host6
Comment 13 Dominique Dumont 2007-07-21 00:51:12 UTC
Created attachment 12083 [details]
/proc/bus/usb/devices file
Comment 14 Dominique Dumont 2007-07-21 01:05:30 UTC
Hello

The EHCI usb controller is indeed ATI:
             product: SB600 USB Controller (EHCI)
             vendor: ATI Technologies Inc

I've attached the required information to this bug. All files from usb_host6 were read after the usb was hung on vanilla kernel 2.6.22 (compiled with Debian's kernel configuration and USB_DEBUG)

HTH
Comment 15 David Brownell 2007-07-21 12:52:06 UTC
If you disable one of the CPUs, does it behave better?

This looks like another variant of the unlink bug that shows up periodically.
What needs to happen is that a cleaner version of 26f953fd884ea4879585287917f855c63c6b2666
get written (that one had to be reverted).  Probably the timer cleanup
and the unlink/IAA cleanup should be split.
Comment 16 Dominique Dumont 2007-07-22 01:34:02 UTC
I've booted 2.6.22 with maxcpus=1 to no avail. usb still hangs.
Comment 17 Dominique Dumont 2007-07-22 05:15:47 UTC
I've found that the problem is more difficult to reproduce when dvb_usb_nova_t_usb2 module is not loaded. (e.g. after 30 mn of md5sum instead of 2 or 3mns).

In this case:
- keyboard is not locked
- I can interrupt the md5sum process with CTRL-C
- If I re-run the mds5um process, it hangs immediately and I can no longer interrupt it
Comment 18 Alan Stern 2007-08-21 12:09:04 UTC
Created attachment 12475 [details]
Add separate IAA watchdog timer for EHCI

It's been a while, but I finally found some time to work on this a little.  Please try the attached patch for 2.6.22, and attach the resulting kernel log.  Even if it doesn't fix the problem, it should provide some clues.
Comment 19 Anonymous Emailer 2007-08-21 12:51:48 UTC
Reply-To: david-b@pacbell.net

On Tuesday 21 August 2007, you wrote:
> Created an attachment (id=12475)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=12475&action=view)
> Add separate IAA watchdog timer for EHCI
> 
> It's been a while, but I finally found some time to work on this a little. 
> Please try the attached patch for 2.6.22, and attach the resulting kernel
> log. 
> Even if it doesn't fix the problem, it should provide some clues.

Thanks for that, Alan ... on a first scan, it looks good.
And if one hypothesis is true (that interactions with other
timeouts breaks this one) it should be a good step forward...

One question though:

> @@ -643,7 +662,7 @@ static irqreturn_t ehci_irq (struct usb_
>         /* complete the unlinking of some qh [4.15.2.3] */
>         if (status & STS_IAA) {
>                 COUNT (ehci->stats.reclaim);
> -               ehci->reclaim_ready = 1;
> +               end_unlink_async(ehci);
>                 bh = 1;
>         }

Given that you're not deferring that work, why is "bh" still set?
Should be harmless but pointless.
Comment 20 Alan Stern 2007-08-21 14:49:19 UTC
Because it was that way in your original patch and I didn't notice it.  :-)
Comment 21 Dominique Dumont 2007-08-22 12:22:46 UTC
Hello

I've tried Alan's path on 2.6.22. Hangs still occur while dvb module is loaded.
(I've not tried again without DVB module)

I'll attach the corresponding kernel log, but I've not seen non usb-storage logs. Please tell me if I missed something.

HTH
Comment 22 Dominique Dumont 2007-08-22 12:24:55 UTC
Created attachment 12492 [details]
2.6.22 kernel log with separate IAA watchdog timer patch
Comment 23 Alan Stern 2007-08-22 12:56:02 UTC
Yes, those usb-storage log entries aren't very useful.  It would be better if you turn off CONFIG_USB_STORAGE_DEBUG.  Then when a transfer hangs get copies of the registers and async files from usb_host6 as before, in addition to the dmesg log.
Comment 24 Dominique Dumont 2007-08-23 12:24:17 UTC
Created attachment 12509 [details]
usb_host6 files after usb hang (with IAA watchdog patch)

As requested, here's the required traces (all files from usb_host6 and dmesg) extracted after usb disk hang
Comment 25 Alan Stern 2007-08-23 13:36:47 UTC
Created attachment 12514 [details]
additional debugging for IAA watchdog

Crazy.  It looks like the new code was never triggered.

This patch applies on top of the previous one.  It adds a bunch more debugging messages, which hopefully will tell us what's going on.
Comment 26 Alan Stern 2007-08-23 13:39:51 UTC
One more thing.  After the hang occurs and you collect the dmesg and usb_host6 data, please get a stack trace (Alt-SysRq-T) and attach that as well.
Comment 27 Dominique Dumont 2007-08-25 03:08:45 UTC
Created attachment 12528 [details]
usb6 + sysrq-t traces

Here's the requested information including the sysrq-t dump (retrieved through dmesg, I've cleaned up demsg before running the test). The hung process is md5sum at the end of the file.

I still cannot find any log related to your iaa watchdog timer.

HTH
Comment 28 Alan Stern 2007-08-25 08:10:27 UTC
No log output at all from that patch?  Are you sure you installed the patched version of ehci-hcd.ko?  Just plugging in a new device should be enough to trigger some of those writes.

The Alt-SysRq-T dump was incomplete; stuff near the top was lost from the ring buffer.  You'll have to increase the size of CONFIG_LOG_BUF_SHIFT.  Or even better, boot into single-user mode first so that most of those processes won't be running.
Comment 29 Dominique Dumont 2007-08-30 10:07:04 UTC
Sorry for the long delay. Something is weird during module loading.

I've modified a trace in ehci-hcd.c to make sure I load the patched ehci-hcd module.

After boot I get the unmodified trace and no IAA watchdog trace:
 ehci_hcd: block sizes: qh 160 qtd 96 itd 192 sitd 96

Then I do "rmmod ehci-hdc" and "modprobe ehci-hcd", I get the modified trace:
ehci_hcd (+patch): block sizes: qh 160 qtd 96 itd 192 sitd 96

and truckloads of IAA traces:
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog done, qh ffff8100249251e0
Aug 30 18:59:56 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog start, qh ffff810024925140


Let me dig what's wrong during module loading and I'll get back to you with more traces.

Cheers
Comment 30 Alan Stern 2007-08-30 10:55:19 UTC
Probably at boot time ehci-hcd is loaded from an initramfs image, not from /lib/modules/...
Comment 31 Dominique Dumont 2007-09-01 05:38:52 UTC
Created attachment 12657 [details]
trace for dvb communication failure

Argh ! You're right. The initrd image was not updated. I'm red with shame...

Now, I'm seeing some progress: the bug is *much* harder to trigger even when 
the dvb module is up and running with vdr.

But, I had to comment out the IAA watchdog start and stop traces in ehci-q.c. 
With them, it's impossible to work on the console: I can't see what I type nor 
console messages. (today's kern.log is already 140M big ...)

Here's what I've observed:
- With dvb and *one* external usb disk: the bug did not show up even after 
  more than 30mn of testing
- When I launched a md5sum on my other usb disk, the first usb disk froze quite 
  fast. I don't have a kernel trace for this, but here's the relevant
  extract from kern.log :
Sep  1 12:43:15 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003dffc460
Sep  1 12:43:59 gandalf last message repeated 14 times
Sep  1 12:44:46 gandalf last message repeated 41 times
Sep  1 12:45:31 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff81003dffc460
Sep  1 12:45:31 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003dffc460
Sep  1 12:45:33 gandalf last message repeated 19 times
Sep  1 12:47:03 gandalf last message repeated 7 times
Sep  1 12:47:50 gandalf last message repeated 14 times
Sep  1 12:48:36 gandalf last message repeated 21 times
Sep  1 12:50:08 gandalf last message repeated 42 times
Sep  1 12:50:54 gandalf last message repeated 7 times
Sep  1 12:50:54 gandalf kernel: fuse exit
Sep  1 12:50:56 gandalf kernel: Kernel logging (proc) stopped.
Sep  1 12:50:56 gandalf kernel: Kernel log daemon terminating.

At this point, both disk were hung, although I could interrupt them with ^C. 
My keyboard was working but what I typed appeared on screen after 2s 
(without your patch, the keyboard does not work except for sysrq).

I've run another test in single user mode. But this time, dvb communication 
failed before the usb disk. I've attached the registers, and dmesg output (with
the sysrq-t trace)

I'll try a test in single user with only the usb disks and without the 
dvb module.

HTH
Comment 32 Dominique Dumont 2007-09-01 08:57:28 UTC
Created attachment 12661 [details]
trace for usb disk failure

In single user mode with 2 usb disks running md5sum, I cannot reproduce the bug. I had not a single trace in kern.log for 2 hours. More that 300GB of data were read.
So your patch is a real improvement.

The bug shows up quite fast when dvb_usb_nova_t_usb2 is loaded and vdr is running. 
Here's a trace of a usb disk failure (disk hangs and keyboard failure) done in single user mode with 2 usb disks and vdr running. I fear that some processes are missing from the trace though.

If this not enough, I'll run some other tests.

HTH
Comment 33 Alan Stern 2007-09-01 09:20:49 UTC
Okay, don't worry about those watchdog start and done messages.  Also I think the stack traces aren't helping much, so you might as well forget them too.

I don't have access to my development system at the moment, so you'll have to make the following changes yourself.  In ehci-hcd.c, in the unlink_async() routine where the previous patch added an "unlink reclaim" line, have that line also print out the value of ehci->reclaim.  Then immediately afterward, add this:

		if (ehci->reclaim == qh)  dump_stack();

The reason is because I noticed in both your logs, there's a whole bunch of "unlink start" lines all with the same qh, followed by a single "unlink reclaim" line for that same qh.  This should never happen -- so I want to make sure it really is occurring.
Comment 34 Dominique Dumont 2007-09-02 09:53:51 UTC
No problem to implement your changes. Here are the results.

The reclaim pointers are different:

Sep  2 18:28:01 gandalf last message repeated 2 times
Sep  2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280
Sep  2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:28:01 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:28:47 gandalf last message repeated 5 times
Sep  2 18:28:47 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0
Sep  2 18:28:47 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:28:47 gandalf last message repeated 7 times
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44640
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c445a0
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:32 gandalf last message repeated 3 times
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c443c0
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:53 gandalf last message repeated 14 times
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280

I have not seen a case where the pointers were identical.

BTW, I've also tested by actually watching a dvb-t channel while performing
heavy traffic on both usb disk. Watching a channel on a dvb-usb device will
require a steady data transfer from the dvb-usb device.

In one instance, both usb disks froze while starting to watch the dvb-t and my keyboard was gone. First, DVB failed:

Sep  2 18:17:18 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:18 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Sep  2 18:17:20 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6
Sep  2 18:17:20 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:20 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:20 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Sep  2 18:17:22 gandalf kernel: usb 6-3: usb-storage timed out on ep0out len=0/0
Sep  2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280
Sep  2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:22 gandalf kernel: usb 6-3: device not accepting address 4, error -110
Sep  2 18:17:22 gandalf kernel: hub 6-0:1.0: logical disconnect on port 3
Sep  2 18:17:22 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0008 evt 0000
Sep  2 18:17:22 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s
Sep  2 18:17:22 gandalf kernel: usb 6-3: USB disconnect, address 4
Sep  2 18:17:22 gandalf kernel: usb 6-3: unregistering device
Sep  2 18:17:22 gandalf kernel: usb 6-3: usb_disable_device nuking all URBs
Sep  2 18:17:22 gandalf kernel: usb 6-3: unregistering interface 6-3:1.0
Sep  2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep01: ep_device_release called for usbdev6.4_ep01
Sep  2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep82: ep_device_release called for usbdev6.4_ep82
Sep  2 18:17:22 gandalf kernel: sd 5:0:0:0: scsi: Device offlined - not ready after error recovery
Sep  2 18:17:22 gandalf kernel: sd 5:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK

Then both disk failed (I've removed most traces related to scsi or ext3).

Sep  2 18:17:22 gandalf kernel: EXT3-fs error (device sdd1): ext3_get_inode_loc: unable to read inode block - inode=4521986, block=9043970
Sep  2 18:17:22 gandalf kernel: usb 6-3:1.0: uevent
Sep  2 18:17:22 gandalf kernel: usb 6-3:1.0: uevent
Sep  2 18:17:22 gandalf kernel: usb_endpoint usbdev6.4_ep00: ep_device_release called for usbdev6.4_ep00
Sep  2 18:17:22 gandalf kernel: usb 6-3: uevent
Sep  2 18:17:22 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device
Sep  2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: port 5 high speed
Sep  2 18:17:22 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 5 status 001005 POWER sig=se0 PE CONNECT
Sep  2 18:17:22 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device
Sep  2 18:17:22 gandalf kernel: EXT3-fs error (device sdd1): ext3_find_entry: reading directory #18792457 offset 0

Then I had these traces on and on until I had to hit 'reset':

Sep  2 18:17:23 gandalf kernel: scsi 5:0:0:0: rejecting I/O to dead device
Sep  2 18:17:24 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6
Sep  2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:24 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Sep  2 18:17:26 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2
Sep  2 18:17:26 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:26 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:26 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0)
Sep  2 18:17:27 gandalf kernel: usb 6-5: usb-storage timed out on ep0in len=0/64
Sep  2 18:17:27 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280
Sep  2 18:17:27 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:28 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6
Sep  2 18:17:28 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:28 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)

Sep  2 18:17:36 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:36 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:36 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Sep  2 18:17:37 gandalf kernel: usb 6-5: usb-storage timed out on ep0in len=0/64
Sep  2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd280
Sep  2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: port 5 high speed
Sep  2 18:17:37 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 5 status 001005 POWER sig=se0 PE CONNECT
Sep  2 18:17:37 gandalf kernel: usb 6-5: device descriptor read/64, error -110
Sep  2 18:17:38 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2
Sep  2 18:17:38 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:38 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep  2 18:17:38 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0)
Sep  2 18:17:40 gandalf kernel: usb 6-6: kdvb-fe-1 timed out on ep1out len=0/6
Sep  2 18:17:40 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff81003d8dd140
Sep  2 18:17:40 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069


In the second instance of starting to watch a channel during heavy usb disks
load, I had a different behavior: both disks were still working but the dvb
channel was barely watchable as if a lot of data were thrown out. Here are the 
traces:

Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:32 gandalf last message repeated 3 times
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c443c0
Sep  2 18:29:32 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:53 gandalf last message repeated 14 times
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c441e0
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink reclaim qh ffff810037c44460, ehci->reclaim ffff810037c44280
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:29:53 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460
Sep  2 18:34:02 gandalf kernel: ehci_hcd 0000:00:13.5: unlink start qh ffff810037c44460

In this case, watching dvb did not change much the traces. I just had more of them.

HTH
Comment 35 Alan Stern 2007-09-03 09:11:52 UTC
So when you have all three devices running, errors show up pretty quickly, right?  But with only two of them running, everything works more or less okay.

What do you see in the async and registers files after a failure with the patched driver?  Let's concentrate on the case where you start the disk activity first and then use the dvb device.

Part of my confusion before was because I missed a code path.  If you look about 8 lines before the end of ehci-q.c, you'll see a call to start_unlink_async().  Please add braces around it and include in them this line:

		ehci_info(ehci, "unlink idle qh %p\n", qh);

That should solve the mystery of the strange pointer values.
Comment 36 Dominique Dumont 2007-09-03 13:37:33 UTC
Created attachment 12683 [details]
lot of "unlink idle qh" when dvb is plugged

Here's a summary of the failures:
- heavy traffic on one usb disk: ok
- heavy traffic on 2 usb disk: mostly ok (very difficult to reproduce)
- heavy traffic on 2 usb disks + dvb plugged: usb disk hang in 5 to 10mn
- heavy traffic on 2 usb disks + wathc dvb: dvb unwatchable, difficult to reproduce usb disk hangs

I've added your "unlink idle qh" traces. As soon as I plug the dvb (or load
dvb-usb-nova-t-usb2 module), I get *lots* of unlink idle traces (never stops).

When I mount an usb disk, I get about 30 unlik idle traces and then no more
[ I removed many of them here ]:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
kjournald starting.  Commit interval 5 seconds
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
EXT3 FS on sdd1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0

Similar story when I plug a second disk:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
kjournald starting.  Commit interval 5 seconds
EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
EXT3 FS on sdc1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf3c0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf500

I aslo get lots of trace while doing heavy traffic on one disk:
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf460
ehci_hcd 0000:00:13.5: unlink idle qh ffff81003daaf5a0


Doing more tests is not easy with so many traces. 

HTH
Comment 37 Alan Stern 2007-09-03 15:02:21 UTC
There was no disk hang in the log file!  In fact it all looked correct.  The poor TV quality may simply be caused by insufficient bandwidth; the one bus may not be able to carry all that disk data plus the digital video.

So I'm most interested in those disk hangs.  Even with poor quality and slow transfers, nothing should actually hang.  Likewise, the contents of the registers and async files aren't interesting until after a hang has occurred.

By the way, I didn't mention it before but these lines from your log:

Sep  2 18:17:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069

clearly indicate that your EHCI controller has an IAA bug.  The new patch is supposed to keep things working in spite of that bug, but maybe there's something else going wrong as well.  Whatever it is, apparently it gets triggered by high traffic levels.

An enhanced version of the first patch might be able to help.  I'll work on it and let you know when it's ready.
Comment 38 Alan Stern 2007-09-05 14:27:54 UTC
Created attachment 12719 [details]
Additions to the IAA watchdog

Here's the additional IAA-watchdog patch.  You should put everything back the way it was after the first patch (the one in comment #18) before applying this.
Comment 39 Dominique Dumont 2007-09-08 03:46:44 UTC
Created attachment 12758 [details]
disk hang trace (2.6.22.6 + 2 IAA patches)

Sorry for the previous "normal" traces. I thought that the unlink reclaim traces were not normal.

Anyway, I've re-started from a vanilla 2.6.22.6 kernel with mostly Debian config and applied your 2 IAA patches.

I've included a trace of usb disk hang (with dvb running, but I was not watching a channel). I had a disk hand after about 5mns of traffic. The attachmeent contains usb register dump and dmesg output

HTH
Comment 40 Alan Stern 2007-09-08 06:53:32 UTC
Boy, that controller of yours sure is in a messed-up state!  It's just not running the async schedule at all.

The only thing left to try is a soft reset of the controller.  I'll write another patch next week.  It would be great if we could avoid the problem entirely, but since (1) we have no idea what the cause is, and (2) even if we knew the cause (most likely a bug in the controller hardware) we might not know how to avoid it, I don't see any other choice.
Comment 41 Dominique Dumont 2007-09-08 11:01:59 UTC
My USB controller is AMD/ATI's SB600. Given AMD's new stance ragarding graphic open source development, may be you could get help from AMD.

May be in AMD developer forum:
http://forums.amd.com/devforum/

Another possibility is to enable the "USB legacy support" in the BIOS. I'll try that also.

HTH
Comment 42 Dominique Dumont 2007-09-09 07:37:26 UTC
Created attachment 12768 [details]
disk hang while watching dvb trace (2.6.22.6 + 2 IAA patches)

Hello

I think I've found a work-around. I've run the test (md5sum on 2 disks) while watching a dvb channel. For sure, the dvb stream was heavily corrupted, but the interesting thing is that the md5sum test ran for more than 20 mns without locking. Moreover there was not a single IAA trace in the kernel logs. 

I still had a failure after 20mns, but I'm not sure that it was due to a disk failure: X had crashed, one disk stopped running while the other one was still running. So the symptoms are completely different. I've attached the usb registers content (after failure) in case you find something wrong with them.

I've also found this trace during boot sequence:
  dvb-usb: schedule remote query interval to 100 msecs.

This means that when loaded, the dvb-t module is queried every 100ms, it's possible that the hang is triggered by one of these remote queries.

It may be time to watch more closely the interaction between dvb-t module and usb controller.

HTH
Comment 43 Alan Stern 2007-09-09 12:38:49 UTC
What is the workaround you found?

BTW, the files at the top of the last attachment were taken from the wrong directory.  For some reason, this time your EHCI controller was usb_host1, not usb_host6.  In addition, there was no indication in the log of a disk crash.

In theory, it shouldn't be possible for a module like dvb-t to crash the USB controller no matter what queries it makes.  But of course, if the module actually contains a bug then all bets are off.
Comment 44 Dominique Dumont 2007-09-09 13:44:50 UTC
The work-around I mentioned was just watching a DVB channel while running the test. Which means a sustained data stream (even if garbled) from the DVB device. This is to be compared to the 10 queries per seconds mentioned by the dvb log.

Anyway I'll have to run more tests to check if this really makes a difference. I'll keep you posted.

I'll also do some tests with dvb debug traces. I may find some clues there.

HTH
Comment 45 Dominique Dumont 2007-09-15 09:11:52 UTC
Some weird stuff I've noticed:

- When the usb disk is hung. My keyboard (PS2) is not completely hung: what I type appears on screen after about 70s.

- Once the usb disk is hung, after reboot, the sub disk itself is unresponsive:
Sep 15 17:53:42 gandalf kernel: dvb_usb_nova_t_usb2 6-6:1.0: usb_probe_interface
Sep 15 17:53:42 gandalf kernel: dvb_usb_nova_t_usb2 6-6:1.0: usb_probe_interface - got id
Sep 15 17:53:42 gandalf kernel: usb usb6: usb auto-resume
Sep 15 17:53:42 gandalf kernel: usb usb6: finish resume
Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: hub_resume
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: resume root hub
Sep 15 17:53:42 gandalf kernel: usb 6-6: usb auto-resume
Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0000 evt 0000
Sep 15 17:53:42 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 6 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: usb 6-6: finish resume
Sep 15 17:53:42 gandalf kernel: dvb-usb: found a 'Hauppauge WinTV-NOVA-T usb2' in warm state.
Sep 15 17:53:42 gandalf kernel: dvb-usb: will pass the complete MPEG2 transport stream to the software demuxer.
Sep 15 17:53:42 gandalf kernel: DVB: registering new adapter (Hauppauge WinTV-NOVA-T usb2).
Sep 15 17:53:42 gandalf kernel: dvb-usb: MAC address: 00:0d:fe:ff:ff:ff
Sep 15 17:53:42 gandalf kernel: DiB3000MC/P:-I-  found DiB3000MC/P: 3002
Sep 15 17:53:42 gandalf kernel: DVB: registering frontend 1 (DiBcom 3000MC/P)...
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: input: IR-receiver inside an USB DVB receiver as /class/input/input3
Sep 15 17:53:42 gandalf kernel: dvb-usb: schedule remote query interval to 100 msecs.
Sep 15 17:53:42 gandalf kernel: dvb-usb: Hauppauge WinTV-NOVA-T usb2 successfully initialized and connected.
Sep 15 17:53:42 gandalf kernel: usb 6-3: new high speed USB device using ehci_hcd and address 7
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes
Sep 15 17:53:42 gandalf last message repeated 2 times
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: DVB: registering frontend 0 (ST STV0299 DVB-S)...
Sep 15 17:53:42 gandalf kernel: ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 21 (level, low) -> IRQ 21
Sep 15 17:53:42 gandalf kernel: gameport: EMU10K1 is pci0000:03:01.1/gameport0, io 0xbc00, speed 592kHz
Sep 15 17:53:42 gandalf kernel: ACPI: PCI Interrupt 0000:00:14.2[A] -> GSI 16 (level, low) -> IRQ 16
Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes
Sep 15 17:53:42 gandalf last message repeated 2 times
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: input: ImExPS/2 Logitech Wheel Mouse as /class/input/input4
Sep 15 17:53:42 gandalf kernel: usb 6-3: new high speed USB device using ehci_hcd and address 8
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes
Sep 15 17:53:42 gandalf last message repeated 2 times
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: devpath 3 ep0in 3strikes
Sep 15 17:53:42 gandalf last message repeated 2 times
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: port 3 high speed
Sep 15 17:53:42 gandalf kernel: ehci_hcd 0000:00:13.5: GetStatus port 3 status 001005 POWER sig=se0 PE CONNECT
Sep 15 17:53:42 gandalf kernel: usb 6-3: device descriptor read/64, error -71

I have to switch off/on the external disk to get it back.

- Here's a trace of a usb failure :
 Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:timf: 11075200
Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:autosearch returns: 2
Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:-I-  Setting output mode for demod ffff81003eba9000 to 5
Sep 15 17:46:24 gandalf kernel: DiB3000MC/P:timf: 11075200
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:46 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:47 gandalf kernel: usb 6-6: vdr-kbd timed out on ep1out len=0/6
Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:47 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:47 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Sep 15 17:46:47 gandalf kernel: DiB3000MC/P:i2c read error on 509
Sep 15 17:46:49 gandalf kernel: usb 6-6: events/1 timed out on ep1out len=0/2
Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:49 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status

According to DiB3000MC/P tace, it looks like i2c bus is also gone ... 

I'm going to contact dvb usb people to see if one of them can help.
Comment 46 Dominique Dumont 2007-09-15 11:48:21 UTC
It looks like a pure dvb bug: I just had a dvb failure even though the only dvb device was my nova-t dvb usb device. 

I've sent this information to the dvb list. I'll get back to this bug report as soon as some progress is made on the dvb front.

Cheers
Comment 47 Alan Stern 2007-09-15 13:13:56 UTC
Here are the most important lines from the log, right near the start of the failure:

Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Sep 15 17:46:45 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009

In the status register, bit 8000 indicates the async schedule is running and bit 0020 (which never got turned on) is the IAA flag.  Bits 0060 in the cmd register are the IAA-doorbell and async-enabled bits.

So translating this into English, originally the system asked for an IAA signal and never received one.  When the watchdog timer expired, the async schedule was still running and the IAA-doorbell request was still on.  The driver then turned them both off, and at the second watchdog expiration you can see that the async schedule is indeed no longer running.

But then the driver tried to turn the async schedule back on, and the controller refused to do it!  All the remaining watchdog log entries show that the async schedule is not running, even when the async-enable command bit is set.

Evidently the controller has malfunctioned.  Giving it a light reset might get it going again -- but support for a light controller reset is optional and the chip might not include it, in which case there wouldn't be anything left for the driver to do but give up.  I haven't had time to write a patch for this, but maybe sometime next week.

Have you considered buying an add-on PCI USB controller card?  It might work better than the onboard USB controller.
Comment 48 Dominique Dumont 2007-10-06 09:03:12 UTC
Hello

Sorry for the long delay in following up.

Here's the status so far:
- I've tried to reproduce the problem at work to no avail. I've recently found
  that the dvb device must be hooked to an antenna to trigger the bug. And I 
  don't have dvb antenna at work...
- I'm able to reproduce usb hang without disk traffic: the dvb device (with the
  antenna) fails once or twice a day. In this case the keyboard is "delayed"
  (characters appears on screen after 50 to 60 seconds or more)
- When dvb is locked, ehci-hcd module hangs. Simple command like 
  'cat /proc/bus/usb/devices' hangs. Looks like a mutex is not freed.
  (See kernel stack trace below).
- If I unplug the dvb device, I get back my keyboard but ehci-hcd is still
  hung. rmmod ehci-hcd hangs. Only a reboot restores usb functions.
- When dvb hangs, I've no IAA watchdog trace before or after the hang

I'll buy a USB controller PCI card if I cannot borrow a laptop to reproduce 
the problem. But this is not my favorite solution since I have no free PCI slot.

Cheers

Oct  6 14:47:23 gandalf kernel: cat           D 000000010069bcc6     0 23980  12418 (NOTLB)
Oct  6 14:47:23 gandalf kernel:  ffff810018ef7df8 0000000000000082 0000000000000000 0000000000000000
Oct  6 14:47:23 gandalf kernel:  00003ffffffff000 ffffffff80263b30 0000000000000008 ffff81002f566ea0
Oct  6 14:47:23 gandalf kernel:  ffffffff804aa4c0 00000da7d831a456 00000000000b0406 ffff81002f567050
Oct  6 14:47:23 gandalf kernel: Call Trace:
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80263b30>] __alloc_pages+0x5a/0x2bc
Oct  6 14:47:23 gandalf kernel:  [<ffffffff803f4c0c>] __down+0xe8/0x100
Oct  6 14:47:23 gandalf kernel:  [<ffffffff8022c321>] default_wake_function+0x0/0xe
Oct  6 14:47:23 gandalf kernel:  [<ffffffff802f6a3b>] __up_read+0x13/0x8a
Oct  6 14:47:23 gandalf kernel:  [<ffffffff803f4880>] __down_failed+0x35/0x3a
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80373e3c>] usb_device_read+0xb4/0x149
Oct  6 14:47:23 gandalf kernel:  [<ffffffff802831dd>] vfs_read+0xcb/0x153
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80283579>] sys_read+0x45/0x6e
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80209d8e>] system_call+0x7e/0x83
Oct  6 14:47:23 gandalf kernel:
Oct  6 14:47:23 gandalf kernel: cat           D 000000010070ddb6     0 24464  12003 (NOTLB)
Oct  6 14:47:23 gandalf kernel:  ffff810008947dc8 0000000000000082 0000000000000000 0000000000000000
Oct  6 14:47:23 gandalf kernel:  00007fff2a345860 ffff81000adf8000 0000000000000008 ffff81001d253160
Oct  6 14:47:23 gandalf kernel:  ffff8100010fb530 00000e7e95b838d1 00000000000ace6d ffff81001d253310
Oct  6 14:47:23 gandalf kernel: Call Trace:
Oct  6 14:47:23 gandalf kernel:  [<ffffffff803f4c0c>] __down+0xe8/0x100
Oct  6 14:47:23 gandalf kernel:  [<ffffffff8022c321>] default_wake_function+0x0/0xe
Oct  6 14:47:23 gandalf kernel:  [<ffffffff8035a23e>] get_device+0x17/0x1f
Oct  6 14:47:23 gandalf kernel:  [<ffffffff803f4880>] __down_failed+0x35/0x3a
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80370672>] usbdev_read+0x43/0x1c3
Oct  6 14:47:23 gandalf kernel:  [<ffffffff802831dd>] vfs_read+0xcb/0x153
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80283579>] sys_read+0x45/0x6e
Oct  6 14:47:23 gandalf kernel:  [<ffffffff80209d8e>] system_call+0x7e/0x83
Comment 49 Dominique Dumont 2007-10-06 09:32:13 UTC
Here's another example of hung cat command where usb_device_read is involved.


Oct  6 14:26:21 gandalf kernel: cat           D 000000010069bcc6     0 23980  12418 (NOTLB)
Oct  6 14:26:21 gandalf kernel:  ffff810018ef7df8 0000000000000082 0000000000000000 0000000000000000
Oct  6 14:26:21 gandalf kernel:  00003ffffffff000 ffffffff80263b30 0000000000000008 ffff81002f566ea0
Oct  6 14:26:21 gandalf kernel:  ffffffff804aa4c0 00000da7d831a456 00000000000b0406 ffff81002f567050
Oct  6 14:26:21 gandalf kernel: Call Trace:
Oct  6 14:26:21 gandalf kernel:  [<ffffffff80263b30>] __alloc_pages+0x5a/0x2bc
Oct  6 14:26:21 gandalf kernel:  [<ffffffff803f4c0c>] __down+0xe8/0x100
Oct  6 14:26:21 gandalf kernel:  [<ffffffff8022c321>]
default_wake_function+0x0/0xe
Oct  6 14:26:21 gandalf kernel:  [<ffffffff802f6a3b>] __up_read+0x13/0x8a
Oct  6 14:26:21 gandalf kernel:  [<ffffffff803f4880>] __down_failed+0x35/0x3a
Oct  6 14:26:21 gandalf kernel:  [<ffffffff80373e3c>] usb_device_read+0xb4/0x149
Oct  6 14:26:21 gandalf kernel:  [<ffffffff802831dd>] vfs_read+0xcb/0x153
Oct  6 14:26:21 gandalf kernel:  [<ffffffff80283579>] sys_read+0x45/0x6e
Oct  6 14:26:21 gandalf kernel:  [<ffffffff80209d8e>] system_call+0x7e/0x83
Comment 50 Dominique Dumont 2007-10-06 10:13:46 UTC
Created attachment 13057 [details]
usbmon trace of dvb failure

Hello

Here's an usbmon trace captured just before and during a dvb failure. I hope you
can find some information in there.

Cheers
Comment 51 Alan Stern 2007-10-06 12:51:40 UTC
Well, it's pretty clear where the problem started.  At timestamp 1659505782 a transfer failed with error -2.  The next transfer succeeded, and then all the following transfers failed.  (Error code -2 means the transfer was cancelled; by looking at the timestamps you can see that each of those transfers was cancelled 2 seconds after it was submitted.  That must be the dvb driver's timeout limit.)

What the usbmon log doesn't say is _why_ the transfers timed out.  There's only one clue: In the very first failed transfer (the one mentioned above), it timed out even though all the data had been sent!  That wasn't true of the later transfers; none of them managed to send anything.  Still, this indicates something is wrong with the EHCI controller -- but we already knew that.

I don't think we have any hope of figuring out why the controller malfunctions.  It could be a bug in the design or a flaw in the manufacturing.  You've never been able to duplicate this problem on any other computer, have you?  Which indicates a hardware error in your one machine, rather than a widespread bug.
Comment 52 Dominique Dumont 2007-10-13 10:07:35 UTC
Created attachment 13144 [details]
hub failure log

No, I have not been able to duplicate the problem on another
hardware. But I haven't found another 64 bits machine that I
could bring home. :-(

Anyway, I have another clue.

I've bought an USB 2.0 hub (with a Philips chip) to check if the
problem was not due to a power supply problem.

I've found that this hub barely works. Once I plug a thumb drive
in it (or after 2 or 3 plug/unplug cycles), I get "IAA lost"
traces without mounting the drive.

On the other hand, I've booted a knoppix 4.0 (kernel 2.6.12 32 bits) 
on the same hardware: In this case, the hub and thumb drives work
perfectly well.

So I believe there's a problem with usb core code, either a code
difference between 2.6.12 and 2.6.22.6 or a problem due to the
different architecture (32 vs 64 bits).

I'm going to try more recent knoppix to narrow down the code
difference between the working and non-working setup.

HTH
Comment 53 Alan Stern 2007-10-13 14:12:36 UTC
It's always good to have a working version of the kernel for comparison purposes.  You would probably be better off using a vanilla kernel rather than a Knoppix kernel, if at all possible.  Then you could try out intermediate versions and zero down on the particular change responsible for the problem.
Comment 54 Dominique Dumont 2007-10-21 08:40:00 UTC
I can use recent vanilla kernels, but I'm not sure that Debian unstable will work fine on very old kernels. May be in single user.

Anyway, here's what I've found so far:
- 2.6.22.6 (vanilla + your patches) tends to fail fast (after 2 or 3 insertions)
- 2.6.21-2-amd64 (Debian's): failure after 9 insertions
- 2.6.19.1 (vanilla) single user: ok after 22 insertions (probably got lucky)
- 2.6.19.1 (vanilla) multi user: failure after 9 insertions
- 2.6.19.1 (knoppix) multi user: failure after 36 insertions 
- 2.6.12   (knoppix) multi user: ok after 127 insertions 

So there's a chance that the problem was introduced between 2.6.12 and 2.6.19 (even though I've not enough results to be really sure). Note that more recent kernels are more likely to fail than older ones.

I'm going to try older kernels, I'll begin with vanilla kernel. If I get too
many problems, I'll try with older knoppix.

HTH
Comment 55 Dominique Dumont 2007-10-23 10:19:51 UTC
I've great news !!!

Until now, I've done all the tests reported here on a machine with powernowd (which uses cpufreq_userspace module) set on "performance" mode. The freq of my CPU varied from 1GHZ and 2.3GHz. Without powernowd (and cpufreq_userspace is not loaded), the CPU
always runs at 2.3GHz.

The good news are: without powernowd and cpufreq, all the USB system works fine :
- I can plug re-plug USB hub wihtout problem (I;ve tested about 100 cycles)
- DVB no longer fails (although i'd need more days to be really sure on this one)
- USB disk traffic + DVB plugged works normally (I've tested more than 20mn of md5sum on the USB drive while DVB was plugged)

So my USB problems are solved by removing CPU throttling.

But this raises the question of why do I get problems when cpu throttling is enabled ?

Does USB have timers which depends on CPU frequency ?

Can this problem occur on other HW  ? on laptop which most often use CPU throttling ?

Feel free to send me patches if you want me to test solutions that would work with CPU throttling. I'll do my best to try them.

All the best.
Comment 56 David Brownell 2007-10-23 16:31:44 UTC
Sounds like a problem observed on some systems, whereby cpufreq breaks EHCI.  Linus didn't much like the fix since it broke some systems ... plus it did look pretty ugly, and a simpler one should probably have been possible.  However, nobody has yet provided a mergeable patch.

I think a mergeable patch would use such a notifier and either abort the cpufreq change if there are ISO transfers queued, or else stop both periodic and async schedules before the change (then restart the appropriate ones after the change completes).
Comment 57 Alan Stern 2007-10-24 10:20:39 UTC
The underlying problem appears to be that when the actual CPU frequency change occurs the USB controller hardware is prevented from accessing the PCI bus, causing various sorts of errors or failures.  At least one other person has seen these sorts of errors.

However, David, it's worth noticing that Dominique's failures don't appear to involve Iso transfers -- only Bulk.
Comment 58 Dominique Dumont 2007-10-27 12:19:58 UTC
Yes, the failures only involve bulk transfer: weirdly enough, the Nova-t dvb-t device only provides bulk endpoints:

T:  Bus=06 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  3 Spd=480 MxCh= 0
D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=2040 ProdID=9301 Rev= 0.00
S:  Manufacturer=Hauppau??
S:  Product=SOHO-FX2
S:  SerialNumber=9301-00-F008B260
C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 4 Cls=ff(vend.) Sub=00 Prot=00 Driver=dvb_usb_nova_t_usb2
E:  Ad=01(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=81(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=0ms
E:  Ad=86(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms

HTH
Comment 59 Dominique Dumont 2007-11-05 13:43:59 UTC
Hello

Looks like I'm not out of the woods yet. I had a usb disk failure without 
powernowd or cpufreq (with dvb). This happened while reading a file from the
usb disk at a rather slow rate (about 1.3MB/s) compared to the usual md5sum tests. So having a fixed frequency makes the bugs less likely to happen, but not  impossible.

Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov  5 21:04:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10069
Nov  5 21:04:29 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009

HTH
Comment 60 Dominique Dumont 2007-11-11 11:25:03 UTC
Created attachment 13497 [details]
dvb usb disk failure related to killing urb ?

Hello

I've tried the same test with dvb code from mercurial. I still have failure every now and then. The failures look related to killing urbs:

Nov 11 14:51:31 gandalf kernel: stop feeding
Nov 11 14:51:31 gandalf kernel: killing URB no. 0.
Nov 11 14:51:31 gandalf kernel: killing URB no. 1.
Nov 11 14:51:31 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Nov 11 14:51:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov 11 14:51:32 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov 11 14:51:32 gandalf kernel: killing URB no. 2.


and:

Nov 11 18:58:15 gandalf kernel: stop feeding
Nov 11 18:58:15 gandalf kernel: killing URB no. 0.
Nov 11 18:58:15 gandalf kernel: killing URB no. 1.
Nov 11 18:58:15 gandalf kernel: killing URB no. 2.
Nov 11 18:58:15 gandalf kernel: killing URB no. 3.
Nov 11 18:58:15 gandalf kernel: killing URB no. 4.
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status a008 cmd 10069
Nov 11 18:58:15 gandalf kernel: >>> 04 35
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: IAA watchdog: status 2008 cmd 10009
Nov 11 18:58:15 gandalf kernel: ehci_hcd 0000:00:13.5: lost IAA status
Nov 11 18:58:15 gandalf kernel: killing URB no. 5.

The "Killing URB" trace comes from dvb-USB/usb-urb.c line 73:

int usb_urb_kill(struct usb_data_stream *stream)
{
        int i;
        for (i = 0; i < stream->urbs_submitted; i++) {
                deb_ts("killing URB no. %d.\n",i);

                /* stop the URB */
                usb_kill_urb(stream->urb_list[i]);
        }
        stream->urbs_submitted = 0;
        return 0;
}

If this does not give you a clue about what is wrong (beside HW pb), I'm going to buy a PCI usb controller card to check if this problem can be reproduced with another controller (as Alan suggested). 

HTH
Comment 61 Alan Stern 2007-11-12 11:30:04 UTC
Actually the problem is related to unlinking URBs, which is part of what usb_kill_urb() does.  When an URB is unlinked its active QH has to be removed from the schedule.  Removing QHs requires an IAA interrupt, so when your balky hardware fails to set the proper IAA status the problem occurs.

This is all stuff we already knew.
Comment 62 Dominique Dumont 2007-12-02 08:33:50 UTC
Hello

Sorry for the delay in following up.

I've tested like DVB+external USB disk with an add-on USB card (VIA chip). 
It's working normally with CPU freq set at max speed.

I'm going to check if changing cpu freq causes problems with this chip.

All the best
Comment 63 Christian Wick 2008-01-21 04:41:31 UTC
Hi,

I am experiencing the same problem as Dominique reported here.
My board (Asus M2A-VM HDMI) also uses the SB600 chipset.
Attached to one USB port is a D-LINK 7-port hub (USB 2.0). 7 USB-HDDs are attached to this hub. At least 5 USB-HDDs are using a Cypress chipset.
The system is Ubuntu 7.10 with kernel "2.6.22-14-server #1 SMP i686".
Ok, it is not a vanilla kernel, but I think it also happens to a vanilla kernel. 
I can try this as well if you wish.

Access to the USB-HDDs without high traffic works without any problem. 
As soon as I start to copy a few big files from one USB-HDD to another USB-HDD, the cp command hangs at some point. "sync" does not work either. The only solution so far for me is to reboot the machine. This is reproducable.

Copying from the internal IDE-HDD does not produce this bug, but copying via Samba to the USB-HDDs does.

I have an Add-On PCI USB2.0 controller card which uses a VIA controller. I am going to see if the described behaviour also happens with the card.

Regards,
Christian
Comment 64 Alan Stern 2008-01-21 08:47:24 UTC
Christian, there's no way you can possibly know whether your problem is the same as Dominique's without looking at the EHCI "registers" debugging file.  See comments 8 and 12.
Comment 65 Christian Wick 2008-01-21 14:24:52 UTC
Created attachment 14515 [details]
cat /proc/bus/usb/devices

ok. the debug output follows.
Comment 66 Christian Wick 2008-01-21 14:26:38 UTC
Created attachment 14516 [details]
cat /sys/class/usb_host/usb_host6/registers
Comment 67 Christian Wick 2008-01-21 14:27:13 UTC
Created attachment 14517 [details]
cat /sys/class/usb_host/usb_host6/async
Comment 68 Christian Wick 2008-01-21 14:31:04 UTC
Created attachment 14518 [details]
cat /sys/class/usb_host/usb_host6/periodic

the file "companion" is empty.

I have started to copy a few files from one USB-HDD to another USB-HDD.
Now the cp hangs again.
Comment 69 Alan Stern 2008-01-21 14:47:11 UTC
Okay, that does look like the same bug.  But as far as I've been able to tell it's a hardware bug in the host controller, not a software bug in the driver.  Nobody to my knowledge has found a way to work around it.  If you have any ideas, I'd like to hear them.
Comment 70 Christian Wick 2008-01-27 05:30:32 UTC
Same tests on the same board with the same kernel with an additional USB2.0 controller card (VIA VT6212L):
Big bulk transfers from one USB2.0 HDD to another one work without any problems. No more hangs. 

Can you provide a patch which soft-resets the EHCI controller?

--

Linux sleepless 2.6.22.9-custom #1 SMP Mon Jan 21 21:44:50 CET 2008 i686 GNU/Linux

--

03:06.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) (prog-if 00 [UHCI])
        Subsystem: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
        Flags: bus master, medium devsel, latency 64, IRQ 22
        I/O ports at c800 [size=32]
        Capabilities: [80] Power Management version 2

03:06.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 61) (prog-if 00 [UHCI])
        Subsystem: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
        Flags: bus master, medium devsel, latency 64, IRQ 19
        I/O ports at c400 [size=32]
        Capabilities: [80] Power Management version 2

03:06.2 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 63) (prog-if 20 [EHCI])
        Subsystem: VIA Technologies, Inc. USB 2.0
        Flags: bus master, medium devsel, latency 64, IRQ 20
        Memory at fddfe000 (32-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 2

--
Comment 71 Alan Stern 2008-01-27 08:03:07 UTC
Unfortunately I can't.  The driver isn't designed in a way that makes it straightforward.  Furthermore the EHCI spec says explicitly that hardware support for soft reset is optional.

You can always do a manual reset.  The easiest way is to rmmod ehci-hcd and then modprobe it back.
Comment 72 David Brownell 2008-02-09 15:28:38 UTC
Created attachment 14769 [details]
experimental IAA/IAAD patch for 2.6.24-git20+

Try this patch, especially on ATI hardware.  It tries to handle one case that's been described as a hardware bug (albeit not one in public erata listings), and generally adds lots of paranoia about the two IAA related bits to detect similar cases of bits not being set or cleared when they should be.

Run with CONFIG_USB_DEBUG=y for the best diagnostics, and please report whether any interesting behavioral changes appear.
Comment 73 Dominique Dumont 2008-02-10 08:30:16 UTC
I will test your patch and get back to you.

But before that, it looks like 2.6.24.1 behaves better: I still have to produce
a single failure while slowly reading the content of an USB disk (i.e. playing a movie stored there). So far, it has worked for 2 hours. I may have been lucky, so I need to perform mode tests, including tests with DVB usb and re-enabling CPU throttle.

Christian, could you test 2.6.24.1 on your side ?

All the best
Comment 74 Alan Stern 2008-02-10 13:09:52 UTC
Incidentally, Dominique, you shouldn't be using the patches attached to comments 18 and 38 any more.  The first has been updated by a patch included in 2.6.25 which you can find here: <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=07d29b63ef6b39963ab37818653284d861cf55af>.  The second is essentially replaced by Dave's new patches.
Comment 75 Martin Ammerm 2008-02-16 17:02:49 UTC
If have a similar problem with an ATI IXP SB400 EHCI controller. I didn't read all comments, but maybe i can add some new facts here. If i can do anything further to help, feel free to contact me.

I do not have this problem with usb storage but with a usb mouse. After some random time, all my usb ports freeze and i cannot use any usb devices. rmmod usb_storage freezes, too. I don't have any of the problems i describe here, when i boot with acpi=off noapic parameters.

This problem occurs since at least 2.6.18 but not on all kernel versions since then. Versions where usb ports don't freeze despite having acpi and apic enabled have a known problem with netdev watchdog timer for realtek ethernet card (modules 8139cp/8139too). See  http://bugzilla.kernel.org/show_bug.cgi?id=6138
Comment 76 Dominique Dumont 2008-02-24 11:49:45 UTC
Sorry for the delay in testing.

Alan, thanks for the tip.

I've tested 2.6.24-git22 with David's IAA patch.

When any device is plugged in the usb bus, I get a lot of "IAAD cleared without IAA" messages.

On the SB600 device:
kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA

But also on the VIA chip (which worked flawlessly so far):
kernel: ehci_hcd 0000:03:01.2: IAAD cleared without IAA


To continue the tests, I've removed all devices connected to the VIA chip. Here's what I get with minimal traffic (i.e. dvb connected and USB disk mounted with no traffic):
Feb 24 19:33:54 gandalf kernel: EXT3 FS on sdh1, internal journal
Feb 24 19:33:54 gandalf kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 24 19:33:54 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA
Feb 24 19:34:25 gandalf last message repeated 1093 times
Feb 24 19:35:26 gandalf last message repeated 2190 times
Feb 24 19:36:27 gandalf last message repeated 2191 times
Feb 24 19:37:28 gandalf last message repeated 2180 times
Feb 24 19:38:29 gandalf last message repeated 2212 times

And when I start the md5sum + dvb device test :
Feb 24 19:40:29 gandalf last message repeated 26947 times
Feb 24 19:41:29 gandalf last message repeated 35874 times
Feb 24 19:42:29 gandalf last message repeated 35038 times
Feb 24 19:43:29 gandalf last message repeated 35316 times

With 2.6.24 and before, this test usually hangs the USB controller in a few minutes.

After a while, the DVB device fails:
Feb 24 20:19:09 gandalf last message repeated 21933 times
Feb 24 20:19:09 gandalf kernel: usb 6-4: vdr-kbd timed out on ep1in len=0/2
Feb 24 20:19:09 gandalf kernel: dvb-usb: recv bulk message failed: -110
Feb 24 20:19:09 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA
Feb 24 20:19:11 gandalf last message repeated 191 times

At this time the symptoms are:
- USB disk access still works (this is new behavior)
- usbview hangs.
- keyboard is weird: key pressed appears on screen after several seconds 
  (symptoms always seen when dvb device fails).

Then I stop disk traffic. 

And I need to unplug the DVB device to get a normal keyboard behavior.

usbview is still hung and is unkillable (even with SIGKILL)

Feel free to ask me other tests if needed.

HTH
Comment 77 David Brownell 2008-02-25 21:43:35 UTC
Comment on attachment 14769 [details]
experimental IAA/IAAD patch for 2.6.24-git20+

original IAA patch had some bogus messaging in the typical case; replacing it.
Comment 78 David Brownell 2008-02-25 21:48:56 UTC
Created attachment 14997 [details]
improved iaa diagnostics patch

This is basically same as the previous patch, except that it doesn't emit spurious  diagnostics ("IAAD cleared without IAA") in the normal all-is-working-right case.  (It should have checked for IAA clear before assuming it wasn't set!  I'm told there is hardware which will trigger that corrected diagnostic...)  It also restores the diagnostic for the normal watchdog-firing case to "verbose" level, to avoid log clutter on that uninteresting case.
Comment 79 David Brownell 2008-02-25 21:55:22 UTC
Dominique, sorry for the noise, but that original patch had a goof; I posted the
updated patch on linux-usb@vger but forgot to update the version here.  Unfortunately I can't tell which of those "IAAD cleared" messages may be valid (if any!) versus which are noise (most of them).

It'd be good if you could retest with the updated patch (say, RC3 -- or else just the post-2.6.24 kernel you're already using) plus the anti-oops patch in

http://lkml.org/lkml/2008/2/19/546
Comment 80 Dominique Dumont 2008-03-02 10:02:48 UTC
Hello

I've tested your new patch (2008-02-25 patch + anti-oops patch ) with 2.6.24-git22. Test setup is dvb-t device, reading a movie on usb disk (not md5sum sum), with powernowd running in performance mode.

I indeed had a failure:

- usb disk com failed
- dvb-t com failed (with usual impact: sluggish keyboard)
- but usb is not locked: usbview still works (that's an improvement :-) )
- cycling power on usb disk is not enough to get it back: the power cycleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee is not detected [you just witnessed another keyboard problem consecutive to usb failure: I hit 'e' only once and it got repeated :-/ ]

See the traces below:

Mar  2 16:18:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079
Mar  2 16:18:32 gandalf kernel: usb 6-4: events/1 timed out on ep1out len=0/2
Mar  2 16:18:32 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0)
Mar  2 16:18:34 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079
Mar  2 16:18:34 gandalf kernel: usb 6-4: vdr-kbd timed out on ep1out len=0/6
Mar  2 16:18:34 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)

[...]

Mar  2 16:20:22 gandalf kernel: usb 6-3: usb-storage timed out on ep0out len=0/0
Mar  2 16:20:23 gandalf kernel: usb 6-3: device not accepting address 2, error -110
Mar  2 16:20:23 gandalf kernel: hub 6-0:1.0: logical disconnect on port 3
Mar  2 16:20:23 gandalf kernel: hub 6-0:1.0: state 7 ports 10 chg 0008 evt 0000
Mar  2 16:20:23 gandalf kernel: hub 6-0:1.0: port 3, status 0501, change 0000, 480 Mb/s
Mar  2 16:20:23 gandalf kernel: usb 6-3: USB disconnect, address 2
Mar  2 16:20:23 gandalf kernel: usb 6-3: unregistering device
Mar  2 16:20:23 gandalf kernel: usb 6-3: usb_disable_device nuking all URBs
Mar  2 16:20:23 gandalf kernel: usb 6-3: unregistering interface 6-3:1.0
Mar  2 16:20:23 gandalf kernel: sd 4:0:0:0: Device offlined - not ready after error recovery
Mar  2 16:20:23 gandalf kernel: sd 4:0:0:0: [sdh] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK,SUGGEST_OK
Mar  2 16:20:23 gandalf kernel: end_request: I/O error, dev sdh, sector 18237903
Mar  2 16:20:23 gandalf kernel: sd 4:0:0:0: rejecting I/O to offline device
Mar  2 16:20:23 gandalf kernel: sd 4:0:0:0: rejecting I/O to offline device
Mar  2 16:20:23 gandalf kernel: sd 4:0:0:0: [sdh] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Mar  2 16:20:23 gandalf kernel: end_request: I/O error, dev sdh, sector 18238143
Mar  2 16:20:23 gandalf kernel: Buffer I/O error on device sdh1, logical block 1138
Mar  2 16:20:23 gandalf kernel: lost page write due to I/O error on sdh1
Mar  2 16:20:23 gandalf kernel: Aborting journal on device sdh1.
Mar  2 16:20:23 gandalf kernel: Buffer I/O error on device sdh1, logical block 532

[...]

Mar  2 16:25:24 gandalf kernel: usb-storage: device scan complete
Mar  2 16:25:24 gandalf kernel: scsi 19:0:0:0: Direct-Access     Generic  USB SD Reader    1.00 PQ: 0 ANSI: 0
Mar  2 16:25:24 gandalf kernel: scsi 19:0:0:1: Direct-Access     Generic  USB CF Reader    1.01 PQ: 0 ANSI: 0
Mar  2 16:25:24 gandalf kernel: scsi 19:0:0:2: Direct-Access     Generic  USB SM Reader    1.02 PQ: 0 ANSI: 0
Mar  2 16:25:24 gandalf kernel: scsi 19:0:0:3: Direct-Access     Generic  USB MS Reader    1.03 PQ: 0 ANSI: 0
Mar  2 16:25:24 gandalf kernel: sd 19:0:0:0: [sde] Attached SCSI removable disk
Mar  2 16:25:24 gandalf kernel: sd 19:0:0:1: [sdf] Attached SCSI removable disk
Mar  2 16:25:24 gandalf kernel: sd 19:0:0:2: [sdg] Attached SCSI removable disk
Mar  2 16:25:24 gandalf kernel: sd 19:0:0:3: [sdk] Attached SCSI removable disk
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079
Mar  2 16:25:24 gandalf kernel: usb 6-4: kdvb-fe-0 timed out on ep1out len=0/6
Mar  2 16:25:24 gandalf kernel: dvb-usb: bulk message failed: -110 (6/0)
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 93120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd65bc0 path 2 ep2in 92120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81007e4a7ec0 path 2 ep2in 92120000 cc 9 --> status -121
Mar  2 16:25:24 gandalf kernel: ohci_hcd 0000:00:13.2: urb ffff81006fd656c0 path 2 ep2in 40160000 cc 4 --> status -32
Mar  2 16:25:26 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079
Mar  2 16:25:26 gandalf kernel: usb 6-4: events/1 timed out on ep1out len=0/2
Mar  2 16:25:26 gandalf kernel: dvb-usb: bulk message failed: -110 (2/0)
Mar  2 16:25:28 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010079

I also got:

Mar  2 18:44:18 gandalf kernel: INFO: task khubd:147 blocked for more than 120 seconds.
Mar  2 18:44:18 gandalf kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  2 18:44:18 gandalf kernel: khubd         D 0000000000000000     0   147      2
Mar  2 18:44:18 gandalf kernel:  ffff81007f0f7c40 0000000000000046 ffff810000000000 ffffffff00000000
Mar  2 18:44:18 gandalf kernel:  ffff81007f0f7b90 ffff81007f0f7bf0 ffffffff805b3680 ffffffff805b3680
Mar  2 18:44:18 gandalf kernel:  ffffffff805b3680 ffffffff805b0880 ffffffff805b3680 ffff81007f0f5878
Mar  2 18:44:18 gandalf kernel: Call Trace:
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8843ba9c>] :dvb_core:dvb_unregister_frontend+0xaf/0xef
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80243a2d>] autoremove_wake_function+0x0/0x2e
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8031171b>] kref_put+0x41/0x4c
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8844dbc9>] :dvb_usb:dvb_usb_adapter_frontend_exit+0x15/0x25
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8844d336>] :dvb_usb:dvb_usb_exit+0x31/0x98
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8844d3d3>] :dvb_usb:dvb_usb_device_exit+0x36/0x48
Mar  2 18:44:18 gandalf kernel:  [<ffffffff803899ef>] usb_unbind_interface+0x59/0xb6
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80379c0d>] __device_release_driver+0x74/0x97
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80379fe4>] device_release_driver+0x31/0x49
Mar  2 18:44:18 gandalf kernel:  [<ffffffff803794d3>] bus_remove_device+0x8e/0xa1
Mar  2 18:44:18 gandalf kernel:  [<ffffffff803779b3>] device_del+0x10e/0x173
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80386e7e>] usb_disable_device+0xe2/0x163
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80382330>] usb_disconnect+0xd9/0x172
Mar  2 18:44:18 gandalf kernel:  [<ffffffff803839fc>] hub_thread+0x66d/0xff3
Mar  2 18:44:18 gandalf kernel:  [<ffffffff80243a2d>] autoremove_wake_function+0x0/0x2e
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8038338f>] hub_thread+0x0/0xff3
Mar  2 18:44:18 gandalf kernel:  [<ffffffff802438f9>] kthread+0x47/0x76
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8022cef5>] schedule_tail+0x28/0x5d
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8020cc58>] child_rip+0xa/0x12
Mar  2 18:44:18 gandalf kernel:  [<ffffffff802438b2>] kthread+0x0/0x76
Mar  2 18:44:18 gandalf kernel:  [<ffffffff8020cc4e>] child_rip+0x0/0x12


On a related note, I've also seen some improvements between 2.6.22 and 2.6.24.1:
Using an external hub with powernowd no longer fails. I can plug/unplug usb thumb drive on the external hub without problems (used to hang usb after 3 or 4 tries).

HTH
Comment 81 Alan Stern 2008-03-03 07:56:29 UTC
My impression is that this is just a buggy controller.  David may have some more constructive suggestions.
Comment 82 Dominique Dumont 2008-03-08 08:32:55 UTC
That's possible... On the other hand, 2.6.24 behaves much better than 2.6.22 with regard to USB disks.

I've also some trouble with DVB-T device with the VIA chipset. On some occasion, when I reboot, the LED of the Nova-T device is still on during the whole reboot process. Normally, it should switch off when the machine goes down. I can only assume that the DVB-T device is trashed.

Even when if the DVB-T device is plugged on the VIA card (PCI add-on usb controller) I get the following symptoms:
- lot of "dvb-usb: bulk message failed: -110 (5/0)" in kern.log
- PS/2 keyboard is sluggish (about 1 or 2s delay between key hit and effect)
- behavior go back to normal when I unplug the DVB device.

Looks like error recovery in DVB driver has some problem.

Anyway, this may be unrelated to the USB disk hang I've experienced before.

So I'm going to test only USB storage (without DVB) on 2.6.24-git22 + david's patch, so I can check ehci + usb storage behavior on SB600 chipset without DVB interferences.
Comment 83 Dominique Dumont 2008-03-10 13:04:31 UTC
Here are some results with USB disk alone on 2.6.24-git12 + Dave's patch 

To test the disk I run mplayer on a movie stored on the usb disk.

After about 75mnm I got one (and only one) IAA hiccup (mplayer did not stop):

Mar  8 19:01:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD cleared without IAA
Mar  8 19:01:32 gandalf kernel: ehci_hcd 0000:00:13.5: IAAD wasn't clear 00010069

After this trace, I had no other trouble for at least one hour (and I had to stop the test). I'll keep testing this setup for a while to see if this kind of IAA hiccups happens often or not.

HTH
Comment 84 David Brownell 2008-03-10 13:42:34 UTC
Hmm, the "IAAD cleared without IAA" seems to show that there's a real silicon issue there ... perhaps more prevalent with the SB600 than the SB700.  Please try this with the patch attached to Bug 10078 ... that's essentially what I asked be merged for 2.6.25-final (so it's not got as much in the way of diagnostics).  It lets the watchdog handle the case which triggers that  particular message.

Re the "IAAD wasn't clear", that message may be a bit dubious, but in any case the situation it addresses should also be addressed by that patch.
Comment 85 Dominique Dumont 2008-03-24 05:27:08 UTC
I've tested your patch for 2 hours without a hiccup. USB disk is working normally.

But, the file ehci-hdc.c contains:
 #undef EHCI_VERBOSE_DEBUG

And you patch uses ehci_vdbg which is silent when EHCI_VERBOSE_DEBUG is not defined, so I don't think I will get traces.

What should I do to enable these traces ?
- update to a more recent git kernel (I'm still at 2.6.24 git22)
- modify the patch to use ehci_dbg instead of ehci_vdbg
- remove the '#undef EHCI_VERBOSE_DEBUG'
Comment 86 Anonymous Emailer 2008-03-24 08:03:15 UTC
Reply-To: david-b@pacbell.net

> What should I do to enable these traces ?
> - update to a more recent git kernel (I'm still at 2.6.24 git22)
> - modify the patch to use ehci_dbg instead of ehci_vdbg
> - remove the '#undef EHCI_VERBOSE_DEBUG'

By all means update to 2.6.25-rc6 or newer.  If you really want to
see such messages, #define EHCI_VERBOSE_DEBUG.
Comment 87 Alan Stern 2008-03-24 08:06:11 UTC
You should try 2.6.25-rc6; it includes the IAA patch.

As regards the debugging output: The most important messages use ehci_dbg, not ehci_vdbg.  (These are the messages in the STS_IAA part of ehci_irq().)  So you would see them provided you enable CONFIG_USB_DEBUG.

If you really want to see the ehci_vdbg messages then #define EHCI_VERBOSE_DEBUG, i.e., change the #undef to #define.
Comment 88 Alan Stern 2008-05-23 14:02:36 UTC
Can we close this bug report?
Comment 89 Dominique Dumont 2008-05-24 00:46:41 UTC
Sorry, I forgot to update this bug report. 

I couldn't get failure on 2.6.25-rc kernel for 2 hours. Then I had to switch back to 2.6.24.

To be really sure that the bug is fixed, I'd like to test this with 2.6.25 for everyday use. But in this case, I need a working ATI driver (fglrx). This one was released by ATI this week and should be available soon for Debian. So I should get back to you next week or so for an update.

Until then, please don't close this bug.

Thanks for your patience.
Comment 90 Stephen Evanchik 2008-05-30 04:44:36 UTC
I have had this problem for quite some time as well. I am running 2.6.25.4 now and have not experienced any hangs with my external USB disk.
Comment 91 Dominique Dumont 2008-06-21 09:26:57 UTC
Hello

With 2.6.25.6 (Debian sid) kernel, I've been reading and writing to my USB disks during this week without problems when they are directly connected to the machine (through SB600).

When connected through a hub, I sometime get errors when connecting a thumb drive to the hub. I cannot say if it's a HW problem or software problem. The good news is that the USB systems comes back alive after 10 or 20 seconds. After power cycling the disk and the hub, the usb and usb-storage will show them correctly.

The only hitch is that udev mounts them under a different name. For instance
 
  /dev/disk/by-id/usb-MAXTOR_S_TM3320820A_00-0:0

instead of

  /dev/disk/by-id/usb-MAXTOR_S_TM3320820A_0-0:0

In any case, this is a minor issue compared to the USB hang. Do you want me to open another bug for this issue ? (so you can close this one)

Thanks for all your help
Comment 92 Alan Stern 2008-06-22 10:41:11 UTC
The different name probably reflects the different physical port being used by the device.  You ought to be able to fix it by altering the appropriate udev script.  In any event, it's a udev issue rather than a kernel issue so it doesn't below in the kernel bugzilla.

If you're happy with the way the driver is working, we can close this bug report.
Comment 93 Dominique Dumont 2008-06-22 12:13:44 UTC
Yes, you can close this bug.

Thanks for your help

:-)

Note You need to log in before you can comment on or make changes to this bug.