Bug 4500

Summary:	OHCI wedge with AMD756
Product:	Drivers	Reporter:	Brian Beardall (rapsure)
Component:	USB	Assignee:	David Brownell (dbrownell)
Status:	REJECTED INSUFFICIENT_DATA
Severity:	normal	CC:	bunk
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.11 2.6.12 2.6.13_rc6	Subsystem:
Regression:	---	Bisected commit-id:
Bug Depends on:
Bug Blocks:	5089
Attachments:	kernel log at the time of USB error Kernel log at boot time These are the kernel messages from the MSI-6167 mainboard at bootup. This is the kernel config for the MSI-6167 mainboard system Kernel config for MSI-6195 mainboard stress script for usb mass storage These are email from usb devel mailing list documenting the problem. The kernel output from device attachment until irq disabled. This is with the option irqpoll set as a kernel option at boot. ohci resume detect irq patch tweak "bogus NDP' messages

Description Brian Beardall 2005-04-15 08:33:34 UTC

Distribution:Gentoo 2005.0
Hardware Environment: MSI 6195, and MSI 6167 mainboards
Software Environment: camserv, camorama, and gyache-enhanced
Problem Description: After about 10 - 30 minutes of use of my webcam using the
qc-usb modules the USB stops responding, and irq 10 is shutdown with an "I don't
care."  All devices on that irq stop working as a result.

Steps to reproduce:
1.  Boot linux
2.  load qc-usb module
3.  Load camorama version 0.17
4.  watch for 10 - 30 minutes
5.  I get:  irq 10: nobody cared!
6.  reboot computer to get all devices that use irq 10 to work again.

Comment 1 Brian Beardall 2005-04-15 08:35:08 UTC

Created attachment 4927 [details]
kernel log at the time of USB error

Comment 2 Brian Beardall 2005-04-15 08:36:04 UTC

Created attachment 4928 [details]
Kernel log at boot time

Comment 3 Brian Beardall 2005-04-15 08:38:38 UTC

This bug is not reproducible on an HP PIII 900 with a VIA chipset using the UHCI
controller.

Comment 4 Brian Beardall 2005-04-15 14:09:05 UTC

Created attachment 4929 [details]
These are the kernel messages from the MSI-6167 mainboard at bootup.

Comment 5 Brian Beardall 2005-04-15 14:12:39 UTC

Created attachment 4930 [details]
This is the kernel config for the MSI-6167 mainboard system

Comment 6 Brian Beardall 2005-04-15 14:14:08 UTC

Created attachment 4931 [details]
Kernel config for MSI-6195 mainboard

Comment 7 Brian Beardall 2005-04-17 07:39:18 UTC

When noirqdebug is set as a kernel boot option there are no longer any problems
with the IRQ being shutdown or the computer locking up on me.

Comment 8 Brian Beardall 2005-04-20 19:38:29 UTC

adding the kernel option noirqdebug did not resolve the issue with USB.  It was
just being difficult to trigger the bug.  With further testing on the AMD756 USB
OHCI_HCD controller I have now been able to crash it using a scanner which uses
libusb, and with a USB mass storage device running a program that I post.  What
I have done is filled the USB device with a file with all zero's.  Then I cat
that contents of that file to /dev/null.  It does crash the USB controller, and
the computer.  The bug is driver independent.  It can occur with ANY USB device
that can stream data over a long period of time that is plugged into the USB
OHCI_HCD AMD 756 controller.  This bug does need to be fixed.

Comment 9 Brian Beardall 2005-04-20 19:41:07 UTC

Created attachment 4961 [details]
stress script for usb mass storage

Comment 10 Brian Beardall 2005-04-25 13:16:25 UTC

This bug has to do with isochronous transfers.  Any driver that uses isochronous
transfers will crash the hardware for the AMD 756 OHCI USB controller.

Comment 11 David Brownell 2005-06-15 06:14:09 UTC

Dropping priority, as the issue is by all reports specific 
to this particular no-longer-manufactured chipset. 
 
Your comment #8 says this is reproducible without requiring 
the out-of-tree quickcam driver.  Can you provide kernel logs 
for that failure happening?  Right now this bug report is 
rather uselessly thin on details.

Comment 12 Brian Beardall 2005-06-24 22:30:35 UTC

Created attachment 5212 [details]
These are email from usb devel mailing list documenting the problem.

Comment 13 Brian Beardall 2005-06-24 22:46:34 UTC

I have had the computer on for a while and have had gnome-pilot running the
entire time.  I have been able to get this error message: 
ohci_hcd 0000:00:07.4: bogus NDP=255, rereads as NDP=4

The difference is that irq's are not being submitted, and processed.  I think I
can crash the usb provided I am transfering data (irq's are being submitted).  I
doubt the data rate matters since just simple polling of the bus seems to cause
a bogus NDP.  I know my board doesn't have 255 usb ports on the root hub. :)  I
studied the code where these error messages are generated, and there aren't
supposed to be irq's being submitted while these status's are being checked. 
Hmmm to me it sounds like an irq race issue.  These error messages are all in
the ohci-hub.c file under the ohci_hub_status_data() function.  The first thing
done in the function is to call spin_lock_irqsave();

One more thing.  I used the spca5xx driver.  It is out of the tree, but killed
the root hub exactly the same as the qc-usb driver.  I would like to test an in
tree device that does isochronous transfers.  I did test with the 2.6.12-rc3
kernel.  The results from that are in the big attachment.

Comment 14 Greg Kroah-Hartman 2005-08-18 21:46:18 UTC

Still an issue on 2.6.13-rc6 or greater?

Comment 15 Brian Beardall 2005-08-19 22:42:49 UTC

This is still a bug with the 2.6.13_rc6 kernel. I am attaching the output of
this version of the kernel with USB debug enabled.  The second attachment has
irqpoll enabled.

Comment 16 Brian Beardall 2005-08-19 22:44:05 UTC

Created attachment 5692 [details]
The kernel output from device attachment until irq disabled.

Comment 17 Brian Beardall 2005-08-19 22:45:13 UTC

Created attachment 5693 [details]
This is with the option irqpoll set as a kernel option at boot.

Comment 18 David Brownell 2005-08-25 09:53:41 UTC

Created attachment 5759 [details]
ohci resume detect irq patch

It'd be really nice if this problem appeared on other hardware;
I'm inclined to drop the priority again...

I re-read the chip "revision guide", and one point it mentioned
to address one of the relevant errata was to update the BIOS.
Are you running the current BIOS for your amd756 board?

Also, missing information here seems to be just which revision
of the amd756 chip you're using.  Erratum 6 says how to figure
that out.

And along the same lines ... you should reproduce this with some
in-tree driver, and without the NVidia driver tainting the kernel.
The in-tree drivers are at least more widely understood, if not
better debugged, than the out-of-tree ones.  And not everyone gets
ISO right either...

That said, here's the only interrupt-related patch to OHCI that's
appeared for some time.  Maybe it'll help.

Comment 19 David Brownell 2005-08-25 10:00:40 UTC

Created attachment 5760 [details]
tweak "bogus NDP' messages

And here's one more patch to try.  The "bogus NDP" messages
should be no more than a minor annoyance (== safe to ignore)
but here's a patch that might improve behavior in their
vicinity.

Again, this looks to be a case where you may have a revision of
the amd756 chip that isn't addressed by the amd756 workaround
code we now have (supplied by AMD) ...

You never posted results of trying some non-AMD OHCI controller
(i.e. some PCI card) on that motherboard.  Did you try?  Did it
act the same?

Comment 20 Brian Beardall 2005-08-25 18:24:24 UTC

My AMD USB revision is: 0000:00:07.4 USB Controller: Advanced Micro Devices
[AMD] AMD-756 [Viper] USB (rev 06)

I got that from lspci, but I have looked at the IC itself, and it is a revision
D4.  I hope that helps.  Both computers that I have use the same revision for
the AMD-756.

I tested another OHCI card.  I tested it a while back like in April, and I have
used my webcam with it without any crashes.  In fact the USB on my ALi USB
controller card is a lot more stable than my AMD-756, and it is also EHCI.

I would like to use a USB camera to test that has a driver that is in the kernel
tree.  I am like you in that I trust those internal kernel tree drivers a lot
more because there has been more testing done with them.  If you would recommend
me a camera to buy to test then I will buy it.  I like testing software/hardware
because then it can be less buggy.  I have thought about installing Windows 98
to test the USB controller with my USB camera because the driver has been
written for Windows 98 by those who designed the hardware.  There could be a
timing issue that isn't quite correct with AMD 756, and the camera.  I am not sure.

In the last output I gave there was no NVidia driver loaded into the kernel.  I
was using an ATI Mach64 card because the NVidia driver doesn't work.  Something
to do with access to kernel interfaces being changed.

I do want a camera that has it's driver in the kernel tree due to the reason
that the in tree drivers are always current with what is being developed in the
kernel.

Comment 21 Brian Beardall 2005-08-25 18:44:40 UTC

I didn't add the irq patch because for one of the work arounds for the AMD-756
the resume from a low power state is disabled.  I have not had this problem with
my other OHCI card.  I am also not concerned about the other messages from the
kernel.  I do have the latest BIOS that was released for my mainboard.  I am not
quite sure on how to do the operation of getting the revision code from the
silicon though.  I am actually not quite sure what the revision is but it is D?
 I don't know what revision of D it is.

Comment 22 Brian Beardall 2005-08-25 20:53:06 UTC

I am testing the last patch that deals with the bogus=NDP.  It seems to always
crash at the bogus=NDP error.  Thanks for the patches because perhaps good luck
will occur.

Comment 23 Brian Beardall 2005-08-25 21:52:15 UTC

I tested the bottom patch, and it still crashes.  Maybe I could find a different
camera to test that is in the kernel tree.

Comment 24 Brian Beardall 2005-09-03 12:00:20 UTC

I may be closing the bug in a couple of days.  I may have narrowed the problem
to RAM problems.  This would be related to bug 21 in the AMD-751 Northbridge
controller.  There are memory timing issues with the chipset and to have system
stability on the chipset you have to have dimms that meet the requirements in
bug 21.  I am still testing for a while, but 13 hours of no crashing is a good
time for using ISOchronous transfers for the webcam. :) I do recommend the code
fix for   the bogus NDP.

Comment 25 Brian Beardall 2005-09-03 12:37:26 UTC

Oh I should have used the word erratta 21 for the AMD 751 Northbridge.

Comment 26 David Brownell 2005-09-12 23:09:53 UTC

So -- time to close this out as a hardware problem???

I'm dropping severity to "normal", in any case, since it's so
hardware-specific and rare.

And I'm not sure what you mean about the NDP messages.
Does the current (2.6.13-git13) kernel have an issue there?

Comment 27 Brian Beardall 2005-09-13 16:53:48 UTC

I changed some ram in my one computer, and enabled super bypass on the
northbridge and from what I have done it seems as though there never was a USB
problem.  However on my other computer that I have been doing a lot more testing
with still has the USB crash.  The only difference is the amount of time it
takes to write to ram.  The bug is still their.  The USB can still crash, but
right now I wonder if I am experiencing a race problem with the driver itself. 
I am going to check into the driver, and make sure it isn't trying to read two
times at the same time, and if there is a check to prevent such operations from
occuring.  I do agree that this bug should be marked as normal.

Comment 28 David Brownell 2005-09-14 07:02:43 UTC

Sounds to me like you had hardware problems plus maybe a bug in
that ISO driver.

If you still expect to action on this bug, rather than just closing
this bug out as bad (memory) hardware, I'd still be needing answers
to some of the questions I've asked:

 - test results with current kernel (2.6.14-rc1)
 - clarification on what you mean with the NDP thing
 - chiprev information (see the errata for how to determine that)

Boards old enough to care about "super bypass" have memory bandwidth
issues that Linux can't do much about.