Distribution:Gentoo 2005.0 Hardware Environment: MSI 6195, and MSI 6167 mainboards Software Environment: camserv, camorama, and gyache-enhanced Problem Description: After about 10 - 30 minutes of use of my webcam using the qc-usb modules the USB stops responding, and irq 10 is shutdown with an "I don't care." All devices on that irq stop working as a result. Steps to reproduce: 1. Boot linux 2. load qc-usb module 3. Load camorama version 0.17 4. watch for 10 - 30 minutes 5. I get: irq 10: nobody cared! 6. reboot computer to get all devices that use irq 10 to work again.
Created attachment 4927 [details] kernel log at the time of USB error
Created attachment 4928 [details] Kernel log at boot time
This bug is not reproducible on an HP PIII 900 with a VIA chipset using the UHCI controller.
Created attachment 4929 [details] These are the kernel messages from the MSI-6167 mainboard at bootup.
Created attachment 4930 [details] This is the kernel config for the MSI-6167 mainboard system
Created attachment 4931 [details] Kernel config for MSI-6195 mainboard
When noirqdebug is set as a kernel boot option there are no longer any problems with the IRQ being shutdown or the computer locking up on me.
adding the kernel option noirqdebug did not resolve the issue with USB. It was just being difficult to trigger the bug. With further testing on the AMD756 USB OHCI_HCD controller I have now been able to crash it using a scanner which uses libusb, and with a USB mass storage device running a program that I post. What I have done is filled the USB device with a file with all zero's. Then I cat that contents of that file to /dev/null. It does crash the USB controller, and the computer. The bug is driver independent. It can occur with ANY USB device that can stream data over a long period of time that is plugged into the USB OHCI_HCD AMD 756 controller. This bug does need to be fixed.
Created attachment 4961 [details] stress script for usb mass storage
This bug has to do with isochronous transfers. Any driver that uses isochronous transfers will crash the hardware for the AMD 756 OHCI USB controller.
Dropping priority, as the issue is by all reports specific to this particular no-longer-manufactured chipset. Your comment #8 says this is reproducible without requiring the out-of-tree quickcam driver. Can you provide kernel logs for that failure happening? Right now this bug report is rather uselessly thin on details.
Created attachment 5212 [details] These are email from usb devel mailing list documenting the problem.
I have had the computer on for a while and have had gnome-pilot running the entire time. I have been able to get this error message: ohci_hcd 0000:00:07.4: bogus NDP=255, rereads as NDP=4 The difference is that irq's are not being submitted, and processed. I think I can crash the usb provided I am transfering data (irq's are being submitted). I doubt the data rate matters since just simple polling of the bus seems to cause a bogus NDP. I know my board doesn't have 255 usb ports on the root hub. :) I studied the code where these error messages are generated, and there aren't supposed to be irq's being submitted while these status's are being checked. Hmmm to me it sounds like an irq race issue. These error messages are all in the ohci-hub.c file under the ohci_hub_status_data() function. The first thing done in the function is to call spin_lock_irqsave(); One more thing. I used the spca5xx driver. It is out of the tree, but killed the root hub exactly the same as the qc-usb driver. I would like to test an in tree device that does isochronous transfers. I did test with the 2.6.12-rc3 kernel. The results from that are in the big attachment.
Still an issue on 2.6.13-rc6 or greater?
This is still a bug with the 2.6.13_rc6 kernel. I am attaching the output of this version of the kernel with USB debug enabled. The second attachment has irqpoll enabled.
Created attachment 5692 [details] The kernel output from device attachment until irq disabled.
Created attachment 5693 [details] This is with the option irqpoll set as a kernel option at boot.
Created attachment 5759 [details] ohci resume detect irq patch It'd be really nice if this problem appeared on other hardware; I'm inclined to drop the priority again... I re-read the chip "revision guide", and one point it mentioned to address one of the relevant errata was to update the BIOS. Are you running the current BIOS for your amd756 board? Also, missing information here seems to be just which revision of the amd756 chip you're using. Erratum 6 says how to figure that out. And along the same lines ... you should reproduce this with some in-tree driver, and without the NVidia driver tainting the kernel. The in-tree drivers are at least more widely understood, if not better debugged, than the out-of-tree ones. And not everyone gets ISO right either... That said, here's the only interrupt-related patch to OHCI that's appeared for some time. Maybe it'll help.
Created attachment 5760 [details] tweak "bogus NDP' messages And here's one more patch to try. The "bogus NDP" messages should be no more than a minor annoyance (== safe to ignore) but here's a patch that might improve behavior in their vicinity. Again, this looks to be a case where you may have a revision of the amd756 chip that isn't addressed by the amd756 workaround code we now have (supplied by AMD) ... You never posted results of trying some non-AMD OHCI controller (i.e. some PCI card) on that motherboard. Did you try? Did it act the same?
My AMD USB revision is: 0000:00:07.4 USB Controller: Advanced Micro Devices [AMD] AMD-756 [Viper] USB (rev 06) I got that from lspci, but I have looked at the IC itself, and it is a revision D4. I hope that helps. Both computers that I have use the same revision for the AMD-756. I tested another OHCI card. I tested it a while back like in April, and I have used my webcam with it without any crashes. In fact the USB on my ALi USB controller card is a lot more stable than my AMD-756, and it is also EHCI. I would like to use a USB camera to test that has a driver that is in the kernel tree. I am like you in that I trust those internal kernel tree drivers a lot more because there has been more testing done with them. If you would recommend me a camera to buy to test then I will buy it. I like testing software/hardware because then it can be less buggy. I have thought about installing Windows 98 to test the USB controller with my USB camera because the driver has been written for Windows 98 by those who designed the hardware. There could be a timing issue that isn't quite correct with AMD 756, and the camera. I am not sure. In the last output I gave there was no NVidia driver loaded into the kernel. I was using an ATI Mach64 card because the NVidia driver doesn't work. Something to do with access to kernel interfaces being changed. I do want a camera that has it's driver in the kernel tree due to the reason that the in tree drivers are always current with what is being developed in the kernel.
I didn't add the irq patch because for one of the work arounds for the AMD-756 the resume from a low power state is disabled. I have not had this problem with my other OHCI card. I am also not concerned about the other messages from the kernel. I do have the latest BIOS that was released for my mainboard. I am not quite sure on how to do the operation of getting the revision code from the silicon though. I am actually not quite sure what the revision is but it is D? I don't know what revision of D it is.
I am testing the last patch that deals with the bogus=NDP. It seems to always crash at the bogus=NDP error. Thanks for the patches because perhaps good luck will occur.
I tested the bottom patch, and it still crashes. Maybe I could find a different camera to test that is in the kernel tree.
I may be closing the bug in a couple of days. I may have narrowed the problem to RAM problems. This would be related to bug 21 in the AMD-751 Northbridge controller. There are memory timing issues with the chipset and to have system stability on the chipset you have to have dimms that meet the requirements in bug 21. I am still testing for a while, but 13 hours of no crashing is a good time for using ISOchronous transfers for the webcam. :) I do recommend the code fix for the bogus NDP.
Oh I should have used the word erratta 21 for the AMD 751 Northbridge.
So -- time to close this out as a hardware problem??? I'm dropping severity to "normal", in any case, since it's so hardware-specific and rare. And I'm not sure what you mean about the NDP messages. Does the current (2.6.13-git13) kernel have an issue there?
I changed some ram in my one computer, and enabled super bypass on the northbridge and from what I have done it seems as though there never was a USB problem. However on my other computer that I have been doing a lot more testing with still has the USB crash. The only difference is the amount of time it takes to write to ram. The bug is still their. The USB can still crash, but right now I wonder if I am experiencing a race problem with the driver itself. I am going to check into the driver, and make sure it isn't trying to read two times at the same time, and if there is a check to prevent such operations from occuring. I do agree that this bug should be marked as normal.
Sounds to me like you had hardware problems plus maybe a bug in that ISO driver. If you still expect to action on this bug, rather than just closing this bug out as bad (memory) hardware, I'd still be needing answers to some of the questions I've asked: - test results with current kernel (2.6.14-rc1) - clarification on what you mean with the NDP thing - chiprev information (see the errata for how to determine that) Boards old enough to care about "super bypass" have memory bandwidth issues that Linux can't do much about.