This is a regression that started in 2.6.15 - 2.6.14 booted OK at all times (handoff option has not been tried). 2.6.15 and newer hangs on warm boot at this line: io scheduler cfq registered Cold boot works fine, and this problem doesn't happen on *every* warm boot (but does occur most of the time). Further investigation shows that it hangs in quirk_usb_disable_ehci() at this point: /* always say Linux will own the hardware * by setting EHCI_USBLEGSUP_OS. */ pci_write_config_byte(pdev, offset + 3, 1); The pci_write_config_byte call never returns. This happens on the first iteration of the loop. This is an ICH-based system with a Pentium 4 CPU. Downstream bug is at http://bugs.gentoo.org/show_bug.cgi?id=122277 I'm reporting this on behalf of Tersius.
Created attachment 7398 [details] dmesg from 2.6.16-rc4 dmesg from a cold boot
Earlier you said that the patch in http://bugzilla.kernel.org/show_bug.cgi?id=6011#c12 didn't help. Is that still true? 2.6.14 contains essentially identical code in drivers/usb/host/ehci-hcd.c in the bios_handoff() routine. It gets called even when you don't specify the "usb-handoff" boot-line parameter. Can you confirm that the code works by adding printk statements? If it does, this indicates that the problem is not in the EHCI driver but somewhere else entirely -- maybe when control of the device is handed back to the BIOS during rebooting.
Upgrade that BIOS. It's clear that this is a BIOS bug; it works on cold boot, but not otherwise... ergo, the BIOS is doing something different
By the way, it's not at all clear that the downstream bug is really against 2.6.16-rc4 ... the dmesg you attached, from rc4, shows everything working just fine. So judging by that evidence, there is no bug!
The dmesg was for a cold boot, but the problem supposedly shows up only during warm boots. Checking for a BIOS upgrade indeed should be the first thing tried.
Alan, yes, http://bugzilla.kernel.org/show_bug.cgi?id=6011#c12 does not solve the problem. The problematic function call is a little below the code disabled by that patch. We didn't patch this in manually as it is already included in 2.6.16-rc4. David: the dmesg attached was from a cold boot which works ok. The hang only happens when you reboot, and it does not happen every single time. I will get the user to test 2.6.14 and to hunt for bios updates. Thanks for your input.
David Brownell <david-b@pacbell.net> wrote: > > On Saturday 18 February 2006 12:49 pm, Andrew Morton wrote: > > > > I thought we'd fixed this hang? It's been known about for a > > month or so? > > Seems like a slightly different case, with a similar symptom. One > that's still caused by the BIOS, of course: > > > > Cold boot works fine, and this problem doesn't happen on *every* warm boot (but > > does occur most of the time). > > Which means that it's a BIOS bug that doesn't trigger in all cases. > > However, it's also confusing that the bug doesn't match the downstream > bug that allegedly corresponds to this. In particular, this report > claims the issue appears with 2.6.14-rc4, and the downstream bug > describes older kernels _without_ a patch that addresses a similar > bug that also "hangs at this point". > > Are we sure the bug report is really against 2.6.14-rc4? That seems > oddly fast for a bug to get through the Gentoo bug database and then > get bounced up to the osdl database ... > There isn't much point in asking me.. Cc's added. > > > Further investigation shows that it hangs in quirk_usb_disable_ehci() at this point: > > > > /* always say Linux will own the hardware > > * by setting EHCI_USBLEGSUP_OS. > > */ > > pci_write_config_byte(pdev, offset + 3, 1); > > > > The pci_write_config_byte call never returns. This happens on the first > > iteration of the loop.
David Brownell wrote: > However, it's also confusing that the bug doesn't match the downstream > bug that allegedly corresponds to this. In particular, this report > claims the issue appears with 2.6.14-rc4, and the downstream bug > describes older kernels _without_ a patch that addresses a similar > bug that also "hangs at this point". > > Are we sure the bug report is really against 2.6.14-rc4? That seems > oddly fast for a bug to get through the Gentoo bug database and then > get bounced up to the osdl database ... I assume you meant 2.6.16-rc4 rather than 2.6.14-rc4. I'll try and clarify what has gone on on the downstream bug (http://bugs.gentoo.org/122277) from the top: - The user ran 2.6.14 (and older kernels) without problem. - 2.6.15 came out, and the user upgraded through the package system. - 2.6.15 introduced this problem where it would hang during bootup on most warm boots. It looks like this happened because early handoff was enabled by default in 2.6.15 whereas it was not in 2.6.14, although the exact cause of the problem was not known at this point. The user filed the bug report at http://bugs.gentoo.org/12227 and you can see the point of the hang in comment #2. - You then reworked the EHCI early handoff in the 2.6.16-rc tree based on feedback from others. This didn't solve the problem. See comment #4. At this point the user was running 2.6.16-rc2 and the point of the hang changed to a message about the io scheduler, but it became clearer later that the hang actually occurs within the usb pci-quirks file. - You then disabled the SMI control code in the EHCI handoff. This didn't solve the problem. See comment #9. (There is no feedback shown on the bug as we discussed this on IRC). - I posted a comment at http://bugzilla.kernel.org/show_bug.cgi?id=6011#c17 describing the situation and Alan Stern gave some pointers towards diagnosing it. - I created 2 patches (comment #12 and comment #15) which narrowed down the exact function call which was causing the hang. At this point, the user was testing 2.6.16-rc4 and applying my patch on top. - Not really knowing what to do next, I filed the bug report found at http://bugzilla.kernel.org/show_bug.cgi?id=6098 As for speed, Gentoo immediately include every non-rc kernel release in the testing tree (which is very popular), and typically move those releases into the stable tree 2-3 weeks later. If a bug is reported and a fix is not obvious we ask the user to reproduce it on the latest development kernel (in this case 2.6.16-rc4) and then file bugs at the kernel bugzilla. This is the kind of process that has happened here. I guess I may have confused you by making you think the bug was *introduced* in 2.6.16-rc4? If so, I apologise: it probably existed in 2.6.14 and older if the handoff parameter was used. I was just following the usual procedure of making sure the problem hasn't been fixed in the latest bleeding edge kernel before troubling you with it. I'll be clearer in future :) Daniel
Just a random observation, not sure if there is any relevance: In the quirk function the first function call inside the loop is: pci_read_config_dword(pdev, offset, &cap); The lowest byte of cap is then used to decide what to do next. When the system froze, cap had the value 0x1. You can see this in the photo at http://img423.imageshack.us/img423/9526/rc4crash8my.jpg. On cold boot, cap had value 0x1000001 - you can see this in comment #1. I know the upper bits are ignored so this won't affect the function flow, but perhaps that upper 1 bit has some interesting meaning? I have sent the user a patch as suggested in comment #2 and will see what results that brings. Thanks again.
If there's an issue in the config space, let's see a complete dump of it: lspci -vv -xxx -s BB:SS.F should do.
"On cold boot, cap had value 0x1000001 - you can see this in comment #1. I know the upper bits are ignored so this won't affect the function flow, but perhaps that upper 1 bit has some interesting meaning?" Sort of -- it means the BIOS is pretty mixed up. That 1 in the high-order byte means that Linux owns the host controller, which obviously it doesn't at this point during a cold boot. The value of cap should be 0x10001, meaning that the BIOS owns the controller. During the warm boot, the value 0x1 means that nobody owns the controller. It's a reasonable value, if the BIOS has been configured not to support legacy USB emulation. The pci_write call that hangs merely sets the 1 in the high-order byte. This probably generates an SMI interrupt that the BIOS doesn't expect, causing the BIOS to crash. It might be interesting to know what the corresponding value is during a warm boot in 2.6.14. For that matter, if 2.6.14 and 2.6.16-rc4 can both be installed on the test machine, it might also be interesting to cold boot into one and then warm boot into the other. Try doing it each way.
The user upgraded the BIOS and disabled USB legacy support. The problem appears to have gone. I've asked him to check if enabling USB legacy support brings the problem back, or if it all works as expected.
Enabling USB Legacy support doesn't make the problem reappear. Sorry to have bothered you guys with a silly BIOS bug.