Bug 6098

Summary: EHCI handoff causes boot hang
Product: Drivers Reporter: Daniel Drake (dsd)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: REJECTED INVALID    
Severity: normal CC: dbrownell, kernel, raphael, stern
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16-rc4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from 2.6.16-rc4

Description Daniel Drake 2006-02-18 08:03:29 UTC
This is a regression that started in 2.6.15 - 2.6.14 booted OK at all times
(handoff option has not been tried).

2.6.15 and newer hangs on warm boot at this line:
io scheduler cfq registered

Cold boot works fine, and this problem doesn't happen on *every* warm boot (but
does occur most of the time).

Further investigation shows that it hangs in quirk_usb_disable_ehci() at this point:

			/* always say Linux will own the hardware
			 * by setting EHCI_USBLEGSUP_OS.
			 */
			pci_write_config_byte(pdev, offset + 3, 1);

The pci_write_config_byte call never returns. This happens on the first
iteration of the loop.

This is an ICH-based system with a Pentium 4 CPU.

Downstream bug is at http://bugs.gentoo.org/show_bug.cgi?id=122277
I'm reporting this on behalf of Tersius.
Comment 1 Daniel Drake 2006-02-18 08:22:13 UTC
Created attachment 7398 [details]
dmesg from 2.6.16-rc4

dmesg from a cold boot
Comment 2 Alan Stern 2006-02-18 13:18:05 UTC
Earlier you said that the patch in
http://bugzilla.kernel.org/show_bug.cgi?id=6011#c12 didn't help.  Is that still
true?

2.6.14 contains essentially identical code in drivers/usb/host/ehci-hcd.c in the
bios_handoff() routine.  It gets called even when you don't specify the
"usb-handoff" boot-line parameter.  Can you confirm that the code works by
adding printk statements?  If it does, this indicates that the problem is not in
the EHCI driver but somewhere else entirely -- maybe when control of the device
is handed back to the BIOS during rebooting.
Comment 3 David Brownell 2006-02-18 13:22:18 UTC
Upgrade that BIOS.  It's clear that this is a BIOS bug;  
it works on cold boot, but not otherwise... ergo, the  
BIOS is doing something different 
Comment 4 David Brownell 2006-02-18 13:28:23 UTC
By the way, it's not at all clear that the downstream bug 
is really against 2.6.16-rc4 ... the dmesg you attached, 
from rc4, shows everything working just fine.  So judging 
by that evidence, there is no bug! 
Comment 5 Alan Stern 2006-02-18 13:32:44 UTC
The dmesg was for a cold boot, but the problem supposedly shows up only during
warm boots.

Checking for a BIOS upgrade indeed should be the first thing tried.
Comment 6 Daniel Drake 2006-02-18 13:45:18 UTC
Alan, yes, http://bugzilla.kernel.org/show_bug.cgi?id=6011#c12 does not solve
the problem. The problematic function call is a little below the code disabled
by that patch. We didn't patch this in manually as it is already included in
2.6.16-rc4.

David: the dmesg attached was from a cold boot which works ok. The hang only
happens when you reboot, and it does not happen every single time.

I will get the user to test 2.6.14 and to hunt for bios updates. Thanks for your
input.
Comment 7 Andrew Morton 2006-02-18 14:05:03 UTC
David Brownell <david-b@pacbell.net> wrote:
>
> On Saturday 18 February 2006 12:49 pm, Andrew Morton wrote:
> > 
> > I thought we'd fixed this hang?  It's been known about for a
> > month or so?
> 
> Seems like a slightly different case, with a similar symptom.  One
> that's still caused by the BIOS, of course:
> 
> 
> > Cold boot works fine, and this problem doesn't happen on *every* warm boot (but
> > does occur most of the time).
> 
> Which means that it's a BIOS bug that doesn't trigger in all cases.
> 
> However, it's also confusing that the bug doesn't match the downstream
> bug that allegedly corresponds to this.  In particular, this report
> claims the issue appears with 2.6.14-rc4, and the downstream bug
> describes older kernels _without_ a patch that addresses a similar
> bug that also "hangs at this point".
> 
> Are we sure the bug report is really against 2.6.14-rc4?  That seems
> oddly fast for a bug to get through the Gentoo bug database and then
> get bounced up to the osdl database ...
> 

There isn't much point in asking me..  Cc's added.

> 
> > Further investigation shows that it hangs in quirk_usb_disable_ehci() at this point:
> > 
> > 			/* always say Linux will own the hardware
> > 			 * by setting EHCI_USBLEGSUP_OS.
> > 			 */
> > 			pci_write_config_byte(pdev, offset + 3, 1);
> > 
> > The pci_write_config_byte call never returns. This happens on the first
> > iteration of the loop.

Comment 8 Daniel Drake 2006-02-18 16:29:32 UTC
David Brownell wrote:
> However, it's also confusing that the bug doesn't match the downstream
> bug that allegedly corresponds to this.  In particular, this report
> claims the issue appears with 2.6.14-rc4, and the downstream bug
> describes older kernels _without_ a patch that addresses a similar
> bug that also "hangs at this point".
> 
> Are we sure the bug report is really against 2.6.14-rc4?  That seems
> oddly fast for a bug to get through the Gentoo bug database and then
> get bounced up to the osdl database ...

I assume you meant 2.6.16-rc4 rather than 2.6.14-rc4.

I'll try and clarify what has gone on on the downstream bug 
(http://bugs.gentoo.org/122277) from the top:

- The user ran 2.6.14 (and older kernels) without problem.

- 2.6.15 came out, and the user upgraded through the package system.

- 2.6.15 introduced this problem where it would hang during bootup on 
most warm boots. It looks like this happened because early handoff was 
enabled by default in 2.6.15 whereas it was not in 2.6.14, although the 
exact cause of the problem was not known at this point. The user filed 
the bug report at http://bugs.gentoo.org/12227 and you can see the point 
of the hang in comment #2.

- You then reworked the EHCI early handoff in the 2.6.16-rc tree based 
on feedback from others. This didn't solve the problem. See comment #4. 
At this point the user was running 2.6.16-rc2 and the point of the hang 
changed to a message about the io scheduler, but it became clearer later 
that the hang actually occurs within the usb pci-quirks file.

- You then disabled the SMI control code in the EHCI handoff. This 
didn't solve the problem. See comment #9. (There is no feedback shown on 
the bug as we discussed this on IRC).

- I posted a comment at http://bugzilla.kernel.org/show_bug.cgi?id=6011#c17
describing the situation and Alan Stern gave some pointers towards 
diagnosing it.

- I created 2 patches (comment #12 and comment #15) which narrowed down 
the exact function call which was causing the hang. At this point, the 
user was testing 2.6.16-rc4 and applying my patch on top.

- Not really knowing what to do next, I filed the bug report found at 
http://bugzilla.kernel.org/show_bug.cgi?id=6098

As for speed, Gentoo immediately include every non-rc kernel release in 
the testing tree (which is very popular), and typically move those 
releases into the stable tree 2-3 weeks later.

If a bug is reported and a fix is not obvious we ask the user to 
reproduce it on the latest development kernel (in this case 2.6.16-rc4) 
and then file bugs at the kernel bugzilla. This is the kind of process 
that has happened here.

I guess I may have confused you by making you think the bug was 
*introduced* in 2.6.16-rc4? If so, I apologise: it probably existed in 
2.6.14 and older if the handoff parameter was used. I was just following 
the usual procedure of making sure the problem hasn't been fixed in the 
latest bleeding edge kernel before troubling you with it. I'll be 
clearer in future :)

Daniel

Comment 9 Daniel Drake 2006-02-18 16:55:33 UTC
Just a random observation, not sure if there is any relevance:

In the quirk function the first function call inside the loop is:

		pci_read_config_dword(pdev, offset, &cap);

The lowest byte of cap is then used to decide what to do next.

When the system froze, cap had the value 0x1. You can see this in the photo at
http://img423.imageshack.us/img423/9526/rc4crash8my.jpg.

On cold boot, cap had value 0x1000001 - you can see this in comment #1. I know
the upper bits are ignored so this won't affect the function flow, but perhaps
that upper 1 bit has some interesting meaning?

I have sent the user a patch as suggested in comment #2 and will see what
results that brings. Thanks again.
Comment 10 David Brownell 2006-02-18 17:26:23 UTC
If there's an issue in the config space, let's see a complete 
dump of it:  lspci -vv -xxx -s BB:SS.F should do. 
Comment 11 Alan Stern 2006-02-18 18:43:04 UTC
"On cold boot, cap had value 0x1000001 - you can see this in comment #1. I know
the upper bits are ignored so this won't affect the function flow, but perhaps
that upper 1 bit has some interesting meaning?"

Sort of -- it means the BIOS is pretty mixed up.  That 1 in the high-order byte
means that Linux owns the host controller, which obviously it doesn't at this
point during a cold boot.  The value of cap should be 0x10001, meaning that the
BIOS owns the controller.

During the warm boot, the value 0x1 means that nobody owns the controller.  It's
a reasonable value, if the BIOS has been configured not to support legacy USB
emulation.  The pci_write call that hangs merely sets the 1 in the high-order
byte.  This probably generates an SMI interrupt that the BIOS doesn't expect,
causing the BIOS to crash.

It might be interesting to know what the corresponding value is during a warm
boot in 2.6.14.

For that matter, if 2.6.14 and 2.6.16-rc4 can both be installed on the test
machine, it might also be interesting to cold boot into one and then warm boot
into the other.  Try doing it each way.
Comment 12 Daniel Drake 2006-02-19 07:15:04 UTC
The user upgraded the BIOS and disabled USB legacy support. The problem appears
to have gone.

I've asked him to check if enabling USB legacy support brings the problem back,
or if it all works as expected.
Comment 13 Daniel Drake 2006-02-21 03:45:49 UTC
Enabling USB Legacy support doesn't make the problem reappear. Sorry to have
bothered you guys with a silly BIOS bug.