Bug 9026

Summary: PROBLEM: kernel hang in ohci init (pci-quirks)
Product: Drivers Reporter: Satyam Sharma (satyam)
Component: PCIAssignee: Greg Kroah-Hartman (greg)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: protasnb, yyyeer.bo
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 5089    

Description Satyam Sharma 2007-09-16 05:49:44 UTC
[ Bug opened by satyam@infradead.org for problem originally
  reported by Timo Lindemann <tlindemann@arcor.de> on LKML.
  Analysis by David Brownell pointed at something to do with
  ACPI/PCI early initialization, so ccing Greg K-H and ACPI. ]


Most recent kernel where this bug did not occur: 2.6.20
Distribution:
Hardware Environment: } 32-bit userspace on x86_64 CPU
Software Environment: }
Problem Description: See original post to LKML below for details.
Steps to reproduce:

Hi all,

a problem report to something giving me a real headache:

[1.] Kernel hangs when initializing ohci-controller

[2.] The version 2.6.22 of the linux kernel hangs when initializing the
integrated ohci controller of the nvidia MCP51 chipset (pci device ids
vendor:product == 10de:26d). I have traced through various printks that
pci_init calls pci_fixup_device, later on in quirk_usb_ohci_handoff
(file linux/drivers/usb/host/pci-quirks.c) kernel freezes in this
section:
...
if (control & OHCI_CTRL_IR) {
       int wait_time = 500;
       writel(OHCI_INTR_OC, base + OHCI_INTRENABLE);
       writel(OHCI_ORC, base + OHCI_CMDSTATUS); // this never returns
...
after this, kernel apparently goes into busy waiting (fans gradually
turn louder) and hangs indefinitely. I have also made sure that writel
(in linux/include/asm/io.h) really is entered, but never returns.

[3.] keywords: pci ohci kernel

[4.] /proc/version can not be read, as kernel freezes in startup

[5.] No Oops, no panic

[6.] Reproducible by booting any version 2.6.21+ on that machine
(nvidia MCP51-Chipset, see the lspci output)

[7.1] the ver_linux output under 2.6.20.6, in the directory of 2.6.22,
says:

Gnu C                  4.2.1
Gnu make               3.81
binutils               2.17.50.0.17
util-linux             2.12r
mount                  2.12r
module-init-tools      3.2.2
e2fsprogs              1.40
jfsutils               1.1.11
reiserfsprogs          3.6.20
xfsprogs               2.8.21
pcmciautils            014
PPP                    2.4.4
Linux C Library        > libc.2.6
Dynamic linker (ldd)   2.6
Linux C++ Library      so.6.0
Procps                 3.2.7
Net-tools              1.60
Kbd                    1.12
Sh-utils               6.9
udev                   113
wireless-tools         29
Modules Loaded         rt2500* nvidia* forcedeth

* nvidia and rt2500 are most assuredly not involved in this. They are
not loaded by that kernel.

[7.2] Processor information:
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 36
model name      : AMD Turion(tm) 64 Mobile Technology ML-37
stepping        : 2
cpu MHz         : 800.000
cache size      : 1024 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
lm 3dnowext 3dnow pni lahf_lm ts fid vid ttp tm stc
bogomips        : 1608.22
clflush size    : 64

[7.3] no modules have been configured (all in-kernel)

[7.4] n/a

[7.5] (I cannot run this with 2.6.22. In 2.6.20.6, the output can be
retrieved from http://cip.uni-trier.de/~lindem/lspci.txt as this is
really large)

[7.6] (I have SATA, but again, I don't reach /proc from within that
kernel)

[7.7] What is striking about that problem is that kernel 2.6.20.6 does
not even enter the section mentioned in [2.]. If booted, serial console
and netconsole do not work either, nor does magic sysrq key. Also, this
is a 64bit cpu, running a 32bit linux distro, and it happens regardless
whether 64bit resources are activated or not.

[X.] I tried hard to understand what's going on, but ultimately, I could
not yet write a fix, workaround, or anything like that, so I am asking
for help/enlightenment, or even an already-done fix. Really very
sorry. Also, different options like noapic, nolapic, acpi=off,
pci=routeirq|biosirq|usepirqmask were already tried; I also tried
disabling quirks for that particular vendor:device-combination, which
leads to another freeze further along. Also, commenting the writel()
will hang indefinitely in the following wait_time loop.

I can only guess that it might
have to do with the patch
"commit 4302a595cd9c6363b495460497ecbda49fa16858
Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date:   Fri Dec 15 06:53:55 2006 +1100
USB: Rework the OHCI quirk mecanism as suggested by David
"
but I don't really have a clue, so this might be groundless suspicion.
If so, I apologize about that.

Greetings and thanks for all the work with the kernel!

--
Timo Lindemann
Comment 1 Satyam Sharma 2007-09-16 06:10:58 UTC
[ Getting bugzilla up to speed with later discussion on the
  original LKML thread at http://lkml.org/lkml/2007/7/12/64 ]


David Brownell:
> 
> > (file linux/drivers/usb/host/pci-quirks.c) kernel freezes in this
> > section:
> 
> Note that hangs in that file almost always mean "your BIOS is goofy".
> Hunt for BIOS settings related to USB, and change them.  As a rule, if
> you tell your BIOS to ignore USB devices (mostly keyboards and disks),
> it will have even less of an excuse to break like that.


Timo Lindemann:
> 
> > Note that hangs in that file almost always mean "your BIOS is goofy".
> > Hunt for BIOS settings related to USB, and change them.
> 
> This laptop's BIOS only offers "legacy support" enabled or disabled,
> both of which lead to frozen kernel.
> [...]
> 
> It is just odd that up to (not including) the 2.6.21-series every kernel
> boots, and after that, they just freeze.


Satyam Sharma:
> 
> Hey, just try git-bisect already :-)


Timo Lindemann:
> 
> To sum this up:
> 
> the userspace 2.6.20.6 (the "good" kernel) and 2.6.22 (the "bad" kernel)
> were compiled in is exactly the same setup. I recompiled "good" to check
> for that, earlier, but "good" also works then.
> 
> "good" does not exhibit the printks I placed in the section (the same
> ones I did for "bad"), making it plausible that the section is not
> executed at all.
> 
> dmesg is not captured to disk, netconsole and serial console also do not
> work (they both did in the "good" kernel). Also, my keyboard does not
> work with "bad" during that phase -- Magic SysRq is also not working then.
> 
> I can try to hook up the laptop to an external monitor to capture some
> more dmesg, and just shoot a photo, but I am right now trying to work
> with git, as Satyam suggested.


David Brownell:
> 
> Thing is, pci-quirks.c runs early
> enough in the boot process -- before the OHCI driver can even
> run!! -- that you can probably rule out the USB stack as being
> the cause of this regression.  Disable the USB host controllers
> in your config, and see what happens...
> [...]
> 
> Where the subsystem in question is early PCI/ACPI initialization,
> before the drivers start binding to PCI devices... it's always
> annoying when changes in that area cause USB to break, since the
> only involvement of USB is to display a "rude failure" symptom.
> It took a long time to get the IRQ setup glitches fixed!
> 
> One thing you might do is enable all the ACPI debug messaging and
> disable the usb/host/pci-quirks.c stuff (just comment it all out),
> assuming you can boot without USB keyboard/mouse.  Then compare
> the relevant diagnostics between "good" and "bad" kernels.  It's
> likely something interesting will appear.
Comment 2 David Brownell 2007-10-09 15:29:59 UTC
One theory about what's going wrong:  somethings interfering with SMI handling, and that's required for the BIOS to do its part of that handoff.
Comment 3 Natalie Protasevich 2008-02-11 20:29:50 UTC
Any updates on this problem please? It looks like reporter is not giving any more feedback, unless someone has been working with him directly.
Comment 4 Satyam Sharma 2008-04-14 12:58:31 UTC

On Mon, 11 Feb 2008, bugme-daemon@bugzilla.kernel.org wrote:
> 
> ------- Comment #3 from protasnb@gmail.com  2008-02-11 20:29 -------
> Any updates on this problem please? It looks like reporter is not giving any
> more feedback, unless someone has been working with him directly.

Apologies for the late reply, but I haven't really been keeping up with
kernel development in a big way for the last few months.

Regarding this particular issue, I was contacted by the original reporter
(Timo Lindemann) maybe a month back and he said the latest kernel seems
to be booting/working fine on his laptop now. Hard for me or others to
confirm considering the problem was reproduced only on the original
reporter's laptop. Moreover, it is not known whether Timo built the latest
kernel using the same .config or a new one. We could probably request him
to do a git-bisect to find both the buggy commit (and the one that
resolved it "automagically") to really get down to the bottom of this,
but otherwise I guess we may have to close this one (or keep as is) for
lack of further information ...

Thanks,

Satyam
Comment 5 Natalie Protasevich 2008-04-14 13:45:55 UTC
Great, thanks for the update. 
I think on this positive note we can close it now, taking into account how much fixes/updates went into the subsystem.
Please reopen if the problem confirmed with latest kernel.