[ Bug opened by satyam@infradead.org for problem originally reported by Timo Lindemann <tlindemann@arcor.de> on LKML. Analysis by David Brownell pointed at something to do with ACPI/PCI early initialization, so ccing Greg K-H and ACPI. ] Most recent kernel where this bug did not occur: 2.6.20 Distribution: Hardware Environment: } 32-bit userspace on x86_64 CPU Software Environment: } Problem Description: See original post to LKML below for details. Steps to reproduce: Hi all, a problem report to something giving me a real headache: [1.] Kernel hangs when initializing ohci-controller [2.] The version 2.6.22 of the linux kernel hangs when initializing the integrated ohci controller of the nvidia MCP51 chipset (pci device ids vendor:product == 10de:26d). I have traced through various printks that pci_init calls pci_fixup_device, later on in quirk_usb_ohci_handoff (file linux/drivers/usb/host/pci-quirks.c) kernel freezes in this section: ... if (control & OHCI_CTRL_IR) { int wait_time = 500; writel(OHCI_INTR_OC, base + OHCI_INTRENABLE); writel(OHCI_ORC, base + OHCI_CMDSTATUS); // this never returns ... after this, kernel apparently goes into busy waiting (fans gradually turn louder) and hangs indefinitely. I have also made sure that writel (in linux/include/asm/io.h) really is entered, but never returns. [3.] keywords: pci ohci kernel [4.] /proc/version can not be read, as kernel freezes in startup [5.] No Oops, no panic [6.] Reproducible by booting any version 2.6.21+ on that machine (nvidia MCP51-Chipset, see the lspci output) [7.1] the ver_linux output under 2.6.20.6, in the directory of 2.6.22, says: Gnu C 4.2.1 Gnu make 3.81 binutils 2.17.50.0.17 util-linux 2.12r mount 2.12r module-init-tools 3.2.2 e2fsprogs 1.40 jfsutils 1.1.11 reiserfsprogs 3.6.20 xfsprogs 2.8.21 pcmciautils 014 PPP 2.4.4 Linux C Library > libc.2.6 Dynamic linker (ldd) 2.6 Linux C++ Library so.6.0 Procps 3.2.7 Net-tools 1.60 Kbd 1.12 Sh-utils 6.9 udev 113 wireless-tools 29 Modules Loaded rt2500* nvidia* forcedeth * nvidia and rt2500 are most assuredly not involved in this. They are not loaded by that kernel. [7.2] Processor information: processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 36 model name : AMD Turion(tm) 64 Mobile Technology ML-37 stepping : 2 cpu MHz : 800.000 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm ts fid vid ttp tm stc bogomips : 1608.22 clflush size : 64 [7.3] no modules have been configured (all in-kernel) [7.4] n/a [7.5] (I cannot run this with 2.6.22. In 2.6.20.6, the output can be retrieved from http://cip.uni-trier.de/~lindem/lspci.txt as this is really large) [7.6] (I have SATA, but again, I don't reach /proc from within that kernel) [7.7] What is striking about that problem is that kernel 2.6.20.6 does not even enter the section mentioned in [2.]. If booted, serial console and netconsole do not work either, nor does magic sysrq key. Also, this is a 64bit cpu, running a 32bit linux distro, and it happens regardless whether 64bit resources are activated or not. [X.] I tried hard to understand what's going on, but ultimately, I could not yet write a fix, workaround, or anything like that, so I am asking for help/enlightenment, or even an already-done fix. Really very sorry. Also, different options like noapic, nolapic, acpi=off, pci=routeirq|biosirq|usepirqmask were already tried; I also tried disabling quirks for that particular vendor:device-combination, which leads to another freeze further along. Also, commenting the writel() will hang indefinitely in the following wait_time loop. I can only guess that it might have to do with the patch "commit 4302a595cd9c6363b495460497ecbda49fa16858 Author: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Fri Dec 15 06:53:55 2006 +1100 USB: Rework the OHCI quirk mecanism as suggested by David " but I don't really have a clue, so this might be groundless suspicion. If so, I apologize about that. Greetings and thanks for all the work with the kernel! -- Timo Lindemann
[ Getting bugzilla up to speed with later discussion on the original LKML thread at http://lkml.org/lkml/2007/7/12/64 ] David Brownell: > > > (file linux/drivers/usb/host/pci-quirks.c) kernel freezes in this > > section: > > Note that hangs in that file almost always mean "your BIOS is goofy". > Hunt for BIOS settings related to USB, and change them. As a rule, if > you tell your BIOS to ignore USB devices (mostly keyboards and disks), > it will have even less of an excuse to break like that. Timo Lindemann: > > > Note that hangs in that file almost always mean "your BIOS is goofy". > > Hunt for BIOS settings related to USB, and change them. > > This laptop's BIOS only offers "legacy support" enabled or disabled, > both of which lead to frozen kernel. > [...] > > It is just odd that up to (not including) the 2.6.21-series every kernel > boots, and after that, they just freeze. Satyam Sharma: > > Hey, just try git-bisect already :-) Timo Lindemann: > > To sum this up: > > the userspace 2.6.20.6 (the "good" kernel) and 2.6.22 (the "bad" kernel) > were compiled in is exactly the same setup. I recompiled "good" to check > for that, earlier, but "good" also works then. > > "good" does not exhibit the printks I placed in the section (the same > ones I did for "bad"), making it plausible that the section is not > executed at all. > > dmesg is not captured to disk, netconsole and serial console also do not > work (they both did in the "good" kernel). Also, my keyboard does not > work with "bad" during that phase -- Magic SysRq is also not working then. > > I can try to hook up the laptop to an external monitor to capture some > more dmesg, and just shoot a photo, but I am right now trying to work > with git, as Satyam suggested. David Brownell: > > Thing is, pci-quirks.c runs early > enough in the boot process -- before the OHCI driver can even > run!! -- that you can probably rule out the USB stack as being > the cause of this regression. Disable the USB host controllers > in your config, and see what happens... > [...] > > Where the subsystem in question is early PCI/ACPI initialization, > before the drivers start binding to PCI devices... it's always > annoying when changes in that area cause USB to break, since the > only involvement of USB is to display a "rude failure" symptom. > It took a long time to get the IRQ setup glitches fixed! > > One thing you might do is enable all the ACPI debug messaging and > disable the usb/host/pci-quirks.c stuff (just comment it all out), > assuming you can boot without USB keyboard/mouse. Then compare > the relevant diagnostics between "good" and "bad" kernels. It's > likely something interesting will appear.
One theory about what's going wrong: somethings interfering with SMI handling, and that's required for the BIOS to do its part of that handoff.
Any updates on this problem please? It looks like reporter is not giving any more feedback, unless someone has been working with him directly.
On Mon, 11 Feb 2008, bugme-daemon@bugzilla.kernel.org wrote: > > ------- Comment #3 from protasnb@gmail.com 2008-02-11 20:29 ------- > Any updates on this problem please? It looks like reporter is not giving any > more feedback, unless someone has been working with him directly. Apologies for the late reply, but I haven't really been keeping up with kernel development in a big way for the last few months. Regarding this particular issue, I was contacted by the original reporter (Timo Lindemann) maybe a month back and he said the latest kernel seems to be booting/working fine on his laptop now. Hard for me or others to confirm considering the problem was reproduced only on the original reporter's laptop. Moreover, it is not known whether Timo built the latest kernel using the same .config or a new one. We could probably request him to do a git-bisect to find both the buggy commit (and the one that resolved it "automagically") to really get down to the bottom of this, but otherwise I guess we may have to close this one (or keep as is) for lack of further information ... Thanks, Satyam
Great, thanks for the update. I think on this positive note we can close it now, taking into account how much fixes/updates went into the subsystem. Please reopen if the problem confirmed with latest kernel.