Hello, somewhere between v2.6.36 and v2.6.37-rc1, the kernel stopped booting on my machine. The boot process stops at "Freeing initrd memory: [..]k freed" and after that the machine is completely hard locked. After a thrilling git bisect session, the guilty commit was found: 66db60eaf158aa953651d03e43e931e757e87262 PCI: add quirk for non-symmetric-mode irq routing to versions 0 and 4 of the MCP55 northbridge reverting that commit indeed fixes the bug lspci says: 00:00.0 RAM memory [0500]: nVidia Corporation MCP55 Memory Controller [10de:0369] (rev a1) 00:01.0 ISA bridge [0601]: nVidia Corporation MCP55 LPC Bridge [10de:0360] (rev a2) 00:01.1 SMBus [0c05]: nVidia Corporation MCP55 SMBus [10de:0368] (rev a2) 00:02.0 USB Controller [0c03]: nVidia Corporation MCP55 USB Controller [10de:036c] (rev a1) 00:02.1 USB Controller [0c03]: nVidia Corporation MCP55 USB Controller [10de:036d] (rev a2) 00:04.0 IDE interface [0101]: nVidia Corporation MCP55 IDE [10de:036e] (rev a1) 00:05.0 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2) 00:05.1 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2) 00:05.2 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2) 00:06.0 PCI bridge [0604]: nVidia Corporation MCP55 PCI bridge [10de:0370] (rev a2) 00:06.1 Audio device [0403]: nVidia Corporation MCP55 High Definition Audio [10de:0371] (rev a2) 00:08.0 Bridge [0680]: nVidia Corporation MCP55 Ethernet [10de:0373] (rev a2) 00:0f.0 PCI bridge [0604]: nVidia Corporation MCP55 PCI Express bridge [10de:0377] (rev a2) 00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration [1022:1200] 00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Address Map [1022:1201] 00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller [1022:1202] 00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control [1022:1203] 00:18.4 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Link Control [1022:1204] 02:00.0 VGA compatible controller [0300]: ATI Technologies Inc Cypress [Radeon HD 5800 Series] [1002:6899] 02:00.1 Audio device [0403]: ATI Technologies Inc Cypress HDMI Audio [Radeon HD 5800 Series] [1002:aa50]
We can't just yank out that commit, without it kdump will not work on systems with the listed revisions of MCP55. Perhaps that the availability of that register might be more specific what we can tell withjust pci device and vendor id. Can you please post the output of lspci -vvv so that I can compare your Bridge to the one I worked with here? Thanks!
Created attachment 38822 [details] lspci -vvv
Created attachment 38832 [details] dmesg with 66db60eaf15 reverted complete dmesg log from 2.6.37-rc with commit 66db60eaf15 reverted
Well, I agree that fixing kdump on some systems is a very good thing. But if you think in terms of regression, that patch is introducing a very serious one: you just break Linux on some other systems, that seems a quite unfavorable balance to me. As a side note, kexec as always worked very well on this particular system. Would a dump of the pci config space of that PCI device on my system be useful to you ?
You're correct, but its important to remember that the MCP55 is: 1) A fairly widespread northbridge 2) relatively old (nvidia no longer manufactures it as I understand it) Given those facts, along with the fact that you're the first to report this (this patch has been in RHEL since 5.3, over 1.5 years), I feel like this report may be a corner case, and as such something that we can fine tune the existing patch for. I'm really hoping that when i compare your pci output to what I have here we'll be able to do some filtering on pci sub-device id and get this working. If you could please, yes, your pci config space might be handy in comparison to my system here. thanks! Neil
Okay, here is the conf space of my 10de:0360 device, on a kernel with the patch reverted: 10de 0360 000f 00a0 00a2 0601 0000 0080 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1043 8239 0000 0000 0000 0000 0000 0000 00ff 0000 1043 8239 f000 feff 3efa 00ff 3efa 00ff 3efa 00ff 5a00 0262 0000 0100 0000 ffff 0000 0000 0000 0000 0000 0000 0000 fff9 0010 ffff 80c5 0000 0000 1944 0000 0330 8009 1200 d201 d000 00f0 0100 00f0 0000 0800 0000 0000 0000 4721 8695 cdef 00ab 0001 c030 0000 0000 0000 0000 0000 0000 0290 02ef 0800 085f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 2a50 fe00 e1fd b000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0010 0000 0000 0000
ok, so some good news. It appears that there are 2 major differences between your system and the ones I was testing against. The first is that this is a rev a1 while everything I tested on is a rev a2, so its possible that the register I need is non-existant on your system, and we can key off that to know if we should set it or not. Also, on the systems I test with, the MCP55 has a hypertransport capable interface to the cpu, whereas yours does not. Given that the problem that I sought to fix only occurs on AMD systems in which a hypertransport bus was used, I think the best solution is to key off that fact in the quirk, and not do anything if there is no ht bus. I'll have a patch for you later today
Created attachment 39192 [details] patch to filter out mcp55 chips w/o hyperthread interface capabilities Hey, here you go, as promised. The problem that this quirk adresses was only noted on AMD systems with hypertransport busses, so this patch should cause the quirk to not be applied on MCP55 chips without the HT capability (which I think makes sense, as I would imagine this register may be non-existant on those versions of the chip). Anywho, I don't have a non-ht system here to test on, but with this patch, kdump still works on my mcp55 based server. If you could test on your non-hyperbus mcp55 based system and see if a post 2.6.36 system boots on it, I'd appreciate it. If you give it a thumbs up, I'll post it asap. Thanks!
The patch allows latest git pull of 2.6.37-rc to boot on my system. Thanks !
Ok good, I'll clean it up and post it in the AM. Thanks!
http://marc.info/?l=linux-kernel&m=129181976528659&w=2 posted for review
This is fixed in mainline.
fixed in .37-rc7 by commit 49c2fa08a77a7eefa4cbc73601f64984aceacfa7 Author: Neil Horman <nhorman@tuxdriver.com> Date: Wed Dec 8 09:47:48 2010 -0500 PCI: Update MCP55 quirk to not affect non HyperTransport variants