Bug 23952 - Bisected regression: kernel won't boot (MCP55 bridge)
Bisected regression: kernel won't boot (MCP55 bridge)
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: PCI
All Linux
: P1 normal
Assigned To: Neil Horman
:
Depends on:
Blocks: 21782
  Show dependency treegraph
 
Reported: 2010-11-28 21:46 UTC by Mathieu Bérard
Modified: 2011-01-23 15:32 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.37-rc1+
Tree: Mainline
Regression: Yes


Attachments
lspci -vvv (21.03 KB, text/plain)
2010-12-02 19:30 UTC, Mathieu Bérard
Details
dmesg with 66db60eaf15 reverted (57.28 KB, text/plain)
2010-12-02 19:31 UTC, Mathieu Bérard
Details
patch to filter out mcp55 chips w/o hyperthread interface capabilities (461 bytes, patch)
2010-12-06 20:36 UTC, Neil Horman
Details | Diff

Description Mathieu Bérard 2010-11-28 21:46:30 UTC
Hello,
somewhere between v2.6.36 and v2.6.37-rc1, the kernel stopped booting on my machine.
The boot process stops at "Freeing initrd memory: [..]k freed" and after that the machine is completely hard locked.

After a thrilling git bisect session, the guilty commit was found:
66db60eaf158aa953651d03e43e931e757e87262
PCI: add quirk for non-symmetric-mode irq routing to versions 0 and 4 of the MCP55 northbridge

reverting that commit indeed fixes the bug

lspci says:
00:00.0 RAM memory [0500]: nVidia Corporation MCP55 Memory Controller [10de:0369] (rev a1)
00:01.0 ISA bridge [0601]: nVidia Corporation MCP55 LPC Bridge [10de:0360] (rev a2)
00:01.1 SMBus [0c05]: nVidia Corporation MCP55 SMBus [10de:0368] (rev a2)
00:02.0 USB Controller [0c03]: nVidia Corporation MCP55 USB Controller [10de:036c] (rev a1)
00:02.1 USB Controller [0c03]: nVidia Corporation MCP55 USB Controller [10de:036d] (rev a2)
00:04.0 IDE interface [0101]: nVidia Corporation MCP55 IDE [10de:036e] (rev a1)
00:05.0 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2)
00:05.1 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2)
00:05.2 IDE interface [0101]: nVidia Corporation MCP55 SATA Controller [10de:037f] (rev a2)
00:06.0 PCI bridge [0604]: nVidia Corporation MCP55 PCI bridge [10de:0370] (rev a2)
00:06.1 Audio device [0403]: nVidia Corporation MCP55 High Definition Audio [10de:0371] (rev a2)
00:08.0 Bridge [0680]: nVidia Corporation MCP55 Ethernet [10de:0373] (rev a2)
00:0f.0 PCI bridge [0604]: nVidia Corporation MCP55 PCI Express bridge [10de:0377] (rev a2)
00:18.0 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration [1022:1200]
00:18.1 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Address Map [1022:1201]
00:18.2 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller [1022:1202]
00:18.3 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control [1022:1203]
00:18.4 Host bridge [0600]: Advanced Micro Devices [AMD] Family 10h Processor Link Control [1022:1204]
02:00.0 VGA compatible controller [0300]: ATI Technologies Inc Cypress [Radeon HD 5800 Series] [1002:6899]
02:00.1 Audio device [0403]: ATI Technologies Inc Cypress HDMI Audio [Radeon HD 5800 Series] [1002:aa50]
Comment 1 Neil Horman 2010-12-01 16:50:13 UTC
We can't just yank out that commit, without it kdump will not work on systems with the listed revisions of MCP55.  Perhaps that the availability of that register might be more specific what we can tell withjust pci device and vendor id.  Can you please post the output of lspci -vvv so that I can compare your Bridge to the one I worked with here?  Thanks!
Comment 2 Mathieu Bérard 2010-12-02 19:30:36 UTC
Created attachment 38822 [details]
lspci -vvv
Comment 3 Mathieu Bérard 2010-12-02 19:31:55 UTC
Created attachment 38832 [details]
dmesg with 66db60eaf15 reverted

complete dmesg log from 2.6.37-rc with commit 66db60eaf15 reverted
Comment 4 Mathieu Bérard 2010-12-02 19:40:30 UTC
Well, I agree that fixing kdump on some systems is a very good thing. 
But if you think in terms of regression, that patch is introducing a very serious one: you just break Linux on some other systems, that seems a quite unfavorable balance to me.
As a side note, kexec as always worked very well on this particular system.

Would a dump of the pci config space of that PCI device on my system be useful to you ?
Comment 5 Neil Horman 2010-12-03 02:15:05 UTC
You're correct, but its important to remember that the MCP55 is:
1) A fairly widespread northbridge
2) relatively old (nvidia no longer manufactures it as I understand it)

Given those facts, along with the fact that you're the first to report this (this patch has been in RHEL since 5.3, over 1.5 years), I feel like this report may be a corner case, and as such something that we can fine tune the existing patch for.  I'm really hoping that when i compare your pci output to what I have here we'll be able to do some filtering on pci sub-device id and get this working.

If you could please, yes, your pci config space might be handy in comparison to my system here.

thanks!
Neil
Comment 6 Mathieu Bérard 2010-12-05 22:46:44 UTC
Okay, 
here is the conf space of my 10de:0360 device, on a kernel with the patch reverted:

10de 0360 000f 00a0 00a2 0601 0000 0080
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 1043 8239
0000 0000 0000 0000 0000 0000 00ff 0000
1043 8239 f000 feff 3efa 00ff 3efa 00ff
3efa 00ff 5a00 0262 0000 0100 0000 ffff
0000 0000 0000 0000 0000 0000 0000 fff9
0010 ffff 80c5 0000 0000 1944 0000 0330
8009 1200 d201 d000 00f0 0100 00f0 0000
0800 0000 0000 0000 4721 8695 cdef 00ab
0001 c030 0000 0000 0000 0000 0000 0000
0290 02ef 0800 085f 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 2a50 fe00 e1fd b000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0010 0000 0000 0000
Comment 7 Neil Horman 2010-12-06 11:56:52 UTC
ok, so some good news.  It appears that there are 2 major differences between your system and the ones I was testing against.  The first is that this is a rev a1 while everything I tested on is a rev a2, so its possible that the register I need is non-existant on your system, and we can key off that to know if we should set it or not.  Also, on the systems I test with, the MCP55 has a hypertransport capable interface to the cpu, whereas yours does not.  Given that the problem that I sought to fix only occurs on AMD systems in which a hypertransport bus was used, I think the best solution is to key off that fact in the quirk, and not do anything if there is no ht bus.  I'll have a patch for you later today
Comment 8 Neil Horman 2010-12-06 20:36:35 UTC
Created attachment 39192 [details]
patch to filter out mcp55 chips w/o hyperthread interface capabilities

Hey, here you go, as promised.  The problem that this quirk adresses was only noted on AMD systems with hypertransport busses, so this patch should cause the quirk to not be applied on MCP55 chips without the HT capability (which I think makes sense, as I would imagine this register may be non-existant on those versions of the chip).  Anywho, I don't have a non-ht system here to test on, but with this patch, kdump still works on my mcp55 based server.  If you could test on your non-hyperbus mcp55 based system and see if a post 2.6.36 system boots on it, I'd appreciate it.  If you give it a thumbs up, I'll post it asap.

Thanks!
Comment 9 Mathieu Bérard 2010-12-07 23:06:30 UTC
The patch allows latest git pull of 2.6.37-rc to boot on my system.
Thanks !
Comment 10 Neil Horman 2010-12-08 00:18:46 UTC
Ok good, I'll clean it up and post it in the AM.  Thanks!
Comment 11 Neil Horman 2010-12-08 14:55:28 UTC
http://marc.info/?l=linux-kernel&m=129181976528659&w=2

posted for review
Comment 12 Ozan Caglayan 2011-01-09 20:11:27 UTC
This is fixed in mainline.
Comment 13 Florian Mickler 2011-01-23 15:31:21 UTC
fixed in .37-rc7 by

commit 49c2fa08a77a7eefa4cbc73601f64984aceacfa7
Author: Neil Horman <nhorman@tuxdriver.com>
Date:   Wed Dec 8 09:47:48 2010 -0500

    PCI: Update MCP55 quirk to not affect non HyperTransport variants

Note You need to log in before you can comment on or make changes to this bug.