Most recent kernel where this bug did not occur: Not Sure Distribution: FC4 Hardware Environment: Dual opteron board (single vs. dual doesn't make any difference) 4 memory slots 2x2GB,2x256MB (all Registered+ECC) qla2312 in 32-bit 33MHz PCI Software Environment: vanilla kernel built from kernel.org source Problem Description: When the system has greater than 4GB memory, even 2 x 2GB + 2 x 256MB, the NMI watchdog timer will go off, during the chip configuration (see trace below). Without the Watchdog dimer, the kernel just hangs. Crash Log: NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: qla2xxx scsi_transport_fc sata_via libata dm_snapshot dm_zero dm_mirror Pid: 863, comm: modprobe Tainted: G M 2.6.15-55.1831 #1 RIP: 0010:[<ffffffff80218bdc>] <ffffffff80218bdc>{__delay+12} RSP: 0000:ffff81011eef5be8 EFLAGS: 00000087 RAX: 0000000089bb1ad6 RBX: 00000000004b9fe7 RCX: 0000000089bb0479 RDX: 000000000000001f RSI: 0000000000000292 RDI: 0000000000002328 RBP: ffff81011eef5be8 R08: 000000000000289c R09: 0000000000000034 R10: 0000000000000000 R11: 0000000000000010 R12: ffffc2000001c000 R13: ffff81011df38518 R14: ffff81011df38598 R15: 0000000000000296 FS: 00002aaaaaabc3c0(0000) GS:ffffffff8053f080(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaaacef00f CR3: 000000011e1b4000 CR4: 00000000000006e0 Process modprobe (pid: 863, threadinfo ffff81011eef4000, task ffff81011f5b08a0) Stack: ffff81011eef5bf8 ffffffff80218c18 ffff81011eef5c48 ffffffff88038389 ffff81011eef5c48 ffffffff8803703e 0001000100000000 ffff81011df38518 0000000000000000 0000000000000009 Call Trace:<ffffffff80218c18>{__const_udelay+51} <ffffffff88038389>{:qla2xxx:qla2x00_chip_diag+145} <ffffffff8803703e>{:qla2xxx:qla2x00_nvram_config+2046} <ffffffff88039f3f>{:qla2xxx:qla2x00_initialize_adapter+343} <ffffffff88033fd7>{:qla2xxx:qla2x00_probe_one+4184} <ffffffff8012f0b0>{default_wake_function+0} <ffffffff8012fdfc>{set_cpus_allowed+318} <ffffffff880344c2>{:qla2xxx:qla2xxx_probe_one+16} <ffffffff80222f54>{pci_device_probe+232} <ffffffff80274e90>{driver_probe_device+72} <ffffffff80274fc6>{__driver_attach+88} <ffffffff80274f6e>{__driver_attach+0} <ffffffff8027435f>{bus_for_each_dev+79} <ffffffff80274c74>{driver_attach+28} <ffffffff8027480f>{bus_add_driver+114} <ffffffff802752c8>{driver_register+103} <ffffffff80222b83>{__pci_register_driver+187} <ffffffff8806109e>{:qla2xxx:qla2x00_module_init+158} <ffffffff801501f5>{sys_init_module+311} <ffffffff8010da9a>{system_call+126}
It says right above where you file the bug: "NO BINARY MODULES or other tainted kernels"
bugme-daemon@bugzilla.kernel.org wrote: > > Status|NEW |REJECTED > Resolution| |INVALID Well it's almost certainly still a bug - I don't see how a binary module could have caused this lockup. And I don't see any modules in there which are proprietary, so perhaps we're just missing a MODULE_LICENSE tag or something. Anyway. Nathaniel, please let us know why that kernel is tainted and, if poss, double-check that the lockup still happens on an untainted kernel - I'm sure it will..
Sorry, wrong crash log: NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror sata_via libata sd_mod Pid: 889, comm: modprobe Not tainted 2.6.16-rc5 #1 RIP: 0010:[<ffffffff8021b23c>] <ffffffff8021b23c>{__delay+12} RSP: 0000:ffff81011e2fbbe8 EFLAGS: 00000097 RAX: 00000000e6bffd80 RBX: 00000000004a7fba RCX: 00000000e6bff5d5 RDX: 0000000000000014 RSI: 0000000000000296 RDI: 000000000000233a RBP: ffff81011e2fbbe8 R08: 000000000000277b R09: 0000000000000036 R10: 00000000ffffffff R11: 0000000000000010 R12: ffffc2000001c000 R13: ffff81011d95c548 R14: ffff81011d95c5c8 R15: 0000000000000296 FS: 00002b9ede0ab3c0(0000) GS:ffff81011fc7ba40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002b5aae655660 CR3: 000000011e2e0000 CR4: 00000000000006e0 Process modprobe (pid: 889, threadinfo ffff81011e2fa000, task ffff81011f66e0c0) Stack: ffff81011e2fbbf8 ffffffff8021b286 ffff81011e2fbc48 ffffffff8803cb4c ffff81011e2fbc48 ffffffff8803b6f3 0001000100000000 ffff81011d95c548 0000000000000000 0000000000000009 Call Trace: <ffffffff8021b286>{__const_udelay+54} <ffffffff8803cb4c>{:qla2xxx:qla2x00_chip_diag+124} <ffffffff8803b6f3>{:qla2xxx:qla2x00_nvram_config+947} <ffffffff8803e967>{:qla2xxx:qla2x00_initialize_adapter+327} <ffffffff88038bdf>{:qla2xxx:qla2x00_probe_one+4175} <ffffffff8033ff6e>{wait_for_completion+254} <ffffffff80140dc8>{__queue_work+104} <ffffffff8012a6dd>{set_cpus_allowed+333} <ffffffff880390c0>{:qla2xxx:qla2xxx_probe_one+16} <ffffffff80225f93>{pci_device_probe+243} <ffffffff8027ad57>{driver_probe_device+103} <ffffffff8027af21>{__driver_attach+161} <ffffffff8027ae80>{__driver_attach+0} <ffffffff80279faf>{bus_for_each_dev+79} <ffffffff8027abac>{driver_attach+28} <ffffffff8027a398>{bus_add_driver+136} <ffffffff8027b49d>{driver_register+189} <ffffffff802259e2>{__pci_register_driver+162} <ffffffff8805c09e>{:qla2xxx:qla2x00_module_init+158} <ffffffff8014e3cd>{sys_init_module+301} <ffffffff8010adee>{system_call+126}
This dmesg is with the offending code replaced with a delay. The line that the Watchdog trips on is in a loop waiting for a softreset to complete on the chip (the spec for the chip says that we should be able to just wait 16 PCI clock cycles for it to complete, so I replaced the loop w/ a udelay(200) to get the debug messages to actually print, the driver is functional with this change and <= 4GB of memory). As you can see below there is a problem prior to the line that actually fails, the NVRAM checksum is entirely wrong because the NVRAM is 0. Could it be the iobase is wrong? Would this be caused by something that kicks when more than 4GB of memory are present? The only major difference I could come up with is IOMMU, but disabling that for >4GB did not change the outcome. NOTE: The iobase listed w/ only 4GB is the same as w/ >4GB of memory. DMESG: (w/ >4GB) ... snip ... [ 71.948730] QLogic Fibre Channel HBA Driver [ 71.948859] GSI 17 sharing vector 0xB1 and IRQ 17 [ 71.948875] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> IRQ 177 [ 71.949507] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase 0xffffc2000001c000 [ 71.950048] qla2xxx 0000:00:06.0: Configuring PCI space... [ 72.249938] qla2xxx 0000:00:06.0: Configure NVRAM parameters... [ 72.334593] scsi(2): Contents of NVRAM [ 72.334596] 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh [ 72.334599] -------------------------------------------------------------- [ 72.334601] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334612] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334621] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334631] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334640] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334649] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334659] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334668] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334678] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334687] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334696] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334706] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334715] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334725] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334734] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334743] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 72.334754] qla2xxx 0000:00:06.0: Inconsistent NVRAM detected: checksum=0x0 id=<4>qla2xxx 0000:00:06.0: Falling back to functioning (yet invalid -- WWPN) defaults. [ 72.334762] scsi(2): NVRAM configuration failed! [ 72.334765] qla2xxx 0000:00:06.0: Verifying loaded RISC code... [ 72.334768] scsi(2): **** Load RISC code **** [ 72.334771] scsi(2): Testing device at ffffc2000001c000. ... snip ... DMESG: (w/ only 4GB, for compairison of identical code) ... snip ... [ 67.107800] QLogic Fibre Channel HBA Driver [ 67.108288] GSI 17 sharing vector 0xB1 and IRQ 17 [ 67.108304] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> IRQ 177 [ 67.108358] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase 0xffffc2000001c000 [ 67.108813] qla2xxx 0000:00:06.0: Configuring PCI space... [ 67.108971] qla2xxx 0000:00:06.0: Configure NVRAM parameters... [ 67.193664] scsi(2): Contents of NVRAM [ 67.193667] 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh [ 67.193670] -------------------------------------------------------------- [ 67.193673] 49 53 50 20 01 00 01 00 06 a4 00 08 00 01 00 01 [ 67.193684] 08 01 21 00 00 e0 8b 14 e2 6f 00 00 00 00 00 00 [ 67.193694] 00 00 00 00 00 00 20 30 00 00 00 a0 00 00 00 00 [ 67.193704] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193713] 00 00 00 00 00 00 02 0d 00 00 00 00 00 00 00 00 [ 67.193723] 00 05 2d 00 00 01 00 00 00 00 00 00 00 00 00 00 [ 67.193733] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193742] 46 43 35 30 31 30 34 30 39 2d 32 35 20 20 20 20 [ 67.193753] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193762] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193772] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193781] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193791] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193800] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193809] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 67.193819] 00 00 00 00 00 00 00 00 00 00 77 10 00 01 00 8a [ 67.193833] qla2xxx 0000:00:06.0: Verifying loaded RISC code... [ 67.193836] scsi(2): **** Load RISC code **** [ 67.193839] scsi(2): Testing device at ffffc2000001c000. [ 67.194042] scsi(2): Reset register cleared by chip reset [ 67.194054] scsi(2): Checking product ID of chip [ 67.194059] scsi(2): Checking mailboxes. ... snip ...
> This dmesg is with the offending code replaced with a delay. The line that > the > Watchdog trips on is in a loop waiting for a softreset to complete on the > chip > (the spec for the chip says that we should be able to just wait 16 PCI clock > cycles for it to complete, so I replaced the loop w/ a udelay(200) to get the > debug messages to actually print, the driver is functional with this change > and > <= 4GB of memory). Something else is wrong... It's almost as if there's some hardware-level failure within this particular configuration at getting the ISP to complete reset... > As you can see below there is a problem prior to the line that actually > fails, > the NVRAM checksum is entirely wrong because the NVRAM is 0. Could it be the > iobase is wrong? Would this be caused by something that kicks when more than > 4GB of memory are present? The only major difference I could come up with is > IOMMU, but disabling that for >4GB did not change the outcome. > > NOTE: The iobase listed w/ only 4GB is the same as w/ >4GB of memory. > > DMESG: (w/ >4GB) > ... snip ... > [ 71.948730] QLogic Fibre Channel HBA Driver > [ 71.948859] GSI 17 sharing vector 0xB1 and IRQ 17 > [ 71.948875] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> > IRQ 177 > [ 71.949507] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase > 0xffffc2000001c000 > [ 71.950048] qla2xxx 0000:00:06.0: Configuring PCI space... > [ 72.249938] qla2xxx 0000:00:06.0: Configure NVRAM parameters... > [ 72.334593] scsi(2): Contents of NVRAM > [ 72.334596] 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh > [ 72.334599] -------------------------------------------------------------- > [ 72.334601] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 72.334612] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... > [ 72.334754] qla2xxx 0000:00:06.0: Inconsistent NVRAM detected: > checksum=0x0 > id=<4>qla2xxx 0000:00:06.0: Falling back to functioning (yet invalid -- WWPN) > defaults. > [ 72.334762] scsi(2): NVRAM configuration failed! > [ 72.334765] qla2xxx 0000:00:06.0: Verifying loaded RISC code... > [ 72.334768] scsi(2): **** Load RISC code **** And this just verifies that -- running with invalid NVRAM is not a good thing... We've run 23xx and 24xx boards on several dual-proc opteron boards with 8gb of memory -- what's type (model/manfacturer) of motherboard do you have? > DMESG: (w/ only 4GB, for compairison of identical code) > ... snip ... > [ 67.107800] QLogic Fibre Channel HBA Driver > [ 67.108288] GSI 17 sharing vector 0xB1 and IRQ 17 > [ 67.108304] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> > IRQ 177 > [ 67.108358] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase > 0xffffc2000001c000 > [ 67.108813] qla2xxx 0000:00:06.0: Configuring PCI space... > [ 67.108971] qla2xxx 0000:00:06.0: Configure NVRAM parameters... > [ 67.193664] scsi(2): Contents of NVRAM > [ 67.193667] 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh > [ 67.193670] -------------------------------------------------------------- > [ 67.193673] 49 53 50 20 01 00 01 00 06 a4 00 08 00 01 00 01 Hmm, it's almost as if MMIO accesses are limping along... -- av
CPUs: 2x Opteron(tm) Processor 244 (stepping: 10 family: 15 model: 5) Motherboard: MSI K8T Master2-FAR Chipset: VIA K8T 800 / VIA VT8237 Side Note: (This may be relavent) The Broadcom gigabit ethernet chip (driver: tg3) doesn't come up correctly either. There are no error messages to explain why. The link just never comes up, I haven't investigated to any extent (it does work in the 4GB case).
Google search for the mobo yielded: http://groups.google.com/group/lucky.freebsd.amd64/browse_thread/thread/b859e7ecf2d97513/ e4d765f8120a5664?lnk=st&q=MSI+K8T+Master+2+problems&rnum=3&hl=en#e4d765f8120a5664
I've tried the same setup on a different motherboard (Asus K8N-DL bios rev 1007) and the driver loads with no problems. (The ethernet chip is the same, and it also loads) After reading the other thread, it's clear that this is a bug with the hardware, though the msi board works with 8GB of ram with windows, so there exists a software fix. I've changed the summary to reflect the hardware dependent nature of the bug, not sure how the categorization should change. Thanks for all the help.
Any updates on this issue, is it still present in current kernel (2.6.22+)? Thanks.
Please reopen this bug if it's still present with kernel 2.6.22.