Bug 6235 - Crash/Lockup on loading qla2xxx on x86_64 with >4GB memory on MSI K8T Master2-FAR motherboard
Summary: Crash/Lockup on loading qla2xxx on x86_64 with >4GB memory on MSI K8T Master2...
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: QLOGIC QLA2XXX (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Andrew Vasquez
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-16 14:05 UTC by Nathaniel Clark
Modified: 2007-09-18 14:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.16-rc5
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Nathaniel Clark 2006-03-16 14:05:57 UTC
Most recent kernel where this bug did not occur: Not Sure
Distribution: FC4
Hardware Environment: Dual opteron board (single vs. dual doesn't make any
difference)
  4 memory slots 2x2GB,2x256MB (all Registered+ECC)
  qla2312 in 32-bit 33MHz PCI

Software Environment: vanilla kernel built from kernel.org source

Problem Description:

When the system has greater than 4GB memory, even 2 x 2GB + 2 x 256MB, the NMI
watchdog timer will go off, during the chip configuration (see trace below).

Without the Watchdog dimer, the kernel just hangs.

Crash Log:

NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: qla2xxx scsi_transport_fc sata_via libata dm_snapshot dm_zero
dm_mirror
Pid: 863, comm: modprobe Tainted: G   M  2.6.15-55.1831 #1
RIP: 0010:[<ffffffff80218bdc>] <ffffffff80218bdc>{__delay+12}
RSP: 0000:ffff81011eef5be8  EFLAGS: 00000087
RAX: 0000000089bb1ad6 RBX: 00000000004b9fe7 RCX: 0000000089bb0479
RDX: 000000000000001f RSI: 0000000000000292 RDI: 0000000000002328
RBP: ffff81011eef5be8 R08: 000000000000289c R09: 0000000000000034
R10: 0000000000000000 R11: 0000000000000010 R12: ffffc2000001c000
R13: ffff81011df38518 R14: ffff81011df38598 R15: 0000000000000296
FS:  00002aaaaaabc3c0(0000) GS:ffffffff8053f080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaacef00f CR3: 000000011e1b4000 CR4: 00000000000006e0
Process modprobe (pid: 863, threadinfo ffff81011eef4000, task ffff81011f5b08a0)
Stack: ffff81011eef5bf8 ffffffff80218c18 ffff81011eef5c48 ffffffff88038389 
       ffff81011eef5c48 ffffffff8803703e 0001000100000000 ffff81011df38518 
       0000000000000000 0000000000000009 
 
Call Trace:<ffffffff80218c18>{__const_udelay+51}
       <ffffffff88038389>{:qla2xxx:qla2x00_chip_diag+145}
       <ffffffff8803703e>{:qla2xxx:qla2x00_nvram_config+2046}
       <ffffffff88039f3f>{:qla2xxx:qla2x00_initialize_adapter+343}
       <ffffffff88033fd7>{:qla2xxx:qla2x00_probe_one+4184}
       <ffffffff8012f0b0>{default_wake_function+0}
       <ffffffff8012fdfc>{set_cpus_allowed+318}
       <ffffffff880344c2>{:qla2xxx:qla2xxx_probe_one+16}
       <ffffffff80222f54>{pci_device_probe+232}
       <ffffffff80274e90>{driver_probe_device+72}
       <ffffffff80274fc6>{__driver_attach+88}
       <ffffffff80274f6e>{__driver_attach+0}
       <ffffffff8027435f>{bus_for_each_dev+79}
       <ffffffff80274c74>{driver_attach+28}
       <ffffffff8027480f>{bus_add_driver+114}
       <ffffffff802752c8>{driver_register+103}
       <ffffffff80222b83>{__pci_register_driver+187}
       <ffffffff8806109e>{:qla2xxx:qla2x00_module_init+158}
       <ffffffff801501f5>{sys_init_module+311}
       <ffffffff8010da9a>{system_call+126}
Comment 1 Martin J. Bligh 2006-03-16 14:19:59 UTC
It says right above where you file the bug:

"NO BINARY MODULES or other tainted kernels"
Comment 2 Andrew Morton 2006-03-16 14:29:33 UTC
bugme-daemon@bugzilla.kernel.org wrote:
>
>              Status|NEW                         |REJECTED
>          Resolution|                            |INVALID

Well it's almost certainly still a bug - I don't see how a binary module
could have caused this lockup.  And I don't see any modules in there which
are proprietary, so perhaps we're just missing a MODULE_LICENSE tag or
something.

Anyway.  Nathaniel, please let us know why that kernel is tainted and, if
poss, double-check that the lockup still happens on an untainted kernel -
I'm sure it will..

Comment 3 Nathaniel Clark 2006-03-16 14:39:07 UTC
Sorry, wrong crash log:

NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror
sata_via libata sd_mod
Pid: 889, comm: modprobe Not tainted 2.6.16-rc5 #1
RIP: 0010:[<ffffffff8021b23c>] <ffffffff8021b23c>{__delay+12}
RSP: 0000:ffff81011e2fbbe8  EFLAGS: 00000097
RAX: 00000000e6bffd80 RBX: 00000000004a7fba RCX: 00000000e6bff5d5
RDX: 0000000000000014 RSI: 0000000000000296 RDI: 000000000000233a
RBP: ffff81011e2fbbe8 R08: 000000000000277b R09: 0000000000000036
R10: 00000000ffffffff R11: 0000000000000010 R12: ffffc2000001c000
R13: ffff81011d95c548 R14: ffff81011d95c5c8 R15: 0000000000000296
FS:  00002b9ede0ab3c0(0000) GS:ffff81011fc7ba40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b5aae655660 CR3: 000000011e2e0000 CR4: 00000000000006e0
Process modprobe (pid: 889, threadinfo ffff81011e2fa000, task ffff81011f66e0c0)
Stack: ffff81011e2fbbf8 ffffffff8021b286 ffff81011e2fbc48 ffffffff8803cb4c 
       ffff81011e2fbc48 ffffffff8803b6f3 0001000100000000 ffff81011d95c548 
       0000000000000000 0000000000000009 
 
Call Trace: <ffffffff8021b286>{__const_udelay+54}
       <ffffffff8803cb4c>{:qla2xxx:qla2x00_chip_diag+124}
       <ffffffff8803b6f3>{:qla2xxx:qla2x00_nvram_config+947}
       <ffffffff8803e967>{:qla2xxx:qla2x00_initialize_adapter+327}
       <ffffffff88038bdf>{:qla2xxx:qla2x00_probe_one+4175}
       <ffffffff8033ff6e>{wait_for_completion+254}
       <ffffffff80140dc8>{__queue_work+104}
       <ffffffff8012a6dd>{set_cpus_allowed+333}
       <ffffffff880390c0>{:qla2xxx:qla2xxx_probe_one+16}
       <ffffffff80225f93>{pci_device_probe+243}
       <ffffffff8027ad57>{driver_probe_device+103}
       <ffffffff8027af21>{__driver_attach+161}
       <ffffffff8027ae80>{__driver_attach+0}
       <ffffffff80279faf>{bus_for_each_dev+79}
       <ffffffff8027abac>{driver_attach+28}
       <ffffffff8027a398>{bus_add_driver+136}
       <ffffffff8027b49d>{driver_register+189}
       <ffffffff802259e2>{__pci_register_driver+162}
       <ffffffff8805c09e>{:qla2xxx:qla2x00_module_init+158}
       <ffffffff8014e3cd>{sys_init_module+301}
       <ffffffff8010adee>{system_call+126}
Comment 4 Nathaniel Clark 2006-03-21 08:27:30 UTC
This dmesg is with the offending code replaced with a delay.  The line that the
Watchdog trips on is in a loop waiting for a softreset to complete on the chip
(the spec for the chip says that we should be able to just wait 16 PCI clock
cycles for it to complete, so I replaced the loop w/ a udelay(200) to get the
debug messages to actually print, the driver is functional with this change and
<= 4GB of memory).

As you can see below there is a problem prior to the line that actually fails,
the NVRAM checksum is entirely wrong because the NVRAM is 0.  Could it be the
iobase is wrong?  Would this be caused by something that kicks when more than
4GB of memory are present?  The only major difference I could come up with is
IOMMU, but disabling that for >4GB did not change the outcome.

NOTE: The iobase listed w/ only 4GB is the same as w/ >4GB of memory.

DMESG: (w/ >4GB)
... snip ...
[   71.948730] QLogic Fibre Channel HBA Driver
[   71.948859] GSI 17 sharing vector 0xB1 and IRQ 17
[   71.948875] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> IRQ 177
[   71.949507] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase
0xffffc2000001c000
[   71.950048] qla2xxx 0000:00:06.0: Configuring PCI space...
[   72.249938] qla2xxx 0000:00:06.0: Configure NVRAM parameters...
[   72.334593] scsi(2): Contents of NVRAM
[   72.334596]  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
[   72.334599] --------------------------------------------------------------
[   72.334601] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334612] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334621] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334631] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334640] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334649] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334659] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334668] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334678] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334687] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334696] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334706] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334715] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334725] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334734] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334743] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   72.334754] qla2xxx 0000:00:06.0: Inconsistent NVRAM detected: checksum=0x0
id=<4>qla2xxx 0000:00:06.0: Falling back to functioning (yet invalid -- WWPN)
defaults.
[   72.334762] scsi(2): NVRAM configuration failed!
[   72.334765] qla2xxx 0000:00:06.0: Verifying loaded RISC code...
[   72.334768] scsi(2): **** Load RISC code ****
[   72.334771] scsi(2): Testing device at ffffc2000001c000.
... snip ...


DMESG: (w/ only 4GB, for compairison of identical code)
... snip ...
[   67.107800] QLogic Fibre Channel HBA Driver
[   67.108288] GSI 17 sharing vector 0xB1 and IRQ 17
[   67.108304] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) -> IRQ 177
[   67.108358] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase
0xffffc2000001c000
[   67.108813] qla2xxx 0000:00:06.0: Configuring PCI space...
[   67.108971] qla2xxx 0000:00:06.0: Configure NVRAM parameters...
[   67.193664] scsi(2): Contents of NVRAM
[   67.193667]  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
[   67.193670] --------------------------------------------------------------
[   67.193673] 49  53  50  20  01  00  01  00  06  a4  00  08  00  01  00  01
[   67.193684] 08  01  21  00  00  e0  8b  14  e2  6f  00  00  00  00  00  00
[   67.193694] 00  00  00  00  00  00  20  30  00  00  00  a0  00  00  00  00
[   67.193704] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193713] 00  00  00  00  00  00  02  0d  00  00  00  00  00  00  00  00
[   67.193723] 00  05  2d  00  00  01  00  00  00  00  00  00  00  00  00  00
[   67.193733] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193742] 46  43  35  30  31  30  34  30  39  2d  32  35  20  20  20  20
[   67.193753] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193762] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193772] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193781] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193791] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193800] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193809] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
[   67.193819] 00  00  00  00  00  00  00  00  00  00  77  10  00  01  00  8a
[   67.193833] qla2xxx 0000:00:06.0: Verifying loaded RISC code...
[   67.193836] scsi(2): **** Load RISC code ****
[   67.193839] scsi(2): Testing device at ffffc2000001c000.
[   67.194042] scsi(2): Reset register cleared by chip reset
[   67.194054] scsi(2): Checking product ID of chip
[   67.194059] scsi(2): Checking mailboxes.
... snip ...
Comment 5 Andrew Vasquez 2006-03-21 08:46:31 UTC
> This dmesg is with the offending code replaced with a delay.  The line that
> the
> Watchdog trips on is in a loop waiting for a softreset to complete on the
> chip
> (the spec for the chip says that we should be able to just wait 16 PCI clock
> cycles for it to complete, so I replaced the loop w/ a udelay(200) to get the
> debug messages to actually print, the driver is functional with this change
> and
> <= 4GB of memory).

Something else is wrong...  It's almost as if there's some
hardware-level failure within this particular configuration at
getting the ISP to complete reset...

> As you can see below there is a problem prior to the line that actually
> fails,
> the NVRAM checksum is entirely wrong because the NVRAM is 0.  Could it be the
> iobase is wrong?  Would this be caused by something that kicks when more than
> 4GB of memory are present?  The only major difference I could come up with is
> IOMMU, but disabling that for >4GB did not change the outcome.
> 
> NOTE: The iobase listed w/ only 4GB is the same as w/ >4GB of memory.
> 
> DMESG: (w/ >4GB)
> ... snip ...
> [   71.948730] QLogic Fibre Channel HBA Driver
> [   71.948859] GSI 17 sharing vector 0xB1 and IRQ 17
> [   71.948875] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) ->
> IRQ 177
> [   71.949507] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase
> 0xffffc2000001c000
> [   71.950048] qla2xxx 0000:00:06.0: Configuring PCI space...
> [   72.249938] qla2xxx 0000:00:06.0: Configure NVRAM parameters...
> [   72.334593] scsi(2): Contents of NVRAM
> [   72.334596]  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
> [   72.334599] --------------------------------------------------------------
> [   72.334601] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
> [   72.334612] 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
...
> [   72.334754] qla2xxx 0000:00:06.0: Inconsistent NVRAM detected:
> checksum=0x0
> id=<4>qla2xxx 0000:00:06.0: Falling back to functioning (yet invalid -- WWPN)
> defaults.
> [   72.334762] scsi(2): NVRAM configuration failed!
> [   72.334765] qla2xxx 0000:00:06.0: Verifying loaded RISC code...
> [   72.334768] scsi(2): **** Load RISC code ****

And this just verifies that -- running with invalid NVRAM is not
a good thing...

We've run 23xx and 24xx boards on several dual-proc opteron
boards with 8gb of memory -- what's type (model/manfacturer) of
motherboard do you have?

> DMESG: (w/ only 4GB, for compairison of identical code)
> ... snip ...
> [   67.107800] QLogic Fibre Channel HBA Driver
> [   67.108288] GSI 17 sharing vector 0xB1 and IRQ 17
> [   67.108304] ACPI: PCI Interrupt 0000:00:06.0[A] -> GSI 17 (level, low) ->
> IRQ 177
> [   67.108358] qla2xxx 0000:00:06.0: Found an ISP2312, irq 177, iobase
> 0xffffc2000001c000
> [   67.108813] qla2xxx 0000:00:06.0: Configuring PCI space...
> [   67.108971] qla2xxx 0000:00:06.0: Configure NVRAM parameters...
> [   67.193664] scsi(2): Contents of NVRAM
> [   67.193667]  0   1   2   3   4   5   6   7   8   9  Ah  Bh  Ch  Dh  Eh  Fh
> [   67.193670] --------------------------------------------------------------
> [   67.193673] 49  53  50  20  01  00  01  00  06  a4  00  08  00  01  00  01

Hmm, it's almost as if MMIO accesses are limping along...

--
av
Comment 6 Nathaniel Clark 2006-03-21 09:03:34 UTC
CPUs: 2x Opteron(tm) Processor 244 (stepping: 10 family: 15 model: 5)

Motherboard: MSI K8T Master2-FAR
    Chipset: VIA K8T 800 / VIA VT8237

Side Note: (This may be relavent)
The Broadcom gigabit ethernet chip (driver: tg3) doesn't come up correctly
either. There are no error messages to explain why.  The link just never comes
up, I haven't investigated to any extent (it does work in the 4GB case).
Comment 7 Andrew Vasquez 2006-03-21 09:46:54 UTC
Google search for the mobo yielded:

http://groups.google.com/group/lucky.freebsd.amd64/browse_thread/thread/b859e7ecf2d97513/
e4d765f8120a5664?lnk=st&q=MSI+K8T+Master+2+problems&rnum=3&hl=en#e4d765f8120a5664
Comment 8 Nathaniel Clark 2006-03-21 11:54:47 UTC
I've tried the same setup on a different motherboard (Asus K8N-DL bios rev 1007)
and the driver loads with no problems.  (The ethernet chip is the same, and it
also loads)

After reading the other thread, it's clear that this is a bug with the hardware,
though the msi board works with 8GB of ram with windows, so there exists a
software fix.

I've changed the summary to reflect the hardware dependent nature of the bug,
not sure how the categorization should change.

Thanks for all the help.
Comment 9 Natalie Protasevich 2007-07-22 22:58:52 UTC
Any updates on this issue, is it still present in current kernel (2.6.22+)?
Thanks.
Comment 10 Adrian Bunk 2007-09-18 14:18:44 UTC
Please reopen this bug if it's still present with kernel 2.6.22.

Note You need to log in before you can comment on or make changes to this bug.