Bug 12164
Summary: | cciss driver hangs on boot | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | richlv |
Component: | SCSI | Assignee: | Willy Tarreau (w) |
Status: | CLOSED WILL_NOT_FIX | ||
Severity: | normal | CC: | alan, w |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.4.37 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | kernel .config, as used with 2.4.37 |
Description
richlv
2008-12-04 08:11:06 UTC
lspci -v -s 00:04.0 00:04.0 RAID bus controller: Compaq Computer Corporation Smart Array 5i/532 (rev 01) Subsystem: Compaq Computer Corporation Smart Array 5i Flags: bus master, 66MHz, medium devsel, latency 71, IRQ 3 Memory at f5f80000 (64-bit, non-prefetchable) [size=256K] I/O ports at 2800 [size=256] Memory at f5df0000 (64-bit, prefetchable) [size=16K] Expansion ROM at <unassigned> [disabled] [size=16K] Capabilities: [c0] Power Management version 2 Capabilities: [cc] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable- Capabilities: [dc] PCI-X non-bridge device grepping changelogs for cciss shows the following changes in kernel versions that could potentially break this : ChangeLog-2.4.26: o cciss update: support the new MSA30 storage enclosure ChangeLog-2.4.26: o cciss update: If no device attached we return -ENXIO instead of some bogus numbers ChangeLog-2.4.27: o cciss update ChangeLog-2.4.27: o cciss update: support for two new controllers ChangeLog-2.4.27: o Fix cciss bug in proc reporting ChangeLog-2.4.28: o Mike Miller: cciss typo fix ChangeLog-2.4.28: o cpqarray/cciss gcc3.4 inline fixes ChangeLog-2.4.28: o cciss update [1/5] PCI ID fix for cciss SATA hba ChangeLog-2.4.28: o cciss update [2/5] fix for 32/64-bit conversions ChangeLog-2.4.28: o cciss update [3/5] pci_dev->irq fix ChangeLog-2.4.28: o cciss update [4/5] fix for HP utilities ChangeLog-2.4.28: o cciss update [5/5] maintainers update for HP drivers Hi, sorry for the delay, I've not been much available these weeks. With which compiler(s) have you tried ? I have vague memories of a few of the changes you outlined above (eg: gcc-3.4 fixes). Those were done with a lot of care, so I would bet they're OK. But maybe the updates 1-5 contain something wrong. I still find it strange that nobody encountered any problem with this driver for the last 4 years. Do you think that your setup has something specific which could explain why others are not affected ? Willy hi, it's ok - this is an older kernel after all :) so far i have compiled kernels for testing only with gcc 3.4.6. nothing specific about this setup, single cciss adapter, no fancy configuration. this adapter has been running fine with older 2.4 versions for... don't remember for sure, but 5 years or more :) i could try compiling kernels on another machine with gcc 4.2.4, if only to nail down problematic version more - but thus testing requires rebooting the machine, so i can't do that too soon. > ------- Comment #3 from rich@hq.vsaa.lv 2009-01-05 00:45 ------- > hi, it's ok - this is an older kernel after all :) it's not a reason for not trying to fix it ;-) > so far i have compiled kernels for testing only with gcc 3.4.6. > nothing specific about this setup, single cciss adapter, no fancy > configuration. OK, I was wondering if the gcc-3.4 fixes could be incomplete. > this adapter has been running fine with older 2.4 versions for... don't > remember for sure, but 5 years or more :) That's what I understood. It's certainly a regression in newer kernels, but seeing only *one* affected user makes the case suspicious. I really think that there is something specific in your setup that we must find in order to fix the driver's bug. > i could try compiling kernels on another machine with gcc 4.2.4, if only to > nail down problematic version more - but thus testing requires rebooting the > machine, so i can't do that too soon. It will not build with 4.2 (well, it may but the generated code will be wrong, with missing sections). Ideally, an older 3.3 or 2.95.3 would be fine. Could you please post your last boot messages before the system hangs, as well as your config ? Also, could you try to press Alt-SysRq-T when the system is stuck to see if it's running a busy loop (the trace will then help us find where) or if it's completely frozen ? attaching config used with 2.4.37. previous boot messages, i'll have to get to the machine, reboot it, then probably take a photo of the screen and then transcribe it - i'll try to do that as soon as possible (hopefully, this week). Alt-SysRq-T doesn't require magic sysrq support specifically compiled in like with newer kernels, right ? > attaching config used with 2.4.37. Your attachment failed :-) Did you use the "create a new attachment" link at the bottom of the page ? > previous boot messages, i'll have to get to the machine, reboot it, then > probably take a photo of the screen and then transcribe it - i'll try to do > that as soon as possible (hopefully, this week). OK > Alt-SysRq-T doesn't require magic sysrq support specifically compiled in like > with newer kernels, right ? Yes it requires support to be compiled in, but it's the default. Grep for CONFIG_MAGIC_SYSRQ in your config. Created attachment 19652 [details] kernel .config, as used with 2.4.37 > Your attachment failed :-) Did you use the "create a new attachment" link > at the bottom of the page ? nah, it was a pebkac error :) >> Alt-SysRq-T doesn't require magic sysrq support specifically compiled in >> like >> with newer kernels, right ? > Yes it requires support to be compiled in, but it's the default. Grep for > CONFIG_MAGIC_SYSRQ in your config. that's why i asked - grepped for magic, no magic (except CONFIG_FB_NEOMAGIC). OK, you need to enable CONFIG_DEBUG_KERNEL to get MAGIC_SYSRQ. BTW, I think that we have an HP DL360-G3 at Exosec, I'll check that next week. It would be really nice if I could reproduce the problem on similar hardware. i accidentally ended near the machine, so i took the chance to get some data :) console output, when booting 2.4.37 (and hang happens) : ----------------------------- 2.4.37 --- FDC 0 is a National Semiconductor PC87306 loop: loaded (max 8 devices) HP CISS Driver (v 2.4.60) blocks= 71122560 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 RAID 0 Partition check: cciss/c0d0: ----------------------------- 2.4.37 --- and now, 2.4.25, which works. notice, output differs - cciss driver version, two additional lines : ----------------------------- 2.4.25 --- FDC 0 is a National Semiconductor PC87306 loop: loaded (max 8 devices) HP CISS Driver (v 2.4.50) cciss: Device 0xb178 has been found at bus 0 dev 4 func 0 blocks= 71122560 block_size= 512 heads= 255, sectors= 32, cylinders= 8716 RAID 0 blk: queue c0403000, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Partition check: cciss/c0d0: p1 p2 ----------------------------- 2.4.25 --- (retyped from a crappy phone camera output, a character here or there might be off). i compiled 2.4.37 with sysrq support, but somehow i could not get any of the combos to work - not even alt+sysrq+b (i verified, .config contains CONFIG_MAGIC_SYSRQ=y). on the other hand, ctrl+alt+del at this point does reboot the machine, so maybe i did something wrong there. gcc 3.4.6 seems to be the oldest version i have access to right now. On Mon, Jan 05, 2009 at 04:57:11AM -0800, bugme-daemon@bugzilla.kernel.org wrote: > Partition check: > cciss/c0d0: OK so the disk is detected and the hang happens while reading the partition table. > blk: queue c0403000, I/O limit 4294967295Mb (mask 0xffffffffffffffff) This line is not important, it's an old debug message that was left over and that we finally removed in recent kernels. It's not part of the difference. > (retyped from a crappy phone camera output, a character here or there might > be > off). OK. > i compiled 2.4.37 with sysrq support, but somehow i could not get any of the > combos to work - not even alt+sysrq+b (i verified, .config contains > CONFIG_MAGIC_SYSRQ=y). > on the other hand, ctrl+alt+del at this point does reboot the machine, so > maybe > i did something wrong there. Strange indeed. I'm used to perform it with AltGr-sysrq, but recently found a keyboard on which it does not work and I had to use the left Alt for this. Maybe you have something similar. > gcc 3.4.6 seems to be the oldest version i have access to right now. I do have older ones. If you're interested, I can build you a kernel with your config on 2.95.3 for instance. I'd also suggest trying the cciss code from 2.6.27 and 2.6.26 (copy them over 2.6.28's). If it's too much work for you, let's wait for me to get access to a similar machine first. > Strange indeed. I'm used to perform it with AltGr-sysrq, but recently found > a keyboard on which it does not work and I had to use the left Alt for this. > Maybe you have something similar. head-rackconsole. i only tried left alt. somehow i'm used to altgr being used for diacritic chars and it did not occur to me to try it at all... > I do have older ones. If you're interested, I can build you a kernel with > your config on 2.95.3 for instance. I'd also suggest trying the cciss code > from 2.6.27 and 2.6.26 (copy them over 2.6.28's). If it's too much work for > you, let's wait for me to get access to a similar machine first. this depends on me being the location where the machine is - if i get there before you get to yours, i'll try altgr for sysrq, and copying cciss over .28 kernel. what about .25 (known working), should i try that one as well, if i get to it ? > what about .25 (known working), should i try that one as well, if i get to it
> ?
you could if you want, it would confirm that the regression is really located
in the driver and not somewhere else.
Rich, I have performed some tests on an HP DL360 which contains a Smart Array 5i. This one works perfectly fine : # cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 5i Controller Board ID: 0x40800e11 Firmware Version: 2.38 IRQ: 31 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 1 Max # commands on controller since init: 1 Max SG entries since init: 4 Sequential access devices: 0 cciss/c0d0: 17.35GB RAID 0 # cat /proc/partitions major minor #blocks name 104 0 17776560 cciss/c0d0 104 1 102384 cciss/c0d0p1 104 2 5120000 cciss/c0d0p2 104 3 9845760 cciss/c0d0p3 104 4 1 cciss/c0d0p4 104 5 557040 cciss/c0d0p5 104 6 2047984 cciss/c0d0p6 104 7 102384 cciss/c0d0p7 # lspci -nnvvv 01:04.0 Class 0104: 0e11:b178 (rev 01) Subsystem: 0e11:4080 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 71, cache line size 08 Interrupt: pin A routed to IRQ 31 Region 0: Memory at f7ec0000 (64-bit, non-prefetchable) [size=256K] Region 2: I/O ports at 3000 [size=256] Region 3: Memory at f7df0000 (64-bit, prefetchable) [size=16K] Expansion ROM at <unassigned> [disabled] [size=16K] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [cc] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [dc] PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=3 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- During boot, the kernel reports the following information about the driver : FDC 0 is a National Semiconductor PC87306 RAMDISK driver initialized: 16 RAM disks of 18000K size 1024 blocksize loop: loaded (max 8 devices) Compaq SMART2 Driver (v 2.4.28) HP CISS Driver (v 2.4.60) cciss: Device 0xb178 has been found at bus 1 dev 4 func 0 blocks= 35553120 block_size= 512 heads= 255, sectors= 32, cylinders= 4357 RAID 0 Partition check: cciss/c0d0: p1 p2 p3 p4 < p5 p6 p7 > v2.3 : Micro Memory(tm) PCI memory board block driver Would it be possible that you have too old a firmware on this board and that a simple firmware upgrade would solve the issue ? You can check this in /proc/driver/cciss/cciss0. Willy thanks for doing the tests. indeed, firmware version is somewhat older - 2.36 (vs 2.38 for you). if i'm reading the correct information, going 2.36->2.38 only changes "Added support for the 4-Port Shared Storage Module.", which shouldn't be relevant, i think. what's interesting, another, identical machine... is running 2.4.29 just fine. so it's something either with the particular kernel config, or with hardware. these machines run the same distribution, same version. only difference is the one that boots fine has two disks in raid 1 config, while the one having problems has raid 0 with single hdd only. could this difference be important ? > thanks for doing the tests. indeed, firmware version is somewhat older - 2.36 > (vs 2.38 for you). > > if i'm reading the correct information, going 2.36->2.38 only changes "Added > support for the 4-Port Shared Storage Module.", which shouldn't be relevant, > i > think. I don't think so either, unless they have fixed other minor issues, of course. > what's interesting, another, identical machine... is running 2.4.29 just > fine. > so it's something either with the particular kernel config, or with hardware. OK. > these machines run the same distribution, same version. only difference is > the > one that boots fine has two disks in raid 1 config, while the one having > problems has raid 0 with single hdd only. > could this difference be important ? maybe. It is possible that some RAID configs exhibit the problems and others not. Could you try to boot your 2.4.29 kernel on the machine showing the problem ? This would indicate whether the hard or the soft makes the difference. Willy got some more data : 2.4.29 from the other machine also hangs on boot; downgraded gcc on the problematic machine to the one from slackware 10.2 - 3.3.6. that allowed me to compile all kernels, thus i can report that it boots with 2.4.26, and hangs with 2.4.27 -> 2.4.27 broke the partition listing part somehow. 2.4.27 still seems to have quite many changes, so it might not be much help, but maybe that gives some hints. On Thu, Jan 29, 2009 at 09:29:57AM -0800, bugme-daemon@bugzilla.kernel.org wrote: > 2.4.29 from the other machine also hangs on boot; OK so the hardware (or the RAID controller's firmware) makes a difference between the two machines. > downgraded gcc on the problematic machine to the one from slackware 10.2 - > 3.3.6. > that allowed me to compile all kernels, thus i can report that it boots with > 2.4.26, and hangs with 2.4.27 -> 2.4.27 broke the partition listing part > somehow. > 2.4.27 still seems to have quite many changes, so it might not be much help, > but maybe that gives some hints. OK it's good that you could isolate the precise version which introduced the change. However, the changes between the two versions' driver are really minimal. Just two new controllers added, and better support for 64-bit systems. I don't see anything else :-( It is possible that another change outside this driver had a side effect though. I seem to remember that there was one long-term stable debian kernel based on 2.4.27. If you can put your hands on it, it would help us troubleshoot the issue, as it might contain a fix for this obscure problem, that might not have gone upstream. The difficulty right now is that 2.4.27 is 5-years old and we'll not find anyone able to explain change X or Y back then. Also, since we have other systems running the same hardware fine, and I'm not aware of any other such report (and HP servers are very common), it is really possible that your particular firmware is botched and that an upgrade will fix all the issues up to 2.4.37. Do you know if you can try such an upgrade easily ? From my memories, it should be on the smartsuite CD or floppy, but I may be wrong. Willy i tried to copy cciss code over from "last good" to "first bad" and compile that. produced some warnings regarding cciss, but compiled. such a combination failed at boot the same way, which seems to indicate that the problem actually is outside cciss code... Hi, that's very useful information, thanks. However it will be harder to debug :-( Willy |