Bug 12164 - cciss driver hangs on boot
cciss driver hangs on boot
Status: CLOSED WILL_NOT_FIX
Product: IO/Storage
Classification: Unclassified
Component: SCSI
All Linux
: P1 normal
Assigned To: Willy Tarreau
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-04 08:11 UTC by richlv
Modified: 2012-05-22 16:58 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.4.37
Tree: Mainline
Regression: No


Attachments
kernel .config, as used with 2.4.37 (22.04 KB, text/plain)
2009-01-05 01:50 UTC, richlv
Details

Description richlv 2008-12-04 08:11:06 UTC
Latest working kernel version: 2.4.25
Earliest failing kernel version: 2.4.28
Distribution: slackware 11.0
Hardware Environment: HP ProLiant DL360 G3
Software Environment:
Problem Description:

boot of kernels 2.4.28-37 hangs on "partition check" step.
latest kernel that boots is 2.4.25.
26 and 27 fail to compile on that machine, so nailing this down to a precise version has been problematic.
cciss driver is compiled in the kernel.

more information probably will be needed, i'll be happy to provide it.

adapter :
00:04.0 RAID bus controller: Compaq Computer Corporation Smart Array 5i/532 (rev 01)

Steps to reproduce:
boot a kernel 2.4.28 to 2.4.37 on such a hw configuration
Comment 1 richlv 2008-12-05 01:15:46 UTC
lspci -v -s 00:04.0

00:04.0 RAID bus controller: Compaq Computer Corporation Smart Array 5i/532 (rev 01)
        Subsystem: Compaq Computer Corporation Smart Array 5i
        Flags: bus master, 66MHz, medium devsel, latency 71, IRQ 3
        Memory at f5f80000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at 2800 [size=256]
        Memory at f5df0000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at <unassigned> [disabled] [size=16K]
        Capabilities: [c0] Power Management version 2
        Capabilities: [cc] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
        Capabilities: [dc] PCI-X non-bridge device


grepping changelogs for cciss shows the following changes in kernel versions that could potentially break this :

ChangeLog-2.4.26:  o cciss update: support the new MSA30 storage enclosure
ChangeLog-2.4.26:  o cciss update: If no device attached we return -ENXIO instead of some bogus numbers
ChangeLog-2.4.27:  o cciss update
ChangeLog-2.4.27:  o cciss update: support for two new controllers
ChangeLog-2.4.27:  o Fix cciss bug in proc reporting
ChangeLog-2.4.28:  o Mike Miller: cciss typo fix
ChangeLog-2.4.28:  o cpqarray/cciss gcc3.4 inline fixes
ChangeLog-2.4.28:  o cciss update [1/5] PCI ID fix for cciss SATA hba
ChangeLog-2.4.28:  o cciss update [2/5] fix for 32/64-bit conversions
ChangeLog-2.4.28:  o cciss update [3/5] pci_dev->irq fix
ChangeLog-2.4.28:  o cciss update [4/5] fix for HP utilities
ChangeLog-2.4.28:  o cciss update [5/5] maintainers update for HP drivers
Comment 2 Willy Tarreau 2009-01-04 15:54:01 UTC
Hi,

sorry for the delay, I've not been much available these weeks.

With which compiler(s) have you tried ? I have vague memories of a few of the changes you outlined above (eg: gcc-3.4 fixes). Those were done with a lot of care, so I would bet they're OK. But maybe the updates 1-5 contain something
wrong. I still find it strange that nobody encountered any problem with this
driver for the last 4 years. Do you think that your setup has something specific
which could explain why others are not affected ?

Willy
Comment 3 richlv 2009-01-05 00:45:46 UTC
hi, it's ok - this is an older kernel after all :)

so far i have compiled kernels for testing only with gcc 3.4.6.
nothing specific about this setup, single cciss adapter, no fancy configuration.

this adapter has been running fine with older 2.4 versions for... don't remember for sure, but 5 years or more :)

i could try compiling kernels on another machine with gcc 4.2.4, if only to nail down problematic version more - but thus testing requires rebooting the machine, so i can't do that too soon.
Comment 4 Willy Tarreau 2009-01-05 01:21:26 UTC
> ------- Comment #3 from rich@hq.vsaa.lv  2009-01-05 00:45 -------
> hi, it's ok - this is an older kernel after all :)

it's not a reason for not trying to fix it ;-)

> so far i have compiled kernels for testing only with gcc 3.4.6.
> nothing specific about this setup, single cciss adapter, no fancy
> configuration.

OK, I was wondering if the gcc-3.4 fixes could be incomplete.

> this adapter has been running fine with older 2.4 versions for... don't
> remember for sure, but 5 years or more :)

That's what I understood. It's certainly a regression in newer kernels,
but seeing only *one* affected user makes the case suspicious. I really
think that there is something specific in your setup that we must find
in order to fix the driver's bug.

> i could try compiling kernels on another machine with gcc 4.2.4, if only to
> nail down problematic version more - but thus testing requires rebooting the
> machine, so i can't do that too soon.

It will not build with 4.2 (well, it may but the generated code will be
wrong, with missing sections). Ideally, an older 3.3 or 2.95.3 would be
fine.

Could you please post your last boot messages before the system hangs,
as well as your config ?

Also, could you try to press Alt-SysRq-T when the system is stuck to see
if it's running a busy loop (the trace will then help us find where) or
if it's completely frozen ?


Comment 5 richlv 2009-01-05 01:29:56 UTC
attaching config used with 2.4.37.
previous boot messages, i'll have to get to the machine, reboot it, then probably take a photo of the screen and then transcribe it - i'll try to do that as soon as possible (hopefully, this week).

Alt-SysRq-T doesn't require magic sysrq support specifically compiled in like with newer kernels, right ?
Comment 6 Willy Tarreau 2009-01-05 01:36:04 UTC
> attaching config used with 2.4.37.

Your attachment failed :-) Did you use the "create a new attachment" link
at the bottom of the page ?

> previous boot messages, i'll have to get to the machine, reboot it, then
> probably take a photo of the screen and then transcribe it - i'll try to do
> that as soon as possible (hopefully, this week).

OK

> Alt-SysRq-T doesn't require magic sysrq support specifically compiled in like
> with newer kernels, right ?

Yes it requires support to be compiled in, but it's the default. Grep for
CONFIG_MAGIC_SYSRQ in your config.

Comment 7 richlv 2009-01-05 01:50:23 UTC
Created attachment 19652 [details]
kernel .config, as used with 2.4.37

> Your attachment failed :-) Did you use the "create a new attachment" link
> at the bottom of the page ?

nah, it was a pebkac error :)

>> Alt-SysRq-T doesn't require magic sysrq support specifically compiled in like
>> with newer kernels, right ?

> Yes it requires support to be compiled in, but it's the default. Grep for
> CONFIG_MAGIC_SYSRQ in your config.

that's why i asked - grepped for magic, no magic (except CONFIG_FB_NEOMAGIC).
Comment 8 Willy Tarreau 2009-01-05 02:13:17 UTC
OK, you need to enable CONFIG_DEBUG_KERNEL to get MAGIC_SYSRQ. BTW, I think that we have an HP DL360-G3 at Exosec, I'll check that next week. It would be really nice if I could reproduce the problem on similar hardware.
Comment 9 richlv 2009-01-05 04:57:10 UTC
i accidentally ended near the machine, so i took the chance to get some data :)

console output, when booting 2.4.37 (and hang happens) :

----------------------------- 2.4.37 ---
FDC 0 is a National Semiconductor PC87306
loop: loaded (max 8 devices)
HP CISS Driver (v 2.4.60)
    blocks= 71122560 block_size= 512
    heads= 255, sectors= 32, cylinders= 8716 RAID 0

Partition check:
 cciss/c0d0:
----------------------------- 2.4.37 ---

and now, 2.4.25, which works. notice, output differs - cciss driver version, two additional lines :

----------------------------- 2.4.25 ---
FDC 0 is a National Semiconductor PC87306
loop: loaded (max 8 devices)
HP CISS Driver (v 2.4.50)
cciss: Device 0xb178 has been found at bus 0 dev 4 func 0
    blocks= 71122560 block_size= 512
    heads= 255, sectors= 32, cylinders= 8716 RAID 0

blk: queue c0403000, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
Partition check:
 cciss/c0d0: p1 p2
----------------------------- 2.4.25 ---

(retyped from a crappy phone camera output, a character here or there might be off).

i compiled 2.4.37 with sysrq support, but somehow i could not get any of the combos to work - not even alt+sysrq+b (i verified, .config contains CONFIG_MAGIC_SYSRQ=y).
on the other hand, ctrl+alt+del at this point does reboot the machine, so maybe i did something wrong there.

gcc 3.4.6 seems to be the oldest version i have access to right now.
Comment 10 Willy Tarreau 2009-01-05 05:24:50 UTC
On Mon, Jan 05, 2009 at 04:57:11AM -0800, bugme-daemon@bugzilla.kernel.org wrote:
> Partition check:
>  cciss/c0d0:

OK so the disk is detected and the hang happens while reading the partition
table.

> blk: queue c0403000, I/O limit 4294967295Mb (mask 0xffffffffffffffff)

This line is not important, it's an old debug message that was left over
and that we finally removed in recent kernels. It's not part of the
difference.

> (retyped from a crappy phone camera output, a character here or there might be
> off).

OK.

> i compiled 2.4.37 with sysrq support, but somehow i could not get any of the
> combos to work - not even alt+sysrq+b (i verified, .config contains
> CONFIG_MAGIC_SYSRQ=y).
> on the other hand, ctrl+alt+del at this point does reboot the machine, so maybe
> i did something wrong there.

Strange indeed. I'm used to perform it with AltGr-sysrq, but recently found
a keyboard on which it does not work and I had to use the left Alt for this.
Maybe you have something similar.

> gcc 3.4.6 seems to be the oldest version i have access to right now.

I do have older ones. If you're interested, I can build you a kernel with
your config on 2.95.3 for instance. I'd also suggest trying the cciss code
from 2.6.27 and 2.6.26 (copy them over 2.6.28's). If it's too much work for
you, let's wait for me to get access to a similar machine first.

Comment 11 richlv 2009-01-05 05:34:25 UTC
> Strange indeed. I'm used to perform it with AltGr-sysrq, but recently found
> a keyboard on which it does not work and I had to use the left Alt for this.
> Maybe you have something similar.

head-rackconsole. i only tried left alt. somehow i'm used to altgr being used for diacritic chars and it did not occur to me to try it at all...

> I do have older ones. If you're interested, I can build you a kernel with
> your config on 2.95.3 for instance. I'd also suggest trying the cciss code
> from 2.6.27 and 2.6.26 (copy them over 2.6.28's). If it's too much work for
> you, let's wait for me to get access to a similar machine first.

this depends on me being the location where the machine is - if i get there before you get to yours, i'll try altgr for sysrq, and copying cciss over .28 kernel.
what about .25 (known working), should i try that one as well, if i get to it ?
Comment 12 Willy Tarreau 2009-01-05 05:38:24 UTC
> what about .25 (known working), should i try that one as well, if i get to it ?
you could if you want, it would confirm that the regression is really located
in the driver and not somewhere else.

Comment 13 Willy Tarreau 2009-01-21 02:49:20 UTC
Rich,

I have performed some tests on an HP DL360 which contains a Smart Array 5i.
This one works perfectly fine :

# cat /proc/driver/cciss/cciss0

cciss0: HP Smart Array 5i Controller
Board ID: 0x40800e11
Firmware Version: 2.38
IRQ: 31
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 1
Max # commands on controller since init: 1
Max SG entries since init: 4

Sequential access devices: 0

cciss/c0d0:       17.35GB       RAID 0


# cat /proc/partitions
major minor  #blocks  name

 104     0   17776560 cciss/c0d0
 104     1     102384 cciss/c0d0p1
 104     2    5120000 cciss/c0d0p2
 104     3    9845760 cciss/c0d0p3
 104     4          1 cciss/c0d0p4
 104     5     557040 cciss/c0d0p5
 104     6    2047984 cciss/c0d0p6
 104     7     102384 cciss/c0d0p7


# lspci -nnvvv

01:04.0 Class 0104: 0e11:b178 (rev 01)
        Subsystem: 0e11:4080
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 71, cache line size 08
        Interrupt: pin A routed to IRQ 31
        Region 0: Memory at f7ec0000 (64-bit, non-prefetchable) [size=256K]
        Region 2: I/O ports at 3000 [size=256]
        Region 3: Memory at f7df0000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at <unassigned> [disabled] [size=16K]
        Capabilities: [c0] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [cc] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [dc] PCI-X non-bridge device.
                Command: DPERE- ERO- RBC=0 OST=3
                Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-


During boot, the kernel reports the following information about the driver :

FDC 0 is a National Semiconductor PC87306
RAMDISK driver initialized: 16 RAM disks of 18000K size 1024 blocksize
loop: loaded (max 8 devices)
Compaq SMART2 Driver (v 2.4.28)
HP CISS Driver (v 2.4.60)
cciss: Device 0xb178 has been found at bus 1 dev 4 func 0
      blocks= 35553120 block_size= 512
      heads= 255, sectors= 32, cylinders= 4357 RAID 0

Partition check:
 cciss/c0d0: p1 p2 p3 p4 < p5 p6 p7 >
v2.3 : Micro Memory(tm) PCI memory board block driver

Would it be possible that you have too old a firmware on this board and that
a simple firmware upgrade would solve the issue ?

You can check this in /proc/driver/cciss/cciss0.

Willy

Comment 14 richlv 2009-01-21 04:36:19 UTC
thanks for doing the tests. indeed, firmware version is somewhat older - 2.36 (vs 2.38 for you).

if i'm reading the correct information, going 2.36->2.38 only changes "Added support for the 4-Port Shared Storage Module.", which shouldn't be relevant, i think.

what's interesting, another, identical machine... is running 2.4.29 just fine. so it's something either with the particular kernel config, or with hardware.

these machines run the same distribution, same version. only difference is the one that boots fine has two disks in raid 1 config, while the one having problems has raid 0 with single hdd only.
could this difference be important ?
Comment 15 Willy Tarreau 2009-01-21 05:55:18 UTC
> thanks for doing the tests. indeed, firmware version is somewhat older - 2.36
> (vs 2.38 for you).
> 
> if i'm reading the correct information, going 2.36->2.38 only changes "Added
> support for the 4-Port Shared Storage Module.", which shouldn't be relevant, i
> think.

I don't think so either, unless they have fixed other minor issues, of course.

> what's interesting, another, identical machine... is running 2.4.29 just fine.
> so it's something either with the particular kernel config, or with hardware.

OK.

> these machines run the same distribution, same version. only difference is the
> one that boots fine has two disks in raid 1 config, while the one having
> problems has raid 0 with single hdd only.
> could this difference be important ?

maybe. It is possible that some RAID configs exhibit the problems and others
not. Could you try to boot your 2.4.29 kernel on the machine showing the
problem ? This would indicate whether the hard or the soft makes the
difference.

Willy

Comment 16 richlv 2009-01-29 09:29:57 UTC
got some more data :

2.4.29 from the other machine also hangs on boot;

downgraded gcc on the problematic machine to the one from slackware 10.2 - 3.3.6.
that allowed me to compile all kernels, thus i can report that it boots with 2.4.26, and hangs with 2.4.27 -> 2.4.27 broke the partition listing part somehow.
2.4.27 still seems to have quite many changes, so it might not be much help, but maybe that gives some hints.
Comment 17 Willy Tarreau 2009-01-29 14:09:17 UTC
On Thu, Jan 29, 2009 at 09:29:57AM -0800, bugme-daemon@bugzilla.kernel.org wrote:
> 2.4.29 from the other machine also hangs on boot;

OK so the hardware (or the RAID controller's firmware) makes a
difference between the two machines.

> downgraded gcc on the problematic machine to the one from slackware 10.2 -
> 3.3.6.
> that allowed me to compile all kernels, thus i can report that it boots with
> 2.4.26, and hangs with 2.4.27 -> 2.4.27 broke the partition listing part
> somehow.
> 2.4.27 still seems to have quite many changes, so it might not be much help,
> but maybe that gives some hints.

OK it's good that you could isolate the precise version which introduced
the change. However, the changes between the two versions' driver are
really minimal. Just two new controllers added, and better support for
64-bit systems. I don't see anything else :-(

It is possible that another change outside this driver had a side effect
though. I seem to remember that there was one long-term stable debian
kernel based on 2.4.27. If you can put your hands on it, it would help
us troubleshoot the issue, as it might contain a fix for this obscure
problem, that might not have gone upstream.

The difficulty right now is that 2.4.27 is 5-years old and we'll not
find anyone able to explain change X or Y back then. Also, since we
have other systems running the same hardware fine, and I'm not aware
of any other such report (and HP servers are very common), it is
really possible that your particular firmware is botched and that an
upgrade will fix all the issues up to 2.4.37.

Do you know if you can try such an upgrade easily ? From my memories,
it should be on the smartsuite CD or floppy, but I may be wrong.

Willy

Comment 18 richlv 2009-02-25 04:52:50 UTC
i tried to copy cciss code over from "last good" to "first bad" and compile that. produced some warnings regarding cciss, but compiled.
such a combination failed at boot the same way, which seems to indicate that the problem actually is outside cciss code...
Comment 19 Willy Tarreau 2009-02-25 06:39:11 UTC
Hi,

that's very useful information, thanks. However it will be harder to debug :-(

Willy


Note You need to log in before you can comment on or make changes to this bug.