Most recent kernel where this bug did *NOT* occur: supposedly any kernel with
the external Promise stex driver (<2.6.19), please see details
Distribution: OpenSuSE 10.2 (x86_64 smp)
Hardware Environment: Tyan S3970 w/1 Opteron 2210 dual core, 2GB DDR2@667MHz,
Promise SuperTrak EX 16350/16300 SATA RAID controller w/ 14 disks attached (10
in RAID10, 4 in RAID5); OS installed on extra HDD connected to mainboard
Software Environment: stock SuSE with vanilla kernel 188.8.131.52, setup as NFS
I am filing this under SCSI, as the stex driver resides there, though it is a
SATA device and the actual problem could even be in more generic code.
- installed 184.108.40.206 since stock kernel does not have stex driver
- System or just the filesystems on the RAIDs freeze after some time / amount
of data processed (normally total freeze, one time the box was still
responsive, but any access to the RAID mounts was blocking)
- tried different configuration of kernel (no -Os, p.ex.), added debugging...
effect stays the same
I was able to get two Kernel BUG messages (I'll attach the detailed messages):
Kernel BUG at block/elevator.c:332
invalid opcode: 0000  SMP
Kernel BUG at block/ll_rw_blk.c:1130
invalid opcode: 0000  SMP
I contacted Promise and they provided an updated version of their external
stex/shasta driver source that compiles against the 220.127.116.11-like kernel of
OpenSuSE 10.2. This driver works under varying load (also stressed it) for 9
There seems to be a problem with the changes the driver did undergo to be
included in the kernel (replacing code from Promise with using in-kernel common
code) or with the used common kernel code itself...
That this setup now works reliably (I hope;-) is an indication that switching
to the external driver is the solution and it's indeed a software problem.
However, I won't hide that I see a correlation between RAID usage and the
occurence of MCEs regarding corrected ECC errors. Not with CPU/Memory load
alone -- there seems to be the I/O / PCI-X / whatever action needed.
Steps to reproduce:
- get Promise STEX163x0 card and perhaps an SMP box ...
- configure two RAID sets
- copy big files from one to the other -- that reliably triggered a system
freeze before reaching 100G of copied data for me
Created attachment 10105 [details]
trace of the last seen kernel BUG
Created attachment 10106 [details]
kernel messages from syslog with a crash/freeze
Same problem here. The system freezes sporadically.
AMD X2 Dualcore 2GHz, 4GB DDR RAM
Tyan Mainboard S2865 Tomcat K8E
Promise EX 8350 Raid
5 Western Digital HDs (160GB) in one Raid5 (OS + Data)
I'm using Ubuntu 7.04 with Kernel 2.6.20.
The Server ist used as VMware-Server.
Same error messages as above.
Christian, are you also using the driver from Promise's site
or are you referring to the kernel.org driver?
If it's the external driver then I'll close the bug, sorry - there
isn't anything we can do about it.
Please note that it is the kernel.org driver (be it in OpenSUSE or Ubuntu variation) that is causing the freezes for Christian and me; we both got around the problem by using the external driver.
kernel.org -> freeze
external -> works
Just in case that info got mixed up...
Also, Christian informed me about a patch the driver guy from Promise (Ed Lin) posted to LKM:
There were two comments on that post and no further discussion.
Perhaps one could revive this one... and generally: Ed, are you with us on this bugzilla entry?
The patch should be still valid technically. Maybe need a little re-work. However, I must be careful to make this patch be accepted. I will re-submit it again in the near future, and hopefully it can get into some 2.6.23-rcx.
which is now in Linus tree, it should fix this issue.
This issue may be closed.
I use stex driver with kernel 2.6.21-r4 (gentoo).
Kernel hangs when I make heavy i/o operations with scsi volume.
This is the error:
Process kblockd/1 (pid:46, ti=c19e6000 task=c19be530 task.ti=c19e6000)
STACK: C1AE6448 D752F578 F7E5BBA4 C1A7E800 C1A7880 F7E5BBA4 C01CC54B
00000287 C027A1CC C0DDA080 00000000 C1A7E800 C1A78800 C027E98E C1979764
F7E5BBA4 C1979740 C1A7E884 D752F578 FE5BBA4 C1979740 F7E5BC30 00000292
CODE: C9 74 28 85 D2 74 10 8B 43 2C 8B 4B 3C 8D 50
01 89 53 2C 39 C8 73 9F 31 C9 EB AE C7 43 20
00 00 00 00 8B 54 CB 08 E9 56 FF FF FF <0F>
0B EB FE 0F 0B EB FE 8B 40 0C 8B 50 04 83
C2 0B 8D 42 08 31
Do you think that the 2.6.23 kernel (and the Ed's patch included) resolve this bug?
Is these errors caused by stex driver?
A lot can change between 2.6.21 and 2.6.23. Yes, please do test 2.6.23.
It's almost 100% certainly the same bug, so yes is fixed.
Updated with 2.6.23 kernel and in-kernel stex driver, same error. (see post #9).
Nothing is changed with new kernel.
I don't know why this driver isn't signed as "experimental".
I am wondering if Lorenzo really is talking about the same issue I had.
Sadly, I cannot test the current in-kernel driver since we only have one stex system which is needed in production and there is no further backup space for the many 100s of GB on the RAID:-(
Our box is running fine with the external driver from Promise.
I read that your box is running fine with the external driver from Promise.
Which kernel version is running? Which Promise driver have you selected?
Can you post your .config?
atlas:~ # uname -a
Linux atlas 18.104.22.168-34-thor #5 SMP Sun Jan 7 15:13:46 CET 2007 x86_64 x86_64 x86_64 GNU/Linux
We have the 22.214.171.124 Kernel from OpenSuse 10.2; I took the OpenSuse kernel sources with the default (smp) config... modified a bit...
The important part is that Ed Lin sent me an updated version of the Promise driver that I then compiled externally.
I recommend you ask him or check out the promise site:
This looks like the version I got.
Created attachment 13271 [details]
Fix bad sharing of tag busy list
I think there's one more bug there, for shared maps. For the locking to
work, only the tag map and tag bit map may be shared (incidentally, I
was just explaining this to Nick yesterday, but I apparently didn't
review the code well enough myself). But we also share the busy list!
The busy_list must be queue private, or we need a block_queue_tag
covering lock as well.
So we have to move the busy_list to the queue. This'll work fine, and
it'll actually also fix a problem with blk_queue_invalidate_tags() which
will invalidate tags across all shared queues. This is a bit confusing,
the low level driver should call it for each queue seperately since
otherwise you cannot kill tags on just a single queue for eg a hard
drive that stops responding. Since the function has no callers
currently, it's not an issue.
Please test this patch, it should fix the issue!
Created attachment 13272 [details]
126.96.36.199 variant of the patch
Same as 13271, just applies cleanly for 188.8.131.52 for testing there. Lorenzo, please test?
Ed verified the patch, we should close this bug report. Fix is in 2.6.24-git and is also queued for 2.6.23.x stable series.
I guess it's not solved yet.
I gave it a try and tested kernel 184.108.40.206 (which is the Ubuntu 7.10 kernel) and compiled 220.127.116.11 myself. Both crashed when I made heavy IO. In fact it needed only a ftp transfer of a single 700MB file (GBit ethernet) to crash.
One thing I forgot to tell: I used the 64 bit version. I didn't test any 32 bit kernel.
Closing obsolete bugs, if still relevant please re-open