Most recent kernel where this bug did *NOT* occur: supposedly any kernel with the external Promise stex driver (<2.6.19), please see details Distribution: OpenSuSE 10.2 (x86_64 smp) Hardware Environment: Tyan S3970 w/1 Opteron 2210 dual core, 2GB DDR2@667MHz, Promise SuperTrak EX 16350/16300 SATA RAID controller w/ 14 disks attached (10 in RAID10, 4 in RAID5); OS installed on extra HDD connected to mainboard Software Environment: stock SuSE with vanilla kernel 2.6.19.1, setup as NFS server Problem Description: I am filing this under SCSI, as the stex driver resides there, though it is a SATA device and the actual problem could even be in more generic code. - installed 2.6.19.1 since stock kernel does not have stex driver - System or just the filesystems on the RAIDs freeze after some time / amount of data processed (normally total freeze, one time the box was still responsive, but any access to the RAID mounts was blocking) - tried different configuration of kernel (no -Os, p.ex.), added debugging... effect stays the same I was able to get two Kernel BUG messages (I'll attach the detailed messages): Kernel BUG at block/elevator.c:332 invalid opcode: 0000 [1] SMP CPU 0 Kernel BUG at block/ll_rw_blk.c:1130 invalid opcode: 0000 [1] SMP CPU 0 I contacted Promise and they provided an updated version of their external stex/shasta driver source that compiles against the 2.6.18.2-like kernel of OpenSuSE 10.2. This driver works under varying load (also stressed it) for 9 days now. There seems to be a problem with the changes the driver did undergo to be included in the kernel (replacing code from Promise with using in-kernel common code) or with the used common kernel code itself... That this setup now works reliably (I hope;-) is an indication that switching to the external driver is the solution and it's indeed a software problem. However, I won't hide that I see a correlation between RAID usage and the occurence of MCEs regarding corrected ECC errors. Not with CPU/Memory load alone -- there seems to be the I/O / PCI-X / whatever action needed. Steps to reproduce: - get Promise STEX163x0 card and perhaps an SMP box ... - configure two RAID sets - copy big files from one to the other -- that reliably triggered a system freeze before reaching 100G of copied data for me
Created attachment 10105 [details] trace of the last seen kernel BUG
Created attachment 10106 [details] kernel messages from syslog with a crash/freeze
Same problem here. The system freezes sporadically. System: AMD X2 Dualcore 2GHz, 4GB DDR RAM Tyan Mainboard S2865 Tomcat K8E Promise EX 8350 Raid 5 Western Digital HDs (160GB) in one Raid5 (OS + Data) I'm using Ubuntu 7.04 with Kernel 2.6.20. The Server ist used as VMware-Server. Same error messages as above.
Christian, are you also using the driver from Promise's site or are you referring to the kernel.org driver? If it's the external driver then I'll close the bug, sorry - there isn't anything we can do about it.
Please note that it is the kernel.org driver (be it in OpenSUSE or Ubuntu variation) that is causing the freezes for Christian and me; we both got around the problem by using the external driver. Short: kernel.org -> freeze external -> works Just in case that info got mixed up... Also, Christian informed me about a patch the driver guy from Promise (Ed Lin) posted to LKM: http://lkml.org/lkml/2007/1/23/268 There were two comments on that post and no further discussion. Perhaps one could revive this one... and generally: Ed, are you with us on this bugzilla entry?
The patch should be still valid technically. Maybe need a little re-work. However, I must be careful to make this patch be accepted. I will re-submit it again in the near future, and hopefully it can get into some 2.6.23-rcx.
See http://git.kernel.dk/?p=linux-2.6.git;a=commit;h=f3da54ba140c6427fa4a32913e1bf406f41b5dda which is now in Linus tree, it should fix this issue.
This issue may be closed.
Hi. I use stex driver with kernel 2.6.21-r4 (gentoo). Kernel hangs when I make heavy i/o operations with scsi volume. This is the error: Process kblockd/1 (pid:46, ti=c19e6000 task=c19be530 task.ti=c19e6000) STACK: C1AE6448 D752F578 F7E5BBA4 C1A7E800 C1A7880 F7E5BBA4 C01CC54B C1A78800 00000287 C027A1CC C0DDA080 00000000 C1A7E800 C1A78800 C027E98E C1979764 F7E5BBA4 C1979740 C1A7E884 D752F578 FE5BBA4 C1979740 F7E5BC30 00000292 CALL TRACE: [<C01CC54B>] ELV_NEXT_REQUEST+0x18/0x125 C027A1CC SCSI_DISPATCH_CMD+0x13c/0X215 C027E98E SCSI_REQUEST_FN+0x186/0x269 C01CF273 __GENERIC_UNPLUG_DEVICE+0x21/0x23 C01CFFBE GENERIC_UNPLUG_DEVICE+0x15/0x21 C01CD177 BLK_UNPLUG_WORK+0xB/0xC C0127A9E RUN_WORK_QUEUE+0x94/0x13F C01CD16C BLK_UNPLUG_WORK+0x0/0xC C0128188 WORKER_THREAD+0x143/0x168 C01150B1 DEFAULT_WAKE_FUNCTION+0x0/0xC C0128045 WORKER_THREAD+0x0/0x168 C012A8F2 KTHREAD+0xAE/0xD3 C012A844 KTHREAD+0x0/0xD3 C01032DB KERNEL_THREAD_HELPER+0x7/0x1C CODE: C9 74 28 85 D2 74 10 8B 43 2C 8B 4B 3C 8D 50 01 89 53 2C 39 C8 73 9F 31 C9 EB AE C7 43 20 00 00 00 00 8B 54 CB 08 E9 56 FF FF FF <0F> 0B EB FE 0F 0B EB FE 8B 40 0C 8B 50 04 83 C2 0B 8D 42 08 31 EIP:[<C01D3576>] DEADLINE_DISPATCH_REQUEST+0xF3/0xFB Do you think that the 2.6.23 kernel (and the Ed's patch included) resolve this bug? Is these errors caused by stex driver? Thanks.
A lot can change between 2.6.21 and 2.6.23. Yes, please do test 2.6.23.
It's almost 100% certainly the same bug, so yes is fixed.
Updated with 2.6.23 kernel and in-kernel stex driver, same error. (see post #9). Nothing is changed with new kernel. I don't know why this driver isn't signed as "experimental".
I am wondering if Lorenzo really is talking about the same issue I had. Sadly, I cannot test the current in-kernel driver since we only have one stex system which is needed in production and there is no further backup space for the many 100s of GB on the RAID:-( Our box is running fine with the external driver from Promise.
Hi Thomas. I read that your box is running fine with the external driver from Promise. Which kernel version is running? Which Promise driver have you selected? Can you post your .config? Thank you
atlas:~ # uname -a Linux atlas 2.6.18.2-34-thor #5 SMP Sun Jan 7 15:13:46 CET 2007 x86_64 x86_64 x86_64 GNU/Linux We have the 2.6.18.2 Kernel from OpenSuse 10.2; I took the OpenSuse kernel sources with the default (smp) config... modified a bit... The important part is that Ed Lin sent me an updated version of the Promise driver that I then compiled externally. I recommend you ask him or check out the promise site: http://www.promise.com/upload/Support/Driver/SuperTrak-EX-Series-suse-10.2-x86_64-2.9.0.22.tgz This looks like the version I got.
Created attachment 13271 [details] Fix bad sharing of tag busy list I think there's one more bug there, for shared maps. For the locking to work, only the tag map and tag bit map may be shared (incidentally, I was just explaining this to Nick yesterday, but I apparently didn't review the code well enough myself). But we also share the busy list! The busy_list must be queue private, or we need a block_queue_tag covering lock as well. So we have to move the busy_list to the queue. This'll work fine, and it'll actually also fix a problem with blk_queue_invalidate_tags() which will invalidate tags across all shared queues. This is a bit confusing, the low level driver should call it for each queue seperately since otherwise you cannot kill tags on just a single queue for eg a hard drive that stops responding. Since the function has no callers currently, it's not an issue. Please test this patch, it should fix the issue!
Created attachment 13272 [details] 2.6.23.1 variant of the patch Same as 13271, just applies cleanly for 2.6.23.1 for testing there. Lorenzo, please test?
Ed verified the patch, we should close this bug report. Fix is in 2.6.24-git and is also queued for 2.6.23.x stable series.
I guess it's not solved yet. I gave it a try and tested kernel 2.6.22.14 (which is the Ubuntu 7.10 kernel) and compiled 2.6.24.3 myself. Both crashed when I made heavy IO. In fact it needed only a ftp transfer of a single 700MB file (GBit ethernet) to crash.
One thing I forgot to tell: I used the 64 bit version. I didn't test any 32 bit kernel.
Closing obsolete bugs, if still relevant please re-open