Bug 7842
Summary: | in-kernel stex driver locks up system, problem with common block layer code? | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Thomas Orgis (orgis) |
Component: | SCSI | Assignee: | Ed Lin (ed.lin) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | akpm, alan, axboe, ed.lin |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.19.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
trace of the last seen kernel BUG
kernel messages from syslog with a crash/freeze Fix bad sharing of tag busy list 2.6.23.1 variant of the patch |
Description
Thomas Orgis
2007-01-17 08:18:55 UTC
Created attachment 10105 [details]
trace of the last seen kernel BUG
Created attachment 10106 [details]
kernel messages from syslog with a crash/freeze
Same problem here. The system freezes sporadically. System: AMD X2 Dualcore 2GHz, 4GB DDR RAM Tyan Mainboard S2865 Tomcat K8E Promise EX 8350 Raid 5 Western Digital HDs (160GB) in one Raid5 (OS + Data) I'm using Ubuntu 7.04 with Kernel 2.6.20. The Server ist used as VMware-Server. Same error messages as above. Christian, are you also using the driver from Promise's site or are you referring to the kernel.org driver? If it's the external driver then I'll close the bug, sorry - there isn't anything we can do about it. Please note that it is the kernel.org driver (be it in OpenSUSE or Ubuntu variation) that is causing the freezes for Christian and me; we both got around the problem by using the external driver. Short: kernel.org -> freeze external -> works Just in case that info got mixed up... Also, Christian informed me about a patch the driver guy from Promise (Ed Lin) posted to LKM: http://lkml.org/lkml/2007/1/23/268 There were two comments on that post and no further discussion. Perhaps one could revive this one... and generally: Ed, are you with us on this bugzilla entry? The patch should be still valid technically. Maybe need a little re-work. However, I must be careful to make this patch be accepted. I will re-submit it again in the near future, and hopefully it can get into some 2.6.23-rcx. See http://git.kernel.dk/?p=linux-2.6.git;a=commit;h=f3da54ba140c6427fa4a32913e1bf406f41b5dda which is now in Linus tree, it should fix this issue. This issue may be closed. Hi. I use stex driver with kernel 2.6.21-r4 (gentoo). Kernel hangs when I make heavy i/o operations with scsi volume. This is the error: Process kblockd/1 (pid:46, ti=c19e6000 task=c19be530 task.ti=c19e6000) STACK: C1AE6448 D752F578 F7E5BBA4 C1A7E800 C1A7880 F7E5BBA4 C01CC54B C1A78800 00000287 C027A1CC C0DDA080 00000000 C1A7E800 C1A78800 C027E98E C1979764 F7E5BBA4 C1979740 C1A7E884 D752F578 FE5BBA4 C1979740 F7E5BC30 00000292 CALL TRACE: [<C01CC54B>] ELV_NEXT_REQUEST+0x18/0x125 C027A1CC SCSI_DISPATCH_CMD+0x13c/0X215 C027E98E SCSI_REQUEST_FN+0x186/0x269 C01CF273 __GENERIC_UNPLUG_DEVICE+0x21/0x23 C01CFFBE GENERIC_UNPLUG_DEVICE+0x15/0x21 C01CD177 BLK_UNPLUG_WORK+0xB/0xC C0127A9E RUN_WORK_QUEUE+0x94/0x13F C01CD16C BLK_UNPLUG_WORK+0x0/0xC C0128188 WORKER_THREAD+0x143/0x168 C01150B1 DEFAULT_WAKE_FUNCTION+0x0/0xC C0128045 WORKER_THREAD+0x0/0x168 C012A8F2 KTHREAD+0xAE/0xD3 C012A844 KTHREAD+0x0/0xD3 C01032DB KERNEL_THREAD_HELPER+0x7/0x1C CODE: C9 74 28 85 D2 74 10 8B 43 2C 8B 4B 3C 8D 50 01 89 53 2C 39 C8 73 9F 31 C9 EB AE C7 43 20 00 00 00 00 8B 54 CB 08 E9 56 FF FF FF <0F> 0B EB FE 0F 0B EB FE 8B 40 0C 8B 50 04 83 C2 0B 8D 42 08 31 EIP:[<C01D3576>] DEADLINE_DISPATCH_REQUEST+0xF3/0xFB Do you think that the 2.6.23 kernel (and the Ed's patch included) resolve this bug? Is these errors caused by stex driver? Thanks. A lot can change between 2.6.21 and 2.6.23. Yes, please do test 2.6.23. It's almost 100% certainly the same bug, so yes is fixed. Updated with 2.6.23 kernel and in-kernel stex driver, same error. (see post #9). Nothing is changed with new kernel. I don't know why this driver isn't signed as "experimental". Updated with 2.6.23 kernel and in-kernel stex driver, same error. (see post #9). Nothing is changed with new kernel. I don't know why this driver isn't signed as "experimental". I am wondering if Lorenzo really is talking about the same issue I had. Sadly, I cannot test the current in-kernel driver since we only have one stex system which is needed in production and there is no further backup space for the many 100s of GB on the RAID:-( Our box is running fine with the external driver from Promise. Hi Thomas. I read that your box is running fine with the external driver from Promise. Which kernel version is running? Which Promise driver have you selected? Can you post your .config? Thank you atlas:~ # uname -a Linux atlas 2.6.18.2-34-thor #5 SMP Sun Jan 7 15:13:46 CET 2007 x86_64 x86_64 x86_64 GNU/Linux We have the 2.6.18.2 Kernel from OpenSuse 10.2; I took the OpenSuse kernel sources with the default (smp) config... modified a bit... The important part is that Ed Lin sent me an updated version of the Promise driver that I then compiled externally. I recommend you ask him or check out the promise site: http://www.promise.com/upload/Support/Driver/SuperTrak-EX-Series-suse-10.2-x86_64-2.9.0.22.tgz This looks like the version I got. Created attachment 13271 [details]
Fix bad sharing of tag busy list
I think there's one more bug there, for shared maps. For the locking to
work, only the tag map and tag bit map may be shared (incidentally, I
was just explaining this to Nick yesterday, but I apparently didn't
review the code well enough myself). But we also share the busy list!
The busy_list must be queue private, or we need a block_queue_tag
covering lock as well.
So we have to move the busy_list to the queue. This'll work fine, and
it'll actually also fix a problem with blk_queue_invalidate_tags() which
will invalidate tags across all shared queues. This is a bit confusing,
the low level driver should call it for each queue seperately since
otherwise you cannot kill tags on just a single queue for eg a hard
drive that stops responding. Since the function has no callers
currently, it's not an issue.
Please test this patch, it should fix the issue!
Created attachment 13272 [details]
2.6.23.1 variant of the patch
Same as 13271, just applies cleanly for 2.6.23.1 for testing there. Lorenzo, please test?
Ed verified the patch, we should close this bug report. Fix is in 2.6.24-git and is also queued for 2.6.23.x stable series. I guess it's not solved yet. I gave it a try and tested kernel 2.6.22.14 (which is the Ubuntu 7.10 kernel) and compiled 2.6.24.3 myself. Both crashed when I made heavy IO. In fact it needed only a ftp transfer of a single 700MB file (GBit ethernet) to crash. One thing I forgot to tell: I used the 64 bit version. I didn't test any 32 bit kernel. Closing obsolete bugs, if still relevant please re-open |