Bug 5733
Summary: | Oops writing to SATA RAID disks | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Hans-Joachim Baader (Hans-Joachim.Baader) |
Component: | LVM2/DM | Assignee: | Alasdair G Kergon (agk) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | ||
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2,6,15-rc5, 2.6.13.4 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Output of lspci -vv
Stacktrace of partimage processes |
Description
Hans-Joachim Baader
2005-12-12 06:26:54 UTC
Based on the stack trace, this is DM, not MD. Reassigning.... What's your dm configuration? Run # dmsetup info -c # dmsetup table # dmsetup status (might be dmsetup.static in your environment) Here's the requested dmsetup info. I've also attached a better lspci output. dmsetup info -c Name Maj Min Stat Open Targ Event UUID isw_cfdbggdhji_Volume05 254 5 L--w 0 1 0 isw_cfdbggdhji_Volume0 254 0 L--w 2 1 0 isw_cfdbggdhji_Volume01 254 1 L--w 1 1 0 dmsetup table isw_cfdbggdhji_Volume05: 0 202992174 linear 254:0 31439268 isw_cfdbggdhji_Volume0: 0 234436608 mirror core 2 2048 nosync 2 8:0 0 8:16 0 isw_cfdbggdhji_Volume01: 0 31439142 linear 254:0 63 dmsetup status isw_cfdbggdhji_Volume05: 0 202992174 linear isw_cfdbggdhji_Volume0: 0 234436608 mirror 2 8:0 8:16 114471/114471 isw_cfdbggdhji_Volume01: 0 31439142 linear Created attachment 6809 [details]
Output of lspci -vv
Created attachment 6863 [details]
Stacktrace of partimage processes
This is a stack trace of the partimage processes, the first of which is stuck
in 'D' state. Obtained with sysrq.
I have some additional info. The oops occurs earlier than I thought; perhaps during definition of the DM RAID or shortly after. Also, I see "Access beyond end of device" of both /dev/sda and /dev/sdb in the log (couldn't save that log yet). I added kdb to the kernel, but the kernel didn't break into kdb on the Oops. I thought it should. What could have gone wrong? I did some debugging today. The loop in core_get_resync_work is started with lc->sync_bits = 0xf88db000 (the value contained in %edi) lc->region_count = 114471 lc->sync_search = 0 After the first call of find_next_zero_bit, *region becomes 114496 and hence, lc->sync_search becomes 114497. Since this is not equal to lc->region_count the loop continues. But it is already larger than lc->region_count and out of range. The following patch fixes the problem for me: --- linux-2.6.14.2/drivers/md/dm-log.c 2005-10-28 02:02:08.000000000 +0200 +++ linux-2.6.15-rc5/drivers/md/dm-log.c 2005-12-23 15:36:39.000000000 +0100 @@ -573,7 +573,7 @@ lc->sync_search); lc->sync_search = *region + 1; - if (*region == lc->region_count) + if (*region >= lc->region_count) return 0; } while (log_test_bit(lc->recovering_bits, *region)); I wonder why one would only test for equality. Is find_next_zero bit supposed to only return a result within the range? Or was it once, but has been changed recently? Or is the real problem elsewhere, perhaps in incorrect drive geometry? Hans-Joachim's patch fixes the problem for me as well. The patch attached here is in (atleast) 2.6.16-rc3. Please close this report. Thanks. |