Bug 7835

Summary: Status? RAID-6 chunk_aligned_read
Product: IO/Storage Reporter: Duncan (1i5t5.duncan)
Component: MDAssignee: Neil Brown (neilb)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20-rc1 - rc5 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Patch to remove annoying message

Description Duncan 2007-01-16 10:18:56 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.19

Distribution:  Gentoo (mainline kernel directly from kernel.org)

Hardware Environment: AMD64, Tyan s2885, dual Opteron 242, 8 gig memory, 4 SATA 
drives

Software Environment: md-RAID-1, -6 (partitioned), -0, all across all four SATA 
drives.  root (and a backup root snapshot) on RAID-6 partitions.  LVM2 on third 
RAID-6 partition.  reiserfs for everything.

Problem Description:  I'm having the chunk_aligned_read issue, continuing 
on -rc5.  I googled chunk_aligned_read and see the pre-rc1 commits, but nothing 
since.  I searched bugs and nothing on it showed up, tho I'm not especially 
skilled at bugzilla queries.

From google's results I see you were aware of the issue pre-rc1, and submitted 
patches for it.  I'm still seeing the errors at boot, loading root off the 
RAID-6, tho based on the changelog entries that's the first try and the second, 
thru the stripe cache, goes thru.  Still, it's -rc5, and I'm still seeing the 
issue, so whatever was done to fix that bad RAID-6 assumption (I read changelog 
entries but not C code so well tho I can sometimes get the gist, so 
it's "whatever" from my perspective) apparently didn't fix it for me.

I gather from the changelogs and kernel list postings that pre-changes in 
2.6.19-git18, the kernel /always/ read thru the stripe cache, so falling back 
to that now isn't a big deal.  However, those warnings still look rather 
alarming, "harmless" or not, and I can't help but be worried if I'm putting 
system stability on the line.

Bottom line, what's the status?  Will proper RAID-6 assumptions be merged, 
curing the issue?  Will the direct read changes be rolled back?  Since it's now 
using as a fallback what it did before, will the warnings just be quited, at 
least for 2.6.20, with a real fix presumably for 2.6.21?  A different solution?  
Did you think it was all fixed and this is new data?  This RAID-6 users needs 
to know. His data depends on it! =8^)

Hopefully this bug will avoid other inquiries as others find it.  Thanks for 
all you kernel hackers do for us users. =8^)

Duncan
Comment 1 Andrew Morton 2007-01-16 14:00:26 UTC
What is "the chunk_aligned_read issue"?
Comment 2 Duncan 2007-01-16 16:24:15 UTC
I figured you'd be familiar with it.  This makes way more sense than I'm likely 
to.  It's a google for "chunk_aligned_read"

http://www.google.com/linux?lr=lang_en&hl=en&q=%22chunk_aligned_read%22

In particular, from the 2.6.19-git18 long format changelog:

commit 46031f9a38a9773021f1872abc713d62467ac22e
Author: Raz Ben-Jehuda(caro) <raziebe@gmail.com>
Date:   Sun Dec 10 02:20:47 2006 -0800

    [PATCH] md: allow reads that have bypassed the cache to be retried on 
failure

    If a bypass-the-cache read fails, we simply try again through the cache.  
If
    it fails again it will trigger normal recovery precedures.

    update 1:

    From: NeilBrown <neilb@suse.de>

    1/
      chunk_aligned_read and retry_aligned_read assume that
          data_disks == raid_disks - 1
      which is not true for raid6.
      So when an aligned read request bypasses the cache, we can get the wrong 
data.

    2/ The cloned bio is being used-after-free in raid5_align_endio
       (to test BIO_UPTODATE).

    3/ We forgot to add rdev->data_offset when submitting
       a bio for aligned-read

    4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
       so we need to invalidate the segment counts.

    5/ We don't de-reference the rdev when the read completes.
       This means we need to record the rdev to so it is still
       available in the end_io routine.  Fortunately
       bi_next in the original bio is unused at this point so
       we can stuff it in there.

    6/ We leak a cloned bio if the target rdev is not usable.

    From: NeilBrown <neilb@suse.de>

    update 2:

    1/ When aligned requests fail (read error) they need to be retried
       via the normal method (stripe cache).  As we cannot be sure that
       we can process a single read in one go (we may not be able to
       allocate all the stripes needed) we store a bio-being-retried
       and a list of bioes-that-still-need-to-be-retried.
       When find a bio that needs to be retried, we should add it to
       the list, not to single-bio...

    2/ We were never incrementing 'scnt' when resubmitting failed
       aligned requests.

    [akpm@osdl.org: build fix]
    Signed-off-by: Neil Brown <neilb@suse.de>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

OK, so as I said, that (and the other references) seems to indicate that 2.6.19 
and earlier read from the (stripe?) cache -- new to .20-rc1 is the direct-read 
code, bypassing cache.  Only it made a wrong assumption about RAID-6 stripe 
alignment, assuming it was the same as RAID-5.  As best I as a non-coder can 
decipher, the situation now is that if the direct-read fails, it tries again 
thru the cache -- which seems to work here, thankfully.  

Only, apparently either the RAID-5/RAID-6 mismatch wasn't entirely fixed or it 
didn't take, or there's a related issue still hanging around, because while the 
second try thru the cache (which AFAIK was the only try, previously) works, I'm 
continuing to get "chunk_aligned_read : non aligned" errors on the console at 
boot and continuing in syslog.  The second try thru the cache works, so I'm 
operational (and I think/hope data-safe), but the first try is still failing, 
and all those errors stacking up remain rather alarming, in addition to 
spamming my logs.

So I'm just asking what the status is.  Am I supposed to still be getting those 
errors, and will it be fixed or at least the errors quieted (since after all 
it's simply falling back to the old cached option where needed, which might be 
good enough for .20, with the error to be tackled in .21) by .20 main release?  
I expect if not, even if it's a relatively harmless error, a lot of RAID-6 
folks will be rather alarmed when they start seeing these after they upgrade, 
so to prevent needless bug noise if anything, something needs done with this 
by .20 release.

Duncan (hope I'm being helpful)
Comment 3 Neil Brown 2007-01-24 20:54:14 UTC
Yes, you are being helpful.
The particular helpful thing you say is that:

 I'm continuing to get "chunk_aligned_read : non aligned" errors on the console 

This helps me understand what is happening.  I'll attach a patch which
fixes this.  It just removes the printk as it is reporting something
that can normally happen.
Comment 4 Neil Brown 2007-01-24 20:56:25 UTC
Created attachment 10179 [details]
Patch to remove annoying message
Comment 5 Duncan 2007-01-25 08:23:39 UTC
I applied the patch to -rc6 b4 I compiled it.
The patch is confirmed to work as promised. =8^)

Looking forward to seeing it in .20 release!

Duncan
Comment 6 Adrian Bunk 2007-01-27 04:49:32 UTC
Te patch from this bug is now in Linus' tree.