Bug 32552 - threads relating to md lock up, causing data loss and preventing halt/reboot
Summary: threads relating to md lock up, causing data loss and preventing halt/reboot
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-03 12:29 UTC by Delan Azabani
Modified: 2013-12-23 11:52 UTC (History)
3 users (show)

See Also:
Kernel Version: -
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Delan Azabani 2011-04-03 12:29:00 UTC
In 2.6.39 from mainline/snapshot/next, but not in 2.6.38, threads relating to md frequently lock up permanently into the 'Uninterruptible' status, causing:

* any data attempted to be written after this problem begins to fail to commit properly (meaning I could lose hours worth of downloads because the lock up happens silently, without anything drastic such as a full freeze)
* halt/reboot are prevented (the shutdown process also is 'Uninterruptible')
* processes relying on files on a volume on the md to hang (e.g. transmission-daemon won't respond to RPC anymore with torrents on the RAID)
* killing processes relying files on a volume on the md stay in Zombie state and won't disappear (e.g. transmission-daemon)

The details of the lock up events vary, but this is one I'm encountering now:

* flush-9:127 is Uninterruptible on waiting channel 0, not using any CPU
* jbd2/md127p1-8 is Uninterruptible on waiting channel sleep_on_page, not using any CPU

Another lock up I had was with a thread, I can't remember the name, being Uninterruptible on waiting channel something like 'jbd2 transaction commit'. I'm sorry for not being specific.

My current mdadm array would have been created with the following:

mdadm -C /dev/md/delan:Greens -e1.0 --name=Greens -l5 -n4 /dev/sd[cdef]1
Comment 1 Duncan 2011-04-18 04:53:17 UTC
FWIW, I'm running md/raid-1 and raid-0 here on x86_64 (older dual dual-core Opterons, 4x sata-based devices) here, and am NOT seeing this.

Reiserfs here, if it makes a difference.

Is it still an issue with rc3?  Because there's another issue I'm running into both triggered and fixed before rc1, that's complicating a bisect of a different problem.  But that problem too I expect is different, as it's killing my raid-1 personality entirely.  As that's what my rootfs is on, it's killing that and I'm getting a kernel panic when it can't load the rootfs.  Yours seems way more intermittent, and I'm seeing nothing like it, either with rc3, or previous.

I thus suspect yours is likely to be raid-5 (and possibly 4 and 6) personality specific.

(FWIW, the bug I'm trying to bisect and research is delayed and often apparently hung logins/session-exits.  I'll file a bug if I don't see anything else on it.  Right now I'm stuck on about 1900 commits left (954 /after/ the current round), so a hint as to which commit I might cherry-pick to finish the bisect, would be nice.  I'll probably find it eventually, but...)

Duncan
Comment 2 Neil Brown 2011-04-18 05:18:55 UTC
The commit that breaks md is
  7eaceaccab5f40bbfda044629a6298616aeaed50

though just removing that is unlikely to be easy.

Also it would not affect reading from the md, so it wouldn't cause your problem with not being able to mount the md device.

Do you have the text of the kernel panic?
Comment 3 Duncan 2011-05-02 03:20:24 UTC
(In reply to comment #2)
> The commit that breaks md is
>   7eaceaccab5f40bbfda044629a6298616aeaed50
> 
> though just removing that is unlikely to be easy.
> 
> Also it would not affect reading from the md, so it wouldn't cause your
> problem with not being able to mount the md device.

(FWIW, my logins/logouts issue disappeared sometime between rc3 and rc5-127-g1be6a1f.  So I don't have to try to finish that bisect after all.  Just thought I'd tie up the loose end in case anyone was wondering about another latent bug alluded to but not yet properly filed.  =:^)

Note You need to log in before you can comment on or make changes to this bug.