|Summary:||does not seem to shut down one of two md raid-1 arrays on hibernation|
|Product:||IO/Storage||Reporter:||Martin Steigerwald (martin.steigerwald)|
|Component:||MD||Assignee:||Neil Brown (neilb)|
|Severity:||normal||CC:||maciej.rutecki, neilb, rjw|
|Bug Depends on:|
|Bug Blocks:||7216, 15310|
first occurence of the hang on shutting down one of the softraids
one other occurence of the hang where I waited a bit longer, an ext4 error occured
Description Martin Steigerwald 2010-07-21 13:08:48 UTC
Last kernel version, where this problem does not appear: 188.8.131.52-workstation-toi-3.1-04962-g60dd176 (will be testing with 184.108.40.206 soon to, if it shouldn't work, I'll report here) I am using TuxOnIce, but it worked with 2.6.33 plus TuxOnIce as well and the screen output does not indicate any TuxOnIce issue. Also it only happens with one of the md devices. I switched back to 220.127.116.11 to have a working setup for my workstation again. On hibernation after TuxOnIce saved the caches and kernel & process data i.e. everything it needs prior to turning off the machine, the kernel tries to stop the RAID arrays but, sometimes hang on one of them: md: stopping all md devices md: md2 switched to read only mode No md1 appears here. md1 is Ext4 on /, md2 is BTRFS on some other directory. Both RAIDs span over two local Western Digital 1.5 TB SATA drives. Once after about a minute I saw an EXT4-fs error, but a fsck.ext4 -f from a grml 2010.04 did not yield any issue. On restart I see in syslog, that md1 is not clean and needs to be resync'd from the last checkpoint: Jul 21 09:30:01 mango kernel: md: raid1 personality registered for level 1 Jul 21 09:30:01 mango kernel: md: md1 stopped. Jul 21 09:30:01 mango kernel: usb 5-4: new high speed USB device using ehci_hcd and address 2 Jul 21 09:30:01 mango kernel: md: bind<sdb6> Jul 21 09:30:01 mango kernel: md: bind<sda6> Jul 21 09:30:01 mango kernel: raid1: md1 is not clean -- starting background reconstruction Jul 21 09:30:01 mango kernel: raid1: raid set md1 active with 2 out of 2 mirrors Jul 21 09:30:01 mango kernel: md1: detected capacity change from 0 to 20003749888 Jul 21 09:30:01 mango kernel: md1: unknown partition table Jul 21 09:30:01 mango kernel: md: md2 stopped. Jul 21 09:30:01 mango kernel: md: bind<sdb9> Jul 21 09:30:01 mango kernel: md: bind<sda9> Jul 21 09:30:01 mango kernel: raid1: raid set md2 active with 2 out of 2 mirrors Jul 21 09:30:01 mango kernel: md2: detected capacity change from 0 to 600001150976 [...] Jul 21 09:30:01 mango kernel: md: resync of RAID array md1 Jul 21 09:30:01 mango kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jul 21 09:30:01 mango kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync. Jul 21 09:30:01 mango kernel: md: using 128k window, over a total of 19534912 blocks. Jul 21 09:30:01 mango kernel: md: resuming resync of md1 from checkpoint. The sync appears to finish correctly: mango:~# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda9 sdb9 585938624 blocks [2/2] [UU] md1 : active raid1 sda6 sdb6 19534912 blocks [2/2] [UU] unused devices: <none> Details of the RAIDs: mango:~# mdadm --detail /dev/md1 /dev/md1: Version : 00.90 Creation Time : Wed Dec 16 16:20:47 2009 Raid Level : raid1 Array Size : 19534912 (18.63 GiB 20.00 GB) Used Dev Size : 19534912 (18.63 GiB 20.00 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Jul 21 15:06:50 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 57[...] (local to host mango) Events : 0.58058 Number Major Minor RaidDevice State 0 8 6 0 active sync /dev/sda6 1 8 22 1 active sync /dev/sdb6 mango:~# mdadm --detail /dev/md2 /dev/md2: Version : 00.90 Creation Time : Wed Dec 16 16:20:54 2009 Raid Level : raid1 Array Size : 585938624 (558.79 GiB 600.00 GB) Used Dev Size : 585938624 (558.79 GiB 600.00 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Wed Jul 21 10:11:48 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : cd[...] (local to host mango) Events : 0.39618 Number Major Minor RaidDevice State 0 8 9 0 active sync /dev/sda9 1 8 25 1 active sync /dev/sdb9
Comment 1 Martin Steigerwald 2010-07-21 13:13:31 UTC
Created attachment 27182 [details] first occurence of the hang on shutting down one of the softraids Some more information to add: - Its a hang. The kernel does not skip shutting down the RAID array, but just hangs there then. - It only seem to happen with md1. - It does not happen all of the time, sometimes, the kernel just shut down immediately·
Comment 2 Martin Steigerwald 2010-07-21 13:15:45 UTC
Created attachment 27183 [details] one other occurence of the hang where I waited a bit longer, an ext4 error occured After about one or two minutes ext4 showed an error. But a fsck.ext4 -f from a grml 2010.04 did find anything.
Comment 3 Martin Steigerwald 2010-07-21 13:27:45 UTC
I just tested with 18.104.22.168 and it shut down both RAID arrays, i.e. put both of them to read only, also md1. Putting md1 to read only is missing in the cases where 22.214.171.124 hangs on shutdown.
Comment 4 Neil Brown 2010-07-24 05:23:03 UTC
I'm guessing that it is hanging in md_update_sb. It would be good if you could confirm that, with by collecting the output of "alt-sysrq-T" or by adding some printk's to the code. Why it would hang there I don't know. Maybe the device under the array is still suspended and isn't responding to IO ?? The rare occasion that you don't get a hang would be explained by the 'safe_mode' timer having fired and the array being marked 'clean' before the suspend. I assume that suspend-to-disk image is not being written to the md array? Maybe tux-on-ice needs to sync the md array before entering suspend, maybe after making the filesystem read-only if it does that... unfortunately it isn't easy to do that. Maybe "mdadm -r /dev/md1 ; mdadm -w /dev/md1" would do it. Anyway, the first step is to confirm that md_update_sb is the culprit.
Comment 5 Martin Steigerwald 2010-07-26 08:12:07 UTC
Thanks, Neil. Alt-SysRQ-T prints quite a long output. And it will be just prior to shutting the machine off. The only thing I noticed after shutting down the MDs are the messages "Synchronising SCSI caches" - with the working 126.96.36.199 kernel. I don't think I have network at that time anymore and the machine has no serial port either. Is there some way to put some short "I am here" message into md_update_sb? Or a helpful SysRQ key combination that just outputs one page that I could take a photo of? TuxOnIce is using a Swap-RAID 0 for its hibernation image, it doesn't use SoftRAID for it. mango:~# swapon -s Filename Type Size Used Priority /dev/sda7 partition 7815580 8528 1 /dev/sdb7 partition 7815580 8448 1 I do not know whether it utilizes the Swap-RAID 0, AFAIK it should. I get the following image writing and reading speeds: mango:~# grep -i "I/O speed" /var/log/syslog Jul 20 16:37:04 mango kernel: - I/O speed: Write 73 MB/s, Read 207 MB/s. Jul 21 09:26:49 mango kernel: - I/O speed: Write 71 MB/s, Read 204 MB/s. Jul 21 09:42:03 mango kernel: - I/O speed: Write 65 MB/s, Read 188 MB/s. Jul 21 15:25:52 mango kernel: - I/O speed: Write 89 MB/s, Read 217 MB/s. Jul 26 08:21:34 mango kernel: - I/O speed: Write 78 MB/s, Read 216 MB/s. But anyway its not on the SoftRAID.
Comment 7 Rafael J. Wysocki 2011-01-16 22:35:28 UTC
Is the problem still present in 2.6.37?
Comment 8 Martin Steigerwald 2011-01-17 08:28:11 UTC
I didn't see this problem since I switched to in-mainline-kernel hibernation. Might have do something to do with TuxOnIce then or not. I do not remember the exactly what I did back then anymore. 2.6.36 is working fine here. Thus closing. Thanks for reminder.