Bug 192941

Summary: MDRAID disk failure causes BTRFS corruption
Product: File System Reporter: Arie Skliarouk (skliarie)
Component: btrfsAssignee: Josef Bacik (josef)
Status: RESOLVED OBSOLETE    
Severity: normal CC: dsterba
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.8.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: full kernel log

Description Arie Skliarouk 2017-01-20 06:07:41 UTC
Created attachment 252521 [details]
full kernel log

One of the harddisks of RAID5 (mdraid) started failing, e.g. timeouting on I/O operations. For some reason it is not enough for mdraid to take out the drive from the RAID, but all operations on it now start taking loooong time.
Somehow this causes BTRFS to corrupt the filesystem - after some time kernel prints BTFS corruption warnings.

Here is the first BTRFS-related kernel message:
Jan 19 20:27:32 cmdesk01 kernel: [1394013.164071] ata2.00: hard resetting link                                                       Jan 19 20:27:33 cmdesk01 kernel: [1394013.876023] ata2.01: hard resetting link                                                       Jan 19 20:27:33 cmdesk01 kernel: [1394014.348069] ata2.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                          Jan 19 20:27:33 cmdesk01 kernel: [1394014.348080] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)                          Jan 19 20:27:33 cmdesk01 kernel: [1394014.476409] ata2.00: configured for UDMA/33                                                    Jan 19 20:27:33 cmdesk01 kernel: [1394014.481084] ata2.01: configured for UDMA/133
Jan 19 20:27:33 cmdesk01 kernel: [1394014.481118] ata2: EH complete
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776165] INFO: task kworker/u8:4:20884 blocked for more than 120 seconds.
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776171]       Not tainted 4.8.0-32-generic #34-Ubuntu
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776175] kworker/u8:4    D ffff99395791fb08     0 20884      2 0x00000000
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776210] Workqueue: btrfs-delayed-meta btrfs_delayed_meta_helper [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776213]  ffff99395791fb08 0000000000000004 ffffffff89e0d540 ffff9939cdcb0d40
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776216]  ffff9939c3f6a7c0 ffff993957920000 7fffffffffffffff ffff993922138ce0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776219]  ffff9939cdcb0d40 0000000000000020 ffff99395791fb20 ffffffff89896b15
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776222] Call Trace:
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776228]  [<ffffffff89896b15>] schedule+0x35/0x80
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776230]  [<ffffffff8989a2ca>] schedule_timeout+0x22a/0x3f0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776234]  [<ffffffff890aec35>] ? wake_up_process+0x15/0x20
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776237]  [<ffffffff8909b7dc>] ? insert_work+0x7c/0xc0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776239]  [<ffffffff89897563>] wait_for_completion+0xb3/0x140
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776241]  [<ffffffff890aecc0>] ? wake_up_q+0x80/0x80
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776258]  [<ffffffffc05a0ac6>] btrfs_async_run_delayed_refs+0x136/0x160 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776278]  [<ffffffffc05be807>] __btrfs_end_transaction+0x207/0x350 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776299]  [<ffffffffc05be960>] btrfs_end_transaction+0x10/0x20 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776320]  [<ffffffffc061ba83>] btrfs_async_run_delayed_root+0x1d3/0x650 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776323]  [<ffffffff890ba9e5>] ? dequeue_entity+0x245/0xab0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776325]  [<ffffffff890bb87f>] ? dequeue_task_fair+0x62f/0x8f0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776328]  [<ffffffff8902c77e>] ? __switch_to+0x2ce/0x6c0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776330]  [<ffffffff890c11f3>] ? pick_next_task_fair+0x113/0x4c0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776352]  [<ffffffffc05f2998>] btrfs_scrubparity_helper+0xc8/0x2d0 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776374]  [<ffffffffc05f2c0e>] btrfs_delayed_meta_helper+0xe/0x10 [btrfs]
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776376]  [<ffffffff8909d85c>] process_one_work+0x1fc/0x4b0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776378]  [<ffffffff8909db5b>] worker_thread+0x4b/0x500
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776380]  [<ffffffff8909db10>] ? process_one_work+0x4b0/0x4b0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776382]  [<ffffffff890a3e58>] kthread+0xd8/0xf0
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776384]  [<ffffffff8989b4df>] ret_from_fork+0x1f/0x40
Jan 19 20:27:59 cmdesk01 kernel: [1394039.776386]  [<ffffffff890a3d80>] ? kthread_create_on_node+0x1e0/0x1e0

After that (amidst I/O errors) BTRFS errors started showing up:

Jan 19 20:42:05 cmdesk01 kernel: [1394885.837167] BTRFS error (device md0): parent transid verify failed on 3713608400896 wanted 16015 found 16004                                                                                                                        Jan 19 20:42:05 cmdesk01 kernel: [1394885.956277] BTRFS info (device md0): read error corrected: ino 1 off 3713608400896 (dev /dev/md0 sector 7283566496)                                                                                                                 Jan 19 20:42:05 cmdesk01 kernel: [1394885.969552] BTRFS info (device md0): read error corrected: ino 1 off 3713608404992 (dev /dev/md0 sector 7283566504)                                                                                                                 Jan 19 20:42:05 cmdesk01 kernel: [1394885.980363] BTRFS info (device md0): read error corrected: ino 1 off 3713608409088 (dev /dev/md0 sector 7283566512)                                                                                                                 Jan 19 20:42:05 cmdesk01 kernel: [1394885.993575] BTRFS info (device md0): read error corrected: ino 1 off 3713608413184 (dev /dev/md0 sector 7283566520)                                                                      

Another one

Jan 19 21:10:53 cmdesk01 kernel: [1396614.192142] BTRFS error (device md0): parent transid verify failed on 3688251686912 wanted 16054 found 15822                                                                                                                        Jan 19 21:10:53 cmdesk01 kernel: [1396614.309237] BTRFS info (device md0): read error corrected: ino 1 off 3688251686912 (dev /dev/md0 sector 7062075200)                                                                                                                 Jan 19 21:10:53 cmdesk01 kernel: [1396614.324435] BTRFS info (device md0): read error corrected: ino 1 off 3688251691008 (dev /dev/md0 sector 7062075208)                                                                                                                 Jan 19 21:10:53 cmdesk01 kernel: [1396614.346305] BTRFS info (device md0): read error corrected: ino 1 off 3688251695104 (dev /dev/md0 sector 7062075216)                                                                                                                 Jan 19 21:10:53 cmdesk01 kernel: [1396614.348532] BTRFS info (device md0): read error corrected: ino 1 off 3688251699200 (dev /dev/md0 sector 7062075224)

After I resolved the MDADM problem, the BTRFS filesystem is unmountable anymore.

For full logs dump see attachment
Comment 1 Arie Skliarouk 2017-03-14 14:00:21 UTC
Couple of days ago, same problem happened again.

One of the disks developed 1 Offline_Uncorrectable block. Again, mdadm did not took the drive out of the RAID6 but speed of all I/O operations to the RAID slowed down dramatically. While BTRFS was mounted, I was able to see my files, even took backup of critical ones. After unmount the FS died, btrfsck did not help as well.

There really is something with BTRFS being sensitive to slow I/O and corrupting the FS.
Comment 2 Arie Skliarouk 2018-02-11 09:36:48 UTC
Couple of days ago, same problem happened again. An lost interrupt on one of RAID6 drives caused BTRFS corruption!

kernel 4.4.0-96-generic (ubuntu 16.04.3).

Feb  9 19:58:36 backup1 kernel: [888948.836030] ata4: lost interrupt (Status 0x50)
Feb  9 19:58:36 backup1 kernel: [888948.836047] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen
Feb  9 19:58:36 backup1 kernel: [888948.836124] ata4: SError: { RecovComm PHYRdyChg CommWake DevExch }
Feb  9 19:58:36 backup1 kernel: [888948.836181] ata4.00: failed command: WRITE DMA EXT
Feb  9 19:58:36 backup1 kernel: [888948.836238] ata4.00: cmd 35/00:00:90:5f:55/00:04:3c:00:00/e0 tag 0 dma 524288 out
Feb  9 19:58:36 backup1 kernel: [888948.836238]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
Feb  9 19:58:36 backup1 kernel: [888948.836353] ata4.00: status: { DRDY }
Feb  9 19:58:36 backup1 kernel: [888948.836409] ata4: hard resetting link
Feb  9 19:58:37 backup1 kernel: [888949.716052] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb  9 19:58:38 backup1 kernel: [888950.159352] ata4.00: configured for UDMA/133
Feb  9 19:58:38 backup1 kernel: [888950.159381] ata4: EH complete
Feb  9 19:58:58 backup1 kernel: [888971.035501] BTRFS warning (device md0): csum failed ino 128547 off 61796352 csum 2721855190 expected csum 269856498
Feb  9 19:58:58 backup1 kernel: [888971.036557] BTRFS warning (device md0): csum failed ino 128547 off 61865984 csum 3059364696 expected csum 3766411157
Feb  9 19:58:58 backup1 kernel: [888971.038068] BTRFS warning (device md0): csum failed ino 128547 off 64942080 csum 3096280063 expected csum 4208637941
Comment 3 David Sterba 2022-10-06 17:41:13 UTC
This is a semi-automated bugzilla cleanup, report is against an old kernel version. If the problem still happens, please open a new bug. Thanks.